Standard Module sgmllib

sgmllib

This module defines a class SGMLParser which serves as the basis for parsing text files formatted in SGML (Standard Generalized Mark-up Language). In fact, it does not provide a full SGML parser — it only parses SGML insofar as it is used by HTML, and the module only exists as a base for the htmllib module. htmllib

In particular, the parser is hardcoded to recognize the following constructs:

The SGMLParser class must be instantiated without arguments. It has the following interface methods:


\begin{funcdesc}{reset}{}
Reset the instance. Loses all unprocessed data. This is called
implicitly at instantiation time.
\end{funcdesc}


\begin{funcdesc}{setnomoretags}{}
Stop processing tags. Treat all following inpu...
... provided so the HTML tag \code{<PLAINTEXT>}
can be implemented.)
\end{funcdesc}


\begin{funcdesc}{setliteral}{}
Enter literal mode (CDATA mode).
\end{funcdesc}


\begin{funcdesc}{feed}{data}
Feed some text to the parser. It is processed insof...
...a is buffered until more data is
fed or \code{close()} is called.
\end{funcdesc}


\begin{funcdesc}{close}{}
Force processing of all buffered data as if it were fo...
...e
redefined version should always call \code{SGMLParser.close()}.
\end{funcdesc}


\begin{funcdesc}{handle_starttag}{tag\, method\, attributes}
This method is call...
... calls
\code{method} with \code{attributes} as the only argument.
\end{funcdesc}


\begin{funcdesc}{handle_endtag}{tag\, method}
\par
This method is called to hand...
...s not called. The base implementation simply calls
\code{method}.
\end{funcdesc}


\begin{funcdesc}{handle_data}{data}
This method is called to process arbitrary d...
...n by a derived class; the base class implementation does
nothing.
\end{funcdesc}


\begin{funcdesc}{handle_charref}{ref}
This method is called to process a charact...
...ride this method to provide support for named
character entities.
\end{funcdesc}


\begin{funcdesc}{handle_entityref}{ref}
This method is called to process a gener...
..., \code{\\ ensuremath{>}},
\code{\\ ensuremath{<}}, and \code{\}.
\end{funcdesc}


\begin{funcdesc}{handle_comment}{comment}
This method is called when a comment i...
...with the argument \code{'text'}. The
default method does nothing.
\end{funcdesc}


\begin{funcdesc}{report_unbalanced}{tag}
This method is called when an end tag is found which does not
correspond to any open element.
\end{funcdesc}


\begin{funcdesc}{unknown_starttag}{tag\, attributes}
This method is called to pr...
...n by a derived class; the base class implementation
does nothing.
\end{funcdesc}


\begin{funcdesc}{unknown_endtag}{tag}
This method is called to process an unknow...
...n by a derived class; the base class implementation
does nothing.
\end{funcdesc}


\begin{funcdesc}{unknown_charref}{ref}
This method is called to process unresolv...
...n by a derived class; the
base class implementation does nothing.
\end{funcdesc}


\begin{funcdesc}{unknown_entityref}{ref}
This method is called to process an unk...
...n by a derived class; the base class
implementation does nothing.
\end{funcdesc}

Apart from overriding or extending the methods listed above, derived classes may also define methods of the following form to define processing of specific tags. Tag names in the input stream are case independent; the tag occurring in method names must be in lower case:


\begin{funcdesc}{start_\var{tag}}{attributes}
This method is called to process a...
...the same meaning as described for \code{handle_starttag()} above.
\end{funcdesc}


\begin{funcdesc}{do_\var{tag}}{attributes}
This method is called to process an o...
...the same meaning as described for \code{handle_starttag()} above.
\end{funcdesc}


\begin{funcdesc}{end_\var{tag}}{}
This method is called to process a closing tag \var{tag}.
\end{funcdesc}

Note that the parser maintains a stack of open elements for which no end tag has been found yet. Only tags processed by start_tag() are pushed on this stack. Definition of an end_tag() method is optional for these tags. For tags processed by do_tag() or by unknown_tag(), no end_tag() method must be defined; if defined, it will not be used. If both start_tag() and do_tag() methods exist for a tag, the start_tag() method takes precedence.