htmllib
This module defines a number of classes which can serve as a basis for
parsing text files formatted in HTML (HyperText Mark-up Language).
The classes are not directly concerned with I/O — the have to be fed
their input in string form, and will make calls to methods of a
``formatter'' object in order to produce output. The classes are
designed to be used as base classes for other classes in order to add
functionality, and allow most of their methods to be extended or
overridden. In turn, the classes are derived from and extend the
class SGMLParser defined in module sgmllib.
sgmllib
SGMLParser
The following is a summary of the interface defined by
sgmllib.SGMLParser:
- The interface to feed data to an instance is through the feed()
method, which takes a string argument. This can be called with as
little or as much text at a time as desired;
p.feed(a); p.feed(b) has the same effect as p.feed(a+b).
When the data contains complete
HTML elements, these are processed immediately; incomplete elements
are saved in a buffer. To force processing of all unprocessed data,
call the close() method.
Example: to parse the entire contents of a file, do
parser.feed(open(file).read()); parser.close().
- The interface to define semantics for HTML tags is very simple: derive
a class and define methods called start_tag(),
end_tag(), or do_tag(). The parser will
call these at appropriate moments: start_tag or
do_tag is called when an opening tag of the form
<tag ...> is encountered; end_tag is called
when a closing tag of the form <tag> is encountered. If
an opening tag requires a corresponding closing tag, like <H1>
... </H1>, the class should define the start_tag
method; if a tag requires no closing tag, like <P>, the class
should define the do_tag method.
The module defines the following classes:
Instances of CollectingParser (and thus also instances of
FormattingParser and AnchoringParser) have the following
instance variables:
The anchors, anchornames and anchortypes lists
are ``parallel arrays'': items in these lists with the same index
pertain to the same anchor. Missing attributes default to the empty
string. Anchors with neither a HREF nor a NAME
attribute are not entered in these lists at all.
The module also defines a number of style sheet classes. These should
never be instantiated — their class variables are the only behavior
required. Note that style sheets are specifically designed for a
particular formatter implementation. The currently defined style
sheets are:
Style sheets have the following class variables:
Although no documented implementation of a formatter exists, the
FormattingParser class assumes that formatters have a
certain interface. This interface requires the following methods:
A sample formatter implementation can be found in the module
fmt, which in turn uses the module Para. These modules are
not intended as standard library modules; they are available as an
example of how to write a formatter.
fmt
Para