Following the earlier example, let's consider a simple XML format for storing information about a comic book collection. Here's a sample document for a collection consisting of a single issue:
<collection> <comic title="Sandman" number='62'> <writer>Neil Gaiman</writer> <penciller pages='1-9,18-24'>Glyn Dillon</penciller> <penciller pages="10-17">Charles Vess</penciller> </comic> </collection>
An XML document must have a single root element; this is the
"collection" element. It has one child comic
element
for each issue; the book's title and number are given as attributes of
the comic
element, which can have one or more children
containing the issue's writer and artists. There may be several
artists or writers for a single issue.
Let's start off with something simple: a document handler named FindIssue that reports whether a given issue is in the collection.
from xml.sax import saxutils class FindIssue(saxutils.DefaultHandler): def __init__(self, title, number): self.search_title, self.search_number = title, number
The DefaultHandler class inherits from all four interfaces: ContentHandler, DTDHandler, EntityResolver, and ErrorHandler. This is what you should use if you want to use one class for everything. When you want separate classes for each purpose, or if you want to implement only a single interface, you can just subclass each interface individually. Neither of the two approaches is always ``better'' than the other; their suitability depends on what you're trying to do, and on what you prefer.
Since this class is doing a search, an instance needs to know what to search for. The desired title and issue number are passed to the FindIssue constructor, and stored as part of the instance.
Now let's look at the function which actually does all the work. This simple task only requires looking at the attributes of a given element, so only the startElement method is relevant.
def startElement(self, name, attrs): # If it's not a comic element, ignore it if name != 'comic': return # Look for the title and number attributes (see text) title = attrs.get('title', None) number = attrs.get('number', None) if title == self.search_title and number == self.search_number: print title, '#'+str(number), 'found'
The startElement() method is passed a string giving the name
of the element, and an instance containing the element's attributes.
The latter implements the AttributeList interface, which
includes most of the semantics of Python dictionaries. Therefore, the
function looks for comic
elements, and compares the
specified title
and number
attributes to the
search values. If they match, a message is printed out.
startElement() is called for every single element in the
document. If you added print 'Starting element:', name
to the
top of startElement(), you would get the following output.
Starting element: collection Starting element: comic Starting element: writer Starting element: penciller Starting element: penciller
To actually use the class, we need top-level code that creates instances of a parser and of FindIssue, associates them, and then calls a parser method to process the input.
from xml.sax import make_parser from xml.sax.handler import feature_namespaces if __name__ == '__main__': # Create a parser parser = make_parser() # Tell the parser we are not interested in XML namespaces parser.setFeature(feature_namespaces, 0) # Create the handler dh = FindIssue('Sandman', '62') # Tell the parser to use our handler parser.setContentHandler(dh) # Parse the input parser.parse(file)
The make_parser class can automate the job of creating parsers. There are already several XML parsers available to Python, and more might be added in future. xmllib.py is included with Python 1.5, so it's always available, but it's also not particularly fast. A faster version of xmllib.py is included in xml.parsers. The xml.parsers.expat module is faster still, so it's obviously a preferred choice if it's available. make_parser determines which parsers are available and chooses the fastest one, so you don't have to know what the different parsers are, or how they differ. (You can also tell make_parser to try a list of parsers, if you want to use a specific one).
In SAX2, XML namespace are supported. Parsers will not call startElement, but startElementNS if namespace processing is active. Since our content handler does not implement the namespace-aware methods, we request that namespace processing is deactivated. The default of this setting varies from parser to parser, so you should always set it to a safe value - unless your handlers support either method.
Once you've created a parser instance, calling setContentHandler tells the parser what to use as the handler.
If you run the above code with the sample XML document, it'll output
Sandman #62 found.