The Simple API for XML isn't a standard in the formal sense, but an informal specification designed by David Megginson, with input from many people on the xml-dev mailing list. SAX defines an event-driven interface for parsing XML. To use SAX, you must create Python class instances which implement a specified interface, and the parser will then call various methods of those objects.
This howto describes version 2 of SAX (also referred to as SAX2). Earlier versions of this text did explain SAX1, which is primarily of historical interest only.
SAX is most suitable for purposes where you want to read through an
entire XML document from beginning to end, and perform some
computation, such as building a data structure representating a
document, or summarizing information in a document (computing an
average value of a certain element, for example). It's not very
useful if you want to modify the document structure in some
complicated way that involves changing how elements are nested, though
it could be used if you simply wish to change element contents or
attributes. For example, you would not want to re-order chapters in a
book using SAX, but you might want to change the contents of any
name
elements with the attribute lang
equal to
'greek' into Greek letters.
One advantage of SAX is speed and simplicity. Let's say
you've defined a complicated DTD for listing comic books, and you wish
to scan through your collection and list everything written by Neil
Gaiman. For this specialized task, there's no need to expend effort
examining elements for artists and editors and colourists, because
they're irrelevant to the search. You can therefore write a class
instance which ignores all elements that aren't writer
.
Another advantage is that you don't have the whole document resident in memory at any one time, which matters if you are processing really huge documents.
SAX defines 4 basic interfaces; an SAX-compliant XML parser can be passed any objects that support these interfaces, and will call various methods as data is processed. Your task, therefore, is to implement those interfaces that are relevant to your application.
The SAX interfaces are:
Interface | Purpose |
---|---|
ContentHandler |
Called for general document events. This interface is the heart of SAX; its methods are called for the start of the document, the start and end of elements, and for the characters of data contained inside elements. |
DTDHandler |
Called to handle DTD events required for basic parsing. This means notation declarations (XML spec section 4.7) and unparsed entity declarations (XML spec section 4). |
EntityResolver |
Called to resolve references to external entities. If your documents will have no external entity references, you won't need to implement this interface. |
ErrorHandler |
Called for error handling. The parser will call methods from this interface to report all warnings and errors. |
Python doesn't support the concept of interfaces, so the interfaces
listed above are implemented as Python classes. The default method
implementations are defined to do nothing--the method body is just a
Python pass
statement-so usually you can simply ignore methods
that aren't relevant to your application.
Pseudo-code for using SAX looks something like this:
# Define your specialized handler classes from xml.sax import Contenthandler, ... class docHandler(ContentHandler): ... # Create an instance of the handler classes dh = docHandler() # Create an XML parser parser = ... # Tell the parser to use your handler instance parser.setContentHandler(dh) # Parse the file; your handler's method will get called parser.parse(sys.stdin)