3 SAX: The Simple API for XML

The Simple API for XML isn't a standard in the formal sense, but an informal specification designed by David Megginson, with input from many people on the xml-dev mailing list. SAX defines an event-driven interface for parsing XML. To use SAX, you must create Python class instances which implement a specified interface, and the parser will then call various methods of those objects.

This howto describes version 2 of SAX (also referred to as SAX2). Earlier versions of this text did explain SAX1, which is primarily of historical interest only.

SAX is most suitable for purposes where you want to read through an entire XML document from beginning to end, and perform some computation, such as building a data structure representating a document, or summarizing information in a document (computing an average value of a certain element, for example). It's not very useful if you want to modify the document structure in some complicated way that involves changing how elements are nested, though it could be used if you simply wish to change element contents or attributes. For example, you would not want to re-order chapters in a book using SAX, but you might want to change the contents of any name elements with the attribute lang equal to 'greek' into Greek letters.

One advantage of SAX is speed and simplicity. Let's say you've defined a complicated DTD for listing comic books, and you wish to scan through your collection and list everything written by Neil Gaiman. For this specialized task, there's no need to expend effort examining elements for artists and editors and colourists, because they're irrelevant to the search. You can therefore write a class instance which ignores all elements that aren't writer.

Another advantage is that you don't have the whole document resident in memory at any one time, which matters if you are processing really huge documents.

SAX defines 4 basic interfaces; an SAX-compliant XML parser can be passed any objects that support these interfaces, and will call various methods as data is processed. Your task, therefore, is to implement those interfaces that are relevant to your application.

The SAX interfaces are:

Interface  Purpose 
ContentHandler Called for general document events. This interface is the heart of SAX; its methods are called for the start of the document, the start and end of elements, and for the characters of data contained inside elements.
DTDHandler Called to handle DTD events required for basic parsing. This means notation declarations (XML spec section 4.7) and unparsed entity declarations (XML spec section 4).
EntityResolver Called to resolve references to external entities. If your documents will have no external entity references, you won't need to implement this interface.
ErrorHandler Called for error handling. The parser will call methods from this interface to report all warnings and errors.

Python doesn't support the concept of interfaces, so the interfaces listed above are implemented as Python classes. The default method implementations are defined to do nothing--the method body is just a Python pass statement-so usually you can simply ignore methods that aren't relevant to your application.

Pseudo-code for using SAX looks something like this:

# Define your specialized handler classes
from xml.sax import Contenthandler, ...
class docHandler(ContentHandler):
    ...

# Create an instance of the handler classes
dh = docHandler()

# Create an XML parser
parser = ...

# Tell the parser to use your handler instance
parser.setContentHandler(dh)

# Parse the file; your handler's method will get called
parser.parse(sys.stdin)


Subsections