Last modified: Wed Dec 16 13:07:44 1998

Chapter 2: Parsing

Table of Contents

2.1 Introduction

2.1.1 XML Processor
2.1.2 Using IBM's XML for Java

2.3 Printing XML Document from Parsed Structure

2.4 Programming Interfaces for Document Structure

2.4.1 DOM: Object Model for XML Document
2.4.2 SAX: Event-Driven API of XML Processor
2.4.3 Element Handler: Yet another Event-Driven API

2.1 Introduction

2.1.1 XML Processor

As the first step towards building an XML-based web application, in this chapter we show the basics of the most fundamental building block; XML Processor. An XML processor is a software module that is used to read XML documents and provide access to their content and structure for application programs. What an XML should do when reading an XML document is precisely defined in the XML 1.0 Recommendation. As one of the most robust and faithful implementation of XML processor, we use XML for Java, a validating XML processor developed by IBM Research, Tokyo Research Laboratory throughout this book. You can find a complete package of XML for Java in the attached CD-ROM, or you can download the latest version from IBM's web site at http://www.alphaworks.com/formula/xml/. The current release of XML for Java (the one in the CD-ROM) conforms to the XML 1.0 Recommendation issued on February 10th, 1998. IBM offers a free commercial license for XML for Java.

There are several other XML processors written in Java. Some of them are validating processors while other are non-validating processors. When reading an XML document, a non-validating processor checks the well-formedness constraints as defined in the XML 1.0 specification and reports any violations. A validating processor must check the validity constraints as well as the well-formedness constraints. The behavior of a conforming XML processor is highly predictable so using other conforming XML processors should not be difficult.

Although the XML 1.0 Recommendation precisely defines the behavior of an XML processor, it has nothing to say about its programming interface (API). As for API, DOM and SAX are the two specifications that are supported by many of the existing XML processors. This chapter introduces basic programming with DOM, SAX, and ElementHandler. ElementHandler is an event-driven API unique to IBM's XML for Java. During writing sample programs in this book, generally we paid attention to avoid using XML for Java-specific features. Whenever we use an API that is not defined in DOM or SAX, it will be clearly stated. Also Appendix B gives a comparison with some of the major XML processors. Note that not all XML processors support DOM and/or SAX.

Note: XML processor and parser

The term XML processor has a specific meaning as defined in the XML 1.0 Recommendation. Sometimes XML processors are also called parsers because they read an XML document and dissect it into elements. Often these two terms are used interchangeably.

2.1.2 Using IBM's XML for Java

XML for Java is a Java class library. You are required to have basic Java programming skills to use it. First, let us set up your programming environment. You need Java Development Kit (JDK) Version 1.1 or later because XML for Java is written in Java 1.1. If you do not have JDK 1.1 yet, download the latest release from Sun Microsystems' web site at http://java.sun.com/. At the time of writing, JDK 1.2 has just been released, but we developed our sample programs in this book with JDK 1.1.7B. We had once a trouble in executing XML for Java with JDK 1.1.7B. It turned out that it was due to a software bug in the JIT compiler of the original JDK 1.1.6 distribution. Applying a service release fixed the problem. XML for Java has been tested against the latest JDK 1.2, you should be able to use it as well. Since XML for Java is written in Java, theoretically it can run on any operating system on any hardware as long as Java 1.1 is supported. However, operating environments such as how to set environment variables may be somewhat different depending on your platform. We will use Windows (both NT and 95/98) in our command line input/output examples throughout the book. You should read the command prompts and certain shell commands by substituting them for your own platform.

The version of XML for Java used in developing the sample programs is 1.1.9. However, the CD-ROM may contain a later version of XML for Java. If it is the case, you should read xml4j_1_1_9 by replacing the version number with the appropriate one (e.g., xml4j_1_1_14 for version 1.1.14). To install XML for Java, first create a directory for XML for Java, move to the directory, and unzip xml4j_1_1_9 in the CD-ROM using one of popular archive tools such as unzip, pkzip, and winzip (Note that the tools should support long file names). Here we assume that you installed XML for Java in c:\xml4j.

It is important for you to set the environment variable CLASSPATH properly. CLASSPATH tells the Java interpreter where to find class libraries. In order to execute the examples given in this book you need to have the following jar file in your CLASSPATH:

Classes for XML for Java (c:\xml4j\xml4j_1_1_9.jar)

In addition, it may be convenient if you include the current directory (.) in your CLASSPATH. A Windows command to set these these to the environment variable CLASSPATH is as follows:

c:\xml4j>set CLASSPATH=".;c:\xml4j\xml4j_1_1_9.jar;c:\xml4j\xml4jSamples_1_1_9.jar"

Note that on Unix platforms the command syntax may be different. Consult the manual of your platform. You may want to put the above line in your profile such as c:\autoexec.bat to avoid typing every time you bring up a new command prompt. If you are using Windows NT 4.0, a convenient way is to open the "System Property" by right-clicking the "My Computer" icon, select "Environment" tab, and add a new variable called CLASSPATH.

To see if the installation is successful, move to the installation directory (c:\xml4j) and try the following command.

C:\xml4j>java samples.XJParse.XJParse -format data/personal.xml

You will see a formatted output of a sample XML file named personnel.xml. If not, please double check your installation, especially the CLASSPATH environment variable. You can also tell the Java interpreter where to find classes as follows without setting CLASSPATH.

C:\xml4>java -classpath ".;c:\xml4j\xml4j_1_1_4.jar;C:\xml4j\xml4jSamples_1_1_4.jarc:\jdk1.1.6\lib\classes.zip" samples.XJParse.XJParse -format data/personal.xml

Now you are ready to try the sample programs in the CD-ROM. Go to the R:\samples\chap2 directory, where R is the drive letter of your CD-ROM. Try the first sample by entering the command:

R:\samples\chap2>java SimpleParse department.xml

You will see nothing but it is expected because this first sample program produces no output. All the example programs shown in this book are included in the CD-ROM. Take a few moments to explore the CD-ROM. Samples in this chapter can be found in the \samples\chap2 directory.

We presume that you have basic skills of writing and executing Java programs. If you do not, there are a lot of good textbooks for Java programming that will serve for your purposes. Also it will help if you have a basic set of program development tools such as text editors.

2.2. Reading XML Document

Let us prepare a simple XML document called department.xml and read it with XML for Java. This document represents a set of employee records in a department.

`department.xml`: a simple XML document
<?xml version="1.0"?> <!DOCTYPE department SYSTEM "department.dtd"> <department> <employee id="J.D"> <name>John Doe</name> <email>John.Doe@foo.ibm.com</email> </employee> <employee id="B.S"> <name>Bob Smith</name> <email>Bob.Smith@foo.com</email> </employee> <employee id="A.M"> <name>Alice Miller</name> <url href="http://www.trl.jp.ibm.com/~amiller/"/> </employee> </department>

<?xml version="1.0"?>
<!DOCTYPE department SYSTEM "department.dtd">
<department>
  <employee id="J.D">
    <name>John Doe</name>
    <email>John.Doe@foo.ibm.com</email>
  </employee>

  <employee id="B.S">
    <name>Bob Smith</name>
    <email>Bob.Smith@foo.com</email>
  </employee>

  <employee id="A.M">
    <name>Alice Miller</name>
    <url href="http://www.trl.jp.ibm.com/~amiller/"/>
  </employee>
</department>

The DTD of this document is given in a separate file department.dtd as shown below. The meanings of the tags should be self explanatory.

`department.dtd`: DTD for `department.xml`
<!ELEMENT department (employee)*> <!ELEMENT employee (name, (email \| url))> <!ATTLIST employee id CDATA #REQUIRED> <!ELEMENT name (#PCDATA)> <!ELEMENT email (#PCDATA)> <!ELEMENT url EMPTY> <!ATTLIST url href CDATA #REQUIRED>

<!ELEMENT department (employee)*>
<!ELEMENT employee (name, (email | url))>
<!ATTLIST employee id CDATA #REQUIRED>
<!ELEMENT name (#PCDATA)>
<!ELEMENT email (#PCDATA)>
<!ELEMENT url EMPTY>
<!ATTLIST url href CDATA #REQUIRED>

Let us try to read department.xml with XML for Java. Run the sample program SimpleParse in the CD-ROM that is located in the R:\samples\chap2 directory with the following command:

R:\samples\chap2>java SimpleParse department.xml
R:\samples\chap2>

It reads the XML document from a file, analyzes the document structure and creates a corresponding data structure in the computer memory. The structure can be later referred to or manipulated by application programs as Java objects.

As you can see, this program produces no outputs. However, XML for Java did its job properly. The program opened department.xml, found that department.dtd is the DTD of this document so it read the DTD as well, and then analyzed department.xml according to the syntax defined in the DTD. That you see nothing means that there have been no violations of well-formedness nor validity constraints.

Now let us take you through the source code of this program (Listing 2.1). As you can imagine, this is a very short program but it shows basics of how you can use XML for Java.

XML for Java is in the package com.ibm.xml.parser. In SimpleParse, we need to import the Parser class. The parser returns a DOM structure that is defined in the org.w3c.dom.Document class. The first two lines in the program declares to import these classes.

import com.ibm.xml.parser.Parser;
import org.w3c.dom.Document;

com.ibm.xml.parser.Parser is an algorithm to parse an XML document. Later more classes in the package will be required, so instead of having one line for each imported class, you can tell the compiler to import all the classes in the com.ibm.xml.parser package.

import com.ibm.xml.parser.*;

In this book, however, we try to explicitly list up all the classes used in each example so that you can know what classes are actually used.

The most important part of this program is condensed into the following three lines:

FileInputStream is = new FileInputStream(argv[0]);
Parser parser = new Parser(argv[0]);
Document doc = parser.readStream(is);

In plain English, these lines

Create an input stream for reading a file,
Create an instance of Parser object, and
Pass the stream to the parser.

In the above fragment, argv[0] refers to the first command line argument, which is supposed to be the name of a file to be opened. Why does the Parser require the file name as its argument when it is created? The answer is that it uses the file name to be included in error messages if there are any. If parsing succeeds without errors, an instance of the class Document, which represents the entire document structure, is created and assigned to the variable doc.

Below is the complete program listing of SimpleParse. In this book, we tried to use the standard API as as much as possible, but sometimes we are forced to use the features unique to XML for Java. In the program listings in Part One, we marked those lines that are dependent on XML for Java with "@XML4J."
Listing 2.1: SimpleParse.java: Read and parse an XML document

/** * SimpleParse.java **/ import org.w3c.dom.Document; import com.ibm.xml.parser.Parser; import java.io.FileInputStream; public class SimpleParse { public static void main(String[] argv) { if (argv.length != 1) { System.err.println("Missing filename."); System.exit(1); } try { // Open specified file. FileInputStream is = new FileInputStream(argv[0]); // Start to parse Parser parser = new Parser(argv[0]); // @XML4J Document doc = parser.readStream(is); // Error? if (parser.getNumberOfErrors() > 0) { // @XML4J System.exit(1); // If the document has any error, // the program is terminated. } // Codes for process will be here } catch (Exception e) { e.printStackTrace(); } } }

Listing 2.1: `SimpleParse.java`: Read and parse an XML document
/** * SimpleParse.java **/ import org.w3c.dom.Document; import com.ibm.xml.parser.Parser; import java.io.FileInputStream; public class SimpleParse { public static void main(String[] argv) { if (argv.length != 1) { System.err.println("Missing filename."); System.exit(1); } try { // Open specified file. FileInputStream is = new FileInputStream(argv[0]); // Start to parse Parser parser = new Parser(argv[0]); // @XML4J Document doc = parser.readStream(is); // Error? if (parser.getNumberOfErrors() > 0) { // @XML4J System.exit(1); // If the document has any error, // the program is terminated. } // Codes for process will be here } catch (Exception e) { e.printStackTrace(); } } }

For the sake of completeness, this program also contains a test for the command line arguments, a test for number of errors during parsing, and an exception handling.

You may think that this program has no practical value because it does not produce any output. Let us tell you that this program indeed has a meaningful usage: as a syntax checker. This program can tell you whether the input XML document is well-formed and / or valid. Let us feed an invalid document department2.xml to SimpleParse.

`department2.xml`: invalid XML document
<?xml version="1.0"?> <!DOCTYPE department SYSTEM "department.dtd"> <department> <employee id="J.D"> <name>John Doe</name> </employee> <employee id="B.S"> <name>Bob Smith</name> <email>Bob.Smith@foo.com</email> </employee> <employee id="A.M"> <name>Alice Miller</name> <url href="http://www.trl.jp.ibm.com/~amiller/"/> </employee> </department>

<?xml version="1.0"?>
<!DOCTYPE department SYSTEM "department.dtd">
<department>
  <employee id="J.D">
    <name>John Doe</name>
  </employee>

  <employee id="B.S">
    <name>Bob Smith</name>
    <email>Bob.Smith@foo.com</email>
  </employee>

  <employee id="A.M">
    <name>Alice Miller</name>
    <url href="http://www.trl.jp.ibm.com/~amiller/"/>
  </employee>
</department>

Note that the first employee (John Doe) does not have an <email> nor <url> tag. According to the DTD, an <employee> tag must have both a <name> tag and either of an <email> or a <url> tag so department2.xml is not a valid document. SimpleParse generates the following error message when reading department2 (actual error messages may differ depending on your locale -- see the Note Error messages in XML for Java).

R:\samples\chap2>java SimpleParse department2.xml department2.xml: 6, 13: Content mismatch in `<employee>'. Content model is ``(name,(email|url))'.

Note: Error messages in XML for Java Java 1.1 has an ability to switch resources such as messages based on the current locale setting. XML for Java uses this capability to switch error messages. Currently English and Japanese error messages are distributed in the package. If you are running XML for Java in a Japanese environment, the above messages will be shown in Japanese.

This message tell you that the content of the employee element invalid according to the content model defined in the DTD. In this case, the getNumberOfErrors() method returns a value greater than zero, and the program terminates with return value 1.

We have shown a simplest program that reads an XML document using XML for Java. This is simple but serves as a basis for more complex programs. Also it can be used as an XML syntax checker as it is. In the next section, we will show how to generate an XML document from the internal structure.

Note: International language encoding of XML documents

Every XML processor is required to support at least two UNICODE encodings, UTF-8 and UTF-16. UTF-16 is a uniform 16 bit encoding of the UNICODE character set while UTF-8 is variable length encoding. The XML 1.0 Recommendation allows you to specify the character encoding of an XML document by declaring it in the first line of the document as in <?xml encoding="UTF-16"?>. XML for Java supports 19 different encodings specified this way. How can an XML processor recognize that this line is encoded in UTF-16 before reading this line? Isn't this an chickin-and-egg problem? The answer is that since normally an XML document starts with <?xml, the processor can read the first four octets (bytes) of the file and determine the basic scheme of character encoding (single-byte basis such as UTF-8, double-byte basis such as UTF-16, and so on). Since only a limited set of characters are allowed in the first line, it is sufficient to parse the line. After parsing the encoding= attribute, the processor resets its encoding rule to the specified one. If the processor cannot determine the basic encoding scheme after looking at the first four octets, it assumes that the encoding is UTF-8. Therefore, an document with an encoding other than UTF-8 must start with <?xml.

2.3 Printing an XML Document from a Parsed Structure

Our second program SimpleParseAndPrint (Listing 2.2) has a minimum modification to the first one, SimpleParse. After reading and parsing an XML document, SimpleParseAndPrint prints it out to the standard output. The only difference in the new program is the addition of the following lines:

import com.ibm.xml.parser.TXDocument;
import java.io.PrintWriter;
  ...
   ((TXDocument)doc).print(new PrintWriter(System.out));

Remember that the variable doc represents an entire internal structure of an XML document. Unfortunately the DOM interface Document has no method for generating an XML document. Since Document is an interface, the variable doc holds actually an instance of class TXDocument, an implementation class of interface Document provided by XML for Java. TXDocument has a print() method that takes a java.io.Writer as its argument and generates an XML document from the internal structure. In this example, the output goes to the standard output (System.out) with the character encoding determined by the current locale. Note that TXDocument and print() are implementation-specific, so you need to use something else if you are using a different XML processor. Listing 2.2 below shows the complete listing of SimpleParseAndPrint.

Listing 2.2: `SimpleParseAndPrint.java`: Reproduce an XML document from a parsed structure
/** * SimpleParseAndPrint.java */ import com.ibm.xml.parser.Parser; import com.ibm.xml.parser.TXDocument; import java.io.FileInputStream; import java.io.PrintWriter; import org.w3c.dom.Document; public class SimpleParseAndPrint { public static void main(String[] argv) { if (argv.length != 1) { System.err.println("Require a filename."); System.exit(1); } try { // Open specified file. FileInputStream is = new FileInputStream(argv[0]); // Start to parse Parser parser = new Parser(argv[0]); // @XML4J Document doc = parser.readStream(is); // @XML4J // Error? if (parser.getNumberOfErrors() > 0) { // @XML4J System.exit(1); // If the document has any error, // the program is terminated. } // Print to the standard output // in XML format. ((TXDocument)doc).print(new PrintWriter(System.out)); // @XML4J // Codes for process will be here } catch (Exception e) { e.printStackTrace(); } } }

/**
 *       SimpleParseAndPrint.java
 */
import com.ibm.xml.parser.Parser;
import com.ibm.xml.parser.TXDocument;
import java.io.FileInputStream;
import java.io.PrintWriter;
import org.w3c.dom.Document;

public class SimpleParseAndPrint {
    public static void main(String[] argv) {
        if (argv.length != 1) {
            System.err.println("Require a filename.");
            System.exit(1);
        }
        try {
            // Open specified file.
            FileInputStream is = new FileInputStream(argv[0]);
            // Start to parse
            Parser parser = new Parser(argv[0]);   // @XML4J
            Document doc = parser.readStream(is);  // @XML4J
            // Error?
            if (parser.getNumberOfErrors() > 0) {  // @XML4J
                System.exit(1);                 // If the document has any error,
                // the program is terminated.
            }
            // Print to the standard output
            // in XML format.
            ((TXDocument)doc).print(new PrintWriter(System.out));  // @XML4J
            // Codes for process will be here
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

What will you see when executing this program? Here is the result.

R:\samples\chap2>java SimpleParseAndPrint department.xml
<?xml version="1.0"?>
<!DOCTYPE department SYSTEM "department.dtd">
<department>
  <employee id="J.D">
    <name>John Doe</name>
    <email>John.Doe@foo.ibm.com</email>
  </employee>

  <employee id="B.S">
    <name>Bob Smith</name>
    <email>Bob.Smith@foo.com</email>
  </employee>

  <employee id="A.M">
    <name>Alice Miller</name>
    <url href="http://www.trl.jp.ibm.com/~amiller/"/>
  </employee>
</department>

Note that the input file department.xml is exactly reproduced on the output. XML for Java is designed to preserve the appearance of the input as much as possible, including white spaces and line breaking. More on handling of whitespaces are discussed in Chapter 3.

2.4 Programming Interfaces for Document Structure

Through the previous two program examples, we have learned how to read an XML document and how to print it out. The next step is to process an XML document by accessing internal structure and to do more useful tasks. To do so, we need Application Programming Interfaces (APIs).

DOM, SAX, ElementHandler

The APIs that are widely used in XML processors today are DOM (Document Object Model) and SAX (Simple API for XML). DOM (Level 1) is a tree structure-based API and was issued as a W3C Recommendation in October, 1998. SAX is an event-driven API developed by David Megginson and a number of people on the xml-dev mailing list. Although SAX is not sanctioned by any standard body, it is supported by most of the available XML processors for its simplicity. In addition to DOM and SAX, XML for Java provides additional proprietary API called ElementHandler. Let us take a brief comparative look on these three APIs before presenting programming examples.

In DOM, an XML document is represented as a tree whose nodes are elements, texts, and so on. An XML processor generates the tree and hands it to an application program (in Figure 2.1, "DOM" shows this). DOM provides a set of API to access and manipulate these nodes in the DOM tree. Since DOM-based XML processor creates the entire structure of an XML document in memory, it is suitable for applications that operates on the document as a whole. In particular, the following situations are best suited for the use of DOM.

When structurally modifying an XML document. For example,

Sorting elements in a particular order
Moving some elements from one place of the tree to another
Etc.

When sharing the document in memory with other applications.

On the other hand, an XML processors with SAX does not create a data structure. Instead, they scan an input XML documents and generate events such as the start of an element, the end of an element, and so on (see Figure 2.1, "SAX"). The application program intercepts these events and do whatever appropriate for the application task. SAX is more efficient than DOM because it does not create an explicit data structure. Therefore it is good for the following occasions.

When dealing with a large document that does not fit in the memory.
When doing tasks on elements that are irrelevant to the surrounding document structure, e.g.,

Counting the total number of elements in a document
Extracting contents of a specific element
Etc.

Figure 2.1: DOM and SAX API

XML for Java supports both DOM and SAX, as well as its proprietary ElementHandler API. This third API is event-driven as SAX but it also creates DOM tree. Therefore, it is suitable for an event-oriented task but at the same time the application program needs to manipulate the internal structure of an element.

In the following subsections, we take a simple task of extracting information from an XML document with a set of attribute and value pairs, and explain how these three API can be used for this task.

2.4.1 DOM: Object Model for XML Document

This section introduces the object structure as defined by Document Object Model, or DOM. DOM defines a set of Java interfaces to create, access, and manipulate internal structures of XML documents.

XML is a language for describing a tree structured data. In XML, an element is represented by a start tag and a matching end tag (or an empty-element tag). An element may contain one or more elements between the beginning and ending tags. Therefore, an entire document renders itself as a nested tree. For example, our department example department.xml can be depicted as in Figure 2.2. Footnote: For simplicity, text elements solely consisting of white spaces are not shown in the figure. Chapter 3 elaborates on the structure containing white space texts. Each pair of tags (i.e., beginning and ending tags) corresponds to an Element node (square box in the figure), and each chunk of text surrounded by two tags corresponds to a Text node (string in the figure). These nodes are defined as objects in DOM.

Figure 2.2: Structure of an XML Document (department.xml)

The term "document object model" has been used for referring to models that are designed for defining the structure of an HTML document, allowing scripting languages such as JavaScript to access the elements of the structure. You may have experiences of writing JavaScript programs that manipulate the value of an input field within a Form element. For example, "document.forms(1).username.value" refers to the value of the input field with the name "username" in the first form element in an HTML document. However, because the current HTML object models are dependent on browsers, in general multiple pages should be prepared for different browsers. One of the goals of the W3C DOM Working Group is to define a common, interoperable document object model for HTML.

The W3C DOM Working Group is also defining a document object model for XML. Since XML and HTML object models have so much in common, those definitions that overlap between HTML and XML object models are called Core Document Object Model. Most of the XML-related definitions are in the core, but there are few XML-specific objects (called Extended XML Interfaces) specified in the Core document.

Primarily DOM defines a platform- and language-neutral interface for application programs in terms of standard set of objects. To help interoperability DOM defines APIs for OMG IDL, Java, and ECMAScript (called language bindings). The current DOM specification is called Level-1 DOM. It is expected that DOM will evolve by incorporating additional features.

From an object-oriented programming viewpoint, DOM API is a set of interfaces that should be implemented by a particular DOM implementation. XML for Java is one example of such DOM implementation. The following table shows interfaces and classes defined in DOM (Core) Level 1.

Table 2.1: Core Interface of DOM (based on REC-DOM-19981001)
Interface Name	Description	Implementation classes in XML for Java
Node	The `Node` is the primary datatype representing a single node in the document tree.	Child or Parent
Document	Represents the entire XML document.	TXDocument
Element	Represents an element and any contained nodes.	TXElement
Attr	Represents an attribute in an `Element` object.	TXAttribute
ProcessingInstruction	Represents a processing instruction.	TXPI
CDATASection	Represents a CDATASection	TXCDATASection
DocumentFragment	A lightweight document object used for representing multiple subtrees or partial documents	TXDocumentFragment
Entity	Represents an entity, parsed or unparsed in a `DocumentType` object	EntityDecl.EntityImpl
EntityReference	Represents an entity reference, as appeared in the document tree	GeneralReference
DocumentType	Represents a DTD, which contains a list of entities	DTD
Notation	Represents a notation declared in the DTD. A notation declares, by name, the format of an unparsed entity.	TXNotation.NotationImpl
CharacterData	A parent interface of `Text` and others, which require operations such as insert and delete string.	TXCharacterData
Comment	Represents a comment.	TXComment
Text	Represents a text	TXText
DOMException	An exception thrown when no further processing is possible. Normal errors are reported by return values.	TXDOMException
DOMImplementation	Intended to be a placeholder of methods that are not dependent on specific DOM implementations.	NA
NodeList	Represents an ordered collection of nodes. The items in the NodeList are accessible via an integral index, starting from 0.
NamedNodeMap	Represents a collection of nodes that can be accessed by name.

The table also shows the implementation classes provided by XML for Java. For example, the interface Document is implemented by the class TXDocument in XML for Java.

Note: What does TX mean?

As always true for good (i.e., long-life) software, XML for Java contains names that have been inherited from its early development days. TX was originally meant to be Tokyo Research Laboratory's XML processor -- IBM has seven research laboratories worldwide and Tokyo is one of them. So TX does not really mean anything to you. However, we intentionally left the prefix in the later releases because without some uncommon prefixes the class names such as Element and Text would be too general so that application programmers may be forced, at times, to specify the full class name as in com.ibm.xml.parser.Text if it is confusing with other classes. We wanted to have something uncommon as our prefix. TX is as meaningless as any other random prefixes, so why not TX?

Figure 2.3 shows the class hierarchy of the DOM Level 1 Core interfaces. Note that Node is the primary data structure that constructs a tree structure. DOM tree constituents such as Element, Text, and Attr are all defined as a derived interface from Node.

Figure 2.3: Class/Interface Hierarchy of DOM Level 1 Specification (W3C Recommendation)

Now let us look at DOM in action. One of the frequently required actions for XML documents is to extract a text content from elements. The program (MakeEltTblDOM.java) we use to illustrate the use of DOM is to extract key-value pairs from an XML document such as keyval.xml shown below.


<?xml version="1.0"?> <keyval> <key>URL</key> <value>http://www.ibm.com/xml</value> <key>Owner</key> <value>IBM</value> </keyval>

keyval.xml

To get the strings such as "URL" and "http://www.ibm.com/xml", we should first find the specified element in a DOM tree, and extract the content text from it. The complete listing of MakeEltTblDOM.java is shown in Listing 2.3.

Listing 2.3: `MakeEltTblDOM.java`: Extract key-value pairs and stores them into a hash table (DOM-based implementation).
import com.ibm.xml.parser.Parser; import java.io.FileInputStream; import java.io.InputStream; import java.util.Hashtable; import org.w3c.dom.CDATASection; import org.w3c.dom.Document; import org.w3c.dom.Element; import org.w3c.dom.EntityReference; import org.w3c.dom.Node; import org.w3c.dom.Text; /** * MakeEltTblDOM.java **/ public class MakeEltTblDOM { static public void main(String[] argv) { if (argv.length != 1) { System.err.println("Missing filename."); System.exit(1); } try { // Open specified file InputStream is = new FileInputStream(argv[0]); // Start parsing Parser parser = new Parser(argv[0]); // @XML4J Document doc = parser.readStream(is); // @XML4J // Check if there is errors if (parser.getNumberOfErrors() > 0) { // @XML4J System.exit(1); } // Document is well-formed // Create hashtable for string key-value pairs Hashtable hash = new Hashtable(); String key = null, value = null; // Traverse all the children of the root element for (Node kvchild = doc.getDocumentElement().getFirstChild(); kvchild != null; kvchild = kvchild.getNextSibling()) { // When child is an element if (kvchild instanceof Element) { // If tag name is "key", store its content in vkey if (kvchild.getNodeName().equals("key")) { key = makeChildrenText(kvchild); // If tag name is "value" } else if (kvchild.getNodeName().equals("value")) { // Extract the text content from the child value = makeChildrenText(kvchild); // Check key is specified and // store the key-value pair int the hashtable if (key != null) { hash.put(key, value); key = null; } } } } // Display the hashtable System.out.println(hash); } catch (Exception e) { e.printStackTrace(); } } private static String makeChildrenText (Node node){ // Create a StringBuffer to store the result. // StringBuffer is more efficient than String StringBuffer buffer = new StringBuffer(); return makeChildrenText1 (node, buffer); } private static String makeChildrenText1 (Node node, StringBuffer buffer){ // Visit all the child nodes for (Node ch = node.getFirstChild(); ch != null; ch = ch.getNextSibling()) { // Recursively call if the child may have children if (ch instanceof Element \|\| ch instanceof EntityReference) { buffer.append(makeChildrenText(ch)); // If the child is a text, append it to the result buffer } else if (ch instanceof Text) { buffer.append(ch.getNodeValue()); } } return buffer.toString(); } }

import com.ibm.xml.parser.Parser;
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.Hashtable;
import org.w3c.dom.CDATASection;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.EntityReference;
import org.w3c.dom.Node;
import org.w3c.dom.Text;

/**
 * MakeEltTblDOM.java
 **/
public class MakeEltTblDOM {

    static public void main(String[] argv) {
        if (argv.length != 1) {
            System.err.println("Missing filename.");
            System.exit(1);
        }
        try {
            // Open specified file

            InputStream is = new FileInputStream(argv[0]);

            // Start parsing
            Parser parser = new Parser(argv[0]);   // @XML4J
            Document doc = parser.readStream(is);  // @XML4J

            // Check if there is errors
            if (parser.getNumberOfErrors() > 0) { // @XML4J
                System.exit(1);
            }
            // Document is well-formed

            // Create hashtable for string key-value pairs
            Hashtable hash = new Hashtable();

            String key = null, value = null;
            
            // Traverse all the children of the root element
            for (Node kvchild = doc.getDocumentElement().getFirstChild();
                kvchild != null;
                kvchild = kvchild.getNextSibling()) {
                // When child is an element
                if (kvchild instanceof Element) {
                    // If tag name is "key", store its content in vkey
                    if (kvchild.getNodeName().equals("key")) {
                        key = makeChildrenText(kvchild);

                    // If tag name is "value"
                    } else if (kvchild.getNodeName().equals("value")) {
                        // Extract the text content from the child
                        value = makeChildrenText(kvchild);
                        // Check key is specified and
                        // store the key-value pair int the hashtable
                        if (key != null) {
                            hash.put(key, value);
                            key = null;
                        }
                    }
                }
            }
            // Display the hashtable
            System.out.println(hash);

        } catch (Exception e) {
            e.printStackTrace();
        }
    }

private static String makeChildrenText (Node node){
    // Create a StringBuffer to store the result.
    // StringBuffer is more efficient than String
    StringBuffer buffer = new StringBuffer();
    return makeChildrenText1 (node, buffer);
}

private static String makeChildrenText1 (Node node, StringBuffer buffer){
        // Visit all the child nodes
        for (Node ch = node.getFirstChild();
             ch != null;
             ch = ch.getNextSibling()) {
            // Recursively call if the child may have children
            if (ch instanceof Element || ch instanceof EntityReference) {
                buffer.append(makeChildrenText(ch));
            // If the child is a text, append it to the result buffer
            } else if (ch instanceof Text) {
                buffer.append(ch.getNodeValue());
            }
        }

        return buffer.toString();
    }
}

As usual, our program first reads an XML document from a file and creates a DOM tree in the variable doc. We learned how to read an XML documents in previous examples. After reading the documents, our program scans all the child elements of the root element to locate key elements and value elements.

    // Traverse all the children of the root element
    for (Node kvchild = doc.getDocumentElement().getFirstChild();
           kvchild != null;
           kvchild = kvchild.getNextSibling()) {
    ...
    }

The getDocumentElement() method returns the root element of the document. The getFirstChild() returns the first child element of an element. Together, the first line above extracts the first child element of the root element and assigns it to kvchild, an variable of type Node. For each iteration, the next element is retrieved by the getNextSibling() method. With DOM, this is a standard way to iterate on the child nodes of a given node. The interface Node has other methods to access various other information of the node. Chapter 4 covers DOM programming more in detail.

The private method makeChildrenText() (which simply calls makeChildrenText1()) in the bottom half of the program is to extract all the texts in the descendant nodes and return the concatenation of them. It is necessary because <key> elements and / or <value> elements may have subelements recursively. You can see how the program recursively descend into subtrees. The point here is the test on whether an element may have child nodes (either it is an Element or an EntityReference). If it is, then the process goes recursively into the child nodes.

         ...
        // Visit all the child nodes
        for (Node ch = node.getFirstChild();
             ch != null;
             ch = ch.getNextSibling()) {
            // Recursively call if the child may have children
            if (ch instanceof Element || ch instanceof EntityReference) {
                buffer.append(makeChildrenText(ch));
            // If the child is a text, append it to the result buffer
            } else if (ch instanceof Text) {
                buffer.append(ch.getNodeValue());
            }
        }

Note:

XML for Java provides a number of useful methods for the class TXElement. One of them is getText(), whose functionality is the same as makeChildrenText(). You can replace the line
value = makeChildrenText(kvchild);
with
value = ((TXElement)kvchild).getText()
and remove the definition of method makeChildrenText(). The method getText() will be used in Section 2.5.

The output of running this program is like this.

R:\samples\chap2>java MakeEltTblDOM keyval.xml
{URL=http://www.ibm.com/xml, Owner=IBM}

In this section, we introduced DOM, the object model for XML documents, and showed a simple program that uses the DOM API. DOM is one of the most fundamental API for dealing with XML documents by computer programs. Chapters 3 and later will make a heavy use of DOM. Chapter 4 has more on programming techniques using DOM. In addition, we provided the complete Java binding of DOM in Appendix D. The next subsection will cover the other fundamental API, SAX.

2.4.2 SAX: Event-Driven API of XML Processor

In Section 2.4.1, we surveyed the method of using DOM API to access the structure of an XML document, where the entire process is divided into two phases; parsing an XML document and then accessing the DOM tree. We introduce two alternative ways to access the document structure -- using event-driven APIs. An application that wishes to receive information on document structure, such as the element type name of each element, attribute names appearing in an element, and attribute values of each attribute can register handlers to the XML processor. The processor notifies the handlers events such as start of a new tag, data characters, and so on. Unlike the case of using DOM API, the entire process is one-pass. That is, any application-specific operations are performed during parsing.

In this subsection and the next subsection, we explain two different event-driven APIs. One is Simple API for XML (SAX), which is supported by several other XML processors. The other is ElementHandler which is specific to XML for Java. We first show how SAX can be used for simple rewriting of XML documents.

Simple API for XML, or SAX, is a event-driven API for XML processors. SAX is designed as a lightweight API that does not involve generation of internal structures.

Applications are required to register event handlers to a parser object that implements the org.sax.Parser interface. SAX has three handler interfaces; DocumentHandler, DTDHandler, and ErrorHandler (SAX also provides the default implementation class HandlerBase for the default behavior of all of these interfaces.).

DocumentHandler is most important and most frequently used because it is called whenever an element is found.

A normal steps required to use DocumentHandler is as follows.

            import org.xml.sax.Parser;
            ...
            Class c = Class.forName("com.ibm.xml.parser.SAXDriver");
            Parser parser = (Parser)c.newInstance();
            parser.setDocumentHandler(new MyHandler());

First, the code must import the Parser interface. The next line

            Class c = Class.forName("com.ibm.xml.SAXDriver");

creates an instance of a SAX driver. com.ibm.cml.SAXDriver is a SAX driver provided by XML for Java. Using this driver, you can create a parser object as follows.

            Parser parser = (Parser)c.newInstance();

This is a standard technique in Java to dynamically specify class to be instantiated at run time. Note that Parser is an interface defined in the SAX package. With the following line, an instance of class MyHandler is registered as a DocumentHandler.

            parser.setDocumentHandler(new MyHandler());

MyHandler can be programmed by implementing the DocumentHandler interface, or by subclassing the adapter class HandlerBase, which has each empty method for all the events.

Table 2.2 shows the methods defined in the DocumentHandler interface.

Table 2.2: Methods of `DocumentHandler`
Method	Description
`startDocument()`	Receive notification of the beginning of the document
`endDocument()`	Receive notification of the end of the document
`startElement(String name, AttributeList atts)`	Receive notification of the beginning of an element.
`endElement(String name)`	Receive notification of the end of an element.
`characters(char ch[], int start, int length)`	Receive notification of character data.
`ignorableWhitespace(char ch[], int start, int length)`	Receive notification of ignorable whitespace in element content.
`processingInstruction(String target, String data)`	Receive notification of a processing instruction.
`setDocumentLocator(Locator locator)`	Receive an object for locating the origin of SAX document events. The `Locator` object gives information on the location of the event such as line number and column position.

SAX defines several event types and associated methods. It might be helpful to understand how these methods are called during the parsing process, so our first SAX application, NotifyStr.java (Listing 2.4), is to report all the SAX events to the standard output. Listing 2.4 shows the complete listing.

Listing 2.4: `NotifyStr.java`: Trace methods of `DocumentHandler`
import org.xml.sax.AttributeList; import org.xml.sax.HandlerBase; import org.xml.sax.Parser; import org.xml.sax.Locator; import org.xml.sax.SAXException; /** * NotifyStr.java */ public class NotifyStr extends HandlerBase { static public void main(String[] argv) { try { Class c = Class.forName(argv[0]); Parser parser = (Parser )c.newInstance(); NotifyStr notifyStr = new NotifyStr(); parser.setDocumentHandler(notifyStr); parser.parse(argv[1]); } catch (Exception e) { e.printStackTrace(); } } public NotifyStr() { } public void startDocument() throws SAXException { System.out.println("startDocument is called:"); } public void endDocument() throws SAXException { System.out.println("endDocument is called:"); } public void startElement(String name, AttributeList amap) throws SAXException { System.out.println("startElement is called: element name=" + name); for (int i = 0; i < amap.getLength(); i++) { String attname = amap.getName(i); String type = amap.getType(i); String value = amap.getValue(i); System.out.println(" attribute name="+attname+" type="+type+" value="+value); } } public void endElement(String name) throws SAXException { System.out.println("endElement is called: " + name); } public void characters(char[] ch, int start, int length) throws SAXException { System.out.println("characters is called: " + new String(ch, start, length)); } }

import org.xml.sax.AttributeList;
import org.xml.sax.HandlerBase;
import org.xml.sax.Parser;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;

/**
 * NotifyStr.java
 */
public class NotifyStr extends HandlerBase {
    static public void main(String[] argv) {
      try {
        Class c = Class.forName(argv[0]);
        Parser parser = (Parser )c.newInstance();
        NotifyStr notifyStr = new NotifyStr();
        parser.setDocumentHandler(notifyStr);    

        parser.parse(argv[1]);

      } catch (Exception e) {
        e.printStackTrace();
      }
    }
    
    public NotifyStr() {
    }

    public void startDocument() throws SAXException {
      System.out.println("startDocument is called:");
    }

    public void endDocument() throws SAXException {
      System.out.println("endDocument is called:");
    }

    public void startElement(String name, AttributeList amap) throws SAXException {
      System.out.println("startElement is called: element name=" + name);
      for (int i = 0; i < amap.getLength(); i++) {     
        String attname = amap.getName(i);
        String type = amap.getType(i);     
        String value = amap.getValue(i);
        System.out.println("  attribute name="+attname+" type="+type+" value="+value);
      } 
    }

    public void endElement(String name) throws SAXException {
      System.out.println("endElement is called: " + name);
    }

    public void characters(char[] ch, int start, int length) throws SAXException {
      System.out.println("characters is called: " + new String(ch, start, length));
    }

}

This program takes a SAX driver class as the first command line argument and the file name of an XML document as the second argument. To run the program using XML for Java as the SAX driver, you have to supply "com.ibm.xml.parser.SAXDriver" as the first argument.

R:\samples\chap2>java NotifyStr com.ibm.xml.parser.SAXDriver deparment.xml
startDocument is called:
startElement is called: element name=department
startElement is called: element name=employee
  attribute name=id type=CDATA value=J.D
startElement is called: element name=name
characters is called: John Doe
endElement is called: name
startElement is called: element name=email
characters is called: John.Doe@foo.ibm.com
endElement is called: email
endElement is called: employee
startElement is called: element name=employee
  attribute name=id type=CDATA value=B.S
startElement is called: element name=name
characters is called: Bob Smith
endElement is called: name
startElement is called: element name=email
characters is called: Bob.Smith@foo.com
endElement is called: email
endElement is called: employee
startElement is called: element name=employee
  attribute name=id type=CDATA value=A.M
startElement is called: element name=name
characters is called: Alice Miller
endElement is called: name
startElement is called: element name=email
characters is called: Alice.Miller@jp.ibm.com
endElement is called: email
endElement is called: employee
endElement is called: department
endDocument is called:

With this output, you should be able to understand when these methods are called.

To highlight the differences in programming with DOM and ElementHandler, we show the SAX version of our key-value program in Listing 2.5.

Listing 2.5: `MakeEltTblSAX.java`: Extract Key-Value pairs and stores them into a hashtable (SAX-based implementation)
import java.io.File; import java.io.FileInputStream; import java.io.InputStream; import java.net.URL; import java.util.Hashtable; import org.xml.sax.AttributeList; import org.xml.sax.HandlerBase; import org.xml.sax.Parser; import org.xml.sax.SAXException; /** * MakeEltTblSAX.java */ public class MakeEltTblSAX extends HandlerBase { static public void main(String[] argv) { if (argv.length != 2) { System.err.println("Usage: java MakeEltTblSAX <SAX-class-name> <xml-filename>"); System.exit(1); } try { // get class object for SAX Driver Class c = Class.forName(argv[0]); // create instance of the class Parser parser = (Parser)c.newInstance(); // create document handler, MakeEltTblSAX makehash = new MakeEltTblSAX(); // and register it parser.setDocumentHandler(makehash); parser.parse(argv[1]); System.out.println(makehash.m_hash); } catch (Exception e) { e.printStackTrace(); } } Hashtable m_hash; public MakeEltTblSAX() { m_hash = new Hashtable(); m_state = STATE_OTHER; m_textbuf = new StringBuffer(); } int m_state; static final int STATE_KEY = 0, STATE_VALUE = 1, STATE_OTHER = 2; StringBuffer m_textbuf; String m_key; public void startElement(String name, AttributeList amap) throws SAXException { if (name.equals("key")) { // store status m_state = STATE_KEY; m_textbuf.setLength(0); } else if (name.equals("value")) { // store status m_state = STATE_VALUE; m_textbuf.setLength(0); } } public void endElement(String name) throws SAXException { if (name.equals("key")) { m_key = m_textbuf.toString(); this.m_state = STATE_OTHER; } else if (name.equals("value")) { m_hash.put(m_key, m_textbuf.toString()); this.m_state = STATE_OTHER; } } public void characters(char[] ch, int start, int length) throws SAXException { if (STATE_KEY == m_state \|\| STATE_VALUE == m_state) { m_textbuf.append(ch, start, length); } } public void endDocument() throws SAXException { m_textbuf = null; m_key = null; } }

import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.net.URL;
import java.util.Hashtable;
import org.xml.sax.AttributeList;
import org.xml.sax.HandlerBase;
import org.xml.sax.Parser;
import org.xml.sax.SAXException;

/**
 * MakeEltTblSAX.java
 */
public class MakeEltTblSAX extends HandlerBase {
    static public void main(String[] argv) {
        if (argv.length != 2) {
            System.err.println("Usage: java MakeEltTblSAX <SAX-class-name> <xml-filename>");
            System.exit(1);
        }
        try {
            // get class object for SAX Driver
            Class c = Class.forName(argv[0]);
            // create instance of the class
            Parser parser = (Parser)c.newInstance();
            // create document handler, 
            MakeEltTblSAX makehash = new MakeEltTblSAX();
            // and register it
            parser.setDocumentHandler(makehash);

            parser.parse(argv[1]);

            System.out.println(makehash.m_hash);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    
    Hashtable m_hash;
    public MakeEltTblSAX() {
        m_hash = new Hashtable();
        m_state = STATE_OTHER;
        m_textbuf = new StringBuffer();
    }

    int m_state;
    static final int STATE_KEY = 0, STATE_VALUE = 1, STATE_OTHER = 2;
    StringBuffer m_textbuf;
    String m_key;

    public void startElement(String name, AttributeList amap) throws SAXException {
        if (name.equals("key")) {
            // store status
            m_state = STATE_KEY;
            m_textbuf.setLength(0);
        } else if (name.equals("value")) {
            // store status
            m_state = STATE_VALUE;
            m_textbuf.setLength(0);
        }
    }

    public void endElement(String name) throws SAXException {
        if (name.equals("key")) {
            m_key = m_textbuf.toString();
            this.m_state = STATE_OTHER;
        } else if (name.equals("value")) {
            m_hash.put(m_key, m_textbuf.toString());
            this.m_state = STATE_OTHER;
        }
    }

    public void characters(char[] ch, int start, int length) throws SAXException {
        if (STATE_KEY == m_state || STATE_VALUE == m_state) {
            m_textbuf.append(ch, start, length);
        }
    }

    public void endDocument() throws SAXException {
        m_textbuf = null;
        m_key = null;
    }
}

Again, this program produces exactly the same result as follows:

R:\samples\chap2>java MakeEltTblSAX com.ibm.xml.parser.SAXDriver keyval.xml
{URL=http://www.ibm.com/xml, Owner=IBM}

Note that in this sample program, the SAX driver's name is specified in the command line. There is no XML processor-dependent code in this program. MakeEltTblSAX.java should run with any SAX-compatible XML processor.

As you can see in the above examples, for any element in the tree, three methods, startElement(), characters() (possibly multiple times), and endElement() are called in this sequence. However, as these are independently called each other, the application program is responsible for keeping track of with which element an event is associated. In MakeEltTblSAX.java, the variable m_state holds the state of the process if the current tag is "key" or "value. It is set in the startElement() method as shown below.

public void startElement(String name, AttributeList amap) throws SAXException {
        if (name.equals("key")) {
            m_state = STATE_KEY;
            m_textbuf.setLength(0);
        } else if (name.equals("value")) {
            m_state = STATE_VALUE;
            m_textbuf.setLength(0);
        }
    }

The variable m_state is reset in the endElement() method.

When a text is found (that is, the characters() method is called), it checks if it appears in a context of either "key" or "value," and if so, concatenate the string to the buffer.

    public void characters(char[] ch, int start, int length) throws SAXException {
        if (STATE_KEY == m_state || STATE_VALUE == m_state) {
            m_textbuf.append(ch, start, length);
        }
    }

When the endElement() is called, the assembled text strings are stored in the hash table.

    public void endElement(String name) throws SAXException {
        if (name.equals("key")) {
            m_key = m_textbuf.toString();
            this.m_state = STATE_OTHER;
        } else if (name.equals("value")) {
            m_hash.put(m_key, m_textbuf.toString());
            this.m_state = STATE_OTHER;
        }
    }

2.4.3 Element Handler: Yet another Event-Driven API

IBM's XML for Java allows you to register an element handler to a particular element name before starting reading an XML document. An element handler is called whenever the parser encounters the specified element. We will show how element handler can change the structure of an input document using this API. Like SAX, applications can receive events about elements. The difference between them is that ElementHandler creates a DOM tree, while SAX does not. On the other hand, an ElementHander can be attached to elements only -- you cannot catch events of other types.

ElementHandler is particularly useful when modifying some element while preserving the overall structure. We use SimpleFilter.java (Listing 2.6) to illustrate the use of ElementHandler for this purpose. This program translates the email tags into the url tags. For example,
<email>John.Doe@foo.com</email>
will be converted into
<url href="mailto:John.Doe@foo.com"/>.

An element handler is a class that implements the com.ibm.xml.parser.ElementHandler interface. Therefore, first you need to import it in your program.

import com.ibm.xml.parser.ElementHandler;

Since ElementHandler is an interface, you have to define a class that implements this interface. ElementHandler has only one method, handleElement() to be implemented by its implementation class.

Note:

If you are familiar with C or C++ and not very familiar with the Java 1.1's event model, you may consider an element handler as a sort of a call back function. These are of course separate concepts but at least you can get a glimpse of the idea.

Listing 2.6: `SimpleFilter.java` -- Modify structure of an XML document
/** * SimpleFilter.java **/ // Note: This class is XML4J specific import org.w3c.dom.Document; import com.ibm.xml.parser.Parser; import com.ibm.xml.parser.TXDocument; import com.ibm.xml.parser.TXElement; import com.ibm.xml.parser.ElementHandler; import java.io.FileInputStream; import java.io.PrintWriter; public class SimpleFilter implements ElementHandler { // This method is XML4J-specific. public TXElement handleElement(TXElement el) { // ElementHandler String addr = el.getText(); TXElement e = new TXElement("url"); e.setAttribute("href", "mailto:"+addr); return e; } public static void main(String[] argv) { if (argv.length != 1) { System.err.println("Missing filename."); System.exit(1); } try { FileInputStream is = new FileInputStream(argv[0]); Parser parser = new Parser(argv[0]); // @XML4J // Register ElementHandler parser.addElementHandler(new SimpleFilter(), "email"); // Start parsing @XML4J Document doc = parser.readStream(is); // Error? if (parser.getNumberOfErrors() > 0) { // @XML4J System.exit(1); // If the document has any error, // the program is terminated. } // Generate modified XML document ((TXDocument)doc).printWithFormat(new PrintWriter(System.out)); // @XML4J } catch (Exception e) { e.printStackTrace(); } } }

/**
 *       SimpleFilter.java
 **/

// Note: This class is XML4J specific

import org.w3c.dom.Document;
import com.ibm.xml.parser.Parser;
import com.ibm.xml.parser.TXDocument;
import com.ibm.xml.parser.TXElement;
import com.ibm.xml.parser.ElementHandler;
import java.io.FileInputStream;
import java.io.PrintWriter;

public class SimpleFilter implements ElementHandler {

            // This method  is XML4J-specific.
    public TXElement handleElement(TXElement el) { // ElementHandler
            String addr = el.getText();
            TXElement e = new TXElement("url");  
            e.setAttribute("href", "mailto:"+addr);
            return e;
    }

    public static void main(String[] argv) {
        if (argv.length != 1) {
            System.err.println("Missing filename.");
            System.exit(1);
        }
        try {
            FileInputStream is = new FileInputStream(argv[0]);
            Parser parser = new Parser(argv[0]);   // @XML4J

            // Register ElementHandler
            parser.addElementHandler(new SimpleFilter(), "email");

            // Start parsing @XML4J
            Document doc = parser.readStream(is);
            // Error?
            if (parser.getNumberOfErrors() > 0) {  // @XML4J
                System.exit(1);                // If the document has any error,
                                               // the program is terminated.
            }

            // Generate modified XML document
            ((TXDocument)doc).printWithFormat(new PrintWriter(System.out));  // @XML4J

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Here, the class named SimpleFilter implements an element handler. In the method main(), we create a new instance of this element handler and register it as the handler of the element "email" in the following line.

     parser.addElementHandler(new SimpleFilter(), "email");

During the parsing, when an "email" element is created after seeing an end tag (</email>), the handleElement() method of the element handler is called. The implementation of this method is shown below.

public TXElement handleElement(TXElement el) { // ElementHandler
     String addr = el.getText();
     TXElement e = new TXElement("url");
     e.setAttribute("href", "mailto:"+addr);
     return e;
}

The newly generated "email" element is passed through the argument el. The program obtains the text string in this element (such as "John.Doe@foo.com") by calling getText() (this is functionally equivalent to the makeChildrenText() method in Section 2.4.1) and set it to the variable addr. Then, another element named "url" is created and the address is set to its "href" attribute. Finally this new "url" element is returned as the return value of this method. This will replace the original "email" element in the resulting DOM tree. This way, <email>John.Doe@foo.com</email> is effectively replaced by <url href="mailto:John.Doe@foo.com"/>.

Let us execute the program and see what is the result.

R:\samples\chap2>java SimpleFilter department.xml
<?xml version="1.0"?>
<!DOCTYPE department SYSTEM "department.dtd" >
<department>
  <employee id="J.D">
    <name>John Doe</name>
    <url href="mailto:John.Doe@foo.ibm.com"/>
  </employee>
  <employee id="B.S">
    <name>Bob Smith</name>
    <url href="mailto:Bob.Smith@foo.com"/>
  </employee>
  <employee id="A.M">
    <name>Alice Miller</name>
    <url href="http://www.trl.jp.ibm.com/~amiller/"/>
  </employee>
</department>

As you see, the <email> tags of John Doe and Bob Smith have been replaced by the <url> tags.

Note:

In XML for Java, validity constraints are checked after ElementHandler is called. In other words, the constraints are applied to the modified structure. Therefore, the url element must be declared in the DTD.

You can attach as many element handlers as you like to any element names. By default, the handleElement() method is called after the parser sees the end tag of an element. For example, if you have an element handler attached to "employee", it is called when the parser sees a </employee> tag.

<department>
  <employee id="J.D">
    <name>John Doe</name>
    <email>John.Doe@foo.com</email>
    <title>Manager<title/>
  </employee>
<department>

For the input above, all the complete element structures for "name," "email," and "title" are available to the element handler of "employee". On the other hand, the element for "department" is not yet complete so it is not available to you. An exception to this is the attributes of the parent tag -- XML for Java provides methods for accessing them. Refer to Appendix D for interface specifications for the details.

Java 1.1 introduced a powerful concept called inner class into Java language syntax. It allows you to embed a class definition within an expression as an anonymous class. Inner classes are particularly useful for element handlers because you do not need to give a separate name for each element handler.

In the above example, we have defined SimpleFilter as an implementation of ElementHandler and register it as an element handler of "email" as shown below.

parser.addElementHandler(new SimpleFilter(), "email");

With inner class, we can embed a class definition in place of new SimpleFilter() as follows:

           parser.addElementHandler(new ElementHandler() {
                public TXElement handleElement(TXElement el) {
                    String addr = el.getText();
                    TXElement e = new TXElement("url");
                    e.setAttribute("href", "mailto:"+addr);
:                   return e;
                }
                , "email");
           }

The entire rewritten program is shown in Listing 2.7.

Listing 2.7: SimpleFilter2 -- Modify structure of an XML document (inner class version)
/** * SimpleFilter2.java **/ // Note: This class is XML4J specific import org.w3c.dom.Document; import com.ibm.xml.parser.Parser; import com.ibm.xml.parser.TXDocument; import com.ibm.xml.parser.TXElement; import com.ibm.xml.parser.ElementHandler; import java.io.FileInputStream; import java.io.PrintWriter; public class SimpleFilter2 { public static void main(String[] argv) { if (argv.length != 1) { System.err.println("Missing filename."); System.exit(1); } try { FileInputStream is = new FileInputStream(argv[0]); Parser parser = new Parser(argv[0]); // @XML4J // ================================================ // @XML4J parser.addElementHandler(new ElementHandler() { public TXElement handleElement(TXElement el) { String addr = el.getText(); TXElement e = new TXElement("url"); e.setAttribute("href", "mailto:"+addr); return e; }} , "email"); // Start parsing Document doc = parser.readStream(is); // @XML4J // Error? if (parser.getNumberOfErrors() > 0) { // @XML4J System.exit(1); // If the document has any error, // the program is terminated. } // Generate modified XML document ((TXDocument)doc).printWithFormat(new PrintWriter(System.out)); // @XML4J } catch (Exception e) { e.printStackTrace(); } } }

Listing 2.7: SimpleFilter2 -- Modify structure of an XML document (inner class version)

/**
 *       SimpleFilter2.java
 **/

// Note: This class is XML4J specific

import org.w3c.dom.Document;
import com.ibm.xml.parser.Parser;
import com.ibm.xml.parser.TXDocument;
import com.ibm.xml.parser.TXElement;
import com.ibm.xml.parser.ElementHandler;
import java.io.FileInputStream;
import java.io.PrintWriter;

public class SimpleFilter2 {

    public static void main(String[] argv) {
        if (argv.length != 1) {
            System.err.println("Missing filename.");
            System.exit(1);
        }
        try {
            FileInputStream is = new FileInputStream(argv[0]);
            Parser parser = new Parser(argv[0]);   // @XML4J

            // ================================================
            // @XML4J
            parser.addElementHandler(new ElementHandler() {
                public TXElement handleElement(TXElement el) {
                    String addr = el.getText();
                    TXElement e = new TXElement("url");
                    e.setAttribute("href", "mailto:"+addr);
                    return e;
                }}
                , "email");

            // Start parsing
            Document doc = parser.readStream(is); // @XML4J
            // Error?
            if (parser.getNumberOfErrors() > 0) {  // @XML4J
                 System.exit(1);                // If the document has any error,
                                                // the program is terminated.
            }

            // Generate modified XML document
            ((TXDocument)doc).printWithFormat(new PrintWriter(System.out)); // @XML4J

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Using element handlers, you can write simple filter programs, such as searching particular tags, substituting and inserting elements, and so on.

The next example is functionally equivalent to MakeEltTblDOM.java(Listing 2.3) that we discussed in Section 2.4.1. Comparing these two programs, you will see the difference between the two approaches.

Listing 2.8: `MakeEltTblEltHandler.java`: Extract key-value pairs and store them into HashTable (ElementHandler-based implementation)
import java.io.FileInputStream; import java.io.InputStream; import java.util.Hashtable; import com.ibm.xml.parser.ElementHandler; import com.ibm.xml.parser.Parser; import com.ibm.xml.parser.TXElement; /** * MakeEltTblEltHandler.java **/ // This class is XML4J specific. public class MakeEltTblEltHandler implements ElementHandler { Hashtable hash = new Hashtable(); String key = null; // ElementHandler public TXElement handleElement(TXElement el) { String name = el.getNodeName(); // Extract value of tag "key" if (name.equals("key")) { this.key = el.getText(); // Extract value of tag "value" and store the pair into HashTable } else if (name.equals("value") && this.key != null) { this.hash.put(this.key, el.getText()); this.key = null; } return el; } static public void main(String[] argv) { if (argv.length != 1) { System.err.println("Missing filename."); System.exit(1); } try { InputStream is = new FileInputStream(argv[0]); Parser parser = new Parser(argv[0]); // Create and register ElementHandler MakeEltTblEltHandler handler = new MakeEltTblEltHandler(); parser.addElementHandler(handler, "key"); parser.addElementHandler(handler, "value"); // Start parsing parser.readStream(is); // Error check if (parser.getNumberOfErrors() > 0) { System.exit(1); } // Display HashTable System.out.println(handler.hash); } catch (Exception e) { e.printStackTrace(); } } }

import java.io.FileInputStream;
import java.io.InputStream;
import java.util.Hashtable;
import com.ibm.xml.parser.ElementHandler;
import com.ibm.xml.parser.Parser;
import com.ibm.xml.parser.TXElement;

/**
 *       MakeEltTblEltHandler.java
 **/
// This class is XML4J specific.
public class MakeEltTblEltHandler implements ElementHandler {

    Hashtable hash = new Hashtable();
    String key = null;
    
    // ElementHandler
    public TXElement handleElement(TXElement el) { 
        String name = el.getNodeName();
        // Extract value of tag "key"
        if (name.equals("key")) {
            this.key = el.getText();
        // Extract value of tag "value" and store the pair into HashTable
        } else if (name.equals("value") && this.key != null) {
            this.hash.put(this.key, el.getText());
            this.key = null;
        }
        return el;
    }
    
    static public void main(String[] argv) {
        if (argv.length != 1) {
            System.err.println("Missing filename.");
            System.exit(1);
        }
        try {
            InputStream is = new FileInputStream(argv[0]);
            Parser parser = new Parser(argv[0]);

            // Create and register ElementHandler
            MakeEltTblEltHandler handler = new MakeEltTblEltHandler();
            parser.addElementHandler(handler, "key");
            parser.addElementHandler(handler, "value");

            // Start parsing
            parser.readStream(is);

            // Error check
            if (parser.getNumberOfErrors() > 0) {
                System.exit(1);
            }

            // Display HashTable
            System.out.println(handler.hash);

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Running this program produces exactly the same result as MakeEltTblDOM.java.

R:\samples\chap2>java MakeEltTblEltHandler keyval.xml
{URL=http://www.ibm.com/xml, Owner=IBM}

The core of this program is the handleElement() method. This method is registered for two tag names; "value" and "key." Tag name is obtained by calling getNodeName() method. Also the content of the tag is obtained with getText(). These methods are specific to XML for Java.

Comparison of Three APIs

We have examined the three APIs provided by XML for Java. The following table summarizes the comparison of the three approaches. You should select appropriate one depending on your application.

	DOM	SAX	ElementHandler
Create a tree structure?	Yes	No	Yes
Event-driven?	No	Yes	Yes
Types of event available	NA	Document, Element, Text, PI	Element only
Efficient for large documents?	No	Yes	Yes

2.5 Summary

In this chapter we have explained the basics of programming using XML for Java as an XML processor. We showed how to read and generate XML documents. Two types of APIs, DOM and event-driven APIs, are discussed.

With an event-driven API, simple tasks can be implemented easily as a one-pass process, and can be more efficient than the DOM-based process, because with DOM an entire DOM tree is always created before doing any application-specific tasks. On the other hand, it is hard to randomly access different parts of the tree with a one-pass process. The readers should consider the pros and cons of these methods and select an appropriate method.

The next chapter deals with generation in more detail.

Hiroshi Maruyama <maruyama@jp.ibm.com>

Chapter 2: Parsing

2.1 Introduction

2.2 Reading XML Document

2.3 Printing XML Document from Parsed Structure

2.4 Programming Interfaces for Document Structure

2.5 Summary

2.1 Introduction

2.1.1 XML Processor

2.1.2 Using IBM's XML for Java

2.2. Reading XML Document

2.3 Printing an XML Document from a Parsed Structure

2.4 Programming Interfaces for Document Structure

DOM, SAX, ElementHandler

2.4.1 DOM: Object Model for XML Document

2.4.2 SAX: Event-Driven API of XML Processor

2.4.3 Element Handler: Yet another Event-Driven API

Comparison of Three APIs

2.5 Summary