Microsoft XML Parser in Java
Release Notes for Version 1.8

December 4, 1997

XML Language Issues
Object Model Issues
Recent Bug Fixes
Earlier Release Notes (version 1.6)

XML Language Issues

Feature Details

Case Sensitivity The XML language recently changed to become case-sensitive. MSXML uses the exact case for all keywords as defined in the XML Language Specification, so the following were changed to lowercase.
<?xml version="..." encoding="..." ?> xml-space=(default|preserve) <?xml:namespace as="..." href="..."?>
This is clearly a breaking change from the 1.0 version of the parser and is enabled by default, but for backwards compatibility the object model provides a switch to set the parser back to case-insensitive, as follows:
Document d = new Document(); d.setCaseInsensitive(true); d.load("http://www.foo.com/example.xml");

Namespaces The parser supports namespaces as outlined in the XML Namespaces document. Namespace support was used to implement the new xml:lang attribute.

XML encoding On the Windows platform, MSXML uses C++ optimization for input. This currently supports only UCS2, UTF8, and the Windows code page. On other platforms, the encodings supported depends on which encodings the Java Virtual Machine provides in the InputStreamReader. MSXML also support little endian and big endian Unicode storage formats and maintains the same format when the document is saved.

White space Handling Section 2.10 says that xml-space can be specified on any element, controlling whether white space is preserved or normalized. The default is to normalize white space (which means unify all white space characters down to a single space). To preserve white space, set xml-space to preserve -- this is inherited down the hierarchy. To switch back to the default, set xml-space to default.

Standalone Document Declaration Section 2.9 says that the xml declaration can contain a standalone attribute with values yes or no This replaces the old RMD attribute. The standalone attribute is currently not used by the parser. If you want to stop the parser from loading external DTDs and entities, use the Document method:
document.setLoadExternal(false);

End-of-Line Handling Section 2.11 of the spec specifies that all new lines are now returned to the application as the single character 0xa. So to make sure that Document.save() still generates a valid text file on each platform, the parser now writes out the System.getProperty("line.separator") every time it sees the 0xa character.

Language Identification Section 2.12 adds a new xml:lang attribute. This means that any element can now have this attribute regardless of ATTLIST declaration. For example, the following is valid, even though the DTD says that the test element doesn't have any attributes.
<!DOCTYPE test [ <!ELEMENT test ANY> ]> <test xml:lang="en"> The quick brown fox. </test>

Feature	Details
Case Sensitivity	The XML language recently changed to become case-sensitive. MSXML uses the exact case for all keywords as defined in the XML Language Specification, so the following were changed to lowercase. <?xml version="..." encoding="..." ?> xml-space=(default\|preserve) <?xml:namespace as="..." href="..."?> This is clearly a breaking change from the 1.0 version of the parser and is enabled by default, but for backwards compatibility the object model provides a switch to set the parser back to case-insensitive, as follows: Document d = new Document(); d.setCaseInsensitive(true); d.load("http://www.foo.com/example.xml");
Namespaces	The parser supports namespaces as outlined in the XML Namespaces document. Namespace support was used to implement the new xml:lang attribute.
XML encoding	On the Windows platform, MSXML uses C++ optimization for input. This currently supports only UCS2, UTF8, and the Windows code page. On other platforms, the encodings supported depends on which encodings the Java Virtual Machine provides in the InputStreamReader. MSXML also support little endian and big endian Unicode storage formats and maintains the same format when the document is saved.
White space Handling	Section 2.10 says that xml-space can be specified on any element, controlling whether white space is preserved or normalized. The default is to normalize white space (which means unify all white space characters down to a single space). To preserve white space, set xml-space to preserve -- this is inherited down the hierarchy. To switch back to the default, set xml-space to default.
Standalone Document Declaration	Section 2.9 says that the xml declaration can contain a standalone attribute with values yes or no This replaces the old RMD attribute. The standalone attribute is currently not used by the parser. If you want to stop the parser from loading external DTDs and entities, use the Document method: document.setLoadExternal(false);
End-of-Line Handling	Section 2.11 of the spec specifies that all new lines are now returned to the application as the single character 0xa. So to make sure that Document.save() still generates a valid text file on each platform, the parser now writes out the System.getProperty("line.separator") every time it sees the 0xa character.
Language Identification	Section 2.12 adds a new xml:lang attribute. This means that any element can now have this attribute regardless of ATTLIST declaration. For example, the following is valid, even though the DTD says that the test element doesn't have any attributes. <!DOCTYPE test [ <!ELEMENT test ANY> ]> <test xml:lang="en"> The quick brown fox. </test>

Object Model Issues

Change Details

Ignorable White Space The parser generates ignorable white-space nodes, as defined in the W3C DOM Specification. This results in making Element.getChild(index) unreliable, since there may or may not be white-space nodes that affect this index. A more reliable way to get the FOO element is as follows:
Element root = document.getRoot(); Element foo = root.getChildren().getChild(0);

This works because the default ElementCollection returned from getChildren() automatically filters out the white-space nodes.

C++ XML object model The JavaBeans are provided that expose the Java Object Model in a way that is as close as possible to the C++ XML Object Model that shipped with Internet Explorer 4.0. This makes JavaScript pages work the same regardless of whether the back-end parser is C++ or Java.

ElementFactory The element factory is now designed so that the factory is responsible for building the XML tree. It now has the following methods:
Element createElement(Element parent, int type, Name tag, String text); void parsedAttribute(Element e, Name name, Object value);
This makes it possible to provide a factory that builds nothing and returns null from createElement -- which obviously results in faster parsing.

Name Tokenization The Name class is a useful class that automatically tokenizes commonly used names. This can save a lot of memory, and as a result it can also speed up parsing. For example, the msnbc.cdf file creates only 58 unique Names, and shares a whopping 3522 Name objects. All Name objects are created using a static method as follows: Name foo = Name.create("FOO"); These names are stored in a hash table so that multiple instances of the same name will share the actual Name object. Obviously this is useful for XML tags and XML entities, and so the APIs in the object model now take and return Name objects instead of strings whereever applicable so that clients can also receive these benefits.

Change	Details
Ignorable White Space	The parser generates ignorable white-space nodes, as defined in the W3C DOM Specification. This results in making Element.getChild(index) unreliable, since there may or may not be white-space nodes that affect this index. A more reliable way to get the FOO element is as follows: Element root = document.getRoot(); Element foo = root.getChildren().getChild(0); This works because the default ElementCollection returned from getChildren() automatically filters out the white-space nodes.
C++ XML object model	The JavaBeans are provided that expose the Java Object Model in a way that is as close as possible to the C++ XML Object Model that shipped with Internet Explorer 4.0. This makes JavaScript pages work the same regardless of whether the back-end parser is C++ or Java.
ElementFactory	The element factory is now designed so that the factory is responsible for building the XML tree. It now has the following methods: Element createElement(Element parent, int type, Name tag, String text); void parsedAttribute(Element e, Name name, Object value); This makes it possible to provide a factory that builds nothing and returns null from createElement -- which obviously results in faster parsing.
Name Tokenization	The Name class is a useful class that automatically tokenizes commonly used names. This can save a lot of memory, and as a result it can also speed up parsing. For example, the msnbc.cdf file creates only 58 unique Names, and shares a whopping 3522 Name objects. All Name objects are created using a static method as follows: Name foo = Name.create("FOO"); These names are stored in a hash table so that multiple instances of the same name will share the actual Name object. Obviously this is useful for XML tags and XML entities, and so the APIs in the object model now take and return Name objects instead of strings whereever applicable so that clients can also receive these benefits.

Recent Bug Fixes

Fixed bugs in Details

Parameter Entities Fixed inclusion of parameter entities inside entity declarations. The new table in section 4.4.0 of the XML Language Specification is a useful guide for how entities are handled.

Portability The XMLInputStream.java file will now compile on all platforms. It uses the following trick to detect whether the Windows-specific optimization is available:
Class clazz = Class.forName( "com.ms.xml.xmlstream.XMLStream"); xmlis = (XMLStreamReader)clazz.newInstance();

Entity Validation There was a bug in validation when entities were involved. For example, the following now works correctly:
<!DOCTYPE doc [ <!ELEMENT doc (foo)> <!ELEMENT foo EMPTY> <!ENTITY foo "<foo/>"> ]> <doc>&foo;</doc>
This was complicated by the presence of EntityRef nodes. The validator no longer assumes an entity is PCDATA.

Duplicate Entities Fixed a bug in its handling of a declaration for an entity that has already been declared. It used to use the second definition but the spec says to ignore the second definition.

Fixed bugs in	Details
Parameter Entities	Fixed inclusion of parameter entities inside entity declarations. The new table in section 4.4.0 of the XML Language Specification is a useful guide for how entities are handled.
Portability	The XMLInputStream.java file will now compile on all platforms. It uses the following trick to detect whether the Windows-specific optimization is available: Class clazz = Class.forName( "com.ms.xml.xmlstream.XMLStream"); xmlis = (XMLStreamReader)clazz.newInstance();
Entity Validation	There was a bug in validation when entities were involved. For example, the following now works correctly: <!DOCTYPE doc [ <!ELEMENT doc (foo)> <!ELEMENT foo EMPTY> <!ENTITY foo "<foo/>"> ]> <doc>&foo;</doc> This was complicated by the presence of EntityRef nodes. The validator no longer assumes an entity is PCDATA.
Duplicate Entities	Fixed a bug in its handling of a declaration for an entity that has already been declared. It used to use the second definition but the spec says to ignore the second definition.

Back to the XML Parser in Java home page

Microsoft XML Parser in Java Release Notes for Version 1.8

XML Language Issues

Object Model Issues

Recent Bug Fixes

Microsoft XML Parser in Java
Release Notes for Version 1.8