Microsoft XML Parser in Java -- Release Notes

Microsoft XML Parser in Java
Release Notes

May 22nd, 1998

XML Language Issues
Object Model Issues
Recent Bug Fixes
Prior Release Notes

XML Language Issues

Feature Details

Case Sensitivity The XML language recently changed to become case-sensitive. MSXML uses the exact case for all keywords as defined in the XML Language Specification. So the following were changed to lowercase.
<?xml version="..." encoding="..." ?> xml:space=(default|preserve) <?xml:namespace name="..." src="..." prefix="..."?>
This is clearly a breaking change from the 1.0 version of the parser and is enabled by default, but for backwards compatibility the object model provides a switch to set the parser back to case-insensitive, as follows:
Document d = new Document(); d.setCaseInsensitive(true); d.load("http://www.foo.com/example.xml");

Namespaces The parser supports namespaces as outlined in separate document. See separate XML Namespaces specification . Namespace support is used to implement the xml:space and xml:lang attributes.

XML encoding On the Windows platform, MSXML uses a C++ optimization for input. This currently only supports UCS2, UTF8 and the windows code page. On other platforms the encodings supported depends on what encodings the Java Virtual Machine provides in the InputStreamReader. MSXML also support little endian and big endian unicode storage formats and maintains the same format when the document is saved.

Whitespace Handling Section 2.10 says that xml-space can be specified on any element controlling whether white space is preserved or normalized. The default is to normalize white space (which means unify all white space characters down to a single space). To preserve whitespace set xml-space to preserve, and this is inherited down the hierarchy. To switch back to the default, set xml-space to default

Standalone Document Declaration Section 2.9 says that the xml declaration can contain a standalone attribute with values yes or no This replaces the old RMD attribute. The standalone attribute is currently not used by the parser. If you want to stop the parser from loading external DTD's and entities use the Document method:
document.setLoadExternal(false);

End-of-Line Handling. Section 2.11 of the spec. This means all new lines are now returned to the application as the single character 0xa. So to make sure that Document.save() still generates a valid text file on each platform it now writes out the System.getProperty("line.separator") every time it sees the 0xa character.

Language Identification Section 2.12 adds a new xml:lang attribute. This means any element can now have this attribute regardless of ATTLIST declaration. For example, the following is valid, even though the DTD says that the test element doesn't have any attributes.
<!DOCTYPE test [ <!ELEMENT test ANY> ]> <test xml:lang="en"> The quick brown fox. </test>

Object Model Issues

Change Details

Ignorable White Space The parser generates ignorable white-space nodes, as defined in the W3C DOM Specification. This results in making Element.getChild(index) unreliable, since there may or may not be white-space nodes that affect this index. A more reliable way to get the FOO element is as follows:
Element root = document.getRoot(); Element foo = root.getChildren().getChild(0);

This works because the default ElementCollection returned from getChildren() automatically filters out the white-space nodes.

C++ XML object model The JavaBeans are provided that expose the Java Object Model in a way that is as close as possible to the C++ XML Object Model that shipped with Internet Explorer 4.0. This makes JavaScript pages work the same regardless of whether the back-end parser is C++ or Java.

ElementFactory The element factory is now designed so that the factory is responsible for building the XML tree. It now has the following methods:
Element createElement(Element parent, int type, Name tag, String text); void parsedAttribute(Element e, Name name, Object value);
This makes it possible to provide a factory that builds nothing and returns null from createElement - which obviously results in faster parsing.

Name Tokenization The Name class is a useful class that automatically tokenizes commonly used names. This can save a lot of memory, and as a result it can also speed up parsing. For example, the msnbc.cdf file creates only 58 unique Names, and shares a whopping 3522 Name objects. All Name objects are created using a static method as follows: Name foo = Name.create("FOO"); These names are stored in a hash table so that multiple instances of the same name will share the actual Name object. Obviously this is useful for XML tags and XML entities, and so the APIs in the object model now take and return Name objects instead of strings whereever applicable so that clients can also receive these benefits.

Back to top

Recent Bug Fixes

Fixed bugs in Details

Parameter Entities Fixed inclusion of parameter entities inside entity declarations. The new table in section 4.4.0 of the XML Language Specification is a useful guide for how entities are handled.

Portability The XMLInputStream.java file will now compile on all platforms. It uses the following trick to detect whether the Windows specific optimization is available:
Class clazz = Class.forName( "com.ms.xml.xmlstream.XMLStream"); xmlis = (XMLStreamReader)clazz.newInstance();

Entity Validation There was a bug in validation when entities were involved. For example, the following now works correctly:
<!DOCTYPE doc [ <!ELEMENT doc (foo)> <!ELEMENT foo EMPTY> <!ENTITY foo "<foo/>"> ]> <doc>&foo;</doc>
This was complicated by the presense of EntityRef nodes. The validator no longer assumes an entity is PCDATA.

Duplicate Entities Fixed a bug in its handling of a declaration for an entity that has already been declared. It used to use the second definition but the spec says to ignore the second definition.

Back to the XML Parser home page.