The beauty of XML is that you can easily add or invent tags, and thus
extend this grammar to include a superset of all the data that
any address book or address book-managing application would want. You can even make some tags and attributes optional.
Before the Web existed, there was no real need to make sure all
the data on all the desktop islands was accessible and
understandable. In fact, there were some territorial advantages
for applications having proprietary data formats that were obscure and unpublished.
You can easily extrapolate this address book example to another domain, say word processing. How often
have you hunted for an MS Word reader because you received a document in *.doc format?
That's a lot of work just to read a document.
Now, the momentum is toward user choice in applications. For
example, at home, users choose their favorite mail application,
browser, and word processor. And at work, they push to have the same choices
and functionality. Plug-and-play is becoming a reality with
hardware, with applications, and now because of XML, with data.
XML is essentially plug-and-play data, or data that
defines itself. For many domains, the application that doesn't
parse XML will soon be considered proprietary.
<OL>
<LI>HTML allows <B><I>improper nesting</B></I>.
<LI>HTML allows start tags, without end tags, like the <BR> tag.
<LI>HTML allows <FONT COLOR=#9900CC>attribute values</FONT> without quotes.
</OL>
In an HTML browser, the above snippet probably renders just as intended with no complaint from the browser.
But now let's convert it to well-formed XML:
<OL>
<LI>XML requires <B><I>proper nesting</I></B>.</LI>
<LI>XML requires empty tags to be identified with a trailing slash, as in <BR/>.</LI>
<LI>XML requires <FONT COLOR="#9900CC">quoted attribute values</FONT>.</LI>
</OL>
<LI>HTML allows start tags, without end tags, like
<BR>tags.</LI>
But this is well-formed XML:
<LI>XML requires empty tags to be identified with a
trailing slash, as in <BR/>.</LI>
because the <BR> is an empty tag and
includes the required trailing slash, <BR/>. In
this way, an XML parser knows immediately not to look for
an end tag, because an empty tag is a start
and end tag together as one. A start tag and end tag with no
data within them are also sometimes referred to as an
empty tag, but this
is not the precise definition.3. Single root elements XML documents allow only one root element. This restriction makes it easier to verify
that the document is complete.
4. Quoted attribute values All attribute values must be within
single or double quotes.
The following is not well-formed.
Note the missing quotes around #9900CC.
5. Declared entities All entities must be declared in a DTD. XML entities
are analogous to constants in other languages.
Entities can be expanded during processing, like a
macro-preprocessing capability, saving error-prone
duplication of common text. We'll cover DTDs later in this tutorial.
We won't cover entities
any further, though, since we don't
use them in our example. Entities are an important topic, so
you may want to refer to the specification.
Now that you know about well-formed
XML documents, you're almost ready to start writing one. There are just a few other
fine points to cover.
But the H1 and h1 are seen as two
entirely different tags in XML, so you must use the same
case in both tags. This is correct XML:
<H1>Remember XML is case-sensitive!</H1>
Tip: Use either upper or lower case for tags.
Or use a strict convention, like upper-casing only word
boundaries, which is a common programming practice.
2. Relevant white space White space in the data between tags is relevant, because XML is a data
format. However, within the markup
itself, and also within quoted attribute values, white
space is normalized, or removed. For example, in XML, these two poems are the same:
<poem form="free ">
To XML or not to XML
That is the question.
</poem>
<poem
form="free">
To XML or not to XML
That is the question.
</poem>
But this poem is different:
<poem form="free">
To XML or
not to XML
That is the question.
</poem>
(Technical point: Even though the parser normalizes
whitespace within tags, the
files when re-parsed will create the same DOM representation,
and so data integrity is maintained.)
The specification provides more detail on white-space
as well.
3. Character encoding XML allows you to specify different character set encodings.
The encoding must be identified within the <?xml ?>
declaration as an encoding="UTF-8"
attribute. An XML processor is required to support 'UTF-8'
and 'UTF-16'.
For example:
<?xml version='1.0' encoding='UTF-8' ?>
The IBM XML Parser for Java (XML4J) supports the following encodings:
US-ASCII
UCS-2
UCS-4
UTF-8
UTF-16
Shift_JIS
EUC-JP
ISO-2022-JP
Big5
GB2312
ISO-8859-1 through ISO-8859-9
4. Special reserved characters Several characters
are part of the syntactic structure of XML and will not be
interpreted as themselves if simply placed within an XML
document. You must substitute a special character sequence
called an predefined
entity by XML. More about entities later.
Reserved character
Predefined entity to use instead
<
<
&
&
>
>
'
'
"
"
Only the "<" char
seems to be automatically interpreted by most HTML browsers as the
start of a markup tag, although the HTML specification may be
stricter.
With regard to a DTD, an XML document can do any of the following:
Refer to a DTD, using a URI.
Include a DTD inline as part of the XML document.
Omit a DTD altogether. Without a DTD, an XML document can be checked for well-formedness, but not for validity.
An XML document is valid
if its content conforms to the rules in its DTD. Validity allows an application to make sure the XML data is
complete, is formatted properly, and has appropriate attribute
values. It also allows an application to construct valid
XML that conforms to that DTD, which is a very powerful feature.
For example, the IBM XML Parser for Java (XML4J) can:
Ensure that end-users cannot
create invalid XML data
Help users build XML
documents by showing what is valid at any given point
Currently the IBM Parser is the only one
implementing this powerful capability. More on this later in "Tutorial 3: Parsing XML Using Java."
This DTD statement declares an element called person, which
must consist of a name and an optional e-mail.
Type
Element Declaration
Element Content Model
<
!
ELEMENT
person
(name, e-mail*)
>
Let's try to avoid reading the specification for now. Instead, refer to the following table for the symbols and
special names used in DTD element rules. The characters A and B represent an
element or an expression found in the Element Content Model.
Element definition
What it means
A?
Matches A or nothing; optional A.
A+
Matches one or more occurrences of A.
A*
Matches zero or more occurrences of A.
A | B
Matches A or B but not both.
A , B
Matches A followed by B, in that order.
(A, B)+
Matches one or more
occurrences of (A followed by B). Parentheses are a grouping mechanism; the
expression inside parentheses is treated as a unit.
#PCDATA
Keyword matches string data in the
current character encoding.
That's enough information for the moment. Please feel free to
refer to this table as we continue. Next let's build on our Address Book example.
<?xml encoding="UTF-8"?>
<!ELEMENT addressBook (person)+>
<!ELEMENT person (name,e-mail*)>
<!ELEMENT name (family, given)>
<!ELEMENT family (#PCDATA)>
<!ELEMENT given (#PCDATA)>
<!ELEMENT e-mail (#PCDATA)>
Note that in the DTD, we may also specify the <?xml
encoding="UTF-8"?> expression. However, if we do so, we must also include the encoding in the DTD.
This DTD says that our addressBook is composed of one or more
people, where each person has a name, and optional e-mail
address. The name is composed of a family
name, and a given name. And the content of each of these
is UTF-8 string data.
If you do include the XML declaration in your XML document, you must follow some simple
rules:
<?xml is required as the first characters of the document, with no preceding spaces.
The version
attribute is required.
In an external document, such as a DTD, the encoding
attribute is required, and the version is
optional (will be inherited from the document).
The XML declaration rules are simpler than they appear in the specification. These rules exist so that XML processors can interpret the
XML document correctly. The first few characters
specify the character encoding of the file (8- or
16-bit characters, ASCII or EBCDIC) well enough so that a processor can read the rest of the first line, including the "encoding
='xxxx'" attribute. The version is specified so that the XML language may
evolve gracefully without breaking existing XML documents.
Also, because the XML language is designed to be completely international from the
start, any external document that is referenced may have a
different encoding.
The (male|female) expression is an enumerated type,
and #IMPLIED is a keyword designating that this
attribute is optional.
In the following interactive example, the DTD rule is included for the gender attribute. Can you add
the gender attribute to the XML part
for both people?
Now, the gender attribute is still optional in
the XML part, but if unspecified, it will default to "unknown".
Alternatively, if we want to require the existence of the gender
attribute, we use the keyword #REQUIRED, like
so:
Type
Element
Attribute declaration
Attribute definition
<
!
ATTLIST
person
gender
(male|female|unknown) #REQUIRED
>
Now, the gender attribute must be present with one
of the valid values, within every person tag.
And, finally, what if we wanted to say that if an attribute is
present, it is always assigned a particular fixed value? That's possible, too, by
using the keyword #FIXED.
Use the tokenized attribute types to represent a fixed set of keyword types with special meanings.
Often, we want to uniquely identify instances of a certain
element, so it has an attribute with a value that must be
unique. In our Address Book example, it would be
helpful to be able to uniquely identify and refer to a
person, even people with the same name and similar data. This
is done using a tokenized attribute type called ID.
The ID
attribute type in the following example, plus the #REQUIRED
keyword ensures that every person must have an id
attribute whose value is unique within the document.
<!ATTLIST person id ID #REQUIRED>
Now that we can ensure the uniqueness of element attribute
values, we can refer to them. The next tokenized
type, called IDREF, does just that.
Each IDREF attribute is required to match an ID
attribute on some element in the XML document. Similarly,
attribute values of type IDREFS must contain whitespace-delimited ID values in the document. So, let's define the
ability for a person to link to his or her manager and/or subordinates:
<!ELEMENT link EMPTY>
<!ATTLIST link
manager IDREF #IMPLIED
subordinates IDREFS #IMPLIED>
Notice how we declared an EMPTYlink
element, which can contain attributes that refer to other people.
For details on attribute types, see the specification.
<?xml encoding="UTF-8"?>
<!ELEMENT addressBook (person)+>
<!ELEMENT person (name,email*,link?)>
<!ATTLIST person id ID #REQUIRED>
<!ATTLIST person gender (male|female) #IMPLIED>
<!ELEMENT name (#PCDATA|family|given)*>
<!ELEMENT family (#PCDATA)>
<!ELEMENT given (#PCDATA)>
<!ELEMENT email (#PCDATA)>
<!ELEMENT link EMPTY>
<!ATTLIST link
manager IDREF #IMPLIED
subordinates IDREFS #IMPLIED>
Now, your final XML challenge, if you choose to accept it,
is to start with the addressBook XML document below, where we left
off, and express the fact that Claire is the manager of Bob,
and that Bob is a subordinate of Claire. Give it a
try below, and hopefully our Parser will help you correct any
errors.
Does your XML document look something like this? (Note how the link element attributes refer to the contents
of an id attribute in the document.)
Now that you know how to write a valid XML document, the next tutorial will show you how to write a Java program to invoke the IBM XML Parser for Java and manipulate the structure of your address book with the DOM API.
The third tutorial, "Parsing XML Using Java," isn't
available as you read this, so please check
IBM's website for XML Developers.