extensibility

Inside XML Schemas: Under the Hood

Schemas use declarations to describe rules and constraints for elements and attributes, building a framework for documents out of a fairly small set of declarations. Declarations create vocabulary and set constraints, identifying content and where it is to appear. Many schemas can be built using only a combination of element and attribute declarations, while other declarations (like entities and notations) can supplement these core declarations when needed in a particular situation.

Elements and Attributes

Elements and attributes are the core structures of XML, the key features with which document content is built and annotated. Elements and attributes 'mark up' text into easily processed segments, labeled for identification. XML schemas describe the structure of those labels and identify (with varying degrees of precision) the material they may contain. Both elements and attributes may be used to contain simple data types, from basic text types to the sophisticated and complex data types proposed in XML-Data and other schema proposals. Only elements may contain structured content, storing elements within elements and defining potentially complex structure.

Elements and attributes have a number of key differences. Attributes cannot hold subcomponents - no child elements or attributes are allowed within an attribute. (An application can still parse the content within an attribute value into smaller components, if it wants, but XML itself doesn't provide that facility.) Attributes are always assigned to particular elements, while elements can appear inside of any other element as a child element if that element permits it. Attribute values can be set through default or fixed values set in the schema. Elements may have to contain certain other elements or text, but the actual value of that content cannot be fixed from the schema.

Attributes tend to be simpler, typically holding less (though often just as important) information that annotates an element. In general, elements annotate the content of a document, and attributes annotate elements. In practice, the limitations of attributes do restrict their use to some extent, although their ability to have defaults sometimes leads to their use in place of elements.

Atomic Content Models for Elements and Attributes

Schemas can constrain content for elements and attributes, requiring that the content within them meet certain criteria. XML Authority provides a number of data types from the XML-Data schema proposal, in addition to the built-in types provided by XML 1.0. Schema developers can also identify their own data types with notations. Only the built-in XML Types are supported by XML 1.0 validating parsers (and only for attributes), but processors supporting the newer schemas should support a fuller selection of types. XML Authority uses a set of conventions to store data typing information within XML 1.0 DTDs and native formats within the other schema types. Atomic data types make it possible to store information inside of XML documents and move them back and forth reliably between the document and programs' internal representations of information.

Default Values (for Attributes Only)

In addition to identifying attribute types, attribute type declarations include a default declaration for each attribute. Default values allow schema developers to provide values for attributes, require that document authors provide values, or fix the value permanently. This makes it much easier to ensure that information appears when it should without requiring an enormous amount of extra markup within documents.

Structured Content Models (for Elements Only)

Elements can support more sophisticated content models that include sub-elements rather than the atomic types mentioned above. Content models serve several purposes, from setting general rules about the content of the document to providing detailed information about the type or sequence of the contents. Content models give schema designers control over the combination of parts that together make up element content. The two simplest content models are Empty and Any. The Empty content model prohibits instances of this element type from containing any content, including text, white space, or other elements. For example, the HTML IMG element has an empty content model - all of its content is defined through attributes. The Any content model, on the other hand, allows any combination of text, white space, and other elements to appear within a document. (It is also acceptable for an element with a content model of Any to have no content at all.)

Any isn't exactly a license to do anything you like. For the document to be valid, any child elements or entities within an element with an Any content model must be declared, and all of the text within the element must be acceptable XML characters. Use Any content models very carefully, and preferably very rarely or not at all. While they are very useful in the early stages of development for creating prototypes, schemas that use Any content models often lead to documents that can be very difficult to process. Any content models may be useful in areas where designers intend to offer future extensibility, however.

Sometimes elements only need to contain text, and Any is too broad a tool, allowing child elements to be used when perhaps they shouldn’t. Declaring an element that may only contain text without any declarations for child elements is typical for 'leaf' elements, which store the document information and situate it within a particular context of parent elements. Such an element can only hold text - no child elements are permitted. (If you want something more specific than just text, see the entry on data typing.) If a little more information might be needed in that text, like citations or emphasis, a more sophisticated mixed declaration will permit other elements to appear. Mixed declarations allow a more precise statement of when text is permitted to appear as element content, and can limit the child elements that are allowed to appear in element content as well. By specifying that an element may contain text and child elements, both text and child elements can be interspersed. However, doing so means giving up some control over how often or in what order the text and elements appear; mixed declarations only permit you to say that elements may appear, not constrain them to appear in a particular sequence or show up a certain number of times.

When you need more precise control over element content, the more formal model for declaring element-only content can be very helpful. XML 1.0 provides a set of symbols that can be used in combination with the names of potential child elements to create sophisticated content models representing complex structures. Two kinds of symbols are available: sequence indicators and occurrence indicators. The sequence indicators identify the order in which elements may appear, while the occurrence indicators identify how often they may appear.

In combination, these choices of sequences and occurrence make it possible to describe complex structures concisely.

Creating Reusable Content with Entities, and Other Advanced Features

XML schemas also provide features for reusing content, both within the schema and in documents. The content may be explicitly declared within the DTD, or stored in a file that is referenced from the DTD. General entities define content that may be used within documents, simplifying tasks like integrating boilerplate content with the parts of a document that change on a regular basis. A special category of these entities, unparsed entities, allows documents to reference external content that isn't stored in XML and explains to the application what type of content it is. Parameter entities, on the other hand, define content that can only be used within a schema, simplifying the task of building modular schemas and facilitating the combination of multiple schemas into larger structures.

General entities may be internal or external, and external entities may be parsed or unparsed as well. Internal general entities have only a name and a value, assigning that value to the entity name for expansion when documents include a reference with the syntax &entityName;.  If you create an entity named product with a value of ZYM-3150, when the entity reference &product; is encountered in documents using this DTD, the parser will replace &product; with ZYM-3150. The value of a general entity must be either plain text or well-formed markup; an element can't begin in an entity and end outside of it, and elements that weren't started in a particular entity can't be ended there.

External general entities are required to have a name and a System identifier - a file name or URL where the contents of that entity - the value - are stored. They may optionally have a Public identifier, using the Formal Public Identifier syntax used in SGML. Public identifiers are useful for describing general document types, but since there isn't any standardized way to resolve public identifiers to locations, their use is completely optional. In general, use public identifiers only if a project requires their use.

Notations provide identifiers for different kinds of information, and may be used to create new data types for elements and attributes, or to identify unparsed entities. Adding a notation identifier to an external entity declaration changes the entity from a general entity containing text that should be added to the XML file by the parser into an unparsed entity that can only be included by reference in an attribute value. Unparsed entities are commonly used in SGML to represent graphics and other non-textual content that an application may be able to use or present but which can't be included directly within a text document. The notation must be declared, as discussed above, or documents using this schema will not be valid.

XML Authority supports parameter entities, though it also provides tools (reusables and the Overview pane) that provide parameter entity functionality without the complexity of direct parameter entity handling. Parameter entities operate only within the schema, and are typically used in two ways. Parameter entities may include modularized groups of complete declarations within a larger schema, or they may define fragments of declarations that can be reused within other declarations (like groups of attribute declarations that are applied to multiple elements, or content model fragments that can be used for multiple elements). XML Authority supports both of these approaches, allowing the creation of simple 'reusables', and the inclusion of external declaration sets for the construction of modular schemas. XML Authority supports the creation of reusables for content models, attribute sets, constraints, and plain text. Like general entities, parameters may be internal or external. Internal parameter entities have only a name and value containing the replacement content, while external entities may have system identifiers and public identifiers. External entities are typically used to import declarations from external files, and XML Authority provides an option for including those external resources in the current schema.

XML 1.0 schemas may also contain processing instructions. Processing instructions provide a mechanism for passing information to the application outside of the element and attribute declarations. Unless you've built your own parser, however, don't expect processing instructions within the schema to have much effect. (XML Authority has its own parser for reading schemas and thus can use processing instructions for keeping track of version history of the schema as a whole, for instance.)

Schemas also need documentation, which XML 1.0 supports with comments. XML Authority's Notes window allows you to annotate your schema with detailed information (like change history and what components really 'mean'), which will be stored as comments. XML Authority uses a set of conventions to help you keep your comments organized.

Copyright 2000 Extensibility, Inc.

Suite 250, 200 Franklin Street, Chapel Hill, North Carolina 27516