home *** CD-ROM | disk | FTP | other *** search
- <!doctype linuxdoc system
- [ <!entity nsgmls "<tt/nsgmls/" >
- <!entity lt "<">
- ]>
-
- <article>
- <title>The HTML Validation HOWTO
- <author>Keith M. Corbett, <htmlurl url="mailto:kmc@specialform.com" name="kmc@specialform.com">
- <date>v0.2, 29 October 1995
-
- <abstract>
- This document explains how to use the &nsgmls parser
- to validate HTML documents for conformance with the
- HTML 2.0 document type definition, or "DTD".
- This DTD is the most commonly accepted SGML based definition of HTML,
- and thus defines a subset of current practice in HTML markup
- that is likely to be portable to a wide number of
- HTML users agents (browsers).
- </abstract>
-
- <toc>
-
- <sect>Introduction
-
- <p>
- This is a guide to using the &nsgmls parser to
- validate and process HTML documents.
-
- <sect1>Costs and benefits
-
- <p>
- Using the full features of SGML markup will enrich your
- HTML documents.
- However, validating your documents to the HTML DTD
- has certain cost / benefit tradeoffs,
- basically because you are dealing with
- a more circumscribed dialect of HTML than is currently in vogue.
- The "official" HTML rules for enforcing document structure, and the SGML rules
- for data content markup, are more restrictive than current practice on the Web.
-
- The main issue you must consider is that valid HTML is restricted to a
- standard set of element tags.
-
- There isn't an accepted DTD that accurately reflects "browser HTML"
- as understood by many client browser programs.
- For the most part, the HTML 2.0 DTD reflects tags and attributes
- that were commonly in use on the Web around June 1994.
- Various efforts to define a more advanced HTML+ or HTML 3.0 DTD
- have gotten somewhat bogged down.
- And none of the DTDs in circulation will recognize all of the tags
- that have been popularized recently by browser vendors
- such as Netscape and Microsoft.
-
- <sect1>Getting started
-
- <p>
- Contrary to popular opinion, working with SGML does not have to cost a
- lot of time and money.
- It is possible to build a robust development
- environment consisting entirely of software that is freely available
- on a wide range of platforms, including
- Linux, DOS, and most Unix workstations.
- Thanks to a few very dedicated folks, all the tools you need
- to work with SGML have been made publicly available on the Internet.
-
- Setting up your environment (the parser and supporting program
- libraries) takes a bit of work but not nearly as much as one might
- think.
-
- You may also want to peruse an introductory SGML text
- such as "SGML: An Author's Guide to the Standard Generalized Markup
- Language" by Martin bryan, or "Practical SGML" by Eric van Herwijnen.
-
- <sect>Tools
-
- <sect1>The <tt/HTML Check toolkit/ package
-
- <p>
- If you want a completely self-installing / canned package,
- check out the HalSoft <it/HTML Check Toolkit/ at URL:
- <htmlurl
- url="http://www.halsoft.com/html-tk/index.html"
- name="http://www.halsoft.com/html-tk/index.html">
-
- <p>
- The only disadvantage of using the HalSoft kit is that it uses the
- older <tt/sgmls/ parser, which produces error messages that are
- sometimes (even) more cryptic than those from &nsgmls;.
-
- <p>
- I've used &nsgmls on Linux and Windows (3.x and NT);
- it is supposed to work on many other platforms as well.
-
- <sect1>The &nsgmls parser
-
- <p>
- James Clark has built a software kit called <tt/sp/
- which includes the validating SGML parser, &nsgmls;.
- (This is the successor to the <tt/sgmls/ parser
- which has long been considered the reference parser.)
-
- For information on the <tt/sp/ kit, see URL:
- <htmlurl
- url="http://www.jclark.com/sp.html"
- name="http://www.jclark.com/sp.html">
-
- <p>
- You can download the kit directly from:
- <htmlurl
- url="ftp://ftp.jclark.com/pub/sp/"
- name="ftp://ftp.jclark.com/pub/sp/">
-
- <p>
- You may be able to pick up &nsgmls executable files for your platform.
- Or, download the source kit and follow the directions in the
- <tt/README/ file for running <tt/make/.
-
- <p>
- Consider creating a high level public directory that will contain
- SGML-related files.
- For example, on my Linux PC I have various SGML related directories including:
-
- <list>
- <item>/usr/sgml/bin
- <item>/usr/sgml/html
- <item>/usr/sgml/sgmls
- <item>/usr/sgml/sp
- </list>
-
- <sect1>Download the HTML specification materials
-
- <p>
- The draft standard for HTML 2.0 includes SGML definition files you
- need to run the parser, namely the DTD (Document Type Definition),
- SGML Declaration, and entity catalog. To obtain the HTML 2.0 public
- text, see URL:
-
- <htmlurl
- url="http://www.w3.org/hypertext/WWW/MarkUp/html-spec/"
- name="http://www.w3.org/hypertext/WWW/MarkUp/html-spec/">
-
- <p>
- Download and install the following files:
-
- <list>
- <item>DTD <it/html*.dtd/
- <item>SGML declaration <it/html.decl/
- <item>Entity catalog <it/catalog/
- </list>
-
- <p>
- You can add two entries to the HTML entity catalog
- for ease of use with &nsgmls:
-
- <tscreen><code>
- -- catalog: SGML Open style entity catalog for HTML --
- -- $Id: catalog,v 1.2 1994/11/30 23:45:18 connolly Exp $ --
- :
- :
- -- Additions for ease of use with nsgmls --
- SGMLDECL "html.decl"
- DOCTYPE HTML "html.dtd"
- </code>
- </tscreen>
-
- <p>
- Alternatively, you can create a second catalog containing these
- entries; you will have to pass this catalog to &nsgmls as an argument
- with the <tt/-m/ switch.
-
- <sect>Parsing an HTML document
-
- <p>Following is a "cookbook" for validating a single document.
- Simply invoke the &nsgmls parser
- and pass it the pathnames of the HTML catalog file(s)
- and the document:
-
- <tscreen> <verb>
- % nsgmls -s -m /usr/sgml/html/catalog <test.html
- </verb></tscreen>
-
- <p>
- The <tt/-s/ switch suppresses the parser's output; see below.
-
- <sect1>Parser input
- <p>
- Your document must conform to SGML, which means, among other things,
- that the document type must be declared at the beginning of the input.
- (You can fudge this by prepending the information to the document
- instance on the nsgmls command line.)
-
- Here's a simple HTML document that can be parsed correctly using the
- scheme I've outlined:
-
- <tscreen><code>
- <!doctype html public "-//IETF//DTD HTML 2.0//EN">
- <html>
- <head>
- <title>Simple HTML document.</title>
- </head>
- <body>
- <h1>Test document</h1>
- <p>This is a test document.</p>
- </body>
- </html>
- </code></tscreen>
-
- <sect1>Parser output
- <p>
- The standard output of &nsgmls is a digested form of the SGML input
- that processing systems can use as a lexer for navigating the
- structure of the document.
- For the purpose of validation, you can throw the standard output away
- and rely on the error output.
-
- <p>
- If you do want the full output, omit the <tt/-s/ switch
- and pipe standard output to a file:
-
- <tscreen> <verb>
- % nsgmls -m /usr/sgml/html/catalog <test.html >test.out
- </verb></tscreen>
-
- <sect1>Parser messages
-
- <p>
- Error and warning messages from &nsgmls can be very cryptic.
- And you may see very many errors from illegal markup.
-
- <p>
- To pipe messages to a file, use the <tt/-f/ switch:
-
- <tscreen> <verb>
- % nsgmls -s -m /usr/sgml/html/catalog -f test.err <test.html
- </verb></tscreen>
-
- <sect1>Return status
-
- <p>The parser indicates whether the input document conforms to the
- HTML DTD in two ways:
-
- <list>
- <item>Return code - the parser returns a 0 exit status on success,
- non-zero otherwise.
- <item>Output - if the document conforms to the DTD, the last line of
- standard output will consist of a single <tt/C/ character.
- </list>
-
- <sect>Resources
-
- <p>
- The HalSoft <it/HTML Check Toolkit/ is at URL:
- <htmlurl
- url="http://www.halsoft.com/html-tk/index.html"
- name="http://www.halsoft.com/html-tk/index.html">
-
- <p>
- James Clark's page on <tt/sp/ is at URL:
- <htmlurl
- url="http://www.jclark.com/sp.html"
- name="http://www.jclark.com/sp.html">
-
- <p>
- The W3C page on the HTML specification is at URL:
- <htmlurl
- url="http://www.w3.org/hypertext/WWW/MarkUp/html-spec/"
- name="http://www.w3.org/hypertext/WWW/MarkUp/html-spec/">
-
- <p>
- Feel free to contact me via email: <htmlurl url="mailto:kmc@specialform.com" name="kmc@specialform.com">.
-
- </article>
-