2. HTML Specification

2.16 Character Data

2.16.1 - Special Characters
2.16.2 - Control Characters
2.16.3 - Numeric Character References
2.16.4 - Character Entities

Level 0

The characters between HTML tags represent text encoded according to ISO 8859/1 8-bit single-byte coded graphic character set known as Latin Alphabet No. 1, or simply Latin-1. There are 256 character positions in the Latin-1 encoding. Latin-1 includes characters from most Western European languages. It consists of the space character, 186 characters that form a subset of the graphic characters in ISO 6937/2 (1983), and four additional characters that are intended for inclusion in ISO 6937/2. Also see Section 2.4.

The lower 128 character positions include a space, 33 control characters, the 26 upper- and lowercase letters of the english alphabet, 10 numerals and 32 other printing characters This subset, functionally identical to ASCII, is defined by ISO 646 7-bit coded character set for information interchange, also known as the International Reference Version. ISO 646 is identical in most respect to the ANSI standard for ASCII (American Standard Code for Information Interchange). The only significant difference between ISO 646 and ASCII is the specific names assigned to the control characters in positions 00-31 and 127.

The upper 128 positions include a non-breaking space, a soft hyphen indicator, 93 graphical characters, 8 unassigned characters, and 25 control characters. Because non-breaking space and soft hyphen indicator are not recognized and interpreted by all HTML user agents, their use is discouraged.

There are 58 character positions occupied by control characters. See Section 2.16.2 for details on the interpretation of control characters.

Because certain special characters are subject to interpretation and special processing, information providers and HTML user agent implementors should follow the guidelines in Section 2.16.1.

Certain characters may not be accessible from your keyboard, or some part of your system (i.e. translation software) may not be equipped to deal with 8-bit character codes. HTML and many HTML user agents provide character entity references (see Section 2.17.2) and numerical character references (see Section 2.17.3) to facilitate the entry and interpretation of characters by name and by numerical position.

Because certain characters will be interpreted as markup, they must be represented by markup as described in Section 2.16.3 and Section 2.16.4.

2.16.1 Special Characters

Certain characters have special meaning in HTML documents. There are two printing characters which may be interpreted by an HTML application to have an effect of the format of the text:

Space

Interpreted as a word space (place where a line can be broken) in all contexts except the Preformatted Text element.
Interpreted as a nonbreaking space within the Preformatted Text element.

Hyphen

Interpreted as a hyphen glyph in all contexts
Interpreted as a potential word space by hyphenation engine

2.16.2 Control Characters

Control characters are non-printable characters that are typically used for communication and device control, as format effectors, and as information separators.

In SGML applications, the use of control characters is limited in order to maximize the chance of successful interchange over heterogenous networks and operating systems. In HTML, only three control characters are used. The valid control characters and their interpretation are:

Horizontal Tab (HT - 9 dec)

Interpreted as a word space in all contexts except preformatted text.
Within preformatted text, the tab should be interpreted to shift the horizontal column position to the next position which is a multiple of 8 on the same line; that is, col := (col+8) mod 8

Line Feed (LF - 10 dec)

Interpreted as a word space in all contexts except preformatted text.
Within the Preformatted Text element, the tab should be interpreted as a shift to the start of a new line; that is, col := 0; row := row+1

Carriage Return (CR - 13 dec)

Interpreted as a word space in all contexts.

2.16.3 Numeric Character References

Any printing character within the 8-bit character encoding of ISO 8859/1 (256 character positions) or the 7-bit character encoding of ISO 646 (128 character positions) may be represented within the text of an HTML document by a numeric character reference. See Section 2.17.1 for a list of the characters, their names and input syntax.

Two reasons for using a numeric character reference:

the keyboard does not provide a key for the character, such as on U.S. keyboards which do not provide European characters
the character may be interpreted as SGML coding, such as the ampersand (&), double quotes ("), the lesser (<) and greater (>) characters

Numeric character references are represented in an HTML document as SGML entities whose name is number sign (#) followed by a numeral from 32-126 and 161-255. The HTML DTD includes a numeric character for each of the printing characters in Latin-1, so that one may reference them by number if it is inconvenient to enter them directly:

the ampersand (&#38;), double quotes (&#34;),
lesser (&#60;) and greater (&#62;) characters

2.16.4 Character Entities

Many of the Latin alphabet No. 1 set of printing characters may be represented within the text of an HTML document by a character entity. See 2.17.2 for a list of the characters, names, input syntax, and descriptions. See 5.2.1 for the SGML entity definitions of "Added Latin 1 for HTML".

Two reasons for using a character entity:

the keyboard does not provide a key for the character, such as on U.S. keyboards which do not provide European characters
the character may be interpreted as SGML coding, such as the ampersand (&), double quotes ("), the lesser (<) and greater (>) characters

A character entity is represented in an HTML document as an SGML entity whose name is defined in the HTML DTD. The HTML DTD includes a character entity for each of the SGML markup characters and for each of the printing characters in the upper half of Latin-1, so that one may reference them by name if it is inconvenient to enter them directly:

the ampersand (&amp;), double quotes (&quot;),
lesser (&lt;) and greater (&gt;) characters
Kurt G&ouml;del was a famous logician and mathematician.

NOTE: To ensure that a string of characters is not interpreted as markup, represent all occurrences of <, >, and & by character or entity references.
NOTE: There are SGML features, CDATA and RCDATA, to allow most <, >, and & characters to be entered without the use of entity or character references. Because these features tend to be used and implemented inconsistently, and because they require 8-bit characters to represent non-ASCII characters, they are not used in this version of the HTML DTD. An earlier HTML specification included an Example element (<XMP>) whose syntax is not expressible in SGML. No markup was recognized inside of the Example element except the </XMP> end tag. While HTML user agents are encouraged to support this idiom, its use is deprecated.

HTML 2.0 Specification (Internet Draft) - 29 NOV 94

[Next] [Previous] [Up] [Top]

Generated with CERN WebMaker