home *** CD-ROM | disk | FTP | other *** search
Text File | 2003-06-11 | 89.7 KB | 2,412 lines |
-
-
-
-
-
-
- Network Working Group F. Yergeau
- Request for Comments: 2070 Alis Technologies
- Category: Standards Track G. Nicol
- Electronic Book Technologies
- G. Adams
- Spyglass
- M. Duerst
- University of Zurich
- January 1997
-
-
- Internationalization of the Hypertext Markup Language
-
- Status of this Memo
-
- This document specifies an Internet standards track protocol for the
- Internet community, and requests discussion and suggestions for
- improvements. Please refer to the current edition of the "Internet
- Official Protocol Standards" (STD 1) for the standardization state
- and status of this protocol. Distribution of this memo is unlimited.
-
- Abstract
-
- The Hypertext Markup Language (HTML) is a markup language used to
- create hypertext documents that are platform independent. Initially,
- the application of HTML on the World Wide Web was seriously
- restricted by its reliance on the ISO-8859-1 coded character set,
- which is appropriate only for Western European languages. Despite
- this restriction, HTML has been widely used with other languages,
- using other coded character sets or character encodings, at the
- expense of interoperability.
-
- This document is meant to address the issue of the
- internationalization (i18n, i followed by 18 letters followed by n)
- of HTML by extending the specification of HTML and giving additional
- recommendations for proper internationalization support. A foremost
- consideration is to make sure that HTML remains a valid application
- of SGML, while enabling its use with all languages of the world.
-
- Table of Contents
-
- 1. Introduction .................................................. 2
- 1.1. Scope ...................................................... 2
- 1.2. Conformance ................................................ 3
- 2. The document character set ..................................... 4
- 2.1. Reference processing model ................................. 4
- 2.2. The document character set ................................. 6
- 2.3. Undisplayable characters ................................... 8
-
-
-
- Yergeau, et. al. Standards Track [Page 1]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- 3. The LANG attribute.............................................. 8
- 4. Additional entities, attributes and elements ................... 9
- 4.1. Full Latin-1 entity set .................................... 9
- 4.2. Markup for language-dependent presentation ................ 10
- 5. Forms ..........................................................16
- 5.1. DTD additions ..............................................16
- 5.2. Form submission ............................................17
- 6. External character encoding issues .............................18
- 7. HTML public text ...............................................20
- 7.1. HTML DTD ...................................................20
- 7.2. SGML declaration for HTML ..................................35
- 7.3. ISO Latin 1 character entity set ...........................37
- 8. Security Considerations.........................................40
- Bibliography ......................................................40
- Authors' Addresses ................................................43
-
- 1. Introduction
-
- The Hypertext Markup Language (HTML) is a markup language used to
- create hypertext documents that are platform independent. Initially,
- the application of HTML on the World Wide Web was seriously
- restricted by its reliance on the ISO-8859-1 coded character set,
- which is appropriate only for Western European languages. Despite
- this restriction, HTML has been widely used with other languages,
- using other coded character sets or character encodings, through
- various ad hoc extensions to the language [TAKADA].
-
- This document is meant to address the issue of the
- internationalization of HTML by extending the specification of HTML
- and giving additional recommendations for proper internationalization
- support. It is in good part based on a paper by one of the authors
- on multilingualism on the WWW [NICOL]. A foremost consideration is
- to make sure that HTML remains a valid application of SGML, while
- enabling its use with all languages of the world.
-
- The specific issues addressed are the SGML document character set to
- be used for HTML, the proper treatment of the charset parameter
- associated with the "text/html" content type and the specification of
- some additional elements and entities.
-
- 1.1 Scope
-
- HTML has been in use by the World-Wide Web (WWW) global information
- initiative since 1990. This specification extends the capabilities
- of HTML 2.0 (RFC 1866), primarily by removing the restriction to the
- ISO-8859-1 coded character set [ISO-8859].
-
-
-
-
-
- Yergeau, et. al. Standards Track [Page 2]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- HTML is an application of ISO Standard 8879:1986, Information
- Processing Text and Office Systems -- Standard Generalized Markup
- Language (SGML) [ISO-8879]. The HTML Document Type Definition (DTD)
- is a formal definition of the HTML syntax in terms of SGML. This
- specification amends the DTD of HTML 2.0 in order to make it
- applicable to documents encompassing a character repertoire much
- larger than that of ISO-8859-1, while still remaining SGML
- conformant.
-
- Both formal and actual development of HTML are advancing very fast.
- The features described in this document are designed so that they can
- (and should) be added to other forms of HTML besides that described
- in RFC 1866. Where indicated, attributes introduced here should be
- extended to the appropriate elements.
-
- 1.2 Conformance
-
- This specification changes slightly the conformance requirements of
- HTML documents and HTML user agents.
-
- 1.2.1 Documents
-
- All HTML 2.0 conforming documents remain conforming with this
- specification. However, the extensions introduced here make valid
- certain documents that would not be HTML 2.0 conforming, in
- particular those containing characters or character references
- outside of the repertoire of ISO 8859-1, and those containing markup
- introduced herein.
-
- 1.2.2. User agents
-
- In addition to the requirements of RFC 1866, the following
- requirements are placed on HTML user agents.
-
- To ensure interoperability and proper support for at least ISO-
- 8859-1 in an environment where character encoding schemes other
- than ISO-8859-1 are present, user agents MUST correctly interpret
- the charset parameter accompanying an HTML document received from
- the network.
-
- Furthermore, conforming user-agents MUST at least parse correctly
- all numeric character references within the range of ISO 10646-1
- [ISO-10646].
-
- Conforming user-agents are required to apply the BIDI presentation
- algorithm if they display right-to-left characters. If there is
- no displayable right-to-left character in a document, there is no
- need to apply BIDI processing.
-
-
-
- Yergeau, et. al. Standards Track [Page 3]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- 2. The document character set
-
- 2.1. Reference processing model
-
- This overview explains a reference processing model used for HTML,
- and in particular the SGML concept of a document character set. An
- actual implementation may widely differ in its internal workings from
- the model given below, but should behave as described to an outside
- observer.
-
- Because there are various widely differing encodings of text, SGML
- does not directly address how the sequence of characters that
- constitutes an SGML document in the abstract sense are encoded by
- means of a sequence of octets (or occasionally bit groups of another
- length than 8) in a concrete realization of the document such as a
- computer file. This encoding is called the external character
- encoding of the concrete SGML document, and it should be carefully
- distinguished from the document character set of the abstract HTML
- document. SGML views the characters as a single set (called a
- "character repertoire"), and a "code set" that assigns an integer
- number (known as "character number") to each character in the
- repertoire. The document character set declaration defines what each
- of the character numbers represents [GOLD90, p. 451]. In most cases,
- an SGML DTD and all documents that refer to it have a single document
- character set, and all markup and data characters are part of this
- set.
-
- HTML, as an application of SGML, does not directly address the
- question of the external character encoding. This is deferred to
- mechanisms external to HTML, such as MIME as used by the HTTP
- protocol or by electronic mail.
-
- For the HTTP protocol [RFC2068], the external character encoding is
- indicated by the "charset" parameter of the "Content-Type" field of
- the header of an HTTP response. For example, to indicate that the
- transmitted document is encoded in the "JUNET" encoding of Japanese
- [RFC1468], the header will contain the following line:
-
- Content-Type: text/html; charset=ISO-2022-JP
-
- The term "charset" in MIME is used to designate a character encoding,
- rather than merely a coded character set as the term may suggest. A
- character encoding is a mapping (possibly many-to-one) of sequences
- of octets to sequences of characters taken from one or more character
- repertoires.
-
- The HTTP protocol also defines a mechanism for the client to specify
- the character encodings it can accept. Clients and servers are
-
-
-
- Yergeau, et. al. Standards Track [Page 4]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- strongly requested to use these mechanisms to assure correct
- transmission and interpretation of any document. Provisions that can
- be taken to help correct interpretation, even in cases where a server
- or client do not yet use these mechanisms, are described in section
- 6.
-
- Similarly, if HTML documents are transferred by electronic mail, the
- external character encoding is defined by the "charset" parameter of
- the "Content-Type" MIME header field [RFC2045], and defaults to US-
- ASCII in its absence.
-
- No mechanisms are currently standardized for indicating the external
- character encoding of HTML documents transferred by FTP or accessed
- in distributed file systems.
-
- In the case any other way of transferring and storing HTML documents
- are defined or become popular, it is advised that similar provisions
- be made to clearly identify the character encoding used and/or to use
- a single/default encoding capable of representing the widest range of
- characters used in an international context.
-
- Whatever the external character encoding may be, the reference
- processing model translates it to the document character set
- specified in Section 2.2 before processing specific to SGML/HTML.
- The reference processing model can be depicted as follows:
-
- [resource]->[decoder]->[entity ]->[ SGML ]->[application]->[display]
- [manager] [parser]
- ^ |
- | |
- +----------+
-
- The decoder is responsible for decoding the external representation
- of the resource to the document character set. The entity manager,
- the parser, and the application deal only with characters of the
- document character set. A display-oriented part of the application
- or the display machinery itself may again convert characters
- represented in the document character set to some other
- representation more suitable for their purpose. In any case, the
- entity manager, the parser, and the application, as far as character
- semantics are concerned, are using the HTML document character set
- only.
-
- An actual implementation may choose, or not, to translate the
- document into some encoding of the document character set as
- described above; the behaviour described by this reference processing
- model can be achieved otherwise. This subject is well out of the
- scope of this specification, however, and the reader is invited to
-
-
-
- Yergeau, et. al. Standards Track [Page 5]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- consult the SGML standard [ISO-8879] or an SGML handbook [BRYAN88]
- [GOLD90] [VANH90] [SQ91] for further information.
-
- The most important consequence of this reference processing model is
- that numeric character references are always resolved with respect to
- the fixed document character set, and thus to the same characters,
- whatever the external encoding actually used. For an example, see
- Section 2.2.
-
- 2.2. The document character set
-
- The document character set, in the SGML sense, is the Universal
- Character Set (UCS) of ISO 10646:1993 [ISO-10646], as amended.
- Currently, this is code-by-code identical with the Unicode standard,
- version 1.1 [UNICODE].
-
- NOTE -- implementers should be aware that ISO 10646 is amended
- from time to time; 4 amendments have been adopted since the
- initial 1993 publication, none of which significantly affects this
- specification. A fifth amendment, now under consideration, will
- introduce incompatible changes to the standard: 6556 Korean Hangul
- syllables allocated between code positions 3400 and 4DFF
- (hexadecimal) will be moved to new positions (and 4516 new
- syllables added), thus making references to the old positions
- invalid. Since the Unicode consortium has already adopted a
- corresponding amendment for inclusion in the forthcoming Unicode
- 2.0, adoption of DAM 5 is considered likely and implementers
- should probably consider the old code positions as already
- invalid. Despite this one-time change, the relevant standard
- bodies have committed themselves not to change any allocated code
- position in the future. To encode Korean Hangul irrespective of
- these changes, the conjoining Hangul Jamo in the range 1110-11F9
- can be used.
-
- The adoption of this document character set implies a change in the
- SGML declaration specified in the HTML 2.0 specification (section 9.5
- of [RFC1866]). The change amounts to removing the first BASESET
- specification and its accompanying DESCSET declaration, replacing
- them with the following declaration:
-
-
-
-
-
-
-
-
-
-
-
-
- Yergeau, et. al. Standards Track [Page 6]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- BASESET "ISO Registration Number 177//CHARSET
- ISO/IEC 10646-1:1993 UCS-4 with implementation level 3
- //ESC 2/5 2/15 4/6"
- DESCSET 0 9 UNUSED
- 9 2 9
- 11 2 UNUSED
- 13 1 13
- 14 18 UNUSED
- 32 95 32
- 127 1 UNUSED
- 128 32 UNUSED
- 160 2147483486 160
-
- Making the UCS the document character set does not create non-
- conformance of any expression, construct or document that is
- conforming to HTML 2.0. It does make conforming certain constructs
- that are not admissible in HTML 2.0. One consequence is that data
- characters outside the repertoire of ISO-8859-1, but within that of
- UCS-4 become valid SGML characters. Another is that the upper limit
- of the range of numeric character references is extended from 255 to
- 2147483645; thus, И is a valid reference to a "CYRILLIC CAPITAL
- LETTER I". [ERCS] is a good source of information on Unicode and
- SGML, although its scope and technical content differ greatly from
- this specification.
-
- NOTE -- the above SGML declaration, like that of HTML 2.0,
- specifies the character numbers 128 to 159 (80 to 9F hex) as
- UNUSED. This means that numeric character references within that
- range (e.g. ) are illegal in HTML. Neither ISO 8859-1 nor
- ISO 10646 contain characters in that range, which is reserved for
- control characters.
-
- Another change was made from the HTML 2.0 SGML declaration, in the
- belief that the latter did not express its authors' true intent. The
- syntax character set declaration was changed from ISO 646.IRV:1983 to
- the newer ISO 646.IRV:1991, the latter, but not the former, being
- identical with US-ASCII. In principle, this introduces an
- incompatibility with HTML 2.0, but in practice it should increase
- interoperability by i) having the SGML declaration say what everyone
- thinks and ii) making the syntax character set a proper subset of the
- document character set. The characters that differ between the two
- versions of ISO 646.IRV are not actually used to express HTML syntax.
-
- ISO 10646-1:1993 is the most encompassing character set currently
- existing, and there is no other character set that could take its
- place as the document character set for HTML. If nevertheless for a
- specific application there is a need to use characters outside this
- standard, this should be done by avoiding any conflicts with present
-
-
-
- Yergeau, et. al. Standards Track [Page 7]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- or future versions of ISO 10646, i.e. by assigning these characters
- to a private zone of the UCS-4 coding space [ISO-10646 section 11].
- Also, it should be borne in mind that such a use will be highly
- unportable; in many cases, it may be better to use inline bitmaps.
-
- 2.3. Undisplayable characters
-
- With the document character set being the full ISO 10646, the
- possibility that a character cannot be displayed due to lack of
- appropriate resources (fonts) cannot be avoided. Because there are
- many different things that can be done in such a case, this document
- does not prescribe any specific behaviour. Depending on the
- implementation, this may also be handled by the underlaying display
- system and not the application itself. The following considerations,
- however, may be of help:
-
- - A clearly visible, but unobtrusive behaviour should be preferred.
- Some documents may contain many characters that cannot be
- rendered, and so showing an alert for each of them is not the
- right thing to do.
-
- - In case a numeric representation of the missing character is
- given, its hexadecimal (not decimal) form is to be preferred,
- because this form is used in character set standards [ERCS].
-
- 3. The LANG attribute
-
- Language tags can be used to control rendering of a marked up
- document in various ways: glyph disambiguation, in cases where the
- character encoding is not sufficient to resolve to a specific glyph;
- quotation marks; hyphenation; ligatures; spacing; voice synthesis;
- etc. Independently of rendering issues, language markup is useful as
- content markup for purposes such as classification and searching.
-
- Since any text can logically be assigned a language, almost all HTML
- elements admit the LANG attribute. The DTD reflects this; the only
- elements in this version of HTML without the LANG attribute are BR,
- HR, BASE, NEXTID, and META. It is also intended that any new element
- introduced in later versions of HTML will admit the LANG attribute,
- unless there is a good reason not to do so.
-
- The language attribute, LANG, takes as its value a language tag that
- identifies a natural language spoken, written, or otherwise conveyed
- by human beings for communication of information to other human
- beings. Computer languages are explicitly excluded.
-
-
-
-
-
-
- Yergeau, et. al. Standards Track [Page 8]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- The syntax and registry of HTML language tags is the same as that
- defined by RFC 1766 [RFC1766]. In summary, a language tag is composed
- of one or more parts: A primary language tag and a possibly empty
- series of subtags:
-
- language-tag = primary-tag *( "-" subtag )
- primary-tag = 1*8ALPHA
- subtag = 1*8ALPHA
-
- Whitespace is not allowed within the tag and all tags are case-
- insensitive. The namespace of language tags is administered by the
- IANA. Example tags include:
-
- en, en-US, en-cockney, i-cherokee, x-pig-latin
-
- In the context of HTML, a language tag is not to be interpreted as a
- single token, as per RFC 1766, but as a hierarchy. For example, a
- user agent that adjusts rendering according to language should
- consider that it has a match when a language tag in a style sheet
- entry matches the initial portion of the language tag of an element.
- An exact match should be preferred. This interpretation allows an
- element marked up as, for instance, "en-US" to trigger styles
- corresponding to, in order of preference, US-English ("en-US") or
- 'plain' or 'international' English ("en").
-
- NOTE -- using the language tag as a hierarchy does not imply that
- all languages with a common prefix will be understood by those
- fluent in one or more of those languages; it simply allows the
- user to request this commonality when it is true for that user.
-
- The rendering of elements may be affected by the LANG attribute. For
- any element, the value of the LANG attribute overrides the value
- specified by the LANG attribute of any enclosing element and the
- value (if any) of the HTTP Content-Language header. If none of these
- are set, a suitable default, perhaps controlled by user preferences,
- by automatic context analysis or by the user's locale, should be used
- to control rendering.
-
- 4. Additional entities, attributes and elements
-
- 4.1. Full Latin-1 entity set
-
- According to the suggestion of section 14 of [RFC1866], the set of
- Latin-1 entities is extended to cover the whole right part of ISO-
- 8859-1 (all code positions with the high-order bit set), including
- the already commonly used , © and ®. The names of the
- entities are taken from the appendices of SGML [ISO-8879]. A list is
- provided in section 7.3 of this specification.
-
-
-
- Yergeau, et. al. Standards Track [Page 9]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- 4.2. Markup for language-dependent presentation
-
- 4.2.1. Overview
-
- For the correct presentation of text in certain languages
- (irrespective of formatting issues), some support in the form of
- additional entities and elements is needed.
-
- In particular, the following features are dealt with:
-
- - Markup of bidirectional text, i.e. text where left-to-right and
- right-to-left scripts are mixed.
-
- - Control of cursive joining behaviour in contexts where the
- default behaviour is not appropriate.
-
- - Language-dependent rendering of short (in-line) quotations.
-
- - Better justification control for languages where this is
- important.
-
- - Superscripts and subscripts for languages where they appear as
- part of general text.
-
- Some of the above features need very little additional support;
- others need more. The additional features are introduced below with
- brief comments only. Explanations on cursive joining behaviour and
- bidirectional text follow later. For cursive joining behaviour and
- bidirectional text, this document follows [UNICODE] in that: i)
- character semantics, where applicable, are identical to [UNICODE],
- and ii) where functionality is moved to HTML as a higher level
- protocol, this is done in a way that allows straightforward
- conversion to the lower-level mechanisms defined in [UNICODE].
-
- 4.2.2. List of entities, elements, and attributes
-
- First, a generic container is needed to carry the LANG and DIR (see
- below) attributes in cases where no other element is appropriate; the
- SPAN element is introduced for that purpose.
-
-
-
-
-
-
-
-
-
-
-
-
- Yergeau, et. al. Standards Track [Page 10]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- A set of named character entities is added for use with bidirectional
- rendering and cursive joining control:
-
- <!ENTITY zwnj CDATA ""--=zero width non-joiner-->
- <!ENTITY zwj CDATA ""--=zero width joiner-->
- <!ENTITY lrm CDATA ""--=left-to-right mark-->
- <!ENTITY rlm CDATA ""--=right-to-left mark-->
-
- These entities can be used in place of the corresponding formatting
- characters whenever convenient, for example to ease keyboard entry or
- when a formatting character is not available in the character
- encoding of the document.
-
- Next, an attribute called DIR is introduced, restricted to the values
- LTR (left-to-right) and RTL (right-to-left), for the indication of
- directionality in the context of bidirectional text (see 4.2.4 below
- for details). Since any text and many other elements (e.g. tables)
- can logically be assigned a directionality, all elements except BR,
- HR, BASE, NEXTID, and META admit this attribute. The DTD reflects
- this. It is also intended that any new element introduced in later
- versions of HTML will admit the DIR attribute, unless there is a good
- reason not to do so.
-
- A new phrase-level element called BDO (BIDI Override) is introduced,
- which requires the DIR attribute to specify whether the override is
- left-to-right or right-to-left. This element is required for
- bidirectional text control; for detailed explanations, see section
- 4.2.4.
-
- The phrase-level element Q is introduced to allow language-dependent
- rendering of short quotations depending on language and platform
- capability. As the following examples show (rather poorly, because of
- the character set restriction of Internet specifications), the
- quotation marks surrounding the quotation are particularly affected:
- "a quotation in English", `another, slightly better one', ,,a
- quotation in German'', << a quotation in French >>. The contents of
- the Q element does not include quotation marks, which have to be
- added by the rendering process.
-
- NOTE -- Q elements can be nested. Many languages use different
- quotation styles for outer and inner quotations, and this should
- be respected by user-agents implementing this element.
-
-
-
-
-
-
-
-
-
- Yergeau, et. al. Standards Track [Page 11]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- NOTE -- minimal support for the Q element is to surround the
- contents with some kind of quotes, like the plain ASCII double
- quotes. As this is rather easy to implement, and as the lack of
- any visible quotes may affect the perceived meaning of the text,
- user-agent implementors are strongly requested to provide at least
- this minimal level of support.
-
- Many languages require superscript text for proper rendering: as an
- example, the French "Mlle Dupont" should have "lle" in superscript.
- The SUP element, and its sibling SUB for subscript text, are
- introduced to allow proper markup of such text. SUP and SUB contents
- are restricted to PCDATA to avoid nesting problems.
-
- Finally, in many languages text justification is much more important
- than it is in Western languages, and justifies markup. The ALIGN
- attribute, admitting values of LEFT, RIGHT, CENTER and JUSTIFY, is
- added to a selection of elements where it makes sense (the block-like
- P, HR, H1 to H6, OL, UL, DIR, MENU, LI, BLOCKQUOTE and ADDRESS). If
- a user-agent chooses to have LEFT as a default for blocks of left-
- to-right directionality, it should use RIGHT for blocks of right-to-
- left directionality.
-
- NOTE -- RFC 1866 section 4.2.2 specifies that an HTML user agent
- should treat an end of line as a word space, except in
- preformatted text. This should be interpreted in the context of
- the script being processed, as the way words are separated in
- writing is script-dependent. For some scripts (e.g. Latin), a
- word space is just a space, but in other scripts (e.g. Thai) it is
- a zero-width word separator, whereas in yet other scripts (e.g.
- Japanese) it is nothing at all, i.e. totally ignored.
-
- NOTE -- the SOFT HYPHEN character (U+00AD) needs special attention
- from user-agent implementers. It is present in many character
- sets (including the whole ISO 8859 series and, of course, ISO
- 10646), and can always be included by means of the reference
- . Its semantics are different from the plain HYPHEN: it
- indicates a point in a word where a line break is allowed. If the
- line is indeed broken there, a hyphen must be displayed at the end
- of the first line. If not, the character is not dispalyed at all.
- In operations like searching and sorting, it must always be
- ignored.
-
-
-
-
-
-
-
-
-
-
- Yergeau, et. al. Standards Track [Page 12]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- In the DTD, the LANG and DIR attributes are grouped together in a
- parameter entity called attrs. To parallel RFC 1942 [RFC1942], the
- ID and CLASS attributes are also included in attrs. The ID and CLASS
- attributes are required for use with style sheets, and RFC 1942
- defines them as follows:
-
- ID Used to define a document-wide identifier. This can be used
- for naming positions within documents as the destination of a
- hypertext link. It may also be used by style sheets for
- rendering an element in a unique style. An ID attribute value is
- an SGML NAME token. NAME tokens are formed by an initial
- letter followed by letters, digits, "-" and "." characters. The
- letters are restricted to A-Z and a-z.
-
- CLASS A space separated list of SGML NAME tokens. CLASS names
- specify that the element belongs to the corresponding named
- classes. It allows authors to distinguish different roles
- played by the same tag. The classes may be used by style
- sheets to provide different renderings as appropriate to
- these roles.
-
- 4.2.3. Cursive joining behaviour
-
- Markup is needed in some cases to force cursive joining behavior in
- contexts in which it would not normally occur, or to block it when it
- would normally occur.
-
- The zero-width joiner and non-joiner ( and ) are used to
- control cursive joining behaviour. For example, ARABIC LETTER HEH is
- used in isolation to abbreviate "Hijri" (the Islamic calendrical
- system); however, the initial form of the letter is desired, because
- the isolated form of HEH looks like the digit five as employed in
- Arabic script. This is obtained by following the HEH with a zero-
- width joiner whose only effect is to provide context. In Persian
- texts, there are cases where a letter that normally would join a
- subsequent letter in a cursive connection does not. Here a zero-
- width non- joiner is used.
-
- 4.2.4. Bidirectional text
-
- Many languages are written in horizontal lines from left to right,
- while others are written from right to left. When both writing
- directions are present, one talks of bidirectional text (BIDI for
- short). BIDI text requires markup in special circumstances where
- ambiguities as to the directionality of some characters have to be
- resolved. This markup affects the ability to render BIDI text in a
- semantically legible fashion. That is, without this special BIDI
- markup, cases arise which would prevent *any* rendering whatsoever
-
-
-
- Yergeau, et. al. Standards Track [Page 13]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- that reflected the basic meaning of the text. Plain text may contain
- BIDI markup in the form of special-purpose formatting characters.
-
- This is also possible in HTML, which includes the five BIDI-related
- formatting characters (202A - 202E) of ISO 10646. As an alternative,
- HTML provides equivalent SGML markup.
-
- BIDI is a complex issue, and conversion of logical text sequences to
- display sequences has to be done according to the algorithm and
- character properties specified in [UNICODE]. Here, explanations are
- given only as far as they are needed to understand the necessity of
- the features introduced and to define their exact semantics.
-
- The Unicode BIDI algorithm is based on the individual characters of a
- text being stored in logical order, that is the order in which they
- are normally input and in which the corresponding sounds are normally
- spoken. To make rendering of logical order text possible, the
- algorithm assigns a directionality property to each character, e.g.
- Latin letters are specified to have a left-to-right direction, Arabic
- and Hebrew characters have a right-to-left direction.
-
- The left-to-right and right-to-left marks ( and ) are used
- to disambiguate directionality of neutral characters. For example,
- when a double quote sits between an Arabic and a Latin letter, its
- direction is ambiguous; if a directional mark is added on one side
- such that the quotation mark is surrounded by characters of only one
- directionality, the ambiguity is removed. These characters are like
- zero width spaces which have a directional property (but no word/line
- break property).
-
- Nested embeddings of contra-directional text runs, due to nested
- quotations or to the pasting of text from one BIDI context to
- another, is also a case where the implicit directionality of
- characters is not sufficient, requiring markup. Also, it is
- frequently desirable to specify the basic directionality of a block
- of text. For these purposes, the DIR attribute is used.
-
- On block-type elements, the DIR attribute indicates the base
- directionality of the text in the block; if omitted it is inherited
- from the parent element. The default directionality of the overall
- HTML document is left-to-right.
-
- On inline elements, it makes the element start a new embedding level
- (to be explained below); if omitted the inline element does not start
- a new embedding level.
-
-
-
-
-
-
- Yergeau, et. al. Standards Track [Page 14]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- NOTE -- the PRE, XMP and LISTING elements admit the DIR attribute.
- Their contents should not be considered as preformatted with
- respect to bidirectional layout, but the BIDI algorithm should be
- applied to each line of text.
-
- Following is an example of a case where embedding is needed, showing
- its effect:
-
- Given the following latin (upper case) and arabic (lower case)
- letters in backing store with the specified embeddings:
-
- <SPAN DIR=LTR> AB <SPAN DIR=RTL> xy <SPAN DIR=LTR> CD </SPAN> zw
- </SPAN> EF </SPAN>
-
- One gets the following rendering (with [] showing the directional
- transitions):
-
- [ AB [ wz [ CD ] yx ] EF ]
-
- On the other hand, without this markup and with a base direction
- of LTR one gets the following rendering:
-
- [ AB [ yx ] CD [ wz ] EF ]
-
- Notice that yx is on the left and wz on the right unlike the above
- case where the embedding levels are used. Without the embedding
- markup one has at most two levels: a base directional level and a
- single counterflow directional level.
-
- The DIR attribute on inline elements is equivalent to the formatting
- characters LEFT-TO-RIGHT EMBEDDING (202A) and RIGHT-TO-LEFT
- EMBEDDING (202B) of ISO 10646. The end tag of the element is
- equivalent to the POP DIRECTIONAL FORMATTING (202C) character.
-
- Directional override, as provided by the BDO element, is needed to
- deal with unusual short pieces of text in which directionality cannot
- be resolved from context in an unambiguous fashion. For example, it
- can be used to force left-to-right (or right-to-left) display of part
- numbers composed of Latin letters, digits and Hebrew letters.
-
- The effect of BDO is to force the directionality of all characters
- within it to the value of DIR, irrespective of their intrinsic
- directional properties. It is equivalent to using the LEFT-TO-RIGHT
- OVERRIDE (202D) or RIGHT-TO-LEFT OVERRIDE (202E) characters of ISO
- 10646, the end tag again being equivalent to the POP DIRECTIONAL
- FORMATTING (202C) character.
-
-
-
-
-
- Yergeau, et. al. Standards Track [Page 15]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- NOTE -- authors and authoring software writers should be aware
- that conflicts can arise if the DIR attribute is used on inline
- elements (including BDO) concurrently with the use of the
- corresponding ISO 10646 formatting characters.
-
- Preferably one or the other should be used exclusively; the markup
- method is better able to guarantee document structural integrity,
- and alleviates some problems when editing bidirectional HTML text
- with a simple text editor, but some software may be more apt at
- using the 10646 characters. If both methods are used, great care
- should be exercised to insure proper nesting of markup and
- directional embedding or override; otherwise, rendering results
- are undefined.
-
- 5. Forms
-
- 5.1. DTD additions
-
- It is natural to expect input in any language in forms, as they
- provide one of the only ways of obtaining user input. While this is
- primarily a UI issue, there are some things that should be specified
- at the HTML level to guide behavior and promote interoperability.
-
- To ensure full interoperability, it is necessary for the user agent
- (and the user) to have an indication of the character encoding(s)
- that the server providing a form will be able to handle upon
- submission of the filled-in form. Such an indication is provided by
- the ACCEPT-CHARSET attribute of the INPUT and TEXTAREA elements,
- modeled on the HTTP Accept-Charset header (see [HTTP-1.1]), which
- contains a space and/or comma delimited list of character sets
- acceptable to the server. A user agent may want to somehow advise
- the user of the contents of this attribute, or to restrict his
- possibility to enter characters outside the repertoires of the listed
- character sets.
-
- NOTE -- The list of character sets is to be interpreted as an
- EXCLUSIVE-OR list; the server announces that it is ready to accept
- any ONE of these character encoding schemes for each part of a
- multipart entity. The client may perform character encoding
- translation to satisfy the server if necessary.
-
- NOTE -- The default value for the ACCEPT-CHARSET attribute of an
- INPUT or TEXTAREA element is the reserved value "UNKNOWN". A user
- agent may interpret that value as the character encoding scheme
- that was used to transmit the document containing that element.
-
-
-
-
-
-
- Yergeau, et. al. Standards Track [Page 16]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- 5.2. Form submission
-
- The HTML 2.0 form submission mechanism, based on the "application/x-
- www-form-urlencoded" media type, is ill-equipped with regard to
- internationalization. In fact, since URLs are restricted to ASCII
- characters, the mechanism is akward even for ISO-8859-1 text.
- Section 2.2 of [RFC1738] specifies that octets may be encoded using
- the "%HH" notation, but text submitted from a form is composed of
- characters, not octets. Lacking a specification of a character
- encoding scheme, the "%HH" notation has no well-defined meaning.
-
- The best solution is to use the "multipart/form-data" media type
- described in [RFC1867] with the POST method of form submission. This
- mechanism encapsulates the value part of each name-value pair in a
- body-part of a multipart MIME body that is sent as the HTTP entity;
- each body part can be labeled with an appropriate Content-Type,
- including if necessary a charset parameter that specifies the
- character encoding scheme. The changes to the DTD necessary to
- support this method of form submission have been incorporated in the
- DTD included in this specification.
-
- A less satisfactory solution is to add a MIME charset parameter to
- the "application/x-www-form-urlencoded" media type specifier sent
- along with a POST method form submission, with the understanding that
- the URL encoding of [RFC1738] is applied on top of the specified
- character encoding, as a kind of implicit Content-Transfer-Encoding.
-
- One problem with both solutions above is that current browsers do not
- generally allow for bookmarks to specify the POST method; this should
- be improved. Conversely, the GET method could be used with the form
- data transmitted in the body instead of in the URL. Nothing in the
- protocol seems to prevent it, but no implementations appear to exist
- at present.
-
- How the user agent determines the encoding of the text entered by the
- user is outside the scope of this specification.
-
- NOTE -- Designers of forms and their handling scripts should be
- aware of an important caveat: when the default value of a field
- (the VALUE attribute) is returned upon form submission (i.e. the
- user did not modify this value), it cannot be guaranteed to be
- transmitted as a sequence of octets identical to that in the
- source document -- only as a possibly different but valid encoding
- of the same sequence of text elements. This may be true even if
- the encoding of the document containing the form and that used for
- submission are the same.
-
-
-
-
-
- Yergeau, et. al. Standards Track [Page 17]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- Differences can occur when a sequence of characters can be
- represented by various sequences of octets, and also when a
- composite sequence (a base character plus one or more combining
- diacritics) can be represented by either a different but
- equivalent composite sequence or by a fully precomposed character.
- For instance, the UCS-2 sequence 00EA+0323 (LATIN SMALL LETTER E
- WITH CIRCUMFLEX ACCENT + COMBINING DOT BELOW) may be transformed
- into 1EC7 (LATIN SMALL LETTER E WITH CIRCUMFLEX ACCENT AND DOT
- BELOW), into 0065+0302+0323 (LATIN SMALL LETTER E + COMBINING
- CIRCUMFLEX ACCENT + COMBINING DOT BELOW), as well as into other
- equivalent composite sequences.
-
- 6. External character encoding issues
-
- Proper interpretation of a text document requires that the character
- encoding scheme be known. Current HTTP servers, however, do not
- generally include an appropriate charset parameter with the Content-
- Type header. This is bad behaviour, which is even encouraged by the
- continued existence of browsers that declare an unrecognized media
- type when they receive a charset parameter. User agent
- implementators are strongly encouraged to make their software
- tolerant of this parameter, even if they cannot take advantage of it.
- Proper labelling is highly desirable, but some preventive measures
- can be taken to minimize the detrimental effects of its absence:
-
- In the case where a document is accessed from a hyperlink in an
- origin HTML document, a CHARSET attribute is added to the attribute
- list of elements with link semantics (A and LINK), specifically by
- adding it to the linkExtraAttributes entity. The value of that
- attribute is to be considered a hint to the User Agent as to the
- character encoding scheme used by the resource pointed to by the
- hyperlink; it should be the appropriate value of the MIME charset
- parameter for that resource.
-
- In any document, it is possible to include an indication of the
- encoding scheme like the following, as early as possible within the
- HEAD of the document:
-
- <META HTTP-EQUIV="Content-Type"
- CONTENT="text/html; charset=ISO-2022-JP">
-
- This is not foolproof, but will work if the encoding scheme is such
- that ASCII-valued octets stand for ASCII characters only at least
- until the META element is parsed. Note that there are better ways
- for a server to obtain character encoding information, instead of the
- unreliable META above; see [NICOL2] for some details and a proposal.
-
-
-
-
-
- Yergeau, et. al. Standards Track [Page 18]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- For definiteness, the "charset" parameter received from the source of
- the document should be considered the most authoritative, followed in
- order of preference by the contents of a META element such as the
- above, and finally the CHARSET parameter of the anchor that was
- followed (if any).
-
- When HTML text is transmitted directly in UCS-2 or UCS-4 form, the
- question of byte order arises: does the high-order byte of each
- multi-byte character come first or last? For definiteness, this
- specification recommends that UCS-2 and UCS-4 be transmitted in big-
- endian byte order (high order byte first), which corresponds to the
- established network byte order for two- and four-byte quantities, to
- the ISO 10646 requirement and Unicode recommendation for serialized
- text data and to RFC 1641. Furthermore, to maximize chances of
- proper interpretation, it is recommended that documents transmitted
- as UCS-2 or UCS-4 always begin with a ZERO-WIDTH NON-BREAKING SPACE
- character (hexadecimal FEFF or 0000FEFF) which, when byte-reversed
- becomes number FFFE or FFFE0000, a character guaranteed to be never
- assigned. Thus, a user-agent receiving an FFFE as the first octets
- of a text would know that bytes have to be reversed for the remainder
- of the text.
-
- There exist so-called UCS Transformation Formats than can be used to
- transmit UCS data, in addition to UCS-2 and UCS-4. UTF-7 [RFC1642]
- and UTF-8 [UTF-8] have favorable properties (no byte-ordering
- problem, different flavours of ASCII compatibility) that make them
- worthy of consideration, especially for transmission of multilingual
- text. Another encoding scheme, MNEM [RFC1345], also has interesting
- properties and the capability to transmit the full UCS. The UTF-1
- transformation format of ISO 10646:1993 (registered by IANA as ISO-
- 10646-UTF-1), has been removed from ISO 10646 by amendment 4, and
- should not be used.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Yergeau, et. al. Standards Track [Page 19]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- 7. HTML Public Text
-
- 7.1. HTML DTD
-
- This section contains a DTD for HTML based on the HTML 2.0 DTD of RFC
- 1866, incorporating the changes for file upload as specified in RFC
- 1867, and the changes deriving from this document.
-
- <!-- html.dtd
-
- Document Type Definition for the HyperText Markup Language,
- extended for internationalisation (HTML DTD)
-
- Last revised: 96/08/07
-
- Authors: Daniel W. Connolly <connolly@w3.org>
- Francois Yergeau <yergeau@alis.com>
- See Also:
- http://www.w3.org/hypertext/WWW/MarkUp/MarkUp.html
- -->
-
- <!ENTITY % HTML.Version
- "-//IETF//DTD HTML i18n//EN"
-
- -- Typical usage:
-
- <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML i18n//EN">
- <html>
- ...
- </html>
- --
- >
-
-
- <!--============ Feature Test Entities ========================-->
-
- <!ENTITY % HTML.Recommended "IGNORE"
- -- Certain features of the language are necessary for
- compatibility with widespread usage, but they may
- compromise the structural integrity of a document.
- This feature test entity enables a more prescriptive
- document type definition that eliminates
- those features.
- -->
-
- <![ %HTML.Recommended [
- <!ENTITY % HTML.Deprecated "IGNORE">
- ]]>
-
-
-
- Yergeau, et. al. Standards Track [Page 20]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- <!ENTITY % HTML.Deprecated "INCLUDE"
- -- Certain features of the language are necessary for
- compatibility with earlier versions of the specification,
- but they tend to be used and implemented inconsistently,
- and their use is deprecated. This feature test entity
- enables a document type definition that eliminates
- these features.
- -->
-
- <!ENTITY % HTML.Highlighting "INCLUDE"
- -- Use this feature test entity to validate that a
- document uses no highlighting tags, which may be
- ignored on minimal implementations.
- -->
-
- <!ENTITY % HTML.Forms "INCLUDE"
- -- Use this feature test entity to validate that a document
- contains no forms, which may not be supported in minimal
- implementations
- -->
-
- <!--============== Imported Names ==============================-->
-
- <!ENTITY % Content-Type "CDATA"
- -- meaning an internet media type
- (aka MIME content type, as per RFC2045)
- -->
-
- <!ENTITY % HTTP-Method "GET | POST"
- -- as per HTTP specification, RFC2068
- -->
-
- <!--========= DTD "Macros" =====================-->
-
- <!ENTITY % heading "H1|H2|H3|H4|H5|H6">
-
- <!ENTITY % list " UL | OL | DIR | MENU " >
-
- <!ENTITY % attrs -- common attributes for elements --
- "LANG NAME #IMPLIED -- RFC 1766 language tag --
- DIR (ltr|rtl) #IMPLIED -- text directionnality --
- ID ID #IMPLIED -- element identifier
- (from RFC1942) --
- CLASS NAMES #IMPLIED -- for subclassing elements
- (from RFC1942) --">
-
- <!ENTITY % just -- an attribute for text justification --
- "ALIGN (left|right|center|justify) #IMPLIED"
-
-
-
- Yergeau, et. al. Standards Track [Page 21]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- -- default is left for ltr paragraphs, right for rtl -- >
-
- <!--======= Character mnemonic entities =================-->
-
- <!ENTITY % ISOlat1 PUBLIC
- "ISO 8879-1986//ENTITIES Added Latin 1//EN//HTML">
- %ISOlat1;
-
- <!ENTITY amp CDATA "&" -- ampersand -->
- <!ENTITY gt CDATA ">" -- greater than -->
- <!ENTITY lt CDATA "<" -- less than -->
- <!ENTITY quot CDATA """ -- double quote -->
-
- <!--Entities for language-dependent presentation (BIDI and
- contextual analysis) -->
- <!ENTITY zwnj CDATA ""-- zero width non-joiner-->
- <!ENTITY zwj CDATA ""-- zero width joiner-->
- <!ENTITY lrm CDATA ""-- left-to-right mark-->
- <!ENTITY rlm CDATA ""-- right-to-left mark-->
-
-
- <!--========= SGML Document Access (SDA) Parameter Entities =====-->
-
- <!-- HTML contains SGML Document Access (SDA) fixed attributes
- in support of easy transformation to the International Committee
- for Accessible Document Design (ICADD) DTD
- "-//EC-USA-CDA/ICADD//DTD ICADD22//EN".
- ICADD applications are designed to support usable access to
- structured information by print-impaired individuals through
- Braille, large print and voice synthesis. For more information on
- SDA & ICADD:
- - ISO 12083:1993, Annex A.8, Facilities for Braille,
- large print and computer voice
- - ICADD ListServ
- <ICADD%ASUACAD.BITNET@ARIZVM1.ccit.arizona.edu>
- - Usenet news group bit.listserv.easi
- - Recording for the Blind, +1 800 221 4792
- -->
-
- <!ENTITY % SDAFORM "SDAFORM CDATA #FIXED"
- -- one to one mapping -->
- <!ENTITY % SDARULE "SDARULE CDATA #FIXED"
- -- context-sensitive mapping -->
- <!ENTITY % SDAPREF "SDAPREF CDATA #FIXED"
- -- generated text prefix -->
- <!ENTITY % SDASUFF "SDASUFF CDATA #FIXED"
- -- generated text suffix -->
- <!ENTITY % SDASUSP "SDASUSP NAME #FIXED"
-
-
-
- Yergeau, et. al. Standards Track [Page 22]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- -- suspend transform process -->
-
-
- <!--========== Text Markup =====================-->
-
- <![ %HTML.Highlighting [
-
- <!ENTITY % font " TT | B | I ">
-
- <!ENTITY % phrase "EM | STRONG | CODE | SAMP | KBD | VAR | CITE ">
-
- <!ENTITY % text "#PCDATA|A|IMG|BR|%phrase|%font|SPAN|Q|BDO|SUP|SUB">
-
- <!ELEMENT (%font;|%phrase) - - (%text)*>
- <!ATTLIST ( TT | CODE | SAMP | KBD | VAR )
- %attrs;
- %SDAFORM; "Lit"
- >
-
- <!ATTLIST ( B | STRONG )
- %attrs;
- %SDAFORM; "B"
- >
- <!ATTLIST ( I | EM | CITE )
- %attrs;
- %SDAFORM; "It"
- >
-
- <!-- <TT> Typewriter text -->
- <!-- <B> Bold text -->
- <!-- <I> Italic text -->
-
- <!-- <EM> Emphasized phrase -->
- <!-- <STRONG> Strong emphasis -->
- <!-- <CODE> Source code phrase -->
- <!-- <SAMP> Sample text or characters -->
- <!-- <KBD> Keyboard phrase, e.g. user input -->
- <!-- <VAR> Variable phrase or substitutable -->
- <!-- <CITE> Name or title of cited work -->
-
- <!ENTITY % pre.content "#PCDATA|A|HR|BR|%font|%phrase|SPAN|BDO">
-
- ]]>
-
- <!ENTITY % text "#PCDATA|A|IMG|BR|SPAN|Q|BDO|SUP|SUB">
-
- <!ELEMENT BR - O EMPTY>
- <!ATTLIST BR
-
-
-
- Yergeau, et. al. Standards Track [Page 23]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- %SDAPREF; "RE;"
- >
-
- <!-- <BR> Line break -->
-
- <!ELEMENT SPAN - - (%text)*>
- <!ATTLIST SPAN
- %attrs;
- %SDAFORM; "other #Attlist"
- >
-
- <!-- <SPAN> Generic inline container -->
- <!-- <SPAN DIR=...> New counterflow embedding -->
- <!-- <SPAN LANG="..."> Language of contents -->
-
- <!ELEMENT Q - - (%text)*>
- <!ATTLIST Q
- %attrs;
- %SDAPREF; '"'
- %SDASUFF; '"'
- >
-
- <!-- <Q> Short quotation -->
- <!-- <Q LANG=xx> Language of quotation is xx -->
- <!-- <Q DIR=...> New conterflow embedding -->
-
- <!ELEMENT BDO - - (%text)+>
- <!ATTLIST BDO
- LANG NAME #IMPLIED
- DIR (ltr|rtl) #REQUIRED
- ID ID #IMPLIED
- CLASS NAMES #IMPLIED
- %SDAPREF "Bidi Override #Attval(DIR): "
- %SDASUFF "End Bidi"
- >
-
- <!-- <BDO DIR=...> Override directionality of text to value of DIR -->
- <!-- <BDO LANG=...> Language of contents -->
-
- <!ELEMENT (SUP|SUB) - - (#PCDATA)>
- <!ATTLIST (SUP)
- %attrs;
- %SDAPREF "Superscript(#content)"
- >
- <!ATTLIST (SUB)
- %attrs;
- %SDAPREF "Subscript(#content)"
- >
-
-
-
- Yergeau, et. al. Standards Track [Page 24]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- <!-- <SUP> Superscript -->
- <!-- <SUB> Subscript -->
-
- <!--========= Link Markup ======================-->
-
- <!ENTITY % linkType "NAMES">
-
- <!ENTITY % linkExtraAttributes
- "REL %linkType #IMPLIED
- REV %linkType #IMPLIED
- URN CDATA #IMPLIED
- TITLE CDATA #IMPLIED
- METHODS NAMES #IMPLIED
- CHARSET NAME #IMPLIED
- ">
-
- <![ %HTML.Recommended [
- <!ENTITY % A.content "(%text)*"
-
- -- <H1><a name="xxx">Heading</a></H1>
- is preferred to
- <a name="xxx"><H1>Heading</H1></a>
- -->
- ]]>
-
- <!ENTITY % A.content "(%heading|%text)*">
-
- <!ELEMENT A - - %A.content -(A)>
- <!ATTLIST A
- %attrs;
- HREF CDATA #IMPLIED
- NAME CDATA #IMPLIED
- %linkExtraAttributes;
- %SDAPREF; "<Anchor: #AttList>"
- >
- <!-- <A> Anchor; source/destination of link -->
- <!-- <A NAME="..."> Name of this anchor -->
- <!-- <A HREF="..."> Address of link destination -->
- <!-- <A URN="..."> Permanent address of destination -->
- <!-- <A REL=...> Relationship to destination -->
- <!-- <A REV=...> Relationship of destination to this -->
- <!-- <A TITLE="..."> Title of destination (advisory) -->
- <!-- <A METHODS="..."> Operations on destination (advisory) -->
- <!-- <A CHARSET="..."> Charset of destination (advisory) -->
- <!-- <A LANG="..."> Language of contents btw <A> and </A> -->
- <!-- <A DIR=...> Contents is a new counterflow embedding -->
-
- <!--========== Images ==========================-->
-
-
-
- Yergeau, et. al. Standards Track [Page 25]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- <!ELEMENT IMG - O EMPTY>
- <!ATTLIST IMG
- %attrs;
- SRC CDATA #REQUIRED
- ALT CDATA #IMPLIED
- ALIGN (top|middle|bottom) #IMPLIED
- ISMAP (ISMAP) #IMPLIED
- %SDAPREF; "<Fig><?SDATrans Img: #AttList>#AttVal(Alt)</Fig>"
- >
-
- <!-- <IMG> Image; icon, glyph or illustration -->
- <!-- <IMG SRC="..."> Address of image object -->
- <!-- <IMG ALT="..."> Textual alternative -->
- <!-- <IMG ALIGN=...> Position relative to text -->
- <!-- <IMG LANG=...> Image contains "text" in that language -->
- <!-- <IMG DIR=...> Inline image acts as a RTL or LTR
- embedding w/r to BIDI algorithm -->
- <!-- <IMG ISMAP> Each pixel can be a link -->
-
- <!--========== Paragraphs=======================-->
-
- <!ELEMENT P - O (%text)*>
- <!ATTLIST P
- %attrs;
- %just;
- %SDAFORM; "Para"
- >
-
- <!-- <P> Paragraph -->
- <!-- <P LANG="..."> Language of paragraph text -->
- <!-- <P DIR=...> Base directionality of paragraph -->
- <!-- <P ALIGN=...> Paragraph alignment (justification) -->
-
- <!--========== Headings, Titles, Sections ===============-->
-
- <!ELEMENT HR - O EMPTY>
- <!ATTLIST HR
- %just;
- %SDAPREF; "RE;RE;"
- >
-
- <!-- <HR> Horizontal rule -->
-
- <!ELEMENT ( %heading ) - - (%text;)*>
- <!ATTLIST H1
- %attrs;
- %just;
- %SDAFORM; "H1"
-
-
-
- Yergeau, et. al. Standards Track [Page 26]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- >
- <!ATTLIST H2
- %attrs;
- %just;
- %SDAFORM; "H2"
- >
- <!ATTLIST H3
- %attrs;
- %just;
- %SDAFORM; "H3"
- >
- <!ATTLIST H4
- %attrs;
- %just;
- %SDAFORM; "H4"
- >
- <!ATTLIST H5
- %attrs;
- %just;
- %SDAFORM; "H5"
- >
- <!ATTLIST H6
- %attrs;
- %just;
- %SDAFORM; "H6"
- >
-
- <!-- <H1> Heading, level 1 -->
- <!-- <H2> Heading, level 2 -->
- <!-- <H3> Heading, level 3 -->
- <!-- <H4> Heading, level 4 -->
- <!-- <H5> Heading, level 5 -->
- <!-- <H6> Heading, level 6 -->
-
-
- <!--========== Text Flows ======================-->
-
- <![ %HTML.Forms [
- <!ENTITY % block.forms "BLOCKQUOTE | FORM | ISINDEX">
- ]]>
-
- <!ENTITY % block.forms "BLOCKQUOTE">
-
- <![ %HTML.Deprecated [
- <!ENTITY % preformatted "PRE | XMP | LISTING">
- ]]>
-
- <!ENTITY % preformatted "PRE">
-
-
-
- Yergeau, et. al. Standards Track [Page 27]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- <!ENTITY % block "P | %list | DL
- | %preformatted
- | %block.forms">
-
- <!ENTITY % flow "(%text|%block)*">
-
- <!ENTITY % pre.content "#PCDATA | A | HR | BR | SPAN | BDO">
- <!ELEMENT PRE - - (%pre.content)*>
- <!ATTLIST PRE
- %attrs;
- WIDTH NUMBER #implied
- %SDAFORM; "Lit"
- >
-
- <!-- <PRE> Preformatted text -->
- <!-- <PRE WIDTH=...> Maximum characters per line -->
- <!-- <PRE DIR=...> Base direction of preformatted block -->
- <!-- <PRE LANG=...> Language of contents -->
-
- <![ %HTML.Deprecated [
-
- <!ENTITY % literal "CDATA"
- -- historical, non-conforming parsing mode where
- the only markup signal is the end tag
- in full
- -->
-
- <!ELEMENT (XMP|LISTING) - - %literal>
- <!ATTLIST XMP
- %attrs;
- %SDAFORM; "Lit"
- %SDAPREF; "Example:RE;"
- >
- <!ATTLIST LISTING
- %attrs;
- %SDAFORM; "Lit"
- %SDAPREF; "Listing:RE;"
- >
-
- <!-- <XMP> Example section -->
- <!-- <LISTING> Computer listing -->
-
- <!ELEMENT PLAINTEXT - O %literal>
- <!-- <PLAINTEXT> Plain text passage -->
-
- <!ATTLIST PLAINTEXT
- %attrs;
- %SDAFORM; "Lit"
-
-
-
- Yergeau, et. al. Standards Track [Page 28]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- >
- ]]>
-
-
- <!--========== Lists ==================-->
-
- <!ELEMENT DL - - (DT | DD)+>
- <!ATTLIST DL
- %attrs;
- COMPACT (COMPACT) #IMPLIED
- %SDAFORM; "List"
- %SDAPREF; "Definition List:"
- >
-
- <!ELEMENT DT - O (%text)*>
- <!ATTLIST DT
- %attrs;
- %SDAFORM; "Term"
- >
-
- <!ELEMENT DD - O %flow>
- <!ATTLIST DD
- %attrs;
- %SDAFORM; "LItem"
- >
-
- <!-- <DL> Definition list, or glossary -->
- <!-- <DL COMPACT> Compact style list -->
- <!-- <DT> Term in definition list -->
- <!-- <DD> Definition of term -->
-
- <!ELEMENT (OL|UL) - - (LI)+>
- <!ATTLIST OL
- %attrs;
- %just;
- COMPACT (COMPACT) #IMPLIED
- %SDAFORM; "List"
- >
- <!ATTLIST UL
- %attrs;
- %just;
- COMPACT (COMPACT) #IMPLIED
- %SDAFORM; "List"
- >
- <!-- <UL> Unordered list -->
- <!-- <UL COMPACT> Compact list style -->
- <!-- <OL> Ordered, or numbered list -->
- <!-- <OL COMPACT> Compact list style -->
-
-
-
- Yergeau, et. al. Standards Track [Page 29]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- <!ELEMENT (DIR|MENU) - - (LI)+ -(%block)>
- <!ATTLIST DIR
- %attrs;
- %just;
- COMPACT (COMPACT) #IMPLIED
- %SDAFORM; "List"
- %SDAPREF; "<LHead>Directory</LHead>"
- >
- <!ATTLIST MENU
- %attrs;
- %just;
- COMPACT (COMPACT) #IMPLIED
- %SDAFORM; "List"
- %SDAPREF; "<LHead>Menu</LHead>"
- >
-
- <!-- <DIR> Directory list -->
- <!-- <DIR COMPACT> Compact list style -->
- <!-- <MENU> Menu list -->
- <!-- <MENU COMPACT> Compact list style -->
-
- <!ELEMENT LI - O %flow>
- <!ATTLIST LI
- %attrs;
- %just;
- %SDAFORM; "LItem"
- >
-
- <!-- <LI> List item -->
-
- <!--========== Document Body ===================-->
-
- <![ %HTML.Recommended [
- <!ENTITY % body.content "(%heading|%block|HR|ADDRESS|IMG)*"
- -- <h1>Heading</h1>
- <p>Text ...
- is preferred to
- <h1>Heading</h1>
- Text ...
- -->
- ]]>
-
- <!ENTITY % body.content "(%heading | %text | %block |
- HR | ADDRESS)*">
-
- <!ELEMENT BODY O O %body.content>
- <!ATTLIST BODY
- %attrs;
-
-
-
- Yergeau, et. al. Standards Track [Page 30]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- >
-
- <!-- <BODY> Document body -->
- <!-- <BODY DIR=...> Base direction of whole body -->
- <!-- <BODY LANG=...> Language of contents -->
-
- <!ELEMENT BLOCKQUOTE - - %body.content>
- <!ATTLIST BLOCKQUOTE
- %attrs;
- %just;
- %SDAFORM; "BQ"
- >
-
- <!-- <BLOCKQUOTE> Quoted passage -->
-
- <!ELEMENT ADDRESS - - (%text|P)*>
- <!ATTLIST ADDRESS
- %attrs;
- %just;
- %SDAFORM; "Lit"
- %SDAPREF; "Address:RE;"
- >
-
- <!-- <ADDRESS> Address, signature, or byline -->
-
-
- <!--======= Forms ====================-->
-
- <![ %HTML.Forms [
-
- <!ELEMENT FORM - - %body.content -(FORM) +(INPUT|SELECT|TEXTAREA)>
- <!ATTLIST FORM
- %attrs;
- ACTION CDATA #IMPLIED
- METHOD (%HTTP-Method) GET
- ENCTYPE %Content-Type; "application/x-www-form-urlencoded"
- %SDAPREF; "<Para>Form:</Para>"
- %SDASUFF; "<Para>Form End.</Para>"
- >
-
- <!-- <FORM> Fill-out or data-entry form -->
- <!-- <FORM ACTION="..."> Address for completed form -->
- <!-- <FORM METHOD=...> Method of submitting form -->
- <!-- <FORM ENCTYPE="..."> Representation of form data -->
- <!-- <FORM DIR=...> Base direction of form -->
- <!-- <FORM LANG=...> Language of contents -->
-
- <!ENTITY % InputType "(TEXT | PASSWORD | CHECKBOX |
-
-
-
- Yergeau, et. al. Standards Track [Page 31]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- RADIO | SUBMIT | RESET |
- IMAGE | HIDDEN | FILE )">
- <!ELEMENT INPUT - O EMPTY>
- <!ATTLIST INPUT
- %attrs;
- TYPE %InputType TEXT
- NAME CDATA #IMPLIED
- VALUE CDATA #IMPLIED
- SRC CDATA #IMPLIED
- CHECKED (CHECKED) #IMPLIED
- SIZE CDATA #IMPLIED
- MAXLENGTH NUMBER #IMPLIED
- ALIGN (top|middle|bottom) #IMPLIED
- ACCEPT CDATA #IMPLIED --list of content types --
- ACCEPT-CHARSET CDATA #IMPLIED --list of charsets accepted --
- %SDAPREF; "Input: "
- >
-
- <!-- <INPUT> Form input datum -->
- <!-- <INPUT TYPE=...> Type of input interaction -->
- <!-- <INPUT NAME=...> Name of form datum -->
- <!-- <INPUT VALUE="..."> Default/initial/selected value -->
- <!-- <INPUT SRC="..."> Address of image -->
- <!-- <INPUT CHECKED> Initial state is "on" -->
- <!-- <INPUT SIZE=...> Field size hint -->
- <!-- <INPUT MAXLENGTH=...> Data length maximum -->
- <!-- <INPUT ALIGN=...> Image alignment -->
- <!-- <INPUT ACCEPT="..."> List of desired media types -->
- <!-- <INPUT ACCEPT-CHARSET="..."> List of acceptable charsets -->
-
- <!ELEMENT SELECT - - (OPTION+) -(INPUT|SELECT|TEXTAREA)>
- <!ATTLIST SELECT
- %attrs;
- NAME CDATA #REQUIRED
- SIZE NUMBER #IMPLIED
- MULTIPLE (MULTIPLE) #IMPLIED
- %SDAFORM; "List"
- %SDAPREF;
- "<LHead>Select #AttVal(Multiple)</LHead>"
- >
-
- <!-- <SELECT> Selection of option(s) -->
- <!-- <SELECT NAME=...> Name of form datum -->
- <!-- <SELECT SIZE=...> Options displayed at a time -->
- <!-- <SELECT MULTIPLE> Multiple selections allowed -->
-
- <!ELEMENT OPTION - O (#PCDATA)*>
- <!ATTLIST OPTION
-
-
-
- Yergeau, et. al. Standards Track [Page 32]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- %attrs;
- SELECTED (SELECTED) #IMPLIED
- VALUE CDATA #IMPLIED
- %SDAFORM; "LItem"
- %SDAPREF;
- "Option: #AttVal(Value) #AttVal(Selected)"
- >
-
- <!-- <OPTION> A selection option -->
- <!-- <OPTION SELECTED> Initial state -->
- <!-- <OPTION VALUE="..."> Form datum value for this option-->
-
- <!ELEMENT TEXTAREA - - (#PCDATA)* -(INPUT|SELECT|TEXTAREA)>
- <!ATTLIST TEXTAREA
- %attrs;
- NAME CDATA #REQUIRED
- ROWS NUMBER #REQUIRED
- COLS NUMBER #REQUIRED
- ACCEPT-CHARSET CDATA #IMPLIED -- list of charsets accepted --
- %SDAFORM; "Para"
- %SDAPREF; "Input Text -- #AttVal(Name): "
- >
-
- <!-- <TEXTAREA> An area for text input -->
- <!-- <TEXTAREA NAME=...> Name of form datum -->
- <!-- <TEXTAREA ROWS=...> Height of area -->
- <!-- <TEXTAREA COLS=...> Width of area -->
-
- ]]>
-
-
- <!--======= Document Head ======================-->
-
- <![ %HTML.Recommended [
- <!ENTITY % head.extra "">
- ]]>
- <!ENTITY % head.extra "& NEXTID?">
-
- <!ENTITY % head.content "TITLE & ISINDEX? & BASE? %head.extra">
-
- <!ELEMENT HEAD O O (%head.content) +(META|LINK)>
- <!ATTLIST HEAD
- %attrs; >
-
- <!-- <HEAD> Document head -->
-
- <!ELEMENT TITLE - - (#PCDATA)* -(META|LINK)>
- <!ATTLIST TITLE
-
-
-
- Yergeau, et. al. Standards Track [Page 33]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- %attrs;
- %SDAFORM; "Ti" >
-
- <!-- <TITLE> Title of document -->
-
- <!ELEMENT LINK - O EMPTY>
- <!ATTLIST LINK
- %attrs;
- HREF CDATA #REQUIRED
- %linkExtraAttributes;
- %SDAPREF; "Linked to : #AttVal (TITLE) (URN) (HREF)>" >
-
- <!-- <LINK> Link from this document -->
- <!-- <LINK HREF="..."> Address of link destination -->
- <!-- <LINK URN="..."> Lasting name of destination -->
- <!-- <LINK REL=...> Relationship to destination -->
- <!-- <LINK REV=...> Relationship of destination to this -->
- <!-- <LINK TITLE="..."> Title of destination (advisory) -->
- <!-- <LINK CHARSET="..."> Charset of destination (advisory) -->
- <!-- <LINK METHODS="..."> Operations allowed (advisory) -->
-
- <!ELEMENT ISINDEX - O EMPTY>
- <!ATTLIST ISINDEX
- %attrs;
- %SDAPREF;
- "<Para>[Document is indexed/searchable.]</Para>">
-
- <!-- <ISINDEX> Document is a searchable index -->
-
- <!ELEMENT BASE - O EMPTY>
- <!ATTLIST BASE
- HREF CDATA #REQUIRED >
-
- <!-- <BASE> Base context document -->
- <!-- <BASE HREF="..."> Address for this document -->
-
- <!ELEMENT NEXTID - O EMPTY>
- <!ATTLIST NEXTID
- N CDATA #REQUIRED >
-
- <!-- <NEXTID> Next ID to use for link name -->
- <!-- <NEXTID N=...> Next ID to use for link name -->
-
- <!ELEMENT META - O EMPTY>
- <!ATTLIST META
- HTTP-EQUIV NAME #IMPLIED
- NAME NAME #IMPLIED
- CONTENT CDATA #REQUIRED >
-
-
-
- Yergeau, et. al. Standards Track [Page 34]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- <!-- <META> Generic Meta-information -->
- <!-- <META HTTP-EQUIV=...> HTTP response header name -->
- <!-- <META NAME=...> Meta-information name -->
- <!-- <META CONTENT="..."> Associated information -->
-
- <!--======= Document Structure =================-->
-
- <![ %HTML.Deprecated [
- <!ENTITY % html.content "HEAD, BODY, PLAINTEXT?">
- ]]>
- <!ENTITY % html.content "HEAD, BODY">
-
- <!ELEMENT HTML O O (%html.content)>
- <!ENTITY % version.attr "VERSION CDATA #FIXED '%HTML.Version;'">
-
- <!ATTLIST HTML
- %attrs;
- %version.attr;
- %SDAFORM; "Book"
- >
-
- <!-- <HTML> HTML Document -->
-
- 7.2. SGML Declaration for HTML
-
- <!SGML "ISO 8879:1986"
- --
- SGML Declaration for HyperText Markup Language version 2.x
- (HTML 2.x = HTML 2.0 + i18n).
-
- --
-
- CHARSET
- BASESET "ISO Registration Number 177//CHARSET
- ISO/IEC 10646-1:1993 UCS-4 with
- implementation level 3//ESC 2/5 2/15 4/6"
- DESCSET 0 9 UNUSED
- 9 2 9
- 11 2 UNUSED
- 13 1 13
- 14 18 UNUSED
- 32 95 32
- 127 1 UNUSED
- 128 32 UNUSED
- 160 2147483486 160
- --
- In ISO 10646, the positions with hexadecimal
- values 0000D800 - 0000DFFF, used in the UTF-16
-
-
-
- Yergeau, et. al. Standards Track [Page 35]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- encoding of UCS-4, are reserved, as well as the last
- two code values in each plane of UCS-4, i.e. all
- values of the hexadecimal form xxxxFFFE or xxxxFFFF.
- These code values or the corresponding numeric
- character references must not be included when
- generating a new HTML document, and they should be
- ignored if encountered when processing a HTML
- document.
- --
-
- CAPACITY SGMLREF
- TOTALCAP 150000
- GRPCAP 150000
- ENTCAP 150000
-
- SCOPE DOCUMENT
-
- SYNTAX
- SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
- 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127
-
- BASESET "ISO 646IRV:1991//CHARSET
- International Reference Version
- (IRV)//ESC 2/8 4/2"
- DESCSET 0 128 0
-
- FUNCTION
- RE 13
- RS 10
- SPACE 32
- TAB SEPCHAR 9
-
- NAMING LCNMSTRT ""
- UCNMSTRT ""
- LCNMCHAR ".-"
- UCNMCHAR ".-"
- NAMECASE GENERAL YES
- ENTITY NO
- DELIM GENERAL SGMLREF
- SHORTREF SGMLREF
- NAMES SGMLREF
- QUANTITY SGMLREF
- ATTSPLEN 2100
- LITLEN 1024
- NAMELEN 72 -- somewhat arbitrary; taken from
- internet line length conventions --
- PILEN 1024
- TAGLVL 100
-
-
-
- Yergeau, et. al. Standards Track [Page 36]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- TAGLEN 2100
- GRPGTCNT 150
- GRPCNT 64
-
- FEATURES
- MINIMIZE
- DATATAG NO
- OMITTAG YES
- RANK NO
- SHORTTAG YES
- LINK
- SIMPLE NO
- IMPLICIT NO
- EXPLICIT NO
- OTHER
- CONCUR NO
- SUBDOC NO
- FORMAL YES
- APPINFO "SDA" -- conforming SGML Document Access application
- --
- >
-
- 7.3. ISO Latin 1 entity set
-
- The following public text lists each of the characters specified in
- the Added Latin 1 entity set, along with its name, syntax for use,
- and description. This list is derived from ISO Standard
- 8879:1986//ENTITIES Added Latin 1//EN. HTML includes the entire
- entity set, and adds entities for all missing characters in the right
- part of ISO-8859-1.
-
- <!-- (C) International Organization for Standardization 1986
- Permission to copy in any form is granted for use with
- conforming SGML systems and applications as defined in
- ISO 8879, provided this notice is included in all copies.
- -->
- <!-- Character entity set. Typical invocation:
- <!ENTITY % ISOlat1 PUBLIC
- "ISO 8879-1986//ENTITIES Added Latin 1//EN//HTML">
- %ISOlat1;
- -->
- <!ENTITY nbsp CDATA " " -- no-break space -->
- <!ENTITY iexcl CDATA "¡" -- inverted exclamation mark -->
- <!ENTITY cent CDATA "¢" -- cent sign -->
- <!ENTITY pound CDATA "£" -- pound sterling sign -->
- <!ENTITY curren CDATA "¤" -- general currency sign -->
- <!ENTITY yen CDATA "¥" -- yen sign -->
- <!ENTITY brvbar CDATA "¦" -- broken (vertical) bar -->
-
-
-
- Yergeau, et. al. Standards Track [Page 37]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- <!ENTITY sect CDATA "§" -- section sign -->
- <!ENTITY uml CDATA "¨" -- umlaut (dieresis) -->
- <!ENTITY copy CDATA "©" -- copyright sign -->
- <!ENTITY ordf CDATA "ª" -- ordinal indicator, feminine -->
- <!ENTITY laquo CDATA "«" -- angle quotation mark, left -->
- <!ENTITY not CDATA "¬" -- not sign -->
- <!ENTITY shy CDATA "" -- soft hyphen -->
- <!ENTITY reg CDATA "®" -- registered sign -->
- <!ENTITY macr CDATA "¯" -- macron -->
- <!ENTITY deg CDATA "°" -- degree sign -->
- <!ENTITY plusmn CDATA "±" -- plus-or-minus sign -->
- <!ENTITY sup2 CDATA "²" -- superscript two -->
- <!ENTITY sup3 CDATA "³" -- superscript three -->
- <!ENTITY acute CDATA "´" -- acute accent -->
- <!ENTITY micro CDATA "µ" -- micro sign -->
- <!ENTITY para CDATA "¶" -- pilcrow (paragraph sign) -->
- <!ENTITY middot CDATA "·" -- middle dot -->
- <!ENTITY cedil CDATA "¸" -- cedilla -->
- <!ENTITY sup1 CDATA "¹" -- superscript one -->
- <!ENTITY ordm CDATA "º" -- ordinal indicator, masculine -->
- <!ENTITY raquo CDATA "»" -- angle quotation mark, right -->
- <!ENTITY frac14 CDATA "¼" -- fraction one-quarter -->
- <!ENTITY frac12 CDATA "½" -- fraction one-half -->
- <!ENTITY frac34 CDATA "¾" -- fraction three-quarters -->
- <!ENTITY iquest CDATA "¿" -- inverted question mark -->
- <!ENTITY Agrave CDATA "À" -- capital A, grave accent -->
- <!ENTITY Aacute CDATA "Á" -- capital A, acute accent -->
- <!ENTITY Acirc CDATA "Â" -- capital A, circumflex accent -->
- <!ENTITY Atilde CDATA "Ã" -- capital A, tilde -->
- <!ENTITY Auml CDATA "Ä" -- capital A, dieresis or umlaut -->
- <!ENTITY Aring CDATA "Å" -- capital A, ring -->
- <!ENTITY AElig CDATA "Æ" -- capital AE diphthong (ligature) -->
- <!ENTITY Ccedil CDATA "Ç" -- capital C, cedilla -->
- <!ENTITY Egrave CDATA "È" -- capital E, grave accent -->
- <!ENTITY Eacute CDATA "É" -- capital E, acute accent -->
- <!ENTITY Ecirc CDATA "Ê" -- capital E, circumflex accent -->
- <!ENTITY Euml CDATA "Ë" -- capital E, dieresis or umlaut -->
- <!ENTITY Igrave CDATA "Ì" -- capital I, grave accent -->
- <!ENTITY Iacute CDATA "Í" -- capital I, acute accent -->
- <!ENTITY Icirc CDATA "Î" -- capital I, circumflex accent -->
- <!ENTITY Iuml CDATA "Ï" -- capital I, dieresis or umlaut -->
- <!ENTITY ETH CDATA "Ð" -- capital Eth, Icelandic -->
- <!ENTITY Ntilde CDATA "Ñ" -- capital N, tilde -->
- <!ENTITY Ograve CDATA "Ò" -- capital O, grave accent -->
- <!ENTITY Oacute CDATA "Ó" -- capital O, acute accent -->
- <!ENTITY Ocirc CDATA "Ô" -- capital O, circumflex accent -->
- <!ENTITY Otilde CDATA "Õ" -- capital O, tilde -->
- <!ENTITY Ouml CDATA "Ö" -- capital O, dieresis or umlaut -->
-
-
-
- Yergeau, et. al. Standards Track [Page 38]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- <!ENTITY times CDATA "×" -- multiply sign -->
- <!ENTITY Oslash CDATA "Ø" -- capital O, slash -->
- <!ENTITY Ugrave CDATA "Ù" -- capital U, grave accent -->
- <!ENTITY Uacute CDATA "Ú" -- capital U, acute accent -->
- <!ENTITY Ucirc CDATA "Û" -- capital U, circumflex accent -->
- <!ENTITY Uuml CDATA "Ü" -- capital U, dieresis or umlaut -->
- <!ENTITY Yacute CDATA "Ý" -- capital Y, acute accent -->
- <!ENTITY THORN CDATA "Þ" -- capital Thorn, Icelandic -->
- <!ENTITY szlig CDATA "ß" -- small sharp s, German (sz ligature) -->
- <!ENTITY agrave CDATA "à" -- small a, grave accent -->
- <!ENTITY aacute CDATA "á" -- small a, acute accent -->
- <!ENTITY acirc CDATA "â" -- small a, circumflex accent -->
- <!ENTITY atilde CDATA "ã" -- small a, tilde -->
- <!ENTITY auml CDATA "ä" -- small a, dieresis or umlaut -->
- <!ENTITY aring CDATA "å" -- small a, ring -->
- <!ENTITY aelig CDATA "æ" -- small ae diphthong (ligature) -->
- <!ENTITY ccedil CDATA "ç" -- small c, cedilla -->
- <!ENTITY egrave CDATA "è" -- small e, grave accent -->
- <!ENTITY eacute CDATA "é" -- small e, acute accent -->
- <!ENTITY ecirc CDATA "ê" -- small e, circumflex accent -->
- <!ENTITY euml CDATA "ë" -- small e, dieresis or umlaut -->
- <!ENTITY igrave CDATA "ì" -- small i, grave accent -->
- <!ENTITY iacute CDATA "í" -- small i, acute accent -->
- <!ENTITY icirc CDATA "î" -- small i, circumflex accent -->
- <!ENTITY iuml CDATA "ï" -- small i, dieresis or umlaut -->
- <!ENTITY eth CDATA "ð" -- small eth, Icelandic -->
- <!ENTITY ntilde CDATA "ñ" -- small n, tilde -->
- <!ENTITY ograve CDATA "ò" -- small o, grave accent -->
- <!ENTITY oacute CDATA "ó" -- small o, acute accent -->
- <!ENTITY ocirc CDATA "ô" -- small o, circumflex accent -->
- <!ENTITY otilde CDATA "õ" -- small o, tilde -->
- <!ENTITY ouml CDATA "ö" -- small o, dieresis or umlaut -->
- <!ENTITY divide CDATA "÷" -- divide sign -->
- <!ENTITY oslash CDATA "ø" -- small o, slash -->
- <!ENTITY ugrave CDATA "ù" -- small u, grave accent -->
- <!ENTITY uacute CDATA "ú" -- small u, acute accent -->
- <!ENTITY ucirc CDATA "û" -- small u, circumflex accent -->
- <!ENTITY uuml CDATA "ü" -- small u, dieresis or umlaut -->
- <!ENTITY yacute CDATA "ý" -- small y, acute accent -->
- <!ENTITY thorn CDATA "þ" -- small thorn, Icelandic -->
- <!ENTITY yuml CDATA "ÿ" -- small y, dieresis or umlaut -->
-
-
-
-
-
-
-
-
-
-
- Yergeau, et. al. Standards Track [Page 39]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- 8. Security Considerations
-
- Anchors, embedded images, and all other elements which contain URIs
- as parameters may cause the URI to be dereferenced in response to
- user input. In this case, the security considerations of [RFC1738]
- apply.
-
- The widely deployed methods for submitting form requests -- HTTP and
- SMTP -- provide little assurance of confidentiality. Information
- providers who request sensitive information via forms -- especially
- by way of the `PASSWORD' type input field (see section 8.1.2 in
- [RFC1866]) -- should be aware and make their users aware of the lack
- of confidentiality.
-
- Bibliography
-
- [BRYAN88] M. Bryan, "SGML -- An Author's Guide to the Standard
- Generalized Markup Language", Addison-Wesley, Reading,
- 1988.
-
- [ERCS] Extended Reference Concrete Syntax for SGML.
- <http://www.sgmlopen.org/sgml/docs/ercs/ercs-
- home.html>
-
- [GOLD90] C. F. Goldfarb, "The SGML Handbook", Y. Rubinsky, Ed.,
- Oxford University Press, 1990.
-
- [HTTP-1.1] Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
- and T. Berners-Lee, "Hypertext Transfer Protocol --
- HTTP/1.1", RFC 2068, January 1997.
-
- [ISO-639] ISO 639:1988. International standard -- Code for the
- representation of the names of languages. Technical
- content in <http://www.sil.org/sgml/iso639a.html>
-
- [ISO-8859] ISO 8859. International standard -- Information pro-
- cessing -- 8-bit single-byte coded graphic character
- sets -- Part 1: Latin alphabet No. 1 (1987) -- Part 2:
- Latin alphabet No. 2 (1987) -- Part 3: Latin alphabet
- No. 3 (1988) -- Part 4: Latin alphabet No. 4 (1988) --
- Part 5: Latin/Cyrillic alphabet (1988) -- Part 6:
- Latin/Arabic alphabet (1987) -- Part : Latin/Greek
- alphabet (1987) -- Part 8: Latin/Hebrew alphabet
- (1988) -- Part 9: Latin alphabet No. 5 (1989) -- Part
- 10: Latin alphabet No. 6 (1992)
-
-
-
-
-
-
- Yergeau, et. al. Standards Track [Page 40]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- [ISO-8879] ISO 8879:1986. International standard -- Information
- processing -- Text and office systems -- Standard gen-
- eralized markup language (SGML).
-
- [ISO-10646] ISO/IEC 10646-1:1993. International standard -- Infor-
- mation technology -- Universal multiple-octet coded
- character Sset (UCS) -- Part 1: Architecture and basic
- multilingual plane.
-
- [NICOL] G.T. Nicol, "The Multilingual World Wide Web",
- Electronic Book Technologies, 1995,
- <http://www.ebt.com/docs/multling.html>
-
- [NICOL2] G.T. Nicol, "MIME Header Supplemented File Type", Work
- in Progress, EBT, October 1995.
-
- [RFC1345] Simonsen, K., "Character Mnemonics & Character Sets",
- RFC 1345, Rationel Almen Planlaegning, June 1992.
-
- [RFC1468] Murai, J., Crispin M., and E. van der Poel,
- "Japanese Character Encoding for Internet Messages",
- RFC 1468, Keio University, Panda Programming, June
- 1993.
-
- [RFC2045] Freed, N., and N. Borenstein, "Multipurpose Internet
- Mail Extensions (MIME) Part One: Format of Internet
- Message Bodies", RFC 2045, Innosoft, First Virtual,
- November 1996.
-
- [RFC1641] Goldsmith, D., and M.Davis, "Using Unicode with MIME",
- RFC 1641, Taligent inc., July 1994.
-
- [RFC1642] Goldsmith, D., and M. Davis, "UTF-7: A Mail-safe
- Transformation Format of Unicode", RFC 1642, Taligent,
- Inc., July 1994.
-
- [RFC1738] Berners-Lee, T., Masinter, L., and M. McCahill,
- "Uniform Resource Locators (URL)", RFC 1738, CERN,
- Xerox PARC, University of Minnesota, October 1994.
-
- [RFC1766] Alverstrand, H., "Tags for the Identification of
- Languages", RFC 1766, UNINETT, March 1995.
-
- [RFC1866] Berners-Lee, T., and D. Connolly, "Hypertext Markup
- Language - 2.0", RFC 1866, MIT/W3C, November 1995.
-
- [RFC1867] Nebel, E., and L. Masinter, "Form-based File Upload
- in HTML", RFC 1867, Xerox Corporation, November 1995.
-
-
-
- Yergeau, et. al. Standards Track [Page 41]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- [RFC1942] Raggett, D., "HTML Tables", RFC 1942, W3C, May 1996.
-
- [RFC2068] Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
- and T. Berners-Lee, "Hypertext Transfer Protocol --
- HTTP/1.1", RFC 2068, January 1997.
-
- [SQ91] SoftQuad, "The SGML Primer", 3rd ed., SoftQuad Inc.,
- 1991.
-
- [TAKADA] Toshihiro Takada, "Multilingual Information Exchange
- through the World-Wide Web", Computer Networks and
- ISDN Systems, Vol. 27, No. 2, Nov. 1994 , p. 235-241.
-
- [TEI] TEI Guidelines for Electronic Text Encoding and Inter-
- change. <http://etext.virgina.edu/TEI.html>
-
- [UNICODE] The Unicode Consortium, "The Unicode Standard --
- Worldwide Character Encoding -- Version 1.0", Addison-
- Wesley, Volume 1, 1991, Volume 2, 1992, and Technical
- Report #4, 1993. The BIDI algorithm is in appendix A
- of volume 1, with corrections in appendix D of volume
- 2.
-
- [UTF-8] ISO/IEC 10646-1:1993 AMENDMENT 2 (1996). UCS Transfor-
- mation Format 8 (UTF-8).
-
- [VANH90] E. van Hervijnen, "Practical SGML", Kluwer Academicq
- Publishers Group, Norwell and Dordrecht, 1990.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Yergeau, et. al. Standards Track [Page 42]
-
- RFC 2070 HTML Internationalization January 1997
-
-
- Authors' Addresses
-
- Frangois Yergeau
- Alis Technologies
- 100, boul. Alexis-Nihon, bureau 600
- Montrial QC H4M 2P2
- Canada
-
- Tel: +1 (514) 747-2547
- Fax: +1 (514) 747-2561
- EMail: fyergeau@alis.com
-
-
- Gavin Thomas Nicol
- Electronic Book Technologies, Japan
- 1-29-9 Tsurumaki,
- Setagaya-ku,
- Tokyo
- Japan
-
- Tel: +81-3-3230-8161
- Fax: +81-3-3230-8163
- EMail: gtn@ebt.com, gtn@twics.co.jp
-
-
- Glenn Adams
- Spyglass
- 118 Magazine Street
- Cambridge, MA 02139
- U.S.A.
-
- Tel: +1 (617) 864-5524
- Fax: +1 (617) 864-4965
- EMail: glenn@spyglass.com
-
-
- Martin J. Duerst
- Multimedia-Laboratory
- Department of Computer Science
- University of Zurich
- Winterthurerstrasse 190
- CH-8057 Zurich
- Switzerland
-
- Tel: +41 1 257 43 16
- Fax: +41 1 363 00 35
- EMail: mduerst@ifi.unizh.ch
-
-
-
-
- Yergeau, et. al. Standards Track [Page 43]
-
-