home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!ogicse!mintaka.lcs.mit.edu!ai-lab!wheat-chex!glenn
- From: glenn@wheat-chex.ai.mit.edu (Glenn A. Adams)
- Newsgroups: comp.std.internat
- Subject: Re: ISO 10646 the final character set?
- Message-ID: <26625@life.ai.mit.edu>
- Date: 15 Aug 92 03:50:22 GMT
- Article-I.D.: life.26625
- References: <BstGEq.7E7@immd4.informatik.uni-erlangen.de> <q++ygqb@rpi.edu>
- Sender: news@ai.mit.edu
- Organization: MIT Artificial Intelligence Laboratory
- Lines: 82
-
- Even though John Jenkins and Erik Naggum have already responded to some
- recent second-hand statements about 10646, I think it is worth emphasizing a
- couple of points.
-
- 1. 10646 is now an official ISO standard which defines a character
- repertoire *aimed* at covering the written form of all languages.
- [Some languages cannot be represented yet since the scripts used
- to write them have yet to be incorporated into 10646; namely, the
- writing systems based on the Burmese, Ethiopic, Khmer, Mongolian,
- Sinhalese, and Tibetan scripts. These are planned for a future
- version along with other archaic and less-used scripts. ]
-
- 2. 10646 defines two encodings for these characters: a 32-bit canonical
- form and a 16-bit Base Multilingual Plane (BMP). The 16-bit form, also
- called UCS2 (universal character set - 2 byte form), can be transformed
- into the 32-bit form (UCS4) by zero extending the 16-bit value to 32 bits.
-
- 3. No characters are currently defined outside of the BMP. The BMP is
- derived from Unicode 1.0 with some changes; Unicode 1.1 will be identical
- to the BMP. [In fact, Unicode 1.0, volume 2, which publishes the Han
- character encodings, is already identical to the Han encodings in the BMP;
- other changes required of Unicode 1.0, volume 1, in order to accomodate
- the new 10646 are also documented in volume 2.]
-
- The first 256 encodings of both the USC2 and UCS4 forms are identical to
- ISO8859-1 (of which ASCII constitutes the first 128 elements).
-
- 4. Software should not interpret 10646 encoded data as a sequence of bytes
- for the purpose of determining character values; only 16-bit and 32-bit
- unsigned integral units should be used. [Byte swapping may be necessary
- to deal with mixed endian envrionments; 10646 does define a mechanism to
- ensure that the correct endian order can be determined].
-
- 5. There are *no* holes in the encoding space of either UCS4 or UCS2 (BMP).
- There *are* unassigned code points (bit values); but these may be assigned
- in future versions of the standard. [This differs from EUC, Shift Jis,
- and other similar encoding techniques that *do* have holes in their
- encoding spaces.]
-
- 6. The language of the text encoded with 10646 is not determined by 10646;
- a higher level protocol such as an environment variable, an escape
- sequence, or other out-of-band data must supply the language which is
- being represented.
-
- 7. 10646 does not allow creating new *characters* by using "combining
- characters." The semantics of combining characters is that their
- visual form is intended to "combine" with the visual form of a base
- character. The combination so produced is not a "character" in the
- sense of "a member of a character set." However, an application or
- user may choose to interpret the combination as a "letter" (i.e., a
- unit of writing) in the context of their written language.
-
- 8. The issue of combining characters in 10646 is *very* different from
- the issue of variable length encodings in ShiftJis, EUC Multibyte,
- and other similar variable length encoding techniques. In these cases,
- one *cannot* interpret a sequence of bytes *as a character in a
- character set* without scanning from the beginning of the string (or
- by synchronizing to a character encoding boundary). In contrast, with
- 10646, a program can randomly index (by 16-bit unit) any character
- encoding (code element) in a string and be assured of finding a proper
- character encoding; with Shift Jis, a random index (by byte), may land
- one right in the middle of a character encoding and thus yield garbage.
-
- 9. Contrary to claims otherwise, 16-bits is adequate for representing all
- of the world's languages if two principles are followed: (1) unify
- different uses of the same script and encode only once instance of
- a given script; and (2) allow characters to combine in a visual sense
- in order to produce more complex shapes or forms from simpler shapes or
- forms. Both of these principles are followed by Unicode (and 10646 BMP).
- It is unlikely that the full 32-bit canonical forms will ever be needed
- except for perhaps compatibility or efficiency reasons. However, in
- terms of actually representing all languages, 16-bits is quite adequate.
-
- [Personally, I will bet my Sparcstation that a number of major vendors,
- including Apple and Microsoft, will never support the 32-bit form for
- internal processing. It simply isn't needed.]
-
- Finally, I'd like to note that an Implementers Workshop for Unicode &
- 10646 is being planned for early December in Frankfurt, Germany. A formal
- announcement will be forthcoming.
-
- Glenn Adams
-