NetNews Usenet Archive 1992 #18

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #18 / NN_1992_18.iso / spool / comp / std / internat / 621 < prev next >

Wrap

Internet Message Format | 1992-08-14 | 5.1 KB

Path: sparky!uunet!ogicse!mintaka.lcs.mit.edu!ai-lab!wheat-chex!glenn From: glenn@wheat-chex.ai.mit.edu (Glenn A. Adams) Newsgroups: comp.std.internat Subject: Re: ISO 10646 the final character set? Message-ID: <26625@life.ai.mit.edu> Date: 15 Aug 92 03:50:22 GMT Article-I.D.: life.26625 References: <BstGEq.7E7@immd4.informatik.uni-erlangen.de> <q++ygqb@rpi.edu> Sender: news@ai.mit.edu Organization: MIT Artificial Intelligence Laboratory Lines: 82 Even though John Jenkins and Erik Naggum have already responded to some recent second-hand statements about 10646, I think it is worth emphasizing a couple of points. 1. 10646 is now an official ISO standard which defines a character repertoire *aimed* at covering the written form of all languages. [Some languages cannot be represented yet since the scripts used to write them have yet to be incorporated into 10646; namely, the writing systems based on the Burmese, Ethiopic, Khmer, Mongolian, Sinhalese, and Tibetan scripts. These are planned for a future version along with other archaic and less-used scripts. ] 2. 10646 defines two encodings for these characters: a 32-bit canonical form and a 16-bit Base Multilingual Plane (BMP). The 16-bit form, also called UCS2 (universal character set - 2 byte form), can be transformed into the 32-bit form (UCS4) by zero extending the 16-bit value to 32 bits. 3. No characters are currently defined outside of the BMP. The BMP is derived from Unicode 1.0 with some changes; Unicode 1.1 will be identical to the BMP. [In fact, Unicode 1.0, volume 2, which publishes the Han character encodings, is already identical to the Han encodings in the BMP; other changes required of Unicode 1.0, volume 1, in order to accomodate the new 10646 are also documented in volume 2.] The first 256 encodings of both the USC2 and UCS4 forms are identical to ISO8859-1 (of which ASCII constitutes the first 128 elements). 4. Software should not interpret 10646 encoded data as a sequence of bytes for the purpose of determining character values; only 16-bit and 32-bit unsigned integral units should be used. [Byte swapping may be necessary to deal with mixed endian envrionments; 10646 does define a mechanism to ensure that the correct endian order can be determined]. 5. There are *no* holes in the encoding space of either UCS4 or UCS2 (BMP). There *are* unassigned code points (bit values); but these may be assigned in future versions of the standard. [This differs from EUC, Shift Jis, and other similar encoding techniques that *do* have holes in their encoding spaces.] 6. The language of the text encoded with 10646 is not determined by 10646; a higher level protocol such as an environment variable, an escape sequence, or other out-of-band data must supply the language which is being represented. 7. 10646 does not allow creating new *characters* by using "combining characters." The semantics of combining characters is that their visual form is intended to "combine" with the visual form of a base character. The combination so produced is not a "character" in the sense of "a member of a character set." However, an application or user may choose to interpret the combination as a "letter" (i.e., a unit of writing) in the context of their written language. 8. The issue of combining characters in 10646 is *very* different from the issue of variable length encodings in ShiftJis, EUC Multibyte, and other similar variable length encoding techniques. In these cases, one *cannot* interpret a sequence of bytes *as a character in a character set* without scanning from the beginning of the string (or by synchronizing to a character encoding boundary). In contrast, with 10646, a program can randomly index (by 16-bit unit) any character encoding (code element) in a string and be assured of finding a proper character encoding; with Shift Jis, a random index (by byte), may land one right in the middle of a character encoding and thus yield garbage. 9. Contrary to claims otherwise, 16-bits is adequate for representing all of the world's languages if two principles are followed: (1) unify different uses of the same script and encode only once instance of a given script; and (2) allow characters to combine in a visual sense in order to produce more complex shapes or forms from simpler shapes or forms. Both of these principles are followed by Unicode (and 10646 BMP). It is unlikely that the full 32-bit canonical forms will ever be needed except for perhaps compatibility or efficiency reasons. However, in terms of actually representing all languages, 16-bits is quite adequate. [Personally, I will bet my Sparcstation that a number of major vendors, including Apple and Microsoft, will never support the 32-bit form for internal processing. It simply isn't needed.] Finally, I'd like to note that an Implementers Workshop for Unicode & 10646 is being planned for early December in Frankfurt, Germany. A formal announcement will be forthcoming. Glenn Adams