home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!ogicse!mintaka.lcs.mit.edu!ai-lab!wheat-chex!glenn
- From: glenn@wheat-chex.ai.mit.edu (Glenn A. Adams)
- Newsgroups: comp.std.internat
- Subject: Re: ISO 10646 questions
- Message-ID: <27192@life.ai.mit.edu>
- Date: 25 Aug 92 22:23:27 GMT
- Article-I.D.: life.27192
- References: <1992Aug25.105342.189801@rrz.uni-koeln.de>
- Sender: news@ai.mit.edu
- Organization: MIT Artificial Intelligence Laboratory
- Lines: 206
-
-
- From: a0047@aix370.rrz.Uni-Koeln.DE (Andreas Strotmann)
- Message-ID: <1992Aug25.105342.189801@rrz.uni-koeln.de>
- Date: 25 Aug 92 10:53:42 GMT
-
- - Is there some sort of official press release by ISO describing key
- features of the new standard? Where published?
-
- ISO doesn't make press releases. Contact you local standards organization
- (in your case DIN). The Unicode Consortium has made a number of announcements
- describing the result of the merger with Unicode. You can contact The
- Benjamin Group in San Jose, CA for information about these releases.
-
- - There were a couple of topics hotly debated between UniCode and 10646
- champions on this group. I would like to know how these have been
- resolved in the unification of these two. Specifically:
-
- * Pre-/Postfix "diacritics" (aka "combining characters", "non-spacing
- characters"): Which are they? Is there a limit defined on the number
- of these in a row allowed?
-
- All non-spacing marks (combining characters, etc.) are encoded in POSTFIX
- order, i.e., after the base character to which they are applied. No limits
- are placed on the number or combinations of these characters. [With the
- exception that in implementation level 1, no non-spacing marks are allowed;
- in implementation level 2, only those NSMs which are absolutely needed to
- represent a language, e.g., Thai or Vowelled Arabic; in implementation level
- 3, no restrictions at all. Since Unicode does not provide implementation
- levels, it is to be considered as level 3; however, by subsetting Unicode,
- one can operate at either level 2 or level 1. For example, one can build
- a (relatively useless) Unicode implementation which supports the Empty Subset
- of 10646 UCS2; this would be a valid Unicode implementation as long as it
- observed the Unicode conformance criteria: if you can't interpret a
- character, and you intend to interchange it, then you shouldn't damage it
- unintentionally, i.e., simply because you can't interpret it.]
-
- * Does 10646 still reserve about a quarter of all possible codes
- for control characters, a la ISO 8859?
-
- No "byte values" are reserverd whatsoever. In particular, NULL byte values
- can appear in any byte of either UCS-2 or UCS-4 character encodings. These
- are more accurately treated as 16-bit and 32-bit integer encodings and not
- a sequence of bytes. The entire 65536 encodings of UCS-2 are available for
- encoding characters; 6144 encoding values in this space are reserved for
- private use by either vendor or end-user. [An informative annex of 10646
- defines a transformation method which can convert either UCS-2 or UCS-4
- into an 8-bit byte stream that preserves C0, C1, SPACE, and DEL encodings,
- thus allowing transmission over many existing paths. Copyleft sources for
- routines which convert to and from this format can be obtained by anonymous
- FTP on the host METIS.COM (140.186.33.40). See the file pub/utf.c.]
-
- - Another hot debate was on Han unification
- Recent postings cite two numbers: JIS using effectively 14 bits to
- code their script, and 36000 code points being reserved for "unified"
- Han. Now (3x15000)=45000, so unification cannot have been very
- extensive, or was it?
-
- The initial collection of character sets which served as the source for
- the unification process contained over 100,000 characters in all; the end
- result of unification produced 20,992 characters. I judge that to be a very
- high rate of unification.
-
- In the same thread it was suggested that European scripts should be
- unified, too [Capital A in latin, greek, cyrillic, e.g.]. Were they?
-
- No. This is a ridiculous suggestion that no one took seriously. Indeed,
- in a fit of satire, I once suggested creating PCCode - ProtoCaananite Code
- which would radically unify ALL non-Han scripts. Needless to say,
- I wasn't serious.
-
- You must keep in mind that perhaps the highest priority of the choice of
- what was a character for 10646 had to do with retaining 1-1 correspondence
- with existing character sets; this precludes any sort of coherent theoretic
- approach which, in all likelihood, would have produced something quite
- unusable.
-
- - I'd like to include a couple of statistics in my article:
-
- * How many graphic characters are effectively defined in ISO 10646 UCS2?
-
- I don't have the final count; it's between 35,000 and 40,000 if you count
- the private use zone (6144 encodings). [I will post an accurate character
- count shortly.]
-
- * How many code points are reserved, but not yet assigned in UCS2
- (i.e. how much can still fit in for the next release)?
-
- Say 25,000 or so = ( 65,536 - whatever number one chooses for the above )
-
- * How many/which modern languages/scripts are covered/not (yet) covered?
- Why not if not?
-
- 10646 encodes scripts, not languages; 6 important modern scripts are not yet
- included: Burmese, Ethiopian, Khmer, Mongolian, Sinhalese, and Tibetan.
- Perhaps 10-20 less widely used scripts are also not yet present, including
- Cree, Lanna Thai, Mangyan, Pollard, Tai Nua, Tifinagh, Yi, and others. The
- reason they are not yet encoded is either because no propoosal has been made
- for them yet or the details of their encoding haven't been entirely worked
- out. All of the 6 major scripts above are under review for inclusion in 10646.
-
- - Does every UCS2 coded text have to start with the "signature" mentioned
- in earlier postings? How about UCS4?
-
- No signature is required; simply recommended in certain contexts.
-
- - Are the names of characters that have been cited in numerous postings to
- this group "defined" by ISO 10646, i.e., standardized, too? Does every
- character have a unique name? How are (unified) Han characters named?
- Example? System? How about other scripts (arab, indian...)?
-
- Names are both unique and standardized. These are two essential requirements
- for all ISO character set standards. Han characters are named by using a
- hexidecimal symbol equal to their encoding value; no alternative way is
- acceptable.
-
- Some character names follow:
-
- CJK UNIFIED IDEOGRAPH-4E00
- HANGUL SYLLABLE SSANGSIOS-AE-LIEUL
- THAI LETTER KO KAI
- CYRILLIC CAPITAL LETTER KOPPA
-
- An informative annex of ISO10646 spells out the structure of character names.
-
- - Does the *current* set of UniCode definitions really cover UCS2 as voted
- on? How can it be obtained? [Sorry, didn't save the answer to this when
- I should have. I tried finding a reference in university on-line library
- catalogues around the world - no luck! Have they sold any copies at
- all?]
-
- Some minor differences exist from Unicode 1.0, Volume I to the ISO10646 code
- charts. Some new characters were added in 10646, some minor adjustments were
- made to the order of some elements, and a couple of characters have been
- deleted. These changes are largely documented in Unicode 1.0, Volume II,
- which documents the differences; furthermore, volume II contains the
- unified Han characters of Unicode which are identical to ISO10646.
-
- These volumes can be ordered from Addison-Wesley Publishing, Route 128,
- Reading, MA 01867 USA, phone 1-800-447-2226. The ISBN # for volume II
- is 0-201-60845-6. I don't have the number for volume I handy.
-
- Who is the UniCode consortium? [Sorry again]
-
- Member companies now include: Adobe, Apple, Borland, Ecological Linguistics,
- Digital, Go, HP, IBM, Lotus Development, Microsoft, NeXT, Novell, Sun,
- Taligent, The Research Libraries Group, Symantec, Unisys, Wordperfect, and
- Xerox.
-
- - How can one listen in on the 10646 discussion forum mentioned by one
- of the experts in a recent posting? I assume it's going to discuss
- 10646 version 2 once version 1 is published, so it might still be
- interesting after most of the work has been done.
- Is it archived? Where?
-
- LISTSERV@JHUVM.BITNET; the list name is ISO10646. Now that the voting
- is over, there is little activity on this list. [I think everyone is
- exhausted and needs a rest.]
-
- - Where do you order the 10646 standard? [Sorry, again]
-
- The final form will not be available until early next year. It is being
- edited to reflect the outcome of the last meeting of WG2 in Seoul.
-
- - What's the legal issue on preparing lists from the 10646 definition
- and providing these on anonymous FTP? Possible candidates are
- name lists (probably no problem) or a reference font produced by
- scanning in the 10646 standard (almost certainly illegal).
-
- Neither of these are illegal. The Association for Font Information
- Interchange (AFII) is responsible for the code page charts of 10646;
- they can be contacted at AFII, 2961 Copa de Oro, Los Alamitos, CA 90720 USA.
- Personally, I think it would be much better to buy fonts than go through
- the effort of scanning. Anyway, I doubt if anyone will market a single
- quality font that covers all of 10646 UCS2; it doesn't make any sense from
- a marketing perspective. Some work is underway to create a 48x48 bitmap
- font of Unihan at AFII. I'm not sure, but they may make this available for
- free or a relatively small fee.
-
- Could someone convince the appropriate authorities that they would
- further acceptance and implementation of the standard considerably
- by donating a possibly copy-lefted reference font and character name
- list to be put on anonymous FTP servers so people can start hacking
- *now*?
-
- ISO charges a rather hefty premium for a copy of any standard. This is how
- they recover the reproduction and archiving costs. The Unicode Consortium
- *may* make a namelist available for individual members; contact Unicode, Inc.,
- 1965 Charleston Road, Mountain View, CA 94043, USA, phone + 415 966 1637
- for more info.
-
- I'm sure there's someone raring to go to do a couple of X11
- applications e.g. since several already exist in 16bit versions for
- Japanese, Korean and Chinese.
-
- The CJK versions of X11 do not process 16-bit character encodings; they
- process Double Byte Character Set (DBCS) encodings that intermix single
- and double byte encodings. 10646 UCS-2 requires each character encoding
- to be processed as a 16-bit integral value, i.e., unsigned short; one cannot
- use any existing functions based on the current C definition of char *.
-
- The Unicode Implementation Subcommittee is quite interested in participants
- who are knowledgeable in implementation issues of Unicode or 10646. Again,
- you can contact the Unicode Consortium address I gave above for more info
- on this work.
-
- Glenn Adams
-