NetNews Usenet Archive 1992 #19

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #19 / NN_1992_19.iso / spool / comp / std / internat / 628 < prev next >

Wrap

Internet Message Format | 1992-08-25 | 10.7 KB

Path: sparky!uunet!ogicse!mintaka.lcs.mit.edu!ai-lab!wheat-chex!glenn From: glenn@wheat-chex.ai.mit.edu (Glenn A. Adams) Newsgroups: comp.std.internat Subject: Re: ISO 10646 questions Message-ID: <27192@life.ai.mit.edu> Date: 25 Aug 92 22:23:27 GMT Article-I.D.: life.27192 References: <1992Aug25.105342.189801@rrz.uni-koeln.de> Sender: news@ai.mit.edu Organization: MIT Artificial Intelligence Laboratory Lines: 206 From: a0047@aix370.rrz.Uni-Koeln.DE (Andreas Strotmann) Message-ID: <1992Aug25.105342.189801@rrz.uni-koeln.de> Date: 25 Aug 92 10:53:42 GMT - Is there some sort of official press release by ISO describing key features of the new standard? Where published? ISO doesn't make press releases. Contact you local standards organization (in your case DIN). The Unicode Consortium has made a number of announcements describing the result of the merger with Unicode. You can contact The Benjamin Group in San Jose, CA for information about these releases. - There were a couple of topics hotly debated between UniCode and 10646 champions on this group. I would like to know how these have been resolved in the unification of these two. Specifically: * Pre-/Postfix "diacritics" (aka "combining characters", "non-spacing characters"): Which are they? Is there a limit defined on the number of these in a row allowed? All non-spacing marks (combining characters, etc.) are encoded in POSTFIX order, i.e., after the base character to which they are applied. No limits are placed on the number or combinations of these characters. [With the exception that in implementation level 1, no non-spacing marks are allowed; in implementation level 2, only those NSMs which are absolutely needed to represent a language, e.g., Thai or Vowelled Arabic; in implementation level 3, no restrictions at all. Since Unicode does not provide implementation levels, it is to be considered as level 3; however, by subsetting Unicode, one can operate at either level 2 or level 1. For example, one can build a (relatively useless) Unicode implementation which supports the Empty Subset of 10646 UCS2; this would be a valid Unicode implementation as long as it observed the Unicode conformance criteria: if you can't interpret a character, and you intend to interchange it, then you shouldn't damage it unintentionally, i.e., simply because you can't interpret it.] * Does 10646 still reserve about a quarter of all possible codes for control characters, a la ISO 8859? No "byte values" are reserverd whatsoever. In particular, NULL byte values can appear in any byte of either UCS-2 or UCS-4 character encodings. These are more accurately treated as 16-bit and 32-bit integer encodings and not a sequence of bytes. The entire 65536 encodings of UCS-2 are available for encoding characters; 6144 encoding values in this space are reserved for private use by either vendor or end-user. [An informative annex of 10646 defines a transformation method which can convert either UCS-2 or UCS-4 into an 8-bit byte stream that preserves C0, C1, SPACE, and DEL encodings, thus allowing transmission over many existing paths. Copyleft sources for routines which convert to and from this format can be obtained by anonymous FTP on the host METIS.COM (140.186.33.40). See the file pub/utf.c.] - Another hot debate was on Han unification Recent postings cite two numbers: JIS using effectively 14 bits to code their script, and 36000 code points being reserved for "unified" Han. Now (3x15000)=45000, so unification cannot have been very extensive, or was it? The initial collection of character sets which served as the source for the unification process contained over 100,000 characters in all; the end result of unification produced 20,992 characters. I judge that to be a very high rate of unification. In the same thread it was suggested that European scripts should be unified, too [Capital A in latin, greek, cyrillic, e.g.]. Were they? No. This is a ridiculous suggestion that no one took seriously. Indeed, in a fit of satire, I once suggested creating PCCode - ProtoCaananite Code which would radically unify ALL non-Han scripts. Needless to say, I wasn't serious. You must keep in mind that perhaps the highest priority of the choice of what was a character for 10646 had to do with retaining 1-1 correspondence with existing character sets; this precludes any sort of coherent theoretic approach which, in all likelihood, would have produced something quite unusable. - I'd like to include a couple of statistics in my article: * How many graphic characters are effectively defined in ISO 10646 UCS2? I don't have the final count; it's between 35,000 and 40,000 if you count the private use zone (6144 encodings). [I will post an accurate character count shortly.] * How many code points are reserved, but not yet assigned in UCS2 (i.e. how much can still fit in for the next release)? Say 25,000 or so = ( 65,536 - whatever number one chooses for the above ) * How many/which modern languages/scripts are covered/not (yet) covered? Why not if not? 10646 encodes scripts, not languages; 6 important modern scripts are not yet included: Burmese, Ethiopian, Khmer, Mongolian, Sinhalese, and Tibetan. Perhaps 10-20 less widely used scripts are also not yet present, including Cree, Lanna Thai, Mangyan, Pollard, Tai Nua, Tifinagh, Yi, and others. The reason they are not yet encoded is either because no propoosal has been made for them yet or the details of their encoding haven't been entirely worked out. All of the 6 major scripts above are under review for inclusion in 10646. - Does every UCS2 coded text have to start with the "signature" mentioned in earlier postings? How about UCS4? No signature is required; simply recommended in certain contexts. - Are the names of characters that have been cited in numerous postings to this group "defined" by ISO 10646, i.e., standardized, too? Does every character have a unique name? How are (unified) Han characters named? Example? System? How about other scripts (arab, indian...)? Names are both unique and standardized. These are two essential requirements for all ISO character set standards. Han characters are named by using a hexidecimal symbol equal to their encoding value; no alternative way is acceptable. Some character names follow: CJK UNIFIED IDEOGRAPH-4E00 HANGUL SYLLABLE SSANGSIOS-AE-LIEUL THAI LETTER KO KAI CYRILLIC CAPITAL LETTER KOPPA An informative annex of ISO10646 spells out the structure of character names. - Does the *current* set of UniCode definitions really cover UCS2 as voted on? How can it be obtained? [Sorry, didn't save the answer to this when I should have. I tried finding a reference in university on-line library catalogues around the world - no luck! Have they sold any copies at all?] Some minor differences exist from Unicode 1.0, Volume I to the ISO10646 code charts. Some new characters were added in 10646, some minor adjustments were made to the order of some elements, and a couple of characters have been deleted. These changes are largely documented in Unicode 1.0, Volume II, which documents the differences; furthermore, volume II contains the unified Han characters of Unicode which are identical to ISO10646. These volumes can be ordered from Addison-Wesley Publishing, Route 128, Reading, MA 01867 USA, phone 1-800-447-2226. The ISBN # for volume II is 0-201-60845-6. I don't have the number for volume I handy. Who is the UniCode consortium? [Sorry again] Member companies now include: Adobe, Apple, Borland, Ecological Linguistics, Digital, Go, HP, IBM, Lotus Development, Microsoft, NeXT, Novell, Sun, Taligent, The Research Libraries Group, Symantec, Unisys, Wordperfect, and Xerox. - How can one listen in on the 10646 discussion forum mentioned by one of the experts in a recent posting? I assume it's going to discuss 10646 version 2 once version 1 is published, so it might still be interesting after most of the work has been done. Is it archived? Where? LISTSERV@JHUVM.BITNET; the list name is ISO10646. Now that the voting is over, there is little activity on this list. [I think everyone is exhausted and needs a rest.] - Where do you order the 10646 standard? [Sorry, again] The final form will not be available until early next year. It is being edited to reflect the outcome of the last meeting of WG2 in Seoul. - What's the legal issue on preparing lists from the 10646 definition and providing these on anonymous FTP? Possible candidates are name lists (probably no problem) or a reference font produced by scanning in the 10646 standard (almost certainly illegal). Neither of these are illegal. The Association for Font Information Interchange (AFII) is responsible for the code page charts of 10646; they can be contacted at AFII, 2961 Copa de Oro, Los Alamitos, CA 90720 USA. Personally, I think it would be much better to buy fonts than go through the effort of scanning. Anyway, I doubt if anyone will market a single quality font that covers all of 10646 UCS2; it doesn't make any sense from a marketing perspective. Some work is underway to create a 48x48 bitmap font of Unihan at AFII. I'm not sure, but they may make this available for free or a relatively small fee. Could someone convince the appropriate authorities that they would further acceptance and implementation of the standard considerably by donating a possibly copy-lefted reference font and character name list to be put on anonymous FTP servers so people can start hacking *now*? ISO charges a rather hefty premium for a copy of any standard. This is how they recover the reproduction and archiving costs. The Unicode Consortium *may* make a namelist available for individual members; contact Unicode, Inc., 1965 Charleston Road, Mountain View, CA 94043, USA, phone + 415 966 1637 for more info. I'm sure there's someone raring to go to do a couple of X11 applications e.g. since several already exist in 16bit versions for Japanese, Korean and Chinese. The CJK versions of X11 do not process 16-bit character encodings; they process Double Byte Character Set (DBCS) encodings that intermix single and double byte encodings. 10646 UCS-2 requires each character encoding to be processed as a 16-bit integral value, i.e., unsigned short; one cannot use any existing functions based on the current C definition of char *. The Unicode Implementation Subcommittee is quite interested in participants who are knowledgeable in implementation issues of Unicode or 10646. Again, you can contact the Unicode Consortium address I gave above for more info on this work. Glenn Adams