NetNews Usenet Archive 1992 #19

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #19 / NN_1992_19.iso / spool / comp / std / internat / 627 next >

Wrap

Text File | 1992-08-25 | 15.2 KB | 387 lines

Newsgroups: comp.std.internat Path: sparky!uunet!cs.utexas.edu!sun-barr!ames!data.nas.nasa.gov!taligent!tseng From: jenkinsj@blowfish.taligent.com (John H. Jenkins) Subject: Re: ISO 10646 questions (longish) Message-ID: <jenkinsj-250892105325@tseng.taligent.com> Followup-To: comp.std.internat Sender: usenet@taligent.com (More Bytes Than You Can Read) Organization: Taligent, Inc. References: <1992Aug25.105342.189801@rrz.uni-koeln.de> Date: Tue, 25 Aug 1992 19:50:08 GMT Lines: 374 In article <1992Aug25.105342.189801@rrz.uni-koeln.de>, a0047@aix370.rrz.Uni-Koeln.DE (Andreas Strotmann) wrote: > > - Is there some sort of official press release by ISO describing key > features of the new standard? Where published? > Not that I'm aware. Such information is available from Unicode, however. > - There were a couple of topics hotly debated between UniCode and 10646 > champions on this group. I would like to know how these have been > resolved in the unification of these two. Specifically: > > * Pre-/Postfix "diacritics" (aka "combining characters", "non-spacing > characters"): Which are they? Is there a limit defined on the number > of these in a row allowed? > 10646 now defines three implementation levels. In Level 1, combining characters may not be used. In Level 3 (the old Level 2), they may be used without restriction (meaning, among other things, there is no limit on how many you can use one after the other). A new Level 2 was defined which allows use of combining marks for anything except Latin, Greek, and Cyrillic. This reflects the fact that few people objected to their use with languages like Hebrew where they are practically de rigueur. The main objections were to the use of combining marks with Latin, Greek, and Cyrillic. The combining marks are all postfix, BTW. Unicode represents a Level 3 implementation of 10646. > * Does 10646 still reserve about a quarter of all possible codes > for control characters, a la ISO 8859? > No. > - Another hot debate was on Han unification ("evil" comes to mind ;-). > Recent postings cite two numbers: JIS using effectively 14 bits to > code their script, and 36000 code points being reserved for "unified" > Han. Now (3x15000)=45000, so unification cannot have been very > extensive, or was it? > [15000 because not all code points in JIS are used, 3 for Korean, > Japanese, and Chinese] I don't know where these numbers come from. JIS uses a total of about 12,000 characters, but there were never more than 22,000 cells reserved for Unified Han. The following is taken from pp. 19-20 of "The Unicode standard," vol. 2: # # Compatibility with Existing Standards # # The compatibility of the Unicode Han character set with the # repertoire of existing standards is assured by the source # separation rule described above. The Unicode standard # contains additional Han characters that are not included # in the unified repertoire, but that do occur in widely-used # corporate character sets. This practice is recognized by # CJK-JRG. The following table lists all the standards that # comprise the Unicode Han character set, and the number of # characters included from each. # # Standard Number of Characters # ANSI Z39.64-1989 (EACC) 13,053 # Big Five 13,481 # CCCII, level 1 4,808 # CNS 11643-1986 13,051 # CNS 11643-1986 User Characters 3,418 # GB 2312-80 (GB0) 6,763 # GB 12345-90 (GB1) 2,176 # GB 7589-87 (GB3) 7,327 # GB 7590-87 (GB5) 7,039 # General Use Characters forModern Chinese (GB7) 41 # GB 8565-89 (GB8) 287 # GB 12052-89 (Korean) 94 # IBM Selected Japanese 360 # IBM Selected Korean 6 # JEF (Fujitsu) 3,149 # JIS X 0208-1990 6,355 # JIS X 0212-1990 5,801 # KS C 5601-1989 4,888 # KS C 5657-1991 2,856 # PRC Telegraph Code ~8,000 # Taiwan Telegraph Code 9,040 # Xerox Chinese 9,776 # # Total characters covered ~121,769 # Total unique characters 21,001 # If we had unified the various standards by language but left Japanese, Korean, and Chinese separate the total unique characters would still have been on the order of 40,000 to 50,000. > In the same thread it was suggested that European scripts should be > unified, too [Capital A in latin, greek, cyrillic, e.g.]. Were they? > As I understand it, this was never a serious suggestion, but a straw man raised by opponents of Han unification. They were pointing out that they felt the relationship between Japanese kanji and Chinese characters is analogous to that between the Latin and Greek alphabets -- since nobody would ever unify the latter, why unify the former? > - I'd like to include a couple of statistics in my article: > > * How many graphic characters are effectively defined in ISO 10646 UCS2? > * How many code points are reserved, but not yet assigned in UCS2 > (i.e. how much can still fit in for the next release)? There are roughly 21,000 Han characters, 7000 precomposed Hangul syllables and 7000 everything-else. Additionally, about 6000 cells are reserved for the user zone. (This is all rounded to the nearest 1000.) The total is therefore about 41,000 of 65,000 cells currently allocated. That means there are about 24,000 cells left. That sounds like a lot, but it really isn't. If you add up all the things that people would like to add to the BMP, you'll find you have well over 24,000 pigeons to squeeze into 24,000 pigeonholes. > * How many/which modern languages/scripts are covered/not (yet) covered? > Why not if not? > The list of blocks in 10646 follows. Anything not on this list is not covered. (This list reflects the second DIS of 10646. The IS will have slightly different block allocations. BTW, remember that some of these scripts cover a heck of a lot of languages; I'm not sure anyone has prepared a "languages-covered" list.) IRV-646 0020 - 007E LATIN-1 SUPPLEMENT 00A0 - 00FF EXTENDED LATIN-A 0100 - 017F EXTENDED LATIN-B 0180 - 024F IPA EXTENSIONS 0250 - 02AF SPACING MODIFIER LETTERS 02B0 - 02FF COMBINING DIACRITICAL MARKS 0300 - 036F GREEK 0370 - 03FF CYRILLIC 0400 - 04FF ARMENIAN 0530 - 058F HEBREW 0590 - 05FF ARABIC 0600 - 06FF DEVANAGARI 0900 - 097F BENGALI 0980 - 09FF GURMUKHI 0A00 - 0A7F GUJARATI 0A80 - 0AFF ORIYA 0B00 - 0B7F TAMIL 0B80 - 0BFF TELUGU 0C00 - 0C7F KANNADA 0C80 - 0CFF MALAYALAM 0D00 - 0D7F THAI 0E00 - 0E7F LAO 0E80 - 0EFF *TIBETAN 1000 - 105F GEORGIAN 10A0 - 10FF ADDITIONAL EXTENDED LATIN 1E00 - 1EFF GREEK EXTENSIONS 1F00 - 1FFF GENERAL PUNCTUATION 2000 - 206F SUPERSCRIPTS AND SUBSCRIPTS 2070 - 209F CURRENCY SYMBOLS 20A0 - 20CF COMBINING DIACRITICAL MARKS FOR SYMBOLS 20D0 - 20FF LETTERLIKE SYMBOLS 2100 - 214F NUMBER FORMS 2150 - 218F ARROWS 2190 - 21FF MATHEMATICAL OPERATORS 2200 - 22FF MISCELLANEOUS TECHNICAL 2300 - 23FF CONTROL PICTURES 2400 - 243F OPTICAL CHARACTER RECOGNITION 2440 - 245F ENCLOSED ALPHANUMERICS 2460 - 24FF BOX DRAWINGS 2500 - 257F BLOCK ELEMENTS 2580 - 259F GEOMETRIC SHAPES 25A0 - 25FF MISCELLANEOUS DINGBATS 2600 - 26FF DINGBATS 2700 - 27BF CJK SYMBOLS AND PUNCTUATION 3000 - 303F HIRAGANA 3040 - 309F KATAKANA 30A0 - 30FF BOPOMOFO 3100 - 312F HANGUL JAMO 3130 - 318F CJK MISCELLANEOUS 3190 - 319F COMBINING HANGUL JAMO 31A0 - 31FF ENCLOSED CJK LETTERS AND IDEOGRAPHS 3200 - 32FF CJK COMPATIBILITY WORDS 3300 - 337F CJK SQUARED ABBREVIATIONS 3380 - 33FF HANGUL 3400 - 4DFF CJK UNIFIED IDEOGRAPHS 4E00 - 9FFF PRIVATE USE AREA E000 - F7FF CJK COMPATIBILITY IDEOGRAPHS F900 - FAFF ALPHABETIC PRESENTATION FORMS FB00 - FBFF ARABIC PRESENTATION FORMS-A FC00 - FDFF CJK COMPATIBILITY FORMS FE30 - FE4F SMALL FORM VARIANTS FE50 - FE6F ARABIC PRESENTATION FORMS-B FE70 - FEFF HALFWIDTH AND FULLWIDTH FORMS FF00 - FFEF SPECIALS FFF0 - FFFD *Tibetan was withdrawn from the IS. There are three main reasons why something is _not_ covered. #1. The script is ill-documented. A number of scripts were excluded only because nobody ever submitted an authoritative list of characters they contain (e.g., various Native American languages). These will be added at a future date. #2. The script is in the main well-documented but there is considerable controversy as to how it should best be encoded (e.g., Tibetan and Ethiopian). These are usually well-known scripts, but are rare enough that Becker's Second Law kicks in ("The fewer experts there are on a given subject, the more they disagree") and so there's a lot of controversy that has to be settled. Musical notation falls into this category, too. These will also be added at a future date. #3. The script is well-documented but SO-O-O-O rare that few people other than specialists have even heard of it (e.g., the Deseret Alphabet) or is out-and-out bogus (Klingon, Quenya, the Seuss script from "On beyond zebra"). The few people who want to use these scripts can put them into the Private Use Area -- that's what it's for. > - Does every UCS2 coded text have to start with the "signature" mentioned > in earlier postings? How about UCS4? > No, but it would be nice. The use of signatures is covered in Annex E, which is only informative. > - Are the names of characters that have been cited in numerous postings to > this group "defined" by ISO 10646, i.e., standardized, too? Does every > character have a unique name? Yes and yes. >How are (unified) Han characters named? By their code point. You can't name them by their sound because they have utterly different pronunciations even in different Chinese dialects, let alone different languages, and you can't name them by their meaning because they have too many. The only real alternative was to name them by their dictionary position. > Example? System? How about other scripts (arab, indian...)? > Character 0000 4E00 is named "CJK UNIFIED IDEOGRAPH-4E00." You also have things like "ARABIC LETTER HAH WITH HAMZA ABOVE" and "DEVANAGARI LETTER KHA." > - Does the *current* set of UniCode definitions really cover UCS2 as voted > on? How can it be obtained? [Sorry, didn't save the answer to this when > I should have. I tried finding a reference in university on-line library > catalogues around the world - no luck! Have they sold any copies at all?] > Unicode 1.0 has been published and is slightly different from UCS-2. The Unicode Technical Committee formally agreed last month to issue 1.1 of Unicode which will be code-point identical to UCS-2. Unicode 1.0 is defined in "The Unicode standard," published by Addison-Wesley. Your local technical bookstore should know how to contact them. Vol. 1 is US$32.95 and ISBN 0-201-56788-1; vol. 2 is US$29.95 and ISBN 0-201-60845-6. They are both selling very nicely. The Unicode Consortium will also be holding its fourth Implementers Workshop this December in Frankfurt, Germany. Contact the Consortium for more information. Unicode is also opening a European branch to promote the use of Unicode within Europe. The fifth Implementers Workshop is planned for Japan next spring but the details are still being worked out. > Who is the UniCode consortium? [Sorry again] > "The Unicode Consortium is a non-profit organization founded in January of 1991 to promote the use of Unicode as a character standard. Its first accomplishment was to formally set the specifications for Unicode Version 1.0. The Unicode Consortium now provides technical information and news about Unicode, and works to maintain the Unicode standard, expanding and refining it as necessary." For more information contact the Consortium at: Unicode Consortium, Inc. 1965 Charleston Road Mountain View, CA 94043 USA Phone: (415) 961-4189 FAX: (415) 966-1637 Internet: unicode-inc@HQ.M4.Metaphor.com. > - How can one listen in on the 10646 discussion forum mentioned by one > of the experts in a recent posting? I assume it's going to discuss > 10646 version 2 once version 1 is published, so it might still be > interesting after most of the work has been done. You subscribe by sending a SUB ISO10646 command to LISTSERV@JHUVM.BITNET. It is a rather dull list, though. :-( Besides, there really isn't going to be any work done on 10646 for nearly a year, so the traffic should be pretty slight for the next several months (say I with fingers crossed). > - Where do you order the 10646 standard? [Sorry, again] You use whatever resource you usually use to order ISO standards. You may try your national standards body. The IS will not be published until early next year, however. > > - Does anyone archive comp.std.internat? Where? > Does anyone want to? :-) > Finally, a question that's not necessarily relevant for my article: > > - What's the legal issue on preparing lists from the 10646 definition > and providing these on anonymous FTP? Possible candidates are > name lists (probably no problem) or a reference font produced by > scanning in the 10646 standard (almost certainly illegal). Unicode is looking into creating an FTP site with mapping tables, sample code, names tables, and other delights. There no word on if or when this will be available. Meanwhile, you can always get the names lists and mapping tables on disk by writing the Consortium. You can also get copies of the proceedings of the implementers workshops, I believe. Unicode also plans to make available to its members a series of Macintosh fonts covering everything although, again, I can't guarantee this will ever materialize. AFII is hoping to make a 10646 font available. They are putting an incredible amount of work into printing the IS, and if they can, in fact, distribute a font, it should be really top-notch. ---- John H. Jenkins John_Jenkins@taligent.com #include <std_disclaimer.h>