home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.std.internat
- Path: sparky!uunet!cs.utexas.edu!sun-barr!ames!data.nas.nasa.gov!taligent!tseng
- From: jenkinsj@blowfish.taligent.com (John H. Jenkins)
- Subject: Re: ISO 10646 questions (longish)
- Message-ID: <jenkinsj-250892105325@tseng.taligent.com>
- Followup-To: comp.std.internat
- Sender: usenet@taligent.com (More Bytes Than You Can Read)
- Organization: Taligent, Inc.
- References: <1992Aug25.105342.189801@rrz.uni-koeln.de>
- Date: Tue, 25 Aug 1992 19:50:08 GMT
- Lines: 374
-
- In article <1992Aug25.105342.189801@rrz.uni-koeln.de>,
- a0047@aix370.rrz.Uni-Koeln.DE (Andreas Strotmann) wrote:
- >
- > - Is there some sort of official press release by ISO describing key
- > features of the new standard? Where published?
- >
-
- Not that I'm aware. Such information is available from Unicode,
- however.
-
- > - There were a couple of topics hotly debated between UniCode and
- 10646
- > champions on this group. I would like to know how these have been
- > resolved in the unification of these two. Specifically:
- >
- > * Pre-/Postfix "diacritics" (aka "combining characters",
- "non-spacing
- > characters"): Which are they? Is there a limit defined on the
- number
- > of these in a row allowed?
- >
-
- 10646 now defines three implementation levels. In Level 1, combining
- characters may not be used. In Level 3 (the old Level 2), they may be
- used without restriction (meaning, among other things, there is no limit
- on how many you can use one after the other). A new Level 2 was defined
- which allows use of combining marks for anything except Latin, Greek,
- and Cyrillic. This reflects the fact that few people objected to their
- use with languages like Hebrew where they are practically de rigueur.
- The main objections were to the use of combining marks with Latin,
- Greek, and Cyrillic.
-
- The combining marks are all postfix, BTW.
-
- Unicode represents a Level 3 implementation of 10646.
-
- > * Does 10646 still reserve about a quarter of all possible codes
- > for control characters, a la ISO 8859?
- >
-
- No.
-
- > - Another hot debate was on Han unification ("evil" comes to mind
- ;-).
- > Recent postings cite two numbers: JIS using effectively 14 bits to
- > code their script, and 36000 code points being reserved for
- "unified"
- > Han. Now (3x15000)=45000, so unification cannot have been very
- > extensive, or was it?
- > [15000 because not all code points in JIS are used, 3 for Korean,
- > Japanese, and Chinese]
-
- I don't know where these numbers come from. JIS uses a total of about
- 12,000 characters, but there were never more than 22,000 cells reserved
- for Unified Han. The following is taken from pp. 19-20 of "The Unicode
- standard," vol. 2:
-
- #
- # Compatibility with Existing Standards
- #
- # The compatibility of the Unicode Han character set with the
- # repertoire of existing standards is assured by the source
- # separation rule described above. The Unicode standard
- # contains additional Han characters that are not included
- # in the unified repertoire, but that do occur in widely-used
- # corporate character sets. This practice is recognized by
- # CJK-JRG. The following table lists all the standards that
- # comprise the Unicode Han character set, and the number of
- # characters included from each.
- #
- # Standard Number of Characters
- # ANSI Z39.64-1989 (EACC) 13,053
- # Big Five 13,481
- # CCCII, level 1 4,808
- # CNS 11643-1986 13,051
- # CNS 11643-1986 User Characters 3,418
- # GB 2312-80 (GB0) 6,763
- # GB 12345-90 (GB1) 2,176
- # GB 7589-87 (GB3) 7,327
- # GB 7590-87 (GB5) 7,039
- # General Use Characters forModern Chinese (GB7) 41
- # GB 8565-89 (GB8) 287
- # GB 12052-89 (Korean) 94
- # IBM Selected Japanese 360
- # IBM Selected Korean 6
- # JEF (Fujitsu) 3,149
- # JIS X 0208-1990 6,355
- # JIS X 0212-1990 5,801
- # KS C 5601-1989 4,888
- # KS C 5657-1991 2,856
- # PRC Telegraph Code ~8,000
- # Taiwan Telegraph Code 9,040
- # Xerox Chinese 9,776
- #
- # Total characters covered ~121,769
- # Total unique characters 21,001
- #
-
- If we had unified the various standards by language but left Japanese,
- Korean, and Chinese separate the total unique characters would still
- have been on the order of 40,000 to 50,000.
-
- > In the same thread it was suggested that European scripts should be
- > unified, too [Capital A in latin, greek, cyrillic, e.g.]. Were
- they?
- >
-
- As I understand it, this was never a serious suggestion, but a straw man
- raised by opponents of Han unification. They were pointing out that
- they felt the relationship between Japanese kanji and Chinese characters
- is analogous to that between the Latin and Greek alphabets -- since
- nobody would ever unify the latter, why unify the former?
-
- > - I'd like to include a couple of statistics in my article:
- >
- > * How many graphic characters are effectively defined in ISO 10646
- UCS2?
- > * How many code points are reserved, but not yet assigned in UCS2
- > (i.e. how much can still fit in for the next release)?
-
- There are roughly 21,000 Han characters, 7000 precomposed Hangul
- syllables and 7000 everything-else. Additionally, about 6000 cells are
- reserved for the user zone. (This is all rounded to the nearest 1000.)
- The total is therefore about 41,000 of 65,000 cells currently allocated.
- That means there are about 24,000 cells left.
-
- That sounds like a lot, but it really isn't. If you add up all the
- things that people would like to add to the BMP, you'll find you have
- well over 24,000 pigeons to squeeze into 24,000 pigeonholes.
-
- > * How many/which modern languages/scripts are covered/not (yet)
- covered?
- > Why not if not?
- >
-
- The list of blocks in 10646 follows. Anything not on this list is not
- covered. (This list reflects the second DIS of 10646. The IS will have
- slightly different block allocations. BTW, remember that some of these
- scripts cover a heck of a lot of languages; I'm not sure anyone has
- prepared a "languages-covered" list.)
-
- IRV-646 0020 - 007E
- LATIN-1 SUPPLEMENT 00A0 - 00FF
- EXTENDED LATIN-A 0100 - 017F
- EXTENDED LATIN-B 0180 - 024F
- IPA EXTENSIONS 0250 - 02AF
- SPACING MODIFIER LETTERS
- 02B0 - 02FF
- COMBINING DIACRITICAL MARKS
- 0300 - 036F
- GREEK 0370 - 03FF
- CYRILLIC 0400 - 04FF
- ARMENIAN 0530 - 058F
- HEBREW 0590 - 05FF
- ARABIC 0600 - 06FF
- DEVANAGARI 0900 - 097F
- BENGALI 0980 - 09FF
- GURMUKHI 0A00 - 0A7F
- GUJARATI 0A80 - 0AFF
- ORIYA 0B00 - 0B7F
- TAMIL 0B80 - 0BFF
- TELUGU 0C00 - 0C7F
- KANNADA 0C80 - 0CFF
- MALAYALAM 0D00 - 0D7F
- THAI 0E00 - 0E7F
- LAO 0E80 - 0EFF
- *TIBETAN 1000 - 105F
- GEORGIAN 10A0 - 10FF
- ADDITIONAL EXTENDED LATIN
- 1E00 - 1EFF
- GREEK EXTENSIONS 1F00 - 1FFF
- GENERAL PUNCTUATION 2000 - 206F
- SUPERSCRIPTS AND SUBSCRIPTS
- 2070 - 209F
- CURRENCY SYMBOLS 20A0 - 20CF
- COMBINING DIACRITICAL MARKS FOR SYMBOLS
- 20D0 - 20FF
- LETTERLIKE SYMBOLS 2100 - 214F
- NUMBER FORMS 2150 - 218F
- ARROWS 2190 - 21FF
- MATHEMATICAL OPERATORS
- 2200 - 22FF
- MISCELLANEOUS TECHNICAL
- 2300 - 23FF
- CONTROL PICTURES 2400 - 243F
- OPTICAL CHARACTER RECOGNITION
- 2440 - 245F
- ENCLOSED ALPHANUMERICS
- 2460 - 24FF
- BOX DRAWINGS 2500 - 257F
- BLOCK ELEMENTS 2580 - 259F
- GEOMETRIC SHAPES 25A0 - 25FF
- MISCELLANEOUS DINGBATS
- 2600 - 26FF
- DINGBATS 2700 - 27BF
- CJK SYMBOLS AND PUNCTUATION
- 3000 - 303F
- HIRAGANA 3040 - 309F
- KATAKANA 30A0 - 30FF
- BOPOMOFO 3100 - 312F
- HANGUL JAMO 3130 - 318F
- CJK MISCELLANEOUS 3190 - 319F
- COMBINING HANGUL JAMO 31A0 - 31FF
- ENCLOSED CJK LETTERS AND IDEOGRAPHS
- 3200 - 32FF
- CJK COMPATIBILITY WORDS
- 3300 - 337F
- CJK SQUARED ABBREVIATIONS
- 3380 - 33FF
- HANGUL 3400 - 4DFF
- CJK UNIFIED IDEOGRAPHS 4E00 - 9FFF
- PRIVATE USE AREA E000 - F7FF
- CJK COMPATIBILITY IDEOGRAPHS
- F900 - FAFF
- ALPHABETIC PRESENTATION FORMS
- FB00 - FBFF
- ARABIC PRESENTATION FORMS-A
- FC00 - FDFF
- CJK COMPATIBILITY FORMS FE30 - FE4F
- SMALL FORM VARIANTS FE50 - FE6F
- ARABIC PRESENTATION FORMS-B
- FE70 - FEFF
- HALFWIDTH AND FULLWIDTH FORMS
- FF00 - FFEF
- SPECIALS FFF0 - FFFD
-
- *Tibetan was withdrawn from the IS.
-
- There are three main reasons why something is _not_ covered.
-
- #1. The script is ill-documented. A number of scripts were excluded
- only because nobody ever submitted an authoritative list of characters
- they contain (e.g., various Native American languages). These will be
- added at a future date.
-
- #2. The script is in the main well-documented but there is considerable
- controversy as to how it should best be encoded (e.g., Tibetan and
- Ethiopian). These are usually well-known scripts, but are rare enough
- that Becker's Second Law kicks in ("The fewer experts there are on a
- given subject, the more they disagree") and so there's a lot of
- controversy that has to be settled. Musical notation falls into this
- category, too. These will also be added at a future date.
-
- #3. The script is well-documented but SO-O-O-O rare that few people
- other than specialists have even heard of it (e.g., the Deseret
- Alphabet) or is out-and-out bogus (Klingon, Quenya, the Seuss script
- from "On beyond zebra"). The few people who want to use these scripts
- can put them into the Private Use Area -- that's what it's for.
-
- > - Does every UCS2 coded text have to start with the "signature"
- mentioned
- > in earlier postings? How about UCS4?
- >
-
- No, but it would be nice. The use of signatures is covered in Annex E,
- which is only informative.
-
- > - Are the names of characters that have been cited in numerous
- postings to
- > this group "defined" by ISO 10646, i.e., standardized, too? Does
- every
- > character have a unique name?
-
- Yes and yes.
-
- >How are (unified) Han characters named?
-
- By their code point. You can't name them by their sound because they
- have utterly different pronunciations even in different Chinese
- dialects, let alone different languages, and you can't name them by
- their meaning because they have too many. The only real alternative was
- to name them by their dictionary position.
-
- > Example? System? How about other scripts (arab, indian...)?
- >
-
- Character 0000 4E00 is named "CJK UNIFIED IDEOGRAPH-4E00." You also
- have things like "ARABIC LETTER HAH WITH HAMZA ABOVE" and "DEVANAGARI
- LETTER KHA."
-
- > - Does the *current* set of UniCode definitions really cover UCS2 as
- voted
- > on? How can it be obtained? [Sorry, didn't save the answer to this
- when
- > I should have. I tried finding a reference in university on-line
- library
- > catalogues around the world - no luck! Have they sold any copies
- at all?]
- >
-
- Unicode 1.0 has been published and is slightly different from UCS-2.
- The Unicode Technical Committee formally agreed last month to issue 1.1
- of Unicode which will be code-point identical to UCS-2.
-
- Unicode 1.0 is defined in "The Unicode standard," published by
- Addison-Wesley. Your local technical bookstore should know how to
- contact them. Vol. 1 is US$32.95 and ISBN 0-201-56788-1; vol. 2 is
- US$29.95 and ISBN 0-201-60845-6. They are both selling very nicely.
-
- The Unicode Consortium will also be holding its fourth Implementers
- Workshop this December in Frankfurt, Germany. Contact the Consortium
- for more information.
-
- Unicode is also opening a European branch to promote the use of Unicode
- within Europe. The fifth Implementers Workshop is planned for Japan
- next spring but the details are still being worked out.
-
- > Who is the UniCode consortium? [Sorry again]
- >
-
- "The Unicode Consortium is a non-profit organization founded in January
- of 1991 to promote the use of Unicode as a character standard. Its
- first accomplishment was to formally set the specifications for Unicode
- Version 1.0. The Unicode Consortium now provides technical information
- and news about Unicode, and works to maintain the Unicode standard,
- expanding and refining it as necessary."
-
- For more information contact the Consortium at:
-
- Unicode Consortium, Inc.
- 1965 Charleston Road
- Mountain View, CA 94043 USA
- Phone: (415) 961-4189
- FAX: (415) 966-1637
- Internet: unicode-inc@HQ.M4.Metaphor.com.
-
- > - How can one listen in on the 10646 discussion forum mentioned by
- one
- > of the experts in a recent posting? I assume it's going to discuss
- > 10646 version 2 once version 1 is published, so it might still be
- > interesting after most of the work has been done.
-
- You subscribe by sending a SUB ISO10646 command to
- LISTSERV@JHUVM.BITNET. It is a rather dull list, though. :-( Besides,
- there really isn't going to be any work done on 10646 for nearly a year,
- so the traffic should be pretty slight for the next several months (say
- I with fingers crossed).
-
- > - Where do you order the 10646 standard? [Sorry, again]
-
- You use whatever resource you usually use to order ISO standards. You
- may try your national standards body. The IS will not be published
- until early next year, however.
-
- >
- > - Does anyone archive comp.std.internat? Where?
- >
-
- Does anyone want to? :-)
-
- > Finally, a question that's not necessarily relevant for my article:
- >
- > - What's the legal issue on preparing lists from the 10646 definition
- > and providing these on anonymous FTP? Possible candidates are
- > name lists (probably no problem) or a reference font produced by
- > scanning in the 10646 standard (almost certainly illegal).
-
- Unicode is looking into creating an FTP site with mapping tables, sample
- code, names tables, and other delights. There no word on if or when
- this will be available. Meanwhile, you can always get the names lists
- and mapping tables on disk by writing the Consortium. You can also get
- copies of the proceedings of the implementers workshops, I believe.
- Unicode also plans to make available to its members a series of
- Macintosh fonts covering everything although, again, I can't guarantee
- this will ever materialize.
-
- AFII is hoping to make a 10646 font available. They are putting an
- incredible amount of work into printing the IS, and if they can, in
- fact, distribute a font, it should be really top-notch.
-
- ----
- John H. Jenkins
- John_Jenkins@taligent.com
- #include <std_disclaimer.h>
-