home *** CD-ROM | disk | FTP | other *** search
- Xref: sparky comp.std.internat:623 comp.std.c:2475
- Path: sparky!uunet!know!hri.com!snorkelwacker.mit.edu!ai-lab!wheat-chex!glenn
- From: glenn@wheat-chex.ai.mit.edu (Glenn A. Adams)
- Newsgroups: comp.std.internat,comp.std.c
- Subject: Re: Handling ISO 10646 UCS in C
- Message-ID: <26631@life.ai.mit.edu>
- Date: 15 Aug 92 18:50:26 GMT
- References: <Bt1792.48L@immd4.informatik.uni-erlangen.de>
- Sender: news@ai.mit.edu
- Followup-To: comp.std.internat
- Organization: MIT Artificial Intelligence Laboratory
- Lines: 88
-
-
- From: mskuhn@immd4.informatik.uni-erlangen.de (Markus Kuhn)
- Date: 15 Aug 92 15:34:13 GMT
-
- Are there any proposals about future extensions to Standard C that will
- allow the programmer to use ISO 10646/Unicode strings as easily as
- char []?
-
- No. Feel free to make a proposal to your local standards organization.
-
- Will wchar_t be fixed to UCS-2 or will the standard commitees introduce two
- new char types e.g.
-
- ucs2_t for the 16-bit ISO 10646 BMP (also known as Unicode)
- ucs4_t for the full 32-bit ISO 10646 code
-
- together with new string constant notations for both of them and a lot
- of new library functions for arrays of these new types? (I'd prefer
- introducing two new types a lot!)
-
- At least one vendor (Microsoft) is using wchar_t to refer to the 16-bit
- UCS2 form (they have as much as said they will not support UCS4). Using
- wchar_t does make certain things easier in terms of integrating the
- existing wchar string functions already part of standard C implementations.
-
- A number of others, including myself, have been using new type definitions
- similar to the one you propose above.
-
- Efforts are underway in the Unicode Implementation Subcommittee to
- undertake defining a basic set of functions for processing Unicode
- data (10646 UCS2). I personally expect to see a lot of different
- proposals being made over the next few years as to new extensions
- for string and text processing of 10646 data. If you have concrete
- proposals to make, I'm sure the Unicode subcommittee would like to
- hear about them.
-
- To my knowledge, there are *no* commercial available systems which
- support the entirety of Unicode in its full glory, let alone having
- a standard API for dealing with this data.
-
- A useful analogy might be that of the rise of TCP/IP networking. The
- standard was first available around 1981 or thereabouts; however, it
- took until the 4.2BSD UNIX release from Berkeley in 1984 before it
- started catching on; and that was because Sun Microsystems adopted it
- and began selling systems everywhere. Even today we don't have a
- single API for doing networking, especially when you look across
- different OSs. I expect that the tools needed for processing Unicode
- and 10646 data will take some time in their development; and more
- time still until a standard might emerge. It would be a serious mistake
- to try to create a standard at this time ex nihilo.
-
- Will C programs be allowed to be UCS-2 files in addition to ASCII files
- which would make entering Unicode characters in string constants and
- comments as easy as ASCII characters today?
-
- Compiler developers should seriously consider using 10646 UCS-2 for source
- file representations. Beyond simply string constants and comments, one
- should also allow symbols, keywords, and syntactic tokens to use UCS-2.
- However, changes will be necessary for preprocessors and lexical scanners
- due to (1) the much larger tables potentially needed for a 65,536 element
- character set (UCS-2); and (2) the ability in 10646 to encode a text
- element (e.g., A WITH ACUTE) in multiple ways (e.g., A WITH ACUTE or
- A + NON-SPACING ACUTE). [If a keyword or symbol permits this text
- element to be part of it, then the lexical scanner must provide
- equivalence classes made of strings of character code elements in order
- to recognize these multiple encodings of the same information as being
- equivalent. This is a bit like having lower case and upper case not
- be distinguished in symbols; in this case, the different ways to spell
- out the same symbol should not constitute differences for lexical
- recognition.]
-
- What will be the new equivalent of \0? \x0000 or \ffff or another special
- character?
-
- Since both 10646 UCS-2 and UCS-4 incorporate ASCII encodings identically,
- the NULL character is encoded at 0x0000 and 0x00000000 for UCS2 and UCS4,
- respectively.
-
- The character encoding 0xFFFF (in UCS-2) is reserved and explicitly not
- assigned a character value. Therefore, it can be used as a special
- marker for software; e.g., it can be coerced to a signed integer to
- yield a value of -1 (at least on 2s complement machines). Thus, the
- standard C library usage of -1 to signify EOF can easily be maintained.
- [Note that the encoding 0x00FF *is* a character; namely, the last
- character of ISO8859-1 which was directly re-encoded in Unicode,
- i.e., LATIN SMALL LETTER Y DIAERESIS.]
-
- Glenn Adams
-