NetNews Usenet Archive 1992 #18

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #18 / NN_1992_18.iso / spool / comp / std / internat / 623 < prev next >

Wrap

Internet Message Format | 1992-08-15 | 4.8 KB

Xref: sparky comp.std.internat:623 comp.std.c:2475 Path: sparky!uunet!know!hri.com!snorkelwacker.mit.edu!ai-lab!wheat-chex!glenn From: glenn@wheat-chex.ai.mit.edu (Glenn A. Adams) Newsgroups: comp.std.internat,comp.std.c Subject: Re: Handling ISO 10646 UCS in C Message-ID: <26631@life.ai.mit.edu> Date: 15 Aug 92 18:50:26 GMT References: <Bt1792.48L@immd4.informatik.uni-erlangen.de> Sender: news@ai.mit.edu Followup-To: comp.std.internat Organization: MIT Artificial Intelligence Laboratory Lines: 88 From: mskuhn@immd4.informatik.uni-erlangen.de (Markus Kuhn) Date: 15 Aug 92 15:34:13 GMT Are there any proposals about future extensions to Standard C that will allow the programmer to use ISO 10646/Unicode strings as easily as char []? No. Feel free to make a proposal to your local standards organization. Will wchar_t be fixed to UCS-2 or will the standard commitees introduce two new char types e.g. ucs2_t for the 16-bit ISO 10646 BMP (also known as Unicode) ucs4_t for the full 32-bit ISO 10646 code together with new string constant notations for both of them and a lot of new library functions for arrays of these new types? (I'd prefer introducing two new types a lot!) At least one vendor (Microsoft) is using wchar_t to refer to the 16-bit UCS2 form (they have as much as said they will not support UCS4). Using wchar_t does make certain things easier in terms of integrating the existing wchar string functions already part of standard C implementations. A number of others, including myself, have been using new type definitions similar to the one you propose above. Efforts are underway in the Unicode Implementation Subcommittee to undertake defining a basic set of functions for processing Unicode data (10646 UCS2). I personally expect to see a lot of different proposals being made over the next few years as to new extensions for string and text processing of 10646 data. If you have concrete proposals to make, I'm sure the Unicode subcommittee would like to hear about them. To my knowledge, there are *no* commercial available systems which support the entirety of Unicode in its full glory, let alone having a standard API for dealing with this data. A useful analogy might be that of the rise of TCP/IP networking. The standard was first available around 1981 or thereabouts; however, it took until the 4.2BSD UNIX release from Berkeley in 1984 before it started catching on; and that was because Sun Microsystems adopted it and began selling systems everywhere. Even today we don't have a single API for doing networking, especially when you look across different OSs. I expect that the tools needed for processing Unicode and 10646 data will take some time in their development; and more time still until a standard might emerge. It would be a serious mistake to try to create a standard at this time ex nihilo. Will C programs be allowed to be UCS-2 files in addition to ASCII files which would make entering Unicode characters in string constants and comments as easy as ASCII characters today? Compiler developers should seriously consider using 10646 UCS-2 for source file representations. Beyond simply string constants and comments, one should also allow symbols, keywords, and syntactic tokens to use UCS-2. However, changes will be necessary for preprocessors and lexical scanners due to (1) the much larger tables potentially needed for a 65,536 element character set (UCS-2); and (2) the ability in 10646 to encode a text element (e.g., A WITH ACUTE) in multiple ways (e.g., A WITH ACUTE or A + NON-SPACING ACUTE). [If a keyword or symbol permits this text element to be part of it, then the lexical scanner must provide equivalence classes made of strings of character code elements in order to recognize these multiple encodings of the same information as being equivalent. This is a bit like having lower case and upper case not be distinguished in symbols; in this case, the different ways to spell out the same symbol should not constitute differences for lexical recognition.] What will be the new equivalent of \0? \x0000 or \ffff or another special character? Since both 10646 UCS-2 and UCS-4 incorporate ASCII encodings identically, the NULL character is encoded at 0x0000 and 0x00000000 for UCS2 and UCS4, respectively. The character encoding 0xFFFF (in UCS-2) is reserved and explicitly not assigned a character value. Therefore, it can be used as a special marker for software; e.g., it can be coerced to a signed integer to yield a value of -1 (at least on 2s complement machines). Thus, the standard C library usage of -1 to signify EOF can easily be maintained. [Note that the encoding 0x00FF *is* a character; namely, the last character of ISO8859-1 which was directly re-encoded in Unicode, i.e., LATIN SMALL LETTER Y DIAERESIS.] Glenn Adams