Usenet 1994 October

home *** CD-ROM | disk | FTP | other *** search

/ Usenet 1994 October / usenetsourcesnewsgroupsinfomagicoctober1994disk2.iso / std_unix / volume.28 / text0101.txt < prev

Wrap

Text File | 1992-08-17 | 2.7 KB | 56 lines

Submitted-by: thorinn@diku.dk (Lars Henrik Mathiesen) gwyn@smoke.brl.mil (Doug Gwyn) writes: >The promoters of Unicode were informed of the (numerous) problems in >fitting Unicode into the existing practice, reflected in Standard C >multibyte character support, [...] Why is it so bad that Unicode is not like earlier ``extended encodings''? Existing code is a large problem, of course, but what is the problem relative to Standard C itself? The standard provides both wide and multibyte characters, and Unicode happens to fit the former definition instead of the latter. 1) Of course it would be nice to be able to use the good old library calls with the fancy new character set. So, if your text encoding fits into EUC or whatever, it will work as Standard C multibyte characters, but what does that mean? As far as I can see, you cannot use strcat, index, or printf("%s") on a multibyte string anyway, you need extra magic to make sure that shift states match properly. 2) For the programmer, it seems easier to just use wchar_t strings; the usual string-handling idioms ("while (*cp++)") carry over with just a change of declarations, and it is possible to implement direct equivalents of every string-oriented function in the library. (When multibyte characters are used, some of the idioms will work, but everything that tries to interpret the characters, even just to find a '/', will have to use mbtowc/mbstowcs or a non-standard routine. If you have to convert to wchar_t anyway, what's the point?) 3) There is a problem in that the standard defines wide character constants and wide string literals in terms of multibyte characters. However, Unicode does define a ``data interchange'' format (UTF), for transmission through channels that like to mung control characters. This format happens to be usable as a Standard C multibyte encoding as well --- it maps the source character set to itself, uses NUL only to code NUL, and has no shift state as such; after editing C source in a Unicode editor, you can save it ``as UTF'' and the result will be nicely acceptable to an 8-bit C compiler with the right default locale. And when you upgrade to a Unicode C compiler, you can convert your source again. If the problem is that POSIX wants to insist on a subset of Standard C, or C bindings for some system functionality, that cannot work with Unicode; then I have very little sympathy: The direction that Unicode was taking must have been clear well before anything within POSIX was finalized, and disregarding it may well turn out to have been, let's say, infelicitous. Lars Mathiesen (U of Copenhagen CS Dep) <thorinn@diku.dk> (Humour NOT marked) Volume-Number: Volume 28, Number 103