home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!uunet!not-for-mail
- From: thorinn@diku.dk (Lars Henrik Mathiesen)
- Newsgroups: comp.std.unix
- Subject: Re: POSIX update
- Date: 17 Aug 1992 14:42:30 -0700
- Organization: Department of Computer Science, U of Copenhagen
- Lines: 54
- Sender: sef@ftp.UU.NET
- Approved: sef@ftp.uucp (Moderator, Sean Eric Fagan)
- Message-ID: <16p6g6INNs4i@ftp.UU.NET>
- References: <15rucjINNdov@ftp.UU.NET> <166v0lINNpj8@ftp.UU.NET> <16b9drINNero@ftp.UU.NET> <16i6laINNssj@ftp.UU.NET>
- NNTP-Posting-Host: ftp.uu.net
- Keywords: posix, unicode, iso10646, microsoft windows nt
- X-Submissions: std-unix@uunet.uu.net
-
- Submitted-by: thorinn@diku.dk (Lars Henrik Mathiesen)
-
- gwyn@smoke.brl.mil (Doug Gwyn) writes:
- >The promoters of Unicode were informed of the (numerous) problems in
- >fitting Unicode into the existing practice, reflected in Standard C
- >multibyte character support, [...]
-
- Why is it so bad that Unicode is not like earlier ``extended
- encodings''? Existing code is a large problem, of course, but what is
- the problem relative to Standard C itself? The standard provides both
- wide and multibyte characters, and Unicode happens to fit the former
- definition instead of the latter.
-
- 1) Of course it would be nice to be able to use the good old library
- calls with the fancy new character set. So, if your text encoding fits
- into EUC or whatever, it will work as Standard C multibyte characters,
- but what does that mean? As far as I can see, you cannot use strcat,
- index, or printf("%s") on a multibyte string anyway, you need extra
- magic to make sure that shift states match properly.
-
- 2) For the programmer, it seems easier to just use wchar_t strings;
- the usual string-handling idioms ("while (*cp++)") carry over with
- just a change of declarations, and it is possible to implement direct
- equivalents of every string-oriented function in the library.
-
- (When multibyte characters are used, some of the idioms will work, but
- everything that tries to interpret the characters, even just to find a
- '/', will have to use mbtowc/mbstowcs or a non-standard routine. If
- you have to convert to wchar_t anyway, what's the point?)
-
- 3) There is a problem in that the standard defines wide character
- constants and wide string literals in terms of multibyte characters.
-
- However, Unicode does define a ``data interchange'' format (UTF), for
- transmission through channels that like to mung control characters.
- This format happens to be usable as a Standard C multibyte encoding as
- well --- it maps the source character set to itself, uses NUL only to
- code NUL, and has no shift state as such; after editing C source in a
- Unicode editor, you can save it ``as UTF'' and the result will be
- nicely acceptable to an 8-bit C compiler with the right default
- locale. And when you upgrade to a Unicode C compiler, you can convert
- your source again.
-
- If the problem is that POSIX wants to insist on a subset of Standard
- C, or C bindings for some system functionality, that cannot work with
- Unicode; then I have very little sympathy: The direction that Unicode
- was taking must have been clear well before anything within POSIX was
- finalized, and disregarding it may well turn out to have been, let's
- say, infelicitous.
-
- Lars Mathiesen (U of Copenhagen CS Dep) <thorinn@diku.dk> (Humour NOT marked)
-
-
- Volume-Number: Volume 28, Number 103
-