home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!uunet!not-for-mail
- From: thorinn@diku.dk (Lars Henrik Mathiesen)
- Newsgroups: comp.std.unix
- Subject: Re: POSIX update
- Date: 25 Aug 1992 13:47:22 -0700
- Organization: Department of Computer Science, U of Copenhagen
- Lines: 115
- Sender: sef@ftp.UU.NET
- Approved: sef@ftp.uucp (Moderator, Sean Eric Fagan)
- Message-ID: <17e68qINNj6j@ftp.UU.NET>
- References: <16b9drINNero@ftp.UU.NET> <16i6laINNssj@ftp.UU.NET> <16p6g6INNs4i@ftp.UU.NET> <17bncrINNrpe@ftp.UU.NET>
- NNTP-Posting-Host: ftp.uu.net
- Keywords: posix, unicode, iso10646
- X-Submissions: std-unix@uunet.uu.net
-
- Submitted-by: thorinn@diku.dk (Lars Henrik Mathiesen)
-
- NOTE: quotes from Doug's article have been included out of sequence.
-
- gwyn@smoke.brl.mil (Doug Gwyn) writes:
- >In article <16p6g6INNs4i@ftp.UU.NET> thorinn@diku.dk (Lars Henrik Mathiesen) writes:
- >>Why is it so bad that Unicode is not like earlier ``extended
- >>encodings''? Existing code is a large problem, of course, but what is
- >>the problem relative to Standard C itself? The standard provides both
- >>wide and multibyte characters, and Unicode happens to fit the former
- >>definition instead of the latter.
-
- I think I'll answer my own question: The problem is that one cannot
- reasonably demand of a vendor that their implementation should allow
- the programmer to use wchar_t for Unicode characters.
-
- However, my (perhaps misguided) thought was that the problem could be
- resolved by the vendors choosing to do precisely that when they want
- to add Unicode support to their systems. The rest of the article was
- intended to compare programming for Unicode in such an implementation
- to programming for some (non-trivial) 8-bit-byte multibyte character
- set, and not to imply that such a use would be portable --- but on
- rereading I can see that this was less than clear.
-
- >And that's the problem. Unicode appears to be intended for use in
- >external text objects, such as disk files. The only way an
- >implementation could simultaneously conform to the C standard and
- >to 10646 used for external storage representation would be for it
- >to make char (NOT just wchar_t) 16 bits wide.
-
- Correct me if I'm wrong, but a C implementation will not become
- non-conforming if it documents that |fread| can be used to input
- Unicode characters to a wchar_t array, and that library functions such
- as |wfgets| exist (declared in a non-standard header, of course).
- Programs that exploit this will not be strictly conforming, but
- neither will programs that call setlocale(LC_CTYPE, "").
-
- (I don't think that 10646 defines what conformance is for programming
- languages, but an implementation like this would offer one way of
- dealing with it.)
-
- >> 3) There is a problem in that the standard defines wide character
- >> constants and wide string literals in terms of multibyte
- characters.
-
- >There is no problem there, and it's not relevant to the issue anyway;
- >perhaps you don't understand the intent of this part of the standard.
-
- The intent seems to be to enable the programmer to get some constants
- ``preconverted'' from multibyte characters into wide characters.
-
- It is exactly because this special syntax sugar exists to initialize
- wchar_t objects that the type becomes attractive --- any old 16-bit
- integral type would do perfectly fine (and be more portable) if you
- could live with writing all the constant strings as lists of hex
- values.
-
- In the normal usage, the wide character set is derived from the
- multibyte character set with the sole purpose of one-to-one
- convertibility (of characters). But again, a conforming implementation
- could use a multibyte character set that was constructed with the sole
- purpose of allowing Unicode in wide character constants.
- -----------------------------------------------
- >[...] it is important for others now trying to work in the same areas
- >to (a) fully understand and (b) cooperate with this existing standard
- >model. From what I have seen, ISO 10646 failed on both counts (a)
- >and (b), so now we do have a practical programming problem.
-
- The problem may be that the Unicode consortium and/or the ISO
- committee work under the implicit assumption that their character sets
- will _not_ have to coexist with any others; this may even have been a
- conscious decision. If all files on your system are Unicode, there's
- no problem, of course, because all your compilers work in Unicode too.
- -----------------------------------------------
- >>If the problem is that POSIX wants to insist on a subset of Standard
- >>C, [...]
-
- >[...] The prior art here is that which was adopted into the C
- >standard, not Unicode which came later.
-
- It seems that I was both unclear, and confused on the timing of POSIX
- development. I included that paragraph because the thread was named
- "POSIX update" (and this is comp.std.unix), and I was not sure if the
- percieved problem was with the C standard or with POSIX.
-
- The C standard explicitly allows a ``byte'' to be more than eight
- bits, and 16-bit Unicode bytes is a very reasonable decision in some
- contexts. But it might be that POSIX has additional requirements that
- make it impossible for such a system to conform --- and if the POSIX
- committee had been aware of Unicode, it would have made sense to avoid
- such a conflict.
-
- I have since been told, by e-mail, that POSIX has no such conflict,
- and that Unicode came later --- and now, from your article, that
- Unicode started out differently, anyway.
-
- >During the Unicode evolution it was very fluid, for example at one
- >point using non-0-valued bytes for the "padding". When the Unicode
- >proposal started heading in a direction that would cause
- >incompatibilities with the established multibyte encoding model,
- >[...]
-
- As far as I can see, padding is not enough. A character set encoding
- needs to have some way of shifting in and out of an 8-bit subset if it
- is to work as an 8-bit C multibyte encoding. I was not aware that
- Unicode, in its early stages, provided such a mechanism. It is true
- that the first DIS 10646 (which had nothing to do with Unicode) could
- be fitted into the multibyte framework, even though the intention
- seems to have been that a file would be either 8-bit or 16-bit, except
- perhaps for some annunciators at the start.
-
- Lars Mathiesen (U of Copenhagen CS Dep) <thorinn@diku.dk> (Humour NOT marked)
-
-
- Volume-Number: Volume 29, Number 12
-