Usenet 1994 January

home *** CD-ROM | disk | FTP | other *** search

/ Usenet 1994 January / usenetsourcesnewsgroupsinfomagicjanuary1994.iso / sources / std_unix / volume.29 / text0011.txt < prev next >

Wrap

Text File | 1992-12-26 | 5.6 KB | 117 lines

Submitted-by: thorinn@diku.dk (Lars Henrik Mathiesen) NOTE: quotes from Doug's article have been included out of sequence. gwyn@smoke.brl.mil (Doug Gwyn) writes: >In article <16p6g6INNs4i@ftp.UU.NET> thorinn@diku.dk (Lars Henrik Mathiesen) writes: >>Why is it so bad that Unicode is not like earlier ``extended >>encodings''? Existing code is a large problem, of course, but what is >>the problem relative to Standard C itself? The standard provides both >>wide and multibyte characters, and Unicode happens to fit the former >>definition instead of the latter. I think I'll answer my own question: The problem is that one cannot reasonably demand of a vendor that their implementation should allow the programmer to use wchar_t for Unicode characters. However, my (perhaps misguided) thought was that the problem could be resolved by the vendors choosing to do precisely that when they want to add Unicode support to their systems. The rest of the article was intended to compare programming for Unicode in such an implementation to programming for some (non-trivial) 8-bit-byte multibyte character set, and not to imply that such a use would be portable --- but on rereading I can see that this was less than clear. >And that's the problem. Unicode appears to be intended for use in >external text objects, such as disk files. The only way an >implementation could simultaneously conform to the C standard and >to 10646 used for external storage representation would be for it >to make char (NOT just wchar_t) 16 bits wide. Correct me if I'm wrong, but a C implementation will not become non-conforming if it documents that |fread| can be used to input Unicode characters to a wchar_t array, and that library functions such as |wfgets| exist (declared in a non-standard header, of course). Programs that exploit this will not be strictly conforming, but neither will programs that call setlocale(LC_CTYPE, ""). (I don't think that 10646 defines what conformance is for programming languages, but an implementation like this would offer one way of dealing with it.) >> 3) There is a problem in that the standard defines wide character >> constants and wide string literals in terms of multibyte characters. >There is no problem there, and it's not relevant to the issue anyway; >perhaps you don't understand the intent of this part of the standard. The intent seems to be to enable the programmer to get some constants ``preconverted'' from multibyte characters into wide characters. It is exactly because this special syntax sugar exists to initialize wchar_t objects that the type becomes attractive --- any old 16-bit integral type would do perfectly fine (and be more portable) if you could live with writing all the constant strings as lists of hex values. In the normal usage, the wide character set is derived from the multibyte character set with the sole purpose of one-to-one convertibility (of characters). But again, a conforming implementation could use a multibyte character set that was constructed with the sole purpose of allowing Unicode in wide character constants. ----------------------------------------------- >[...] it is important for others now trying to work in the same areas >to (a) fully understand and (b) cooperate with this existing standard >model. From what I have seen, ISO 10646 failed on both counts (a) >and (b), so now we do have a practical programming problem. The problem may be that the Unicode consortium and/or the ISO committee work under the implicit assumption that their character sets will _not_ have to coexist with any others; this may even have been a conscious decision. If all files on your system are Unicode, there's no problem, of course, because all your compilers work in Unicode too. ----------------------------------------------- >>If the problem is that POSIX wants to insist on a subset of Standard >>C, [...] >[...] The prior art here is that which was adopted into the C >standard, not Unicode which came later. It seems that I was both unclear, and confused on the timing of POSIX development. I included that paragraph because the thread was named "POSIX update" (and this is comp.std.unix), and I was not sure if the percieved problem was with the C standard or with POSIX. The C standard explicitly allows a ``byte'' to be more than eight bits, and 16-bit Unicode bytes is a very reasonable decision in some contexts. But it might be that POSIX has additional requirements that make it impossible for such a system to conform --- and if the POSIX committee had been aware of Unicode, it would have made sense to avoid such a conflict. I have since been told, by e-mail, that POSIX has no such conflict, and that Unicode came later --- and now, from your article, that Unicode started out differently, anyway. >During the Unicode evolution it was very fluid, for example at one >point using non-0-valued bytes for the "padding". When the Unicode >proposal started heading in a direction that would cause >incompatibilities with the established multibyte encoding model, >[...] As far as I can see, padding is not enough. A character set encoding needs to have some way of shifting in and out of an 8-bit subset if it is to work as an 8-bit C multibyte encoding. I was not aware that Unicode, in its early stages, provided such a mechanism. It is true that the first DIS 10646 (which had nothing to do with Unicode) could be fitted into the multibyte framework, even though the intention seems to have been that a file would be either 8-bit or 16-bit, except perhaps for some annunciators at the start. Lars Mathiesen (U of Copenhagen CS Dep) <thorinn@diku.dk> (Humour NOT marked) Volume-Number: Volume 29, Number 12