Usenet 1994 January

home *** CD-ROM | disk | FTP | other *** search

/ Usenet 1994 January / usenetsourcesnewsgroupsinfomagicjanuary1994.iso / sources / std_unix / volume.29 / text0010.txt < prev next >

Wrap

Text File | 1992-12-26 | 4.4 KB | 83 lines

Submitted-by: gwyn@smoke.brl.mil (Doug Gwyn) In article <16p6g6INNs4i@ftp.UU.NET> thorinn@diku.dk (Lars Henrik Mathiesen) writes: >Why is it so bad that Unicode is not like earlier ``extended >encodings''? Existing code is a large problem, of course, but what is >the problem relative to Standard C itself? The standard provides both >wide and multibyte characters, and Unicode happens to fit the former >definition instead of the latter. And that's the problem. Unicode appears to be intended for use in external text objects, such as disk files. The only way an implementation could simultaneously conform to the C standard and to 10646 used for external storage representation would be for it to make char (NOT just wchar_t) 16 bits wide. While that suffices in theory, it was the desire to implement char as an 8-bit byte object type regardless of character set needs that led to the adoption of the multibyte character encoding model for IS 9899 (C standard) over alternate proposals, one sponsored by ITSCJ and one sponsored by just me, that would have made character and byte object types both basic data types in C, with standard I/O functions dealing with the most appropriate type (e.g. fread() would get bytes while fscanf() would get characters). The observation was made that all significant character set encodings at the time could be accommodated within the multibyte sequence model; that was an important factor in deciding on the model adopted for Standard C. The point is that these issues were thoroughly discussed in /usr/group and X/Open internationalization working groups, members of which (as well as Japanese who had already amassed considerable practical experience in this area) then assisted during X3J11/WG14 deliberations that adopted the multibyte approach. Yes, other, conceptually cleaner models are possible, but they were considered and deliberately rejected in favor of the external:multibyte/internal:wchar_t model. X3.159-1989/IS 9899:1990 was the first major programming language standard to tackle internationalization issues, and it was not done in isolation but rather with extensive liaison with those working the problem at the time. Therefore, it is important for others now trying to work in the same areas to (a) fully understand and (b) cooperate with this existing standard model. From what I have seen, ISO 10646 failed on both counts (a) and (b), so now we do have a practical programming problem. > 2) For the programmer, it seems easier to just use wchar_t strings; But he CAN'T -- there is no standard way to convert from external Unicode to internal wchar_t! (Unless char is made 16 bits wide, which many C vendors have said is not feasible for their clients.) (The root problem is that there are 0-valued "bytes" within the Unicode character set.) > 3) There is a problem in that the standard defines wide character > constants and wide string literals in terms of multibyte characters. There is no problem there, and it's not relevant to the issue anyway; perhaps you don't understand the intent of this part of the standard. >If the problem is that POSIX wants to insist on a subset of Standard >C, or C bindings for some system functionality, that cannot work with >Unicode; then I have very little sympathy: The direction that Unicode >was taking must have been clear well before anything within POSIX was >finalized, and disregarding it may well turn out to have been, let's >say, infelicitous. I cannot express how angry such a statement makes me. The prior art here is that which was adopted into the C standard, not Unicode which came later. During the Unicode evolution it was very fluid, for example at one point using non-0-valued bytes for the "padding". When the Unicode proposal started heading in a direction that would cause incompatibilities with the established multibyte encoding model, attempts were made to bring this to the attention of the DIS 10646 developers so that it could be remedied before any harm was done. Reports from those involved were that all such suggestions were denounced as intrusion on 10646's turf, and that the rest of the world should change to accommodate the forthcoming IS 10646 rather than the character set standard being compatible with existing practice. Why 10646 would ever have been approved by the International Standards Organization under such circumstances I don't know, but it must have been for political reasons since technically it is an immense mistake. Volume-Number: Volume 29, Number 11