home *** CD-ROM | disk | FTP | other *** search
- Submitted-by: gwyn@smoke.brl.mil (Doug Gwyn)
-
- In article <16p6g6INNs4i@ftp.UU.NET> thorinn@diku.dk (Lars Henrik Mathiesen) writes:
- >Why is it so bad that Unicode is not like earlier ``extended
- >encodings''? Existing code is a large problem, of course, but what is
- >the problem relative to Standard C itself? The standard provides both
- >wide and multibyte characters, and Unicode happens to fit the former
- >definition instead of the latter.
-
- And that's the problem. Unicode appears to be intended for use in
- external text objects, such as disk files. The only way an
- implementation could simultaneously conform to the C standard and
- to 10646 used for external storage representation would be for it
- to make char (NOT just wchar_t) 16 bits wide. While that suffices
- in theory, it was the desire to implement char as an 8-bit byte
- object type regardless of character set needs that led to the
- adoption of the multibyte character encoding model for IS 9899 (C
- standard) over alternate proposals, one sponsored by ITSCJ and one
- sponsored by just me, that would have made character and byte object
- types both basic data types in C, with standard I/O functions
- dealing with the most appropriate type (e.g. fread() would get bytes
- while fscanf() would get characters). The observation was made that
- all significant character set encodings at the time could be
- accommodated within the multibyte sequence model; that was an
- important factor in deciding on the model adopted for Standard C.
-
- The point is that these issues were thoroughly discussed in /usr/group
- and X/Open internationalization working groups, members of which (as
- well as Japanese who had already amassed considerable practical
- experience in this area) then assisted during X3J11/WG14 deliberations
- that adopted the multibyte approach. Yes, other, conceptually cleaner
- models are possible, but they were considered and deliberately
- rejected in favor of the external:multibyte/internal:wchar_t model.
- X3.159-1989/IS 9899:1990 was the first major programming language
- standard to tackle internationalization issues, and it was not done
- in isolation but rather with extensive liaison with those working the
- problem at the time. Therefore, it is important for others now trying
- to work in the same areas to (a) fully understand and (b) cooperate
- with this existing standard model. From what I have seen, ISO 10646
- failed on both counts (a) and (b), so now we do have a practical
- programming problem.
-
- > 2) For the programmer, it seems easier to just use wchar_t strings;
-
- But he CAN'T -- there is no standard way to convert from external
- Unicode to internal wchar_t! (Unless char is made 16 bits wide,
- which many C vendors have said is not feasible for their clients.)
- (The root problem is that there are 0-valued "bytes" within the
- Unicode character set.)
-
- > 3) There is a problem in that the standard defines wide character
- > constants and wide string literals in terms of multibyte characters.
-
- There is no problem there, and it's not relevant to the issue anyway;
- perhaps you don't understand the intent of this part of the standard.
-
- >If the problem is that POSIX wants to insist on a subset of Standard
- >C, or C bindings for some system functionality, that cannot work with
- >Unicode; then I have very little sympathy: The direction that Unicode
- >was taking must have been clear well before anything within POSIX was
- >finalized, and disregarding it may well turn out to have been, let's
- >say, infelicitous.
-
- I cannot express how angry such a statement makes me. The prior art
- here is that which was adopted into the C standard, not Unicode which
- came later. During the Unicode evolution it was very fluid, for
- example at one point using non-0-valued bytes for the "padding".
- When the Unicode proposal started heading in a direction that would
- cause incompatibilities with the established multibyte encoding model,
- attempts were made to bring this to the attention of the DIS 10646
- developers so that it could be remedied before any harm was done.
- Reports from those involved were that all such suggestions were
- denounced as intrusion on 10646's turf, and that the rest of the world
- should change to accommodate the forthcoming IS 10646 rather than the
- character set standard being compatible with existing practice. Why
- 10646 would ever have been approved by the International Standards
- Organization under such circumstances I don't know, but it must have
- been for political reasons since technically it is an immense mistake.
-
-
- Volume-Number: Volume 29, Number 11
-
-