home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!ogicse!mintaka.lcs.mit.edu!ai-lab!wheat-chex!glenn
- From: glenn@wheat-chex.ai.mit.edu (Glenn A. Adams)
- Newsgroups: comp.std.internat
- Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
- Keywords: Unicode ISO10646 CharacterEncoding
- Message-ID: <1ipo2kINN6g2@life.ai.mit.edu>
- Date: 10 Jan 93 17:57:40 GMT
- Article-I.D.: life.1ipo2kINN6g2
- References: <1993Jan9.024546.26934@fcom.cc.utah.edu> <1in2c8INNmbj@life.ai.mit.edu> <1993Jan10.000115.28150@fcom.cc.utah.edu>
- Organization: MIT Artificial Intelligence Laboratory
- Lines: 87
- NNTP-Posting-Host: wheat-chex.ai.mit.edu
-
- In article <1993Jan10.000115.28150@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes:
- >I agree; however, given my objections to UTF, the only potential use for
- >it in my book would be interchange of data between systems, and in
- >particular, with Plan 9 systems, which are already committed to UTF for
- >storage, interchange, and processing.
-
- The original UTF (UTF1), as described in DIS10646-1.2:1992, was defined
- for one purpose only: to facilitate interchange of 10646 data over
- 8-bit interchange channels sensitive to C0/C1/SPACE/DEL octet values
- (i.e., 8-bit ISO2022 interchange channels). There was no intent in
- using this for direct processing, nor for file system storage (though
- these weren't ruled out).
-
- The FSS-UTF (UTF2), developed by Ken Thompson, and used in Plan9, had
- a different set of goals which were established by their usage requirements.
- I would not claim that UTF2 is inappropriate given the goals articulated
- by the Plan9 designers; however, I do not believe one should take
- their goals to be those of the Unicode/10646 community at large. Most
- Unicode/10646 implementors that I'm aware of, tend to use 10646 UCS2
- (Unicode) fixed width 16-bit encodings only, for both processing and
- interchange.
-
- >In my opinion, UTF and other species of what I have been calling "Runic
- >encoding" are inappropriate for storage of non per-language attributable
- >data.
-
- I think you may be misusing the term as Ken Thompson used it (I believe)
- to refer to fixed width 16-bit 10646 UCS2 (Unicode) encodings, and not to
- the variable width UTF encodings. The term "multibyte" encoding as
- currently used in the Posix, X/Open, and Unix communities tends to be
- used for what is equivalent to UTF variable length multibyte encodings.
-
- >This [language attribution] *must* be in-band for multilingual Unicode
- >documents if we are to overcome the (I believe reasonable) objection to
- >the lack of information for display localization.
-
- No, this is incorrect. Language attribution *is not* necessary for
- the legible display of any multilingual Unicode data. That is, unless
- you consider that font attribution is necessary for display. Neither
- Ohta-san's claims nor Vadim's claims have shown that language attributes
- are necessary to perform legible display. If you (or anyone else) can
- demonstrably show this to be the case, then I welcome you to do it.
- Otherwise, you can take my word (having implemented a Unicode rendering
- engine) that language attributes are not necessary.
-
- I agree that language attribution is desirable for typographically
- correct display; however, so are font and style attributes.
-
- >The destruction of information, like record boundries or an aritmmetic
- >relationship between the character count in a document and the number of
- >bytes in a file, is unavoidable in a Unicode implementation (10646 can
- >escape this by implementing Vadim or Ohta's "super character set" elsewhere
- >in the vastly larger space provided by the 32 bit space).
-
- No, this is incorrect:
-
- byte_count = unicode_char_count * sizeof (unsigned short)
-
- Your claim is only true in the context of UTF(s), not Unicode (or 10646)
- in general.
-
- >until such time as the high 16 bits of 10646 come into use
-
- Actually, there are only 15 usable high bits in 10646, bit 31 is
- reserved and MBZ. I wouldn't look for any characters being defined
- outside the BMP (lower 16-bits) for 5 years or more. Even then,
- I don't see any need for it, nor do I see any reason that, should
- additional non-BMP characters be defined, any major vendors will
- pay any attention to them.
-
- Why should someone support 10646 UCS4 when UCS2 gives you everything
- that is needed. [I am assuming here that ISO JTC1/SC2/WG2 will not
- sanction encodings outside of UCS2 simply for the purpose of providing
- implicit language attributions. If this were proposed, I suspect WG2
- would claim that language attributes are beyond the scope of character
- encoding, and, instead, require it be addressed in another standards
- context.]
-
- >To my mind, the way UTF is formulated seems like a buy-off of the 7-bit
- >ASCII world, with little benefit to non (or only partial) 7-bit ASCII
- >users.
-
- UTF2 buys significant software backward compatibility with many 8-bit
- clean for ASCII only applications. I believe that was its design goal,
- so it can't be faulted on that basis.
-
- Glenn Adams
-