NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / std / internat / 1135 < prev next >

Wrap

Internet Message Format | 1993-01-10 | 4.8 KB

Path: sparky!uunet!ogicse!mintaka.lcs.mit.edu!ai-lab!wheat-chex!glenn From: glenn@wheat-chex.ai.mit.edu (Glenn A. Adams) Newsgroups: comp.std.internat Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST) Keywords: Unicode ISO10646 CharacterEncoding Message-ID: <1ipo2kINN6g2@life.ai.mit.edu> Date: 10 Jan 93 17:57:40 GMT Article-I.D.: life.1ipo2kINN6g2 References: <1993Jan9.024546.26934@fcom.cc.utah.edu> <1in2c8INNmbj@life.ai.mit.edu> <1993Jan10.000115.28150@fcom.cc.utah.edu> Organization: MIT Artificial Intelligence Laboratory Lines: 87 NNTP-Posting-Host: wheat-chex.ai.mit.edu In article <1993Jan10.000115.28150@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes: >I agree; however, given my objections to UTF, the only potential use for >it in my book would be interchange of data between systems, and in >particular, with Plan 9 systems, which are already committed to UTF for >storage, interchange, and processing. The original UTF (UTF1), as described in DIS10646-1.2:1992, was defined for one purpose only: to facilitate interchange of 10646 data over 8-bit interchange channels sensitive to C0/C1/SPACE/DEL octet values (i.e., 8-bit ISO2022 interchange channels). There was no intent in using this for direct processing, nor for file system storage (though these weren't ruled out). The FSS-UTF (UTF2), developed by Ken Thompson, and used in Plan9, had a different set of goals which were established by their usage requirements. I would not claim that UTF2 is inappropriate given the goals articulated by the Plan9 designers; however, I do not believe one should take their goals to be those of the Unicode/10646 community at large. Most Unicode/10646 implementors that I'm aware of, tend to use 10646 UCS2 (Unicode) fixed width 16-bit encodings only, for both processing and interchange. >In my opinion, UTF and other species of what I have been calling "Runic >encoding" are inappropriate for storage of non per-language attributable >data. I think you may be misusing the term as Ken Thompson used it (I believe) to refer to fixed width 16-bit 10646 UCS2 (Unicode) encodings, and not to the variable width UTF encodings. The term "multibyte" encoding as currently used in the Posix, X/Open, and Unix communities tends to be used for what is equivalent to UTF variable length multibyte encodings. >This [language attribution] *must* be in-band for multilingual Unicode >documents if we are to overcome the (I believe reasonable) objection to >the lack of information for display localization. No, this is incorrect. Language attribution *is not* necessary for the legible display of any multilingual Unicode data. That is, unless you consider that font attribution is necessary for display. Neither Ohta-san's claims nor Vadim's claims have shown that language attributes are necessary to perform legible display. If you (or anyone else) can demonstrably show this to be the case, then I welcome you to do it. Otherwise, you can take my word (having implemented a Unicode rendering engine) that language attributes are not necessary. I agree that language attribution is desirable for typographically correct display; however, so are font and style attributes. >The destruction of information, like record boundries or an aritmmetic >relationship between the character count in a document and the number of >bytes in a file, is unavoidable in a Unicode implementation (10646 can >escape this by implementing Vadim or Ohta's "super character set" elsewhere >in the vastly larger space provided by the 32 bit space). No, this is incorrect: byte_count = unicode_char_count * sizeof (unsigned short) Your claim is only true in the context of UTF(s), not Unicode (or 10646) in general. >until such time as the high 16 bits of 10646 come into use Actually, there are only 15 usable high bits in 10646, bit 31 is reserved and MBZ. I wouldn't look for any characters being defined outside the BMP (lower 16-bits) for 5 years or more. Even then, I don't see any need for it, nor do I see any reason that, should additional non-BMP characters be defined, any major vendors will pay any attention to them. Why should someone support 10646 UCS4 when UCS2 gives you everything that is needed. [I am assuming here that ISO JTC1/SC2/WG2 will not sanction encodings outside of UCS2 simply for the purpose of providing implicit language attributions. If this were proposed, I suspect WG2 would claim that language attributes are beyond the scope of character encoding, and, instead, require it be addressed in another standards context.] >To my mind, the way UTF is formulated seems like a buy-off of the 7-bit >ASCII world, with little benefit to non (or only partial) 7-bit ASCII >users. UTF2 buys significant software backward compatibility with many 8-bit clean for ASCII only applications. I believe that was its design goal, so it can't be faulted on that basis. Glenn Adams