NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / std / internat / 1125 < prev next >

Wrap

Text File | 1993-01-09 | 6.0 KB | 111 lines

Newsgroups: comp.std.internat Path: sparky!uunet!gatech!usenet.ins.cwru.edu!agate!dog.ee.lbl.gov!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry From: terry@cs.weber.edu (A Wizard of Earth C) Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST) Message-ID: <1993Jan10.000115.28150@fcom.cc.utah.edu> Keywords: Unicode ISO10646 CharacterEncoding Sender: news@fcom.cc.utah.edu Organization: Weber State University (Ogden, UT) References: <1993Jan8.092754.6344@prl.dec.com> <1993Jan9.024546.26934@fcom.cc.utah.edu> <1in2c8INNmbj@life.ai.mit.edu> Date: Sun, 10 Jan 93 00:01:15 GMT Lines: 98 In article <1in2c8INNmbj@life.ai.mit.edu> glenn@wheat-chex.ai.mit.edu (Glenn A. Adams) writes: >In article <1993Jan9.024546.26934@fcom.cc.utah.edu> you write: >>[ First a clarification of something which is my fault because of my >> background in comm software: I have been informed that the currently >> "blessed" correct terminlogy for what I have been calling "Runic >> encoding" is "Process code", "File code", or "Interchange code". I'll >> try to call it "Interchange code" from now on (I feel the other terms >> imply applications, some of which I disagree with). ] > >I should have been more clear. A "process code" is a fixed-width encoding >suitable for internal processing, e.g., ASCII, Unicode, 10646 UCS2, and >10646 UCS4, EUC wide char; a "file code" or "interchange code" is a >potentially variable length encoding suitable for file storage (non memory >mapped environments) or interchange, e.g., UTF1 and UTF2 (FSS-UTF), >Shift JIS, EUC Multibyte. > >[My objection to your use of the word "rune" was (1) you weren't clear >about which of these encodings you were referring to, and (2) I hate >cute terminology which is opaque when perfectly transparent terminology >already exists.] > >One should not in general use an interchange code (UTF1 or UTF2) for >processing. While one may use a process code for interchange, some >communication channels may have difficulties with data transparency >(e.g., Unicode and 10646 UCS[24] allow NULL bytes and ISO2022 C0/C1 control >code bytes in any byte position of their "process codes"). I agree; however, given my objections to UTF, the only potential use for it in my book would be interchange of data between systems, and in particular, with Plan 9 systems, which are already committed to UTF for storage, interchange, and processing. In my opinion, UTF and other species of what I have been calling "Runic encoding" are inappropriate for storage of non per-language attributable data. They could be useful in 10646 in the future, or in Vadim or Ohta's suggested "super character set", but there are other mechanisms required for language attribution in Unicode and 10646 as it currently sits. This *must* be in-band for multilingual Unicode documents if we are to overcome the (I believe reasonable) objection to the lack of information for display localization. For monolingual Unicode documents, or a restricted localization environment (such as one would need to limit the font data downloaded to an X server), attributing the data in-band is possible, but attributing it out-of-band preserves a lot of information that is destroyed by in-band attribution. The destruction of information, like record boundries or an aritmmetic relationship between the character count in a document and the number of bytes in a file, is unavoidable in a Unicode implementation (10646 can escape this by implementing Vadim or Ohta's "super character set" elsewhere in the vastly larger space provided by the 32 bit space). There is other information and other potential uses disallowed as well; this is simply by way of example. I think it is acceptable to lose this information and be required to carry enough baggage to recreate it, or to maintain it in band, for the minority case of multilingual documents; for the majority case of monolingual documents in a localized environment, I think the attribution out-of band can maintain the majority if not all of the information. It is possible to avoid having to do out-of-band attribution for monolingual by relying on a locale mechanism, and assuming an attribution on all files of that locale, or by not doing "compact storage" of files which would otherwise take half the storage in a simple 8-bit clean internationaliztion. I think the main problem with non-compact storage is the price paid by small glyph-set users for internationalization from which they gain no benefit (as long as they are currently using 8-bit clean systems). Going to UTF storage alleviates this for 7-bit ASCII users, but someone who could use an 8-bit clean system with no penalty still pays extra for characters not in (or replacing -- not possible in Unicode) the 7-bit ASCII set. The further one diverges, the higher the penalty. What is the benefit to these users? In short, little or none, since they are paying in storage for internationalization which they can do without. The penalty for UTF storage is exhorbitant for large glyph-set users, like Japanese and Chinese, since it will almost invariably cost more than simply raw 16-bit storage of the characters until such time as the high 16 bits of 10646 come into use; even then, it is questionable as to whether the raw 32 bit storage would still be less expensive than the UTF encoding (this would depend on whether a particular character set in the expanded 10646 averaged 4 or less bytes for encoding the average character. To my mind, the way UTF is formulated seems like a buy-off of the 7-bit ASCII world, with little benefit to non (or only partial) 7-bit ASCII users. Terry Lambert terry@icarus.weber.edu terry_lambert@novell.com --- Any opinions in this posting are my own and not those of my present or previous employers. -- ------------------------------------------------------------------------------- "I have an 8 user poetic license" - me Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial -------------------------------------------------------------------------------