home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.std.internat
- Path: sparky!uunet!gatech!usenet.ins.cwru.edu!agate!dog.ee.lbl.gov!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry
- From: terry@cs.weber.edu (A Wizard of Earth C)
- Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
- Message-ID: <1993Jan10.000115.28150@fcom.cc.utah.edu>
- Keywords: Unicode ISO10646 CharacterEncoding
- Sender: news@fcom.cc.utah.edu
- Organization: Weber State University (Ogden, UT)
- References: <1993Jan8.092754.6344@prl.dec.com> <1993Jan9.024546.26934@fcom.cc.utah.edu> <1in2c8INNmbj@life.ai.mit.edu>
- Date: Sun, 10 Jan 93 00:01:15 GMT
- Lines: 98
-
- In article <1in2c8INNmbj@life.ai.mit.edu> glenn@wheat-chex.ai.mit.edu (Glenn A. Adams) writes:
- >In article <1993Jan9.024546.26934@fcom.cc.utah.edu> you write:
- >>[ First a clarification of something which is my fault because of my
- >> background in comm software: I have been informed that the currently
- >> "blessed" correct terminlogy for what I have been calling "Runic
- >> encoding" is "Process code", "File code", or "Interchange code". I'll
- >> try to call it "Interchange code" from now on (I feel the other terms
- >> imply applications, some of which I disagree with). ]
- >
- >I should have been more clear. A "process code" is a fixed-width encoding
- >suitable for internal processing, e.g., ASCII, Unicode, 10646 UCS2, and
- >10646 UCS4, EUC wide char; a "file code" or "interchange code" is a
- >potentially variable length encoding suitable for file storage (non memory
- >mapped environments) or interchange, e.g., UTF1 and UTF2 (FSS-UTF),
- >Shift JIS, EUC Multibyte.
- >
- >[My objection to your use of the word "rune" was (1) you weren't clear
- >about which of these encodings you were referring to, and (2) I hate
- >cute terminology which is opaque when perfectly transparent terminology
- >already exists.]
- >
- >One should not in general use an interchange code (UTF1 or UTF2) for
- >processing. While one may use a process code for interchange, some
- >communication channels may have difficulties with data transparency
- >(e.g., Unicode and 10646 UCS[24] allow NULL bytes and ISO2022 C0/C1 control
- >code bytes in any byte position of their "process codes").
-
- I agree; however, given my objections to UTF, the only potential use for
- it in my book would be interchange of data between systems, and in
- particular, with Plan 9 systems, which are already committed to UTF for
- storage, interchange, and processing.
-
- In my opinion, UTF and other species of what I have been calling "Runic
- encoding" are inappropriate for storage of non per-language attributable
- data. They could be useful in 10646 in the future, or in Vadim or Ohta's
- suggested "super character set", but there are other mechanisms required
- for language attribution in Unicode and 10646 as it currently sits. This
- *must* be in-band for multilingual Unicode documents if we are to overcome
- the (I believe reasonable) objection to the lack of information for display
- localization. For monolingual Unicode documents, or a restricted
- localization environment (such as one would need to limit the font data
- downloaded to an X server), attributing the data in-band is possible,
- but attributing it out-of-band preserves a lot of information that is
- destroyed by in-band attribution.
-
- The destruction of information, like record boundries or an aritmmetic
- relationship between the character count in a document and the number of
- bytes in a file, is unavoidable in a Unicode implementation (10646 can
- escape this by implementing Vadim or Ohta's "super character set" elsewhere
- in the vastly larger space provided by the 32 bit space). There is other
- information and other potential uses disallowed as well; this is simply
- by way of example.
-
- I think it is acceptable to lose this information and be required to carry
- enough baggage to recreate it, or to maintain it in band, for the minority
- case of multilingual documents; for the majority case of monolingual
- documents in a localized environment, I think the attribution out-of band
- can maintain the majority if not all of the information.
-
- It is possible to avoid having to do out-of-band attribution for monolingual
- by relying on a locale mechanism, and assuming an attribution on all files
- of that locale, or by not doing "compact storage" of files which would
- otherwise take half the storage in a simple 8-bit clean internationaliztion.
-
-
- I think the main problem with non-compact storage is the price paid by
- small glyph-set users for internationalization from which they gain no
- benefit (as long as they are currently using 8-bit clean systems). Going
- to UTF storage alleviates this for 7-bit ASCII users, but someone who could
- use an 8-bit clean system with no penalty still pays extra for characters
- not in (or replacing -- not possible in Unicode) the 7-bit ASCII set. The
- further one diverges, the higher the penalty. What is the benefit to these
- users? In short, little or none, since they are paying in storage for
- internationalization which they can do without. The penalty for UTF storage
- is exhorbitant for large glyph-set users, like Japanese and Chinese, since
- it will almost invariably cost more than simply raw 16-bit storage of the
- characters until such time as the high 16 bits of 10646 come into use; even
- then, it is questionable as to whether the raw 32 bit storage would still
- be less expensive than the UTF encoding (this would depend on whether a
- particular character set in the expanded 10646 averaged 4 or less bytes
- for encoding the average character.
-
- To my mind, the way UTF is formulated seems like a buy-off of the 7-bit
- ASCII world, with little benefit to non (or only partial) 7-bit ASCII
- users.
-
-
- Terry Lambert
- terry@icarus.weber.edu
- terry_lambert@novell.com
- ---
- Any opinions in this posting are my own and not those of my present
- or previous employers.
- --
- -------------------------------------------------------------------------------
- "I have an 8 user poetic license" - me
- Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial
- -------------------------------------------------------------------------------
-