NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / std / internat / 1145 < prev next >

Wrap

Text File | 1993-01-11 | 5.9 KB | 123 lines

Newsgroups: comp.std.internat Path: sparky!uunet!spool.mu.edu!agate!dog.ee.lbl.gov!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry From: terry@cs.weber.edu (A Wizard of Earth C) Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST) Message-ID: <1993Jan11.193710.29580@fcom.cc.utah.edu> Keywords: Unicode ISO10646 CharacterEncoding Sender: news@fcom.cc.utah.edu Organization: Weber State University (Ogden, UT) References: <1in2c8INNmbj@life.ai.mit.edu> <1993Jan10.000115.28150@fcom.cc.utah.edu> <1ipo2kINN6g2@life.ai.mit.edu> Date: Mon, 11 Jan 93 19:37:10 GMT Lines: 110 In article <1ipo2kINN6g2@life.ai.mit.edu> glenn@wheat-chex.ai.mit.edu (Glenn A. Adams) writes: >The FSS-UTF (UTF2), developed by Ken Thompson, and used in Plan9, had >a different set of goals which were established by their usage requirements. >I would not claim that UTF2 is inappropriate given the goals articulated >by the Plan9 designers; however, I do not believe one should take >their goals to be those of the Unicode/10646 community at large. Most >Unicode/10646 implementors that I'm aware of, tend to use 10646 UCS2 >(Unicode) fixed width 16-bit encodings only, for both processing and >interchange. I definitely agree with this approach for processing. I *don't* agree for interchange, if that interchange takes the form of a non-internationalized or 8-bit interntationalized (with an 8-bit or smaller character set) NFS client of an internationalized system. Storage should be local specific for interoperability and for storage savings to avoid non-international users paying the higher price of raw encoding, even if the NFS argument is thrown out the window. >>In my opinion, UTF and other species of what I have been calling "Runic >>encoding" are inappropriate for storage of non per-language attributable >>data. > >I think you may be misusing the term as Ken Thompson used it (I believe) >to refer to fixed width 16-bit 10646 UCS2 (Unicode) encodings, and not to >the variable width UTF encodings. The term "multibyte" encoding as >currently used in the Posix, X/Open, and Unix communities tends to be >used for what is equivalent to UTF variable length multibyte encodings. This is indeed what I meant. Variant numbers of bytes per character is nearly intolerable. >>This [language attribution] *must* be in-band for multilingual Unicode >>documents if we are to overcome the (I believe reasonable) objection to >>the lack of information for display localization. > >No, this is incorrect. Language attribution *is not* necessary for >the legible display of any multilingual Unicode data. That is, unless >you consider that font attribution is necessary for display. Neither >Ohta-san's claims nor Vadim's claims have shown that language attributes >are necessary to perform legible display. If you (or anyone else) can >demonstrably show this to be the case, then I welcome you to do it. >Otherwise, you can take my word (having implemented a Unicode rendering >engine) that language attributes are not necessary. You misunderstand the basis of the objections. The objections are not made on the basis of legibility, but rather on the [apparent] imposition of cultural imperialism on those languages undergoing unification. The point is esthetic in many cases, rather than technical. I can say that an English text mixing normal, italic, and bold characters will "look like hell" when printed in a single font. The point is language attribution so that font selection is possible. In a monolingual document, the locale information (ala file attribute or per system) is sufficient to provide the rendering clues; a multilingual document is a compound document. It *is* possible to seperate out language attribution in a file based on any compund (multilingual) document by seperating documents by language per file. This is unacceptable if the file is, for instance, a document like Roland Lange's "Japanese Verbs", with drastically mixed text. The issue is not that the text is mixed in and of itself, but that the text is nearly interspersed, and that it isn't practical to seperate it into files by language, with a document description file to do the compunding. Consider the worst case scenerio: a Japanese text telling how to write Chinese characters. >I agree that language attribution is desirable for typographically >correct display; however, so are font and style attributes. Agreed; and a good argument could be made that this is indeed a thin line. >>The destruction of information, like record boundries or an aritmmetic >>relationship between the character count in a document and the number of >>bytes in a file, is unavoidable in a Unicode implementation (10646 can >>escape this by implementing Vadim or Ohta's "super character set" elsewhere >>in the vastly larger space provided by the 32 bit space). > >No, this is incorrect: > > byte_count = unicode_char_count * sizeof (unsigned short) > >Your claim is only true in the context of UTF(s), not Unicode (or 10646) >in general. Right. Again, UTF encoding was what I was referring to. >>To my mind, the way UTF is formulated seems like a buy-off of the 7-bit >>ASCII world, with little benefit to non (or only partial) 7-bit ASCII >>users. > >UTF2 buys significant software backward compatibility with many 8-bit >clean for ASCII only applications. I believe that was its design goal, >so it can't be faulted on that basis. Perhaps not, but there are cleaner mechanism which buy backward compatability with character sets other than simply 7-bit US ASCII. Terry Lambert terry@icarus.weber.edu terry_lambert@novell.com --- Any opinions in this posting are my own and not those of my present or previous employers. -- ------------------------------------------------------------------------------- "I have an 8 user poetic license" - me Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial -------------------------------------------------------------------------------