home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.unix.bsd
- Path: sparky!uunet!usc!zaphod.mps.ohio-state.edu!caen!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry
- From: terry@cs.weber.edu (A Wizard of Earth C)
- Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
- Message-ID: <1993Jan5.090747.29232@fcom.cc.utah.edu>
- Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages
- Sender: news@fcom.cc.utah.edu
- Organization: Weber State University (Ogden, UT)
- References: <2564@titccy.cc.titech.ac.jp> <1992Dec28.062554.24144@fcom.cc.utah.edu> <2615@titccy.cc.titech.ac.jp>
- Date: Tue, 5 Jan 93 09:07:47 GMT
- Lines: 268
-
- In article <2615@titccy.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes:
- >In article <1992Dec28.062554.24144@fcom.cc.utah.edu>
- > terry@cs.weber.edu (A Wizard of Earth C) writes:
- >
- >>|> Do you know what Shift JIS is? It's a defacto standard for charcter encoding
- >>|> established by microsoft, NEC, ASCII etc. and common in Japanese PC market.
- >>
- [ ... ]
- >Sigh... WIth DOS/V you can use Japanese on YOUR "commodity hardware".
- >
- >>I think other mechanisms, such as ATOK, Wnn, and KanjiHand deserve to be
- >>examined. One method would be to adopt exactly the input mechanism of
- >>"Ichi-Taro" (the most popular NEC 98 word processer).
- >
- >They run also on IBM/PC.
-
- Well, then this makes them worthy of consideration in the same light as
- Shift JIS as input mechanisms.
-
- >>|> In the workstation market in Japan, some supports Shift JIS, some
- >>|> supports EUC and some supports both. Of course, many US companies
- >>|> sell Japanized UNIX on thier workstations.
- >>
- >>I think this is precisely what we want to avoid -- localization. The basic
- >>difference, to my mind, is that localization involves the maintenance of
- >>multiple code sets, whereas internationalization requires maintenance of
- >>multiple data sets, a much smaller job.
- >
- >>This I don't understand. The maximum translation table from one 16 bit value
- >>to another is 16k.
- >
- >WHAAAAT? It's 128KB, not 16k.
-
- Not true for designated sets, the maximum of which spans a 16K region. If,
- for instance, I designate that I am using the German language, I can say
- that there will be no input of characters other than German characters;
- thus the set of characters is reduced from the first German characters to
- the last German character. 16K is sufficient for most spanning sets for
- a particular language, as it covers the lexical distance between characters
- in a particular designated language.
-
- 128k is only necessary if you are to include translation of all characters;
- for a translation involving all characters, no spanning set smaller than the
- full set exists. Thus a 128k translation table is [almost] never useful
- (an exception exists for JIS input translation, and that table can exist
- in the input mechanism just as easily as the existing JIS ordering table
- does now).
-
- >>This means 2 16k tables for translation into/out of
- >>Unicode for Input/Output devices,
- >
- >I'm afraid you don't know what Unicode is. What, do you mean, "tables for
- >translation" is?
-
- In this particular case, I was thinking an ASCII->Unicode or Unicode->ASCII
- translation to reduce the storage penalty paid by Western languages for
- the potential support for large glyph set (ie: greater than 256 character)
- languages. Failure to do this translation (presumably to an ISO font for
- storage) will result in 1) space loss due to raw encoding, or 2) information
- loss due to Runic encoding (ie: the byte cont of the file no longer reflects
- the files character count).
-
- >>I don't see why the storage mechanism in any way effects the validity of the
- >>data
- >
- >*I* don't see why the storage mechanism in any way effects the validity of the
- >data
- >
- >>and thus I don't understand *why* you say "with Unicode, we can't
- >>achieve internationalization."
- >
- >Because we can't process a data mixed with Japanese and Chinese.
-
- Unicode is a dtorage mechanism, *not* the display mechanism. The Japanese
- and Chinese data within the file can be tagged as such, and this information
- can be used by programs (such as the print utility) to perform font selection.
- If the use of embedded control information is so much more undesirable than
- Runic encoding, which seems to be accepted without a peep, then a DEC-style
- CDA (Compound Document Architecture) can be used.
-
- In any case, font and language selection is not a property of the storage
- mechanism, but a property of the data stored. Thus the storage mechanism
- (in this case, Unicode) is not responsible for the language tagging.
-
- A mixed Japanese and Chinese document is a multilingual document; I can
- go into the rationale as to why this is not a penalty specific to languages
- with intersecting glyph variants if necessary. I think that this is beside
- the point (which is enabling 386BSD for data-driven localization).
-
- >>I don't understand this, either. This is like saying PC ASCII can not cover
- >>both the US and the UK because the American and English pound signs are not
- >>the same, or that it can't cover German or Dutch because of the 7 characters
- >>difference needed for support of those languages.
- >
- >Wrong. The US and UK sign are the same character while they might be assigned
- >different code points in different countryies.
- >
- >Thus, in universal coded character set, it is corrent to assign a
- >single code point to the single pound sign, even though the character
- >is used both in US and UK.
- >
- >But, corresponding characters in China/Japan, which do not share the
- >same graphical representation even on the moderate quality printers
- >thus different characters, are assigned the same code point in Unicode.
-
- This is not a requirement for unification if an assumption of language
- tagging for the files (either externally to the files or within the data
- stream for mixed language documents) is made. I am making that assumption.
-
- The fact is that localization is a much more desirable goal than
- multilingual word processing. Thus if a penalty is to be paid, it should
- be paid in the multinational application rather than in all localized
- applications.
-
- One could make the same argument regarding the Unification of the Latin,
- Latvian, and Lappish { SMALL LETTER G CEDILLA }, since there are glyph
- variants there as well.
-
- The point is that the storage technology is divorced from the font
- selection process; that is something that is done, either based on an
- ISO 2022 style mechanism within the document itself, or on a per file basis
- during localization or in a Compound Document (what Unicode calls a "Fancy
- text" file).
-
- The font selection is based on the language tagging of the file at the time
- of display (to a screen or to a hard copy device). The use of a Unicode
- character set for storage does not necessarily imply the use of a Unicode
- (or other unified) font for display.
-
- When printing Latin text, a Latin font will be used; when printing Latvian
- text, a Lativian font, etc.
-
- >>|> Of course, it is possible to LOCALIZE Unicode so that it produces
- >>|> Japanese characters only or Chinese characters only. But don't we
- >>|> need internationalization?
- >>
- >>The point of an internationalization effort (as *opposed* to a localization
- >>effort) is the coexistance of languages within the same processing means.
- >>The point is not to produce something which is capable of "only English" or
- >>"only French" or "only Japanese" at the flick of an environment variable;
- >>the point is to produce something which is *data driven* and localized by
- >>a change of data rather than by a change of code. To do otherwise would
- >>require the use of multiple code trees for each language, which was the
- >>entire impetus for an internationalization effort in the first place.
- >
- >That is THE problem of Unicode.
-
- It's not a problem if you don't expect your storage mechanism to dictate
- your display mechanism.
-
- >I was informed that MicroSoft will provide a LOCALIZATION mechanism
- >to print correnponding Chinese/Japanese characters of Unicode
- >differently.
-
- Yes. We will have to do the same if we use Unicode.
-
- >So, HOW can we MIX Chinese and Japanese without LOCALIZATION?
-
- You can't... but niether can you type in both Chinese and Japanese on the
- same keyboard without switching "LOCALIZATION" of the keyboard to the
- language being input. This gives an automatic attribution queue to the
- Compounding mechanism (ISO 2022 style or DEC style or whatever).
-
- >>This involves yet another
- >>set of localization-specific storage tables to translate from an ISO or
- >>other local font to Unicode and back on attributed file storage.
- >
- >FILE ATTRIBUTE!!!!!????? *IT* *IS* *EVIL*. Do you really know UNIX?
-
- The basis for this is an assumtion of an optimization necessary to reduce
- the storage requirements of Western text files to current levels, and the
- assumption that it would be better if the file size on a monolingual
- document (the most common kind of document) bore an arithmetic relationship
- to the number of glyphs within the file. This will incidently reduce
- the number of bytes required for the storage of Japanese documents from
- 2-5 down to 2 per glyph. Of course you can always turn the optimiztion
- off...
-
- >How can you "cat" two files with different file attributes?
-
- By localizing the display output mechanism. In particualr, in a multilingual
- environment (the only place you will ever have two files with different
- file attributes), there would be no alternative to a graphic or other
- complex display device (be it printer or X terminal).
-
- Ideally, there will be a language-localized font for every supported language
- defining only the glyphs in Unicode which pertain to taht language.
-
- >What attribute can you attach to semi binary file, in which some field
- >contains an ASCII string and some other field contains a JIS string?
-
- The point of a *data driven* localization is to avoid the existance of
- semi-binary files. In terms of internationalization, you are describing
- not a generally usable application, but a bilingual application. This
- is against the rules. It is possible to do language tagging of the
- users, wherein all tagging mechanisms not associated with a language
- (such as data embedded in a printf() call) in a badly behaved application
- is assumed to be a tagged for output as a particular language. Thus a
- German user on a 7-bit VT220 set for the German language NRCS (National
- Replacement Character Set) might see some odd characters on their screen
- when running a badly behaved application from France.
-
- >>To do
- >>otherwise would require 16 bit sotrage of files, or worse, runic encoding
- >>of any non-US ASCII characters in a file. This either doubles the file
- >>size for all text files (something the west _will_not_accept_),
- >
- >Do you know what UTF is?
-
- Yes, Runic encoding. I'll post the Plan-9 and Metis sources for it if you
- want. I describe it fairly accurately below as "pollution":
-
- >>or
- >>"pollutes" the files (all files except those stored in US-ASCII have file
- >>sizes which no longer reflect true character counts on the file).
-
- Destruction of this information is basically unacceptable for Western users
- and other multinational users currently used to internationalization by way
- of 8-bit clean environments.
-
- >That's already true for languages like Japanese, whose characters are
- >NOT ALWAYS (but sometimes) represented with a single byte.
- >
- >But, what's wrong with that?
-
- It doesn't have to be true of Japanese, especially with a fixed storage
- coding. What's wrong with it is that file size equates to character count
- for Western users, and there are users who depend on this information. As
- a VMS user (I know, *bletch*), I had the wonderful experience of finding
- out that tell() didn't return a byte offset for record oriented files. As
- a UNIX user, I depended on that information, and it was vastly frustrating
- to me when the information was destroyed by the storage mechanism. This
- is precisely analogous, only picking the file size rather than the data
- within the file.
-
- Consider for a moment, if you will, changing the first character of a
- field in a 1M file. Not only does this cause the record size ot become
- variant on the data within it (thus rendering a computed lseek() useless,
- since the records are no longer fixed size), but it requires that the entire
- file contents be shifted to accomodate what used to be ther rewrite of a
- single block.
-
- I realize that as a Japanese user, you are "used to" this behaviour; the
- average Western user (or *anyone* using a language representable in only
- and 8-bit clean envornment) is *not*.
-
- >>Admittedly, these mechanisms are adapatable for XPG4 (not widely available)
- >>and XPG3 (does not support eastern languages), but the MicroSoft adoption
- >>of Unicode tells us that at least 90% of the market is now committed to
- >>Unicode, if not now, then in the near future.
- >
- >Do you think MicroSoft will use file attributes?
-
- Probably not. Most likely, the fact that Microsoft did not invent it and the
- fact that the HPFS fot NT is pretty well locked in place argues against it.
-
-
- Terry Lambert
- terry@icarus.weber.edu
- terry_lambert@novell.com
- ---
- Any opinions in this posting are my own and not those of my present
- or previous employers.
- --
- -------------------------------------------------------------------------------
- "I have an 8 user poetic license" - me
- Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial
- -------------------------------------------------------------------------------
-