home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.std.internat
- Path: sparky!uunet!gatech!usenet.ins.cwru.edu!agate!dog.ee.lbl.gov!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry
- From: terry@cs.weber.edu (A Wizard of Earth C)
- Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
- Message-ID: <1993Jan9.230114.27491@fcom.cc.utah.edu>
- Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages
- Sender: news@fcom.cc.utah.edu
- Organization: Weber State University (Ogden, UT)
- References: <1993Jan7.033153.12133@fcom.cc.utah.edu> <1993Jan9.024546.26934@fcom.cc.utah.edu> <24579@alice.att.com>
- Date: Sat, 9 Jan 93 23:01:14 GMT
- Lines: 195
-
- In article <24579@alice.att.com> andrew@alice.att.com (Andrew Hume) writes:
- >In article <1993Jan9.024546.26934@fcom.cc.utah.edu>, terry@cs.weber.edu (A Wizard of Earth C) writes:
- >~ In article <1993Jan8.092754.6344@prl.dec.com> boyd@prl.dec.com (Boyd Roberts) writes:
- >~ >Using Plan 9 utf you know the maximum size in bytes that a Rune can
- >~ >be encoded into. Fixed fields only have to be a multiple of this value,
- >~ >defined to be to be UTFmax. So where's the problem?
-
- [ ... ]
-
- >~ A good question. The answer is fourfold:
- >~
- >~ 1) The field length is still not descriptive of the number of
- >~ characters a field may contain; you have only set a minimum
- >~ "maximum character count". It is still possible to enter
- >~ in more characters in a non-maximally encoded character set
- >~ than this value, unless you adopt the abstraction of a
- >~ maximum character count seperate from the maximum necessary
- >~ storage length. The maximum character count is what we
- >~ would normally think of as the displayed field length (ie:
- >~ what you see as the size of an input box). If my field
- >~ length is 6 bytes times the maximum record length for 31
- >~ bits (as with UTF encoding), would I not be better off using
- >~ raw 32 bit encoding, yielding only 4 times the maximum
- >~ record length for the same data (or only 2 times for 16 bit
- >~ data)?
- >
- > what objective function are you maximising? frankly,
- > if i were doing fixed records, i would keep char/byte counts.
- > the reason to use UTF is precisely system-independence (see below).
- > if density is an issue (it rarely is for us), then i agree --
- > 16bit encodings save you space for kanji.
-
- One objective is flat file databases with a direct correlation between
- seek offset into the file and record boundries. COBOL and FORTRAN data
- files are notorious in this context. I realize that there are indexing
- mechanisms for COBOL, but there's a lot of old COBOL code out there, and
- I'd probably argue that's why COBOL is still out there at all. The problem
- isn't one of new code.
-
- If the database contents are shared between systems at the file system
- level (via NFS and remote locks, etc.) rather than at a client/server
- level, then theres no means of user space translation of data in a
- server for the non-internationalized clients.
-
- With all due respect to DMR, FORTRAN is terrible at stored data manipulation
- in anything other than flat files.
-
- My main issue is grandfathering old code and gradual rather than catastrophic
- update of older systems. Until all older systems are replaced, I think
- interoperability will be an issue. With a large distributed system like the
- Internet, this is even more of a problem. Imagine NetNews going UTF a
- piece at a time with no interface mechanism.
-
- I think density is a problem, at least in the small systems arena. Remember
- that the reference target that I began with was 386BSD; in the 386BSD
- community, the average hard drive size is 80-100M.
-
- [ ... possible problems with using UTF internal to applications ... ]
-
- > as it turns out, MANY (but not all) applications can
- > work quite well on the UTF-encoded text. mostly, apps search
- > for literal strings, and all that happens is that their length
- > increases a little. our experience in plan 9 is that relatively
- > few applications do enough that it is more efficient to convert
- > to Runes before processing.
-
- I'll agree with this for small glyph-set languages (those which can be
- represented normally in an 8-bit clean environment), although the further
- the differentiate from ASCII, the less true it is. For large glyph-set
- languages, like Japanese or Chinese, this is less true. I think Vadim
- pointed out that the waste for non 7-bit ASCII test is about 12.5 percent
- with the "best" encoding scheme (he wasn't referring to UTF, and he
- didn't give a reference; sorry). If I were a 7-bit ASCII user who was
- only concerned with the the US, this my be sufficient argument.
-
- [ ... conversion and application size problems if UTF isn't universal in
- an application ... ]
-
- > ahhh, therein lies the nub! Plan 9 is UTF all the way
- > through. by and large, the problem with this approach for
- > unix systems should only be in the moral equivalent of teh tty driver.
- > once characters are entered, everything is just UTF. the OS
- > and libraries and compilers and ... all need to be recompiled
- > with care. this is not as bad as it seems but is a lot of fussing.
- > on the other hand, ascii systems are there already!
-
- Conceded. I wasn't thinking that applications would be doing processing in
- UTF. Given that, what you say is true.
-
- >~ 4) Backward comatability with existing systems requires the ability
- >~ to crossmount using existing mechanisms (NFS, RFS, etc.). How
- >~ can this be supported without kernel code, when the remote
- >~ system is ignorant of UTF deencoding? With kernel code, what
- >~ is the purpose in using the UTF encoding as the native processing
- >~ mode in the application?
- >
- > well, if the existing remote systems are just ascii,
- > then you have no problem. by and large, everything just works
- > (if your system is 8-bit clean). if the remote systems are
- > (and remain) non-ascii, then you have problems. the purpose
- > (at least in plan 9) of teh kernel using UTF is that the entire
- > system uses UTF -- all text strings are 10646 chars UTf encoded into
- > a byte stream.
-
- Here's the major sticking point for me. I can conceive of updating all
- software on an enterprise-wide basis without a great deal of kicking,
- scratching, and bleeding all over. Even when users want to update, there
- are problems.
-
- Even on an 8-bit clean remote system, you may be able to store UTF encoded
- data, but you will only be able to use it locally on the storage system
- (assuming it's older) if it's 7-bit US ASCII. This works well in the
- international market only in those countries where 7-bit ASCII variants
- are used (an NRCS -- "National Replacement Character Set") which are
- stored as if they were simply 7-bit ASCII. This is really
- internationalization.
-
- I think to work, there has to be some way to make older 8-bit clean
- internationalizations (ie: ISO 8859-x character sets) for small glyph-set
- languages interoperate with the newer machines as they are brought on line;
- this pretty much rules out any form of variant encoding.
-
- >~ There are workarounds to these, but they are much uglier than compact
- >~ storage processing, where storage is done in a locale-specific set if
- >~ possible. This would also allow a business to convert relatively
- >~ painlessly by leveraging current investments in prior localization
- >~ technology.
- >
- > i'd be the first to admit the plan 9 approach has its
- >difficulties. however, there is a nice coherence to the system
- >that applies to everything it connects to. i am unimpressed with
- >the alternative you suggest (although i admire the length and thoroughness
- >you have shown in discussing it) because it seems to me like
- >``here is an island, there is an island, interchange is a bitch''.
- >all interchange must be explicitly typed and converted on system
- >boundaries. and i have no idea how you do that. you can't convert
- >binaries, just text; but how do you tell the difference? and how
- >do you recognise system boundaries? (for example, when i mount
- >a remote system's files, i may be getting several systems' files;
- >how doe sthe file server know which ones to convert and how?
-
- I think this is only an issue if you use "compact storage", as I have been
- calling storage in 8-bit character sets if the data fits. Even then, it's
- quite possible to arrange interchange problems so that they are taken care
- of during initial configuration.
-
- Machine boundries for NFS wouldn't be necessary; by definition, "compact
- storage" presents the localized data in the format expected by the remote
- machine. For Multilingual files, it's not possible to use them on a
- remote system in any case; if there is truly a need, an internationalized
- application on the localized machine could be made to understand the
- compound document format.
-
- In the case of NFS mounts on an internationalized system of localized
- data on an older machine, you could always attribute the mount point
- with the attribute indicating that all files below that point be
- considered to have that locale attribute in their localized vnode for
- remote files.
-
- Or, if it were too difficult, throw out compacted data entirely (I would
- probably be loathe to do this, now that I talked about it enough for it
- to go from "idea" to "really neat idea" 8-)). The potential for a
- compression of data where a given 8 bits are always the same value for
- the entirety of the file is real, real high. Storage could be at nearly
- the 8-bit level. For a UNIX system (I am at a disadvanted when it comes to
- Plan 9 internals!), this could be handled at the device level (ie: bread,
- bwrite). The savings for UTF storage over and above this type of encoding
- would probably be minimal enough that the only locales that would feel it
- would be 7-bit ASCII and close variants thereof.
-
- I don't think any of this results in an island where interchange is not
- possible. I think UTF encoding results in the same problems in interchange,
- since most of them arise from (to continue the analogy) "an island with
- different natives". Already there are inherent difficulties for attribute
- relay over data-stream technologies, and attribution is mandatory in-band
- for multilingual data, and either in-band or out-of-band (if we can find
- a way to overcome the obstacles caused by out-of-band) for locale-specific
- monolingual data stored using Unicode. I think the "island" problem is
- smaller than a lot of people have presented it (ie: it *is* soluble),
- but I think it's shared by any approach requiring data above and beyond
- that communicated by the character set itself. This is the primary
- objection voiced to Unicode and Unicode-based systems (ie: 10646).
-
-
- Terry Lambert
- terry@icarus.weber.edu
- terry_lambert@novell.com
- ---
- Any opinions in this posting are my own and not those of my present
- or previous employers.
- --
- -------------------------------------------------------------------------------
- "I have an 8 user poetic license" - me
- Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial
- -------------------------------------------------------------------------------
-