NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / std / internat / 1122 < prev next >

Wrap

Text File | 1993-01-09 | 10.8 KB | 208 lines

Newsgroups: comp.std.internat Path: sparky!uunet!gatech!usenet.ins.cwru.edu!agate!dog.ee.lbl.gov!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry From: terry@cs.weber.edu (A Wizard of Earth C) Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST) Message-ID: <1993Jan9.230114.27491@fcom.cc.utah.edu> Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages Sender: news@fcom.cc.utah.edu Organization: Weber State University (Ogden, UT) References: <1993Jan7.033153.12133@fcom.cc.utah.edu> <1993Jan9.024546.26934@fcom.cc.utah.edu> <24579@alice.att.com> Date: Sat, 9 Jan 93 23:01:14 GMT Lines: 195 In article <24579@alice.att.com> andrew@alice.att.com (Andrew Hume) writes: >In article <1993Jan9.024546.26934@fcom.cc.utah.edu>, terry@cs.weber.edu (A Wizard of Earth C) writes: >~ In article <1993Jan8.092754.6344@prl.dec.com> boyd@prl.dec.com (Boyd Roberts) writes: >~ >Using Plan 9 utf you know the maximum size in bytes that a Rune can >~ >be encoded into. Fixed fields only have to be a multiple of this value, >~ >defined to be to be UTFmax. So where's the problem? [ ... ] >~ A good question. The answer is fourfold: >~ >~ 1) The field length is still not descriptive of the number of >~ characters a field may contain; you have only set a minimum >~ "maximum character count". It is still possible to enter >~ in more characters in a non-maximally encoded character set >~ than this value, unless you adopt the abstraction of a >~ maximum character count seperate from the maximum necessary >~ storage length. The maximum character count is what we >~ would normally think of as the displayed field length (ie: >~ what you see as the size of an input box). If my field >~ length is 6 bytes times the maximum record length for 31 >~ bits (as with UTF encoding), would I not be better off using >~ raw 32 bit encoding, yielding only 4 times the maximum >~ record length for the same data (or only 2 times for 16 bit >~ data)? > > what objective function are you maximising? frankly, > if i were doing fixed records, i would keep char/byte counts. > the reason to use UTF is precisely system-independence (see below). > if density is an issue (it rarely is for us), then i agree -- > 16bit encodings save you space for kanji. One objective is flat file databases with a direct correlation between seek offset into the file and record boundries. COBOL and FORTRAN data files are notorious in this context. I realize that there are indexing mechanisms for COBOL, but there's a lot of old COBOL code out there, and I'd probably argue that's why COBOL is still out there at all. The problem isn't one of new code. If the database contents are shared between systems at the file system level (via NFS and remote locks, etc.) rather than at a client/server level, then theres no means of user space translation of data in a server for the non-internationalized clients. With all due respect to DMR, FORTRAN is terrible at stored data manipulation in anything other than flat files. My main issue is grandfathering old code and gradual rather than catastrophic update of older systems. Until all older systems are replaced, I think interoperability will be an issue. With a large distributed system like the Internet, this is even more of a problem. Imagine NetNews going UTF a piece at a time with no interface mechanism. I think density is a problem, at least in the small systems arena. Remember that the reference target that I began with was 386BSD; in the 386BSD community, the average hard drive size is 80-100M. [ ... possible problems with using UTF internal to applications ... ] > as it turns out, MANY (but not all) applications can > work quite well on the UTF-encoded text. mostly, apps search > for literal strings, and all that happens is that their length > increases a little. our experience in plan 9 is that relatively > few applications do enough that it is more efficient to convert > to Runes before processing. I'll agree with this for small glyph-set languages (those which can be represented normally in an 8-bit clean environment), although the further the differentiate from ASCII, the less true it is. For large glyph-set languages, like Japanese or Chinese, this is less true. I think Vadim pointed out that the waste for non 7-bit ASCII test is about 12.5 percent with the "best" encoding scheme (he wasn't referring to UTF, and he didn't give a reference; sorry). If I were a 7-bit ASCII user who was only concerned with the the US, this my be sufficient argument. [ ... conversion and application size problems if UTF isn't universal in an application ... ] > ahhh, therein lies the nub! Plan 9 is UTF all the way > through. by and large, the problem with this approach for > unix systems should only be in the moral equivalent of teh tty driver. > once characters are entered, everything is just UTF. the OS > and libraries and compilers and ... all need to be recompiled > with care. this is not as bad as it seems but is a lot of fussing. > on the other hand, ascii systems are there already! Conceded. I wasn't thinking that applications would be doing processing in UTF. Given that, what you say is true. >~ 4) Backward comatability with existing systems requires the ability >~ to crossmount using existing mechanisms (NFS, RFS, etc.). How >~ can this be supported without kernel code, when the remote >~ system is ignorant of UTF deencoding? With kernel code, what >~ is the purpose in using the UTF encoding as the native processing >~ mode in the application? > > well, if the existing remote systems are just ascii, > then you have no problem. by and large, everything just works > (if your system is 8-bit clean). if the remote systems are > (and remain) non-ascii, then you have problems. the purpose > (at least in plan 9) of teh kernel using UTF is that the entire > system uses UTF -- all text strings are 10646 chars UTf encoded into > a byte stream. Here's the major sticking point for me. I can conceive of updating all software on an enterprise-wide basis without a great deal of kicking, scratching, and bleeding all over. Even when users want to update, there are problems. Even on an 8-bit clean remote system, you may be able to store UTF encoded data, but you will only be able to use it locally on the storage system (assuming it's older) if it's 7-bit US ASCII. This works well in the international market only in those countries where 7-bit ASCII variants are used (an NRCS -- "National Replacement Character Set") which are stored as if they were simply 7-bit ASCII. This is really internationalization. I think to work, there has to be some way to make older 8-bit clean internationalizations (ie: ISO 8859-x character sets) for small glyph-set languages interoperate with the newer machines as they are brought on line; this pretty much rules out any form of variant encoding. >~ There are workarounds to these, but they are much uglier than compact >~ storage processing, where storage is done in a locale-specific set if >~ possible. This would also allow a business to convert relatively >~ painlessly by leveraging current investments in prior localization >~ technology. > > i'd be the first to admit the plan 9 approach has its >difficulties. however, there is a nice coherence to the system >that applies to everything it connects to. i am unimpressed with >the alternative you suggest (although i admire the length and thoroughness >you have shown in discussing it) because it seems to me like >``here is an island, there is an island, interchange is a bitch''. >all interchange must be explicitly typed and converted on system >boundaries. and i have no idea how you do that. you can't convert >binaries, just text; but how do you tell the difference? and how >do you recognise system boundaries? (for example, when i mount >a remote system's files, i may be getting several systems' files; >how doe sthe file server know which ones to convert and how? I think this is only an issue if you use "compact storage", as I have been calling storage in 8-bit character sets if the data fits. Even then, it's quite possible to arrange interchange problems so that they are taken care of during initial configuration. Machine boundries for NFS wouldn't be necessary; by definition, "compact storage" presents the localized data in the format expected by the remote machine. For Multilingual files, it's not possible to use them on a remote system in any case; if there is truly a need, an internationalized application on the localized machine could be made to understand the compound document format. In the case of NFS mounts on an internationalized system of localized data on an older machine, you could always attribute the mount point with the attribute indicating that all files below that point be considered to have that locale attribute in their localized vnode for remote files. Or, if it were too difficult, throw out compacted data entirely (I would probably be loathe to do this, now that I talked about it enough for it to go from "idea" to "really neat idea" 8-)). The potential for a compression of data where a given 8 bits are always the same value for the entirety of the file is real, real high. Storage could be at nearly the 8-bit level. For a UNIX system (I am at a disadvanted when it comes to Plan 9 internals!), this could be handled at the device level (ie: bread, bwrite). The savings for UTF storage over and above this type of encoding would probably be minimal enough that the only locales that would feel it would be 7-bit ASCII and close variants thereof. I don't think any of this results in an island where interchange is not possible. I think UTF encoding results in the same problems in interchange, since most of them arise from (to continue the analogy) "an island with different natives". Already there are inherent difficulties for attribute relay over data-stream technologies, and attribution is mandatory in-band for multilingual data, and either in-band or out-of-band (if we can find a way to overcome the obstacles caused by out-of-band) for locale-specific monolingual data stored using Unicode. I think the "island" problem is smaller than a lot of people have presented it (ie: it *is* soluble), but I think it's shared by any approach requiring data above and beyond that communicated by the character set itself. This is the primary objection voiced to Unicode and Unicode-based systems (ie: 10646). Terry Lambert terry@icarus.weber.edu terry_lambert@novell.com --- Any opinions in this posting are my own and not those of my present or previous employers. -- ------------------------------------------------------------------------------- "I have an 8 user poetic license" - me Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial -------------------------------------------------------------------------------