NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / std / internat / 1119 < prev next >

Wrap

Internet Message Format | 1993-01-09 | 6.8 KB

Path: sparky!uunet!zaphod.mps.ohio-state.edu!pacific.mps.ohio-state.edu!linac!att!att!dptg!ulysses!allegra!alice!andrew From: andrew@alice.att.com (Andrew Hume) Newsgroups: comp.std.internat Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST) Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages Message-ID: <24579@alice.att.com> Date: 9 Jan 93 07:51:57 GMT Article-I.D.: alice.24579 References: <id.EAHW.92A@ferranti.com> <1993Jan7.033153.12133@fcom.cc.utah.edu> <1993Jan9.024546.26934@fcom.cc.utah.edu> Organization: AT&T Bell Laboratories, Murray Hill NJ Lines: 119 In article <1993Jan9.024546.26934@fcom.cc.utah.edu>, terry@cs.weber.edu (A Wizard of Earth C) writes: ~ In article <1993Jan8.092754.6344@prl.dec.com> boyd@prl.dec.com (Boyd Roberts) writes: ~ >Using Plan 9 utf you know the maximum size in bytes that a Rune can ~ >be encoded into. Fixed fields only have to be a multiple of this value, ~ >defined to be to be UTFmax. So where's the problem? ~ ~ [ First a clarification of something which is my fault because of my ~ background in comm software: I have been informed that the currently ~ "blessed" correct terminlogy for what I have been calling "Runic ~ encoding" is "Process code", "File code", or "Interchange code". I'll ~ try to call it "Interchange code" from now on (I feel the other terms ~ imply applications, some of which I disagree with). ] ~ ~ A good question. The answer is fourfold: ~ ~ 1) The field length is still not descriptive of the number of ~ characters a field may contain; you have only set a minimum ~ "maximum character count". It is still possible to enter ~ in more characters in a non-maximally encoded character set ~ than this value, unless you adopt the abstraction of a ~ maximum character count seperate from the maximum necessary ~ storage length. The maximum character count is what we ~ would normally think of as the displayed field length (ie: ~ what you see as the size of an input box). If my field ~ length is 6 bytes times the maximum record length for 31 ~ bits (as with UTF encoding), would I not be better off using ~ raw 32 bit encoding, yielding only 4 times the maximum ~ record length for the same data (or only 2 times for 16 bit ~ data)? what objective function are you maximising? frankly, if i were doing fixed records, i would keep char/byte counts. the reason to use UTF is precisely system-independence (see below). if density is an issue (it rarely is for us), then i agree -- 16bit encodings save you space for kanji. ~ 2) A database application with memory mapped I/O for both the ~ front and back ends can not keep a consistant coding for ~ data throughout the application. If I memory map a UTF ~ file, conversion is required before searches. This will ~ either have to be done at search time, or will have to be ~ done via non-UTF encoded cache buffers within each database ~ engine (there may be several for a single database). This ~ is wasteful of user-space memory and requires additional ~ swapping of pages for the user-space cache implementation. ~ Further, UTF encoded data must be converted into some ~ native representation before it can be written to display ~ memory if memory mapped output is used. as it turns out, MANY (but not all) applications can work quite well on the UTF-encoded text. mostly, apps search for literal strings, and all that happens is that their length increases a little. our experience in plan 9 is that relatively few applications do enough that it is more efficient to convert to Runes before processing. ~ 3) If UTF is used for internal encoding in the application, ~ (such as would be necessary for direct use of memory mapped ~ files), then you have a dilemma: the natural input mechanism ~ into such a system is also UTF encoded data. This means that ~ either the device must directly supply UTF encoded data (a ~ problem, in that the reson the memory-mapped files are UTF in ~ user space is our reluctance to do translation to and from ~ UTF within the kernel), or each application must carry around ~ it's own raw-input-to-UTF-conversion code. This enlareges the ~ size of the application. Note that this conversion code is ~ not "already there" if the native processing mode for an ~ application is UTF, since other conversions would not take ~ place. ahhh, therein lies the nub! Plan 9 is UTF all the way through. by and large, the problem with this approach for unix systems should only be in the moral equivalent of teh tty driver. once characters are entered, everything is just UTF. the OS and libraries and compilers and ... all need to be recompiled with care. this is not as bad as it seems but is a lot of fussing. on the other hand, ascii systems are there already! ~ 4) Backward comatability with existing systems requires the ability ~ to crossmount using existing mechanisms (NFS, RFS, etc.). How ~ can this be supported without kernel code, when the remote ~ system is ignorant of UTF deencoding? With kernel code, what ~ is the purpose in using the UTF encoding as the native processing ~ mode in the application? well, if the existing remote systems are just ascii, then you have no problem. by and large, everything just works (if your system is 8-bit clean). if the remote systems are (and remain) non-ascii, then you have problems. the purpose (at least in plan 9) of teh kernel using UTF is that the entire system uses UTF -- all text strings are 10646 chars UTf encoded into a byte stream. ~ There are workarounds to these, but they are much uglier than compact ~ storage processing, where storage is done in a locale-specific set if ~ possible. This would also allow a business to convert relatively ~ painlessly by leveraging current investments in prior localization ~ technology. i'd be the first to admit the plan 9 approach has its difficulties. however, there is a nice coherence to the system that applies to everything it connects to. i am unimpressed with the alternative you suggest (although i admire the length and thoroughness you have shown in discussing it) because it seems to me like ``here is an island, there is an island, interchange is a bitch''. all interchange must be explicitly typed and converted on system boundaries. and i have no idea how you do that. you can't convert binaries, just text; but how do you tell the difference? and how do you recognise system boundaries? (for example, when i mount a remote system's files, i may be getting several systems' files; how doe sthe file server know which ones to convert and how? plan 9 is inherently very distributed and so we see the difficulties of teh island approach earlier or more clearly. on the other hand, i'd be the last person to advocate massive system upheavals just for the sake of using 10646. if you are willing to be an island, and for many systems it may be the practical choice, then all power to you. andrew