home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!zaphod.mps.ohio-state.edu!pacific.mps.ohio-state.edu!linac!att!att!dptg!ulysses!allegra!alice!andrew
- From: andrew@alice.att.com (Andrew Hume)
- Newsgroups: comp.std.internat
- Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
- Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages
- Message-ID: <24579@alice.att.com>
- Date: 9 Jan 93 07:51:57 GMT
- Article-I.D.: alice.24579
- References: <id.EAHW.92A@ferranti.com> <1993Jan7.033153.12133@fcom.cc.utah.edu> <1993Jan9.024546.26934@fcom.cc.utah.edu>
- Organization: AT&T Bell Laboratories, Murray Hill NJ
- Lines: 119
-
- In article <1993Jan9.024546.26934@fcom.cc.utah.edu>, terry@cs.weber.edu (A Wizard of Earth C) writes:
- ~ In article <1993Jan8.092754.6344@prl.dec.com> boyd@prl.dec.com (Boyd Roberts) writes:
- ~ >Using Plan 9 utf you know the maximum size in bytes that a Rune can
- ~ >be encoded into. Fixed fields only have to be a multiple of this value,
- ~ >defined to be to be UTFmax. So where's the problem?
- ~
- ~ [ First a clarification of something which is my fault because of my
- ~ background in comm software: I have been informed that the currently
- ~ "blessed" correct terminlogy for what I have been calling "Runic
- ~ encoding" is "Process code", "File code", or "Interchange code". I'll
- ~ try to call it "Interchange code" from now on (I feel the other terms
- ~ imply applications, some of which I disagree with). ]
- ~
- ~ A good question. The answer is fourfold:
- ~
- ~ 1) The field length is still not descriptive of the number of
- ~ characters a field may contain; you have only set a minimum
- ~ "maximum character count". It is still possible to enter
- ~ in more characters in a non-maximally encoded character set
- ~ than this value, unless you adopt the abstraction of a
- ~ maximum character count seperate from the maximum necessary
- ~ storage length. The maximum character count is what we
- ~ would normally think of as the displayed field length (ie:
- ~ what you see as the size of an input box). If my field
- ~ length is 6 bytes times the maximum record length for 31
- ~ bits (as with UTF encoding), would I not be better off using
- ~ raw 32 bit encoding, yielding only 4 times the maximum
- ~ record length for the same data (or only 2 times for 16 bit
- ~ data)?
-
- what objective function are you maximising? frankly,
- if i were doing fixed records, i would keep char/byte counts.
- the reason to use UTF is precisely system-independence (see below).
- if density is an issue (it rarely is for us), then i agree --
- 16bit encodings save you space for kanji.
-
- ~ 2) A database application with memory mapped I/O for both the
- ~ front and back ends can not keep a consistant coding for
- ~ data throughout the application. If I memory map a UTF
- ~ file, conversion is required before searches. This will
- ~ either have to be done at search time, or will have to be
- ~ done via non-UTF encoded cache buffers within each database
- ~ engine (there may be several for a single database). This
- ~ is wasteful of user-space memory and requires additional
- ~ swapping of pages for the user-space cache implementation.
- ~ Further, UTF encoded data must be converted into some
- ~ native representation before it can be written to display
- ~ memory if memory mapped output is used.
-
- as it turns out, MANY (but not all) applications can
- work quite well on the UTF-encoded text. mostly, apps search
- for literal strings, and all that happens is that their length
- increases a little. our experience in plan 9 is that relatively
- few applications do enough that it is more efficient to convert
- to Runes before processing.
-
- ~ 3) If UTF is used for internal encoding in the application,
- ~ (such as would be necessary for direct use of memory mapped
- ~ files), then you have a dilemma: the natural input mechanism
- ~ into such a system is also UTF encoded data. This means that
- ~ either the device must directly supply UTF encoded data (a
- ~ problem, in that the reson the memory-mapped files are UTF in
- ~ user space is our reluctance to do translation to and from
- ~ UTF within the kernel), or each application must carry around
- ~ it's own raw-input-to-UTF-conversion code. This enlareges the
- ~ size of the application. Note that this conversion code is
- ~ not "already there" if the native processing mode for an
- ~ application is UTF, since other conversions would not take
- ~ place.
-
- ahhh, therein lies the nub! Plan 9 is UTF all the way
- through. by and large, the problem with this approach for
- unix systems should only be in the moral equivalent of teh tty driver.
- once characters are entered, everything is just UTF. the OS
- and libraries and compilers and ... all need to be recompiled
- with care. this is not as bad as it seems but is a lot of fussing.
- on the other hand, ascii systems are there already!
-
- ~ 4) Backward comatability with existing systems requires the ability
- ~ to crossmount using existing mechanisms (NFS, RFS, etc.). How
- ~ can this be supported without kernel code, when the remote
- ~ system is ignorant of UTF deencoding? With kernel code, what
- ~ is the purpose in using the UTF encoding as the native processing
- ~ mode in the application?
-
- well, if the existing remote systems are just ascii,
- then you have no problem. by and large, everything just works
- (if your system is 8-bit clean). if the remote systems are
- (and remain) non-ascii, then you have problems. the purpose
- (at least in plan 9) of teh kernel using UTF is that the entire
- system uses UTF -- all text strings are 10646 chars UTf encoded into
- a byte stream.
-
- ~ There are workarounds to these, but they are much uglier than compact
- ~ storage processing, where storage is done in a locale-specific set if
- ~ possible. This would also allow a business to convert relatively
- ~ painlessly by leveraging current investments in prior localization
- ~ technology.
-
- i'd be the first to admit the plan 9 approach has its
- difficulties. however, there is a nice coherence to the system
- that applies to everything it connects to. i am unimpressed with
- the alternative you suggest (although i admire the length and thoroughness
- you have shown in discussing it) because it seems to me like
- ``here is an island, there is an island, interchange is a bitch''.
- all interchange must be explicitly typed and converted on system
- boundaries. and i have no idea how you do that. you can't convert
- binaries, just text; but how do you tell the difference? and how
- do you recognise system boundaries? (for example, when i mount
- a remote system's files, i may be getting several systems' files;
- how doe sthe file server know which ones to convert and how?
- plan 9 is inherently very distributed and so we see
- the difficulties of teh island approach earlier or more clearly.
- on the other hand, i'd be the last person to advocate massive system
- upheavals just for the sake of using 10646. if you are willing to be
- an island, and for many systems it may be the practical choice, then
- all power to you.
-
- andrew
-