NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / std / internat / 1117 < prev next >

Wrap

Text File | 1993-01-08 | 5.3 KB | 105 lines

Newsgroups: comp.std.internat Path: sparky!uunet!zaphod.mps.ohio-state.edu!magnus.acs.ohio-state.edu!usenet.ins.cwru.edu!agate!overload.lbl.gov!dog.ee.lbl.gov!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry From: terry@cs.weber.edu (A Wizard of Earth C) Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST) Message-ID: <1993Jan9.024546.26934@fcom.cc.utah.edu> Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages Sender: news@fcom.cc.utah.edu Organization: Weber State University (Ogden, UT) References: <id.EAHW.92A@ferranti.com> <1993Jan7.033153.12133@fcom.cc.utah.edu> <1993Jan8.092754.6344@prl.dec.com> Date: Sat, 9 Jan 93 02:45:46 GMT Lines: 92 In article <1993Jan8.092754.6344@prl.dec.com> boyd@prl.dec.com (Boyd Roberts) writes: >In article <1993Jan7.033153.12133@fcom.cc.utah.edu>, terry@cs.weber.edu (A Wizard of Earth C) writes: >> >> Consider a newline terminated text database containing fixed length lines, >> or consider a database consisting of variant text records in fixed fields. >> In either case, the amount of data per field is now variant on Runic >> encoding. For instance, if we accept the Plan-9 soloution, an application >> used in both England and the US will vary as to how much data is >> representable per fixed field based on whether or not that data contains >> the English "#" character. This gets worse the further you get from base >> ASCII coding. > >Using Plan 9 utf you know the maximum size in bytes that a Rune can >be encoded into. Fixed fields only have to be a multiple of this value, >defined to be to be UTFmax. So where's the problem? [ First a clarification of something which is my fault because of my background in comm software: I have been informed that the currently "blessed" correct terminlogy for what I have been calling "Runic encoding" is "Process code", "File code", or "Interchange code". I'll try to call it "Interchange code" from now on (I feel the other terms imply applications, some of which I disagree with). ] A good question. The answer is fourfold: 1) The field length is still not descriptive of the number of characters a field may contain; you have only set a minimum "maximum character count". It is still possible to enter in more characters in a non-maximally encoded character set than this value, unless you adopt the abstraction of a maximum character count seperate from the maximum necessary storage length. The maximum character count is what we would normally think of as the displayed field length (ie: what you see as the size of an input box). If my field length is 6 bytes times the maximum record length for 31 bits (as with UTF encoding), would I not be better off using raw 32 bit encoding, yielding only 4 times the maximum record length for the same data (or only 2 times for 16 bit data)? 2) A database application with memory mapped I/O for both the front and back ends can not keep a consistant coding for data throughout the application. If I memory map a UTF file, conversion is required before searches. This will either have to be done at search time, or will have to be done via non-UTF encoded cache buffers within each database engine (there may be several for a single database). This is wasteful of user-space memory and requires additional swapping of pages for the user-space cache implementation. Further, UTF encoded data must be converted into some native representation before it can be written to display memory if memory mapped output is used. 3) If UTF is used for internal encoding in the application, (such as would be necessary for direct use of memory mapped files), then you have a dilemma: the natural input mechanism into such a system is also UTF encoded data. This means that either the device must directly supply UTF encoded data (a problem, in that the reson the memory-mapped files are UTF in user space is our reluctance to do translation to and from UTF within the kernel), or each application must carry around it's own raw-input-to-UTF-conversion code. This enlareges the size of the application. Note that this conversion code is not "already there" if the native processing mode for an application is UTF, since other conversions would not take place. 4) Backward comatability with existing systems requires the ability to crossmount using existing mechanisms (NFS, RFS, etc.). How can this be supported without kernel code, when the remote system is ignorant of UTF deencoding? With kernel code, what is the purpose in using the UTF encoding as the native processing mode in the application? There are workarounds to these, but they are much uglier than compact storage processing, where storage is done in a locale-specific set if possible. This would also allow a business to convert relatively painlessly by leveraging current investments in prior localization technology. Terry Lambert terry@icarus.weber.edu terry_lambert@novell.com --- Any opinions in this posting are my own and not those of my present or previous employers. -- ------------------------------------------------------------------------------- "I have an 8 user poetic license" - me Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial -------------------------------------------------------------------------------