home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.std.internat
- Path: sparky!uunet!zaphod.mps.ohio-state.edu!magnus.acs.ohio-state.edu!usenet.ins.cwru.edu!agate!overload.lbl.gov!dog.ee.lbl.gov!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry
- From: terry@cs.weber.edu (A Wizard of Earth C)
- Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
- Message-ID: <1993Jan9.024546.26934@fcom.cc.utah.edu>
- Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages
- Sender: news@fcom.cc.utah.edu
- Organization: Weber State University (Ogden, UT)
- References: <id.EAHW.92A@ferranti.com> <1993Jan7.033153.12133@fcom.cc.utah.edu> <1993Jan8.092754.6344@prl.dec.com>
- Date: Sat, 9 Jan 93 02:45:46 GMT
- Lines: 92
-
- In article <1993Jan8.092754.6344@prl.dec.com> boyd@prl.dec.com (Boyd Roberts) writes:
- >In article <1993Jan7.033153.12133@fcom.cc.utah.edu>, terry@cs.weber.edu (A Wizard of Earth C) writes:
- >>
- >> Consider a newline terminated text database containing fixed length lines,
- >> or consider a database consisting of variant text records in fixed fields.
- >> In either case, the amount of data per field is now variant on Runic
- >> encoding. For instance, if we accept the Plan-9 soloution, an application
- >> used in both England and the US will vary as to how much data is
- >> representable per fixed field based on whether or not that data contains
- >> the English "#" character. This gets worse the further you get from base
- >> ASCII coding.
- >
- >Using Plan 9 utf you know the maximum size in bytes that a Rune can
- >be encoded into. Fixed fields only have to be a multiple of this value,
- >defined to be to be UTFmax. So where's the problem?
-
- [ First a clarification of something which is my fault because of my
- background in comm software: I have been informed that the currently
- "blessed" correct terminlogy for what I have been calling "Runic
- encoding" is "Process code", "File code", or "Interchange code". I'll
- try to call it "Interchange code" from now on (I feel the other terms
- imply applications, some of which I disagree with). ]
-
- A good question. The answer is fourfold:
-
- 1) The field length is still not descriptive of the number of
- characters a field may contain; you have only set a minimum
- "maximum character count". It is still possible to enter
- in more characters in a non-maximally encoded character set
- than this value, unless you adopt the abstraction of a
- maximum character count seperate from the maximum necessary
- storage length. The maximum character count is what we
- would normally think of as the displayed field length (ie:
- what you see as the size of an input box). If my field
- length is 6 bytes times the maximum record length for 31
- bits (as with UTF encoding), would I not be better off using
- raw 32 bit encoding, yielding only 4 times the maximum
- record length for the same data (or only 2 times for 16 bit
- data)?
-
- 2) A database application with memory mapped I/O for both the
- front and back ends can not keep a consistant coding for
- data throughout the application. If I memory map a UTF
- file, conversion is required before searches. This will
- either have to be done at search time, or will have to be
- done via non-UTF encoded cache buffers within each database
- engine (there may be several for a single database). This
- is wasteful of user-space memory and requires additional
- swapping of pages for the user-space cache implementation.
- Further, UTF encoded data must be converted into some
- native representation before it can be written to display
- memory if memory mapped output is used.
-
- 3) If UTF is used for internal encoding in the application,
- (such as would be necessary for direct use of memory mapped
- files), then you have a dilemma: the natural input mechanism
- into such a system is also UTF encoded data. This means that
- either the device must directly supply UTF encoded data (a
- problem, in that the reson the memory-mapped files are UTF in
- user space is our reluctance to do translation to and from
- UTF within the kernel), or each application must carry around
- it's own raw-input-to-UTF-conversion code. This enlareges the
- size of the application. Note that this conversion code is
- not "already there" if the native processing mode for an
- application is UTF, since other conversions would not take
- place.
-
- 4) Backward comatability with existing systems requires the ability
- to crossmount using existing mechanisms (NFS, RFS, etc.). How
- can this be supported without kernel code, when the remote
- system is ignorant of UTF deencoding? With kernel code, what
- is the purpose in using the UTF encoding as the native processing
- mode in the application?
-
- There are workarounds to these, but they are much uglier than compact
- storage processing, where storage is done in a locale-specific set if
- possible. This would also allow a business to convert relatively
- painlessly by leveraging current investments in prior localization
- technology.
-
-
- Terry Lambert
- terry@icarus.weber.edu
- terry_lambert@novell.com
- ---
- Any opinions in this posting are my own and not those of my present
- or previous employers.
- --
- -------------------------------------------------------------------------------
- "I have an 8 user poetic license" - me
- Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial
- -------------------------------------------------------------------------------
-