home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.std.internat
- Path: sparky!uunet!elroy.jpl.nasa.gov!ames!agate!dog.ee.lbl.gov!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry
- From: terry@cs.weber.edu (A Wizard of Earth C)
- Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
- Message-ID: <1993Jan7.033153.12133@fcom.cc.utah.edu>
- Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages
- Sender: news@fcom.cc.utah.edu
- Organization: Weber State University (Ogden, UT)
- References: <2615@titccy.cc.titech.ac.jp> <1993Jan5.090747.29232@fcom.cc.utah.edu> <id.EAHW.92A@ferranti.com>
- Date: Thu, 7 Jan 93 03:31:53 GMT
- Lines: 119
-
- In article <id.EAHW.92A@ferranti.com> peter@ferranti.com (peter da silva) writes:
- >In article <1993Jan5.090747.29232@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes:
- >> >>or
- >> >>"pollutes" the files (all files except those stored in US-ASCII have file
- >> >>sizes which no longer reflect true character counts on the file).
- >
- >> Destruction of this information is basically unacceptable for Western users
- >
- >Gee, since this information is already non-existant for anything but computer
- >source (computer programs and other data intended to be read by a compiler
- >or other plain text parser), and even there information is hidden in white
- >space (is that a tab or 8 spaces?), I find it hard to understand this
- >statement.
-
- Consider a newline terminated text database containing fixed length lines,
- or consider a database consisting of variant text records in fixed fields.
- In either case, the amount of data per field is now variant on Runic encoding.
- For instance, if we accept the Plan-9 soloution, an application used in
- both England and the US will vary as to how much data is representable per
- fixed field based on whether or not that data contains the English "#"
- character. This gets worse the further you get from base ASCII coding.
-
- Now consider a database, one of the primary methods of implementation being
- memory mapped files. Thus the storage and the in core representation must
- be the same.
-
- Now consider the mapping, which, by definition, must be done in the kernel,
- for NFS mounts to and from our internationalized machine and both other
- internationalized machines and currently installed machines. Of course
- Ohta would term this mapping a political (re interoperability) or economic
- (re updating all machines) issue. I do not believe this to be the case.
- Remember that the primary distinction wil not be made for compact files,
- but for non-compact (multilingual and therefore 16 bit or larger characters)
- file contents, as well as for directory information, which, by definition,
- must be legible to all users (without proposing a localized per user or
- per language file system view -- both Ohta and Vadim abhor "locale").
-
- Further suppose a distributed application with clients for the database
- running on old and new hardware. Only those datum with the same localization
- as the (non-updated) client machine's software will be usable remotely. A
- potential example of this is available in OLTP for ticketing and reservation
- systems for international airlines.
-
- Any COBOL program stores data in this fashion, Many other programs store
- data in this fashion to avoid a direct wire-translation of important values
- by character encoding them in the files. This allows a direct record seek
- based on offset. Many Ada programs also take advantage of this.
-
- Consider that Runic encoding is antithetical in terms of single character
- changes for fixed record length files by virtue of it's ability to either
- change record size (destroying the seek-offset record addressing) or by
- changing the amount of data representable in a field (destroying the
- ability to use fixed-length fields for input in the front end client).
-
- Consider the program that performs a fast record count based on a division
- of the number of characters in a file by the *fixed* record size.
-
- >> Consider for a moment, if you will, changing the first character of a
- >> field in a 1M file. Not only does this cause the record size ot become
- >> variant on the data within it (thus rendering a computed lseek() useless,
- >> since the records are no longer fixed size), but it requires that the entire
- >> file contents be shifted to accomodate what used to be ther rewrite of a
- >> single block.
- >
- >Files containing fixed sized fields will be unable to contain runic data within
- >the fields, or the feilds will have to be padded, just as they are now. The
- >occasions you can do this sort of meddling in a plaintext file are pretty
- >limited already.
-
- Limited in applicability, but, I would argue, wide in use.
-
- Padding is unacceptable, both because it destroys the meaning of field width
- in the front end and because of the need for memory-mapping of files. A
- database program is likely to require the ability to either partially or
- fully control it's own pages. Consider a two-stage commit database with
- a need to compute transitive closure on potentially intersecting record
- range locks (ie: locks spanning more than one record) during the acquistion
- of a lock in a potential deadlock situation (ie: P1 wants locks on a+b and P2
- wants locks on b+c and P3 wants locks on c+a [in that locking order]). This
- is a rather common need in any multi-client database (a file server is an
- instance of a multi-client database).
-
- There are a great many reasons to avoid Runic encoding, not the least of
- which is that storage requirements are diminished to current levels for
- all small glyph-set languages (<=256 characters) at the cost of local
- attribution. We must accept partial locale information for the input
- mechanisms if nowhere else, and this provides a promiscuous tagging for
- us at nearly 0 additional cost.
-
- Attributed non-Runic encoding also buys us 2 bytes rather than 2-5 bytes
- per glyph for large glyph-set languages (ie: Japanese).
-
- Runic encoding has too many significant drawbacks, and the only penalty
- on non-Runic encoding is +17% of disk space on an average UNIX system
- (20% of the files on an average UNIX system are text files) for 16-bit
- encoding, or +35% for 32-bit encoding, or 0% (with a space reduction and
- a restoration of meaning to file, field, and record lengths for all
- non-7-bit ASCII languages) with the assumption of a "locale" mechanism.
-
-
- PS: I notice that you, as well as Vadim, are dragging me from 386BSD
- localization (in comp.unix.bsd) into a discussion of multinationalization
- (in comp.std.internat). It was not my intent to be involved in the
- now distorted scope of this discussion when I started the news thread
- "INTERNATIONALIZATION: JAPAN, FAR EAST" in comp.unix.bsd for discussion
- of OS localization issues. I hope you know what you are letting me in for.
- 8-(.
-
- Terry Lambert
- terry@icarus.weber.edu
- terry_lambert@novell.com
- ---
- Any opinions in this posting are my own and not those of my present
- or previous employers.
- --
- -------------------------------------------------------------------------------
- "I have an 8 user poetic license" - me
- Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial
- -------------------------------------------------------------------------------
-