home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!not-for-mail
- From: avg@rodan.UU.NET (Vadim Antonov)
- Newsgroups: comp.std.internat
- Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
- Date: 7 Jan 1993 14:48:06 -0500
- Organization: UUNET Technologies Inc, Falls Church, VA
- Lines: 115
- Message-ID: <1ii1dmINNcu6@rodan.UU.NET>
- References: <1993Jan5.090747.29232@fcom.cc.utah.edu> <id.EAHW.92A@ferranti.com> <1993Jan7.033153.12133@fcom.cc.utah.edu>
- NNTP-Posting-Host: rodan.uu.net
- Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages
-
- In article <1993Jan7.033153.12133@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes:
- >Consider a newline terminated text database containing fixed length lines,
- >or consider a database consisting of variant text records in fixed fields.
-
- Databases with fixed-length fields deserve to loose.
-
- >In either case, the amount of data per field is now variant on Runic encoding.
-
- In either case, the amount of data per field is now variant on ASCII encoding.
- I can store 'Terry Lambert' next to the 'Thelma Louis Maria Anna Bertran'.
- I alwasys thought that wizards think before speaking.
-
- >Now consider a database, one of the primary methods of implementation being
- >memory mapped files. Thus the storage and the in core representation must
- >be the same.
-
- You can as well keep your stuff in runic encoding in memory. If you
- didn't know the virtual memory is the same ol' disk.
-
- >Now consider the mapping, which, by definition, must be done in the kernel,
-
- Ever heard of user-lever file system servers?
-
- >(without proposing a localized per user or
- >per language file system view -- both Ohta and Vadim abhor "locale").
-
- I do not abhor locale -- i flatly state that it is not sufficient for
- multilingual environments and Unicode/ISO10646 is too much for localizing
- where you can do everything *with* locale and 8-bit encoding (16 bit
- for Oriental languages using existing standards).
-
- >Further suppose a distributed application with clients for the database
- >running on old and new hardware. Only those datum with the same localization
- >as the (non-updated) client machine's software will be usable remotely. A
- >potential example of this is available in OLTP for ticketing and reservation
- >systems for international airlines.
-
- Now tell us how will you sort data from various locations with *local*
- sorting algorithms?
-
- >Any COBOL program stores data in this fashion, Many other programs store
- >data in this fashion to avoid a direct wire-translation of important values
- >by character encoding them in the files. This allows a direct record seek
- >based on offset. Many Ada programs also take advantage of this.
-
- And some people never grow out of BASIC.
-
- >Consider that Runic encoding is antithetical in terms of single character
- >changes for fixed record length files by virtue of it's ability to either
- >change record size (destroying the seek-offset record addressing) or by
- >changing the amount of data representable in a field (destroying the
- >ability to use fixed-length fields for input in the front end client).
-
- I do not see why s/A/B/ should be different from s/A/AAA/.
- Fixed-length fields impose upper bounds on the length of the text;
- if you wan to be sure that every string contaiting N characters will fit
- into the field just reserve N*K bytes where K is the maximal length of
- rune. The field remains fixed-size.
-
- >Consider the program that performs a fast record count based on a division
- >of the number of characters in a file by the *fixed* record size.
-
- And what? You can do it with runic encoding as well.
-
- I'll discard the rest of the babble about fixed-sized fields for
- its obvious meaninglessness.
-
- >There are a great many reasons to avoid Runic encoding, not the least of
- >which is that storage requirements are diminished to current levels for
- >all small glyph-set languages (<=256 characters) at the cost of local
- >attribution.
-
- >We must accept partial locale information for the input
- >mechanisms if nowhere else, and this provides a promiscuous tagging for
- >us at nearly 0 additional cost.
-
- You'd have to change the semantic of Unix first. What do you think
- cat of two files created within different locales should produce?
- If you have a single locale for entire system the problem is non-existent
- as well as any necessity to use Unicode.
-
- >Attributed non-Runic encoding also buys us 2 bytes rather than 2-5 bytes
- >per glyph for large glyph-set languages (ie: Japanese).
-
- The best runic encoding wastes 12.5% of space but wins over fixed-size
- characters on European languages greatly.
-
- UTF is space-inefficient for everything above ASCII.
-
- >Runic encoding has too many significant drawbacks, and the only penalty
- >on non-Runic encoding is +17% of disk space on an average UNIX system
- >(20% of the files on an average UNIX system are text files)
-
- Even with that we have 3.4% (17%*20%) average wasted space.
-
- >for 16-bit
- >encoding, or +35% for 32-bit encoding,
-
- Don't use UTF for 32-bit encoding, period.
-
- >or 0% (with a space reduction and
- >a restoration of meaning to file, field, and record lengths for all
- >non-7-bit ASCII languages) with the assumption of a "locale" mechanism.
-
- >PS: I notice that you, as well as Vadim, are dragging me from 386BSD
- >localization (in comp.unix.bsd) into a discussion of multinationalization
- >(in comp.std.internat). It was not my intent to be involved in the
- >now distorted scope of this discussion when I started the news thread
- >"INTERNATIONALIZATION: JAPAN, FAR EAST" in comp.unix.bsd for discussion
- >of OS localization issues. I hope you know what you are letting me in for.
-
- OS *localization* does not require Unicode -- it was proven in practice
- many many times. The real issue is multinationalization.
-
- --vadim
-