NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / std / internat / 1091 < prev next >

Wrap

Internet Message Format | 1993-01-07 | 5.6 KB

Path: sparky!uunet!not-for-mail From: avg@rodan.UU.NET (Vadim Antonov) Newsgroups: comp.std.internat Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST) Date: 7 Jan 1993 14:48:06 -0500 Organization: UUNET Technologies Inc, Falls Church, VA Lines: 115 Message-ID: <1ii1dmINNcu6@rodan.UU.NET> References: <1993Jan5.090747.29232@fcom.cc.utah.edu> <id.EAHW.92A@ferranti.com> <1993Jan7.033153.12133@fcom.cc.utah.edu> NNTP-Posting-Host: rodan.uu.net Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages In article <1993Jan7.033153.12133@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes: >Consider a newline terminated text database containing fixed length lines, >or consider a database consisting of variant text records in fixed fields. Databases with fixed-length fields deserve to loose. >In either case, the amount of data per field is now variant on Runic encoding. In either case, the amount of data per field is now variant on ASCII encoding. I can store 'Terry Lambert' next to the 'Thelma Louis Maria Anna Bertran'. I alwasys thought that wizards think before speaking. >Now consider a database, one of the primary methods of implementation being >memory mapped files. Thus the storage and the in core representation must >be the same. You can as well keep your stuff in runic encoding in memory. If you didn't know the virtual memory is the same ol' disk. >Now consider the mapping, which, by definition, must be done in the kernel, Ever heard of user-lever file system servers? >(without proposing a localized per user or >per language file system view -- both Ohta and Vadim abhor "locale"). I do not abhor locale -- i flatly state that it is not sufficient for multilingual environments and Unicode/ISO10646 is too much for localizing where you can do everything *with* locale and 8-bit encoding (16 bit for Oriental languages using existing standards). >Further suppose a distributed application with clients for the database >running on old and new hardware. Only those datum with the same localization >as the (non-updated) client machine's software will be usable remotely. A >potential example of this is available in OLTP for ticketing and reservation >systems for international airlines. Now tell us how will you sort data from various locations with *local* sorting algorithms? >Any COBOL program stores data in this fashion, Many other programs store >data in this fashion to avoid a direct wire-translation of important values >by character encoding them in the files. This allows a direct record seek >based on offset. Many Ada programs also take advantage of this. And some people never grow out of BASIC. >Consider that Runic encoding is antithetical in terms of single character >changes for fixed record length files by virtue of it's ability to either >change record size (destroying the seek-offset record addressing) or by >changing the amount of data representable in a field (destroying the >ability to use fixed-length fields for input in the front end client). I do not see why s/A/B/ should be different from s/A/AAA/. Fixed-length fields impose upper bounds on the length of the text; if you wan to be sure that every string contaiting N characters will fit into the field just reserve N*K bytes where K is the maximal length of rune. The field remains fixed-size. >Consider the program that performs a fast record count based on a division >of the number of characters in a file by the *fixed* record size. And what? You can do it with runic encoding as well. I'll discard the rest of the babble about fixed-sized fields for its obvious meaninglessness. >There are a great many reasons to avoid Runic encoding, not the least of >which is that storage requirements are diminished to current levels for >all small glyph-set languages (<=256 characters) at the cost of local >attribution. >We must accept partial locale information for the input >mechanisms if nowhere else, and this provides a promiscuous tagging for >us at nearly 0 additional cost. You'd have to change the semantic of Unix first. What do you think cat of two files created within different locales should produce? If you have a single locale for entire system the problem is non-existent as well as any necessity to use Unicode. >Attributed non-Runic encoding also buys us 2 bytes rather than 2-5 bytes >per glyph for large glyph-set languages (ie: Japanese). The best runic encoding wastes 12.5% of space but wins over fixed-size characters on European languages greatly. UTF is space-inefficient for everything above ASCII. >Runic encoding has too many significant drawbacks, and the only penalty >on non-Runic encoding is +17% of disk space on an average UNIX system >(20% of the files on an average UNIX system are text files) Even with that we have 3.4% (17%*20%) average wasted space. >for 16-bit >encoding, or +35% for 32-bit encoding, Don't use UTF for 32-bit encoding, period. >or 0% (with a space reduction and >a restoration of meaning to file, field, and record lengths for all >non-7-bit ASCII languages) with the assumption of a "locale" mechanism. >PS: I notice that you, as well as Vadim, are dragging me from 386BSD >localization (in comp.unix.bsd) into a discussion of multinationalization >(in comp.std.internat). It was not my intent to be involved in the >now distorted scope of this discussion when I started the news thread >"INTERNATIONALIZATION: JAPAN, FAR EAST" in comp.unix.bsd for discussion >of OS localization issues. I hope you know what you are letting me in for. OS *localization* does not require Unicode -- it was proven in practice many many times. The real issue is multinationalization. --vadim