NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / std / internat / 1072 < prev next >

Wrap

Text File | 1993-01-06 | 7.0 KB | 132 lines

Newsgroups: comp.std.internat Path: sparky!uunet!elroy.jpl.nasa.gov!ames!agate!dog.ee.lbl.gov!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry From: terry@cs.weber.edu (A Wizard of Earth C) Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST) Message-ID: <1993Jan7.033153.12133@fcom.cc.utah.edu> Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages Sender: news@fcom.cc.utah.edu Organization: Weber State University (Ogden, UT) References: <2615@titccy.cc.titech.ac.jp> <1993Jan5.090747.29232@fcom.cc.utah.edu> <id.EAHW.92A@ferranti.com> Date: Thu, 7 Jan 93 03:31:53 GMT Lines: 119 In article <id.EAHW.92A@ferranti.com> peter@ferranti.com (peter da silva) writes: >In article <1993Jan5.090747.29232@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes: >> >>or >> >>"pollutes" the files (all files except those stored in US-ASCII have file >> >>sizes which no longer reflect true character counts on the file). > >> Destruction of this information is basically unacceptable for Western users > >Gee, since this information is already non-existant for anything but computer >source (computer programs and other data intended to be read by a compiler >or other plain text parser), and even there information is hidden in white >space (is that a tab or 8 spaces?), I find it hard to understand this >statement. Consider a newline terminated text database containing fixed length lines, or consider a database consisting of variant text records in fixed fields. In either case, the amount of data per field is now variant on Runic encoding. For instance, if we accept the Plan-9 soloution, an application used in both England and the US will vary as to how much data is representable per fixed field based on whether or not that data contains the English "#" character. This gets worse the further you get from base ASCII coding. Now consider a database, one of the primary methods of implementation being memory mapped files. Thus the storage and the in core representation must be the same. Now consider the mapping, which, by definition, must be done in the kernel, for NFS mounts to and from our internationalized machine and both other internationalized machines and currently installed machines. Of course Ohta would term this mapping a political (re interoperability) or economic (re updating all machines) issue. I do not believe this to be the case. Remember that the primary distinction wil not be made for compact files, but for non-compact (multilingual and therefore 16 bit or larger characters) file contents, as well as for directory information, which, by definition, must be legible to all users (without proposing a localized per user or per language file system view -- both Ohta and Vadim abhor "locale"). Further suppose a distributed application with clients for the database running on old and new hardware. Only those datum with the same localization as the (non-updated) client machine's software will be usable remotely. A potential example of this is available in OLTP for ticketing and reservation systems for international airlines. Any COBOL program stores data in this fashion, Many other programs store data in this fashion to avoid a direct wire-translation of important values by character encoding them in the files. This allows a direct record seek based on offset. Many Ada programs also take advantage of this. Consider that Runic encoding is antithetical in terms of single character changes for fixed record length files by virtue of it's ability to either change record size (destroying the seek-offset record addressing) or by changing the amount of data representable in a field (destroying the ability to use fixed-length fields for input in the front end client). Consider the program that performs a fast record count based on a division of the number of characters in a file by the *fixed* record size. >> Consider for a moment, if you will, changing the first character of a >> field in a 1M file. Not only does this cause the record size ot become >> variant on the data within it (thus rendering a computed lseek() useless, >> since the records are no longer fixed size), but it requires that the entire >> file contents be shifted to accomodate what used to be ther rewrite of a >> single block. > >Files containing fixed sized fields will be unable to contain runic data within >the fields, or the feilds will have to be padded, just as they are now. The >occasions you can do this sort of meddling in a plaintext file are pretty >limited already. Limited in applicability, but, I would argue, wide in use. Padding is unacceptable, both because it destroys the meaning of field width in the front end and because of the need for memory-mapping of files. A database program is likely to require the ability to either partially or fully control it's own pages. Consider a two-stage commit database with a need to compute transitive closure on potentially intersecting record range locks (ie: locks spanning more than one record) during the acquistion of a lock in a potential deadlock situation (ie: P1 wants locks on a+b and P2 wants locks on b+c and P3 wants locks on c+a [in that locking order]). This is a rather common need in any multi-client database (a file server is an instance of a multi-client database). There are a great many reasons to avoid Runic encoding, not the least of which is that storage requirements are diminished to current levels for all small glyph-set languages (<=256 characters) at the cost of local attribution. We must accept partial locale information for the input mechanisms if nowhere else, and this provides a promiscuous tagging for us at nearly 0 additional cost. Attributed non-Runic encoding also buys us 2 bytes rather than 2-5 bytes per glyph for large glyph-set languages (ie: Japanese). Runic encoding has too many significant drawbacks, and the only penalty on non-Runic encoding is +17% of disk space on an average UNIX system (20% of the files on an average UNIX system are text files) for 16-bit encoding, or +35% for 32-bit encoding, or 0% (with a space reduction and a restoration of meaning to file, field, and record lengths for all non-7-bit ASCII languages) with the assumption of a "locale" mechanism. PS: I notice that you, as well as Vadim, are dragging me from 386BSD localization (in comp.unix.bsd) into a discussion of multinationalization (in comp.std.internat). It was not my intent to be involved in the now distorted scope of this discussion when I started the news thread "INTERNATIONALIZATION: JAPAN, FAR EAST" in comp.unix.bsd for discussion of OS localization issues. I hope you know what you are letting me in for. 8-(. Terry Lambert terry@icarus.weber.edu terry_lambert@novell.com --- Any opinions in this posting are my own and not those of my present or previous employers. -- ------------------------------------------------------------------------------- "I have an 8 user poetic license" - me Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial -------------------------------------------------------------------------------