NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / std / internat / 1104 < prev next >

Wrap

Text File | 1993-01-08 | 13.7 KB | 269 lines

Newsgroups: comp.std.internat Path: sparky!uunet!elroy.jpl.nasa.gov!ames!agate!dog.ee.lbl.gov!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry From: terry@cs.weber.edu (A Wizard of Earth C) Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST) Message-ID: <1993Jan8.051317.7820@fcom.cc.utah.edu> Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages Sender: news@fcom.cc.utah.edu Organization: University of Utah Computer Center References: <1993Jan5.090747.29232@fcom.cc.utah.edu> <id.EAHW.92A@ferranti.com> <1993Jan7.033153.12133@fcom.cc.utah.edu> <1ii1dmINNcu6@rodan.UU.NET> Date: Fri, 8 Jan 93 05:13:17 GMT Lines: 256 In article <1ii1dmINNcu6@rodan.UU.NET>, avg@rodan.UU.NET (Vadim Antonov) writes: |> In article <1993Jan7.033153.12133@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes: |> >Consider a newline terminated text database containing fixed length lines, |> >or consider a database consisting of variant text records in fixed fields. |> |> Databases with fixed-length fields deserve to loose. This dictates coding practice, a no-no. It is not a refutation of the point, only a statement of opinion on your part. |> >In either case, the amount of data per field is now variant on Runic encoding. |> |> In either case, the amount of data per field is now variant on ASCII encoding. |> I can store 'Terry Lambert' next to the 'Thelma Louis Maria Anna Bertran'. |> I alwasys thought that wizards think before speaking. Your example is of two pieces of *information* of differring lengths, not a change in the amount of data potentially representable by the field. For instance, using Ken Thompsons FSS-UTF encoding from Plan 9, the amount of space taken by a single glyph is 1-6 bytes, depending on the ordinal value of each glyph within the set. The problem is I have a fixed length field and/or storage. Say I am storing English, Cyrillic, and Japanese text in a database with fixed field input and storage. As discussed before, there are penalties in both places, so it is irrelevant if you eliminate the encoding in one location the other, if you do not do so in both. I can store 24 7-bit ASCII glyphs, 12 11-bit Cyrillic glyphs (assume an encoding somewhere between 0x80 and 0x7ff. No matter what encoding scheme is used, some character set will fall in that range, or worse), or 4 Kanji characters (assume a location between 0x4000000 and 0x7ffffff; it could be slightly better, although either Chinese or Japanese will get more characters per field, depending on who wins the argument over who's first lexically). This is unacceptable for either a fixed input field in a database client, or for fixed field storage in a file system. It absolutely rots for modern extent-based file systems, like vxfs (Veritas). |> >Now consider a database, one of the primary methods of implementation being |> >memory mapped files. Thus the storage and the in core representation must |> >be the same. |> |> You can as well keep your stuff in runic encoding in memory. If you |> didn't know the virtual memory is the same ol' disk. The problem occurs if one wants to simplify user-space processing of data; Runic encoding both places *is* possible (assuming a client-based decode for display), but is undesirable for the glyph-count limitations on fields. |> >Now consider the mapping, which, by definition, must be done in the kernel, |> |> Ever heard of user-lever file system servers? Yes, I'm "a bit" familiar with the concept (hint: check my signature). The problem is NFS mounts and backward compatability. Also, you multiply your wire traffic from 1-7 times for Runic-encoded 32-bit values (as opposed to a constant 4 times for flat 32-bit encoding or a constant 2 times for flat 16-bit encoding -- or not at all, using compacted storage). |> >(without proposing a localized per user or |> >per language file system view -- both Ohta and Vadim abhor "locale"). |> |> I do not abhor locale -- i flatly state that it is not sufficient for |> multilingual environments and Unicode/ISO10646 is too much for localizing |> where you can do everything *with* locale and 8-bit encoding (16 bit |> for Oriental languages using existing standards). I disagree. You can not make system localization the norm rather than application localization. Addmittedly, application localization will still need to occur for cases such as word processing. I have *never* suggested that "locale was sufficient" ... only a necessary part of the whole picture. |> >Further suppose a distributed application with clients for the database |> >running on old and new hardware. Only those datum with the same localization |> >as the (non-updated) client machine's software will be usable remotely. A |> >potential example of this is available in OLTP for ticketing and reservation |> >systems for international airlines. |> |> Now tell us how will you sort data from various locations with *local* |> sorting algorithms? By locale tables. I assume that if I have the fonts, I have the sorting tables. I don't see a great deal of use in sorting in the application I suggested, in any case, since insertion is likely to be in sorted order rather than sorting the data each time it is displayed. Insertion is done locally. |> >Any COBOL program stores data in this fashion, Many other programs store |> >data in this fashion to avoid a direct wire-translation of important values |> >by character encoding them in the files. This allows a direct record seek |> >based on offset. Many Ada programs also take advantage of this. |> |> And some people never grow out of BASIC. Some banking applications have legislation *requiring* their computations take place in decimal rather than binary. Other than BCD, which is not a native computation mode for many processors (I know that the Intel chips and the IBM 360/370 are exceptions to this). What do you do where storage by sign/exponent/mantissa isn't tolerable? |> >Consider that Runic encoding is antithetical in terms of single character |> >changes for fixed record length files by virtue of it's ability to either |> >change record size (destroying the seek-offset record addressing) or by |> >changing the amount of data representable in a field (destroying the |> >ability to use fixed-length fields for input in the front end client). |> |> I do not see why s/A/B/ should be different from s/A/AAA/. |> Fixed-length fields impose upper bounds on the length of the text; |> if you wan to be sure that every string contaiting N characters will fit |> into the field just reserve N*K bytes where K is the maximal length of |> rune. The field remains fixed-size. But much larger, and wasteful of memory where it is stored in encoded format (be it disk or core). The idea is not to ensure a fixed field size per se, but to ensure the *smallest possible* fixed field size. If we consider our database application, this is unacceptable for even moderately large databases. Also: If I increase my field length to N*K bytes, iNf the minimal length of the rune is 1, then what prevents me from typing in N*K bytes of encoding 1 into the field? I still have a disparity between the amount of data I may enter, and the input mechanism is "prejudiced" based on language. Of course, I could maintain a glyph count seperately from the field count, but the maintenance of external information defeats your arguments with regard to "all information in the set". To calculate the length of the string to determine of I have exceeded the boundry, I will have to place some type of bounding (in excess of "length of field") on the input procedure, and re-check the length each time (unless I accept the concept of locale, or do it arithmetically based on placing each possible character set entirely within a fixed number of bytes per encoding (ie: All US ASCII characters take one byte, all Cyrillic, German and French two, all Tamil three, etc.); this is extremely awkward if we are to fit the idea into existing standards for display I/O (OpenLook, Motif, Athena, etc.). |> >Consider the program that performs a fast record count based on a division |> >of the number of characters in a file by the *fixed* record size. |> |> And what? You can do it with runic encoding as well. Yes, if you accept the "N*K bytes" argument. You will have to convince me of this first. |> I'll discard the rest of the babble about fixed-sized fields for |> its obvious meaninglessness. |> |> >There are a great many reasons to avoid Runic encoding, not the least of |> >which is that storage requirements are diminished to current levels for |> >all small glyph-set languages (<=256 characters) at the cost of local |> >attribution. |> |> >We must accept partial locale information for the input |> >mechanisms if nowhere else, and this provides a promiscuous tagging for |> >us at nearly 0 additional cost. |> |> You'd have to change the semantic of Unix first. What do you think |> cat of two files created within different locales should produce? A single file in the "compound document" locale (with embedded "rich text" information) if the source files are not from the same locale; a single file with the same locale attribution as the sorce files if they are all from the same locale. |> If you have a single locale for entire system the problem is non-existent |> as well as any necessity to use Unicode. Not true. It is a requirement with the size of the set that you propose that there be "localization" of some kind, if only to reduce the size of the character sets which must be downloaded to a display device. |> >Attributed non-Runic encoding also buys us 2 bytes rather than 2-5 bytes |> >per glyph for large glyph-set languages (ie: Japanese). |> |> The best runic encoding wastes 12.5% of space but wins over fixed-size |> characters on European languages greatly. |> UTF is space-inefficient for everything above ASCII. This is true, if you don't do any further optimization (as in locale-based storage optimization for small glyph-set languages). It is possible to optimize storage down to 8 bits for all small glyph-set languages with file attribution (or non-spacing language selection "tag" characters -- I am against in-band tagging, as it destroys file size information). In this case, there is no 12.5% loss (12.5% is based on the assumption of a selected set of languages, not the full possible expansion set of all languages). |> >Runic encoding has too many significant drawbacks, and the only penalty |> >on non-Runic encoding is +17% of disk space on an average UNIX system |> >(20% of the files on an average UNIX system are text files) |> |> Even with that we have 3.4% (17%*20%) average wasted space. Sorry; the 17% wasn't 17% of 20%, it was the an *additional* 17% on top of the 20% (ie: if 20% of your disk was taken up by your text files, now 37% is taken up using 16 bit encoding; this grows to 54% (17%*2+20%) for 32 bit raw storage. This means users with an 80M drive, 16M is text. This means you will need a 96.6M drive for the same contents with 16 bit encoding, or a 107.2M drive for 32 bit encoding. The 17% is based on 3% of the stored characters being non-simply encodable (for instance, well known paths or well known data can not be renamed). The original 20% is from another engineer, but I have no reason to doubt it's accuracy; if anything, it's low, since it doesn't include file names within directories. This number will also go up with the use of message catalogs for localization (I expect error messages for commands to be in English), since there is not a just little text hard-coded within non-internationalized commands and applications, and I expect that this will have to change. |> >for 16-bit |> >encoding, or +35% for 32-bit encoding, |> |> Don't use UTF for 32-bit encoding, period. This is raw 32 bit encoding (32 bits are stored as 32 bits without *any* Runic encoding); UTF, of course, makes the nubers much, much worse. |> >or 0% (with a space reduction and |> >a restoration of meaning to file, field, and record lengths for all |> >non-7-bit ASCII languages) with the assumption of a "locale" mechanism. |> |> >PS: I notice that you, as well as Vadim, are dragging me from 386BSD |> >localization (in comp.unix.bsd) into a discussion of multinationalization |> >(in comp.std.internat). It was not my intent to be involved in the |> >now distorted scope of this discussion when I started the news thread |> >"INTERNATIONALIZATION: JAPAN, FAR EAST" in comp.unix.bsd for discussion |> >of OS localization issues. I hope you know what you are letting me in for. |> |> OS *localization* does not require Unicode -- it was proven in practice |> many many times. The real issue is multinationalization. I beg your pardon, but the subject line "INTERNATIONALIZATION: JAPAN, FAR EAST" comes from a discussion in comp.unix.bsd regarding the internationalization as a means of enabling localization for 386BSD. internationalization as a means of providing a base for either localization or multinationalization is what Unicode is about. Your suggestions deal with multinationalization only, and at least several of the suggegtions are unworkable for technological and/or political reasons. I began the discussion "INTERNATIONALIZATION: JAPAN, FAR EAST" looking for an existing or draft standard for use in 386BSD localization and potential future multinationalization (no matter how grungy in implementation). Usurping the subject doesn't change the original intent. 8-). Terry Lambert terry@icarus.weber.edu terry_lambert@novell.com --- Any opinions in this posting are my own and not those of my present or previous employers. ------------------------------------------------------------------------------- "I have an 8 user poetic license" - me Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial -------------------------------------------------------------------------------