home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.std.internat
- Path: sparky!uunet!elroy.jpl.nasa.gov!ames!agate!dog.ee.lbl.gov!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry
- From: terry@cs.weber.edu (A Wizard of Earth C)
- Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
- Message-ID: <1993Jan8.051317.7820@fcom.cc.utah.edu>
- Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages
- Sender: news@fcom.cc.utah.edu
- Organization: University of Utah Computer Center
- References: <1993Jan5.090747.29232@fcom.cc.utah.edu> <id.EAHW.92A@ferranti.com> <1993Jan7.033153.12133@fcom.cc.utah.edu> <1ii1dmINNcu6@rodan.UU.NET>
- Date: Fri, 8 Jan 93 05:13:17 GMT
- Lines: 256
-
- In article <1ii1dmINNcu6@rodan.UU.NET>, avg@rodan.UU.NET (Vadim Antonov) writes:
- |> In article <1993Jan7.033153.12133@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes:
- |> >Consider a newline terminated text database containing fixed length lines,
- |> >or consider a database consisting of variant text records in fixed fields.
- |>
- |> Databases with fixed-length fields deserve to loose.
-
- This dictates coding practice, a no-no. It is not a refutation of the point,
- only a statement of opinion on your part.
-
- |> >In either case, the amount of data per field is now variant on Runic encoding.
- |>
- |> In either case, the amount of data per field is now variant on ASCII encoding.
- |> I can store 'Terry Lambert' next to the 'Thelma Louis Maria Anna Bertran'.
- |> I alwasys thought that wizards think before speaking.
-
- Your example is of two pieces of *information* of differring lengths,
- not a change in the amount of data potentially representable by the field.
- For instance, using Ken Thompsons FSS-UTF encoding from Plan 9, the amount
- of space taken by a single glyph is 1-6 bytes, depending on the ordinal
- value of each glyph within the set.
-
- The problem is I have a fixed length field and/or storage. Say I am storing
- English, Cyrillic, and Japanese text in a database with fixed field input
- and storage. As discussed before, there are penalties in both places, so it
- is irrelevant if you eliminate the encoding in one location the other, if you
- do not do so in both.
-
- I can store 24 7-bit ASCII glyphs, 12 11-bit Cyrillic glyphs (assume an
- encoding somewhere between 0x80 and 0x7ff. No matter what encoding scheme
- is used, some character set will fall in that range, or worse), or 4 Kanji
- characters (assume a location between 0x4000000 and 0x7ffffff; it could be
- slightly better, although either Chinese or Japanese will get more characters
- per field, depending on who wins the argument over who's first lexically).
-
- This is unacceptable for either a fixed input field in a database client,
- or for fixed field storage in a file system. It absolutely rots for modern
- extent-based file systems, like vxfs (Veritas).
-
- |> >Now consider a database, one of the primary methods of implementation being
- |> >memory mapped files. Thus the storage and the in core representation must
- |> >be the same.
- |>
- |> You can as well keep your stuff in runic encoding in memory. If you
- |> didn't know the virtual memory is the same ol' disk.
-
- The problem occurs if one wants to simplify user-space processing of data;
- Runic encoding both places *is* possible (assuming a client-based decode for
- display), but is undesirable for the glyph-count limitations on fields.
-
- |> >Now consider the mapping, which, by definition, must be done in the kernel,
- |>
- |> Ever heard of user-lever file system servers?
-
- Yes, I'm "a bit" familiar with the concept (hint: check my signature). The
- problem is NFS mounts and backward compatability. Also, you multiply your
- wire traffic from 1-7 times for Runic-encoded 32-bit values (as opposed to
- a constant 4 times for flat 32-bit encoding or a constant 2 times for flat
- 16-bit encoding -- or not at all, using compacted storage).
-
- |> >(without proposing a localized per user or
- |> >per language file system view -- both Ohta and Vadim abhor "locale").
- |>
- |> I do not abhor locale -- i flatly state that it is not sufficient for
- |> multilingual environments and Unicode/ISO10646 is too much for localizing
- |> where you can do everything *with* locale and 8-bit encoding (16 bit
- |> for Oriental languages using existing standards).
-
- I disagree. You can not make system localization the norm rather than
- application localization. Addmittedly, application localization will still
- need to occur for cases such as word processing.
-
- I have *never* suggested that "locale was sufficient" ... only a necessary
- part of the whole picture.
-
- |> >Further suppose a distributed application with clients for the database
- |> >running on old and new hardware. Only those datum with the same localization
- |> >as the (non-updated) client machine's software will be usable remotely. A
- |> >potential example of this is available in OLTP for ticketing and reservation
- |> >systems for international airlines.
- |>
- |> Now tell us how will you sort data from various locations with *local*
- |> sorting algorithms?
-
- By locale tables. I assume that if I have the fonts, I have the sorting
- tables. I don't see a great deal of use in sorting in the application I
- suggested, in any case, since insertion is likely to be in sorted order
- rather than sorting the data each time it is displayed. Insertion is done
- locally.
-
- |> >Any COBOL program stores data in this fashion, Many other programs store
- |> >data in this fashion to avoid a direct wire-translation of important values
- |> >by character encoding them in the files. This allows a direct record seek
- |> >based on offset. Many Ada programs also take advantage of this.
- |>
- |> And some people never grow out of BASIC.
-
- Some banking applications have legislation *requiring* their computations
- take place in decimal rather than binary. Other than BCD, which is not a
- native computation mode for many processors (I know that the Intel chips and
- the IBM 360/370 are exceptions to this). What do you do where storage by
- sign/exponent/mantissa isn't tolerable?
-
- |> >Consider that Runic encoding is antithetical in terms of single character
- |> >changes for fixed record length files by virtue of it's ability to either
- |> >change record size (destroying the seek-offset record addressing) or by
- |> >changing the amount of data representable in a field (destroying the
- |> >ability to use fixed-length fields for input in the front end client).
- |>
- |> I do not see why s/A/B/ should be different from s/A/AAA/.
- |> Fixed-length fields impose upper bounds on the length of the text;
- |> if you wan to be sure that every string contaiting N characters will fit
- |> into the field just reserve N*K bytes where K is the maximal length of
- |> rune. The field remains fixed-size.
-
- But much larger, and wasteful of memory where it is stored in encoded format
- (be it disk or core). The idea is not to ensure a fixed field size per se,
- but to ensure the *smallest possible* fixed field size.
-
- If we consider our database application, this is unacceptable for even
- moderately large databases.
-
- Also: If I increase my field length to N*K bytes, iNf the minimal length of
- the rune is 1, then what prevents me from typing in N*K bytes of encoding 1
- into the field? I still have a disparity between the amount of data I may
- enter, and the input mechanism is "prejudiced" based on language.
-
- Of course, I could maintain a glyph count seperately from the field count,
- but the maintenance of external information defeats your arguments with
- regard to "all information in the set". To calculate the length of the
- string to determine of I have exceeded the boundry, I will have to place
- some type of bounding (in excess of "length of field") on the input
- procedure, and re-check the length each time (unless I accept the concept
- of locale, or do it arithmetically based on placing each possible character
- set entirely within a fixed number of bytes per encoding (ie: All US ASCII
- characters take one byte, all Cyrillic, German and French two, all Tamil
- three, etc.); this is extremely awkward if we are to fit the idea into
- existing standards for display I/O (OpenLook, Motif, Athena, etc.).
-
- |> >Consider the program that performs a fast record count based on a division
- |> >of the number of characters in a file by the *fixed* record size.
- |>
- |> And what? You can do it with runic encoding as well.
-
- Yes, if you accept the "N*K bytes" argument. You will have to convince me
- of this first.
-
- |> I'll discard the rest of the babble about fixed-sized fields for
- |> its obvious meaninglessness.
- |>
- |> >There are a great many reasons to avoid Runic encoding, not the least of
- |> >which is that storage requirements are diminished to current levels for
- |> >all small glyph-set languages (<=256 characters) at the cost of local
- |> >attribution.
- |>
- |> >We must accept partial locale information for the input
- |> >mechanisms if nowhere else, and this provides a promiscuous tagging for
- |> >us at nearly 0 additional cost.
- |>
- |> You'd have to change the semantic of Unix first. What do you think
- |> cat of two files created within different locales should produce?
-
- A single file in the "compound document" locale (with embedded "rich text"
- information) if the source files are not from the same locale; a single
- file with the same locale attribution as the sorce files if they are all
- from the same locale.
-
- |> If you have a single locale for entire system the problem is non-existent
- |> as well as any necessity to use Unicode.
-
- Not true. It is a requirement with the size of the set that you propose
- that there be "localization" of some kind, if only to reduce the size of
- the character sets which must be downloaded to a display device.
-
- |> >Attributed non-Runic encoding also buys us 2 bytes rather than 2-5 bytes
- |> >per glyph for large glyph-set languages (ie: Japanese).
- |>
- |> The best runic encoding wastes 12.5% of space but wins over fixed-size
- |> characters on European languages greatly.
- |> UTF is space-inefficient for everything above ASCII.
-
- This is true, if you don't do any further optimization (as in locale-based
- storage optimization for small glyph-set languages). It is possible to
- optimize storage down to 8 bits for all small glyph-set languages with
- file attribution (or non-spacing language selection "tag" characters -- I
- am against in-band tagging, as it destroys file size information). In this
- case, there is no 12.5% loss (12.5% is based on the assumption of a selected
- set of languages, not the full possible expansion set of all languages).
-
- |> >Runic encoding has too many significant drawbacks, and the only penalty
- |> >on non-Runic encoding is +17% of disk space on an average UNIX system
- |> >(20% of the files on an average UNIX system are text files)
- |>
- |> Even with that we have 3.4% (17%*20%) average wasted space.
-
- Sorry; the 17% wasn't 17% of 20%, it was the an *additional* 17% on top of
- the 20% (ie: if 20% of your disk was taken up by your text files, now 37%
- is taken up using 16 bit encoding; this grows to 54% (17%*2+20%) for 32 bit
- raw storage. This means users with an 80M drive, 16M is text. This means
- you will need a 96.6M drive for the same contents with 16 bit encoding, or
- a 107.2M drive for 32 bit encoding. The 17% is based on 3% of the stored
- characters being non-simply encodable (for instance, well known paths or
- well known data can not be renamed). The original 20% is from another
- engineer, but I have no reason to doubt it's accuracy; if anything, it's low,
- since it doesn't include file names within directories. This number will
- also go up with the use of message catalogs for localization (I expect error
- messages for commands to be in English), since there is not a just little
- text hard-coded within non-internationalized commands and applications, and
- I expect that this will have to change.
-
- |> >for 16-bit
- |> >encoding, or +35% for 32-bit encoding,
- |>
- |> Don't use UTF for 32-bit encoding, period.
-
- This is raw 32 bit encoding (32 bits are stored as 32 bits without *any* Runic
- encoding); UTF, of course, makes the nubers much, much worse.
-
- |> >or 0% (with a space reduction and
- |> >a restoration of meaning to file, field, and record lengths for all
- |> >non-7-bit ASCII languages) with the assumption of a "locale" mechanism.
- |>
- |> >PS: I notice that you, as well as Vadim, are dragging me from 386BSD
- |> >localization (in comp.unix.bsd) into a discussion of multinationalization
- |> >(in comp.std.internat). It was not my intent to be involved in the
- |> >now distorted scope of this discussion when I started the news thread
- |> >"INTERNATIONALIZATION: JAPAN, FAR EAST" in comp.unix.bsd for discussion
- |> >of OS localization issues. I hope you know what you are letting me in for.
- |>
- |> OS *localization* does not require Unicode -- it was proven in practice
- |> many many times. The real issue is multinationalization.
-
- I beg your pardon, but the subject line "INTERNATIONALIZATION: JAPAN, FAR EAST"
- comes from a discussion in comp.unix.bsd regarding the internationalization
- as a means of enabling localization for 386BSD. internationalization as a
- means of providing a base for either localization or multinationalization is
- what Unicode is about. Your suggestions deal with multinationalization only,
- and at least several of the suggegtions are unworkable for technological
- and/or political reasons.
-
- I began the discussion "INTERNATIONALIZATION: JAPAN, FAR EAST" looking for an
- existing or draft standard for use in 386BSD localization and potential future
- multinationalization (no matter how grungy in implementation). Usurping the
- subject doesn't change the original intent. 8-).
-
-
- Terry Lambert
- terry@icarus.weber.edu
- terry_lambert@novell.com
- ---
- Any opinions in this posting are my own and not those of my present
- or previous employers.
- -------------------------------------------------------------------------------
- "I have an 8 user poetic license" - me
- Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial
- -------------------------------------------------------------------------------
-