home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!zaphod.mps.ohio-state.edu!uwm.edu!linac!att!att!allegra!alice!andrew
- From: andrew@alice.att.com (Andrew Hume)
- Newsgroups: comp.std.internat
- Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
- Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages
- Message-ID: <24565@alice.att.com>
- Date: 7 Jan 93 07:19:55 GMT
- Article-I.D.: alice.24565
- References: <2615@titccy.cc.titech.ac.jp> <1993Jan5.090747.29232@fcom.cc.utah.edu> <1993Jan7.033153.12133@fcom.cc.utah.edu>
- Organization: AT&T Bell Laboratories, Murray Hill NJ
- Lines: 56
-
- let me start by thanking terry for a posting full of content.
-
- In article <1993Jan7.033153.12133@fcom.cc.utah.edu>, terry@cs.weber.edu (A Wizard of Earth C) writes:
- > Consider a newline terminated text database containing fixed length lines,
- > or consider a database consisting of variant text records in fixed fields.
- > In either case, the amount of data per field is now variant on Runic encoding.
- > For instance, if we accept the Plan-9 soloution, an application used in
- > both England and the US will vary as to how much data is representable per
- > fixed field based on whether or not that data contains the English "#"
- > character. This gets worse the further you get from base ASCII coding.
-
- this is true (and is a bummer) but has been true for a
- bunch of folks already who seem to be able to cope.
- anyone who uses 2022 (presumably ohta-san) already has a variable
- character-to-byte ratio. all you know is that a character (in UTF-2)
- cannot take more than 5 bytes (or 3 right now).
-
- > Consider that Runic encoding is antithetical in terms of single character
- > changes for fixed record length files by virtue of it's ability to either
- > change record size (destroying the seek-offset record addressing) or by
- > changing the amount of data representable in a field (destroying the
- > ability to use fixed-length fields for input in the front end client).
-
- i don't understand this. all runic encoding does is imply that
- there MAY be situations where if you change a single character, another
- character may need to be dropped. is this such a big deal? fixed-length
- records have always had a similiar property; sometimes, you can't add
- anything more. if the records are truly fixed size, then they are
- of fixed size and you can continue to address them directly.
-
- > Padding is unacceptable, both because it destroys the meaning of field width
- > in the front end and because of the need for memory-mapping of files. A
- > database program is likely to require the ability to either partially or
- > fully control it's own pages.
-
- say what? there seems to be a confusion here between what is visible
- to the user and what is stored. lets make it concrete. say we have a fixed length
- record of 5 fields, each exactly 7 characters wide. on ascii systems, the
- record is almost certainly 35 bytes long. if you have runic (variable-length)
- encoding, you would probably (for the current 10646) make the record 105 bytes
- long, and right justify each field in a 21 byte space. the padding here
- is not visible to the user and clearly has no effect on memory-mapping
- or any kind of sharing of data structures. you can say this is wasteful
- of space (and i'll agree; that's why we love fixed-length records!), but
- it certainly doesn't have the impact you describe.
-
- > text deleted, all depicting the advantages of attributed non-runic
- > encoding...
-
- your arithmetic makes it seem the attribution is on a per file
- basis. that doesn't seem to handle the notion of files containing
- characters from multiple character sets. could you please elaborate
- a little, or tell me where to look for such an elaboration?
-
-
- andrew hume
-