NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / std / internat / 1081 < prev next >

Wrap

Internet Message Format | 1993-01-07 | 3.6 KB

Path: sparky!uunet!zaphod.mps.ohio-state.edu!uwm.edu!linac!att!att!allegra!alice!andrew From: andrew@alice.att.com (Andrew Hume) Newsgroups: comp.std.internat Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST) Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages Message-ID: <24565@alice.att.com> Date: 7 Jan 93 07:19:55 GMT Article-I.D.: alice.24565 References: <2615@titccy.cc.titech.ac.jp> <1993Jan5.090747.29232@fcom.cc.utah.edu> <1993Jan7.033153.12133@fcom.cc.utah.edu> Organization: AT&T Bell Laboratories, Murray Hill NJ Lines: 56 let me start by thanking terry for a posting full of content. In article <1993Jan7.033153.12133@fcom.cc.utah.edu>, terry@cs.weber.edu (A Wizard of Earth C) writes: > Consider a newline terminated text database containing fixed length lines, > or consider a database consisting of variant text records in fixed fields. > In either case, the amount of data per field is now variant on Runic encoding. > For instance, if we accept the Plan-9 soloution, an application used in > both England and the US will vary as to how much data is representable per > fixed field based on whether or not that data contains the English "#" > character. This gets worse the further you get from base ASCII coding. this is true (and is a bummer) but has been true for a bunch of folks already who seem to be able to cope. anyone who uses 2022 (presumably ohta-san) already has a variable character-to-byte ratio. all you know is that a character (in UTF-2) cannot take more than 5 bytes (or 3 right now). > Consider that Runic encoding is antithetical in terms of single character > changes for fixed record length files by virtue of it's ability to either > change record size (destroying the seek-offset record addressing) or by > changing the amount of data representable in a field (destroying the > ability to use fixed-length fields for input in the front end client). i don't understand this. all runic encoding does is imply that there MAY be situations where if you change a single character, another character may need to be dropped. is this such a big deal? fixed-length records have always had a similiar property; sometimes, you can't add anything more. if the records are truly fixed size, then they are of fixed size and you can continue to address them directly. > Padding is unacceptable, both because it destroys the meaning of field width > in the front end and because of the need for memory-mapping of files. A > database program is likely to require the ability to either partially or > fully control it's own pages. say what? there seems to be a confusion here between what is visible to the user and what is stored. lets make it concrete. say we have a fixed length record of 5 fields, each exactly 7 characters wide. on ascii systems, the record is almost certainly 35 bytes long. if you have runic (variable-length) encoding, you would probably (for the current 10646) make the record 105 bytes long, and right justify each field in a 21 byte space. the padding here is not visible to the user and clearly has no effect on memory-mapping or any kind of sharing of data structures. you can say this is wasteful of space (and i'll agree; that's why we love fixed-length records!), but it certainly doesn't have the impact you describe. > text deleted, all depicting the advantages of attributed non-runic > encoding... your arithmetic makes it seem the attribution is on a per file basis. that doesn't seem to handle the notion of files containing characters from multiple character sets. could you please elaborate a little, or tell me where to look for such an elaboration? andrew hume