NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / std / internat / 1025 < prev next >

Wrap

Text File | 1993-01-04 | 2.6 KB | 55 lines

Newsgroups: comp.std.internat Path: sparky!uunet!psinntp!ficc!peter From: peter@ferranti.com (peter da silva) Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST) Message-ID: <id.N4FW.SAC@ferranti.com> Keywords: ISO10646 Unicode Organization: Xenix Support, FICC References: <1i13rrINNars@rodan.UU.NET> <id.68CW.A16@ferranti.com> <1i2m57INN4vr@rodan.UU.NET> Date: Mon, 4 Jan 1993 17:35:38 GMT Lines: 43 In article <1i2m57INN4vr@rodan.UU.NET> avg@rodan.UU.NET (Vadim Antonov) writes: > In article <id.68CW.A16@ferranti.com> peter@ferranti.com (peter da silva) writes: > >In article <1i13rrINNars@rodan.UU.NET> avg@rodan.UU.NET (Vadim Antonov) writes: > >> We were talking about lexicographical sorting, not abouth phonetics. > >But lexicographic sorting (actually, lexicograhic ordering) is a minor part of > >this. Most sorting computers do is algorithmic ordering, to optimise some > >combination of operations on data structures (searching, for example). The > >character set is irrelevant there. > Wrong-o. Nobody does numerical sorts since invention of secondary > indices. I'm afraid you'll have to translate this sentence. It parses as valid English, and uses appropriate sentence structure and terminology for the context, but seems almost completely irrelevant to anything I said. For efficient lookup the index needs to be ordered in some fashion whether it's a flat table or a tree. Unless you find hashing adequate for all possible purposes, perhaps? > The problem is not in searching -- the problem is in presenting > the information and in regular expressions ([a-z] - does it include "o?) No. The regular expression '[a-z]' is a side effect of the fact that ASCII happens to be in numerical order for the base alphanumeric characters used in English computer text. It's invalid for EBCDIC, for example. The POSIX alternative for what you *mean* here is something like '[:lower:]', and I would hope that for the long term this be extended to specify localization information, for example '[:lower/english/usa:]' so that it would allow loan words like clich'e or na"ive, or names like 'da Silva' (with a non blank space between the 'a' and 'S'). Sure, it's a mouthful, so you'd do this: setenv LOWER '[:lower/english/usa:]' You need to do that for scripts, anyway, since you want your program to continue to work when it's downloaded from some site in Finland and used in London or Beirut. -- Peter da Silva `-_-' Ferranti International Controls Corporation 'U` Sugar Land, TX 77487-5012 USA +1 713 274 5180 "Zure otsoa besarkatu al duzu gaur?"