NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / std / internat / 1088 < prev next >

Wrap

Internet Message Format | 1993-01-07 | 4.0 KB

Path: sparky!uunet!cs.utexas.edu!sun-barr!sh.wide!wnoc-tyo-news!cs.titech!titccy.cc.titech!necom830!mohta From: mohta@necom830.cc.titech.ac.jp (Masataka Ohta) Newsgroups: comp.std.internat Subject: Re: Language tagging Message-ID: <2643@titccy.cc.titech.ac.jp> Date: 7 Jan 93 13:09:38 GMT References: <1993Jan3.203017.232@enea.se> <2609@titccy.cc.titech.ac.jp> <1iav6tINNee2@life.ai.mit.edu> <1iddeeINN58g@rodan.UU.NET> <TT.93Jan7085019@tarzan.jyu.fi> Sender: news@titccy.cc.titech.ac.jp Organization: Tokyo Institute of Technology Lines: 96 In article <TT.93Jan7085019@tarzan.jyu.fi> tt@tarzan.jyu.fi (Tapani Tarvainen) writes: >>Now, i think everybody agrees that the "ultimate encoding" is >>the one which provides the complete information about which >>language is used -- it sovles all the problems. > >While such an encoding might be called "ultimate", Agreed. As the proposal is much more rational than 10646 and can not be compatible with 10646, let's throw away 10646/Unicode and make the UCS (Ultimate coded Character Set). >it wouldn't solve >all problems. In particular, the encoding mustn't force one to >provide information that's irrelevant or not available. There should be one language code: wild card. If the language distiction information is not available, we should assign wild card code to the character, which will match with characters of any language. >>2) there is an *algorithm* for converting from upper case to lower case >> and vice versa. > >> This property is extremely useful > >Agreed. It is also very difficult to do completely for all languages >without knowing the language. E.g., in some languages several lower >case characters may map to same upper case character. Perhaps it >would be sufficient to have a case-insensitive comparison algorithm? Agreed. >A pure locale-system clearly won't do in multilingual environments. >Nonetheless some things are, IMHO, best handled with locales. Agreed. >However, there's another issue with sorting where your proposal is >disastrous: A single list of names may need to be sorted differently >on different occasions depending on the target language. If Spanish >ll and ch are considered individual characters, and German and Finnish >a-umlaut as distinct characters, this means that a Finnish reader must >know the language a name originated in in order to find it. Are 'll' and 'ch' considered individual characers having distinct code points? Or are 'l', 'l', 'c' and 'h'? I think latter better. >It is >impossible to explain to a layman that M"oller and M"oller are >different names and in different place in the directory because one is >German and the other Swedish. In this case, of course, language distinction should be the last key used for sorting, shouldn't it? >Even worse, the problem isn't limited to sorting: E.g., assume you >want to search a database for the works of one Mr. M"oller without >knowing where he comes from. Or when entering a reference you have >on paper to a database: how do you know which "o to type? I think the problem can be solved by introducing new notation of regular expressions. For example, let '[[A]]' represent all latin alphabet 'A's regardless of its language, and let '[[[a]]]' represent all variants of latin alphabet 'A's such as "A with ring above". Then, if you want to search Mr. M"oller of any nationality, use the following pattern [[M]][["o]][[l]][[l]][[e]][[r]] or just use characters of wild card nationality. >>5) typographical-quality fonts often differentiate between >> "similar" glyhps reproducible on lower-resolution devices. It should be stressed that corresponding characters in China and Japan are differnt even on high resolution CRTs. >>The idea is to leave ALL language specifics at the point of input >>where the language is supposedly known. How many, do you think, nationality distinction is necessary for a given single character? It seems to me that, for Han characters, we need 5 or so and it could be represented with 3 bits. Masataka Ohta