NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / std / internat / 1097 < prev next >

Wrap

Internet Message Format | 1993-01-07 | 4.5 KB

Path: sparky!uunet!not-for-mail From: avg@rodan.UU.NET (Vadim Antonov) Newsgroups: comp.std.internat Subject: Re: Language tagging Date: 7 Jan 1993 16:12:20 -0500 Organization: UUNET Technologies Inc, Falls Church, VA Lines: 90 Message-ID: <1ii6bkINNf6c@rodan.UU.NET> References: <1iav6tINNee2@life.ai.mit.edu> <1iddeeINN58g@rodan.UU.NET> <TT.93Jan7085019@tarzan.jyu.fi> NNTP-Posting-Host: rodan.uu.net In article <TT.93Jan7085019@tarzan.jyu.fi> tt@tarzan.jyu.fi (Tapani Tarvainen) writes: >Let's look at it this way: How would a Finn want to see Chinese >names sorted? For many reasoins -- say to print it and send to his Chinese correspondent. (reminds me an old Soviet anecdote: Evening news at 2011: "The fields of Washingtonshina* bring a good harvest of crops. Today light armed clashed on Finno-Chinese border had no casualties." [* - a Russified province name in the form specific to the officiose newspeak]) Another problem locales are unable to solve is multilingual sorting, especially with closely related languages. If (as is likely) he doesn't know Chinese he either >couldn't care less, or would want them transliterated into Latin >characters (and then sorted by Finnish rules). Ever saw Chinese transliterated into Latin? :-) You generally can't do it and keep it intelligible because phonetic structures of languages are fairly different. >How about sorting a list of European names (from various Latin-based >languages) in Japan? Here a default "proto-Latin" sorting might make >sense. However, if the list is to be handed to guests who come from >various countries it might be desirable to sort it differently for >each country, or use the most common language (which may but need >not be English). A Spanish delegation in Japan would probably >appreciate seeing their names correctly sorted by Spanish rules. Proto-Latin, proto-Cyrillic etc sorting makes a lot of sense for business applications, providing uniformed way to deal with multilingual lists (and such sorting has an attractive quality -- it "automatically" reduces to a national sorting after deletion of foreighn letters). Somehow, it is a simplified scheme of sorting large libraries use. >A pure locale-system clearly won't do in multilingual environments. >Nonetheless some things are, IMHO, best handled with locales. >Perhaps it should be possible to specify the default language (like, >in an environment variable) separately for each script one is >concerned with (setenv LANG 'LATIN:finnish;CYRILLIC:russian;HAN:chinese') >or whatever) and fall to a default in the rest. It is not enough. Ukrain, say, use characters of both Cyrillic and Latin scripts, etc. >However, there's another issue with sorting where your proposal is >disastrous: A single list of names may need to be sorted differently >on different occasions depending on the target language. If Spanish >ll and ch are considered individual characters, and German and Finnish >a-umlaut as distinct characters, this means that a Finnish reader must >know the language a name originated in in order to find it. It is >impossible to explain to a layman that M"oller and M"oller are >different names and in different place in the directory because one is >German and the other Swedish. Most likely an end user will have a local sorting algorithm which basically reduces glyphs to the particular language's by transliteration and then applies the same generic sorting algorithm. At the same time, if you're sending a monolingual file with Sweden words to China it would be sorted correctly even if nobody around knows what the rules are! >Even worse, the problem isn't limited to sorting: E.g., assume you >want to search a database for the works of one Mr. M"oller without >knowing where he comes from. Or when entering a reference you have >on paper to a database: how do you know which "o to type? You have this problem with Unicode/ISO10646 right now -- consider, say, the word BETA -- it can be encoded in more than a hundred ways! The search routines should simply treat identical glyphs as identical; you have to do it now anyway. >This is not strictly true, as similarly spelled words occur >in different languages and are hyphenated differently. Any realistic example? :-) >>The idea is to leave ALL language specifics at the point of input >>where the language is supposedly known. > >That assumption is simply false far too often. Not only may the >language of a word be unknown, it may not even exist. (What language >is M"oller in this sentence?) I'd say that this word belong to the native language of Mr. M"oller :-) It's just you didn't specify that. --vadim