home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!not-for-mail
- From: avg@rodan.UU.NET (Vadim Antonov)
- Newsgroups: comp.std.internat
- Subject: Re: Language tagging
- Date: 7 Jan 1993 16:12:20 -0500
- Organization: UUNET Technologies Inc, Falls Church, VA
- Lines: 90
- Message-ID: <1ii6bkINNf6c@rodan.UU.NET>
- References: <1iav6tINNee2@life.ai.mit.edu> <1iddeeINN58g@rodan.UU.NET> <TT.93Jan7085019@tarzan.jyu.fi>
- NNTP-Posting-Host: rodan.uu.net
-
- In article <TT.93Jan7085019@tarzan.jyu.fi> tt@tarzan.jyu.fi (Tapani Tarvainen) writes:
- >Let's look at it this way: How would a Finn want to see Chinese
- >names sorted?
-
- For many reasoins -- say to print it and send to his Chinese
- correspondent. (reminds me an old Soviet anecdote:
- Evening news at 2011: "The fields of Washingtonshina* bring a good
- harvest of crops. Today light armed clashed on Finno-Chinese border
- had no casualties." [* - a Russified province name in the form
- specific to the officiose newspeak])
-
- Another problem locales are unable to solve is multilingual sorting,
- especially with closely related languages.
-
- If (as is likely) he doesn't know Chinese he either
- >couldn't care less, or would want them transliterated into Latin
- >characters (and then sorted by Finnish rules).
-
- Ever saw Chinese transliterated into Latin? :-) You generally
- can't do it and keep it intelligible because phonetic structures
- of languages are fairly different.
-
- >How about sorting a list of European names (from various Latin-based
- >languages) in Japan? Here a default "proto-Latin" sorting might make
- >sense. However, if the list is to be handed to guests who come from
- >various countries it might be desirable to sort it differently for
- >each country, or use the most common language (which may but need
- >not be English). A Spanish delegation in Japan would probably
- >appreciate seeing their names correctly sorted by Spanish rules.
-
- Proto-Latin, proto-Cyrillic etc sorting makes a lot of sense for
- business applications, providing uniformed way to deal with
- multilingual lists (and such sorting has an attractive quality --
- it "automatically" reduces to a national sorting after deletion
- of foreighn letters). Somehow, it is a simplified scheme of
- sorting large libraries use.
-
- >A pure locale-system clearly won't do in multilingual environments.
- >Nonetheless some things are, IMHO, best handled with locales.
- >Perhaps it should be possible to specify the default language (like,
- >in an environment variable) separately for each script one is
- >concerned with (setenv LANG 'LATIN:finnish;CYRILLIC:russian;HAN:chinese')
- >or whatever) and fall to a default in the rest.
-
- It is not enough. Ukrain, say, use characters of both Cyrillic and
- Latin scripts, etc.
-
- >However, there's another issue with sorting where your proposal is
- >disastrous: A single list of names may need to be sorted differently
- >on different occasions depending on the target language. If Spanish
- >ll and ch are considered individual characters, and German and Finnish
- >a-umlaut as distinct characters, this means that a Finnish reader must
- >know the language a name originated in in order to find it. It is
- >impossible to explain to a layman that M"oller and M"oller are
- >different names and in different place in the directory because one is
- >German and the other Swedish.
-
- Most likely an end user will have a local sorting algorithm which
- basically reduces glyphs to the particular language's by
- transliteration and then applies the same generic sorting algorithm.
- At the same time, if you're sending a monolingual file with
- Sweden words to China it would be sorted correctly even if nobody
- around knows what the rules are!
-
- >Even worse, the problem isn't limited to sorting: E.g., assume you
- >want to search a database for the works of one Mr. M"oller without
- >knowing where he comes from. Or when entering a reference you have
- >on paper to a database: how do you know which "o to type?
-
- You have this problem with Unicode/ISO10646 right now -- consider, say,
- the word BETA -- it can be encoded in more than a hundred ways!
- The search routines should simply treat identical glyphs as identical;
- you have to do it now anyway.
-
- >This is not strictly true, as similarly spelled words occur
- >in different languages and are hyphenated differently.
-
- Any realistic example? :-)
-
- >>The idea is to leave ALL language specifics at the point of input
- >>where the language is supposedly known.
- >
- >That assumption is simply false far too often. Not only may the
- >language of a word be unknown, it may not even exist. (What language
- >is M"oller in this sentence?)
-
- I'd say that this word belong to the native language of Mr. M"oller :-)
- It's just you didn't specify that.
-
- --vadim
-