NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / std / internat / 1083 < prev next >

Wrap

Text File | 1993-01-07 | 8.2 KB | 197 lines

Newsgroups: comp.std.internat Path: sparky!uunet!mcsun!news.funet.fi!network.jyu.fi!tarzan!tt From: tt@tarzan.jyu.fi (Tapani Tarvainen) Subject: Re: Language tagging In-Reply-To: avg@rodan.UU.NET's message of 5 Jan 1993 20: 42:38 -0500 Message-ID: <TT.93Jan7085019@tarzan.jyu.fi> Summary: "o is "o is "o Originator: tt@tarzan.math.jyu.fi Sender: news@jyu.fi (News articles) Nntp-Posting-Host: tarzan.math.jyu.fi Organization: University of Jyvaskyla References: <1993Jan3.203017.232@enea.se> <2609@titccy.cc.titech.ac.jp> <1iav6tINNee2@life.ai.mit.edu> <1iddeeINN58g@rodan.UU.NET> Date: Thu, 7 Jan 1993 06:50:19 GMT Lines: 180 In article <1iddeeINN58g@rodan.UU.NET> avg@rodan.UU.NET (Vadim Antonov) writes: [heavily edited] >Now, i think everybody agrees that the "ultimate encoding" is >the one which provides the complete information about which >language is used -- it sovles all the problems. While such an encoding might be called "ultimate", it wouldn't solve all problems. In particular, the encoding mustn't force one to provide information that's irrelevant or not available. A good encoding is one which allows one to provide the information one wants, no less, no more. E.g., things like typeface used may be relevant in some contexts but not always, and I doubt many would want them in the character set. Whether language should be there, or in how great detail, is debatable (indeed the subject of the presend debate). >Let's analyse the other properties of the ultimate encoding: (I make comments about Vadim's proposed encoding below but I've deleted the description. Those interested in this should read Vadim's article first.) >1) it explicitly distinguises between the natural languages. > I think this property can be easily dropped. Agreed. >2) there is an *algorithm* for converting from upper case to lower case > and vice versa. > This property is extremely useful Agreed. It is also very difficult to do completely for all languages without knowing the language. E.g., in some languages several lower case characters may map to same upper case character. Perhaps it would be sufficient to have a case-insensitive comparison algorithm? A glyph-based encoding could probably handle that in most cases; not all, though (like the Turkish dotless i). Here your system clearly wins. >3) there is an *algorithm* for lexicographical sorting (and for > that matter for range comparisons). > This is necessary if the now-ubiquotus semantics of regular > expressions, shell globbing and sorted directories (and, of > course, neat sorted reports) is to be preserved. > Several people argued that this functionality is not actually > necessary because users will want to see strings sorted > accordingly to their native language's rules). Rather it isn't sufficient. > While it is > a neat idea it still does not eliminate the necessity of a > generic algorithm used as fallback for "customized" sorting > algorithsm -- what, for example, is the idea of sorting Chinese > using rules of Finnish? Nor does the universal algorithm eliminate the necessity of language-specific algorithms. (More about this below.) Nonetheless you make a good point. Finnish sorting rules are no good for non-Latin scripts. Let's look at it this way: How would a Finn want to see Chinese names sorted? If (as is likely) he doesn't know Chinese he either couldn't care less, or would want them transliterated into Latin characters (and then sorted by Finnish rules). If he knows Chinese, he'll want the Chinese ordering -- or one of them, if (as I suspect) there are several. Or if he knows another language that uses the Chinese character set but different sorting he might prefer that. In any case, he wants an order that makes sense to _him_. How about sorting a list of European names (from various Latin-based languages) in Japan? Here a default "proto-Latin" sorting might make sense. However, if the list is to be handed to guests who come from various countries it might be desirable to sort it differently for each country, or use the most common language (which may but need not be English). A Spanish delegation in Japan would probably appreciate seeing their names correctly sorted by Spanish rules. I don't know how that should be handled. Nonetheless, since several languages using the same script have different sorting rules, external information of the language is still needed and a generic algorithm is effectively just providing a default language for each script. A pure locale-system clearly won't do in multilingual environments. Nonetheless some things are, IMHO, best handled with locales. Perhaps it should be possible to specify the default language (like, in an environment variable) separately for each script one is concerned with (setenv LANG 'LATIN:finnish;CYRILLIC:russian;HAN:chinese') or whatever) and fall to a default in the rest. Your encoding helps here only by making programming easier and that only if the default is usable often enough -- and even then it can be harmful by leading programmers to ignore situations where it wouldn't fit (and consequently no program could sort Finnish correctly, for I'm sure it wouldn't fit the default anyway; oh well, I guess that would mean more jobs for Finnish programmers ...). > Moreover, the usage of a "unversal" algorithm is inavoidable > when consumers of the pre-sorted information do not share > a cultural background. Any information must be targeted to people who understand the language it is in. Sorting multilingual wordlists should be done according to the language of context, or some language the recipient is expected to know. In any case, I don't see this as requiring anything but a "default" language -- and it might be different for different groups of consumers. However, there's another issue with sorting where your proposal is disastrous: A single list of names may need to be sorted differently on different occasions depending on the target language. If Spanish ll and ch are considered individual characters, and German and Finnish a-umlaut as distinct characters, this means that a Finnish reader must know the language a name originated in in order to find it. It is impossible to explain to a layman that M"oller and M"oller are different names and in different place in the directory because one is German and the other Swedish. Even worse, the problem isn't limited to sorting: E.g., assume you want to search a database for the works of one Mr. M"oller without knowing where he comes from. Or when entering a reference you have on paper to a database: how do you know which "o to type? >4) indication of the language is useful for hyphenation. Yes. However, I think it's really useful only if the exact language is known, and we already agreed not to put than in the character encoding. > Dictionary-based hyphenation algorithms do not need the > language specification because the words themselves indicate > the language; This is not strictly true, as similarly spelled words occur in different languages and are hyphenated differently. > if the word is not in the dictionary the > generic rules should be applied. > Therefore it is desireable (though not necessary) that > the letter classes (ie. which ones are vowels) are preserved. I doubt any generic rule would do much good. At least among languages using Latin characters recognizing vowels won't help anything. The only universal hyphenation-related thing I might put in the character set is a hyphenation hint character. >5) typographical-quality fonts often differentiate between > "similar" glyhps reproducible on lower-resolution devices. > The proposed encoding methodology (see later) "automatically" > separates graphic sets, so the ability to output typography- > quality text is essentially "free" in the proposed encoding. I'm not sure about this one. Isn't it just a question of deciding exactly which are considered different glyphs in a glyph-based encoding? **** >The idea is to leave ALL language specifics at the point of input >where the language is supposedly known. That assumption is simply false far too often. Not only may the language of a word be unknown, it may not even exist. (What language is M"oller in this sentence?) -- Tapani Tarvainen (tt@math.jyu.fi, tarvainen@finjyu.bitnet)