home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.std.internat
- Path: sparky!uunet!mcsun!news.funet.fi!network.jyu.fi!tarzan!tt
- From: tt@tarzan.jyu.fi (Tapani Tarvainen)
- Subject: Re: Language tagging
- In-Reply-To: avg@rodan.UU.NET's message of 5 Jan 1993 20: 42:38 -0500
- Message-ID: <TT.93Jan7085019@tarzan.jyu.fi>
- Summary: "o is "o is "o
- Originator: tt@tarzan.math.jyu.fi
- Sender: news@jyu.fi (News articles)
- Nntp-Posting-Host: tarzan.math.jyu.fi
- Organization: University of Jyvaskyla
- References: <1993Jan3.203017.232@enea.se> <2609@titccy.cc.titech.ac.jp>
- <1iav6tINNee2@life.ai.mit.edu> <1iddeeINN58g@rodan.UU.NET>
- Date: Thu, 7 Jan 1993 06:50:19 GMT
- Lines: 180
-
- In article <1iddeeINN58g@rodan.UU.NET> avg@rodan.UU.NET (Vadim Antonov) writes:
- [heavily edited]
-
- >Now, i think everybody agrees that the "ultimate encoding" is
- >the one which provides the complete information about which
- >language is used -- it sovles all the problems.
-
- While such an encoding might be called "ultimate", it wouldn't solve
- all problems. In particular, the encoding mustn't force one to
- provide information that's irrelevant or not available.
-
- A good encoding is one which allows one to provide the information
- one wants, no less, no more. E.g., things like typeface used
- may be relevant in some contexts but not always, and I doubt
- many would want them in the character set. Whether language
- should be there, or in how great detail, is debatable
- (indeed the subject of the presend debate).
-
-
- >Let's analyse the other properties of the ultimate encoding:
-
- (I make comments about Vadim's proposed encoding below but I've
- deleted the description. Those interested in this should read Vadim's
- article first.)
-
- >1) it explicitly distinguises between the natural languages.
-
- > I think this property can be easily dropped.
-
- Agreed.
-
-
- >2) there is an *algorithm* for converting from upper case to lower case
- > and vice versa.
-
- > This property is extremely useful
-
- Agreed. It is also very difficult to do completely for all languages
- without knowing the language. E.g., in some languages several lower
- case characters may map to same upper case character. Perhaps it
- would be sufficient to have a case-insensitive comparison algorithm?
- A glyph-based encoding could probably handle that in most cases;
- not all, though (like the Turkish dotless i).
- Here your system clearly wins.
-
-
- >3) there is an *algorithm* for lexicographical sorting (and for
- > that matter for range comparisons).
-
- > This is necessary if the now-ubiquotus semantics of regular
- > expressions, shell globbing and sorted directories (and, of
- > course, neat sorted reports) is to be preserved.
-
- > Several people argued that this functionality is not actually
- > necessary because users will want to see strings sorted
- > accordingly to their native language's rules).
-
- Rather it isn't sufficient.
-
- > While it is
- > a neat idea it still does not eliminate the necessity of a
- > generic algorithm used as fallback for "customized" sorting
- > algorithsm -- what, for example, is the idea of sorting Chinese
- > using rules of Finnish?
-
- Nor does the universal algorithm eliminate the necessity of
- language-specific algorithms. (More about this below.)
-
- Nonetheless you make a good point. Finnish sorting rules are no good
- for non-Latin scripts.
-
- Let's look at it this way: How would a Finn want to see Chinese
- names sorted? If (as is likely) he doesn't know Chinese he either
- couldn't care less, or would want them transliterated into Latin
- characters (and then sorted by Finnish rules). If he knows Chinese,
- he'll want the Chinese ordering -- or one of them, if (as I suspect)
- there are several. Or if he knows another language that uses
- the Chinese character set but different sorting he might prefer that.
- In any case, he wants an order that makes sense to _him_.
-
- How about sorting a list of European names (from various Latin-based
- languages) in Japan? Here a default "proto-Latin" sorting might make
- sense. However, if the list is to be handed to guests who come from
- various countries it might be desirable to sort it differently for
- each country, or use the most common language (which may but need
- not be English). A Spanish delegation in Japan would probably
- appreciate seeing their names correctly sorted by Spanish rules.
-
- I don't know how that should be handled. Nonetheless, since several
- languages using the same script have different sorting rules, external
- information of the language is still needed and a generic algorithm is
- effectively just providing a default language for each script.
-
- A pure locale-system clearly won't do in multilingual environments.
- Nonetheless some things are, IMHO, best handled with locales.
- Perhaps it should be possible to specify the default language (like,
- in an environment variable) separately for each script one is
- concerned with (setenv LANG 'LATIN:finnish;CYRILLIC:russian;HAN:chinese')
- or whatever) and fall to a default in the rest.
-
- Your encoding helps here only by making programming easier and that
- only if the default is usable often enough -- and even then it can be
- harmful by leading programmers to ignore situations where it wouldn't
- fit (and consequently no program could sort Finnish correctly,
- for I'm sure it wouldn't fit the default anyway; oh well, I guess
- that would mean more jobs for Finnish programmers ...).
-
- > Moreover, the usage of a "unversal" algorithm is inavoidable
- > when consumers of the pre-sorted information do not share
- > a cultural background.
-
- Any information must be targeted to people who understand the
- language it is in. Sorting multilingual wordlists should
- be done according to the language of context, or some language
- the recipient is expected to know. In any case, I don't see this
- as requiring anything but a "default" language -- and it might
- be different for different groups of consumers.
-
- However, there's another issue with sorting where your proposal is
- disastrous: A single list of names may need to be sorted differently
- on different occasions depending on the target language. If Spanish
- ll and ch are considered individual characters, and German and Finnish
- a-umlaut as distinct characters, this means that a Finnish reader must
- know the language a name originated in in order to find it. It is
- impossible to explain to a layman that M"oller and M"oller are
- different names and in different place in the directory because one is
- German and the other Swedish.
-
- Even worse, the problem isn't limited to sorting: E.g., assume you
- want to search a database for the works of one Mr. M"oller without
- knowing where he comes from. Or when entering a reference you have
- on paper to a database: how do you know which "o to type?
-
-
- >4) indication of the language is useful for hyphenation.
-
- Yes. However, I think it's really useful only if the exact language
- is known, and we already agreed not to put than in the character
- encoding.
-
- > Dictionary-based hyphenation algorithms do not need the
- > language specification because the words themselves indicate
- > the language;
-
- This is not strictly true, as similarly spelled words occur
- in different languages and are hyphenated differently.
-
- > if the word is not in the dictionary the
- > generic rules should be applied.
-
- > Therefore it is desireable (though not necessary) that
- > the letter classes (ie. which ones are vowels) are preserved.
-
- I doubt any generic rule would do much good. At least among
- languages using Latin characters recognizing vowels won't help
- anything. The only universal hyphenation-related thing I
- might put in the character set is a hyphenation hint character.
-
-
- >5) typographical-quality fonts often differentiate between
- > "similar" glyhps reproducible on lower-resolution devices.
-
- > The proposed encoding methodology (see later) "automatically"
- > separates graphic sets, so the ability to output typography-
- > quality text is essentially "free" in the proposed encoding.
-
- I'm not sure about this one. Isn't it just a question of
- deciding exactly which are considered different glyphs
- in a glyph-based encoding?
-
- ****
-
- >The idea is to leave ALL language specifics at the point of input
- >where the language is supposedly known.
-
- That assumption is simply false far too often. Not only may the
- language of a word be unknown, it may not even exist. (What language
- is M"oller in this sentence?)
- --
- Tapani Tarvainen (tt@math.jyu.fi, tarvainen@finjyu.bitnet)
-