home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!cs.utexas.edu!sun-barr!sh.wide!wnoc-tyo-news!cs.titech!titccy.cc.titech!necom830!mohta
- From: mohta@necom830.cc.titech.ac.jp (Masataka Ohta)
- Newsgroups: comp.std.internat
- Subject: Re: Language tagging
- Message-ID: <2643@titccy.cc.titech.ac.jp>
- Date: 7 Jan 93 13:09:38 GMT
- References: <1993Jan3.203017.232@enea.se> <2609@titccy.cc.titech.ac.jp> <1iav6tINNee2@life.ai.mit.edu> <1iddeeINN58g@rodan.UU.NET> <TT.93Jan7085019@tarzan.jyu.fi>
- Sender: news@titccy.cc.titech.ac.jp
- Organization: Tokyo Institute of Technology
- Lines: 96
-
- In article <TT.93Jan7085019@tarzan.jyu.fi>
- tt@tarzan.jyu.fi (Tapani Tarvainen) writes:
-
- >>Now, i think everybody agrees that the "ultimate encoding" is
- >>the one which provides the complete information about which
- >>language is used -- it sovles all the problems.
- >
- >While such an encoding might be called "ultimate",
-
- Agreed. As the proposal is much more rational than 10646 and can not
- be compatible with 10646, let's throw away 10646/Unicode and make the
- UCS (Ultimate coded Character Set).
-
- >it wouldn't solve
- >all problems. In particular, the encoding mustn't force one to
- >provide information that's irrelevant or not available.
-
- There should be one language code: wild card.
-
- If the language distiction information is not available, we should
- assign wild card code to the character, which will match with
- characters of any language.
-
- >>2) there is an *algorithm* for converting from upper case to lower case
- >> and vice versa.
- >
- >> This property is extremely useful
- >
- >Agreed. It is also very difficult to do completely for all languages
- >without knowing the language. E.g., in some languages several lower
- >case characters may map to same upper case character. Perhaps it
- >would be sufficient to have a case-insensitive comparison algorithm?
-
- Agreed.
-
- >A pure locale-system clearly won't do in multilingual environments.
- >Nonetheless some things are, IMHO, best handled with locales.
-
- Agreed.
-
- >However, there's another issue with sorting where your proposal is
- >disastrous: A single list of names may need to be sorted differently
- >on different occasions depending on the target language. If Spanish
- >ll and ch are considered individual characters, and German and Finnish
- >a-umlaut as distinct characters, this means that a Finnish reader must
- >know the language a name originated in in order to find it.
-
- Are 'll' and 'ch' considered individual characers having distinct code
- points? Or are 'l', 'l', 'c' and 'h'?
-
- I think latter better.
-
- >It is
- >impossible to explain to a layman that M"oller and M"oller are
- >different names and in different place in the directory because one is
- >German and the other Swedish.
-
- In this case, of course, language distinction should be the last key
- used for sorting,
- shouldn't it?
-
- >Even worse, the problem isn't limited to sorting: E.g., assume you
- >want to search a database for the works of one Mr. M"oller without
- >knowing where he comes from. Or when entering a reference you have
- >on paper to a database: how do you know which "o to type?
-
- I think the problem can be solved by introducing new notation of
- regular expressions.
-
- For example, let '[[A]]' represent all latin alphabet 'A's regardless of its
- language, and let '[[[a]]]' represent all variants of latin alphabet 'A's
- such as "A with ring above".
-
- Then, if you want to search Mr. M"oller of any nationality, use the
- following pattern
-
- [[M]][["o]][[l]][[l]][[e]][[r]]
-
- or just use characters of wild card nationality.
-
- >>5) typographical-quality fonts often differentiate between
- >> "similar" glyhps reproducible on lower-resolution devices.
-
- It should be stressed that corresponding characters in China and Japan
- are differnt even on high resolution CRTs.
-
- >>The idea is to leave ALL language specifics at the point of input
- >>where the language is supposedly known.
-
- How many, do you think, nationality distinction is necessary for
- a given single character?
-
- It seems to me that, for Han characters, we need 5 or so and it could
- be represented with 3 bits.
-
- Masataka Ohta
-