home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.std.internat
- Path: sparky!uunet!mcsun!news.funet.fi!network.jyu.fi!tarzan!tt
- From: tt@tarzan.jyu.fi (Tapani Tarvainen)
- Subject: Re: Language tagging
- In-Reply-To: avg@rodan.UU.NET's message of 7 Jan 1993 16: 12:20 -0500
- Message-ID: <TT.93Jan10143855@tarzan.jyu.fi>
- Summary: universal sorting and hyphenation are impossible (cute hyphenation
- examples): two "o's would be a royal pain; transliteration is useful.
- Originator: tt@tarzan.math.jyu.fi
- Sender: news@jyu.fi (News articles)
- Nntp-Posting-Host: tarzan.math.jyu.fi
- Organization: University of Jyvaskyla
- References: <1iav6tINNee2@life.ai.mit.edu> <1iddeeINN58g@rodan.UU.NET>
- <TT.93Jan7085019@tarzan.jyu.fi> <1ii6bkINNf6c@rodan.UU.NET>
- Date: Sun, 10 Jan 1993 12:38:55 GMT
- Lines: 135
-
- In article <1ii6bkINNf6c@rodan.UU.NET> avg@rodan.UU.NET (Vadim Antonov) writes:
- [parts deleted and re-ordered]
-
- >In article <TT.93Jan7085019@tarzan.jyu.fi> tt@tarzan.jyu.fi (Tapani Tarvainen) writes:
- >>Let's look at it this way: How would a Finn want to see Chinese
- >>names sorted?
-
- >For many reasoins
-
- The question wasn't "why" but "how" -- and my point was that the how
- depends on the why.
-
-
- >Most likely an end user will have a local sorting algorithm which
- >basically reduces glyphs to the particular language's by
- >transliteration and then applies the same generic sorting algorithm.
- >At the same time, if you're sending a monolingual file with
- >Sweden words to China it would be sorted correctly even if nobody
- >around knows what the rules are!
-
- Who's going to read the words in China? If he doesn't know Latin
- characters at all he won't care how they're sorted. If he knows
- English but not Swedish he'll prefer English sorting rules. If he
- knows Swedish he'll want Swedish sorting -- and can specify Swedish
- locale for the purpose.
-
- People will not care whether a list of words in a language they don't
- know is sorted correctly in that language. The correct language-
- specific ordering is useful only when the reader knows it.
- When producing pre-sorted material you generally have an idea of at
- least some language the recipient will understand -- if he can't be
- expected to know any language that uses the characters in question he
- won't care about their ordering (or indeed of the material at all).
- Maybe sometimes one'll have to guess -- but a fixed proto-Latin order
- would in effect mean making the guess once and for all, a true
- procrustean solution.
-
-
- >>E.g., assume you
- >>want to search a database for the works of one Mr. M"oller without
- >>knowing where he comes from. Or when entering a reference you have
- >>on paper to a database: how do you know which "o to type?
-
- >You have this problem with Unicode/ISO10646 right now -- consider, say,
- >the word BETA -- it can be encoded in more than a hundred ways!
- >The search routines should simply treat identical glyphs as identical;
- >you have to do it now anyway.
-
- Yes, but it isn't sufficient:
- If you have a reference in a book (one that's made of paper!) to a
- Mr. M"oller whose native languages you don't know: how are you going
- to write his name to the file that's going to China?
-
- In the case of hypothetical Mr. BETA the problem isn't as likely, for
- when books use multiple scripts they usually use sufficiently
- different fonts or other means for distinguising them. (And no,
- people won't accept the idea of using different fonts for German and
- Swedish whenever they occur together.) How often have you actually
- encountered words of which you don't immediately know whether they're
- written in Cyrillic or Latin letters? Names that can be either
- German or Swedish occur _often_.
-
- I think the _script_ of the word in question is usually known from
- context, even if you can find individual words that look identical.
- The same is not true of languages using the same script.
- (Are there languages that use a genuine mixture of multiple scripts,
- as opposed to one base script and a few letters borrowed from others?)
-
-
-
- >>similarly spelled words occur
- >>in different languages and are hyphenated differently.
-
- >Any realistic example? :-)
-
- Why the smiley? Examples are not hard to come by, even from languages
- as different as English and Finnish:
- pat-i-na/pa-ti-na, pi-an-o/pia-no, piv-ot/pi-vot, tal-on/ta-lon,
- tel-ex/te-lex, home/ho-me, pore/po-re, pure/pu-re, ma-lar-i-a/ma-la-ria,
- vale/va-le, gig-o-lo/gi-go-lo, pet-it/pe-tit, des-per-a-do/des-pe-ra-do
- (some have same meanings, most don't).
- Even more common the situation is with proper names:
- Ev-er-ett/E-ve-rett, Fa-bri-ti-us/Fab-ri-ti-us, Far-a-day/Fa-ra-day,
- Fahr-en-heit/Fah-ren-heit, Flem-ing/Fle-ming, Fitz-ger-ald/Fitz-ge-rald,
- Fred-er-ick/Fre-de-rick, Gan-y-mede/Ga-ny-me-de, ... (I presume you'll
- argue proper names should be hyphenated according to where the person
- in question came from. I don't think that's possible in practice.)
-
- Even worse it gets with two closely related languages like English and
- German: des-ig-na-tion/de-si-gna-ti-on, des-per-a-do/de-spe-ra-do,
- fight-er/figh-ter, lead-er/lea-der, limon-ade/li-mo-na-de,
- meth-od-ist/me-tho-dist, min-i-mal/mi-ni-mal, mod-est/mo-dest,
- orig-i-nal/ori-gi-nal, par-the-no-gen-e-sis/par-the-no-ge-ne-sis,
- pref-er-ence/pre-fe-rence, pseu-do-nym/pseud-onym, rav-age/ra-va-ge,
- re-tal-i-a-tion/re-ta-lia-ti-on, rit-u-al/ri-tu-al, sep-a-ra-tion/
- se-pa-ra-ti-on (I know, German nouns should be capitalized, but
- capitalization isn't a reliable indicator of language anyway).
-
- (Just noticed that "desperado" occurs in all three languages and is
- hyphenated three different ways!)
-
- In case you're wondering, I have some experience in fixing by hand
- Finnish and German texts hyphenated by English rules ...
-
- I suggest you forget the idea of language-independent hyphenation.
-
-
-
- >Ever saw Chinese transliterated into Latin? :-) You generally
- >can't do it and keep it intelligible because phonetic structures
- >of languages are fairly different.
-
- I see Chinese names transliterated with Latin letters practically
- every time I open a newspaper. Even though a Finn who doesn't know
- Chinese can't pronounce them understandably, he can easily remember
- them (as he couldn't ideograms he doesn't know), he can talk about
- them with other Finns, and he can copy them in writing and be
- understood by someone who knows Chinese and the transliteration method
- used.
-
-
-
- >>>The idea is to leave ALL language specifics at the point of input
- >>>where the language is supposedly known.
- >>
- >>That assumption is simply false far too often. Not only may the
- >>language of a word be unknown, it may not even exist. (What language
- >>is M"oller in this sentence?)
-
- >I'd say that this word belong to the native language of Mr. M"oller :-)
- >It's just you didn't specify that.
-
- Maybe he's bilingual. :-)
- --
- Tapani Tarvainen (tt@math.jyu.fi, tarvainen@finjyu.bitnet)
-