home *** CD-ROM | disk | FTP | other *** search
- Xref: sparky comp.std.internat:647 soc.culture.turkish:9822 soc.culture.nordic:5592
- Path: sparky!uunet!ogicse!mintaka.lcs.mit.edu!ai-lab!wheat-chex!glenn
- From: glenn@wheat-chex.ai.mit.edu (Glenn A. Adams)
- Newsgroups: comp.std.internat,soc.culture.turkish,soc.culture.nordic
- Subject: Re: Latin unification in ISO 10646
- Message-ID: <27738@life.ai.mit.edu>
- Date: 10 Sep 92 14:18:05 GMT
- Article-I.D.: life.27738
- References: <HAAVARDF.92Sep8012952@gluon.uio.no> <TT.92Sep9114439@tarzan.jyu.fi> <1992Sep9.163417.8803@corax.udac.uu.se>
- Sender: news@ai.mit.edu
- Followup-To: comp.std.internat
- Organization: MIT Artificial Intelligence Laboratory
- Lines: 61
-
-
- In article <1992Sep9.163417.8803@corax.udac.uu.se> andersa@Riga.DoCS.UU.SE (Anders Andersson) writes:
- >In article <TT.92Sep9114439@tarzan.jyu.fi>, tt@tarzan.jyu.fi (Tapani Tarvainen) writes:
- >> In article <1992Sep8.160511.1976@corax.udac.uu.se> andersa@Riga.DoCS.UU.SE (Anders Andersson) writes:
- >> >The Turkish alphabet is different here (and more consistent, in my
- >> >opinion), as it has two different vowels 'i'; one with dot and the
- >> >other without (both letters of course appear in upper- and lowercase).
- >> >I'm afraid even ISO 10646 fails to support them properly...
-
- I must disagree that 10646 "fails to support [the Turkish alphabet] properly."
- 10646 contains the following characters (UCS2 codes given):
-
- 0131 LATIN SMALL LETTER DOTLESS I
- 0049 LATIN CAPITAL LETTER I
-
- 0069 LATIN SMALL LETTER I (with a dot)
- 0130 LATIN CAPITAL LETTER I WITH DOT ABOVE
-
- These *adequately* support representations of both European and Turkish
- language texts. The fact that one has to take language into account in
- performing case conversion is not relevant. The coding structure of
- 10646 does not represent case transformations nor does it represent
- sorting order; an application (or system) must use table lookup to
- perform these operations correctly, taking language into account as
- necessary. The efficiencies obtained by programmers using ASCII for
- representing English text are simply not possible with a universal character
- set.
-
- Other examples of language (or regional) differences in case conversion:
-
- LATIN SMALL LETTER SHARP S (ESS-ZED) -> "SS" or "SZ" in uppercase, the latter
- sometimes used in Austrian German
-
- LATIN SMALL LETTER E ACUTE -> "CAPITAL E" or "CAPITAL E ACUTE",
- the former sometimes used in France
-
- An ISO character set is simply a repertoire of characters, their mappings to
- code points, and a list of unique names for those characters. No semantics
- regarding usage are specified for "graphic characters." Other standards
- are free to address semantics or particular usages of characters.
-
- >Latin and Cyrillic capital 'M' look the same, while the small forms
- >don't.
-
- 10646 encodes the elements of scripts, independent of their use by particular
- languages. Latin and Cyrillic, though both derived from the Greek alphabet
- along with influences from Etruscan and Aramaic, are clearly distinct
- scripts. On the other hand, the Han script, as used in China, Japan, Korea,
- and Vietnam, is clearly one script, though some innovations were introduced
- in its different writing systems.
-
- Since Cyrillic, Greek, and Latin are separate scripts, their elements are
- encoded separately, even when they happen to have some overlap of form.
-
- An even more important reason for their distinction in 10646 is because they
- are distinct in ISO8859-5. The prime directive, as it were, for 10646 was
- to facilitate a 1-1 round-trip mapping between 10646 and existing character
- sets. This precludes any unification of the few similar Cyrillic, Greek,
- and Latin characters.
-
- Glenn Adams
-