NetNews Usenet Archive 1992 #20

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #20 / NN_1992_20.iso / spool / comp / std / internat / 647 < prev next >

Wrap

Internet Message Format | 1992-09-10 | 3.6 KB

Xref: sparky comp.std.internat:647 soc.culture.turkish:9822 soc.culture.nordic:5592 Path: sparky!uunet!ogicse!mintaka.lcs.mit.edu!ai-lab!wheat-chex!glenn From: glenn@wheat-chex.ai.mit.edu (Glenn A. Adams) Newsgroups: comp.std.internat,soc.culture.turkish,soc.culture.nordic Subject: Re: Latin unification in ISO 10646 Message-ID: <27738@life.ai.mit.edu> Date: 10 Sep 92 14:18:05 GMT Article-I.D.: life.27738 References: <HAAVARDF.92Sep8012952@gluon.uio.no> <TT.92Sep9114439@tarzan.jyu.fi> <1992Sep9.163417.8803@corax.udac.uu.se> Sender: news@ai.mit.edu Followup-To: comp.std.internat Organization: MIT Artificial Intelligence Laboratory Lines: 61 In article <1992Sep9.163417.8803@corax.udac.uu.se> andersa@Riga.DoCS.UU.SE (Anders Andersson) writes: >In article <TT.92Sep9114439@tarzan.jyu.fi>, tt@tarzan.jyu.fi (Tapani Tarvainen) writes: >> In article <1992Sep8.160511.1976@corax.udac.uu.se> andersa@Riga.DoCS.UU.SE (Anders Andersson) writes: >> >The Turkish alphabet is different here (and more consistent, in my >> >opinion), as it has two different vowels 'i'; one with dot and the >> >other without (both letters of course appear in upper- and lowercase). >> >I'm afraid even ISO 10646 fails to support them properly... I must disagree that 10646 "fails to support [the Turkish alphabet] properly." 10646 contains the following characters (UCS2 codes given): 0131 LATIN SMALL LETTER DOTLESS I 0049 LATIN CAPITAL LETTER I 0069 LATIN SMALL LETTER I (with a dot) 0130 LATIN CAPITAL LETTER I WITH DOT ABOVE These *adequately* support representations of both European and Turkish language texts. The fact that one has to take language into account in performing case conversion is not relevant. The coding structure of 10646 does not represent case transformations nor does it represent sorting order; an application (or system) must use table lookup to perform these operations correctly, taking language into account as necessary. The efficiencies obtained by programmers using ASCII for representing English text are simply not possible with a universal character set. Other examples of language (or regional) differences in case conversion: LATIN SMALL LETTER SHARP S (ESS-ZED) -> "SS" or "SZ" in uppercase, the latter sometimes used in Austrian German LATIN SMALL LETTER E ACUTE -> "CAPITAL E" or "CAPITAL E ACUTE", the former sometimes used in France An ISO character set is simply a repertoire of characters, their mappings to code points, and a list of unique names for those characters. No semantics regarding usage are specified for "graphic characters." Other standards are free to address semantics or particular usages of characters. >Latin and Cyrillic capital 'M' look the same, while the small forms >don't. 10646 encodes the elements of scripts, independent of their use by particular languages. Latin and Cyrillic, though both derived from the Greek alphabet along with influences from Etruscan and Aramaic, are clearly distinct scripts. On the other hand, the Han script, as used in China, Japan, Korea, and Vietnam, is clearly one script, though some innovations were introduced in its different writing systems. Since Cyrillic, Greek, and Latin are separate scripts, their elements are encoded separately, even when they happen to have some overlap of form. An even more important reason for their distinction in 10646 is because they are distinct in ISO8859-5. The prime directive, as it were, for 10646 was to facilitate a 1-1 round-trip mapping between 10646 and existing character sets. This precludes any unification of the few similar Cyrillic, Greek, and Latin characters. Glenn Adams