NetNews Usenet Archive 1992 #20

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #20 / NN_1992_20.iso / spool / comp / std / internat / 643 < prev next >

Wrap

Internet Message Format | 1992-09-09 | 4.5 KB

Xref: sparky comp.std.internat:643 soc.culture.turkish:9799 soc.culture.nordic:5580 Newsgroups: comp.std.internat,soc.culture.turkish,soc.culture.nordic Path: sparky!uunet!mcsun!sunic!corax.udac.uu.se!Riga.DoCS.UU.SE!andersa From: andersa@Riga.DoCS.UU.SE (Anders Andersson) Subject: Latin unification in ISO 10646 Message-ID: <1992Sep9.163417.8803@corax.udac.uu.se> Followup-To: comp.std.internat Sender: news@corax.udac.uu.se Organization: Uppsala University, Sweden References: <1992Sep7.195212.2614@boole.uucp> <HAAVARDF.92Sep8012952@gluon.uio.no> <TT.92Sep9114439@tarzan.jyu.fi> Date: Wed, 9 Sep 1992 16:34:17 GMT Lines: 79 [Note move of thread from soc.culture.nordic to comp.std.internat, as well as a single hint to readers of soc.culture.turkish in case of interest.] In article <TT.92Sep9114439@tarzan.jyu.fi>, tt@tarzan.jyu.fi (Tapani Tarvainen) writes: > In article <1992Sep8.160511.1976@corax.udac.uu.se> andersa@Riga.DoCS.UU.SE (Anders Andersson) writes: > >The Turkish alphabet is different here (and more consistent, in my > >opinion), as it has two different vowels 'i'; one with dot and the > >other without (both letters of course appear in upper- and lowercase). > >I'm afraid even ISO 10646 fails to support them properly... > > I'm fairly sure even some less ambitious ISO character set > (probably 8859-n, where n>1) supports Turkish completely, including > the dotless i (treated as a separate character). I suggest we look at ISO 8859-3 (which I understand is the official name for Latin Alphabet Nr 3) for reference, as it claims to support Turkish. In the following, I'm disputing the 'completeness' of that support: Latin-3 contains among other, mostly southern European, characters 0xA9 capital letter I with dot above, and 0xB9 small letter i without dot above. Of course, these are supposed to be used in conjunction with the 'normal' ASCII characters of the LH part of the table, in particular 0x49 (Latin) capital letter I, and 0x69 (Latin) small letter i, to make up the two different kinds of 'i' used in Turkish, each in upper- and lowercase. From a mere typographic standpoint (having a unique code for each visually distinguishable glyph), I consider this support complete. Programmers are used to being able to perform case conversion on letters of the ASCII table by simply adding or subtracting a certain constant to the character code, given that the code is within a particular range (A-Z or a-z). With later ISO standards, this is not quite such a simple task due to the sometimes ad-hoc layout of lowercase letters with respect to corresponding uppercase letters (examples available upon request), but it would still be possible using tables showing the relationship. However, since the same character code is now used for both Latin capital 'I' and Turkish capital dotless 'I', case conversion is no longer a trivial matter. Consider TO_LOWER(TO_UPPER(dotless 'i')). It ought to be symmetric, but what's the result? Is it somehow understood that automatic case conversion of letters of the Latin, Greek and Cyrillic alphabets (and possibly others) is beyond the scope of ISO character standards, or is this just an odd case having been overlooked? Judging from the little I've seen of ISO 10646, it contains no better support for Turkish 'i' variants than Latin-3 does (see positions 0x0130 and 0x0131 in UCS-2). My proposal: Add two specifically Turkish letters to ISO 10646, one capital 'I' without dot and one small 'i' with dot, and consider them different from the Latin 'I' and 'i'. I have no formal relationship with any standardization body, so I'll have to leave this proposal for any interested party to bring it up in the proper forum. Latin and Cyrillic capital 'M' look the same, while the small forms don't. Those capital 'M' letters have different codes in ISO 10646, though maybe for reasons of systematic tabulation rather than in order to support case conversion. We did away with the old typewriter unification of '1' and 'l' long ago (and the same for '0' and 'O', if that ever was a problem). Is Turkish 'i' vs. Latin 'i' in that different a ballpark? Are there word processors today that know how to case-convert a word containing Turkish letters? What are Turkish typists used to? Are there other letters in other alphabets suffering from similar unification problems in current ISO standards? -- Anders Andersson, Dept. of Computer Systems, Uppsala University Paper Mail: Box 325, S-751 05 UPPSALA, Sweden Phone: +46 18 183170 EMail: andersa@DoCS.UU.SE