NetNews Usenet Archive 1993 #3

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #3 / NN_1993_3.iso / spool / comp / std / internat / 1325 < prev next >

Wrap

Internet Message Format | 1993-01-28 | 6.6 KB

Path: sparky!uunet!charon.amdahl.com!pacbell.com!ames!haven.umd.edu!darwin.sura.net!spool.mu.edu!caen!batcomputer!cornell!uw-beaver!micro-heart-of-gold.mit.edu!news.media.mit.edu!mintaka.lcs.mit.edu!ai-lab!wheat-chex!glenn From: glenn@wheat-chex.ai.mit.edu (Glenn A. Adams) Newsgroups: comp.std.internat Subject: Re: Alphabets Date: 26 Jan 1993 18:26:21 GMT Organization: MIT Artificial Intelligence Laboratory Lines: 119 Message-ID: <1k3vodINN7qb@life.ai.mit.edu> References: <8719@charon.cwi.nl> <1k100eINNs9n@life.ai.mit.edu> <8732@charon.cwi.nl> NNTP-Posting-Host: wheat-chex.ai.mit.edu In article <8732@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: >In my opinion you can not give absolute criteria. While Suetterlin is not >derived from the Latin script and there is no formal similarity, it is >functionally equivalent. So, although it is in fact a different script it >may just as well be viewed as a different font. I do not think that Unicode >(where it is extended to extinct scripts) should reserve codepoints for this >script. I think it is clear from my initial posting on script unification that I consider pragmatics (utility) to be the most important criteria. Obviously there can be no absolute utilitarian judgement on these matters. I do not believe functional equivalence should be given much priority. It can, but it introduces problems of its own. First of all it assumes that one can identify an unambiguous functional value for each symbol. However, no writing system that I know of is purely unique and unambiguous in assigning functional values to forms. Some writing systems come pretty close to being pure in this respect, e.g., Finnish orthography and Japanese Kana; however, even these are impure in certain cases. That is, phonographic writing systems (e.g., English & German) tend to incorporate morphographic or lexographic functional identifications; whereas, morphographic writing systems (e.g., Chinese & Japanese) tend to incorporate phonographic functional identifications. Because of these impurities, it is quite impossible to determine a precise functional value for any particular symbol. Not that this hasn't been tried: the ISCII character sets, designed to encode the various scripts of India dervied from the Brahmi script, were designed to fit into a 7/8-bit encoding space. The only way to do this with 9 scripts (Devanagari, Gurmukhi (Punjabi), Gujarati, Bengali, Oriya, Kannada, Malayalam, Telugu, Tamil) was to unify according to phonemic function, ignoring form completely. There are a number of problems in doing this: (1) the writing systems based on these scripts don't share a single set of functional units -- they employ different functional units; and (2) these writing systems (like most others) diverge from a pure functional encoding of sounds, incorporating a mixture of morphological and lexical values in various places -- since these morphological and lexical functions have a reflection in the orthography, e.g., in choosing altenative variant forms, or orthographic conventions, the encoding must diverge from encoding pure phonological units, to encoding other kinds of functional units and also formal units. >What I indicated was that functional equivalence (as you >correctly stated it) might be the most important deciding factor for some >scripts. I won't deny that it *may* be useful in some cases, e.g., there is some opinion that Glagolitic should be considered a font shift from Cyrillic, even though the two have different forms; however, because Cyrillic has grown and has incurred a considerable number of innovations (in order to write various non-Slavic languages), an equivalence would be quite strained. Currently, Unicode considers Glagolitic a separate script subject to distinct encoding. >But I think that the identification process may ignore the actual form. When >Suetterlin was used it was mixed with the normal Roman form without change of >meaning. It was more or less viewed as a different font, although the letter >forms are completely different. This raises an interesting point related to the notion of "plain text form." Unicode requires that the encoding be adquate to display the text in legible form without the use of language or font shifts (or any other kind of rich text extensions). If Suetterlin were unified with Latin, then it would not be possible to determine which forms to display, Suetterlin forms or Latin forms, when a plain text string is displayed. This would argue against unification, that is, unless it was considered reasonable (from the point of legibility) to use Latin forms in all cases (even when Suetterlin was expected). >But that is only an afterthought. How about the Turkish I with and without >dot? It would not have cost much to give them separate coding points. (Yes, >I understand the compatibility reasons. Are there other reasons?) There is a general policy against encoding identical forms which differ in usage alone. In the case of Turkish vs., Spanish there is no functional difference. The character in common character sets doesn't really even designate a single functional value, but as many values as the different writing systems that use it give to it; e.g., English interprets it in a variety of ways depending on context. In the case of Turkish vs., say, Spanish , the only difference is one of writing system dependent orthographic behavior regarding case convesion. The case also holds with lowercasing <SS> in German (and <SZ> in Austrian German); or going between <ck> and <kk> when hyphenating German. About the only way I can figure how to solve the problem of mixing Turkish and English (French, Spanish, German, etc.) and still get display and case conversion to work correctly is to do something like the following: Turkish <dotless i> <dotless i> + <dot above> + <dot above> English Using this scheme, for the purpose of case conversion and equivalence testing, we would have: != <dotless i> + <dot above> != + <dot above> Furthermore, we would have to display with a dotless 'I' glyph, violating the formal semantics of the character.. Two visual ambiguities are introduced by this scheme: Glyph 'i' -> or <dotless i> + <dot above> Glyph 'I' -> or However, there is no ambiguity in the encoding as far as display or case conversion is concerned. If anyone can think a of better way to do this without adding new characters to Unicode which are different in usage only, then I'd like hear about it. Glenn Adams