home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!charon.amdahl.com!pacbell.com!ames!haven.umd.edu!darwin.sura.net!spool.mu.edu!caen!batcomputer!cornell!uw-beaver!micro-heart-of-gold.mit.edu!news.media.mit.edu!mintaka.lcs.mit.edu!ai-lab!wheat-chex!glenn
- From: glenn@wheat-chex.ai.mit.edu (Glenn A. Adams)
- Newsgroups: comp.std.internat
- Subject: Re: Alphabets
- Date: 26 Jan 1993 18:26:21 GMT
- Organization: MIT Artificial Intelligence Laboratory
- Lines: 119
- Message-ID: <1k3vodINN7qb@life.ai.mit.edu>
- References: <8719@charon.cwi.nl> <1k100eINNs9n@life.ai.mit.edu> <8732@charon.cwi.nl>
- NNTP-Posting-Host: wheat-chex.ai.mit.edu
-
- In article <8732@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
- >In my opinion you can not give absolute criteria. While Suetterlin is not
- >derived from the Latin script and there is no formal similarity, it is
- >functionally equivalent. So, although it is in fact a different script it
- >may just as well be viewed as a different font. I do not think that Unicode
- >(where it is extended to extinct scripts) should reserve codepoints for this
- >script.
-
- I think it is clear from my initial posting on script unification that I
- consider pragmatics (utility) to be the most important criteria. Obviously
- there can be no absolute utilitarian judgement on these matters.
-
- I do not believe functional equivalence should be given much priority. It
- can, but it introduces problems of its own. First of all it assumes that
- one can identify an unambiguous functional value for each symbol. However,
- no writing system that I know of is purely unique and unambiguous in
- assigning functional values to forms. Some writing systems come pretty close
- to being pure in this respect, e.g., Finnish orthography and Japanese Kana;
- however, even these are impure in certain cases. That is, phonographic
- writing systems (e.g., English & German) tend to incorporate morphographic
- or lexographic functional identifications; whereas, morphographic writing
- systems (e.g., Chinese & Japanese) tend to incorporate phonographic
- functional identifications. Because of these impurities, it is quite
- impossible to determine a precise functional value for any particular
- symbol.
-
- Not that this hasn't been tried: the ISCII character sets, designed to
- encode the various scripts of India dervied from the Brahmi script, were
- designed to fit into a 7/8-bit encoding space. The only way to do this with
- 9 scripts (Devanagari, Gurmukhi (Punjabi), Gujarati, Bengali, Oriya, Kannada,
- Malayalam, Telugu, Tamil) was to unify according to phonemic function,
- ignoring form completely. There are a number of problems in doing this:
- (1) the writing systems based on these scripts don't share a single set of
- functional units -- they employ different functional units; and (2) these
- writing systems (like most others) diverge from a pure functional encoding
- of sounds, incorporating a mixture of morphological and lexical values in
- various places -- since these morphological and lexical functions have a
- reflection in the orthography, e.g., in choosing altenative variant forms,
- or orthographic conventions, the encoding must diverge from encoding pure
- phonological units, to encoding other kinds of functional units and also
- formal units.
-
- >What I indicated was that functional equivalence (as you
- >correctly stated it) might be the most important deciding factor for some
- >scripts.
-
- I won't deny that it *may* be useful in some cases, e.g., there is some
- opinion that Glagolitic should be considered a font shift from Cyrillic,
- even though the two have different forms; however, because Cyrillic has
- grown and has incurred a considerable number of innovations (in order to
- write various non-Slavic languages), an equivalence would be quite
- strained. Currently, Unicode considers Glagolitic a separate script subject
- to distinct encoding.
-
- >But I think that the identification process may ignore the actual form. When
- >Suetterlin was used it was mixed with the normal Roman form without change of
- >meaning. It was more or less viewed as a different font, although the letter
- >forms are completely different.
-
- This raises an interesting point related to the notion of "plain text form."
- Unicode requires that the encoding be adquate to display the text in legible
- form without the use of language or font shifts (or any other kind of rich text
- extensions). If Suetterlin were unified with Latin, then it would not be
- possible to determine which forms to display, Suetterlin forms or Latin forms,
- when a plain text string is displayed. This would argue against unification,
- that is, unless it was considered reasonable (from the point of legibility)
- to use Latin forms in all cases (even when Suetterlin was expected).
-
- >But that is only an afterthought. How about the Turkish I with and without
- >dot? It would not have cost much to give them separate coding points. (Yes,
- >I understand the compatibility reasons. Are there other reasons?)
-
- There is a general policy against encoding identical forms which differ in
- usage alone. In the case of Turkish <i> vs., Spanish <i> there is
- no functional difference. The <i> character in common character sets
- doesn't really even designate a single functional value, but as many values
- as the different writing systems that use it give to it; e.g., English
- interprets it in a variety of ways depending on context.
-
- In the case of Turkish <i> vs., say, Spanish <i>, the only difference is one
- of writing system dependent orthographic behavior regarding case convesion.
- The case also holds with lowercasing <SS> in German (and <SZ> in Austrian
- German); or going between <ck> and <kk> when hyphenating German.
-
- About the only way I can figure how to solve the problem of mixing
- Turkish and English (French, Spanish, German, etc.) and still get
- display and case conversion to work correctly is to do something like
- the following:
-
- Turkish
-
- <dotless i> <I>
- <dotless i> + <dot above> <I> + <dot above>
-
- English
-
- <i> <I with dot above>
-
- Using this scheme, for the purpose of case conversion and equivalence
- testing, we would have:
-
- <i> != <dotless i> + <dot above>
- <I with dot above> != <I> + <dot above>
-
- Furthermore, we would have to display <I with dot above> with a dotless
- 'I' glyph, violating the formal semantics of the character..
-
- Two visual ambiguities are introduced by this scheme:
-
- Glyph 'i' -> <i> or <dotless i> + <dot above>
- Glyph 'I' -> <I> or <I with dot above>
-
- However, there is no ambiguity in the encoding as far as display or
- case conversion is concerned.
-
- If anyone can think a of better way to do this without adding new characters
- to Unicode which are different in usage only, then I'd like hear about it.
-
- Glenn Adams
-