home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!europa.asd.contel.com!darwin.sura.net!jvnc.net!newsserver.jvnc.net!yale.edu!ira.uka.de!fauern!uni-erlangen.de!not-for-mail
- From: unrza3@cd4680fs.rrze.uni-erlangen.de (Markus Kuhn)
- Newsgroups: comp.std.internat
- Subject: Re: Let's develop ISO sorting rules
- Date: 6 Jan 1993 16:49:27 +0100
- Organization: Regionales Rechenzentrum Erlangen
- Message-ID: <1iev27EINNmc4@uni-erlangen.de>
- References: <1i0vnmINN352@rodan.UU.NET> <8494@charon.cwi.nl> <1i2durINN2pj@rodan.UU.NET> <8496@charon.cwi.nl> <C0Cuz5.2wy@flatlin.ka.sub.org> <1ibmdcEINNooe@uni-erlangen.de> <1993Jan5.222627.29561@jarvis.csri.toronto.edu>
- Reply-To: mskuhn@immd4.informatik.uni-erlangen.de
- NNTP-Posting-Host: cd4680fs.rrze.uni-erlangen.de
- Lines: 96
-
- flaps@dgp.toronto.edu (Alan J Rosenthal) writes:
-
- >unrza3@cd4680fs.rrze.uni-erlangen.de (Markus Kuhn) writes:
- >>Believe me, German users who know what they are talking about won't
- >>complan if "a, "o and "u are not sorted as ae, oe and ue.
-
- >I believe you. I'm sure that the analogous statement is true for French. But
- >I'm almost as sure that Swedish readers won't want "a-circle" to be anywhere
- >near "a". I think your flexibility on this is a language-relative phenomenon.
-
- This would be really bad news, could someone from Sweden coment on this?
-
- My idea is, that there is one latin alphabet, that is known by everyone
- using a latin script:
-
- abcdefghijklmnopqrstuvwxyz
-
- And the following rule may be understood by every user within less then
- 10 seconds:
-
- sort latin characters with diacritics (e.g. "a, 'a, ^a, ...) near their
- pure latin version a.
-
- The time needed to understand this rule is independend of which latin
- alphabet based language the user uses normally.
-
- Then you still have to insert special characters (e.g. Icelandic Thorn)
- at suitable places THAT ONLY NEED TO BE REMEMBERED BY PEOPLES INTERESTED
- IN THIS LANGUAGE! That's really simple to understand. Ok, perhaps users
- should not be forced to use this standard multilingual sorting, but if it
- is offered as one possible 'international locale', I bet many people will
- like it very quickly.
-
- My algorithm is a kind of generalized upcase conversion: All a letters
- (a, A, A-ring, a-ring, "a, "A, ... perhaps even greek alpha and Alpha)
- form one group. Without the huge far east character sets, there might
- be about 50-80 groups. Each ISO 10646 character is assigned a group
- number and members of the same group have the same group numbers.
- Then we sort according to group numbers, not according to character
- codes. If the comparison between two words fails, because their group
- string doesn't differ, than let's compare by the positions within a group.
- Each character is also given a group position number, that provides a total
- order on all characters within a group. E.g. we might define in the standard,
- that ring-above comes always after acute-accent etc. These definitions are
- only necessary to offer a total ordering and need NOT be known by 99% of
- the users, because they will be significant very rarely (e.g. comes
- cooperation before or after co÷peration [÷="o]?). In addition,
- there is a punctuation group, that is ignored in the first pass
- and a space group. Not to forget 10 digit groups, where all the
- different digit versions in Unicode will be summarized, etc.
-
- I'd like to discuss here, where the cyrillic, greek etc. letters
- should be included. The positions that correspond to the latin letters
- used in international transcription systems might be a good starting
- point, if this is possible while preserving the ordering of e.g. the
- greek alphabet. If this is absolutely impossible (e.g. with han characters),
- than of course completely seperate groups have to be used.
-
- I have started to write a full algebraic specification of the algorithm,
- but it is really a pain to do this in ASCII. :-) Perhaps a 20-line C
- function will be a better specification.
-
- It would be possible to encode group number and position within a group
- in the character code, but this would increase the number of bits needed,
- because there would have to be big gaps in the code space. I prefer
- 16-bit ISO 10646 together with a table that gives me a 2x2 byte code
- (group x position in group) for each character. Clever programmers
- will store this table efficiently in much less then 128kBytes.
-
- >>Word lists produced by my algorithm are pretty easy to scan for human eyes.
-
- >is language-relative as well. If people are used to thinking of a-circle as
- >being at the end of the alphabet, it may become a different letter. For all I
- >know they think of it as a circle with an "a" as a diacritical mark rather than
- >the other way around.
-
- If we want ONE alphabethical order for an international locale, than we
- have to use what is easily understood by the majority of the world
- population. No one should be forced to use the international sorting,
- but for many people, a systematic approach will be very useful.
-
- Are there any experts in the Unicode Consortium, that believe, that specifying
- the details of this algorithm together with a well designed group table
- would be worth the effort? This would be a very nice next standard submitted
- by the Unicode Consortium to ISO ... :-) With out it, there will terrible
- ISO 10646 sorting methods based on the Unicode code number of each character
- become common practice!!! The definition of these groups might perhaps also be
- useful for case-and-diacritic-invariant searching, because case-only-invariant
- searching is only half a solution in Unicode that was ok with ASCII.
-
- Markus
-
- --
- Markus Kuhn, Computer Science student -=-=- University of Erlangen, Germany
- Internet: mskuhn@immd4.informatik.uni-erlangen.de | X.500 entry available
- --- Wer, wie, was? Wieso, weshalb, warum? Wer nichts fragt bleibt dumm. ---
-