NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / std / internat / 1056 < prev next >

Wrap

Text File | 1993-01-06 | 5.6 KB | 109 lines

Path: sparky!uunet!europa.asd.contel.com!darwin.sura.net!jvnc.net!newsserver.jvnc.net!yale.edu!ira.uka.de!fauern!uni-erlangen.de!not-for-mail From: unrza3@cd4680fs.rrze.uni-erlangen.de (Markus Kuhn) Newsgroups: comp.std.internat Subject: Re: Let's develop ISO sorting rules Date: 6 Jan 1993 16:49:27 +0100 Organization: Regionales Rechenzentrum Erlangen Message-ID: <1iev27EINNmc4@uni-erlangen.de> References: <1i0vnmINN352@rodan.UU.NET> <8494@charon.cwi.nl> <1i2durINN2pj@rodan.UU.NET> <8496@charon.cwi.nl> <C0Cuz5.2wy@flatlin.ka.sub.org> <1ibmdcEINNooe@uni-erlangen.de> <1993Jan5.222627.29561@jarvis.csri.toronto.edu> Reply-To: mskuhn@immd4.informatik.uni-erlangen.de NNTP-Posting-Host: cd4680fs.rrze.uni-erlangen.de Lines: 96 flaps@dgp.toronto.edu (Alan J Rosenthal) writes: >unrza3@cd4680fs.rrze.uni-erlangen.de (Markus Kuhn) writes: >>Believe me, German users who know what they are talking about won't >>complan if "a, "o and "u are not sorted as ae, oe and ue. >I believe you. I'm sure that the analogous statement is true for French. But >I'm almost as sure that Swedish readers won't want "a-circle" to be anywhere >near "a". I think your flexibility on this is a language-relative phenomenon. This would be really bad news, could someone from Sweden coment on this? My idea is, that there is one latin alphabet, that is known by everyone using a latin script: abcdefghijklmnopqrstuvwxyz And the following rule may be understood by every user within less then 10 seconds: sort latin characters with diacritics (e.g. "a, 'a, ^a, ...) near their pure latin version a. The time needed to understand this rule is independend of which latin alphabet based language the user uses normally. Then you still have to insert special characters (e.g. Icelandic Thorn) at suitable places THAT ONLY NEED TO BE REMEMBERED BY PEOPLES INTERESTED IN THIS LANGUAGE! That's really simple to understand. Ok, perhaps users should not be forced to use this standard multilingual sorting, but if it is offered as one possible 'international locale', I bet many people will like it very quickly. My algorithm is a kind of generalized upcase conversion: All a letters (a, A, A-ring, a-ring, "a, "A, ... perhaps even greek alpha and Alpha) form one group. Without the huge far east character sets, there might be about 50-80 groups. Each ISO 10646 character is assigned a group number and members of the same group have the same group numbers. Then we sort according to group numbers, not according to character codes. If the comparison between two words fails, because their group string doesn't differ, than let's compare by the positions within a group. Each character is also given a group position number, that provides a total order on all characters within a group. E.g. we might define in the standard, that ring-above comes always after acute-accent etc. These definitions are only necessary to offer a total ordering and need NOT be known by 99% of the users, because they will be significant very rarely (e.g. comes cooperation before or after co÷peration [÷="o]?). In addition, there is a punctuation group, that is ignored in the first pass and a space group. Not to forget 10 digit groups, where all the different digit versions in Unicode will be summarized, etc. I'd like to discuss here, where the cyrillic, greek etc. letters should be included. The positions that correspond to the latin letters used in international transcription systems might be a good starting point, if this is possible while preserving the ordering of e.g. the greek alphabet. If this is absolutely impossible (e.g. with han characters), than of course completely seperate groups have to be used. I have started to write a full algebraic specification of the algorithm, but it is really a pain to do this in ASCII. :-) Perhaps a 20-line C function will be a better specification. It would be possible to encode group number and position within a group in the character code, but this would increase the number of bits needed, because there would have to be big gaps in the code space. I prefer 16-bit ISO 10646 together with a table that gives me a 2x2 byte code (group x position in group) for each character. Clever programmers will store this table efficiently in much less then 128kBytes. >>Word lists produced by my algorithm are pretty easy to scan for human eyes. >is language-relative as well. If people are used to thinking of a-circle as >being at the end of the alphabet, it may become a different letter. For all I >know they think of it as a circle with an "a" as a diacritical mark rather than >the other way around. If we want ONE alphabethical order for an international locale, than we have to use what is easily understood by the majority of the world population. No one should be forced to use the international sorting, but for many people, a systematic approach will be very useful. Are there any experts in the Unicode Consortium, that believe, that specifying the details of this algorithm together with a well designed group table would be worth the effort? This would be a very nice next standard submitted by the Unicode Consortium to ISO ... :-) With out it, there will terrible ISO 10646 sorting methods based on the Unicode code number of each character become common practice!!! The definition of these groups might perhaps also be useful for case-and-diacritic-invariant searching, because case-only-invariant searching is only half a solution in Unicode that was ok with ASCII. Markus -- Markus Kuhn, Computer Science student -=-=- University of Erlangen, Germany Internet: mskuhn@immd4.informatik.uni-erlangen.de | X.500 entry available --- Wer, wie, was? Wieso, weshalb, warum? Wer nichts fragt bleibt dumm. ---