home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!gatech!enterpoop.mit.edu!eru.mt.luth.se!lunic!sunic!corax.udac.uu.se!Riga.DoCS.UU.SE!andersa
- From: andersa@Riga.DoCS.UU.SE (Anders Andersson)
- Newsgroups: comp.std.internat
- Subject: Re: Let's develop ISO sorting rules
- Message-ID: <1ifr9sINNd0o@corax.udac.uu.se>
- Date: 6 Jan 93 23:51:24 GMT
- References: <1i0vnmINN352@rodan.UU.NET> <8494@charon.cwi.nl> <1i2durINN2pj@rodan.UU.NET> <8496@charon.cwi.nl> <C0Cuz5.2wy@flatlin.ka.sub.org> <1ibmdcEINNooe@uni-erlangen.de> <1993Jan5.222627.29561@jarvis.csri.toronto.edu> <1iev27EINNmc4@uni-erlangen.de>
- Organization: Uppsala University, Sweden
- Lines: 87
- NNTP-Posting-Host: riga.docs.uu.se
-
- In article <1iev27EINNmc4@uni-erlangen.de>, unrza3@cd4680fs.rrze.uni-erlangen.de (Markus Kuhn) writes:
- > flaps@dgp.toronto.edu (Alan J Rosenthal) writes:
- > >I believe you. I'm sure that the analogous statement is true for French. But
- > >I'm almost as sure that Swedish readers won't want "a-circle" to be anywhere
- > >near "a". I think your flexibility on this is a language-relative phenomenon.
- >
- > This would be really bad news, could someone from Sweden coment on this?
-
- I can support Alan's assumption. The three common Swedish letters
- that are the issue here are _always_ sorted after Z when done here
- properly. The only place where I've seen them occur elsewhere is in
- foreign-made gazetteers (such as the index of a world atlas), where
- it wouldn't make sense to convene readers of Swedish in particular.
-
- This of course doesn't mean that we would be totally lost in the
- dark if a foreign sorting method were forced down our throats, but
- I'm sure a lot of people would complain about it if they were told
- it was for the benefit of computer standardization... I can see
- that A and A-ring have a common graphical component, but so what?
- We consider them different. A Swedish typist might consider 'l'
- and '1' equivalent, but not A and A-ring.
-
- The situation is the same in Finland, Norway and Denmark, though
- the relative positions of (and the actual glyphs used for) the
- three vowels differ somewhat. When we sort Norwegian and Danish
- words in a Swedish context, we usually regard their vowels as
- equivalent to ours based on phonetics (i.e. Danish AE ligature =
- = Swedish A-diaeresis).
-
- Further, W is sorted as V, and U-diaeresis as Y. Your suggested
- rule does apply to most other accented and special letters of the
- Latin alphabet, though, whenever they occur in a Swedish context
- (most often E-acute, and in proper names of people).
-
- The original language of the sorted word does not matter to us
- (I don't think it does to anybody); O-diaeresis comes last in
- the alphabet regardless of whether it belongs to a German,
- Hungarian, or Turkish name.
-
- > And the following rule may be understood by every user within less then
- > 10 seconds:
- >
- > sort latin characters with diacritics (e.g. "a, 'a, ^a, ...) near their
- > pure latin version a.
-
- The issue is not whether the rule is understood, but whether it's
- accepted. It may not turn into a political issue like the one
- about nuclear power, but people generally prefer doing things
- the way they have always done them.
-
- Btw, where would you put the AE ligature? With a or with ae?
-
- > I'd like to discuss here, where the cyrillic, greek etc. letters
- > should be included. The positions that correspond to the latin letters
- > used in international transcription systems might be a good starting
- > point, if this is possible while preserving the ordering of e.g. the
- > greek alphabet.
-
- If you take the Latin alphabet as your frame of reference, then
- it's not possible to preserve the alphabetic order of either
- Cyrillic (A, B, V, G, D...) or Greek (A, B, G, D, E...) letters.
- However, Cyrillic appears to have more in common with Greek than
- with Latin. Maybe we should settle for Greek order? :-)
-
- > With out it, there will terrible
- > ISO 10646 sorting methods based on the Unicode code number of each character
- > become common practice!!! The definition of these groups might perhaps also be
- > useful for case-and-diacritic-invariant searching, because case-only-invariant
- > searching is only half a solution in Unicode that was ok with ASCII.
-
- Why bother, really? If you need a sorting order that is the same
- all over the world, for the purpose of building a database where
- ISO 10646 strings are used as keys, then simply use code point order.
- We have done that with ASCII for decades; it's not pretty, but it's
- probably not intended for human consumption anyway. If, on the
- other hand, the purpose is to produce a human-readable index or
- something, then strive to accomodate the human, which means using
- the user's preferred sorting order. If the index is to be printed
- once and not changeable thereafter, and to be used by many users,
- decide upon one existing natural language, and sort according to
- its rules (English has worked fine for the gazetteers I've seen).
-
- Do you have an example of where your method would be used?
- --
- Anders Andersson, Dept. of Computer Systems, Uppsala University
- Paper Mail: Box 325, S-751 05 UPPSALA, Sweden
- Phone: +46 18 183170 EMail: andersa@DoCS.UU.SE
-