home *** CD-ROM | disk | FTP | other *** search
- From: arnet@hpcupt1.cup.hp.com (Arne Thormodsen)
- Date: Wed, 6 Jan 1993 17:28:17 GMT
- Subject: Re: Let's develop ISO sorting rules
- Message-ID: <140950002@hpcupt1.cup.hp.com>
- Organization: Hewlett Packard, Cupertino
- Path: sparky!uunet!spool.mu.edu!sdd.hp.com!hpscit.sc.hp.com!hplextra!hpcss01!hpcupt1!arnet
- Newsgroups: comp.std.internat
- References: <1ibmdcEINNooe@uni-erlangen.de>
- Lines: 67
-
- Markus Kuhn writes:
-
- >My vision is NOT a sorting order that is embedded in the character set.
- >That would be too trivial, of course. The Unicode developpers had good
- >reasons to embed one into the code table. No, I have a slightly more clever
- >algorithm in mind, that will do 2 passes:
- >
- > 1. ignore punctuations etc. and group letters together before
- > comparing the strings.
- >
- > 2. No. 1 will not offer a total order, which should be supplied by a
- > beautiful sorting standard. So if 1 fails than compare the strings
- > completely without throwing any trivial information away. Rule 2 must not
- > conflict with rule 1, the partial ordering must only be completed.
- >
- >I am playing around with an algorithm that works this way since a few
- >days, and the results are very promising and easy to understand intuitively.
- :
- :
- >The method deals fine with punctuations in the strings (e.g. in
- >bibliographic references and person names), is pretty efficient and
- >easy to implement. Word lists produced by my algorithm are pretty easy
- >to scan for human eyes. The solution is much more general than simple
- >upcase conversion before lexicographic character code comparison which
- >is often used today with US-ASCII.
- >
- >I still don't know, whether I should post the algorithm here, or whether
- >I should write a paper or techreport first, as it is much more promising
- >than all character code based lexical orderings that have been proposed here
- >so far.
-
- Markus,
-
- I must admit severe scepticism here, but I am sincerely interested
- in what you are doing. Please post details if you feel you can.
-
- My scepticism is basically this: I guess you have probably come up
- with a method which you believe is more-or-less acceptable to Europeans
- and North/South Americans. What about Chinese, Japanese, Korean,
- Hebrew, Arabic, all the Indic languages, Thai, etc? These are all
- commercially important markets, and they have very different sorting
- requirements. Some things you mention, like ignoring "punctuation",
- simply won't work for some languages. For example, in Japanese there
- are "punctuation" marks like the "ditto" mark (which means duplicate the
- last Kanji) or the Katakana "dash" (which means extend the previous
- vowel). These (should) affect collations. I am sure there are dozens,
- if not hundreds, of similar cases.
-
- Personally (and professionally :-) I believe that the support of
- multiple and complex collation algorithms is a necessary and permanent
- feature of internationalized programs. This problem cannot be wished away.
-
- For *internal* purposes (B-trees, etc) a simple sort based on numeric
- character codes (8, 16 or whatever bits) will always work, and is the
- fastest method possible.
-
- I don't see a market need for the kind of "in-between" solution you
- are proposing (but I am still interested...)
-
- --arne
-
- Arne Thormodsen
- CSO Internationalization
- Hewlett-Packard
-
- DISCLAIMER: These views are my own and do not necessarily represent
- any views of the Hewlett-Packard Company.
-