NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / std / internat / 1102 < prev next >

Wrap

Internet Message Format | 1993-01-07 | 3.3 KB

From: arnet@hpcupt1.cup.hp.com (Arne Thormodsen) Date: Wed, 6 Jan 1993 17:28:17 GMT Subject: Re: Let's develop ISO sorting rules Message-ID: <140950002@hpcupt1.cup.hp.com> Organization: Hewlett Packard, Cupertino Path: sparky!uunet!spool.mu.edu!sdd.hp.com!hpscit.sc.hp.com!hplextra!hpcss01!hpcupt1!arnet Newsgroups: comp.std.internat References: <1ibmdcEINNooe@uni-erlangen.de> Lines: 67 Markus Kuhn writes: >My vision is NOT a sorting order that is embedded in the character set. >That would be too trivial, of course. The Unicode developpers had good >reasons to embed one into the code table. No, I have a slightly more clever >algorithm in mind, that will do 2 passes: > > 1. ignore punctuations etc. and group letters together before > comparing the strings. > > 2. No. 1 will not offer a total order, which should be supplied by a > beautiful sorting standard. So if 1 fails than compare the strings > completely without throwing any trivial information away. Rule 2 must not > conflict with rule 1, the partial ordering must only be completed. > >I am playing around with an algorithm that works this way since a few >days, and the results are very promising and easy to understand intuitively. : : >The method deals fine with punctuations in the strings (e.g. in >bibliographic references and person names), is pretty efficient and >easy to implement. Word lists produced by my algorithm are pretty easy >to scan for human eyes. The solution is much more general than simple >upcase conversion before lexicographic character code comparison which >is often used today with US-ASCII. > >I still don't know, whether I should post the algorithm here, or whether >I should write a paper or techreport first, as it is much more promising >than all character code based lexical orderings that have been proposed here >so far. Markus, I must admit severe scepticism here, but I am sincerely interested in what you are doing. Please post details if you feel you can. My scepticism is basically this: I guess you have probably come up with a method which you believe is more-or-less acceptable to Europeans and North/South Americans. What about Chinese, Japanese, Korean, Hebrew, Arabic, all the Indic languages, Thai, etc? These are all commercially important markets, and they have very different sorting requirements. Some things you mention, like ignoring "punctuation", simply won't work for some languages. For example, in Japanese there are "punctuation" marks like the "ditto" mark (which means duplicate the last Kanji) or the Katakana "dash" (which means extend the previous vowel). These (should) affect collations. I am sure there are dozens, if not hundreds, of similar cases. Personally (and professionally :-) I believe that the support of multiple and complex collation algorithms is a necessary and permanent feature of internationalized programs. This problem cannot be wished away. For *internal* purposes (B-trees, etc) a simple sort based on numeric character codes (8, 16 or whatever bits) will always work, and is the fastest method possible. I don't see a market need for the kind of "in-between" solution you are proposing (but I am still interested...) --arne Arne Thormodsen CSO Internationalization Hewlett-Packard DISCLAIMER: These views are my own and do not necessarily represent any views of the Hewlett-Packard Company.