NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / std / internat / 1050 < prev next >

Wrap

Text File | 1993-01-06 | 43.8 KB | 1,023 lines

Path: sparky!uunet!not-for-mail From: avg@rodan.UU.NET (Vadim Antonov) Newsgroups: comp.std.internat Subject: Re: Language tagging Date: 5 Jan 1993 20:42:38 -0500 Organization: UUNET Technologies Inc, Falls Church, VA Lines: 1011 Message-ID: <1iddeeINN58g@rodan.UU.NET> References: <1993Jan3.203017.232@enea.se> <2609@titccy.cc.titech.ac.jp> <1iav6tINNee2@life.ai.mit.edu> NNTP-Posting-Host: rodan.uu.net Excuse me for the delay -- i've got a lot of things i have to do besides playing with flammable materials in USENET. Well, i'll try to outline the rationale after my proposal. First, let's define terms. Localization is when a system is supposed to work with only one (user's) language. In practice, bilinguality is required because most programming languages and command-line interfaces heavily use English mnemonics. Localized system's components apriori know the language of the texts they work with (including sorting/case conversion rules and, more important the alphabet). Since the (supposedly small) alphabet is used, there is no need to use multibyte characters (at least for European and Slavic languages) and there are already national standards in place. For Greek or Hebrew or Russian users Unicode/ISO10646 offers nothing except wasted disk space. Therefore, i'm discarding all claims that Unicode/ISO10646 is intended to provide means for localization ONLY. It is designed for internationalization, i.e. for supporting sufficiently multilingual environments; and this function is the only justification for its introduction. Now, i think everybody agrees that the "ultimate encoding" is the one which provides the complete information about which language is used -- it sovles all the problems. Such an encoding can be implemented with: 1) register switching with (say) escape sequences. This is highly impractical; moreover it is impossible to determine the language if the information is available from some point in the middle of file -- this situation is especially troubling with Unix file pointer sharing. 2) every character code is a pair (language-code, letter-number-in-alphabet) It is hardly practical because of the storage considerations. Codifying languages require at least 10-12 bits, ie. every letter turns into at least 3-byte sequence. Now, we can sacrifice some information to achieve better compression by assinging similar codes to similar LETTERS (note that i didn't use the word GLYPH). First we need to determine which information CAN be dropped without making the encoding impractical. Obviously, the visual representation (i.e. the set of glyphs) should be preserved -- it sets the lower boundary on the compression. Let's analyse the other properties of the ultimate encoding: 1) it explicitly distinguises between the natural languages. This information cannot be used for processing the texts anyway because machines do not *understand* languages and therefore do not care about the contents. The language indication can be useful for selecting a dictionary; but the words themselves are already good indicators of the language (with few exceptions) involved. Moreover, the spell-checking is usually done at the input where the language in question is known. I think this property can be easily dropped. 2) there is an *algorithm* for converting from upper case to lower case and vice versa. This property is extremely useful for everything involving searches in texts and databases. The case-insensitive comparison is simply inavoidable because of the rules of capitalization in sentences (i.e. you can't easily tell if the word at the beginning of sentence is capitalized because it is name or because of the common rule). The database applications often require preserving the strings as was entered while searching for words with undefined capitalization. If an encoding does not provide this property it is nearly useless for most library/groupware and business applications. Therefore this property MUST be preserved. (I have to remark here that neither Unicode nor ISO 10646 provide a way to perform case-insensitive comparison without specifying the exact language of the string -- if the language is kept OUTSIDE the text the text itself can be encoded within the alpabet of this particular language much more effectively). 3) there is an *algorithm* for lexicographical sorting (and for that matter for range comparisons). This is necessary if the now-ubiquotus semantics of regular expressions, shell globbing and sorted directories (and, of course, neat sorted reports) is to be preserved. Several people argued that this functionality is not actually necessary because users will want to see strings sorted accordingly to their native language's rules). While it is a neat idea it still does not eliminate the necessity of a generic algorithm used as fallback for "customized" sorting algorithsm -- what, for example, is the idea of sorting Chinese using rules of Finnish? Moreover, the usage of a "unversal" algorithm is inavoidable when consumers of the pre-sorted information do not share a cultural background. Therefore this property must be preserved as well. 4) indication of the language is useful for hyphenation. There is no such thing as ideal hyphenation algorithms -- the hyphenation algorithms are nothing more than euristics which yield more-less reasonable results. While the high-quality hyphenation algorithms require dictionaries and heavily use language specifics the reasonable results (sufficient for use in, say, columns of tables or as a base for consequtive manual reaarangement) can be achieved by relatively generic algorithms based on knowledge of specific roles of letters (i.e. vowels and modificators like hard/soft signs in Russian). Dictionary-based hyphenation algorithms do not need the language specification because the words themselves indicate the language; if the word is not in the dictionary the generic rules should be applied. Therefore it is desireable (though not necessary) that the letter classes (ie. which ones are vowels) are preserved. Fortunately, if a letter shared between languages it is nearly certain that it is (is not) vowel in both languages. 5) typographical-quality fonts often differentiate between "similar" glyhps reproducible on lower-resolution devices. However, it is possible to create reasonably-looking fonts shared between several languages if the basic graphic set of the languages is common (say, Latin for European languages, Cyrillic for most Slavic languages). Actually, it the way the font designers work in the multi-lingual environments. The proposed encoding methodology (see later) "automatically" separates graphic sets, so the ability to output typography- quality text is essentially "free" in the proposed encoding. The way the ultimate encoding is compressed into the proposed encoding is quite simple: The alphabets are separated into several groups; in every group conflicts between incompatible ordering resolved by duplicating codes for the conflicting letters. If there are letters which do not have upper/lowercase counterparts (like German eszat) new codes should be reserved to distinguish the corresponding printable signs as "special" (i.e. the special German S) or new glyphs should be added (SS). If there is ambiguity in converting case (like cyrillic/greek/german T) this glyph should be assigned several codes to eliminate ambiguity. If a glyph can represent differntly-behaving letters in different languages (like "o or l, c and h in Spanish) it should be assigned several codes to eliminate the ambiguity. The "canonical" sorting order is the _dictionary_ one (dictionaries are usually more logical in their sorting conventions). The criteria for separating alphabets into groups are simple as well - it simply comparing the number of new codes which are introduced by adding a new group with the number of new codes assigned by attempting to fit the language in existing groups (well, it's not that simple but still the problem is well-formalized and can be solved with known mathematical methods without reverting to truthsearching by voting). It makes sense to keep codes in the order similar to the order of letters in alphabet to eliminate the need for large tables used by the universal sorting algortim. The canonical universal sorting algorithm should be included in the standard (contrary to the popular opinion it would not be THAT large -- i think everything can be fitted in less than 50 lines because sorting irregularities are specific for European languages for historical reasons [namely the idea of dictionary with sorted entries is European -- and the irregular rules were accepted before the importance of simple lexicographical sorting was realized -- see Knuth vol.3], the sorting rules for ancient languages were introduced _later_ as well as rules for languages which had no established writing at the time (most of Cyrillic-based languages for example had no written form before the nations were occupied by chauvinistic Russian imperialists which obviously did not care about the culture or ethnic miniorities and opressed them by inventing and introducing writing to be able to give orders in natives' languages -- that's for soc.culture.soviet readers). The "customized" sorting algorithms (sorting everything Latin-based according to, say, Finnish rules) in most cases can be reduced to the universal sorting by trivial transliteration which makes possible introduction of table-driven generic customized sorting routines. Apparently it makes sense to violate the lexicographical letter ordering for Latin-based languages to make the proposed encoding upward-compatible with ASCII. The price is nothing more than one additional transliteration table in the comparison/sorting algorithm. Since the proposed encoding preserves the properties 2 & 3 of the ultimate encoding it is POSSIBLE to create truly internationalized applications without getting into local specifics (i.e. you can sort German file on a machine in Japan without any changes in environment). No additional off-text information is required which surely allows to preserve the existing interfaces like regular expressions/shell globbing/mailing systems etc and minimize changes in the existing software. To ease hypenation problem it may make sense to add a zero-width "hyphenate hint" symbol (aka \% in troff). This symbol can be added manually or by language-specific hyphenation algorithm at the input (it may as well be a part of spell-checker, in this role it is essentially "free"). The idea is to leave ALL language specifics at the point of input where the language is supposedly known. If a "generic" hypenation algorithm is included in standard it may be possible to omit hyphenation hints in words which were split correctly by the standard algorithm. The main inconvinience introduced by the proposed encoding is the possibility of confusion between several codes for a glyph. However, since practically all input is done in a sinlge or two languages it is not really a problem. Some people will object to explicit switching between languages, however; it they LIKE to include a foreign word into a letter as native it it supposed to be treated as native. Introducing spell-checking at input and emphasising selected languages by means of color, brightness or different font styles practically eliminates the problem. ISO 10646 (unlike Unicode as in Plan 9) already provides several codes for identical glyphs in places where it is obviously absurd to do otherwise (as with T in cyrillic, greek and latin); but as old saying goes: "you can't be 30% pregnant". This does not solve all sorting and case conversion problems (unlike the systematic approach i outlined before) but introduces all inconviniences of having ambiguous glyph encodings. Now, flame mode on (type 'n' if you think the following material can be offensive, tasteless or plain stupid): ------- From: Terry Lambert >Other word processing operations (such as dictionary and thesaurus use >within the program) require knowledge of which language to use. Phew! If a word is IN dictionary it is already known which language it is from. Thesaurus-collecting algorithms generally do not care about lagnuage at all. >The idea of sort order should be (and is in Unicode) divorced from the >idea of information storage. The fact that one will have text files, >data files, and text files which act as data files on the same machine >*requires* some type of promiscuous [out of data band] method of >determining the format of the data within a file. NEVER say "impossible". As i just argued before it IS possible and no out-of-band information required for 99% of applications. >This method, whether >it be language tagging of the files in the inode, or language tagging >of the user during the login process is imperitive. To do otherwise >means that your localization data coexists with system data rather >than system data being localized as well. Have it ever occured to you that texts may contain arbitrary numbers of languages? No external (out-of-text) tragging is sufficient in general case. >The operations you wish to perform are the province of applications running >on the system, not the system itself. Regardless of whether this is done >by an application programmer (as a per application localization) or by the >creator of a library used by applications (as part of developement system >localization), THE CODE BELONGS IN USER SPACE. Nobody (surely, not me) ever said something about getting everything in the kernel. However, moving multilingual support from SYSTEM software to APPLICATION software is the best way to kill the idea. Somehow you forgot that what is not done once in system will be done in zillion incompatibe ways in user programs. Take MS-DOS as a live example. >A point of contention: a logician is, by disipline, a philosopher, not a >mathematician. Being the latter does not qualify one as the former. If it IS the way the logic is taught here i feel pity for the American education. Philosophy is a matter of speculations; logic is a precise science and attempts to make "logical" decisions based on philosofical word games are plain stupid. >The point of having to know what language a particular document is written >in in order to manipulate it was *not* omitted; it was taken as an *axiom*. Er? That's the lack of logical thinking -- never question *axioms*. >With all due respect, the ISO-8859-5 is an international standard to >which engineering outside of Russia is done for use in Russia. Barring >another published standard for external use, this is probably what >Russian users are going to be stuck with for code originating outside >of Russia. I suggest that if this concerns you, you should have the >"defacto standard" codified for use by external agencies. KOI-8 became a standard in early 70s and sure is codified (apparetly not by ISO but by CCITT because it is based on "phonetical" system for sending cyrillic messages over international talegraph which predates computers). >One wonders at the ECMA registration of a supposed "non-standard" by Russian >nationals if the standard is not used in Russia. One wonders on X.25, X.29, X.400 and a lot of other stuff which was standartized BEFORE it become a proven practical solution. Need i to remind that ASCII defeated EBCDIC in the market before it become international standard. As for ISO commitees -- USSR is surely was represented by petty bureaucrats from the Ministry of Communications who are utterly incompetent. Somehow international bodies never got interested in looking for really competent people if they do not represent government or corporate interests. Unfortunately we can do nothing about it. The only remedy is to oppose attemts to standartize things which gained no market acceptance. >The process of "asking the user" is near 0 cost regardless of whether the >implementation is some means of file attribution per language or some >method of user attribution (ala proc struct, password file, or environment). How do think you're going to attribute a single language to the user avg? I somehow write in Russian and English and often use both in the same phrase. >It becomes more complicate if you are attemting a multinational document; >the point here is to enable localization with user supplied data seta >rather than providing a tool for linguistic scholars or multilingual >word processors. It is possible to do both of these things within the >confines of Unicode, penalizing only the authors of the applications. As i already said Unicode/ISO 10646 are USELESS for localization -- we don't need one more standard where old standards work just fine. >I believe that multinational documented will be the exception, not the >rule. Learn some foreighn language and try to use it in the real-life communications -- you'll find that mixed-language texts are far far more common than you think. Russian, for example do not have equivalents for "political correctness" or "mentally disadvantaged" simply because that plague avoided the Russian culture. >I believe the goal is *NOT* multinationalization, but internationalization. >In this context, internationalization refers not to the ability to provide >perfect access to all supported languages (by way of glyph preference), but >refers instead to an enabling technology to allow better operating system >support for localization. For "internationalization" as you understand it Unicode is nothing more than extra bytes and lot of pain in the ass. >Multinational use is out of the question until modifications are made to >the file system in terms of supporting multiple nation name spaces for the >files. You don't need changes in file systems to support multilingual texts. Having developed bilingual Unix i can assure you in that. >Localization in terms of multinationalization requires other considerations >not directly possible, in particular, the concept of "well known file system >objects" must be adjusted. Consider, if you will, the fact that such a >localization of the existing UNIX infrastructure is currently impossible >in this framework. I am thinking in particular renaming the /etc directory >or the /etc/passwd file to localized non-English equivalents. Renaming /etc or /etc/passwd in order to achieve "multinationalization" is plain stupid because these aren't human words but rather kinds of hierogliphs. Somehow French, Russian, German programmers never had troubles understanding that IF THEN ELSE means conditional excecution. It may be not so obvious for English speakers who grow up on COBOL but the computerish semantic units are easier to understand if not to mix them with the natural languages. I bet that schoolchilden in Russia have much less troubles understanding the difference between usage of conditional clauses in natural and algorithmic languages (simply *because* they're different). >The idea >of multinationalization falls under its own weight. Considr a system used >by users of several languages (ie: a multinational environment). Providing >each use with their own language's view of file requires a minimum of the >number of well known file names times the number of languages (bearing in >mind that translation may effect name length) for directory information >alone. Now consider that each of these users will want their names and >passwords to be in their own language in a single password file. It means only that you don't understand what multinationalization is. I myself developed the system which is successfully used by people knowing English or Russian for a decade -- somehow it didn't collapsed. To make system multinational you don't need to rename system utilities and files. Waht you need is to give users: - ability to operate texts in their own language - print diagnostic messages in user's language - provide documentation in user's language - provide means for development of multinational applications (i.e. make development tools to understand the user's language) - support filenames in user's language - provide a way to customize user's environment (aka aliases) Unix already has adequate customization mechanizms. >Multinationalization is possible, but of questionalble utility and merit >in current computing systems. We ned only worry about providing the >mechanisms for concurrency of use for the translators. Tell it to the secretary in DEMOS -- she somehow used to keep her whole stuff in a Unix machine for years without bothering to learn any English. What you think is impossible or of questionable utility is already used by lots of people for many many years. >Consider now the goal of data-driven localization (a single translation >for all system application programs and switching of language environments >without application recompilation. Ever heard about string files? It is as easy as that. You may also want to steal a look at Macintrash -- they had it from the very beginning. >Rather than rewriting all applications which use text as data (cf: the C >compiler example), unification of the glyph sets makes more sense. No logic here. Having several codes for a glyph does not hinder application's ability to treat that as a glyph. >The only goal I am esposing here is enabling for localization. For this >task, Unicode is far from useless. Ah. Checked your logic lately? Tell the Russian users they'll need two bytes per letter instead of one to be able to do what they already can do with KOI-8 (well, not exactly; they'll be unable to sort without telling which language they mean). The bi-lingual encodings with pairs of separate alphabets fitting in one byte (or fitting in two bytes for Chinese and Japanese) are quite sufficient for localization; and moreover there already are standards. >Then "real users" can supply a codified alternative in short order or lump it. Real users do not play in bureaucratic games in international bodies, they simply pay money. Need i to teach Americans basics of capitalism? >T: the lexical order argument is one of the sticking points against the >T: Japanese acceptance of Unicode, and is a valid argument in that arena. >T: The fact of the matter is that Unicode is not an information manipulation >T: standard, but (for the purposes of it's use in internationalization) a >T: storage and an I/O standard. View this way, the lexical ordering argument >T: is nonapplicable. >V: It'd be sticking point about Slavic languages as well, you may be sure. >V: Knowing ex-Soviet standard-making routine i think the official fishy-eyed >V: representatives will silentlly vote pro to get some more time for raving >V: in Western stores and nobody will use it since then. The "working" standards >V: in Russia aren't made by commitees. >Then this will have to change, or the Russian users will pay the price. >Those of us external to Russia are in no position to involve ourselves in >this process. Any changes will have to originate in Russia. Aha. I like your naivete. The fastest way to achieve the "changes" is to butcher every ex-communist official. Sure, Russian users will pay the price -- to companies which will offer them useable products which do not follow dead-born standards. >Address information can not be hyphenated, at least in US and other Western >mail of which I am personally aware. This is a non-issue. This is also >something that is not the responsibility of the operating system or the >storage mechanism therein... unless you are arguing that UFS knows to store >documents without hyphenation, and that the "cat" and "more" programs will >hyphenate for you. If you are talking about ANY OTHER APPICATION, THE >HYPHENATION IS THE APPLICATIONS RESPONSIBILITY. PERIOD. I always thought that the hyphenation routine's place is in the SYSTEM library. It's a reuseable piece of code. And there is a lot of places where strings need to be hyphenated (long diagnostics in windows of variable size) even by trivial programs. The behaviour of Unix utilities simply wrapping strings over the right edge is at the best case a kludge. I somehow hope that it'll change. >T: Find another standard to tell you how to write a word processor. >V: Is there any? :-) > >No, there isn't; that was the point. It is not the intent of the Unicode >standard to provide a means of performing the operations normally >associated with word processing. That is the job of the word processor, and >is the reason people who write word processors are paid money by an employer >rather than starving to death. The second and the third sentences have no logical link. Sure Unicode does not help for word processing. >Again, font *changes* only become a problem if one attempts to print a >*multinational* document. Since we aren't interested in multinationalization, >it's unlikely that a Unicode font containing all Unicode glyphs will be >used for that purpose. Since Unicode does not provide any benefits for localization your statement is meaningless. >In all likelyhood, use will be in a localized environmentn *NOT* a >multinational one. Since this is the case, it follows that the sum total of >the Unicode font implemented in the US will be the ISO Latin-1 set. >Similarly, if you are printing a Cyrillic document, you will be using a >Cyrillic font; the "X" character you are concerned about will be *localized* >to the Cyrillic "X", *NOT* the Latin "X". Somehow i use to print documents in both Russian and English without bothering to switch fonts. If Unicode does not allow me to do that i simply never switch to it (as well as many other users). I don't need to spend disk space for the benefit of being unable to do things i used to for years. >V: I see no reasons why we should treat the regular expression matching >V: as "fancy" feature. > >Because globbing characters are language dependant. >The globbing ("regular expression pattern match") characters >DO change for any patterns more complicated than "*". Globbing is nothing more than sequential matching to a regular expression. >V: Don't you think the ANY text is going to be fancy because Unicode >V: as it is does not provide adequate means for the trivial operations? >Perhaps any multinational text, yes; for normal text, processing will be >done using the localized form, not the Unicode form; therefore the issue >will never come up, unless the application requires embedded attributes >(like a desktop publishing package. Since multinational processing is >the exception rather than the rule, let the multinational users pay the >proce in terms of "Fancy text". If processing is done using localized (non-Unicode) form then why do we need Unicode at all? The assertion that the multilingual processing requires register-switching (aka "fancy mode") is simply wrong. >Again, multinational software is not being addressed; however, were we to >address the issue, I suspect that it would, in all cases, be implementation >dependant upon the multinational application. Then forget about Unicode. It is not supposed to be a mean of localization. >I don't know how stridently I can express this: runic encoding destroys >information (such as file size = character count) and makes file system >processing re character substituion totally unacceptable... Variable-length line encodings destroy information (such as file size = character count / record size) and makes file system processing re word substitution totally unacceptible... (You've got to be a OS/370 zealot). >consider the >case of a substitution of a character requiring 3 bytes to encode for on >that takes 1 byte (or 4 bytes) currently. Say further that it is the >first character in a 2M file. You are talking about either shifting the >contents of the entire file, or, MUCH WORSE, going to record oriented files >for text. If there is defacto attribution of text vs. other files (shifting >the data is unacceptable. period.), there is no reason to avoid making >that attribution as meaningful as possible. consider the case of a substitution of a word requiring 3 bytes for on that takes 1 byte (or 4 bytes) currently. Say further that it is the first character in a 2M file. You are talking about either shifting the contents of the entire file, or, MUCH WORSE, going to record oriented files for text. If there is defacto attribution of text vs. other files (shifting the data is unacceptable. period.), there is no reason to avoid making that attribution as meaningful as possible. I worry why don't you run around crying that Unix require record-oriented files and defacto attribution of text vs other files. You'd better think before posting anything to USENET. V: Pretty soon it will be a dead standard because of the logical problems V: in the design. Voting is inadequate replacement for logic, you know. V: I'd better stick to a good standard from Zambia than to the brain-dead V: creature of ISO even if every petty bureaucrat voted for it. >I agree; however, the peope involved were slightly more knowledgable about >the subject than your average "petty bureaucrat". If they understand the problem as well as you do i somehow doubt that. >And there has not been >a suggested alternative, only rantings of "not Unicode". If someone comes to me declaring that he invented perpetuum mobile he can claim that i criticize him without providing an alternative too. Moreover, it is not the case. Altough i cannot replace ISO in blessing standards i already provided my ideas on how to create a standard which will not be dead before birth. >V: I expressed my point of view (and proposed some kind of solution) in >V: comp.std.internat, where the discussion should belong. I'd like you to >V: see the problem not as an excercise in wrestling consensus from an >V: international body but as a mathematical problem. From the logistical >V: point of view the solution is simply incorrect and no standard commitee >V: can vote out that small fact. The fundamental assumption Unicode is >V: based upon (i.e. one glyph - one code) makes the whole construction >V: illogical and it, unfortunately, cannot be mended without serious >V: redesign of the whole thing. > >Wrong, wrong, wrong. > >1) We are not discussiong the embodiment of a standard, but the > applicability of existing standards to a particular problem. > Basically, we could care less about anything other than the > existing or draft standards and their suitability to the task > at hand, the international enabling of 386BSD. Do you REALLY believe that if something blessed by a commitee it will work in the real life? I thought this mindset died with the communism. >2) We are not interested in "arriving" at a new standard or defending > existing or draft standards, except as regards their suitability > to our goal of enabling. *We*'re not interested in spending life on perpetuating old mistakes and doing useless work. I'd prefer you not to talk in the name of all labouring people; although i'm immune to that kind of speech and it does not impress me i still find it repulsive. >3) The proposal of new soloutions (new standards) is neither useful > nor interesting, in light of our need being "now" and the adoption > of a new soloution or standard being "at some future date". If it is not interesting to you you can simply say 'k'. >4) Barring a suggestion of a more suitable standard, I and others > will begin coding to the Unicode standard. Good luck. It is my belief that most people are unable to learn on mistakes of others but on their own only. I also do not care about 386BSD for a number of reasons. >5) Since we are discussing adoption of a standard for enabling of > localization of 386BSD, and are neither intent on a general defense > of any existing standard, nor the proposal of changes to an > existing standard or the emobodiment of a new standard, this > discussion doe *NOT* belong in comp.std.internat, since the > subscribers of comp.unix.bsd are infinitely more qualified to > determine which existing or draft standard they wish to use > without a discussion of multinationalization (something only > potentially useful to a limited audience, and then only at some > future date when multinational processing on 386BSD becomes a > generally desirable feature. See before. >V: Try to understand the argument about the redundance of encoding with >V: external restrictions provided i used earlier in this letter. The >V: Unicode commitee really get caught in a logical trap and it's a pity >V: few people realize that. > >I *understand* the argument; I simply *disagree* with it's applicability >to anything other than enabling multinationalization as opposed to >enabling localization, which is the goal. "I understand that 2+2=4; I simply *disagree* with it's applicability to anything other than drawing figures on a blackboard as opposed to counting apples, which is the goal". I wonder if you ever wrote a working program. --------- From: Peter da Silva >You have identified two problems with Unicode and ISO 10646: case conversion >and lexical ordering. >> See how Unicode renders itself useless? >Unless you want to work on multilingual documents, yes. It could be better, >certainly, but to say it's *useless* is hyperbole. Yep, although it depends where your letter is goes to. If it'll be fed to something like SM/1 or grapeVINE the case conversion and sorting is necessary. Now, let's be fair. "Unicode is the standard applicable to writing letters only", right? Do *you* need such standard? --------- From: Dik T. Winter >I move the discussion a bit: would we like sorting according to the texts >language or the users language? >I do not know, but I do know that in many cases a Finn would like to handle >it differently from a German, regardless of the language of the text involved! Neat idea. I'd like to see it implemented, though i don't know what is the Finnish idea of sorting Chinese. Even if you have a customized sorting routine, the fall-back to universal routine is necessary. > So, even if sorting is not regular there always is a way around -- > with Unicode you can't do even that. Eh? I would think the same way around! Nope. I propose encoding in which sorting is algorithm (i agree, the arithmetic sorting is not sufficient). You can't do it with Unicode/ISO 10646 because you lost necessary information. >You are confused about a few things. Unicode will help your sorting >Cyrillic names and Latin names just fine. It does not help me with Ukrain and Russian and the fact that it has the separate cyrillic page is quite irrelevant. >You are mixing soring with different >scripts and sorting different languages. See example before. BTW, Ukrainian has a letter i which is found in the Latin code page. >The latter makes no sense if >you use for each word the place it should occupy in its native language >(especially for loan-words). If you're typing in loan-words you're typing it in the *your* language (and character codes used belong to your language) and it will be treated as you wished. > > >It won't make sense. Lexical sorting makes only sense, if at all, in > > >*one* single language. > > > > See before. > >See before, you give different scripts not different languages. See before, you missed the point compltely. es> You've talked a lot about regular expressions etc. Frankly I es> don't give a damn about those. The main bulk of computer users es> are not programmers and don't know what a regular expression es> is, so why focus such specific issues? >I agree completely here. Shell globbing, regular expressions and >sorting by 'ls' are not relevant. Tell it to the secretary which knows that ls gives sorted list of files from her directory with couple hundreds of names like "Letter1" and "IRS_request" in it. >(There are still systems around >that do not sort the list of files at all. Can you say IBM VM/CS?) I knew that you ought to be MS-DOS zealot. >Those things are indeed used by programmers only... Obviously you don't earn your living by writing programs for real users. You're underestimating peoples intelligence; there *are* non-programmers who know a hell lot of shell and use it. -------- From: Lars Wirzenius >If I understand your proposed solution correctly, it is the "each >language has a completely different set of character codes" solution >which I have disagreed with above. You understood it incorrectly and therefore further discussion makes no sense. -------- From: Glenn A. Adams >You're uninformed. Try looking at Xerox Viewpoint (West European, >East European, Greek, Arabic, Hebrew, Chinese, Japanese, Korean, et al); >try looking at BB&N Slate (West & East European, Greek, Hebrew, Arabic, >Korean, Thai). There are others: Nota Bene, Multilingual Scholar, >AlKatiib, etc. Thank you for enlightening me, but i wonder if you can tell the difference between an entire operating system and a single application -- the localization of those are tasks differing in compexity for a couple orders of magnitude. So far, MNOS RL 1.2 and DEMOS 2 are THE ONLY EXISTING true bilingual systems. >After reading this yet again, I now believe that this entire conversation >may be based on a misunderstanding. Unicode does not unify Latin T, >Cyrillic T, and Greek T! They are separate characters, as are Latin A, >Cyrillic A, and Greek A. Nor does Unicode unify LATIN A WITH RING and >ANGSTROM SYMBOL. Which Unicode? 16-bit or 32-bit? It already has multiple codes for similar glyphs but still does not allow algorithmic sorting or case conversion. What's the point? To combine the worst of both "one glyph - one code" and my proposal? ------- From: Professor David J. Birnbaum [the text omitted] Thank you for understanding! ------- From: Erland Sommarskog Xref: uunet comp.std.internat:2106 >>Aw, don't be silly. It's trivial. >When you can't explain write off the problem as trivial. Nope. Simply add conversion CH -> _CH_ to the text editor on writing the otput file and reverse conversion on reading. Also, add the same conversion at the copying from raw list to cooked list in the tty driver. Isn't it trivial? >>>What the hell has the number of bits to do with anything? Do computers >>>exist for the programmers of the users? >> >>Look, you've missed the logic completely. Read it please again. I also >>explained it several times in other postings. > >What logic? I want to be able to write and read text in European >languages. Period. You can do it right now without Unicode, period. >Then how many bits you use is not my issue, as >long as you give me something which I consider user-friendly. (Being >forced to keep track whether a certain dotted "a" is German or >Swedish is not.) How many bits you use is completely irrelevant. >(But since I want more than 256 symbols, you will have a pain if >you stay with eight bits.) It IS issue when you're getting no benefits -- if you have to specify the language outside of the text you can as well live with several 8-bit codepages. The onky difference that you needn't 2 bytes for every letter. >>What do you tell the poor user when he has a database with English >>and Russian company names (a case from my practice, to be real) -- >>in both upper and lower case and the smart guys (apparently Erlands >>pupils) made a terminal which converts cyrillic codes for the letters >>of the same shape as latin to the latin codes? Go get a rope? > >Yes, that is precisely the confusion which is likely to happen when >you assign the same character different codes depending on language, >and when the application program is not smart enough to equate them. You missed the example again completely. The problem was with the lot of letters like T which AREN'T the same in Russian and English when converted to lowercase. >>The basic ASCII principles (after reordering and replacing several >>characters) remained the same -- there is a way to convert upper<->lower >>case and there is a way to sort without asking which language every word >>came from (it's known apriori). > >Nope. Not with German. ASCII is not a German standard, nor it was designed for German. >The problem with your idea is that you believe that everything is >known at input time. It isn't. If the originator of information does not know which language he used who will know it? In many cases this inforamtion is available ONLY at input. >You've talked a lot about regular expressions etc. Frankly I >don't give a damn about those. The main bulk of computer users >are not programmers and don't know what a regular expression >is, so why focus such specific issues? I got to like that attitude. "I don't give a damn"... The user don't give a damn (and a dime, btw) for yor programs which are unable to do things they have now for granted. ------- From: Kosta Kostis >> The one which does not ask user which language he means every time >> he runs more -i. > >Nice. Your local language will be implied somehow and you can use it as >a default. What's your problem? Nice. I have a file in both English and Russian. What is my local language? >> Unfortunately there is a logical gap. I don't care WHICH algorith >> im used as long as it is ALGORITHM. There is no way to convert Unicode >> strings uppercase without "external" information. > >There is no way to convert non-US ASCII strings without "external" >information. Non-US ASCII is an oxymoron. >Simple "solutions" may work for "Russian and English" >or for "Greek and English", where you imply the language, but there's >no *general* solution. You don't seem to understand that. Who said that there is NO solution for N languages where N is arbitrary? You don't seem to understand yourself. >Nice for you you're bilingual, but there are companies and the like >that need support for much more than two languages and their >"common alphabet" won't fit in 8-bit, 9-bit or 10-bit. That's ok, but why should i sacrifice things i have now for granted (like working sorting which do not ask me what do i mean). >> Already discussed. I sure don't know everything but i know that >> you can made a minimal strictly ordered set from unification of >> strictly ordered sets by merging similar elements. > >You think you can do so. Implement it and try to sell it. Good luck. I'm in a different business. When i did the internationalization project i did it quite successfully. You know, a 22 year-old hippie programmer does not usually get gold medal from the government for nothing. >I can see 202 cyrillic characters (including diacritic marks) in >UniCode Version 1.0 - that's better than ISO 8859-5 (96 characters). >Does KOI-8 cover more than 202 cyrillic characters? Nope. It was desinged for English and Russian ONLY and it does it work well. The other cyrillic languages have their own code pages. >You have your way of sorting names, others have other ways of sorting. >Foreign names are written in German with German letters in Germany. Then it's German words and what the fuss is about? >UniCode is not a Russian-English encoding. It's a multilingual >encoding with advantages and drawbacks. Thank you for reminding, i do not have amnesia. I do not care about "multilingual" encodings which do not allow me to do things i can do with the bilingual. If you don't understand the difference between an example and proposal you'd better keep out of USENET or bought an asbesto suit. >There are partial solutions for local problems, fine, but that's >it and that's what will be. No universal character set will ever >solve that [period]. (Now hear my shoe on the table ... ;-) ) First get that shoe from your mouth. -------- From: Johnny Eriksson >Why not tell everyone about KOI-8 and the other cyrillic coding methods, >code tables, design principles, whatever. There may be more than one >of us that are interested in that information. KOI-8 is nothin more than a 8-bit ASCII extension with cyrillic letters in codes from 0300 to 0377. It has completely separate alphabets for Russian and English even though there is a number of similar letters. I used that code only as illustration of model bilingual encoding. There were Unicode-style encodings (GOST and DKOI) which followed the "one-code -- one glyph" principle. Those encodings are dead by now. Flame mode off. --vadim