home *** CD-ROM | disk | FTP | other *** search
Text File | 1993-01-06 | 43.8 KB | 1,023 lines |
- Path: sparky!uunet!not-for-mail
- From: avg@rodan.UU.NET (Vadim Antonov)
- Newsgroups: comp.std.internat
- Subject: Re: Language tagging
- Date: 5 Jan 1993 20:42:38 -0500
- Organization: UUNET Technologies Inc, Falls Church, VA
- Lines: 1011
- Message-ID: <1iddeeINN58g@rodan.UU.NET>
- References: <1993Jan3.203017.232@enea.se> <2609@titccy.cc.titech.ac.jp> <1iav6tINNee2@life.ai.mit.edu>
- NNTP-Posting-Host: rodan.uu.net
-
-
- Excuse me for the delay -- i've got a lot of things i have to do
- besides playing with flammable materials in USENET.
-
- Well, i'll try to outline the rationale after my proposal.
-
- First, let's define terms.
-
- Localization is when a system is supposed to work with only one
- (user's) language. In practice, bilinguality is required because
- most programming languages and command-line interfaces heavily use
- English mnemonics.
-
- Localized system's components apriori know the language of the texts
- they work with (including sorting/case conversion rules and, more
- important the alphabet). Since the (supposedly small) alphabet is
- used, there is no need to use multibyte characters (at least for
- European and Slavic languages) and there are already national
- standards in place. For Greek or Hebrew or Russian users Unicode/ISO10646
- offers nothing except wasted disk space.
-
- Therefore, i'm discarding all claims that Unicode/ISO10646 is intended
- to provide means for localization ONLY. It is designed for
- internationalization, i.e. for supporting sufficiently multilingual
- environments; and this function is the only justification for
- its introduction.
-
- Now, i think everybody agrees that the "ultimate encoding" is
- the one which provides the complete information about which
- language is used -- it sovles all the problems.
-
- Such an encoding can be implemented with:
-
- 1) register switching with (say) escape sequences.
- This is highly impractical; moreover it is impossible
- to determine the language if the information is available
- from some point in the middle of file -- this situation
- is especially troubling with Unix file pointer sharing.
-
- 2) every character code is a pair (language-code, letter-number-in-alphabet)
- It is hardly practical because of the storage considerations.
- Codifying languages require at least 10-12 bits, ie. every
- letter turns into at least 3-byte sequence.
-
- Now, we can sacrifice some information to achieve better compression
- by assinging similar codes to similar LETTERS (note that i didn't
- use the word GLYPH).
-
- First we need to determine which information CAN be dropped without
- making the encoding impractical.
-
- Obviously, the visual representation (i.e. the set of glyphs) should
- be preserved -- it sets the lower boundary on the compression.
-
- Let's analyse the other properties of the ultimate encoding:
-
- 1) it explicitly distinguises between the natural languages.
-
- This information cannot be used for processing the texts anyway
- because machines do not *understand* languages and therefore
- do not care about the contents. The language indication can
- be useful for selecting a dictionary; but the words themselves
- are already good indicators of the language (with few exceptions)
- involved. Moreover, the spell-checking is usually done at the
- input where the language in question is known.
-
- I think this property can be easily dropped.
-
- 2) there is an *algorithm* for converting from upper case to lower case
- and vice versa.
-
- This property is extremely useful for everything involving
- searches in texts and databases. The case-insensitive comparison
- is simply inavoidable because of the rules of capitalization
- in sentences (i.e. you can't easily tell if the word at the
- beginning of sentence is capitalized because it is name or because
- of the common rule). The database applications often require
- preserving the strings as was entered while searching for
- words with undefined capitalization.
-
- If an encoding does not provide this property it is nearly
- useless for most library/groupware and business applications.
-
- Therefore this property MUST be preserved.
-
- (I have to remark here that neither Unicode nor ISO 10646 provide
- a way to perform case-insensitive comparison without specifying
- the exact language of the string -- if the language is kept
- OUTSIDE the text the text itself can be encoded within the
- alpabet of this particular language much more effectively).
-
- 3) there is an *algorithm* for lexicographical sorting (and for
- that matter for range comparisons).
-
- This is necessary if the now-ubiquotus semantics of regular
- expressions, shell globbing and sorted directories (and, of
- course, neat sorted reports) is to be preserved.
-
- Several people argued that this functionality is not actually
- necessary because users will want to see strings sorted
- accordingly to their native language's rules). While it is
- a neat idea it still does not eliminate the necessity of a
- generic algorithm used as fallback for "customized" sorting
- algorithsm -- what, for example, is the idea of sorting Chinese
- using rules of Finnish?
-
- Moreover, the usage of a "unversal" algorithm is inavoidable
- when consumers of the pre-sorted information do not share
- a cultural background.
-
- Therefore this property must be preserved as well.
-
- 4) indication of the language is useful for hyphenation.
-
- There is no such thing as ideal hyphenation algorithms --
- the hyphenation algorithms are nothing more than euristics
- which yield more-less reasonable results.
-
- While the high-quality hyphenation algorithms require
- dictionaries and heavily use language specifics the
- reasonable results (sufficient for use in, say, columns
- of tables or as a base for consequtive manual reaarangement)
- can be achieved by relatively generic algorithms based on
- knowledge of specific roles of letters (i.e. vowels and
- modificators like hard/soft signs in Russian).
-
- Dictionary-based hyphenation algorithms do not need the
- language specification because the words themselves indicate
- the language; if the word is not in the dictionary the
- generic rules should be applied.
-
- Therefore it is desireable (though not necessary) that
- the letter classes (ie. which ones are vowels) are preserved.
-
- Fortunately, if a letter shared between languages it is
- nearly certain that it is (is not) vowel in both languages.
-
- 5) typographical-quality fonts often differentiate between
- "similar" glyhps reproducible on lower-resolution devices.
-
- However, it is possible to create reasonably-looking fonts
- shared between several languages if the basic graphic set
- of the languages is common (say, Latin for European languages,
- Cyrillic for most Slavic languages). Actually, it the way
- the font designers work in the multi-lingual environments.
-
- The proposed encoding methodology (see later) "automatically"
- separates graphic sets, so the ability to output typography-
- quality text is essentially "free" in the proposed encoding.
-
-
- The way the ultimate encoding is compressed into the proposed
- encoding is quite simple:
-
- The alphabets are separated into several groups; in every group
- conflicts between incompatible ordering resolved by duplicating
- codes for the conflicting letters. If there are letters which do
- not have upper/lowercase counterparts (like German eszat) new
- codes should be reserved to distinguish the corresponding printable
- signs as "special" (i.e. the special German S) or new glyphs should
- be added (SS). If there is ambiguity in converting case (like
- cyrillic/greek/german T) this glyph should be assigned several codes
- to eliminate ambiguity. If a glyph can represent differntly-behaving
- letters in different languages (like "o or l, c and h in Spanish) it
- should be assigned several codes to eliminate the ambiguity.
-
- The "canonical" sorting order is the _dictionary_ one (dictionaries
- are usually more logical in their sorting conventions).
-
- The criteria for separating alphabets into groups are simple as well -
- it simply comparing the number of new codes which are introduced
- by adding a new group with the number of new codes assigned by
- attempting to fit the language in existing groups (well, it's not
- that simple but still the problem is well-formalized and can be
- solved with known mathematical methods without reverting to
- truthsearching by voting).
-
- It makes sense to keep codes in the order similar to the order of
- letters in alphabet to eliminate the need for large tables used by
- the universal sorting algortim. The canonical universal sorting algorithm
- should be included in the standard (contrary to the popular opinion
- it would not be THAT large -- i think everything can be fitted in
- less than 50 lines because sorting irregularities are specific for
- European languages for historical reasons [namely the idea of dictionary
- with sorted entries is European -- and the irregular rules were
- accepted before the importance of simple lexicographical sorting was
- realized -- see Knuth vol.3], the sorting rules for ancient languages
- were introduced _later_ as well as rules for languages which had no
- established writing at the time (most of Cyrillic-based languages for example
- had no written form before the nations were occupied by chauvinistic
- Russian imperialists which obviously did not care about the culture
- or ethnic miniorities and opressed them by inventing and introducing
- writing to be able to give orders in natives' languages -- that's for
- soc.culture.soviet readers).
-
- The "customized" sorting algorithms (sorting everything Latin-based
- according to, say, Finnish rules) in most cases can be reduced to
- the universal sorting by trivial transliteration which makes possible
- introduction of table-driven generic customized sorting routines.
-
- Apparently it makes sense to violate the lexicographical letter
- ordering for Latin-based languages to make the proposed encoding
- upward-compatible with ASCII. The price is nothing more than
- one additional transliteration table in the comparison/sorting algorithm.
-
- Since the proposed encoding preserves the properties 2 & 3 of the
- ultimate encoding it is POSSIBLE to create truly internationalized
- applications without getting into local specifics (i.e. you can
- sort German file on a machine in Japan without any changes in
- environment). No additional off-text information is required which
- surely allows to preserve the existing interfaces like regular
- expressions/shell globbing/mailing systems etc and minimize changes
- in the existing software.
-
- To ease hypenation problem it may make sense to add a zero-width
- "hyphenate hint" symbol (aka \% in troff). This symbol can be added
- manually or by language-specific hyphenation algorithm at the
- input (it may as well be a part of spell-checker, in this role
- it is essentially "free"). The idea is to leave ALL language
- specifics at the point of input where the language is supposedly
- known. If a "generic" hypenation algorithm is included in standard
- it may be possible to omit hyphenation hints in words which were
- split correctly by the standard algorithm.
-
- The main inconvinience introduced by the proposed encoding is
- the possibility of confusion between several codes for a glyph.
- However, since practically all input is done in a sinlge or two
- languages it is not really a problem. Some people will object
- to explicit switching between languages, however; it they LIKE
- to include a foreign word into a letter as native it it supposed
- to be treated as native. Introducing spell-checking at input
- and emphasising selected languages by means of color, brightness
- or different font styles practically eliminates the problem.
-
- ISO 10646 (unlike Unicode as in Plan 9) already provides several
- codes for identical glyphs in places where it is obviously
- absurd to do otherwise (as with T in cyrillic, greek and latin);
- but as old saying goes: "you can't be 30% pregnant". This
- does not solve all sorting and case conversion problems (unlike the
- systematic approach i outlined before) but introduces all
- inconviniences of having ambiguous glyph encodings.
-
- Now, flame mode on (type 'n' if you think the following material
- can be offensive, tasteless or plain stupid):
-
- -------
- From: Terry Lambert
-
- >Other word processing operations (such as dictionary and thesaurus use
- >within the program) require knowledge of which language to use.
-
- Phew! If a word is IN dictionary it is already known which language
- it is from. Thesaurus-collecting algorithms generally do not care
- about lagnuage at all.
-
- >The idea of sort order should be (and is in Unicode) divorced from the
- >idea of information storage. The fact that one will have text files,
- >data files, and text files which act as data files on the same machine
- >*requires* some type of promiscuous [out of data band] method of
- >determining the format of the data within a file.
-
- NEVER say "impossible". As i just argued before it IS possible
- and no out-of-band information required for 99% of applications.
-
- >This method, whether
- >it be language tagging of the files in the inode, or language tagging
- >of the user during the login process is imperitive. To do otherwise
- >means that your localization data coexists with system data rather
- >than system data being localized as well.
-
- Have it ever occured to you that texts may contain arbitrary numbers
- of languages? No external (out-of-text) tragging is sufficient
- in general case.
-
- >The operations you wish to perform are the province of applications running
- >on the system, not the system itself. Regardless of whether this is done
- >by an application programmer (as a per application localization) or by the
- >creator of a library used by applications (as part of developement system
- >localization), THE CODE BELONGS IN USER SPACE.
-
- Nobody (surely, not me) ever said something about getting everything
- in the kernel. However, moving multilingual support from SYSTEM
- software to APPLICATION software is the best way to kill the idea.
- Somehow you forgot that what is not done once in system will be
- done in zillion incompatibe ways in user programs. Take MS-DOS
- as a live example.
-
- >A point of contention: a logician is, by disipline, a philosopher, not a
- >mathematician. Being the latter does not qualify one as the former.
-
- If it IS the way the logic is taught here i feel pity for the
- American education. Philosophy is a matter of speculations;
- logic is a precise science and attempts to make "logical"
- decisions based on philosofical word games are plain stupid.
-
- >The point of having to know what language a particular document is written
- >in in order to manipulate it was *not* omitted; it was taken as an *axiom*.
-
- Er? That's the lack of logical thinking -- never question *axioms*.
-
- >With all due respect, the ISO-8859-5 is an international standard to
- >which engineering outside of Russia is done for use in Russia. Barring
- >another published standard for external use, this is probably what
- >Russian users are going to be stuck with for code originating outside
- >of Russia. I suggest that if this concerns you, you should have the
- >"defacto standard" codified for use by external agencies.
-
- KOI-8 became a standard in early 70s and sure is codified (apparetly
- not by ISO but by CCITT because it is based on "phonetical" system
- for sending cyrillic messages over international talegraph which
- predates computers).
-
- >One wonders at the ECMA registration of a supposed "non-standard" by Russian
- >nationals if the standard is not used in Russia.
-
- One wonders on X.25, X.29, X.400 and a lot of other stuff which
- was standartized BEFORE it become a proven practical solution.
- Need i to remind that ASCII defeated EBCDIC in the market before
- it become international standard.
-
- As for ISO commitees -- USSR is surely was represented by petty
- bureaucrats from the Ministry of Communications who are utterly
- incompetent. Somehow international bodies never got interested
- in looking for really competent people if they do not represent
- government or corporate interests. Unfortunately we can do nothing
- about it. The only remedy is to oppose attemts to standartize
- things which gained no market acceptance.
-
- >The process of "asking the user" is near 0 cost regardless of whether the
- >implementation is some means of file attribution per language or some
- >method of user attribution (ala proc struct, password file, or environment).
-
- How do think you're going to attribute a single language to the user avg?
- I somehow write in Russian and English and often use both in
- the same phrase.
-
- >It becomes more complicate if you are attemting a multinational document;
- >the point here is to enable localization with user supplied data seta
- >rather than providing a tool for linguistic scholars or multilingual
- >word processors. It is possible to do both of these things within the
- >confines of Unicode, penalizing only the authors of the applications.
-
- As i already said Unicode/ISO 10646 are USELESS for localization
- -- we don't need one more standard where old standards work
- just fine.
-
- >I believe that multinational documented will be the exception, not the
- >rule.
-
- Learn some foreighn language and try to use it in the real-life
- communications -- you'll find that mixed-language texts are
- far far more common than you think. Russian, for example
- do not have equivalents for "political correctness" or
- "mentally disadvantaged" simply because that plague avoided
- the Russian culture.
-
- >I believe the goal is *NOT* multinationalization, but internationalization.
- >In this context, internationalization refers not to the ability to provide
- >perfect access to all supported languages (by way of glyph preference), but
- >refers instead to an enabling technology to allow better operating system
- >support for localization.
-
- For "internationalization" as you understand it Unicode is
- nothing more than extra bytes and lot of pain in the ass.
-
- >Multinational use is out of the question until modifications are made to
- >the file system in terms of supporting multiple nation name spaces for the
- >files.
-
- You don't need changes in file systems to support multilingual texts.
- Having developed bilingual Unix i can assure you in that.
-
- >Localization in terms of multinationalization requires other considerations
- >not directly possible, in particular, the concept of "well known file system
- >objects" must be adjusted. Consider, if you will, the fact that such a
- >localization of the existing UNIX infrastructure is currently impossible
- >in this framework. I am thinking in particular renaming the /etc directory
- >or the /etc/passwd file to localized non-English equivalents.
-
- Renaming /etc or /etc/passwd in order to achieve "multinationalization"
- is plain stupid because these aren't human words but rather kinds
- of hierogliphs. Somehow French, Russian, German programmers never
- had troubles understanding that IF THEN ELSE means conditional
- excecution. It may be not so obvious for English speakers who
- grow up on COBOL but the computerish semantic units are easier to
- understand if not to mix them with the natural languages.
- I bet that schoolchilden in Russia have much less troubles
- understanding the difference between usage of conditional
- clauses in natural and algorithmic languages (simply *because*
- they're different).
-
- >The idea
- >of multinationalization falls under its own weight. Considr a system used
- >by users of several languages (ie: a multinational environment). Providing
- >each use with their own language's view of file requires a minimum of the
- >number of well known file names times the number of languages (bearing in
- >mind that translation may effect name length) for directory information
- >alone. Now consider that each of these users will want their names and
- >passwords to be in their own language in a single password file.
-
- It means only that you don't understand what multinationalization is.
- I myself developed the system which is successfully used by
- people knowing English or Russian for a decade -- somehow it
- didn't collapsed.
-
- To make system multinational you don't need to rename system
- utilities and files. Waht you need is to give users:
-
- - ability to operate texts in their own language
- - print diagnostic messages in user's language
- - provide documentation in user's language
- - provide means for development of multinational applications
- (i.e. make development tools to understand the user's language)
- - support filenames in user's language
- - provide a way to customize user's environment (aka aliases)
- Unix already has adequate customization mechanizms.
-
- >Multinationalization is possible, but of questionalble utility and merit
- >in current computing systems. We ned only worry about providing the
- >mechanisms for concurrency of use for the translators.
-
- Tell it to the secretary in DEMOS -- she somehow used to keep
- her whole stuff in a Unix machine for years without bothering
- to learn any English. What you think is impossible or of
- questionable utility is already used by lots of people for
- many many years.
-
- >Consider now the goal of data-driven localization (a single translation
- >for all system application programs and switching of language environments
- >without application recompilation.
-
- Ever heard about string files? It is as easy as that.
- You may also want to steal a look at Macintrash -- they had
- it from the very beginning.
-
- >Rather than rewriting all applications which use text as data (cf: the C
- >compiler example), unification of the glyph sets makes more sense.
-
- No logic here. Having several codes for a glyph does not
- hinder application's ability to treat that as a glyph.
-
- >The only goal I am esposing here is enabling for localization. For this
- >task, Unicode is far from useless.
-
- Ah. Checked your logic lately? Tell the Russian users they'll
- need two bytes per letter instead of one to be able to do
- what they already can do with KOI-8 (well, not exactly; they'll
- be unable to sort without telling which language they mean).
- The bi-lingual encodings with pairs of separate alphabets
- fitting in one byte (or fitting in two bytes for Chinese and
- Japanese) are quite sufficient for localization; and moreover
- there already are standards.
-
- >Then "real users" can supply a codified alternative in short order or lump it.
-
- Real users do not play in bureaucratic games in international bodies,
- they simply pay money. Need i to teach Americans basics of capitalism?
-
- >T: the lexical order argument is one of the sticking points against the
- >T: Japanese acceptance of Unicode, and is a valid argument in that arena.
- >T: The fact of the matter is that Unicode is not an information manipulation
- >T: standard, but (for the purposes of it's use in internationalization) a
- >T: storage and an I/O standard. View this way, the lexical ordering argument
- >T: is nonapplicable.
-
- >V: It'd be sticking point about Slavic languages as well, you may be sure.
- >V: Knowing ex-Soviet standard-making routine i think the official fishy-eyed
- >V: representatives will silentlly vote pro to get some more time for raving
- >V: in Western stores and nobody will use it since then. The "working" standards
- >V: in Russia aren't made by commitees.
-
- >Then this will have to change, or the Russian users will pay the price.
- >Those of us external to Russia are in no position to involve ourselves in
- >this process. Any changes will have to originate in Russia.
-
- Aha. I like your naivete. The fastest way to achieve the "changes"
- is to butcher every ex-communist official. Sure, Russian users
- will pay the price -- to companies which will offer them useable
- products which do not follow dead-born standards.
-
- >Address information can not be hyphenated, at least in US and other Western
- >mail of which I am personally aware. This is a non-issue. This is also
- >something that is not the responsibility of the operating system or the
- >storage mechanism therein... unless you are arguing that UFS knows to store
- >documents without hyphenation, and that the "cat" and "more" programs will
- >hyphenate for you. If you are talking about ANY OTHER APPICATION, THE
- >HYPHENATION IS THE APPLICATIONS RESPONSIBILITY. PERIOD.
-
- I always thought that the hyphenation routine's place is in the
- SYSTEM library. It's a reuseable piece of code. And there is
- a lot of places where strings need to be hyphenated (long diagnostics
- in windows of variable size) even by trivial programs.
- The behaviour of Unix utilities simply wrapping strings over
- the right edge is at the best case a kludge. I somehow hope
- that it'll change.
-
- >T: Find another standard to tell you how to write a word processor.
- >V: Is there any? :-)
- >
- >No, there isn't; that was the point. It is not the intent of the Unicode
- >standard to provide a means of performing the operations normally
- >associated with word processing. That is the job of the word processor, and
- >is the reason people who write word processors are paid money by an employer
- >rather than starving to death.
-
- The second and the third sentences have no logical link.
- Sure Unicode does not help for word processing.
-
- >Again, font *changes* only become a problem if one attempts to print a
- >*multinational* document. Since we aren't interested in multinationalization,
- >it's unlikely that a Unicode font containing all Unicode glyphs will be
- >used for that purpose.
-
- Since Unicode does not provide any benefits for localization
- your statement is meaningless.
-
- >In all likelyhood, use will be in a localized environmentn *NOT* a
- >multinational one. Since this is the case, it follows that the sum total of
- >the Unicode font implemented in the US will be the ISO Latin-1 set.
- >Similarly, if you are printing a Cyrillic document, you will be using a
- >Cyrillic font; the "X" character you are concerned about will be *localized*
- >to the Cyrillic "X", *NOT* the Latin "X".
-
- Somehow i use to print documents in both Russian and English without
- bothering to switch fonts. If Unicode does not allow me to do that
- i simply never switch to it (as well as many other users).
- I don't need to spend disk space for the benefit of being unable
- to do things i used to for years.
-
- >V: I see no reasons why we should treat the regular expression matching
- >V: as "fancy" feature.
- >
- >Because globbing characters are language dependant.
-
- >The globbing ("regular expression pattern match") characters
- >DO change for any patterns more complicated than "*".
-
- Globbing is nothing more than sequential matching to a regular expression.
-
- >V: Don't you think the ANY text is going to be fancy because Unicode
- >V: as it is does not provide adequate means for the trivial operations?
-
- >Perhaps any multinational text, yes; for normal text, processing will be
- >done using the localized form, not the Unicode form; therefore the issue
- >will never come up, unless the application requires embedded attributes
- >(like a desktop publishing package. Since multinational processing is
- >the exception rather than the rule, let the multinational users pay the
- >proce in terms of "Fancy text".
-
- If processing is done using localized (non-Unicode) form then
- why do we need Unicode at all?
-
- The assertion that the multilingual processing requires
- register-switching (aka "fancy mode") is simply wrong.
-
- >Again, multinational software is not being addressed; however, were we to
- >address the issue, I suspect that it would, in all cases, be implementation
- >dependant upon the multinational application.
-
- Then forget about Unicode. It is not supposed to be a mean of
- localization.
-
- >I don't know how stridently I can express this: runic encoding destroys
- >information (such as file size = character count) and makes file system
- >processing re character substituion totally unacceptable...
-
- Variable-length line encodings destroy information (such as
- file size = character count / record size) and makes file system
- processing re word substitution totally unacceptible...
-
- (You've got to be a OS/370 zealot).
-
- >consider the
- >case of a substitution of a character requiring 3 bytes to encode for on
- >that takes 1 byte (or 4 bytes) currently. Say further that it is the
- >first character in a 2M file. You are talking about either shifting the
- >contents of the entire file, or, MUCH WORSE, going to record oriented files
- >for text. If there is defacto attribution of text vs. other files (shifting
- >the data is unacceptable. period.), there is no reason to avoid making
- >that attribution as meaningful as possible.
-
- consider the
- case of a substitution of a word requiring 3 bytes for on
- that takes 1 byte (or 4 bytes) currently. Say further that it is the
- first character in a 2M file. You are talking about either shifting the
- contents of the entire file, or, MUCH WORSE, going to record oriented files
- for text. If there is defacto attribution of text vs. other files (shifting
- the data is unacceptable. period.), there is no reason to avoid making
- that attribution as meaningful as possible.
-
- I worry why don't you run around crying that Unix require
- record-oriented files and defacto attribution of text vs other
- files. You'd better think before posting anything to USENET.
-
- V: Pretty soon it will be a dead standard because of the logical problems
- V: in the design. Voting is inadequate replacement for logic, you know.
- V: I'd better stick to a good standard from Zambia than to the brain-dead
- V: creature of ISO even if every petty bureaucrat voted for it.
-
- >I agree; however, the peope involved were slightly more knowledgable about
- >the subject than your average "petty bureaucrat".
-
- If they understand the problem as well as you do i somehow doubt that.
-
- >And there has not been
- >a suggested alternative, only rantings of "not Unicode".
-
- If someone comes to me declaring that he invented perpetuum mobile
- he can claim that i criticize him without providing an
- alternative too.
-
- Moreover, it is not the case. Altough i cannot replace ISO
- in blessing standards i already provided my ideas on how to
- create a standard which will not be dead before birth.
-
- >V: I expressed my point of view (and proposed some kind of solution) in
- >V: comp.std.internat, where the discussion should belong. I'd like you to
- >V: see the problem not as an excercise in wrestling consensus from an
- >V: international body but as a mathematical problem. From the logistical
- >V: point of view the solution is simply incorrect and no standard commitee
- >V: can vote out that small fact. The fundamental assumption Unicode is
- >V: based upon (i.e. one glyph - one code) makes the whole construction
- >V: illogical and it, unfortunately, cannot be mended without serious
- >V: redesign of the whole thing.
- >
- >Wrong, wrong, wrong.
- >
- >1) We are not discussiong the embodiment of a standard, but the
- > applicability of existing standards to a particular problem.
- > Basically, we could care less about anything other than the
- > existing or draft standards and their suitability to the task
- > at hand, the international enabling of 386BSD.
-
- Do you REALLY believe that if something blessed by a commitee it
- will work in the real life? I thought this mindset died
- with the communism.
-
- >2) We are not interested in "arriving" at a new standard or defending
- > existing or draft standards, except as regards their suitability
- > to our goal of enabling.
-
- *We*'re not interested in spending life on perpetuating old mistakes
- and doing useless work. I'd prefer you not to talk in the name of
- all labouring people; although i'm immune to that kind of speech
- and it does not impress me i still find it repulsive.
-
- >3) The proposal of new soloutions (new standards) is neither useful
- > nor interesting, in light of our need being "now" and the adoption
- > of a new soloution or standard being "at some future date".
-
- If it is not interesting to you you can simply say 'k'.
-
- >4) Barring a suggestion of a more suitable standard, I and others
- > will begin coding to the Unicode standard.
-
- Good luck. It is my belief that most people are unable to learn
- on mistakes of others but on their own only. I also do not care
- about 386BSD for a number of reasons.
-
- >5) Since we are discussing adoption of a standard for enabling of
- > localization of 386BSD, and are neither intent on a general defense
- > of any existing standard, nor the proposal of changes to an
- > existing standard or the emobodiment of a new standard, this
- > discussion doe *NOT* belong in comp.std.internat, since the
- > subscribers of comp.unix.bsd are infinitely more qualified to
- > determine which existing or draft standard they wish to use
- > without a discussion of multinationalization (something only
- > potentially useful to a limited audience, and then only at some
- > future date when multinational processing on 386BSD becomes a
- > generally desirable feature.
-
- See before.
-
- >V: Try to understand the argument about the redundance of encoding with
- >V: external restrictions provided i used earlier in this letter. The
- >V: Unicode commitee really get caught in a logical trap and it's a pity
- >V: few people realize that.
- >
- >I *understand* the argument; I simply *disagree* with it's applicability
- >to anything other than enabling multinationalization as opposed to
- >enabling localization, which is the goal.
-
- "I understand that 2+2=4; I simply *disagree* with it's applicability
- to anything other than drawing figures on a blackboard as opposed to
- counting apples, which is the goal".
-
- I wonder if you ever wrote a working program.
-
- ---------
- From: Peter da Silva
-
- >You have identified two problems with Unicode and ISO 10646: case conversion
- >and lexical ordering.
-
- >> See how Unicode renders itself useless?
-
- >Unless you want to work on multilingual documents, yes. It could be better,
- >certainly, but to say it's *useless* is hyperbole.
-
- Yep, although it depends where your letter is goes to. If it'll
- be fed to something like SM/1 or grapeVINE the case conversion and
- sorting is necessary.
-
- Now, let's be fair. "Unicode is the standard applicable to writing
- letters only", right? Do *you* need such standard?
-
- ---------
- From: Dik T. Winter
-
- >I move the discussion a bit: would we like sorting according to the texts
- >language or the users language?
-
- >I do not know, but I do know that in many cases a Finn would like to handle
- >it differently from a German, regardless of the language of the text involved!
-
- Neat idea. I'd like to see it implemented, though i don't know
- what is the Finnish idea of sorting Chinese.
-
- Even if you have a customized sorting routine, the fall-back
- to universal routine is necessary.
-
- > So, even if sorting is not regular there always is a way around --
- > with Unicode you can't do even that.
-
- Eh? I would think the same way around!
-
- Nope. I propose encoding in which sorting is algorithm (i agree,
- the arithmetic sorting is not sufficient).
- You can't do it with Unicode/ISO 10646 because you lost
- necessary information.
-
- >You are confused about a few things. Unicode will help your sorting
- >Cyrillic names and Latin names just fine.
-
- It does not help me with Ukrain and Russian and the fact that
- it has the separate cyrillic page is quite irrelevant.
-
- >You are mixing soring with different
- >scripts and sorting different languages.
-
- See example before. BTW, Ukrainian has a letter i which is found
- in the Latin code page.
-
- >The latter makes no sense if
- >you use for each word the place it should occupy in its native language
- >(especially for loan-words).
-
- If you're typing in loan-words you're typing it in the *your* language
- (and character codes used belong to your language) and it will be
- treated as you wished.
-
- > > >It won't make sense. Lexical sorting makes only sense, if at all, in
- > > >*one* single language.
- > >
- > > See before.
- >
- >See before, you give different scripts not different languages.
-
- See before, you missed the point compltely.
-
- es> You've talked a lot about regular expressions etc. Frankly I
- es> don't give a damn about those. The main bulk of computer users
- es> are not programmers and don't know what a regular expression
- es> is, so why focus such specific issues?
-
- >I agree completely here. Shell globbing, regular expressions and
- >sorting by 'ls' are not relevant.
-
- Tell it to the secretary which knows that ls gives sorted list
- of files from her directory with couple hundreds of names like
- "Letter1" and "IRS_request" in it.
-
- >(There are still systems around
- >that do not sort the list of files at all. Can you say IBM VM/CS?)
-
- I knew that you ought to be MS-DOS zealot.
-
- >Those things are indeed used by programmers only...
-
- Obviously you don't earn your living by writing programs for
- real users. You're underestimating peoples intelligence;
- there *are* non-programmers who know a hell lot of shell
- and use it.
-
- --------
- From: Lars Wirzenius
-
- >If I understand your proposed solution correctly, it is the "each
- >language has a completely different set of character codes" solution
- >which I have disagreed with above.
-
- You understood it incorrectly and therefore further discussion
- makes no sense.
-
- --------
- From: Glenn A. Adams
-
- >You're uninformed. Try looking at Xerox Viewpoint (West European,
- >East European, Greek, Arabic, Hebrew, Chinese, Japanese, Korean, et al);
- >try looking at BB&N Slate (West & East European, Greek, Hebrew, Arabic,
- >Korean, Thai). There are others: Nota Bene, Multilingual Scholar,
- >AlKatiib, etc.
-
- Thank you for enlightening me, but i wonder if you can tell the
- difference between an entire operating system and a single
- application -- the localization of those are tasks differing
- in compexity for a couple orders of magnitude.
-
- So far, MNOS RL 1.2 and DEMOS 2 are THE ONLY EXISTING true
- bilingual systems.
-
-
- >After reading this yet again, I now believe that this entire conversation
- >may be based on a misunderstanding. Unicode does not unify Latin T,
- >Cyrillic T, and Greek T! They are separate characters, as are Latin A,
- >Cyrillic A, and Greek A. Nor does Unicode unify LATIN A WITH RING and
- >ANGSTROM SYMBOL.
-
- Which Unicode? 16-bit or 32-bit? It already has multiple
- codes for similar glyphs but still does not allow algorithmic
- sorting or case conversion. What's the point? To combine the
- worst of both "one glyph - one code" and my proposal?
-
- -------
- From: Professor David J. Birnbaum
-
- [the text omitted]
-
- Thank you for understanding!
-
-
- -------
- From: Erland Sommarskog
- Xref: uunet comp.std.internat:2106
-
- >>Aw, don't be silly. It's trivial.
-
- >When you can't explain write off the problem as trivial.
-
- Nope. Simply add conversion CH -> _CH_ to the text editor on
- writing the otput file and reverse conversion on reading.
- Also, add the same conversion at the copying from raw list
- to cooked list in the tty driver.
-
- Isn't it trivial?
-
-
- >>>What the hell has the number of bits to do with anything? Do computers
- >>>exist for the programmers of the users?
- >>
- >>Look, you've missed the logic completely. Read it please again. I also
- >>explained it several times in other postings.
- >
- >What logic? I want to be able to write and read text in European
- >languages. Period.
-
- You can do it right now without Unicode, period.
-
- >Then how many bits you use is not my issue, as
- >long as you give me something which I consider user-friendly. (Being
- >forced to keep track whether a certain dotted "a" is German or
- >Swedish is not.) How many bits you use is completely irrelevant.
- >(But since I want more than 256 symbols, you will have a pain if
- >you stay with eight bits.)
-
- It IS issue when you're getting no benefits -- if you have to
- specify the language outside of the text you can as well live with
- several 8-bit codepages. The onky difference that you needn't
- 2 bytes for every letter.
-
- >>What do you tell the poor user when he has a database with English
- >>and Russian company names (a case from my practice, to be real) --
- >>in both upper and lower case and the smart guys (apparently Erlands
- >>pupils) made a terminal which converts cyrillic codes for the letters
- >>of the same shape as latin to the latin codes? Go get a rope?
- >
- >Yes, that is precisely the confusion which is likely to happen when
- >you assign the same character different codes depending on language,
- >and when the application program is not smart enough to equate them.
-
- You missed the example again completely. The problem was with
- the lot of letters like T which AREN'T the same in Russian
- and English when converted to lowercase.
-
- >>The basic ASCII principles (after reordering and replacing several
- >>characters) remained the same -- there is a way to convert upper<->lower
- >>case and there is a way to sort without asking which language every word
- >>came from (it's known apriori).
- >
- >Nope. Not with German.
-
- ASCII is not a German standard, nor it was designed for German.
-
- >The problem with your idea is that you believe that everything is
- >known at input time. It isn't.
-
- If the originator of information does not know which language he
- used who will know it? In many cases this inforamtion is available
- ONLY at input.
-
- >You've talked a lot about regular expressions etc. Frankly I
- >don't give a damn about those. The main bulk of computer users
- >are not programmers and don't know what a regular expression
- >is, so why focus such specific issues?
-
- I got to like that attitude. "I don't give a damn"...
- The user don't give a damn (and a dime, btw) for yor
- programs which are unable to do things they have now
- for granted.
-
-
- -------
- From: Kosta Kostis
-
- >> The one which does not ask user which language he means every time
- >> he runs more -i.
- >
- >Nice. Your local language will be implied somehow and you can use it as
- >a default. What's your problem?
-
- Nice. I have a file in both English and Russian. What is my local
- language?
-
- >> Unfortunately there is a logical gap. I don't care WHICH algorith
- >> im used as long as it is ALGORITHM. There is no way to convert Unicode
- >> strings uppercase without "external" information.
- >
- >There is no way to convert non-US ASCII strings without "external"
- >information.
-
- Non-US ASCII is an oxymoron.
-
- >Simple "solutions" may work for "Russian and English"
- >or for "Greek and English", where you imply the language, but there's
- >no *general* solution. You don't seem to understand that.
-
- Who said that there is NO solution for N languages where
- N is arbitrary? You don't seem to understand yourself.
-
- >Nice for you you're bilingual, but there are companies and the like
- >that need support for much more than two languages and their
- >"common alphabet" won't fit in 8-bit, 9-bit or 10-bit.
-
- That's ok, but why should i sacrifice things i have now for
- granted (like working sorting which do not ask me what do i mean).
-
- >> Already discussed. I sure don't know everything but i know that
- >> you can made a minimal strictly ordered set from unification of
- >> strictly ordered sets by merging similar elements.
- >
- >You think you can do so. Implement it and try to sell it. Good luck.
-
- I'm in a different business. When i did the internationalization
- project i did it quite successfully. You know, a 22 year-old
- hippie programmer does not usually get gold medal from the
- government for nothing.
-
- >I can see 202 cyrillic characters (including diacritic marks) in
- >UniCode Version 1.0 - that's better than ISO 8859-5 (96 characters).
- >Does KOI-8 cover more than 202 cyrillic characters?
-
- Nope. It was desinged for English and Russian ONLY and it
- does it work well. The other cyrillic languages have their
- own code pages.
-
- >You have your way of sorting names, others have other ways of sorting.
- >Foreign names are written in German with German letters in Germany.
-
- Then it's German words and what the fuss is about?
-
- >UniCode is not a Russian-English encoding. It's a multilingual
- >encoding with advantages and drawbacks.
-
- Thank you for reminding, i do not have amnesia. I do not care
- about "multilingual" encodings which do not allow me to
- do things i can do with the bilingual.
-
- If you don't understand the difference between an example and
- proposal you'd better keep out of USENET or bought an asbesto
- suit.
-
- >There are partial solutions for local problems, fine, but that's
- >it and that's what will be. No universal character set will ever
- >solve that [period]. (Now hear my shoe on the table ... ;-) )
-
- First get that shoe from your mouth.
-
-
- --------
- From: Johnny Eriksson
-
- >Why not tell everyone about KOI-8 and the other cyrillic coding methods,
- >code tables, design principles, whatever. There may be more than one
- >of us that are interested in that information.
-
- KOI-8 is nothin more than a 8-bit ASCII extension with cyrillic
- letters in codes from 0300 to 0377. It has completely separate
- alphabets for Russian and English even though there is a number
- of similar letters.
-
- I used that code only as illustration of model bilingual
- encoding.
-
- There were Unicode-style encodings (GOST and DKOI) which
- followed the "one-code -- one glyph" principle.
- Those encodings are dead by now.
-
-
- Flame mode off.
-
- --vadim
-