NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / std / internat / 1057 < prev next >

Wrap

Internet Message Format | 1993-01-06 | 4.0 KB

Path: sparky!uunet!zaphod.mps.ohio-state.edu!rpi!bu.edu!att!mcdchg!mcdphx!udc!preece From: preece@urbana.mcd.mot.com (Scott E. Preece) Newsgroups: comp.std.internat Subject: Re: Language tagging Message-ID: <PREECE.93Jan6092809@predator.urbana.mcd.mot.com> Date: 6 Jan 93 15:28:16 GMT References: <1993Jan3.203017.232@enea.se> <2609@titccy.cc.titech.ac.jp> <1iav6tINNee2@life.ai.mit.edu> <1iddeeINN58g@rodan.UU.NET> Sender: news@urbana.mcd.mot.com (News) Distribution: comp Organization: Motorola MCG, Urbana Design Center Lines: 70 In-Reply-To: avg@rodan.UU.NET's message of 6 Jan 93 01:42:38 GMT Nntp-Posting-Host: predator.urbana.mcd.mot.com I don't want to get into this too deeply, because I don't feel I really understand the issues about Unicode and 10646, but I would like to respond to one particular point in Vadim's presentation. In article <1iddeeINN58g@rodan.UU.NET> avg@rodan.UU.NET (Vadim Antonov) writes: |... | Now, i think everybody agrees that the "ultimate encoding" is | the one which provides the complete information about which | language is used -- it sovles all the problems. | | Such an encoding can be implemented with: | | 1) register switching with (say) escape sequences. | This is highly impractical; moreover it is impossible | to determine the language if the information is available | from some point in the middle of file -- this situation | is especially troubling with Unix file pointer sharing. | | 2) every character code is a pair (language-code, letter-number-in-alphabet) | It is hardly practical because of the storage considerations. | Codifying languages require at least 10-12 bits, ie. every | letter turns into at least 3-byte sequence. --- There is an important additional approach: Text is represented as a sequence of text objects, each of which has its own locale and cultural tagging. Much of the arguing in this string is based on the broken notion that the important notion is "a file" and that a file is "a string of characters" which may be entered at any point. Someone appears to have heard the phrase "mechanism, not policy" and decided that it means *nobody* should have policy. That's stupid. Any interesting application (or family of cooperating applications) is going overlay more complex semantics on its basic entities than "a string of characters". The real problem is that everyone is so concerned with maintaining the notion that a file is a string of characters (which is a perfectly good model for the underlying operating system) that no common policy has emerged for representing attributed text *on top of that simple model*. Any piece of text must have significant context associated with it if we are to process it automatically (for indexing or display or retrieval or whatever). The "right" way to sort a list of text entries depends on the containing context. A list of German words may sort differently in an English context than in a German context; moreover, in an English context a list of German words may sort differently than a list of lexically identical English words. This says, to me, that it is critically important that a useful text processing system be able to mark a section of text as being a German word and that it be able to mark a section of text as following AACR2 sorting rules. What we need is a common model for text representation that has more structure than "a string of characters". Then we could all build applications that used the interfaces defined by that model and they could all work together. I think the object-oriented model fits this quite naturally and potentially handles a lot of the problems mentioned (a text object might provide, for instance, an ordering method which accepted two text objects as arguments and returned an indication of their ordering specific to its language and context). scott -- scott preece motorola/mcg urbana design center 1101 e. university, urbana, il 61801 uucp: uunet!uiucuxc!udc!preece, arpa: preece@urbana.mcd.mot.com phone: 217-384-8589 fax: 217-384-8550