home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!zaphod.mps.ohio-state.edu!rpi!bu.edu!att!mcdchg!mcdphx!udc!preece
- From: preece@urbana.mcd.mot.com (Scott E. Preece)
- Newsgroups: comp.std.internat
- Subject: Re: Language tagging
- Message-ID: <PREECE.93Jan6092809@predator.urbana.mcd.mot.com>
- Date: 6 Jan 93 15:28:16 GMT
- References: <1993Jan3.203017.232@enea.se> <2609@titccy.cc.titech.ac.jp>
- <1iav6tINNee2@life.ai.mit.edu> <1iddeeINN58g@rodan.UU.NET>
- Sender: news@urbana.mcd.mot.com (News)
- Distribution: comp
- Organization: Motorola MCG, Urbana Design Center
- Lines: 70
- In-Reply-To: avg@rodan.UU.NET's message of 6 Jan 93 01:42:38 GMT
- Nntp-Posting-Host: predator.urbana.mcd.mot.com
-
- I don't want to get into this too deeply, because I don't feel I really
- understand the issues about Unicode and 10646, but I would like to
- respond to one particular point in Vadim's presentation.
-
-
- In article <1iddeeINN58g@rodan.UU.NET> avg@rodan.UU.NET (Vadim Antonov) writes:
- |...
- | Now, i think everybody agrees that the "ultimate encoding" is
- | the one which provides the complete information about which
- | language is used -- it sovles all the problems.
- |
- | Such an encoding can be implemented with:
- |
- | 1) register switching with (say) escape sequences.
- | This is highly impractical; moreover it is impossible
- | to determine the language if the information is available
- | from some point in the middle of file -- this situation
- | is especially troubling with Unix file pointer sharing.
- |
- | 2) every character code is a pair (language-code, letter-number-in-alphabet)
- | It is hardly practical because of the storage considerations.
- | Codifying languages require at least 10-12 bits, ie. every
- | letter turns into at least 3-byte sequence.
- ---
- There is an important additional approach:
-
- Text is represented as a sequence of text objects, each of which
- has its own locale and cultural tagging.
-
- Much of the arguing in this string is based on the broken notion that
- the important notion is "a file" and that a file is "a string of characters"
- which may be entered at any point.
-
- Someone appears to have heard the phrase "mechanism, not policy" and
- decided that it means *nobody* should have policy. That's stupid.
- Any interesting application (or family of cooperating applications) is
- going overlay more complex semantics on its basic entities than "a
- string of characters". The real problem is that everyone is so
- concerned with maintaining the notion that a file is a string of
- characters (which is a perfectly good model for the underlying operating
- system) that no common policy has emerged for representing attributed
- text *on top of that simple model*.
-
- Any piece of text must have significant context associated with it if we
- are to process it automatically (for indexing or display or retrieval or
- whatever). The "right" way to sort a list of text entries depends on
- the containing context. A list of German words may sort differently in
- an English context than in a German context; moreover, in an English
- context a list of German words may sort differently than a list of
- lexically identical English words. This says, to me, that it is
- critically important that a useful text processing system be able to
- mark a section of text as being a German word and that it be able to
- mark a section of text as following AACR2 sorting rules.
-
- What we need is a common model for text representation that has more
- structure than "a string of characters". Then we could all build
- applications that used the interfaces defined by that model and they
- could all work together. I think the object-oriented model fits this
- quite naturally and potentially handles a lot of the problems mentioned
- (a text object might provide, for instance, an ordering method which
- accepted two text objects as arguments and returned an indication of
- their ordering specific to its language and context).
-
- scott
-
- --
- scott preece
- motorola/mcg urbana design center 1101 e. university, urbana, il 61801
- uucp: uunet!uiucuxc!udc!preece, arpa: preece@urbana.mcd.mot.com
- phone: 217-384-8589 fax: 217-384-8550
-