home *** CD-ROM | disk | FTP | other *** search
- Xref: sparky comp.unix.bsd:10985 comp.std.internat:1073
- Newsgroups: comp.unix.bsd,comp.std.internat
- Path: sparky!uunet!spool.mu.edu!agate!dog.ee.lbl.gov!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry
- From: terry@cs.weber.edu (A Wizard of Earth C)
- Subject: Re: INTERNATIONALIZATION: JAPAN, FAR EAST
- Message-ID: <1993Jan7.045612.13244@fcom.cc.utah.edu>
- Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages
- Sender: news@fcom.cc.utah.edu
- Organization: Weber State University (Ogden, UT)
- References: <2615@titccy.cc.titech.ac.jp> <1993Jan5.090747.29232@fcom.cc.utah.edu> <2628@titccy.cc.titech.ac.jp>
- Date: Thu, 7 Jan 93 04:56:12 GMT
- Lines: 226
-
- Before I proceed, I will [ once again ] remove the "dumb Americans" from my
- original topic line. You and I both know the original attribution was by
- way of a query by an American, and was not intended to be reintegrated
- into the main thread.
-
- In article <2628@titccy.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes:
- >In article <1993Jan5.090747.29232@fcom.cc.utah.edu>
- > terry@cs.weber.edu (A Wizard of Earth C) writes:
- >
- >>>>This I don't understand. The maximum translation table from one 16 bit value
- >>>>to another is 16k.
- >>>
- >>>WHAAAAT? It's 128KB, not 16k.
- >>
- >>Not true for designated sets, the maximum of which spans a 16K region.
- >
- >Uuuurrrghhhh!!! Ia Ia!
- >
- >>128k is only necessary if you are to include translation of all characters;
- >>for a translation involving all characters, no spanning set smaller than the
- >>full set exists. Thus a 128k translation table is [almost] never useful
- >
- >Then, don't say 16 bit. If you say 16 bit, it IS 128K.
-
- It is still a translation of one 16 bit value to another. In is *not* an
- *arbitrary* translation we are talking about, since the spanning sets will
- be known.
-
- >>>>This means 2 16k tables for translation into/out of
- >>>>Unicode for Input/Output devices,
- >
- >>In this particular case, I was thinking an ASCII->Unicode or Unicode->ASCII
- >>translation to reduce the storage penalty paid by Western languages for
- >
- >ASCII <-> Unicode translation does not need any tables. PERIOD.
-
- Sorry; I misspoke (mistyped?) here. I meant to refer to any arbitrary 8-bit
- set for which a localization set is available (example: and ISO 8859-x set).
- I did not mean to limit the scope of the discussion to ASCII.
-
- For any set containing more than 256 characters, raw representation in 16-bit
- form is preferrable, as it saves space (an 8-bit representation is not
- attainable for such a large glyph-set character set). This would *not*
- result in direct Unicode storage of the large glyph-set characters, since
- a language attribution is still necessary for attribution.
-
- >>>How can you "cat" two files with different file attributes?
- >>
- >>By localizing the display output mechanism.
- >
- >Wow! Apparently he thinks "cat" is a command to display content of
- >files. No wonder he think file attributes good.
-
- Obviously, by this response, you meant "cat two files to a third file" rather
- than what you stated, which would have resulted in the files going to the
- screen. Display device attribution based on supported character sets has
- been well discussed, hopefully to both our satisfaction.
-
- "cat" command pseudo-code in a language attributed file environment:
- [ cat f1 f2 > f3 ]
-
- Shell opens output file (default attribute: Unicode)
- Open first file (attribute: Japanese)
- Read data out (as Unicode data)
- Write Unicode data to stdout
- Write data to stdout
- Close first file
- Open second file (attribute: Japanese)
- Read data out (as Unicode data)
- Write Unicode data to stdout
- Close second file
- Shell closes output file (ala "cat" program exit)
-
- Obviously what you are asking is "how do I make two monolingual/bilingual/
- multilingual files of different language attribution into a single bilingual/
- multilingual file using cat" -- not the question as you have phrased it, nor
- as I have answered it, but in the context of the discussion, clearly the
- intended tack. Rather than pretending I don't know what you are getting
- at, I will answer the target question to avoid several more cryptic postings
- on your part as you slowly work towards the point.
-
- The answer is "you don't use 'cat'". The "cat" command does not deal with
- attribution of files because it's redirected output is arbitrarily assigned,
- and because of the fact that the input files are processed sequentially.
-
- I could argue for open descriptor type attribution and parallelization of
- the open mechanism such that all stream types addressed are known to the
- "cat" program, since attribution of an inode implies that attribution
- can be directly applied to an open in-core file descriptor for an instance
- of a pipe on either end and to input and output redirection. This defeats
- the cat program somewhat, and it is still possible to contrive examples
- where this would fail:
-
- #
- # Contrived example 1.0
- #
- for i in f1 f2 f3
- do
- cat $i
- done | cat > f4
-
- In this case it would not be possible to provide the parallelization I
- spoke of above (no doubt you will include this construct in all your
- shell scripts from this point on if you rea no further).
-
- The correct approach is to note that since Unicode does not provide a
- mechanism directly for language attribution, and that file attribution
- is only a partial soloution, since it does not deal with non-monolingual
- texts. While this has already been discussed to death, I will answer in
- good faith.
-
- What this means is that all files which are multilingual in nature require
- a compound document architecture. Whether this is done using "bit stuffing"
- (a "magic cookie" which can not occur in normal text files is allocated
- a code space in the Unicode character set as a non-spacing language
- designation character. Yes, this means that the data within the files is
- stateful.) or by some other mechanism impelementing what the Unicode
- committee unfortunately called "Fancy text", or what most people call
- "Rich text", or what Adobe calls "Page Description Language" is irrelevant.
-
- What this means is that a utility to combine documents (let's call it
- "combine") must have the ability to either generate language attributed
- files (if the source files are all of a single language attribution) or
- our default compound document format (TBD). This also means that the
- only difference between its operation and that of "cat" is that it must
- be unable to accept input piped to it, or in any other way in which it
- may be contrived to fool such a utility. The soloution to this may be
- documentation of "undefined behaviour" (as ANSI is fond of saying about
- the C language) rather than specific limitations on its use.
-
- The ability to redirect or pipe output is predicated on the ability to
- attribute existing files and/or pipes by language as well; if this is
- not allowed, then it would be necessary to specify a direct output file
- (most likely as the first argument, like the "tar" command). Sufficient
- definition of "undefined" behaviour could allow us to rename "combine"
- back to "cat" if we wished to risk it.
-
- Thus, to combine two arbitrary files to produce a third without the user
- worrying about language attribution:
-
- combine f1 f2 > f3
-
- Attribution of output and clever construction of out output device drivers
- would even allow us to switch fonts as dictated by the compound document
- architecture controls embedded in the file and/or the attribution of the
- file descriptor (the absence of such attribution being an indicator of a
- compund document). Thus a modified "more" could:
-
- more f3
-
- successfully, and a modified "lpr" could:
-
- lpr f3
-
- successfully.
-
- In the case of two monolingual documents of the same attribution, the
- standard "cat" command would produce an output file missing language
- attribution (unless both files were compound documents, and there was
- no aspect of the compounding mechanism causing the files to be unable
- to be simply conjoined.
-
- A POSSIBLE OPTIMIZATION OF STORAGE FOR SMALL GLYPH-SET LANGUAGES.
-
- As a possible optimization to reduce storage costs, we add to the existing
- list of potential language attributes for a file, which are:
-
- Compound Document
- Raw Unicode
- <Specific Language> This may take on many values
-
- To these we add:
-
- <Language Class>
-
- A language class is done by batch reduction of files to reduce storage
- costs and to provide attribution, and is done when the system is idle.
- For instance, if I have a file which contains US ASCII, French, and German,
- a spanning reduction of the language attribute of the file from Raw
- Unicode (the result of the standard "cat" output) to a language class of
- "ISO Latin-1" is possible, thus reducing storage requirements. For some
- languages, a 100 percent reduction is possible; for others, such as US ASCII
- , French, Japanese, or Chinese, it is not. In these cases, the system locale
- can be used to provide a "strong hint". Ideally, none of this would be
- necessary, since we have replace the standard cat, and all non-international
- user code will be specific to the locale as long as output of a stream
- with multiple potential source languages is restricted. In a monolingual
- environment, this will be possible. A strict/nonstrict locale switch is
- one potential soloution to this problem. This would cause instant reduction
- to the locale language in the strict case.
-
- A direct language attribute reduction from Unicode to the language class
- ISO Latin-1 will result in a halving of file storage requirements.
-
- Note that a reduction from Raw Unicode to Compound Document is possible
- base on our "magic cookie" scenerio above.
-
- Admittedly, it is possible to misreduce a file if its contents are not
- specific to a particular language or language class. In this case, no
- harm is done directly appart from misattribution of the file; the file
- can be considered to have been migrated to a slightly slower storage. Any
- additions to the file not fitting into the language class will cause a
- remigration back to the Raw Unicode class; hopefully this would be the
- minority case, since most utilites will already be internationalized, and
- there should be intrinsic internationalization of new utilities.
-
- Note that since this is simply a storage compaction optimization, there is
- no explict reason that it could not be switchable as to whether or not it
- is enforced, at the system administrators discretion.
-
-
- Does this answer your "cat" question sufficiently? The problem seemed to
- be that there was not a means around the problem from your point of view.
-
-
- Terry Lambert
- terry@icarus.weber.edu
- terry_lambert@novell.com
- ---
- Any opinions in this posting are my own and not those of my present
- or previous employers.
- --
- -------------------------------------------------------------------------------
- "I have an 8 user poetic license" - me
- Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial
- -------------------------------------------------------------------------------
-