home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!zaphod.mps.ohio-state.edu!uwm.edu!linac!att!att!allegra!alice!andrew
- From: andrew@alice.att.com (Andrew Hume)
- Newsgroups: comp.std.internat
- Subject: Re: Data tagging (was: 8-bit representation, plus an X problem)
- Summary: systems and standards
- Keywords: magic codes, portable data
- Message-ID: <24455@alice.att.com>
- Date: 20 Dec 92 06:37:55 GMT
- Article-I.D.: alice.24455
- References: <24426@alice.att.com| <1gpruaINNhfm@frigate.doc.ic.ac.uk> <1gtrpdINN6c4@corax.udac.uu.se>
- Organization: AT&T Bell Laboratories, Murray Hill NJ
- Lines: 151
-
- In article <1gtrpdINN6c4@corax.udac.uu.se>, andersa@Riga.DoCS.UU.SE (Anders Andersson) writes:
- > [note Followup-To: comp.std.internat]
- >
- > In article <1gt5a2EINNin3@uni-erlangen.de>, unrza3@cd4680fs.rrze.uni-erlangen.de (Markus Kuhn) writes:
- > > It should also be noted, that at least one existing OS (Windows NT)
- > > uses a 2 byte encoding both internally (e.g. in filenames in Fnodes
- > > on the disc) as well as in text files. Text files always begin with
- > ^^
- > > FEFF as a magic code for ISO 10646 textes. This code also indicates,
- > > whether it is a littleendian file.
- >
- > Is this magic code visible to the user without any special tricks,
- > or is it filtered away by the operating system when the file is
- > opened for reading? Suppose I obtain a file, that is labeled as
- > containing IS 10646 text, via FTP from a server running Windows NT,
- > to a client running a different system--will I then get this 0xFEFF
- > magic code (which is meaningless on my system) too, or will I get a
- > 'clean' IS 10646 text?
- >
- > I remember seeing text files containing an explicit ^Z (0x1A) at
- > the end, due to their origin on some home computer where ^Z was the
- > ordinary EOF marker, even though I was sitting on a system with
- > perfectly functional EOF pointers in the file descriptor blocks...
- >
- > I hope the above isn't yet another version of that problem (non-
- > standard tags or markers floating around with standards-compliant
- > data on systems not understanding them)?
- >
- > Alternatively, does this magic code have any chance of becoming
- > a standard itself?
- > --
- > Anders Andersson, Dept. of Computer Systems, Uppsala University
-
-
- This is a quite complicated set of questions that strikes at the
- heart of how to handle 10646 text streams and even how to migrate
- to where you can handle them.
-
- firstly, we can answer what FEFF is. it is not a character as such
- (in fact, it and FFEF are defined as never being characters). the meaning,
- defined in 10646, is that the following byte stream *should* be in MSB first
- order (FFEF indicating LSB first). note that this is informative and
- not normative.
-
- how would you use such a thing? good question. what i tell you now
- is not in any standard. there is a convention, proposed by the unicoders,
- that text streams have a FEFF as the first 16 bits that is not part of
- the text stream; the FEFF serves as a byte-order indicator. (it has never
- been clear to me how FEFFs after the first 16 bits are handled.) so, for
- a program like cat would strip the FEFFs from all its inputs, swab'ing
- those inputs who were FFEFs, and emitting one FEFF before the catenation
- of the (processed) inputs.
-
- can we answer the question yet of what happens to an ftp'ed file?
- not yet. before we can do that, we need to know a little about your system.
- one way of classifying systems is whether or not files are typed.
-
- in unix (and plan 9), files are not typed from the system
- point of view. there are certainly files which are typed at the application
- level; archives, exectuables. these files typically need to be massaged
- by specific sets of utilities. for example, the tools for manipulating
- archives are ar, ranlib, and ld. however, it is legal to cat archives,
- and copy them with cp, and so on and we know this in advance, because
- the system just views these files as byte streams.
-
- in such systems, how do you handle 10646 text? if you insist upon
- the FEFF header, then ALL the utilities handling text have to acquire
- code to handle the header (and the byte-swabbing that seeing an FFEF implies).
- this also means a new cat (otherwise cat'ing together two binary files
- might indavertently cause one to be swabbed). it means that all tools
- that might handle text have to know when their input is text so that they can
- handle it. of course, some folks say that all the files on their system
- will have the same byte order and thus the FEFF is not necessary. this
- is a plausible position but highly restrictive; it fails utterly in the
- presence of networked filesystems.
-
- the other kind of systems support explicit typing of files.
- this would allow you to designate a file as being 10646-LSB, S-JIS,
- 8859-1, or whatever. such systems will find migration easy but of course,
- typed files have problems too: what is the type of a pipeline?
- what is the type of the output of cat (when its inputs have different types)?
- of course, such systems are often small, closed universes (like the world
- according to Word) with carefully planned allowable user actions.
- such system are limited but can be really smooth to the user.
-
- there is also the problem of migration; how the hell do you
- migrate to this new scheme? it amounts to being able to guess what the
- text files are on your system and converting them (to 16bit chars or whatever).
-
- the next system issue is deciding on the representation of
- text streams throughout the system. for example, as the body of a file?
- as an argument to a system call? as an argument to a library call?
- as text for display by the user's display? these are not necessarily
- the same. for example, the system might allow either byte sex as the
- contents of a file but insist on one byte order (and thus no FEFF)
- for library and system calls and display (this example has difficulties
- if your display is actually networked to another architecture).
- by and large, this becomes a mess of interfaces each trying to guess
- how other parts need their text streams.
-
- for mostly these reasons, Plan 9 chose a byte-stream encoding
- (initially UTF-1 and then UTF-2) and applied it uniformly according
- to a single rule: all byte streams interpreted as characters shall
- be interpreted as a sequence of 10646 characters encoded as UTF-2.
- this applies everywhere: it applies to the kernel and file server,
- it applies to the window system and the user's display, it applies
- to names in archives and tar files. and best of all, the existing
- system and its text is, because we were an ascii site, already
- correctly encoded. (actually, we were a Latin-1 system, but we were
- willing to make user's convert latin-1 text to the new format.)
-
- normally, such a solution
- requires everything entering/leaving the plan 9 universe be converted.
- however as the encoding we use is backward compatible with ASCII,
- no conversion needs be done for the only important case (text files on
- networked filesystems). it also has the advantage that all programs
- can display text uniformly; users don't have to write S-JIS editors
- because the regular editor (sam or ed) edits kana/kanji just fine.
- all the conversion effort can be, and is, confined to one place
- (a program called tcs [translate character sets]). the hope is
- that is most cases, this conversion can happen automatically
- (which is how this stream arose originally; the case of mail
- and news should be easy to make happen).
-
- to finally come back to the original question, one would presume
- that ftp simply transfers files without diddling them and as such,
- if the original had an FEFF, then the result would as well. a more
- agressive ftp might convert to local format, inserting the FEFF as necessary,
- but this would require another mode (don't want compress'ed files
- swabbed, do we?) for transmission.
-
- finally, you must understand that 10646 doesn't mandate
- solutions to any of these issues. it has accomplished an admirable job
- in that we can now unamibguously refer to explicit characters.
- however, as i hope i have shown above, there is much more to the job
- of converting and migrating to a ``10646 system''. i believe plan 9
- was the first such system, mainly because we had the will and the source
- (and rather less of it than most systems). there are still lingering
- problems, mainly talking to other systems (for example, most of
- the printers we use are postscript printers driven from unix machines;
- it has been a long and tedious process to get them to understand 10646
- characters), but on the whole, within Plan 9, it just works.
-
- i believe these system (design and migration) issues have been
- essentially ignored in all the work and fuss on unicode/10646.
- i know that deep within unicode and in places like X/Open, there are
- efforts to develop support libraries for wide characters but this simply
- ignores the system issues.
-
-
- andrew hume
-