NetNews Usenet Archive 1992 #30

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #30 / NN_1992_30.iso / spool / comp / std / internat / 913 < prev next >

Wrap

Internet Message Format | 1992-12-21 | 8.6 KB

Path: sparky!uunet!zaphod.mps.ohio-state.edu!uwm.edu!linac!att!att!allegra!alice!andrew From: andrew@alice.att.com (Andrew Hume) Newsgroups: comp.std.internat Subject: Re: Data tagging (was: 8-bit representation, plus an X problem) Summary: systems and standards Keywords: magic codes, portable data Message-ID: <24455@alice.att.com> Date: 20 Dec 92 06:37:55 GMT Article-I.D.: alice.24455 References: <24426@alice.att.com| <1gpruaINNhfm@frigate.doc.ic.ac.uk> <1gtrpdINN6c4@corax.udac.uu.se> Organization: AT&T Bell Laboratories, Murray Hill NJ Lines: 151 In article <1gtrpdINN6c4@corax.udac.uu.se>, andersa@Riga.DoCS.UU.SE (Anders Andersson) writes: > [note Followup-To: comp.std.internat] > > In article <1gt5a2EINNin3@uni-erlangen.de>, unrza3@cd4680fs.rrze.uni-erlangen.de (Markus Kuhn) writes: > > It should also be noted, that at least one existing OS (Windows NT) > > uses a 2 byte encoding both internally (e.g. in filenames in Fnodes > > on the disc) as well as in text files. Text files always begin with > ^^ > > FEFF as a magic code for ISO 10646 textes. This code also indicates, > > whether it is a littleendian file. > > Is this magic code visible to the user without any special tricks, > or is it filtered away by the operating system when the file is > opened for reading? Suppose I obtain a file, that is labeled as > containing IS 10646 text, via FTP from a server running Windows NT, > to a client running a different system--will I then get this 0xFEFF > magic code (which is meaningless on my system) too, or will I get a > 'clean' IS 10646 text? > > I remember seeing text files containing an explicit ^Z (0x1A) at > the end, due to their origin on some home computer where ^Z was the > ordinary EOF marker, even though I was sitting on a system with > perfectly functional EOF pointers in the file descriptor blocks... > > I hope the above isn't yet another version of that problem (non- > standard tags or markers floating around with standards-compliant > data on systems not understanding them)? > > Alternatively, does this magic code have any chance of becoming > a standard itself? > -- > Anders Andersson, Dept. of Computer Systems, Uppsala University This is a quite complicated set of questions that strikes at the heart of how to handle 10646 text streams and even how to migrate to where you can handle them. firstly, we can answer what FEFF is. it is not a character as such (in fact, it and FFEF are defined as never being characters). the meaning, defined in 10646, is that the following byte stream *should* be in MSB first order (FFEF indicating LSB first). note that this is informative and not normative. how would you use such a thing? good question. what i tell you now is not in any standard. there is a convention, proposed by the unicoders, that text streams have a FEFF as the first 16 bits that is not part of the text stream; the FEFF serves as a byte-order indicator. (it has never been clear to me how FEFFs after the first 16 bits are handled.) so, for a program like cat would strip the FEFFs from all its inputs, swab'ing those inputs who were FFEFs, and emitting one FEFF before the catenation of the (processed) inputs. can we answer the question yet of what happens to an ftp'ed file? not yet. before we can do that, we need to know a little about your system. one way of classifying systems is whether or not files are typed. in unix (and plan 9), files are not typed from the system point of view. there are certainly files which are typed at the application level; archives, exectuables. these files typically need to be massaged by specific sets of utilities. for example, the tools for manipulating archives are ar, ranlib, and ld. however, it is legal to cat archives, and copy them with cp, and so on and we know this in advance, because the system just views these files as byte streams. in such systems, how do you handle 10646 text? if you insist upon the FEFF header, then ALL the utilities handling text have to acquire code to handle the header (and the byte-swabbing that seeing an FFEF implies). this also means a new cat (otherwise cat'ing together two binary files might indavertently cause one to be swabbed). it means that all tools that might handle text have to know when their input is text so that they can handle it. of course, some folks say that all the files on their system will have the same byte order and thus the FEFF is not necessary. this is a plausible position but highly restrictive; it fails utterly in the presence of networked filesystems. the other kind of systems support explicit typing of files. this would allow you to designate a file as being 10646-LSB, S-JIS, 8859-1, or whatever. such systems will find migration easy but of course, typed files have problems too: what is the type of a pipeline? what is the type of the output of cat (when its inputs have different types)? of course, such systems are often small, closed universes (like the world according to Word) with carefully planned allowable user actions. such system are limited but can be really smooth to the user. there is also the problem of migration; how the hell do you migrate to this new scheme? it amounts to being able to guess what the text files are on your system and converting them (to 16bit chars or whatever). the next system issue is deciding on the representation of text streams throughout the system. for example, as the body of a file? as an argument to a system call? as an argument to a library call? as text for display by the user's display? these are not necessarily the same. for example, the system might allow either byte sex as the contents of a file but insist on one byte order (and thus no FEFF) for library and system calls and display (this example has difficulties if your display is actually networked to another architecture). by and large, this becomes a mess of interfaces each trying to guess how other parts need their text streams. for mostly these reasons, Plan 9 chose a byte-stream encoding (initially UTF-1 and then UTF-2) and applied it uniformly according to a single rule: all byte streams interpreted as characters shall be interpreted as a sequence of 10646 characters encoded as UTF-2. this applies everywhere: it applies to the kernel and file server, it applies to the window system and the user's display, it applies to names in archives and tar files. and best of all, the existing system and its text is, because we were an ascii site, already correctly encoded. (actually, we were a Latin-1 system, but we were willing to make user's convert latin-1 text to the new format.) normally, such a solution requires everything entering/leaving the plan 9 universe be converted. however as the encoding we use is backward compatible with ASCII, no conversion needs be done for the only important case (text files on networked filesystems). it also has the advantage that all programs can display text uniformly; users don't have to write S-JIS editors because the regular editor (sam or ed) edits kana/kanji just fine. all the conversion effort can be, and is, confined to one place (a program called tcs [translate character sets]). the hope is that is most cases, this conversion can happen automatically (which is how this stream arose originally; the case of mail and news should be easy to make happen). to finally come back to the original question, one would presume that ftp simply transfers files without diddling them and as such, if the original had an FEFF, then the result would as well. a more agressive ftp might convert to local format, inserting the FEFF as necessary, but this would require another mode (don't want compress'ed files swabbed, do we?) for transmission. finally, you must understand that 10646 doesn't mandate solutions to any of these issues. it has accomplished an admirable job in that we can now unamibguously refer to explicit characters. however, as i hope i have shown above, there is much more to the job of converting and migrating to a ``10646 system''. i believe plan 9 was the first such system, mainly because we had the will and the source (and rather less of it than most systems). there are still lingering problems, mainly talking to other systems (for example, most of the printers we use are postscript printers driven from unix machines; it has been a long and tedious process to get them to understand 10646 characters), but on the whole, within Plan 9, it just works. i believe these system (design and migration) issues have been essentially ignored in all the work and fuss on unicode/10646. i know that deep within unicode and in places like X/Open, there are efforts to develop support libraries for wide characters but this simply ignores the system issues. andrew hume