NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / comp / arch / 11989 < prev next >

Wrap

Internet Message Format | 1992-12-29 | 8.0 KB

Xref: sparky comp.arch:11989 comp.sys.dec:6642 comp.sys.sgi:18373 comp.sys.hp:14415 Path: sparky!uunet!spool.mu.edu!agate!dog.ee.lbl.gov!horse.ee.lbl.gov!torek From: torek@horse.ee.lbl.gov (Chris Torek) Newsgroups: comp.arch,comp.sys.dec,comp.sys.sgi,comp.sys.hp Subject: Re: Comparison of Alpha, MIPS and PA-RISC-II wanted Date: 30 Dec 1992 00:01:31 GMT Organization: Lawrence Berkeley Laboratory, Berkeley CA Lines: 136 Message-ID: <28164@dog.ee.lbl.gov> References: <1992Dec29.044012.1@cc.curtin.edu.au> <3623363@zl2tnm.gen.nz> NNTP-Posting-Host: 128.3.112.15 (Note, I have no idea what any of this is doing in any of these newsgroups, but....) [someone---the names have been deleted to protect the guilty :-) ---writes] >>But U*** still loses because it has no way of telling the application what >>TYPE of file structure it is... [someone else writes] >Exactly. Well, you can pull kludgery with magic numbers.... OK, I can see it is time to go back to basics here. (Not BASIC, basics :-) ) The first principle is that there must be *some* shared ideas in order to make communication possible. If I were to write this article in Swahili, you could read it if and only if: a) you knew it was in Swahili; and b) you knew Swahili, or had a translation system. Without both of those, no communication would occur. This article, like too many others, would all be noise. This sort of thing---links via shared ideas that allow one to understand part of a message, which then allows still further understanding, and so on---is just what happened with the Rosetta Stone. Pre-Rosetta, trying to read Dead Sea Scrolls was something like trying to read a structured file without any idea what the structure was. (I may have some details confused; I am not into archaeology. But the idea is valid: you need to have some key to get started.) Now let us consider computer files. A computer data store simply holds a sequence of bits. Those bits, those 1s and 0s, have no inherent *meaning*---all the meaning arises through interpretation. We need some way to interpret each bit. This interpretation occurs at many levels: some of the bits on a disk, for instance, are for error correction. Others mark sector headers; still others (`gap' bits) are left simply for timing. The remainder are what we usually think of as `user and system data'. (Incidentally, note that I have started several levels up. The disk does not actually contain *bits* but rather merely records magnetic domain changes. When moved across a conductor, these generate a sequence of electrical signals which must be carefully interpreted in order to derive the individual bits.) In our case, these `data bits' are grouped in sets of 8. This was not always true, and there are probably a number of systems that still group them differently, but most modern machines use `bytes' as a basic building block. Each set of 8 again has an arbitrary interpretation: 01101001 may mean `the number 105' (a `big-endian' integral interpretation) or `the number 150' (a `little-endian' integeral interpration) or perhaps `the letter lowercase i' (an ASCII interpretation of code 105). They could be part of a larger integer, or maybe they are even part of a floating point number or something equally bizarre. Fortunately, most of use use computers with a few standard interpretations. If we agree, for purposes of netnews, to write the bits in a big-endian fashion, we can talk about 8, 16, 32, and 64 bit integers---these are available on practically every modern system---and 32, 64, 80, and often 128-bit floating point numbers in either IEEE or some proprietary notation. Textual strings can be represented in ASCII, either with a leading count, or more typically just by enclosing in special `quote' characters. More complicated interpretations, such as Lisp-like S-expression structures or relational database tables, we generally represent using some ad-hoc notation and hope that our readers apply the proper interpretation from that. Of course, our computer systems are generally not as flexible as our people, and our people have far more experience with various forms of communication than do our machines. So when we use a computer we generally have to instruct it, in gory detail, as to just how we want those 1s and 0s interpreted. And here we reach the point where the flamage occurs. Some OSes provide an enormous collection of `standard interpretations', and some provide only a few. The VMS people insist that it is better to have 16 built-in record formats; the UNIX people insist that it is better to have none. Both are right, both are wrong; neither way is `better'. (By the way, the `16' above was selected at random. I have no idea how many formats RMS officially supports, and if you count each record length as a different format you will get a different number than if you count all `fixed block' variants as one format. It does little good to set an exact number. The fact is that RMS provides a number of standard formats---fixed and variable block, stream-LF, etc.---and the usual UNIX interface is just `a stream of bytes'.) Let us suppose you want to, say, keep an inventory. Under VMS, you can choose a fixed block format with a particular record size, and store a sequence of records: manufacturer, model, serial number, quantity on hand, stock level, prices, date last ordered, date last sold, etc. On a UNIX system you have to package up the records yourself. But this is the easy part! Managing a sequence of records is trivial. The hard work goes into deciding what data to keep, how to cross-index, what operations to support, what kind of security to provide, and so forth, and into developing tools for answering questions about the data (`should we drop the Baby Spits Up doll?'). As far as I know, no OS solves these problems directly (yet?). One can buy canned solutions for both VMS and UNIX systems, of course, but in that case the underlying OS implementation is irrelevant; one can hardly argue that having or lacking RMS is inherently superior just because FooBase sells a system for VMS or UNIX (or both or neither). Now, one can say that RMS is provided but need not be used. This is a valid point. But the argument cuts both ways: features provided may be used inappropriately. Indeed, for some purposes a largely-unstructured text file may prove better than a highly-structured collection of records. Grep's regular expressions are quite simple and yet act as a powerful query tool (this reminds me of Kernighan's story of using egrep to find words one can spell on a calculator, but never mind that). Fortunately or not, grep does not work on records. If one had records, grep would need some way of `flattening' them for display anyway. And then there are `magic numbers'. Are they a kludge? Well, no and yes. They are an ad-hoc method to insert a `representation key' in a file. Given a simple `stream of bytes' file, if one takes the first two or four bytes of a file and puts an `unlikely' value there, one can write programs that find these `magic numbers' and change their behavior based upon them. One might further wonder how exactly this differs from reading a few bytes from a specially-defined location in every file, and deciding the file's type based on that information. Indeed, there *is* *no* difference except in convention: RMS stores file types in a system-decreed manner, and RMS is included in everything used in the system, so that, by convention, everything makes the same interpretation of the magic number. UNIX systems are, in this sense, looser: each individual program can attempt to interpret file types or not, in an individual manner. A rigid system avoids chaos, but also stifles novel solutions. This is why, when we speak in generalities, it does no good to say that either VMS or UNIX is `better'. I know which one *I* prefer, but then I know the kinds of tasks I do and how I go about them. I also know whether I prefer raspberry ice cream over peach, but I would not claim one is `better' and the other `loses'. :-) -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 510 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov