home *** CD-ROM | disk | FTP | other *** search
- @c *************************************************************************
- @c CHAPTER: Introduction to VVcode
- @c *************************************************************************
- @c node-name, next, previous, up
- @node Introduction, Introduction, Top, Top
- @chapter Introduction to VVcode
-
- Reliable and faithful exchange of binary files between computers over
- networks is a well-known problem, especially if the computers use different
- operating systems and are connected to different networks via a gateway.
- Unfortunately inter-networking and electronic mail are very much children
- of the 1960s: they might have had to wait until the 1970s for their
- naissance, but their progenitors were mentally locked-in to the concept of
- the 7-bit ASCII code for conveying textual information. The @TeX{}
- community has long been aware of this problem when trying to exchange
- ``machine-independent'' @file{.dvi} files and font-related data such as
- @file{.tfm} and @file{.pk} files. It has sometimes been possible to
- exchange this binary data by using encoding schemes that allow the data to
- be represented using a subset of the seven-bit ASCII character set.
-
- Academics and authors in many fields have hitherto been able to pass
- @file{.tex} files back-and-forth by electronic mail---apart from a few
- minor quirks and blemishes, such @TeX{} source files pass unharmed across
- the planet's networks. Problems are encountered when mail passes through
- certain gateway machines which introduce irreversible character
- corruptions. Particularly notorious is the Janet/Bitnet gateway
- which has the unfortunate habit of converting @samp{^} to @samp{~} and
- @samp{~} to @samp{%}: since it leaves @samp{%} itself unaffected, this
- makes recovery of the original file a non-trivial exercise. It sometimes
- also changes the brace characters @samp{@{@}} into odd characters above
- 128: this is particularly embarrassing, of course, for @file{.tex} files!
-
- For some years many @TeX{} users, particularly those working in languages
- other than English, and thus familiar with character set encodings
- containing other than the basic ASCII set, have been agitating for
- @TeX{} to be able to handle input in their mother tongues, using their own
- languages' character sets. In 1989, Knuth announced @TeX{} V3, and
- implementors world-wide beavered away to bring each implementation
- up-to-date. @TeX{} V3 now supports eight-bit character sets and so
- @file{.tex} source files are now effectively `binary' files and will
- therefore suffer from the same exchange problems experienced with
- @file{.dvi} files.
-
- All those authors that had previously been able to cooperate, despite being
- separated by hundreds or thousands of miles, might once again be forced to
- entrust floppy disks to the vagaries of the world's postal systems
- (although one shouldn't underestimate the bandwidth of the Royal [or other]
- Mail system).
-
- Unless or until the various e-mail protocols, networks and software are
- converted to support uncorrupted transmission of characters codes
- @code{0x20 @dots{} 0x7e} and @code{0xa1 @dots{} 0xfe}, it will have to
- become the norm for @file{.tex} sources to be encoded for transmission by
- e-mail.
-
- This problem is of course well known outside the @TeX{} community.
-
-
- @c =========================================================================
- @c SECTION: The Aston Archive
- @c =========================================================================
- @section{The Aston Archive}
- The author is a volunteer assistant to Peter Abbott in running the world's
- principal repository of @TeX{}-related material at Aston University in
- Birmingham. The archive (host: @code{TeX.Ac.Uk}) holds several hundred
- megabytes of text and binary files including:
-
- @itemize @bullet
- @item program sources for @TeX{}, @code{METAFONT}, DVI drivers and many
- other utilities;
- @item binary executables for a variety of popular operating systems (e.g.
- Atari, Macintosh, MS-DOS, Unix, VAX/VMS and VM/CMS);
- @item @code{METAFONT} sources for Computer Modern and other fonts;
- @item binary font files (mainly @file{.tfm} and @file{.pk}) for a number of
- different output devices;
- @item text macro and style files.
- @end itemize
-
- The archive provides access to these files via the following services:
-
- @itemize @bullet
- @item NIFTP@footnote{Network Independent File Transfer Protocol --- in the
- UK, one does not perform the pseudo-login that
- Internet users are accustomed to using with the FTP
- protocol: instead, one issues a ``transfer request''
- for a file to be sent to or from the remote
- machine --- the transfer itself takes place
- asynchronously. One nice consequence is that such
- transfers can be queued for overnight execution,
- leaving daytime bandwidth free for e-mail and true
- remote interactive logins.}
- from Janet hosts---typically 300 megabytes of data are transferred
- every month; this would probably be much greater if we were not
- limited by the bandwidth of our 9600Bd connection to Janet.
- @item FTP and Telnet access from Internet hosts.
- @item Interactive browsing service via Janet PAD, including the facility
- to send files out using NIFTP (and later FTP).
- @item Interactive browsing service via dialup modem lines, including the
- facility to download files using Kermit and similar protocols.
- @item An e-mail file server which typically sends 150 megabytes of data per
- month to sites all over the world (though predominantly to
- EARN/Bitnet sites).
- @item A magnetic media distribution service via surface carriers.
- Copies of the entire archive have been sent to embryonic @TeX{}
- communities in Czechoslovakia, Hungary and Poland.
- @end itemize
-
- We have experienced many problems trying to support all of these file
- types, operating systems and access methods. The e-mail file server
- clearly needs a reliable method of encoding files if its many customers are
- not to be denied access to the non-text files in the archive.
-
- Binary files such as @file{.pk} font files are stored in different ways to
- accommodate the requirements of the different operating systems supported.
- Currently we maintain multiple font directory trees for the Macintosh,
- MS-DOS, Unix and VAX/VMS with all the attendant problems of
- synchronization, disk space and archivists' time. We need a single storage
- format which allows export to all of our supported operating systems.
-
-
- @c =========================================================================
- @c SECTION: Specification for a Coding Scheme
- @c =========================================================================
- @section{Specification for a Coding Scheme}
- In mid-1990, the archivists came to the conclusion that a universal
- encoding scheme was required to accommodate the many different kinds of
- file and file organizations that needed to be supported by the archive.
-
- Niel Kempson formulated the first draft of this specification in mid-1990;
- the requirements of the encoding scheme may be summarized as follows:
-
- @table @strong
- @item Preserving File Structure
- It is insufficient, especially for an archive holding text and
- binary files for a variety of machine types, merely to encode data
- simply as a stream of bytes:
-
- @itemize @bullet
- @item Virtually all operating systems (except Unix) make a
- distinction between binary and text files, so the coding
- system should recognize and maintain this distinction.
- @item Unix and most PC-based operating systems treat files as
- streams of bytes with no further structure imposed. On the
- other hand, certain widely-used operating systems (e.g.
- VAX/VMS and VM/CMS) have record-oriented file systems where
- different types of file are stored in a format appropriate to
- the type of file@footnote{It is often argued that the
- increase in efficiency more than offsets the increase in
- complexity.}.
-
- For these operating systems, we consider it essential that
- the encoding scheme should identify, preserve and record the
- most commonly used file organizations. The decoding program
- should be able to use this information to create the output
- file using the organization appropriate to the operating
- system in use. If the information is of no consequence to
- the receiving system, the default file structure (if any)
- should be created. If the encoding system does not have
- structure in its files, the receiving system may provide
- suitable defaults automatically. In all cases the programs
- should permit the user to override or supplement file
- structure information.
-
- @item Whenever possible, these details of structure should be
- determined automatically by the encoding program; at the very
- least, an indication of whether the file is text or binary
- shall be provided, even under an operating system such as
- Unix that need make no such distinction for its own use, to
- allow decoding to an appropriate file organization on those
- systems that {@emph do} make such a distinction.
- @end itemize
-
- @item Coding Scheme
- Whatever method is used must allow encoded data to be e-mailed:
-
- @itemize @bullet
- @item It should be possible to specify the coding table to be used
- to encode the data. The coding table used shall be recorded
- with each part of the encoded data.
- @item If a recorded coding table is found while decoding, it should
- be used to construct an appropriate decoding table. Simple
- one-to-one character corruptions should be corrected as long
- as only one of the input characters is mapped to any one
- output character.
- @item The recommended encoding uses only the following characters:
- @quotation
- @code{+-0123456789}@*
- @code{abcdefghijklmnopqrstuvwxyz}@*
- @code{ABCDEFGHIJKLMNOPQRSTUVWXYZ}
- @end quotation
-
- Such an encoding as originally used for XXcode has been shown
- to pass successfully through all the gateways which are known
- to corrupt characters.
- @end itemize
-
- @item Integrity of Encoded Data
- We want to ensure that the @emph{whole} encoded file passes through
- the e-mail network.
-
- @itemize @bullet
- @item Encoded lines should be prefixed by an appropriate character
- string to distinguish them from unwanted lines such as mail
- headers and trailers. Whilst not essential, this feature
- does assist the decoding program in ignoring these spurious
- data.
- @item Lines should not end with whitespace characters as some
- mailers and operating systems strip off trailing whitespace.
- @item The encoding program should calculate parameters of the input
- file such as the number of bytes and CRC and record them at
- the end of the encoded data.
-
- The decoding program should calculate the same parameters
- from the decoded data and compare the values obtained from
- those recorded at the end of the encoded data.
- @end itemize
- @item Making Files Mailable
- A mechanism is needed to overcome some gateways' refusal to handle
- large files.
-
- @itemize @bullet
- @item The encoding program should be able to split the encoded
- output into parts, each no larger than a maximum specified
- size. Splitting the output into smaller parts is useful if
- the encoded data is to be transmitted using electronic mail
- or over unreliable network links that do not stay up long
- enough to transmit a large file. The recommended default
- maximum part size is 30kB.
- @item The decoding program should be able to decode a multi-part
- encoded file very flexibly. It should @emph{not} be
- necessary to:
-
- @enumerate
- @item strip out mail headers and trailers;
- @item combine all of the parts into one file in the
- correct order;
- @item process each part of the encoded data as a
- separate file.
- @end enumerate
- @item In addition any file specifications from the operating system
- on which the VVE file was created must not prevent the file
- from being decoded.
- @end itemize
-
- @item Miscellaneous
- Further considerations include:
-
- @itemize @bullet
- @item Support for character sets other than ASCII is essential
- if the encoding scheme is to be useful to IBM hosts. The
- encoding program should label the character set used by the
- encoded data, and both encoder and decoder should enable the
- conversion between the local character set and another
- character set. For example a user on an EBCDIC host
- should be able to encode text files for transmission to another
- EBCDIC host, or to convert them to ASCII before
- encoding and transmission to an ASCII host. Similarly,
- that user should be able to decode text files from ASCII
- and EBCDIC machines, creating EBCDIC output files.
- @item Where possible, the original file's timestamp should be encoded
- and used by the decoding program when recreating the file: this
- will permit archives to retain the originator's time of
- creation for files, and thus permit the users (not to mention
- the archivists) to identify more clearly when a new version of
- a file has been made available. Timezones should be supported
- where possible.
- @item The encoding and decoding schemes should be able to read and
- write files that are compatible with one or more of the well
- established coding schemes (e.g. UUcode, XXcode).
- @item The source code for the programs should be freely available.
- It should also be portable and usable with as many computers,
- operating systems and compilers as possible.
- @end itemize
- @end table
-
-
- @c =========================================================================
- @c SECTION: The Search Commences
- @c =========================================================================
- @section{The Search Commences}
- Naturally, the first step was to examine the existing coding schemes in
- comparison with the above ideal specification. Such schemes fell into two
- broad classes: @dfn{portable schemes}, which were intended to permit the
- encoding of files on any computer architecture into a form that could be
- transmitted electronically, and decoded on the same or a different
- architecture; and @dfn{platform-specific schemes}, which provided rather
- better support for transferring files between two computers using the same
- architecture and operating system.
-
- @subsection{Portable Coding Schemes}
- The most commonly used coding schemes supported by a variety of platforms
- are:
-
- @itemize @bullet
- @item @code{BOO}
- @item @code{UU}
- @item @code{XX}
- @end itemize
-
- Most implementations of these schemes known to the authors are designed for
- use with stream file systems. These programs have no means of recording,
- let alone preserving, record structure and are thus unsuitable for our
- purposes. This is not surprising since @code{UUcode} and its mutation
- @code{XXcode} were developed specifically for exchanging files between Unix
- systems. In fairness to these schemes, they are well suited to the
- transmission of text files and certain unstructured binary files.
-
- Standard @code{UUcode} encodes files using characters @samp{ } @dots
- @samp{_} of ASCII. This can result in one or more spaces appearing at
- the ends of lines: some mailers decide that this is information not worth
- transmitting, with consequent inability to reconstruct the original file.
-
- Files containing characters such as @samp{^} are often irreversibly
- corrupted by mail gateways; this problem led to the development of
- @code{XXcode} which uses a rather more robust character set, namely:
-
- @quotation
- @code{+-0123456789}@*
- @code{abcdefghijklmnopqrstuvwxyz}@*
- @code{ABCDEFGHIJKLMNOPQRSTUVWXYZ}
- @end quotation
-
- The encoding table used is recorded with the encoded data to allow the
- detection of character corruptions, and the correction of reversible
- character transpositions. Whilst superficially a step forward,
- @code{XXcode} offered little more than most existing versions of
- @code{UUcode}, which already supported coding tables. Its major
- contribution was in formalizing the encoding table, and in particular its
- default table was proof against all the known gateway-induced corruptions.
-
-
- @subsection{Platform Specific Coding Schemes}
-
- Encoding schemes have been developed to support transfer of files
- possessing some structure which therefore cannot be reconstructed correctly
- when encoded by the portable schemes. When the encoding and decoding
- programs of such a platform specific scheme are each used on the same
- computer and operating system type, files may be encoded and transmitted
- with a great deal of confidence that the decoded file will reproduce the
- original's structure and attributes in their entirety.
-
- Examples of such programs are @code{TELCODE} and @code{MFTU} for VMS,
- @code{NETDATA} for IBM mainframes, and @code{Stuffit} and @code{MacBinary}
- for the Macintosh. But these programs have the major disadvantage that
- they have each been implemented @emph{only} on the single architecture for
- which they were designed: thus the only two of these schemes that could be
- used on the VMS-based Aston Archive would be of minimal interest elsewhere!
-
- The Archive's content is in some respects artificially inflated by the
- presence of @file{.hqx} files for Macintoshes, @file{.boo} for MS-DOS,
- etc., which have to be held in pre-encoded form for transfer by those
- requiring them.
-
-
- @c =========================================================================
- @c SECTION: VVcode is Born
- @c =========================================================================
- @section{VVcode is Born}
-
- Realizing that none of the existing portable schemes were close enough to
- our ideal, an early version of our specification was circulated on various
- mailing lists by Niel Kempson towards the end of 1990. When the anticipated
- ``nil return'' was all that resulted, Brian Hamilton Kelly went ahead and
- created a rudimentary @code{VVencode} by modifying an existing VAX Pascal
- implementation of @code{uuencode}. After generating the companion
- @code{VVdecode}, he then re-implemented the programs in Turbo C under the
- MS-DOS operating system on the IBM-PC, and thereby was able to prove that
- the new scheme was both viable and sufficient.
-
- This version didn't support file formats, time stamping, file splitting,
- character sets or CRC checking.
-
-
- @subsection{A Production VVcode}
-
- Following the minor feasibility study, Niel Kempson re-engineered the pair
- of programs from scratch (adding certain features of the evolving
- specification), paying particular attention to making the code portable
- across a wide variety of operating systems. Particular care was taken to
- avoid the use of supposedly ``standard'' C functions that experience had
- shown behaved differently under individual manufacturer's implementations,
- or were even non-existent in some. Therefore the code may sometimes appear
- to be performing certain operations in a very long-winded way; it's very
- easy to look at it and say ``why didn't the author use the @code{foo()}
- function, which does this much more efficiently?'', but this function may
- not even exist under another implementation of C, or behave in a subtly
- different manner.
-
- The core functions of @code{VVcode} are implemented as a collection of
- routines written in as portable a fashion as possible, and a separate
- module of a few operating system specific routines for file I/O,
- timestamping, command-line or other interface, etc. Porting @code{VVcode}
- to a new platform should require only that this latter module be
- re-implemented, in most cases by adapting an existing one.
-
- @code{VVcode} implements all of the features listed in the specification,
- apart from the ability to generate @code{UUcode} and @code{XXcode}
- compatible files. However, the decoding program is backwards compatible
- and can decode files generated by @code{UUcode} and @code{XXcode}.
-
-
- @subsection{Arguments against VVcode}
-
- When the advent of the @code{VVcode} system was first aired in the various
- electronic digests, some heated debate followed along the lines that a new
- encoding scheme was unnecessary, since @code{UUcode}/@code{XXcode} sufficed
- @emph{for them}. However, all these correspondents were Unix users who had
- interpreted the @samp{VV} as meaning ``Vax-to-Vax'' by analogy with
- @samp{uu}@footnote{@samp{V} was chosen simply because it followed @samp{U};
- at one time, we'd seriously considered calling it @code{YAFES} --- Yet
- Another File Encoding Scheme!} and who felt that such a scheme should be
- private to VAXen. The authors' reply was to the effect that the encoding
- scheme was intended to support the needs of archives like Aston's, and as
- such, had to provide
-
- @enumerate
- @item an automated tool (it would be somewhat difficult to expect our
- users to be able to tell the encoder what sort of file structure
- it was handling, when this concept was entirely alien to many of
- them);
- @item facilities to encode binaries for many operating systems;
- @item mail server features, such as splitting of large files;
- @item operation across the widest possible combination of platforms.
- @end enumerate
-
- The overhead of using the @code{VVcode} system is at most a couple of
- hundred bytes over using @code{UUcode}, and the extra functionality and
- @emph{universality} with respect to @code{UUcode} or @code{XXcode} thereby
- comes almost for free.
-
-
- @c =========================================================================
- @c SECTION:
- @c =========================================================================
- @section{Availability of VVcode}
-
- At present, the @code{VVcode} system is only available in C, but it has
- been shown to run successfully on the following combinations of hardware,
- operating system and compiler:
-
- @table @strong
- @item Macintosh
- At the time of writing (May 1991) John Rawnsley of the University
- of Warwick had commenced development of a Macintosh port, which
- will encode the resource and data forks in a manner that will
- permit the former to be ignored by non-Macintosh systems.
-
- @item MS-DOS
- @itemize @bullet
- @item IBM PS/2, PC (and clones); MS-DOS 3.3, 4.01, 5.00;
- Borland Turbo C 1.5, 2.0, Borland C++ 1.0, 2.0, 3.0 and
- Microsoft C 5.1, 6.0
- @end itemize
-
- @item OS/2
- @itemize @bullet
- @item IBM PS/2, PC (and clones); OS/2 2.0; Microsoft C 6.0 and
- GNU C 2.1
- @end itemize
-
- @item Unix
- @itemize @bullet
- @item Sun 3; SunOS 3.x and 4.0.3; native C and GNU C
- @item Sun Sparcstation 1; SunOS 4.1; native C and GNU C
- @item SCO Unix V/386 v3.2.2, Microsoft C compiler
- @end itemize
-
- @item VAX/VMS
- @itemize @bullet
- @item All VAXen; VMS 5.2--5.4-1; VAX C V3.0--V3.2 and GNU C 1.40
- @end itemize
-
- @item VM/CMS
- @itemize @bullet
- @item VM/CMS; Whitesmith C compiler v1.0 (This implementation was
- ported by Rainer Sch@"opf; basing it upon the Unix
- implementation, this took him about one day.)
- @end itemize
-
- @end table
-
-
-
-
-
-
-