OS/2 Professional

home *** CD-ROM | disk | FTP | other *** search

/ OS/2 Professional / OS2PRO194.ISO / os2 / packer / vv097exe / vvintro.asc < prev next >

Wrap

Text File | 1993-01-07 | 23.1 KB | 334 lines

Introduction to VVencode/VVdecode Table of Contents 1 Introduction to VVcode .....................: 2 1.1 The Aston Archive........................... 2 1.2 Specification for a Coding Scheme ..................: 3 1.3 The Search Commences ........................ 6 1.3.1 Portable Coding Schemes .................. 6 1.3.2 Platform Specific Coding Schemes .............: 7 1.4 VVcode is Born............................: 7 1.4.1 A Production VVcode .................... 7 1.4.2 Arguments against VVcode ................. 8 1.5 Availability of VVcode ........................: 8 1 1 Introduction to VVcode Reliable and faithful exchange of binary files between computers over networks is a well-known problem, especially if the computers use different operating systems and are connected to different networks via a gateway. Unfortunately inter-networking and electronic mail are very much children of the 1960s: they might have had to wait until the 1970s for their naissance, but their progenitors were mentally locked-in to the concept of the 7-bit ASCII code for conveying textual information. The TEX community has long been aware of this problem when trying to exchange "machine- independent" `.dvi' files and font-related data such as `.tfm' and `.pk' files. It has sometimes been possible to exchange this binary data by using encoding schemes that allow the data to be represented using a subset of the seven-bit ASCII character set. Academics and authors in many fields have hitherto been able to pass `.tex' files back-and- forth by electronic mail_apart from a few minor quirks and blemishes, such TEX source files pass unharmed across the planet's networks. Problems are encountered when mail passes through certain gateway machines which introduce irreversible character corruptions. Particularly notorious is the Janet/Bitnet gateway which has the unfortunate habit of converting `^' to `"' and `"' to `%': since it leaves `%' itself unaffected, this makes recovery of the original file a non-trivial exercise. It sometimes also changes the brace characters `-"' into odd characters above 128: this is particularly embarrassing, of course, for `.tex' files! For some years many TEX users, particularly those working in languages other than English, and thus familiar with character set encodings containing other than the basic ASCII set, have been agitating for TEX to be able to handle input in their mother tongues, using their own languages' character sets. In 1989, Knuth announced TEX V3, and implementors world-wide beavered away to bring each implementation up-to-date. TEX V3 now supports eight-bit character sets and so `.tex' source files are now effectively `binary' files and will therefore suffer from the same exchange problems experienced with `.dvi' files. All those authors that had previously been able to cooperate, despite being separated by hun- dreds or thousands of miles, might once again be forced to entrust floppy disks to the vagaries of the world's postal systems (although one shouldn't underestimate the bandwidth of the Royal [or other] Mail system). Unless or until the various e-mail protocols, networks and software are converted to support un- corrupted transmission of characters codes 0x20 : : :0x7e and 0xa1 : : :0xfe, it will have to become the norm for `.tex' sources to be encoded for transmission by e-mail. This problem is of course well known outside the TEX community. 1.1 The Aston Archive The author is a volunteer assistant to Peter Abbott in running the world's principal repository of TEX-related material at Aston University in Birmingham. The archive (host: TeX.Ac.Uk) holds several hundred megabytes of text and binary files including: o program sources for TEX, METAFONT, DVI drivers and many other utilities; o binary executables for a variety of popular operating systems (e.g. Atari, Macintosh, MS-DOS, Unix, VAX/VMS and VM/CMS); o METAFONT sources for Computer Modern and other fonts; o binary font files (mainly `.tfm' and `.pk') for a number of different output devices; o text macro and style files. The archive provides access to these files via the following services: o NIFTP1 from Janet hosts_typically 300 megabytes of data are transferred every month; this would probably be much greater if we were not limited by the bandwidth of our 9600Bd connection to Janet. o FTP and Telnet access from Internet hosts. o Interactive browsing service via Janet PAD, including the facility to send files out using NIFTP (and later FTP). o Interactive browsing service via dialup modem lines, including the facility to download files using Kermit and similar protocols. o An e-mail file server which typically sends 150 megabytes of data per month to sites all over the world (though predominantly to EARN/Bitnet sites). o A magnetic media distribution service via surface carriers. Copies of the entire archive have been sent to embryonic TEX communities in Czechoslovakia, Hungary and Poland. We have experienced many problems trying to support all of these file types, operating systems and access methods. The e-mail file server clearly needs a reliable method of encoding files if its many customers are not to be denied access to the non-text files in the archive. Binary files such as `.pk' font files are stored in different ways to accommodate the requirements of the different operating systems supported. Currently we maintain multiple font directory trees for the Macintosh, MS-DOS, Unix and VAX/VMS with all the attendant problems of synchronization, disk space and archivists' time. We need a single storage format which allows export to all of our supported operating systems. 1.2 Specification for a Coding Scheme In mid-1990, the archivists came to the conclusion that a universal encoding scheme was required to accommodate the many different kinds of file and file organizations that needed to be supported by the archive. _________________________________________ 1 Network Independent File Transfer Protocol _ in the UK, one does not perform the pseudo- login that Internet users are accustomed to using with the FTP protocol: instead, one issues a "transfer request" for a file to be sent to or from the remote machine _ the transfer itself takes place asynchronously. One nice consequence is that such transfers can be queued for overnight execution, leaving daytime bandwidth free for e-mail and true remote interactive logins. Niel Kempson formulated the first draft of this specification in mid-1990; the requirements of the encoding scheme may be summarized as follows: Preserving File Structure It is insufficient, especially for an archive holding text and binary files for a variety of machine types, merely to encode data simply as a stream of bytes: o Virtually all operating systems (except Unix) make a distinction between binary and text files, so the coding system should recognize and maintain this distinction. o Unix and most PC-based operating systems treat files as streams of bytes with no further structure imposed. On the other hand, certain widely-used operating systems (e.g. VAX/VMS and VM/CMS) have record-oriented file systems where different types of file are stored in a format appropriate to the type of file2 . For these operating systems, we consider it essential that the encoding scheme should identify, preserve and record the most commonly used file organizations. The decoding program should be able to use this information to create the output file using the organization appropriate to the operating system in use. If the information is of no consequence to the receiving system, the default file structure (if any) should be created. If the encoding system does not have structure in its files, the receiving system may provide suitable defaults automatically. In all cases the programs should permit the user to override or supplement file structure information. o Whenever possible, these details of structure should be determined automatically by the encoding program; at the very least, an indication of whether the file is text or binary shall be provided, even under an operating system such as Unix that need make no such distinction for its own use, to allow decoding to an appropriate file organization on those systems that do make such a distinction. Coding Scheme Whatever method is used must allow encoded data to be e-mailed: o It should be possible to specify the coding table to be used to encode the data. The coding table used shall be recorded with each part of the encoded data. o If a recorded coding table is found while decoding, it should be used to construct an appropriate decoding table. Simple one-to-one character corruptions should be corrected as long as only one of the input characters is mapped to any one output character. o The recommended encoding uses only the following characters: +-0123456789 abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ _________________________________________ 2 It is often argued that the increase in efficiency more than offsets the increase in complexity. Such an encoding as originally used for XXcode has been shown to pass successfully through all the gateways which are known to corrupt characters. Integrity of Encoded Data We want to ensure that the whole encoded file passes through the e-mail network. o Encoded lines should be prefixed by an appropriate character string to distinguish them from unwanted lines such as mail headers and trailers. Whilst not essential, this feature does assist the decoding program in ignoring these spurious data. o Lines should not end with whitespace characters as some mailers and operating systems strip off trailing whitespace. o The encoding program should calculate parameters of the input file such as the number of bytes and CRC and record them at the end of the encoded data. The decoding program should calculate the same parameters from the decoded data and compare the values obtained from those recorded at the end of the en- coded data. Making Files Mailable A mechanism is needed to overcome some gateways' refusal to handle large files. o The encoding program should be able to split the encoded output into parts, each no larger than a maximum specified size. Splitting the output into smaller parts is useful if the encoded data is to be transmitted using electronic mail or over unreliable network links that do not stay up long enough to transmit a large file. The recommended default maximum part size is 30kB. o The decoding program should be able to decode a multi-part encoded file very flexibly. It should not be necessary to: 1. strip out mail headers and trailers; 2. combine all of the parts into one file in the correct order; 3. process each part of the encoded data as a separate file. o In addition any file specifications from the operating system on which the VVE file was created must not prevent the file from being decoded. Miscellaneous Further considerations include: o Support for character sets other than ASCII is essential if the encoding scheme is to be useful to IBM hosts. The encoding program should label the character set used by the encoded data, and both encoder and decoder should enable the conversion between the local character set and another character set. For example a user on an EBCDIC host should be able to encode text files for transmission to another EBCDIC host, or to convert them to ASCII before encoding and transmission to an ASCII host. Similarly, that user should be able to decode text files from ASCII and EBCDIC machines, creating EBCDIC output files. o Where possible, the original file's timestamp should be encoded and used by the decoding program when recreating the file: this will permit archives to retain the originator's time of creation for files, and thus permit the users (not to mention the archivists) to identify more clearly when a new version of a file has been made available. Timezones should be supported where possible. o The encoding and decoding schemes should be able to read and write files that are compatible with one or more of the well established coding schemes (e.g. UUcode, XXcode). o The source code for the programs should be freely available. It should also be portable and usable with as many computers, operating systems and compilers as possible. 1.3 The Search Commences Naturally, the first step was to examine the existing coding schemes in comparison with the above ideal specification. Such schemes fell into two broad classes: portable schemes, which were intended to permit the encoding of files on any computer architecture into a form that could be transmitted electronically, and decoded on the same or a different architecture; and platform-specific schemes, which provided rather better support for transferring files between two computers using the same architecture and operating system. 1.3.1 Portable Coding Schemes The most commonly used coding schemes supported by a variety of platforms are: o BOO o UU o XX Most implementations of these schemes known to the authors are designed for use with stream file systems. These programs have no means of recording, let alone preserving, record structure and are thus unsuitable for our purposes. This is not surprising since UUcode and its mutation XXcode were developed specifically for exchanging files between Unix systems. In fairness to these schemes, they are well suited to the transmission of text files and certain unstructured binary files. Standard UUcode encodes files using characters ` ' : :`:_' of ASCII. This can result in one or more spaces appearing at the ends of lines: some mailers decide that this is information not worth transmitting, with consequent inability to reconstruct the original file. Files containing characters such as `^' are often irreversibly corrupted by mail gateways; this problem led to the development of XXcode which uses a rather more robust character set, namely: +-0123456789 abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ The encoding table used is recorded with the encoded data to allow the detection of character corruptions, and the correction of reversible character transpositions. Whilst superficially a step forward, XXcode offered little more than most existing versions of UUcode, which already supported coding tables. Its major contribution was in formalizing the encoding table, and in particular its default table was proof against all the known gateway-induced corruptions. 1.3.2 Platform Specific Coding Schemes Encoding schemes have been developed to support transfer of files possessing some structure which therefore cannot be reconstructed correctly when encoded by the portable schemes. When the encoding and decoding programs of such a platform specific scheme are each used on the same computer and operating system type, files may be encoded and transmitted with a great deal of confidence that the decoded file will reproduce the original's structure and attributes in their entirety. Examples of such programs are TELCODE and MFTU for VMS, NETDATA for IBM mainframes, and Stuffit and MacBinary for the Macintosh. But these programs have the major disadvantage that they have each been implemented only on the single architecture for which they were designed: thus the only two of these schemes that could be used on the VMS-based Aston Archive would be of minimal interest elsewhere! The Archive's content is in some respects artificially inflated by the presence of `.hqx' files for Macintoshes, `.boo' for MS-DOS, etc., which have to be held in pre-encoded form for transfer by those requiring them. 1.4 VVcode is Born Realizing that none of the existing portable schemes were close enough to our ideal, an early version of our specification was circulated on various mailing lists by Niel Kempson towards the end of 1990. When the anticipated "nil return" was all that resulted, Brian Hamilton Kelly went ahead and created a rudimentary VVencode by modifying an existing VAX Pascal implementation of uuencode. After generating the companion VVdecode, he then re-implemented the programs in Turbo C under the MS-DOS operating system on the IBM-PC, and thereby was able to prove that the new scheme was both viable and sufficient. This version didn't support file formats, time stamping, file splitting, character sets or CRC checking. 1.4.1 A Production VVcode Following the minor feasibility study, Niel Kempson re-engineered the pair of programs from scratch (adding certain features of the evolving specification), paying particular attention to making the code portable across a wide variety of operating systems. Particular care was taken to avoid the use of supposedly "standard" C functions that experience had shown behaved differently under individual manufacturer's implementations, or were even non-existent in some. Therefore the code may sometimes appear to be performing certain operations in a very long-winded way; it's very easy to look at it and say "why didn't the author use the foo() function, which does this much more efficiently?", but this function may not even exist under another implementation of C, or behave in a subtly different manner. The core functions of VVcode are implemented as a collection of routines written in as portable a fashion as possible, and a separate module of a few operating system specific routines for file I/O, timestamping, command-line or other interface, etc. Porting VVcode to a new platform should require only that this latter module be re-implemented, in most cases by adapting an existing one. VVcode implements all of the features listed in the specification, apart from the ability to generate UUcode and XXcode compatible files. However, the decoding program is backwards compatible and can decode files generated by UUcode and XXcode. 1.4.2 Arguments against VVcode When the advent of the VVcode system was first aired in the various electronic digests, some heated debate followed along the lines that a new encoding scheme was unnecessary, since UUcode/XXcode sufficed for them. However, all these correspondents were Unix users who had interpreted the `VV' as meaning "Vax-to-Vax" by analogy with `uu'3 and who felt that such a scheme should be private to VAXen. The authors' reply was to the effect that the encoding scheme was intended to support the needs of archives like Aston's, and as such, had to provide 1. an automated tool (it would be somewhat difficult to expect our users to be able to tell the encoder what sort of file structure it was handling, when this concept was entirely alien to many of them); 2. facilities to encode binaries for many operating systems; 3. mail server features, such as splitting of large files; 4. operation across the widest possible combination of platforms. The overhead of using the VVcode system is at most a couple of hundred bytes over using UUcode, and the extra functionality and universality with respect to UUcode or XXcode thereby comes almost for free. 1.5 Availability of VVcode At present, the VVcode system is only available in C, but it has been shown to run successfully on the following combinations of hardware, operating system and compiler: _________________________________________ 3 `V' was chosen simply because it followed `U'; at one time, we'd seriously considered calling it YAFES _ Yet Another File Encoding Scheme! Macintosh At the time of writing (May 1991) John Rawnsley of the University of Warwick had commenced development of a Macintosh port, which will encode the resource and data forks in a manner that will permit the former to be ignored by non-Macintosh systems. MS-DOS o IBM PS/2, PC (and clones); MS-DOS 3.3, 4.01, 5.00; Borland Turbo C 1.5, 2.0, Borland C++ 1.0, 2.0, 3.0 and Microsoft C 5.1, 6.0 OS/2 o IBM PS/2, PC (and clones); OS/2 2.0; Microsoft C 6.0 and GNU C 2.1 Unix o Sun 3; SunOS 3.x and 4.0.3; native C and GNU C o Sun Sparcstation 1; SunOS 4.1; native C and GNU C o SCO Unix V/386 v3.2.2, Microsoft C compiler VAX/VMS o All VAXen; VMS 5.2-5.4-1; VAX C V3.0-V3.2 and GNU C 1.40 VM/CMS o VM/CMS; Whitesmith C compiler v1.0 (This implementation was ported by Rainer Sch"opf; basing it upon the Unix implementation, this took him about one day.)