home *** CD-ROM | disk | FTP | other *** search
-
- JLPAK and JLUNPAK
- xxxxxxxxxxxxxxxxx
-
- 1 Purpose
- ==========
-
- These programs use "forward compression" to compress a file for
- distribution. The principle is very simple - we assume an old version of
- the file (text or binary) is already held on the receiving system. The
- compression algorithm compares the new version with the old and ships only
- instructions to reconstitute the new version using the old version,
- resulting in typical compression ratios of between 5 to 1 and 10 to 1,
- although where there are almost no changes between the two versions, a
- ratio of almost a thousand to one is possible!
-
- JLPAK and JLUNPAK provide the starting point for efficient distribution of
- versions of a package when the earlier version is already held. The normal
- procedure would be to pack first with JLPAK, then use a conventional
- packing program like PKPAK, then convert to ASCII with MKBOO. The receiver
- reverses the process.
-
- The term "forward compression" is used for compression based on an
- assumption that an earlier version of a file is held at a receiving system.
- It is thus mainly intended for regular software or text-file updates. The
- approach has been successfully applied for compression of successive frames
- of a television transmission, but the author is not aware of any existing
- general-purpose software package supporting the use of this technique for PC
- software distribution.
-
- Note that programs such as PKPAK usually talk about "percentage gains" -
- maybe as much as 60% with a following wind, but often only twenty or thirty
- percent. JLPAK talks about the "compression ratio" - maybe very high, but
- figures of between 5 and 10 occur regularly for practical distribution of
- new versions.
-
- The present programs, with sources, are placed in the public domain in the
- hope that faster versions will be produced by others. The present packages
- demonstrate very clearly the viability of the approach, and the compression
- ratios that can easily be achieved. Free use and development of the
- approach is encouraged, including the production of commercial software
- based on these programmes. All that is asked is acknowledgement of the
- contribution of
-
- The IT Institute
- University of Salford
- Salford M5 4WT
- England
-
- and free licencing to the IT Institute of any software produced using this
- work. Sources (Microsoft QuickBasic) are provided to encourage further
- development.
-
- If you find the programs useful, and particularly if you develop them
- further, please inform the Administrator of the IT Institute at the above
- address, or try E-mail to IT01 @ UK.AC.SALFORD.SYSB.
-
- 2 Installation
- ===============
-
- The package consists of two .EXE files, JLPAK.EXE and JLUNPAK.EXE, this
- documentation (JLPAK.DOC), and two .BAS source files JLPAK.BAS and
- JLUNPAK.BAS. The source files are in Microsoft QuickBasic.
-
- 3 Invoking JLPAK.EXE
- =====================
-
- * Command line * O/M/C * Use of parameter
- * parameter * *
-
- ("M" means mandatory, "O" means optional, "C" means conditional)
-
- JLPAK M Command name
-
- New file path M The path name for the new version (the file
- to be compressed) on the sender's system
-
- Ref file path M The path name for the old version (the
- reference file which must also exist on the
- system of any receiver of the compressed
- format file
-
- Verbosity M The letters B or P for "Brief" or "Progress"
- messages, and optionally a T for "Trace".
- In the case of B, a one-line "percentage
- complete" message is output. In the case of
- P, a full-screen display showing the state
- of the compression and the sort of matches
- being found is displayed. For a full
- understanding of this, the technical
- section below needs to be read. In the case
- of T, a full trace of the matches produced
- is output to a separate file. This is
- generally useful only for debugging or
- further development work. Use "B" normally.
-
- FCM file path M The path name for the file (conventionally
- ".FCM" for forward compression) that is to
- be produced. This file will normally be
- compressed further using, for example,
- PKPAK, then converted to ASCII using, for
- example, MKBOO, then sent to recipients.
-
- Trace file path C This has to be present if "T" is included in
- the verbosity, and specifies the file to
- contain the trace information. If "T" is
- not present, this parameter must be absent.
-
- Receiver's new O If omitted, JLUNPAK will (by default) place
- file path the unpacked file in the path name used for
- "New file path" when JLPAK was invoked. If
- present, then this determines the default
- file to be produced by JLUNPAK (see below).
-
- Receiver's ref O If omitted, JLUNPAK will (by default) use
- file path the path name used for "New file path" when
- JLPAK was invoked as the reference file for
- unpacking. If present, then this
- determines the default reference file to be
- used by JLUNPAK (see below).
-
- 4 Invoking JLUNPAK.EXE
- =======================
-
- * Command line * O/M/C * Use of parameter
- * parameter * *
-
- JLUNPAK M Command name
-
- FCM file path M The path name for the file containing the
- packed version that was produced by JLPAK.
-
- Verbosity O Default is "B", meaning simple percentage
- complete messages on one line. The other
- option is "P", giving a full-screen display
- of progress. Both "B" and "P" may be
- combined with "T", in which case a trace of
- the decompression process is produced in a
- separate file. This is mainly useful for
- debugging and further development.
-
- Trace file path C This has to be present if "T" is included in
- the verbosity, and specifies the file to
- contain the trace information. If "T" is
- not present, this parameter must be absent.
-
- Receiver's new O If omitted, JLUNPAK will (by default) place
- file path the unpacked file in the file determined
- when JLPAK was invoked (see above), and
- whose name is carried in the FCM format
- file. If present, this parameter overrides
- the name in the FCM format file.
-
- Receiver's ref O If omitted, JLUNPAK will (by default) use a
- file path file identified by the FCM format file as
- the reference file. If present, this
- parameter overrides the name in the FCM
- format file.
-
- Workfile path O If omitted, JLUNPAK will first create a
- file TEMP.TMP in the current directory,
- unpacking into that, then copy that to the
- required new file, then delete TEMP.TMP.
- If present, this parameter specifies the
- work file to be used.
-
- 5 Technical details
- ====================
-
- 5.1 Overall approach
- ---------------------
-
- 5.1.1 There are undoubtedly other and perhaps better ways of establishing
- the information needed to reconstitute the new file after transmission.
- The following approach, implemented in version 1.0 of the package, works
- adequately, but I would like to encourage experiments to improve on it, as
- well as simply recoding in assembler or with assembler assistance to speed
- up the present package. (JLUNPAK is OK - JLPAK is sometimes a bit too slow,
- even on an AT).
-
- 5.1.2 Binary and text files are both treated as binary. End of line is
- never significant.
-
- 5.1.3 16 byte segments of the new file are taken, and are matched against
- the old file. There are a number of options:
-
- a) Type A match: An identical set of 16 bytes exists somewhere in the
- search region (see below) in the old file;
-
- b) Type B match: An identical set of 16 bytes exists somewhere in the
- search region in the old file, subject to the possible addition of a
- single fixed value (called a differ) to some of the 16 bytes;
-
- c) Type C match: Similar to a type B, but there are two possible
- differes, one or other of which may need to be added in to each of the
- sixteen bytes in the segment;
-
- d) Type D1 match: Three differs;
-
- e) Type D2 match: Four differs;
-
- f) No match: It was not possible to find a match within the search
- region.
-
- 5.1.4 The search region is determined by the "resynch state". In resynch
- state zero (the initial state), every segment is tested against every
- possible set of 16 bytes of the old file within a 1K window. The window's
- position is dependent on the location of earlier matches (it is actually
- centred on a position that moves halfway towards the location of each match
- that is obtained).
-
- 5.1.5 If there are never more than ten no matches between successive
- matches, we remain in resynch state zero, otherwise we move to resynch
- state 1.
-
- 5.1.6 In resynch state 1, we only attempt to match every nth segment
- (outputting the rest as "no match" options, but broaden the search window
- to nK. Thus the search time is not increased. We maintain a "no match
- count" (initially ten), n is the no match count less nine. On a failure to
- match (of the segements we attempt), we increment the no match count, and
- hence increase our window but reduce the proportion of segments we try to
- match. As soon as we get a match, we not only reduce the no match count by
- one, but we also go back to trying every segment until there is a failure
- again, in which case n is applied as above. If the nomatch count gets down
- to ten, then we reenter resych state zero. If the nomatch count reaches
- twenty, we go into resynch state two.
-
- 5.1.7 In resynch state two, we maintain n at ten, ignoring further
- failures to match. On a match, we revert to resynch state 1, and again go
- back to trying every segment. Also, in resynch state 2, we continuously
- recalculate the position of the centre of the window to maintain it in the
- right proportional position - that is, if we are 40% through the new file,
- we centre the window 40% through the old file.
-
- 5.1.8 The above algorithm represents a compromise between a poor
- compression ratio (a lot of "no match" segments output) and an excessive
- time spent compressing the file. It seems to work OK on an AT with the
- QuickBasic code. With assembler assist, we could probably improve the
- algorithm to put more effort into the packing and improve the compression
- ratio. In particular, more research into appropriate techniques when we
- reach resynch state 2 could be valuable.
-
- 5.2 The FCM file format
- ------------------------
-
- 5.2.1 The FCM file format is largely independent of the details of the
- heuristic algorithm used to determine matches. The format is based on the
- retention of five pieces of state information, a pointer, and four
- differs.
-
- 5.2.2 The basic structure is
-
- a) a header (see below);
-
- b) a series of single-octet coded commands, each followed by some
- parameter information for the command (frequently a bit-map);
-
- 5.2.3 The header contains
-
- a) The path name to be used for the new file after decompression;
-
- b) A two-byte checksum (the standard sigma Ai MOD 255 plus sigma iAi
- MOD 255) is used of the new file; if the results of decompression do
- not produce a matching checksum, the new file is not written out from
- the work file;
-
- c) The path name to be used for the reference file;
-
- d) A two-byte checksum of the reference file; if the purported
- reference file for decompression does not have the same checksum, then
- decompression is abandonned.
-
- 5.2.4 The coded commands consist of:
-
- a) Move the pointer; (four codes, to cope with positive and negative
- movements and one or two following octets with the movement value);
-
- b) Change one of the four differs; (two codes each, for positive and
- negative values of 0 to 255, held in the following octet);
-
- c) Copy up to 64 segments (of 16 bytes each) from the old file at the
- given pointer position (these commands reflect type A matches);
-
- d) Copy up to 8 segments (of 16 bytes each), using a given differ
- (four codes, one for each differ), with the addition of the differ
- according to the following bit-map where 0 means no addition and 1
- means add; there will be two following octets for each segment; this
- corresponds to type B matches;
-
- e) A similar set of six codes require the use of two out of the four
- differs to be added in, with the bit-map being variable length (it
- ends when we have got all we need), a zero bit meaning no addition, a
- 10 meaning the first of the selected pair of differs (selected by the
- code), and a 11 the second of the selected pair of differs. Again,
- we cope with up to 8 segments in one command;
-
- f) A further set of four codes which deal with type D1 and D2 matches
- (three or four differs); in retrospect, we could probably have got
- away with treating all three differ cases as four differ cases, which
- could have given PKPAK more scope for further compression of the FCM
- format file;
-
- g) A code which specifies the copying of up to 64 16-byte segments
- from the FCM format file itself (no match cases);
-
- h) A code to flag the discard of parts of the old file for buffer
- management purposes (the software handles arbitrarily large files,
- using 30K buffers);
-
- i) A code to handle termination, and any remaining octets after the
- last full 16-byte segment (no attempt is made to match these, they are
- always sent explicitly).
-
- 5.2.5 For further details of formats or operation, the source code should
- be consulted.
-
- JL
-
- 20 August 1990
-