ProfitPress Mega CDROM2 Shareware Freeware (MSDOS)(1992)(Eng)

home *** CD-ROM | disk | FTP | other *** search

/ ProfitPress Mega CDROM2 …eeware (MSDOS)(1992)(Eng) / ProfitPress-MegaCDROM2.B6I / UTILITY / FILE / JLPAK10.ZIP / JLPAK.DOC < prev next >

Wrap

Text File | 1990-08-20 | 15.9 KB | 321 lines

JLPAK and JLUNPAK xxxxxxxxxxxxxxxxx 1 Purpose ========== These programs use "forward compression" to compress a file for distribution. The principle is very simple - we assume an old version of the file (text or binary) is already held on the receiving system. The compression algorithm compares the new version with the old and ships only instructions to reconstitute the new version using the old version, resulting in typical compression ratios of between 5 to 1 and 10 to 1, although where there are almost no changes between the two versions, a ratio of almost a thousand to one is possible! JLPAK and JLUNPAK provide the starting point for efficient distribution of versions of a package when the earlier version is already held. The normal procedure would be to pack first with JLPAK, then use a conventional packing program like PKPAK, then convert to ASCII with MKBOO. The receiver reverses the process. The term "forward compression" is used for compression based on an assumption that an earlier version of a file is held at a receiving system. It is thus mainly intended for regular software or text-file updates. The approach has been successfully applied for compression of successive frames of a television transmission, but the author is not aware of any existing general-purpose software package supporting the use of this technique for PC software distribution. Note that programs such as PKPAK usually talk about "percentage gains" - maybe as much as 60% with a following wind, but often only twenty or thirty percent. JLPAK talks about the "compression ratio" - maybe very high, but figures of between 5 and 10 occur regularly for practical distribution of new versions. The present programs, with sources, are placed in the public domain in the hope that faster versions will be produced by others. The present packages demonstrate very clearly the viability of the approach, and the compression ratios that can easily be achieved. Free use and development of the approach is encouraged, including the production of commercial software based on these programmes. All that is asked is acknowledgement of the contribution of The IT Institute University of Salford Salford M5 4WT England and free licencing to the IT Institute of any software produced using this work. Sources (Microsoft QuickBasic) are provided to encourage further development. If you find the programs useful, and particularly if you develop them further, please inform the Administrator of the IT Institute at the above address, or try E-mail to IT01 @ UK.AC.SALFORD.SYSB. 2 Installation =============== The package consists of two .EXE files, JLPAK.EXE and JLUNPAK.EXE, this documentation (JLPAK.DOC), and two .BAS source files JLPAK.BAS and JLUNPAK.BAS. The source files are in Microsoft QuickBasic. 3 Invoking JLPAK.EXE ===================== * Command line * O/M/C * Use of parameter * parameter * * ("M" means mandatory, "O" means optional, "C" means conditional) JLPAK M Command name New file path M The path name for the new version (the file to be compressed) on the sender's system Ref file path M The path name for the old version (the reference file which must also exist on the system of any receiver of the compressed format file Verbosity M The letters B or P for "Brief" or "Progress" messages, and optionally a T for "Trace". In the case of B, a one-line "percentage complete" message is output. In the case of P, a full-screen display showing the state of the compression and the sort of matches being found is displayed. For a full understanding of this, the technical section below needs to be read. In the case of T, a full trace of the matches produced is output to a separate file. This is generally useful only for debugging or further development work. Use "B" normally. FCM file path M The path name for the file (conventionally ".FCM" for forward compression) that is to be produced. This file will normally be compressed further using, for example, PKPAK, then converted to ASCII using, for example, MKBOO, then sent to recipients. Trace file path C This has to be present if "T" is included in the verbosity, and specifies the file to contain the trace information. If "T" is not present, this parameter must be absent. Receiver's new O If omitted, JLUNPAK will (by default) place file path the unpacked file in the path name used for "New file path" when JLPAK was invoked. If present, then this determines the default file to be produced by JLUNPAK (see below). Receiver's ref O If omitted, JLUNPAK will (by default) use file path the path name used for "New file path" when JLPAK was invoked as the reference file for unpacking. If present, then this determines the default reference file to be used by JLUNPAK (see below). 4 Invoking JLUNPAK.EXE ======================= * Command line * O/M/C * Use of parameter * parameter * * JLUNPAK M Command name FCM file path M The path name for the file containing the packed version that was produced by JLPAK. Verbosity O Default is "B", meaning simple percentage complete messages on one line. The other option is "P", giving a full-screen display of progress. Both "B" and "P" may be combined with "T", in which case a trace of the decompression process is produced in a separate file. This is mainly useful for debugging and further development. Trace file path C This has to be present if "T" is included in the verbosity, and specifies the file to contain the trace information. If "T" is not present, this parameter must be absent. Receiver's new O If omitted, JLUNPAK will (by default) place file path the unpacked file in the file determined when JLPAK was invoked (see above), and whose name is carried in the FCM format file. If present, this parameter overrides the name in the FCM format file. Receiver's ref O If omitted, JLUNPAK will (by default) use a file path file identified by the FCM format file as the reference file. If present, this parameter overrides the name in the FCM format file. Workfile path O If omitted, JLUNPAK will first create a file TEMP.TMP in the current directory, unpacking into that, then copy that to the required new file, then delete TEMP.TMP. If present, this parameter specifies the work file to be used. 5 Technical details ==================== 5.1 Overall approach --------------------- 5.1.1 There are undoubtedly other and perhaps better ways of establishing the information needed to reconstitute the new file after transmission. The following approach, implemented in version 1.0 of the package, works adequately, but I would like to encourage experiments to improve on it, as well as simply recoding in assembler or with assembler assistance to speed up the present package. (JLUNPAK is OK - JLPAK is sometimes a bit too slow, even on an AT). 5.1.2 Binary and text files are both treated as binary. End of line is never significant. 5.1.3 16 byte segments of the new file are taken, and are matched against the old file. There are a number of options: a) Type A match: An identical set of 16 bytes exists somewhere in the search region (see below) in the old file; b) Type B match: An identical set of 16 bytes exists somewhere in the search region in the old file, subject to the possible addition of a single fixed value (called a differ) to some of the 16 bytes; c) Type C match: Similar to a type B, but there are two possible differes, one or other of which may need to be added in to each of the sixteen bytes in the segment; d) Type D1 match: Three differs; e) Type D2 match: Four differs; f) No match: It was not possible to find a match within the search region. 5.1.4 The search region is determined by the "resynch state". In resynch state zero (the initial state), every segment is tested against every possible set of 16 bytes of the old file within a 1K window. The window's position is dependent on the location of earlier matches (it is actually centred on a position that moves halfway towards the location of each match that is obtained). 5.1.5 If there are never more than ten no matches between successive matches, we remain in resynch state zero, otherwise we move to resynch state 1. 5.1.6 In resynch state 1, we only attempt to match every nth segment (outputting the rest as "no match" options, but broaden the search window to nK. Thus the search time is not increased. We maintain a "no match count" (initially ten), n is the no match count less nine. On a failure to match (of the segements we attempt), we increment the no match count, and hence increase our window but reduce the proportion of segments we try to match. As soon as we get a match, we not only reduce the no match count by one, but we also go back to trying every segment until there is a failure again, in which case n is applied as above. If the nomatch count gets down to ten, then we reenter resych state zero. If the nomatch count reaches twenty, we go into resynch state two. 5.1.7 In resynch state two, we maintain n at ten, ignoring further failures to match. On a match, we revert to resynch state 1, and again go back to trying every segment. Also, in resynch state 2, we continuously recalculate the position of the centre of the window to maintain it in the right proportional position - that is, if we are 40% through the new file, we centre the window 40% through the old file. 5.1.8 The above algorithm represents a compromise between a poor compression ratio (a lot of "no match" segments output) and an excessive time spent compressing the file. It seems to work OK on an AT with the QuickBasic code. With assembler assist, we could probably improve the algorithm to put more effort into the packing and improve the compression ratio. In particular, more research into appropriate techniques when we reach resynch state 2 could be valuable. 5.2 The FCM file format ------------------------ 5.2.1 The FCM file format is largely independent of the details of the heuristic algorithm used to determine matches. The format is based on the retention of five pieces of state information, a pointer, and four differs. 5.2.2 The basic structure is a) a header (see below); b) a series of single-octet coded commands, each followed by some parameter information for the command (frequently a bit-map); 5.2.3 The header contains a) The path name to be used for the new file after decompression; b) A two-byte checksum (the standard sigma Ai MOD 255 plus sigma iAi MOD 255) is used of the new file; if the results of decompression do not produce a matching checksum, the new file is not written out from the work file; c) The path name to be used for the reference file; d) A two-byte checksum of the reference file; if the purported reference file for decompression does not have the same checksum, then decompression is abandonned. 5.2.4 The coded commands consist of: a) Move the pointer; (four codes, to cope with positive and negative movements and one or two following octets with the movement value); b) Change one of the four differs; (two codes each, for positive and negative values of 0 to 255, held in the following octet); c) Copy up to 64 segments (of 16 bytes each) from the old file at the given pointer position (these commands reflect type A matches); d) Copy up to 8 segments (of 16 bytes each), using a given differ (four codes, one for each differ), with the addition of the differ according to the following bit-map where 0 means no addition and 1 means add; there will be two following octets for each segment; this corresponds to type B matches; e) A similar set of six codes require the use of two out of the four differs to be added in, with the bit-map being variable length (it ends when we have got all we need), a zero bit meaning no addition, a 10 meaning the first of the selected pair of differs (selected by the code), and a 11 the second of the selected pair of differs. Again, we cope with up to 8 segments in one command; f) A further set of four codes which deal with type D1 and D2 matches (three or four differs); in retrospect, we could probably have got away with treating all three differ cases as four differ cases, which could have given PKPAK more scope for further compression of the FCM format file; g) A code which specifies the copying of up to 64 16-byte segments from the FCM format file itself (no match cases); h) A code to flag the discard of parts of the old file for buffer management purposes (the software handles arbitrarily large files, using 30K buffers); i) A code to handle termination, and any remaining octets after the last full 16-byte segment (no attempt is made to match these, they are always sent explicitly). 5.2.5 For further details of formats or operation, the source code should be consulted. JL 20 August 1990