home *** CD-ROM | disk | FTP | other *** search
- Technical Abstract
-
- CRUNCH 1.x maintained a table representing up to 4096 strings of
- varying lengths using the so called LZW algorithm, which has been
- described in the earlier documentation. These strings were ent-
- ered into a table in a manner where the strings content was used
- to determine the physical location (hashing), and that location
- was used as the output code. Hash "collisions" were resolved by
- maintaining another 12 bits of data per entry which was a "link",
- or pointer to another entry.
-
- In contrast, CRUNCH 2.x uses an "open-addressing, double hashing"
- method similar to that employed in the UNIX COMPRESS. This meth-
- od involves a table of length 5003, where only 4096 entries are
- ever made, insuring the table never gets much more than about 80%
- full. When a hash collision occurs, a secondary hash function is
- used to check a series of additional entries until an empty entry
- is encountered. This method creates a table filled with many
- criss-crossed "virtual" chains, without the use of a "link" entry
- in the table.
-
- One reason this is important is that [without using any addition-
- al memory] the 1 1/2 bytes which were previously allocated as a
- link can now become the [output] code number. This enables us to
- assign code numbers, which are kept right alongside the entry
- itself, independently of the entry's physical location. This
- allows the codes to be assigned in order, permitting the use of
- 9-bit representations until there are 512 codes in the table,
- after which 10 bit representations are output, etc.
-
- The variable bit length method has three ramifications. It is
- particularly helpful when encoding very short files, where the
- table never even fills up. It also provides a fixed additional
- savings (not insubstantial) even when the table does fill up.
- Thirdly, it reduces the overhead associated with an "adaptive
- reset" to the point where it becomes a very viable alternative.
- "Adaptive reset" simply means throwing out the whole table and
- starting over. This can be quite advantageous when used proper-
- ly. CRUNCH v2.x employs this provision, which was not incorpor-
- ated in the V1.x algorithm.
-
- "Code reassignment" is an advancement I introduced with the re-
- lease of CRUNCH v2.0 based on original work. It is not used in
- COMPRESS, any MS-DOS ARC program, or [to the best of my know-
- ledge] any other data compression utility currently available.
- There are many ways one might go about this (and at least as many
- possible pitfalls). The algorithm I selected seemed to represent
- a good tradeoff between speed, memory used, and improved perfor-
- mance, while maintaining "soundness of algorithm" (ie it works).
-
-
- Briefly, it works as follows: Once the table fills up, the code
- reassignment process begins. (At this same time, the possibility
- of adaptive reset is also enabled). Whenever a new code would
- otherwise be made (if the table weren't full), the entries along
- the hash chain which would normally contain the entry are
- scanned. The first, if any, of those entries which was made but
- never subsequently referenced is bumped in favor of the new en-
- try. The uncruncher, which would not normally need to perform
- any hash type function, has an auxiliary physical to logical
- translation table, where it simulates the hashing going on in the
- cruncher. In this fashion it is able to exactly reproduce the
- reassignments made my the cruncher, which is essential.
-
- ---
-
- I hope to write an article soon on "Recent Advancements in Data
- Compression". It would cover the recent history generally, along
- with a more detailed description of some of the algorithms, and a
- series of additional ideas for future enhancement.
-
- Steven Greenberg
- 16 November 1986
-