Oakland CPM Archive

home *** CD-ROM | disk | FTP | other *** search

/ Oakland CPM Archive / oakcpm.iso / cpm / squsq / crunch.abs < prev next >

Wrap

Text File | 1986-12-05 | 3.8 KB | 74 lines

Technical Abstract CRUNCH 1.x maintained a table representing up to 4096 strings of varying lengths using the so called LZW algorithm, which has been described in the earlier documentation. These strings were ent- ered into a table in a manner where the strings content was used to determine the physical location (hashing), and that location was used as the output code. Hash "collisions" were resolved by maintaining another 12 bits of data per entry which was a "link", or pointer to another entry. In contrast, CRUNCH 2.x uses an "open-addressing, double hashing" method similar to that employed in the UNIX COMPRESS. This meth- od involves a table of length 5003, where only 4096 entries are ever made, insuring the table never gets much more than about 80% full. When a hash collision occurs, a secondary hash function is used to check a series of additional entries until an empty entry is encountered. This method creates a table filled with many criss-crossed "virtual" chains, without the use of a "link" entry in the table. One reason this is important is that [without using any addition- al memory] the 1 1/2 bytes which were previously allocated as a link can now become the [output] code number. This enables us to assign code numbers, which are kept right alongside the entry itself, independently of the entry's physical location. This allows the codes to be assigned in order, permitting the use of 9-bit representations until there are 512 codes in the table, after which 10 bit representations are output, etc. The variable bit length method has three ramifications. It is particularly helpful when encoding very short files, where the table never even fills up. It also provides a fixed additional savings (not insubstantial) even when the table does fill up. Thirdly, it reduces the overhead associated with an "adaptive reset" to the point where it becomes a very viable alternative. "Adaptive reset" simply means throwing out the whole table and starting over. This can be quite advantageous when used proper- ly. CRUNCH v2.x employs this provision, which was not incorpor- ated in the V1.x algorithm. "Code reassignment" is an advancement I introduced with the re- lease of CRUNCH v2.0 based on original work. It is not used in COMPRESS, any MS-DOS ARC program, or [to the best of my know- ledge] any other data compression utility currently available. There are many ways one might go about this (and at least as many possible pitfalls). The algorithm I selected seemed to represent a good tradeoff between speed, memory used, and improved perfor- mance, while maintaining "soundness of algorithm" (ie it works). Briefly, it works as follows: Once the table fills up, the code reassignment process begins. (At this same time, the possibility of adaptive reset is also enabled). Whenever a new code would otherwise be made (if the table weren't full), the entries along the hash chain which would normally contain the entry are scanned. The first, if any, of those entries which was made but never subsequently referenced is bumped in favor of the new en- try. The uncruncher, which would not normally need to perform any hash type function, has an auxiliary physical to logical translation table, where it simulates the hashing going on in the cruncher. In this fashion it is able to exactly reproduce the reassignments made my the cruncher, which is essential. --- I hope to write an article soon on "Recent Advancements in Data Compression". It would cover the recent history generally, along with a more detailed description of some of the algorithms, and a series of additional ideas for future enhancement. Steven Greenberg 16 November 1986