NetNews Usenet Archive 1992 #30

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #30 / NN_1992_30.iso / spool / comp / os / msdos / programm / 11361 < prev next >

Wrap

Text File | 1992-12-13 | 4.4 KB | 132 lines

Newsgroups: comp.os.msdos.programmer Path: sparky!uunet!caen!batcomputer!theory.TC.Cornell.EDU!homer From: homer@theory.TC.Cornell.EDU (Homer Smith) Subject: HASHING AND LZW COMPRESSION Message-ID: <1992Dec13.182946.5344@tc.cornell.edu> Sender: news@tc.cornell.edu Nntp-Posting-Host: theory.tc.cornell.edu Organization: Cornell Theory Center Date: Sun, 13 Dec 1992 18:29:46 GMT Lines: 120 This is long, please bomb out of it when ever it gets boring. Could someone enlighten me as to what is HASHING? It arises as an algorithm to search a table of numbers for the existence of a particular entry. For example if you have a N by 2 table of random entries of the form 259 4 34 34 245 123 259 4 what is the absolutely fastest way to determine if any other random pair of numbers is already in the table and its position. The problem arises in LZW compression where in you build a table of strings of characters. For example DC DCB DCBE DCBEA ETC. This example is highly simplified. The nature of these LZW tables is that each string that is in the table consists of a PRIOR string that is also in the table plus one more character. The ordering however can be very random and complex. Thus such an LZW table can be expressed as a two column table where each entry consists of the index of the prior string and the one more character. INDEX STRING 1 A 2 B 3 C 4 D 5 E . . 259 DC 4 3 260 DCB --> 259 2 261 DCBA 260 1 262 DCBAE 261 5 Thus if one has an arbitrary string such as DCB, and one has ALREADY found it in the table at position 260, one then gets the next character in the material to be coded, say F, and now one needs to know if DCBF is already in the table. If it is, one goes on to the next character. If not, one adds the new string DCBF to the table. DCBF is clearly 260 6, so the problem becomes searching through the two column table for the entry 260 6. LZW tables can be relatively long, usually up to 4096 entries. The right hand column needs only be a byte wide, but the left hand column, being an index into previous entries in the table needs to span 4096 or at least two bytes. Thus the tables do not take up a terrible amount of space, but the searching through them can take forever, especially if you proceed with a straight forward linear search from the beginning to the end of the table. Obviously you only check the right column if you have found a match in the left column. But remember this has to be done over and over again for each character in the input stream to be compressed. For a simple 1200 by 800 fractal, the algorithm can take upwards of 8 minutes on a 486/33 pc, which is ridiculous. One solution to this does away with the table searching completely. This is done by recognizing that each ordered pair of numbers such as (261,23) is a multiple base number, the right column is in base 256 and the left hand column is in base 4096. Thus (261,23) can be expressed as a unique number of the form 261*256 + 23 = 66839. The total span of ordered pairs from (0,0) to (4095,255) thus spans 0 to 4095*256 + 255 or 1048575. Thus rather than adding a new ordered pair (A,B) to the end of a simple 2 column table, one places it directly into position A*256 + B in the longer array. Actually what one stores there is the INDEX of the position that the ordered pair would have occupied in a normal table. The longer array should intially be set to to -1's. Thus when you wish to know if any ordered pair is in the table, you merely compute A*256 + B and see if anything is in the longer array. If it is, then that entry is already 'in the table' at the implied index position specified by the number found in the longer array. The brilliance of this is that LZW compression becomes very fast, the 8 minute example mentioned above reduces down to 38 seconds, as there is no table searching whatsoever. The catch is that it takes over 2 Meg of memory to store the longer array, space that most people can ill afford. The original two columned table only takes 12,288 bytes, 2 bytes for the left column and 1 for the right column, times 4096. I would appreciate any enlightening on this subject that you can throw my way. Homer