home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.os.msdos.programmer
- Path: sparky!uunet!caen!batcomputer!theory.TC.Cornell.EDU!homer
- From: homer@theory.TC.Cornell.EDU (Homer Smith)
- Subject: HASHING AND LZW COMPRESSION
- Message-ID: <1992Dec13.182946.5344@tc.cornell.edu>
- Sender: news@tc.cornell.edu
- Nntp-Posting-Host: theory.tc.cornell.edu
- Organization: Cornell Theory Center
- Date: Sun, 13 Dec 1992 18:29:46 GMT
- Lines: 120
-
-
- This is long, please bomb out of it when ever it gets boring.
-
- Could someone enlighten me as to what is HASHING?
-
- It arises as an algorithm to search a table of numbers for
- the existence of a particular entry.
-
- For example if you have a N by 2 table of random entries of the form
-
- 259 4
- 34 34
- 245 123
- 259 4
-
-
- what is the absolutely fastest way to determine if any other random
- pair of numbers is already in the table and its position.
-
- The problem arises in LZW compression where in you build
- a table of strings of characters.
-
- For example
-
- DC
- DCB
- DCBE
- DCBEA ETC.
-
- This example is highly simplified. The nature of these
- LZW tables is that each string that is in the table
- consists of a PRIOR string that is also in the table plus one
- more character. The ordering however can be very random and complex.
-
- Thus such an LZW table can be expressed as a two column table
- where each entry consists of the index of the prior string and the
- one more character.
-
- INDEX STRING
- 1 A
- 2 B
- 3 C
- 4 D
- 5 E
- . .
- 259 DC 4 3
- 260 DCB --> 259 2
- 261 DCBA 260 1
- 262 DCBAE 261 5
-
- Thus if one has an arbitrary string such as DCB, and one has ALREADY
- found it in the table at position 260, one then gets the next character in the
- material to be coded, say F, and now one needs to know if DCBF is
- already in the table. If it is, one goes on to the next character.
- If not, one adds the new string DCBF to the table.
-
- DCBF is clearly 260 6, so the problem becomes searching through
- the two column table for the entry 260 6.
-
- LZW tables can be relatively long, usually up to 4096 entries.
- The right hand column needs only be a byte wide, but the left
- hand column, being an index into previous entries in the table
- needs to span 4096 or at least two bytes.
-
- Thus the tables do not take up a terrible amount of space, but
- the searching through them can take forever, especially
- if you proceed with a straight forward linear search from the
- beginning to the end of the table.
-
- Obviously you only check the right column if you have found
- a match in the left column.
-
- But remember this has to be done over and over again for each
- character in the input stream to be compressed.
-
- For a simple 1200 by 800 fractal, the algorithm can take upwards
- of 8 minutes on a 486/33 pc, which is ridiculous.
-
- One solution to this does away with the table searching
- completely.
-
- This is done by recognizing that each ordered pair of numbers
- such as (261,23) is a multiple base number, the right column
- is in base 256 and the left hand column is in base 4096.
-
- Thus (261,23) can be expressed as a unique number of the form
-
- 261*256 + 23 = 66839.
-
- The total span of ordered pairs from (0,0) to (4095,255) thus
- spans 0 to 4095*256 + 255 or 1048575.
-
- Thus rather than adding a new ordered pair (A,B) to the end
- of a simple 2 column table, one places it directly into position
- A*256 + B in the longer array. Actually what one stores there
- is the INDEX of the position that the ordered pair would have occupied
- in a normal table.
-
- The longer array should intially be set to to -1's.
-
- Thus when you wish to know if any ordered pair is in the table,
- you merely compute A*256 + B and see if anything is in the longer array.
- If it is, then that entry is already 'in the table' at the
- implied index position specified by the number found in the longer array.
-
- The brilliance of this is that LZW compression becomes very fast,
- the 8 minute example mentioned above reduces down to 38 seconds,
- as there is no table searching whatsoever.
-
- The catch is that it takes over 2 Meg of memory to store the
- longer array, space that most people can ill afford.
-
- The original two columned table only takes 12,288 bytes, 2 bytes
- for the left column and 1 for the right column, times 4096.
-
- I would appreciate any enlightening on this subject that you
- can throw my way.
-
- Homer
-
-