home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!ogicse!das-news.harvard.edu!cantaloupe.srv.cs.cmu.edu!kosak
- From: kosak+@cs.cmu.edu (Corey Kosak)
- Newsgroups: comp.lang.c
- Subject: Re: LZW Compression algorithm
- Message-ID: <Bxs4oA.9s6.2@cs.cmu.edu>
- Date: 15 Nov 92 22:48:09 GMT
- Article-I.D.: cs.Bxs4oA.9s6.2
- References: <BxMAwz.D6C@willamette.edu> <1992Nov15.200257.20378@thunder.mcrcim.mcgill.edu>
- Sender: Corey Kosak <kosak+@cmu.edu>
- Organization: School of Computer Science, Carnegie Mellon
- Lines: 96
- Nntp-Posting-Host: ibis.warp.cs.cmu.edu
-
- In article <1992Nov15.200257.20378@thunder.mcrcim.mcgill.edu> mouse@thunder.mcrcim.mcgill.edu (der Mouse) writes:
- >In article <BxMAwz.D6C@willamette.edu>, bbrown@willamette.edu (Brian Brown) writes:
- >
- >> Does anybody have any information on how to implement LZW
- >> compression/decompression?
- >
- >You have a table of 65536 (2^16) entries. Each table entry
- >conceptually contains a string of bytes, or is empty. Initially, 256
- >of the entries contain the 256 possible single-byte strings, with the
- >rest empty. Initialize the algorithm by reading the first byte of
- >input into a working variable which conceptually contains a string of
- >bytes.
- >
- >While the current working string of bytes is listed in the table, read
- >another byte. As soon as the string is not listed in the table, output
- >the table index corresponding to the last string that was in the table,
- >add the working string to the table, and reset the working string so it
- >contains just the last byte. Then go back to the beginning of this
- >paragraph. Dealing with EOF on input complicates this slightly, but
- >only slightly. Dealing with a full table is fairly easy: just skip the
- >"add the working string to the table" step.
-
- Mr./Ms. Mouse leaves off the description of the decoding process.
- However, it is fairly straightforward. Input a table index, decode it
- by retrieving the associated string from your table, and then add a new entry
- to your table. This new entry is the concatenation of the previous string
- found and the first character of the current string. (Note that for
- the first time through the loop, there is no "previous string", so
- you do not do a table update)
-
- The description that Herr/Frau Mouse gives for the encoding process
- is almost correct. The problem is that in the given algorithm,
- the encoder's table updates are one step ahead of the decoder's;
- this usually doesn't matter, but will cause incorrect behavior for some
- input.
-
- Consider an input that begins with "aaaa", and trace the algorithm.
-
- 1. Initialize the table
-
- 2. read a byte. (the first 'a'. Our working string is "a". This is
- in the table, so we loop)
-
- 3. read another byte. (the second 'a'. Now our working string is "aa",
- which is not in the table. Following the instructions, we:
-
- - output the index for the last string that was in the table
- OUTPUT (INDEX ("a")) -- assume for the sake of argument that this index is 97
- - add the working string to the table
- TABLE [256] = "aa"
- - reset the working string so it contains the last byte
- WORKING_STRING = "a"
- - loop
-
- 4. read another byte. (the third 'a'. Now our working string is "aa".
- This is in the table, so we loop)
-
- 5. read another byte. (the fourth 'a'. Now our working string is "aaa",
- which is not in the table. Following the instructions we:
-
- - output the index for the last string that was in the table
- OUTPUT (INDEX ("aa")) -- this is 256
- - add the working string to the table
- TABLE [257] = "aaa"
- - reset the working string so it contains the last byte
- WORKING_STRING = "a"
- - loop
-
-
- Let's stop here and look at what happens when we try to decompress
- this data. So far, we've emitted two codes: 97 and 256.
-
- Decoding 97 tells us that the first byte is 'a'. But when we encounter
- the second code, 256, we have no way of knowing what the sender has
- in slot 256 of its array. Maybe "aa". Maybe not.
-
- As I mentioned above, this problem is caused by the encoder being
- one step ahead of the decoder with respect to table updates.
- The way to fix this is to delay sender table updates by one iteration.
- So the revised instructions might read:
-
- >While the current working string of bytes is listed in the table, read
- >another byte. As soon as the string is not listed in the table, output
- >the table index corresponding to the last string that was in the table,
- ***add LAST_UPDATE to the table, set LAST_UPDATE to the current working string,***
- >and reset the working string so it
- >contains just the last byte. Then go back to the beginning of this
- >paragraph. Dealing with EOF on input complicates this slightly, but
- >only slightly. Dealing with a full table is fairly easy: just skip the
- >"add the working string to the table" step.
-
- (The same note applies to the compressor: for the first time through the
- loop, LAST_UPDATE is not valid, so you do not add anything to the table.)
-
-
- -Corey Kosak
-