home *** CD-ROM | disk | FTP | other *** search
- Path: spinifex!elecvax!usage!basser!munnari!otc!softway!swine
- From: swine@softway.oz (Peter Swain)
- Newsgroups: comp.theory
- Subject: Zev-Limpel (Lempel-Ziv?) compression algorithm (long)
- Keywords: Zev_Limpel,compression
- Message-ID: <1057@softway.oz>
- Date: 4 Jan 89 00:10:48 GMT
- References: <182@syacus.acus.oz>
- Reply-To: swine@softway.oz (Peter Swain)
- Distribution: aus
- Organization: Softway Pty Ltd, Sydney, Australia
- Lines: 77
-
- In article <182@syacus.acus.oz> petri@syacus.UUCP (Petri Nuuttila) writes:
- > Does anybody have a reference for Zev-Limpel compression allgorithm?
- > Better yet, could somebody post some information on it. Is it some variant
- > of Huffman encoding, which it improves?
-
- Quite different, really.
-
- Huffman codes are designed through a static examination of
- the symbol space, giving the shortest names to most common
- symbols.
-
- Lempel-Ziv (I think that's them) is a dynamic encoding,
- which need not see the whole file before compressing.
- It's best seen as a smart communication channel which
- generates its own abbreviations for common strings.
-
- We have an encoder, E, and a decoder, D, with a stream of
- symbols being passed from E to D (the compressed text).
- E & D both start out with an initial alphabet (call it a,b,c...)
- and a common boredom threshold.
- The initial alphabet is just the character set of the input
- stream (8bit bytes, 7bit ascii, 64bit chunks or wotever).
- E starts sending text to D:
- thecatsatonthematthecatsatonthemat....
- and soon notices that some pairs of symbols are common.
-
- When any pair reaches the boredom threshold, E picks the
- next free symbol to encode future occurences of the pair.
- Say we get bored after two occurences of a pair,
- and pick new symbols (A,B,C,...) to represent them:
- thecatsatonthemABecAsAonBemA....
- On the second occurence of "at", we invent a macro "A".
- Similarly "B" for "th".
- E *does not* send this new code the first time, but lets the
- old pair pass. D, using the same algorithm, expects that E
- will have been bored by "at", and intuits that it will have
- assigned the next free symbol "A".
- Some time later "Be" (from "the") will be assigned (say)
- "E", and "Em" (from "Bem" from "them") will get (say) "G".
- {appropriately, "Bem" == "BugEyedMonster" == "them"}
-
- As a reasonably redundant text passes under their noses, E &
- D invent a more compact notation, but it costs them in size
- of symbol space (bits per symbol).
-
- The PD compress utility (for Unix and much else) is a good
- example of Lempel-Ziv. Initially knowing nothing about the
- plaintext, it encodes input bytes into 8 bit symbols
- (themselves). On getting bored with "at" above, it would
- invent the symbol "A" as 256, thus blowing out the symbol
- space to 9 bits. Both E & D realise this on the last
- occurrence of "at", and shift to a 9 bit encoding from
- the next symbol.
- After inventing 256 symbols, they shift to 10 bit encoding,
- and so on, limited only by the size of the decoding table
- they need to maintain.
- Compress can be told to limit its symbol space size to
- enable the decoding to run on small machines (usually 13 bit
- space for 64k addressing).
-
- This algorithm picks up local text features quickly, and can
- fill its symbol space with anachronisms if the text shape
- changes (say from English to machine code).
- If E notices a decrease in compression (perhaps because it's
- now sending 10 bit symbols for essentially incompressible
- data), it can blow away that excess baggage and go back to
- 8 bit encoding again, moving to 9 bits when it sees new
- idioms in the text.
-
- If my library were not in such disarry i'd feed you some
- references.
- Seek out the Compress sources, from me if necessary.
- --
- Peter Swain - Softway Pty Ltd (swine@softway.oz)
-
- PHONE: +61-2-698-2322 UUCP: uunet!softway.oz!swine
- FAX: +61-2-699-9174 INTERNET: swine@softway.oz.au
-