ftp.cse.unsw.edu.au

home *** CD-ROM | disk | FTP | other *** search

/ ftp.cse.unsw.edu.au / 2014.06.ftp.cse.unsw.edu.au.tar / ftp.cse.unsw.edu.au / pub / doc / algorithms / Lempel_Ziv_Compress_Algorithm < prev

Wrap

Internet Message Format | 1992-10-18 | 3.7 KB

Path: spinifex!elecvax!usage!basser!munnari!otc!softway!swine From: swine@softway.oz (Peter Swain) Newsgroups: comp.theory Subject: Zev-Limpel (Lempel-Ziv?) compression algorithm (long) Keywords: Zev_Limpel,compression Message-ID: <1057@softway.oz> Date: 4 Jan 89 00:10:48 GMT References: <182@syacus.acus.oz> Reply-To: swine@softway.oz (Peter Swain) Distribution: aus Organization: Softway Pty Ltd, Sydney, Australia Lines: 77 In article <182@syacus.acus.oz> petri@syacus.UUCP (Petri Nuuttila) writes: > Does anybody have a reference for Zev-Limpel compression allgorithm? > Better yet, could somebody post some information on it. Is it some variant > of Huffman encoding, which it improves? Quite different, really. Huffman codes are designed through a static examination of the symbol space, giving the shortest names to most common symbols. Lempel-Ziv (I think that's them) is a dynamic encoding, which need not see the whole file before compressing. It's best seen as a smart communication channel which generates its own abbreviations for common strings. We have an encoder, E, and a decoder, D, with a stream of symbols being passed from E to D (the compressed text). E & D both start out with an initial alphabet (call it a,b,c...) and a common boredom threshold. The initial alphabet is just the character set of the input stream (8bit bytes, 7bit ascii, 64bit chunks or wotever). E starts sending text to D: thecatsatonthematthecatsatonthemat.... and soon notices that some pairs of symbols are common. When any pair reaches the boredom threshold, E picks the next free symbol to encode future occurences of the pair. Say we get bored after two occurences of a pair, and pick new symbols (A,B,C,...) to represent them: thecatsatonthemABecAsAonBemA.... On the second occurence of "at", we invent a macro "A". Similarly "B" for "th". E *does not* send this new code the first time, but lets the old pair pass. D, using the same algorithm, expects that E will have been bored by "at", and intuits that it will have assigned the next free symbol "A". Some time later "Be" (from "the") will be assigned (say) "E", and "Em" (from "Bem" from "them") will get (say) "G". {appropriately, "Bem" == "BugEyedMonster" == "them"} As a reasonably redundant text passes under their noses, E & D invent a more compact notation, but it costs them in size of symbol space (bits per symbol). The PD compress utility (for Unix and much else) is a good example of Lempel-Ziv. Initially knowing nothing about the plaintext, it encodes input bytes into 8 bit symbols (themselves). On getting bored with "at" above, it would invent the symbol "A" as 256, thus blowing out the symbol space to 9 bits. Both E & D realise this on the last occurrence of "at", and shift to a 9 bit encoding from the next symbol. After inventing 256 symbols, they shift to 10 bit encoding, and so on, limited only by the size of the decoding table they need to maintain. Compress can be told to limit its symbol space size to enable the decoding to run on small machines (usually 13 bit space for 64k addressing). This algorithm picks up local text features quickly, and can fill its symbol space with anachronisms if the text shape changes (say from English to machine code). If E notices a decrease in compression (perhaps because it's now sending 10 bit symbols for essentially incompressible data), it can blow away that excess baggage and go back to 8 bit encoding again, moving to 9 bits when it sees new idioms in the text. If my library were not in such disarry i'd feed you some references. Seek out the Compress sources, from me if necessary. -- Peter Swain - Softway Pty Ltd (swine@softway.oz) PHONE: +61-2-698-2322 UUCP: uunet!softway.oz!swine FAX: +61-2-699-9174 INTERNET: swine@softway.oz.au