home *** CD-ROM | disk | FTP | other *** search
- --<Start of LZRW4 paper>--
-
- LZRW4: ZIV AND LEMPEL MEET MARKOV
- =================================
- Author : Ross Williams
- Date : 23-Jul-1991.
-
- This is a public domain document and may be copied freely so long as
- it is not modified (but text formatting transformations are permitted
- though).
-
- Note: This paper has not been polished to publication-standard. It has
- been prepared under considerable time pressure and is released mainly
- to make the ideas it contains public so as to avoid others
- re-inventing and patenting them.
-
- Abstract
- --------
- In 1977 and 1978, Ziv and Lempel[Ziv77][Ziv78] introduced the LZ class
- of adaptive dictionary data compression techniques that have become so
- very popular today. In competition with these techniques are the
- Markov algorithms that give better compression but run much more
- slowly. This paper compares the two classes of algorithm and shows how
- algorithms that combine techniques from both classes have the
- potential to provide higher performance (i.e. speed/ compression trade
- off) than we have seen so far. An algorithm called LZRW4 that combines
- Markov and LZ77 ideas is sketched.
-
- Comparison of Ziv/Lempel and Markov Algorithms
- ----------------------------------------------
- To the casual observer, the Ziv/Lempel (LZ) data compression
- algorithms may appear to be worlds away from the lesser know Markov
- algorithms. LZ algorithms construct trees of strings and transmit
- strings as codewords. Markov algorithms construct large numbers of
- contexts and fiddle with arithmetic ranges and probabilities. The two
- kinds of algorithms seem to operate with totally different objects.
-
- However, deep down, both classes of algorithm are using the same
- probabilistic engine. Deep down, both classes must be representing
- more common events with short codes and less common events with longer
- codes. This is the law of data compression. But, without some careful
- thought, it's hard to see how the two classes relate.
-
- In the early eighties, Glen Langdon put in some careful thought and a
- lot of insight and came up with two papers[Langdon83] and [Langdon84]
- which for me are pivotal in my understanding of the field of data
- compression. I cannot recommend these papers highly enough to anyone
- who wishes to gain an understanding of what REALLY makes LZW tick or
- to anyone who wants a broader perspective of the field. The papers
- seem highly focused and obscure, but in fact they cast a broad soft
- light over the whole field of text compression, illuminating areas
- that for me were very dark (but might not be so dark for people who
- read IEEE Transactions on Information Theory like Dickens). The two
- papers have to be read carefully though. An earlier paper [Rissanen81]
- by Rissanen and Langdon gives an even more general view, but does not
- deal directly with LZ algorithms as the later papers do and so I have
- focussed on the later papers.
-
- The idea behind the two papers is that all dictionary algorithms,
- including the LZ classes (LZ77 and LZ78), are subsets (can be emulated
- by) Markov algorithms, but the converse is not true. Langdon shows us
- a clear view of LZ algorithms through Markov eyes. This view is highly
- illuminating.
-
- In [Langdon83], Langdon examines the LZ78 class of algorithm closely
- and explains the underlying engine in probabilistic terms. In
- [Langdon84], he has a more thorough byte at the problem, formalizing
- the whole case and proving that dictionary algorithms can always be
- emulated by Markov algorithms, but not the converse.
-
- I will attempt to explain the idea of [Langdon83] briefly. I will
- review LZW but not give a detailed explanation of how it works.
-
- LZ78/LZW builds its dictionary by parsing the input as phrases
- consisting of the longest dictionary match to date. At the end of each
- phrase parse, the algorithm adds a new phrase to the dictionary
- consisting of the current phrase plus the next input character. The
- dictionary grows forever, adding one new phrase per phrase parsed.
-
- The dictionary may be viewed in many ways. We choose to view it as a
- (forwards) trie structure whose leaves are to the right and whose root
- is at the left. The following diagram illustrates such a tree. Each
- arc carries a single character label. Each node represents a single
- string. In the diagram, arcs are labelled in lower case and nodes in
- upper case.
-
- AA
- /
- a/
- A
- a/ \b
- / \
- Root AB
- b\
- \
- B
-
- All dictionaries can be viewed in this way.
-
- In LZW, each node is allocated a code value. Suppose we have two-bit
- code words and the codes are allocated as follows:
-
- 0: A
- 1: B
- 2: AA
- 3: AB
-
- This code allocation implies that each event has an equal probability
- of one quarter. We can thus label our tree with node occurrence
- probabilities.
-
- AA 1/4
- /
- a/
- A 1/4
- a/ \b
- / \
- Root AB 1/4
- b\
- \
- B 1/4
-
- These node probabilities can then be converted into probabilities for
- the ARCS:
-
- AA
- /
- /a 1/3
- A
- 3/4a/ \b 1/3
- / \
- Root AB
- b\
- 1/4\
- B
-
- (the 1/3 and 1/3 don't add up because the A node (which represents an
- imaginary arc labelled by the empty string) has 1/3 too.)
-
- These probabilties are exactly the sort of fodder that we need to feed
- a Markov algorithm. From the above, it can be seen that the act of LZW
- of parsing the longest match and assigning a new codeword, can be
- viewed as the execution of a series of single-byte prediction
- decisions each with its own separate probabilities and each capable of
- being coded independently and with identical efficiency using
- arithmetic coding. You can't see this normally because LZW does the
- parse all in one go and outputs (for each phrased) a codeword which
- looks, for all the world, to be indivisible. But deep down, we all
- know that it isn't. Instead, it's made up of little pieces of entropy
- accumulated during the traversal and cleverly bundled into 16 bit
- codewords at the last minute.
-
- So far, we have converted our LZW tree to a Markov-like prediction
- tree. Now let's analyse it.
-
- Consider a single node in the LZW tree. Associated with the node is a
- string (being the characters on the arc leading from the root). Each
- time that the parse reaches the node, it means that the node's string
- has occurred at the start of a phrase boundary. LZW then goes on
- further to the node's child* nodes and eventually reaches a leaf where
- it adds a new node AND ALLOCATES A NEW CODEWORD. The result is that
- each node accumulates a subtree whose cardinality (=number of subnodes
- including itself in this paper=number of codewords allocated to the
- subtree) is proportional to the probability of the node's string
- arising as a prefix of a parse. As the number of subnodes is also the
- node's share of the probability space (because there is one codeword
- allocated to each node), we see that the amount of code space
- allocated to a node is proportional to the number of times it has
- occurred. Very neat.
-
- All this stuff is in [Langdon83] which quantifies it.
-
- The above explains why LZW compresses so well. Why then does it
- compress so badly (compared to Markov models)? Without going into
- details, here are some reasons:
-
- * Approximately half of the code space is wasted at any point of time
- in unallocated codewords.
- * Probability estimates are inaccurate near the leaves.
-
- Other such reasons eat away at the compression. In particular, there
- is one inefficiency of LZW-type compression that stands out.
-
- In our analysis, each LZW parse of length d consists of d
- single-character predictions. Each prediction is made in the context
- of previous predictions. Thus, after having parsed part of a phrase
- "ab", the LZW algorithm has two characters of Markov context with
- which to "predict" the next character. After having parsed
- "compressi", it would have nine characters of context. Thus we see
- that, to the extent that we can view LZ78/LZW algorithms as making
- predictions, they make predictions using a sawtooth context-length.
-
- Order
- ^
- 4| *.
- 3| * . *. *.
- 2| *. * . *. * . * .
- 1| * . * . * . * . *. * .
- 0|* .* .* .* .* .* .
- +-----|---------|-----|-------|---|-------|-----> Characters
- phrase1 phr2 phr3 phr4 phr5 phr6
-
- In contrast, Markov algorithms make predictions using a fairly flat
- (or sometimes totally flat) context line. Thus, while a Markov
- algorithm may make all of its predictions based on a 3-character
- context, a LZW algorithm will make only some of its predictions from
- that depth.
-
- The thing that really kills LZW's compression is the first one or two
- characters of each phrase, the first of which is made based on no
- context at all and the second of which is made based on a context of a
- single character. After that the context is sufficient not to
- introduce too much inefficiency. See Figure 64 of [WIlliams91] to get
- an idea of how much impact this has on compression.
-
- To compensate for its poor performance in the first one or two
- characters, LZW grows an enormous tree. This has the effect of
- increasing the average length of a phrase so that the beginning of
- phrases occurs less often. This is like Pollyanna adding extra days to
- the week to avoid Sundays. As the length of the input increases to
- infinity, so does the average phrase length, with the startling result
- that by the time you get to infinity, LZW has converged on the entropy
- of the source (as Ziv and Lempel proved in their founding
- paper[Ziv78]). The rate of convergence depends on the average phrase
- length which depends on the entropy of the source. In practice, for
- ordinary sources (e.g. text files), the average phrase length does not
- rise fast enough to properly offset the effect of the
- low-context-length predictions at the start of each phrase.
-
- An additional reason for the relatively poor compression performance
- of LZW can be found in its phrasing. Markov algorithms predict each
- character of the input one at a time and record contextual information
- at each step. In contrast LZW divides the input up into phrases each
- of which causes a single parse through its tree. Thus, the
- "zero-order" model that LZW builds up for the first character of each
- phrase is based only on the first character of all the previous
- phrases it has seen whereas a zero-order Markov model builds its
- model based on ALL the characters it has seen. This applies for all
- orders.
-
- To summarize:
-
- * All dictionary algorithms can be cast into Markov form.
-
- * The compression grunt (order) that LZW applies is in the shape
- of an irregular sawtooth.
-
- * LZW collects statistics for each context only on phrase boundaries.
-
- * A little thought reveals that all this applies to the LZ77 class
- as well.
-
- A brief discussion of all this can be found in Chapter one of my book
- [Williams91] and in particular section 1.13:"Dictionary Methods vs
- Markov Methods". However, for a full treatment, the reader is referred
- to [Langdon83] and [Langdon84].
-
- Ziv and Lempel meet Markov
- --------------------------
- Currently, the field of text data compression has two separate methods
- being statistical (Markov) and dictionary (Ziv and Lempel). Markov
- algorithms inch through the input character by character, recording
- and using context information at each step. This yields the best
- compression, but makes Markov algorithms slow. Ziv and Lempel
- algorithms gallop through the input one phrase at a time, losing lots
- of non-phrase-aligned context information and using zero order models
- at the start of each phrase. This yields a higher speed but poorer
- compression.
-
- Obviously it would be highly desirable to combine the speed of the Ziv
- and Lempel algorithms with the compression power of the Markov
- algorithms. As time goes by this seems less and less likely (without
- special (e.g. parallel) hardware). However, there does seem to be one
- way of combining the two classes of algorithm and to my surprise, I
- have not yet seen it used.
-
- The idea is to create a group of contexts, but to parse phrases from
- each context. For example, we might create 256 1-character contexts
- and grow an LZW tree in each of them. Compression would proceed
- exactly as with LZW except that the most recent character transmitted
- (the last character of the previous phrase) would be used to select
- one of the 256 contexts, each of which would contain an LZW tree which
- would then be used to parse the phrase.
-
- The result would be to take the low-order edge off LZW without losing
- any of its speed. Each character of the phrase would be "predicted"
- using a model one order higher than would have been used in LZW. At
- high orders this would have negligible effect, but at low orders it
- would have a significant effect, for the first character of each
- phrase, formerly predicted by a zero-order model, would now be
- predicted by a first-order model. Each phrase would be coded using
- only the range required by the size of the tree in its context.
-
- By using contexts at the start of the LZW tree, we raise the
- bottom of the sawtooth a few notches.
-
- Order
- ^
- 6| *.
- 5| * . *. *.
- 4| *. * . *. * . * .
- 3| * . * . * . * . *. * .
- 2|* .* .* .* .* .* .
- 1|
- 0|
- +-----|---------|-----|-------|---|-------|-----> Characters
- phrase1 phr2 phr3 phr4 phr5 phr6
-
-
- This idea is not at all specific to LZW, but could be applied in
- dozens of ways to many different fast parsing adaptive algorithms. In
- fact my example algorithm, LZRW4 below uses an LZ77-like allgorithm.
- Here are some ideas:
-
- * Use a two-byte context to hash into a table of LZW trees.
- * Grow backwards context trees as in DHPC, but grow LZW parsing trees
- out of each node.
- * Somehow use a single LZW tree BUT START PARSING ONE OR MORE
- CHARACTERS BACK! This avoids having two different kinds of data structure.
- * Set up a set of contexts and bolt an LZ77 compressor on the
- back. Each context stores a list of pointers to strings that
- occurred in that context (see LZRW4 below).
- * Use context with any other dynamic dictionary scheme.
- * Use a rolling hash function in hardware to avoid hash delays at the
- start of phrases.
-
- ...combinations and permuations.
-
- There are two drawbacks to all this. First of all we are up for a lot
- more memory. Second, we run up against the context-depth/sample-size
- tradeoff encountered by Markov algorithms. The deeper the contexts we
- establish, the more specific they become, but the smaller the amount
- of information that each context has on which to base its mini-model
- and hence its predictions. The first problem can be overcome with more
- memory. The second problem requires careful tuning.
-
- LZRW4
- -----
- To test out the idea, I devised LZRW4. I didn't implement LZRW4, but I
- did write a program to measure the compression that it would yield.
- This program is provided in Appendix A on an "as is" basis.
-
- Note: The name LZRW4 is used here to describe an algorithm. However,
- the eventual official code release of LZRW4 may differ in some details
- from the algorithm presented here.
-
- With a huge range of LZ algorithms from which to choose, I chose the
- LZ77 class with which I am familiar. This avoided patent problems.
-
- LZRW4 uses a hash table of 4096 partitions each containing 32 pointers
- to the input which is held in a 1 Meg buffer. At the start of each
- phrase, LZRW4 hashes the previous two characters to obtain a partition
- number in the range 0..4095. The 32 pointers in that partition are
- searched for the longest match and the length 3bits=[2,3,4,5,6,7,8,16]
- and pointer number 5bits=[0,..,31] coded and transmitted. LZRW4 uses a
- single control bit for each phrase to say whether a literal byte or a
- copy byte is coming next. These control bits are buffered as in the
- other LZRW* algorithms. The equal length of literal and copy items is
- an interesting feature of the algorithm which seems bound to be useful
- to someone someday!
-
- To update the partition, LZRW4 swaps the winning pointer with the
- pointer halfway towards the front of the partition. If the winning
- pointer matched less than 4 bytes then a pointer to the current
- position (start of the Ziv) is shifted into the partition.
-
- See the code in Appendix A for the details. An understanding of the
- LZRW3-A algorithm may also assist.
-
- The results of LZRW4 are not startling, but demonstrate that the idea
- is working. By using contexts, LZRW4 gets away with using 3-bit
- lengths and 5-bit "indexes" whereas its ancestor algorithm LZRW3-A
- uses 4-bit lengths and 12-bit "indexes", yielding comparable
- compression. Here are some rough results on some Calgary corpus files
- (percentage remaining):
-
- LZRW4 LZRW3-A Unix Compress
- bib 39.5 44.1 41.8
- book1 51.4 54.7 43.2
- book2 41.2 45.5 41.1
- obj1 65.5 58.8 65.3
- obj2 43.4 43.8 52.1
- paper1 46.5 46.1 47.2
- progc 46.4 45.2 48.3
- trans 29.1 32.3 40.8
-
- I have experiments with lots of variants of LZRW4 but have not
- formalized them nor written them up. One particularly interesting idea
- is to subdivide each partition into subpartitions based on the first
- character of the phrase. LZRW3-A partition management techniques can
- then be used.
-
- Closing Thoughts
- ----------------
- I don't know if this idea is original or whether it will yield any
- benefit. However, I have not heard of any Ziv/Lempel/Markov hybrids to
- date, and the experiments above indicate at least that the idea works,
- if not that it has real competitive benefit. With the right tuning,
- the use of contexts could provide extra compression with little impact
- on speed (but usually a lot of impact on memory!). At the very least,
- it is another weapon in the data compression armoury.
-
- References
- ----------
- [Langdon83] Langdon G.G., "A Note on the Ziv-Lempel Model for
- Compressing Individual Sequences, IEEE Transactions on Information Theory,
- Vol.29, No.2, pp.284-287.
-
- [Langdon84] Langdon G.G., "On Parsing Versus Mixed-Order Model
- Structures for Data Compression", IBM Research Report RJ-4163 (46091)
- 1/18/84, IBM Research Laboratory, San Jose, CA 95193, 1984.
-
- [Rissanen81] Rissanen J.J, Langdon G.G, "Universal Modeling and Coding",
- IEEE Transactions on Information Theory, Vol. 27, No. 1, pp.12-23.
-
- [Williams91] Williams R.N., "Adaptive Data Compression", Kluwer Academic
- Publishers, 101 Philip Drive, Assinippi Park, Norwell, Massachusetts 02061.
-
- [Ziv77] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data
- Compression", IEEE Transactions on Information Theory", Vol. 23, No. 3,
- pp. 337-343.
-
- [Ziv78] Ziv J., Lempel A., "Compression of Individual Sequences via
- Variable-Rate Coding", IEEE Transactions on Information Theory,
- Vol. 24, No. 5, pp. 530-536.
-
- Appendix: LZRW4: Experimental Code
- ----------------------------------
- The following code is public domain and is provided "as is". As most
- experimental code, it is poorly designed, structured, coded, and
- commented.
-
- #include <stdio.h>
- /* #include <unix.h> */
- #include <fcntl.h>
- #include "port.h"
-
- #define MY_BUFFER_SIZE (1024*1024)
- #define HASHLENGTH 4096
- #define HASHDEPTH 32
- #define CONTEXT 2
- #define MEDMATCH 8 /* Expressible lengths are [2..MEDMATCH,MAXMATCH]. */
- #define MAXMATCH 16
- #define NOSHIFTTHRESH 4 /* This or longer match and we don't shift */
-
- main(argc,argv)
- int argc;
- char **argv;
- {
- typedef UBYTE *queue[HASHDEPTH];
- /* Element zero is most recent element. */
-
- UBYTE buf[MY_BUFFER_SIZE+2000];
- queue hash[HASHLENGTH];
-
- char *input_file_name;
- int infile;
-
- ULONG bits_in =0;
- ULONG bits_out =0;
- if (argc != 2)
- {
- printf("To run lzcontext alg: x <filename>\n");
- exit(0);
- }
-
- printf("Welcome to LZCONTEXT\n");
- printf("--------------------\n");
- printf("Buffer size = %u\n",MY_BUFFER_SIZE);
- printf("Hash table length = %u\n",HASHLENGTH );
- printf("Hash table depth = %u\n",HASHDEPTH );
- printf("Context = %u\n",CONTEXT );
- printf("Med match = %u\n",MEDMATCH );
- printf("Max match = %u\n",MAXMATCH );
- printf("No shift threshold = %u\n",NOSHIFTTHRESH );
-
- input_file_name=argv[1];
-
- infile =open(input_file_name ,O_RDONLY); /* O_BINARY*/
- while(TRUE)
- {
- ULONG i,j,k;
- UBYTE *p_next_byte;
- ULONG bytes_in;
-
- /* Read in a block of bytes to compress. */
- bytes_in=read(infile,buf,(unsigned int) MY_BUFFER_SIZE);
- if (bytes_in == 0)
- break;
-
- /* Initialize the hash table so all entries point to a defined string. */
- for (i=0;i<HASHLENGTH;i++)
- for (j=0;j<HASHDEPTH;j++)
- hash[i][j]=(UBYTE *) "0123456789ABCDEF";
-
-
- /* Get past the first few context bytes. */
- bits_in +=CONTEXT*8;
- bits_out+=CONTEXT*9;
- p_next_byte=buf+CONTEXT;
-
- /* Loop once for each item. */
- /* The end is sloppy but this is only approximate so who cares? */
- while (p_next_byte < (buf+bytes_in))
- {
- ULONG index;
- UWORD maxlength;
- UBYTE **this_queue;
- UBYTE *old_p_next_byte=p_next_byte;
- ULONG longest_match_index;
-
- index= (*(p_next_byte-2)<<8) ^ (*(p_next_byte-1));
-
- /* context of 3
- index= (*(p_next_byte-3)<<8) ^
- (*(p_next_byte-2)<<4) ^
- (*(p_next_byte-1) );
- */
-
- index=(40543*index) >> 4;
- /* index&=0x3FFF; */
- index&=0xFFF;
- this_queue=&hash[index][0];
-
- /* Find the longest match in the 16 recent positions. */
- /* Note: Because we are only calculating entropy here, we don't */
- /* actually need to know the number of the best position. */
- maxlength=0;
- for (j=0;j<HASHDEPTH;j++)
- {
- UBYTE *p = this_queue[j];
- for (k=0;k<MAXMATCH;k++)
- if (p_next_byte[k] != p[k])
- break;
- if (k>maxlength)
- {
- maxlength=k;
- longest_match_index=j;
- }
- }
-
- /* Both literals and copies output one control bit and eight
- data bits. They differ on how much input they consume. */
- if (maxlength == 0) maxlength=1;
- if (maxlength>MEDMATCH && maxlength<MAXMATCH) maxlength=MEDMATCH;
-
- p_next_byte+=maxlength;
- bits_in += 8*maxlength;
- bits_out += 9;
-
- /* If there was a match of 2 or greater, swap the matching
- element with the one half the distance to the head. */
- if (maxlength > 1)
- {
- UBYTE *t;
- ULONG half=longest_match_index/2;
- t=this_queue[half];
- this_queue[half]=this_queue[longest_match_index];
- this_queue[longest_match_index]=t;
- }
-
- /* Shift the queue and put the new value in the recent end. */
- if (maxlength < NOSHIFTTHRESH)
- {
- for (j=HASHDEPTH-1;j>0;j--)
- this_queue[j]=this_queue[j-1];
- this_queue[0]=old_p_next_byte;
- }
- }
- }
-
- close(infile);
- printf("Bytes in = %lu.\n",bits_in /8);
- printf("Bytes out = %lu.\n",bits_out/8);
- printf("Percentage remaining=%5.1f\n",100.0*((float) bits_out)/((float)
- bits_in));
- }
-
- void error(mess)
- /* Prints out the string and terminates. */
- char mess[];
- {
- printf("%s\n",mess);
- exit(0);
- }
-
- --<End of LZRW4 paper>--
-
-