NetNews Usenet Archive 1992 #16

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #16 / NN_1992_16.iso / spool / comp / compress / 2774 < prev next >

Wrap

Internet Message Format | 1992-07-22 | 5.3 KB

Path: sparky!uunet!decwrl!concert!duke!srt From: srt@duke.cs.duke.edu (Stephen R. Tate) Newsgroups: comp.compression Subject: Re: entropy Message-ID: <711820693@pike.cs.duke.edu> Date: 22 Jul 92 15:58:14 GMT References: <1992Jul22.144104.27576@usenet.ins.cwru.edu> Organization: Duke University Computer Science Dept.; Durham, N.C. Lines: 99 In article <1992Jul22.144104.27576@usenet.ins.cwru.edu> daf10@po.CWRU.Edu (David A. Ferrance) writes: >Right now I compute the entropy of a file by summing the probability of >each character ocurring times the log (base2) of the reciprocal of that >probability. This gives me the theoretical minimum number of bits >needed to represent each character, on average (from what I understand). >What I would like to know is, does this number only work for first order >(1-ary?) compression techniques or does it serve also as a limit for >more complex techniques? If it does not serve for the more complex >techniques (and I suspect it doesn't), is there any way to determine >entropies for different orders? Warning: another one of my long-winded explanations of information theory follows! What you have calculated is the order-0 entropy of the sequence. (Incidentally, there is a lot of disagreement about the terminology here --- some people would call it the order-1 entropy.) This is the minimum average code length if every character is encoded independently of the other characters. Needless to say, this only applies to the most basic of compression methods. If you're not scared of a little math, the following will explain a little more about entropy --- it's really not all that hard, but you need to know the basics of probability, and you might have to think about it for a while to let it really sink in. Just as you calculated the entropy from the probabilities of individual characters, you can calculated it based on larger groups of characters. In particular, let P_m(x_1,x_2,...,x_m) denote the probability of seeing the sequence of m characters x_1,x_2,...,x_m (pairs or triples of characters for example.) Now you can calculate the expected value of the log (base 2) of the reciprocal of these probabilities over all m-tuples, divide by m, and you get the per-character entropy of the sequence of m-ary subsequences. This is the minimum average code length possible per character if each m-tuple is coded independently. Clearly, (well, think about it) the entropy never increases as m increases (in other words, you can always do better -- or at least no worse -- by encoding bigger and bigger chunks at a time). Now the *entropy* (this is the true entropy) of a data source is: lim_{m->infinity} 1/m * E[ log(1/P_m(X_1,X_2,...,X_m)) ]. This is where there is some math you might want to think about. The E[] stands for expected value, and here X_1,... are random variables. In other words, you just take the limit as m goes to infinity of the per-character entropy that I described above. This limit always exists if the data source is what is called "stationary" and "ergodic" --- I won't define these terms, but they're very common assumptions made in information theory. This true entropy is the absolute minimum average code length that can be obtained (assuming a stationary ergodic source, of course!) for *any* coding/compression technique, regardless of how complicated it is. Computing the true entropy is impossible from a finite sequence generated by the source, but it can be estimated by looking at reasonably sized m-tuples (for long samples of English text, you should be able to calculate up to m=8 or more). There are many other ways of calculating entropy --- the best is possible if you know exactly what the source model is (i.e., the Markov chain describing the data source). Then you can calculate the exact entropy --- you don't need to estimate. Needless to say, it is rather rare that you know the source model. To throw one more monkey-wrench in to things, let me add that the entropy lower bound is tight only if you assume that both encoder and decoder know the source model/encoding method. In most cases this is not the case --- most compressors learn the source model as they encounter the input, and cannot reach the entropy lower bound. In this case, you need to look at what is called the "stochastic complexity" of the sequence. In particular, if you want a universal compressor (i.e., one that works for any stationary ergodic source), the best you can ever do in bits/character is H + 1/2 * k * log(n)/n, where H is the entropy of the source, k is a constant that depends on how many states are in the source, and n is the number of characters in the sequence. There are some source models that can beat this, but it's really not even worth discussing --- the set of models that can beat this lower bound has measure zero, so let's just pretend they don't exist. In the limit (as you look at longer and longer strings) this approaches the entropy, but for finite length strings you can never quite reach the entropy. That's about it for this weeks installment of Dr. Tate's "information theory on Usenet"... :-) Next week's class will be held on the beach with the topic being "information content of the human mind after the consumption of alcoholic beverages".... Hope to see you there! -- Steve Tate srt@duke.cs.duke.edu Dept. of Computer Science Duke University Durham, NC 27706