home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!decwrl!concert!duke!srt
- From: srt@duke.cs.duke.edu (Stephen R. Tate)
- Newsgroups: comp.compression
- Subject: Re: entropy
- Message-ID: <711820693@pike.cs.duke.edu>
- Date: 22 Jul 92 15:58:14 GMT
- References: <1992Jul22.144104.27576@usenet.ins.cwru.edu>
- Organization: Duke University Computer Science Dept.; Durham, N.C.
- Lines: 99
-
-
- In article <1992Jul22.144104.27576@usenet.ins.cwru.edu> daf10@po.CWRU.Edu (David A. Ferrance) writes:
- >Right now I compute the entropy of a file by summing the probability of
- >each character ocurring times the log (base2) of the reciprocal of that
- >probability. This gives me the theoretical minimum number of bits
- >needed to represent each character, on average (from what I understand).
- >What I would like to know is, does this number only work for first order
- >(1-ary?) compression techniques or does it serve also as a limit for
- >more complex techniques? If it does not serve for the more complex
- >techniques (and I suspect it doesn't), is there any way to determine
- >entropies for different orders?
-
- Warning: another one of my long-winded explanations of information
- theory follows!
-
- What you have calculated is the order-0 entropy of the sequence.
- (Incidentally, there is a lot of disagreement about the terminology
- here --- some people would call it the order-1 entropy.) This is
- the minimum average code length if every character is encoded
- independently of the other characters. Needless to say, this only
- applies to the most basic of compression methods.
-
- If you're not scared of a little math, the following will explain
- a little more about entropy --- it's really not all that hard, but
- you need to know the basics of probability, and you might have to
- think about it for a while to let it really sink in.
-
- Just as you calculated the entropy from the probabilities of
- individual characters, you can calculated it based on larger groups of
- characters. In particular, let P_m(x_1,x_2,...,x_m) denote the
- probability of seeing the sequence of m characters x_1,x_2,...,x_m
- (pairs or triples of characters for example.) Now you can calculate
- the expected value of the log (base 2) of the reciprocal of these
- probabilities over all m-tuples, divide by m, and you get the
- per-character entropy of the sequence of m-ary subsequences. This is
- the minimum average code length possible per character if each m-tuple
- is coded independently. Clearly, (well, think about it) the entropy
- never increases as m increases (in other words, you can always do
- better -- or at least no worse -- by encoding bigger and bigger chunks
- at a time). Now the *entropy* (this is the true entropy) of a data
- source is:
-
- lim_{m->infinity} 1/m * E[ log(1/P_m(X_1,X_2,...,X_m)) ].
-
- This is where there is some math you might want to think about. The
- E[] stands for expected value, and here X_1,... are random variables.
- In other words, you just take the limit as m goes to infinity of the
- per-character entropy that I described above.
-
- This limit always exists if the data source is what is called "stationary"
- and "ergodic" --- I won't define these terms, but they're very common
- assumptions made in information theory.
-
- This true entropy is the absolute minimum average code length that
- can be obtained (assuming a stationary ergodic source, of course!)
- for *any* coding/compression technique, regardless of how complicated
- it is. Computing the true entropy is impossible from a finite
- sequence generated by the source, but it can be estimated by looking
- at reasonably sized m-tuples (for long samples of English text, you
- should be able to calculate up to m=8 or more).
-
- There are many other ways of calculating entropy --- the best is
- possible if you know exactly what the source model is (i.e., the
- Markov chain describing the data source). Then you can calculate the
- exact entropy --- you don't need to estimate. Needless to say, it is
- rather rare that you know the source model.
-
- To throw one more monkey-wrench in to things, let me add that the
- entropy lower bound is tight only if you assume that both encoder and
- decoder know the source model/encoding method. In most cases this is
- not the case --- most compressors learn the source model as they
- encounter the input, and cannot reach the entropy lower bound. In
- this case, you need to look at what is called the "stochastic
- complexity" of the sequence. In particular, if you want a universal
- compressor (i.e., one that works for any stationary ergodic source),
- the best you can ever do in bits/character is
-
- H + 1/2 * k * log(n)/n,
-
- where H is the entropy of the source, k is a constant that depends on
- how many states are in the source, and n is the number of characters
- in the sequence. There are some source models that can beat this, but
- it's really not even worth discussing --- the set of models that can
- beat this lower bound has measure zero, so let's just pretend they
- don't exist. In the limit (as you look at longer and longer strings)
- this approaches the entropy, but for finite length strings you can
- never quite reach the entropy.
-
- That's about it for this weeks installment of Dr. Tate's "information
- theory on Usenet"... :-) Next week's class will be held on the beach
- with the topic being "information content of the human mind after the
- consumption of alcoholic beverages".... Hope to see you there!
-
-
- --
- Steve Tate srt@duke.cs.duke.edu
- Dept. of Computer Science
- Duke University
- Durham, NC 27706
-