NetNews Usenet Archive 1992 #16

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #16 / NN_1992_16.iso / spool / comp / compress / 2796 < prev next >

Wrap

Internet Message Format | 1992-07-25 | 2.6 KB

Path: sparky!uunet!zaphod.mps.ohio-state.edu!sdd.hp.com!usc!hacgate!rigel!abirenbo From: abirenbo@rigel.cel.scg.hac.com (Aaron Birenboim) Newsgroups: comp.compression Subject: Re: entropy Message-ID: <22552@hacgate.SCG.HAC.COM> Date: 24 Jul 92 18:49:11 GMT References: <1992Jul22.144104.27576@usenet.ins.cwru.edu> Sender: news@hacgate.SCG.HAC.COM Organization: Hughes Aircraft Colorado Engineering Labs Lines: 45 In article <1992Jul22.144104.27576@usenet.ins.cwru.edu> daf10@po.CWRU.Edu (David A. Ferrance) writes: > >Right now I compute the entropy of a file by summing the probability of >each character ocurring times the log (base2) of the reciprocal of that >probability. This gives me the theoretical minimum number of bits >needed to represent each character, on average (from what I understand). >What I would like to know is, does this number only work for first order >(1-ary?) compression techniques 2 points for david. good intuition. I only stumbled through my info-theory class, but this would seem to be the case to me. As soon as you start using some kind of memory, or markov chain, or dictionary the rules of the game change. > If it does not serve for the more complex >techniques (and I suspect it doesn't), is there any way to determine >entropies for different orders? i dunno. i'd like to hear some discussion on that though. I'd immagine that there are papers out covering this. anybody care to enlighten us? My guess as to how this works is as follows: Lets talk about entropy as bits/SYMBOL. If your symbols are all characters, you have the usual case. However, if, for example, you go to a DICTIONARY technique, you start getting SYMBOLS which are STRINGS of CHARACTERS. Now... the usual (first order ?!?) entropy law still holds. You can compute a theoretical minimum number fo bits per SYMBOL, but your symbol table is much more complex than before. This is why most good compression techniques can beat single-character entropy. Now... i'd like to see some kind of markov chain applied to strings of SYMBOLS... will this be somehow analagous to say encodeing the error from a predictor of order x. (like in a DPCM with prediction based of the last X symbols) This idea fits well to signals, but not so well to language where DPCM makes less sense... (but is conceivable). Will the info theory gurus please stand up ! ;-) -- Aaron Birenboim |I am just too tired to think abirenbo%rigel.cel.scg.hac.com@hac2arpa.hac.com|of a good quote. abirenbo@isis.cs.du.edu | - ChickenHead w (303) 344-6486 |