NetNews Usenet Archive 1992 #16

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #16 / NN_1992_16.iso / spool / comp / compress / 2776 < prev next >

Wrap

Internet Message Format | 1992-07-22 | 1.8 KB

Path: sparky!uunet!news.uiowa.edu!ns-mx!pyrite.cs.uiowa.edu From: jones@pyrite.cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879) Newsgroups: comp.compression Subject: Re: entropy Message-ID: <13282@ns-mx.uiowa.edu> Date: 22 Jul 92 15:46:14 GMT References: <1992Jul22.144104.27576@usenet.ins.cwru.edu> Sender: news@ns-mx.uiowa.edu Lines: 27 From article <1992Jul22.144104.27576@usenet.ins.cwru.edu>, by daf10@po.CWRU.Edu (David A. Ferrance): > > Right now I compute the entropy of a file by summing the probability of > each character ocurring times the log (base2) of the reciprocal of that > probability. This gives me the theoretical minimum number of bits > needed to represent each character, on average (from what I understand). > What I would like to know is, does this number only work for first order > (1-ary?) compression techniques or does it serve also as a limit for > more complex techniques? This is a universal way to compute entropy, but the answer you get depends on how you compute the probability of a character occuring. The first order entropy is obtained by measuring probabilities in isolation -- the probability of a letter is taken out of context. To get the second order entropy, you look at the probability of each letter in the context of the immediately preceeding letter. For the third order, you look at each letter in the context of each pair of preceeding letters, and so on. The absolute entropy of a bit of text can only be determined if you use an infinite order model, taking into account every bit of context information available. Shannon did a very elegant experiment to measure this by having human subjects read bits of text up to random points and then guess the next letter. This is how he arrived at the estimate that the entropy of English text is a bit over one bit per character. Doug Jones jones@cs.uiowa.edu