NetNews Usenet Archive 1992 #18

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #18 / NN_1992_18.iso / spool / comp / compress / 3047 < prev next >

Wrap

Internet Message Format | 1992-08-19 | 6.9 KB

Path: sparky!uunet!charon.amdahl.com!pacbell.com!mips!darwin.sura.net!Sirius.dfn.de!zrz.tu-berlin.de!math.fu-berlin.de!news.netmbx.de!Germany.EU.net!mcsun!uknet!acorn!armltd!dseal From: dseal@armltd.co.uk (David Seal) Newsgroups: comp.compression Subject: Re: Compression technique mentioned in PCW Message-ID: <5504@armltd.uucp> Date: 19 Aug 92 13:19:20 GMT References: <1992Aug18.155518@sees.bangor.ac.uk> Sender: news@armltd.uucp Distribution: comp Organization: A.R.M. Ltd, Swaffham Bulbeck, Cambs, UK Lines: 122 In article <1992Aug18.155518@sees.bangor.ac.uk> mather@sees.bangor.ac.uk (Paul Mather) writes: >A friend showed me a letter in a recent issue (Sep 1992) of Personal >Computer World magazine. ... > >Anyway, the thrust of the letter was not about those but was to >outline a method which seemed to illustrate that it was possible >to compress data outside the laws of information theory ... > >Well, the method the person outlined was the familiar one of representing >the bytes of a data stream as a decimal fraction N, where N is between 0 >and 1. ... This is effectively what arithmetic compression does. You may be glad to hear that it doesn't contravene the laws of information theory... The reason why people tend to think it would is that they see it as compressing a large amount of data down to just one item of information - i.e. a number. The fallacy in this is that a number is *not* a unit of information: you need to reduce everything to bits or some equivalent measure. The number sent needs to be known *very* accurately in order to reproduce the original text correctly, and representing a number accurately requires a lot of bits. In fact, you will find that the number of bits required to represent an arithmetically compressed message is approximately equal to the theoretical limit for the source model being used - i.e. in technical terms, to the entropy of that message with respect to the source model used in the arithmetic compression. (In theory, it would be almost exactly equal to the entropy; in practice, it's very slightly higher, due to compromises used in arithmetic compression programs to avoid the necessity for infinite precision arithmetic.) >However, in the compression scheme mentioned in the letter, instead of >using a stout piece of oak :-), the decimal fraction was compressed by >determining two numbers, a and b, such that a/b = N. To transmit the >original file, one need only transmit a and b. Decompression would >involve calculating a/b and enumerating the result. > >A further modification was suggested to the algorithm whereby the >file is partitioned into arbitrary sized chunks and each chunk treated >as a decimal number between 0 and 1 as before. ... > >When my friend asked me what I thought of the proposed method, my gut >reaction was to say that finding the values of a and b sounded pretty >intractable to me. It also occurred to me that it was likely that for >some N the enumeration of the integers a and b could be longer than N >itself but my number theory is very rusty so I didn't speculate on that >to my friend. :-) Finding a and b is not especially intractable, using continued fraction techniques. This is assuming you keep the size of the chunks down to the point where multi-precision arithmetic is reasonably feasible - this allows quite long chunks, but not enormously long ones. Thousands of bits are OK, millions are not. But the real problem is the second one you've raised: the integers a and b are going to require at least as many bits to represent as the original fraction. To see an example of this, look at what numbers you can represent with a and b both containing 3 bits. We're only interested in fractions between 0 and 1 inclusive. We're obviously not interested in b=0, so we can represent values of a from 0 to 7, of b from 1 to 8. This allows us to represent the following fractions, in ascending order: 0 1 1 1 1 1 2 1 3 2 3 1 4 3 5 2 5 3 4 5 6 7 1 - - - - - - - - - - - - - - - - - - - - - - - 1 8 7 6 5 4 7 3 8 5 7 2 7 5 8 3 7 4 5 6 7 8 1 The smallest gap between two of these numbers is 1/56 (between 1/8 and 1/7, and between 6/7 and 7/8); the largest gap is 1/8 (between 0/1 and 1/8, and between 7/8 and 1/1). So by rounding to the nearest of these fractions, we are using 6 bits to represent a fraction between 0 and 1 with an error that can be as bad as 1/16, and even in the most densely covered area can be as bad as 1/112. But if we simply use the bits to represent the binary number 0.xxxxxx1 and round the desired fraction to the nearest such number, we never get an error of more than 1/128. These results generalise: if you use 2N bits as a straightforward binary expansion, you can represent any fraction with a maximum error of 1/(2^(2N+1)); if you instead use them as an N-bit a divided by an N-bit b, the worst case error is 1/(2^(N+1)) and even in the best-covered areas, the worst case error is 1/(2^(N+1) * (2^N - 1)). So even when it's at its best, the a/b method represents fractions a bit less accurately than a straightforward binary expansion; at its worst, it represents them much less accurately. The conclusion is that the a/b form is *not* an efficient use of bits to represent a number. The reason why it is inefficient is that it is redundant in two ways: first, pairs with a>b don't occur because they don't represent fractions between 0 and 1, and secondly, pairs in which a and b have a common factor don't occur because the fractions are generally used in lowest terms. Thus the 6 bits used for a and b above can only represent 23 distinct useful values, whereas 6 bits used for a binary expansion represent 64 distinct useful values. By being a bit more clever about it, you can increase the bounds on a and b a bit further and represent a few more numbers. E.g. let the three bits for b represent a number in the range 2 to 9, and the three bits for a represent a number in the range 1 to 8 if b=9, 0 to 7 otherwise. Then you can represent the following 29 fractions: 0 1 1 1 1 1 2 1 2 1 3 2 3 4 1 5 4 3 5 2 5 3 7 4 5 6 7 8 2 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 2 9 8 7 6 5 9 4 7 3 8 5 7 9 2 9 7 5 8 3 7 4 9 5 6 7 8 9 2 This represents the fractions better, and occasionally achieves better accuracy than the binary expansion technique (the minimum gap is now 1/72 compared with 1/64). But the odds are still that it will represent a number less accurately than the binary expansion. It's still inefficient - i.e. we're still not using all of the available combinations of 6 bits (in fact, we're using less than half of them). Further improvements are possible, but it should be fairly clear that the a/b approach is doomed: even if we get it to represent 64 distinct fractions, spread uniformly over the [0,1] interval in order to minimise the worst-case error, all we've done is equalled what the binary expansion technique can do! David Seal dseal@armltd.co.uk All opinions are mine only...