home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!charon.amdahl.com!pacbell.com!mips!darwin.sura.net!Sirius.dfn.de!zrz.tu-berlin.de!math.fu-berlin.de!news.netmbx.de!Germany.EU.net!mcsun!uknet!acorn!armltd!dseal
- From: dseal@armltd.co.uk (David Seal)
- Newsgroups: comp.compression
- Subject: Re: Compression technique mentioned in PCW
- Message-ID: <5504@armltd.uucp>
- Date: 19 Aug 92 13:19:20 GMT
- References: <1992Aug18.155518@sees.bangor.ac.uk>
- Sender: news@armltd.uucp
- Distribution: comp
- Organization: A.R.M. Ltd, Swaffham Bulbeck, Cambs, UK
- Lines: 122
-
- In article <1992Aug18.155518@sees.bangor.ac.uk> mather@sees.bangor.ac.uk
- (Paul Mather) writes:
-
- >A friend showed me a letter in a recent issue (Sep 1992) of Personal
- >Computer World magazine. ...
- >
- >Anyway, the thrust of the letter was not about those but was to
- >outline a method which seemed to illustrate that it was possible
- >to compress data outside the laws of information theory ...
- >
- >Well, the method the person outlined was the familiar one of representing
- >the bytes of a data stream as a decimal fraction N, where N is between 0
- >and 1. ...
-
- This is effectively what arithmetic compression does. You may be glad to
- hear that it doesn't contravene the laws of information theory...
-
- The reason why people tend to think it would is that they see it as
- compressing a large amount of data down to just one item of information -
- i.e. a number. The fallacy in this is that a number is *not* a unit of
- information: you need to reduce everything to bits or some equivalent
- measure. The number sent needs to be known *very* accurately in order to
- reproduce the original text correctly, and representing a number accurately
- requires a lot of bits.
-
- In fact, you will find that the number of bits required to represent an
- arithmetically compressed message is approximately equal to the theoretical
- limit for the source model being used - i.e. in technical terms, to the
- entropy of that message with respect to the source model used in the
- arithmetic compression. (In theory, it would be almost exactly equal to the
- entropy; in practice, it's very slightly higher, due to compromises used in
- arithmetic compression programs to avoid the necessity for infinite
- precision arithmetic.)
-
- >However, in the compression scheme mentioned in the letter, instead of
- >using a stout piece of oak :-), the decimal fraction was compressed by
- >determining two numbers, a and b, such that a/b = N. To transmit the
- >original file, one need only transmit a and b. Decompression would
- >involve calculating a/b and enumerating the result.
- >
- >A further modification was suggested to the algorithm whereby the
- >file is partitioned into arbitrary sized chunks and each chunk treated
- >as a decimal number between 0 and 1 as before. ...
- >
- >When my friend asked me what I thought of the proposed method, my gut
- >reaction was to say that finding the values of a and b sounded pretty
- >intractable to me. It also occurred to me that it was likely that for
- >some N the enumeration of the integers a and b could be longer than N
- >itself but my number theory is very rusty so I didn't speculate on that
- >to my friend. :-)
-
- Finding a and b is not especially intractable, using continued fraction
- techniques. This is assuming you keep the size of the chunks down to the
- point where multi-precision arithmetic is reasonably feasible - this allows
- quite long chunks, but not enormously long ones. Thousands of bits are OK,
- millions are not.
-
- But the real problem is the second one you've raised: the integers a and b
- are going to require at least as many bits to represent as the original
- fraction. To see an example of this, look at what numbers you can represent
- with a and b both containing 3 bits. We're only interested in fractions
- between 0 and 1 inclusive. We're obviously not interested in b=0, so we can
- represent values of a from 0 to 7, of b from 1 to 8. This allows us to
- represent the following fractions, in ascending order:
-
- 0 1 1 1 1 1 2 1 3 2 3 1 4 3 5 2 5 3 4 5 6 7 1
- - - - - - - - - - - - - - - - - - - - - - - -
- 1 8 7 6 5 4 7 3 8 5 7 2 7 5 8 3 7 4 5 6 7 8 1
-
- The smallest gap between two of these numbers is 1/56 (between 1/8 and 1/7,
- and between 6/7 and 7/8); the largest gap is 1/8 (between 0/1 and 1/8, and
- between 7/8 and 1/1). So by rounding to the nearest of these fractions, we
- are using 6 bits to represent a fraction between 0 and 1 with an error that
- can be as bad as 1/16, and even in the most densely covered area can be as
- bad as 1/112. But if we simply use the bits to represent the binary number
- 0.xxxxxx1 and round the desired fraction to the nearest such number, we
- never get an error of more than 1/128.
-
- These results generalise: if you use 2N bits as a straightforward binary
- expansion, you can represent any fraction with a maximum error of
- 1/(2^(2N+1)); if you instead use them as an N-bit a divided by an N-bit b,
- the worst case error is 1/(2^(N+1)) and even in the best-covered areas, the
- worst case error is 1/(2^(N+1) * (2^N - 1)). So even when it's at its best,
- the a/b method represents fractions a bit less accurately than a
- straightforward binary expansion; at its worst, it represents them much less
- accurately. The conclusion is that the a/b form is *not* an efficient use of
- bits to represent a number.
-
- The reason why it is inefficient is that it is redundant in two ways: first,
- pairs with a>b don't occur because they don't represent fractions between 0
- and 1, and secondly, pairs in which a and b have a common factor don't occur
- because the fractions are generally used in lowest terms. Thus the 6 bits
- used for a and b above can only represent 23 distinct useful values, whereas
- 6 bits used for a binary expansion represent 64 distinct useful values.
-
- By being a bit more clever about it, you can increase the bounds on a and b
- a bit further and represent a few more numbers. E.g. let the three bits for
- b represent a number in the range 2 to 9, and the three bits for a represent
- a number in the range 1 to 8 if b=9, 0 to 7 otherwise. Then you can
- represent the following 29 fractions:
-
- 0 1 1 1 1 1 2 1 2 1 3 2 3 4 1 5 4 3 5 2 5 3 7 4 5 6 7 8 2
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- 2 9 8 7 6 5 9 4 7 3 8 5 7 9 2 9 7 5 8 3 7 4 9 5 6 7 8 9 2
-
- This represents the fractions better, and occasionally achieves better
- accuracy than the binary expansion technique (the minimum gap is now 1/72
- compared with 1/64). But the odds are still that it will represent a number
- less accurately than the binary expansion. It's still inefficient - i.e.
- we're still not using all of the available combinations of 6 bits (in fact,
- we're using less than half of them).
-
- Further improvements are possible, but it should be fairly clear that the
- a/b approach is doomed: even if we get it to represent 64 distinct
- fractions, spread uniformly over the [0,1] interval in order to minimise the
- worst-case error, all we've done is equalled what the binary expansion
- technique can do!
-
- David Seal
- dseal@armltd.co.uk
-
- All opinions are mine only...
-