ENT
A Pseudorandom Number Sequence Test Program
This directory contains a program which applies various tests to
sequences of bytes stored in files and reports the results of those
tests. The program is useful for those evaluating pseudorandom number
generators for encryption and statistical sampling applications,
compression algorithms, and other applications where the information
density of a file is of interest.
The program is provided as random.tar.gz,
a GZIPped TAR archive containing an MS-DOS executable program
(compiled using Microsoft Quick C for Windows, creating an MS-DOS file
which does not require Windows to execute), and in source code form
along with a Makefile
to build the program under Unix.
The program produces output as follows on the standard output stream:
Entropy = 7.980627 bits per character.
Optimum compression would reduce the size
of this 51768 character file by 0 percent.
Chi square distribution for 51768 samples is 1542.26, and randomly
would exceed this value 0.01 percent of the times.
Arithmetic mean value of data bytes is 125.93 (128 = random).
Monte Carlo value for PI is 3.169834647 (error 0.90 percent).
Serial correlation coefficient is 0.004249 (totally uncorrelated = 0.0).
The values calculated are as follows:
- ENTROPY
- The information density of the contents of the file,
expressed as a number of bits per character. The
results above, which resulted from processing an image
file compressed with JPEG, indicate that the file is
extremely dense in information--essentially random.
Hence, compression of the file is unlikely to reduce
its size. By contrast, the C source code of the
program has entropy of 4.931743 bits per character,
indicating that optimal compression of the file would
reduce its size by 38%. [Hamming, pp. 104-108]
- CHI-SQUARE TEST
- The chi-square test is the most commonly used test for
the randomness of data, and is extremely sensitive to
errors in pseudorandom sequence generators. The
chi-square distribution is calculated for the stream
of bytes in the file and expressed as an absolute
number and a percentage which indicates how frequently
a truly random sequence would exceed the value
calculated. We interpret the percentage as the degree
to which the sequence tested is suspected of being
non-random. If the percentage is greater than 99% or
less than 1%, the sequence is almost certainly not
random. If the percentage is between 99% and 95% or
between 1% and 5%, the sequence is suspect.
Percentages between 90% and 95% and 5% and 10%
indicate the sequence is "almost suspect". Note that
our JPEG file, while very dense in information, is far
from random as revealed by the chi-square test.
Applying this test to the output of various
pseudorandom sequence generators is interesting. The
low-order 8 bits returned by the standard Unix
rand()
function, for example, yields:
Chi square distribution for 500000 samples is 0.01,
and randomly would exceed this value 99.99 percent
of the times.
While an improved generator [Park & Miller] reports:
Chi square distribution for 500000 samples is
212.53, and randomly would exceed this value 95.00
percent of the times.
Thus, the standard Unix generator (or at least the
low-order bytes it returns) is unacceptably
non-random, while the improved generator is much
better but still sufficiently non-random to cause
concern for demanding applications. Contrast both of
these software generators with the chi-square result
of a genuine random sequence created by timing
radioactive decay events.
Chi square distribution for 32768 samples is 237.05,
and randomly would exceed this value 75.00 percent
of the times.
See [Knuth, pp. 35-40] for more information on the
chi-square test.
- ARITHMETIC MEAN
- This is simply the result of summing the all the bytes in the file
and dividing by the file length. If the data are
close to random, this should be about 128. If the
mean departs from this value, the values are
consistently high or low.
- MONTE CARLO VALUE FOR PI
- Each successive sequence of four bytes is used as 16 bit X and Y
co-ordinates within a square. If the distance of the
randomly-generated point is less than the radius of a
circle inscribed within the square, the four-byte
sequence is considered a "hit". The percentage of
hits can be used to calculate the value of Pi. For
very large streams (this approximation converges very
slowly), the value will approach the correct value of
Pi if the sequence is close to random. A 32768 byte
file created by radioactive decay yielded:
Monte Carlo value for PI is 3.139648438 (error 0.06
percent).
- SERIAL CORRELATION COEFFICIENT
- This quantity measures the extent to which each byte in the file
depends upon the previous byte. For random sequences,
this value (which can be positive or negative) will,
of course, be close to zero. A non-random byte stream
such as a C program will yield a serial correlation
coefficient on the order of 0.5. Wildly predictable
data such as uncompressed bitmaps will exhibit serial
correlation coefficients approaching 1. See [Knuth,
pp. 64-65] for more details.
Distribution
This program is in the public domain. You can do anything you like
with it.
References
- [Hamming]
- Hamming, Richard W., Coding and Information Theory,
Englewood Cliffs NJ: Prentice-Hall, 1980.
- [Knuth]
- Knuth, Donald E., The Art of Computer Programming,
Volume 2/Seminumerical Algorithms, Reading MA:
Addison-Wesley, 1969.
- [Park & Miller]
- Park, Stephen K. and Miller, Keith W., "Random Number
Generators: Good Ones Are Hard to Find".
Communications of the ACM, New York, October 1988, p. 1192.
by John Walker
kelvin@fourmilab.ch
September 24th, 1992