home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
RBBS in a Box Volume 1 #3.1
/
RBBSIABOX31.cdr
/
sspl
/
sq_usq.txt
< prev
next >
Wrap
Text File
|
1984-12-09
|
14KB
|
301 lines
Data Compression Techniques
Using SQ and USQ
Mark DiVecchio
San Diego IBM PC Users Group
In any computer system, efficient management of the file
storage space is important. The two programs SQ and USQ
reduce the size of data files by using the Huffman Data
Compression Algorithm.
A file is composed of a sequence of bytes. A byte is
composed of 8 bits. That byte can have a decimal value
between 0 and 255. A typical text file, like a C language
program, is composed of the ASCII character set (decimal 0 to
127). This character set only requires seven of the eight
bits in a byte. Just think -- if there were a way to use
that extra bit for another character, you could reduce the
size of the file by 12.5%. For those of you who use upper
case characters only, you use only about 100 of the ASCII
codes from 0 to 255. You could save 60% of your storage if
the bits not needed within each byte could be used for other
characters. What if you could encode the most frequently
used characters in the file with only one bit? For example,
if you could store the letter "e" (the most commonly used
letter in the English language) using only one bit : "1",
you could save 87.5% of the space that the "e" would normally
take if it were stored as the ASCII "0110 0101" binary.
The answer to these questions is the SQ and USQ programs.
The SQ program uses the Huffman coding technique to search
for the frequency of use of each of the 256 possible byte
patterns, and it then assigns a translation for each
character to a bit string. All of these bit strings are
placed end to end and written onto a disk file. The encoding
information is also put on the file since the USQ program
needs to know the character distribution of the original
file.
The USQ program reads in the encoding information and then
reads in the encoded file. It is a simple matter to scan the
encoded file and produce an output file which is identical to
the file that SQ started with.
Huffman Coding Technique
This is by far the most popular encoding technique in use
today. The Huffman encoding replaces fixed bit characters
with variable length bit strings. The length of the bit
string is roughly inversely proportional to the frequency of
occurrence of the character. For those of you inclined to
such symbolism:
Length of bit ~= log (character
string 2 probability)
The implementation of the algorithm which we will discuss
encodes fixed bit strings of length 8.
This algorithm requires two passes through the input file.
The first pass builds a table of 256 entries showing the
frequency of each occurrence of each of the 256 possible
values for a byte of information.
Once the counting is complete, the algorithm must decide
which bit strings to associate with each of the 256
characters that were found in the file. Note that if a
particular byte value was never used, no string association
is needed.
The second pass through the input file converts each byte
into its encoded string. Remember that when the output file
is created, the information used for encoding must also be
written on the file for use by the decoding program.
The decoding program reads the encoding information from the
file and then starts reading the bit strings. As soon as
enough bits are read to interpret a character, that character
is written onto the final output file. See the next two
sections on how SQ and USQ actually implement this.
Even though this article primarily has addresses ASCII input
files, there is nothing which restricts this algorithm to
ASCII. It will work on binary files (.COM or .EXE) as well.
But since the length of the encoded bit string is
approximately equal to the inverse of the frequency of
occurrence of each 8 bit byte, a binary file may not compress
very much. This is because a binary file most likely has a
uniform distribution over the 256 values in a byte. A
machine language program is not like the English language
where the letter "e" is used far more than other letters. If
the distribution is uniform, the encoded bit strings will all
be the same length and the encoded file could be longer than
the original (because of the encoding information on the
front of the file). All of this has to be qualified, because
machine language programs tend to use a lot of "MOV"
instructions and have a lot of bytes of zeros so that
encoding .COM and .EXE files does save some disk space.
SQ
The SQ program is an example of the Huffman algorithm.
The first thing that SQ does is read through the input file
and create a distribution array for the 256 possible
characters. This array contains counts of the number of
occurrences of each of the 256 characters. The program
counts these values in a 16 bit number. It makes sure that,
if you are encoding a big file, counts do not exceed a 16 bit
value. This is highly unlikely, but it must be accounted
for.
At the same time, SQ removes strings of identical characters
and replaces them with the ASCII character DLE followed by a
character count of 2-255. SQ replaces the ASCII DLE with the
pair of characters: DLE DLE. This is not related to the
Huffman algorithm but just serves to compress the file a
little more.
Once SQ has scanned the input file, it creates a binary tree
structure containing this frequency occurrence information.
The most frequently occurring characters have the shortest
path from the root to the node, and the least frequently
occurring characters have the longest path. For example, if
your file were:
ABRACADABRA (a very simple and
magical example)
The table of frequency of occurrences would be:
Letter # of Occurrences
------ ---------------
A 5
B 2
C 1
D 1
R 2
all the rest 0
Since the letter "A" occurs most often, it should have the
shortest encoded bit string. The letters "C" and "D" should
have the longest. The other characters which don't appear in
the input file don't need to be considered.
SQ would create a binary tree to represent this information.
The tree might look something like this (for purposes of
discussion only):
root <---Computer trees are
/ \ always upside down!
1 0 <-- This is called a
/ / \ node.
A 1 0
/ / \ <--This is called
B 1 0 branch.
/ / \
C 1 0
/ \
D R <-This
is a
leaf
From this our encoded bit strings which are kept in a
translation table would be:
Table Entry Character Binary
----------- --------- ------
1 A 1
2 B 01
3 C 001
4 D 0001
5 R 0000
The output file would be:
A B R A C A D A B R A
----------------------------------
1 01 0000 1 001 1 0001 1 01 0000 1
(binary)
A1 31 A1
(hex)
We have reduced the size of your file from ten bytes to thre
bytes for a 70% savings. For this simple example, things
aren't that well off since we must put the binary tree
encoding information onto the file as well. So the file size
grew a lot. But consider a file with the word ABRACADABRA
repeated 100,000 times. Now the encoding information is
going to be a very very small percentage of the output file
and the file will shrink tremendously.
SQ opens the output file and writes out the binary tree
information. Then SQ rewinds the input file and rereads it
from the beginning. As it reads each character, it looks
into the translation table and outputs the corresponding bit
string.
SQ is a little more complicated than what I have outlined
since it must operate in the real world of hardware, but this
is a fairly complete description of the algorithm.
USQ
The USQ program is very straightforward. It reads in the
encoding information written out by SQ and builds the
identical binary tree that SQ used to encode the file.
USQ then reads the input file as if it were a string of bits.
Starting at the root of the tree, it traverses one branch of
the tree with each input bit. If it has reached a leaf, it
has a character and that character is written to the output
file. USQ then starts at the root again with the next bit
from the input file.
What does it all mean?
Now that we understand the algorithm and a little about how
the SQ and USQ programs work, we can use that knowledge to
help us run our systems a little more efficiently. (So all
of this theory is worth something after all!).
1. Files must be above a threshold size, or else the output
file will be longer than the input file because of the
decoding information put at the beginning of the compressed
data. We don't know the exact size of the threshold because
the encoding binary tree information depends on the
distribution of the characters in a file. At least we know
to check the size of the encoded file after we run SQ to make
sure our file didn't grow.
2. Some files will not compress well if they have a uniform
distribution of byte values, for example, .COM or .EXE files.
This is because of the way SQ builds the tree. Remember that
bytes with the same frequency of occurrence are at the same
depth (usually) in the tree. So if all of the bytes have the
same depth, the output strings are all the same length.
3. SQ reads the input file twice. If you can, use RAM disk
at least for the input file and for both files if you have
the room. The next best case is to use two floppy drives,
one for input and one for output. This will cause a lot of
disk starts and stops but not much head movement. Worst case
is to use one floppy drive for both input and output. This
will cause a lot of head movement as the programs alternate
between the input and output files.
Other Compression Techniques
RUN-LENGTH ENCODING
Run-length encoding is a technique whereby sequences of
identical bytes are replaced by the repeated byte and a byte
count. As you might guess, this method is effective only on
very specialized files. One good candidate is a screen
display buffer. A screen is made up mostly of "spaces". A
completely blank line could be reduced from 80 bytes of
spaces to one space followed by a value of 80. To go from 80
bytes down to two bytes is a savings of almost 98%. You
might guess that for text files or binary files, this
technique does not work well at all.
ADAPTIVE COMPRESSION
This technique replaces strings of characters of code. For
example, the string "ABRACADABRA" would be replaced by a
code. Typical algorithms use a 12 bit code. The algorithm
is unique in that it only requires a single pass through the
input file as the encoding is taking place. The current
incarnation of this procedure is called the LZW method (after
co-inventors; A. Lempel, J. Ziv and T. Welch). This
algorithm claims a savings of 66% on machine language files
and up to 83% on COBOL files.
Other Reading
If you are interested in reading more about data compression
techniques, you may be interested in these articles:
H.K. Reghbati, "An Overview of Data Compression Techniques,"
Computer Magazine, Vol. 14, No. 4, April 1981, pp. 71-76.
T.A. Welch, "A Technique for High-Performance Data
Compression", Computer Magazine, Vol 17, No. 6, June 1984,
pp. 8-19.
J. Ziv and A. Lempel, "A Universal Algorithm for Sequential
Data Compression," IEEE Transactions on Information Theory,
Vol. It-23, No.3, May 1977, pp. 337-343.
Press ENTER to continue: ," IEEE Transactions on Information Theory,