home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
OS/2 Professional
/
OS2PRO194.ISO
/
os2
/
packer
/
ultracom
/
backgrnd.doc
next >
Wrap
Text File
|
1993-12-31
|
14KB
|
335 lines
7. BACKGRND: concepts, program design, compressor design, benchmarks
====================================================================
Although many aspects of UC look simple and straightforward, UC is a
very advanced program.
This document contains the following paragraphs:
- A. Inside UC
- B. Compression technology
- C. Damage protection technology
- D. Benchmarking
To jump to a certain paragraph, press the corresponding letter. To
jump to a specific document/chapter, press the corresponding digit.
See chapter 1 paragraph E for an overview of all documentation.
7.A INSIDE UC.
==============
Here a rough overview of the internals of UC is given. It can give
those who are interested more insight into what is going on INSIDE
UltraCompressor II (tm).
If you want more detailed technical information (e.g. for making add-on
tools) you can always contact us. Please note however we do not intend
to supply the source code of UC to third parties. Sample add-on tools
(including their source code) are available on request.
UC is programmed in C++ and assembly language (for the time critical
sections). It is devised in a number of modules which are briefly
described in this section.
Overview
--------
GENERAL CONTROL
main
command interpreter
super manager
archive I/O
SYSTEM SERVICES COMPRESSION SERVICES
video neuro manager
lowlevel I/O ultra engine
memory manager
virtual memory manager
RELIABILITY SERVICES
error handler
OTHER damage protection
lister guard
viewer test
General control
---------------
MAIN
The main module calls all initialization routines and takes care
of the configuration and help menus.
COMMAND INTERPRETER
The command line and script files are interpreted in this
module.
SUPER MANAGER
Finds out what should happen (e.g. ask the command interpreter,
scan the disk and the archive), and makes it happen (call the
datacompressor etc.). This module is the most complex module of
UC, but it delegates a lot of hard work to other modules.
ARCHIVE I/O
Here the archive file format is managed. Conversion between
external (compressed flat file) and internal (object database)
formats, as well as all kinds of integrity checks are performed.
System services
---------------
VIDEO
The keyboard, screen and output file (in case of redirection)
are managed here.
LOWLEVEL I/O
Here lowlevel file specifics and networking are handled. Also
this module takes care of file caching, an essential issue in
making UC perform fast on (floppy) disk and network access.
MEMORY MANAGER (MM)
This module manages base memory, XMS and EMS. The MM decides
which component of the program can get memory, and how much
memory it can get, to optimize overall performance.
VIRTUAL MEMORY MANAGER (VMM)
This module manages the 'persistent object database' containing
the central directory of an archive and numerous links needed to
make processing as efficient as possible. No matter the amount
of data in the VMM, it can always work (fast) with only a small
amount of memory. VMM uses the memory manager to optimize its
performance wherever possible. The VMM also supplies pipeline
services for decoupling translation (archive I/O) and compression
logic.
Compression services
--------------------
NEURO MANAGER
This component makes 'neuro models' of collections of files.
These neuro models allow UC to use similarities between
different files to compress all files even further. The
analyzation phase of UC refers to the creation of neuro models
based on found files. The optimizing phase (which is different
from the optimize command) refers to the compression of the
neuro models. One special neuro model (the 'master') is built
into the compressor. This model is used to improve compression
of the neuro models. The master contains information about the
general structure of common file types such as: documents,
spreadsheets, databases, sources etc..
ULTRA ENGINE
The bare high speed compressor/decompressor. Based on a neuro
model the redundancy of a stream of data is analyzed and
compressed.
Reliability services
--------------------
ERROR HANDLER
The error handler traps DOS system errors and handles them in a
neat and consistent way. It often allows the user to go to DOS,
solve the problem, leave DOS and continue where UC has stopped.
The error handler makes it impossible for the user to specify
'ignore', (an unfortunate option DOS offers if something goes
wrong).
DAMAGE PROTECTION
This module handles damage protection, damage detection and
damage recovery.
GUARD
The guard continuously checks the internal integrity of UC. It
is often able to prevent damage caused by e.g. software
conflicts or hardware problems.
TEST (only available in beta versions)
Tests and interrogates various components of UC during
execution.
Other
-----
LISTER
Shows contents of archive in various formats.
VIEWER
Online help viewer.
7.B COMPRESSION TECHNOLOGY.
===========================
UC significantly outperforms existing datacompression technology. For
those interested in background information, we will discuss WHY and
HOW UC is capable of doing this.
AIP-NL would like to specifically thank the University of Utrecht, Drs.
Ing. D. Bezemer, Prof. Dr. J. van Leeuwen, Dr. P. van Oostrum and Drs.
Ing. N.E. de Vries for their significant participation in the design
of the UC compression engine. The detailed design is on request
available at the public library of the University of Utrecht.
To begin, we will give a brief explanation of AR002. Most modern
datacompressors (ARJ, LHA, ZIP, ZOO) are derivatives of AR002. AR002
was designed by Haruhiko Okumura.
AR002, a two phase compressor
-----------------------------
DATA <-> (#1, LZ) <-> (grouping) <-> (#2, Huffman) <-> COMPRESSED
In phase #1 (the 'LempelZiv' phase) 'matches' are located. Matches
are strings of data which are equal. The string 'hello there hello'
can be converted to 'hello there '(-12,5) since 'hello' is matched.
In general ARJ, LHA, ZIP and ZOO can find matches with a maximum
distance of about 32700, and a maximum length of about 500 (rough
estimates).
In phase #2 (the 'Huffman' phase) the output of phase #1 is divided
into blocks. For each block statistics are analyzed (for example
the compressor notices the character 'e' is used much more often
than 'q'). From this analysis an optimal translation table
('Huffman tree') is generated (e.g. 'e'=001 'q'=100100010). The
compressed block then consists of the Huffman tree and the
translated contents of the block.
The storage of the Huffman tree, at the beginning of each block, is
a complicated matter which we will not cover here.
The number of different values phase #1 can generate is very large
(e.g. all 256 possible values of a byte, and about 32700 * 500
different match discriptions). Since it is very inefficient to make
a Huffman tree for all those possible values, the values are
grouped, and statistics are gathered for the groups instead of the
single values. LHA, ZOO, ARJ and ZIP all use various methods to
realize this grouping.
UltraEngine
-----------
The UltraEngine is also an AR002 derivative, but with some
fundamental enhancements. These enhancements can be divided into
two major features: the core engine and the neuro manager.
The neuro manager is the most significant enhancement. Ordinary
datacompression engines always start completely 'blank', knowing
absolutely nothing about the data they are going to compress. The
neuro-manager collects generic information about a group of files.
For example a collection of *.DOC files will all have similar
characteristics, and share common text strings like 'paper'. The
neuro-manager collects this common 'knowledge' in a neuro model.
This neuro model is used to make the core engine much smarter when
similar files are compressed.
UC has a neuro model ('master') built into the executable as well.
This master contains information about common file formats. It
contains for example information about spreadsheets, databases,
wordprocessors, programming languages, common words in: English,
French, German and Dutch etc.. This built in information improves
compression.
The core engine also has some enhancements over the basic AR002
engine.
The match engine has been replaced by a completely new one. This
new engine is capable of finding matches at a distance of 64000 and
with a maximum length of 30000.
Also the way in which blocks of data are gathered for further
Huffman compression is completely different. Instead of looking at
one block at a time, the core engine attempts to 'learn' from
previous blocks so less information is needed to encode statistical
information of a block.
Alternative
-----------
A collection of files can also be compressed more, with the often
used TAR+COMPRESS 'trick'. First collect all files in a single
large file, and then compress this single large file at once.
Especially for large collections of small files, this improves
compression significantly. This approach can be reached with ARJ as
well, by first compressing with ARJ -m0, and compressing the
resulting archive again with ARJ -jm1.
It should be noted, however, that the UltraEngine approach works
much better. First of all, the compression is better. The neuro
manager is much smarter than just a simple concatenation. This can
be verified by compressing the output of ARJ -m0 with UC -tt. Also
the core engine is fine tuned for optimal performance in
cooperation with neuro models. The 64000 byte match length is one
of the essential enhancements.
Another reason why the UC approach is much better, is random access.
Try to delete a single file of a collection of 5000 C files when
ARJ -m0/-jm1 has been used, or try to add or extract SOME files. UC
will mostly do this much faster than ordinary archivers (this is a
result of better caching and a better file format), and remarkably
faster than a collect/compress archiver.
7.C DAMAGE PROTECTION TECHNOLOGY.
=================================
When an archive is 'damage protected' with the P command, it becomes
resistant to a certain amount of damage. This paragraph explains how
UC achieves this.
We will start with a mathematical operator called 'XOR'. XOR has the
following properties:
0 XOR 0 = 0; 0 XOR 1 = 1; 1 XOR 0 = 1; 1 XOR 1 = 0
if a1 XOR a2 XOR a3 XOR a4 XOR a5 = xx
then a1 XOR xx XOR a3 XOR a4 XOR a5 = a2
and a1 XOR a2 XOR a3 XOR xx XOR a5 = a4
etc.
This exactly shows the essence of damage protection, a1..a5 are the
bare data, xx is special (damage recovery) information which can be
used to restore bare data if it somehow has been destroyed.
In practice UC XOR's groups of 512 bytes (sectors) instead of single
bits. To determine if a specific sector is correct or damaged a
checksum of all sectors is included in the damage recovery
information.
About 1% of damage protection information is added to an archive. For
short files this is a single recovery sector (meaning up to 1 damaged
sector can be recovered). But if the archive is longer multiple damage
recovery sectors are added, for example one for odd and one for even
sectors.
7.D BENCHMARKING.
=================
Benchmarking is a very complicated matter. It is always possible to
create a legitimate looking benchmark which makes product A or product
B look better. A nice example of this can be seen by just looking at
some ads.
Some products are even optimized for running benchmarks. An
unfortunate habit which makes it even more difficult to compare
products.
We have done a successful attempt to optimize UC for daily 'real life'
use. This means it compresses fast and it decompresses very fast.
But when archives are used intensively, the bare compression/
decompression speed is only an aspect of the whole picture. What is
also very important is how efficient updates are performed. Especially
when 'the going gets tough', for example when a large archive resides
on a floppy disk, or when the archive is on a network, UC can be much
faster than other products.
An unfortunate problem in benchmarking archivers is the 'Calgary
Compression Corpus', which is often used to compare datacompressors.
Well, UC will do noticeably better than other compressors with this
test, yet the real power of the compression engine is not used in this
unrealistic test. When collections of 'real life' files (e.g. a large
collection of documents, or sources, or the average archive one finds
on a BBS) is compressed the difference between UC and other products
becomes much more significant.
Of course, although performance is very important, it is only one
aspect in the usability of a product.
In general it is our opinion that you, the user, are the best person
to perform benchmarks. Try the product the way you would actually USE
it and see for yourself.