COMPACT

Section: User Commands (1)
Updated: 10/27/90
Index Return to Main Contents
 

NAME

compact, uncompact - compress/uncompress large files  

SYNOPSIS

compact-clqsSuvxX ] [ -a cmd ] [ -b cbs ] [ -B ebs ] [ -f cfile ] [ -F efile ] [ -m memory ] [ -n cblks ] [ -N eblks ] [ -p outpad ] [ -r runsize ] [ -t cbuf ] [ -T ebuf ] [ -w bitwidth ] [ -z cmd ]
-clqsSuvxX ] [ -a cmd ] [ -b cbs ] [ -B ebs ] [ -f cfile ] [ -F efile ] [ -n cblks ] [ -N eblks ] [ -p outpad ] [ -t cbuf ] [ -T ebuf ] [ -z cmd ]
 

DESCRIPTION

compact

is a modified Lempel-Ziv-Welsh data compression utility designed to do high speed data compression/decompression of large files, particularly system backups. The algorithm resembles the V.42bis compression engine, but was developed independently.

This program is similar to the well known compress program but runs faster, compresses large files better, and has a superior worst case expansion of incompressable data. The program supports multi-volume data sets, and can run as multiple concurrent tasks to overlap compute and I/O activities. These characteristics make it particularly useful when doing periodic system backups, especially with streaming tape drives.

The magic number for big-endian compacted files is C5A3; little endian files use the byte-swapped A3C5. The recommended suffix for compacted files is (.W).

This program may be used for research into data compression techniques only. Be sure to read WARNING section for further discussion.  

OPTIONS

In the options below, upper case letters refer to the expanded input or output stream, while opposite lower case letters refer to the compressed stream. All numbers may be suffixed by b (512 byte blocks), k (kilobytes), or m (megabytes).

-a cmd
Execute cmd before accessing each volume of a multi-volume set. This is useful when the media must be retensioned, erased or formatted before accessing.
-b cbs
Use a block size of cbs to read/write the compressed stream. Default is 1K.
-B ebs
Use a block size of ebs to read/write the expanded stream. Default is 1K.
-c
Compress data. This is default for compact.
-f cfile
Read/write compressed data using file/device cfile. This option is required for multi-volume compressed data sets.
-F efile
Read/write compressed data using file/device efile. This option is required for multi-volume expanded data sets.
-l
Lock the data compression program in memory to avoid page contention in systems where file I/O and page I/O use the same buffer pool. This is a priveleged function since blithely locking large sections of main memory can lead to system deadlocks.
-m memory
Limit the maximum amount of memory the program can use for data compression to memory bytes. This overrides -w below. Useful on machines where even 1 meg of working space is too large.
-n cblks
The volume size of the compressed file is cblks blocks of size cbs bytes. When this much data has been read/written, the program will prompt for another volume. This is useful with multiple-volume backups.
-N eblks
The volume size of the expanded file is eblks blocks of size ebs bytes.
-r runsize
Resynchronize data compression (and discard the compression table) every runsize bytes. This reduces compression efficiency, but allows resynchronization of the compression data stream after a corrupted section of the input file. The current default is 10 megabytes.
-p outpad
Pad the final output record to a multiple of outpad bytes. Raw devices often require multiples of 512 byte physical records. Default is 512 bytes in compress mode.
-q
Run in quiet mode. Do not report compression statistics or byte-swapping actions.
-s
Swap bytes on the compressed file. This option need only be selected when writing a tape, since the magic number in the compress file automatically swaps input bytes when reading a byte-swapped tape. The most common use of this option is to produce a tape that can be more efficiently read by a machine with opposite byte ordering; the -S option should be selected at the same time.
-S
Swap bytes on the expanded file. This option is automatically toggled when reading a compressed file written with opposite byte ordering.
-t cbuf
Spawn a child task to do compressed file I/O, and allocate cbuf shared memory buffers of size cbs for buffering.
-T ebuf
Spawn a child task to do expanded file I/O, and allocate cbuf shared memory buffers of size cbs for buffering.
-u
Uncompact data. This is default for uncompact.
-v
Print program version number.
-w width
Limit the bit width of an output code. Must be 17-26 bits. See text.
-x
Do read/write compressed file operations to a separately allocated unshared buffer. This avoids driver bugs on some systems which manifest themselves as hung devices or system panics.
-X
Do read/write expanded file operations to a separately allocated unshared buffer.
-z cmd
Execute cmd after each complete volume has been read or written. This is useful to position a tape or eject a floppy after each volume of a multi-volume media set.
 

THEORY OF OPERATION

The program replaces commonly encountered strings of 16-bit input data codes as 17-bit or larger output data codes. As with Lempel-Ziv-Welsh, the program allocates compression code entries in a fixed size table until the table fills. Unlike LZW (but like V.42bis) less recently encountered strings are replaced with more recently encountered strings when the table becomes full. This modification of the algorithm is more expensive in processor time, and requires more memory for the hash table, but generally reduces the size of the output file by about 10%.

The 16-bit input code size halves the number of operations required by the 8-bit compress algorithm, but uses more memory, and takes longer to fill the string table. The larger input code size also improves the worst-case expansion of incompressible data, since (with default 17-bit output) the worst-case expansion is only 1/16 or about 6%.

Because the program compresses 16-bit codes, it is susceptable to big-endian/little-endian byte ordering incompatibilities. In practice this is handled automatically when reading the compressed file, because the magic number is seen as byte-swapped; the program knows to byte-swap both input and output streams when reading the data back in.  

PERFORMANCE

On systems available to the author, with compressed file sizes of 1 megabyte or larger, compact runs 2.5 to 3 times faster than compress while producing an output file typically 10% smaller. These characteristics are largely insensitive to the maximum output code size. On highly compressible data, larger output code width further improves compression perhaps 10%; however with incompressable input data (eg *.Z files) compression deteriorates by about the same amount.

An output code width of 17 bits requires about a megabyte of memory. Each additional output bit about doubles this requirement. Hence output code sizes of 20 bits or more are impractical on most systems, but values up to 26 bits or so could theoretically be used.

On systems without previously compressed files, compact generally compresses anything (including program binaries) by 50% or better. With 35% compressed files (the author's system disk) data compression is still 42%; this compares to 31% feeding the same input stream to compress.

With the -t and -T the program uses multiple processes, pipes, and System V shared memory to overlap compute and I/O activities. When both options are selected, one child does input directly to shared memory, while another child does output directly from shared memory, and the parent does data compression from shared memory to shared memory without device I/O delays.

To achieve maximum performance on a system with adequate real memory, it is usually good to specify both input and output buffering. Buffering on the expanded side is useful to smooth out the inherent bursty performance of the filesystem; 100K to a megabyte or so works well here. Buffering on the compressed side should be sufficient to hold all data produced while the streaming tape stops, reverses and restarts; generally this takes 1-3 seconds.

At the time of this writing, peak performance is limited on most systems by the host processor. If input-output buffering is well tuned, processor utilization of 98% or better is common. The compression algorithm itself compresses 20-30 KB/second per processor VAX MIP available. However some system time is required to do I/O, so uniprocessor systems cannot achieve this speed in practice.

As an example, a particular SparcStation 1 (12 VAX MIPS) with a fast SCSI disk, 35% compressed files and a cartridge tape drive can do a 273 Megabyte tar backup to 159 MB of tape in 36 minutes using about 3 megabytes of working storage, 50% user and 13% system time. The remaining processor power is absorbed by the tar program to bring total system processor utilization to 99%. This runs the tape drive at 74% of rated speed. A similar uncompressed backup running the drive at full speed would take 46 minutes, except that the data wouldn't fit on the tape. The specific shell command used is:

       tar cf - /home | compact -T400 -B4k -t6 -b100k >/dev/rct0
 

WARNING

This program may be used only for research into data compression, as a case study in program optimization, or as an example of System V interprocess communication.

The author originally intended the program for practical day-to-day usage. However, in final preparation for program release, it was learned the program probably infringes one or more widely licensed patents in data compression. There is no choice but to release the program for research purposes only.

Compact was derived exclusively from the compress program and from the paper quoted in the references. It appears the author duplicated much of the research leading to V.42bis, and the discussion given there is useful in understanding this program. There remain significant differences the author has not analyzed in detail. Careful study of those differences might yield something new.  

SEE ALSO

compress(1)

A Technique for High Performance Data Compression, Terry A. Welch, computer, vol. 17, no. 6 (June 1984), pp. 8-19.  

BUGS

The program attempts to always read full input records. If for example a 10KB read returns only 4KB, the program tries again with 6KB. If less than 6KB is received, it will try again with the remaining buffer space. Some tape drivers may consider this antisocial behavior.

The program won't run on machines with 16 bit integers and 64K addressing. Intel 80286 users, eat your hearts out.

An alarmingly large number of systems seem unable to do floppy or tape I/O to shared memory, presumably because the DMA lock-down mechanism is confused by shared pages. On these systems, the x or X options must be used to prevent device hangs or system crashes.


 

Index

NAME
SYNOPSIS
DESCRIPTION
OPTIONS
THEORY OF OPERATION
PERFORMANCE
WARNING
SEE ALSO
BUGS

This document was created by man2html, using the manual pages.
Time: 14:30:29 GMT, July 24, 2024