home *** CD-ROM | disk | FTP | other *** search
- ACB, DMC, PPM, LZ - CLASSIFICATION, IDEA, COMPARISON
-
- George Buyanovsky
-
- 1. Introduction
-
- This article provides a brief overview of the existing technologies for
- compression of information and describes the place taken by ACB compression.
- The overview discusses only the technologies for building compression
- models: LZ, PPM, DMC, ACB; coding is not discussed.
-
- LZ has two basic modifications, LZ77 and LZ78 along with a large number of
- variations. At present LZ models are widely used (zip,rar,arj,lzh,lzw...).
- The theoretical approach for LZ models was suggested in 1977 by
- Lempel A. Ziv J. (LZ), with software realization in the early eighties.
-
- PPM, or context modelling, is based on theoretical work conducted from 1986-95
- (T.C.Bell, J.G.Cleary, I.H.Witten, W.J.Teahan, A.Moffat ....).
- Software realization came about in the early nineties (HA, X1, ....).
-
- DMC - Several DMC, or Dynamic Markov Coding, are known, in particular,
- Ross Williams'DHPC algorithm, as well as the DMC algorithm of
- Cormack & Horspool, which uses many heuristics methods (as well as PPM).
- I am not aware of industrial archivators using the DMC-technology.
-
-
- 2. LZ, PPM, ACB<-->DMC
-
-
- When two persons speak with one another, they have to formulate their
- ideas in detail at the beginning of their conversation, but the longer
- their conversation lasts, less detail is required. Why does this happen?
- Many ideas have been expressed and are understood by conversants without
- detail.
-
- 2.1 LZ-compression
-
- LZ-compression substitutes the initial text with references to a dictionary;
- it seems that this scheme resembles the methodology used by two interlocutors.
-
- However, with the growth of the dictionary, the number of bits
- necessary for formulating the reference grows proportionally by a binary
- logarithm to the size of the dictionary. The length of the phrases
- in the dictionary (in the simplest case, a binary tree) at the beginning
- considerably surpasses the growth of the length of the references, but after
- some saturation, the speed of their growth asymptotically tends to
- logarithmic dependence.
- The optimal size of the dictionary varies for different types of data;
- the more variable are the data, the smaller the optimal size of the
- dictionary.
-
- 2.2 PPM - algorithms (context modelling)
-
- The idea of context modeling is based on the fact that distribution of
- possibilities in the alphabet depends on the nearest context, in other
- words, letters are more likely to appear in a particular pattern, that
- is, next to or near other particular letters. For this technology, there
- also exist principal restrictions for the increase of a sliding frame,
- or a moving processor of data or letters, for which a context model is
- built. A small sliding frame with a corresponding meagre context model
- is optimal on variable data, for which the problem of zero frequency is
- especially acute. In the course of increasing the size of a sliding frame,
- more and more various information is processed (e.g., a text in French,
- then a text in German and then a text in Russian enters this frame),
- with the result that the distribution of possibilities widens and the
- effectiveness of context modelling (with short context) decreases quickly.
- With the increase of the context length, costs also increase exponentially.
- The transition to context-mixed models has improved the situation to some
- extent; however, the absence of theoretically substantiated schemes of
- intermingling probabilities is compensated by a large number of heuristics,
- which takes us back to the times of alchemy.
-
- 2.3 ACB-compression
-
- However, the brain of interlocutors in a mysterious fashion can overcome
- this barrier and can use all accumulated information through a mechanism
- called associative memory.
-
- The general principle of ACB-compression is very simple: the ACB-algorithm
- puts on glasses with an associative filter and looking at the past, it sees
- not the whole frame but rather only fragments, which are close by context
- to the nearest context. Fragments or phrases of the received picture differ
- much by quality; phrases close by context are distinct in their relation to
- other phrases, less close phrases are less distinct or are not distinct at
- all. In this case, the ACB-algorithm has ideal conditions for work, in that
- it works more quickly by considering larger units, moreover, it can
- distinguish the value of phrases, through consideration of the probability
- of the appearance or use. In the case of an unlimited increase of a frame,
- ACB-compression always ensures an optimal size of a context dictionary.
-
- Some ditailes about ACB-compression see APPENDIX_A & APPENDIX_B
-
- 2.4 DMC - Dynamic Markov Coding
-
- Probabilistic models with a finite number of states can be described by a
- finite automate. A set of states S(i) and a set of probabilities of
- transition P(i,j) from the state i into the state J are called Markov's
- models. Markov's Dynamic Coding (DMC) is of practical interest, in that
- it works adaptively, starting from a simple initial model and adding,
- if necessary, new states. PPM technology is a particular case of the DMC
- approach. DMC allows the construction of context models not only for
- single symbols (as with PPM and consideration of letters of the alphabet),
- but also for phrases or lines. In this sense, ACB-compression can be
- classified as a variety of the DMC approach. I am not aware of software
- realizations of DMC algorithms; I would appreciate receiving any
- information on this issue (especially software realizations for the IBM_PC).
-
- 3. Comparison: ACB, PPM, LZ
-
- ACB.EXE v1.14a & v1.23b works only in a "solid" mode, in which all files being
- packed are lined in a solid stream; this allows the most effective use the
- advantages of a large frame. The least favourable data for this solid-mode
- test data. As a rule, various information is taken into the catalogue to
- conduct tests. However, the high adaptivity of ACB-compression reduces to a
- minimum the possible loss, and on real data, the solid mode gives gain in
- compression in 99% cases. Below please find a comparative table (as test
- files, I have selected those transmitted by modem):
-
- Compared: ACB v1.14a, X1 v0.93e, RAR v1.54b, ZIP v2.04g
- DATA - Borland_C++_v_3.1 and GS_system_(convertor for *.ps)
- (exe,asm,c,cpp,h,hpp,obj,rtf,hlp,txt,doc,rc,res,bmp,lib,dll,wav,prj,ps,....)
-
- Regim (Options):
- ACB - default ( "TAUGHT CHANNEL"-mode is not used)
- X1 - select mode with max compression (m1/m3/m4/m1-Solid/m3-Solid/m4-Solid)
- RAR - all Solid, max compression
- ZIP - max compression
-
- ACB PPM LZ LZ
- *.ACB *.X *.RAR *.ZIP
- TLIB 78,232 124,950 197,383 219,503
- LIB 551,153 798,122 1,020,380 1,157,399
- BGI 80,441 91,310 88,763 111,801
- DOC 145,801 157,010 193,922 209,196
- OWL 813,594 1,306,961 1,306,723 1,569,437
- EXMPL 413,790 568,216 565,897 837,843
- GS 605,931 653,809 671,410 725,455
- 2,688,942 3,700,378 4,044,478 4,830,634
-
- Time on OWL Pentium-100 (486SX-33)
- ACB - 232 sec. 1057 sec.
- X1 -(m4-Solid) 190 sec. 710 sec.
- RAR - 101 sec. 341 sec.
- ZIP - 77 sec. 170 sec.
-
- The non-linear nature of the gain in speed in switching to a Pentium processor
- is explained by three reasons:
- 1) Time costs for disk operations in switching over to Pentium are not
- substantially decreased; their share for ACB is very small.
- 2) Optimization of the code for Pentium.
- 3) More effective caching of RAM, which is very critical for ACB.
-
- The advantage of ACB-compression can be increased if ACB is tuned to some type
- of data, using the "TAUGHT CHANNEL" mode, having created in advance the
- context from relative information. Such a possibility is only a side effect of
- this mode. Something similar occurs in choosing options in order to obtain the
- maximum compression coefficient (external information about the data type).
- This approach is good if one wants to set the world records, but is of little
- practical application.
-
- In the APPENDIX_B see the comparative table between the best research
- algorithms and ACB (on "Calgary Corpus" ).
-
- In the APPENDIX_C see the comparative table between ACB_1.23b and ACB_1.14a.
-
-
- 4. "TAUGHT CHANNEL"
-
- The idea: in compressing information transmitted by a channel of
- communications, all of ALL!!! the information earlier transmitted by
- this channel is tranmitted, so that any repetitions of the data
- transmitted in the past through the channel will be used in
- establishing the algorithm for compression.
-
- Realization: The ACB-compression algorithm was implemented in the
- archivator ACB.EXE for Dos-WIN95 (ACB_1.23b- the latest version);
- it was fully written on Assembler (32_Protection-mode) and
- optimized for the Pentium processor.
- It supports the TAUGHT CHANNEL mode.
-
- Innovations: The ACB-compression algorithm is the only algorithm of
- compression, the compression coefficient of which asymptotically
- increases with the growth of the sliding frame. Other algorithms
- can track not more than 8-65 Kb of the channel's background,
- depending on the types of the data, as compared with
- 822 Kb in ACB_1.14a & 1671 Kb in ACB_1.23b.
-
- 4.1 Testing with "taught channel" - mode
-
- Stream of data consists of two types of data:
- - ARCHIVE COMPARISON TEST (A.C.T.) Author: JEFF GILCHRIST
- for June, July, August, September, October;
- - and *.exe files these are different version ACB.EXE:
-
- Regim (Options):
- ACB - "TAUGHT CHANNEL"-mode
- X1 - aem#
- RAR - all Solid, max compression
- ZIP - max compression
-
- Stream: TXT --> EXE --> TXT --> EXE .... TXT --> EXE
-
- ACB PPM LZ LZ
- Stream *.ACB *.X *.RAR *.ZIP
- of data size size size size
- A.C.T_6.95 10323 10314 12201 12199
- ACB_1.10 35605 36713 37150 37605
- A.C.T_7.95 3976 11039 13118 13121
- ACB_1.11 7903 36751 37172 37620
- A.C.T_8.95 2120 10847 12855 12849
- ACB_1.12 6963 36977 37381 37853
- A.C.T_9.95 1905 11004 12962 12949
- ACB_1.13 6123 37036 37402 37886
- A.C.T_10.95 1847 11381 13465 13480
- ACB_1.14 6873 36201 36410 36978
- 83638 238263 250116 252540
- Time (sec.) 20 28 7 5
-
- Testing was conducted on Pentium-75.
-
- 5. Conclusion
-
- Pentium-100 encourages the creation of new algorithms, one of which is
- ACB-compression.
-
- If you are working on the problem of data compression for communications
- channels, ACB-compression is the most effective algorithm for compressing
- the stream of various information.
-
- ---------------------------------------------------------------------------
- APPENDIX_A:
- Complexity of ACB-compression
-
- Time: T(n) = n/k * ( S + Log2( n/(2*k) ) ) + o(1)
- Memory: M(n) = n + n/8 + n*(Log2(n)+1)/4 + 11000
-
- n - sliding frame size (for ACB v1.14 n=822000 bytes & v1.23b n=1671000 bytes),
- S - context dictionary size
- for ACB_1.14 : S=53, for ACB_1.23b : S=100 (MAX.), S=55 (NORMAL), S=16 (FAST)
- k - I/C I - data size, C - code size,
- T(n)code=T(n)decode,
- M(n)code=M(n)decode.
-
- Note:
- - memory requirements decrease proportionally to an ratio.
- However, the latter possibility in ACB.EXE is not used
- to simplify memory manager.
- M'(n) = n + n/8 + n*(Log2(n)+1)/(4*k) + 11000
-
- - If S=1, the compression rate will be approximately the same as LZ-schemes
- have on small files (32-64 Kb). However, with the inrease of their size,
- the advantage of the ACB-compression increases (S=1). And all its valuable
- properties (on_the_fly, unrestricted growth of a sliding frame ...) remain.
-
- - Log2(n)+1 is the size of the pointer (bits), necessary for addressing
- each byte of the sliding frame
-
- For n=4096 bytes M(n)=28920 bytes
- n=32768 M(n)=178936
- n=65536 M(n)=363256
- n=1048576 M(n)=6695672
-
- Property:
- _____ _____
- Data | | Cod | | Data
- ----->| ACB |----->| ACB |-------->
- Time1 |_____| |_____| Time2
-
- Time1=Time2
-
- The first compressed byte generates the code sufficient for
- restoration of this byte (on_the_fly).
-
- On Pentium_100 the speed of the code is 40000 bits/sec, if the transmition
- speed in the communication line is <= than 40000 b/s, then time work
- of the ACB does not influence on the total transmition time.
- ---------------------------------------------------------------------------
- APPENDIX_B:
-
- This overview shows the results obtained by the best universal compression
- algorithms (research implementations) on the test "Calgary Compression Corpus".
-
- The columns are --
- LZB Bell [1987] one of the best in the family LZ77
- order-0 BWT with a simple order-0 arithmetic final stage
- Shann BWT with a "Shannon coder"
- Struct model BWT with a "structured model" ( ditto)
- PPM*+, PPMD+ two recent PPM compressors
- BW95 6/2 arith The best of David Wheeler's BWT compressors.
- LZP4 Author is Charles Bloom
- ACB Buyanovsky's Associativety coding (ACB.EXE DOS/ver.1.23a)
- BEST Combination of the best results from the
- algorithms being considered
-
- The results on all BWTs & PPMD+ were kindly provided
- by Dr Peter Fenwick <p_fenwick@cs.auckland.ac.nz>.
-
- The results on PPM*+ were kindly provided
- by Dr Bill Teahan <wjt@kauri.cs.waikato.ac.nz>
-
- The results on LZP4 were kindly provided by the author of this
- algorithm Charles Bloom <cbloom@mail.utexas.edu>.
-
- The results on ACB (ACB.EXE v.1.23a for DOS) were kindly provided
- by Leonid Broukhis <leob@mailcom.com>
-
-
- order-0 Shann Struct BW94 PPM*+ PPMD+ BW95 LZP4 ACB LZB BEST
- model
- Bib 2.13 1.97 1.95 2.07 +1.86 +1.86 2.02 1.915 1.95 3.17 | 1.86
- Book1 2.52 2.40 2.39 2.49 2.41 +2.30 2.48 2.350 2.34 3.86 | 2.30
- Book2 2.19 2.06 2.04 2.13 2.00 +1.96 2.10 2.011 +1.96 3.28 | 1.96
- Geo 4.81 4.55 +4.50 4.45 4.78 4.73 4.73 4.740 4.69 6.17 | 4.50
- News 2.67 2.53 2.50 2.59 2.37 2.35 2.56 2.350 +2.34 3.55 | 2.34
- Obj1 4.20 3.93 3.87 3.98 3.83 3.73 3.88 3.744 +3.54 4.26 | 3.54
- Obj2 2.70 2.49 2.46 2.64 2.31 2.38 2.53 2.388 +2.24 3.14 | 2.24
- Paper1 2.60 2.48 2.46 2.55 +2.33 +2.33 2.52 2.378 2.37 3.22 | 2.33
- Paper2 2.57 2.44 2.42 2.51 2.34 +2.32 2.50 2.389 2.37 3.43 | 2.32
- Pic 0.83 0.77 0.77 0.83 0.84 0.80 0.79 0.814 +0.75 1.01 | 0.75
- ProgC 2.68 2.52 2.50 2.58 +2.34 2.36 2.54 2.392 2.35 3.08 | 2.34
- ProgL 1.85 1.73 1.72 1.80 1.61 1.68 1.75 1.589 +1.53 2.11 | 1.53
- ProgP 1.83 1.72 1.70 1.79 1.55 1.70 1.74 1.585 +1.52 2.08 | 1.52
- Trans 1.61 1.50 1.50 1.57 1.39 1.47 1.52 1.343 +1.31 2.12 | 1.31
- ----------------------------------------------------------------------|-----
- AVG 2.52 2.36 2.34 2.43 2.28 2.28 2.40 2.291 +2.23 3.18 | 2.20
-
- ------ ACB_1.23a -----
- Computer: (Intel - P90/RAM 16M/MS-DOS)
-
- -----FAST------- ------NORM----- ------MAX------
- Size Code /bpc /time Code /bpc /time Code /bpc /time
- sec. sec. sec.
- bib.acb 111261 29215 2.10 7 27410 1.97 13 27178 1.95 20
- book1.acb 768771 241257 2.51 92 226612 2.36 199 224529 2.34 319
- book2.acb 610856 159132 2.08 55 150598 1.97 115 149753 1.96 193
- geo.acb 102400 61052 4.77 11 60189 4.70 24 60047 4.69 34
- news.acb 377109 116998 2.48 30 111149 2.36 64 110131 2.34 100
- obj1.acb 21504 9652 3.59 1 9535 3.55 2 9527 3.54 3
- obj2.acb 246814 72692 2.36 15 69781 2.26 29 69228 2.24 42
- paper1.acb 53161 16418 2.47 3 15797 2.38 6 15773 2.37 9
- paper2.acb 82199 25473 2.48 6 24391 2.37 12 24313 2.37 17
- pic.acb 513216 52604 0.82 16 49372 0.77 30 48272 0.75 45
- progc.acb 39611 12056 2.43 2 11673 2.36 4 11647 2.35 6
- progl.acb 71646 14284 1.59 3 13740 1.53 6 13686 1.53 8
- progp.acb 49379 9690 1.57 1 9379 1.52 3 9358 1.52 4
- trans.acb 93695 16062 1.37 3 15411 1.32 6 15308 1.31 8
- Total: 3141622 836585 245 795037 513 788750 808
- ------------------------------------------------------------------------------
- AVR. 2.33 2.24 2.23
- --------------------------------------------------------------------------
-
- APPENDIX_C:
-
- Here are the results comparing ACB v1.23b against ACB v1.14a:
-
- i486DX-75 16 RAM (256 Kb cache)
-
- ///////////////// TXT ////////////////////////
-
- Non-formatted text 8 files of parts of fiction - 2359210 bytes
-
- ................
- . <DIR> 13.02.96 21:43
- .. <DIR> 13.02.96 21:43
- JELYAZ02 TXT 485 162 09.09.91 17:35
- LKNKPLNT TXT 357 680 25.11.95 14:17
- VAGNER01 TXT 312 153 10.03.94 22:25
- STARWIND TXT 206 486 05.05.93 16:50
- ELEPHANT TXT 112 611 24.11.93 8:19
- ANDERS07 TXT 560 582 21.11.93 2:04
- BSELIKS TXT 86 491 02.04.91 15:13
- WARRIOR TXT 237 851 16.01.94 23:41
- 10 file(s) 2 359 016 bytes
- ................
-
- Size (bytes) Time (sec.)
- ACB_1.14a 724821 473
- ACB_1.23b(NORMAL) 710498 427
- ACB_1.23b(MAX) 702851 646
- ACB_1.23b(FAST) 758146 229
- PKZIP_2.04g (-ex) 1005277 35
-
- ///////////////// EXE ////////////////////////
-
- WINWORD.EXE (MS_WINWORD v.6.0_rus) - 3483156 bytes
-
- Size (bytes) Time (sec.)
- ACB_1.14a 1753719 715
- ACB_1.23b(NORMAL) 1701156 704
- ACB_1.23b(MAX) 1680749 1064
- ACB_1.23b(FAST) 1780426 369
- PKZIP_2.04g (-ex) 2031837 50
-
- //////////////////////////////////////////////
-