home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!gatech!destroyer!sol.ctr.columbia.edu!eff!snorkelwacker.mit.edu!bloom-beacon!eru.mt.luth.se!lunic!sunic!news.funet.fi!funic!nntp.hut.fi!nntp!eye
- From: eye@lesti.hut.fi (Petri M Kuittinen)
- Newsgroups: comp.compression
- Subject: Re: Best compression technique & ratio for (ascii) text
- Message-ID: <EYE.92Sep3163746@lesti.hut.fi>
- Date: 3 Sep 92 14:37:46 GMT
- References: <BtztDo.3oB@casper.cs.uct.ac.za>
- Sender: usenet@nntp.hut.fi (Usenet pseudouser id)
- Organization: Helsinki University of Technology, Finland
- Lines: 345
- In-Reply-To: roland@sun-2.cs.uct.ac.za's message of Thu, 3 Sep 1992 08:10:35 GMT
- Nntp-Posting-Host: lesti.hut.fi
-
- In article <BtztDo.3oB@casper.cs.uct.ac.za> roland@sun-2.cs.uct.ac.za (Roland Paterson-Jones) writes:
-
- Sorry if this is a FAQ, but I am interested in knowing what compression ratios
- are possible for English text in ascii. Unix's compress (adaptive Lev-Zimpil)
- claims 50-60% for such files. Is this as good as can be done? Quick
- decompression would be essential.
-
- Thanks for any responses...
-
-
- Hmm, I recently made myself a UNIX-compress alike program which uses
- my own compression algorithm (yet another LZSS-variant). It is slightly better
- than UNIX compress at compression, about 3x slower at compression and
- very fast at decompression (about 2x faster).
-
- ---------------------Here are few test results------------------------------
-
- Compression test results
-
- In the above charts I have compared Bcpack-3 with other widely
- used compressor and archiver programs for UNIX-systems. The test files
- are from "Calcary Text Compression Corpus", files paper3-6 are excluded
- from the comparison. (It's common to leave them out in data compression
- comparisons). These files ca be obtained via anonymous ftp from:
- fsa.cpsc.ucalgary.ca:/pub/text.compression.corpus.tar.Z
-
- Most of the files are text files, few of them are binary files
- (geo,obj1,obj2) and one is a picture file (pic). Most of the files are
- relatively large, so this comparison favours compressors which are good
- at compressing large text files.
-
- All the comparisons are done on a Sony NEWS NWS-1510
- work station (CPU Motorola 68030, 25 MHz). All the times in the
- charts are CPU-times not real times, which are slightly bigger.
- The file lengths also include file headers in the compressor files
- (Compress, Bcpack3). The total length of the archive is about 0.1%
- larger than the one reported here.
-
- program size arc bcpack3 compress lha
- version 5.00 3.00 4.0 1.00
- options -om
-
- bib 111261 53537 43132 46528 40740
- book1 768771 390356 389156 332056 339074
- book2 610856 326083 254142 250759 228442
- geo 102400 73884 74792 77777 68574
- news 377109 227938 167678 182121 155084
- obj1 21504 14269 10754 14048 10310
- obj2 246814 141583 89032 128659 84981
- paper1 53161 28936 21516 25077 19676
- paper2 82199 41318 36114 36161 32096
- pic 513216 67658 85164 62215 52221
- progc 39611 22810 14936 19143 13941
- progl 71646 30851 18658 27148 16914
- progp 49379 23117 12642 19209 11507
- trans 93695 50234 22728 38240 22578
- ------------------------------------------------------------------
- total 3141622 1492574 1240444 1259141 1096138
- time - 234,6 460,0 67,4 731,8
- - 155,2 15,5 36,7 71,8
-
-
- program size lharc zip zoo
- version 1.02 1.0 2.1
- options -9 -ah
-
- bib 111261 46502 40717 40742
- book1 768771 369479 339932 339076
- book2 610856 252540 229419 228444
- geo 102400 70955 69837 68576
- news 377109 166048 154865 155086
- obj1 21504 10748 10522 10312
- obj2 246814 90848 86661 84983
- paper1 53161 21748 19761 19678
- paper2 82199 32275 32296 32098
- pic 513216 61394 56828 52221
- progc 39611 15399 13995 13943
- progl 71646 18760 16954 16916
- progp 49379 12792 11558 11509
- trans 93695 28092 22737 22580
- ------------------------------------------------------------------
- total 3141622 1200580 1106013 1096166
- time - 496,03 254,0 503,0
- - 85,56 75,7 43,9
-
- Compression result
- 1. lha * 34,9%
- 2. zoo * 34,9%
- 3. zip 35,2%
- 4. lharc 38,2%
- 5. bcpack 39,5%
- 6. compress 40,1%
- 7. arc 47,5%
-
- * A file compressed with zoo is always 2 bytes longer than a one
- compressed with lha. The both programs use the same compression
- algorithm (AR002), but the implementation and the file structure
- differs. That explains the slight speed and size differences.
-
- Compression speed
- 1. compress 45,5 kb/s
- 2. arc 13,1 kb/s
- 3. zip 12,1 kb/s
- 4. bcpack3 6,7 kb/s
- 5. lharc 6,1 kb/s
- 6. zoo 6,1 kb/s
- 7. lha 4,2 kb/s
-
- Decompression speed
- 1. bcpack3 197,9 kb/s
- 2. compress 83,6 kb/s
- 3. zoo 69,9 kb/s
- 4. lha 42,7 kb/s
- 5. zip 40,5 kb/s
- 6. lharc 35,9 kb/s
- 7. arc 19,8 Kb/s
-
- File compression speeds per file don't appear in the charts. The other
- compressors spent nearly the average compression time per file, but
- Bcpack3 used most of its time to compress pic-file. (CPU-time elapsed
- 251,92 s). The search method used in Bcpack3 probably failed miserably
- with this file. The average compression speed of Bcpack3 is 12,3 kb/s
- if file pic is left out from the calculations.
-
- In the below is another comparison done with small file found
- from: nic.funet.fi:pub/msdos/utilities/txtutlo/
-
- program size arc bcpack3 compresslha
- version 5.00 3.00 4.0 1.00
- options -om
-
- readme 1704 1118 1110 1120 923
- shell.com 12616 11880 9820 11970 9278
- shell.pas 3101 1358 1080 1392 928
- sort.c 5909 1800 1246 1838 1108
- sort.exe 9807 8117 6538 8130 6224
- sortdemo.doc 2944 1763 1678 1774 1447
- sortf.com 3235 2774 2438 2794 2321
- sortf.doc 12753 5378 4668 5420 4170
- sortf.obj 4056 3789 3244 3799 3153
- ------------------------------------------------------------------
- total 56125 37977 31882 38237 29552
- time - 5,52 4,38 2,22 10,86
- - 3,50 0,92 1,75 2,59
-
- program size lharc zip zoo
- version 1.02 1.0 2.1
- options -9 -ah
-
- readme 1704 1007 993 925
- shell.com 12616 9426 9595 9280
- shell.pas 3101 997 1023 930
- sort.c 5909 1141 1188 1109
- sort.exe 9807 6389 6420 6226
- sortdemo.doc 2944 1533 1544 1449
- sortf.com 3235 2330 2385 2323
- sortf.doc 12753 4555 4235 4172
- sortf.obj 4056 3153 3244 3155
- ------------------------------------------------------------------
- total 56125 30531 30627 29569
- time - 9,02 6,55 6,91
- - 3,53 2,09 1,36
-
- compression result
- 1. lha 52,7%
- 2. zoo 52,7%
- 3. lharc 54,4%
- 4. zip 54,6%
- 5. bcpack3 56,8%
- 6. arc 67,7%
- 7. compress 68,1%
-
- compression speed:
- 1. compress 24,7 kb/s
- 2. bcpack 12,5 kb/s
- 3. arc 9,9 kb/s
- 4. zip 8,4 kb/s
- 5. zoo 7,9 kb/s
- 6. lharc 6,1 kb/s
- 7. lha 5,0 kb/s
-
- decompression speed:
- 1. bcpack 59,6 kb/s
- 2. zoo 40,3 kb/s
- 3. compress 31,3 kb/s
- 4. zip 26,2 kb/s
- 5. lha 21,2 kb/s
- 6. arc 15,7 kb/s
- 7. lharc 15,5 kb/s
-
- Test done on the small files favour the archiver prg:s even more
- The archives are about 2-3% bigger than the total length shown here.
- In addition the archiver prg:s are now proportionally faster at
- uncompression than compressors, because archiver programs need only
- to read one bigger file instead of reading many small ones as the
- compression prg:s must do.
-
- Variable parameters effects in Bcpack3
-
- There are two variable parameters in Bcpack3, which affect
- the compression. The first one is offset range (you can set it by
- option -oN, where N is offset range. If N=m then maximum offset
- range is used). Offset range tells the ring buffer size. Here is
- a table which shows it's effect on a finnish text file.
-
- decom2.txt, original length 97044 bytes
-
- offset size compression decompression
- (bytes) time (s) time (s)
- 300 68356 3,6 0,8
- 1000 60698 4,2 0,7
- 3000 56046 5,1 0,7
- 5000 54102 5,7 0,6
- 7000 53096 6,3 0,6
- 9000 52552 6,8 0,6
- 11000 51954 7,3 0,6
- 13000 51512 7,8 0,7
- 15000 50780 8,6 0,6
- 17000 50694 9,0 0,6
- 19000 50604 9,4 0,6
- 21000 50008 10,2 0,6
-
- As you can see from the table, compression results get better
- with bigger offset. Compression time is also gets bigger with bigger
- offset sizes. Decompression time is slightly smaller with files which
- are compressed using bigger offset.
-
- Bcpack3 features are also another variable parameter. It is
- called relative-mode (Option '-r'). With relative mode on the program
- first delta codes the files before compression. Delta coding means
- transforming the data, to a form in which each byte is a signed value
- which tells the difference to previous byte.
-
- program size bcpack3 bcpack3
- version 3.00 3.00
- options -om -rom
-
- bib 111261 43132 48924
- book1 768771 389156 440106
- book2 610856 254142 286102
- geo 102400 74792 95064
- news 377109 167678 195392
- obj1 21504 10754 12674
- obj2 246814 89032 107898
- paper1 53161 21516 24738
- paper2 82199 36114 39956
- pic 513216 85164 96180
- progc 39611 14936 17678
- progl 71646 18658 21678
- progp 49379 12642 15220
- trans 93695 22728 26842
- -------------------------------------
- total 3141622 1240444 1428452
- time 460,0 425,8
- 15,5 18,0
-
- From the above chart you might make a conclusion that relative
- mode is non-useful. It's not meant for compressing ordinary files.
- It's meant for compressing files, which mostly consist of continuous
- periodical data. Eg. sound files are such files. On the below
- Here are few test results done on few sound/music files found from
- nic.funet.fi:pub/amiga/audio.
-
- program size arc bcpack3 bcpack3 compress
- version 5.00 3.00 3.00 4.00
- options -om -rom
-
- beethoven.snd 96360 59296 53612 56088 53608
- explosion2.snd 24312 19728 19218 15392 19166
- illbeback.snd 11716 11239 10792 9590 11716 *
- mod.BeyondMusic 402166 338422 332032 305242 343905
- mod.spectral_in 109152 87824 77662 74508 91001
- --------------------------------------------------------
- total 643706 516509 493316 460820 519396
- time - 92,81 71,77 120,53 28,14
- - 38,61 5,84 6,05 12,73
-
- program size lha lharc zip zoo
- version 1.00 1.02 1.0 2.1
- options -9 -ah
-
- beethoven.snd 96360 48820 51246 50392 48822
- explosion2.snd 24312 17489 18096 18265 17491
- illbeback.snd 11716 9977 10086 10466 9979
- mod.BeyondMusic 402166 307261 314144 321918 307263
- mod.spectral_in 109152 72578 77012 75820 72580
- --------------------------------------------------------
- total 643706 456125 470548 476861 456135
- time - 164,22 101,64 77,06 106,97
- - 25,78 39,16 25,58 14,66
-
- compression result:
- 1. lha 70,9%
- 2. zoo 70,9%
- 3. bcpack3 -r 71,6%
- 4. lharc 73,1%
- 5. zip 74,1%
- 6. bcpack3 76,6%
- 7. arc 80,2%
- 8. compress 80,7% *
-
- * compress wasn't able to compress file illbeback.snd at all.
-
- compression speed:
- 1. compress 22,3 kb/s
- 2. bcpack3 8,8 kb/s
- 3. zip 8,2 kb/s
- 4. arc 6,8 kb/s
- 5. lharc 6,2 kb/s
- 6. zoo 5,9 kb/s
- 7. bcpack -r 5,2 kb/s
- 8. lha 3,8 kb/s
-
- decompression speed:
- 1. bcpack3 107,6 kb/s
- 2. bcpack3 -r 103,9 kb/s
- 3. compress 49,4 kb/s
- 4. zoo 42,9 kb/s
- 5. zip 24,6 kb/s
- 6. lha 24,4 kb/s
- 7. lharc 16,1 kb/s
- 8. arc 16,3 kb/s
-
- As you see delta-coding improves compression results by
- average of 5%, increases compression time and doesn't affect much the
- decompression speed. It's usually useful to use relative mode on
- files which consist mostly of 8-bit sound data.
- ---------------------------end of results stuff------------------------------
-
- The text compression results above could have been improved by about
- 5% if there would be a special ASCII mode, which would code bytes to just
- 7 bits.
- The Bcpack3 program, C-source code & the algorithm are copyrighted
- at this moment, but if enough interest comes from the field, I will publish
- them and make them Public Domain (PD). If nobody is interested about my
- program, I won't continue improving it.
-
- Petri -The Eye- Kuittinen
-
-
- --
- | The Eye of Brainwash Company | LEA POINT(PC),A3 ;Don't try |
- | E-mail:eye@niksula.hut.fi | ADD.L #$465707B7,(A3) ;this M68000|
- | S-mail:Timpurinkuja 1 C 39 | LEA $8000,A4 ;assembler |
- | 02600 Espoo, Finland |POINT: ORI.W #$5945,D5 ;program!! |
-