home *** CD-ROM | disk | FTP | other *** search
- Xref: sparky comp.compression:2780 news.software.b:2593
- Newsgroups: comp.compression,news.software.b
- Path: sparky!uunet!pipex!ibmpcug!exnet!dhd
- From: dhd@exnet.co.uk (Damon)
- Subject: COLLATED ANSWERS---Preloading `compress' for better NEWS batching.
- Message-ID: <Brt81I.IE@exnet.co.uk>
- Organization: ExNet Systems Ltd Public Access News, London, UK
- References: <1992Jul22.144104.27576@usenet.ins.cwru.edu> <711820693@pike.cs.duke.edu>
- Date: Wed, 22 Jul 1992 21:36:53 GMT
- Lines: 555
-
- I asked about an idea for tuning compress for news batches... Here is my
- original question and a selection of the answers received. Thanks to all those
- who replied.
-
- Take the trouble to read this and maybe followup on (bit of!) it; there's a lot
- of good info here. I may start a second summary.
-
- I may well have missed posted rather than e-mailed replies. Sorry if any of
- you tried to mail and it bounced.
-
- Replies quoted are from:
- Nigel Metheringham <nigelm@ohm.york.ac.uk>
- Alfred Kayser <Alfred.Kayser@dnpap.et.tudelft.nl>
- Matthew Farwell <dylan@ibmpcug.co.uk>
- Don Taylor <dont@klic.rain.com>
- Rob Stampfli <res@colnet.cmhnet.org>
-
- Thanks again...
-
- Damon
-
- -------------------------------------------------------------------------------
- Newsgroups: comp.compression,news.software.b
- Path: exnet!dhd
- From: dhd@exnet.co.uk (Damon)
- Subject: Preloading compress for news...
- Message-ID: <BrE95s.26L@exnet.co.uk>
- Keywords: compress, LZ, preloading, news, batch, n-graphs
- Organization: ExNet Systems Ltd Public Access News, London, UK
- Date: Tue, 14 Jul 1992 19:37:03 GMT
-
- I'm sure this one has been discussed before... but I can't remember...
-
- Suppose, for a particular application (in this case batch compression
- for news) we record the frequency of occurrence of the most common 2,
- 3,
- ... 6 character sequences in real batches.
-
- (I have the tools to do this; I may even do something about it this
- evening...)
-
- Then pick the 255 sequences that if shrunk to 9 bits would save the
- most space (ie bits saved * freq of occurrence). Preload these into a
- version of compress (and also uncompress, being the same program). If
- we want we could then do the same with the next most `useful' sequences
- for the 10-bit codes.
-
- Call this something new like `newscomp' and supply it with (say) new C
- News software for participating pairs of sites to use to
- compress/uncompress batches.
-
- The idea is that this `tuned' compress/uncompress may be much more
- effective than a normal compress, and although incompatible with the
- normal compress because of this preloaded info, it might save
- significant bandwidth between cooperating pairs of sites, starting
- better than compress by itself, but still adapting well for different
- traffic.
-
- If you e-mail I'll gather responses together and post as one article.
- I don't promise to catch all articles in the news spool, since my
- machine is about to overflow!
-
- If someone knows how to modify compress with this preloaded info, I'll
- collect character frequencies in real batches for initially a day or
- two and then for a few months, or supply them with the automatic tools
- to do so. This could be a very worthwhile saving in
- net.telecomms.costs.
-
- Damon
-
-
- ===============================================================================
- RESPONSES
- -------------------------------------------------------------------------------
- Newsgroups: comp.compression,news.software.b
- Path: exnet!dhd
- From: dhd@exnet.co.uk (Damon)
- Subject: Re: Preloading compress for news...
- Message-ID: <BrEJ51.581@exnet.co.uk>
- Keywords: compress, LZ, preloading, news, batch, n-graphs
- Organization: ExNet Systems Ltd Public Access News, London, UK
- References: <BrE95s.26L@exnet.co.uk>
- Date: Tue, 14 Jul 1992 23:12:36 GMT
-
- In article <BrE95s.26L@exnet.co.uk> dhd@exnet.co.uk (Damon) writes:
- >I'm sure this one has been discussed before... but I can't remember...
- >
- >Suppose, for a particular application (in this case batch compression
- >for news) we record the frequency of occurrence of the most common 2,
- >3,
- >... 6 character sequences in real batches.
- >
- >(I have the tools to do this; I may even do something about it this
- >evening...)
- >
- >Then pick the 255 sequences that if shrunk to 9 bits would save the
- >most space (ie bits saved * freq of occurrence). Preload these into a
- >version of compress (and also uncompress, being the same program). If
- >we want we could then do the same with the next most `useful' sequences
- >for the 10-bit codes.
-
- I know that following up your own articles is a bit naughty, but...
-
- OK, well I measured the frequency of a selection of n-graphs in 581991
- characters-worth of uk.misc articles with cut-down headers (this info was
- collected for other purposes, so is not exactly ideal for the purpose in hand,
- but is close). The headers are cut down to just the Subject: line. Note that
- not all n-graphs present in the data were sampled for frequency of occurrence,
- especially for the longer graphs.
-
- I picked out the 254 sequences that would save most bits if they had been
- replaced by a single 9-bit code word. They are listed below as three fields.
-
- Length of n-graph, *bits* saved, n-graph with ctrl chars escaped (space is \40)
-
- Now the numbers given are not true simultaneously, since if all sequences of,
- say, `---------------------' were replaced then the additional saving gained by
- replacing all the `-------'s would be much smaller than that listed. But
- adding up the non-overlapping n-graphs such as the run of 16 `-'s and the run
- of 16 ` 's, and the occurences of `the' and `Subject: Re:' and so on comes, by
- eye, to around 30% compression straight off. The 254th code might be saving
- about .4% of the bits.
-
- Of course, this selection of graphs would most favour traffic in the uk groups,
- so saving us the most money... Perhaps alt.sex should be sampled instead? B^O
-
- Table of 254 putative preloaded strings (n-graphs) follows:
-
- 16 633556 ----------------
- 15 603618 ---------------
- 14 571856 --------------
- 13 538555 -------------
- 12 503469 ------------
- 16 491827 \40\40\40\40\40\40\40\40\40\40\40\40\40\40\40\40
- 15 488733 \40\40\40\40\40\40\40\40\40\40\40\40\40\40\40
- 14 482761 \40\40\40\40\40\40\40\40\40\40\40\40\40\40
- 13 474715 \40\40\40\40\40\40\40\40\40\40\40\40\40
- 11 466495 -----------
- 12 465363 \40\40\40\40\40\40\40\40\40\40\40\40
- 11 452196 \40\40\40\40\40\40\40\40\40\40\40
- 10 436082 \40\40\40\40\40\40\40\40\40\40
- 10 427704 ----------
- 9 415674 \40\40\40\40\40\40\40\40\40
- 8 394240 \40\40\40\40\40\40\40\40
- 9 387072 ---------
- 7 367681 \40\40\40\40\40\40\40
- 8 344520 --------
- 6 333684 \40\40\40\40\40\40
- 7 300095 -------
- 5 292454 \40\40\40\40\40
- 6 253734 ------
- 4 240051 \40\40\40\40
- 5 205468 -----
- 3 179340 \40\40\40
- 4 155664 ----
- 4 106099 \40the
- 2 105812 \40\40
- 3 103965 ---
- 3 102885 \40th
- 5 100719 \40the\40
- 2 88578 e\40
- 3 83565 the
- 4 78706 the\40
- 2 71792 \40t
- 2 64344 th
- 3 58980 he\40
- 2 52507 --
- 2 52500 s\40
- 2 51107 \40a
- 2 50757 he
- 2 50323 t\40
- 2 43477 in
- 4 43171 ing\40
- 4 42090 \40to\40
- 4 40710 \40of\40
- 3 38370 ing
- 2 38325 er
- 13 37525 Subject:\40Re:\40
- 6 36972 \40that\40
- 2 36715 n\40
- 12 34365 Subject:\40Re:
- 12 34365 ubject:\40Re:\40
- 2 34265 an
- 2 34230 re
- 12 33756 In\40article\40<
- 5 33728 \40that
-
- [etc]
- -------------------------------------------------------------------------------
- Newsgroups: comp.compression,news.software.b
- From: dhd@exnet.co.uk (Damon)
- Subject: Re: Preloading compress for news...
- Message-ID: <BrEJuK.5HA@exnet.co.uk>
- Followup-To: comp.compression
- Keywords: compress, LZ, preloading, news, batch, n-graphs
- Organization: ExNet Systems Ltd Public Access News, London, UK
- References: <BrE95s.26L@exnet.co.uk> <BrEJ51.581@exnet.co.uk>
- Date: Tue, 14 Jul 1992 23:27:55 GMT
-
- In article <BrEJ51.581@exnet.co.uk> dhd@exnet.co.uk (Damon) writes:
- >In article <BrE95s.26L@exnet.co.uk> dhd@exnet.co.uk (Damon) writes:
- >I know that following up your own articles is a bit naughty, but...
-
- And I'm doing it again. Too tired to think straight...
-
- >Now the numbers given are not true simultaneously, since if all sequences of,
- >say, `---------------------' were replaced then the additional saving gained by
- >replacing all the `-------'s would be much smaller than that listed. But
- >adding up the non-overlapping n-graphs such as the run of 16 `-'s and the run
- >of 16 ` 's, and the occurences of `the' and `Subject: Re:' and so on comes, by
- >eye, to around 30% compression straight off. The 254th code might be saving
- >about .4% of the bits.
-
- The big thing I overlooked which may invalidate all that above (!) was that
- where we have a repeating sequence, eg `---', a subsequence of it will be
- counted on each match, eg `--' will be counted twice in `---'. So the Big
- Saving `-' and ` ' runs will not save nearly as much as the crude figures
- suggest... My data needs more processing. I need sleep.
-
- I've now altered the followups line so I can make a fool of myself in just one
- global newsgroup Bv<.
-
- Other people are invited to get their own words in edgeways...
-
- Damon
-
- -------------------------------------------------------------------------------
- From: Nigel Metheringham <nigelm@ohm.york.ac.uk>
- Date: Wed, 15 Jul 1992 09:03:45 +0100
- Message-Id: <3246.199207150803@glenlivet.ohm.york.ac.uk>
- To: dhd@exnet.co.uk
- Subject: Re: Preloading compress for news...
- Newsgroups: comp.compression,news.software.b
- References: <BrE95s.26L@exnet.co.uk>
-
- Compress puts the compression table in the compressed data stream.
- It should indeed be possible to make a version of compress that does
- more stringent analysis of the file to be compressed (ie looks at
- the whole file rather than taking it as an incoming data stream),
- and so get a more thoroughly compressed file. The good point is
- that you can get a more compressed data stream which will still
- uncompress with the same uncompressor!
-
- Nigel.
-
- --
- # Nigel Metheringham # (NeXT) EMail: nigelm@ohm.york.ac.uk #
- # System Administrator ####### Phone: +44 904 432374 #
- # Department of Electronics # Fax: +44 904 432335 #
- # University of York, Heslington, York, UK, YO1 5DD #
- -------------------------------------------------------------------------------
- From: dhd (Damon)
- To: nigelm@ohm.york.ac.uk
- Subject: Re: Preloading compress for news...
-
- I'm not sure you can do that with compress. Each time it adds a new
- string to its database it transmits the new character to the other
- end to build the receiver's table.
-
- But it would be good.
-
- Still, preloading the table at both ends will remove the need to transmit
- it at all. Since we might only preload a relatively small number of
- strings the new compressor would still work well on other sorts of data
- than that we sampled to chose the preload strings.
-
- Damon
- -------------------------------------------------------------------------------
- From: Nigel Metheringham <nigelm@ohm.york.ac.uk>
- Date: Wed, 15 Jul 1992 16:37:22 +0100
- To: Damon <dhd@exnet.co.uk>
- Subject: Re: Preloading compress for news...
-
- There was some discussion a while back in comp.compression about
- prescanning the file and making sure you had an optimal compression
- table before emmitting the table and the data. This has apprently
- been implemented and works fine with the standard uncompress.
-
- I think trying to use a fixed string table and specialised
- compressor/uncompressor pairs would not be a win at all. The
- performance would be good on some batches and probably appaling on
- others (ie program code or encoded binary). The current methods work
- pretty much OK - probably safer to leave well alone!
-
- Nigel.
-
- -------------------------------------------------------------------------------
- Date: Wed, 15 Jul 92 18:06:53 BST
- From: dhd (Damon)
- Message-Id: <9207151706.AA01504@exnet.co.uk>
- To: dhd
- Status: R
-
- >I think trying to use a fixed string table and specialised
-
- No, no, you don't *fix* the string table; you just load the first few at
- the compressor and decompressor and let it extend the table as usual. So
- its performance is going to me more-or-less the same as the default
- compress/uncompress in the worst case, and in the best case somewhat better.
-
- Damon
- -------------------------------------------------------------------------------
- Newsgroups: comp.compression
- From: Alfred.Kayser@dnpap.et.tudelft.nl (Alfred Kayser)
- Subject: Re: Preloading compress for news...
- Message-ID: <alfred.711184930@dutepp3>
- Organization: Delft University of Technology, Dept. of Electrical Engineering
- References: <BrE95s.26L@exnet.co.uk> <BrEJ51.581@exnet.co.uk> <BrEJuK.5HA@exnet.co.uk>
- Date: Wed, 15 Jul 1992 07:22:10 GMT
-
- dhd@exnet.co.uk (Damon) writes:
-
- >In article <BrEJ51.581@exnet.co.uk> dhd@exnet.co.uk (Damon) writes:
- >>In article <BrE95s.26L@exnet.co.uk> dhd@exnet.co.uk (Damon) writes:
- >>I know that following up your own articles is a bit naughty, but...
- >And I'm doing it again. Too tired to think straight...
- >>Now the numbers given are not true simultaneously, since if all sequences of,
- >>say, `---------------------' were replaced then the additional saving gained by
- >>replacing all the `-------'s would be much smaller than that listed. But
- >>adding up the non-overlapping n-graphs such as the run of 16 `-'s and the run
- >>of 16 ` 's, and the occurences of `the' and `Subject: Re:' and so on comes, by
- >>eye, to around 30% compression straight off. The 254th code might be saving
- >>about .4% of the bits.
-
- >The big thing I overlooked which may invalidate all that above (!) was that
- >where we have a repeating sequence, eg `---', a subsequence of it will be
- >counted on each match, eg `--' will be counted twice in `---'. So the Big
- >Saving `-' and ` ' runs will not save nearly as much as the crude figures
- >suggest... My data needs more processing. I need sleep.
- You need to sleep indeed!
-
- Why don't we use plain 'compress' or other compressors which are indenpendent
- of the data? The 30% compression rate with handtweaked 'n-graphs' is
- certainly not better than the rates achieved by standard and widely availeble
- compressors. IMHO there two places where one should compress, that is on the high end
- of the protocol (ie in the application), so the compression is ideally suited
- to the data (take for instance gif and jpg for images, and zoo for executables).
- The other place is somewhere in the transport layer of the network protocol (ie
- in modems via v42bis and MNP5) transparantly to the applications (such at usenet).
- The first place can achieve compression ratios up to 1:1000 (such as ifs based
- compression of images), while the second is mostly in the order of 1:2 to 1:4.
-
- It is overkill to want to compress in between those two places, especially if one
- want to do this solely to cutdown transport costs.
-
- Just my two cents, Alfred
-
- --
- -- Ir. Alfred Kayser. PACS, OS/2, TCP/IP. --- Email: AKayser@et.tudelft.nl --
- -- CARDIT, Delft University of Technology ------------ Tel: (31)-15-786179 --
- -- P.O.Box 5031, 2600 GA Delft, The Netherlands ------ Fax: (31)-15-784898 --
- -------------------------------------------------------------------------------
- Newsgroups: comp.compression
- From: dhd@exnet.co.uk (Damon)
- Subject: Re: Preloading compress for news...
- Message-ID: <BrFMuC.43M@exnet.co.uk>
- Organization: ExNet Systems Ltd Public Access News, London, UK
- References: <BrEJ51.581@exnet.co.uk> <BrEJuK.5HA@exnet.co.uk> <alfred.711184930@dutepp3>
- Date: Wed, 15 Jul 1992 13:30:11 GMT
-
- In article <alfred.711184930@dutepp3> Alfred.Kayser@dnpap.et.tudelft.nl (Alfred Kayser) writes:
- >dhd@exnet.co.uk (Damon) writes:
- >
- >>In article <BrEJ51.581@exnet.co.uk> dhd@exnet.co.uk (Damon) writes:
- >>>In article <BrE95s.26L@exnet.co.uk> dhd@exnet.co.uk (Damon) writes:
- >>>I know that following up your own articles is a bit naughty, but...
- >>And I'm doing it again. Too tired to think straight...
- >>>Now the numbers given are not true simultaneously, since if all sequences of,
- >>>say, `---------------------' were replaced then the additional saving gained by
- >>>replacing all the `-------'s would be much smaller than that listed. But
- >>>adding up the non-overlapping n-graphs such as the run of 16 `-'s and the run
- >>>of 16 ` 's, and the occurences of `the' and `Subject: Re:' and so on comes, by
- >>>eye, to around 30% compression straight off. The 254th code might be saving
- >>>about .4% of the bits.
- >
- >>The big thing I overlooked which may invalidate all that above (!) was that
- >>where we have a repeating sequence, eg `---', a subsequence of it will be
- >>counted on each match, eg `--' will be counted twice in `---'. So the Big
- >>Saving `-' and ` ' runs will not save nearly as much as the crude figures
- >>suggest... My data needs more processing. I need sleep.
- >You need to sleep indeed!
- >
- >Why don't we use plain 'compress' or other compressors which are indenpendent
- >of the data? The 30% compression rate with handtweaked 'n-graphs' is
-
- We do. By default `compress' is used to compress batches of news between uucp
- news sites. What I am suggesting is *improving* on its default behaviour by
- tuning it for this specific application. What I'm saying is that we might be
- able to get 30% before compress even has to start `learning'.
-
- >The other place is somewhere in the transport layer of the network protocol (ie
- >in modems via v42bis and MNP5) transparantly to the applications (such at usenet).
-
- Less good, as you point out, because the user data is broken up into
- effectively random pieces by the packet headers/trailers...
-
- Damon
- -------------------------------------------------------------------------------
- To: dhd@exnet.co.uk
- Subject: news compression
- Date: Wed, 15 Jul 92 15:24:44 BST
- From: Matthew Farwell <dylan@ibmpcug.co.uk>
- Message-Id: <9207151524.aa17631@kate.ibmpcug.co.uk>
-
- Have you looked at Freeze/melt?
-
- Script started on Wed Jul 15 15:22:04 1992
- ; ls -l 71*
- -rw-r--r-- 1 news news 1035353 Jul 10 03:12 710734342
- ; -- time
- time compress < 71* > out
-
- real 26.4
- user 3.2
- sys 0.8
- ; ls -l out
- -rw-r--r-- 1 dylan source 490825 Jul 15 15:22 out
- ; time freeze < 71* > out
-
- real 29.6
- user 19.7
- sys 0.9
- ; ls -l out
- -rw-r--r-- 1 dylan source 429503 Jul 15 15:23 out
- ;
- script done on Wed Jul 15 15:23:26 1992
-
- Dylan.
-
- --
- It is no coincidence that in no known language does the phrase 'As
- pretty as an Airport' appear -- Douglas Adams
-
-
- [Freeze is available in source form.]
- -------------------------------------------------------------------------------
- Message-Id: <m0m8No6-0003NMC@klic.rain.com>
- Date: Wed, 15 Jul 92 21:51 PDT
- From: Don taylor <dont@klic.rain.com>
- To: dhd@exnet.co.uk
-
- I too have been considering preloading compressors with tables, and other
- methods to get 75% compression, as is quoted as the underlying information
- content of english text.
-
- I have been thinking that the compressor needs to make use of lookahead, not
- just lookback. I suspect the receiver needs to make use of a dictionary, not
- just a pointer to the previous text (since there are duplications in the text
- which provide less material available for a fixed sized pointer or a larger
- pointer for the same material). For text, since most words include a trailing
- blank, how about encoding the word with a trailing blank and have a less common
- code that says to discard the last character reconstructed. Since the typical
- word is 5 characters, 6 with the blank, and the blank is free in about 90% of
- the cases if it is encoded in this way, that should provide some savings.
-
- I have been looking for algorithms to efficiently do the compression as
- follows
- Given the dictionary known to the receiver
- and the next x kbytes of lookahead
- what is the least cost encoding to send those next x kbytes
- considering the price of dictionary updates+transmission
- and only send a dictionary update when needed, i.e. for prefix string.
-
- Thanks
- dont@klic.rain.com
- -------------------------------------------------------------------------------
- To: dhd@exnet.co.uk
- Subject: Re: Preloading compress for news...
- In-Reply-To: <BrE95s.26L@exnet.co.uk>
- Organization: Little to None
- Message-Id: <9207152332.AA27348@colnet.cmhnet.org>
- Date: Wed, 15 Jul 92 23:32:13 EDT (-0400)
- From: Rob Stampfli <res@colnet.cmhnet.org>
-
- A couple of us here actually thought of this, although we never did
- anything with it except talk about it. I think it would be a win,
- although I don't know how much of one. Do you have any statistics?
- Also, the internal tables in compress, which are now in bss, would
- have to be moved to .data, meaning a *large* program on disk.
-
- One possibility would be to give compress the option of preloading
- the internal tables from a command-line specified file. You could
- also have it write this information out to a file after it was done
- with a compression.
-
- Another possibility would be to get a "core to a.out" translator. I
- understand these exist for some machines, but I have never actually
- seen one. If you had one, you might be able to relatively easily
- rig up a version of compress that drops core at a strategic point
- and then is capable of restarting with its already populated memory
- when invoked from scratch.
-
- As a stop-gap here, I have gone to using zip and unzip in lieu of
- compress between me and my feed site. It saves 10-20% on the size
- of the compressed batches, but it takes a lot longer to do the
- compression and decompression. Right now, though, I have more cpu
- cycles than modem time, so it's a win.
-
- I don't have the time to work on something like this myself, but I
- would be interested in hearing anything you may come up with on it.
- Good luck with it and keep us informed.
- --
- Rob Stampfli rob@colnet.cmhnet.org The neat thing about standards:
- 614-864-9377 HAM RADIO: kd8wk@n8jyv.oh There are so many to choose from.
- -------------------------------------------------------------------------------
- Date: Thu, 16 Jul 92 14:32:06 BST
- From: dhd (Damon)
- Message-Id: <9207161332.AA05405@exnet.co.uk>
- To: res@colnet.cmhnet.org
- Subject: Re: Preloading compress for news...
-
- >One possibility would be to give compress the option of preloading
- >the internal tables from a command-line specified file. You could
- >also have it write this information out to a file after it was done
- >with a compression.
-
- I like this. Compat. with extant compress/uncompress, and retunable by
- pairs of sites as and when it suits them... Maybe even depending on
- which groups they get...
-
- [[Maybe even auto-tuning... This added after original mailer sent]]
-
- >Another possibility would be to get a "core to a.out" translator. I
- >understand these exist for some machines, but I have never actually
- >seen one. If you had one, you might be able to relatively easily
- >rig up a version of compress that drops core at a strategic point
- >and then is capable of restarting with its already populated memory
- >when invoked from scratch.
-
- This probably won't work for dynamically-linked library versions (it
- breaks Sun's sendmail.fc [frozen config] file which works by reloading
- .data).
-
- >As a stop-gap here, I have gone to using zip and unzip in lieu of
- >compress between me and my feed site. It saves 10-20% on the size
- >of the compressed batches, but it takes a lot longer to do the
- >compression and decompression. Right now, though, I have more cpu
- >cycles than modem time, so it's a win.
-
- Freeze/melt seems much the same.
-
- I thought I might quickly modify compress to only compress a 7-bit
- input stream (eg the plain ASCII that news ships). This might make a
- significant differernce. I might try it later. I think just ignoring
- the 8th bit might be OK. But this is obviously not safe for anything
- other than ASCII text.
-
- Damon
- ===============================================================================
- # 1.8 compr 92/07/22 #
- --
- Damon Hart-Davis | Tel/Fax: +44 81 755 0077 |1.21|| ALL MAIL FREE.
- Internet: dhd@exnet.co.uk | Also: Damon@ed.ac.uk || Hotrod motor groups soon.
- --------------------------+----------------------++ >12 mail&news polls / day.
- Public access UNIX (Suns), news and mail for #5 per month. FIRST MONTH FREE.
-