Columbia Kermit

home *** CD-ROM | disk | FTP | other *** search

/ Columbia Kermit / kermit.zip / protocol / compress.txt < prev next >

Wrap

Text File | 2020-01-01 | 152KB | 3,525 lines

25-Oct-91 14:45:19-GMT,4410;000000000401 Return-Path: <fdc> Received: by watsun.cc.columbia.edu (5.59/FCB) id AA03372; Fri, 25 Oct 91 10:45:06 EDT Date: Fri, 25 Oct 91 10:45:06 EDT From: Frank da Cruz <fdc@watsun.cc.columbia.edu> To: "Glenn E. Thobe" <thobe@getunx.info.com> Subject: Re: Recent Kermit Protocol Extensions In-Reply-To: Your message of Thu, 24 Oct 91 23:31:06 PDT Message-Id: <CMM.0.90.0.688401906.fdc@watsun.cc.columbia.edu> > Frank- > In comp.protocols.kermit you write: > > >The next major effort in Kermit protocol expansion (no commitment expressed > >or implied!) is the addition of an effective and portable compression > >method. ... For maximum transportability, the > >result of compression should be a sequence of 7-bit ASCII printable > >characters (32-126 or a subset of these). > > No, it should be flexible enough to efficiently use whatever character > space is available: 32-126 or 32-126,160-255 or ... > > Development of a compression scheme should be combined with the > development of an efficient means of transporting binary data > uniformly distributed over the codes 0-255. > We have already done this, to an extent. But because Kermit will always have to work on 7-bit connections, and connections where any number of C0 or C1 control characters could interfere with transmission, it would save us a lot of complexity if the compression method always encodes into ASCII -- that relieves us of the overhead of prefixing, quoting, and shifting already-compressed text (to get an idea of what I mean, see the lshift.txt paper that was mentioned in the same message). > The current method, control-code prefixing, increases the size of such > a binary file by a good 25% on an 8-bit channel, dissipating about half > the gains achieved by precompressing the file. On a 7-bit channel, the > combination of 8-th bit and control-code prefixing increases the > effective file size by some 75%. Compare these losses the theoretical > minimums of 6.5% and 21.5%, respectively. > That's why we need a good compression scheme that automatically gives 8-bit and C0/C1 transparency. > For 7-bit channels, a 4-to-5 byte scheme such as used in the tarmail > utility solves this problem (the basis of this algorithm is to treat > each four bytes as a 32-bit unsigned integer and express it as 5 digits > in base 95). For 8-bit channels, a similar scheme could be worked out > to utilize both the lower and upper ascii printables efficiently. > Right, but better still to design an algorithm that optimally compresses into ASCII graphics from the start -- then there is no need to expand the compressed data for transmission. > Such an alternative to control-code (and even 8th-bit) prefixing > should be added along with compression, but should be selectable > (negotiable) even without compression. > There has actually been a lot of talk about this, and this is a possibility. But the gain in efficiency over current methods is minimal compared (I hope) to what an effective compression algorithm will yield, so I'd much rather concentrate on compression than on yet-another-way-of-achieving-transparency- without-compression. > (Since a protocol extension is envisioned anyway, you might consider > extending the character space to optionally include at least the > upper ASCII control characters.) > That'll never work. C1 controls can be misinterpreted as C0 controls by communication devices or hosts that ignore the 8th bit, and even when they are correctly seen as C1 controls can trigger unwanted actions like shifting. > Sorry, I don't have any good alternative compression algorithms. > It is just possible that the LZW *program* might be incorporated into > Kermit or left as a separate module without legal problems on the > grounds of prior art, publication, etc. Or the patent holders > might write an exemption for Kermit use only. > It's a possibility. > Best wishes. > > -Glenn Thobe <getunx!thobe@uunet.uu.net> > (A long-time Kermit enthusiast.) > --- > P.S. Bravo for your pioneering internationalization efforts! > Thanks! (Do you use this stuff yourself?) > P.P.S. What can be done to get the most popular BBS's to support Kermit? > Beats me. Maybe as BBS's become popular outside the USA -- where (inter)national character set conversion is important and 8-bit transparency cannot be assumed (because of the PTTs) -- we'll see more Kermit support in BBS software. 26-Oct-91 6:20:53-GMT,1200;000000000011 Return-Path: <@mail.swip.net:pkmab!ske@kullmar> Received: from mail.swip.net by watsun.cc.columbia.edu (5.59/FCB) id AA15980; Sat, 26 Oct 91 02:20:44 EDT Received: by mail.swip.net (5.61+IDA/KTH/LTH/1.2) id AAmail11771; Sat, 26 Oct 91 07:20:39 +0100 Received: by kullmar.se (smail2.5) id AA04013; 26 Oct 91 06:15:01 SWT (Sat) Received: by pkmab.se (smail2.5) id AA13833; 26 Oct 91 02:33:04 MET (Sat) Subject: Compression Protocol To: fdc@watsun.cc.columbia.edu (Frank da Cruz) Date: Sat, 26 Oct 91 2:33:03 MET From: Kristoffer Eriksson <ske@pkmab.se> X-Mailer: ELM (ver 2.2 PL16) Message-Id: <9110260233.AA13833@pkmab.se> >The next major effort in Kermit protocol expansion (no commitment expressed >or implied!) is the addition of an effective and portable compression >method. Isn't this a bit over-ambitious? Why not just provide a portable but separate compression program that you can run before and after transfer? That way you would avoid bloating the kermit program itself. ---- Kristoffer Eriksson, Peridot Konsult AB, Hagagatan 6, S-703 40 Oerebro, Sweden Phone: +46 19-13 03 60 ! e-mail: ske@pkmab.se Fax: +46 19-11 51 03 ! or ...!{uunet,mcsun}!mail.swip.net!kullmar!pkmab!ske 28-Oct-91 15:18:19-GMT,1076;000000000001 Return-Path: <fdc> Received: by watsun.cc.columbia.edu (5.59/FCB) id AA12440; Mon, 28 Oct 91 10:18:10 EST Date: Mon, 28 Oct 91 10:18:09 EST From: Frank da Cruz <fdc@watsun.cc.columbia.edu> To: Kristoffer Eriksson <ske@pkmab.se> Subject: Re: Compression Protocol In-Reply-To: Your message of Sat, 26 Oct 91 2:33:03 MET Message-Id: <CMM.0.90.0.688663089.fdc@watsun.cc.columbia.edu> > >The next major effort in Kermit protocol expansion (no commitment expressed > >or implied!) is the addition of an effective and portable compression > >method. > > Isn't this a bit over-ambitious? Why not just provide a portable but > separate compression program that you can run before and after transfer? > That way you would avoid bloating the kermit program itself. > Because the sending Kermit has to do the record format and character-set conversions before compressing, and the receiving Kermit has to do them after decompressing. Thus we would need not only "n" portable compression programs, but also "n x (n - 1)" character-set and record-format conversion programs. - Frank 28-Oct-91 18:25:13-GMT,3632;000000000011 Return-Path: <afbf05!afbf02!afbf14!les@uunet.UU.NET> Received: from relay2.UU.NET by watsun.cc.columbia.edu (5.59/FCB) id AA16062; Mon, 28 Oct 91 13:25:11 EST Received: from uunet.uu.net (via LOCALHOST.UU.NET) by relay2.UU.NET with SMTP (5.61/UUNET-internet-primary) id AA24113; Mon, 28 Oct 91 13:25:04 -0500 Received: from afbf05.UUCP by uunet.uu.net with UUCP/RMAIL (queueing-rmail) id 132450.6045; Mon, 28 Oct 1991 13:24:50 EST Received: id <m0kbbcJ-0000meC@afbf05.fb.com> Mon, 28 Oct 91 13:23 EST Received: id <m0kbbXk-00026bC@afbf02.fb.com> Mon, 28 Oct 91 12:18 CST Received: id <m0kbcTI-0002Y3C@afbf14.fb.com> Mon, 28 Oct 91 13:18 CST Message-Id: <m0kbcTI-0002Y3C@afbf14.fb.com> From: les@fb.com (Les Mikesell) Subject: Re: Info-Kermit Digest V14 To: fdc@watsun.cc.columbia.edu (Frank da Cruz) Date: Mon, 28 Oct 91 13:17:55 CST In-Reply-To: <CMM.0.90.0.688662617.fdc@watsun.cc.columbia.edu>; from "Frank da Cruz" at Oct 28, 91 10:10 am [Compression to printable chars...] > Right, but better still to design an algorithm that optimally compresses > into ASCII graphics from the start -- then there is no need to expand the > compressed data for transmission. I'm not sure I agree here. I suppose it might depend on the algorithm, but I don't see how a 1-pass compress can do any optimization into a specified character set to make it come out any better than compressing into 256 characters, then expanding through something like btoa. That is, the compress pass is always going to generate an essentially random output and thus can't do any optimization for the output character set. The extra processing would be irrelevant compared to the compress pass anyway. > > Such an alternative to control-code (and even 8th-bit) prefixing > > should be added along with compression, but should be selectable > > (negotiable) even without compression. > There has actually been a lot of talk about this, and this is a possibility. > But the gain in efficiency over current methods is minimal compared (I hope) > to what an effective compression algorithm will yield, so I'd much rather > concentrate on compression than on yet-another-way-of-achieving-transparency- > without-compression. But any way you look at it, control-code prefixing is going to cost about 25% of your possible bandwidth, and on many (probably most) connections it isn't necessary. I think you need both options and they need to be coordinated so the compression can use the full character set if the transfer link allows it. Likewise, if the material is pre-compressed with unix compress, zip, zoo, arc or the like, the kermit compression is unlikely to help and quite likely to expand the data unless you are careful to allow compression to be disabled when it isn't working (per-packet would be nice here). In this case if the transfer link needs control/hi-bit escaping you would still be ahead to pass the data stream through a btoa style encoder instead of using traditional kermit escaping, so that should be a separate choice with or without the compression. Also, this stuff needs to be processed so that the CRC checking sees the same thing on both ends. The current 8-bit quoting is done such that it can be set on one end and not on the other and the error checking still passes. > > Sorry, I don't have any good alternative compression algorithms. You might want to check with: brnstnd@KRAMDEN.ACF.NYU.EDU (Dan Bernstein). He has posted some original code to usenet and made references to some work in progress. I think the intent was to have something available to use without patent restrictions. Les Mikesell les@fb.com 28-Oct-91 18:56:12-GMT,4752;000000000001 Return-Path: <fdc> Received: by watsun.cc.columbia.edu (5.59/FCB) id AA16439; Mon, 28 Oct 91 13:56:04 EST Date: Mon, 28 Oct 91 13:56:03 EST From: Frank da Cruz <fdc@watsun.cc.columbia.edu> To: les@fb.com (Les Mikesell) Subject: Re: Info-Kermit Digest V14 In-Reply-To: Your message of Mon, 28 Oct 91 13:17:55 CST Message-Id: <CMM.0.90.0.688676163.fdc@watsun.cc.columbia.edu> > I'm not sure I agree here. I suppose it might depend on the algorithm, > but I don't see how a 1-pass compress can do any optimization into > a specified character set to make it come out any better than compressing > into 256 characters, then expanding through something like btoa. That > is, the compress pass is always going to generate an essentially random > output and thus can't do any optimization for the output character set. > The extra processing would be irrelevant compared to the compress pass > anyway. > You might be right, but I think it's worth trying to come up with an algorithm that compresses into printable ASCII before deciding it can't be done. You should be able to use codewords of any bit length you like, so (for example) we could compress into 6-bit codewords and add some offset (like 33) to get printable ASCII. Wasteful but simple. Needs analysis. > But any way you look at it, control-code prefixing is going to cost about > 25% of your possible bandwidth, and on many (probably most) connections > it isn't necessary. > But you never know which control characters are going to cause problems. The Kermit programs themselves can't know, because they don't know what kinds of things are sitting between them. They can't even send a "test pattern" because that might put some intermediate box out of commission (e.g. Ctrl-P might put an X.3 PAD into local mode; Series/1-style 3270 protocol converters normally will not pass *any* control characters through to the host; etc etc). I don't think we can ever assume that any particular control characters are safe, even though some subset of them usually are on any particular connection. > I think you need both options and they need to be coordinated so the > compression can use the full character set if the transfer link allows it. > But then you put an awful burden on the user. I'd prefer to see compression work automatically if both Kermit programs support it, without having to make the user figure whether and to what degree the link is transparent to control and 8-bit characters. > Likewise, if the material is pre-compressed with unix compress, zip, zoo, > arc or the like, the kermit compression is unlikely to help and quite likely > to expand the data unless you are careful to allow compression to be > disabled when it isn't working (per-packet would be nice here). > Like all Kermit options (except control-character escaping), compression will be disablable by the user. If a file (precompressed or not) has a random distribution of characters, there will indeed be about 25% expansion for control prefixing. On a 7-bit connection, there would also be an additional 50% for 8th-bit escaping, except that now we have a locking shift mechanism that reduces that to practically nil. In the interest of minimizing the internal complexity of the Kermit program and its user interface, I think it might be better, when compressing, to strip away the already-complicated control- and 8th-bit escaping methods and compress directly into a form that is guaranteed to be transparent to even the most hostile connections. It's a tradeoff -- everybody pays a price in some additional overhead, but everybody gets a compression method that works without requiring just the right combination of manual tweaking. > In this case if the transfer link needs control/hi-bit escaping you would > still be ahead to pass the data stream through a btoa style encoder instead > of using traditional kermit escaping, so that should be a separate choice > with or without the compression. > btoa is a 4-for-3 scheme, right? For most types of data (all text, many kinds of binaries), Kermit is already better than that. > Also, this stuff needs to be processed > so that the CRC checking sees the same thing on both ends. The current > 8-bit quoting is done such that it can be set on one end and not on > the other and the error checking still passes. > That only happens if there is a bug in one of the Kermit programs. The 8th-bit prefixing negotiation is supposed to work. > You might want to check with: brnstnd@KRAMDEN.ACF.NYU.EDU (Dan Bernstein). > He has posted some original code to usenet and made references to some > work in progress. I think the intent was to have something available > to use without patent restrictions. > Thanks for the pointer! - Frank 28-Oct-91 20:36:19-GMT,4650;000000000011 Return-Path: <afbf05!afbf02!afbf14!les@uunet.UU.NET> Received: from relay2.UU.NET by watsun.cc.columbia.edu (5.59/FCB) id AA17697; Mon, 28 Oct 91 15:36:15 EST Received: from uunet.uu.net (via LOCALHOST.UU.NET) by relay2.UU.NET with SMTP (5.61/UUNET-internet-primary) id AA27497; Mon, 28 Oct 91 15:36:17 -0500 Received: from afbf05.UUCP by uunet.uu.net with UUCP/RMAIL (queueing-rmail) id 153546.21143; Mon, 28 Oct 1991 15:35:46 EST Received: id <m0kbden-00015qC@afbf05.fb.com> Mon, 28 Oct 91 15:33 EST Received: id <m0kbdbA-00028IC@afbf02.fb.com> Mon, 28 Oct 91 14:30 CST Received: id <m0kbeWY-0002WRC@afbf14.fb.com> Mon, 28 Oct 91 15:29 CST Message-Id: <m0kbeWY-0002WRC@afbf14.fb.com> From: les@fb.com (Les Mikesell) Subject: Re: Info-Kermit Digest V14 To: fdc@watsun.cc.columbia.edu (Frank da Cruz) Date: Mon, 28 Oct 91 15:29:26 CST In-Reply-To: <CMM.0.90.0.688676163.fdc@watsun.cc.columbia.edu>; from "Frank da Cruz" at Oct 28, 91 1:56 pm > You might be right, but I think it's worth trying to come up with an algorithm > that compresses into printable ASCII before deciding it can't be done. No question that it can be done - I just don't think it can be done much more efficiently than compression without regard to the output character set followed by remapping. > > > But you never know which control characters are going to cause problems. Right - in my original message I suggested explicit commands for SET CONTROL-ESCAPING ON/OFF/MINIMAL/LIST so you could tell it how to handle a particular link optimally. Minimal would be xon/xoff/ctl-p/and the kermit start-of-packet. Off would probably still have to include the start-of-packet. This would need to be negotiated with the remote, although I suspect that most kermits would receive correctly anyway if the sender unilaterally only escaped the minimal set. > > I think you need both options and they need to be coordinated so the > > compression can use the full character set if the transfer link allows it. > > > But then you put an awful burden on the user. I'd prefer to see compression > work automatically if both Kermit programs support it, without having to make > the user figure whether and to what degree the link is transparent to control > and 8-bit characters. It's not a burden if the default is for the worst-case. Then the user doesn't need to know, but if he does know about a particular link he can easily set the option to take advantage of it. > In the interest of minimizing the internal complexity of the Kermit program > and its user interface, I think it might be better, when compressing, to strip > away the already-complicated control- and 8th-bit escaping methods and > compress directly into a form that is guaranteed to be transparent to even the > most hostile connections. It's a tradeoff -- everybody pays a price in some > additional overhead, but everybody gets a compression method that works > without requiring just the right combination of manual tweaking. Yes, that is the right thing to do for the default case, but I still think you should accomodate the people who are using expensive but transparent links and want the optimum speed. In this case, the data is very likely going to be pre-compressed anyway. > > In this case if the transfer link needs control/hi-bit escaping you would > > still be ahead to pass the data stream through a btoa style encoder instead > > of using traditional kermit escaping, so that should be a separate choice > > with or without the compression. > > > btoa is a 4-for-3 scheme, right? For most types of data (all text, many > kinds of binaries), Kermit is already better than that. But the people who care about transfer time are probably compressing the files already. They are probably also using some other transfer protocol if they have transparent links, but kermit could handle it with a simple option. The 4-3 coding just makes things fall out on byte boundaries. You could do 8-bit to n-bit coding any way you like as long as it is internal to kermit. The ideal thing would be to make the compression and encoding options settable per-packet (at least when you use long packets). Then the sender could prepare compressed/encoded, straight encoded, and traditional kermit escaping packets and send whichever happens to be shorter. No matter what compression scheme you pick there will always be data that won't compress. If you look at tests of MNP5 vs. V.42bis modem transfer rates you will see that the V.42bis handles pre-compressed data much better because it knows enough not to expand the stream accidentally for any length of time. Les Mikesell les@fb.com 28-Oct-91 21:34:03-GMT,5365;000000000001 Return-Path: <fdc> Received: by watsun.cc.columbia.edu (5.59/FCB) id AA18316; Mon, 28 Oct 91 16:33:53 EST Date: Mon, 28 Oct 91 16:33:52 EST From: Frank da Cruz <fdc@watsun.cc.columbia.edu> To: les@fb.com (Les Mikesell) Subject: Re: Info-Kermit Digest V14 In-Reply-To: Your message of Mon, 28 Oct 91 15:29:26 CST Message-Id: <CMM.0.90.0.688685632.fdc@watsun.cc.columbia.edu> > > You might be right, but I think it's worth trying to come up with an > > algorithm that compresses into printable ASCII before deciding it can't > > be done. > > No question that it can be done - I just don't think it can be done much > more efficiently than compression without regard to the output character > set followed by remapping. > Right, probably somewhat less efficiently -- but more portably (in the same way that Kermit protocol is more portable than Xmodem, but at some expense in efficiency -- however, Xmodem's efficiency is zero on a 7-bit link, so it kind of averages out). > > But you never know which control characters are going to cause problems. > > Right - in my original message I suggested explicit commands for > SET CONTROL-ESCAPING ON/OFF/MINIMAL/LIST so you could tell it how > to handle a particular link optimally. > This is, of course, a possibility and there's nothing in the protocol that prevents this. You are correct that most Kermit programs would accept control characters in data packets "bare" (but there's no good way to take a complete census). But it's a cost vs benefit thing -- the cost in complexity and confusion would be high, few people would actually use it; those that did would become confused when the "transparency mask" changed from connection to connection (and it will), etc. The overall gain for text is minimal, and even for binary files with uniform distribution of chars, 25% at best. So yes, it can be done, but the payoff would not be nearly so dramatic as for a good compression scheme, so it's lower on the priority list. Btw, there have also been some very clever alternate escaping mechanisms proposed for combinations of control and 8-bit chars, but again, the benefit/cost ratio is low. > Minimal would be xon/xoff/ctl-p/and the kermit start-of-packet. > ...and carriage return, and linefeed, and 0xff (telnet), and Ctrl-C (most things), and Ctrl-I (often gets expanded to spaces), and Ctrl-N/Ctrl-O (shift-in/shift-out), and various 8-bit shifts (SS2, SS3, LS2, LS3, LS2R,...), etc etc. > Off would probably still have to include the start-of-packet. This would > need to be negotiated with the remote, > But it can't be. They don't know, and there is no way (unless I'm being incredibly dense) that they can figure it out for themselves. "Hey, here's a Ctrl-S, did you get it?" -- what happens next (now that the link is Xoff'd). Similar scenarios for each Ctrl character... > It's not a burden if the default is for the worst-case. Then the user > doesn't need to know, but if he does know about a particular link he can > easily set the option to take advantage of it. > Well, yes, that's an option. The thing can be designed (for example -- truly off the top of the head) to use, say, 6-bit codewords on a 7-bit link and 7-bit codewords on an 8-bit link (etc, into the future). But we'd have to be pretty clever about designing an algorithm that generalizes to n-bit codewords (and we should be!). > But the people who care about transfer time are probably compressing the > files already. > Actually, there are two things going on here. Precompression for quick transfers is one thing, and hopefully Kermit will make that relatively unnecessary. People also put things into archives for EASE of transport -- many unlike files collected into a single flat file, which floats around from place to place until it finally reaches its final destination where it is de-archived. Unfortunately, these schemes are mostly system-specific -- DOS ZIP archives, UNIX TAR archives, VMS BACKUP archives, etc -- they depend on the home system's encoding and record format conventions. > The 4-3 coding just makes things fall out on byte boundaries. You could do > 8-bit to n-bit coding any way you like as long as it is internal to kermit. > The ideal thing would be to make the compression and encoding options > settable per-packet (at least when you use long packets). Then the sender > could prepare compressed/encoded, straight encoded, and traditional kermit > escaping packets and send whichever happens to be shorter. > This is a worthy goal, but phew! -- Try to picture the programs that implement it. My instinct tells me that if we don't keep it pretty simple, programmers are just not going to tackle it. Also we have the issue of CPU speed here -- on your typical 80286 and below, the computation could easily wind up being the bottleneck. > No matter what compression scheme you pick there will always be data that > won't compress. If you look at tests of MNP5 vs. V.42bis modem transfer > rates you will see that the V.42bis handles pre-compressed data much better > because it knows enough not to expand the stream accidentally for any length > of time. > Just the kind of info I'm looking for. Have you seen any analyses of V.42bis published anywhere? I have (at least part of) the V.42bis specification (text, no pictures), but no analysis at all. Thanks! - Frank 28-Oct-91 23:51:52-GMT,6495;000000000011 Return-Path: <afbf05!afbf02!afbf14!les@uunet.UU.NET> Received: from relay2.UU.NET by watsun.cc.columbia.edu (5.59/FCB) id AA19986; Mon, 28 Oct 91 18:51:49 EST Received: from uunet.uu.net (via LOCALHOST.UU.NET) by relay2.UU.NET with SMTP (5.61/UUNET-internet-primary) id AA09251; Mon, 28 Oct 91 18:51:37 -0500 Received: from afbf05.UUCP by uunet.uu.net with UUCP/RMAIL (queueing-rmail) id 185002.12615; Mon, 28 Oct 1991 18:50:02 EST Received: id <m0kbgfw-0001DyC@afbf05.fb.com> Mon, 28 Oct 91 18:47 EST Received: id <m0kbgb0-0002MCC@afbf02.fb.com> Mon, 28 Oct 91 17:42 CST Received: id <m0kbhWd-0001C5C@afbf14.fb.com> Mon, 28 Oct 91 18:41 CST Message-Id: <m0kbhWd-0001C5C@afbf14.fb.com> From: les@fb.com (Les Mikesell) Subject: Re: Info-Kermit Digest V14 To: fdc@watsun.cc.columbia.edu (Frank da Cruz) Date: Mon, 28 Oct 91 18:41:42 CST In-Reply-To: <CMM.0.90.0.688685632.fdc@watsun.cc.columbia.edu>; from "Frank da Cruz" at Oct 28, 91 4:33 pm > > No question that it can be done - I just don't think it can be done much > > more efficiently than compression without regard to the output character > > set followed by remapping. > Right, probably somewhat less efficiently -- but more portably Keeping the phases separate doesn't affect portability - and it lets you use each for its own purpose. [reduced escaping...] > This is, of course, a possibility and there's nothing in the protocol that > prevents this. > The overall gain for text is minimal, and > even for binary files with uniform distribution of chars, 25% at best. So > yes, it can be done, but the payoff would not be nearly so dramatic as for a > good compression scheme, so it's lower on the priority list. Btw, there have > also been some very clever alternate escaping mechanisms proposed for > combinations of control and 8-bit chars, but again, the benefit/cost ratio is > low. What's low? We spend about $6-10,000/month on phone charges for data communications and that's probably low for a national organization. I'd think 25% of that would be worth thinking about. Certainly worth figuring out whether you should issue a set command in your start-up scripts. Most of it isn't using kermit now, but I'd switch some things over if it could match the other protocols in speed. > > Off would probably still have to include the start-of-packet. This would > > need to be negotiated with the remote, > > > But it can't be. They don't know, and there is no way (unless I'm being > incredibly dense) that they can figure it out for themselves. Right - we have to assume that the sender knows what he is doing, or at least is capable of learning from the transfer failures when he does it wrong. The negotiation part would just be to be assured that the remote knows how to receive the unescaped control characters - it's really up to the sender to do it all. Basically you are just asking the user to be aware of the same things they currently need to know when they chose a protocol, if the list of choices includes xmodem, etc. > Actually, there are two things going on here. Precompression for quick > transfers is one thing, and hopefully Kermit will make that relatively > unnecessary. People also put things into archives for EASE of transport -- > many unlike files collected into a single flat file, which floats around from > place to place until it finally reaches its final destination where it is > de-archived. Unfortunately, these schemes are mostly system-specific -- DOS > ZIP archives, UNIX TAR archives, VMS BACKUP archives, etc -- they depend on > the home system's encoding and record format conventions. This is becoming much less of an issue with portable versions of zoo, arc and zip around, as well as tar and cpio. I happen to prefer zoo at the moment, but I may change my mind when I get around to pickup up the zip source. The real challenge for portable kermit compression has got to be the contortions you have to go through when ascii<->ebcdic conversions are done by something else between the sender and receiver, but I'm not sure it's worth crippling other transfers just to allow this one to be a little bit faster. (I do use a unix <-> ibm mainframe kermit link, but it's local and completely script driven, so the speed doesn't matter much to me.) > > The 4-3 coding just makes things fall out on byte boundaries. You could do > > 8-bit to n-bit coding any way you like as long as it is internal to kermit. > > The ideal thing would be to make the compression and encoding options > > settable per-packet (at least when you use long packets). Then the sender > > could prepare compressed/encoded, straight encoded, and traditional kermit > > escaping packets and send whichever happens to be shorter. > > > This is a worthy goal, but phew! -- Try to picture the programs that > implement it. My instinct tells me that if we don't keep it pretty simple, > programmers are just not going to tackle it. Also we have the issue of CPU > speed here -- on your typical 80286 and below, the computation could easily > wind up being the bottleneck. The worst case of this is the one you mostly want to try, though. That is, you need to go through the motions of compressing and doing the portable encoding anyway. The other 2 choices (encoding only/traditional kermit) are trivial in computation for the sender, and all you need are 3 bits in a packet header to allow an on-the-fly choice of which to send (although you do have to be careful about compression methods that change their tables dynamically). If the encoding/decompression steps are separate, the receiver doesn't have to be much more complicated at all. You will already need traditional kermit packet handling, so you just add the new alternative routines and process each step according to the header indication. I'm not sure it's worth going down to the short packets with this, though. > Just the kind of info I'm looking for. Have you seen any analyses of V.42bis > published anywhere? I have (at least part of) the V.42bis specification > (text, no pictures), but no analysis at all. No, I've only seen some commentary on usenet. Toby Nixon (an engineer at Hayes) has explained some of the theory (like what happens when you send a worst-case data stream), but unfortunately I didn't save any of it. Basically, after a certain interval of expanding data it will dump its tables and start retraining, somehow syncing this with the remote end. Les Mikesell les@chinet.chi.il.us 29-Oct-91 0:19:42-GMT,2071;000000000001 Return-Path: <fdc> Received: by watsun.cc.columbia.edu (5.59/FCB) id AA20307; Mon, 28 Oct 91 19:18:02 EST Date: Mon, 28 Oct 91 19:18:01 EST From: Frank da Cruz <fdc@watsun.cc.columbia.edu> To: les@fb.com (Les Mikesell) Subject: Re: Info-Kermit Digest V14 In-Reply-To: Your message of Mon, 28 Oct 91 18:41:42 CST Message-Id: <CMM.0.90.0.688695481.fdc@watsun.cc.columbia.edu> > What's low? We spend about $6-10,000/month on phone charges for data > communications and that's probably low for a national organization. > I'd think 25% of that would be worth thinking about. > Total agreement. But wouldn't you rather reduce your phone bill by, say, 50%, 100%, or more, instead of 25%? With only a couple people willing to put in whatever spare-time hours they can grab, nights and weekends, we have to put our efforts where the biggest payoff is. Changing Kermit protocol and software to transmit & negotiate bare controls would take a lot longer than you think -- probably a year for the flame wars alone... > Right - we have to assume that the sender knows what he is doing, or at least > is capable of learning from the transfer failures when he does it wrong. > Yes, but... Can you suggest a method that a mere mortal can follow? > The real challenge for portable kermit compression has got to > be the contortions you have to go through when ascii<->ebcdic conversions > are done by something else between the sender and receiver... > Not just ASCII/EBCDIC, but Latin-1, JIS Kanji, etc etc. We've already figured this one out -- it's how mainframe Kermit works. We're adding compression at a level (sort of "high presentation") where this is not an issue (long story). > The worst case of this is the one you mostly want to try, though. That is, > you need to go through the motions of compressing and doing the portable > encoding anyway. The other 2 choices (encoding only/traditional kermit) > are trivial... > I must say, this is an interesting idea (mixing compressed and uncompressed packets) but, no doubt, a lot trickier than it sounds. - Frank 29-Oct-91 16:06:22-GMT,5542;000000000011 Return-Path: <afbf05!afbf02!afbf14!les@uunet.UU.NET> Received: from relay2.UU.NET by watsun.cc.columbia.edu (5.59/FCB) id AA28757; Tue, 29 Oct 91 11:06:20 EST Received: from uunet.uu.net (via LOCALHOST.UU.NET) by relay2.UU.NET with SMTP (5.61/UUNET-internet-primary) id AA10445; Tue, 29 Oct 91 11:06:21 -0500 Received: from afbf05.UUCP by uunet.uu.net with UUCP/RMAIL (queueing-rmail) id 110506.14627; Tue, 29 Oct 1991 11:05:06 EST Received: id <m0kbvuR-0001GaC@afbf05.fb.com> Tue, 29 Oct 91 11:03 EST Received: id <m0kbvkL-0001YRC@afbf02.fb.com> Tue, 29 Oct 91 09:52 CST Received: id <m0kbwgQ-0002WWC@afbf14.fb.com> Tue, 29 Oct 91 10:52 CST Message-Id: <m0kbwgQ-0002WWC@afbf14.fb.com> From: les@fb.com (Les Mikesell) Subject: Re: Info-Kermit Digest V14 To: fdc@watsun.cc.columbia.edu (Frank da Cruz) Date: Tue, 29 Oct 91 10:52:50 CST In-Reply-To: <CMM.0.90.0.688695481.fdc@watsun.cc.columbia.edu>; from "Frank da Cruz" at Oct 28, 91 7:18 pm > Total agreement. But wouldn't you rather reduce your phone bill by, say, > 50%, 100%, or more, instead of 25%? With only a couple people willing to put > in whatever spare-time hours they can grab, nights and weekends, we have to > put our efforts where the biggest payoff is. Changing Kermit protocol and > software to transmit & negotiate bare controls would take a lot longer than > you think -- probably a year for the flame wars alone... I can't argue about the politics, but from a practical point of view, I'd rather have the ability to pass the control characters. You are unlikely to beat V.42bis compression, and that is increasingly available in the modem hardware, and in the cases where it really matters we can always pre-compress before transmission. The thing we currently can't do is move the bytes with any efficiency after they have been compressed. I'd be happy with a simple non-negotiated command to turn off escaping on everything but start-of-packet, ^C, ^P, and if flow control is enabled, ^S and ^Q (I suppose you'd have to look at how most versions handle received nulls, too...). Then just start fixing versions to do the right thing on the receiving side if they don't already. I don't see how anyone can flame about something that is an option and not enabled unless you want it. But the important thing is to get this ability in place so an alternate encoding method can follow the setting. Typical text compression might be in the 60-70% range using the bandwith provided by an 8-bit data stream. The nature of compression is going to make the character distribution essentially random (otherwise the compression isn't optimal), so restricting the character set for transmission is going to take a statistical 1/256th of your efficiency for every character you disallow, plus the overhead of the mapping. There is just no magical way to avoid this. >>Right - we have to assume that the sender knows what he is doing, or at least >> is capable of learning from the transfer failures when he does it wrong. >> > Yes, but... Can you suggest a method that a mere mortal can follow? Yes - the way you learn everything else: if it hurts, don't do it! It's no more complicated than getting your parity set right or any of the other things you already have to know. People are already making this particular decision, except that instead of telling kermit to work efficiently they just use a different protocol that is more efficient by design. But, programs like C-kermit 5A and the MSDOS 3.11 are so good that they could handle just about everything else anyone would want to do. Maybe it would be simpler to just graft a zmodem option into those programs. > Not just ASCII/EBCDIC, but Latin-1, JIS Kanji, etc etc. We've already figured > this one out -- it's how mainframe Kermit works. We're adding compression at > a level (sort of "high presentation") where this is not an issue (long story). I hope this means that the decompression will be done before the checksum/ CRC is computed so that any screwups will appear as a transmission error and not go undetected. I've had this happen with 8-bit quoting with one end happily gobbling all my "&" characters and converting the next character and reporting that the transfer went fine. C source is not a pretty sight after this happens. I know you said that this was supposed to be negotiated, so it was probably an error in the program, but it still should have failed the transfer instead of reporting success with different file contents at the other end. > I must say, this is an interesting idea (mixing compressed and uncompressed > packets) but, no doubt, a lot trickier than it sounds. The only tricky part is that the compressor needs to know about it. You couldn't just take the stream from unix compress and do that because the compression tables are part of the data stream and they adapt to the data on the fly. Also, compression code generally builds a bit-stream, so you might have to fill out to a byte boundary, effectively going through the motions of ending a file at the compression level and then starting up from scratch on the next compressable packet. How complex this is would depend on the nature of the compression code. Off the top of my head, though, I would bet that you could build non-adaptive tables for english text that would work as well as any adaptive scheme if you simply pass through the pieces that don't compress. But, I suppose you need to support other languages as well so you need some sort of adaptive method. Les Mikesell les@fb.com 29-Oct-91 17:06:01-GMT,2579;000000000001 Return-Path: <fdc> Received: by watsun.cc.columbia.edu (5.59/FCB) id AA29982; Tue, 29 Oct 91 12:04:57 EST Date: Tue, 29 Oct 91 12:04:56 EST From: Frank da Cruz <fdc@watsun.cc.columbia.edu> To: les@fb.com (Les Mikesell) Subject: Re: Info-Kermit Digest V14 In-Reply-To: Your message of Tue, 29 Oct 91 10:52:50 CST Message-Id: <CMM.0.90.0.688755896.fdc@watsun.cc.columbia.edu> > I can't argue about the politics, but from a practical point of view, I'd > rather have the ability to pass the control characters. You are unlikely > to beat V.42bis compression, and that is increasingly available in the > modem hardware, and in the cases where it really matters we can always > pre-compress before transmission. > But pre-compression usually only works between like systems, or a small set of unlike systems (e.g. where "portable zip" runs). Even then, it doesn't handle the character set conversions, and often doesn't record format conversions either, which is why you really want to slip compression into Kermit's presentation layer after all that has been done. > > Yes, but... Can you suggest a method that a mere mortal can follow? > Yes - the way you learn everything else: if it hurts, don't do it! > It's no more complicated than getting your parity set right or any of > the other things you already have to know. > Well, it's actually much more complicated. Parity is usually an "on" or "off" decision; only in rare cases does it actually matter which kind of (non-none) parity to use. At most we have 5 choices. For optimal control character transparency the user has to test over 60 different characters for each connection. That could take hours, since any given test could blow the connection away. There goes your efficiency. > I hope this means that the decompression will be done before the checksum/ > CRC is computed so that any screwups will appear as a transmission error > and not go undetected. > Definitely. The entire packet is not compressed, only the contents of the data field, and the checksum is computed on the characters that go into the packet. > I've had this happen with 8-bit quoting with one > end happily gobbling all my "&" characters and converting the next character > and reporting that the transfer went fine. > Like I said, that's a bug -- it should not happen. Before C-Kermit 5A is released, I'll be sure this is working right. > I would bet that you could build non-adaptive tables for english text that... > We actually did this once in a pilot project. But it's a no-no for Kermit. No anglocentrism allowed. - Frank 29-Oct-91 18:55:14-GMT,4225;000000000005 Return-Path: <afbf05!afbf02!afbf14!les@uunet.UU.NET> Received: from relay2.UU.NET by watsun.cc.columbia.edu (5.59/FCB) id AA02047; Tue, 29 Oct 91 13:55:06 EST Received: from uunet.uu.net (via LOCALHOST.UU.NET) by relay2.UU.NET with SMTP (5.61/UUNET-internet-primary) id AA20384; Tue, 29 Oct 91 13:55:09 -0500 Received: from afbf05.UUCP by uunet.uu.net with UUCP/RMAIL (queueing-rmail) id 135439.19267; Tue, 29 Oct 1991 13:54:39 EST Received: id <m0kbyYb-0001I8C@afbf05.fb.com> Tue, 29 Oct 91 13:52 EST Received: id <m0kbyUM-0001rgC@afbf02.fb.com> Tue, 29 Oct 91 12:48 CST Received: id <m0kbzQT-0001geC@afbf14.fb.com> Tue, 29 Oct 91 13:48 CST Message-Id: <m0kbzQT-0001geC@afbf14.fb.com> From: les@fb.com (Les Mikesell) Subject: Re: Info-Kermit Digest V14 To: fdc@watsun.cc.columbia.edu (Frank da Cruz) Date: Tue, 29 Oct 91 13:48:32 CST In-Reply-To: <CMM.0.90.0.688755896.fdc@watsun.cc.columbia.edu>; from "Frank da Cruz" at Oct 29, 91 12:04 pm > But pre-compression usually only works between like systems, or a small set > of unlike systems (e.g. where "portable zip" runs). Then that's a problem in its own right and should be addressed directly, perhaps by promoting the use of "zoo" which should be available for the common ascii machines. > Even then, it doesn't > handle the character set conversions, and often doesn't record format > conversions either, which is why you really want to slip compression into > Kermit's presentation layer after all that has been done. I hesitate to comment on this because I think this belongs elsewhere as well. That is, any machine needing file format preservation should have a way to encapsulate the file into a byte stream and do the reverse. A file being sent over a communications line may traverse many machines before it's actually used again. I'd prefer for the transfer protocol to transfer the file contents unchanged so that the proper conversions can be done only at the end point. For the terminal emulation portion you need to do it in real time, and you don't have any choice about the ascii-ebcdic conversions when going through a mainframe front-end, but files can always be run through a separate conversion. Isn't that why we put things in data files in the first place? Plus there are many things that you might want to preserve (like the ownership, permissions, filenames that may not map reversibly on the current machine) that can only be done by encapsulating the file and attributes for transmission because intermediate filesystems may not share the same concepts. > Well, it's actually much more complicated. Parity is usually an "on" or > "off" decision; only in rare cases does it actually matter which kind of > (non-none) parity to use. At most we have 5 choices. Make that 5 choices at each end, thus 25 possibilities, plus flow control, plus file type, plus a few other settings that have to be set right just to get a single file across unchanged. If you are arguing that kermit is now simple to use and would be rendered complex by another settable option, I disagree. Kermit already has a deserved reputation for being difficult to use - one settable option more or less isn't going to change that. > For optimal control > character transparency the user has to test over 60 different characters for > each connection. That could take hours, since any given test could blow the > connection away. There goes your efficiency. That's an unlikely worst-case scenario. In practice, someone will know if a particular link isn't transparent (and why), and tell the people using it the proper settings to use. How long would it take you to find the right mainframe handshake character? The same argument would apply there. And I still say that people already know this and just use different protocols without being overwhelmed by complexity. Ideally, kermits could be set up to send an attribute packet defining any known problem characters for the link at start-up (escaped or in hex) but that would only work for point to point links with computers that have some concept of which device they are using, so the lack of such an indication shouldn't automatically mean the link is transparent. Les Mikesell les@fb.com 29-Oct-91 20:21:58-GMT,2050;000000000001 Return-Path: <fdc> Received: by watsun.cc.columbia.edu (5.59/FCB) id AA04080; Tue, 29 Oct 91 15:21:42 EST Date: Tue, 29 Oct 91 15:21:40 EST From: Frank da Cruz <fdc@watsun.cc.columbia.edu> To: les@fb.com (Les Mikesell) Subject: Re: Info-Kermit Digest V14 In-Reply-To: Your message of Tue, 29 Oct 91 13:48:32 CST Message-Id: <CMM.0.90.0.688767700.fdc@watsun.cc.columbia.edu> > > Even then, it doesn't handle the character set conversions, and often > > doesn't record format conversions either, which is why you really want to > > slip compression into Kermit's presentation layer after all that has been > > done. > > I hesitate to comment on this because I think this belongs elsewhere as > well. > Here we're getting into very old arguments. I'll only say that the notion of a "common intermediate representation" is essential to all protocols. Without them, every computer would need knowledge of the formats and codes of every other kind of computer. > That is, any machine needing file format preservation should have > a way to encapsulate the file into a byte stream and do the reverse. > A file being sent over a communications line may traverse many machines > before it's actually used again. I'd prefer for the transfer protocol > to transfer the file contents unchanged so that the proper conversions > can be done only at the end point. > Done, not only by zip/zoo/tar/backup and friends, but also by Kermit's very own "set file type binary". You can even have Kermit do record format conversion but skip the character-set translations: "set file type text", "set transfer character-set transparent". The Kermit protocol also has an "archiving" notion that has never been implemented (basically, saving the attribute packet along with the data). > ...but files can always be run through a separate conversion. > Then if we have n kinds of computers, we need n x (n-1) conversion programs. Now add a new kind of computer: write a new conversion program for it, and modify the other n x (n-1) programs to know about it. - Frank 30-Oct-91 3:45:41-GMT,1771;000000000001 Return-Path: <afbf05!afbf02!afbhub!les@uunet.UU.NET> Received: from relay1.UU.NET by watsun.cc.columbia.edu (5.59/FCB) id AA09857; Tue, 29 Oct 91 22:45:35 EST Received: from uunet.uu.net (via LOCALHOST.UU.NET) by relay1.UU.NET with SMTP (5.61/UUNET-internet-primary) id AA03817; Tue, 29 Oct 91 22:45:13 -0500 Received: from afbf05.UUCP by uunet.uu.net with UUCP/RMAIL (queueing-rmail) id 224431.10244; Tue, 29 Oct 1991 22:44:31 EST Received: id <m0kc6pa-0001GUC@afbf05.fb.com> Tue, 29 Oct 91 22:43 EST Received: id <m0kc6mm-0002M2C@afbf02.fb.com> Tue, 29 Oct 91 21:40 CST Received: id <m0kc6nD-0005EkC@afbhub.fb.com> Tue, 29 Oct 91 21:40 CST Message-Id: <m0kc6nD-0005EkC@afbhub.fb.com> From: les@fb.com (Les Mikesell) Subject: Re: Info-Kermit Digest V14 To: fdc@watsun.cc.columbia.edu (Frank da Cruz) Date: Tue, 29 Oct 91 21:40:32 CST In-Reply-To: <CMM.0.90.0.688767700.fdc@watsun.cc.columbia.edu>; from "Frank da Cruz" at Oct 29, 91 3:21 pm > > ...but files can always be run through a separate conversion. > > > Then if we have n kinds of computers, we need n x (n-1) conversion programs. > Now add a new kind of computer: write a new conversion program for it, and > modify the other n x (n-1) programs to know about it. This has nothing to do with the concept of including such conversions in a file transfer protocol - it could still be done by with a single program for each machine to handle the local <-> portable conversions. I don't really want to argue against having those options available in kermit, but the one thing that can't be done anywhere but in the communication code is efficiently moving the bytes. I just think it is a mistake to let functions that could be done elsewhere interfere with the thing that can't. Les Mikesell les@fb.com 30-Oct-91 10:39:12-GMT,1072;000000000005 Return-Path: <tankus@hsi86.hsi.com> Received: from hsi86.hsi.com by watsun.cc.columbia.edu (5.59/FCB) id AA12825; Wed, 30 Oct 91 05:39:09 EST Received: by hsi86.hsi.com (5.61+++/1.34) id AA28816; Wed, 30 Oct 91 05:34:34 -0500 Message-Id: <9110301034.AA28816@hsi86.hsi.com> From: tankus@hsi86.hsi.com (Ed Tankus) Date: Wed, 30 Oct 1991 05:34:33 EST X-Mailer: Mail User's Shell (7.1.2 7/11/90) To: fdc@watsun.cc.columbia.edu Subject: Compresssion for Kermit Frank, You may wish to speak with the author of ARJ, Robert K. Jung. The ARJ program is the fastest and best-compression scheme in the DOS world. Robert has a C source de-archiver which is very portable and plans to release a C source simple archiver during 1Q92 when he releases a UNIX flavor of his commercial product. Robert K. Jung can be reached via email at robjung@world.std.com. The ARJ software is available for FTP from simtel20, wuarchive, or me. Hope this helps. "For every word, there is a song upon which inspiration lies ...". Ed Tankus. {uunet,yale}!hsi!tankus -- OR -- tankus@hsi.com 30-Oct-91 10:54:57-GMT,1332;000000000005 Return-Path: <rommel@dszenger9.informatik.tu-muenchen.de> Received: from tuminfo2.informatik.tu-muenchen.de by watsun.cc.columbia.edu (5.59/FCB) id AA12906; Wed, 30 Oct 91 05:53:03 EST Received: from dszenger9.informatik.tu-muenchen.de ([131.159.8.67]) by tuminfo2.informatik.tu-muenchen.de with SMTP id <16914>; Wed, 30 Oct 91 10:57:34 +0100 Received: by dszenger9.informatik.tu-muenchen.de (5.61++/TUMinfo2.5Ni) id AA25284; Wed, 30 Oct 91 10:57:30 +0100 From: Kai Uwe Rommel <rommel@informatik.tu-muenchen.de> Reply-To: <rommel@dszenger9.informatik.tu-muenchen.de> Message-Id: <9110300957.AA25284@dszenger9.informatik.tu-muenchen.de> Subject: Re: [Bo Kullmar <bk@kullmar.se>: C-Kermit for OS/2] To: fdc@watsun.cc.columbia.edu Date: Wed, 30 Oct 91 10:57:29 +0100 In-Reply-To: <CMM.0.90.0.680906442.fdc@watsun.cc.columbia.edu>; from "Frank da Cruz" at Jul 30, 91 4:40 pm X-Mailer: ELM [version 2.3 PL11] I forgot to mention, I have the compression part of the programs already working on 16-bit Microsoft C, only the decompressor fails currently. I needed the algorithms for my thesis which I am currently preparing. Kai Uwe Rommel -- /* Kai Uwe Rommel, Munich ----- rommel@informatik.tu-muenchen.de */ DOS ... is still a real mode only non-reentrant interrupt handler, and always will be. -Russell Williams 30-Oct-91 11:21:29-GMT,1713;000000000005 Return-Path: <rommel@dszenger9.informatik.tu-muenchen.de> Received: from tuminfo2.informatik.tu-muenchen.de by watsun.cc.columbia.edu (5.59/FCB) id AA13067; Wed, 30 Oct 91 06:20:31 EST Received: by tuminfo2.informatik.tu-muenchen.de via suspension id <16916>; Wed, 30 Oct 91 10:56:31 +0100 Received: from dszenger9.informatik.tu-muenchen.de ([131.159.8.67]) by tuminfo2.informatik.tu-muenchen.de with SMTP id <16914>; Wed, 30 Oct 91 10:56:01 +0100 Received: by dszenger9.informatik.tu-muenchen.de (5.61++/TUMinfo2.5Ni) id AA25262; Wed, 30 Oct 91 10:55:55 +0100 From: Kai Uwe Rommel <rommel@informatik.tu-muenchen.de> Reply-To: <rommel@dszenger9.informatik.tu-muenchen.de> Message-Id: <9110300955.AA25262@dszenger9.informatik.tu-muenchen.de> Subject: Re: [Bo Kullmar <bk@kullmar.se>: C-Kermit for OS/2] To: fdc@watsun.cc.columbia.edu Date: Wed, 30 Oct 91 10:55:54 +0100 In-Reply-To: <CMM.0.90.0.680906442.fdc@watsun.cc.columbia.edu>; from "Frank da Cruz" at Jul 30, 91 4:40 pm X-Mailer: ELM [version 2.3 PL11] I don't have news regarding the OS/2 port of C-Kermit. Probably I will only have time during christmas. On another topic - you are looking for a compression algorithm. You could ask Ross Williams for his LZRW-[123][a] programs. They are portable C versions, running at 200k/sec to 100k/sec on my 24 MHz 386 and work blocked, so you can back up in case of transmission errors. They currently work on 32-bit machines only, but that can be fixed, I think. His address is ross@spam.ua.oz.au. Kai Uwe Rommel -- /* Kai Uwe Rommel, Munich ----- rommel@informatik.tu-muenchen.de */ DOS ... is still a real mode only non-reentrant interrupt handler, and always will be. -Russell Williams 30-Oct-91 19:38:23-GMT,4639;000000000015 Return-Path: <jloup@chorus.fr> Received: from chorus.chorus.fr by watsun.cc.columbia.edu (5.59/FCB) id AA19569; Wed, 30 Oct 91 14:38:04 EST Received: from nocturne.chorus.fr by chorus.chorus.fr, Wed, 30 Oct 91 20:24:44 +0100 Received: from localhost.chorus.fr by nocturne.chorus.fr, Wed, 30 Oct 91 20:21:58 +0100 Message-Id: <9110301921.AA11176@nocturne.chorus.fr> To: Frank da Cruz <fdc@watsun.cc.columbia.edu> Cc: INFO-ZIP@CS.UCLA.EDU, Jean-loup Gailly <jloup@chorus.fr> Subject: Re: Recent Kermit Protocol Extensions Date: Wed, 30 Oct 91 20:21:56 +0100 From: Jean-loup Gailly <jloup@chorus.fr> Frank, > The next major effort in Kermit protocol expansion (no commitment > expressed or implied!) is the addition of an effective and portable > compression method. We're in the information-gathering phase now. I suggest that you use an algorithm in the LZ77 class [Ziv77]. They are often slower than algorithms of the LZ78 class [Ziv78], but generally produce a much better compression ratio. They also require less memory, and it is quite easy to set any desired tradeoff between speed and compression ratio. In short, the main difference between LZ77 and LZ78 is that duplicated strings are designated with a (pointer,length) pair in LZ77 and a pointer only in LZ78. Algorithms using Markov modeling can get an excellent compression ratio but they are slow and eat a lot of memory. > The method that is finally chosen must be single-pass, zip 1.0 uses two passes, but future versions will have an option to use one pass only. > in the clear legally (unlike LZW, which is tied up in patent > infringement litigation, Given the rate at which software patents on data compression are currently granted, it will be very soon impossible to write an efficient non-patented algorithm. LZ78 has been patented twice and several patents on LZ77 have appeared recently. The fastest LZ77 algorithm, published in particular by Ross Williams in his LZRW series, is patented by Gibson and Graybill (U.S. Patent 5,049,881). As mentioned by Rich Wales, Phil Katz also obtained a patent on LZ77, but surprisingly his algorithm does not seem practical and does not cover the simpler technique used in zip, arj and freeze. The algorithms used in lha and zoo 2.1 are probably covered by the Fiala and Greene patent, but I have not read the patent claims. yabba written by Dan Bernstein is free of patents, but requires a lot of memory and does not compress as well as the LZ77 compressors. whap, also written by Dan Bernstein, is patented by Storer. zip, arj, freeze and lharc seem to be free of patents for the moment, but Robert Jung (author of arj) has applied for a patent in this area. (One correction about LZW: this algorithm is patented by IBM and Unisys, but both companies are not actually suing people for using Unix 'compress'.) > effective for all types of data (7-bit text or 8-bit text > in any language, binary executables, numeric data, raster images, > etc), LZ77 is much better than LZ78 for binary data. > For maximum transportability, the result of > compression should be a sequence of 7-bit ASCII printable characters > (32-126 or a subset of these). I think that Kermit should take care of transforming the compressed binary data to printable characters. It would be easier to plug in existing compression algorithms. > Suggestions, pointers to specifications for various methods and > analyses of them, etc, would be most appreciated. For more details on the compression programs mentioned above, you can contact: Rahul Dhesi <dhesi@cirrus.com> about zoo Leonid Broukhis <leo@s514.ipmce.su> about freeze Ross Williams <ross@spam.ua.oz.au> about lzrw* Dan Bernstein <brnstnd@kramden.acf.nyu.edu> about yabba Y.Tagawa ??? about Unix lharc myself <jloup@chorus.fr> about zip zip is currently much faster than its portable LZ77 competitors (zoo, freeze, lharc), but zoo compresses slightly better. The non patented variants of lzrw should be fast as well, but I have not tested them. I don't give references for lha and arj since they only run under Msdos and the source code of arj is not available anyway. lharc, zip, zoo, and freeze have been posted to comp.sources.misc. Jean-loup Gailly References: [Ziv77] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data Compression", IEEE Transactions on Information Theory", Vol. 23, No. 3, pp. 337-343. [Ziv78] Ziv J., Lempel A., "Compression of Individual Sequences via Variable-Rate Coding", IEEE Transactions on Information Theory, Vol. 24, No. 5, pp. 530-536. 30-Oct-91 20:05:31-GMT,3110;000000000001 Return-Path: <jloup@chorus.fr> Received: from chorus.chorus.fr by watsun.cc.columbia.edu (5.59/FCB) id AA19897; Wed, 30 Oct 91 15:05:17 EST Received: from nocturne.chorus.fr by chorus.chorus.fr, Wed, 30 Oct 91 21:03:53 +0100 Received: from localhost.chorus.fr by nocturne.chorus.fr, Wed, 30 Oct 91 21:01:07 +0100 Message-Id: <9110302001.AA11323@nocturne.chorus.fr> To: Frank da Cruz <fdc@watsun.cc.columbia.edu> Cc: Jean-loup Gailly <jloup@chorus.fr> Subject: Re: Recent Kermit Protocol Extensions In-Reply-To: Your message of Wed, 30 Oct 91 14:42:51 -0500. <CMM.0.90.0.688851771.fdc@watsun.cc.columbia.edu> Date: Wed, 30 Oct 91 21:01:04 +0100 From: Jean-loup Gailly <jloup@chorus.fr> > It is also a little bit discouraging (as to the patents, etc) Yes, I am quite upset with these software patents. The US patent office is absolutely incompetent to determine obviousness and the existence of prior art. Most software patents are invalid on both grounds. But the simple threat of a lawsuit is often too much for individuals like us, developers of zip. We learned about the Phil Katz patent a few days before zip became public and we had to study it in detail to convince ourselves that it did not cover our implementation. > What do you know about V.42bis? Is it a derivative of one of the > patented methods? I am afraid so. I think it is LZW. But it is very unlikely that Unisys or IBM will do anything about this, because LZW is used about everywhere now, and moreover they probably do not want to fight each other because both patents apply to the same algorithm. Of course the patent office did not see that. > Have you seen analyses or benchmarks published anywhere? Benchmarks are regularly published in comp.compression. Ask Peter Gutman <pgut1@cs.aukuni.ac.nz> for his latest results, because the last posting did not include zip 1.0 which was not yet public. I am giving below an extract of a posting I made in comp.compression on Oct 7. Jean-loup ----------------------------------------------------------------------- Here are some results on the standard text compression corpus (the 14 files bib book1 book2 geo news obj1 obj2 paper1 paper2 pic progc progl progp trans) available at fsa.cpsc.ucalgary.ca in /pub/text.compression.corpus. user system archive size $ time zip ../corp * 37.630u 5.750s 1,111,595 $ time zoo ah ../corp * 88.070u 12.280s 1,097,258 $ time lharc a ../corp * 99.180u 9.720s 1,201,148 $ tar cf - * | time freeze > ../corp.F 110.590u 9.140s 1,137,586 So, on large files, zip is more than twice faster than all other unix compressors with comparable compression ratio (yabba and compress are faster but have much worse compression ratio). zoo has however slightly better compression because its archive format is better. (zip cannot change its archive format because of the compatibility constraint with PKZIP.) The versions used where zip 1.0, zoo 2.1, lharc 1.02 and freeze 2.2.1, run under SunOS 4.1 on a Sparc. 30-Oct-91 20:53:44-GMT,5208;000000000011 Return-Path: <jloup@chorus.fr> Received: from chorus.chorus.fr by watsun.cc.columbia.edu (5.59/FCB) id AA20527; Wed, 30 Oct 91 15:53:27 EST Received: from nocturne.chorus.fr by chorus.chorus.fr, Wed, 30 Oct 91 21:51:37 +0100 Received: from localhost.chorus.fr by nocturne.chorus.fr, Wed, 30 Oct 91 21:48:50 +0100 Message-Id: <9110302048.AA11583@nocturne.chorus.fr> To: Frank da Cruz <fdc@watsun.cc.columbia.edu> Cc: info-zip@cs.ucla.edu, Jean-loup Gailly <jloup@chorus.fr> Subject: zip compression algorithm (was: Info-Zip and compression for Kermit) Date: Wed, 30 Oct 91 21:48:48 +0100 From: Jean-loup Gailly <jloup@chorus.fr> Frank da Cruz asks: > is there a single document, or at least a small number of files > (or books, or papers) where Zip's compression algorithm is > presented and analyzed? Let me try a short summary. Zip's implosion algorithm finds duplicated strings in the input data. The second occurrence of a string is replaced by a pointer to the previous string, in the form of a pair (distance, length). Distances are limited to 8K bytes, and lengths are limited to 320 bytes. When a string does not occur anywhere in the previous 8K bytes, it is emitted as a sequence of literal bytes. (In this description, 'string' must be taken as an arbitrary sequence of bytes, and is not restricted to printable characters.) In the compressed output, literal bytes are distinguished from pointers with one leading bit. (This is one of the unfortunate features of the zip format, since better formats use less than one bit.) Distances and lengths are themselves compressed using Shannon-Fano encoding, which is somewhat similar to Huffman coding: frequent values are encoded on fewer bits than unlikely values. Literal bytes can also be optionally encoded using a Shannon-Fano tree. The Shannon-Fano trees are stored in a compacted (but non optimal) form at the head of compressed output. zip 1.0 uses a two pass algorithm to determine optimal trees, but a single pass algorithm is possible using static trees. Duplicated strings are found using a hash table. All input strings of length 2 (for binary data) or 3 (for ascii data) are inserted in the hash table and removed when they fall off the sliding window. A hash index is computed for the next 2 or 3 bytes. If the hash chain for this index is not empty, all strings in the chain are compared with the current input string, and the longest match is selected. The hash chains are searched starting with the most recent strings, to favor small distances and thus take advantage of the Shannon-Fano encoding. The hash chains are doubly linked to get fast deletion. To avoid a worst case situation, very long hash chains are arbitrarily truncated at a certain length, determined by a runtime option (zip -0 to -9). So zip does not always find the longest possible match but generally finds a match which is long enough. zip also defers the selection of matches with a lazy evaluation mechanism. After a match of length N has been found, zip searches for a longer match at the next input byte. If a longer match is found, the previous match is truncated to a length of one (thus producing a single literal byte) and the longer match is emitted afterwards. Otherwise, the original match is kept, and the next match search is attempted only N steps later. The lazy match evaluation is also subject to a runtime parameter. If the current match is long enough, zip does not search for a longer match, thus speeding up the whole process. If compression ratio is more important than speed, zip attempts a second search even the first one is already quite long. For small files, zip also tries a completely different algorithm called shrink, which is an improvement on the LZW algorithm used in the well known 'compress' program. The improvement is to re-use leaves of the string tree, instead of erasing the whole tree when the compression ratio degrades. As in implosion, the shrinking algorithm finds duplicated strings, but it can only finds a subset of them, because new strings are inserted in the dictionary only at the end of the previous string match. (Implosion inserts a new string at every input byte.) The decompressor reconstructs the same dictionary as the compressor, and is thus able to determine string lengths given a string pointer only. So the shrink output is only a sequence of pointers and does not include lengths. The pointers are encoded on the minimal number of bits required to represent them, but they are not further compressed with Huffman or Shannon-Fano encoding. Shrinking is not attempted by default on large files because implosion is generally better, once the initial cost of storing the Shanon-Fano trees is amortized. I hope that this little introduction has been helpful. For more details about data compression algorithms, I recommend the book Text Compression Timothy C. Bell, John G. Cleary, Ian H. Witten ISBN 0-13-911991-4 Price: approx. US$40, much higher outside the US Publisher: Prentice Hall or the review paper Modeling for Text Compression Timothy C. Bell, Ian H. Witten, John G. Cleary ACM computing surveys, Vol 21, No 4, Dec 1989, p557-591 Jean-loup 31-Oct-91 3:05:17-GMT,3931;000000000411 Return-Path: <thobe@getunx.info.com> Received: from RUTGERS.EDU by watsun.cc.columbia.edu (5.59/FCB) id AA24528; Wed, 30 Oct 91 22:05:13 EST Received: from usc.UUCP by rutgers.edu (5.59/SMI4.0/RU1.4/3.08) with UUCP id AA16340; Wed, 30 Oct 91 20:50:57 EST Received: from srhqla.UUCP by usc.edu (5.64+/SMI-3.0DEV3) from user uucp id AA18977; Wed, 30 Oct 91 17:16:29 PST Received: by srhqla.SR.COM (5.51/smail2.5/12-06-88) id AA01857; Wed, 30 Oct 91 16:50:11 PST Received: from somewhere by avatar.com id aa15393; Wed, 30 Oct 91 16:23:11 PST Received: by getunx.info.com (5.58/smail2.5/09-28-87) id AA04026; Wed, 30 Oct 91 16:26:32 PST Date: Wed, 30 Oct 91 16:26:32 PST From: "Glenn E. Thobe" <thobe@getunx.info.com> Message-Id: <9110310026.AA04026@getunx.info.com> To: fdc@watsun.cc.columbia.edu Subject: Re: Recent Kermit Protocol Extensions Frank- Thank you for your detailed reply of 25 Oct 91 to my letter regarding proposed Kermit extensions, namely compression. On the issue of whether the compressed data should be able to use the upper ASCII printables when available, a quick calculation reveals that using the extra codes adds about 15% to the channel capacity, i.e. not using them increases the message size by 15%. (log 192 / log 96 = 1.15) That's 15% extra cost no matter how good the compression. >> P.S. Bravo for your pioneering internationalization efforts! >Thanks! (Do you use this stuff yourself?) I do a lot of Cyrillic text processing on my 3b1/UnixPC. I have even built a Russian TeX. As for Cyrillic data communications, I'm pretty much limited to network e-mail just because there is no one to exhange texts with via Kermit. I am hoping that when 8-bit mailers become universal, ISO 8895 (-5 for Cyrillic) will become the standard of exchange as you have made it for Kermit. For now, it seems impossible to get two people to agree on the same Cyrillic transliteration. I found the articles in issue 4 of your Kermit mailout regarding Kermit in Russia and internationalization to be quite informative. (I haven't gotten #5 yet, the announcement said if you got #4, they would send #5 automatically.) >> P.P.S. What can be done to get the most popular BBS's to support Kermit? >Beats me. Maybe as BBS's become popular outside the USA -- where >(inter)national character set conversion is important and 8-bit transparency >cannot be assumed (because of the PTTs) -- we'll see more Kermit support in >BBS software. Chuck Forsberg has had pretty much success getting BBS's to use his Z-modem. And he doesn't have a world-wide army of volunteers like Kermit does. He even charges money for his product. I think part of the reason is that Kermit has gotten a rep for being slow partly because of this prefixing issue, and for other reasons which have been fixed, but people don't know it yet. A stripped MS-Kermit for remote use on BBS's, if such doesn't already exist, would help, along with specific information on how to install it on the major BBS systems (FIDO, Waffle, etc.). By stripped I mean no terminal emulation facilities (e.g. rz/sz), do everything from the command line. Most BBS's give you a choice of protocols, and it shouldn't be hard to make one of them Kermit. The designers and purveyors of BBS systems could be approached directly. I don't see a lot of technical obstacles here. Maybe a note in info-kermit would be sufficient to mobilize the forces. When I encounter a BBS that doesn't support Kermit, I first proselytize the sysop. Often they are interested, but I don't know exactly what to say to them other than to start explaining about the WATSUN archive, the Kermit books, the universality of Kermit implementations, the importance of getting a current release, etc. Then I escape to my local Kermit and "!rz" to download files. (It's easier to swim around a rock than through it.) -Glenn <getunx!thobe@uunet.uu.net> 31-Oct-91 16:19:02-GMT,728;000000000001 Return-Path: <mtune> Received: by watsun.cc.columbia.edu (5.59/FCB) id AA01558; Thu, 31 Oct 91 11:19:01 EST Date: Thu, 31 Oct 91 11:19:01 EST From: Peter Mauzey <mtune> Message-Id: <9110311619.AA01558@watsun.cc.columbia.edu> To: fdc Frank - (from Peter Mauzey) October 31, 1991 Regarding V.42bis being "in the clear," there is no doubt that Microcom, which claims to represent UNISYS, has asserted that the technology represented by V.42bis is covered by U.S. Patent 4,558,302. The contact at Microcom (VP Technology Planning) is Gregory Pearson, 818-986-4212, last I heard (Sherman Oaks, CA). My Department's representatives to the CCITT are now overseas but I will ask them when they return "what's new" with V42bis. 4-Nov-91 6:19:26-GMT,1485;000000000001 Return-Path: <brnstnd@KRAMDEN.ACF.NYU.EDU> Received: from KRAMDEN.ACF.NYU.EDU by watsun.cc.columbia.edu (5.59/FCB) id AA15188; Mon, 4 Nov 91 01:19:25 EST Received: from LOCALHOST by KRAMDEN.ACF.NYU.EDU (5.61/1.34) id AA17921; Mon, 4 Nov 91 06:19:23 GMT Message-Id: <9111040619.AA17921@KRAMDEN.ACF.NYU.EDU> To: Frank da Cruz <fdc@watsun.cc.columbia.edu> Cc: brnstnd@nyu.edu Subject: Re: Compression In-Reply-To: Your message of Mon, 28 Oct 91 14:21:04 EST. <CMM.0.90.0.688677664.fdc@watsun.cc.columbia.edu> Date: Mon, 04 Nov 91 01:19:21 +0000 From: brnstnd@KRAMDEN.ACF.NYU.EDU In message <CMM.0.90.0.688677664.fdc@watsun.cc.columbia.edu> you write: > Pointers to specifications, analyses, etc, would be much appreciated. > Thanks! You should pick up my yabbawhap package, comp.sources.unix volume 24. The yabba part uses Y coding, which is described in a paper included with the package and which is free of patents. Y is an LZ variant which generally outperforms compress by 10-20%; on English text it usually beats even zip, lharc, freeze, and other LZ-Huffman hybrids. I'm working on dabba, which applies a statistical coder to the output of Y coding and which so far appears to be competitive with PPMC, the most effective known compression method. (PPMC's problem is that it's awfully complex to implement and rather slow.) In the meantime, yabba is public-domain. Feel free to use it. Last I heard the GNU project was switching to yabba from compress. ---Dan 4-Nov-91 14:48:12-GMT,897;000000000001 Return-Path: <jloup@chorus.fr> Received: from chorus.chorus.fr by watsun.cc.columbia.edu (5.59/FCB) id AA18341; Mon, 4 Nov 91 09:47:39 EST Received: from nocturne.chorus.fr by chorus.chorus.fr, Mon, 4 Nov 91 15:46:13 +0100 Received: from localhost.chorus.fr by nocturne.chorus.fr, Mon, 4 Nov 91 15:43:13 +0100 Message-Id: <9111041443.AA21073@nocturne.chorus.fr> To: Frank da Cruz <fdc@watsun.cc.columbia.edu> Cc: Jean-loup Gailly <jloup@chorus.fr> Subject: v42bis In-Reply-To: Your message of Wed, 30 Oct 91 14:42:51 -0500. <CMM.0.90.0.688851771.fdc@watsun.cc.columbia.edu> Date: Mon, 04 Nov 91 15:43:12 +0100 From: Jean-loup Gailly <jloup@chorus.fr> > What doyou know about V.42bis? The specification is available by anonymous ftp at bruno.cs.colorado.edu in pub/standards/ccitt/1992/v/v42bis.asc. It is indeed the shrink algorithm used in zip, and derived from LZW. Jean-loup 14-Nov-91 17:57:26-GMT,10517;000000000005 Return-Path: <PK6811S@ACAD.DRAKE.EDU> Received: from DRAKE.EDU by watsun.cc.columbia.edu (5.59/FCB) id AA22990; Thu, 14 Nov 91 12:57:19 EST Received: from DRAKE.BITNET by DRAKE.BITNET (PMDF #12569) id <01GCXSH8PZ280012TQ@DRAKE.BITNET>; Thu, 14 Nov 1991 11:58 CST Date: Thu, 14 Nov 1991 11:58 CST From: PK6811S@ACAD.DRAKE.EDU Subject: Kermit Compression Experiments To: fdc@watsun.cc.columbia.edu Message-Id: <01GCXSH8PZ280012TQ@DRAKE.BITNET> X-Envelope-To: fdc@watsun.cc.columbia.edu X-Vms-To: IN%"fdc@watsun.cc.columbia.edu" One more weekend volunteer reporting in. Let's keep it simple, and let's try it out to see what happens. My experiment uses run-length and 'Huffman' coding. I'm no expert on the differences between Huffman and Shannon-Fano coding, but I understand the principle of using short codes for frequently-occuring characters. I had a go at writing a program on the Mac which reduce the number of transmitted characters significantly. I used four test files: DuTermSource - tokenized zbasic source code (terminal emulator) - has lots of text, with some 8-bit Kingdon200 - 200 dpi 2x2 scan of 8.5 by 14 document Excel.mb - Microsoft Excel program converted by Macbinary - program to one data fork Crashbang - sound file Following are my results, given in 1024 bytes. Transmit sizes include packet headers and blockcheck, with packetlength=1000. Repeat prefixing was allowed in all cases. WITHOUT COMPRESSION ! WITH COMPRESSION ! ---------------------------------------------------- ! Original 8bit 7bit ! 8bit 7bit ! File Transmit Transmit ! Transmit Transmit File ! Size Size Size ! Size Size ---- ! -------- -------- -------- ! -------- -------- DuTermSource ! 255 273 299 ! 210 245 Kingdon200 ! 262 439 521 ! 216 252 Excel.mb ! 729 1010 1202 ! 777 907 Crashbang ! 46 68 90 ! 46 53 I'll summarize my algorithm, then give the details and reasoning. But first, we must understand the concept of a compression window. This is a fixed number of input characters considered together for compression. The window size can be smaller, equal to, or greater than the packet length; for my tests, the window size was 256 with the packet length 1000 Summary of Algorithm: 1. Apply run-length encoding to enough input characters to fill a window. Somewhat like kermit's repeat-prefixing. 2. Convert window to 'Huffman' codes 3. Update Huffman code tables 4. Convert window to printable characters 5. Accumulate enough windows to fill a packet. RUN-LENGTH ENCODING: Any character 'c' repeating from 5 to 255 times is replaced with c:r:n, where c is the character, r is the repeat-flag character, and n is the number of characters replaced. When the repeat-flag character is encountered, it is replaced with r:0. When decoding, when the repeat-flag is found, first check the next character for count. If count is zero, then output the repeat-flag character, else output 'count' number of the previously decoded character. Usually this pass will compress the input data somewhat, and can be very effective against certain kinds of files, or portions of files. It is possible in cases where there are few or no repeating characters, and many occurances of the repeat-flag character that this will actually add to the input size. It might be possible to choose either the encoded or the unencoded version of the window, appropriately flagged, and pass that to step 2, to prevent undue expansion of the file, but I did not do that in my tests. HUFFMAN CODING Huffman coding allows the most frequently occuring characters to be replaced by short (< 8 bits) codes, and less frequently occuring ones to be replaced by long (= or > 8 bits) codes. For example, if hex 00 might be the most frequently occuring character, then every hex 00 should be replaced by the shortest possible code. We first construct a table of codes, from shortest to longest, my table is as follows: binary 1xxxx - the 16 most frequent characters binary 01xxxxx - the next 32 most frequent characters binary 001xxxxxx - the next 64 most frequent characters binary 000xxxxxxxx - the rest To decode, check the next bit, bit1. If bit1 = 1 decode the first group from the next 4 bits else check the next bit, bit2 if bit2 = 1 decode the second group from the next 5 bits else check the next bit, bit3 if bit3 = 1 decode the third group from the next 6 bits else decode the fourth group from the next 8 bits. This produces a stream of bits which should be much shorter than the original input data where every character is represented by 8 bits. UPDATE HUFFMAN CODE TABLES It is important to identify the most frequently occuring characters. Therefore we count the occurances of each character in the window, and sort them from highest-count to lowest. This becomes the table used in encoding the next window. Here is an important concept. The receiver can't decode a window without having the same table that the sender used, therefore the table must be based on the previous window. Hopefully, the distribution of characters doesn't change much from one window to the next. There will be instances in the data where changes occur, but there should be long stretches where they don't. As a side note, I found that the efficiency of the sort algorithm was a limiting factor in speed of processing. A simple bubble sort slowed the compression program terribly, and required changing to a shell-sort which with some tuning gave satisfactory results. CONVERT WINDOW TO PRINTABLE CHARACTERS Huffman coding leaves us with a stream of bits stored in 8-bit bytes. We now must make them kermit-legal to avoid eight-bit and control prefixing. Alas, now is where we give up much of our hard-won compression. If we are using a 7-bit channel, I simply take 6 bits at a time - giving values hex 00 to hex 3f, and add hex 20 to them to make an 8-bit character whose values lie between hex 20 and hex 5f, inclusively, corresponding to printable characters, space through underscore. If we are using an 8-bit channel, I take 7 bits at a time - giving values hex 00 to hex 7f. For values hex 00 to hex 3f, I add hex 20 as above. For values hex 40 to hex 7f, I add hex a0 to place them into the 8-bit versions of the same characters, space through underscore. These conversions give characters which require no prefixing for transmission. MAKE UP A PACKET Now we simply accumulate windows until we have a full packet, append the appropriate packet header and block-check, and write the packets to our output file. Now we encounter a problem of granularity. If our window size is not a factor of our packet size, we will not be able to fill each packet to the maximum allowable size. Consider this example: our packet size is 1000 and our window size is 256. After Huffman coding and printable- conversion, we could produce a series of windows sized 200, 190, 210, 220, 230. The first four windows add up to 820, adding the fifth gives 1050. Our actual average packet size will then be somewhat smaller than that allowable, creating additional overhead. I did not follow up on this problem, but some method which allows windows to span packets may be applied. CONCLUSION AND SUGGESTIONS My experiments show that a reduction of transmitted characters is possible, up to about 50%. The coding needed is on the order of that required for kermit-windows (which makes me think I need to find another name for these compression 'windows'). There is a way to make a further reduction in characters. You will notice that this program doesn't output printable characters in the range hex 60 to hex 7e. Now whenever we have unused character values we should consider how to use them to pass information that aids our compression. Let's say we output a hex 40 (space). We could just as well output a hex 60 "`" - if the receiver understood that it really was a hex 40, cleverly encoded to flag some kind of information. So a space is either hex 40 or hex 60, the flag is either up or down, on or off, zero or one. If we have a series of such flags we can construct a stream of bits out of them; and when we have accumulated enough bits to make a character, we insert that character in the decoded stream. Over a 7-bit channel we are taking our huffman stream in 6-bit chunks, so that we need 6 flags to insert a character. Over an 8-bit channel we need 7 flags. Since half of our input characters, hex 40 to hex 5e, can serve as flags we can insert a character for every 12 or 14 transmitted characters on the average - reducing our transmittal by one-twelfth or one-fourteenth There is also another approach to Huffman tables. Rather than sort characters by their frequency in the window, we can take advantage of the natural assocation of one character to another. For example, in English text, the letter 'd' may be the most frequent character to follow the letter 'e', and the letter 'u' is certainly the most frequent character to follow the letter 'q'. If 'd' follows 'e' in the input, it should be replaced by the shortest possible code, regardless of how many 'd's appear in the window. And 'u' following 'q' should be replaced by the same shortest possible code. To do this, we keep a table of counts of characters occuring after one another, and sort each character's 'follower table' separately. The receiver, having constructed the same tables, knows that the same code following an 'e', means 'd', but after a 'q', means 'u'. Unfortunately this implies a large table, at least 256 by 256 bytes, plus lots of sort time. I did not attempt this experiment, but someone may be able to find an efficient implementation. Hope this is helpful. A lot of us would like to see shorter file transmission times. Paul D. Kline Associate Director of Adminstrative Computing Drake University Des Moines, IA 50311 (515)271-2166 internet pk6811s@acad.drake.edu 15-Nov-91 20:46:42-GMT,714;000000000005 Return-Path: <mtdcc!ptm@mtune.att.com> Received: from att.att.com by watsun.cc.columbia.edu (5.59/FCB) id AA10254; Fri, 15 Nov 91 15:46:29 EST Message-Id: <9111152046.AA10254@watsun.cc.columbia.edu> From: mtdcc!ptm@mtune.att.com Date: Fri, 15 Nov 91 15:42 EST To: fdc@watsun.cc.columbia.edu Frank - (from Peter Mauzey) November 15, 1991 A little more about V.42bis. In addition to UNISYS claiming a relevant patent, British Telecom claims the algorithm was the subject of a UK patent application GB 8815978 filed 5th July 1988. Our representative to the standards organizations says that companies that use V.42bis will have to pay an amount that is separately negotiated with each company. Bye for now. 16-Nov-91 20:56:15-GMT,3212;000000000001 Return-Path: <fdc> Received: by watsun.cc.columbia.edu (5.59/FCB) id AA22451; Sat, 16 Nov 91 15:56:05 EST Date: Sat, 16 Nov 91 15:56:05 EST From: Frank da Cruz <fdc@watsun.cc.columbia.edu> To: PK6811S@ACAD.DRAKE.EDU Subject: Re: Kermit Compression Experiments In-Reply-To: Your message of Thu, 14 Nov 1991 11:58 CST Message-Id: <CMM.0.90.0.690324965.fdc@watsun.cc.columbia.edu> Paul, many thanks for your message! It's a good idea, too. I had always thought of Huffman encoding as a two-pass process if we were to develop an ideal key for a particular file, but your idea (let me see if I understand it right) is to read the first chunk ("window"), get the byte frequencies and compute the key, send the first chunk uncompressed, then use the first chunk's key to encode and send with the second chunk, etc. A refinement on this might be: read a chunk, get the frequencies, construct the key, send the key with its own chunk. This assumes we can keep a whole chunk in memory. But in most cases we do that anyway -- most Kermit programs read data from the input file into a buffer, a fairly long piece at a time -- so now we just make a pass over the input buffer to compute the frequencies before compressing the data putting it in the packet. How do you encode and transmit the key? I assume the key would be 256 bytes long, before Kermit encoding. Later, in an example, you mention a window size of 256. This means the key for a window is as long as the window itself. Obviously we need either a compact way of transmitting the key, or a much bigger window size, or both. Tricky business, because we have memory and addressing limitations to cope with on small machines -- so we'd need to have two compression-capable Kermit programs negotiation the compression chunk size. If it is too small, compression won't pay off. I agree there is no reason why compression chunks and packet boundaries need to have anything to do with each other. We just need a way of distinguishing the key from the compressed data in the packet -- easy to do, using either length fields or flagging via unused characters as you suggested. Another refinement would be to make better use of the GL and GR character code space by "tighter packing", using, say, base 94 (or 188) rather than 64 (or 128). I'm not sure that second-or-higher-order analysis of the data (for pairs, triplets, etc) would be worthwhile -- as you point out, we'd need at least a 64K byte (or word!) table, lots of sorting, etc. Maybe in 10 years, when CPU speed and memory are no longer the bottlenecks. More detailed analysis and testing will tell (higher-order analysis is comparable to the string-matching approach used by LZW, V.24bis, and friends -- all of which are patented). Anyway, the final results of your tests are definitely better than today's Kermit, achieving up to a 100% savings in transmission time -- nothing to sneeze at. Your method is simple, easy to understand and program, relatively compact in memory requirements, and best of all it is clear of the kind of legal hassles impinging on all the fancier LZW-class (string-based) algorithms. It's very much worth considering, and much appreciated. Thanks! - Frank 21-Nov-91 21:48:30-GMT,900;000000000001 Return-Path: <mtdcc!ptm@mtune.att.com> Received: from att.att.com by watsun.cc.columbia.edu (5.59/FCB) id AA06760; Thu, 21 Nov 91 16:48:24 EST Message-Id: <9111212148.AA06760@watsun.cc.columbia.edu> From: mtdcc!ptm@mtune.att.com Date: Thu, 21 Nov 91 16:31 EST To: fdc@watsun.cc.columbia.edu Frank - (from Peter Mauzey) November 21, 1991 Regarding file compression, here's a quote: "Using the 'implode' public domain compression algorithm that is optimized for speed rather than compression, we were able to achieve speed-ups ranging from 18% to 85% on the transfer of a variety of files, varying from 70 kbytes to 1.1 Mbytes." That's from "Software Technology for Wireless Mobile Computing" by Daniel Duchamp, Steven K. Feiner, and Gerald Q. Maguire, Jr., in the November 1991 IEEE Network magazine (ISSN 0890-8044). Page 17. All three authors are at Columbia, Computer Science Department. 22-Nov-91 1:34:03-GMT,7604;000000000001 Return-Path: <PK6811S@ACAD.DRAKE.EDU> Received: from DRAKE.EDU by watsun.cc.columbia.edu (5.59/FCB) id AA09431; Thu, 21 Nov 91 20:33:52 EST Received: from DRAKE.BITNET by DRAKE.BITNET (PMDF #12569) id <01GD7YJJ66N4000625@DRAKE.BITNET>; Thu, 21 Nov 1991 18:40 CST Date: Thu, 21 Nov 1991 18:40 CST From: PK6811S@ACAD.DRAKE.EDU Subject: Re: Kermit Compression Experiments To: fdc@watsun.cc.columbia.edu Message-Id: <01GD7YJJ66N4000625@DRAKE.BITNET> X-Envelope-To: fdc@watsun.cc.columbia.edu X-Vms-To: BITNET%"fdc@watsun.cc.columbia.edu" Frank, you're welcome. With regard to transmitting the key - I don't. I start with a default key, then each chunk is encoded using the key calculated from the previous chunk. Here's an attempt at diagramming: For the sender: chunk1 + key0 -> encode1 chunk1 -> key1 chunk2 + key1 -> encode2 chunk2 -> key2 chunk3 + key2 -> encode3 chunk3 -> key3 . . . . . . . . . encode1 thru encodeX are blocked into packets and transmitted. For the receiver: encode1 + key0 -> chunk1 chunk1 -> key1 encode2 + key1 -> chunk2 chunk2 -> key2 encode3 + key2 -> chunk3 chunk3 -> key3 . . . . . . . . . It's obvious that each key is non-optimal for any given chunk of data, because it is derived from the previous chunk, however, my Huffman codes are not so tight as to make much difference. The first 16 characters are all coded in 5 bits, the next 32 in 7 bits, etc., so if any character is one or two positions off - there won't be much loss of efficiency. This works well when there are long stretches of homogeneous data. When there is a change in type, like in the resource fork of a Macintosh program with pics, sounds, text strings, and executable code - there will be inefficient coding for the first chunk of each different stretch, then the key will adapt and be correct for subsequent chunks. Now, if you were to compute an optimal key for a chunk, by counting the frequencies therein, you would need to send the key. You wouldn't need to send the last group of characters (144 count), so you would only be sending the first 112 characters. Or you might only send the changes in the key... Char(i)+Newposition(i). ----------------------------------------------------------------------- After sending you those results, I made two tweaks to the program. First, I changed the key calculation routine slightly, to allow it to ride over rough parts in the data without disrupting the keys completely for the next chunk. Instead of zeroing out the count for each character, I simply divide the previous count by 2 (shift right 1). This allows some 'memory' of previous chunks and prevents a frequently- recurring character from going to the end of the table just because it was left out for a short stretch. Second, I divided the fourth group of 144 characters into two groups, so that my Huffman codes now look as follows: binary 1xxxx - 1st 16 characters binary 01xxxxx - next 32 characters binary 001xxxxxx - next 64 characters binary 0001xxxxxx - next 64 characters binary 0000xxxxxxx - next 80 characters The fourth group is now encoded in 10 bits instead of 11 as before. With these two changes, I ran the compression against some of the previous files: (all are 8bit) before after File tweak tweak ---- ------ ------ DuTermSource 210 207 Kingdon200 216 214 Crashbang 46 45 I tried the program against the text compression corpus available at fsa.cpsc.ucalgary.ca in /pub/text.compression.corpus. This time I monitored the total size of chunks after Huffman encoding as well as the total size after kermit-legal expansion: Huffman- Kermit- Original Encoded Transmit File File Size Size Size ---- ------- ------ -------- bib 80375 92806 111261 book1 517426 601001 768771 bbok2 416858 483608 610856 geo 86897 100230 102400 news 265430 308191 377109 obj1 15486 16861 21504 obj2 195992 226658 246814 paper1 36699 41581 53161 paper2 55451 63852 82199 pic 79312 91365 513216 progc 28243 31960 39611 progl 46402 53255 71646 progp 32737 37227 49379 trans 67333 77783 93695 ----- ------- ------- ------- total 1924641 2226378 3141622 Some comments: - First, no file transmit size was larger than it's original size, though 'geo' was pretty close. This bodes well for applying the algorithm in practice. - Second, the expansion from Huffman to transmit size is right at 16% for all files, which is due to the data-independent algorithm. - Third, the amazing result for the file 'pic' can be explained by the existence of long strings of repeated characters which were reduced through run-length encoding. - Fourth, these results can be compared with those given by Jean-loup Gailly for zip, zoo, lharc, and tar - whose archive totals ranged from 1097258 to 1201148. They are much smaller, but he did not give the transmit sizes for those archives, which is the real comparison. ---------------------------------------------------------------------- As to the question of long input buffers and such, I tried chunks as large as 4k, but the compression varied from slightly worse to slightly better, on the order of 1% difference from the 256-byte chunks. Reason this way: either a character occurs frequently or it doesn't. If it does, then most chunks, whatever size, will show that. And, as before, my Huffman codes are not so tight that a character's frequency can't move around some without affecting efficiency. There is also the characteristic of 'locality', which means that different stretches within the data will have somewhat different profiles. A good compressor will adapt as it proceeds. In my case adaptivity is determined by chunk size. The smaller the chunk, the more adaptive. This is an advantage above some 'noise' level, below which, the counts become too low to be meaningful - they no longer predict the actual frequency of the characters. Above that level, increasing the chunk size doesn't improve the efficiency, and at some higher level the efficiency decreases as we lose adaptivity. We also need to conceptually separate our input buffers, compression chunks, and packets. They can be optimized independently. The input routine reads large buffers to reduce i/o delays. The compression routine reads small chunks out of the input buffer for processing, and puts them somewhere. The packetizing routine accumulates enough chunks to make an efficient packet. The windowing routine manages packets within it's window space. The leg bone is connected to the thigh bone. ------------------------------------------------------------------------ I like the idea of selecting the unencoded chunk or the encoded one, whichever is shorter. I use a terminator character, hex 7e, to separate the chunks within packets. If we used two terminators, 7e and 7d, they could indicate whether the chunk was encoded or not. This would be useful for data that was extremely 'noisy' across all 256 8bit codes. ------------------------------------------------------------------------ Sorry about this late response, I was attending an IA-FRS User Conference in Orlando this week. 4-Dec-91 21:15:59-GMT,1516;000000000001 Return-Path: <fdc> Received: by watsun.cc.columbia.edu (5.59/FCB) id AA26097; Wed, 4 Dec 91 16:14:39 EST Date: Wed, 4 Dec 91 16:14:38 EST From: Frank da Cruz <fdc@watsun.cc.columbia.edu> To: Dick Elnicki <DICKE%NERVM@cuvmb.cc.columbia.edu> Cc: Network Mailer <MAILER@cuvmb.cc.columbia.edu>, Joe Doupnik <JRD@cc.usu.edu> Subject: Re: File Compression In-Reply-To: Your message of Tue, 03 Dec 91 15:59:32 EST Message-Id: <CMM.0.90.0.691881278.fdc@watsun.cc.columbia.edu> > The results clearly suggest anyone serious about using compression > to save file Xmit times should pre-compress rather than do it on the > fly during the Xmit. The Kermit compression procedures would have > to be at least as good as PKZIP's result to make the inclusion worth > the trouble. Or, am I missing something... > Yup -- Kermit runs on hundreds of different kinds of computers. You can't rely on any two of them to have the same compression methods. Even when they do, they might have different file formats, or use different character sets. So the sending Kermit has to convert data to canonical form first, then compress. So you can't separate compression from Kermit. I'm beginning to think we can't really have a super-effective method built into Kermit, however, because of complexity, size, execution overhead, and patent/legal issues. It might well turn out that we'll settle for something less effective than PKZIP and friends, but something simpler. It'll be much better than what we have now. - Frank 10-Feb-92 19:09:26-GMT,7097;000000000015 Return-Path: <getunx.info.com!thobe@avatar.com> Received: from elroy.jpl.nasa.gov by watsun.cc.columbia.edu (5.59/FCB) id AA02183; Mon, 10 Feb 92 14:09:18 EST Received: from avatar.com by elroy.Jpl.Nasa.Gov (4.1/SMI-4.1+DXR) id AA20906; Mon, 10 Feb 92 11:09:11 PST Received: from getunx.info.com by avatar.com id aa12417; Mon, 10 Feb 92 10:22:00 PST Received: by getunx.info.com id AA00280 (5.65c/IDA-1.4.4 for fdc@watsun.cc.columbia.edu); Mon, 10 Feb 1992 09:06:13 -0800 Date: Mon, 10 Feb 1992 09:06:13 -0800 From: "Glenn E. Thobe" <thobe@getunx.info.com> Message-Id: <199202101706.AA00280@getunx.info.com> To: fdc@watsun.cc.columbia.edu Subject: Encoding speeds transfer of compressed files Frank- I have some experimental results in support of my theory that Kermit's performance in transferring compressed files can be greatly improved by pre- and post-processing of the files. I have designed a special encoding scheme and programmed encoders and decoders to implement it. ------------------------------------------------------------------ Preprocessors Speed Kermit File Transfer by Glenn E. Thobe February 10, 1992 INTRODUCTION I have written a set of four preprocessors for Kermit which greatly improve it's efficiency when transferring compressed files. Kermit's control prefixing algorithm, although efficient for text files, is expensive for compressed files. Compressed files are fundamentally different from other binary files in the way in which the data are statistically distributed: roughly speaking the bytes of a compressed file are uncorrelated and uniformly distributed over the 256 8-bit codes are uniformly distributed over. The new routines eliminate Kermit's control prefixing overhead while utilizing the maximum available code space of 190 or 95 characters. This minimizes the number of characters which get transferred, thereby shortening the time and reducing the cost. They filters are: k7enc - encode for 7-bit channels k7dec - decode for 7-bit channels k8enc - encode for 8-bit channels k8dec - decode for 8-bit channels For testing purposes, additional filters were written which emulate Kermit's prefixing algorithms: getpkt7 - Kermit encoding (control + 8th-bit prefixing) getpkt8 - Kermit encoding (control prefixing) In practice, one would apply the following processes to transfer a compressed file: k8enc (encode file on sending computer) kermit (transfer file) k8dec (decode file on receiving computer) Here we simulate the transfer and predict the performance of the above process by performing the following steps on a single computer: k8enc (encode file) getpkt8 (apply control prefixing of non-printable charaters) k8dec (decode file) We can view k8enc and Kermit's encoding as alternative ways of dealing with the limited 190 (or 95) letter code space, and compare the performance of the two. We can also compare the performance of k7enc, k8enc, and other popular binary-to-printable filters. RESULTS Following are simulation results for a typical file using k8enc. file=DNLNK/dnlnk.tar.Z raw file transfer: in = 57725, out = 75436, expansion = 30.68%, efficiency = 76.52% encoding alone: in = 57725, out = 62118, expansion = 7.61%, efficiency = 92.92% encoded file transfer alone: in = 62118, out = 63072, expansion = 1.53%, efficiency = 98.48% combined coding and transfer: in = 57725, out = 63072, expansion = 9.26%, efficiency = 91.52% Note that k8enc encoding alone significantly outperforms Kermit's getpkt8 encoding. Also, that k8enc preprocessed files enjoy greatly improved transfer performance compared the raw compressed file. Here, for the same file, are the results with different preprocessors: ============================================================ Combined Efficiencies for Kermit used with various preprocessors ============================================================ file name = DNLNK/dnlnk.tar.Z, file size = 57725 bytes 7-bit channel (eighth bit prefixing): cat 56.77% uuencode x 69.2% btoa 75.96% k7enc 77.36% 8-bit channel: cat 76.52% k8enc 91.52% ============================================================ Note that k7enc used with a 7-bit channel actually outperforms raw file transfer (cat) over an 8-bit channel. It also outperforms both uuencode (where each three bytes are spread over four) and atob (where each four bytes are spread over five). K8enc is obviously the big winner here. PROTOCOL ENHANCEMENT PROPOSAL While the improvements are impressive, the user is still left with the clumsiness of manually running the pre- and postprocessors, and keeping track of the file names. It would be much more convenient to incorporate the encoding and decoding processes into the Kermit protocol. A compressed-file mode should be introduced. The user would select this mode when transferring LZ-compressed, zipped, Gif, and other compressed binary data types. The implementation would be especially easy, since in this mode most of the intricate prefixing, RLE, and locking shift preprocessing involved in packetizing the data can be switched off and built-in k7enc or k8enc processing switched on in its place. This protocol enhancement should be viewed as separate from the proposed built-in file compression, since the latter is of no great benefit when the file is already compressed. Built-in compression should be used for text files, executables, etc., but not for already compressed files. A slightly surprising result of this study, is that the cost of Kermit's design decision to use only a 188 character subset (190 extended ascii printables minus two control prefix characters) of the 256 data characters, can have so little adverse impact on efficiency. Typically the cost of mapping white noise data over a 256 character alphabet into a 190 character subset (8-bit channel) is between 7 and 8 percent as shown experimentally by this study. That is, the files are expanded by only 7 to 8 percent. This would seem to be a fair trade off for the ability to operate over channels where full 8-bit transparency cannot be guaranteed. CONCLUSION The traditional Kermit prefixing algorithm for mapping all file characters into the ASCII and extended ASCII printable codes is optimized for text files but performs poorly for compressed files. File transfer performance with compressed files can be improved greatly by complementary pre- and post processing of the files using the filter programs presented here. These filters significantly outperform the traditional uuencode and btoa, optimally exploiting the available channel capacity. These filters should be introduced into the Kermit protocol in support of a special mode for optimizing the transfer of compressed files. ----------------------------------------------------------------------- -Glenn E. Thobe <thobe@getunx.info.com> 29-Jan-92 0:24:32-GMT,4031;000000000015 Return-Path: <sturdeva> Received: by watsun.cc.columbia.edu (5.59/FCB) id AA19299; Tue, 28 Jan 92 19:24:29 EST Date: Tue, 28 Jan 92 19:24:28 EST From: James Sturdevant <sturdeva@watsun.cc.columbia.edu> To: fdc Cc: jrd Subject: Kermit Data Compression Message-Id: <CMM.0.90.0.696644668.sturdeva@watsun.cc.columbia.edu> Greetings, In my home directory, you will find my compression routines in both C and 8086 assembler. Below, you will find my attempt to explain them. (The routines are better than the explaination.) The compression code challenge: I have added my code to xx??co.c and added the module ckccmp.c. Changes in the XX code have [jrs] comments to flag it. I have done my best to speed it up. Any other code jockies are welcome to prove themselves better. Unliketechniques, the compression does not change the bit width of any non-compressed characters. (Is that enough difference to patent it?) It flags compressed data with the grave character (`-0x60). The compression takes place after the conversion of the character (or run of like characters) for the packet. For ease of description, I will call these tokens. The basis of this algorithm came from an article in Computer Language (V8#5, May 91) on a technique called Sliding Windows. It's supposed to be a variation on LZW. My variations are the position encoding and fixed sizes for window and data. The compression works like this: Initialize a compression buffer and window dictionary. Step: Create a token. Add the token to the end of the existing buffer. Search for a match on the buffer in a "window" of bytes processed previously. If found, get another token. When not found, check the previous match. If it is greater than 4 characters (the length of a compression token), output the string as a compressed token, start with the new token in the buffer and search again. If the buffer without the current token is less than 5 characters, output the first token normally and try to find a match starting at the next token. In both cases, the text of the output (whether token or compressed string) is added to the window. This is done only when data is actually output so the search doesn't find data which is pending. The compressed data consists of the flag character a two character (base 94) position, and a one character (base 94 offset 4) length. When the file is complete, flush the remaining characters from the compression buffer. To speed the search of the "window", I maintain a linked list for each character with a pointer to the beginning and end. When a character is removed (to add a new character in its place) the starting link is moved forward. The ending link is moved with each character. Searches are done by starting at the first character, comparing the string starting in that location with the buffer. On success, the search stops (why continue?). On failure, the search follows the link to the next matching character, and compares again. On the receiving side, the decompression routine read characters and processes them until it sees the ` character at the beginning of a token (i.e. it is not quoted by #, or a length field for a repeat). As the characters are processed, they are added to the window. It is not necessary for the receiving side to maintain a list of characters, but it doesn't hurt. When the ` character is seen, the next three characters are used to compute the location and length of the real string. The current data buffer is saved and data is consumed from the calculated string. When the string is exausted, processing continues on the original data. The initial design criteria included low memory usage (24K+slop of data), consistancy with the current Kermit packet style, and no increase on data which cannot compress. There is an exception for ` which must be quoted if compression is turned on. I chose ` because it is in the allowable range for flags and is not likely to appear in much text, c programs or assembler programs (that I am aware of). 13-Feb-92 12:11:52-GMT,27174;000000000005 Return-Path: <getunx.info.com!thobe@avatar.com> Received: from elroy.jpl.nasa.gov by watsun.cc.columbia.edu (5.59/FCB) id AA14153; Thu, 13 Feb 92 07:11:18 EST Received: from avatar.com by elroy.Jpl.Nasa.Gov (4.1/SMI-4.1+DXR) id AA14382; Thu, 13 Feb 92 03:13:52 PST Received: from getunx.info.com by avatar.com id aa01247; Thu, 13 Feb 92 1:39:04 PST Received: by getunx.info.com id AA01901 (5.65c/IDA-1.4.4 for fdc@watsun.cc.columbia.edu); Thu, 13 Feb 1992 01:22:25 -0800 Date: Thu, 13 Feb 1992 01:22:25 -0800 From: "Glenn E. Thobe" <thobe@getunx.info.com> Message-Id: <199202130922.AA01901@getunx.info.com> To: fdc@watsun.cc.columbia.edu Subject: Re: Encoding speeds transfer of compressed files You wrote: >Thanks, Glenn! It'll take me a while to digest this. As you probably know, >yours is only one entry of many in the Great Compression Contest (you saw >kermit/e/compress.txt, right?). Anyway, we really do want to do something >like what you have suggested, but we're all pretty swamped right now and will >have to come back to this later. Meanwhile, do you want to send in your >filters? Thanks again! - Frank Frank- Here is the whole k7- and k8-encoding package. I revised the document I sent you the other day (it should be clearer) and included it. You should be able to build and test everything in minutes on a Unix system. Please let me know if any problems show up. The technique is not a compression scheme as such, it is more of an anti-expansion scheme for already compressed files. Such a mode should be incorporated whether or not a built in compression is adopted. Nearly all the files I transfer are compressed already, so I feel the need personally. I can send you my filters which do Kermit-style prefixing and the comparison shell scripts if you want to repeat my efficiency tests, but I presume you already have a way of doing such things. I just ordered bin/e/compress.txt. Thanks for pointing it out. -Glenn <thobe@getunx.info.com> #! /bin/sh # This is a shell archive. Remove anything before this line, then unpack # it by saving it into a file and typing "sh file". To overwrite existing # files, type "sh file -c". You can also feed this as standard input via # unshar, or by typing "sh <file", e.g.. If this archive is complete, you # will see the following message at the end: # "End of shell archive." # Contents: READ.ME Makefile filters.doc k7enc.c k7dec.c k8enc.c # k8dec.c k7verify.ksh k8verify.ksh # Wrapped by thobe@getunx on Thu Feb 13 00:56:42 1992 PATH=/bin:/usr/bin:/usr/ucb ; export PATH if test -f 'READ.ME' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'READ.ME'\" else echo shar: Extracting \"'READ.ME'\" $2058 characters$ sed "s/^X//" >'READ.ME' <<'END_OF_FILE' X/* READ ME */ XThis is the first release of two pairs of complementary filters for Xfor pre- and post-processing of COMPRESSED FILES in order to speed Xup the transfer of those files via Kermit. One pair is for 7 bit Xdata paths and the other is for 8. X XThe reason these filters are needed is that Kermit allows only the XASCII or 8-bit extended ASCII codes to be to be transferred unchanged. XKermit prefixes control codes and replaces them with printable codes, Xwhich increases the amount of data to be transferred. This is efficient Xfor text files where control codes are infrequent but it is inefficient Xfor compressed files. X XThe code has been compiled and tested on an AT&T 7300/UnixPC/3b1 Xcomputer with both gcc and cc. A make file and a document describing Xthe rationale are included. After you build them, test by running Xk7verify.ksh and k8verify.ksh under either the Bourne or Korn shell. X X /bin/sh k7verify.ksh * X /bin/sh k8verify.ksh * X XInstructions on use are found in the Xheaders of the four C source files. Sorry, no man page. X XIt is hoped that these routines will be found helpful in their own Xright, but most of all it is hoped that they will find their way Xinto the Kermit protocol as a special mode for efficiently transferring Xcompressed files. They may also be combined with future built-in file Xcompression. This mode should be adopted whether or not built in file Xcompression is added. X XThe design and coding of these filters is a completely original Xcreation of the author. Note the copyright notice. It is essentially XColumbia University's copyright, which should in no way restrict the XKermit project from making full use of the code or algorithm. X X Copyright (C) 1992, Glenn E. Thobe, Canoga Park, California, USA. X All rights reserved. Permission is granted to any individual X or institution to use, copy, or redistribute this software so X long as it is not sold for profit, provided this copyright X notice is retained. X XEnjoy. X X-Glenn Thobe X<thobe@getunx.info.com> XWed Feb 12 23:27:51 PST 1992 END_OF_FILE if test 2058 -ne `wc -c <'READ.ME'`; then echo shar: \"'READ.ME'\" unpacked with wrong size! fi # end of 'READ.ME' fi if test -f 'Makefile' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'Makefile'\" else echo shar: Extracting \"'Makefile'\" $721 characters$ sed "s/^X//" >'Makefile' <<'END_OF_FILE' X# Makefile for Kermit pre- and postfilters X# Author: Glenn E. Thobe <thobe@getunx.info.com> X# Wed Feb 12 00:23:44 PST 1992 X XCC = gcc XCFLAGS = -O X#LDFLAGS = -s XLDFLAGS = -s -shlib #shared library option for gcc on 3b1/UnixPC X XOBJS= k7enc.o k7dec.o k8enc.o k8dec.o XPROGS= k7enc k7dec k8enc k8dec XSOURCES= k7enc.c k7dec.c k8enc.c k8dec.c X Xall: $(PROGS) X Xk7enc: k7enc.o X $(CC) -o $@ k7enc.o $(LDFLAGS) X Xk7dec: k7dec.o X $(CC) -o $@ k7dec.o $(LDFLAGS) X Xk8enc: k8enc.o X $(CC) -o $@ k8enc.o $(LDFLAGS) X Xk8dec: k8dec.o X $(CC) -o $@ k8dec.o $(LDFLAGS) X Xclean: X rm core $(PROGS) $(OBJS) X Xshar: READ.ME Makefile filters.doc $(SOURCES) k[78]verify.ksh X shar READ.ME Makefile filters.doc $(SOURCES) k[78]verify.ksh \ X >filters.shar END_OF_FILE if test 721 -ne `wc -c <'Makefile'`; then echo shar: \"'Makefile'\" unpacked with wrong size! fi # end of 'Makefile' fi if test -f 'filters.doc' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'filters.doc'\" else echo shar: Extracting \"'filters.doc'\" $6957 characters$ sed "s/^X//" >'filters.doc' <<'END_OF_FILE' X X Preprocessors Speed Kermit File Transfer X X by Glenn E. Thobe X X rev. February 12, 1992 X X XINTRODUCTION X X XI have written a set of four pre- and post-processors which greatly Ximprove Kermit's efficiency when transferring compressed files. XKermit's control prefixing algorithm, although efficient for text Xfiles, is expensive for compressed files. X XCompressed files are fundamentally different from other binary files Xin the way in which the data are statistically distributed. Roughly Xspeaking, the bytes of a compressed file are uncorrelated and uniformly Xdistributed over the 256 8-bit codes. Text files, for which Kermit's Xcontrol prefixing was originally designed, have few characters which Xare not in the printable range. For them, control prefixing adds Xlittle extra cost. By contrast, compressed files contain many Xunprintable characters for which prefixing is inefficient. Refinements, Xsuch as Kermit's run length and shift lock encoding, are ineffective Xfor compressed files. X XThe preprocessors work by preempting Kermit's control prefixing overhead Xwhich was designed to optimize the transmission of text files. The entire Xavailable code space of 190 or 95 (7-bits) printable characters is used in a Xstatistically uniform way, and none of the control characters are used at Xall. All forms of encoding will expand compressed files. Our Xpreprocessors expand them as little as possible, certainly much less than XKermit's control prefixing strategy. The test measurements bear out this Xcontention. X XThe filters are: X X k7enc - encode for 7-bit channels X k7dec - decode for 7-bit channels X k8enc - encode for 8-bit channels X k8dec - decode for 8-bit channels X XIn actual use these filters are employed in the following sequence to Xtransfer a compressed file in the least amount of time: X X * Using k8enc, encode the file on sending computer. X * Transfer the encoded file using Kermit. X * Using k8dec, decode file on receiving computer. X XFor testing purposes, additional filters were written which implement XKermit's prefixing algorithms: X X getpkt7 - Kermit encoding (control + 8th-bit prefixing) X getpkt8 - Kermit encoding (control prefixing) X XIn order to experimentally evaluate our strategy, we simulated Kermit Xtransfers using the filters k7enc, k8enc, getpkt7, and getpkt8 in Xvarious combinations. We compared the efficiencies of transferring Xfiles with or without the pre- and post-processing. Efficiency is Xdefined as the ratio of the number of bytes in the original file to Xthe number of bytes transmitted over the line (i.e. fully encoded). XThe wc utility gives the number of bytes emerging from the preceding Xfilters in the pipeline: X X getpkt8 <file.Z | wc #raw file transmitted by kermit X k8enc <file.Z | wc #k8encoded file X k8enc <file.Z | getpkt8 | wc #k8encoded file transmitted by Kermit X XWe can view k8enc and Kermit's encoding as alternative ways of dealing Xwith the limited 190 (or 95) letter code space, and compare the Xperformance of the two. We can also compare the performance of k7enc Xand k8enc with other popular binary-to-printable filters. X X XRESULTS X X XFollowing are simulation results for a typical file using k8enc. X X file=DNLNK/dnlnk.tar.Z X raw file transfer: X in = 57725, out = 75436, expansion = 30.68%, efficiency = 76.52% X encoding alone: X in = 57725, out = 62118, expansion = 7.61%, efficiency = 92.92% X encoded file transfer alone: X in = 62118, out = 63072, expansion = 1.53%, efficiency = 98.48% X combined coding and transfer: X in = 57725, out = 63072, expansion = 9.26%, efficiency = 91.52% X XNote that k8enc encoding alone significantly outperforms Kermit's Xgetpkt8 encoding. Also, that k8enc preprocessed files enjoy greatly Ximproved transfer performance compared to the raw compressed file. X XHere, for the same file, are the results with different preprocessors: X X ============================================================ X Combined Efficiencies for Kermit used with various preprocessors X ============================================================ X file name = DNLNK/dnlnk.tar.Z, file size = 57725 bytes X 7-bit channel (eighth bit prefixing): X no encoding 56.77% X uuencode 69.2% X btoa 75.96% X k7enc 77.36% X 8-bit channel: X no encoding 76.52% X k8enc 91.52% X ============================================================ X XNote that k7enc used with a 7-bit channel actually outperforms raw Xfile transfer over an 8-bit channel. It also outperforms both Xuuencode (where each three bytes are spread over four) and atob X(where each four bytes are spread over five). K8enc is obviously Xthe big winner here. X X XPROTOCOL ENHANCEMENT PROPOSAL X X XWhile the improvements are impressive, the user is still left with Xthe clumsiness of manually running the pre- and postprocessors, and Xkeeping track of the file names. It would be much more convenient Xto incorporate the encoding and decoding processes into the Kermit Xprotocol. X XA compressed-file mode should be introduced. The user would select Xthis mode when transferring LZ-compressed, zipped, Gif, and other Xcompressed binary data types. The implementation would be especially Xeasy, since in this mode most of the intricate prefixing, RLE, and Xlocking shift preprocessing involved in packetizing the data can be Xswitched off and built-in k7enc or k8enc processing switched on in Xits place. X XThis protocol enhancement should be viewed as separate from the proposed Xbuilt-in file compression, since the latter is of no great benefit when Xthe file is already compressed. Built-in compression should be used Xfor text files, executables, etc., but not for already compressed files. X XThe good news here is that restricting the code space to the 190 extended Xascii printables can have so little adverse impact on efficiency. When Xthe data are encoded using our k8enc filter, a compressed file is expanded Xtypically by only 7 to 8 percent, as shown experimentally by our study. X X XCONCLUSION X X XThe traditional Kermit prefixing algorithm for mapping all file characters Xinto the ASCII and extended ASCII printable codes is optimized for text Xfiles but performs poorly for compressed files. File transfer performance Xwith compressed files can be improved greatly by complementary pre- and Xpost processing of the files using the filter programs presented here. XThese filters significantly outperform not only Kermit's built-in Xcontrol prefixing, but also the traditional uuencode and btoa. This Xis accomplished by optimally exploiting the available channel capacity. X XThese filters should be introduced into the Kermit protocol in support of Xa special mode for optimizing the transfer of compressed files. X X----------------------------------------------------------------------- X X-Glenn E. Thobe X<thobe@getunx.info.com> END_OF_FILE if test 6957 -ne `wc -c <'filters.doc'`; then echo shar: \"'filters.doc'\" unpacked with wrong size! fi # end of 'filters.doc' fi if test -f 'k7enc.c' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'k7enc.c'\" else echo shar: Extracting \"'k7enc.c'\" $2881 characters$ sed "s/^X//" >'k7enc.c' <<'END_OF_FILE' X/* k7enc.c - encode for Kermit's 7-bit window (0x20-0x7e). XSpeeds Kermit transfer of compressed files for 7-bit channels, use Xwith k7dec. For 8-bit channels with no parity use k8enc and k8dec. X XAuthor: Glenn E. Thobe <thobe@getunx.info.com> X XCopyright (C) 1992, Glenn E. Thobe, Canoga Park, California, USA. XAll rights reserved. Permission is granted to any individual Xor institution to use, copy, or redistribute this software so Xlong as it is not sold for profit, provided this copyright Xnotice is retained. X XK7enc prepares a compressed file to be transferred by Kermit in Xbinary mode and decoded by k7dec. As a filter, its typical Xusage with the compressed file.Z might be: X X k7enc < file.Z | kermit -is - (sender) X kermit -ik | k7dec > file.Z (receiver) X XWith a text file the following is also useful: X X compress < file.txt | k7enc | kermit -is - (sender) X kermit -ik | k7dec | uncompress > file.txt (receiver) X XK7enc and k7dec feature a special non-linear mapping which avoids Xall ASCII control characters and makes optimal use of the Kermit's Xentire 7-bit ASCII-printable code space. Thus Kermit has no need Xof control or 8th-bit prefixing (except for the control prefix itself Xwhich is doubled). X*/ X X/* REVISION HISTORY: XSat Feb 8 22:15:47 PST 1992 initial release -g.t. X*/ X X/* call coroutine */ X#define A_CALL_B(x,Ax) do{An=x;goto B;Ax:;}while(0) /* K&R C version */ X#define B_CALL_A(x,Bx) do{Bn=x;goto A;Bx:;}while(0) /* K&R C version */ X X#define min(x,y) ((x)>(y))?(y):(x) X#define MAP(x) ((x)+0x20) /* map to ASCII printables */ X X#include <stdio.h> X Xvoid Xmain() X{ X int An, Bn; X int c, m, h, w, inbuf, u, eof; X static int rmask[]={0x00, 0x01, 0x03, 0x07, 0x0f, 0x1f, 0x3f, 0x7f, 0xff}; X X /* initialize coroutine linkage */ X An = 0; X Bn = 0; X goto A; X X /* input 1-8 bits as per request */ XA0: X h = 0; X eof = 0; X inbuf = 0; X while(1) X { X A_CALL_B(1,A1); X if (h < w) X { X u = getchar(); X eof = feof(stdin); X inbuf <<= 8; X if(!eof) inbuf += u; X h += 8; X } X c = (inbuf >> (h - w)) & rmask[w]; X h -= w; X } X X /* assemble and output encoded bytes */ XB0: X w = 8; X B_CALL_A(1,B1); X while(1) X { X if (c < 0x88) X { X m = c; X putchar(MAP(m >> 2)); X if (eof && h < 6) exit(0); X w = 6; X B_CALL_A(2,B2); X c += (m & 0x03) << 6; X } X else if (c < 0xfe) X { X m = c; X putchar(MAP((m >> 1) - 0x22)); X if (eof && h < 7) exit(0); X w = 7; X B_CALL_A(3,B3); X c += (m & 0x01) << 7; X } X else X { X putchar(MAP(c - 0xa1)); X if (eof && h < 8) exit(0); X w = 8; X B_CALL_A(4,B4); X } X } X exit(0); X X/* more coroutine stuff */ X XA: X switch(An) X { X case 0: goto A0; X case 1: goto A1; X } X XB: X switch(Bn) X { X case 0: goto B0; X case 1: goto B1; X case 2: goto B2; X case 3: goto B3; X case 4: goto B4; X } X} END_OF_FILE if test 2881 -ne `wc -c <'k7enc.c'`; then echo shar: \"'k7enc.c'\" unpacked with wrong size! fi # end of 'k7enc.c' fi if test -f 'k7dec.c' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'k7dec.c'\" else echo shar: Extracting \"'k7dec.c'\" $2585 characters$ sed "s/^X//" >'k7dec.c' <<'END_OF_FILE' X/* k7dec.c - decode for Kermit's 7-bit window (0x20-0x7e). XSpeeds Kermit transfer of compressed files for 7-bit channels, use Xwith k7enc. For 8-bit channels with no parity use k8enc and k8dec. X XAuthor: Glenn E. Thobe <thobe@getunx.info.com> X XCopyright (C) 1992, Glenn E. Thobe, Canoga Park, California, USA. XAll rights reserved. Permission is granted to any individual Xor institution to use, copy, or redistribute this software so Xlong as it is not sold for profit, provided this copyright Xnotice is retained. X XK7dec recovers a compressed file which has been encoded by k7enc Xand transferred by Kermit in binary mode. As a filter, its typical Xusage with the compressed file.Z might be: X X k7enc < file.Z | kermit -is - (sender) X kermit -ik | k7dec > file.Z (receiver) X XWith a text file the following is also useful: X X compress < file.txt | k7enc | kermit -is - (sender) X kermit -ik | k7dec | uncompress > file.txt (receiver) X XK7enc and k7dec feature a special non-linear mapping which avoids Xall ASCII control characters and makes optimal use of the Kermit's Xentire 7-bit ASCII-printable code space. Thus Kermit has no need Xof control or 8th-bit prefixing (except for the control prefix itself Xwhich is doubled). X*/ X X/* REVISION HISTORY: XSat Feb 8 22:15:47 PST 1992 initial release -g.t. X*/ X X/* call coroutine */ X#define A_CALL_B(x,Ax) do{An=x;goto B;Ax:;}while(0) /* K&R C version */ X#define B_CALL_A(x,Bx) do{Bn=x;goto A;Bx:;}while(0) /* K&R C version */ X X#define UNMAP(x) ((x)-0x20) /* unmap from ASCII printables */ X X#include <stdio.h> X#include <assert.h> X Xvoid Xmain() X{ X int An, Bn; X int c, m, h, w, t, u, inbuf; X static int rmask[]={0x00, 0x01, 0x03, 0x07, 0x0f, 0x1f, 0x3f, 0x7f, 0xff}; X X /* initialize coroutine linkage */ X An = 0; X Bn = 0; X goto A; X X /* input 1-8 bits as per request */ XA0: X h = 0; X inbuf = 0; X while(1) X { X A_CALL_B(1,A1); X while (h < w) X { X t = getchar(); X if (feof(stdin)) exit(0); X u = UNMAP(t); X if (u < 0x22) X { X inbuf <<= 6; X inbuf += u; X h += 6; X } X else if (u < 0x5d) X { X u += 0x22; X inbuf <<= 7; X inbuf += u; X h += 7; X } X else X { X u += 0xa1; X inbuf <<= 8; X inbuf += u; X h += 8; X } X } X c = (inbuf >> (h - w)) & rmask[w]; X h -= w; X } X X /* dissemble encoded bytes and output */ XB0: X while(1) X { X w = 8; X B_CALL_A(1,B1); X putchar(c); X } X X/* more coroutine stuff */ X XA: X switch(An) X { X case 0: goto A0; X case 1: goto A1; X } X XB: X switch(Bn) X { X case 0: goto B0; X case 1: goto B1; X } X} END_OF_FILE if test 2585 -ne `wc -c <'k7dec.c'`; then echo shar: \"'k7dec.c'\" unpacked with wrong size! fi # end of 'k7dec.c' fi if test -f 'k8enc.c' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'k8enc.c'\" else echo shar: Extracting \"'k8enc.c'\" $2737 characters$ sed "s/^X//" >'k8enc.c' <<'END_OF_FILE' X/* k8enc.c - encode for Kermit's 8-bit window (0x20-0x7e, 0xa0-0xfe). XSpeeds Kermit transfer of compressed files for 8-bit channels, use Xwith k8enc. For 7-bit channels with parity use k7enc and k7dec. X XAuthor: Glenn E. Thobe <thobe@getunx.info.com> X XCopyright (C) 1992, Glenn E. Thobe, Canoga Park, California, USA. XAll rights reserved. Permission is granted to any individual Xor institution to use, copy, or redistribute this software so Xlong as it is not sold for profit, provided this copyright Xnotice is retained. X XK8enc prepares a compressed file to be transferred by Kermit in Xbinary mode and decoded by k8dec. As a filter, its typical Xusage with the compressed file.Z might be: X X k8enc < file.Z | kermit -is - (sender) X kermit -ik | k8dec > file.Z (receiver) X XWith a text file the following is also useful: X X compress < file.txt | k8enc | kermit -is - (sender) X kermit -ik | k8dec | uncompress > file.txt (receiver) X XK8enc and k8dec feature a special non-linear mapping which avoids Xall ASCII control characters and makes optimal use of the Kermit's Xentire 8-bit ASCII-printable code space. Thus Kermit has no need Xof control prefixing (except for the control prefix itself which Xis doubled). X*/ X X/* REVISION HISTORY: XSat Feb 8 22:15:47 PST 1992 initial release -g.t. X*/ X X/* call coroutine */ X#define A_CALL_B(x,Ax) do{An=x;goto B;Ax:;}while(0) /* K&R C version */ X#define B_CALL_A(x,Bx) do{Bn=x;goto A;Bx:;}while(0) /* K&R C version */ X X#define min(x,y) ((x)>(y))?(y):(x) X#define MAP(x) ((x)+(((x)<0x5f)?0x20:0x41)) /* map to printables */ X X#include <stdio.h> X Xvoid Xmain() X{ X int An, Bn; X int c, m, h, w, inbuf, u, eof, old_eof; X static int rmask[]={0x00, 0x01, 0x03, 0x07, 0x0f, 0x1f, 0x3f, 0x7f, 0xff}; X X /* initialize coroutine linkage */ X An = 0; X Bn = 0; X goto A; X X /* input 1-8 bits as per request */ XA0: X h = 0; X old_eof = eof = 0; X inbuf = 0; X while(1) X { X A_CALL_B(1,A1); X if (h < w) X { X u = getchar(); X eof = feof(stdin); X inbuf <<= 8; X if(!eof) inbuf += u; X h += 8; X } X c = (inbuf >> (h - w)) & rmask[w]; X h -= w; X } X X /* assemble and output encoded bytes */ XB0: X w = 8; X B_CALL_A(1,B1); X while(1) X { X if (c < 0x84) X { X m = c; X putchar(MAP(m >> 1)); X if (eof && h < 7) exit(0); X w = 7; X B_CALL_A(2,B2); X c += (m & 0x01) << 7; X } X else X { X putchar(MAP(c-0x42)); X if (eof && h < 8) exit(0); X w = 8; X B_CALL_A(3,B3); X } X old_eof = eof; X } X exit(0); X X/* more coroutine stuff */ X XA: X switch(An) X { X case 0: goto A0; X case 1: goto A1; X } X XB: X switch(Bn) X { X case 0: goto B0; X case 1: goto B1; X case 2: goto B2; X case 3: goto B3; X } X} END_OF_FILE if test 2737 -ne `wc -c <'k8enc.c'`; then echo shar: \"'k8enc.c'\" unpacked with wrong size! fi # end of 'k8enc.c' fi if test -f 'k8dec.c' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'k8dec.c'\" else echo shar: Extracting \"'k8dec.c'\" $2500 characters$ sed "s/^X//" >'k8dec.c' <<'END_OF_FILE' X/* k8dec.c - decode for Kermit's 8-bit window (0x20-0x7e, 0xa0-0xfe). XSpeeds Kermit transfer of compressed files for 8-bit channels, use Xwith k8enc. For 7-bit channels with parity use k7enc and k7dec. X XAuthor: Glenn E. Thobe <thobe@getunx.info.com> X XCopyright (C) 1992, Glenn E. Thobe, Canoga Park, California, USA. XAll rights reserved. Permission is granted to any individual Xor institution to use, copy, or redistribute this software so Xlong as it is not sold for profit, provided this copyright Xnotice is retained. X XK8dec recovers a compressed file which has been encoded by k8enc Xand transferred by Kermit in binary mode. As a filter, its typical Xusage with the compressed file.Z might be: X X k8enc < file.Z | kermit -is - (sender) X kermit -ik | k8dec > file.Z (receiver) X XWith a text file the following is also useful: X X compress < file.txt | k8enc | kermit -is - (sender) X kermit -ik | k8dec | uncompress > file.txt (receiver) X XK8enc and k8dec feature a special non-linear mapping which avoids Xall ASCII control characters and makes optimal use of the Kermit's Xentire 8-bit ASCII-printable code space. Thus Kermit has no need Xof control prefixing (except for the control prefix itself Xwhich is doubled). X*/ X X/* REVISION HISTORY: XSat Feb 8 22:15:47 PST 1992 initial release -g.t. X*/ X X/* call coroutine */ X#define A_CALL_B(x,Ax) do{An=x;goto B;Ax:;}while(0) /* K&R C version */ X#define B_CALL_A(x,Bx) do{Bn=x;goto A;Bx:;}while(0) /* K&R C version */ X X#define UNMAP(x) ((x)-(((x)<0x80)?0x20:0x41)) /* unmap from printables */ X X#include <stdio.h> X#include <assert.h> X Xvoid Xmain() X{ X int An, Bn; X int c, m, h, w, t, u, inbuf; X static int rmask[]={0x00, 0x01, 0x03, 0x07, 0x0f, 0x1f, 0x3f, 0x7f, 0xff}; X X /* initialize coroutine linkage */ X An=0; X Bn=0; X goto A; X X /* input 1-8 bits as per request */ XA0: X h = 0; X inbuf = 0; X while(1) X { X A_CALL_B(1,A1); X while (h < w) X { X t = getchar(); X if (feof(stdin)) exit(0); X u = UNMAP(t); X if (u < 0x42) X { X inbuf <<= 7; X inbuf += u; X h += 7; X } X else X { X u += 0x42; X inbuf <<= 8; X inbuf += u; X h += 8; X } X } X c = (inbuf >> (h - w)) & rmask[w]; X h -= w; X } X X /* dissemble encoded bytes and output */ XB0: X while(1) X { X w = 8; X B_CALL_A(1,B1); X putchar(c); X } X X/* more coroutine stuff */ X XA: X switch(An) X { X case 0: goto A0; X case 1: goto A1; X } X XB: X switch(Bn) X { X case 0: goto B0; X case 1: goto B1; X } X} END_OF_FILE if test 2500 -ne `wc -c <'k8dec.c'`; then echo shar: \"'k8dec.c'\" unpacked with wrong size! fi # end of 'k8dec.c' fi if test -f 'k7verify.ksh' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'k7verify.ksh'\" else echo shar: Extracting \"'k7verify.ksh'\" $90 characters$ sed "s/^X//" >'k7verify.ksh' <<'END_OF_FILE' Xfor i X{ X echo "testing on file $i" X k7enc < $i | k7dec | cmp - $i X} Xecho "finished" END_OF_FILE if test 90 -ne `wc -c <'k7verify.ksh'`; then echo shar: \"'k7verify.ksh'\" unpacked with wrong size! fi chmod +x 'k7verify.ksh' # end of 'k7verify.ksh' fi if test -f 'k8verify.ksh' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'k8verify.ksh'\" else echo shar: Extracting \"'k8verify.ksh'\" $90 characters$ sed "s/^X//" >'k8verify.ksh' <<'END_OF_FILE' Xfor i X{ X echo "testing on file $i" X k8enc < $i | k8dec | cmp - $i X} Xecho "finished" END_OF_FILE if test 90 -ne `wc -c <'k8verify.ksh'`; then echo shar: \"'k8verify.ksh'\" unpacked with wrong size! fi chmod +x 'k8verify.ksh' # end of 'k8verify.ksh' fi echo shar: End of shell archive. exit 0 17-Mar-93 21:42:31-GMT,2861;000000000001 Return-Path: <mtdcr!ptm@mtune.att.com> Received: from att-out.att.com by watsun.cc.columbia.edu (5.59/FCB/jba) id AA07725; Wed, 17 Mar 93 16:42:24 EST Message-Id: <9303172142.AA07725@watsun.cc.columbia.edu> From: mtdcr!ptm@mtune.att.com Date: Wed, 17 Mar 93 16:40 EST To: fdc@watsun.cc.columbia.edu Subject: Compression FYI Frank - (from Peter Mauzey) March 17, 1993 You may recall months ago asking about a compression algorithm that is free of patent infringements. Hence the following quote, FYI. I'm not sure Keith is correct in his assertion about "patented algorithms." ------ quote start ----- Date: Sun, 14 Mar 1992 20:41:10 EST [ Really 1993 ] From: w8sdz@TACOM-EMH1.Army.Mil (Keith Petersen) To: MSDOS-Ann@TACOM-EMH1.Army.Mil Subject: GZIP106.ZIP - GNU zip: Unix compress/uncompress replacement Message-Id: <9303142041.12979.w8sdz@TACOM-EMH1.Army.Mil> I have uploaded to WSMR-SIMTEL20.Army.Mil and OAK.Oakland.Edu: pd1:<msdos.compress> GZIP106.ZIP GNU zip: Unix compress/uncompress replacement gzip (GNU zip) is a compression utility designed to be a replacement for 'compress'. Its main advantages over compress are much better compression and freedom from patented algorithms. The GNU Project uses it as the standard compression program for its system. gzip currently uses by default the LZ77 algorithm used in zip 1.9 (the portable pkzip compatible archiver). The gzip format was however designed to accommodate several compression algorithms. gunzip can currently decompress files created by gzip, zip (with restrictions), compress or pack. (The SCO 'compress -H' format will be supported in a future version). The detection of the input format is automatic. When using the first two formats, gunzip checks a 32 bit CRC. For pack, gunzip checks the uncompressed length. The 'compress' format was not designed to allow consistency checks. However gunzip is sometimes able to detect a bad .Z file because there is some redundancy in the .Z compression format. If you get an error when uncompressing a .Z file, do not assume that the .Z file is correct simply because the standard uncompress does not complain. This generally means that the standard uncompress does not check its input, and happily generates garbage output. Those who need the source in order to compile this program on other platforms please see files GZIP106.TAR-Z and GZIPREAD.ME in SIMTEL20 directory PD8:<MISC.UNIX> or OAK directory /pub/misc/unix. Keith -- Keith Petersen Maintainer of the MS-DOS archive at WSMR-SIMTEL20.Army.Mil [192.88.110.20] Internet: w8sdz@TACOM-EMH1.Army.Mil or w8sdz@Vela.ACS.Oakland.Edu Uucp: uunet!umich!vela!w8sdz BITNET: w8sdz@OAKLAND ------ quote end ----- Peter Mauzey 908 957-3887 ptm@mtdcr.att.com Bell Labs, 200 Laurel Avenue, Room 1A-301, Middletown, NJ 07748-4801 1-Dec-93 21:34:56-GMT,9258;000000000001 Return-Path: <beebe@math.utah.edu> Received: from csc-sun.math.utah.edu by watsun.cc.columbia.edu (5.59/FCB/jba) id AA11988; Wed, 1 Dec 93 16:34:53 EST Received: from solitude.math.utah.edu by math.utah.edu (4.1/SMI-4.1-utah-csc-server) id AA06920; Wed, 1 Dec 93 14:34:50 MST Date: Wed, 1 Dec 93 14:34:50 MST From: "Nelson H. F. Beebe" <beebe@math.utah.edu> To: fdc@watsun.cc.columbia.edu Cc: beebe@math.utah.edu, bowman@math.utah.edu X-Us-Mail: "Center for Scientific Computing, South Physics, University of Utah, Salt Lake City, UT 84112" X-Telephone: +1 801 581 5254 X-Fax: +1 801 581 4148 Subject: Kermit and dynamic compression Message-Id: <CMM.0.90.2.754781688.beebe@solitude.math.utah.edu> During the last year, implementations of the UNIX ftpd have become available that support dynamic archiving (tar, arc, zip, zoo) and compression (compress, gzip) for ftp "get" and "mget" commands. For compression, the data transfer volume is often reduced to 1/3 to 1/4 of what it would be without that compression; some representative figures are given below. For compression at least, the code changes are very very small: just replace fp = fopen(filename,"r"); ... (void)fclose(fp); with char buffer[MAXFILENAME + xxx]; sprintf(command,"gzip < %s",filename); fp = popen(command,"r"); ... (void)pclose(fp); popen()/pclose() are defined in POSIX and X/Open, so most UNIX implementations already have them. Otherwise, they can be synthesized from fork() and exec(), and I'm confident that freely-distribution versions of suitable code could be found on the Internet, or straightforwardly written. [xarchie found these a few seconds ago: irisa.irisa.fr:/News/Jukebox/comp.sources.atari.st/volume04/popen spud.hyperion.com:/Atari/sources/volume4/popen ] ftpd is somewhat fancier, since it allows the command lines to be specified in an external file, along with matching extensions, making it possible to extend the choice of archive and compression techniques without code modification. A typical example is appended to this message. The current Kermit implementation, 5A(189), for UNIX uses run-length encoding, as implemented in function encode() in ckcfns.c. For certain types of files (e.g. those with long strings of blanks or NULs, or binary images with constant backgrounds), this works quite well. However, the compression algorithms of gzip at present seem to outperform almost everything that I've tried (gzip often beats compress by about 25%, and both are much better than the dynamic Huffman encoding of UNIX pack/unpack). Here is a single comparison of gzip and compress on the UNIX (Sun SunOS 4.1.3) executable, /usr/local/bin/reduce, the largest one in that directory on our system: -rwxrwxr-x 1 bowman 5414912 Jun 10 1991 /usr/local/bin/reduce -rw-rw-r-- 1 beebe 601721 Dec 1 14:06 /tmp/reduce.Z -rw-rw-r-- 1 beebe 294912 Dec 1 14:06 /tmp/reduce.gz compress decreased the file size by a factor of 601721/5414912 = 0.11, while gzip decreased it by a factor of 294912/5414912 = 0.05. For a large text file, the Steele/Raymond `New Hacker's Dictionary', we have -rw-rw-r-- 1 beebe 1367403 Jul 27 21:04 jargon.info -rw-rw-r-- 1 beebe 587751 Dec 1 14:19 /tmp/jargon.info.Z -rw-rw-r-- 1 beebe 523845 Dec 1 14:20 /tmp/jargon.info.gz compress reduces the file by a factor of 0.43, and gzip, by 0.35. For a much more rigorous test, here are the compression results for our last night's file system backup with amanda (Advanced Maryland Automatic Network Disk Archiver), which uses the standard UNIX compress utility. The COMP% column shows the degree of compression achieved for each file system, a total of 3.4GB of data: STATISTICS: Total Full Daily -------- -------- -------- Dump Time (hrs:min) 6:57 4:27 0:39 (0:32 start, 1:19 idle) Output Size (meg) 1550.2 995.8 554.4 Original Size (meg) 3430.4 2389.0 1041.5 Avg Compressed Size (%) 39.9 41.7 34.2 Tape Used (%) 36.9 23.7 13.2 (level:#disks ...) Filesystems Dumped 23 3 20 (1:15 2:4 3:1) Avg Dump Rate (k/s) 51.0 61.6 38.9 Avg Tp Write Rate (k/s) 86.4 63.7 241.6 DUMP SUMMARY: DUMPER STATS TAPER STATS HOSTNAME DISK LV ORIG-KB OUT-KB COMP% MMM:SS KB/s MMM:SS KB/s -------------------- -------------------------------------- ------------- alfred rz0a 1 198 100 50.8 0:12 8.2 0:16 8.0 alfred rz0h 1 70152 40085 57.1 13:18 50.2 1:33 433.2 alfred rz2a 1 4967 1020 20.5 1:48 9.5 0:12 85.6 alfred rz3a 0 1649058 630695 38.2 230:27 45.6 230:36 45.6 alfred rz5a 2 154888 58652 37.9 28:05 34.8 2:20 417.9 csc-sun xd0a 1 18412 18412 -- 2:05 147.8 0:46 403.9 csc-sun xd0g 1 272 272 -- 1:22 3.3 0:08 34.9 csc-sun xd0h 2 57012 57012 -- 9:03 105.0 2:05 455.1 csc-sun xd1g 1 67337 50299 74.7 17:09 48.9 17:17 48.5 csc-sun xd1h 1 34372 34372 -- 12:25 46.1 1:28 391.7 csc-sun xd3c 2 26432 26432 -- 6:28 68.1 1:10 376.9 csc-sun xd4c 1 33992 33992 -- 7:14 78.2 1:26 395.4 csc-sun xd5c 3 138462 138462 -- 25:30 90.5 5:22 430.6 eros erosb 0 138333 28705 20.8 10:26 45.8 1:18 370.7 eros root 1 25940 8855 34.1 3:54 37.8 0:26 340.0 geronimo geroni 1 355416 59176 16.6 12:17 80.3 2:14 441.4 geronimo root 0 658900 360265 54.7 34:58 171.7 35:06 171.0 honeycomb root 2 14495 1910 13.2 1:07 28.6 0:13 145.0 saturna root 1 412 131 31.9 0:29 4.5 0:07 21.5 sparky sd0a 1 6076 1237 20.4 3:10 6.5 0:19 67.4 sparky sd0g 1 107 48 44.9 0:28 1.7 0:07 9.1 sparky sd1c 1 55902 36969 66.1 96:50 6.4 1:33 399.8 todd root 1 1642 308 18.7 0:30 10.2 0:08 38.9 So, the question arises, have the Kermit developers considered supporting this kind of dynamic compression, at least with a UNIX kermit server? The general idea would be that the user would type kermit> get foo to get the original file foo, and kermit> get foo.gz or kermit> get foo.Z to get foo.gz or foo.Z if they exist, but OTHERWISE, to dynamically invoke gzip or compress via popen(), and send the compressed file to the user's machine. If foo is a directory, a "get foo.tar" or "get foo.tar.gz" or "get foo.tar.Z" could dynamically invoke tar and optionally, gzip or compress. ftpd/ftp provide no facility for dynamic decompression or de-archiving, and I see no strong need to do so for kermit either, although, clearly, it could be an option, e.g. "get/decompress foo.Z" or "get/untar foo.tar". The latter would make it possible for kermit to move entire directory trees with a single user command. ------------------------------------------------------------------------ # This is /usr/local/etc/ftpconversions # File format: # strip prefix:strip postfix:addon prefix:addon postfix:external command:types:options:description # # For example, # : : :.tar.Z:/bin/tar -c --compress -f - %s:T_REG|T_DIR:O_COMPRESS|O_TAR:TAR+COMPRESS # says that a get of foo.tar.Z should execute the command # /bin/tar -c --compress -f - foo # return the piped output as foo.tar.Z. # # The line # :.Z: : :/bin/compress -d -c %s:T_REG|T_ASCII:O_UNCOMPRESS:UNCOMPRESS # says that a get of foo for the file foo.Z should run # /bin/compress -d -c foo.Z # and return the piped output as foo # :.Z: : :/bin/compress -d -c %s:T_REG|T_ASCII:O_UNCOMPRESS:UNCOMPRESS :-z: : :/bin/compress -d -c %s:T_REG|T_ASCII:O_UNCOMPRESS:UNCOMPRESS :.z: : :/bin/gunzip -d -c %s:T_REG|T_ASCII:O_UNCOMPRESS:UNCOMPRESS :.trz: :.tar:/bin/gunzip -d -c %s:T_REG|T_ASCII:O_UNCOMPRESS:UNCOMPRESS :.tar: :.trz:/bin/gunzip -d -c %s:T_REG|T_ASCII:O_UNCOMPRESS:UNCOMPRESS : : :.Z:/bin/compress -c %s:T_REG:O_COMPRESS:COMPRESS : : :.z:/bin/gzip -c %s:T_REG:O_COMPRESS:COMPRESS : : :.tar:/bin/tar -c -f - %s:T_REG|T_DIR:O_TAR:TAR : : :.tar.Z:/bin/tar -c --compress -f - %s:T_REG|T_DIR:O_COMPRESS|O_TAR:TAR+COMPRESS : : :.tar.z:/bin/tar -c --gzip -f - %s:T_REG|T_DIR:O_COMPRESS|O_TAR:TAR+COMPRESS : : :.zoo:/bin/ZOO %s:T_REG|T_DIR:O_COMPRESS|O_TAR:TAR+COMPRESS : : :.zip:/bin/ZIP %s:T_REG|T_DIR:O_COMPRESS|O_TAR:TAR+COMPRESS ======================================================================== Nelson H. F. Beebe Tel: +1 801 581 5254 Center for Scientific Computing FAX: +1 801 581 4148 Department of Mathematics, 105 JWB Internet: beebe@math.utah.edu University of Utah Salt Lake City, UT 84112, USA ======================================================================== 1-Dec-93 22:02:41-GMT,2148;000000000001 Return-Path: <fdc> Received: by watsun.cc.columbia.edu (5.59/FCB/jba) id AA14144; Wed, 1 Dec 93 17:02:36 EST Date: Wed, 1 Dec 93 17:02:35 EST From: Frank da Cruz <fdc@watsun.cc.columbia.edu> To: "Nelson H. F. Beebe" <beebe@math.utah.edu> Cc: bowman@math.utah.edu Subject: Re: Kermit and dynamic compression In-Reply-To: Your message of Wed, 1 Dec 93 14:34:50 MST Message-Id: <CMM.0.90.4.754783355.fdc@watsun.cc.columbia.edu> > During the last year, implementations of the UNIX ftpd have become > available that support dynamic archiving (tar, arc, zip, zoo) and > compression (compress, gzip) for ftp "get" and "mget" commands... > ... > So, the question arises, have the Kermit developers considered > supporting this kind of dynamic compression, at least with a UNIX > kermit server? > Yes indeed. There is a discussion of it in the file kermit/e/compress.txt on kermit.columbia.edu, where you will find your own message as the latest entry. (Well, except for this one.) And no, nothing as simplistic as running things through gzip would do the trick. Sure it's easy in UNIX, but any higher-order compression method we adopt is going to have to meet certain important criteria: . It must be portable among all OS's and file systems. . It must be designed to produce a data stream that contains no control or 8-bit characters (depending on Kermit's PARITY and SET CONTROL settings), so that Kermit's transparency encoding does not undo the effects of compression. Even if we liked the idea of using popen() to run a file through gzip before sending it, it still wouldn't be right. Before compressing a text file, we need to convert its character set and record format, otherwise it'll be garbage after decompression on the receiving end. And vice versa. So don't worry, compression is high up on the list of items to add to the Kermit protocol, pretty much right behind checkpoint/restart. But it's not simple. Meanwhile, of course, when transferring between two UNIX systems, you can already pipe stuff through C-Kermit, and that applies to compress, gzip, crypt, sort, or anything else that works with stdio. - Frank