NetNews Usenet Archive 1992 #26

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #26 / NN_1992_26.iso / spool / comp / dcom / cellrel / 708 < prev next >

Wrap

Text File | 1992-11-12 | 9.3 KB | 174 lines

Newsgroups: comp.dcom.cell-relay Path: sparky!uunet!zaphod.mps.ohio-state.edu!cs.utexas.edu!qt.cs.utexas.edu!yale.edu!spool.mu.edu!sgiblab!sgigate!sgi!rigden.wpd.sgi.com!rpw3 From: rpw3@rigden.wpd.sgi.com (Rob Warnock) Subject: Hop-by-hop flow control (was: Re: clarifying the role of SSCOP?) Message-ID: <s90ppd8@sgi.sgi.com> Sender: rpw3@rigden.wpd.sgi.com Organization: Silicon Graphics, Inc. Mountain View, CA Date: Thu, 12 Nov 1992 11:28:50 GMT Lines: 163 craig@sics.se (Craig Partridge) writes: +--------------- | Can we back up a little bit? ...the likelihood that the tried and true | principles of data networking would all get thrown out the window seemed | small. So I've stuck to philosophy that what we have learned from experience | is right, until someone proves it wrong. So I stick to using the end-to-end | argument (which is not TCP/IP specific and was formulated after TCP/IP was | developed), the principle that end-to-end error recovery is best, etc. +--------------- Craig, I completely agree: End-to-end error recovery is quite sufficient, *if* the likelihood of needing to use it is small. "Tried & true; learned from experience." But when specific links are unreliable, then you do certain minimal local things to improve the unreliable links, so that your end-to-end error recovery remains useful. This is *also* a "tried and true principle of data networking" which many of us think we have "learned from experience". Otherwise, why is it that satellites use forward error-correcting codes? Because the uncorrected bit error rate is so high and the delay so long that end-to-end retransmission would yield practically no throughput. But even here, we know not to try and usurp the role of end-to-end error recovery: You put just enough FEC on your link that the errors "bunch up" into infrequent disastrous miscorrections, which are caught by an overall CRC (orthogonal to your FEC), and when that happens (*much* less often than a single bit error on the uncorrected link) you drop the whole packet and let the E2E retransmission handle it. Or for a really noisy short-delay link, you might put an "invisible" (to the IP layer) small-fragmentation plus go-back-N-retransmission on just that link. It doesn't add much overall delay (because you fragmented into small pieces), but *significantly* reduces the required E2E retransmission rate... without eliminating it (because there are other sources of errors), and thus still leaving the necessity for E2E error recovery. Likewise, when every hop a packet may take has similar buffering character- istics, end-to-end congestion avoidance makes sense. But when some hops have drastically different buffering characteristics -- for example, some ATM switches' output buffers might not even hold an AAL5 MTU's number of cells (64KB = 1366 cells) -- then you have to do something "local" to raise the link functionality to the level where your E2E congestion management can handle it. NOTE: I didn't say "solve" it locally, I just said do enough to make it "look like" the regime for which the E2E congestion management was designed and is suited. In particular, many people are talking about and experimenting with various methods of "local" or hop-by-hop flow-control for ATM. I think this is useful, in general [but see below]. The "hop-by-hop" domain doesn't have to extend outside the small-buffered "core" ATM network, just to the next large-buffered thingies (whether they be traditional routers or special ATM switches with mongo buffers). But most have proposed "credit" systems which keep a separate account of credits for each VPI/VCI, with the credits bubbling backwards through the network as user cells flow forward. I believe this is: - entirely too complex, - will be hard (and thus expensive) to implement, - will be hard to get enough agreement to standardize (since there are many proposals on the table), - will thus be too late to be deployed universally, - and is not needed anyway. I'd like to propose a radically simpler hop-by-hop alternative to those credit systems, yet one which will I believe will interact well with traditional E2E congestion management, and one which I also hope "hop-by-hop critics" such as yourself will find acceptable as well. Let every VC be declared at call-setup time to be either a "rate-limited" call [WILL NOT exceed "x" cells/averaging-interval] or a "bursty" call. And let every ATM switch output port have *two* buffers: one used only for "rate- limited" VCs and one used only for "bursty" calls. [The "only" is what keeps us from having to worry about re-ordering.] The "rate-limited" VCs get first dibs on outgoing cells; the "bursty" VCs fill in the gaps, consuming whatever cells are left over. Both buffers can be fairly small, if you like. [I don't.] Now pick one of the the four (as-yet-undefined) "GFC" bits in the cell header, and call it "reverse bursty congestion experienced" (RBCE). This shall be a *link* signal, not specific to any VC, and when activated should be set in every cell regardless of VC [and it will probably have to include idle or empty cells, too]. When an output port sees RBCE coming back on the link (that is, to its associated input port), it simply quits sending cells from its "bursty" queue. When a "bursty" queue exceeds a watermark on some output port (either because its output is shut off or because it has too much bursty input traffic), it sends the RBCE indication (out-of-band, internally to the switch, not in cells) back to *all* the input ports that have any "bursty" calls set up that go through this output port. [The table or bit-mask or whatever of which input ports have "bursty" calls to this output port need be adjusted only at call setup and teardown, i.e., it's not realtime.] When an input port is thus notified, it tells its associated output port to start sending an RBCE back upstream [to wherever the input comes from]. That is, RBCE is a sort of "not-Clear-To-Send-bursty" signal that fans out to all bursty sources which could have caused the bursty output queue to hit the watermark. And when a host or other end station sees RBCE, it quits sending cells for all "bursty" VCs (just like an "XOFF" or "CTS false" on an RS-232 port). Now, at first, to certain purists, this may look horrid! Why, I'm proposing to block a host from sending *all* bursty traffic when perhaps only *one* bursty VC has experienced congestion! Yep, that's exactly what I'm proposing. That's also exactly what happens to bursty traffic today on LANs, be they Ethernet or FDDI or Token Ring or ARCnet. When a host experiences congestion, *all* of its connection experience con- gestion. And we live with it today; we can live with it in an ATM fabric. Why? Because, in fact, "bursty" traffic is *bursty*! It is extremely rare for a high-speed host to be sending congestion-creating bursts for multiple con- nections at the same instant [provided that the link data rate is high enough]. And why is that? Because high-speed hosts connected to high-speed links tend to send their bursty data directly from the process that originated it, during that process's scheduled run time. And anyway, even if the LAN or link is busy, data from other processes is queued (almost universally) in FIFO order waiting to go out. So what you see on the LAN (or link) is a burst of data from one process (in TCP, usually a 1/2-window's worth or so), then a burst of data from another, etc. [Or else a continuous link-saturating stream all from one process, as with many vendors' "ttcp" these days, but that's another story! ...and even supports the argument further.] I am assuming that hosts will be able to continuously saturate the access line to the ATM switch with useful data. Count everybody who can do close to 100 Mb/s of FDDI today (several workstation vendors), and it doesn't take a mystic to see that the entire next generation of workstations will easily be able to saturate a 155 Mb/s link (or for the "big, fast guys", a 622 Mb/s link) with ordinary TCP and UDP data (and the like). Therefore, since (repeating my short list of assumptions): 1. The ATM switch buffers are *SMALL* (perhaps not even a single AAL5's MTU, or at most a few MTUs), 2. Over a time span that *exceeds* the size of the switch buffers, but is less than the transport protocol's "window" size (64 KB for TCP, but can easily be *megabytes* for TCP with RFC 1323 large windows), a fast host is in fact usually sending traffic for only one connection (from the process that's running "right now"), 3. [New one:] Congestion created by a given host (or set of hosts) will be worst relatively near to the host (or to the first switch they have in common), and will decrease as you get farther into the network fabric away from the host(s) [think about it -- local flow-control makes for lots of leaky buckets leading away from the trafiic source], I claim that blocking *all* bursty traffic from a host when congestion is experienced due to *any* of his (or even his neighbors') bursty traffic is perfectly reasonable, and the (or at least, a) right thing to do. Implementing it is certainly simple enough. And yet [punchline for Craig], to get reasonable overall flow on bursty traffic, I believe you still do need end-to-end congestion avoidance on top of the above-suggested hop-by-hop flow control. -Rob ----- Rob Warnock, MS-9U/510 rpw3@sgi.com Silicon Graphics, Inc. (415)390-1673 2011 N. Shoreline Blvd. Mountain View, CA 94043