home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.dcom.cell-relay
- Path: sparky!uunet!zaphod.mps.ohio-state.edu!cs.utexas.edu!qt.cs.utexas.edu!yale.edu!spool.mu.edu!sgiblab!sgigate!sgi!rigden.wpd.sgi.com!rpw3
- From: rpw3@rigden.wpd.sgi.com (Rob Warnock)
- Subject: Hop-by-hop flow control (was: Re: clarifying the role of SSCOP?)
- Message-ID: <s90ppd8@sgi.sgi.com>
- Sender: rpw3@rigden.wpd.sgi.com
- Organization: Silicon Graphics, Inc. Mountain View, CA
- Date: Thu, 12 Nov 1992 11:28:50 GMT
- Lines: 163
-
- craig@sics.se (Craig Partridge) writes:
- +---------------
- | Can we back up a little bit? ...the likelihood that the tried and true
- | principles of data networking would all get thrown out the window seemed
- | small. So I've stuck to philosophy that what we have learned from experience
- | is right, until someone proves it wrong. So I stick to using the end-to-end
- | argument (which is not TCP/IP specific and was formulated after TCP/IP was
- | developed), the principle that end-to-end error recovery is best, etc.
- +---------------
-
- Craig, I completely agree: End-to-end error recovery is quite sufficient,
- *if* the likelihood of needing to use it is small. "Tried & true; learned
- from experience."
-
- But when specific links are unreliable, then you do certain minimal local
- things to improve the unreliable links, so that your end-to-end error recovery
- remains useful. This is *also* a "tried and true principle of data networking"
- which many of us think we have "learned from experience". Otherwise, why is
- it that satellites use forward error-correcting codes? Because the uncorrected
- bit error rate is so high and the delay so long that end-to-end retransmission
- would yield practically no throughput. But even here, we know not to try and
- usurp the role of end-to-end error recovery: You put just enough FEC on your
- link that the errors "bunch up" into infrequent disastrous miscorrections,
- which are caught by an overall CRC (orthogonal to your FEC), and when that
- happens (*much* less often than a single bit error on the uncorrected link)
- you drop the whole packet and let the E2E retransmission handle it.
-
- Or for a really noisy short-delay link, you might put an "invisible" (to the
- IP layer) small-fragmentation plus go-back-N-retransmission on just that link.
- It doesn't add much overall delay (because you fragmented into small pieces),
- but *significantly* reduces the required E2E retransmission rate... without
- eliminating it (because there are other sources of errors), and thus still
- leaving the necessity for E2E error recovery.
-
- Likewise, when every hop a packet may take has similar buffering character-
- istics, end-to-end congestion avoidance makes sense. But when some hops have
- drastically different buffering characteristics -- for example, some ATM
- switches' output buffers might not even hold an AAL5 MTU's number of cells
- (64KB = 1366 cells) -- then you have to do something "local" to raise the
- link functionality to the level where your E2E congestion management can
- handle it. NOTE: I didn't say "solve" it locally, I just said do enough to
- make it "look like" the regime for which the E2E congestion management was
- designed and is suited.
-
- In particular, many people are talking about and experimenting with various
- methods of "local" or hop-by-hop flow-control for ATM. I think this is useful,
- in general [but see below]. The "hop-by-hop" domain doesn't have to extend
- outside the small-buffered "core" ATM network, just to the next large-buffered
- thingies (whether they be traditional routers or special ATM switches with
- mongo buffers).
-
- But most have proposed "credit" systems which keep a separate account of
- credits for each VPI/VCI, with the credits bubbling backwards through the
- network as user cells flow forward. I believe this is:
-
- - entirely too complex,
- - will be hard (and thus expensive) to implement,
- - will be hard to get enough agreement to standardize
- (since there are many proposals on the table),
- - will thus be too late to be deployed universally,
- - and is not needed anyway.
-
- I'd like to propose a radically simpler hop-by-hop alternative to those credit
- systems, yet one which will I believe will interact well with traditional E2E
- congestion management, and one which I also hope "hop-by-hop critics" such as
- yourself will find acceptable as well.
-
- Let every VC be declared at call-setup time to be either a "rate-limited" call
- [WILL NOT exceed "x" cells/averaging-interval] or a "bursty" call. And let
- every ATM switch output port have *two* buffers: one used only for "rate-
- limited" VCs and one used only for "bursty" calls. [The "only" is what keeps
- us from having to worry about re-ordering.] The "rate-limited" VCs get first
- dibs on outgoing cells; the "bursty" VCs fill in the gaps, consuming whatever
- cells are left over. Both buffers can be fairly small, if you like. [I don't.]
-
- Now pick one of the the four (as-yet-undefined) "GFC" bits in the cell header,
- and call it "reverse bursty congestion experienced" (RBCE). This shall be a
- *link* signal, not specific to any VC, and when activated should be set in
- every cell regardless of VC [and it will probably have to include idle or
- empty cells, too]. When an output port sees RBCE coming back on the link
- (that is, to its associated input port), it simply quits sending cells from
- its "bursty" queue.
-
- When a "bursty" queue exceeds a watermark on some output port (either because
- its output is shut off or because it has too much bursty input traffic), it
- sends the RBCE indication (out-of-band, internally to the switch, not in cells)
- back to *all* the input ports that have any "bursty" calls set up that go
- through this output port. [The table or bit-mask or whatever of which input
- ports have "bursty" calls to this output port need be adjusted only at call
- setup and teardown, i.e., it's not realtime.] When an input port is thus
- notified, it tells its associated output port to start sending an RBCE back
- upstream [to wherever the input comes from]. That is, RBCE is a sort of
- "not-Clear-To-Send-bursty" signal that fans out to all bursty sources which
- could have caused the bursty output queue to hit the watermark.
-
- And when a host or other end station sees RBCE, it quits sending cells for
- all "bursty" VCs (just like an "XOFF" or "CTS false" on an RS-232 port).
-
- Now, at first, to certain purists, this may look horrid! Why, I'm proposing
- to block a host from sending *all* bursty traffic when perhaps only *one*
- bursty VC has experienced congestion!
-
- Yep, that's exactly what I'm proposing. That's also exactly what happens to
- bursty traffic today on LANs, be they Ethernet or FDDI or Token Ring or ARCnet.
- When a host experiences congestion, *all* of its connection experience con-
- gestion. And we live with it today; we can live with it in an ATM fabric.
-
- Why? Because, in fact, "bursty" traffic is *bursty*! It is extremely rare for
- a high-speed host to be sending congestion-creating bursts for multiple con-
- nections at the same instant [provided that the link data rate is high enough].
- And why is that? Because high-speed hosts connected to high-speed links tend
- to send their bursty data directly from the process that originated it, during
- that process's scheduled run time. And anyway, even if the LAN or link is busy,
- data from other processes is queued (almost universally) in FIFO order waiting
- to go out. So what you see on the LAN (or link) is a burst of data from one
- process (in TCP, usually a 1/2-window's worth or so), then a burst of data
- from another, etc. [Or else a continuous link-saturating stream all from one
- process, as with many vendors' "ttcp" these days, but that's another story!
- ...and even supports the argument further.]
-
- I am assuming that hosts will be able to continuously saturate the access
- line to the ATM switch with useful data. Count everybody who can do close to
- 100 Mb/s of FDDI today (several workstation vendors), and it doesn't take
- a mystic to see that the entire next generation of workstations will easily
- be able to saturate a 155 Mb/s link (or for the "big, fast guys", a 622 Mb/s
- link) with ordinary TCP and UDP data (and the like).
-
- Therefore, since (repeating my short list of assumptions):
-
- 1. The ATM switch buffers are *SMALL* (perhaps not even a single AAL5's MTU,
- or at most a few MTUs),
-
- 2. Over a time span that *exceeds* the size of the switch buffers, but is less
- than the transport protocol's "window" size (64 KB for TCP, but can easily
- be *megabytes* for TCP with RFC 1323 large windows), a fast host is in fact
- usually sending traffic for only one connection (from the process that's
- running "right now"),
-
- 3. [New one:] Congestion created by a given host (or set of hosts) will be
- worst relatively near to the host (or to the first switch they have in
- common), and will decrease as you get farther into the network fabric
- away from the host(s) [think about it -- local flow-control makes for
- lots of leaky buckets leading away from the trafiic source],
-
- I claim that blocking *all* bursty traffic from a host when congestion is
- experienced due to *any* of his (or even his neighbors') bursty traffic
- is perfectly reasonable, and the (or at least, a) right thing to do.
-
- Implementing it is certainly simple enough.
-
- And yet [punchline for Craig], to get reasonable overall flow on bursty
- traffic, I believe you still do need end-to-end congestion avoidance on
- top of the above-suggested hop-by-hop flow control.
-
-
- -Rob
-
- -----
- Rob Warnock, MS-9U/510 rpw3@sgi.com
- Silicon Graphics, Inc. (415)390-1673
- 2011 N. Shoreline Blvd.
- Mountain View, CA 94043
-
-