home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.sys.isis
- Path: sparky!uunet!cs.utexas.edu!usc!rpi!batcomputer!cornell!ken
- From: ken@cs.cornell.edu (Ken Birman)
- Subject: More warnings about time warps
- Message-ID: <1992Sep7.185432.26204@cs.cornell.edu>
- Organization: Cornell Univ. CS Dept, Ithaca NY 14853
- Date: Mon, 7 Sep 1992 18:54:32 GMT
- Lines: 76
-
- I spent part of Labor Day staring at one or perhaps two ISIS bugs
- that can be triggered (only?) by big scheduling differences between
- members of a process group. For example, say that A,B,C are in
- group G and that D joins and B crashes between 1:00pm and 1:03pm.
- Normally, one would expect A and C and D to emerge in consistent
- states by the end of this sequence.
-
- Say that C is not cooperating with the ISIS scheduler, though, and
- tends to hold the CPU for minutes at a time without letting ISIS
- run (and, lets add another half dozen or so members to the group
- to liven things up, and have them all do this sporadically -- ignore
- ISIS for a minute or two at a time).
-
- On your console, this will provoke vigorous complaints from ISIS
- about "Time warp: 39.37 seconds" -- or whatever the delay was.
-
- Internally, though, this creates an extreme of asynchrony, in which
- the slow processes may be several group views and hundreds of
- messages behind the fast ones, and in which the fast ones may be
- close to trying to kill the slow ones off for non-responsiveness.
-
- In principle, our protocols handle this case, but in practice, we
- seem to have two remaining bugs in V3.0.6 (V3.0.5 or previous
- had significantly MORE than "only" two problems in this case).
- The observed problems are:
- Group can hang, not permitting further joins/departures and not
- killing members. If you look at a client log you will find
- some members "waiting for flush viewid 154.0" (for example), but with
- "Got flush viewid 153.0" listed on the same channel.
-
- Apparently, in some timer related situation the flush protocol
- either resends the flush for 153.0 or the protocol for 154.0
- advances without using a fall-back scheme and gets confused when
- the 153.0 flush message shows up late. This jams the channel
- between the processes involved and leaves the group stuck, since
- ISIS is not tolerant of this sort of implementation bug.
-
- Perhaps, RPC's via protos (non-bypass) can "lose their replies".
- One sees an RPC pending in the client log, but no sign of the
- reply and no sign of a corresponding task in the protos log.
-
- Again, these bugs are NOT seen except in combination with timewarp
- complaints in the 60-120 second range, which are always due to software
- that doesn't schedule ISIS in a timely manner. In a sense, these are
- ISIS bugs triggered by unreasonable behavior in the application itself.
-
- We will fix our software, but in the meantime there is a work-around for
- both problems. Code with long timewarps should use the ISIS_USEITIMER
- option and bracket long compute segments with a call to THREAD_LEAVE_ISIS()
- before the long computation and THREAD_ENTER_ISIS() after it. This will
- run ISIS off of timer interrupts during the long computation and eliminate
- the timewarps. Since the group members will now advance more or less
- at once (at least, no really extreme delays will occur), the timer
- associated with very slow flush protocols won't trigger (the timeout underlying
- the bug is actually 15 seconds, but I think you need a pretty extreme
- situation to see the problem, which didn't show up at all in testing).
-
- The fix will be in the V3.0.7 release, which only contains bug fixes.
- The other problems V3.0.7 will fix include a bug when isis_disconnect()
- is called after a fork by the child process (workaround: instead,
- the child should call isis_has_crashed(-1); other uses of isis_disconnect()
- are fine though and should NOT call this isis_has_crashed in this manner!)
- And, we are fixing a duplicate declaration that upsets some C compilers,
- but not others, an alignment issue raised by our inclusion of BSD
- malloc code for use with ISIS_USEITIMER ("purify" wants 8-byte aligned
- memory allocation for some reason), and a bug that can cause a program
- to hang during reconnect after a disconnect at an "inopportune" stage
- of one of our protocols -- very hard to trigger.
-
- V3.0.7 will be out in about six weeks. It won't include any new code
- at all -- only these bug fixes.
-
- --
- Kenneth P. Birman E-mail: ken@cs.cornell.edu
- 4105 Upson Hall, Dept. of Computer Science TEL: 607 255-9199 (office)
- Cornell University Ithaca, NY 14853 (USA) FAX: 607 255-4428
-