home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!elroy.jpl.nasa.gov!swrinde!zaphod.mps.ohio-state.edu!rpi!batcomputer!cornell!ken
- From: ken@cs.cornell.edu (Ken Birman)
- Newsgroups: comp.sys.isis
- Subject: Re: crash discovery in ISIS
- Message-ID: <1992Sep16.001715.11391@cs.cornell.edu>
- Date: 16 Sep 92 00:17:15 GMT
- References: <13162@ztivax.UUCP>
- Organization: Cornell Univ. CS Dept, Ithaca NY 14853
- Lines: 64
-
- In article <13162@ztivax.UUCP> mach@ztivax.UUCP (Juergen Schmitz) writes:
- >Hello,
- >after reading some of your documents describing ISIS I would like to know,
- >how your site failure detection works. Is this simply done by polling a node
- >and waiting a while for an answer or are there sophisticated protocols for
- >recognizing a crashed site?
- >It would be great if you can give me hints about papers describing this.
-
- The current algorithm does polling:
-
- - For a site (== backbone site), we organize the sites as a ring.
- Every F/3 seconds, we send a message from site i to site i+1, mod # of sites.
- Site i expects this message to be acknowledged.
- After F seconds we timeout and site i considers site i+1 to be faulty.
- The default value for F is 60 seconds, but you can change this in isis.rc:
- ../bin/protos <isis-protos> -f10
- would, as an example, run protos using a 10-second timer on failure detection
-
- - For a process on a site, we rely on several mechanisms.
- * A graceful exit by a process (in this case, the process sends a "close"
- message to protos)
- * if the connection from the backbone to the process breaks, we signal
- a failure immediately. (Unfortunately, TCP breaks connections very
- slowly, and this is part of the standard...)
- * under the rules for isis_probe(i,j): process P must send a message
- to protos every i seconds. After i+j seconds with no message, protos
- cuts the channel.
- * For a UDP connection (when Isis runs out of TCP connections or if the
- ISISPORT value is set wrong but ISISREMOTE is set correctly), when a
- message has not been acked and more than 45 seconds have elapsed and
- the message has been retransmitted more than 5 times.
-
- To summarize: we use a maze of heuristics, introduced over time to deal
- with real problems we saw in our large installations. For example, one
- large New York investment bank ran tests for 72 hours at a time with
- processes coming and going roughly 2 or 3 times per second, and with
- up to 4 or 5 backbone sites and perhaps 80 or 100 processes at a time.
- The processes exited by just calling the UNIX exit system call. Tests
- like this really stressed our code, and the complex mixture of detection
- schemes above are intended to minimize the impact of a faulty process or
- site on the rest of the system.
-
- I can't recommend any papers on detecting failures. I can make some
- observations from experience, though:
- * The fastest form of failure detection is when the faulty process
- senses its own failure and shuts down gracefully and promptly. So:
- design high reliability software to be self-checking.
- * Processes hang under UNIX more often than one might expect. In a big
- system, it is cheaper to disconnect the slow process than to wait.
- It can always use isis_failed() to trap this and reconnect.
- * Uncooperative applications are very hard to work with -- code that
- doesn't use ISIS_USEITIMER and doesn't call isis_accept_events() often
- enough.
-
- We haven't written this up because it would never get accepted by an
- reputable journal. This is engineering, not science...
-
- You may want to read Jom Gray's paper on why systems fail. He wrote this
- about 5 years ago, then revised it about 3 years ago. I think it appeared
- in IEEE Computer.
- --
- Kenneth P. Birman E-mail: ken@cs.cornell.edu
- 4105 Upson Hall, Dept. of Computer Science TEL: 607 255-9199 (office)
- Cornell University Ithaca, NY 14853 (USA) FAX: 607 255-4428
-