NetNews Usenet Archive 1992 #20

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #20 / NN_1992_20.iso / spool / comp / sys / isis / 249 < prev next >

Wrap

Internet Message Format | 1992-09-15 | 3.8 KB

Path: sparky!uunet!elroy.jpl.nasa.gov!swrinde!zaphod.mps.ohio-state.edu!rpi!batcomputer!cornell!ken From: ken@cs.cornell.edu (Ken Birman) Newsgroups: comp.sys.isis Subject: Re: crash discovery in ISIS Message-ID: <1992Sep16.001715.11391@cs.cornell.edu> Date: 16 Sep 92 00:17:15 GMT References: <13162@ztivax.UUCP> Organization: Cornell Univ. CS Dept, Ithaca NY 14853 Lines: 64 In article <13162@ztivax.UUCP> mach@ztivax.UUCP (Juergen Schmitz) writes: >Hello, >after reading some of your documents describing ISIS I would like to know, >how your site failure detection works. Is this simply done by polling a node >and waiting a while for an answer or are there sophisticated protocols for >recognizing a crashed site? >It would be great if you can give me hints about papers describing this. The current algorithm does polling: - For a site (== backbone site), we organize the sites as a ring. Every F/3 seconds, we send a message from site i to site i+1, mod # of sites. Site i expects this message to be acknowledged. After F seconds we timeout and site i considers site i+1 to be faulty. The default value for F is 60 seconds, but you can change this in isis.rc: ../bin/protos <isis-protos> -f10 would, as an example, run protos using a 10-second timer on failure detection - For a process on a site, we rely on several mechanisms. * A graceful exit by a process (in this case, the process sends a "close" message to protos) * if the connection from the backbone to the process breaks, we signal a failure immediately. (Unfortunately, TCP breaks connections very slowly, and this is part of the standard...) * under the rules for isis_probe(i,j): process P must send a message to protos every i seconds. After i+j seconds with no message, protos cuts the channel. * For a UDP connection (when Isis runs out of TCP connections or if the ISISPORT value is set wrong but ISISREMOTE is set correctly), when a message has not been acked and more than 45 seconds have elapsed and the message has been retransmitted more than 5 times. To summarize: we use a maze of heuristics, introduced over time to deal with real problems we saw in our large installations. For example, one large New York investment bank ran tests for 72 hours at a time with processes coming and going roughly 2 or 3 times per second, and with up to 4 or 5 backbone sites and perhaps 80 or 100 processes at a time. The processes exited by just calling the UNIX exit system call. Tests like this really stressed our code, and the complex mixture of detection schemes above are intended to minimize the impact of a faulty process or site on the rest of the system. I can't recommend any papers on detecting failures. I can make some observations from experience, though: * The fastest form of failure detection is when the faulty process senses its own failure and shuts down gracefully and promptly. So: design high reliability software to be self-checking. * Processes hang under UNIX more often than one might expect. In a big system, it is cheaper to disconnect the slow process than to wait. It can always use isis_failed() to trap this and reconnect. * Uncooperative applications are very hard to work with -- code that doesn't use ISIS_USEITIMER and doesn't call isis_accept_events() often enough. We haven't written this up because it would never get accepted by an reputable journal. This is engineering, not science... You may want to read Jom Gray's paper on why systems fail. He wrote this about 5 years ago, then revised it about 3 years ago. I think it appeared in IEEE Computer. -- Kenneth P. Birman E-mail: ken@cs.cornell.edu 4105 Upson Hall, Dept. of Computer Science TEL: 607 255-9199 (office) Cornell University Ithaca, NY 14853 (USA) FAX: 607 255-4428