NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / sys / isis / 364 < prev next >

Wrap

Text File | 1993-01-07 | 3.2 KB | 69 lines

Newsgroups: comp.sys.isis Path: sparky!uunet!zaphod.mps.ohio-state.edu!malgudi.oar.net!caen!destroyer!cs.ubc.ca!uw-beaver!cornell!ken From: ken@cs.cornell.edu (Ken Birman) Subject: The "27 day" problem (actually, 21.7 day problem) Message-ID: <1993Jan7.151550.6171@cs.cornell.edu> Organization: Cornell Univ. CS Dept, Ithaca NY 14853 Date: Thu, 7 Jan 1993 15:15:50 GMT Lines: 59 This relates to a previously described bug in V3.0.7: In V3.0.7 and all previous releases of Isis, there is a bug that causes remote connections to get dropped if they are active or established any time starting 21.7 days after protos was booted on a given node. For example, say that protos has been up on a host "server2" for 3 weeks. You will find that programs running directly on server2 are fine, as are programs that use the slower UDP connection scheme to become remote clients of server2. But, programs that connect to server2 via TCP will disconnect from protos after about 45 seconds, thinking that protos has crashed. In addition to this, it turns out that if protos is very active at the instant when this timer wraps, Isis can develop other problems too. They would show up by Isis becoming congested and getting stuck in a congested state -- basically, an internal garbage collection mechanism might freeze up when this timer wraps. Protos always prints ISIS TIMER RESET in the protos log when it tries to wrap this timer. The problems occur after that. I've tracked down the precise bug and I know how to fix it. The fix involves changes to protos/pr_main.c and protos/pr_inter.c, which then need to be recompiled. They take effect when protos is next restarted. Users with an urgent need for this fix can have it via email or FAX. Contact me if your situation requires this. However, I would prefer, if possible, to have people get this fix as part of V3.0.8, which is still on target for a release later this month. My reasoning is based on (bitter) experience: even the most minor fix can destabilize the system in unexpected ways. A known problem, however annoying, is probably better than a patch that hasn't been tested for several weeks and burned in. Previously, I suggested a second way (other than just rebooting protos once every few weeks) of disabling the disconnect sensing code. That works, too, and if you are using that scheme you can go on doing so until we release V3.0.8 Now that I understand the problem, a third way of fixing it has surfaced. If your system is such that you can shut down all the remote connections to a protos server right before it wraps its timer, then restart them after the message is found in the log, this would also work. But, having even one "rtcp" connection to protos at the time of the "wrap" is enough to trigger the bug. Sorry about this! I know that it won't look good to have to restart protos this way, but perhaps it can be worked in with normal T/M activities to minimize disruption to your users. After all, you are safe for 3 weeks at a time. -- Kenneth P. Birman E-mail: ken@cs.cornell.edu 4105 Upson Hall, Dept. of Computer Science TEL: 607 255-9199 (office) Cornell University Ithaca, NY 14853 (USA) FAX: 607 255-4428