home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.sys.isis
- Path: sparky!uunet!zaphod.mps.ohio-state.edu!malgudi.oar.net!caen!destroyer!cs.ubc.ca!uw-beaver!cornell!ken
- From: ken@cs.cornell.edu (Ken Birman)
- Subject: The "27 day" problem (actually, 21.7 day problem)
- Message-ID: <1993Jan7.151550.6171@cs.cornell.edu>
- Organization: Cornell Univ. CS Dept, Ithaca NY 14853
- Date: Thu, 7 Jan 1993 15:15:50 GMT
- Lines: 59
-
- This relates to a previously described bug in V3.0.7:
-
- In V3.0.7 and all previous releases of Isis, there is a bug that
- causes remote connections to get dropped if they are active or
- established any time starting 21.7 days after protos was booted
- on a given node.
-
- For example, say that protos has been up on a host "server2" for
- 3 weeks. You will find that programs running directly on server2
- are fine, as are programs that use the slower UDP connection scheme
- to become remote clients of server2. But, programs that connect
- to server2 via TCP will disconnect from protos after about 45 seconds,
- thinking that protos has crashed.
-
- In addition to this, it turns out that if protos is very active at
- the instant when this timer wraps, Isis can develop other problems too.
- They would show up by Isis becoming congested and getting stuck in
- a congested state -- basically, an internal garbage collection mechanism
- might freeze up when this timer wraps.
-
- Protos always prints ISIS TIMER RESET in the protos log when it
- tries to wrap this timer. The problems occur after that.
-
- I've tracked down the precise bug and I know how to fix it. The
- fix involves changes to protos/pr_main.c and protos/pr_inter.c,
- which then need to be recompiled. They take effect when protos is
- next restarted.
-
- Users with an urgent need for this fix can have it via email or FAX.
- Contact me if your situation requires this.
-
- However, I would prefer, if possible, to have people get this fix as
- part of V3.0.8, which is still on target for a release later this month.
- My reasoning is based on (bitter) experience: even the most minor fix
- can destabilize the system in unexpected ways. A known problem, however
- annoying, is probably better than a patch that hasn't been tested for
- several weeks and burned in.
-
- Previously, I suggested a second way (other than just rebooting protos
- once every few weeks) of disabling the disconnect sensing code. That
- works, too, and if you are using that scheme you can go on doing so
- until we release V3.0.8
-
- Now that I understand the problem, a third way of fixing it has surfaced.
- If your system is such that you can shut down all the remote connections
- to a protos server right before it wraps its timer, then restart them
- after the message is found in the log, this would also work. But, having
- even one "rtcp" connection to protos at the time of the "wrap" is
- enough to trigger the bug.
-
- Sorry about this! I know that it won't look good to have to restart
- protos this way, but perhaps it can be worked in with normal T/M
- activities to minimize disruption to your users. After all, you are
- safe for 3 weeks at a time.
-
- --
- Kenneth P. Birman E-mail: ken@cs.cornell.edu
- 4105 Upson Hall, Dept. of Computer Science TEL: 607 255-9199 (office)
- Cornell University Ithaca, NY 14853 (USA) FAX: 607 255-4428
-