home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.sys.isis
- Path: sparky!uunet!caen!batcomputer!cornell!ken
- From: ken@cs.cornell.edu (Ken Birman)
- Subject: Re: clients loosing connection
- Message-ID: <1993Jan4.214514.3759@cs.cornell.edu>
- Keywords: isis,clients,connection
- Organization: Cornell Univ. CS Dept, Ithaca NY 14853
- References: <1ia2mnINN9ak@fnnews.fnal.gov>
- Date: Mon, 4 Jan 1993 21:45:14 GMT
- Lines: 102
-
- In article <1ia2mnINN9ak@fnnews.fnal.gov> udumula@fndaug.fnal.gov (Lourdu Udumula) writes:
- >Hi,
- > I am a new user of this system. I installed the isis system on
- >a sun machine. I started up a client on the same machine which just waits for
- >isis messages. The client is up and waiting as expected. I started the same
- >client on a different machine. The client comes up and runs ok but quits if
- >it is left running for a while. The message I get is this
- >
- >
- >ISIS client rtcp_159: lost connection to <isis-protos>
- >
- >I dont know what is wrong. Did I install the isis toolkit properly or am i
- >making a mistake in the client programs. The same thing happened with a demo
- >program that I got with isis. I am talking about the bank service program.
- >So I suspect that there could be something wrong with isis installation.
-
- Actually, if it gets far enough to print this and nothing else, I
- suspect something is wrong with Isis or some sort of O/S interaction
- is at fault.
-
- A connection to protos can break under several conditions:
- 1. protos crashes.
- 2. the client crashes, or TCP thinks it did.
- 3. protos doesn't send pings through and the client concludes that
- protos is dead
- 4. the client doesn't sent I'm alive's to protos and protos concludes
- that the client is dead.
- 5. some other program is trying to talk to this client and has not seen
- acknowledgements despite having retransmitted some message 10 times
- AND having waited until "RTTIMEOUT" seconds elapsed (currently, 45)
-
- You can distinguish these cases:
- 1. Pretty obvious. Not your problem...
-
- 2. This is usually due to a very flakey network. E.g. if you try
- and run Isis as if a node in Italy and a node in New York are on the
- same LAN...
-
- Workaround -- get a better link, or split system into two LAN's and
- use the wide-area system to connect them.
-
- 3. We think there may be a bug in this part of Isis, but have only seen
- it after protos has been up for 27 days. This is long enough to overflow
- a signed 32-bit timer with units in milliseconds, so probably there is
- some bug in the timer code -- I am studying this and will fix it in
- ISIS V3.0.8. In this case protos puts nothing in its log, and the
- client suddenly jumps into its isis_disconnect() routine -- you can
- deduce that this case occured by ruling out cases 4 and 5, below.
-
- There is a workaround that was posted around December 23 on this.
- You need to comment out one line of cl_isis.c:
- /* Line 2288 of cl_isis.c */
- VOID
- protos_disconnect()
- {
- protos_timer = T47e9b894(protos_timer, 45000, protos_disconnect, 0, 0);
- if((P366170ce-last_input >= 40000) && !R76d124b4(&H46d3cd8))
- /* Ld0b19d0(-1); */ <----- COMMENT THIS OUT */
- }
- The problem this can introduce, however, is that once you do
- make this change, Isis may run into a UNIX bug, in TCP.
-
- TCP on some systems doesn't report that channels have broken.
- So, cases (1) and (2) never arise -- SUN 4.1.1 has this bug, for
- example.
-
- With this line commented out, what can happen is that protos failure
- (case 1) will not be detected. So, if you do make this change, be
- aware that this could now become a problem, and test to make sure that
- if protos does crash, your application notices. The problem is a bug
- in TCP, so for many users, you won't have a problem depending on which
- version of TCP your vendor happens to have used...
-
- 4. In old copies of Isis, like V3.0.5, there was a bug in which this
- occured. Currently, we think this is working and the issue is
- usually that isis_probe() has been set (or allowed to default)
- to too low a value. Mostly a problem in Lisp or Ada, where a
- flukey scheduler can starve Isis for long periods of time.
-
- In this case protos puts a message in its log (look in 1.logdir/1.log
- for site number 1):
- *** TIMEOUT: client failed, rtcp_67
-
- Work-around: increase isis_probe() parameters.
-
- 5. In this case the client who had problems sending prints a message to
- it console:
- WARNING: unable to communicate with (1/0:1234.0)
- ... asking <isis/protos> to break connection
- and protos subsequently puts this in the log on site 1:
- *** DISCONNECTING from (1/0:1234.0)
- [host fafnir.cs.cornell.edu (128.19.27.0:1876)]
- ... not responding to UDP messages after 10 retries and 30 seconds
- (NB: the number 30 is always shown; the number should be the value of
- RTTIMEOUT that was used -- a minor thing, not a bug).
-
- Work-around: recompile cl_inter.c with -DRTTIMEOUT=99 or some other
- number larger than 45.
- --
- Kenneth P. Birman E-mail: ken@cs.cornell.edu
- 4105 Upson Hall, Dept. of Computer Science TEL: 607 255-9199 (office)
- Cornell University Ithaca, NY 14853 (USA) FAX: 607 255-4428
-