home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!ogicse!das-news.harvard.edu!cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!andrew.cmu.edu!<UNAUTHENTICATED>+
- From: snl+@cs.cmu.edu (Sean Levy)
- Newsgroups: comp.sys.isis
- Subject: Re: join never returns
- Message-ID: <kf0i4o600hNS8CkVpk@cs.cmu.edu>
- Date: 12 Nov 92 12:56:52 GMT
- Article-I.D.: cs.kf0i4o600hNS8CkVpk
- References: <Af05oZO00hNSI1DYZ2@cs.cmu.edu>
- <1992Nov12.012210.29358@cs.cornell.edu>
- Organization: Carnegie Mellon, Pittsburgh, PA
- Lines: 67
- In-Reply-To: <1992Nov12.012210.29358@cs.cornell.edu>
-
- Excerpts from netnews.comp.sys.isis: 12-Nov-92 Re: join never returns
- Ken Birman@cs.cornell.ed (2033)
-
- > In article <Af05oZO00hNSI1DYZ2@cs.cmu.edu> Sean Levy <snl+@cs.cmu.edu> writes:
- > (description of a join problem)
-
- > I can see from your log that everything is piling up waiting for
- > replies from one or two of your clients (e.g.: sent to xxxx, status W
- > means "waiting for a reply from xxxx"). But, lacking logs from
- > xxxx I don't know why.
-
- My processes are very simple right now, and don't do logging. Sigh.
-
- > Some random ideas: if TCP channel breakage is not always working
- > right on your systems (and this is a common thing we see on SUN
- > systems, for example), then if isis_probe isn't set you might have
- > Isis fail to notice that xxxx is dead and so hang. But, I bet that
- > this is not the problem. V2.2.7 and V3.0.7, at least, would not have
- > such a problem.
-
- > Some evidence that your TCP is having trouble is the failure to restart
- > after shutting down: seems that UNIX is not deallocating the TCP
- > data structure in the kernel and hence Isis can't reopen it.
-
- > SUN has problems in this part of TCP in one of their releases a while
- > back. If you are on ISIS V2.2.5 on a SUN 4.1.1c platform, for example,
- > this could explain it. But, later releases of SUN OS and also of ISIS
- > (either of them) would probably not have this problem.
-
- Welp, this may be it. I have V3.0.5. We are running SunOS 4.1.1 (not
- sure about the "c"). Since there is zero probability I can get 3.0.7
- between now and tomorrow (my demo is tomorrow), I'm going to fail-over
- to my "b" option and move everything to pmaxen running Ultrix.
- Unfortunately, this means I have to swab my RDBMS gunk offline (I hate
- byte order problems). Sigh.
-
- > Another idea: if you try and join multiple groups, say that p joins
- > A and then B and q joins B and then A. If they don't call isis_start_done
- > FIRST, then they can deadlock because p needs to help q on its join
- > and vice versa. Would only see this for "concurrent" join situations.
- > This could explain why adding some extra groups caused the problem --maybe
- > you did so in a way that introduced a cyclic join pattern?
-
- Again, I don't THINK this is the problem, but I will look into it.
-
- > Did you find client-created xxxx.log files after your snapshot? When
- > you see protos log files that show people waiting for certain programs
- > to take an action, the next step is to have a close look at the state
- > of those programs...
-
- Agreed. Unfortunately, my demo tomorrow calls.
-
- Any more info you can give me on this (like any quick patches I can make
- to the 3.0.5 sources to work around the problem) would be greatly
- appreciated.
-
- > --
- > Kenneth P. Birman E-mail: ken@cs.cornell.edu
- > 4105 Upson Hall, Dept. of Computer Science TEL: 607 255-9199 (office)
- > Cornell University Ithaca, NY 14853 (USA) FAX: 607 255-4428
-
- Cheers,
- -- Sean
-
- --
- Sean Levy, n-dim Group, EDRC, CMU, 5000 Forbes Ave, PGH, PA 15213
- Email: snl+@cmu.edu, Phone: +1 412 268 5221, Fax: +1 412 268 5229
-