home *** CD-ROM | disk | FTP | other *** search
- Organization: Carnegie Mellon, Pittsburgh, PA
- Path: sparky!uunet!news.tek.com!psgrain!charnel!rat!usc!zaphod.mps.ohio-state.edu!magnus.acs.ohio-state.edu!cis.ohio-state.edu!news.sei.cmu.edu!fs7.ece.cmu.edu!crabapple.srv.cs.cmu.edu!andrew.cmu.edu!<UNAUTHENTICATED>+
- Newsgroups: comp.sys.isis
- Message-ID: <Af05oZO00hNSI1DYZ2@cs.cmu.edu>
- Date: Tue, 10 Nov 1992 20:08:53 -0500
- From: Sean Levy <snl+@cs.cmu.edu>
- Subject: join never returns
- Lines: 353
-
- This is with ISIS V3.0.5 on a sun4. I have the following, admitedly
- ugly, code which intends to join a process group called /ndim/ws and
- generally set things up.
- ---
- void
- join()
- {
- void recieve(), change();
-
- #ifdef TALKY
- fprintf(stderr, "ISIS setup -- joining ndim process group\n");
- #endif
- isis_remote_init(0, 0, 0, ISIS_NOCOPYRIGHTNOTICE | ISIS_PANIC);
- isis_task(join, "join");
- isis_entry(NDIM_RECV, recieve, "recieve");
- gaddr = pg_join("/ndim/ws",
- PG_MONITOR, change, 0,
- 0);
- if (addr_isnull(gaddr)) {
- fprintf(stderr, "Error initializing ISIS on port %d - aborting.\n",
- port_no);
- exit(200);
- }
- #ifndef OWN_MAINLOOP
- #ifdef TALKY
- fprintf(stderr, "initializing ISIS/Tk interaction\n");
- #endif
- init_ISIS_Tk();
- #endif
- isis_start_done();
- #ifdef TALKY
- fprintf(stderr, "ISIS startup completed\n");
- #endif
- }
- ---
- This application uses the Tcl/Tk language/UI toolkit, so I can't use
- isis_mainloop() because Tk has its own main loop. The code to set up the
- Tk/ISIS linkage (called from join(), above) is:
- ---
- void
- init_ISIS_Tk()
- {
- extern int isis_socket;
- extern int intercl_socket;
- void handle_ISIS_input();
- void add_next_ISIS_timeout();
-
- #ifdef TALKY
- fprintf(stderr, "ISIS/Tk: isis_socket=%d intercl_socket=%d\n",
- isis_socket,intercl_socket);
- #endif
- add_next_ISIS_timeout();
- Tk_CreateFileHandler(isis_socket, TK_READABLE|TK_EXCEPTION,
- handle_ISIS_input, 0);
- Tk_CreateFileHandler(intercl_socket, TK_READABLE|TK_EXCEPTION,
- handle_ISIS_input, 0);
- #ifdef TALKY
- fprintf(stderr, "ISIS/Tk connection complete\n");
- #endif
- }
-
- void
- add_next_ISIS_timeout()
- {
- void handle_ISIS_timeout();
- unsigned long next_timeout, timeout;
-
- next_timeout = isis_next_timeout();
- if (_ISIS_Tk_timeout != (Tk_TimerToken)0)
- Tk_DeleteTimerHandler(_ISIS_Tk_timeout);
- timeout = (unsigned long)(next_timeout + 1000);
- if (timeout > 1000)
- timeout = 1000;
- #ifdef TALKY2MUCH
- fprintf(stderr, "ISIS: next timeout at %d (%d)\n", next_timeout,timeout);
- #endif
- _ISIS_Tk_timeout = Tk_CreateTimerHandler(timeout,handle_ISIS_timeout,0);
- }
-
- void
- handle_ISIS_timeout(clientData)
- ClientData clientData;
- {
- void add_next_ISIS_timeout();
-
- _ISIS_Tk_timeout = (Tk_TimerToken)0;
- if (!_calling_ISIS) {
- #ifdef USE_ISIS_TIMEOUT
- static struct timeval timeout={1,0};
- #endif
- #ifdef TALKY2MUCH
- fprintf(stderr, "ISIS timeout - handling events\n");
- #endif
- _calling_ISIS = 1;
- #ifdef USE_ISIS_TIMEOUT
- isis_accept_events(ISIS_TIMEOUT, &timeout);
- #else
- isis_accept_events(ISIS_ASYNC);
- #endif
- _calling_ISIS = 0;
- #ifdef TALKY2MUCH
- fprintf(stderr, "ISIS events drained - scheduling next timeout\n");
- #endif
- }
- add_next_ISIS_timeout();
- }
-
- void
- handle_ISIS_input(clientData, mask)
- ClientData clientData;
- int mask;
- {
- void add_next_ISIS_timeout();
-
- #ifdef TALKY2MUCH
- fprintf(stderr, "ISIS input available\n");
- #endif
- if (!_calling_ISIS) {
- #ifdef USE_ISIS_TIMEOUT
- static struct timeval timeout={1,0};
- #endif
- _calling_ISIS = 1;
- #ifdef USE_ISIS_TIMEOUT
- isis_accept_events(ISIS_TIMEOUT, &timeout);
- #else
- isis_accept_events(ISIS_ASYNC);
- #endif
- _calling_ISIS = 0;
- }
- add_next_ISIS_timeout();
- }
- ---
- Finally, from my main(), I do the equivalent of:
- ---
- ...
- join();
- THREAD_ENTER_ISIS()
- ...
- Tk_MainLoop();
- ---
- "the equivalent of", because join() is called from something else that
- does uninteresting (for my purposes here) Tcl stuff. When I run this
- program, I sometimes but not always see the
- ISIS setup -- joining ndim process group
- message and then nothing. Nada. Zip. We are stuck in the join() routine.
- Running under gdb and ^C'ing afer a while yields the following
- (predictable) stack:
- ---
- Starting program:
- /afs/cs.cmu.edu/user/snl/project/ndim/@sys/bin/ndim+ix+isis+DBG+MD
- -noinit -loads "bos/base"
- ISIS setup -- joining ndim process group
-
- Program received signal 2, Interrupt
- 0xf7663910 in select ()
- (gdb) where
- #0 0xf7663910 in select ()
- #1 0xe044 in run_isis ()
- #2 0x2c8f4 in isis_accept_events_loop ()
- #3 0x2d3fc in invoke ()
- (gdb)
- ---
- "cmd group /ndim/ws" gives the following lossage:
- ---
- 216 % cmd group /ndim/ws
- ld.so: warning: /usr/lib/libc.so.1.6 has older revision than expected 7
- *** gid = [site 7 / incarn 0 : gid 9]
- view = ["/ndim/ws" incarn 0 viewid 18.0 nmemb 2]
- members = (7/0:457.0)(14/1:rtcp_1.0)
- Network locations of members:
- (7/0:457.0) is at host loos.edrc.cmu.edu, pid 457
- (14/1:rtcp_1.0) ** timeout -- process in an infinite loop? **
- ... just a moment while I check with protos on site 14
- (barbera.nectar.cs.cmu.e
- du)
- --> Sorry! unimplemented feature. use "cmd dump" at site 14 instead
- ---
- A dump of the ISIS system yields:
- ---
- << ISIS SYSTEM DUMP >>
- ... Time is now Tue Nov 10 20:03:12 1992
-
- PROTOCOLS PROCESS 14/1 INTERNAL DUMP REQUESTED: ISIS protocols process
- status du
- mp
- Memory mgt: 9593 allocs, 9246 frees 385020 bytes in use
- Message counts: 9232 allocs 9202 frees (30 in use)
-
- tasks: scheduler 3ac40 ctp f55b4
- TASK[b96fc]: cl_abcast(7122c), wants 1 of 2 replies, got 0+0 null, msgid=2315
- Dests: (14/1:rtcp_1.join_req)(7/0:457.join_req); stat <WW>
- TASK[bd704]: cl_abcast(71d84), wants 1 of 2 replies, got 0+0 null, msgid=2341
- Dests: (14/1:rtcp_1.join_req)(7/0:457.join_req); stat <WW>
- TASK[b56f4]: cl_abcast(70d88), wants 1 of 2 replies, got 0+0 null, msgid=2369
- Dests: (14/1:rtcp_1.join_req)(7/0:457.join_req); stat <WW>
- TASK[a96dc]: cl_cbcast(710a0), wants 1 of 1 replies, got 0+0 null, msgid=2381
- Dests: (14/1:rtcp_1.gethostname); stat <W>
- TASK[b16ec]: cl_cbcast(71124), wants 1 of 1 replies, got 0+0 null, msgid=2383
- Dests: (14/1:rtcp_1.gethostname); stat <W>
- TASK[c170c]: cl_cbcast(70a70), wants 1 of 1 replies, got 0+0 null, msgid=2397
- Dests: (14/1:rtcp_1.gethostname); stat <W>
- TASK[ad6e4]: cl_cbcast(7101c), wants 1 of 1 replies, got 0+0 null, msgid=2399
- Dests: (14/1:rtcp_1.gethostname); stat <W>
- TASK[c5714]: cl_cbcast(71e08), wants 1 of 1 replies, got 0+0 null, msgid=2407
- Dests: (14/1:rtcp_1.gethostname); stat <W>
- TASK[d957c]: cl_cbcast(70f98), wants 1 of 1 replies, got 0+0 null, msgid=2409
- Dests: (14/1:rtcp_1.gethostname); stat <W>
- TASK[e158c]: cl_cbcast(71c7c), wants 1 of 1 replies, got 0+0 null, msgid=2417
- Dests: (14/1:rtcp_1.gethostname); stat <W>
- TASK[dd584]: cl_cbcast(725c4), wants 1 of 1 replies, got 0+0 null, msgid=2419
- Dests: (14/1:rtcp_1.gethostname); stat <W>
- TASK[e959c]: cl_abcast(723b4), wants 1 of 2 replies, got 0+0 null, msgid=2423
- Dests: (14/1:rtcp_1.join_req)(7/0:457.join_req); stat <WW>
- TASK[e5594]: cl_abcast(70f14), wants 1 of 2 replies, got 0+0 null, msgid=2431
- Dests: (14/1:rtcp_1.join_req)(7/0:457.join_req); stat <WW>
- TASK[f55b4]: cl_wantdump(721a4), ** running **
- runqueue:
- Site view 7/4: 7 14/1
- Scope <edrc> = `000339fe000000000000000000000000'
- Scope <pmax_mach> = `0001220e000000000000000000000000'
- Scope <sun3_mach> = `00000030000000000000000000000000'
- Scope <sun4c_411> = `00000440000000000000000000000000'
- Scope <pmax_ul4> = `00000180000000000000000000000000'
- Scope <ri> = `00008600000000000000000000000000'
- Scope <hp800_ux3> = `00000800000000000000000000000000'
- Scope <rs_aix31> = `00001000000000000000000000000000'
- Scope <cs> = `00004000000000000000000000000000'
- Scope <sun4_41> = `0000c000000000000000000000000000'
- Scope <unknown> = `00020000000000000000000000000000'
-
- Process group views: root 6826c
- Client(14/1:isis.0)
- [ca01c] (gid 14/0.1[0])</sys/counters> =
- VID 4 = (7/0:isis.0)(14/1:isis.0)
- Client(14/1:-7.0)
- [ca58c] (gid 14/1.1[0])</XMGR-service> =
- VID 1 = (14/1:-7.0)
- Client(14/1:1827.0)
- [ca844] (gid 14/1.2[0])</ndim/db> =
- VID 1 = (14/1:1827.0)
- [ca2d4] (gid 14/1.3[0])(* cached *) </ndim/db/backend> =
- VID 1 = (14/1:1828.0)
- Client(14/1:1828.0)
- [caafc] (gid 14/1.3[0])</ndim/db/backend> =
- VID 1 = (14/1:1828.0)
- Client(14/1:rtcp_1.0)
- [cadb4] (gid 14/0.1[0])(* cached *) </sys/counters> =
- VID 4 = (7/0:isis.0)(14/1:isis.0)
- [cb06c] (gid 7/0.9[0])</ndim/ws> =
- VID 18 = (7/0:457.0)(14/1:rtcp_1.0)
- [cb324] (gid 14/1.3[0])(* cached *) </ndim/db/backend> =
- VID 1 = (14/1:1828.0)
- Client(14/1:2725.0)
- [cb894] (gid 14/1.10[0])</ndim/ws/published> =
- VID 1 = (14/1:2725.0)
- Client(14/1:2733.0)
- [cbe04] (gid 7/0.9[0])(* cached 1 iterating failed *) </ndim/ws> =
- VID 18 = (7/0:457.0)(14/1:rtcp_1.0)
- Client(14/1:2784.0)
- [cb5dc] (gid 7/0.9[0])(* cached 1 iterating failed *) </ndim/ws> =
- VID 18 = (7/0:457.0)(14/1:rtcp_1.0)
- Client(14/1:2793.0)
- [cbb4c] (gid 7/0.9[0])(* cached 1 iterating failed *) </ndim/ws> =
- VID 18 = (7/0:457.0)(14/1:rtcp_1.0)
- Client(14/1:2796.0)
- [cc0bc] (gid 7/0.9[0])(* cached 1 iterating *) </ndim/ws> =
- VID 18 = (7/0:457.0)(14/1:rtcp_1.0)
-
- Associative store: as_ndelete 0, as_nlocdelete 1
- id= 92c0039(ab-deletable)[ =[<qu_message: 0x72648><as_scope: bitvec
- `000040800
- 00000000000000000000000'><phase: 2><priority: 0x5b0e><as_freed: bitvec
- `00004000
- 000000000000000000000000'>]]
-
- abq:
- max_priority = 5b0e
- cbcast data structures:
- pbufs:
- pb_itemlist:
- idlists:
- piggylists:
- gbcast data structures:
- wait1:
- wait queues:
- glocks:
-
- Failure detector: current view 7/4:
- slist: 700 e01
- incarn: 7/0 14/1
- failed: `00000000000000000000000000000000'
- recovered: `00004080000000000000000000000000'
- not coord, no fork, no fail, no prop, no oprop, not sent_oprop
- Pending failures:
- Pending recoveries:
- Replies wanted:
- View r_locks: `00000000000000000000000000000000'
- View w_locks: `00000000000000000000000000000000'
- View want_w_locks: `00000000000000000000000000000000'
-
- clients:
- [06] [host barbera.nectar.cs.cmu.edu (128.2.214.54:1450) pid 1800]
- <bin/isis>, idle, monitoring < (14/1:rtcp_1.0) > watched by sites < 7 >
- [07] [host barbera.nectar.cs.cmu.edu (128.2.214.54:2325) pid 1805]
- <xmgr>, idle
- [08] [host barbera.nectar.cs.cmu.edu (128.2.214.54:2341) pid 1802]
- <rexec>, idle
- [09] [host barbera.nectar.cs.cmu.edu (128.2.214.54:2646) pid 1804]
- <rmgr>, idle
- [10] [host barbera.nectar.cs.cmu.edu (128.2.214.54:1509) pid 1827]
- client pid=1827, idle, monitoring < (14/1:1828.0) >
- [11] [host barbera.nectar.cs.cmu.edu (128.2.214.54:1065) pid 1828]
- client pid=1828, idle watched by sites < 7 >
- [12] [host urals.edrc.cmu.edu (128.2.214.68:3001) pid 13454]
- rtcp_1, idle, monitoring < (7/0:isis.0) (14/1:1828.0) > watched by sites
- < 7 >
- [13] [host barbera.nectar.cs.cmu.edu (128.2.214.54:1660) pid 2725]
- client pid=2725, idle
- [14] [host barbera.nectar.cs.cmu.edu (128.2.214.54:1054) pid 2796]
- client pid=2796, idle
- [15] [host barbera.nectar.cs.cmu.edu (128.2.214.54:1920) pid 2800]
- client pid=2800, idle
- Active remote UDP clients:
-
- Intersite:
- 7/0 [loos.edrc.cmu.edu]:
- estab;alive; got: 624/627+0 dups, sent: 725+1 ret, 205 acks, backlog 0
- Message tank: 0 messages, 0 bytes
- ---
- An interesting fact is that all of those /ndim/ws clients are deceased
- -- I ^C'ed out of them ages ago trying to get around this problem. ISIS
- does not notice that they have died. Is this the problem?
-
- This doesn't always happen. I cannot figure exactly why it happens or
- when, but totally restarting ISIS seems to clear the problem up,
- temporarily. I then have a DIFFERENT problem: when I try to restart
- ISIS, it complains that ISISPORT (whatever I said as the second port
- number in sites) cannot be assigned and panics during firing up protos
- (this is different than the "port in use, please try again in a minute"
- message). No amount of fiddling fixes the problem -- I have to edit all
- my sites files to fool ISIS into starting again.
-
- Have I done something drastically wrong in my ISIS code? It was working
- perfectly until I added a few more process groups to the system (e.g.
- everything under "/ndim" except "ws").
-
- Sorry for the length of this message. Any help urgently needed.
-
- Cheers,
- -- Sean
- --
- Sean Levy, n-dim Group, EDRC, CMU, 5000 Forbes Ave, PGH, PA 15213
- Email: snl+@cmu.edu, Phone: +1 412 268 5221, Fax: +1 412 268 5229
-