home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!stanford.edu!sun-barr!sh.wide!wnoc-tyo-news!scslwide!wsgw!wsservra!daemon
- From: irvin@buffett.nas.nasa.gov (Timothy B. Irvin)
- Newsgroups: fj.mail-lists.x-window
- Subject: Keeping X going after fileserver system crash??
- Message-ID: <1992Nov18.091449.5658@sm.sony.co.jp>
- Date: 18 Nov 92 09:14:49 GMT
- Sender: daemon@sm.sony.co.jp (The devil himself)
- Distribution: fj
- Organization: NAS Facility, NASA Ames Research Center, Moffett Field, CA
- Lines: 82
- Approved: michael@sm.sony.co.jp
-
- Date: Tue, 10 Nov 92 18:11:18 GMT
- Message-Id: <1992Nov10.181118.5523@nas.nasa.gov>
- Newsgroups: comp.windows.x
- Sender: xpert-request@expo.lcs.mit.edu
-
- We maintain all the X stuff for our Sun SPARCs on a few replicated fileservers.
- We are beginning to use Amd to mount the closest fileserver, and to fall back
- to one of the replicated ones if the primary server fails.
-
- The problem is, when X is running while the fileserver fails, X gets hosed and
- a reboot becomes necessary. Even though a backup fileserver has been mounted.
- Which makes some sense, since the binary currently being run is still pointing
- to the primary fileserver.
-
- So, we decided to move some of the X stuff to local disk, to see if most
- of X could survive lossing the fileserver. I moved X, twm, and xterm to the
- local disk. But with the dynamically linked libraries still on the fileserver
- this didn't buy us much. So, I moved, libX11.so.4.3, libXaw.so.4.0,
- libXmu.so.4.0, and libXt.so.4.0 to the local disk.
-
- Now X stays up. Xterms are still usable, and all looked right with the
- world. Until we tryed to start up a new X client (keep in mind that by now
- one the replicated fileservers had been mounted, so the new application
- was coming off there -- actually the problem occurred when running
- the client on another machine and displaying it on this machine).
-
- After starting the new client, the server freezes for a few minutes. Little
- to no mouse activity is possible. After a few minutes the application
- finally arrives on the screen and the X returns to normal. It appears
- to me that the X server has some file open on the first fileserver and
- has to go through an NFS timeout trying to read this file. But we can't
- figure out what it would be, or how to get around it.
-
- If the user exits out of X and restarts it, everything returns to normal.
- But we are trying to make a lost fileserver as seamless as possible (provided
- the limited amount of extra disk space we have each SPARCstation.
- Below is a segment of a trace of the xterm client starting up, the lines
- with *** in front are lines where activity froze waiting for a timeout.
-
- Now the question:
- Does anyone know what file the server is trying to access, so we can move it
- to local disk (if not to big). Or how to get the server to close this file and
- reopen it (which would then use the replicated fileserver)? Or any other
- suggestions which might help us out?
-
- If possible please e-mail any suggestions: irvin@nas.nasa.gov.
-
- TRACE OUTPUT SEGMENT
- --------------------------------------------------------------------------------
- writev (3, 0xf7ffdca8, 3) = 16
- read (3, 0xf7ffdd20, 32) = -1 EWOULDBLOCK (Operation would block)
- ***select (4, 0xf7ffdbd8, 0, 0, 0) = 1
- read (3, "".., 32) = 32
- writev (3, 0xf7ffdca8, 3) = 20
- read (3, 0xf7ffdd20, 32) = -1 EWOULDBLOCK (Operation would block)
- ***select (4, 0xf7ffdbd8, 0, 0, 0) = 1
- read (3, "".., 32) = 32
- brk (0x422e8) = 0
- brk (0x432e8) = 0
- writev (3, 0xf7ffd620, 3) = 72
- read (3, 0xf7ffd698, 32) = -1 EWOULDBLOCK (Operation would block)
- ***select (4, 0xf7ffd550, 0, 0, 0) = 1
- read (3, "".., 32) = 32
- writev (3, 0xf7ffd620, 3) = 20
- read (3, 0xf7ffd698, 32) = -1 EWOULDBLOCK (Operation would block)
- ***select (4, 0xf7ffd550, 0, 0, 0) = 1
- read (3, "".., 32) = 32
- write (3, "".., 352) = 352
- read (3, 0xf7fff850, 32) = -1 EWOULDBLOCK (Operation would block)
- ***select (4, 0xf7fff708, 0, 0, 0) = 1
- ---------------------------------------------------------------------------------
-
- PS Each time it stoped on the "select" statements we'd get teh following in the
- console: NFS getattr failed for server fs01: Timed out
-
-
- Thanks in advance,
-
- Tim Irvin
- Systems Administrator (CSC)
- NASA Ames Research Center
- Moffett Field, CA
-