NetNews Usenet Archive 1992 #20

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #20 / NN_1992_20.iso / spool / comp / unix / question / 10833 < prev next >

Wrap

Text File | 1992-09-08 | 4.3 KB | 126 lines

Xref: sparky comp.unix.questions:10833 comp.sys.sgi:13388 comp.unix.misc:3586 Newsgroups: comp.unix.questions,comp.sys.sgi,comp.unix.misc Path: sparky!uunet!stanford.edu!microunity!brendan From: brendan@microunity.com (Brendan Eich) Subject: Re: File Locking across NFS mounts Message-ID: <1992Sep8.210250.7387@microunity.com> Sender: usenet@microunity.com (news) Nntp-Posting-Host: acteon.microunity.com Organization: MicroUnity Systems Engineering, Inc. References: <9209080829.PN04389@LL.MIT.EDU> Distribution: comp Date: Tue, 8 Sep 1992 21:02:50 GMT Lines: 111 In article <9209080829.PN04389@LL.MIT.EDU> lebrun@ll.mit.edu (Steven F. LeBrun) writes: >I am looking for a method of file locking what works across NFS >mounted directories and am interested on what methods other >people are using. > >I tried flock() and discovered a bug that only exists if the file >being locked is a remote (NFS) file. If a program terminates without >unlocking an NFS file (due to a system reboot, user abort, program >bug [this is how I discovered the problem -- bugs can be useful at >times :-) etc.] the file remains locked so that no other programs >can access the file. First let me say that this lockd bug may well be fixed in a release that's available now to customers on support. It was a known bug within SGI last I heard, and I believe the fix was in hand. With that in mind, here's a homebrew locking scheme. Try using a symlink (which is created exclusively and can contain an arbitrary NUL-free string) as a locking device (these functions return 0 on success and an errno on failure): int lockit(char *lockname) { int pid, timeout, cc, lockpid; char hostname[MAXHOSTNAMELEN], lockhost[MAXHOSTNAMELEN]; char signature[BIGENOUGH], locksig[BIGENOUGH]; pid = getpid(); (void) gethostname(hostname, sizeof hostname); sprintf(signature, "%s:%d.%ld", hostname, pid, time(0)); for (timeout = 1; symlink(lockname, signature) < 0; timeout <<= 1) { if (errno != EEXIST) return errno; cc = readlink(lockname, locksig, sizeof locksig); if (cc < 0) return errno; locksig[cc] = '\0'; if (sscanf(locksig, "%s:%d", lockhost, &lockpid) != 2) return EINVAL; if (strcmp(lockhost, hostname) == 0) { /* * If we already own the lock, succeed. */ if (lockpid == pid) break; /* * Probe for a dead process that failed to unlock. */ if (kill(lockpid, 0) < 0 && errno == ESRCH) { (void) unlink(lockname); continue; } } else { /* * A process on another host claims the lock. * Do something fancy, perhaps with rcmd, kill -0, * and error-message echoing, to verify that the * process exists. */ . . . } if (timeout >= BIGTIMEOUT) return ETIMEDOUT; sleep(timeout); } return 0; } int unlockit(char *lockname) { #ifdef DEBUG /* * Use readlink to sanity-check that we claim the lock. */ . . . #endif return (unlink(lockname) < 0) ? errno : 0; } Remember that NFS lacks reliable transactions, so under very heavy load involving a congested internet between client and server and bad delays, or with old NFS servers that fail to use the "idempotency cache" (a weak at-most-once NFS transaction mechanism) for symlink and unlink, several processes can get the lock by mistake, because a delayed unlink request retransmission removed someone else's lock. If the lock ever gets "stuck" you can use `ls -l` to find out who claims it and kill the process or remove the symlink, as appropriate. I added the initial time to the lock signature so you can see how old and bogus a suspicious lock might be. On SGI servers, you can increase MAXDUPREQS in /usr/sysgen/master.d/snfs and '/etc/init.d/autoconfig; answer y; reboot' to get a bigger idempotency cache. In 4.0.5 and later, you can use `nfsstat -sr` to get server-side RPC statistics including "dupage", which gives a smoothed average of the age of the least-recently-used idempotency cache entry that was recycled. This should be a larger number than BIGTIMEOUT (I forget whether the units have to be converted to seconds) for the above scheme to work reliably. On a single Ethernet with few clients, this scheme should work well (the probing for a remote process, not shown above, can be slow if done using rcmd(3N) and kill(1)). /be -- Brendan Eich MicroUnity Systems Engineering, Inc. brendan@microunity.com Any opinions above are mine.