home *** CD-ROM | disk | FTP | other *** search
- Path: usage.csd.unsw.oz.au!metro!munnari.oz.au!samsung!mips!swrinde!zaphod.mps.ohio-state.edu!caen!umich!terminator!pisa.citi.umich.edu!rees
- From: rees@pisa.citi.umich.edu (Jim Rees)
- Newsgroups: comp.sys.apollo
- Subject: Re: DN4500 arbitrarily overloads itself (was Re: (none))
- Message-ID: <52629c2f.1bc5b@pisa.citi.umich.edu>
- Date: 25 Jun 91 16:50:54 GMT
- References: <0677436884@INESCN.RCCN.PT> <DPASSAGE.91Jun20122344@soda.berkeley.edu> <1991Jun24.002623.18899@gtephx.UUCP>
- Sender: usenet@terminator.cc.umich.edu (usenet news)
- Reply-To: rees@citi.umich.edu (Jim Rees)
- Organization: University of Michigan IFS Project
- Lines: 63
-
- In article <1991Jun24.002623.18899@gtephx.UUCP>, wilsonj@gtephx.UUCP (Jay Wilson) writes:
-
- I saw this posting and I could not resist having one of my partners
- in crime (there are 6 of us Sys_admins) respond to it. He has been tracking
- the Mutex Lock problem for over a year now and this is what he had to say.
-
- ...
-
- The error will rear its ugly head with no warning or pattern,
- and once you get it you MUST reboot and run the long SALVOL
- to appease its appetite for disaster.
-
- I find this hard to believe. I've never had to run a long salvol after
- getting a stuck sfcb mutex, and I can't think of anything that a long salvol
- might do that would fix it.
-
- If you would like more information on exactly what "sfcb hash table" and
- "mutex lock" are, please refer to a copy of the "Domain/OS Design Principles"
- 014962-A00 pages 9-14,9-15.
-
- That paper was also published in the Atlanta Usenix Proceedings, which I
- think was summer 1986. It's also available by ftp from pisa.citi.umich.edu.
- I think it's an excellent paper and everyone should read it.
-
- The basic problem with putting mutex locks in shared memory is that any old
- program can go and trash them, and then you're stuck. What's needed is a
- true object oriented architecture with tagged storage, like the old Intel
- 432 or the IBM System 38. But the trend seems to be in the opposite
- direction, and operating systems seem to be getting more primitive, yet
- bloated, every year. Multitasking was common on all computers in the mid
- 1960s, then pretty much disappeared in the 80s when everyone started running
- MS-DOS. I'm waiting for the days when we all have to start using batch
- again. All this is enough to make an old systems guy like me want to retire
- to a small midwestern town and spend his summers in places like Tanjung
- Pinang.
-
- Anyway, where was I? Oh yes, mutex locks. These problems are nearly
- impossible to track down. Since the sfcbs are central to ios, it's very
- hard to do anything in the debugger if, for example, you've set a breakpoint
- in mutex_$lock. Everybody and his mother calls mutex_$lock, so you might
- have to hit it 10000000 times before catching the one time that has no
- matching unlock. And then, how do you know when that happens? It's a
- Turing completion problem. How do you tell the debugger to breakpoint on
- something that isn't going to happen? "Please stop the next time
- mutex_$unlock *isn't* called." And if you do manage to catch it, there you
- are with the sfcbs locked, and you can't do any IO. And remember that the
- missing unlock can happen in any process.
-
- Even worse is the case where it isn't an unmatched lock, it's some random
- trashing of memory that happens to scribble on the sfcb. It can happen at
- any time, in any process.
-
- Last time I wrote a type manager (for AFS), I had a few of these problems.
- I couldn't debug them. I fixed them by tenacious examination of source
- code. I suspect that's how Apollo engineers fix them too, the few who are
- left who even know what an sfcb is.
-
- To Apollo's credit, I have to say that I haven't seen a single stuck mutex
- since I installed sr10.3. I suspect there were some problems with TCP
- before this but I wouldn't swear to it -- it may have been my screwy type
- manager.
-
- Enough ranting.
-