home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.unix.ultrix
- Path: sparky!uunet!zaphod.mps.ohio-state.edu!cs.utexas.edu!sun-barr!decwrl!pa.dec.com!rdg.dec.com!decvax.dec.com!decvax.DEC.COM!jag
- From: jag@decvax.DEC.COM (John A. Gallant UEG)
- Subject: Re: SCSI/CAM problems
- Message-ID: <1992Sep8.164333.19412@decvax.dec.com>
- Sender: usenet@decvax.dec.com (Usenet News System)
- Nntp-Posting-Host: witsend.zk3.dec.com
- Reply-To: jag@zk3.dec.com
- Organization: OSF Engineering, Digital Equipment Corp.
- References: <ROBM.92Sep4150116@ataraxia.Berkeley.EDU> <1992Sep5.011659.7675@news.iastate.edu>
- Date: Tue, 8 Sep 1992 16:43:33 GMT
- Lines: 100
-
- In article <1992Sep5.011659.7675@news.iastate.edu>, john@iastate.edu (John Hascall) writes:
-
- >}Anyone having similar experiences? Anyone have any suggestions?
- > Yes, yes, yes! No (sorry).
-
- Well not exactly the same, your messages are dealing with a memory
- resource warning message. Robs looks like a device related problem.
-
- >We upgraded our central NFS servers from 5000/200s to 5000/240s
- >and since we had to upgrade from 4.2, we installed the SCSI/CAM
- >at the same time. Our machines have 2 or 3 SCSI boards, 8-12 RZ57s,
- >and a TLZ04 each (the DEC support guy said he thought the other
- >problem report he had on this also was a multi-SCSI machine).
-
- The 5000/240 has code that is a memory hog, there is a bug in the
- low level DMA code that allocs resources for all possible devices including
- LUNs. This results in about 40k extra being alloc-ed for each device on
- the mother board bus. We are currently working in the patch process, ETA
- at this point is unknown.
-
- >What happens is we start getting *tons* of:
- > XPT Packet Pool HIGH Water Mark Reached.
- > cam_logger: CAM_ERROR packet
- > cam_logger: No associated bus target lun
-
- I would expect that there are LOW and HIGH warnings comming out.
- The CAM subsystem attempts to keep a local pools of CCBs to allow allow
- a device driver to get one quickly instead of having to always go to
- the system alloc code. The pool parameters are in the
- /usr/sys/data/cam_data.c file. The CAM subsystem has an initial size
- for this pool and when the pool gets to the low water mark, there are
- more added. As the outstanding I/Os are completed the CCBs are returned
- to the pool and if there are too many, the high water mark, the memory is
- returned to the system. For each of these marks there is a warning
- message logged to the system.
-
- What my SWAG is for your problem is that you have "spikes" of disk
- I/O that deplete the pool and then when finished fills the pool back
- up.
-
- For a heavy NFS server you would want to increase the cam_ccb_pool_size
- to a nice number and also increase the high/low marks. The
- cam_ccb_increment value is probably ok. With a larger pool the CAM
- subsystem can "ride" out such spikes with out massive error messages.
- On a NOTE of caution, a larger pool will result in more memory being
- allocated. The pool control data structure in the kernel is the
- xpt_qhead. The "current" pool size as the number of free and busy CCBs is
- kept track of in there. You may want to look at it via dbx also.
-
- For a quick test you can modify the high/low marks with dbx to see
- what the right size for your system can be.
-
- >................................ (in the two times I have seen it happen)
- >the dreaded "cant get mbufs" message also appears. It seems to be somehow
- >related to uptime (memory leak?) and load (so, of course, today, with a
- >machine which hung last night and being the Friday before a long weekend,
- >we had neither and I couldn't reproduce it for DEC *sigh*).
-
- It sounds like an over all low amount of memory on your system ?
- The new CAM code trys to get *all* the disk and tape I/O through the
- system as fast as it can. There is no longer a one I/O per device at
- a time being able to "stall" I/O requests. The peripheral device drivers
- get a request, bundle up a CCB, and issue the command down to the
- sublayers and let the I/O queuing go on "down there". This results in
- a completely different level of "demand" for the system memory.
-
- >We noticed this because we started getting more and more complaints about
- >I/O to the NFS servers slowing down more and more. So I ran the "iozone"
- >benchmark around 2pm and got horrid numbers, later in the afternoon they
- >were even lower, after dinner I decided to run it directly on the server
- >(thinking we had some sort of NFS problem at first). I happened to login
- >on the console to do this and the CAM errors just cam pouring out and then
- >they stopped (little grey button time).
-
- If all of these were scrolling onto the console then for some reason
- the error logger daemon was no longer running. Was it killed or did the
- error log area fill up ? This could also be a part of the overall
- system being slowed down.
-
- >}If I decide to back out on this "upgrade", will I have to reinstall
- >}the 4.2A kernel config files? (I'd hate to have to do that since I've
- >}since installed a bunch of patches.)
- >
- >I was told that backing out of SCSI/CAM was NOT as simple as
- >just "setld -d ..." If we can't get a fix by next week we are
- >resigned to going backwards to 4.2a.
-
- If you must you can delete the CAMBIN*, but not the CAMBASE*. In the
- installation code for V4.2/CAM and V4.2c/CAM the utilities were not saved
- aside as part of the installation. Deleting the CAMBASE* subset will
- unfortunately remove the mt utility and not restore it.
-
- --
- John A. Gallant jag@zk3.dec.com
- Software Engineer - OSF Engineering Group
- Digital Equipment Corp. (603) 881-2472
-
- In the common people there is no wisdom, no penetration, no
- power of judgment.
- Marcus Cicero
-