NetNews Usenet Archive 1992 #20

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #20 / NN_1992_20.iso / spool / comp / unix / ultrix / 6764 < prev next >

Wrap

Text File | 1992-09-08 | 5.6 KB | 114 lines

Newsgroups: comp.unix.ultrix Path: sparky!uunet!zaphod.mps.ohio-state.edu!cs.utexas.edu!sun-barr!decwrl!pa.dec.com!rdg.dec.com!decvax.dec.com!decvax.DEC.COM!jag From: jag@decvax.DEC.COM (John A. Gallant UEG) Subject: Re: SCSI/CAM problems Message-ID: <1992Sep8.164333.19412@decvax.dec.com> Sender: usenet@decvax.dec.com (Usenet News System) Nntp-Posting-Host: witsend.zk3.dec.com Reply-To: jag@zk3.dec.com Organization: OSF Engineering, Digital Equipment Corp. References: <ROBM.92Sep4150116@ataraxia.Berkeley.EDU> <1992Sep5.011659.7675@news.iastate.edu> Date: Tue, 8 Sep 1992 16:43:33 GMT Lines: 100 In article <1992Sep5.011659.7675@news.iastate.edu>, john@iastate.edu (John Hascall) writes: >}Anyone having similar experiences? Anyone have any suggestions? > Yes, yes, yes! No (sorry). Well not exactly the same, your messages are dealing with a memory resource warning message. Robs looks like a device related problem. >We upgraded our central NFS servers from 5000/200s to 5000/240s >and since we had to upgrade from 4.2, we installed the SCSI/CAM >at the same time. Our machines have 2 or 3 SCSI boards, 8-12 RZ57s, >and a TLZ04 each (the DEC support guy said he thought the other >problem report he had on this also was a multi-SCSI machine). The 5000/240 has code that is a memory hog, there is a bug in the low level DMA code that allocs resources for all possible devices including LUNs. This results in about 40k extra being alloc-ed for each device on the mother board bus. We are currently working in the patch process, ETA at this point is unknown. >What happens is we start getting *tons* of: > XPT Packet Pool HIGH Water Mark Reached. > cam_logger: CAM_ERROR packet > cam_logger: No associated bus target lun I would expect that there are LOW and HIGH warnings comming out. The CAM subsystem attempts to keep a local pools of CCBs to allow allow a device driver to get one quickly instead of having to always go to the system alloc code. The pool parameters are in the /usr/sys/data/cam_data.c file. The CAM subsystem has an initial size for this pool and when the pool gets to the low water mark, there are more added. As the outstanding I/Os are completed the CCBs are returned to the pool and if there are too many, the high water mark, the memory is returned to the system. For each of these marks there is a warning message logged to the system. What my SWAG is for your problem is that you have "spikes" of disk I/O that deplete the pool and then when finished fills the pool back up. For a heavy NFS server you would want to increase the cam_ccb_pool_size to a nice number and also increase the high/low marks. The cam_ccb_increment value is probably ok. With a larger pool the CAM subsystem can "ride" out such spikes with out massive error messages. On a NOTE of caution, a larger pool will result in more memory being allocated. The pool control data structure in the kernel is the xpt_qhead. The "current" pool size as the number of free and busy CCBs is kept track of in there. You may want to look at it via dbx also. For a quick test you can modify the high/low marks with dbx to see what the right size for your system can be. >................................ (in the two times I have seen it happen) >the dreaded "cant get mbufs" message also appears. It seems to be somehow >related to uptime (memory leak?) and load (so, of course, today, with a >machine which hung last night and being the Friday before a long weekend, >we had neither and I couldn't reproduce it for DEC *sigh*). It sounds like an over all low amount of memory on your system ? The new CAM code trys to get *all* the disk and tape I/O through the system as fast as it can. There is no longer a one I/O per device at a time being able to "stall" I/O requests. The peripheral device drivers get a request, bundle up a CCB, and issue the command down to the sublayers and let the I/O queuing go on "down there". This results in a completely different level of "demand" for the system memory. >We noticed this because we started getting more and more complaints about >I/O to the NFS servers slowing down more and more. So I ran the "iozone" >benchmark around 2pm and got horrid numbers, later in the afternoon they >were even lower, after dinner I decided to run it directly on the server >(thinking we had some sort of NFS problem at first). I happened to login >on the console to do this and the CAM errors just cam pouring out and then >they stopped (little grey button time). If all of these were scrolling onto the console then for some reason the error logger daemon was no longer running. Was it killed or did the error log area fill up ? This could also be a part of the overall system being slowed down. >}If I decide to back out on this "upgrade", will I have to reinstall >}the 4.2A kernel config files? (I'd hate to have to do that since I've >}since installed a bunch of patches.) > >I was told that backing out of SCSI/CAM was NOT as simple as >just "setld -d ..." If we can't get a fix by next week we are >resigned to going backwards to 4.2a. If you must you can delete the CAMBIN*, but not the CAMBASE*. In the installation code for V4.2/CAM and V4.2c/CAM the utilities were not saved aside as part of the installation. Deleting the CAMBASE* subset will unfortunately remove the mt utility and not restore it. -- John A. Gallant jag@zk3.dec.com Software Engineer - OSF Engineering Group Digital Equipment Corp. (603) 881-2472 In the common people there is no wisdom, no penetration, no power of judgment. Marcus Cicero