NetNews Usenet Archive 1992 #16

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #16 / NN_1992_16.iso / spool / comp / sys / sgi / 11152 < prev next >

Wrap

Text File | 1992-07-21 | 20.6 KB | 450 lines

Newsgroups: comp.sys.sgi Path: sparky!uunet!darwin.sura.net!mips!odin!fido!zola!zuni!anchor!olson From: olson@anchor.esd.sgi.com (Dave Olson) Subject: Re: SCSI Disk Problem? (1.6 MB Seagate) [ with SCSI error info ] Message-ID: <njj76k8@zuni.esd.sgi.com> Sender: news@zuni.esd.sgi.com (Net News) Organization: Silicon Graphics, Inc. Mountain View, CA References: <1992Jul22.022754.10599@ringer.cs.utsa.edu> Date: Wed, 22 Jul 92 03:17:33 GMT Lines: 438 In <1992Jul22.022754.10599@ringer.cs.utsa.edu> senseman@ricky.brainlab.utsa.edu (David M. Senseman) writes: | We recently installed a 1.6 GB Seagate drive on the external | SCSI connector of an Indigo XS-24 (IRIX 4.0.5). The hardware | inventory is as follows: | sc0,4,0: cmd=0x1b timeout after 60 sec. Resetting SCSI Bus | SCSI device/cable diagnostic * FAILED * You almost certainly don't have the drive jumpered to not spinup until a startunit cmd is received. When not jumpered this way, various drives will not respond to commands, or will accept commands but not complete them, etc. In this case, the firmware appears to be somewhat confused, since a startunit command was issued (0x1b), but it timed out. These drives should spinup in far less than 60 seconds. If the drive *is* jumpered to not spin up at poweron, then something is wrong with the drive. | If you continue on with the start-up, the computer seems to | be fine. At least the 1.6 Gb disk seems to work alright. (see below). | | I tried fiddling with the Motor On jumpers -- didn't seem to help | either way (on or off). The drive is NOT terminated so I removed | the "tp" jumpers. | | We are also have some problems with the QIC and DAT tape drives | on this machine -- they tend to quit prematurely with | unrecoverable errors. Sometimes they report I/O errors. The errors | _seem_ to be more numerous since we installed the Seagate. Is | is possible for the disk to work fine but screw up the tape drives? Sounds like something is wrong with the SCSI bus. Are you sure that the cable to the external drive is a real, honest to god, SCSI cable, and not a cheap lookalike ;) Seriously, some centronics printer cables look just like SCSI cables, but have smaller conductors, and don't always have all pins connected, particularly some of the grounds. Of course, even that wouldn't explain the earlier errors. Check for unix: messages in /usr/adm/SYSLOG, and see if they indicate anything. | Senseman's First Rule of Computers says that 99% of all computer | hardware errors is due to a bad cable -- and I suspect that | this is the most likely candidate here. However, before I go | into the cable making business, I wanted to know if anyone else | might had seen this problem with the 1.6 GB Seagate drive before. | I had a 140 Toshiba drive in there previously and didn't experience | any problems with it so maybe the cable not bad? That drive (but not necessarily the firmware rev you have) has been qualified by the other division, and should work on an Indigo. By the way, here is a (slightly out of date) list of SCSI error messages from the wd93 scsi driver, along with sense codes, and in some cases, likely causes for the error message. I provided the raw data, and one of the TAC engineers cleaned it up. There may be minor errors or inconsistencies that we haven't spotted, but in general should prove helpful. ================================================================= DATE November 8, 1991 TITLE Small Computer Interface (SCSI) Controller Error Messages QUESTION What do the phase and sense errors mean? ANSWER This article lists the error strings that are printed by the device drivers, tpsc, dksc, and (in some cases) devscsi. The specific parts of these error strings are contained in Tables 1 - 4, which follow the introductin of this article. In the IRIX 3.3.3 release the Jaguar SCSI controller board (jag) driver prints the information differently than the dksc driver; in 4.0.1, they use the same form. In IRIX 3.3.3, the information is printed in the form: sense codes. key %x asc%x asq%x where key is the number from Table 1, asc (additional sense code) is from Table 2, asq (additional sense qualifier) sometimes provides additional info. NOTE: often there is only one possible asq for a given asc. An example of a media error string is: jag0d2: sense key 3 asc11 asq0 (retry 2) | | | | | | | | device number | | No additional | | information | | "Media Error" | (see Table 1) | | "Unrecovered data block read error" (see Table 2) The asq tends to be more vendor specific, although the IEEE SCSI 2 specification defines the "standard" sense qualifiers. The integral SCSI controller on your system also often prints messages in the form: sc#[,#,#]: message where the first # is the SCSI adapter involved (0 for all systems except those with the IO3 (input/output board) which supports up to 4 adapters, 0-3), the second #,# pair is printed only if you know which device is causing the problem. An example of this type of message string is: SC0,1: resetting SCSI bus: Spurious SCSI interrupt, no connected channel In a number of cases, a phase and possibly a state are printed. These error codes come from the files, usr/include/sys/scsidev.h. usr/include/sys/scsi.h The state and phase meanings are listed in Table 4. A few comments have been added. Some of the messages are also included. NOTE: These tables are used by the smfd, tpsc, and dksc drivers and will probably be used by other SCSI drivers in the future. That is why they are now in scsi.c, even though not referenced here. Table 1 - Primary sense key information ------- Message Sense Key Most common cause(s) No sense 0x0 No error information available Recovered Error 0x1 The device recovered by itself Drive not ready 0x2 No media or not spun up Media error 0x3 An actual media problem Unrecoverable device error 0x4 Usually a device hardware error Illegal request 0x5 Invalid command or data issued Unit Attention 0x6 Device was reset or power cycled Data protect error 0x7 Usually device is write protected Unexpected blank media 0x8 Tried to read at end of a tape Vendor Unique error 0x9 Varies Copy Aborted 0xa Copy cmd aborted by host (not used) Aborted command 0xb Target aborted command Search data successful 0xc Search Data command OK (not used) Volume overflow 0xd Tried to write past EOT on tape Reserved (0xE) 0xe should not be seen Reserved (0xF) 0xf should not be seen Table 2 - Additional Sense Code This table provides further information on the cause of an error. The ASQ (additional sense qualifier) which is printed numerically when non-zero (in 4.0; in 3.3.3 it is always printed by the Jaguar) Missing numerical values are not printed, either because they are no defined, or because the drivers treat them specially. This table is provided primarly so the additional sense codes can be looked up in the device manual. Some are self explanatory, others quite obscure. Message Additional Sense Code No index/sector signal 0x01 No seek complete 0x02 Write fault 0x03 Driver not ready 0x04 Drive not selected 0x05 No track 0 0x06 Multiple drives selected 0x07 LUN communication error 0x08 Track error 0x09 Error log overflow 0x0a Write error 0x0c ID CRC or ECC error 0x10 Unrecovered data block read error 0x11 No addr mark found in ID field 0x12 No addr mark found in Data field 0x13 No record found 0x14 Seek position error 0x15 Data sync mark error 0x16 Read data recovered with retries 0x17 Read data recovered with ECC 0x18 Defect list error 0x19 Parameter overrun 0x1a Synchronous transfer error 0x1b Defect list not found 0x1c Compare error 0x1d Recovered ID with ECC 0x1e Invalid command code 0x20 Illegal logical block address 0x21 Illegal function 0x22 Illegal field in CDB 0x24 Invalid LUN 0x25 Invalid field in parameter list 0x26 Media write protected 0x27 Media change 0x28 Device reset 0x29 Log parameters changed 0x2a Copy requires disconnect 0x2b Command sequence error 0x2c Update in place error 0x2d Tagged commands cleared 0x2f Incompatible media 0x30 Media format corrupted 0x31 No defect spare location available 0x32 Media length error 0x33 Toner/ink error 0x36 Parameter rounded 0x37 Saved parameters not supported 0x39 Medium not present 0x3a Forms error 0x3b Invalid ID msg 0x3d Self config in progress 0x3e Device config has changed 0x3f RAM failure 0x40 Data path diagnostic failure 0x41 Power on diagnostic failure 0x42 Message reject error 0x43 Internal controller error 0x44 Select/Reselect failed 0x45 Soft reset failure 0x46 SCSI interface parity error 0x47 Initiator detected error 0x48 Inappropriate/Illegal message 0x49 Table 3 ------- This next section lists the messages that are printed by the SCSI driver. After the message is printed, the driver resets the SCSI bus. These messages are from IRIX 4.0, but similar ones are printed by earlier releases. "timeout after %d %ssec" A SCSI command didn't complete with the reported number of seconds (or milliseconds, if it wasn't an even number of seconds). This happens only if the command was successfully sent to the device; i.e., the device is at least partly alive. "Spurious SCSI interrupt, no connected channel" This happens in cases that would have panic'ed 3.3. We got a SCSI interrupt that should only happen when a device is logically connected on the bus (command or transfer in progress) but the driver doesn't think any device is active. Seems to happen with some configurations after a SCSI bus reset has occurred; possibly could be do to SCSI bus problems also. "illegal disconnection interrupt: phase %x" A disconnect message has been received in a phase where it shouldn't have occurred. (A disconnect message is sent by a device when relinquishing the bus temporarily, as when it is seeking.) This could be caused (improbably) by some kind of SCSI bus problem. "reselect without ID" Some device has reselected (after a disconnect), but the ID of the reselector hasn't been recieved by the SCSI chip. This could be a cabling problem, or some kind of device failure. "unexpected message in %x, phase %x", "Unexpected msg in %x, phase %x" A device has either gone to a message in phase at a point where neither the SCSI chip nor the driver is prepared for it, or a message phase is on the bus when no devices are connected. The first number printed is the message byte. This can also be caused by cabling problems, and in some cases by devices failing a SCSI sync negotiation in a why the driver isn't prepared to handle. "Unexpected info phase %x, state %x" Same as above, but it was some SCSI bus phase other than message in. "Hardware error" Almost always caused by a driver or chip problem, despite the word hardware. The SCSI chip has reported a failure to select some SCSI device on the bus, but the driver didn't think it was at the point where it should have been doing a select. "unexpected reselection" Some device has reselected us, but either no device was active (only sc# in message), or the device that selected us wasn't in the disconnected state (sc#,#,# in msg). "disconnected on non-word boundary (addr=%x, 0x%x left)" On all of the 4D series, DMA has to be aligned on a 32 bit boundary at the start of DMA. If a device disconnects from the bus in the middle of a transfer, and then reconnects to transfer more data, but the disconnect wasn't after a multiple of 4 bytes were transferred, then we can't transfer the rest of the data correctly, so we abort the transfer. This has been seen to happen on the Exabyte 8500 when used under 3.3, and will happen with some models of DAT drives (other than the one qualified by SGI). Table 4 ------- Several of the SCSI states and phases are listed below. There are other possible states and phases but they rarely occur. The SCSI states and phases are listed in the file, /usr/include/sys/scsidev.h. The comments below have been extracted from this file and supplemented with additional information. Note: "out" is from the CPU to the SCSI device in these descriptions and "receive" and "send" are also from the SCSI device point of view, since the target controls all of the bus phases except for initial selection. ST_RESET 0x00 SCSI chip Reset by reset command or power-up ST_SELECT 0x11 Selection of target complete (after C93SELATN) ST_SATOK 0x16 Select-And-Transfer completed successfully that is, all phases have completed in a normal manner ST_TR_DATAOUT 0x18 transfer cmd done, target requesting data ST_TR_DATAIN 0x19 transfer cmd done, target sending data ST_TR_STATIN 0x1b Target is sending status in ST_TR_MSGIN 0x1f transfer cmd done, target sending msg ST_TRANPAUSE 0x20 transfer cmd has paused with ACK above 5 seen during sync negotiations ST_SAVEDP 0x21 Save Data Pointers message during SAT normal state when device is disconnecting from the bus ST_A_RESELECT 0x27 reselected after disc (93A) ST_UNEXPDISC 0x41 An unexpected disconnect device disconnected without sending a disconnect message; sometimes happens when devices with removable media have had the media removed during a transfer. ST_PARITY 0x43 cmd terminated due to parity error on the SCSI bus ST_PARITY_ATN 0x44 cmd terminated due to parity error (ATN is asserted so that host can send a message to device; we just abort the transfer) ST_TIMEOUT 0x42 Time-out during Select or Reselect that is, the device never responded to an attempt to select it; normally seen only during hardware inventory probing, but sometimes happens after a SCSI bus reset, if device takes a long time to recover from the reset, or is powered off ST_INCORR_DATA 0x47 incorrect message or status byte ST_UNEX_RDATA 0x48 Unexpected receive data phase device tried to send us more data than we programmed the SCSI chip to expect. This can be OK, as when a high level request is made to transfer more data than the DMA hardware can map on a single request. In this case, we simply reprogram the DMA hardware for the next chunk of data, and restart the transfer (but we don't send a new SCSI command to the device). When printed as part of an error message, it can sometimes be caused by a SCSI cabling problem, or (particularly with devscsi user drivers) by a mismatch in the byte count given to the driver and the byte count implied by the SCSI command sent to the device. ST_UNEX_SDATA 0x49 Unexpected send data phase Same as above, but device is asking us to send more data. ST_UNEX_CMDPH 0x4a Unexpected cmd phase ST_UNEX_SSTATUS 0x4b Unexpected send status phase status phases occur at the end of SCSI command (i.e. byte count remaining is 0); if they happen at other times, the chip interrupts. This can frequently happen when we ask a device for more data than it can give us, and in this case we just return a short i/o count to the caller. When printed as part of an error message, it usually implies a cabling or termination problem. ST_UNEX_RMESGOUT 0x4e Unexpected request msg out phase usually indicates a SCSI cabling problem. ST_UNEX_SMESGIN 0x4f Unexpected send message in phase in a message, usually indicates a SCSI cabling problem; also happens when device sends a disconnect message in normal use when preparing to disconnect from the bus ST_RESELECT 0x80 WD33C93 has been reselected ST_93A_RESEL 0x81 reselected while idle (93A) ST_DISCONNECT 0x85 Disconnect has occurred ST_NEEDCMD 0x8a Target is ready for a cmd ST_REQ_SMESGOUT 0x8e REQ signal for send message out ST_REQ_SMESGIN 0x8f REQ signal for send message in above 3 usually seen only during sync negotiations. Phases during a Select and Transfer command PH_NOSELECT 0x00 Selection not successful PH_SELECT 0x10 Selection successful PH_IDENTSEND 0x20 Identify message sent (during selection when sending initial command to a device) phase 30 indicates none of the cmd bytes have yet been sent; every cmd byte sent increments that by one. PH_CDB_START 0x30 Start of CDC transfers PH_CDB_6 0x36 6th cmd byte sent PH_CDB_10 0x3a 0xAth cmd byte sent PH_CDB_12 0x3c 0xCth cmd byte sent PH_SAVEDP 0x41 Save data pointers PH_DISCRECV 0x42 Disconnect message received PH_DISCONNECT 0x43 Target disconnected PH_RESELECT 0x44 Original target reselected PH_IDENTRECV 0x45 Correct identify (right LUN) message rcv'd (during reselection) PH_DATA 0x46 Data transfer completed (expect status next) PH_STATUSRECV 0x50 Status byte received (expect cmd complete next) PH_COMPLETE 0x60 Command complete message received; SCSI command is finished, and SCSI bus is free. ================================================================= -- Let no one tell me that silence gives consent, | Dave Olson because whoever is silent dissents. | Silicon Graphics, Inc. Maria Isabel Barreno | olson@sgi.com