home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.sys.sgi
- Path: sparky!uunet!darwin.sura.net!mips!odin!fido!zola!zuni!anchor!olson
- From: olson@anchor.esd.sgi.com (Dave Olson)
- Subject: Re: SCSI Disk Problem? (1.6 MB Seagate) [ with SCSI error info ]
- Message-ID: <njj76k8@zuni.esd.sgi.com>
- Sender: news@zuni.esd.sgi.com (Net News)
- Organization: Silicon Graphics, Inc. Mountain View, CA
- References: <1992Jul22.022754.10599@ringer.cs.utsa.edu>
- Date: Wed, 22 Jul 92 03:17:33 GMT
- Lines: 438
-
- In <1992Jul22.022754.10599@ringer.cs.utsa.edu> senseman@ricky.brainlab.utsa.edu (David M. Senseman) writes:
-
-
- | We recently installed a 1.6 GB Seagate drive on the external
- | SCSI connector of an Indigo XS-24 (IRIX 4.0.5). The hardware
- | inventory is as follows:
-
- | sc0,4,0: cmd=0x1b timeout after 60 sec. Resetting SCSI Bus
- | SCSI device/cable diagnostic * FAILED *
-
- You almost certainly don't have the drive jumpered to not spinup
- until a startunit cmd is received. When not jumpered this way,
- various drives will not respond to commands, or will accept commands
- but not complete them, etc. In this case, the firmware appears to
- be somewhat confused, since a startunit command was issued (0x1b),
- but it timed out. These drives should spinup in far less than 60
- seconds. If the drive *is* jumpered to not spin up at poweron,
- then something is wrong with the drive.
-
- | If you continue on with the start-up, the computer seems to
- | be fine. At least the 1.6 Gb disk seems to work alright. (see below).
- |
- | I tried fiddling with the Motor On jumpers -- didn't seem to help
- | either way (on or off). The drive is NOT terminated so I removed
- | the "tp" jumpers.
- |
- | We are also have some problems with the QIC and DAT tape drives
- | on this machine -- they tend to quit prematurely with
- | unrecoverable errors. Sometimes they report I/O errors. The errors
- | _seem_ to be more numerous since we installed the Seagate. Is
- | is possible for the disk to work fine but screw up the tape drives?
-
- Sounds like something is wrong with the SCSI bus. Are you sure
- that the cable to the external drive is a real, honest to god,
- SCSI cable, and not a cheap lookalike ;) Seriously, some centronics
- printer cables look just like SCSI cables, but have smaller conductors,
- and don't always have all pins connected, particularly some of the
- grounds.
-
- Of course, even that wouldn't explain the earlier errors. Check
- for unix: messages in /usr/adm/SYSLOG, and see if they indicate
- anything.
-
- | Senseman's First Rule of Computers says that 99% of all computer
- | hardware errors is due to a bad cable -- and I suspect that
- | this is the most likely candidate here. However, before I go
- | into the cable making business, I wanted to know if anyone else
- | might had seen this problem with the 1.6 GB Seagate drive before.
- | I had a 140 Toshiba drive in there previously and didn't experience
- | any problems with it so maybe the cable not bad?
-
- That drive (but not necessarily the firmware rev you have) has been
- qualified by the other division, and should work on an Indigo.
-
- By the way, here is a (slightly out of date) list of SCSI error messages
- from the wd93 scsi driver, along with sense codes, and in some cases,
- likely causes for the error message. I provided the raw data, and
- one of the TAC engineers cleaned it up.
-
- There may be minor errors or inconsistencies that we haven't spotted,
- but in general should prove helpful.
-
- =================================================================
- DATE
- November 8, 1991
-
- TITLE
- Small Computer Interface (SCSI) Controller Error Messages
-
- QUESTION
- What do the phase and sense errors mean?
-
- ANSWER
-
- This article lists the error strings that are printed by the
- device drivers, tpsc, dksc, and (in some cases) devscsi. The
- specific parts of these error strings are contained in
- Tables 1 - 4, which follow the introductin of this article.
-
- In the IRIX 3.3.3 release the Jaguar SCSI controller board (jag)
- driver prints the information differently than the dksc driver;
- in 4.0.1, they use the same form. In IRIX 3.3.3, the information
- is printed in the form:
-
- sense codes. key %x asc%x asq%x
-
- where
- key is the number from Table 1,
- asc (additional sense code) is from Table 2,
- asq (additional sense qualifier) sometimes provides additional info.
-
- NOTE: often there is only one possible asq for a given asc.
-
-
- An example of a media error string is:
-
- jag0d2: sense key 3 asc11 asq0 (retry 2)
- | | | |
- | | | |
- device number | | No additional
- | | information
- | |
- "Media Error" |
- (see Table 1) |
- |
- "Unrecovered data
- block read error"
- (see Table 2)
-
- The asq tends to be more vendor specific, although the IEEE SCSI 2
- specification defines the "standard" sense qualifiers.
-
- The integral SCSI controller on your system also often prints messages
- in the form:
-
- sc#[,#,#]: message
-
- where
-
- the first # is the SCSI adapter involved (0 for all
- systems except those with the IO3 (input/output board) which
- supports up to 4 adapters, 0-3),
-
- the second #,# pair is printed only if you know which device
- is causing the problem.
-
- An example of this type of message string is:
-
- SC0,1: resetting SCSI bus: Spurious SCSI interrupt, no connected channel
-
- In a number of cases, a phase and possibly a state are printed.
- These error codes come from the files,
-
- usr/include/sys/scsidev.h.
- usr/include/sys/scsi.h
-
- The state and phase meanings are listed in Table 4. A few comments
- have been added. Some of the messages are also included.
-
- NOTE: These tables are used by the smfd, tpsc, and dksc drivers
- and will probably be used by other SCSI drivers in the
- future. That is why they are now in scsi.c, even though
- not referenced here.
-
- Table 1 - Primary sense key information
- -------
-
-
- Message Sense Key Most common cause(s)
-
- No sense 0x0 No error information available
- Recovered Error 0x1 The device recovered by itself
- Drive not ready 0x2 No media or not spun up
- Media error 0x3 An actual media problem
- Unrecoverable device error 0x4 Usually a device hardware error
- Illegal request 0x5 Invalid command or data issued
- Unit Attention 0x6 Device was reset or power cycled
- Data protect error 0x7 Usually device is write protected
- Unexpected blank media 0x8 Tried to read at end of a tape
- Vendor Unique error 0x9 Varies
- Copy Aborted 0xa Copy cmd aborted by host (not used)
- Aborted command 0xb Target aborted command
- Search data successful 0xc Search Data command OK (not used)
- Volume overflow 0xd Tried to write past EOT on tape
- Reserved (0xE) 0xe should not be seen
- Reserved (0xF) 0xf should not be seen
-
-
- Table 2 - Additional Sense Code
- This table provides further information on the cause of an error.
- The ASQ (additional sense qualifier) which is printed numerically
- when non-zero (in 4.0; in 3.3.3 it is always printed by the Jaguar)
- Missing numerical values are not printed, either because they are no
- defined, or because the drivers treat them specially.
- This table is provided primarly so the additional sense codes
- can be looked up in the device manual. Some are self explanatory,
- others quite obscure.
-
- Message Additional Sense Code
-
- No index/sector signal 0x01
- No seek complete 0x02
- Write fault 0x03
- Driver not ready 0x04
- Drive not selected 0x05
- No track 0 0x06
- Multiple drives selected 0x07
- LUN communication error 0x08
- Track error 0x09
- Error log overflow 0x0a
- Write error 0x0c
- ID CRC or ECC error 0x10
- Unrecovered data block read error 0x11
- No addr mark found in ID field 0x12
- No addr mark found in Data field 0x13
- No record found 0x14
- Seek position error 0x15
- Data sync mark error 0x16
- Read data recovered with retries 0x17
- Read data recovered with ECC 0x18
- Defect list error 0x19
- Parameter overrun 0x1a
- Synchronous transfer error 0x1b
- Defect list not found 0x1c
- Compare error 0x1d
- Recovered ID with ECC 0x1e
- Invalid command code 0x20
- Illegal logical block address 0x21
- Illegal function 0x22
- Illegal field in CDB 0x24
- Invalid LUN 0x25
- Invalid field in parameter list 0x26
- Media write protected 0x27
- Media change 0x28
- Device reset 0x29
- Log parameters changed 0x2a
- Copy requires disconnect 0x2b
- Command sequence error 0x2c
- Update in place error 0x2d
- Tagged commands cleared 0x2f
- Incompatible media 0x30
- Media format corrupted 0x31
- No defect spare location available 0x32
- Media length error 0x33
- Toner/ink error 0x36
- Parameter rounded 0x37
- Saved parameters not supported 0x39
- Medium not present 0x3a
- Forms error 0x3b
- Invalid ID msg 0x3d
- Self config in progress 0x3e
- Device config has changed 0x3f
- RAM failure 0x40
- Data path diagnostic failure 0x41
- Power on diagnostic failure 0x42
- Message reject error 0x43
- Internal controller error 0x44
- Select/Reselect failed 0x45
- Soft reset failure 0x46
- SCSI interface parity error 0x47
- Initiator detected error 0x48
- Inappropriate/Illegal message 0x49
-
-
- Table 3
- -------
-
- This next section lists the messages that are printed by the SCSI driver.
- After the message is printed, the driver resets the SCSI bus.
- These messages are from IRIX 4.0, but similar ones are printed by
- earlier releases.
-
- "timeout after %d %ssec"
- A SCSI command didn't complete with the reported number of
- seconds (or milliseconds, if it wasn't an even number of seconds).
- This happens only if the command was successfully sent to the
- device; i.e., the device is at least partly alive.
-
- "Spurious SCSI interrupt, no connected channel"
- This happens in cases that would have panic'ed 3.3. We got
- a SCSI interrupt that should only happen when a device is
- logically connected on the bus (command or transfer in progress)
- but the driver doesn't think any device is active. Seems to
- happen with some configurations after a SCSI bus reset has
- occurred; possibly could be do to SCSI bus problems also.
-
- "illegal disconnection interrupt: phase %x"
- A disconnect message has been received in a phase where it shouldn't
- have occurred. (A disconnect message is sent by a device when
- relinquishing the bus temporarily, as when it is seeking.)
- This could be caused (improbably) by some kind of SCSI bus problem.
-
- "reselect without ID"
- Some device has reselected (after a disconnect), but the ID
- of the reselector hasn't been recieved by the SCSI chip. This
- could be a cabling problem, or some kind of device failure.
-
- "unexpected message in %x, phase %x",
- "Unexpected msg in %x, phase %x"
- A device has either gone to a message in phase at a point where
- neither the SCSI chip nor the driver is prepared for it, or a
- message phase is on the bus when no devices are connected. The
- first number printed is the message byte. This can also be caused
- by cabling problems, and in some cases by devices failing a SCSI
- sync negotiation in a why the driver isn't prepared to handle.
-
- "Unexpected info phase %x, state %x"
- Same as above, but it was some SCSI bus phase other than message in.
-
- "Hardware error"
- Almost always caused by a driver or chip problem, despite
- the word hardware. The SCSI chip has reported a failure to
- select some SCSI device on the bus, but the driver didn't think
- it was at the point where it should have been doing a select.
-
- "unexpected reselection"
- Some device has reselected us, but either no device was
- active (only sc# in message), or the device that selected
- us wasn't in the disconnected state (sc#,#,# in msg).
-
- "disconnected on non-word boundary (addr=%x, 0x%x left)"
- On all of the 4D series, DMA has to be aligned on a 32
- bit boundary at the start of DMA. If a device disconnects
- from the bus in the middle of a transfer, and then reconnects
- to transfer more data, but the disconnect wasn't after a multiple
- of 4 bytes were transferred, then we can't transfer the rest of
- the data correctly, so we abort the transfer. This has been seen
- to happen on the Exabyte 8500 when used under 3.3, and will happen
- with some models of DAT drives (other than the one qualified by SGI).
-
- Table 4
- -------
-
- Several of the SCSI states and phases are listed below. There
- are other possible states and phases but they rarely occur.
- The SCSI states and phases are listed in the file,
- /usr/include/sys/scsidev.h. The comments below have been extracted
- from this file and supplemented with additional information.
-
-
- Note: "out" is from the CPU to the SCSI device in these
- descriptions and "receive" and "send" are also from the
- SCSI device point of view, since the target controls all
- of the bus phases except for initial selection.
-
- ST_RESET 0x00 SCSI chip Reset by reset command or power-up
- ST_SELECT 0x11 Selection of target complete (after C93SELATN)
- ST_SATOK 0x16 Select-And-Transfer completed successfully
- that is, all phases have completed in a
- normal manner
-
- ST_TR_DATAOUT 0x18 transfer cmd done, target requesting data
- ST_TR_DATAIN 0x19 transfer cmd done, target sending data
- ST_TR_STATIN 0x1b Target is sending status in
- ST_TR_MSGIN 0x1f transfer cmd done, target sending msg
- ST_TRANPAUSE 0x20 transfer cmd has paused with ACK
- above 5 seen during sync negotiations
-
- ST_SAVEDP 0x21 Save Data Pointers message during SAT
- normal state when device is disconnecting
- from the bus
- ST_A_RESELECT 0x27 reselected after disc (93A)
- ST_UNEXPDISC 0x41 An unexpected disconnect
- device disconnected without sending a
- disconnect message; sometimes happens when
- devices with removable media have had
- the media removed during a transfer.
- ST_PARITY 0x43 cmd terminated due to parity error on the SCSI bus
- ST_PARITY_ATN 0x44 cmd terminated due to parity error
- (ATN is asserted so that host can send a
- message to device; we just abort the transfer)
- ST_TIMEOUT 0x42 Time-out during Select or Reselect
- that is, the device never responded to an
- attempt to select it; normally seen only
- during hardware inventory probing, but sometimes
- happens after a SCSI bus reset, if device takes
- a long time to recover from the reset, or is
- powered off
- ST_INCORR_DATA 0x47 incorrect message or status byte
- ST_UNEX_RDATA 0x48 Unexpected receive data phase
- device tried to send us more data than we
- programmed the SCSI chip to expect. This
- can be OK, as when a high level request is
- made to transfer more data than the DMA
- hardware can map on a single request. In
- this case, we simply reprogram the DMA
- hardware for the next chunk of data, and
- restart the transfer (but we don't send a
- new SCSI command to the device). When
- printed as part of an error message, it can
- sometimes be caused by a SCSI cabling
- problem, or (particularly with devscsi
- user drivers) by a mismatch in the byte
- count given to the driver and the byte
- count implied by the SCSI command sent to
- the device.
-
- ST_UNEX_SDATA 0x49 Unexpected send data phase
- Same as above, but device is asking us
- to send more data.
- ST_UNEX_CMDPH 0x4a Unexpected cmd phase
- ST_UNEX_SSTATUS 0x4b Unexpected send status phase
- status phases occur at the end of
- SCSI command (i.e. byte count remaining
- is 0); if they happen at other times, the
- chip interrupts. This can frequently
- happen when we ask a device for more data
- than it can give us, and in this case
- we just return a short i/o count to the
- caller. When printed as part of an
- error message, it usually implies a
- cabling or termination problem.
- ST_UNEX_RMESGOUT 0x4e Unexpected request msg out phase
- usually indicates a SCSI cabling problem.
- ST_UNEX_SMESGIN 0x4f Unexpected send message in phase
- in a message, usually indicates a SCSI
- cabling problem; also happens when device
- sends a disconnect message in normal use
- when preparing to disconnect from the bus
- ST_RESELECT 0x80 WD33C93 has been reselected
- ST_93A_RESEL 0x81 reselected while idle (93A)
- ST_DISCONNECT 0x85 Disconnect has occurred
- ST_NEEDCMD 0x8a Target is ready for a cmd
- ST_REQ_SMESGOUT 0x8e REQ signal for send message out
- ST_REQ_SMESGIN 0x8f REQ signal for send message in
- above 3 usually seen only during sync
- negotiations.
-
- Phases during a Select and Transfer command
-
- PH_NOSELECT 0x00 Selection not successful
- PH_SELECT 0x10 Selection successful
- PH_IDENTSEND 0x20 Identify message sent (during selection
- when sending initial command to a device)
- phase 30 indicates none of the cmd bytes
- have yet been sent; every cmd byte sent
- increments that by one.
- PH_CDB_START 0x30 Start of CDC transfers
- PH_CDB_6 0x36 6th cmd byte sent
- PH_CDB_10 0x3a 0xAth cmd byte sent
- PH_CDB_12 0x3c 0xCth cmd byte sent
- PH_SAVEDP 0x41 Save data pointers
- PH_DISCRECV 0x42 Disconnect message received
- PH_DISCONNECT 0x43 Target disconnected
- PH_RESELECT 0x44 Original target reselected
- PH_IDENTRECV 0x45 Correct identify (right LUN) message rcv'd
- (during reselection)
- PH_DATA 0x46 Data transfer completed (expect status next)
- PH_STATUSRECV 0x50 Status byte received (expect cmd complete next)
- PH_COMPLETE 0x60 Command complete message received;
- SCSI command is finished, and SCSI bus
- is free.
-
- =================================================================
- --
- Let no one tell me that silence gives consent, | Dave Olson
- because whoever is silent dissents. | Silicon Graphics, Inc.
- Maria Isabel Barreno | olson@sgi.com
-