home *** CD-ROM | disk | FTP | other *** search
- .th CRASH VIII 2/12/75
- .tr |
- .sh NAME
- crash \*- what to do when the system crashes
- .sh DESCRIPTION
- This section gives at least a few clues about how to proceed if the
- system crashes.
- It can't pretend to be complete.
- .s3
- .it "How to bring it back up.||"
- If the reason for the crash is not evident
- (see below for guidance on `evident')
- you may want to try to dump the system if you feel up to
- debugging.
- At the moment a dump can be taken only on magtape.
- With a tape mounted and ready,
- stop the machine, load address 44, and start.
- This should write a copy of all of core
- on the tape with an EOF mark.
- Caution:
- Any error is taken to mean the end of core has been reached.
- This means that you must be sure the ring is in,
- the tape is ready, and the tape is clean and new.
- If the dump fails, you can try again,
- but some of the registers will be lost.
- See below for what to do with the tape.
- .s3
- In restarting after a crash,
- always bring up the system single-user.
- This is accomplished by following the directions in
- .it "boot procedures"
- (VIII)
- as modified for your particular installation;
- a single-user system is indicated by having a particular value
- in the switches (173030 unless you've changed
- .it init)
- as the system starts executing.
- When it is running,
- perform a
- .it dcheck
- and
- .it icheck
- (VIII)
- on all file systems which could have been in use at the time
- of the crash.
- If any serious file system problems are found, they should be repaired.
- When you are satisfied with the health of your disks,
- check and set the date if necessary,
- then come up multi-user.
- This is most easily accomplished by changing the
- single-user value in the switches to something else,
- then logging out
- by typing an EOT.
- .s3
- To even boot \s8UNIX\s10 at all,
- three files (and the directories leading to them)
- must be intact.
- First,
- the initialization program
- .it /etc/init
- must be present and executable.
- If it is not,
- the CPU will loop in user mode at location 6.
- For
- .it init
- to work correctly,
- .it /dev/tty8
- and
- .it /bin/sh
- must be present.
- If either does not exist,
- the symptom is best described
- as thrashing.
- .it Init
- will go into a
- .it fork/exec
- loop trying to create a
- Shell with proper standard input and output.
- .s3
- If you cannot get the system to boot,
- a runnable system must be obtained from
- a backup medium.
- The root file system may then be doctored as
- a mounted file system as described below.
- If there are any problems with the root
- file system,
- it is probably prudent to go to a
- backup system to avoid working on a
- mounted file system.
- .s3
- .it "Repairing disks.||"
- The first rule to keep in mind is that an addled disk
- should be treated gently;
- it shouldn't be mounted unless necessary,
- and if it is very valuable yet
- in quite bad shape, perhaps it should be dumped before
- trying surgery on it.
- This is an area where experience and informed courage count for much.
- .s3
- The problems reported by
- .it icheck
- typically fall into two kinds.
- There can be
- problems with the free list:
- duplicates in the free list, or free blocks also in files.
- These can be cured easily with an
- .it "icheck \*-s."
- If the same block appears in more than one file
- or if a file contains bad blocks,
- the files should be deleted, and the free list reconstructed.
- The best way to delete such a file is to use
- .it clri
- (VIII),
- then remove its directory entries.
- If any of the affected files is really precious,
- you can try to copy it to another device
- first.
- .s3
- .it Dcheck
- may report files which
- have more directory entries than links.
- Such situations are potentially dangerous;
- .it clri
- discusses a special case of the problem.
- All the directory entries for the file should be removed.
- If on the other hand there are more links than directory entries,
- there is no danger of spreading infection, but merely some disk space
- that is lost for use.
- It is sufficient to copy the file (if it has any entries and is useful)
- then use
- .it clri
- on its inode and remove any directory
- entries that do exist.
- .s3
- Finally,
- there may be inodes reported by
- .it dcheck
- that have 0 links and 0 entries.
- These occur on the root device when the system is stopped
- with pipes open, and on other file systems when the system
- stops with files that have been deleted while still open.
- A
- .it clri
- will free the inode, and an
- .it "icheck -s"
- will
- recover any missing blocks.
- .s3
- .it "Why did it crash?||"
- UNIX types a message
- on the console typewriter when it voluntarily crashes.
- Here is the current list of such messages,
- with enough information to provide
- a hope at least of the remedy.
- The message has the form `panic: ...',
- possibly accompanied by other information.
- Left unstated in all cases
- is the possibility that hardware or software
- error produced the message in some unexpected way.
- .s3
- .lp +5 5
- blkdev
- .br
- The
- .it getblk
- routine was called with a nonexistent major device as argument.
- Definitely hardware or software error.
- .s3
- .lp +5 5
- devtab
- .br
- Null device table entry for the major device used as argument to
- .it getblk.
- Definitely hardware or software error.
- .s3
- .lp +5 5
- iinit
- .br
- An I/O error reading the super-block for the root file system
- during initialization.
- .s3
- .lp +5 5
- out of inodes
- .br
- A mounted file system has no more i-nodes when creating a file.
- Sorry, the device isn't available;
- the
- .it icheck
- should tell you.
- .s3
- .lp +5 5
- no fs
- .br
- A device has disappeared from the mounted-device table.
- Definitely hardware or software error.
- .s3
- .lp +5 5
- no imt
- .br
- Like `no fs', but produced elsewhere.
- .s3
- .lp +5 5
- no inodes
- .br
- The in-core inode table is full.
- Try increasing NINODE in param.h.
- Shouldn't be a panic, just a user error.
- .s3
- .lp +5 5
- no clock
- .br
- During initialization,
- neither the line nor programmable clock was found to exist.
- .s3
- .lp +5 5
- swap error
- .br
- An unrecoverable I/O error during a swap.
- Really shouldn't be a panic,
- but it is hard to fix.
- .s3
- .lp +5 5
- unlink \- iget
- .br
- The directory containing a file being deleted can't be found.
- Hardware or software.
- .s3
- .lp +5 5
- out of swap space
- .br
- A program needs to be swapped out, and there is no more swap space.
- It has to be increased.
- This really shouldn't be a panic, but there is no easy fix.
- .s3
- .lp +5 5
- out of text
- .br
- A pure procedure program is being executed,
- and the table for such things is full.
- This shouldn't be a panic.
- .s3
- .lp +5 5
- trap
- .br
- An unexpected trap has occurred within the system.
- This is accompanied by three numbers:
- a `ka6', which is the contents of the segmentation
- register for the area in which the system's stack is kept;
- `aps', which is the location where the hardware stored
- the program status word during the trap;
- and a `trap type' which encodes
- which trap occurred.
- The trap types are:
- .s3
- .lp +10 5
- 0 bus error
- .lp +10 5
- 1 illegal instruction
- .lp +10 5
- 2 BPT/trace
- .lp +10 5
- 3 IOT
- .lp +10 5
- 4 power fail
- .lp +10 5
- 5 EMT
- .lp +10 5
- 6 recursive system call (TRAP instruction)
- .lp +10 5
- 7 11/70 cache parity, or programmed interrupt
- .lp +10 5
- 10 floating point trap
- .lp +10 5
- 11 segmentation violation
- .i0
- .s3
- In some of these cases it is
- possible for octal 20 to be added into the trap type;
- this indicates that the processor was in user mode when the trap occurred.
- If you wish to examine the stack after such a trap,
- either dump the system, or use the console switches to examine core;
- the required address mapping is described below.
- .s3
- .it "Interpreting dumps.||"
- All file system problems
- should be taken care of before attempting to look at dumps.
- The dump should be read into the file
- .it /usr/sys/core;
- .it cp
- (I) will do.
- At this point, you should execute
- .it "ps \*-alxk"
- and
- .it who
- to print the process table and the users who were on
- at the time of the crash.
- You should dump (
- .it od
- (I))
- the first 30 bytes of
- .it /usr/sys/core.
- Starting at location 4,
- the registers R0, R1, R2, R3, R4, R5, SP
- and KDSA6 (KISA6 for 11/40s) are stored.
- If the dump had to be restarted,
- R0 will not be correct.
- Next, take the value of KA6 (location 22(8) in the dump)
- multiplied by 100(8) and dump 1000(8) bytes starting from there.
- This is the per-process data associated with the process running
- at the time of the crash.
- Relabel
- the addresses 140000 to 141776.
- R5 is C's frame or display pointer.
- Stored at (R5) is the old R5 pointing to the previous
- stack frame.
- At (R5)+2
- is the saved PC of the calling procedure.
- Trace
- this calling chain until
- you obtain an R5 value of 141756, which
- is where the user's R5 is stored.
- If the chain is broken,
- you have to look for a plausible
- R5, PC pair and continue from there.
- Each PC should be looked up in the system's name list
- using
- .it db
- (I) and its `:' command,
- to get a reverse calling order.
- In most cases this procedure will give
- an idea of what is wrong.
- A more complete discussion
- of system debugging is impossible here.
- .sh "SEE ALSO"
- clri, icheck, dcheck, boot procedures (VIII)
- .sh BUGS
-