home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!spool.mu.edu!agate!ucbvax!lrw.com!leichter
- From: leichter@lrw.com (Jerry Leichter)
- Newsgroups: comp.os.vms
- Subject: re: Looking for advice on tracking down a system crash
- Message-ID: <9212181327.AA17581@uu3.psi.com>
- Date: 18 Dec 92 11:58:11 GMT
- Sender: daemon@ucbvax.BERKELEY.EDU
- Distribution: world
- Organization: The Internet
- Lines: 39
-
-
- I have a system running VMS 5.4. We have batch jobs that run one
- process per job that control communications cards. The software was
- recently upgraded from VMS 4.7 to 5.4. We have seen a few ramdom
- non-reproducible system crashes recently. Every system crash is due
- to the same executable. Each bugcheck is a Machine check while in
- kernal mode. I am looking for advice on how to track this down. I
- have saved the system dump files but not knowing how to read them puts
- me at a disadvantage. I have suggested adding a set process/dump
- before the executable is run so that is in place now but we have not
- had a crash yet.
-
- Using debug is out of the question since the device the executable
- controls is realtime. Would compiling with debug but not linking with
- debug also help? I am looking for any help I can get since I am
- experienced at debugging but not real-time devices or those that cause
- system crashes!
-
- SET PROCESS/DUMP, DEBUG, and such will not help; a machine check will yank the
- machine out from under them. However, even if it didn't, they are beside the
- point: They are user-mode debugging utilities, and we are talking about a
- kernel-mode problem.
-
- A machine check usually means just that: A hardware problem. Given the
- symptom of a machine check, my guess as to the cause is:
-
- 99% Hardware flakey.
- <1% Privileged code problem "faking" a machine check.
- <1% VMS bug.
-
- Those "<1%" are actually MUCH less than 1% - in fact, I've never seen this
- happen. While bugs in privileged code and in VMS certainly cause crashes,
- it takes some doing to simluate a machine check.
-
- Have you checked the systems error logs? A real-time process is presumably
- making heavy use of one or more I/O devices - look particularly at the state
- of health of those devices.
- -- Jerry
-
-