home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!dtix!darwin.sura.net!wupost!waikato.ac.nz!comp.vuw.ac.nz!zl2tnm!toyunix!don
- From: don@zl2tnm.gen.nz (Don Stokes)
- Newsgroups: comp.os.vms
- Subject: Re: HELP!!! Security problem for gurus. [Directories]
- Message-ID: <86009@zl2tnm.gen.nz>
- Date: 10 Jan 93 13:45:01 GMT
- References: <1iou2eINN372@gap.caltech.edu>
- Sender: news@zl2tnm.gen.nz (GNEWS Version 2.0 news poster.)
- Organization: The Wolery
- Lines: 91
-
- carl@SOL1.GPS.CALTECH.EDU (Carl J Lydick) writes:
- > Agreed. That's one of the many uses of standalone BACKUP. Shut the system
- > down *NOW*, before you run the chance of rendering the disk unreadable.
-
- Not necessarily. It's still running. Start looking. SHOW ERROR. Run
- Did a file get overwritten? When? What's running that might have done it?
- Is the disk structure intact? ANALYZE/DISK. We need to know now. Time
- is money.
-
- I've already outlined a case where the very act of shutting down was one
- of the causes of the problem. You're advocating taking an action before
- diagnosing the problem. If it looks serious, I can at least stabilise
- things by chucking the users off and stopping queues. If it's really bad
- I'm going to have to restore from backup anyway.
-
- > And how certain can you be that you've *REALLY* fixed the problem? Sure, you
- > can track down and repair the symptoms, but do you really want to trust a
- > system that's been running for some time with either the system disk or memory
- > corrupted?
-
- If you can find a cause, you can often find the effect. If in doubt,
- restore. But no bugcheck handler can ever make that decision for you
- reliably.
-
- > True. I'm assuming that you've got at least two disks on the system. Shut
- > down, do a standalone BACKUP of the non-system disk, restore it from your last
- > known good image BACKUP of the system disk, and THEN try to diagnose the
- > problems with the system disk.
-
- ... yeah, you get to find out, three hours and many thousand $ of lost
- production time later, that you accidentally deleted a non-critical file
- and could have fixed the problem in less than two minutes. I'd rather
- you explained that to the CEO.
-
- > And if the system manager isn't there 24 hours a day, 7 days a week, 52 weeks a
- > year? I once managed a 780 that had it's floating point processor fail. At
- > the time that happened, I was involved with some stuff that didn't use floating
- > point. It took a user who *WAS* doing lots of floating point work several
- > hours and about a dozen crashes of his program with machine checks for him to
- > decide to notify me of the problem.
-
- Sites that care about their production have people available. I've
- crawled out of bed in the wee smalls more times than I care to count to
- attend to sick systems (usually hardware problems). Operators were
- instructed that if they didn't know what to do, they were to call; they'd
- get a far bigger blasting if they didn't than the odd curse they'd get for
- digging me out unnecessarily. Incidentally, these are the same folks you
- accused of being "so goddamned stupid" in your previous message....
-
- > >It needs to be possible for the system manager to make that decision.
- >
- > For sufficiently minor problems, yes; once the problem gets severe enough, you
- > don't want to risk having the corrupted system running any longer than it takes
- > it to notice that it IS corrupted.
-
- The nature of bugcheck code is that it must be simpleminded.
-
- Taking an action that may be destructive (and rebooting onto a sick system
- disk is a Bad Thing), may render a recoverable situation unrecoverable,
- or even just increases the downtime required to fix the problem, is not
- an appropriate response.
-
- The correct response is to make it very clear to all who might be in a
- position to fix it that there is a problem. Software systems that cannot
- continue because of the problem should stop, but as far as possible it
- should be up to something with a brain to sort out what happens next.
-
- > That's why not all BUGCHECKs are fatal. Some are continuable.
-
- Bugchecks come in two varieties: those that blow the system out of the
- water and those that don't. The latter type get logged to the error
- log, and that's about it; otherwise they look like normal exceptions to
- the average user. If you're not watching the error log, you'll never
- know one happened.
-
- Personally, I'm rather a fan of things like VAXsim which will take action
- to notify you when something (usually hardware) goes amis. Back at that
- printing co I had it automatically email me at home.... (I even got a
- notification of a dead disk after I left the company.... 8-)
-
- > Fine. Then perhaps we're simply arguing about just how severe the problem must
- > be before the system takes matters out of our hands. It might be useful if VMS
- > had more than two classes of BUGCHECKS, and allowed the system manager to set a
- > SYSGEN parameter that specified the lowest class that was considered fatal.
-
- Maybe, maybe not. I like how it is.
-
- --
- Don Stokes, ZL2TNM (DS555) don@zl2tnm.gen.nz (home)
- Network Manager, Computing Services Centre don@vuw.ac.nz (work)
- Victoria University of Wellington, New Zealand +64-4-495-5052
-