NetNews Usenet Archive 1993 #1

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #1 / NN_1993_1.iso / spool / comp / os / vms / 20698 < prev next >

Wrap

Internet Message Format | 1993-01-10 | 5.1 KB

Path: sparky!uunet!dtix!darwin.sura.net!wupost!waikato.ac.nz!comp.vuw.ac.nz!zl2tnm!toyunix!don From: don@zl2tnm.gen.nz (Don Stokes) Newsgroups: comp.os.vms Subject: Re: HELP!!! Security problem for gurus. [Directories] Message-ID: <86009@zl2tnm.gen.nz> Date: 10 Jan 93 13:45:01 GMT References: <1iou2eINN372@gap.caltech.edu> Sender: news@zl2tnm.gen.nz (GNEWS Version 2.0 news poster.) Organization: The Wolery Lines: 91 carl@SOL1.GPS.CALTECH.EDU (Carl J Lydick) writes: > Agreed. That's one of the many uses of standalone BACKUP. Shut the system > down *NOW*, before you run the chance of rendering the disk unreadable. Not necessarily. It's still running. Start looking. SHOW ERROR. Run Did a file get overwritten? When? What's running that might have done it? Is the disk structure intact? ANALYZE/DISK. We need to know now. Time is money. I've already outlined a case where the very act of shutting down was one of the causes of the problem. You're advocating taking an action before diagnosing the problem. If it looks serious, I can at least stabilise things by chucking the users off and stopping queues. If it's really bad I'm going to have to restore from backup anyway. > And how certain can you be that you've *REALLY* fixed the problem? Sure, you > can track down and repair the symptoms, but do you really want to trust a > system that's been running for some time with either the system disk or memory > corrupted? If you can find a cause, you can often find the effect. If in doubt, restore. But no bugcheck handler can ever make that decision for you reliably. > True. I'm assuming that you've got at least two disks on the system. Shut > down, do a standalone BACKUP of the non-system disk, restore it from your last > known good image BACKUP of the system disk, and THEN try to diagnose the > problems with the system disk. ... yeah, you get to find out, three hours and many thousand $ of lost production time later, that you accidentally deleted a non-critical file and could have fixed the problem in less than two minutes. I'd rather you explained that to the CEO. > And if the system manager isn't there 24 hours a day, 7 days a week, 52 weeks a > year? I once managed a 780 that had it's floating point processor fail. At > the time that happened, I was involved with some stuff that didn't use floating > point. It took a user who *WAS* doing lots of floating point work several > hours and about a dozen crashes of his program with machine checks for him to > decide to notify me of the problem. Sites that care about their production have people available. I've crawled out of bed in the wee smalls more times than I care to count to attend to sick systems (usually hardware problems). Operators were instructed that if they didn't know what to do, they were to call; they'd get a far bigger blasting if they didn't than the odd curse they'd get for digging me out unnecessarily. Incidentally, these are the same folks you accused of being "so goddamned stupid" in your previous message.... > >It needs to be possible for the system manager to make that decision. > > For sufficiently minor problems, yes; once the problem gets severe enough, you > don't want to risk having the corrupted system running any longer than it takes > it to notice that it IS corrupted. The nature of bugcheck code is that it must be simpleminded. Taking an action that may be destructive (and rebooting onto a sick system disk is a Bad Thing), may render a recoverable situation unrecoverable, or even just increases the downtime required to fix the problem, is not an appropriate response. The correct response is to make it very clear to all who might be in a position to fix it that there is a problem. Software systems that cannot continue because of the problem should stop, but as far as possible it should be up to something with a brain to sort out what happens next. > That's why not all BUGCHECKs are fatal. Some are continuable. Bugchecks come in two varieties: those that blow the system out of the water and those that don't. The latter type get logged to the error log, and that's about it; otherwise they look like normal exceptions to the average user. If you're not watching the error log, you'll never know one happened. Personally, I'm rather a fan of things like VAXsim which will take action to notify you when something (usually hardware) goes amis. Back at that printing co I had it automatically email me at home.... (I even got a notification of a dead disk after I left the company.... 8-) > Fine. Then perhaps we're simply arguing about just how severe the problem must > be before the system takes matters out of our hands. It might be useful if VMS > had more than two classes of BUGCHECKS, and allowed the system manager to set a > SYSGEN parameter that specified the lowest class that was considered fatal. Maybe, maybe not. I like how it is. -- Don Stokes, ZL2TNM (DS555) don@zl2tnm.gen.nz (home) Network Manager, Computing Services Centre don@vuw.ac.nz (work) Victoria University of Wellington, New Zealand +64-4-495-5052