NetNews Usenet Archive 1992 #19

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #19 / NN_1992_19.iso / spool / comp / software / 3338 < prev next >

Wrap

Internet Message Format | 1992-09-01 | 3.1 KB

Path: sparky!uunet!gumby!wupost!usc!hacgate!nuntius From: mark@hti.hac.com (Mark Johnson) Newsgroups: comp.software-eng Subject: Re: Testing Complex Systems Message-ID: <23103@hacgate.SCG.HAC.COM> Date: 1 Sep 92 15:23:15 GMT References: <1992Aug31.135414.5265@linus.mitre.org> <1992Aug31.222254.19432@cactus.org> Sender: news@hacgate.SCG.HAC.COM Organization: Hughes Training, Inc. Lines: 69 X-UserAgent: Nuntius v1.1 I'll provide some background in the techniques used here to get a system out the door with a concrete example of working around a serious design flaw. [long] a. Functional requirements testing. Enuf said. b. Tracking system (& software) MTBF to indicate non-deterministic problems; then using the system architecture and post-mortems of failures, determine the root cause of the problems. c. Modelling the system tasking & data flow to provide early warning of problems in the distributed system. d. Host based testing using multitasking to simulate distributed operations. We run the "application level" operational code on the host machine on top of a multitasking test bed. e. Rerun host based testing scenarios on the target hardware to resolve problems with compilers, bugs in target specific code, etc. f. Deliver systems with diagnostic routines included to aid on-site analysis. A system I once worked on would stop working "sometimes". It was basically a star configuration with a central computer & remote computers connected with RS-232 lines. A graphics processor was attached to one of the remote computers by another RS-232 line as well. We had isolated the problem to dropping characters in the interrupt handler for the line to the graphics processor. We identified a design flaw caused by excessive latency of a problem interrupt handler (PIH). It was determined to be "too hard" to redesign this part of the system. We did add a capability to stress this part of the system to get the MTBF's indicated below. Normal operations had much better MTBF's that listed below. Fix 1 was to lower the interrupt level after entry on the PIH. MTBF grew from 5 minutes to 15 minutes. New problem was seen when graphics line IH was interrupted by the PIH. Fix 2 was to recognize when graphics line IH was pre-empted, resume it, and then complete the PIH handling at a low interrupt level. MTBF grew to 30 mins. New problem was seen when the clock interrupted the graphics line IH & THAT was interrupted by the PIH. Fix 3 was to recognize that situation as well. MTBF grew to over a day which was OK for this application (typical runtime was less than 1 hour). Needless to say, this took several months to resolve in the field, something that should have been removed in the design over a year previously. Testing won't fix a broken system, a robust design that can be verified is much more significant. As a result of this problem, the systems we build here are much more robust since we review the basic system architecture & design much more closely. Timing of the system is a very key component of that review. --Mark Johnson <mark@bart.dnet.hac.com>