home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!gumby!wupost!usc!hacgate!nuntius
- From: mark@hti.hac.com (Mark Johnson)
- Newsgroups: comp.software-eng
- Subject: Re: Testing Complex Systems
- Message-ID: <23103@hacgate.SCG.HAC.COM>
- Date: 1 Sep 92 15:23:15 GMT
- References: <1992Aug31.135414.5265@linus.mitre.org>
- <1992Aug31.222254.19432@cactus.org>
- Sender: news@hacgate.SCG.HAC.COM
- Organization: Hughes Training, Inc.
- Lines: 69
- X-UserAgent: Nuntius v1.1
-
- I'll provide some background in the techniques used here to get a system
- out the
- door with a concrete example of working around a serious design flaw.
- [long]
- a. Functional requirements testing. Enuf said.
- b. Tracking system (& software) MTBF to indicate non-deterministic
- problems;
- then using the system architecture and post-mortems of failures,
- determine
- the root cause of the problems.
- c. Modelling the system tasking & data flow to provide early warning
- of problems
- in the distributed system.
- d. Host based testing using multitasking to simulate distributed
- operations. We
- run the "application level" operational code on the host machine on
- top of a
- multitasking test bed.
- e. Rerun host based testing scenarios on the target hardware to
- resolve problems
- with compilers, bugs in target specific code, etc.
- f. Deliver systems with diagnostic routines included to aid on-site
- analysis.
-
- A system I once worked on would stop working "sometimes". It was
- basically a star
- configuration with a central computer & remote computers connected with
- RS-232
- lines. A graphics processor was attached to one of the remote computers
- by another
- RS-232 line as well. We had isolated the problem to dropping characters
- in the
- interrupt handler for the line to the graphics processor. We identified
- a design flaw
- caused by excessive latency of a problem interrupt handler (PIH). It was
- determined to
- be "too hard" to redesign this part of the system. We did add a
- capability to stress
- this part of the system to get the MTBF's indicated below. Normal
- operations had
- much better MTBF's that listed below.
- Fix 1 was to lower the interrupt level after entry on the PIH. MTBF
- grew from
- 5 minutes to 15 minutes. New problem was seen when graphics line IH
- was
- interrupted by the PIH.
- Fix 2 was to recognize when graphics line IH was pre-empted, resume it,
- and then
- complete the PIH handling at a low interrupt level. MTBF grew to 30
- mins.
- New problem was seen when the clock interrupted the graphics line IH
- & THAT was
- interrupted by the PIH.
- Fix 3 was to recognize that situation as well. MTBF grew to over a day
- which
- was OK for this application (typical runtime was less than 1 hour).
- Needless to say, this took several months to resolve in the field,
- something that
- should have been removed in the design over a year previously. Testing
- won't fix
- a broken system, a robust design that can be verified is much more
- significant. As
- a result of this problem, the systems we build here are much more robust
- since
- we review the basic system architecture & design much more closely.
- Timing of
- the system is a very key component of that review.
-
- --Mark Johnson <mark@bart.dnet.hac.com>
-