requires correction of data, or killing of call and return of control
to a safe point.
Inhibit (pest) interrupts while audits are correcting problem,
risky, but assumes single software fault.
In cases where the out-of-range error can be isolated to a single unit can use frame level pesting, otherwise use system level pesting.
Software recovery does not consider the possibility of a hardware fault.
Recovery cannot fix a program bug. Running pested may allows the system to
operate in a degraded fashion while maintenance personnel analyze data and
correct program.
The buffer overflow problem - may be caused by program error.
Buffers protected by hardware overflow interrupts.
Recovery runs the buffer unloader program to unload the buffer and audits the task dispenser program to ensure the unloader is scheduled properly.
The overflow interrupt is pested.
If problem continues, hardware is suspect.
"No. 4 ESS: Maintenance Software"
"by M. N. Meyers, W. A. Routt and K. W. Yoder,"
"BSTJ Vol. 56, No. 7, September 1977"
"Software Error Recovery"
Since system operation is dependent on data in memories, and memories can be written, there is a possibility the memory will be in a state that precludes operation.
System must be as error-free as possibile.
Since system cannot be completely error-free, it must be error tolerant.
"Classification of software errors"
Errors in interfaces between software modules.
Non-conformity to systems rules.
KsO$H@! fwLogic errors.
Coding errors.
Complex man-machine interfaces that lead to procedural errors.
"Effects of software errors"
Loss of viability
Loss of call processing
Loss of a facility
Loss of a functions
Loss of capacity
Loss of single call
No effect
"Error Prevention"
Standardization,
Simplification,
Improved Documentation,
I wonder why improved testing isn't listed
"Error Tolerance"
Error tolerance acheived through defensive programming and defensive memory.
Programs attempt to remain operational in presence of errors
Restrict access to data to noncritical regions where errors have minor impact. (Using special instrutions to write to critical memory).
Checking state codes
Range Checks
Positive Decisions
Symbolic Addressing
Interpreting all possible stimuli
Link by index rather than absolute address
Special purpose memory allocators
"Error Handling"
"Software Integrity Control"
Receives reports of all software errors, decides and activates corrective action.
Corrective action: request an audit, or if history indicates repition, request a phase of system initialization.
Receives reports of overload - must decide if these are true, or are the results of software errors and take appropriate action.
"Audits"
Detect and correct data errors.
Control structure and audit routines specific to particular data structures.
Control structure schedules routine audits.
Demand audits run when errors found or suspect.
Error detection through: comparison of duplicate structures, association (correct structures linked together), and semantics (data is reasonably correct).
U"qGu"Integrity monitor system"
Detects shceduling and cycling irregularities and los of major system functions.
Time monitors detect basic sanity
Base level monitor verifies validity of base cycle
Test call program observes progress of test calls
"Recovery"
Remedial actions - audits - correct specific errors.
If remedial actions fail undertake system recovery.
System recovery reconfigures hardware and initializes memory.
System recovery initiated upon detection of:
Mutilation of program memory
Loss of vital system function,
Loss of major facility,
Escalation of remedial actions,
Software integrity monitor problems,
Sanity Timeout
Mutilation of software clock,
Duplex failure
System recovery phases:
Phase 1 - initialize specific areas of transient memory - does not kill calls,
Phase 2 - reconfigure peripheral hardware and initialize all transient memory - kills
only non-stable calls.
Phase 3 - reconfigure processors, initialize all memory - kills only non-stable calls.
Phase 4 - totally initialize system - kills all calls - manual activation only.
More detailed and specific manual recovery procedures can be initiated.