Back to the top of the comp.arch.storage FAQ.

12. MTBF (Mean Time Between Flareups, er, Failures)

There is a short FAQ-like document available from IBM at www.storage.ibm.com. No math for the statistically inclined, but explains in clear prose what IBM at least means when they say MTBF.

I will also note that, for a complex but reparable system such as an autochanger, each subsystem may have a separate MTBF and a different lifetime, which may be combined to give one figure for the unit as a whole.

Here is a reasonably understandable, but somewhat long, description of MTBF. Thanks to Kevin Daly (president of Odetics, kdaly@odetics.com) wrote in 10/95 for this FAQ. After some waffling, I've included the whole thing, despite its length.

===============================================================

M T B F

In order to understand MTBF (Mean Time Between Failures) it is best to start with something else -- something for which it is easier to develop an intuitive feel. This other concept is failure rate which is, not surprisingly, the average (mean) rate at which things fail. A "thing" could be a component, an assembly, or a whole system. Some things -- rocks, for example -- are accepted to have very low failure rates while others -- British sports cars, for example -- are (or should be) expected to have relatively high failure rates.

It is generally accepted among reliability specialists (and you, therefore, must not question it) that a thing's failure rate isn't constant, but generally goes through three phases over a thing's lifetime. In the first phase the failure rate is relatively high, but decreases over time -- this is called the "infant mortality" phase (sensitive guys these reliability specialists). In the second phase the failure rate is low and essentially constant -- this is (imaginatively) called the "constant failure rate" phase. In the third phase the failure rate begins increasing again, often quite rapidly, -- this is called the "wearout" phase. The reliability specialists noticed that when plotted as a function of time the failure rate resembled a familiar bathroom appliance -- but they called it a "bathtub" curve anyway. The units of failure rate are failures per unit of "thing-time"; e.g. failures per machine-hour or failures per system-year.

What, you may ask, does all this have to do with MTBF? MTBF is the inverse of the failure rate in the constant failure rate phase. Nothing more and nothing less. The units of MTBF are (or, should be) units of "thing-time" pre failure; e.g. machine-hours per failure or system-years per failure but the "thing" part and the "per failure" part are almost always omitted to enhance the mystique and confusion and to make MTBF appear to have the units of "time" which it doesn't. We will bow to the convention of speaking of MTBF in hours or years -- but we all know what we really mean.

What does MTBF have to do with lifetime? Nothing at all! It is not at all unusual for things to have MTBF's which significantly exceed their lifetime as defined by wearout -- in fact, you know many such things. A "thirty-something" American (well within his constant failure rate phase) has a failure (death) rate of about 1.1 deaths per 1000 person-years and, therefore, has an MTBF of 900 years (of course its really 900 person-years per death). Even the best ones, however, wear out long before that.

This example points out one other important characteristic of MTBF -- it is an ensemble characteristic which applies to populations (i.e. "lots") of things; not a sample characteristic which applies to one specific thing. In the good old days when failure rates were relatively high (and, therefore, MTBF relatively low) this characteristic of MTBF was a curiosity which created lively (?) debate at conventions of reliability specialists (them) but otherwise didn't unduly bother right-thinking people (us). Things, however, have changed. For many systems of interest today the required failure rates are so low that the MTBF substantially exceeds the lifetime (obviously nature had this right a long time ago). In these cases MTBF's are not only "not necessarily" sample characteristics, but are "necessarily not" sample characteristics. In the terms of the reliability cognoscenti, failure processes are not ergodic (i.e. you can't blithely trade population statistics for time statistics). The key implication of this essential characteristic of MTBF is that it can only be determined from populations and it should only be applied to populations.

MTBF is, therefore an excellent characteristic for determining how many spare hard drives are needed to support 1000 PC's, but a poor characteristic for guiding you on when you should change your hard drive to avoid a crash.

MTBF's are best determined from large populations. How large? From every point of view (theoretical, practical, statistical) but cost, the answer is "the larger, the better". There are, however, well established techniques for planning and conducting test programs to develop specified levels of confidence in a thing's MTBF. Establishing an MTBF at the 80% confidence level, for example, is clearly better, but much more difficult and expensive, than doing it at a 60% confidence level. As an example, a test designed to demonstrate a thing's MTBF at the 80% confidence level, requires a total thing-time of 160% of the MTBF if it can be conducted with no failures. You don't want to know how much thing-time is required to achieve reasonable confidence levels if any failures occur during the test.

What, by the way, is "thing-time"? An important subtlety is that "thing-time" isn't "clock time" (unless, of course, your thing is a clock). The question of how to compute "thing-time" is a critical one in reliability engineering. For some things (e.g. living thing) time always counts but for others the passage of "thing-time" may be highly dependent upon the state of the thing. Various ad hoc time corrections (such as "power on hours" (POH)) have been used, primarily in the electronics area. There is significant evidence that, in the mechanical area "thing-time" is much more related to activity rate than it is to clock time. Measures such as "Mean Cycles Between Failures (MCBF)" are becoming accepted as more accurate ways to assess the "duty cycle effect". Well-founded, if heuristic, techniques have been developed for combining MCBF and MTBF effects for systems in which the average activity rate is known.

MTBF need not, then be "Mysterious Time Between Failures" or "Misleading Time Between Failures", but an important system characteristic which can help to quantify the suitability of a system for a potential application. While rising demands on system integrity may make this characteristic seem "unnatural", remember you live in a country of 250 million 9- million-hour MTBF people!

===================================================================
Kevin C. Daly
President
ATL Products
kdaly@odetics.com
(714) 774-6900

My Home Page at Caltech

email me at rdv@isi.edu