home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!cs.utexas.edu!swrinde!sdd.hp.com!scd.hp.com!hplextra!hpfcso!hpfcmdd!hpbbrd!hpbbn!hpcc05!hpdmd48!jgm
- From: jgm@hpdmd48.boi.hp.com (John McBride--in my own private Idaho)
- Newsgroups: comp.arch
- Subject: Error Correcting Memory
- Message-ID: <14900030@hpdmd48.boi.hp.com>
- Date: 4 Sep 92 16:57:54 GMT
- References: <Sep03.210730.68303@yuma.ACNS.ColoState.EDU>
- Organization: HP-Boise, ID
- Lines: 66
-
- >I'm interested in learning about the difference in reliability
- >between parity memory and error checking-correction memory. Can
- >anyone provide me with some pointers to related books/articles?
- >Or does anyone know straight off what the relative difference is?
-
- I don't know of any papers on the subject (but then again I haven't looked),
- but you can figure out the difference by applying s(t)atistics.
-
- Assumptions: DRAM Soft error rate (due to alpha particles) = 1000 FIT
- FIT = Failure In Time = expected # of failures in 10^9 hours
- DRAM Chip failure rate = 100 FIT
- Data protection is either byte parity or 32 bit word ECC
- (ECC over 32 bits requires 7 error detection/correction bits
- for single bit correct and double error detect)
- 4Mbit (4M x 1) DRAMs in a 64Mbyte system = 128 data chips
- with either 16 parity chips or 28 ECC chips
- All failures are independent
- The ECC memory includes scrubbing, which continuously looks
- for single bit errors in memory and corrects them.
- Scrubbing all but eliminates the possibility that two
- soft errors occur in the same word. The failure mode
- would then be a soft error in a word that has a failed
- chip, assuming the failed chip is replaced with a week of
- the chip failing (MTTR = 1 week = 168 hours), which
- also assumes that the system can notify an operator of
- the failure.
-
-
- Parity protection calculations:
- Failure rate of 144 chips = 1000 * 144 = 144,000 FITs
- MTBF (Mean Time Between Failure) of 144 chips = 10^9 / 144,000 = 6944 hours
- 1 Year = 365 days * 24 hours = 8760 hours
- AFR (Annual Failure Rate) = 1 - exp(-8760/6944) = 1 - .28 = 72%
- Expected number of failures per year = 8760 / 6944 hours = 1.26 failures
- With parity, you would expect each system with 64Mbytes to lose data
- at least once a year. This is quite unacceptable for most computers
- (except PCs) and data storage devices (disks, etc.)
-
- ECC protection calculations:
- Probability that a soft error occurs given that a chip has failed =
- P(Soft error | Chip failure) = 1 - exp(-MTTR * (Number of chips - 1)
- / MTBF of each chip)
- MTBF (of soft errors) = 10^9 / 1000 = 10^6 hours
- Number of chips = 156 chips
- MTTR (Mean Time To Repair) = 1 week = 168 hours
- P(Soft error | Chip Failure) = 1 -exp(-168 * 155 / 10^6) = .0257
- Chip failure rate of 156 chips = 100 * 156 = 15,600 FITs
- MTBF of chip failures = 10^9 / 15,600 = 64,103 hours
- AFR of chip failures = 1 - exp(-8760/64,103) = 12.7%
- AFR of data loss = AFR of chip failues * P(soft error | chip failure)
- = .127 * .0257 = 0.3% AFR
-
- (Any errors in calculations, anyone? As you can tell, I have spent a
- little time thinking about this, so I think I am an authority. I would
- be delighted to hear where I am wrong, if I am indeed wrong.)
-
- I believe that the assumed soft error FIT rate is low; the chip failure
- FIT rate is high, and the MTTR is high. All of which make the ECC
- and parity reliability numbers come out closer than they probably are.
-
- From these calculations, the ECC system is at least two orders of magnitude
- more reliable than a comparable parity protected system, 72% vs 0.3% AFR,
- at least as measured by AFR.
-
- John McBride
- I speak for myself and not anyone else, etc., etc., etc.
-