NetNews Usenet Archive 1992 #20

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #20 / NN_1992_20.iso / spool / comp / arch / 9259 < prev next >

Wrap

Internet Message Format | 1992-09-07 | 3.9 KB

Path: sparky!uunet!cs.utexas.edu!swrinde!sdd.hp.com!scd.hp.com!hplextra!hpfcso!hpfcmdd!hpbbrd!hpbbn!hpcc05!hpdmd48!jgm From: jgm@hpdmd48.boi.hp.com (John McBride--in my own private Idaho) Newsgroups: comp.arch Subject: Error Correcting Memory Message-ID: <14900030@hpdmd48.boi.hp.com> Date: 4 Sep 92 16:57:54 GMT References: <Sep03.210730.68303@yuma.ACNS.ColoState.EDU> Organization: HP-Boise, ID Lines: 66 >I'm interested in learning about the difference in reliability >between parity memory and error checking-correction memory. Can >anyone provide me with some pointers to related books/articles? >Or does anyone know straight off what the relative difference is? I don't know of any papers on the subject (but then again I haven't looked), but you can figure out the difference by applying s(t)atistics. Assumptions: DRAM Soft error rate (due to alpha particles) = 1000 FIT FIT = Failure In Time = expected # of failures in 10^9 hours DRAM Chip failure rate = 100 FIT Data protection is either byte parity or 32 bit word ECC (ECC over 32 bits requires 7 error detection/correction bits for single bit correct and double error detect) 4Mbit (4M x 1) DRAMs in a 64Mbyte system = 128 data chips with either 16 parity chips or 28 ECC chips All failures are independent The ECC memory includes scrubbing, which continuously looks for single bit errors in memory and corrects them. Scrubbing all but eliminates the possibility that two soft errors occur in the same word. The failure mode would then be a soft error in a word that has a failed chip, assuming the failed chip is replaced with a week of the chip failing (MTTR = 1 week = 168 hours), which also assumes that the system can notify an operator of the failure. Parity protection calculations: Failure rate of 144 chips = 1000 * 144 = 144,000 FITs MTBF (Mean Time Between Failure) of 144 chips = 10^9 / 144,000 = 6944 hours 1 Year = 365 days * 24 hours = 8760 hours AFR (Annual Failure Rate) = 1 - exp(-8760/6944) = 1 - .28 = 72% Expected number of failures per year = 8760 / 6944 hours = 1.26 failures With parity, you would expect each system with 64Mbytes to lose data at least once a year. This is quite unacceptable for most computers (except PCs) and data storage devices (disks, etc.) ECC protection calculations: Probability that a soft error occurs given that a chip has failed = P(Soft error | Chip failure) = 1 - exp(-MTTR * (Number of chips - 1) / MTBF of each chip) MTBF (of soft errors) = 10^9 / 1000 = 10^6 hours Number of chips = 156 chips MTTR (Mean Time To Repair) = 1 week = 168 hours P(Soft error | Chip Failure) = 1 -exp(-168 * 155 / 10^6) = .0257 Chip failure rate of 156 chips = 100 * 156 = 15,600 FITs MTBF of chip failures = 10^9 / 15,600 = 64,103 hours AFR of chip failures = 1 - exp(-8760/64,103) = 12.7% AFR of data loss = AFR of chip failues * P(soft error | chip failure) = .127 * .0257 = 0.3% AFR (Any errors in calculations, anyone? As you can tell, I have spent a little time thinking about this, so I think I am an authority. I would be delighted to hear where I am wrong, if I am indeed wrong.) I believe that the assumed soft error FIT rate is low; the chip failure FIT rate is high, and the MTTR is high. All of which make the ECC and parity reliability numbers come out closer than they probably are. From these calculations, the ECC system is at least two orders of magnitude more reliable than a comparable parity protected system, 72% vs 0.3% AFR, at least as measured by AFR. John McBride I speak for myself and not anyone else, etc., etc., etc.