NetNews Usenet Archive 1992 #20

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #20 / NN_1992_20.iso / spool / comp / sys / ibm / pc / hardware / 24347 < prev next >

Wrap

Text File | 1992-09-15 | 58.3 KB | 945 lines

Newsgroups: comp.sys.ibm.pc.hardware Path: sparky!uunet!paladin.american.edu!europa.asd.contel.com!darwin.sura.net!jvnc.net!yale.edu!ira.uka.de!rz.uni-karlsruhe.de!usenet From: S_JUFFA@iravcl.ira.uka.de (|S| Norbert Juffa) Subject: What you always wanted to know about math coprocessors 3/4 Message-ID: <1992Sep15.162827.11073@rz.uni-karlsruhe.de> Sender: usenet@rz.uni-karlsruhe.de (USENET News System) Organization: University of Karlsruhe (FRG) - Informatik Rechnerabt. Date: Tue, 15 Sep 1992 16:28:27 GMT X-News-Reader: VMS NEWS 1.23 Lines: 933 Whetstone [2,3,4] is a synthetic benchmark based upon statistics collected about the use of certain control and data structures in programs written in high level languages. Based on these statistics, Whetstone tries to mirror a 'typical' HLL program. Whetstone performance is expressed by how many theoretical 'whetstone' instructions are executed per second. It was originally implemented in ALGOL. Unlike PEAKFLOP, LLL, and Linpack, Whetstone not only uses addition and multiplication but exercises all basic arithmetic operations as well as some transcendental functions. Whetstone performance depends on the speed of the coprocessor as well as on the speed of the CPU, while PEAKFLOP, LLL, and Linpack place a heavier burden on the coprocessor/FPU. There exists an old and a new version of Whetstone. Note that results from the two versions can differ by as much as 20% for the same test configuration. For this test, the new version in Pascal from [3] was used. It was compiled with Turbo Pascal 6.0 and my own library (see above) with all 'optimizations' on. SAVAGE tests the performance of transcendental function evaluation. It is basically a small loop in which the sin, cos, arctan, ln, exp, and sqrt functions are combined in a single expression. While sin, cos, arctan, and sqrt can be evaluated directly with a single 387 coprocessor instruction each, ln and exp need additional preprocessing for argument reduction and result conversion. According to [14], the Savage benchmark was devised by Bill Savage, and is distributed by: The Wohl Engine Company, Ltd., 8200 Shore Front Parkway, Rockaway Beach, NY 11693, USA. Usually, Savage is programmed to make 250,000 passes though the loop. Here only 10,000 loops are executed for a total of 60,000 transcendental function evaluations. The result is expressed in function evaluations per second. SAVAGE source code was taken from [7] and compiled with Turbo Pascal 6.0 and my own run-time library (see above). Benchmark results for 387 coprocessors, coprocessor emulators and the Intel RapidCAD and Intel 486 CPUs. 40 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec 386, EM87 0.0084 0.0080 0.0060 0.0060 31 502 ## 386, Franke387 0.0369 0.0295 0.0233 0.0215 164 4002 $$ 386, TP 6 Emu 0.0316 0.0273 0.0200 0.0190 160 3794 %% Intel 387DX 0.9204 0.7212 0.3932 0.3211 2428 52677 ULSI 83C87 1.2093 0.7936 0.3890 0.3120 2528 56926 IIT 3C87 1.0196 0.7145 0.3834 0.3179 2663 58766 IIT 3C87,4x4 1.0196 1.7244 0.3834 0.3179 2663 58766 ?? C&T 38700 1.0722 0.7908 0.4007 0.3222 2837 74906 Cyrix 387+ 1.1305 0.8162 0.3945 0.3208 2946 80322 Intel RapidCAD 2.2128 1.8931 0.7377 0.5432 4810 86957 Intel 486 2.4762 2.1335 1.1110 0.8204 6195 98522 33.3 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec 386, EM87 0.0070 0.0040 0.0050 0.0050 26 418 ## Franke387 0.0307 0.0246 0.0194 0.0179 137 3335 $$ 386, TP 6 Emu 0.0263 0.0227 0.0167 0.0158 133 3160 %% Intel 387DX 0.7647 0.6004 0.3283 0.2676 2046 43860 ULSI 83C87 1.0097 0.6609 0.3239 0.2598 2089 47431 IIT 3C87 0.8455 0.5957 0.3198 0.2646 2203 49020 IIT 3C87,4X4 0.8455 1.4334 0.3198 0.2646 2203 49020 ?? C&T 38700 0.9455 0.6907 0.3338 0.2700 2376 62565 Cyrix 387+ 0.9286 0.6806 0.3293 0.2669 2435 66890 Cyrix 83D87 1.013 N/A 0.333 0.273 2550 N/A Intel RapidCAD 1.8572 1.5798 0.6072 0.4533 3953 72464 Intel 486 2.0800 1.7779 0.9387 0.6682 5143 82192 For comparison: PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec i486DX2-66 4.1601 3.4227 1.6531 1.3010 10655 163934 i486DX2-50 3.0589 2.6665 1.2537 0.9744 7962 123203 i387, 20 MHz 0.2253 0.3271 0.1434 0.1171 952 21739 ++ i387DX, 20 MHz 0.3567 0.4444 0.1484 0.1161 1034 24155 && i80287, 5 MHz 0.0281 0.0310 0.0242 0.0222 150 3261 !! i8087,9.54 MHz 0.0636 0.0705 0.0321 0.0219 234 5782 ** HW configuration for test of 387 coprocessors and Intel RapidCAD: System A: Motherboard with Forex chip set, 128 kB CPU Cache, 8 MB RAM HW configuration for test of 486 FPU (extra fan for 40 MHz operation): System B: Motherboard with SIS chip set, 256 kB CPU Cache, 8 MB RAM ## EM87 V1.2 by Ron Kimball is a public domain coprocessor emulator that loads as a TSR. It uses INT 7 traps emitted by 80286, 80386 systems with no coprocessor upon encountering coprocessor instructions to catch coprocessor instructions and emulate them. Whetstone and Savage benchmarks for this test were compiled with the original TP 6.0 library, as EM87 chokes on the 387 specific FSIN and FCOS instructions used in my own library if a 387 is detected. Obviously EM87 identifies itself as a 387, but has no support for 387 specific instructions. $$ Franke387 is a commercial 387 emulator that is also available in a shareware version. For this test, shareware version V2.4 was used. Franke387 unlike many other emulators supports all 387 instructions. It is loaded as a device driver and uses INT 7 to trap coprocessor instructions. %% These benchmarks were run using the built-in coprocessor emulators of the TP 6.0 and the MS FORTRAN 5.0 run-time libraries. ?? The 3C87 specific F4X4 instruction was used in the vector trans- formation benchmark. ++ Older motherboard with no chip set (discrete logic), no CPU cache, 16 MB RAM && System A, CPU cache disabled via extended set-up, turbo-switch set to half speed (that is, 20 MHz) !! 80386 @ 20 MHz / Intel 80287 @ 5 MHz, no CPU cache, 4 MB RAM due to the fast CPU used here, performance figures are somewhat higher than can be expected for a 80286/287 combination, except for the PEAKFLOP benchmark, which is basically coprocessor limited ** 8086/8087 system with 640 kB RAM Since neither a Weitek coprocessor nor a compiler that generates code for the Weitek chips were available, performance data for the Weitek Abacus is given here according to [31,32] and scaled to show performance of a 33 MHz system. The benchmarks were compiled using highly optimizing 32-bit compilers. Single Prec. Double Prec. Double Prec. 3167 4167 3167 4167 387 486 Linpack MFLOPS 1.8 5.0 0.8 3.2 0.4 1.6 Whetstone kWhet/sec 7470 22700 4900 14000 3290 12300 Note that for the Intel coprocessors, running programs in single vs. double precision doesn't provide much of an performance advantage since all internal calculations are always done in extended precision. Using Weitek coprocessors however, performance nearly doubles when switching from double to single precision. For double precision calculations using only basic arithmetic, the Weitek Abacus can provide performance at twice the level of the respective Intel coprocessor (387/486) clocked at the same speed at most. Speed of various coprocessor instructions measured in clock cycles as measured with my program 87TIMES. Error is +/- one clock cycle, except for the Intel 80287. Times for the 80287 were determined on a system with a 20 MHz 80386 and a 5 MHz Intel 80287. Therefore, times may differ from a genuine 80286/287 system, especially for those instructions that access an operand in memory. Since the times are stated as the number of coprocessor clock cycles used, the faster 386 which can execute four clock cycles where the 80287 executes one clock cycle may decrease memory access times as seen by the coprocessor. Intel Intel Cyrix Cyrix C&T ULSI IIT Intel Intel i486 RapidCAD 83D87 387+ 38700 83C87 3C87 387DX 80387 FLD1 4 3 14 14 14 18 24 23 26 FLDZ 4 3 14 14 14 18 24 23 31 FLDPI 7 8 14 15 14 18 24 38 45 FLDLG2 7 8 14 14 14 18 24 33 45 FLDL2T 7 8 14 14 14 19 24 38 45 FLDL2E 7 8 14 14 14 19 24 38 45 FLDLN2 7 8 14 14 14 19 24 38 45 FLD ST(0) 4 4 14 14 14 14 24 20 21 FST ST(1) 3 4 14 14 14 14 19 18 22 FSTP ST(0) 4 4 14 14 14 15 19 19 22 FSTP ST(1) 4 4 15 15 14 15 19 20 22 FLD ST(1) 4 4 14 14 14 14 24 18 21 FXCH ST(1) 4 4 14 20 14 19 24 24 27 FILD [Word] 12 16 33 37 32 42 38 47 62 FILD [DWord] 8 11 26 26 21 32 28 35 45 FILD [QWord] 9 15 30 30 25 36 32 34 54 FLD [DWord] 3 5 26 26 21 23 28 20 25 FLD [QWord] 3 7 30 30 25 27 32 24 35 FLD [TByte] 5 11 46 46 46 46 47 46 57 FBLD [TByte] 83 90 66 86 106 146 197 71 278 FIST [Word] 31 31 37 40 37 42 51 69 90 FIST [DWord] 29 30 35 40 35 40 49 66 84 FST [DWord] 7 7 35 37 32 40 33 37 40 FST [QWord] 8 9 43 43 39 47 40 45 51 FISTP [Word] 32 32 42 40 37 43 46 70 90 FISTP [DWord] 31 31 40 40 35 41 50 67 87 FISTP [QWord] 29 29 44 44 42 48 56 73 92 FSTP [DWord] 8 8 38 36 32 41 35 38 43 FSTP [QWord] 9 9 46 43 39 48 42 46 49 FSTP [TByte] 8 8 50 45 49 50 48 53 58 FBSTP [TByte] 170 172 98 98 114 129 218 144 533 FINIT 17 31 15 16 15 15 16 16 25 FCLEX 7 20 15 16 16 16 16 16 25 FCHS 7 8 14 15 14 14 19 30 33 FABS 5 5 14 15 14 14 19 30 33 FXAM 12 13 14 15 14 14 19 39 43 FTST 5 5 19 25 14 24 24 34 38 FSTENV 67 82 125 125 124 132 124 159 165 FLDENV 44 59 106 106 112 120 106 119 129 FSAVE 181 169 355 355 374 361 376 469 511 FRSTOR 130 203 358 358 385 372 371 420 456 FSTSW [mem] 4 5 14 14 14 14 14 14 17 FSTSW AX 3 4 12 12 11 11 11 11 14 FSTCW [mem] 4 5 14 14 13 13 13 14 18 FLDCW [mem] 4 11 26 26 31 32 27 32 36 FADD ST,ST(0) 8 9 19 20 19 19 24 24 32 FADD ST,ST(1) 9 9 19 20 19 18 24 20 32 FADD ST(1),ST 10 10 19 20 19 18 24 24 37 FADDP ST(1),ST 11 11 19 19 19 16 24 25 37 FADD [DWord] 9 10 25 28 22 23 23 21 34 FADD [QWord] 9 10 32 32 26 27 27 25 38 FIADD [Word] 20 21 34 34 33 40 40 52 80 FIADD [DWord] 20 21 27 28 27 30 30 37 61 FSUB ST(1),ST 10 10 19 20 19 19 24 24 38 FSUBR ST(1),ST 9 10 19 22 19 19 24 27 38 FSUBRP ST(1),ST 10 10 19 19 22 20 24 25 38 FSUB [DWord] 11 12 27 28 27 23 29 27 32 FSUB [QWord] 11 12 32 32 31 27 33 26 44 FISUB [Word] 21 21 34 34 34 40 40 52 80 FISUB [DWord] 21 22 27 28 27 29 30 40 60 FMUL ST,ST(1) 16 17 19 25 24 24 29 38 57 FMUL ST(1),ST 16 17 19 24 24 24 29 40 62 FMULP ST(1),ST 17 17 19 24 24 25 29 40 58 FIMUL [Word] 22 23 40 40 37 46 46 52 80 FIMUL [DWord] 22 23 27 28 27 36 35 45 68 FMUL [DWord] 11 12 27 28 27 28 29 25 45 FMUL [QWord] 14 15 32 32 31 32 33 37 61 FDIV ST,ST(0) 73 74 26 40 59 54 54 89 100 FDIV ST,ST(1) 73 74 36 45 59 54 54 77 100 FDIV ST(1),ST 73 74 36 45 59 55 54 78 102 FDIVR ST(1),ST 73 74 36 45 59 54 54 77 102 FDIVRP ST(1),ST 73 74 36 44 59 55 54 76 106 FIDIV [Word] 84 85 52 58 75 76 76 105 141 FIDIV [DWord] 84 85 45 46 65 65 65 101 123 FDIV [DWord] 73 74 45 46 63 56 59 77 101 FDIV [QWord] 73 74 50 50 67 60 63 78 103 FSQRT (0.0) 25 25 19 19 14 19 24 29 37 FSQRT (1.0) 83 84 36 74 54 89 59 109 132 FSQRT (L2T) 86 87 36 74 54 89 59 104 137 FXTRACT (L2T) 17 17 19 19 19 28 79 53 72 FSCALE (PI,5) 30 30 36 24 24 49 79 59 82 FRNDINT (PI) 31 31 19 29 24 34 29 49 82 FPREM (99,PI) 58 59 54 99 44 54 49 79 96 FPREM1(99,PI) 90 91 54 99 44 59 54 104 121 FCOM 5 6 15 20 19 25 19 29 32 FCOMP 6 6 15 19 19 25 19 30 33 FCOMPP 7 7 15 19 19 25 19 31 40 FICOM [Word] 16 17 34 34 33 46 34 58 76 FICOM [DWord] 16 16 21 28 21 35 23 45 57 FCOM [DWord] 5 6 21 28 22 23 23 27 34 FCOM [QWord] 5 8 27 32 25 27 27 31 39 FSIN (0.0) 24 24 14 99 14 19 24 39 43 FSIN (1.0) 310 313 114 164 144 494 219 509 596 FSIN (PI) 88 89 118 189 64 64 214 134 152 FSIN (LG2) 292 295 72 89 139 454 184 449 531 FSIN (L2T) 299 302 123 179 164 469 214 454 536 FCOS (0.0) 24 24 19 159 14 19 24 34 42 FCOS (1.0) 302 305 84 104 139 489 214 459 547 FCOS (PI) 88 89 154 254 64 64 224 199 232 FCOS (LG2) 300 303 108 149 139 454 194 504 583 FCOS (L2T) 307 310 159 239 164 469 224 509 601 FSINCOS (0.0) 25 25 14 19 19 18 34 38 55 FSINCOS (1.0) 353 356 124 174 254 493 419 538 636 FSINCOS (PI) 105 106 162 263 79 68 424 228 277 FSINCOS (LG2) 340 343 119 159 249 458 359 533 627 FSINCOS (L2T) 347 350 168 248 274 473 424 538 646 FPTAN (0.0) 25 25 14 19 19 18 29 38 46 FPTAN (1.0) 266 269 119 149 184 538 309 323 396 FPTAN (PI) 145 146 134 228 104 108 304 168 211 FPTAN (LG2) 244 246 94 129 179 498 274 298 363 FPTAN (L2T) 247 249 139 219 204 513 304 298 365 FPATAN (0.0) 38 39 19 24 19 20 29 95 93 FPATAN (1.0) 294 298 124 159 29 375 604 360 433 FPATAN (PI) 304 308 139 188 279 360 424 375 472 FPATAN (LG2) 290 293 128 154 269 365 379 375 448 FPATAN (L2T) 304 308 144 189 274 359 424 375 468 F2XM1 (0.0) 25 25 14 14 14 19 24 34 37 F2XM1 (LN2) 209 211 89 119 169 394 284 299 348 F2XM1 (LG2) 204 206 78 104 159 379 284 294 337 FYL2X (1.0) 60 61 36 39 24 75 94 115 127 FYL2X (PI) 294 297 108 163 249 450 359 395 504 FYL2X (LG2) 311 314 108 159 249 460 339 410 518 FYL2X (L2T) 293 296 108 164 249 439 359 390 501 FYL2XP1 (LG2) 334 337 99 169 234 460 284 435 538 80386 + 80386 + 80386 + Intel Intel Franke387 TP 6.0 EM87 8087 80287 Emulator Emulator Emulator FSTP ST(0) | 26 54 507 358 2115 FLD1 | 26 55 481 422 1626 FLDZ | 21 53 480 416 1646 FLDPI | 26 55 486 443 1626 FLDLG2 | 26 56 486 423 1626 FLDL2T | 26 55 486 440 1626 FLDL2E | 26 53 486 423 1626 FLDLN2 | 26 55 486 441 1626 FLD ST(0) | 31 55 493 362 1851 FST ST(1) | 26 54 489 355 1931 FSTP ST(1) | 21 55 507 356 2116 FLD ST(1) | 26 55 493 362 1852 FXCH ST(1) | 21 57 497 486 2187 FILD [Word] | 58 90 667 712 2259 FILD [DWord] | 64 74 608 812 2164 FILD [QWord] | 74 93 652 707 2971 FLD [DWord] | 49 44 633 473 2077 FLD [QWord] | 54 57 641 524 2336 FLD [TByte] | 59 45 607 492 2063 FBLD [TByte] | 309 310 2019 1512 17827 FIST [Word] | 79 72 854 766 2418 FIST [DWord] | 84 80 865 518 2325 FST [DWord] | 89 85 686 441 2200 FST [QWord] | 99 92 703 516 2481 FISTP [Word] | 79 80 864 794 2620 FISTP [DWord] | 79 81 879 541 2523 FISTP [QWord] | 88 75 904 916 3226 FSTP [DWord] | 89 75 713 467 2400 FSTP [QWord] | 93 72 732 538 2678 FSTP [TByte] | 49 21 685 467 2124 FBSTP [TByte] | 528 472 3305 1555 27013 FINIT | 11 10 742 641 1369 FCLEX | 11 10 440 323 912 FCHS | 21 54 460 354 1744 FABS | 21 54 456 349 1738 FXAM | 21 54 481 380 1551 FTST | 51 75 585 386 2721 FSTENV | 54 57 928 519 2104 FLDENV | 48 50 1125 450 1631 FSAVE | 214 244 1949 976 2749 FRSTOR | 209 227 2182 657 2225 FSTSW [mem] | 28 10 516 401 1189 FSTSW AX | N/A 55 451 N/A N/A FSTCW [mem] | 28 10 506 359 1167 FLDCW [mem] | 19 47 524 437 1584 FADD ST,ST(0) | 86 128 643 706 2805 FADD ST,ST(1) | 85 116 707 808 3093 FADD ST(1),ST | 92 131 664 812 3146 FADDP ST(1),ST | 92 129 704 799 3143 FADD [DWord] | 105 122 874 969 3139 FADD [QWord] | 115 122 888 1021 3396 FIADD [Word] | 115 122 940 1211 3330 FIADD [DWord] | 125 122 882 1297 3215 FSUB ST(1),ST | 88 130 738 817 3156 FSUBR ST(1),ST | 96 132 740 868 3004 FSUBRP ST(1),ST | 99 132 733 805 3301 FSUB [DWord] | 119 122 918 1018 3127 FSUB [QWord] | 129 123 932 1070 3632 FISUB [Word] | 115 123 977 1081 3802 FISUB [DWord] | 125 125 940 980 4161 FMUL ST,ST(1) | 145 151 810 1368 3924 FMUL ST(1),ST | 145 151 817 1377 3962 FMULP ST(1),ST | 148 168 840 1365 4164 FIMUL [Word] | 132 151 1039 1517 4039 FIMUL [DWord] | 141 151 980 1643 3976 FMUL [DWord] | 125 123 948 1480 3445 FMUL [QWord] | 175 192 991 1602 4416 FDIV ST,ST(0) | 201 207 726 1536 9789 FDIV ST,ST(1) | 203 218 808 1658 10332 FDIV ST(1),ST | 207 214 825 1655 10342 FDIVR ST(1),ST | 201 206 819 1806 10213 FDIVRP ST(1),ST | 201 205 845 1803 10409 FIDIV [Word] | 237 227 980 1779 11225 FIDIV [DWord] | 246 227 944 1680 11572 FDIV [DWord] | 229 226 893 1722 10577 FDIV [QWord] | 236 227 993 1777 10829 FSQRT (0.0) | 21 57 512 382 1755 FSQRT (1.0) | 186 206 1106 2504 37836 FSQRT (L2T) | 186 207 1398 2467 37925 FXTRACT (L2T) | 51 56 726 571 3326 FSCALE (PI,5) | 41 56 817 443 3194 FRNDINT (PI) | 51 58 808 800 7092 FPREM (99,PI) | 81 131 1696 941 4098 FPREM1(99,PI) | N/A N/A 1625 N/A N/A FCOM | 56 75 582 483 2799 FCOMP | 61 92 616 485 2983 FCOMPP | 61 90 661 476 3198 FICOM [Word] | 79 77 808 861 3654 FICOM [DWord] | 89 77 750 964 3684 FCOM [DWord] | 74 75 741 625 3643 FCOM [QWord] | 74 76 754 667 3771 FSIN (0.0) | N/A N/A 639 N/A N/A FSIN (1.0) | N/A N/A 4640 N/A N/A FSIN (PI) | N/A N/A 2488 N/A N/A FSIN (LG2) | N/A N/A 3911 N/A N/A FSIN (L2T) | N/A N/A 3767 N/A N/A FCOS (0.0) | N/A N/A 740 N/A N/A FCOS (1.0) | N/A N/A 4777 N/A N/A FCOS (PI) | N/A N/A 2557 N/A N/A FCOS (LG2) | N/A N/A 4176 N/A N/A FCOS (L2T) | N/A N/A 3905 N/A N/A FSINCOS (0.0) | N/A N/A 714 N/A N/A FSINCOS (1.0) | N/A N/A 6049 N/A N/A FSINCOS (PI) | N/A N/A 4091 N/A N/A FSINCOS (LG2) | N/A N/A 5640 N/A N/A FSINCOS (L2T) | N/A N/A 5405 N/A N/A FPTAN (0.0) | 41 58 752 8381 2324 FPTAN (1.0) | 581 582 6366 10817 29824 FPTAN (PI) | 606 587 4388 12410 2300 FPTAN (LG2) | 516 513 5939 12502 26770 FPTAN (L2T) | 576 586 5723 12483 2301 FPATAN (0.0) | 41 55 616 1208 10578 FPATAN (1.0) | 736 736 1426 13446 34208 FPATAN (PI) | 206 207 12835 13305 46903 FPATAN (LG2) | 756 736 12490 13319 41312 FPATAN (L2T) | 206 204 12922 13364 50149 F2XM1 (0.0) | 16 56 563 723 1722 F2XM1 (LN2) | 631 624 4178 11070 33823 F2XM1 (LG2) | 611 585 4798 11116 32163 FYL2X (1.0) | 56 57 961 1214 4327 FYL2X (PI) | 946 961 8987 12858 40148 FYL2X (LG2) | 1081 1038 8933 12748 46821 FYL2X (L2T) | 926 886 8982 12712 38986 FYL2XP1 (LG2) | 1026 1037 10485 11867 44708 The Weitek 3167 and 4167 coprocessors only implement the basic arithmetic functions (add, subtract, multiply, divide, square root) in hardware. Transcendental functions are implemented by means of a software library supplied by Weitek that uses the Weitek hardware to approximate the transcendental functions with polynomial and rational approximations. The clock cycle timings for the transcendental functions are average values, since execution time differs with the value of argument. The speed of transcendental functions for the 4167 is estimated based on the numbers in [31,33], from which this timing information has been extracted. Execution time for floating-point operations in clock cycles on Weitek coprocessors Single Precision Double Precision 3167 4167 3167 4167 ABS 3 2 3 2 NEG 6 2 6 2 ADD 6 2 6 2 SUB 6 2 6 2 SUBR 6 2 6 2 MUL 6 2 10 3 DIVR 38 17 66 31 SQRT 60 17 118 31 SIN 146 ~50 292 ~100 COS 140 ~50 285 ~100 TAN 188 ~60 340 ~110 EXP 179 ~60 401 ~130 LOG 171 ~60 365 ~120 F->ASCII 1000 N/A 1700 N/A // ASCII->F 1100 N/A 1800 N/A // // rough average of the timings given for different numeric formats by Weitek. Note that these conversions routines do much more work than the FBLD and FBSTP instructions provided by the 80x87 coprocessors. FBLD and FBSTP are useful for conversion routines but quite a bit of additional code is need for this purpose. Accuracy The IEEE-754 Standard for Binary Floating-Point Arithmetic [10,11] is fully implemented by Intel's 387 coprocessor [17]. Among other things, this means that the add, subtract, multiply, divide, remainder, and square root operations always deliver the 'exact' result. By exact it is meant that the coprocessor always delivers the machine number closest to the real result, which may not be representable exactly in the available numeric format. The 80387 implements the single, double, and double extended formats as specified in the standard as well as all functions required by it [17]. Note that earlier Intel coprocessors (the 8087 and the 80287) comply with a draft version of the standard that differs from the final version. These chips came out before the IEEE-754 standard was finally accepted in 1985. As in the 80387, the basic arithmetic in the 8087 and the 80287 is exact in the sense that the computed result is always the machine number closest to the real result. However, there are some differences regarding certain operands like infinities and some operations like the remainder are defined differently. Some instructions have been added in the 80387, most notably the FSIN and FCOS operations. The argument range for some transcendental function has been extended [17]. Note that the IEEE-754 standard says nothing about the quality of the implementation of transcendental functions like sin, cos, tan, arctan, log. Intel uses a modified CORDIC [18,19] technique to compute the transcendental functions. Intel claims that maximum error in the 8087, 80287, and 80387 for all transcendental functions does not exceeed two bits in the mantissa of the double extended format, which features 64 mantissa bits for an accuracy of approximately 19 decimal places [22,23]. This claim has been independently verified by a competing vendor [13]. This means that at least 62 of the 64 mantissa bits in a transcendental function result are correct. The Weitek Abacus 3167 and 4167 are 'mostly compatible' with IEEE-754 [31,32,33]. They support the single precision and double precision numeric formats formats described in the standard as well as the four rounding modes required by it. However, due to the need for extremely high speed operation, some of the finer points of IEEE-754 have not been implemented. One of the most notable omissions is the missing support for denormal numbers. Denormals are always flushed to zero. The 387 clone makers claim 100% compatibility with Intel's 80387. So one would expect the same accuracy from their chips. For example, on the packaging of the IIT 3C87 it says that ".. the requirements of ANSI/IEEE standards are fulfilled and exceeded". Cyrix states that their 83D87 complies fully with the IEEE-754 standard [12]. Cyrix delivers with their copocessors some diagnostic software. This includes the program IEEETEST which is based on the IEEE test vectors from the Ph.D. thesis of Jerome T. Coonen [9]. A test using the IEEE test vectors has also been included into the RUNDIAG program on the Intel RapidCAD diagnostic disk. Rather than performing random tests, the test vectors check specific cases that may be hard to get right. Each test vector specifies the operation to be performed, the operands, precision and rounding mode to be used, and the result (including flags set) to be expected according to IEEE-754. I ran IEEETEST on all the available coprocessors/FPUs. The Intel 486, Intel RapidCAD, Intel 387, Intel 387DX, Cyrix 83D87, and the Cyrix 387+ passed with no errors. The ULSI 83C87 showed some minor flaws in the FCOM, FDIV, FMUL, and FSCALE operations, getting flag errors in about 1% of the tested cases, but no computational errors. However, for the IIT 3C87, the IEEETEST program showed flag *and* some computational errors (that is, wrong results) for all tested operations except FXTRACT and FCHS. The Intel 80287 shows numerous errors, but this it not surprising, since the 80287 does not comply with IEEE-754 but with an earlier draft of that standard, so it does some thing differently than required by the final version of the standard. Although IEEETEST is written in Turbo Pascal, the coprocessor emulator in the TP 6.0 library could not be tested since IEEETEST was compiled with the $E- switch excluding the emulator from program code. The public domain emulator EM87 could be tested, but hung in the last test which checks the implementation of the remainder operation. This is probably caused by some bug in the emulation of the FPREM instruction tested in this test. It is interesting to note how the error profile of EM87 matches exactly that of the Intel 80287, so it can be assumed that EM87 is a very good emulation of the 80287. The Franke387 V2.4 emulator hung in the division test quite early in IEEETEST. The tests performed up to the division test reported several errors. Explanatory text printed at the start of the IEEETEST program: JT Coonen's 1984 UC Berkeley Ph.D. thesis centers around his activities as a member of the floating-point working group that defined the IEEE 754-1985 Standard for Binary Floating-Point Arithmetic. Appendix C of his thesis presents FPTEST, a Pascal program written by J Thomas and JT Coonen. IEEETEST is a port of FPTEST and runs on PCs whose math coprocessor accepts 80387 compatible floating-point instructions. IEEETEST reads test vectors from the file TESTVECS and compares the answer returned by the math coprocessor with the answer listed in the test vector. If these answers differ an 'F' is displayed, otherwise a '.'is displayed. Answers can differ due to two types of failures: numeric failures or flag failures. Numeric failures occur when the computed answer has the wrong value. Flag failures occur when the status (invalid operation, divide by zero, underflow, overflow, inexact) is incorrectly identified. TESTVECS is the concatenation of unmodified versions of all the test vectors distributed by UC Berkeley. The test data base is copyrighted by UC Berkeley (1985) and is being distributed with their permission. FPTEST and the test data base can be obtained by asking for 'IEEE-754 Test Vector' from UC Berkeley, Electrical Engineering and Computer Science, Industrial Liaison Program, 479 Corey Hall, Berkeley, CA, 94720 (415)643-6687. The initial version of this test data base for the proposed IEEE 754 binary floating-point standard (draft 8.0) was developed for Zilog, Inc. and was donated to the floating-point working group for dissemination. Errors in or additions to the distributed data base should be reported to the agency of distribution, with copies to Zilog, Inc., 1315 Dell Avenue, Campbell, CA, 95008. IEEETEST output for Intel 80387, Intel 387DX, Intel 486, C&T 38700, Cyrix 83D87, Cyrix 387+, Intel RapidCAD IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 216 0 | 0 0 0 | 0 0 0 Addition + | 3528 0 | 0 0 0 | 0 0 0 Comparison C | 4320 0 | 0 0 0 | 0 0 0 Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0 Division / | 4311 0 | 0 0 0 | 0 0 0 Fraction Part F | 624 0 | 0 0 0 | 0 0 0 Logb L | 960 0 | 0 0 0 | 0 0 0 Multiplication * | 3978 0 | 0 0 0 | 0 0 0 Negation - | 216 0 | 0 0 0 | 0 0 0 Next After N | 2832 0 | 0 0 0 | 0 0 0 Round to Integer I | 558 0 | 0 0 0 | 0 0 0 Scalb S | 948 0 | 0 0 0 | 0 0 0 Square Root V | 744 0 | 0 0 0 | 0 0 0 Subtraction - | 3528 0 | 0 0 0 | 0 0 0 Remainder % | 2984 0 | 0 0 0 | 0 0 0 Totals | 31235 0 | IEEETEST output for ULSI 83C87 (manufactured 91/48) IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 216 0 | 0 0 0 | 0 0 0 Addition + | 3528 0 | 0 0 0 | 0 0 0 Comparison C | 4312 8 | 0 0 0 | 0 0 8 Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0 Division / | 4250 61 | 0 0 0 | 28 28 5 Fraction Part F | 624 0 | 0 0 0 | 0 0 0 Logb L | 960 0 | 0 0 0 | 0 0 0 Multiplication * | 3936 42 | 0 0 0 | 19 19 4 Negation - | 216 0 | 0 0 0 | 0 0 0 Next After N | 2828 4 | 0 0 0 | 0 0 4 Round to Integer I | 558 0 | 0 0 0 | 0 0 0 Scalb S | 930 18 | 0 0 0 | 6 6 6 Square Root V | 744 0 | 0 0 0 | 0 0 0 Subtraction - | 3528 0 | 0 0 0 | 0 0 0 Remainder % | 2984 0 | 0 0 0 | 0 0 0 Totals | 31102 133 | IEEETEST output for ULSI 83S87 (manufactured 92/17) (data kindly supplied by Bengt Ask, f89ba@efd.lth.se) IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 216 0 | 0 0 0 | 0 0 0 Addition + | 3528 0 | 0 0 0 | 0 0 0 Comparison C | 4320 0 | 0 0 0 | 0 0 0 Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0 Division / | 4296 15 | 0 0 0 | 5 5 5 Fraction Part F | 624 0 | 0 0 0 | 0 0 0 Logb L | 960 0 | 0 0 0 | 0 0 0 Multiplication * | 3966 12 | 0 0 0 | 4 4 4 Negation - | 216 0 | 0 0 0 | 0 0 0 Next After N | 2828 4 | 0 0 0 | 0 0 4 Round to Integer I | 558 0 | 0 0 0 | 0 0 0 Scalb S | 930 18 | 0 0 0 | 6 6 6 Square Root V | 744 0 | 0 0 0 | 0 0 0 Subtraction - | 3528 0 | 0 0 0 | 0 0 0 Remainder % | 2984 0 | 0 0 0 | 0 0 0 Totals | 31102 45 | IEEETEST output for IIT 3C87 (manufactured 92/20) IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 200 16 | 0 0 16 | 0 0 0 Addition + | 3336 192 | 0 0 128 | 0 0 96 Comparison C | 4224 96 | 0 0 96 | 0 0 0 Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0 Division / | 4159 152 | 0 0 124 | 0 0 116 Fraction Part F | 600 24 | 0 0 24 | 0 0 24 Logb L | 960 0 | 0 0 0 | 0 0 0 Multiplication * | 3702 276 | 0 0 248 | 0 0 100 Negation - | 200 16 | 0 0 16 | 0 0 0 Next After N | 2248 584 | 0 0 584 | 0 0 168 Round to Integer I | 542 16 | 0 0 4 | 0 0 16 Scalb S | 874 74 | 5 5 44 | 8 8 20 Square Root V | 688 56 | 0 0 56 | 0 0 56 Subtraction - | 3336 192 | 0 0 128 | 0 0 96 Remainder % | 2844 140 | 0 0 140 | 0 0 116 Totals | 29401 1834 | IEEETEST output for Intel 80287 run together with a 80386 CPU IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 216 0 | 0 0 0 | 0 0 0 Addition + | 2886 642 | 16 16 112 | 174 174 174 Comparison C | 0 4320 | 1324 1324 1324 |1332 1332 1332 Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0 Division / | 3777 534 | 18 18 37 | 169 169 165 Fraction Part F | 552 72 | 24 24 24 | 24 24 24 Logb L | 900 60 | 12 12 12 | 20 20 20 Multiplication * | 2944 1034 | 105 105 197 | 303 303 231 Negation - | 216 0 | 0 0 0 | 0 0 0 Next After N | 348 2484 | 768 768 768 | 504 504 526 Round to Integer I | 546 12 | 0 0 0 | 4 4 4 Scalb S | 663 285 | 45 43 26 | 102 98 46 Square Root V | 720 24 | 4 4 4 | 8 8 8 Subtraction - | 2886 642 | 16 16 112 | 174 174 174 Remainder % | 708 2276 | 768 768 560 | 216 216 216 Totals | 18850 12385 | IEEETEST output for EM87 coprocessor emulator run on an Intel 386 CPU IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 216 0 | 0 0 0 | 0 0 0 Addition + | 2886 642 | 16 16 112 | 174 174 174 Comparison C | 0 4320 | 1324 1324 1324 |1332 1332 1332 Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0 Division / | 3777 534 | 18 18 37 | 169 169 165 Fraction Part F | 552 72 | 24 24 24 | 24 24 24 Logb L | 900 60 | 12 12 12 | 20 20 20 Multiplication * | 2944 1034 | 105 105 197 | 303 303 231 Negation - | 216 0 | 0 0 0 | 0 0 0 Next After N | 348 2484 | 768 768 768 | 504 504 526 Round to Integer I | 546 12 | 0 0 0 | 4 4 4 Scalb S | 663 285 | 45 43 26 | 102 98 46 Square Root V | 720 24 | 4 4 4 | 8 8 8 Subtraction - | 2886 642 | 16 16 112 | 174 174 174 To complement the checks done by IEEETEST I wrote some short programs DENORMTS, RCTRL, PCTRL in Turbo Pascal 6.0 that test the following features: 1. support for denormals in all precisions (single, double, extended) 2. support for the four IEEE rounding modes (up, down, nearest, chop) 3. support for precision control Note that passing all tests is required for IEEE conformance and required for compatibility with Intel's coprocessors. Precision control forces the results of the FADD, FSUB, FMUL, FDIV, and FSQRT instruction to be rounded to the specified precision (single, double, double extended). This feature is provided to obtain compatibility with certain programming languages [17]. By specifying lower precision, one effectively nullifies the advantages of extended precision intermediate results. The IEEE-754 standard for floating point arithmetic demands that processors/fp-packages that can not store the result of operations *directly* to single and double precision location must provide precision control. The programs that test precision control and rounding control are designed to return a different result for each of the modes for the same sequence of operation. The source code of the programs can be found in appendix A. The Intel 8087 and 80287 were not tested with DENORMTS since Turbo Pascal does not support extended precision denormals on 8087/80287 processors, so the denormal test fails anyway. The 8087 and 287 pass the RCTRL and PCTRL tests, though. These are the results for the Intel 387, Intel 387DX, Intel 486, Intel RapidCAD, Cyrix 83D87, Cyrix 387+, C&T 38700, and the EM87 emulator (on a 80386 machine) Precision Control SINGLE 1.13311278820037842E+0000 DOUBLE 1.23456789006442125E+0000 EXTENDED 1.23456789012337585E+0000 Rounding Control NEAREST -1.23427629010100635E+0100 DOWN -1.23427623555772409E+0100 UP -1.23457760966801097E+0100 CHOP -1.23397493540770643E+0100 Denormal support SINGLE denormals supported SINGLE denormal prints as: 4.60943116855005E-0041 Denormal should be printed as 4.60943...E-0041 DOUBLE denormals supported DOUBLE denormal prints as: 8.75000000000016E-0311 Denormal should be printed as 8.75...E-0311 EXTENDED denormals supported EXTENDED denormal prints as: 1.31640625000000E-4934 Denormal should be printed as 1.3164...E-4934 These are the results for the ULSI 83C87 Precision Control SINGLE 1.23456789012337585E+0000 DOUBLE 1.23456789012337585E+0000 EXTENDED 1.23456789012337585E+0000 Rounding Control NEAREST -1.23427629010100635E+0100 DOWN -1.23427623555772409E+0100 UP -1.23457760966801097E+0100 CHOP -1.23397493540770643E+0100 Denormal support SINGLE denormals supported SINGLE denormal prints as: 4.60943116855005E-0041 Denormal should be printed as 4.60943...E-0041 DOUBLE denormals supported DOUBLE denormal prints as: 8.75000000000016E-0311 Denormal should be printed as 8.75...E-0311 EXTENDED denormals supported EXTENDED denormal prints as: 1.31640625000000E-4934 Denormal should be printed as 1.3164...E-4934 These are the results for the IIT 3C87 Precision Control SINGLE 1.13311278820037842E+0000 DOUBLE 1.23456789006442125E+0000 EXTENDED 1.23456789012337585E+0000 Rounding Control NEAREST -1.23427629010100635E+0100 DOWN -1.23427623555772409E+0100 UP -1.23457760966801097E+0100 CHOP -1.23397493540770643E+0100 Denormal support SINGLE denormals supported SINGLE denormal prints as: 4.60943116855005E-0041 Denormal should be printed as 4.60943...E-0041 DOUBLE denormals supported DOUBLE denormal prints as: 8.75000000000016E-0311 Denormal should be printed as 8.75...E-0311 EXTENDED denormals not supported These are the results for the TP 6.0 coprocessor emulator: Precision Control SINGLE 1.23456789012351396E+0000 DOUBLE 1.23456789012351396E+0000 EXTENDED 1.23456789012351396E+0000 Rounding Control NEAREST -1.23457766383395931E+0100 DOWN -1.23457766383395931E+0100 UP -1.23457766383395931E+0100 CHOP -1.23457766383395931E+0100 Denormal support SINGLE denormals not supported DOUBLE denormals not supported EXTENDED denormals not supported The test results show that the IIT 3C87 does not conform to the IEEE-754 floating-point standard in that it does not support denormals in double extended precision. The ULSI 83C87 does not conform to that standard in that it does not support precision control, but uses double extended precision for all operations. The TP 6.0 emulator supports neither precision control, rounding control nor support for any denormals. In addition, its basic arithmetic operations do not seem to conform to the IEEE standard as the results of the test programs differ from that of any result computed by a coprocessor for any mode. With regard to the accuracy of transcendental functions, Cyrix claims that the relative error of the transcendental functions on the 83D87 never exceeds 0.5 units in the last place (0.5 ULP) of the double extended format [13]. This means that the maximum relative error is below 2**-64, while Intel's published error limit is 2**-62. While Intel uses a modified CORDIC algorithm [18,19] to compute the transcendental functions, Cyrix uses rational approximations that utilize a very fast array multiplier. For an explanation why this approach is superior to CORDIC with todays technology, see [61]. Also, Cyrix uses an internal 75 bit data path for the mantissa [15], so intermediate computations in the generation of transcendental function values will enjoy some additional accuracy over the 64 bits provided by the double extended format. Using 75 mantissa bits also provides an advantage over other coprocessors like the Intel 387DX and ULSI 83C87 which use only a 68 bit data path for the mantissa [58,59]. Note that a maximum relative error of 0.5 ULP for the Cyrix coprocessor does not mean that it returns the 'exact' result (machine number closest to infinitely precise result) all the time. Just consider the case where the infinitely precise result of a transcendental function falls nearly half way between two machine numbers. A relative error of 0.5 ULP can cause the result to be either of the numbers after rounding, depending on the direction of the error. But the 83D87 should deliver results that never differ from the 'exact' result by more than one ULP. Please note that the claim of relative error being below 0.5 ULPs is slightly exaggerated. 0.6 ULPs would be a more realistic error limit. Imagine that the infinitely precise result for some argument to a transcendental was xxx..xxx1001... (where the xxx...xxx represent the first 64 bits of the result), but that the coprocessor computes the result as xxx..xxx0111 and then round this down to xxx..xxx0000. Then the relative error is (1001b-0b)/1000b = 0.5625 ULPs. I tested some of the transcendental functions of the Cyrix 387+ and found the relative error to be always below 0.6 ULPs. Cyrix also claims that its transcendental functions satisfy the monotonicity criterion [13], a claim not made by any of the competitors, which does not mean that the transcendental functions on the other 387 compatibles may not be monotonic, too. Monotonicity means that for all x1 > x2, it always follows that f(x1) >= f(x2) for an increasing function like sin on [0..pi/4]. Likewise, for a decreasing function like cos on [0..pi/4], for all x1 > x2, it follows that f(x1) <= f(x2). The Weitek Abacus 3167 and 4167 implement only the basic arithmetic operations (add, subtract, negate, multiply, divide, square root) in hardware. Transcendental functions are provided via a software library provided by Weitek. For these library functions Weitek claims a maximum relative error of 5 ULPs [31,33] (ULP = Unit in the Last Place, numeric weight of the least significant mantissa bit). This means that the last three bits in the mantissa of a double precision result can be wrong. Note that the Intel 387 and compatible math coprocessors generate the transcendental functions with a small relative error with regard to the *extended double precision* format. Thus, when rounded to double precision, their function values are nearly always 'exact'. 387 type coprocessors have superior accuracy when compared with Weitek's coprocesssors. The test diskette distributed with early versions of the Cyrix 83D87 contained a program TRANCK that checks the accuracy of the transcendental functions in the coprocessor against a more precise software arithmetic [16]. I used this program to compare the accuracy of the transcendental functions on those 287/387/486 coprocessors/FPUs available to me. As TRANCK will not accept negative numbers as intervall limits, I tested each function on an intervall along the positive x-axis. The functions tested are F2XM1 (2**x-1), FSIN (sine), FCOS (cosine), FPTAN (tangent), FPATAN (arctangent), FYL2X (y * log2 (x)), and FYL2XP1 (y * log2 (x+1)). These are all the transcendental functions implemented on the 80387. Note that the square root (FSQRT) is *not* a transcendental function. For every function, 100,000 arguments were evaluated. The arguments were uniformally distributed within the intervall tested. The EM87 emulator could not be checked with TRANCK, since the multiple precision package in TRANCK would always return with an error message immediately. However, the Franke387 emulator could be tested.