NetNews Usenet Archive 1992 #18

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #18 / NN_1992_18.iso / spool / comp / sys / intel / 1533 < prev next >

Wrap

Internet Message Format | 1992-08-19 | 62.0 KB

Path: sparky!uunet!kithrup!hoptoad!decwrl!elroy.jpl.nasa.gov!usc!sol.ctr.columbia.edu!ira.uka.de!uka!uka!news From: S_JUFFA@iravcl.ira.uka.de (|S| Norbert Juffa) Newsgroups: comp.sys.intel Subject: What you always wanted to know about math coprocessors for 80x86 2/4 Date: 19 Aug 1992 15:01:07 GMT Organization: University of Karlsruhe (FRG) - Informatik Rechnerabt. Lines: 930 Distribution: world Message-ID: <16tnnjINNcqt@iraul1.ira.uka.de> NNTP-Posting-Host: irav1.ira.uka.de X-News-Reader: VMS NEWS 1.23 Cyrix EMC87 is basically a special version of the Cyrix 83D87. In addition to the normal 387 operating mode, in which coprocessor-CPU communication is handled thru reserved IO-ports, it also offers a memory-mapped mode of operation similar to the operation principle of the Weitek Abacus. Please note that the EMC87 is *not* compatible with Weitek's Abacus coprocessor. They both use the same interface technique (memory mapping) but while the EMC87 uses the standard 387 instruction set, the Weitek coprocessors use a different instruction set of their own. Like the Weitek Abacus, the EMC87 occupies a 64 kByte memory block starting at physical address C0000000h. It can therefore only be accessed in the protected or virtual modes of the 386 CPU. DOS programs can access the EMC87 with the help of DOS-extenders or memory managers like EMM386 which run in protected/virtual mode themself. Since the EMC87 provides also the standard CPU interface via IO-ports, it can be used just like any other 387 compatible coprocessor and delivers the same performance as the Cyrix 83D87 in this mode. However, using the memory mapped mode of the EMC87 provides a significant speed advantage. The traditional 387 CPU- coprocessor interface via IO-ports has an overhead of about 16-20 clock cycles. Since the Cyrix 83D87 executes some operations like addition and multiplication in much less time, its performance is limited by the CPU-coprocessor interface. The memory-mapped mode has much less overhead and allows all coprocessor instructions to be executed at full speed and with no penalty. For this reason, Cyrix introduced the EMC87 in 1990. In a test, the EMC87 at 33 MHz ran the single precision Whetstone benchmark at 7608 kWhetstones/sec, while the Cyrix 83D87 at 33 MHz had a speed of only 5049 kWhetstones/sec, an increase of 50.6% [63]. In another test, the EMC87 ran a fractal computation at two times the speed of the Cyrix 83D87 and 2.6 times as fast as an Intel 387DX [64]. A third test found the EMC87's overall performance to be 20% higher than the performance of the Cyrix 83D87 [65]. The Cyrix FasMath EMC87 has also been sold as Cyrix AutoMATH by Cyrix. The two chips are 100% identical. Unlike the Cyrix 83D87, which fits into the 68-pin 387 coprocessor socket, the EMC87 comes in a 121-pin PGA and requires the 121-pin EMC (Extended Math Coprocessor) socket. Note that not all boards have such a socket, a notable exception being IBM's PS/2s, for example. Originally, Cyrix claimed support for the fast memory mapped mode of the EMC87 from a lot of software vendors (including Borland and Microsoft). However, there are only very few applications that make use of it, among them Evolution Computing's FastCAD 3D, MicroWay Inc.'s NDP FORTRAN-386 compiler and Intusofts's Spice [63]. I haven't seen the EMC being offered for about nine month now. It may be that Cyrix has discontinued this product due to lack of sufficient software support. The EMC87 was available in 25 and 33 MHz versions at the end of 1991. Cyrix 387+ seems to be the successor to the Cyrix 83D87. On ordering a Cyrix coprocessor about a month ago, I was automatically supplied with a 387+. In my tests, I found the Cyrix 387+ to be about five to 10 percent *slower* than the Cyrix 83D87. However, some instructions like the square root (FSQRT) now ony run at half the speed at which they ran in the 83D87 (see performance results below). I also found the transcendental functions on the 387+ to be a bit more accurate than those implemented in the 83D87. Why Cyrix has brought out a new coprocessor slower than the 83D87 I don't know. I have written to Cyrix about this question but haven't received a reply yet. Maybe the new coprocessor solves the one small hardware compatibility problem the 83D87 had (see above paragraph on the 83D87). It could also be that Cyrix had to design around the three Intel patents Intel claims the 83D87 has violated. I have no idea wether the Cyrix 387+ is to replace the 83D87 or if both chips will coexist in the market. Like the 83D87, the 387+ is available for speeds of up to 40 MHz. Cyrix 83S87 is the SX version of the Cyrix 83D87. Just like the Cyrix 83D87 is the fastest 387 compatible coprocessor, the Cyrix 83S87 is the fastest of the 387SX compatible coprocessor [1]. Besides being the fastest 387SX 'clone', the Cyrix 83S87 also features the most accurate transcendental functions. The 83S87 is packaged in a 68-pin PLCC and is available in 16, 20 and 25 MHz versions. Due to the advanced power saving features of the Cyrix coprocessor, the typical power consumption of the 20 MHz version is about 350 mW [67]. ULSI 83C87 is a 387 'clone' that came out in early 1991, well after the IIT 3C87 and Cyrix 83D87. Like all clones, it is somewhat faster than the Intel 387DX. Especially the basic arithmetic functions are fast, while the transcendental functions show only a slight speed improvement over the Intel 387DX (see benchmark results below). In my tests, the ULSI had the most inaccurate transcendental functions. However, the maximum relative error is still within the limits set by Intel, so this is probably not an important issue in all but very few applications. The ULSI shows some minor flaws in the tests for IEEE-754 compatiblity, but this, too, is unimportant under typical operating conditions. ULSI claims that the program IEEETEST, which was used to test for IEEE compatibility, contains many personal interpretations of the IEEE standard by the program's author and states that there is no ANSI-certified IEEE-754 complicency test. While this is most probably true, it is also a fact that the IEEE test vectors used in IEEETEST are sort of an industry standard and that Intel's 387, 486, and RapidCAD chips pass it without a single failure. Since the ULSI Math*Co 83C87 fails some of the tests, it is certainly less than 100% compatible with Intel's chips, although this will hardly make any difference in typical operating conditions. The ULSI 83C87 is also not fully compatible with the Intel 387DX in that is does not implement the precision control feature of Intel's coprocessor [58]. While all the internal operations of 80x87 coprocessors are usually done with the maximum precision available (double extended presision with 64 mantissa bits), the 80x87 also offer the possiblity to force lower precision to be used for the basic arithmetic functions add, subtract, multiply, divide, and square root. This feature was included for compatiblity with existing floating-point implementations at the time the 8087 was devised. All coprocessors except the ones from ULSI support this feature. Since precision control is rarely used, this incompatibility with the Intel 387DX does not pose major problems. IEEE-754 mentions precision control, but requires it only for those systems that don't have the possibility to store single and double precision results. Therefore, the standard does not call for precision control in the 387 coprocessor, so the ULSI 83C87's failure to provide rounding control does not constitute a conflict with the IEEE-754 standard for floating point arithmetic. Like the other 387 'clones', the 83C87 does not support asynchronous operation of the CPU and the coprocessor. This means that the 83C87 always runs at the full speed of the CPU. The ULSI 83C87 is available in 20, 25, 33, and 40 MHz versions. The ULSI is produced in high perfromance, low power CMOS. Power consumption at 20 MHz is max. 800 mW (400 mW typical), at 25 MHz it is max. 1000 mW (500 mW typical), at 33 MHz it is max. 1250 mW (625 mW), and at 40 MHz the ULSI Math*Co 83C87 consumes max. 1500 mW (750 mW typical) [58]. The 83C87 is packaged in a 68-pin ceramic PGA. ULSI coprocessors come with a lifetime warranty. ULSI Systems, Inc. will replace the coprocessor up to three times free of charge should it ever fail. ULSI 83S87 is the SX version of the ULSI 83C87 for operation with an Intel 387SX or an AMD Am387SX. It is functionally equivalent to the 83C87. To aid low power laptop designs, the ULSI 83S87 features an advanced power saving design with a sleep mode and a standby mode with only minimal power requirements. Power consumption under normal operating conditions (dynamic mode) is max. 400 mW at 16 MHz (300 mW typical), max. 450 mW at 20 MHz (350 mW typical), and max. 500 mW at 25 MHz (400 mW typical) [58]. The ULSI 83S87 is packaged in a 68-pin PLCC. Intel RapidCAD is not a coprocessor, strictly seen, although it is marketed as one. Rather, it is a CPU replacement. It is basically an Intel 486DX without the cache and with a 386 pinout. RapidCAD is delivered as a set of two chips. RapidCAD-1 goes into the 386 socket and contains the CPU and FPU, RapidCAD-2 goes into the coprocessor socket and contains a PAL that generates the Ferr signal that is normally generated by a coprocessor and used by the motherboard circuitry to provide 287 compatible coprocessor exception handling in 386/387 systems. The RapidCAD instruction set is compatible with the 386, so it doesn't know the 486 specific instructions like BSWAP. Since the RapidCAD CPU core is very similar to 486 CPU core, most of the register to register instructions execute in the same number of clock cycles as on the 486. The use of the 386 bus interface causes instructions that access memory to execute at about the same speed as on the 386. The integer performance on the RapidCAD is definitely limited by the low memory bandwidth provided by the 386 bus interface (2 clock cylces per bus cycle) and the lack of an internal cache. CPU instructions often execute faster than they can be fetched from memory, even with a big and fast external cache. Therefore, the integer performance of the RapidCAD exceeds that of a 386 by at most 25%. This value was derived by running some programs that use mostly register-to-register operations and few memory accesses. This finding is supported by the SPEC ratings that Intel reports for the 386-33 and the RapidCAD-33. While the 386-33 has a SPECint of 6.4, the RapidCAD has a SPECint of 7.3 [28], a 14% increase. Note that these tests used the old (1989) SPEC benchmarks suite. While CPU instructions often execute in one clock cycle on the RapidCAD, FPU instructions always take more than seven clock cycles. They are therefore rarely slowed down by the low memory bandwidth provided by the 386 bus interface. My tests show a 70%-100% performance increase for floating-point intensive benchmarks (see below) over a 386 based system using the Intel 387DX math coprocessor. This is consistent with the SPECfp rating reported by Intel. The 386/387 at 33 MHz is rated at 3.3 SPECfp, while the RapidCAD is rated at 6.1 SPECfp at the same frequency, a 85% increase. This means that a system that uses the RapidCAD is faster than any 386/387 combination, regardless of the type of 387 used (Intel 387DX or faster clone). The diagnostic disk for the RapidCAD also gives some application performance data for the RapidCAD compared to the Intel 387DX: Application Time w/ 387DX Time w/ RapidCAD Speedup AUTOCAD 11 32 sec 52 sec 63% AutoShade/Renderman 108 sec 180 sec 67% Mathematica(Windows) 103 sec 139 sec 35% SPSS/PC+ 4.01 14 sec 17 sec 21% RapidCAD is available in 25 MHz and 33 MHz versions. It is distributed through other channels than the other Intel math coprocessors. Therefore, I have been unable to obtain a data sheet for it. The RapidCad-1 chip gets quite hot when operating and it can be assumed that its power consumption is similar to the 486-33. Therefore, I recommend extra cooling for this chip (see the paragraph below on the 486 for details). The RapidCAD-1 is packaged in a 132-pin PGA, just like the 80386, and the RapidCAD-2 is packaged in a 68-pin PGA like a 80387 coprocessor. Intel 486DX is not a coprocessor. This chip, brought out in 1989 functionally combines the CPU (a heavily pipelined implementation of the 386 architecture) with an enhanced 387 (the floating-point unit, FPU) and 8 kB of unified code/data cache on one chip. Of course, this description is simplified, for a detailed hardware description, see [52]. The 486DX offers about two to three times the integer performance of a 386 at the same frequency. Floating point performance is about three to four times as high as on the Intel 387DX at the same clock rate [29]. Since the FPU is on the same chip as the CPU, the considerable communication overhead between CPU and coprocessor in a 386/387 system is omitted, letting FPU instructions run at the full speed permitted by the implementation. The FPU also takes advantage of the on-chip cache and the highly pipelined execution unit. Besides the higher speed, the 486 FPU features more accurate transcendental functions than the Intel 387DX coprocessor according to tests run by me (see below). To achieve better interrupt latency, FPU instructions with a long execution time have been made abortable in the case an interrupt occurs during their execution. The concurrent execution of CPU and coprocessor instructions typical for 80x86/80x87 systems is still in existence on the 486, but some FPU instructions like FSIN have nearly no concurrency with CPU instructions, indicating that they make heavy use of both, CPU and FPU resources [53, 1]. The 486DX comes in a 168 pin ceramic PGA (pin grid array). It is available in 25 MHz and 33 Mhz versions. Since the end of 1991, there is also a 50 MHz version available done in a CHMOS V process (the 25 MHz and 33 MHz are produced using the CHMOS IV process). Maximum power consumption is 3500 mW for the 25 MHz 486 (2600 mW typical), 4500 mW for the 33 MHz version (3500 mW typical), and 5000 mW (4000 mW typical) for the 50 MHz chip. Due to the considerable amount of heat produced by these chips, and taking into consideration the slow air flow provided by the fan in garden variety PC tower cases, I recommend an extra fan directly above the CPU for safer operation. If you measure the surface temperature of an i486 in a normal tower case without extra cooling after some time of operation, you may well come up with something like 80 - 90 degrees Celsius (that is 176 - 194 degrees Fahrenheit for those not familiar with metric units) [54,55]. You don't need the well known and expensive IceCap(tm) to effectively cool your CPU. A simple fan mounted directly above the CPU can bring the temperature down to about 50 to 60 degrees Celsius (122 - 140 degrees Fahrenheit) depending on the room temperature and the temperature within the PC case (which depends on the total power dissipation of all the components and the cooling provided by the fan in the power unit). According to a simple rule known as Arrehnius' Law, lowering the temperature by 10 degrees Celsius slows down chemical reactions by a factor of two, thus lowering the temperature of your CPU by 30 degrees should prolong the live of the device by a factor of eight due to the slower aging process. If you are reluctant to add a fan to your system because of the additional noise, settle for a low-noise fan like those available from the German manufacturer Pabst (this is not meant to be an advertisement. I am just the happy owner of such a fan. Besides that, I have no connections to the firm). Intel 486DX2 is the name for Intel latest generation of 486 CPUs. Using the DX2 suffix instead of simply DX is meant to be an indicator that these are clock-doubled versions. A normal 486DX operates at the frequency provided by the incoming clock signal. A 486DX2 generates a new clock signal from the incoming clock by means of a PLL (phase locked loop). In the DX2, this clock signal has twice the frequency of the incoming clock, hence the name clock-doubler. All internal parts of the 486DX2 (cache, CPU core, FPU) run at this higher frequency. Only the bus interface runs at the normal speed. That way, a 486DX-50 can run on a motherboard designed for 25 MHz operation. Since motherboards for 50 MHz operations are much harder to design than those for 25 Mhz, this makes a 486DX2-50 system easier to built and cheaper than a 486DX-50 system. For all operations that don't access off-chip resources (e.g. register operations) a 486DX2-50 provides exactly the same performance as a 486DX-50 and twice the performance of a 486DX-25. However, since the main memory in a 486DX2-50 systems still operates at 25 MHz, all instructions involving memory accesses are potentially slower than in a 486DX-50 system, whose memory also runs at 50 Mhz. The internal cache of the 486 helps this problem a bit, but overall performance of a 486DX2-50 is still lower than that of a 486DX-50, although Intel's documentation [32] shows this drop to be quite small. It depends a lot on the code one runs, though. The nice thing about the 486DX2 is that it allows easy upgrading of 25 and 33 Mhz 486 systems, since the 486DX2 is completely pin-compatible with the 486DX. Just take out the 486DX and plug in the new 486DX2. Note that power consumption of the 486DX2-50 equals that of the 486DX-50 (4000 mW typical), and that the 486DX2-66 exceeds this by about 30%. These chips get really hot in a standard PC case with no extra cooling. See the above paragraph for more detailed information on this problem. Intel 487SX is the coprocessor intended for use in 486SX systems. The 486SX is basically a 486DX without the floating- point unit (FPU) [48, 50]. Originally Intel sold 486DXs with a defective FPU as 486SXs but it has now completly removed the FPU part from the 486SX mask for mass production. The introduction of the 486SX in 1991 has been viewed mainly as a marketing 'trick' by Intel to take market share from the 386 based systems once AMD became successful with their Am386 (AMD has taken as much as 40% of the 386 market due to some superior features such as higher clock frequency, lower power consumption, and a fully static design). A 486SX at 20 MHz delivers a bit less integer performance than a 40 MHz Am386. To add floating-point capabilities to a 486SX based system, it would be easiest to swap the 486SX with a 486DX which includes the FPU. However, Intel has prevented this easy solution by giving the 486SX a slightly different pin out [48, 51]. Since only three pins are assigned differently, clever board manufacturers have come out with boards that accept anything from a 486SX-20 to a 486DX2-50 in their CPU socket and provide a clean upgrade path this way. A set of three jumpers ensures correct signal assignment to the pins for either configuration. To upgrade systems without this feature, one has to buy the 487SX and put it into the "Performance Upgrade Socket" present in most 486SX systems. Once the 487SX was available, it was quickly found out that it is just a normal 486DX with a slightly different pin out [49]. Inserting the 487SX effectively shuts down the 486SX in the 486SX/487SX system, so the 486SX could be removed once the 487SX is installed. Since the shut down is logical, not electrical, the 486SX still uses power if used with the 487SX, although it is unoperational. Technically speaking, the solution Intel chose was the only practical way to provide a 486SX system with the high level of floating-point performance the 486DX offers. The CPU and FPU have to be on the same chip, otherwise the FPU can not make use of the cache on the CPU chip and there would be considerable overhead in CPU-FPU communication (similar to a 386/387 system), nullifying most of the arithmetic speedups over the 387. That the 486SX, 487SX, and 486DX are not pin-compatible seems to be purely for marketing reasons. To upgrade a 486SX based system, Intel also offers the OverDrive chip, which is just the same as a 487SX with internal clock doubling. It goes also goes into the "Performance Upgrade Socket" found in 486SX systems. The OverDrive roughly doubles the performance of a 486SX/487SX based system. For a explanation of clock doubling, see the description of the 486DX2 above. As the 486SX, the 487SX is available in 20 MHz and 25 MHz versions. At 20 MHz, the 487SX has a power consumption of max. 4000 mW. It is available in a 169 pin ceramic PGA (pin grid array). Weitek 3167 was introduced in 1989 to provide the fastest floating point performance possible on a 386 based system at that time. The Weitek Abacus 3167 is not a real coprocessor, strictly speaking, but rather a memory mapped peripheral device. The Weitek 3167 was optimized for speed wherever possible. Besides using the faster memory mapped interface to the CPU (the 80x87 uses IO-ports), it does not support many of the features of the 80x87 coprocessors, allowing all of the chip's ressources to be concentrated on the fast execution of the basic arithmetic operations. For a more detailed description of the Weitek 3167 see the first chapter of this document. In benchmark comparisons, the Weitek 3167 provided up to 2.5 times the performance of an Intel 387DX coprocessor. For example, on a 33 MHz 3167 the Whetstone benchmark performed at 7574 kWhetstones/sec compared with the the 3743 kWhetstones/s for the Intel 387DX. Note however that these are single precision results and that the Weitek 3167's performance would drop to about half the stated rate for double precision, while the value for the Intel 387DX would not change much. Anyhow, before the advent of the Intel RapidCAD, the Weitek 3167 usually beat all 387 compatible coprocessors even for double precision operations [63,65,69]. For typical applications the advantage of the Weitek 3167 over the 387 clones is much smaller. In a benchmark test using AutoDesk's 3D-Studio the Weitek 3167 performed at 123% of the Intel 487DX's perfromance comapred with 106% for the Cyrix FasMath 83D87 and 118% for the Intel RapidCAD. The Weitek Abacus 3167 is packaged in a 121-pin PGA that fits into an EMC socket provided by most 386 based systems. It does *not* fit into the normal coprocessor socket designed to hold a 387 compatible coprocessor in a 68-pin PGA. To get the best of both worlds, one might want to use a Weitek 3167 and a 387 compatible coprocessor in the same system. These coprocessors can coexist in the same system just fine. Only problem is that most 386 based systems contain only one coprocessor socket, usually of the EMC (extended math coprocessor) type. Thus, you can install either a 387 coprocessor or a Weitek 3167, but not both. There are little daughter boards available though that fit into the EMC socket and provide two sockets, an EMC and a standard coprocessor socket. At 25 MHz, the Weitek 3167 has a power consumption of max. 1750 mW. At 33 MHz, the max. power consumption is 2250 mW. Weitek 4167 is a memory mapped coprocessor that has the same architecture as the 3167 and is designed to provide 486 based systems with the highest floating point performance available. It executes coprocessor instructions at three to four times the speed of the Weitek 3167. Although it is up to 80% faster than the Intel 468 in some benchmarks [1,69], the performance advantage for real application is more like 10%. The introduction of the 486DX2 processors has more or less obliterated the need for a Weitek 4167, since the DX2 CPUs provide the same performance and all the additional features the 80x87 has over the Weitek Abacus. The Weitek 4167 is packaged in a 142-pin PGA package that is only slightly smaller than the 486's package. At 25 MHz, it has a max. power consumption of 2500 mW [32]. Chips & Technologies has shipped samples of their 38700 and 38700SX coprocessors, which are compatible with the Intel 387DX and Intel 387SX coprocessors, respectively. Both have already been tested in [1]. However, C&T's German distributor (Rein Elektronik, Nettetal) states that these coprocessors will become generally available not before 4Q 1992. The samples tested in [1] showed about the same performance as the Cyrix 83D87. Pricing Due to a recent price slashing by Cyrix and subsequently by Intel for 387 coprocessors, prices have dropped significantly for all 287 and 387 compatible coprocessors with hardly any price difference between manufacturers. 387DX compatible coprocessors typically sell for ~US$ 100 for all speeds except for 40 MHz versions which are typically ~US$ 130. 387SX compatible coprocessors sell for ~US$ 90 regardless of speed with the exception of the 33 MHz version, which are ~US$ 100. The Intel 287XL sells for ~US$ 100, while the IIT 2C87 and Cyrix 82S87 sell for about US$ 70. 8087s may be more expensive, the price of an 8087-10 being US$ 150. I bought the Intel RapidCAD for US$ 320 and haven't seen it offered for a better price. I see the Weitek Abacus 3167-33 being offered for US$ 780 and the 4167-33 being offered for US$ 1100. This price information reflects the price situation as of 08-14-92. Prices can be expected to drop slightly in the near future. If you have a demand for high floating-point performance, you should consider to buy a 486 based system rather than buying a 386 based system with an additional coprocessor. A 386 mother board for 33 MHz operation sell for ~ US$ 300, together with the coprocessor, costs total ~ US$ 400. A 486-33 ISA-board sells for US$ 650. While the 486-33 system is 60% more expensive than the 386/387 system, it also provides 100% more integer and floating- point performance (twice the performance). If you want to push your 386 based system to maximum floating-point performance and can't switch to a 486 based system for some reason, I recommend the Intel RapidCAD. It is both faster [1] and cheaper than installing a Weitek Abacus 3167 with your 386, which used to be the highest performing combination before the RapidCAD came out. Similarily, the introduction of the 486DX2 clock-doubler chips have obliterated the need for a Weitek 4167 to get maximum floating-point performance out of a 486 based system. A 486DX2-66 performs at or above the performance level of a 33 Mhz Weitek 4167, even if the latter uses single precision rather than double precision. The 486DX-66 is rated by Intel at 24700 double precision kWhetstones/sec and 3.1 double precision Linpack MFLOPS. Of course, these benchmarks used the highest performance compilers available. But even with a Turbo Pascal 6.0 program, I managed to squeeze 1.6 double precision MFLOPS out of the 486DX2-66 for the LLL benchmark (for a description of the benchmarks mentioned, see the paragraph on benchmarks below). Although I haven't yet seen 486DX2-66 processors seen offered to the end users for upgrade purposes, I'll recommend the 486DX2-66 to those that need highest floating-point performance and are planning on buying a new PC. The price difference between a 33 MHz 486DX motherboard and a 486DX2-66 motherboard is around US$ 600, well below the price for the Weitek Abacus 4167. Operation In a 80x86/80x87 system CPU instructions and coprocessor instructions are executed concurrently. This means that the CPU can execute CPU instructions while the coprocessor executes a coprocessor instruction at the same time. The concurrency is restricted somewhat by the fact that the CPU has to aid the coprocessor in certain operations. As the CPU and the coprocessor are fed from the same instruction stream and both instruction streams may operate on the same data, there has to be a synchronizing mechanism between the CPU and the coprocessor. 8086/8087 or 8088/8087 system, both of the chips look at the opcodes coming in from the bus. To do this, both chips have the same BIU (bus interface unit) and the 8086 BIU sends the status signals of its prefetch queue to the 8087 BIU. This assures that both processors always decode the same instructions in parallel. Since all coprocessor instruction start with the bit pattern 11011, it is easy for the 8087 to ignore all other instructions. Likewise the CPU ignores all coprocessor instructions except if they access memory. In this case, the CPU computes the address of the LSB (least significant byte) of the memory operand and does a dummy read. The 8087 then takes the data and does a dummy read. from the data bus. If more than one meory access is needed to load an memory operand, the 8087 requests the bus from the CPU, generates the consecutive addresses of the operand's bytes and fetches them from the data bus. After completing the operation, the 8087 hands bus control back to the CPU. Since 8087 and CPU are hooked up to the same synchronous bus, they have to run at the same speed. This means that with the 8087, only synchronous operation of CPU and coprocessor is possible. Another 8087 coprocessor instruction can only be started if the previous one has been completed in the NEU (numerical execution unit) of the 8087. To prevent the 8086 from decoding a new coprocessor instruction while the 8087 is still excuting the previous coprocessor instruction, the following mechanism is used: The compilers and assemblers automatically generate a WAIT instruction before each coprocessor instruction. The WAIT instruction tests the /TEST pin until its input becomes "LOW". In 8086/8087 systems, the 8086 /TEST pin is connected to the 8087 BUSY pin. As long as the NEU executes a coprocessor instruction, it forces its BUSY pin "HIGH". Thus the WAIT instruction in front of every coprocessor instruction stops the CPU until a still executing previous coprocessor instruction has finished. The same synchronization is used before the CPU accesses data that was written by the coprocessor. A WAIT instruction after the coprocessor instruction that writes to memory causes the CPU to stop until the coprocessor has transferred the data to memory, after which the CPU can safely access the data. With the help of an additional chip, the 8087 can also be inter- faced to the 80186 [36]. The 80186 was the CPU in some PCs (e.g. from Philips, Siemens) in the 1982/1983 time frame, but with the introduction of the IBM AT which used the 80286, it lost all significance for the PC market. The 80C186 (CMOS version of the 80186) nowadays sells as an embedded controller and can be combined with a 80C187 coprocessor which is based on the internals of the Intel 387 [37]. The 80287 CPU-interface is totally different from the solution used in the 8087. Since the 80286 implements memory protection via an MMU based on segmentation, it would have been much to expensive to duplicate the whole protection logic on the coprocessor for an interface solution similar to the 8087. In a 80286/80287 system, the CPU fetches and stores all opcodes and operands for the coprocessor. Information is passed through ports F8h - FFh. As these ports are accessible under program control, care must be taken to not accidentally perform write operation to them, as this could corrupt the information in the math coprocessor. The execution unit of the 80287 is practically identical to that of the 8087, that is, nearly all coprocessor instructions execute in the same number of clock cycles on both coprocessors. Due to the additional overhead of the CPU/coprocessor interface (at least ~40 clock cycles), a 8 MHz 80286/80287 combination can be slower than a 8086/8087 system running at the same speed for floating point intensive programs. Additionally, most of the older 286 boards were configured to run the coprocessor at 2/3 the speed of the CPU, making use of the ability of the 80287 to run asynchronous with the CPU. The 80287 has a CKM pin that causes the incoming system clock to be divided by three for the coprocessor if it is tied to ground. The 80286 always divides the system clock by two internally. Thus the ratio 2/3. However, when the CKM (ClocK Mode) pin is tied high on the 80287, it does not divide the CLK input. This feature has been exploited by the maker of coprocessor speed sockets. These sockets tie CKM high and supply their own CLK signal with a built-in oscillator, thereby allowing the 80287 or compatible to run at a much higher speed than the CPU. With an IIT or Cyrix 287 one can have a 20 MHz coprocessor running with a 8 MHz 80286. Note however that the floating-point performance in such a configuration does not scale linearly with the coprocessor clock, since all the data has to be passed through the much slower CPU. If the coprocessor executes mostly simple intructions such as addition and multiplication doubling the coprocessor clock in a 10 MHz system to 20 MHz does not show any performance increase at all [24]. The 80C287 by AMD is a 100% clone of the original Intel 80287, but is produced in CMOS not in NMOS as the original Intel chip. This makes for lower power consumption. The 80287XL, the Cyrix 82S87, and the IIT 2C87 contain the internals of a 387 coprocessor, but are pin-compatible to the original 287. However, these chips divide the system clock by two internally, as opposed to three in the original Intel 80287. Since the 80286 also divides the system clock by two, they usually run synchronously with the CPU. They can also run asynchronously, though. The 8087/8087 combination can be characterized as a cooperation of partners with equal rights, while the 80286/287 is more a master- slave relationship. This makes synchronization much more easy, since the complete instruction and data flow of the coprocessor goes thru the CPU. Before executing most coprocessor instructions, the 80286 tests its /BUSY pin which is hooked up to the 287 coprocessor and signals if the 80287 is still executing a previous coprocessor instruction or has encountered an exception. The 80286 then waits until the 80287 is not busy before loading the coprocessor instruction into the coprocessor. Therefore, a WAIT instruction before every coprocessor instruction is not required. These WAITs are permissible, but not necessary in 80287 programs. The second form of WAIT synchronisation after the coprocessor has written a memory operand is still necessary on 286/287 systems. The coprocessor interface in 80386/80387 systems is very similar to the one found in 286/287 systems. However, to prevent corruption of the coprocessor's contents by programming errors, the IO-ports 800000F8 - 800000FF are used which are not user accessible. The interface has been optimized and uses 32-bit transfers. The overhead of the interface has been reduced to about 16-20 clock cycles. For some operations on the 387 'clones', that take less than 16 clock cycles to complete this effectively limits the execution rate of coprocessor instructions. The only sensible solution to provide even higher floating point performance was to integrate the CPU and coprocessor functionality onto the same chip. This is what Intel did with the 80486. The FPU in the 486 also benefits from the instruction pipelining and from the integrated cache. Performance Several computer magazines have published performance comparisons at the application level for the 387 coprocessors and Weitek's ABACUS 3167 and 4167 chips [1,25,68,70]. Applications tested included AutoCAD R11, RenderStar, Quattro Pro, Lotus 1-2-3, and AutoDesk's 3D-Studio. For most tests, performance improvements for the 387 clones over Intel's 387DX were small to marginal, the clones running the applications no more than 5% to 15% faster than the Intel 387DX. In the test of 3D-Studio, one of the few programs that supports the Weitek Abacus, the Weitek 3167 improved performance by 23% over an Intel 387DX and the 4167 improved performance by 10% over the 486 [1]. The Intel Math Coprocessor Utilities Disk that accompanies the Intel 387DX coprocessor has a demonstration program that shows the speedup of certain application programs when run with the Intel coprocessor vs. a system with no coprocessor. Application Time w/o 387 Time w/ 387 Speedup Art&Letters 87.0 sec 34.8 sec 150% Quattro Pro 8.0 sec 4.0 sec 100% Wingz 17.9 sec 9.1 sec 97% Mathematica 420.2 sec 337.0 sec 25% The following table is an excerpt from [70]: Application Time w/o 387 Time w/ 387 Speedup Corel Draw 471.0 sec 416.0 sec 13% Freedom Of Press 163.0 sec 77.0 sec 112% Lotus 1-2-3 257.0 sec 43.0 sec 597% The following table is an excerpt from [25]: Application Time w/o 387 Time w/ 387 Speedup Design CAD, Test1 98.1 sec 50.0 sec 96% Design CAD, Test2 75.3 sec 35.0 sec 115% Excel, Test 1 9.2 sec 6.8 sec 35% Excel, Test 1 12.6 sec 9.3 sec 35% The performance statistics below were put together with the help of four widely known numeric benchmarks and two benchmarks developed by me. Three Pascal programs, one FORTRAN program, and two assembly language program were used. The assembly language programs were linked with Turbo-Pascal 6.0 for library support, especially to include the coprocessor emulator of the TP 6.0 run-time library. The Pascal programs were compiled with Turbo Pascal 6.0 from Borland International, a non-optimizing compiler that produces 16-bit code. The FORTRAN program was compiled using MS FORTRAN 5.0, an optimizing compiler that generates 16-bit code. All programs except PEAKFLOP and SAVAGE, which use double extended precision, use double precision variables. Note that using a highly optimizing compiler producing 32-bit code you will see much higher performance for some benchmarks. For example, Intel rates the 33 MHz 386/387DX at 3290 KWhetstones/sec and 0.4 double precision LINPACK MFLOPS [28,29]. The 33 MHz Intel 486 is rated by Intel at 12300 KWhetstones/sec and 1.6 double precision LINPACK MFLOPS [30]. The compilers used in these benchmarks run by the chip vendor are the ones that give the highest performance available. These compilers are in the US$ 1000+ price range. Some of them may be experimental or prereleased versions not available to the general public. The relative performance of one coprocessor to another could vary depending on the code generated by compilers. Non-optimizing compilers tend to generate a high percentage of operations which access variables in memory, while optimizing compiler produce code that contains many operations involving registers. Thus it is well possible that coprocessor A beats coprocessor B running benchmark Z if compiled with compiler C, but B beats A when the same benchmark is compiled using compiler D. All benchmark in this overview were run from floppy under a 'bare-bones' MS-DOS 5.0 without the CONFIG.SYS and AUTOEXEC.BAT files. This way, it was made sure no TSR or other program unnecessarily stole computing resources from the benchmarks. Coprocessor performance also depends on the motherboard, or more specifically the chip set used on the motherboard. In [34] and [35] identically configured motherboards using different 386 chip sets were tested. Among other tests a coprocessor benchmark was run which is based on a fractal computation and its execution time recorded. The following tables showing coprocessor performance to vary with the chip set have been copied from these articles in abridged form. Cyrix Cyrix chip set 387+ chip set 83D87 Opti, 40 MHz 24.57 sec 97.0% PC-Chips, 33 MHz 26.97 sec 93.0% Elite,40 MHz 24.46 sec 97.4% UMC, 33 MHz 27.69 sec 90.5% ACT, 40 MHz 23.84 sec 100.0% Headland, 33 MHz 25.08 sec 100.0% Forex,40 MHz 23.84 sec 100.0% Eteq, 33 MHZ 27.38 sec 91.6% This shows that performance of the same coprocessor can vary by up to ~10% depending on the chip set used on your board, at least for 386 motherboards (similar numbers for 286, 386sx, and 486 are unfortunately not available). The benchmarks for this article were run on a board with the Forex chip set, which is one of the fastest 386 chip sets there is, not only with respect to floating-point performance [35]. Description of benchmarks PEAKFLOP is the kernel of a fractal computation. It consists mainly of a tight loop written in assembly code and fine tuned to give maximum performance. All variables are held in the CPU's and coprocessor's registers, so the only memory access is for opcode fetches. The main loop contains three multiplications and five additions/subtractions. This ratio is fairly typical for other floating point intensive programs as well. The whole program fits nicely into even a very small CPU cache. Due to the nature of this program, its MFLOPS rate is hardly to be exceeded by any program that calculates anything useful. Thus the name PEAKFLOP. You will find the source code for PEAKFLOP in appendix B. TRNSFORM multiplies an array of 8191 vectors with a 3D-transformation matrix (a 4x4 matrix). Each vector consists of four double precision values. Multiplying vectors with a matrix is a typical operation in the manipulation (e.g. rotation) of 3D objects which are made up from many vectors decribing the object. This benchmark stresses addition and multiplication as well as memory access. For each vector, 16 multiplications and 12 additions are used. About 256 kByte of data is accessed during the benchmark. TRNSFORM is implemented as an optimized assembler program linked with the Turbo Pascal 6.0 library. For the IIT 3C87, a special version was written that makes use of the special F4X4 instruction available on that coprocessor. F4X4 does a full multiplication of a 4x4 matrix by a 4x1 vector in a single instruction. The full source code for the TRNSFORM program is in appendix B. LLL is short for Lawrence Livermore Loops [21], a set of kernels taken from real floating point extensive programs. Some of these loops are vectorizable, but since we don't deal with vector processors here, this doesn't matter. For this test, LLL was adapted from the FORTRAN original [20] to Turbo Pascal 6.0. By variable overlaying (similar to FORTRAN's EQUIVALENCE statement) memory allocation for data was reduced to 64 kB, so all data fits into a single 64 kB segment. The older version of LLL is used here which contains 14 loops. There also exists a newer, more elaborate version consisting of 24 kernels. The kernels in LLL exercise only multiplication and addition. The MFLOPS rate reported is the average of the MFLOPS rate of all 14 kernels as reported by the LLL program. LLL and Whetstone results (see below) are reported as returned by my COMPTEST test program in which they have been included as a measure of coprocessor/FPU performance. COMPTEST has been compiled under Turbo Pascal 6.0 with all 'optimizations' on and using my own run-time library, which gives higher perfor- mance than the one included with TP 6.0. My library is available as TPL60N15.ZIP from garbo.uwasa.fi and ftp-sites that mirror this site. Linpack [5] is a well known floating-point benchmark that also heavily exercises the memory system. Linpack operates on large matrices and takes up about 570 kB in the version used for this test. This is about the largest program size a pure DOS system can accomodate. Linpack was originally designed to estimate performance of BLAS, a library of FORTRAN subroutines that handles various vector and matrix operations. It uses two routines from BLAS which are thought to be typical of the matrix operations used by BLAS. Both routines only use addition/subtraction and multiplication. The FORTRAN source code for Linpack can be obtained from the automated mail server netlib@ornl.gov. Linpack was compiled using MS Fortran 5.0 in the HUGE memory model (which can handle data structures larger than 64 kB) and with compiler switches set for maximum optimization. Linpack repeatedly does the same test. The number reported is the maximum MFLOPS rate returned by Linpack. Linpack MFLOPS ratings for a great number of machines are contained in [6]. This PostScript document is also available from netlib@ornl.gov. Whetstone [2,3,4] is a synthetic benchmark based upon statistics collected about the use of certain control and data structures in programs written in high level languages. Based on these statistics, Whetstone tries to mirror a 'typical' HLL program. Whetstone performance is expressed by how many theoretical 'whetstone' instructions are executed per second. It was originally implemented in ALGOL. Unlike PEAKFLOP, LLL, and Linpack, Whetstone not only uses addition and multiplication but exercises all basic arithmetic operations as well as some transcendental functions. Whetstone performance depends on the speed of the coprocessor as well as on the speed of the CPU, while PEAKFLOP, LLL, and Linpack place a heavier burden on the coprocessor/FPU. There exists an old and a new version of Whetstone. Note that results from the two versions can differ by as much as 20% for the same test configuration. For this test, the new version in Pascal from [3] was used. It was compiled with Turbo Pascal 6.0 and my own library (see above) with all 'optimizations' on. SAVAGE tests the performance of transcendental function evaluation. It is basically a small loop in which the sin, cos, arctan, ln, exp, and sqrt functions are combined in a single expression. While sin, cos, arctan, and sqrt can be evaluated directly with a single 387 coprocessor instruction each, ln and exp need additional preprocessing for argument reduction and result conversion. According to [14], the Savage benchmark was devised by Bill Savage, and is distributed by: The Wohl Engine Company, Ltd., 8200 Shore Front Parkway, Rockaway Beach, NY 11693, USA. Usually, Savage is programmed to make 250,000 passes though the loop. Here only 10,000 loops are executed for a total of 60,000 transcendental function evaluations. The result is expressed in function evaluations per second. SAVAGE source code was taken from [7] and compiled with Turbo Pascal 6.0 and my own run-time library (see above). Benchmark results for 387 coprocessors, coprocessor emulators and the Intel RapidCAD and Intel 486 CPUs. 40 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec 386, EM87 0.0084 0.0080 0.0060 0.0060 31 502 ## 386, Franke387 0.0369 0.0295 0.0233 0.0215 164 4002 $$ 386, TP 6 Emu 0.0316 0.0273 0.0200 0.0190 160 3794 %% Intel 387DX 0.9204 0.7212 0.3932 0.3211 2428 52677 ULSI 83C87 1.2093 0.7936 0.3890 0.3120 2528 56926 IIT 3C87 1.0196 0.7145 0.3834 0.3179 2663 58766 IIT 3C87,4x4 1.0196 1.7244 0.3834 0.3179 2663 58766 ?? Cyrix 387+ 1.1305 0.8162 0.3945 0.3208 2946 80322 Intel RapidCAD 2.2128 1.8931 0.7377 0.5432 4810 86957 Intel 486 2.4762 2.1335 1.1110 0.8204 6195 98522 33.3 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec 386, EM87 0.0070 0.0040 0.0050 0.0050 26 418 ## Franke387 0.0307 0.0246 0.0194 0.0179 137 3335 $$ 386, TP 6 Emu 0.0263 0.0227 0.0167 0.0158 133 3160 %% Intel 387DX 0.7647 0.6004 0.3283 0.2676 2046 43860 ULSI 83C87 1.0097 0.6609 0.3239 0.2598 2089 47431 IIT 3C87 0.8455 0.5957 0.3198 0.2646 2203 49020 IIT 3C87,4X4 0.8455 1.4334 0.3198 0.2646 2203 49020 ?? Cyrix 387+ 0.9286 0.6806 0.3293 0.2669 2435 66890 Cyrix 83D87 1.013 N/A 0.333 0.273 2550 N/A Intel RapidCAD 1.8572 1.5798 0.6072 0.4533 3953 72464 Intel 486 2.0800 1.7779 0.9387 0.6682 5143 82192 For comparison: PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec i486DX2-66 4.1601 3.4227 1.6531 1.3010 10655 163934 i486DX2-50 3.0589 2.6665 1.2537 0.9744 7962 123203 i387, 20 MHz 0.2253 0.3271 0.1434 0.1171 952 21739 ++ i387DX, 20 MHz 0.3567 0.4444 0.1484 0.1161 1034 24155 && i80287, 5 MHz 0.0281 0.0310 0.0242 0.0222 150 3261 !! i8087,9.54 MHz 0.0636 0.0705 0.0321 0.0219 234 5782 **