home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!kithrup!hoptoad!decwrl!elroy.jpl.nasa.gov!usc!sol.ctr.columbia.edu!ira.uka.de!uka!uka!news
- From: S_JUFFA@iravcl.ira.uka.de (|S| Norbert Juffa)
- Newsgroups: comp.sys.intel
- Subject: What you always wanted to know about math coprocessors for 80x86 2/4
- Date: 19 Aug 1992 15:01:07 GMT
- Organization: University of Karlsruhe (FRG) - Informatik Rechnerabt.
- Lines: 930
- Distribution: world
- Message-ID: <16tnnjINNcqt@iraul1.ira.uka.de>
- NNTP-Posting-Host: irav1.ira.uka.de
- X-News-Reader: VMS NEWS 1.23
-
- Cyrix EMC87 is basically a special version of the Cyrix 83D87.
- In addition to the normal 387 operating mode, in
- which coprocessor-CPU communication is handled thru
- reserved IO-ports, it also offers a memory-mapped
- mode of operation similar to the operation principle
- of the Weitek Abacus. Please note that the EMC87 is
- *not* compatible with Weitek's Abacus coprocessor.
- They both use the same interface technique (memory
- mapping) but while the EMC87 uses the standard 387
- instruction set, the Weitek coprocessors use a
- different instruction set of their own. Like the
- Weitek Abacus, the EMC87 occupies a 64 kByte memory
- block starting at physical address C0000000h. It can
- therefore only be accessed in the protected or virtual
- modes of the 386 CPU. DOS programs can access the
- EMC87 with the help of DOS-extenders or memory
- managers like EMM386 which run in protected/virtual
- mode themself. Since the EMC87 provides also the
- standard CPU interface via IO-ports, it can be used
- just like any other 387 compatible coprocessor and
- delivers the same performance as the Cyrix 83D87 in
- this mode. However, using the memory mapped mode of
- the EMC87 provides a significant speed advantage.
- The traditional 387 CPU- coprocessor interface via
- IO-ports has an overhead of about 16-20 clock cycles.
- Since the Cyrix 83D87 executes some operations like
- addition and multiplication in much less time, its
- performance is limited by the CPU-coprocessor
- interface. The memory-mapped mode has much less
- overhead and allows all coprocessor instructions to
- be executed at full speed and with no penalty. For
- this reason, Cyrix introduced the EMC87 in 1990.
- In a test, the EMC87 at 33 MHz ran the single
- precision Whetstone benchmark at 7608 kWhetstones/sec,
- while the Cyrix 83D87 at 33 MHz had a speed of
- only 5049 kWhetstones/sec, an increase of 50.6% [63].
- In another test, the EMC87 ran a fractal computation
- at two times the speed of the Cyrix 83D87 and 2.6
- times as fast as an Intel 387DX [64]. A third test
- found the EMC87's overall performance to be 20%
- higher than the performance of the Cyrix 83D87
- [65]. The Cyrix FasMath EMC87 has also been sold
- as Cyrix AutoMATH by Cyrix. The two chips are 100%
- identical. Unlike the Cyrix 83D87, which fits into
- the 68-pin 387 coprocessor socket, the EMC87 comes
- in a 121-pin PGA and requires the 121-pin EMC
- (Extended Math Coprocessor) socket. Note that not
- all boards have such a socket, a notable exception
- being IBM's PS/2s, for example. Originally, Cyrix
- claimed support for the fast memory mapped mode of
- the EMC87 from a lot of software vendors (including
- Borland and Microsoft). However, there are only
- very few applications that make use of it, among
- them Evolution Computing's FastCAD 3D, MicroWay
- Inc.'s NDP FORTRAN-386 compiler and Intusofts's
- Spice [63]. I haven't seen the EMC being offered
- for about nine month now. It may be that Cyrix
- has discontinued this product due to lack of
- sufficient software support. The EMC87 was available
- in 25 and 33 MHz versions at the end of 1991.
- Cyrix 387+ seems to be the successor to the Cyrix 83D87. On
- ordering a Cyrix coprocessor about a month ago,
- I was automatically supplied with a 387+. In my
- tests, I found the Cyrix 387+ to be about five
- to 10 percent *slower* than the Cyrix 83D87. However,
- some instructions like the square root (FSQRT) now
- ony run at half the speed at which they ran in the
- 83D87 (see performance results below). I also found
- the transcendental functions on the 387+ to be a bit
- more accurate than those implemented in the 83D87.
- Why Cyrix has brought out a new coprocessor slower
- than the 83D87 I don't know. I have written to Cyrix
- about this question but haven't received a reply yet.
- Maybe the new coprocessor solves the one small
- hardware compatibility problem the 83D87 had (see
- above paragraph on the 83D87). It could also be that
- Cyrix had to design around the three Intel patents
- Intel claims the 83D87 has violated. I have no idea
- wether the Cyrix 387+ is to replace the 83D87 or
- if both chips will coexist in the market. Like the
- 83D87, the 387+ is available for speeds of up to
- 40 MHz.
- Cyrix 83S87 is the SX version of the Cyrix 83D87. Just like the
- Cyrix 83D87 is the fastest 387 compatible coprocessor,
- the Cyrix 83S87 is the fastest of the 387SX compatible
- coprocessor [1]. Besides being the fastest 387SX
- 'clone', the Cyrix 83S87 also features the most
- accurate transcendental functions. The 83S87 is
- packaged in a 68-pin PLCC and is available in 16,
- 20 and 25 MHz versions. Due to the advanced power
- saving features of the Cyrix coprocessor, the typical
- power consumption of the 20 MHz version is about
- 350 mW [67].
- ULSI 83C87 is a 387 'clone' that came out in early 1991, well
- after the IIT 3C87 and Cyrix 83D87. Like all clones,
- it is somewhat faster than the Intel 387DX. Especially
- the basic arithmetic functions are fast, while the
- transcendental functions show only a slight speed
- improvement over the Intel 387DX (see benchmark
- results below). In my tests, the ULSI had the most
- inaccurate transcendental functions. However, the
- maximum relative error is still within the limits
- set by Intel, so this is probably not an important
- issue in all but very few applications. The ULSI
- shows some minor flaws in the tests for IEEE-754
- compatiblity, but this, too, is unimportant under
- typical operating conditions. ULSI claims that the
- program IEEETEST, which was used to test for IEEE
- compatibility, contains many personal interpretations
- of the IEEE standard by the program's author and
- states that there is no ANSI-certified IEEE-754
- complicency test. While this is most probably true,
- it is also a fact that the IEEE test vectors used in
- IEEETEST are sort of an industry standard and that
- Intel's 387, 486, and RapidCAD chips pass it
- without a single failure. Since the ULSI Math*Co
- 83C87 fails some of the tests, it is certainly less
- than 100% compatible with Intel's chips, although
- this will hardly make any difference in typical
- operating conditions. The ULSI 83C87 is also not fully
- compatible with the Intel 387DX in that is does
- not implement the precision control feature of
- Intel's coprocessor [58]. While all the internal
- operations of 80x87 coprocessors are usually done
- with the maximum precision available (double extended
- presision with 64 mantissa bits), the 80x87 also
- offer the possiblity to force lower precision to
- be used for the basic arithmetic functions add,
- subtract, multiply, divide, and square root. This
- feature was included for compatiblity with existing
- floating-point implementations at the time the 8087
- was devised. All coprocessors except the ones from
- ULSI support this feature. Since precision control
- is rarely used, this incompatibility with the Intel
- 387DX does not pose major problems. IEEE-754 mentions
- precision control, but requires it only for those
- systems that don't have the possibility to store
- single and double precision results. Therefore, the
- standard does not call for precision control in the
- 387 coprocessor, so the ULSI 83C87's failure to
- provide rounding control does not constitute a
- conflict with the IEEE-754 standard for floating
- point arithmetic. Like the other 387 'clones', the
- 83C87 does not support asynchronous operation of the
- CPU and the coprocessor. This means that the 83C87
- always runs at the full speed of the CPU. The ULSI
- 83C87 is available in 20, 25, 33, and 40 MHz versions.
- The ULSI is produced in high perfromance, low power
- CMOS. Power consumption at 20 MHz is max. 800 mW
- (400 mW typical), at 25 MHz it is max. 1000 mW
- (500 mW typical), at 33 MHz it is max. 1250 mW
- (625 mW), and at 40 MHz the ULSI Math*Co 83C87
- consumes max. 1500 mW (750 mW typical) [58]. The
- 83C87 is packaged in a 68-pin ceramic PGA. ULSI
- coprocessors come with a lifetime warranty. ULSI
- Systems, Inc. will replace the coprocessor up to
- three times free of charge should it ever fail.
- ULSI 83S87 is the SX version of the ULSI 83C87 for operation
- with an Intel 387SX or an AMD Am387SX. It is
- functionally equivalent to the 83C87. To aid low
- power laptop designs, the ULSI 83S87 features an
- advanced power saving design with a sleep mode and
- a standby mode with only minimal power requirements.
- Power consumption under normal operating conditions
- (dynamic mode) is max. 400 mW at 16 MHz (300 mW
- typical), max. 450 mW at 20 MHz (350 mW typical),
- and max. 500 mW at 25 MHz (400 mW typical) [58].
- The ULSI 83S87 is packaged in a 68-pin PLCC.
- Intel RapidCAD is not a coprocessor, strictly seen, although it
- is marketed as one. Rather, it is a CPU replacement.
- It is basically an Intel 486DX without the cache and
- with a 386 pinout. RapidCAD is delivered as a set of
- two chips. RapidCAD-1 goes into the 386 socket and
- contains the CPU and FPU, RapidCAD-2 goes into the
- coprocessor socket and contains a PAL that generates
- the Ferr signal that is normally generated by a
- coprocessor and used by the motherboard circuitry to
- provide 287 compatible coprocessor exception handling
- in 386/387 systems. The RapidCAD instruction set is
- compatible with the 386, so it doesn't know the 486
- specific instructions like BSWAP. Since the RapidCAD
- CPU core is very similar to 486 CPU core, most of the
- register to register instructions execute in the same
- number of clock cycles as on the 486. The use of the
- 386 bus interface causes instructions that access memory
- to execute at about the same speed as on the 386. The
- integer performance on the RapidCAD is definitely
- limited by the low memory bandwidth provided by the
- 386 bus interface (2 clock cylces per bus cycle)
- and the lack of an internal cache. CPU instructions
- often execute faster than they can be fetched from
- memory, even with a big and fast external cache.
- Therefore, the integer performance of the RapidCAD
- exceeds that of a 386 by at most 25%. This value
- was derived by running some programs that use
- mostly register-to-register operations and few
- memory accesses. This finding is supported by the
- SPEC ratings that Intel reports for the 386-33
- and the RapidCAD-33. While the 386-33 has a
- SPECint of 6.4, the RapidCAD has a SPECint of 7.3
- [28], a 14% increase. Note that these tests used
- the old (1989) SPEC benchmarks suite. While CPU
- instructions often execute in one clock cycle on
- the RapidCAD, FPU instructions always take more
- than seven clock cycles. They are therefore rarely
- slowed down by the low memory bandwidth provided
- by the 386 bus interface. My tests show a 70%-100%
- performance increase for floating-point intensive
- benchmarks (see below) over a 386 based system
- using the Intel 387DX math coprocessor. This is
- consistent with the SPECfp rating reported by Intel.
- The 386/387 at 33 MHz is rated at 3.3 SPECfp, while
- the RapidCAD is rated at 6.1 SPECfp at the same
- frequency, a 85% increase. This means that a system
- that uses the RapidCAD is faster than any 386/387
- combination, regardless of the type of 387 used
- (Intel 387DX or faster clone). The diagnostic disk
- for the RapidCAD also gives some application
- performance data for the RapidCAD compared to the
- Intel 387DX:
-
- Application Time w/ 387DX Time w/ RapidCAD Speedup
-
- AUTOCAD 11 32 sec 52 sec 63%
- AutoShade/Renderman 108 sec 180 sec 67%
- Mathematica(Windows) 103 sec 139 sec 35%
- SPSS/PC+ 4.01 14 sec 17 sec 21%
-
- RapidCAD is available in 25 MHz and 33 MHz versions.
- It is distributed through other channels than the
- other Intel math coprocessors. Therefore, I have been
- unable to obtain a data sheet for it. The RapidCad-1
- chip gets quite hot when operating and it can be
- assumed that its power consumption is similar to
- the 486-33. Therefore, I recommend extra cooling
- for this chip (see the paragraph below on the 486 for
- details). The RapidCAD-1 is packaged in a 132-pin
- PGA, just like the 80386, and the RapidCAD-2 is
- packaged in a 68-pin PGA like a 80387 coprocessor.
- Intel 486DX is not a coprocessor. This chip, brought out in
- 1989 functionally combines the CPU (a heavily pipelined
- implementation of the 386 architecture) with an
- enhanced 387 (the floating-point unit, FPU) and
- 8 kB of unified code/data cache on one chip. Of
- course, this description is simplified, for a
- detailed hardware description, see [52]. The
- 486DX offers about two to three times the integer
- performance of a 386 at the same frequency.
- Floating point performance is about three to four
- times as high as on the Intel 387DX at the same
- clock rate [29]. Since the FPU is on the same
- chip as the CPU, the considerable communication
- overhead between CPU and coprocessor in a 386/387
- system is omitted, letting FPU instructions run
- at the full speed permitted by the implementation.
- The FPU also takes advantage of the on-chip cache
- and the highly pipelined execution unit. Besides
- the higher speed, the 486 FPU features more accurate
- transcendental functions than the Intel 387DX
- coprocessor according to tests run by me (see below).
- To achieve better interrupt latency, FPU instructions
- with a long execution time have been made abortable
- in the case an interrupt occurs during their
- execution. The concurrent execution of CPU and
- coprocessor instructions typical for 80x86/80x87
- systems is still in existence on the 486, but
- some FPU instructions like FSIN have nearly no
- concurrency with CPU instructions, indicating
- that they make heavy use of both, CPU and FPU
- resources [53, 1]. The 486DX comes in a 168 pin
- ceramic PGA (pin grid array). It is available in
- 25 MHz and 33 Mhz versions. Since the end of 1991,
- there is also a 50 MHz version available done in
- a CHMOS V process (the 25 MHz and 33 MHz are
- produced using the CHMOS IV process). Maximum
- power consumption is 3500 mW for the 25 MHz 486
- (2600 mW typical), 4500 mW for the 33 MHz version
- (3500 mW typical), and 5000 mW (4000 mW typical)
- for the 50 MHz chip. Due to the considerable amount
- of heat produced by these chips, and taking into
- consideration the slow air flow provided by the
- fan in garden variety PC tower cases, I recommend
- an extra fan directly above the CPU for safer
- operation. If you measure the surface temperature
- of an i486 in a normal tower case without extra
- cooling after some time of operation, you may well
- come up with something like 80 - 90 degrees Celsius
- (that is 176 - 194 degrees Fahrenheit for those not
- familiar with metric units) [54,55]. You don't need
- the well known and expensive IceCap(tm) to effectively
- cool your CPU. A simple fan mounted directly above
- the CPU can bring the temperature down to about 50
- to 60 degrees Celsius (122 - 140 degrees Fahrenheit)
- depending on the room temperature and the temperature
- within the PC case (which depends on the total power
- dissipation of all the components and the cooling
- provided by the fan in the power unit). According
- to a simple rule known as Arrehnius' Law, lowering
- the temperature by 10 degrees Celsius slows down
- chemical reactions by a factor of two, thus lowering
- the temperature of your CPU by 30 degrees should
- prolong the live of the device by a factor of eight
- due to the slower aging process. If you are reluctant
- to add a fan to your system because of the additional
- noise, settle for a low-noise fan like those
- available from the German manufacturer Pabst (this
- is not meant to be an advertisement. I am just the
- happy owner of such a fan. Besides that, I have no
- connections to the firm).
- Intel 486DX2 is the name for Intel latest generation of 486 CPUs.
- Using the DX2 suffix instead of simply DX is meant
- to be an indicator that these are clock-doubled
- versions. A normal 486DX operates at the frequency
- provided by the incoming clock signal. A 486DX2
- generates a new clock signal from the incoming clock
- by means of a PLL (phase locked loop). In the DX2,
- this clock signal has twice the frequency of the
- incoming clock, hence the name clock-doubler. All
- internal parts of the 486DX2 (cache, CPU core, FPU)
- run at this higher frequency. Only the bus interface
- runs at the normal speed. That way, a 486DX-50 can
- run on a motherboard designed for 25 MHz operation.
- Since motherboards for 50 MHz operations are much
- harder to design than those for 25 Mhz, this makes
- a 486DX2-50 system easier to built and cheaper than
- a 486DX-50 system. For all operations that don't
- access off-chip resources (e.g. register operations)
- a 486DX2-50 provides exactly the same performance as
- a 486DX-50 and twice the performance of a 486DX-25.
- However, since the main memory in a 486DX2-50 systems
- still operates at 25 MHz, all instructions involving
- memory accesses are potentially slower than in a
- 486DX-50 system, whose memory also runs at 50 Mhz.
- The internal cache of the 486 helps this problem a
- bit, but overall performance of a 486DX2-50 is still
- lower than that of a 486DX-50, although Intel's
- documentation [32] shows this drop to be quite small.
- It depends a lot on the code one runs, though. The
- nice thing about the 486DX2 is that it allows easy
- upgrading of 25 and 33 Mhz 486 systems, since the
- 486DX2 is completely pin-compatible with the 486DX.
- Just take out the 486DX and plug in the new 486DX2.
- Note that power consumption of the 486DX2-50 equals
- that of the 486DX-50 (4000 mW typical), and that the
- 486DX2-66 exceeds this by about 30%. These chips get
- really hot in a standard PC case with no extra cooling.
- See the above paragraph for more detailed information
- on this problem.
- Intel 487SX is the coprocessor intended for use in 486SX systems.
- The 486SX is basically a 486DX without the floating-
- point unit (FPU) [48, 50]. Originally Intel sold
- 486DXs with a defective FPU as 486SXs but it has
- now completly removed the FPU part from the 486SX
- mask for mass production. The introduction of the
- 486SX in 1991 has been viewed mainly as a marketing
- 'trick' by Intel to take market share from the 386
- based systems once AMD became successful with their
- Am386 (AMD has taken as much as 40% of the 386 market
- due to some superior features such as higher clock
- frequency, lower power consumption, and a fully
- static design). A 486SX at 20 MHz delivers a bit
- less integer performance than a 40 MHz Am386. To add
- floating-point capabilities to a 486SX based system,
- it would be easiest to swap the 486SX with a 486DX
- which includes the FPU. However, Intel has prevented
- this easy solution by giving the 486SX a slightly
- different pin out [48, 51]. Since only three pins
- are assigned differently, clever board manufacturers
- have come out with boards that accept anything from
- a 486SX-20 to a 486DX2-50 in their CPU socket and
- provide a clean upgrade path this way. A set of
- three jumpers ensures correct signal assignment to
- the pins for either configuration. To upgrade systems
- without this feature, one has to buy the 487SX and
- put it into the "Performance Upgrade Socket" present
- in most 486SX systems. Once the 487SX was available,
- it was quickly found out that it is just a normal
- 486DX with a slightly different pin out [49]. Inserting
- the 487SX effectively shuts down the 486SX in the
- 486SX/487SX system, so the 486SX could be removed
- once the 487SX is installed. Since the shut down is
- logical, not electrical, the 486SX still uses power
- if used with the 487SX, although it is unoperational.
- Technically speaking, the solution Intel chose was
- the only practical way to provide a 486SX system with
- the high level of floating-point performance the
- 486DX offers. The CPU and FPU have to be on the same
- chip, otherwise the FPU can not make use of the cache
- on the CPU chip and there would be considerable
- overhead in CPU-FPU communication (similar to a
- 386/387 system), nullifying most of the arithmetic
- speedups over the 387. That the 486SX, 487SX, and
- 486DX are not pin-compatible seems to be purely for
- marketing reasons. To upgrade a 486SX based system,
- Intel also offers the OverDrive chip, which is just
- the same as a 487SX with internal clock doubling. It
- goes also goes into the "Performance Upgrade Socket"
- found in 486SX systems. The OverDrive roughly doubles
- the performance of a 486SX/487SX based system. For a
- explanation of clock doubling, see the description
- of the 486DX2 above. As the 486SX, the 487SX is
- available in 20 MHz and 25 MHz versions. At 20 MHz,
- the 487SX has a power consumption of max. 4000 mW.
- It is available in a 169 pin ceramic PGA (pin grid
- array).
- Weitek 3167 was introduced in 1989 to provide the fastest
- floating point performance possible on a 386 based
- system at that time. The Weitek Abacus 3167 is not
- a real coprocessor, strictly speaking, but rather
- a memory mapped peripheral device. The Weitek 3167
- was optimized for speed wherever possible. Besides
- using the faster memory mapped interface to the CPU
- (the 80x87 uses IO-ports), it does not support many
- of the features of the 80x87 coprocessors, allowing
- all of the chip's ressources to be concentrated on
- the fast execution of the basic arithmetic operations.
- For a more detailed description of the Weitek 3167 see
- the first chapter of this document. In benchmark
- comparisons, the Weitek 3167 provided up to 2.5 times
- the performance of an Intel 387DX coprocessor. For
- example, on a 33 MHz 3167 the Whetstone benchmark
- performed at 7574 kWhetstones/sec compared with the
- the 3743 kWhetstones/s for the Intel 387DX. Note
- however that these are single precision results and
- that the Weitek 3167's performance would drop to
- about half the stated rate for double precision,
- while the value for the Intel 387DX would not change
- much. Anyhow, before the advent of the Intel RapidCAD,
- the Weitek 3167 usually beat all 387 compatible
- coprocessors even for double precision operations
- [63,65,69]. For typical applications the advantage
- of the Weitek 3167 over the 387 clones is much smaller.
- In a benchmark test using AutoDesk's 3D-Studio the
- Weitek 3167 performed at 123% of the Intel 487DX's
- perfromance comapred with 106% for the Cyrix FasMath
- 83D87 and 118% for the Intel RapidCAD. The Weitek
- Abacus 3167 is packaged in a 121-pin PGA that fits
- into an EMC socket provided by most 386 based systems.
- It does *not* fit into the normal coprocessor socket
- designed to hold a 387 compatible coprocessor in a
- 68-pin PGA. To get the best of both worlds, one might
- want to use a Weitek 3167 and a 387 compatible
- coprocessor in the same system. These coprocessors
- can coexist in the same system just fine. Only problem
- is that most 386 based systems contain only one
- coprocessor socket, usually of the EMC (extended math
- coprocessor) type. Thus, you can install either a
- 387 coprocessor or a Weitek 3167, but not both. There
- are little daughter boards available though that fit
- into the EMC socket and provide two sockets, an EMC
- and a standard coprocessor socket. At 25 MHz, the
- Weitek 3167 has a power consumption of max. 1750 mW.
- At 33 MHz, the max. power consumption is 2250 mW.
- Weitek 4167 is a memory mapped coprocessor that has the same
- architecture as the 3167 and is designed to provide
- 486 based systems with the highest floating point
- performance available. It executes coprocessor
- instructions at three to four times the speed of
- the Weitek 3167. Although it is up to 80% faster
- than the Intel 468 in some benchmarks [1,69], the
- performance advantage for real application is more
- like 10%. The introduction of the 486DX2 processors
- has more or less obliterated the need for a Weitek
- 4167, since the DX2 CPUs provide the same performance
- and all the additional features the 80x87 has over
- the Weitek Abacus. The Weitek 4167 is packaged in
- a 142-pin PGA package that is only slightly smaller
- than the 486's package. At 25 MHz, it has a max.
- power consumption of 2500 mW [32].
-
- Chips & Technologies has shipped samples of their 38700 and
- 38700SX coprocessors, which are compatible with the Intel 387DX
- and Intel 387SX coprocessors, respectively. Both have already
- been tested in [1]. However, C&T's German distributor (Rein
- Elektronik, Nettetal) states that these coprocessors will
- become generally available not before 4Q 1992. The samples
- tested in [1] showed about the same performance as the Cyrix
- 83D87.
-
-
-
- Pricing
-
- Due to a recent price slashing by Cyrix and subsequently by Intel
- for 387 coprocessors, prices have dropped significantly for all
- 287 and 387 compatible coprocessors with hardly any price difference
- between manufacturers. 387DX compatible coprocessors typically sell
- for ~US$ 100 for all speeds except for 40 MHz versions which are
- typically ~US$ 130. 387SX compatible coprocessors sell for ~US$ 90
- regardless of speed with the exception of the 33 MHz version, which
- are ~US$ 100. The Intel 287XL sells for ~US$ 100, while the IIT 2C87
- and Cyrix 82S87 sell for about US$ 70. 8087s may be more expensive,
- the price of an 8087-10 being US$ 150. I bought the Intel RapidCAD
- for US$ 320 and haven't seen it offered for a better price. I see
- the Weitek Abacus 3167-33 being offered for US$ 780 and the 4167-33
- being offered for US$ 1100. This price information reflects the
- price situation as of 08-14-92. Prices can be expected to drop
- slightly in the near future.
-
- If you have a demand for high floating-point performance, you
- should consider to buy a 486 based system rather than buying
- a 386 based system with an additional coprocessor. A 386 mother
- board for 33 MHz operation sell for ~ US$ 300, together with the
- coprocessor, costs total ~ US$ 400. A 486-33 ISA-board sells for
- US$ 650. While the 486-33 system is 60% more expensive than the
- 386/387 system, it also provides 100% more integer and floating-
- point performance (twice the performance). If you want to push
- your 386 based system to maximum floating-point performance and
- can't switch to a 486 based system for some reason, I recommend
- the Intel RapidCAD. It is both faster [1] and cheaper than installing
- a Weitek Abacus 3167 with your 386, which used to be the highest
- performing combination before the RapidCAD came out. Similarily,
- the introduction of the 486DX2 clock-doubler chips have obliterated
- the need for a Weitek 4167 to get maximum floating-point performance
- out of a 486 based system. A 486DX2-66 performs at or above the
- performance level of a 33 Mhz Weitek 4167, even if the latter
- uses single precision rather than double precision. The 486DX-66
- is rated by Intel at 24700 double precision kWhetstones/sec and
- 3.1 double precision Linpack MFLOPS. Of course, these benchmarks
- used the highest performance compilers available. But even with
- a Turbo Pascal 6.0 program, I managed to squeeze 1.6 double precision
- MFLOPS out of the 486DX2-66 for the LLL benchmark (for a description
- of the benchmarks mentioned, see the paragraph on benchmarks below).
- Although I haven't yet seen 486DX2-66 processors seen offered to
- the end users for upgrade purposes, I'll recommend the 486DX2-66
- to those that need highest floating-point performance and are
- planning on buying a new PC. The price difference between a
- 33 MHz 486DX motherboard and a 486DX2-66 motherboard is around
- US$ 600, well below the price for the Weitek Abacus 4167.
-
-
-
- Operation
-
- In a 80x86/80x87 system CPU instructions and coprocessor
- instructions are executed concurrently. This means that
- the CPU can execute CPU instructions while the coprocessor
- executes a coprocessor instruction at the same time. The
- concurrency is restricted somewhat by the fact that the
- CPU has to aid the coprocessor in certain operations. As
- the CPU and the coprocessor are fed from the same instruction
- stream and both instruction streams may operate on the same
- data, there has to be a synchronizing mechanism between the
- CPU and the coprocessor.
-
- 8086/8087 or 8088/8087 system, both of the chips look at the
- opcodes coming in from the bus. To do this, both chips have
- the same BIU (bus interface unit) and the 8086 BIU sends the
- status signals of its prefetch queue to the 8087 BIU. This
- assures that both processors always decode the same instructions
- in parallel. Since all coprocessor instruction start with the
- bit pattern 11011, it is easy for the 8087 to ignore all other
- instructions. Likewise the CPU ignores all coprocessor instructions
- except if they access memory. In this case, the CPU computes
- the address of the LSB (least significant byte) of the memory
- operand and does a dummy read. The 8087 then takes the data and does a dummy read.
- from the data bus. If more than one meory access is needed to
- load an memory operand, the 8087 requests the bus from the CPU,
- generates the consecutive addresses of the operand's bytes
- and fetches them from the data bus. After completing the operation,
- the 8087 hands bus control back to the CPU. Since 8087 and CPU
- are hooked up to the same synchronous bus, they have to run at
- the same speed. This means that with the 8087, only synchronous
- operation of CPU and coprocessor is possible. Another 8087
- coprocessor instruction can only be started if the previous one
- has been completed in the NEU (numerical execution unit) of the
- 8087. To prevent the 8086 from decoding a new coprocessor
- instruction while the 8087 is still excuting the previous
- coprocessor instruction, the following mechanism is used: The
- compilers and assemblers automatically generate a WAIT instruction
- before each coprocessor instruction. The WAIT instruction tests
- the /TEST pin until its input becomes "LOW". In 8086/8087 systems,
- the 8086 /TEST pin is connected to the 8087 BUSY pin. As long
- as the NEU executes a coprocessor instruction, it forces its
- BUSY pin "HIGH". Thus the WAIT instruction in front of every
- coprocessor instruction stops the CPU until a still executing
- previous coprocessor instruction has finished. The same
- synchronization is used before the CPU accesses data that
- was written by the coprocessor. A WAIT instruction after the
- coprocessor instruction that writes to memory causes the CPU to
- stop until the coprocessor has transferred the data to memory,
- after which the CPU can safely access the data.
-
- With the help of an additional chip, the 8087 can also be inter-
- faced to the 80186 [36]. The 80186 was the CPU in some PCs (e.g.
- from Philips, Siemens) in the 1982/1983 time frame, but with
- the introduction of the IBM AT which used the 80286, it lost all
- significance for the PC market. The 80C186 (CMOS version of the
- 80186) nowadays sells as an embedded controller and can be combined
- with a 80C187 coprocessor which is based on the internals of the
- Intel 387 [37].
-
- The 80287 CPU-interface is totally different from the solution
- used in the 8087. Since the 80286 implements memory protection
- via an MMU based on segmentation, it would have been much to
- expensive to duplicate the whole protection logic on the coprocessor
- for an interface solution similar to the 8087. In a 80286/80287
- system, the CPU fetches and stores all opcodes and operands for
- the coprocessor. Information is passed through ports F8h - FFh.
- As these ports are accessible under program control, care must
- be taken to not accidentally perform write operation to them, as
- this could corrupt the information in the math coprocessor.
- The execution unit of the 80287 is practically identical to that
- of the 8087, that is, nearly all coprocessor instructions execute
- in the same number of clock cycles on both coprocessors. Due to
- the additional overhead of the CPU/coprocessor interface (at
- least ~40 clock cycles), a 8 MHz 80286/80287 combination can be
- slower than a 8086/8087 system running at the same speed for
- floating point intensive programs. Additionally, most of the
- older 286 boards were configured to run the coprocessor at 2/3
- the speed of the CPU, making use of the ability of the 80287
- to run asynchronous with the CPU. The 80287 has a CKM pin that
- causes the incoming system clock to be divided by three for
- the coprocessor if it is tied to ground. The 80286 always
- divides the system clock by two internally. Thus the ratio 2/3.
- However, when the CKM (ClocK Mode) pin is tied high on the 80287,
- it does not divide the CLK input. This feature has been exploited
- by the maker of coprocessor speed sockets. These sockets tie
- CKM high and supply their own CLK signal with a built-in oscillator,
- thereby allowing the 80287 or compatible to run at a much higher
- speed than the CPU. With an IIT or Cyrix 287 one can have a
- 20 MHz coprocessor running with a 8 MHz 80286. Note however that
- the floating-point performance in such a configuration does not
- scale linearly with the coprocessor clock, since all the data
- has to be passed through the much slower CPU. If the coprocessor
- executes mostly simple intructions such as addition and multiplication
- doubling the coprocessor clock in a 10 MHz system to 20 MHz does
- not show any performance increase at all [24]. The 80C287 by AMD
- is a 100% clone of the original Intel 80287, but is produced in
- CMOS not in NMOS as the original Intel chip. This makes for lower
- power consumption.
-
- The 80287XL, the Cyrix 82S87, and the IIT 2C87 contain the internals
- of a 387 coprocessor, but are pin-compatible to the original 287.
- However, these chips divide the system clock by two internally,
- as opposed to three in the original Intel 80287. Since the 80286
- also divides the system clock by two, they usually run synchronously
- with the CPU. They can also run asynchronously, though.
-
- The 8087/8087 combination can be characterized as a cooperation of
- partners with equal rights, while the 80286/287 is more a master-
- slave relationship. This makes synchronization much more easy, since
- the complete instruction and data flow of the coprocessor goes thru
- the CPU. Before executing most coprocessor instructions, the 80286
- tests its /BUSY pin which is hooked up to the 287 coprocessor and
- signals if the 80287 is still executing a previous coprocessor
- instruction or has encountered an exception. The 80286 then waits
- until the 80287 is not busy before loading the coprocessor instruction
- into the coprocessor. Therefore, a WAIT instruction before every
- coprocessor instruction is not required. These WAITs are permissible,
- but not necessary in 80287 programs. The second form of WAIT
- synchronisation after the coprocessor has written a memory operand is
- still necessary on 286/287 systems.
-
- The coprocessor interface in 80386/80387 systems is very similar to
- the one found in 286/287 systems. However, to prevent corruption
- of the coprocessor's contents by programming errors, the IO-ports
- 800000F8 - 800000FF are used which are not user accessible. The
- interface has been optimized and uses 32-bit transfers. The overhead
- of the interface has been reduced to about 16-20 clock cycles. For
- some operations on the 387 'clones', that take less than 16 clock
- cycles to complete this effectively limits the execution rate of
- coprocessor instructions. The only sensible solution to provide
- even higher floating point performance was to integrate the CPU
- and coprocessor functionality onto the same chip. This is what
- Intel did with the 80486. The FPU in the 486 also benefits from
- the instruction pipelining and from the integrated cache.
-
-
-
- Performance
-
- Several computer magazines have published performance comparisons
- at the application level for the 387 coprocessors and Weitek's
- ABACUS 3167 and 4167 chips [1,25,68,70]. Applications tested included
- AutoCAD R11, RenderStar, Quattro Pro, Lotus 1-2-3, and AutoDesk's
- 3D-Studio. For most tests, performance improvements for the 387
- clones over Intel's 387DX were small to marginal, the clones running
- the applications no more than 5% to 15% faster than the Intel 387DX.
- In the test of 3D-Studio, one of the few programs that supports
- the Weitek Abacus, the Weitek 3167 improved performance by 23%
- over an Intel 387DX and the 4167 improved performance by 10% over
- the 486 [1].
-
-
- The Intel Math Coprocessor Utilities Disk that accompanies the
- Intel 387DX coprocessor has a demonstration program that shows
- the speedup of certain application programs when run with the
- Intel coprocessor vs. a system with no coprocessor.
-
- Application Time w/o 387 Time w/ 387 Speedup
-
- Art&Letters 87.0 sec 34.8 sec 150%
- Quattro Pro 8.0 sec 4.0 sec 100%
- Wingz 17.9 sec 9.1 sec 97%
- Mathematica 420.2 sec 337.0 sec 25%
-
-
- The following table is an excerpt from [70]:
-
- Application Time w/o 387 Time w/ 387 Speedup
-
- Corel Draw 471.0 sec 416.0 sec 13%
- Freedom Of Press 163.0 sec 77.0 sec 112%
- Lotus 1-2-3 257.0 sec 43.0 sec 597%
-
-
- The following table is an excerpt from [25]:
-
- Application Time w/o 387 Time w/ 387 Speedup
-
- Design CAD, Test1 98.1 sec 50.0 sec 96%
- Design CAD, Test2 75.3 sec 35.0 sec 115%
- Excel, Test 1 9.2 sec 6.8 sec 35%
- Excel, Test 1 12.6 sec 9.3 sec 35%
-
-
-
- The performance statistics below were put together with the
- help of four widely known numeric benchmarks and two benchmarks
- developed by me. Three Pascal programs, one FORTRAN program,
- and two assembly language program were used. The assembly language
- programs were linked with Turbo-Pascal 6.0 for library support,
- especially to include the coprocessor emulator of the TP 6.0
- run-time library. The Pascal programs were compiled with Turbo
- Pascal 6.0 from Borland International, a non-optimizing compiler
- that produces 16-bit code. The FORTRAN program was compiled using
- MS FORTRAN 5.0, an optimizing compiler that generates 16-bit
- code. All programs except PEAKFLOP and SAVAGE, which use double
- extended precision, use double precision variables. Note that
- using a highly optimizing compiler producing 32-bit code you
- will see much higher performance for some benchmarks. For example,
- Intel rates the 33 MHz 386/387DX at 3290 KWhetstones/sec and 0.4
- double precision LINPACK MFLOPS [28,29]. The 33 MHz Intel 486 is
- rated by Intel at 12300 KWhetstones/sec and 1.6 double precision
- LINPACK MFLOPS [30]. The compilers used in these benchmarks run by
- the chip vendor are the ones that give the highest performance
- available. These compilers are in the US$ 1000+ price range.
- Some of them may be experimental or prereleased versions not
- available to the general public. The relative performance of
- one coprocessor to another could vary depending on the code
- generated by compilers. Non-optimizing compilers tend to generate
- a high percentage of operations which access variables in memory,
- while optimizing compiler produce code that contains many
- operations involving registers. Thus it is well possible that
- coprocessor A beats coprocessor B running benchmark Z if compiled
- with compiler C, but B beats A when the same benchmark is compiled
- using compiler D. All benchmark in this overview were run from
- floppy under a 'bare-bones' MS-DOS 5.0 without the CONFIG.SYS
- and AUTOEXEC.BAT files. This way, it was made sure no TSR or
- other program unnecessarily stole computing resources from the
- benchmarks.
-
- Coprocessor performance also depends on the motherboard, or more
- specifically the chip set used on the motherboard. In [34] and [35]
- identically configured motherboards using different 386 chip sets
- were tested. Among other tests a coprocessor benchmark was run
- which is based on a fractal computation and its execution time
- recorded. The following tables showing coprocessor performance
- to vary with the chip set have been copied from these articles
- in abridged form.
-
- Cyrix Cyrix
- chip set 387+ chip set 83D87
-
- Opti, 40 MHz 24.57 sec 97.0% PC-Chips, 33 MHz 26.97 sec 93.0%
- Elite,40 MHz 24.46 sec 97.4% UMC, 33 MHz 27.69 sec 90.5%
- ACT, 40 MHz 23.84 sec 100.0% Headland, 33 MHz 25.08 sec 100.0%
- Forex,40 MHz 23.84 sec 100.0% Eteq, 33 MHZ 27.38 sec 91.6%
-
- This shows that performance of the same coprocessor can vary by
- up to ~10% depending on the chip set used on your board, at least
- for 386 motherboards (similar numbers for 286, 386sx, and 486 are
- unfortunately not available). The benchmarks for this article were
- run on a board with the Forex chip set, which is one of the fastest
- 386 chip sets there is, not only with respect to floating-point
- performance [35].
-
-
- Description of benchmarks
-
- PEAKFLOP is the kernel of a fractal computation. It consists
- mainly of a tight loop written in assembly code and fine tuned
- to give maximum performance. All variables are held in the
- CPU's and coprocessor's registers, so the only memory access
- is for opcode fetches. The main loop contains three multiplications
- and five additions/subtractions. This ratio is fairly typical
- for other floating point intensive programs as well. The whole
- program fits nicely into even a very small CPU cache. Due to
- the nature of this program, its MFLOPS rate is hardly to be
- exceeded by any program that calculates anything useful. Thus
- the name PEAKFLOP. You will find the source code for PEAKFLOP
- in appendix B.
-
- TRNSFORM multiplies an array of 8191 vectors with a 3D-transformation
- matrix (a 4x4 matrix). Each vector consists of four double precision
- values. Multiplying vectors with a matrix is a typical operation in
- the manipulation (e.g. rotation) of 3D objects which are made up from
- many vectors decribing the object. This benchmark stresses addition
- and multiplication as well as memory access. For each vector, 16
- multiplications and 12 additions are used. About 256 kByte of data
- is accessed during the benchmark. TRNSFORM is implemented as an
- optimized assembler program linked with the Turbo Pascal 6.0 library.
- For the IIT 3C87, a special version was written that makes use of
- the special F4X4 instruction available on that coprocessor. F4X4
- does a full multiplication of a 4x4 matrix by a 4x1 vector in a
- single instruction. The full source code for the TRNSFORM program is
- in appendix B.
-
- LLL is short for Lawrence Livermore Loops [21], a set of kernels
- taken from real floating point extensive programs. Some of these
- loops are vectorizable, but since we don't deal with vector
- processors here, this doesn't matter. For this test, LLL was
- adapted from the FORTRAN original [20] to Turbo Pascal 6.0. By
- variable overlaying (similar to FORTRAN's EQUIVALENCE statement)
- memory allocation for data was reduced to 64 kB, so all data fits
- into a single 64 kB segment. The older version of LLL is used here
- which contains 14 loops. There also exists a newer, more elaborate
- version consisting of 24 kernels. The kernels in LLL exercise only
- multiplication and addition. The MFLOPS rate reported is the
- average of the MFLOPS rate of all 14 kernels as reported by the
- LLL program. LLL and Whetstone results (see below) are reported
- as returned by my COMPTEST test program in which they have been
- included as a measure of coprocessor/FPU performance. COMPTEST
- has been compiled under Turbo Pascal 6.0 with all 'optimizations'
- on and using my own run-time library, which gives higher perfor-
- mance than the one included with TP 6.0. My library is available
- as TPL60N15.ZIP from garbo.uwasa.fi and ftp-sites that mirror
- this site.
-
- Linpack [5] is a well known floating-point benchmark that also
- heavily exercises the memory system. Linpack operates on large
- matrices and takes up about 570 kB in the version used for this
- test. This is about the largest program size a pure DOS system
- can accomodate. Linpack was originally designed to estimate
- performance of BLAS, a library of FORTRAN subroutines that
- handles various vector and matrix operations. It uses two routines
- from BLAS which are thought to be typical of the matrix operations
- used by BLAS. Both routines only use addition/subtraction and
- multiplication. The FORTRAN source code for Linpack can be
- obtained from the automated mail server netlib@ornl.gov. Linpack
- was compiled using MS Fortran 5.0 in the HUGE memory model (which
- can handle data structures larger than 64 kB) and with compiler
- switches set for maximum optimization. Linpack repeatedly does
- the same test. The number reported is the maximum MFLOPS rate
- returned by Linpack. Linpack MFLOPS ratings for a great number
- of machines are contained in [6]. This PostScript document is
- also available from netlib@ornl.gov.
-
- Whetstone [2,3,4] is a synthetic benchmark based upon statistics
- collected about the use of certain control and data structures
- in programs written in high level languages. Based on these
- statistics, Whetstone tries to mirror a 'typical' HLL program.
- Whetstone performance is expressed by how many theoretical
- 'whetstone' instructions are executed per second. It was
- originally implemented in ALGOL. Unlike PEAKFLOP, LLL, and
- Linpack, Whetstone not only uses addition and multiplication
- but exercises all basic arithmetic operations as well as some
- transcendental functions. Whetstone performance depends on the
- speed of the coprocessor as well as on the speed of the CPU,
- while PEAKFLOP, LLL, and Linpack place a heavier burden on the
- coprocessor/FPU. There exists an old and a new version of
- Whetstone. Note that results from the two versions can differ
- by as much as 20% for the same test configuration. For this
- test, the new version in Pascal from [3] was used. It was
- compiled with Turbo Pascal 6.0 and my own library (see above)
- with all 'optimizations' on.
-
- SAVAGE tests the performance of transcendental function
- evaluation. It is basically a small loop in which the sin,
- cos, arctan, ln, exp, and sqrt functions are combined in a
- single expression. While sin, cos, arctan, and sqrt can be
- evaluated directly with a single 387 coprocessor instruction
- each, ln and exp need additional preprocessing for argument
- reduction and result conversion. According to [14], the Savage
- benchmark was devised by Bill Savage, and is distributed by:
- The Wohl Engine Company, Ltd., 8200 Shore Front Parkway,
- Rockaway Beach, NY 11693, USA. Usually, Savage is programmed
- to make 250,000 passes though the loop. Here only 10,000 loops
- are executed for a total of 60,000 transcendental function
- evaluations. The result is expressed in function evaluations
- per second. SAVAGE source code was taken from [7] and compiled
- with Turbo Pascal 6.0 and my own run-time library (see above).
-
-
- Benchmark results for 387 coprocessors, coprocessor emulators and
- the Intel RapidCAD and Intel 486 CPUs.
-
-
- 40 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
- MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
-
- 386, EM87 0.0084 0.0080 0.0060 0.0060 31 502 ##
- 386, Franke387 0.0369 0.0295 0.0233 0.0215 164 4002 $$
- 386, TP 6 Emu 0.0316 0.0273 0.0200 0.0190 160 3794 %%
- Intel 387DX 0.9204 0.7212 0.3932 0.3211 2428 52677
- ULSI 83C87 1.2093 0.7936 0.3890 0.3120 2528 56926
- IIT 3C87 1.0196 0.7145 0.3834 0.3179 2663 58766
- IIT 3C87,4x4 1.0196 1.7244 0.3834 0.3179 2663 58766 ??
- Cyrix 387+ 1.1305 0.8162 0.3945 0.3208 2946 80322
- Intel RapidCAD 2.2128 1.8931 0.7377 0.5432 4810 86957
- Intel 486 2.4762 2.1335 1.1110 0.8204 6195 98522
-
-
- 33.3 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
- MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
-
- 386, EM87 0.0070 0.0040 0.0050 0.0050 26 418 ##
- Franke387 0.0307 0.0246 0.0194 0.0179 137 3335 $$
- 386, TP 6 Emu 0.0263 0.0227 0.0167 0.0158 133 3160 %%
- Intel 387DX 0.7647 0.6004 0.3283 0.2676 2046 43860
- ULSI 83C87 1.0097 0.6609 0.3239 0.2598 2089 47431
- IIT 3C87 0.8455 0.5957 0.3198 0.2646 2203 49020
- IIT 3C87,4X4 0.8455 1.4334 0.3198 0.2646 2203 49020 ??
- Cyrix 387+ 0.9286 0.6806 0.3293 0.2669 2435 66890
- Cyrix 83D87 1.013 N/A 0.333 0.273 2550 N/A
- Intel RapidCAD 1.8572 1.5798 0.6072 0.4533 3953 72464
- Intel 486 2.0800 1.7779 0.9387 0.6682 5143 82192
-
- For comparison:
-
- PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
- MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
-
- i486DX2-66 4.1601 3.4227 1.6531 1.3010 10655 163934
- i486DX2-50 3.0589 2.6665 1.2537 0.9744 7962 123203
- i387, 20 MHz 0.2253 0.3271 0.1434 0.1171 952 21739 ++
- i387DX, 20 MHz 0.3567 0.4444 0.1484 0.1161 1034 24155 &&
- i80287, 5 MHz 0.0281 0.0310 0.0242 0.0222 150 3261 !!
- i8087,9.54 MHz 0.0636 0.0705 0.0321 0.0219 234 5782 **
-