home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.sys.ibm.pc.hardware
- Path: sparky!uunet!paladin.american.edu!darwin.sura.net!jvnc.net!yale.edu!ira.uka.de!rz.uni-karlsruhe.de!usenet
- From: S_JUFFA@iravcl.ira.uka.de (|S| Norbert Juffa)
- Subject: What you always wanted to know about math coprocessors 2/4
- Message-ID: <1992Sep15.162718.11001@rz.uni-karlsruhe.de>
- Sender: usenet@rz.uni-karlsruhe.de (USENET News System)
- Organization: University of Karlsruhe (FRG) - Informatik Rechnerabt.
- Date: Tue, 15 Sep 1992 16:27:18 GMT
- X-News-Reader: VMS NEWS 1.23
- Lines: 922
-
- Cyrix EMC87 is basically a special version of the Cyrix 83D87.
- In addition to the normal 387 operating mode, in
- which coprocessor-CPU communication is handled thru
- reserved IO-ports, it also offers a memory-mapped
- mode of operation similar to the operation principle
- of the Weitek Abacus. Please note that the EMC87 is
- *not* compatible with Weitek's Abacus coprocessor.
- They both use the same interface technique (memory
- mapping) but while the EMC87 uses the standard 387
- instruction set, the Weitek coprocessors use a
- different instruction set of their own. Like the
- Weitek Abacus, the EMC87 occupies a 64 kByte memory
- block starting at physical address C0000000h. It can
- therefore only be accessed in the protected or virtual
- modes of the 386 CPU. DOS programs can access the
- EMC87 with the help of DOS-extenders or memory
- managers like EMM386 which run in protected/virtual
- mode themself. Since the EMC87 provides also the
- standard CPU interface via IO-ports, it can be used
- just like any other 387 compatible coprocessor and
- delivers the same performance as the Cyrix 83D87 in
- this mode. However, using the memory mapped mode of
- the EMC87 provides a significant speed advantage.
- The traditional 387 CPU-coprocessor interface via
- IO-ports has an overhead of about 14-20 clock cycles.
- Since the Cyrix 83D87 executes some operations like
- addition and multiplication in much less time, its
- performance is limited by the CPU-coprocessor
- interface. The memory-mapped mode has much less
- overhead and allows all coprocessor instructions to
- be executed at full speed and with no penalty. For
- this reason, Cyrix introduced the EMC87 in 1990.
- In a test, the EMC87 at 33 MHz ran the single
- precision Whetstone benchmark at 7608 kWhetstones/sec,
- while the Cyrix 83D87 at 33 MHz had a speed of
- only 5049 kWhetstones/sec, an increase of 50.6% [63].
- In another test, the EMC87 ran a fractal computation
- at two times the speed of the Cyrix 83D87 and 2.6
- times as fast as an Intel 387DX [64]. A third test
- found the EMC87's overall performance to be 20%
- higher than the performance of the Cyrix 83D87
- [65]. The Cyrix FasMath EMC87 has also been sold
- as Cyrix AutoMATH by Cyrix. The two chips are 100%
- identical. Unlike the Cyrix 83D87, which fits into
- the 68-pin 387 coprocessor socket, the EMC87 comes
- in a 121-pin PGA and requires the 121-pin EMC
- (Extended Math Coprocessor) socket. Note that not
- all boards have such a socket, a notable exception
- being IBM's PS/2s, for example. Originally, Cyrix
- claimed support for the fast memory mapped mode of
- the EMC87 from a lot of software vendors (including
- Borland and Microsoft). However, there are only
- very few applications that make use of it, among
- them Evolution Computing's FastCAD 3D, MicroWay
- Inc.'s NDP FORTRAN-386 compiler, Metaware's High-C
- compiler version 1.6 and newer, and Intusofts's
- Spice [63,73]. Cyrix has implemented some additional
- instructions in the EMC87: FRICHOP, FRINT2, and
- FRINEAR. These instructions enable rounding to
- integer without setting the rounding mode by
- manipulating the coprocessor control word and are
- intended to make life easier for compiler writers.
- The EMC87 is available 25 and 33 MHz versions. Cyrix
- is currently phasing out the EMC87.
- Cyrix 387+ is the successor to the Cyrix 83D87. The name 387+
- is only used for European distribution. In other
- parts of the world, the second generation 387 'clone'
- from Cyrix still goes by the name 83D87. In my tests,
- I found the Cyrix 387+ to be about five to 10 percent
- *slower* than the Cyrix 83D87. However, some
- instructions like the square root (FSQRT) now
- run at only half the speed at which they ran in
- the 83D87 and the transcendental functions show
- a 40% drop in performance comapred with the 83D87
- on the average (see performance results below). I
- also found the transcendental functions on the 387+
- to be a bit *more* accurate than those implemented
- in the 83D87. According to a source with Cyrix [73],
- the 387+ was designed to make a smaller and thus
- cheaper coprocessor chip, that also can go at
- higher frequencies than the 83D87. The new design
- uses a slower hardware multiplier that needs 6
- clock cycles to multiply the floating point mantissa
- of an internal precision number, while the multiplier
- in the 83D87 takes only 4 clocks to accomplish the
- same task. Since the transcendental functions are
- generated by polynomial and rational approximations
- in Cyrix math coprocessors, this slows them down
- quite a bit. The divide/square root logic has also
- been changed from the 83D87 design. The original design
- used an algorithm that could generate both, the
- quotient and square root, so the excution times for
- these instructions were nearly identical. The algorithm
- choosen for the division in the 387+ doesn't allow
- the square root to be taken so easily, so it takes
- nearly twice as long. The 387+ is available in
- versions of up to 40 MHz. In the 387+, the available
- argument range for the FYL2XP1 instruction has been
- extended from the usual range -1+sqrt(2)/2..sqrt(2)/2,
- that is found on all 80x87 coprocessors, to all
- floating-point numbers. Also, four additional
- instructions have been implemented: FRICHOP (opcode
- DD FC), FRINT2 (opcode DB FC), FRINEAR (opcode DF FC),
- and FTSTP (opcode D9 E6)..
- Cyrix 83S87 is the SX version of the Cyrix 83D87. Just like the
- Cyrix 83D87 is the fastest 387 compatible coprocessor,
- the Cyrix 83S87 is the fastest of the 387SX compatible
- coprocessor [1]. Besides being the fastest 387SX
- 'clone', the Cyrix 83S87 also features the most
- accurate transcendental functions. 83S87 chips sold
- after about August 1992 use the internals of the
- Cyrix 387+, the successor to the original 83D87 [73].
- The 83S87 is packaged in a 68-pin PLCC and is available
- in 16, 20 and 25 MHz versions. Due to the advanced
- power saving features of the Cyrix coprocessor, the
- typical power consumption of the 20 MHz version is
- about 350 mW [67].
- ULSI 83C87 is a 387 'clone' that came out in early 1991, well
- after the IIT 3C87 and Cyrix 83D87. Like all clones,
- it is somewhat faster than the Intel 387DX. Especially
- the basic arithmetic functions are fast, while the
- transcendental functions show only a slight speed
- improvement over the Intel 387DX (see benchmark
- results below). In my tests, the ULSI had the most
- inaccurate transcendental functions. However, the
- maximum relative error is still within the limits
- set by Intel, so this is probably not an important
- issue in all but very few applications. The ULSI
- 83C87 shows some minor flaws in the tests for IEEE
- 754 compatiblity, but this, too, is unimportant
- under typical operating conditions. It is interesting
- to note that an ULSI 83S87 manufactured in 92/17
- showed less errors in the IEEETEST test run [74] than
- the ULSI 83C87, manufactured in 91/48, I used in
- my original test. This indicates that ULSI might
- have applied some quick fixes to newer revisions
- of their math coprocessors. ULSI claims that the
- program IEEETEST, which was used to test for IEEE
- compatibility, contains many personal interpretations
- of the IEEE standard by the program's author and
- states that there is no ANSI-certified IEEE-754
- complicency test. While this is may be true, it
- is also a fact that the IEEE test vectors used in
- IEEETEST are sort of an industry standard and that
- Intel's 387, 486, and RapidCAD chips pass it
- without a single failure, as do the coprocessors from
- Cyrix. Since the ULSI Math*Co 83C87 fails some of
- the tests, it is certainly less than 100% compatible
- with Intel's chips, although this will hardly make
- any difference in typical operating conditions.
- The ULSI 83C87 fails to be compatible with the
- IEEE-754 in that is does not implement the precision
- control feature. While all the internal operations of
- 80x87 coprocessors are usually done with the maximum
- precision available (double extended presision with
- 64 mantissa bits), the 80x87 also offer the possiblity
- to force lower precision to be used for the basic
- arithmetic functions add, subtract, multiply, divide,
- and square root. This feature is required by IEEE-754
- for all coprocessor that can not store results
- *directly* to a single or double precision location.
- Since the 80x87 coprocessors lack this capability,
- the have to implement this capability to provide
- correctly rounded single and double precision results
- according to the floating-point standard. All 80x87
- coprocessors except the ones from ULSI support this
- feature. For programs that make use of precision
- control, e.g. Interactive UNIX, correct implementation
- of the feature may be essential for correct arithmetic
- results. Like the other 387 'clones', the 83C87 does
- not support asynchronous operation of the CPU and the
- coprocessor. This means that the 83C87 always runs at
- the full speed of the CPU. The ULSI 83C87 is available
- in 20, 25, 33, and 40 MHz versions. The ULSI is
- produced in high perfromance, low power CMOS. Power
- consumption at 20 MHz is max. 800 mW (400 mW typical),
- at 25 MHz it is max. 1000 mW (500 mW typical), at 33
- MHz it is max. 1250 mW (625 mW), and at 40 MHz the
- ULSI Math*Co 83C87 consumes max. 1500 mW (750 mW
- typical) [58]. The 83C87 is packaged in a 68-pin
- ceramic PGA. ULSI coprocessors come with a lifetime
- warranty. ULSI Systems, Inc. will replace the
- coprocessor up to three times free of charge should
- it ever fail.
- ULSI 83S87 is the SX version of the ULSI 83C87 for operation
- with an Intel 387SX or an AMD Am387SX. It is
- functionally equivalent to the 83C87. To aid low
- power laptop designs, the ULSI 83S87 features an
- advanced power saving design with a sleep mode and
- a standby mode with only minimal power requirements.
- Power consumption under normal operating conditions
- (dynamic mode) is max. 400 mW at 16 MHz (300 mW
- typical), max. 450 mW at 20 MHz (350 mW typical),
- and max. 500 mW at 25 MHz (400 mW typical) [58].
- The ULSI 83S87 is packaged in a 68-pin PLCC.
- C&T 38700DX is the latest entry into the 387 'clone' market.
- Originally announced in October, 1991, it has
- apparently not been available to end users before
- third quarter of 1992, at least here in Germany.
- The C&T 38700DX is compatible with the Intel 387DX.
- My tests show that compatibility is indeed very good,
- even for the more arcane features of the 387DX and
- comparable to the coprocessors from Cyrix. Like
- the coprocessors from Cyrix and Intel, it passes
- the IEEETEST program without a single failure. It
- passes, of course, all tests in Chips & Technologies
- own compatibility test program SMDIAG. However, some
- of the tests (transcendental functions) in this program
- are selected in such a way that the C&T 38700 passes
- while the Cyrix 83D87 or Intel RapidCAD fail, so they
- are not very useful. There is also a 'bug' in the test
- for FSCALE that hides a true bug in the C&T 38700. In
- my own speed tests [see below] and those reported in
- [1], the C&T 38700DX showed performance at about 90-
- 100% the level of the Cyrix 83D87, which is the 387
- 'clone' with the highest performance. For floating
- point intensive benchmarks the C&T 38700DX provides up
- to 50% more computational power than the Intel 387DX.
- However, as with all other 387 compatible coprocessors,
- the speed advantage over the Intel 387DX is hardly
- measurable in real application. The accuracy of the
- transcendental functions on the C&T 38700DX varies.
- Overall accuracy of the transcendental function is
- slightly better than on the Intel 387DX. The SuperMATH
- 38700DX is implemented in 1.2 micron CMOS with on-chip
- power managment, which makes for low power consumption.
- The 38700DX is packaged in a 68-pin ceramic PGA (pin
- grid array and available in speeds of 16, 20, 25, 33,
- and 40 MHz.
- C&T 38700SX is the SX version of the 38700DX and compatible to
- the Intel 387SX. It provides performance similar to
- the Cyrix 83S87 [1], the 387SX 'clone' with the
- highest performance. Compatibility with the Intel
- 387SX is very good and comparable with high degree
- of the compatibility found in the Cyrix 83S87. It
- has low power consumption. The SuperMATH 38700SX is
- packaged in a 68-pin PLCC (plastic leaded chip carrier)
- and available in speeds of 16, 20, and 25 MHz.
- Intel RapidCAD is not a coprocessor, strictly seen, although it
- is marketed as one. Rather, it is a CPU replacement.
- It is basically an Intel 486DX without the cache and
- with a 386 pinout. RapidCAD is delivered as a set of
- two chips. RapidCAD-1 goes into the 386 socket and
- contains the CPU and FPU, RapidCAD-2 goes into the
- coprocessor socket and contains a PAL that generates
- the Ferr signal that is normally generated by a
- coprocessor and used by the motherboard circuitry to
- provide 287 compatible coprocessor exception handling
- in 386/387 systems. The RapidCAD instruction set is
- compatible with the 386, so it doesn't know the 486
- specific instructions like BSWAP. Since the RapidCAD
- CPU core is very similar to 486 CPU core, most of the
- register to register instructions execute in the same
- number of clock cycles as on the 486. The use of the
- 386 bus interface causes instructions that access memory
- to execute at about the same speed as on the 386. The
- integer performance on the RapidCAD is definitely
- limited by the low memory bandwidth provided by the
- 386 bus interface (2 clock cylces per bus cycle)
- and the lack of an internal cache. CPU instructions
- often execute faster than they can be fetched from
- memory, even with a big and fast external cache.
- Therefore, the integer performance of the RapidCAD
- exceeds that of a 386 by *at most* 35%. This value
- was derived by running some programs that use
- mostly register-to-register operations and few
- memory accesses. This finding is supported by the
- SPEC ratings that Intel reports for the 386-33
- and the RapidCAD-33. While the 386-33 has a
- SPECint of 6.4, the RapidCAD has a SPECint of 7.3
- [28], a 14% increase. Note that these tests used
- the old (1989) SPEC benchmarks suite. While CPU
- instructions often execute in one clock cycle on
- the RapidCAD, FPU instructions always take more
- than seven clock cycles. They are therefore rarely
- slowed down by the low memory bandwidth provided
- by the 386 bus interface. My tests show a 70%-100%
- performance increase for floating-point intensive
- benchmarks (see below) over a 386 based system
- using the Intel 387DX math coprocessor. This is
- consistent with the SPECfp rating reported by Intel.
- The 386/387 at 33 MHz is rated at 3.3 SPECfp, while
- the RapidCAD is rated at 6.1 SPECfp at the same
- frequency, a 85% increase. This means that a system
- that uses the RapidCAD is faster than any 386/387
- combination, regardless of the type of 387 used
- (Intel 387DX or faster clone). The diagnostic disk
- for the RapidCAD also gives some application
- performance data for the RapidCAD compared to the
- Intel 387DX:
-
- Application Time w/ 387DX Time w/ RapidCAD Speedup
-
- AUTOCAD 11 52 sec 32 sec 63%
- AutoShade/Renderman 180 sec 108 sec 67%
- Mathematica(Windows) 139 sec 103 sec 35%
- SPSS/PC+ 4.01 17 sec 14 sec 21%
-
- RapidCAD is available in 25 MHz and 33 MHz versions.
- It is distributed through other channels than the
- other Intel math coprocessors. Therefore, I have been
- unable to obtain a data sheet for it. The RapidCad-1
- chip gets quite hot when operating and it can be
- assumed that its power consumption is similar to
- the 486-33. Therefore, I recommend extra cooling
- for this chip (see the paragraph below on the 486 for
- details). The RapidCAD-1 is packaged in a 132-pin
- PGA, just like the 80386, and the RapidCAD-2 is
- packaged in a 68-pin PGA like a 80387 coprocessor.
- Intel 486DX is not a coprocessor. This chip, brought out in
- 1989 functionally combines the CPU (a heavily pipelined
- implementation of the 386 architecture) with an
- enhanced 387 (the floating-point unit, FPU) and
- 8 kB of unified code/data cache on one chip. Of
- course, this description is simplified, for a
- detailed hardware description, see [52]. The
- 486DX offers about two to three times the integer
- performance of a 386 at the same frequency.
- Floating point performance is about three to four
- times as high as on the Intel 387DX at the same
- clock rate [29]. Since the FPU is on the same
- chip as the CPU, the considerable communication
- overhead between CPU and coprocessor in a 386/387
- system is omitted, letting FPU instructions run
- at the full speed permitted by the implementation.
- The FPU also takes advantage of the on-chip cache
- and the highly pipelined execution unit. Besides
- the higher speed, the 486 FPU features more accurate
- transcendental functions than the Intel 387DX
- coprocessor according to tests run by me (see below).
- To achieve better interrupt latency, FPU instructions
- with a long execution time have been made abortable
- in the case an interrupt occurs during their
- execution. The concurrent execution of CPU and
- coprocessor instructions typical for 80x86/80x87
- systems is still in existence on the 486, but
- some FPU instructions like FSIN have nearly no
- concurrency with CPU instructions, indicating
- that they make heavy use of both, CPU and FPU
- resources [53, 1]. The 486DX comes in a 168 pin
- ceramic PGA (pin grid array). It is available in
- 25 MHz and 33 Mhz versions. Since the end of 1991,
- there is also a 50 MHz version available done in
- a CHMOS V process (the 25 MHz and 33 MHz are
- produced using the CHMOS IV process). Maximum
- power consumption is 3500 mW for the 25 MHz 486
- (2600 mW typical), 4500 mW for the 33 MHz version
- (3500 mW typical), and 5000 mW (4000 mW typical)
- for the 50 MHz chip. Due to the considerable amount
- of heat produced by these chips, and taking into
- consideration the slow air flow provided by the
- fan in garden variety PC tower cases, I recommend
- an extra fan directly above the CPU for safer
- operation. If you measure the surface temperature
- of an i486 in a normal tower case without extra
- cooling after some time of operation, you may well
- come up with something like 80 - 90 degrees Celsius
- (that is 176 - 194 degrees Fahrenheit for those not
- familiar with metric units) [54,55]. You don't need
- the well known and expensive IceCap(tm) to effectively
- cool your CPU. A simple fan mounted directly above
- the CPU can bring the temperature down to about 50
- to 60 degrees Celsius (122 - 140 degrees Fahrenheit)
- depending on the room temperature and the temperature
- within the PC case (which depends on the total power
- dissipation of all the components and the cooling
- provided by the fan in the power unit). According
- to a simple rule known as Arrehnius' Law, lowering
- the temperature by 10 degrees Celsius slows down
- chemical reactions by a factor of two, thus lowering
- the temperature of your CPU by 30 degrees should
- prolong the live of the device by a factor of eight
- due to the slower aging process. If you are reluctant
- to add a fan to your system because of the additional
- noise, settle for a low-noise fan like those
- available from the German manufacturer Pabst (this
- is not meant to be an advertisement. I am just the
- happy owner of such a fan. Besides that, I have no
- connections to the firm).
- Intel 486DX2 is the name for Intel latest generation of 486 CPUs.
- Using the DX2 suffix instead of simply DX is meant
- to be an indicator that these are clock-doubled
- versions. A normal 486DX operates at the frequency
- provided by the incoming clock signal. A 486DX2
- generates a new clock signal from the incoming clock
- by means of a PLL (phase locked loop). In the DX2,
- this clock signal has twice the frequency of the
- incoming clock, hence the name clock-doubler. All
- internal parts of the 486DX2 (cache, CPU core, FPU)
- run at this higher frequency. Only the bus interface
- runs at the normal speed. That way, a 486DX-50 can
- run on a motherboard designed for 25 MHz operation.
- Since motherboards for 50 MHz operations are much
- harder to design than those for 25 Mhz, this makes
- a 486DX2-50 system easier to built and cheaper than
- a 486DX-50 system. For all operations that don't
- access off-chip resources (e.g. register operations)
- a 486DX2-50 provides exactly the same performance as
- a 486DX-50 and twice the performance of a 486DX-25.
- However, since the main memory in a 486DX2-50 systems
- still operates at 25 MHz, all instructions involving
- memory accesses are potentially slower than in a
- 486DX-50 system, whose memory also runs at 50 Mhz.
- The internal cache of the 486 helps this problem a
- bit, but overall performance of a 486DX2-50 is still
- lower than that of a 486DX-50, although Intel's
- documentation [32] shows this drop to be quite small.
- It depends a lot on the code one runs, though. The
- nice thing about the 486DX2 is that it allows easy
- upgrading of 25 and 33 Mhz 486 systems, since the
- 486DX2 is completely pin-compatible with the 486DX.
- Just take out the 486DX and plug in the new 486DX2.
- Note that power consumption of the 486DX2-50 equals
- that of the 486DX-50 (4000 mW typical), and that the
- 486DX2-66 exceeds this by about 30%. These chips get
- really hot in a standard PC case with no extra cooling.
- See the above paragraph for more detailed information
- on this problem.
- Intel 487SX is the coprocessor intended for use in 486SX systems.
- The 486SX is basically a 486DX without the floating-
- point unit (FPU) [48, 50]. Originally Intel sold
- 486DXs with a defective FPU as 486SXs but it has
- now completly removed the FPU part from the 486SX
- mask for mass production. The introduction of the
- 486SX in 1991 has been viewed mainly as a marketing
- 'trick' by Intel to take market share from the 386
- based systems once AMD became successful with their
- Am386 (AMD has taken as much as 40% of the 386 market
- due to some superior features such as higher clock
- frequency, lower power consumption, and a fully
- static design). A 486SX at 20 MHz delivers a bit
- less integer performance than a 40 MHz Am386. To add
- floating-point capabilities to a 486SX based system,
- it would be easiest to swap the 486SX with a 486DX
- which includes the FPU. However, Intel has prevented
- this easy solution by giving the 486SX a slightly
- different pin out [48, 51]. Since only three pins
- are assigned differently, clever board manufacturers
- have come out with boards that accept anything from
- a 486SX-20 to a 486DX2-50 in their CPU socket and
- provide a clean upgrade path this way. A set of
- three jumpers ensures correct signal assignment to
- the pins for either configuration. To upgrade systems
- without this feature, one has to buy the 487SX and
- put it into the "Performance Upgrade Socket" present
- in most 486SX systems. Once the 487SX was available,
- it was quickly found out that it is just a normal
- 486DX with a slightly different pin out [49]. Inserting
- the 487SX effectively shuts down the 486SX in the
- 486SX/487SX system, so the 486SX could be removed
- once the 487SX is installed. Since the shut down is
- logical, not electrical, the 486SX still uses power
- if used with the 487SX, although it is unoperational.
- Technically speaking, the solution Intel chose was
- the only practical way to provide a 486SX system with
- the high level of floating-point performance the
- 486DX offers. The CPU and FPU have to be on the same
- chip, otherwise the FPU can not make use of the cache
- on the CPU chip and there would be considerable
- overhead in CPU-FPU communication (similar to a
- 386/387 system), nullifying most of the arithmetic
- speedups over the 387. That the 486SX, 487SX, and
- 486DX are not pin-compatible seems to be purely for
- marketing reasons. To upgrade a 486SX based system,
- Intel also offers the OverDrive chip, which is just
- the same as a 487SX with internal clock doubling. It
- goes also goes into the "Performance Upgrade Socket"
- found in 486SX systems. The OverDrive roughly doubles
- the performance of a 486SX/487SX based system. For a
- explanation of clock doubling, see the description
- of the 486DX2 above. As the 486SX, the 487SX is
- available in 20 MHz and 25 MHz versions. At 20 MHz,
- the 487SX has a power consumption of max. 4000 mW.
- It is available in a 169 pin ceramic PGA (pin grid
- array).
- Weitek 1167 was the predecessor to the Weitek Abacus 3167 math
- coprocessor. It was actually a small printed circuit
- board with three chips mounted on it. As opposed to
- the Weitek 3167, the 1167 did not have a square root
- instruction, instead the square root function was
- computed by means of a subroutine in the Weitek
- transcendental function library. However, the 1167
- did have a mode in which it supported denormals.
- The Weitek 3167 and 4167 only implement the 'fast'
- mode, in which denormals are not supported. Overall
- performance of the 1167 is slightly less than that
- of the Weitek 3167.
- Weitek 3167 was introduced in 1989 to provide the fastest
- floating point performance possible on a 386 based
- system at that time. The Weitek Abacus 3167 is not
- a real coprocessor, strictly speaking, but rather
- a memory mapped peripheral device. The Weitek 3167
- was optimized for speed wherever possible. Besides
- using the faster memory mapped interface to the CPU
- (the 80x87 uses IO-ports), it does not support many
- of the features of the 80x87 coprocessors, allowing
- all of the chip's ressources to be concentrated on
- the fast execution of the basic arithmetic operations.
- For a more detailed description of the Weitek 3167 see
- the first chapter of this document. In benchmark
- comparisons, the Weitek 3167 provided up to 2.5 times
- the performance of an Intel 387DX coprocessor. For
- example, on a 33 MHz 3167 the Whetstone benchmark
- performed at 7574 kWhetstones/sec compared with the
- the 3743 kWhetstones/s for the Intel 387DX. Note
- however that these are single precision results and
- that the Weitek 3167's performance would drop to
- about half the stated rate for double precision,
- while the value for the Intel 387DX would not change
- much. Anyhow, before the advent of the Intel RapidCAD,
- the Weitek 3167 usually beat all 387 compatible
- coprocessors even for double precision operations
- [63,65,69]. For typical applications the advantage
- of the Weitek 3167 over the 387 clones is much smaller.
- In a benchmark test using AutoDesk's 3D-Studio the
- Weitek 3167 performed at 123% of the Intel 487DX's
- perfromance comapred with 106% for the Cyrix FasMath
- 83D87 and 118% for the Intel RapidCAD. The Weitek
- Abacus 3167 is packaged in a 121-pin PGA that fits
- into an EMC socket provided by most 386 based systems.
- It does *not* fit into the normal coprocessor socket
- designed to hold a 387 compatible coprocessor in a
- 68-pin PGA. To get the best of both worlds, one might
- want to use a Weitek 3167 and a 387 compatible
- coprocessor in the same system. These coprocessors
- can coexist in the same system just fine. Only problem
- is that most 386 based systems contain only one
- coprocessor socket, usually of the EMC (extended math
- coprocessor) type. Thus, you can install either a
- 387 coprocessor or a Weitek 3167, but not both. There
- are little daughter boards available though that fit
- into the EMC socket and provide two sockets, an EMC
- and a standard coprocessor socket. At 25 MHz, the
- Weitek 3167 has a power consumption of max. 1750 mW.
- At 33 MHz, the max. power consumption is 2250 mW.
- Weitek 4167 is a memory mapped coprocessor that has the same
- architecture as the 3167 and is designed to provide
- 486 based systems with the highest floating point
- performance available. It executes coprocessor
- instructions at three to four times the speed of
- the Weitek 3167. Although it is up to 80% faster
- than the Intel 468 in some benchmarks [1,69], the
- performance advantage for real application is more
- like 10%. The introduction of the 486DX2 processors
- has more or less obliterated the need for a Weitek
- 4167, since the DX2 CPUs provide the same performance
- and all the additional features the 80x87 has over
- the Weitek Abacus. The Weitek 4167 is packaged in
- a 142-pin PGA package that is only slightly smaller
- than the 486's package. At 25 MHz, it has a max.
- power consumption of 2500 mW [32].
-
-
-
- Pricing
-
- Due to a recent price slashing by Cyrix and subsequently by Intel
- for 387 coprocessors, prices have dropped significantly for all
- 287 and 387 compatible coprocessors with hardly any price difference
- between manufacturers. 387DX compatible coprocessors typically sell
- for ~US$ 100 for all speeds except for 40 MHz versions which are
- typically ~US$ 130. 387SX compatible coprocessors sell for ~US$ 90
- regardless of speed with the exception of the 33 MHz versions, which
- are ~US$ 100. The Intel 287XL sells for ~US$ 100, while the IIT 2C87
- and Cyrix 82S87 sell for about US$ 70. 8087s may be more expensive,
- the price of an 8087-10 being US$ 150. I bought the Intel RapidCAD
- for US$ 320 and haven't seen it offered for a better price. I see
- the Weitek Abacus 3167-33 being offered for US$ 780 and the 4167-33
- being offered for US$ 1100. This price information reflects the
- price situation as of 08-14-92. Prices can be expected to drop
- slightly in the near future.
-
- If you have a demand for high floating-point performance, you
- should consider to buy a 486 based system rather than buying
- a 386 based system with an additional coprocessor. A 386 mother
- board for 33 MHz operation sells for ~ US$ 300, together with the
- coprocessor, cost totals ~ US$ 400. A 486-33 ISA-board sells for
- US$ 650. While the 486-33 system is 60% more expensive than the
- 386/387 system, it also provides 100% more integer and floating-
- point performance (twice the performance). If you want to push
- your 386 based system to maximum floating-point performance and
- can't switch to a 486 based system for some reason, I recommend
- the Intel RapidCAD. It is both faster [1] and cheaper than installing
- a Weitek Abacus 3167 with your 386, which used to be the highest
- performing combination before the RapidCAD came out. Similarily,
- the introduction of the 486DX2 clock-doubler chips have obliterated
- the need for a Weitek 4167 to get maximum floating-point performance
- out of a 486 based system. A 486DX2-66 performs at or above the
- performance level of a 33 Mhz Weitek 4167, even if the latter
- uses single precision rather than double precision. The 486DX-66
- is rated by Intel at 24700 double precision kWhetstones/sec and
- 3.1 double precision Linpack MFLOPS. Of course, these benchmarks
- used the highest performance compilers available. But even with
- a Turbo Pascal 6.0 program, I managed to squeeze 1.6 double precision
- MFLOPS out of the 486DX2-66 for the LLL benchmark (for a description
- of the benchmarks mentioned, see the paragraph on benchmarks below).
- Although I haven't yet seen 486DX2-66 processors being offered
- to end users for upgrade purposes, I recommend the 486DX2-66
- to those that need highest floating-point performance and are
- planning to buy a new PC. The price difference between a 33 MHz
- 486DX motherboard and a 486DX2-66 motherboard is around US$ 600,
- well below the price for the Weitek Abacus 4167.
-
-
-
- Operation
-
- In a 80x86/80x87 system CPU instructions and coprocessor
- instructions are executed concurrently. This means that
- the CPU can execute CPU instructions while the coprocessor
- executes a coprocessor instruction at the same time. The
- concurrency is restricted somewhat by the fact that the
- CPU has to aid the coprocessor in certain operations. As
- the CPU and the coprocessor are fed from the same instruction
- stream and both instruction streams may operate on the same
- data, there has to be a synchronizing mechanism between the
- CPU and the coprocessor.
-
- 8086/8087 or 8088/8087 system, both of the chips look at the
- opcodes coming in from the bus. To do this, both chips have
- the same BIU (bus interface unit) and the 8086 BIU sends the
- status signals of its prefetch queue to the 8087 BIU. This
- assures that both processors always decode the same instructions
- in parallel. Since all coprocessor instruction start with the
- bit pattern 11011, it is easy for the 8087 to ignore all other
- instructions. Likewise the CPU ignores all coprocessor instructions
- except if they access memory. In this case, the CPU computes
- the address of the LSB (least significant byte) of the memory
- operand and does a dummy read. The 8087 then takes the data and does a dummy read.
- from the data bus. If more than one meory access is needed to
- load an memory operand, the 8087 requests the bus from the CPU,
- generates the consecutive addresses of the operand's bytes
- and fetches them from the data bus. After completing the operation,
- the 8087 hands bus control back to the CPU. Since 8087 and CPU
- are hooked up to the same synchronous bus, they have to run at
- the same speed. This means that with the 8087, only synchronous
- operation of CPU and coprocessor is possible. Another 8087
- coprocessor instruction can only be started if the previous one
- has been completed in the NEU (numerical execution unit) of the
- 8087. To prevent the 8086 from decoding a new coprocessor
- instruction while the 8087 is still excuting the previous
- coprocessor instruction, the following mechanism is used: The
- compilers and assemblers automatically generate a WAIT instruction
- before each coprocessor instruction. The WAIT instruction tests
- the /TEST pin until its input becomes "LOW". In 8086/8087 systems,
- the 8086 /TEST pin is connected to the 8087 BUSY pin. As long
- as the NEU executes a coprocessor instruction, it forces its
- BUSY pin "HIGH". Thus the WAIT instruction in front of every
- coprocessor instruction stops the CPU until a still executing
- previous coprocessor instruction has finished. The same
- synchronization is used before the CPU accesses data that
- was written by the coprocessor. A WAIT instruction after the
- coprocessor instruction that writes to memory causes the CPU to
- stop until the coprocessor has transferred the data to memory,
- after which the CPU can safely access the data.
-
- With the help of an additional chip, the 8087 can also be inter-
- faced to the 80186 [36]. The 80186 was the CPU in some PCs (e.g.
- from Philips, Siemens) in the 1982/1983 time frame, but with
- the introduction of the IBM AT which used the 80286, it lost all
- significance for the PC market. The 80C186 (CMOS version of the
- 80186) nowadays sells as an embedded controller and can be combined
- with a 80C187 coprocessor which is based on the internals of the
- Intel 387 [37].
-
- The 80287 CPU-interface is totally different from the solution
- used in the 8087. Since the 80286 implements memory protection
- via an MMU based on segmentation, it would have been much to
- expensive to duplicate the whole protection logic on the coprocessor
- for an interface solution similar to the 8087. In a 80286/80287
- system, the CPU fetches and stores all opcodes and operands for
- the coprocessor. Information is passed through ports F8h - FFh.
- As these ports are accessible under program control, care must
- be taken to not accidentally perform write operation to them, as
- this could corrupt the information in the math coprocessor.
- The execution unit of the 80287 is practically identical to that
- of the 8087, that is, nearly all coprocessor instructions execute
- in the same number of clock cycles on both coprocessors. Due to
- the additional overhead of the CPU/coprocessor interface (at
- least ~40 clock cycles), a 8 MHz 80286/80287 combination can be
- slower than a 8086/8087 system running at the same speed for
- floating point intensive programs. Additionally, most of the
- older 286 boards were configured to run the coprocessor at 2/3
- the speed of the CPU, making use of the ability of the 80287
- to run asynchronous with the CPU. The 80287 has a CKM pin that
- causes the incoming system clock to be divided by three for
- the coprocessor if it is tied to ground. The 80286 always
- divides the system clock by two internally. Thus the ratio 2/3.
- However, when the CKM (ClocK Mode) pin is tied high on the 80287,
- it does not divide the CLK input. This feature has been exploited
- by the maker of coprocessor speed sockets. These sockets tie
- CKM high and supply their own CLK signal with a built-in oscillator,
- thereby allowing the 80287 or compatible to run at a much higher
- speed than the CPU. With an IIT or Cyrix 287 one can have a
- 20 MHz coprocessor running with a 8 MHz 80286. Note however that
- the floating-point performance in such a configuration does not
- scale linearly with the coprocessor clock, since all the data
- has to be passed through the much slower CPU. If the coprocessor
- executes mostly simple intructions such as addition and multiplication
- doubling the coprocessor clock in a 10 MHz system to 20 MHz does
- not show any performance increase at all [24]. The 80C287 by AMD
- is a 100% clone of the original Intel 80287, but is produced in
- CMOS not in NMOS as the original Intel chip. This makes for lower
- power consumption.
-
- The 80287XL, the Cyrix 82S87, and the IIT 2C87 contain the internals
- of a 387 coprocessor, but are pin-compatible to the original 287.
- However, these chips divide the system clock by two internally,
- as opposed to three in the original Intel 80287. Since the 80286
- also divides the system clock by two, they usually run synchronously
- with the CPU. They can also run asynchronously, though.
-
- The 8087/8087 combination can be characterized as a cooperation of
- partners with equal rights, while the 80286/287 is more a master-
- slave relationship. This makes synchronization much more easy, since
- the complete instruction and data flow of the coprocessor goes thru
- the CPU. Before executing most coprocessor instructions, the 80286
- tests its /BUSY pin which is hooked up to the 287 coprocessor and
- signals if the 80287 is still executing a previous coprocessor
- instruction or has encountered an exception. The 80286 then waits
- until the 80287 is not busy before loading the coprocessor instruction
- into the coprocessor. Therefore, a WAIT instruction before every
- coprocessor instruction is not required. These WAITs are permissible,
- but not necessary in 80287 programs. The second form of WAIT
- synchronisation after the coprocessor has written a memory operand is
- still necessary on 286/287 systems.
-
- The coprocessor interface in 80386/80387 systems is very similar to
- the one found in 286/287 systems. However, to prevent corruption
- of the coprocessor's contents by programming errors, the IO-ports
- 800000F8 - 800000FF are used which are not user accessible. The
- interface has been optimized and uses 32-bit transfers. The overhead
- of the interface has been reduced to about 16-20 clock cycles. For
- some operations on the 387 'clones', that take less than 16 clock
- cycles to complete this effectively limits the execution rate of
- coprocessor instructions. The only sensible solution to provide
- even higher floating point performance was to integrate the CPU
- and coprocessor functionality onto the same chip. This is what
- Intel did with the 80486. The FPU in the 486 also benefits from
- the instruction pipelining and from the integrated cache.
-
-
-
- Performance
-
- Several computer magazines have published performance comparisons
- at the application level for the 387 coprocessors and Weitek's
- ABACUS 3167 and 4167 chips [1,25,68,70]. Applications tested included
- AutoCAD R11, RenderStar, Quattro Pro, Lotus 1-2-3, and AutoDesk's
- 3D-Studio. For most tests, performance improvements for the 387
- clones over Intel's 387DX were small to marginal, the clones running
- the applications no more than 5% to 15% faster than the Intel 387DX.
- In the test of 3D-Studio, one of the few programs that supports
- the Weitek Abacus, the Weitek 3167 improved performance by 23%
- over an Intel 387DX and the 4167 improved performance by 10% over
- the 486 [1].
-
-
- The Intel Math Coprocessor Utilities Disk that accompanies the
- Intel 387DX coprocessor has a demonstration program that shows
- the speedup of certain application programs when run with the
- Intel coprocessor vs. a system with no coprocessor.
-
- Application Time w/o 387 Time w/ 387 Speedup
-
- Art&Letters 87.0 sec 34.8 sec 150%
- Quattro Pro 8.0 sec 4.0 sec 100%
- Wingz 17.9 sec 9.1 sec 97%
- Mathematica 420.2 sec 337.0 sec 25%
-
-
- The following table is an excerpt from [70]:
-
- Application Time w/o 387 Time w/ 387 Speedup
-
- Corel Draw 471.0 sec 416.0 sec 13%
- Freedom Of Press 163.0 sec 77.0 sec 112%
- Lotus 1-2-3 257.0 sec 43.0 sec 597%
-
-
- The following table is an excerpt from [25]:
-
- Application Time w/o 387 Time w/ 387 Speedup
-
- Design CAD, Test1 98.1 sec 50.0 sec 96%
- Design CAD, Test2 75.3 sec 35.0 sec 115%
- Excel, Test 1 9.2 sec 6.8 sec 35%
- Excel, Test 1 12.6 sec 9.3 sec 35%
-
-
-
- The performance statistics below were put together with the
- help of four widely known numeric benchmarks and two benchmarks
- developed by me. Three Pascal programs, one FORTRAN program,
- and two assembly language program were used. The assembly language
- programs were linked with Turbo-Pascal 6.0 for library support,
- especially to include the coprocessor emulator of the TP 6.0
- run-time library. The Pascal programs were compiled with Turbo
- Pascal 6.0 from Borland International, a non-optimizing compiler
- that produces 16-bit code. The FORTRAN program was compiled using
- MS FORTRAN 5.0, an optimizing compiler that generates 16-bit
- code. All programs except PEAKFLOP and SAVAGE, which use double
- extended precision, use double precision variables. Note that
- using a highly optimizing compiler producing 32-bit code you
- will see much higher performance for some benchmarks. For example,
- Intel rates the 33 MHz 386/387DX at 3290 KWhetstones/sec and 0.4
- double precision LINPACK MFLOPS [28,29]. The 33 MHz Intel 486 is
- rated by Intel at 12300 KWhetstones/sec and 1.6 double precision
- LINPACK MFLOPS [30]. The compilers used in these benchmarks run by
- the chip vendor are the ones that give the highest performance
- available. These compilers are in the US$ 1000+ price range.
- Some of them may be experimental or prereleased versions not
- available to the general public. The relative performance of
- one coprocessor to another could vary depending on the code
- generated by compilers. Non-optimizing compilers tend to generate
- a high percentage of operations which access variables in memory,
- while optimizing compiler produce code that contains many
- operations involving registers. Thus it is well possible that
- coprocessor A beats coprocessor B running benchmark Z if compiled
- with compiler C, but B beats A when the same benchmark is compiled
- using compiler D. All benchmark in this overview were run from
- floppy under a 'bare-bones' MS-DOS 5.0 without the CONFIG.SYS
- and AUTOEXEC.BAT files. This way, it was made sure no TSR or
- other program unnecessarily stole computing resources from the
- benchmarks.
-
- Coprocessor performance also depends on the motherboard, or more
- specifically the chip set used on the motherboard. In [34] and [35]
- identically configured motherboards using different 386 chip sets
- were tested. Among other tests a coprocessor benchmark was run
- which is based on a fractal computation and its execution time
- recorded. The following tables showing coprocessor performance
- to vary with the chip set have been copied from these articles
- in abridged form.
-
- Cyrix Cyrix
- chip set 387+ chip set 83D87
-
- Opti, 40 MHz 24.57 sec 97.0% PC-Chips, 33 MHz 26.97 sec 93.0%
- Elite,40 MHz 24.46 sec 97.4% UMC, 33 MHz 27.69 sec 90.5%
- ACT, 40 MHz 23.84 sec 100.0% Headland, 33 MHz 25.08 sec 100.0%
- Forex,40 MHz 23.84 sec 100.0% Eteq, 33 MHZ 27.38 sec 91.6%
-
- This shows that performance of the same coprocessor can vary by
- up to ~10% depending on the chip set used on your board, at least
- for 386 motherboards (similar numbers for 286, 386sx, and 486 are
- unfortunately not available). The benchmarks for this article were
- run on a board with the Forex chip set, which is one of the fastest
- 386 chip sets there is, not only with respect to floating-point
- performance [35].
-
-
- Description of benchmarks
-
- PEAKFLOP is the kernel of a fractal computation. It consists
- mainly of a tight loop written in assembly code and fine tuned
- to give maximum performance. All variables are held in the
- CPU's and coprocessor's registers, so the only memory access
- is for opcode fetches. The main loop contains three multiplications
- and five additions/subtractions. This ratio is fairly typical
- for other floating point intensive programs as well. The whole
- program fits nicely into even a very small CPU cache. Due to
- the nature of this program, its MFLOPS rate is hardly to be
- exceeded by any program that calculates anything useful. Thus
- the name PEAKFLOP. You will find the source code for PEAKFLOP
- in appendix B.
-
- TRNSFORM multiplies an array of 8191 vectors with a 3D-transformation
- matrix (a 4x4 matrix). Each vector consists of four double precision
- values. Multiplying vectors with a matrix is a typical operation in
- the manipulation (e.g. rotation) of 3D objects which are made up from
- many vectors decribing the object. This benchmark stresses addition
- and multiplication as well as memory access. For each vector, 16
- multiplications and 12 additions are used. About 256 kByte of data
- is accessed during the benchmark. TRNSFORM is implemented as an
- optimized assembler program linked with the Turbo Pascal 6.0 library.
- For the IIT 3C87, a special version was written that makes use of
- the special F4X4 instruction available on that coprocessor. F4X4
- does a full multiplication of a 4x4 matrix by a 4x1 vector in a
- single instruction. The full source code for the TRNSFORM program is
- in appendix B.
-
- LLL is short for Lawrence Livermore Loops [21], a set of kernels
- taken from real floating point extensive programs. Some of these
- loops are vectorizable, but since we don't deal with vector
- processors here, this doesn't matter. For this test, LLL was
- adapted from the FORTRAN original [20] to Turbo Pascal 6.0. By
- variable overlaying (similar to FORTRAN's EQUIVALENCE statement)
- memory allocation for data was reduced to 64 kB, so all data fits
- into a single 64 kB segment. The older version of LLL is used here
- which contains 14 loops. There also exists a newer, more elaborate
- version consisting of 24 kernels. The kernels in LLL exercise only
- multiplication and addition. The MFLOPS rate reported is the
- average of the MFLOPS rate of all 14 kernels as reported by the
- LLL program. LLL and Whetstone results (see below) are reported
- as returned by my COMPTEST test program in which they have been
- included as a measure of coprocessor/FPU performance. COMPTEST
- has been compiled under Turbo Pascal 6.0 with all 'optimizations'
- on and using my own run-time library, which gives higher perfor-
- mance than the one included with TP 6.0. My library is available
- as TPL60N16.ZIP from garbo.uwasa.fi and ftp-sites that mirror
- this site.
-
- Linpack [5] is a well known floating-point benchmark that also
- heavily exercises the memory system. Linpack operates on large
- matrices and takes up about 570 kB in the version used for this
- test. This is about the largest program size a pure DOS system
- can accomodate. Linpack was originally designed to estimate
- performance of BLAS, a library of FORTRAN subroutines that
- handles various vector and matrix operations. It uses two routines
- from BLAS which are thought to be typical of the matrix operations
- used by BLAS. Both routines only use addition/subtraction and
- multiplication. The FORTRAN source code for Linpack can be
- obtained from the automated mail server netlib@ornl.gov. Linpack
- was compiled using MS Fortran 5.0 in the HUGE memory model (which
- can handle data structures larger than 64 kB) and with compiler
- switches set for maximum optimization. Linpack repeatedly does
- the same test. The number reported is the maximum MFLOPS rate
- returned by Linpack. Linpack MFLOPS ratings for a great number
- of machines are contained in [6]. This PostScript document is
- also available from netlib@ornl.gov.
-
-