home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!kithrup!hoptoad!decwrl!elroy.jpl.nasa.gov!usc!sol.ctr.columbia.edu!ira.uka.de!uka!uka!news
- From: S_JUFFA@iravcl.ira.uka.de (|S| Norbert Juffa)
- Newsgroups: comp.sys.intel
- Subject: What you always wanted to know about math coprocessors for 80x86 1/4
- Date: 19 Aug 1992 15:00:08 GMT
- Organization: University of Karlsruhe (FRG) - Informatik Rechnerabt.
- Lines: 905
- Distribution: world
- Message-ID: <16tnloINNcqt@iraul1.ira.uka.de>
- NNTP-Posting-Host: irav1.ira.uka.de
- X-News-Reader: VMS NEWS 1.23
-
-
- WHAT YOU ALWAYS WANTED TO KNOW ABOUT MATH COPROCESSORS
-
-
- This document has been created to provide the net.community
- with some detailed information about mathematical coprocessors
- for the Intel 80x86 CPU family. It may also help to answer
- some of the FAQs (frequently asked questions) about this topic.
- The focus of this document is on 387 compatible chips, but
- there is also some information on the other chips in the 80x87
- family and the Weitek coprocessors. Care was taken to make the
- information included as accurate as possible. If you think you
- have discovered erroneous information in this text, or think
- that a certain detail needs to be clarified, or want to suggest
- additions to this text, feel free to contact me at:
-
- S_JUFFA@IRAVCL.IRA.UKA.DE
-
- or at my snail mail address:
-
- Norbert Juffa
- Wielandtstr. 14
- 7500 Karlsruhe 1
- Germany
-
-
- CONTENTS of this document
-
- 1) What are math coprocessors?
- 2) What applications benefit from using a math coprocessor
- 3) Installing a math coprocessor
- 4) Description of available math coprocessors, special features,
- available speeds, packaging, power consumption
- 5) Price information
- 6) How do math coprocessors work
- 7) Performance comparison of math coprocessors
- 8) Test for IEEE-754 conformance and accuracy of transcendental
- functions for different math coprocessors
- 9) References (literature)
- 10)Addresses of manufacturers of math coprocessors
- 11)Appendix A: Test programs for partial compatibility checks
- 12)Appendix B: Benchmark programs TRNSFORM and PEAKFLOP
-
- What are math coprocessors?
-
- A coprocessor in the traditional sense is a processor that extends
- the capabilities of a CPU in a transparent manner. This means that
- from the programmer's view the CPU and coprocessor together look
- like one machine. The 80x87 math coprocessors are typical examples
- of such coprocessors. The 80x86 CPUs (with the exception of the 80486,
- which has a built-in 'coprocessor') can only handle 8, 16, or 32 bit
- integers as their primary data types. However, many applications
- require the use of floating-point numbers. Simply put, use of floating
- point numbers enables one to express not only integers, but also
- fractional values over a wide range. The most common application
- of floating point numbers is in scientific applications, where very
- small (e.g. Planck's constant) and very large numbers (e.g. speed
- of light) have to be expressed. But floating-point numbers are also
- useful for business applications such as computing interest. Since
- the 80x86 CPUs do not support floating-point numbers or operations
- on them directly, they have to be programmed using the CPU's integer
- capabilities. This results in slow computations when floating-point
- numbers are used. This is where the 80x87 coprocessors come in.
- Adding a 80x87 to a 80x86 based system augments the CPU architecture
- with eight floating point registers, five additional data types and
- over 70 additional mnemonics. This greatly enhances the system's
- capability to do floating-point computations, as the coprocessor is
- specifically designed to handle floating-point numbers efficiently.
- Like most things in life, floating-point arithmetic has been
- standardized. The relevant standard, to which I will refer quite
- often in this document, is IEEE-754 Standard for Binary Floating-Point
- Arithmetic [10,11]. The standard specifies numeric formats, value
- sets and how the basic arithmetic (+,-,*,/,sqrt, remainder) has to
- work. All the coprocessors covered in this document claim full or
- at least partial compliance with this standard. When browsing the
- literature for information on math coprocessors, you will also
- encounter quite a few acronyms that refer to them: MCP (Math
- CoProcessor), NDP (Numerical Data Processor), NPX (Numerical
- Processor eXtension), FPU (Floating Point Unit). The latter usually
- refers to the 'built-in coprocessor' of the i486.
-
-
- The only data type the 80x87 coprocessors (and the 80486 floating
- point unit, or FPU) can hold in their registers is an 80-bit long
- floating-point number. This data type (called temporary real or
- double extended precision) can represent numbers which range in
- size between 3.36*10^-4932 and 1.19*10^4932 (3.65*10^-4951 to
- 1.19*10^4932 including denormal numbers) where the '^' denotes the
- power operator. For those familiar with floating point formats, this
- format has 64 mantissa bits, 15 exponent bits and 1 sign bit for
- the total of 80 bits. This format provides a precision of about
- 19 decimal places. The 80x87 can handle additional data types
- that are converted to/from the internal format upon being loaded/
- stored to/from the coprocessor. These include 16 bit, 32 bit, and
- 64 bit integers as well as a 18 digit BCD (binary coded decimal)
- occupying 10 bytes and two additional floating point types. The
- short real data type, also called single precision, has 32 bits
- that split into 23 mantissa bits, 8 exponent bit and a sign
- bit. This format provides a precision of about 6-7 decimal places
- and can represent numbers between 1.17*10^-38 and 3.40*10^38
- (1.40*10^-45 to 3.40*10^38 including denormal numbers). The long
- real, or double precision, data type has 64 bits, consisting of
- 52 mantissa bits, 11 exponent bits and the sign bit. It provides
- 15-16 decimal digits of precision and can handle numbers from
- 2.22*10^-308 to 1.79*10^308 (4.94*10^-324 to 1.79*10^308 including
- denormal numbers).
-
- In addition to load/store the above mentioned operand types, the
- 80x87 coprocessors can perform all the basic arithmetic operation
- on floating point numbers. Besides 'knowing' how to add, subtract,
- multiply and divide they can also compare floating-point numbers,
- change the sign, take the square root or absolute value, compute
- the remainder and compute some of the transcendental functions,
- like the logarithm. The eight registers in the 80x87 are organized
- in a stack-like manner which takes some time getting used to if
- one programs the coprocessor directly in assembler. However,
- nowadays the compilers or interpreters for most high level
- languages (HLL) can give the programmer access to the coprocessor's
- data types and use their instructions, so there is not much need
- to worry about the rather unusual architecture of the 80x87.
-
- Strictly speaking, the Weitek Abacus 3167 and 4167 are not
- coprocessors in that they do not transparently extended the
- CPU architecture. Rather they could be described as special
- memory mapped IO-devices. Since the term coprocessor has been
- traditionally used for these chips, they are also called by
- that term in this document. The architecture of the Weitek
- chips differs significantly from the 80x87. The Weitek's register
- file consists of 31 32-bit register, each one capable of holding
- an IEEE single precision number. Pairs of consecutive single
- precision registers can also be used as 64-bit IEEE double
- precision register. Thus there are 15 double precision registers.
- The Weitek register file has the standard organization known from
- other registers files like those in the 80386, not the special
- stack-like organization of the 80x87 coprocessors. The Weitek
- coprocessors have been tuned for maximum performance. Therefore,
- only a small instruction set has been implemented, but each
- instruction executes at a very high speed, usually only a few
- clock cycles each. Instructions available are load/store, add,
- subtract, subtract reverse, multiply, multiply and negate,
- multiply and accumulate, multiply and take absolute value,
- divide reverse, negate, absolute value, compare/test, convert
- fix/float, and square root. Note that the Weitek Abacus does not
- support a double extended format, has no built-in transcendental
- functions, and does not support denormals. The ressources required
- to implement such features have instead been devoted to implement
- the basic arithmetic operations as fast as possible. While the
- 80x87 coprocessors perform all internal calculations in double
- extended precision and therefore have about the same performance
- for single and double precision calculations, the Weitek features
- explicitly single and double precision operations. For applications
- that require only single precision the Weitek provides additional
- performance that way, as single precision operations are about
- twice as fast as their double precision counterparts. Since the
- Weitek Abacus has more registers than the 80x87 coprocessors,
- values can be kept in registers more often and have to be loaded
- from memory less frequently. This also leads to a performance gain.
- To the CPU, the Weitek Abacus looks like a 64 kB block of memory
- starting at physical address 0C0000000h. Every address in this
- range corresponds to a coprocessor instruction. Accessing a
- specified memory location within this block with the 386/486's
- MOV instruction causes the corresponding instruction to be executed.
- The instructions have been assigned to memory locations in such a
- way that loads to consecutive coprocessor registers can make use
- of the 386/486 MOVS string instruction. The memory mapped interface
- of the Weitek coprocessors is much faster than the IO-oriented
- protocol that is used to couple the CPU to the 80287 and 80387
- coprocessors. The Weitek's starting address of 0C0000000h is only
- a physical address. The Weitek's memory block can be assigned to
- any logical address using the MMU (memory managment unit) in the
- 386/486's protected and virtual modes. This also means that the
- Weitek Abacus 3167 and 4167 can *not* be used in the real mode
- of those processors, since the physical start address of the
- Weitek coprocessors is not within the 1 MByte address range and
- the MMU is inoperable in real mode. However, DOS programs can
- make use of the Weitek Abacus by using a DOS extender or a
- memory manager like EMM386 that run in protected/virtual mode
- themself and can therefore map the Weitek's memory block to
- any desired location in the 1 MByte address range. Typically
- the FS segment register is then set up to point to the Weitek's
- memory block. The Weitek Abacus 3167 and 4167 are also supported
- by the UNIX operating system [33].
-
-
- What applications will profit by using a math coprocessor?
-
- According to the Intel 387DX User's Guide, there are more
- than 2100 commercial programs that can make use of a 387
- compatible coprocessor. Every program that uses floating
- point arithmetic somewhere and supports a 80x87 coprocessor
- can gain speed by installing a coprocessor. However, the
- speedup will vary from program to program and even within
- the same program depending on how computation intensive the
- program or operation within the program is. Typical applications
- that benefit from the use of a 80x87 coprocessor are:
- - Business graphics programs, such as Arts&Letters, Freedom
- of Press, and Freelance
- - Spreadsheet programs like Lotus 1-2-3, Excel, Quattro, and
- Wingz
- - CAD programs such as AutoCAD, VersaCAD, and GenericCAD
- - Database programs such as dBase IV, FoxBase, and Paradox
- - Math and Science programs such as Mathematica, TKSolver,
- SPSS/PC, and Statgraphics
- Note that for spreadsheets and databases, a coprocessor
- only helps if some kind of floating point computations
- is performed. This is true more often for spreadsheets
- than for data bases. Also note that the speed of many
- programs depends quite heavily on the speed of the graphics
- adaptor (CAD) or the disk performance (databases), so the
- computational performance is only a (small) part of the
- total performance of the application. There are some programs
- that won't run without a coprocessor, among them AutoCAD R10
- and later and Mathematica. GUIs (graphical user interfaces)
- such as Windows do *not* gain additional speed from using a
- *mathematical* coprocessor, since their graphics operations
- only use integer arithmetic. They benefit from a graphics
- board with a graphical 'coprocessor' though that speed up
- certain common operations such as BitBlt or line drawing.
- However, applications running under Windows may take advantage
- of a math coprocessor, e.g. Excel.
-
- While support for 80x87 coprocessors is very common in application
- programs, the Weitek Abacus coprocessors do not enjoy such wide
- spread support. Due to their high price, only a few high-end PCs
- have been equipped with Weitek coprocessors. Therefore most of
- the programs that support these coprocessors are also high-end
- products like AutoCAD and Versacad-386.
-
-
-
- Installing a math coprocessor
-
- Usually, installing a coprocessor doesn't pose much of a problem,
- as every coprocessor comes with installation instructions and a
- diagnostic disk that lets you check for correct operation once
- the coprocessor has been installed. In addition, the user manuals
- of most computers have a section on coprocessor installation.
-
- 1) Make sure to get the right coprocessor for your system. An
- 8087 works together with 8086, 8088, V20, and V30 CPUs. An
- 80287, 287XL or compatible works together with a 80286 CPU.
- There are also some old 386 motherboards that accept a 80287
- coprocessor, but they usually also provide a socket for the
- 387 and I recommend to get a 387 then for use with these
- systems. A 80387, 387DX or compatible coprocessor is for 386
- based systems, as is the Intel RapidCAD. 387 coprocessors
- also work together with Cyrix' 486DLC CPU which despite its
- name does not include an FPU. Similarly, the 387SX or compatible
- coprocessor go into systems whose CPU is a 386SX or Cyrix 486SLC.
- The Weitek Abacus 3167 works with a 386 CPU but requires a
- 121-pin EMC socket in your system. Some computers, such as
- IBM's PS/2s don't have this socket. The Weitek Abacus 4167
- works together with the 486 and requires the appropriate
- 142-pin socket to be present.
- Always install a coprocessor that is rated at the same speed
- as the CPU. That is, for a 40 MHz 386 system using AMD Am386-40,
- install a coprocessor rated for 40 MHz such as a Cyrix 83D87-40,
- IIT 3C87-40, or ULSI 83C87-40. Running a coprocessor above its
- specified frequency rating may cause it to produce false results,
- which you might fail to recognize as such. I have personally
- experienced this problem with a Cyrix 83D87-33 that I tried
- to push to 40 MHz. It passed all the diagnostic benchmarks
- on the Cyrix diagnostic disk and the tests of some commercial
- system test programs. However, I found it to fail the
- Whetstone and Linpack benchmarks, which include accuracy
- checks. So although there is usually no problem with overheating
- when pushing a coprocessor over the specified maximum frequency
- rating, be warned that operation of a coprocessor above the
- maximum ratings stated by the manufacturer makes operation
- unreliable. Some 386 boards allow the coprocessor to be clocked
- differently than the CPU. This is called asynchronous operation
- and allows you to run the coprocessor at 33 MHz while the CPU
- runs at 40 MHz, for example. Please note that only the Intel
- 80387 and 387DX support asynchronous operation. The 387 'clones'
- from Cyrix, IIT and ULSI always run at the full speed of the
- CPU, even if you have set up your motherboard for asynchronous
- operation.
- 2) Once you've got the correct coprocessor for your system you
- can start the actual installation process:
- - turn off the computer's power switch and unplug the power
- cord from the wall outlet
- - remove the cover of your computer
- - locate the math coprocessor socket. This socket is located
- right next to the CPU, which can be identified by the
- printing on top of the chip. The CPU usually is one of the
- biggest chips on the board. The 8078 and 80287 DIL sockets
- are rectangular sockets with 20 pin holes on each of the
- longer sides. The 387SX PLCC socket is a square socket that
- has 17 vertical connector strips on the 'wall' of each side.
- The 387 PGA socket is square and has two rows of pin holes
- on each side. The EMC socket is similar but has three rows
- of holes on each side. The PGA socket for the Weitek 4167 is
- also square with three rows of holes on each side. If the CPU
- and coprocessor socket is on a separate card rather than on
- the motherboard (typical for modular systems), you have to
- remove the card and place it on a flat and hard surface free
- of static electricity. If you can't find the math coprocessor
- socket, consult your owner's manual or your computer dealer.
- If you want to install the Intel RapidCAD in a 386 system,
- you will have to remove the 386 CPU before starting to
- install the two RapiCAD chips. Intel provides an easy to
- use chip extractor and a storage box for the 386 chip for
- this purpose. Just follow the instructions in the RapidCad's
- installation manual.
- - Be sure you are properly grounded before you remove the
- coprocessor from its antistatic box. Static electricity
- can damage the coprocessor. Make sure you do not touch
- the pins.
- - Check if all pins are straight and not bend. If you find
- bent pins, carefully straigthen them with needle-nose pliers
- or tweezers.
- - Match the coprocessors orientation with the orientation
- of the socket. 8087 and 287 coprocessors have anotch on one
- the shorter sides of their rectangular DIL package that should
- be matched with the notch of the coprocessor socket. Usually
- the 286 CPU and the 287 coprocessor are placed alongside each
- other and both have the same orientation, that is their
- respective notches point in the same direction. 387SX
- coprocessors feature a white dot or similar mark that matches
- with some sort of marking on the socket. 387 coprocessors
- have a beveled corner that is also marked with a white dot
- or similar marking. This should be matched with the beveled
- or otherwise marked corner of the socket. If you install
- a 387 coprocessor in an EMC socket, leave one row of holes
- free on each side. Correct orientation of the coprocessor
- is absolutely essential, because if you insert it the
- wrong way it may be damaged. If you have found the correct
- orientation, make sure all pins are correctly aligned with
- their respective holes. Press firmly and evenly on the chip.
- You may have to press hard to seat the coprocessor all the
- way. Make sure your motherboard does not bend more than
- slighty under the insertion pressure. Otherwise it may
- develop cracks that could damage the signal lines on the
- board. For 8087, 287, and 387 coprocessors it is normal that
- the coprocessor does not go all the way in but about one
- millimeter (1/25 inch) of space is left between the socket
- and the bottom of the coprocessor chip. This enables the
- insertion of a extraction device should it become necessary
- to remove the coprocessor. Note that the construction of the
- 387SX's PLCC socket makes it next to impossible to remove
- the coprocessor once fully inserted, as the top of the chip
- is level with the socket's 'walls' then.
- 3) Check your computer's manual for the jumpers and/or switches
- you may have to set for coprocessor operation.
- Put the cover back on the system unit and reconnect the power.
- Turn on your computer. Depending on your BIOS, you may have
- to run the setup or configuration program to register the
- coprocessor.
- Use the diagnostic disk included with your coprocessor to
- check for correct operation of your coprocessor.
-
-
-
- Coprocessor emulations
-
- In the absence of a coprocessor, floating-point calculations
- are most often performed by a software package that simulates
- the operations of the coprocessor. Such a program is called
- a coprocessor emulator. Simulating the coprocessor has the
- advantage that identical code can be generated for the
- coprocessor and the emulator so that it is possible to write
- programs that run on both, systems with and systems without a
- coprocessor. Wether the program is to use the coprocessor or the
- emulator can then be decided at run-time by checking if a
- math coprocessor is present in the system.
-
- Two approaches to interface an 80x87 emulator to programs are
- common. While the first method works with all 80x86 processors,
- the second only works from the 80286 on. The first method makes
- use of the fact that all coprocessor instruction start with the
- same five bit pattern 11011. Thus the first byte of a coprocessor
- instruction will be in the range D8-DF hexadecimal. In addition,
- coprocessor instructions usually are preceeded by a WAIT instruction
- (opcode 9Bh) which is one byte long (the reason for doing this
- is described in a later chapter on the operation of the 80x87).
- One common approach is to replace the WAIT instruction and the
- first byte of the coprocessor instruction with one of eight
- interrupts; the remaining bytes of the coprocessor instruction
- are left unchanged. Interrupts 34 to 3B hexadecimal are used for
- this emulation technique. Note that the sequences 9B D8 .. 9B DF
- can be easily converted to the interrupt instructions CD 34 .. CD 3B
- by simple addition and subtraction of constants. The compiler or
- assembler produces code that contains the appropriate interrupt
- calls instead of the coprocessor instructions. If a coprocessor
- is detected at run-time, the emulator interrupts point to a short
- routine that converts the interrupts calls back to coprocessor
- instructions (self modifying code). If no coprocessor is found
- the interrupts point to an emulation package which examines the
- byte(s) following the interrupt intruction to determine what
- operation to perform. The method described is used by the compilers
- from Microsoft and Borland for example. It works with every
- 80x86 CPU from the 8086/8088 on.
- The second method to interface an emulator is only available on
- 286 and 386 machines. If the emulation bit in the machine status
- word of these processors is set, the processors will generate an
- interrupt 7 whenever a coprocessor instruction is encountered.
- The interrupt vector then points to an emulation package that
- decodes the instruction and performs the desired operation. This
- approach has the advantage that the emulator doesn't have to be
- included in the program code, but can be loaded as a TSR or
- device driver once and then used by every program that requires
- a coprocessor. Emulation via interrupt 7 is transparent, which
- means that programs containing coprocessor instructions execute
- just like a coprocessor was present, only slower. This approach
- is taken by the public domain EM87 emulator and the commercial
- Franke387 emulator, for example. Even programs that require a
- coprocessor to run like AutoCAD are 'fooled' to believe that
- a coprocessor is present with emulators using INT 7.
-
- The size of the emulator used by TP 6.0 is about 9.5 kB, EM87
- occupies about 15.8 kB as a TSR, and Franke387 uses about 13.4 kByte
- as a device driver. Note that Franke387 and especially EM87 model
- a real coprocessor much more closely than Turbo Pascal's emulator
- does. In particular, EM87 supports denormal numbers, precision
- control, and rounding control. The emulator in TP 6.0 does not
- implement these features. The version of Franke387 tested (V2.4)
- supports denormals in single and double precision, but not
- double extended precision. It supports precision control, but
- not rounding control. Intel's E80287 is supposed to be an 100%
- exact emulation of the 80287 coprocessor [44]. Generally, the
- more closely a real coprocessor is modelled by the emulator,
- the slower does the emulator run and the larger the code for the
- emulator is.
-
-
- Relative execution times of coprocessor vs. software emulators
- for particular coprocessor instructions
-
- Intel 387DX TP 6.0 Emulator EM87 Emulator
-
- FADD ST, ST(0) 1 26 104
- FDIV [DWord] 1 22 136
- FXAM 1 10 73
- FYL2X 1 33 102
- FPATAN 1 36 110
- F2XM1 1 38 110
-
-
- The following table is an excerpt from [44]:
-
- Intel 80287 Intel E80287 Emulator
-
- FADD ST, ST(0) 1 42
- FDIV [DWord] 1 266
- FXAM 1 139
- FYL2X 1 99
- FPATAN 1 153
- F2XM1 1 41
-
-
-
- The following has been adapted from [43] and merged with my own
- data:
-
- Intel 8087 TP 6.0 Emul. (8086) Intel Emul. (8086)
-
- FADD ST, ST(0) 1 20 94
- FDIV [DWord] 1 22 82
- FPTAN 1 18 144
- F2XM1 1 6 171
- FSQRT 1 44 544
-
-
-
- One of the reasons emulators are so slow is that they are
- often designed to run with every CPU from the 8086/8088 on.
- This is the case with the emulators built into the compiler
- libraries of the Turbo Pascal 6.0 (also used by Turbo C/C++)
- and Microsoft C 6.0 compiler (probably also used in other
- Microsoft products) and is also true for the EM87 emulator
- in the public domain. By using code that can run on a 8086/8088,
- these emulators forego the speed advantage offered by the
- additional instructions and architectureal enhancements (such
- as 32-bit registers) of the more advanced Intel 80x86 processors.
- A notable exception is the Franke387 emulator, a commercial
- emulator that is also sold as shareware. It uses 386 specific
- 32-bit code and only runs on 386/386SX computers.
-
- Besides being slow, coprocessor emulators have other drawbacks
- compared with real coprocessors. Most of the emulators do not
- support the additional instructions that the 387 compatible
- coprocessors offer over the 80287. Often, some of the low-level
- stack-manipulating instructions like FDECSTP are not emulated.
- The coprocessor status register is not or only partially emulated.
- Some emulators do not conform to the IEEE-754 standard in their
- implementation of the basic arithmetic functions, while the
- coprocessors do. Also, they sometimes lack the support for
- denormals (a special class of floating point numbers) although
- it is required by the standard. Not all the 80x87 emulators
- support rounding control (a feature required by IEEE-754) and
- precision control (a feature of the 80x87 coprocessor). Most of
- the ommisions are aimed at making the emulator faster and smaller.
- Because of the shortcomings of coprocessor emulators, a real
- coprocessor is a must for anybody planning to do some serious
- computations. At todays prices, this shouldn't pose much of a
- problem to anybody.
-
-
- Available coprocessors, CPU+FPU as of 08-10-92:
-
-
- Intel 8087 [43] was the first coprocessor that Intel brought
- out for the 80x86 family. It was introduced in 1980
- and therefore does not have full compatibility with
- the IEEE-754 standard for floating point arithmetic,
- which was finally released in 1985. It complements
- the 8088 and 8086 CPUs and can also be interfaced
- to the 80188 and 80186 processors. It comes in a
- 40 pin CERDIP (ceramic dual inline package). It
- is available in 5 MHz, 8 Mhz (8087-2), and 10 MHz
- (8087-1) versions. The 8087 is implemented using
- NMOS. Power consumption is rated at max. 2400 mW [42].
- A neat trick to enhance the processing power of the
- 8087 for computations that use only the basic
- arithmetic operations (+,-,*,/) and do not require
- high precision is to set the precision control to
- single precision. This gives one a performance
- increase of up to 20%. For details about programming
- the precision control, see program PCtrl in appendix A.
- Intel 80187 is a rather new coprocessor designed to support the
- 80C186 embedded controller. It was introduced in 1989
- and implements the complete 80387 instruction set.
- It is available in a 40 pin CERDIP (ceramic dual
- inline package) and a 44 pin PLCC (plastic leaded
- chip carrier) for 12.5 and 16 MHz operation. Power
- consumption is rated at max. 675 mW for the
- 12.5 MHz version and max. 780 mW for the 16 MHz
- version [37].
- Intel 80287 [44] is the original Intel coprocessor for the 80286
- and was introduced in 1983. It uses the same execution
- unit as the 8087 and therefore has the same speed
- (sometimes slower due to additional overhead in CPU
- coprocessor communication). As the 8087, it does not
- provide full compatibility with the IEEE-754 floating
- point standard released in 1985. It was manufactured
- in NMOS technology. There are 6 MHz, 8 MHz, and 10
- MHz versions. The chip comes in a 40 pin CERDIP
- (ceramic dual inline package). Power consumption can
- be estimated to be the same as that for the 8087,
- which is max. 2400 mW. The 80287 has been replaced
- in the Intel 80x87 family with its successor, the
- Intel 287XL, which has been introduced in 1990. The
- 287XL is done in CMOS. It is based on the 387 core
- and therefore much faster than the 80287. There may
- still be a few of the old 80287 chips on the market
- though.
- Intel 80287XL is the second generation 287 introduced by Intel
- in 1990. Since it is based on the 387 core, it
- features full IEEE 754 compatibility and faster
- execution of coprocessor instructions. Intel claims
- about 50% faster operation than the 80287 for typical
- benchmark test such as Whetstone [45]. Comparison
- with benchmark results for the AMD 80C287, which is
- identical to the Intel 80287, support this claim [1].
- The Intel 287XL performed 66% faster than the AMD
- 80C287 on the fractal benchmark and 66% faster on
- the Whetstone benchmark in these tests. Whetstone
- results from [46] show the Intel 287XL at 12.5 MHz
- to perform 552 kWhets/sec as opposed to the AMD's
- 80C287 289 kWhets/sec, a 91% performance increase.
- A benchmark using the MathPak program showed the
- Intel 287XL to be 59% faster than the Intel 80287
- (6.9 sec. vs. 11.0 sec.) [26]. Since the 287XL
- has all the additional instructions and enhancements
- of a 387, most software automatically identifies
- it as an 80387 compatible coprocessors and makes
- use of the extra features available like the FSIN
- and FCOS instructions. The 287XL is done in CMOS
- and therefore uses less power than the older 80287,
- which was done in NMOS. The 287XL is rated for
- speeds of up to 12.5 MHz. At 12.5 MHz, the power
- consumption is rated at max. 675 mW, about 1/4 of
- the 80287 power consumption. The 287XL comes in
- either a 40 pin CERDIP (ceramic dual inline package)
- or a 44 pin PLCC (plastic leaded chip carrier). The
- latter version is called the 287XLT and intended
- mainly for laptop use.
- AMD 80C287 is an exact clone of the old Intel 80287 that was
- brought to market by AMD in 1989. It contains the
- original microcode of the 80287 and is therefore
- 100% compatible with this chip. However, as the name
- indicates, the 80C287 is manufactured in CMOS and
- therefore uses less power than an equivalent Intel
- 80287. At 12.5 Mhz, its power consumption is rated
- at max. 625 mW or slightly less than that of the
- Intel 80287XL [27]. There is also another version
- called AMD 80EC287 that uses an 'intelligent' power
- save feature to reduce the power consumption below
- 80C287 levels. Tests at 10.7 MHz show typical power
- consumption for the 80EC287 to be at 30mW, compared
- to 150 mW for the AMD 80C287, 300 mW for the Intel
- 287XL and 1500 mW for the Intel 80287 [57]. The
- 80EC287 is therefore ideally suited for low power
- laptop systems. The AMD 80C287 is available in speeds
- of 10, 12, and 16 MHz. I have only seen it being
- offered in 10 MHz and 12 MHz versions though. At
- about US$ 50, it is the cheapest coprocessor available.
- Note that it provides less performance than the
- newer Intel 287XL (see above for details). The AMD
- 80C287 is available in 40 pin ceramic and plastic
- DIPs (dual inline package) and as 44 pin PLCC
- (plastic leaded chip carrier). Due to recent legal
- battles with Intel over the right to use the 287
- microcode, which AMD lost, AMD may have to discontinue
- this product (disclaimer: I am not a legal expert).
- Cyrix 82S87 was developed from the Cyrix 83D87, Cyrix' 387 'clone'
- and has been available since 1991. It implements the
- full 387 instruction set. It totally complies with
- the IEEE-754 standard for floating point arithmetic
- and features nearly total compatibility with Intel's
- coprocessors. It implements the transcendental
- functions with the same degree of accuracy and the
- superior speed of the Cyrix 83D87. This makes the
- Cyrix 82S87 the fastest [1] and most accurate 287
- compatible coprocessor available. Documentation by
- Cyrix [46] rates the 82S87 at 730 kWhets/sec for a
- 12.5 MHz system, while the Intel 287XL performs only
- 552 kWhets/sec. The 82S87 is a fully static CMOS
- design with very low power requirements that can
- run at speeds of 6 to 20 MHz. Cyrix documentation
- shows the 82S87 to consume about the same amount of
- power as the AMD 80C287 (see above). The 82S87 comes
- in a 40 pin DIP or a 44 pin PLCC (plastic leaded
- chip carrier) compatible with the pinout of the
- Intel 287XLT and ideally suited for laptop use.
- IIT 2C87 was the first 287 clone available. It was introduced
- to the market in 1989. It has about the same speed
- as the Intel 287XL [1]. The 2C87 implements the
- full 387 instruction set [38]. Tests I ran on the
- 3C87 seem to indicate that it is not fully compatible
- with the IEEE-754 standard for floating-point
- arithmetic (see below for details), so it can be
- assumed that the 2C87 also fails these test as it
- presumably uses the same core as the 3C87. The IIT
- 2C87 provides extra functions not available on any
- other 287 chip [38]. It has 24 user accessible
- floating-point registers organized into three register
- banks. Additional instructions (FSBP0, FSBP1, FSBP2)
- allow switching from one bank to another. Transfers
- between registers in different banks are not
- supported however, so this feature by itself
- is of limited usefulness. Also there seems to
- be only one status register (containing the
- stack top pointer), so it has to be manually
- loaded and stored when switching between banks
- with a different number of registers in use [40].
- The register bank's main purpose is to aid the
- fourth additional instruction the 2C87 has
- (F4X4), which does a full multiply of a 4x4 matrix
- by a 4x1 vector, an operation common in 3D graphics
- applications [39]. The built-in matrix multiply
- speeds this operations up by a factor of 6 to 8
- compared with a programmed solution according to
- the manufacturer [38]. Tests show the speed-up
- to be indeed in this range [40]. For the 3C87, I
- measured the execution time of F4X4 to be about
- 280 clock cycles, the execution time on the 2C87
- should be somewhat bigger. I estimate it to be
- around 310 clock cycles due to the higher CPU-NDP
- communication overhead in instruction execution in
- 286/287 systems (~45-50 clock cycles) compared with
- 386/387 systems (~16-20 clock cycles). As useful as
- the F4X4 instruction may seem, there are only very
- few applications that make use of this feature if
- a IIT coprocessor is detected at run time, among
- them Schroff Development's Silver Screen and
- Evolution Computing's Fast-CAD 3-D [25]. The 2C87
- is available for speeds of up to 20 MHz. It is
- implemented in an advanced CMOS process and has
- therefore a low power consumption of typically
- about 500 mW [38].
- Intel 387 was the first generation of coprocessors for the
- Intel 386. It was introduced in 1986, about one
- year after introduction of the 80386. Early 386
- system were therefore equipped with a 80287 and a
- 80387 socket. The 80386 works together with the
- 80287 but the numerical performance is hardly
- adequate for such a system. The 80387 has since
- been superseeded by the Intel 387DX introduced
- by a quiet change in 1990. You might find it
- when aquiring an old 386 machine, though. The
- 80387 is about 20% slower than the newer 387DX
- (see the paragraph below for detailed information).
- Like the other 387 coprocessors, the 80387 is packaged
- in a 68-pin ceramic PGA. The Intel 80387 is
- manufactured using Intel's older 1.5 micron CHMOS
- III technology that has moderate power requirements.
- Power consumption at 16 MHz is max. 1250 mW (750 mW
- typical), at 20 MHz it is max. 1550 mW (950 mW
- typical), and at 25 MHz it is max. 1950 mW (1250 mW
- typical) [60].
- Intel 387DX is the second generation Intel 387 that was quietly
- introduced in 1989. This version is done in a more
- advanced CMOS process than the 80387 that enables
- the coprocessor to run at a maximum frequency of 33
- MHz, while the 80387 had a maximum frequency of 25 MHz.
- The 387DX is about 20% faster than the 80387 on the
- average for the same clock frequency. For a 386/387
- system operating at 29 MHz the Whetstone benchmark
- compiled with the highly optimizing Metaware High-C
- V1.6 runs at 2377 kWhetstones/sec for the 80387 and
- at 2693 kWhetstones/sec for the 387DX, a 13% increase.
- In a fractal calculation programmed in assembly
- language, the 387DX performance was 28% higher than
- the performance of the 80387. The transcendental
- functions have also sped up from the 80387 to the
- 387DX. In the Savage benchmark compiled with the
- Metaware High-C V1.6 optimizing compiler and running
- on a 29 MHz system, the 80387 evaluated 77600 function
- calls/second, while the 387DX evaluated 97800 function
- calls/second, a 26% increase [7]. Some instructions
- have been sped up a lot more more than the average
- 20%. For example the FBSTP instruction has been sped
- up by a factor of 3.64. The Intel 387DX (and its
- predecessor 80387) are the only 387 coprocessors
- that support asynchronous operation of CPU and NDP.
- The 387 consists of a bus interface unit and a
- numerical execution unit. The bus interface unit
- always runs at the speed of the CPU clock (CPUCLK2).
- If the CKM (ClocK Mode) pin of the 387 is strapped
- to Vcc, the numerical execution unit runs at the
- same speed as the bus interface unit. If CKM is tied
- to ground, the numerical execution unit runs at the
- speed provided by the NUMCLK2 input. The ratio of
- NUMCLK2 (coprocessor clock) to CPUCLK2 (CPU clock)
- must lie within the range 10:16 to 14:10. For example,
- for a 20 MHz 386, the Intel 387DX could be clocked
- from 12.5 MHz to 28 MHz via the NUMCLK2 input. On
- the Cyrix 83D87, Cyrix 387+, ULSI 83C87, and the IIT
- 387, the CKM pin is not connected. These coprocessors
- always run at the speed of the CPU. The Intel 387DX
- is manufactured using Intel's advanced low power
- CHMOS IV technology. Power consumption at 20 MHz is
- max. 900 mW (525 mW typical), at 25 MHz it is max.
- 1050 mW (625 mW typical), and at 33 MHz it is 1250
- mW (750mW typical) [59].
- Intel 387SX is the coprocessor for the Intel 386SX. The 386SX is
- an Intel 386 with a 16-bit data path. This reduces
- somewhat the costs to build a complete system as
- compared to a full 32-bit design required by the
- 80386DX. The 386SX main purpose was to replace the
- 80286 CPU, which Intel subsequently stopped producing.
- Due to the 16-bit data path, the 386SX is slower than
- the 386DX and offers about the same speed as a 80286
- at the same clock frequency for 16-bit applications.
- As the 386SX is a complete 80386, it offers also the
- possibility to run 32-bit applications and supports
- the virtual 8086 mode used for example by Windows'
- enhanced mode. The 387SX has all the features the
- Intel 387DX offers, including the ability for
- asynchronous operation of CPU and coprocessor
- (see the above paragraph on the Intel 387DX for
- details). Due to the 16 bit data path between the
- CPU and the coprocessor, the 387SX is a bit slower
- than a 387DX operating at the same frequency. The
- 387SX comes in a 68-pin PLCC (pastic leaded chip
- carrier) package and is available in 16 Mhz and 20
- MHz versions. Coprocessors for faster 386SX systems
- based on the Am386SX CPU are available from IIT,
- Cyrix, and ULSI. Power consumption for the 387SX
- at 16 MHz is max. 1250 mW (740 mW typical), for
- the 20 MHz version it is max. 1500 mW (1000 mW
- typical) [62].
- IIT 3C87 came out in 1989 at about the same time as the
- Cyrix 83D87. Both coprocessors are faster than
- Intel's 387DX coprocessor. Tests I ran with the
- IEEETEST program show that the 3C87 is not fully
- compatible with the IEEE-754 standard for
- floating-point arithmetic although the manufacturer
- claims differently. It is well possible that the
- reported errors are due to personal interpretations
- of the standard by the program's author that have
- been incorporated into IEEETEST and that the
- standard also supports the different interpretation
- chosen by IIT. On the other hand, the IEEE test
- vectors incorporated into IEEETEST have become
- somewhat of an industry standard [66] and Intel's
- 387, 486, and RapidCAD chips pass the test without
- a single failure, so the fact that the IIT 3C87
- fails some of the tests indicates that it is not
- fully compatible with the Intel 387 coprocessor.
- My tests also show that the IIT 3C87 does not
- support denormals for the double extended format.
- It is not entirely clear wether the IEEE standard
- mandates support for extended precision denormals,
- as the IEEE-754 document explicitly only mentions
- single and double precision denormals. Missing
- support for denormals is not a critical issue with
- most applications but there are some programs for
- which support of denormals is quite helpful, if not
- important [41]. Anyhow, failure of the 3C87 to
- support extended precision denormal numbers is an
- incompatibility with the Intel 387 and 486. The 3C87
- provides extra functions not available on any other
- 387 chip [38]. It has 24 user accessible floating-point
- registers organized into three register banks.
- Additional instructions (FSBP0, FSBP1, FSBP2)
- allow switching from one bank to another. Transfers
- between registers in different banks are not
- supported however, so this feature by itself
- is of limited usefulness. Also there seems to
- be only one status register (containing the
- stack top pointer), so it has to be manually
- loaded and stored when switching between banks
- with a different number of registers in use [40].
- The register banks main purpose is to aid the
- fourth additional instruction the 3C87 has
- (F4X4), which does a full multiply of a 4x4 matrix
- by a 4x1 vector, an operation common in 3D graphics
- applications [39]. I measured this instruction to
- execute in about 280 clock cycles, during which
- time it executes 16 multiplications and 12 additions.
- The built-in matrix multiply speeds the matrix by
- vector multiply up by a factor of 3 compared
- with a programmed solution according to IIT [39].
- The results for my own TRNSFORM benchmark support
- this claim (see results below), showing a performance
- increase by a factor of about 2.5. This makes
- matrix multiplies on the IIT 3C87 nearly as fast as
- on an Intel 486 at the same clock frequency. However,
- there are only very few applications that make use
- of this feature if a IIT 3C87 is detected at run time,
- among them Schroff Development's Silver Screen and
- Evolution Computing's Fast-CAD 3-D [25]. Like the
- 387 'clones' from Cyrix and ULSI, the 3C87 does not
- support asynchronous operation of the CPU and the
- coprocessor. The 3C87 always runs at the full speed
- of the CPU. The 3C87 is implemented in an advanced
- CMOS process and has low power requirements of
- typically about 600 mW. It is available in 16, 20,
- 25, 33, and 40 MHz versions.
- IIT 3C87SX is the version of the IIT 3C87 that is intended for
- use with Intel's 386SX or AMD's Am386SX CPU. It is
- functionally equivalent to the IIT3C87. Due to the
- 16-bit data path between the CPU and the coprocessor
- in a 386SX based system, coprocessor instructions
- will execute somewhat slower than on the 3C87. The
- IIT 3C87SX is the only 387SX coprocessor that is
- offered at speeds of 16, 20, 25, and 33 MHz right
- now. I have read that Cyrix has also annouced a
- 83S87-33, but haven't seen it being offered yet.
- The 3C87SX is packaged in a 68-pin PLCC.
- Cyrix 83D87 was introduced in 1989, only shortly after the
- coprocessors from IIT. It has been the fastest
- 387 compatible coprocessor in several benchmark
- comparisons [1,7,68,69]. It also came out as the
- fastest coprocessor in my own tests (see benchmark
- results below). Although the Cyrix 83D87 provides
- up to 50% more performance than the Intel 387DX
- in benchmarks comparisons, the speed advantage
- over other 387 compatible coprocessors in real
- applications is usually much smaller. For example,
- in a test using the program 3D-Studio, the Cyrix
- 83D87 was 6% faster than the Intel 387DX [1].
- Besides being the fastest 387 coprocessor, the
- 83D87 also offers the most accurate transcendental
- functions results of all coprocessors tested (see
- test results below). Unlike the Intel coprocessors,
- which use the CORDIC [18,19] algorithm to compute
- the transcendental functions, Cyrix uses rational
- approximations to the functions. In the past the
- CORDIC method has been popular since it requires
- only shifts and adds which makes it easy to implement.
- It is also reasonably fast. Recently, the cost for
- the implementation for fast floating-point multipliers
- has dropped significantly due to the availablity of
- VLSI, making the use of rational approximations
- superior to CORDIC for the generation of transcendental
- functions [61]. The Cyrix 83D87 uses a very fast
- array multiplier, making its transcendental functions
- faster than those of any other 387 compatible
- coprocessor. It also uses 75 bit for the mantissa
- for intermediate calculations (as opposed to 68 bits
- on other coprocessors), making its transcendental
- functions more accurate than those of any other
- coprocessor or FPU (see results below). The 83D87
- and its successor, the 387+ are the 387 'clones'
- with the highest degree of compatibility. There
- are only very few SW and HW incompatibilties with
- the Intel 387DX. These have been documented by
- Cyrix [12]. The software differences are caused
- by some bugs present in the 387DX that Cyrix fixed
- for the 83D87. Unlike the Intel 387DX, the 83D87
- (and all other 387 'clones' as well) does not support
- asynchronous operation of CPU and coprocessor. There
- have also been problems in the past with the CPU -
- coprocessor communication, causing the 83D87 to
- hang on some machines. The reason was that Cyrix
- shaved off a wait state in the communication protocol,
- which caused a communications breakdown between the
- CPU and the 83D87 for some systems running at 25 MHz
- or faster. One notable example of this behavior was
- the Intel 302 board. The problem is only rarely
- encountered with the current generation of 386
- motherboards. It is possible that the problem has
- been entirely eliminated in the 387+, the sucessor
- to the 83D87. To reduce power consumption the 83D87
- features advanced power saving features. Those
- portions of the coprocessor that are not needed
- are automatically shut down. If no coprocessor
- instructions are being executed, all parts except
- the bus interface unit are shut down [12]. Maximal
- power consumption of the Cyrix 83D87 at 33 MHz is
- 1900 mW, typical power consumption at this clock
- frequency is 500 mW [15].
-