NetNews Usenet Archive 1992 #20

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #20 / NN_1992_20.iso / spool / comp / sys / ibm / pc / hardware / 24342 < prev next >

Wrap

Text File | 1992-09-15 | 62.9 KB | 975 lines

Newsgroups: comp.sys.ibm.pc.hardware Path: sparky!uunet!spool.mu.edu!yale.edu!ira.uka.de!uni-heidelberg!rz.uni-karlsruhe.de!usenet From: S_JUFFA@iravcl.ira.uka.de (|S| Norbert Juffa) Subject: What you always wanted to know about math coprocessors 1/4 Message-ID: <1992Sep15.162554.10917@rz.uni-karlsruhe.de> Sender: usenet@rz.uni-karlsruhe.de (USENET News System) Organization: University of Karlsruhe (FRG) - Informatik Rechnerabt. Date: Tue, 15 Sep 1992 16:25:54 GMT X-News-Reader: VMS NEWS 1.23 Lines: 963 WHAT YOU ALWAYS WANTED TO KNOW ABOUT MATH COPROCESSORS This document has been created to provide the net.community with some detailed information about mathematical coprocessors for the Intel 80x86 CPU family. It may also help to answer some of the FAQs (frequently asked questions) about this topic. The focus of this document is on 387 compatible chips, but there is also some information on the other chips in the 80x87 family and the Weitek coprocessors. Care was taken to make the information included as accurate as possible. If you think you have discovered erroneous information in this text, or think that a certain detail needs to be clarified, or want to suggest additions to this text, feel free to contact me at: S_JUFFA@IRAVCL.IRA.UKA.DE or at my snail mail address: Norbert Juffa Wielandtstr. 14 7500 Karlsruhe 1 Germany This is the second version of this document and I'd like to thank those who have helped improving it by commenting on the previous version: Fred Dunlap (cyrix!volt!fred@texsun.Central.Sun.COM), Peter Forsberg (peter@vnet.ibm.com), Richard Krehbiel (richk@grevyn.com), Arto Viitanen (av@cs.uta.fi), Jerry Whelan (guru@stasi.bradley.edu), Eric Johnson (johnson%camax01@uunet.UU.NET), Bengt Ask (f89ba@efd.lth.se), Thomas Hoberg (tmh@prosun.first.gmd.de), Nhuan Doduc (ndoduc@framentec.fr), John Levine (johnl@iecc.cambridge.ma.us) CONTENTS of this document 1) What are math coprocessors? 2) What applications benefit from using a math coprocessor 3) Installing a math coprocessor 4) Description of available math coprocessors, special features, available speeds, packaging, power consumption 5) Price information 6) How do math coprocessors work 7) Performance comparison of math coprocessors 8) Test for IEEE-754 conformance and accuracy of transcendental functions for different math coprocessors 9) References (literature) 10)Addresses of manufacturers of math coprocessors 11)Appendix A: Test programs for partial compatibility checks 12)Appendix B: Benchmark programs TRNSFORM and PEAKFLOP What are math coprocessors? A coprocessor in the traditional sense is a processor that extends the capabilities of a CPU in a transparent manner. This means that from the programmer's view the CPU and coprocessor together look like one machine. The 80x87 math coprocessors are typical examples of such coprocessors. The 80x86 CPUs (with the exception of the 80486, which has a built-in 'coprocessor') can only handle 8, 16, or 32 bit integers as their primary data types. However, many applications require the use of floating-point numbers. Simply put, use of floating point numbers enables one to express not only integers, but also fractional values over a wide range. The most common application of floating point numbers is in scientific applications, where very small (e.g. Planck's constant) and very large numbers (e.g. speed of light) have to be expressed. But floating-point numbers are also useful for business applications such as computing interest. Since the 80x86 CPUs do not support floating-point numbers or operations on them directly, they have to be programmed using the CPU's integer capabilities. This results in slow computations when floating-point numbers are used. This is where the 80x87 coprocessors come in. Adding a 80x87 to a 80x86 based system augments the CPU architecture with eight floating point registers, five additional data types and over 70 additional mnemonics. This greatly enhances the system's capability to do floating-point computations, as the coprocessor is specifically designed to handle floating-point numbers efficiently. Like most things in life, floating-point arithmetic has been standardized. The relevant standard, to which I will refer quite often in this document, is IEEE-754 Standard for Binary Floating-Point Arithmetic [10,11]. The standard specifies numeric formats, value sets and how the basic arithmetic (+,-,*,/,sqrt, remainder) has to work. All the coprocessors covered in this document claim full or at least partial compliance with this standard. When browsing the literature for information on math coprocessors, you will also encounter quite a few acronyms that refer to them: MCP (Math CoProcessor), NDP (Numerical Data Processor), NPX (Numerical Processor eXtension), FPU (Floating Point Unit). The latter usually refers to the 'built-in coprocessor' of the i486. The only data type the 80x87 coprocessors (and the 80486 floating point unit, or FPU) can hold in their registers is an 80-bit long floating-point number. This data type (called temporary real or double extended precision) can represent numbers which range in size between 3.36*10^-4932 and 1.19*10^4932 (3.65*10^-4951 to 1.19*10^4932 including denormal numbers) where the '^' denotes the power operator. For those familiar with floating point formats, this format has 64 mantissa bits, 15 exponent bits and 1 sign bit for the total of 80 bits. This format provides a precision of about 19 decimal places. The 80x87 can handle additional data types that are converted to/from the internal format upon being loaded/ stored to/from the coprocessor. These include 16 bit, 32 bit, and 64 bit integers as well as a 18 digit BCD (binary coded decimal) occupying 10 bytes and two additional floating point types. The short real data type, also called single precision, has 32 bits that split into 23 mantissa bits, 8 exponent bit and a sign bit. This format provides a precision of about 6-7 decimal places and can represent numbers between 1.17*10^-38 and 3.40*10^38 (1.40*10^-45 to 3.40*10^38 including denormal numbers). The long real, or double precision, data type has 64 bits, consisting of 52 mantissa bits, 11 exponent bits and the sign bit. It provides 15-16 decimal digits of precision and can handle numbers from 2.22*10^-308 to 1.79*10^308 (4.94*10^-324 to 1.79*10^308 including denormal numbers). In addition to load/store the above mentioned operand types, the 80x87 coprocessors can perform all the basic arithmetic operation on floating point numbers. Besides 'knowing' how to add, subtract, multiply and divide they can also compare floating-point numbers, change the sign, take the square root or absolute value, compute the remainder and compute some of the transcendental functions, like the logarithm. The eight registers in the 80x87 are organized in a stack-like manner which takes some time getting used to if one programs the coprocessor directly in assembler. However, nowadays the compilers or interpreters for most high level languages (HLL) can give the programmer access to the coprocessor's data types and use their instructions, so there is not much need to worry about the rather unusual architecture of the 80x87. Strictly speaking, the Weitek Abacus 3167 and 4167 are not coprocessors in that they do not transparently extended the CPU architecture. Rather they could be described as special memory mapped IO-devices. Since the term coprocessor has been traditionally used for these chips, they are also called by that term in this document. The architecture of the Weitek chips differs significantly from the 80x87. The Weitek's register file consists of 31 32-bit register, each one capable of holding an IEEE single precision number. Pairs of consecutive single precision registers can also be used as 64-bit IEEE double precision register. Thus there are 15 double precision registers. The Weitek register file has the standard organization known from other registers files like those in the 80386, not the special stack-like organization of the 80x87 coprocessors. The Weitek coprocessors have been tuned for maximum performance. Therefore, only a small instruction set has been implemented, but each instruction executes at a very high speed, usually only a few clock cycles each. Instructions available are load/store, add, subtract, subtract reverse, multiply, multiply and negate, multiply and accumulate, multiply and take absolute value, divide reverse, negate, absolute value, compare/test, convert fix/float, and square root. Note that the Weitek Abacus does not support a double extended format, has no built-in transcendental functions, and does not support denormals. The ressources required to implement such features have instead been devoted to implement the basic arithmetic operations as fast as possible. While the 80x87 coprocessors perform all internal calculations in double extended precision and therefore have about the same performance for single and double precision calculations, the Weitek features explicitly single and double precision operations. For applications that require only single precision the Weitek provides additional performance that way, as single precision operations are about twice as fast as their double precision counterparts. Since the Weitek Abacus has more registers than the 80x87 coprocessors, values can be kept in registers more often and have to be loaded from memory less frequently. This also leads to a performance gain. To the CPU, the Weitek Abacus looks like a 64 kB block of memory starting at physical address 0C0000000h. Every address in this range corresponds to a coprocessor instruction. Accessing a specified memory location within this block with the 386/486's MOV instruction causes the corresponding instruction to be executed. The instructions have been assigned to memory locations in such a way that loads to consecutive coprocessor registers can make use of the 386/486 MOVS string instruction. The memory mapped interface of the Weitek coprocessors is much faster than the IO-oriented protocol that is used to couple the CPU to the 80287 and 80387 coprocessors. The Weitek's starting address of 0C0000000h is only a physical address. The Weitek's memory block can be assigned to any logical address using the MMU (memory managment unit) in the 386/486's protected and virtual modes. This also means that the Weitek Abacus 3167 and 4167 can *not* be used in the real mode of those processors, since the physical start address of the Weitek coprocessors is not within the 1 MByte address range and the MMU is inoperable in real mode. However, DOS programs can make use of the Weitek Abacus by using a DOS extender or a memory manager like EMM386 that run in protected/virtual mode themself and can therefore map the Weitek's memory block to any desired location in the 1 MByte address range. Typically the FS segment register is then set up to point to the Weitek's memory block. On the 80486, this technique has severe draw backs, as using the FS: prefix takes an additional clock cycle, thereby nearly halving the performance of the 4167. Most DOS based compilers exibit this problem, so the only way aroun it is to code in assembly language [75]. The Weitek Abacus 3167 and 4167 are also supported by the UNIX operating system [33]. What applications will profit by using a math coprocessor? According to the Intel 387DX User's Guide, there are more than 2100 commercial programs that can make use of a 387 compatible coprocessor. Every program that uses floating point arithmetic somewhere and supports a 80x87 coprocessor can gain speed by installing a coprocessor. However, the speedup will vary from program to program and even within the same program depending on how computation intensive the program or operation within the program is. Typical applications that benefit from the use of a 80x87 coprocessor are: - Business graphics programs, such as Arts&Letters, Freedom of Press, and Freelance - Spreadsheet programs like Lotus 1-2-3, Excel, Quattro, and Wingz - CAD programs such as AutoCAD, VersaCAD, and GenericCAD - Database programs such as dBase IV, FoxBase, and Paradox - Math and Science programs such as Mathematica, TKSolver, SPSS/PC, and Statgraphics Note that for spreadsheets and databases, a coprocessor only helps if some kind of floating point computations is performed. This is true more often for spreadsheets than for data bases. Also note that the speed of many programs depends quite heavily on the speed of the graphics adapter (CAD) or the disk performance (databases), so the computational performance is only a (small) part of the total performance of the application. There are some programs that won't run without a coprocessor, among them AutoCAD R10 and later and Mathematica. Most GUIs (graphical user interfaces) such as Microsoft Windows or OS/2's Presentation Manager do *not* gain additional speed from using a *mathematical* coprocessor, since their graphics operations only use integer arithmetic [71]. They benefit from a graphics board with a graphics 'coprocessor' though that speed up certain common operations such as BitBlt or line drawing. A few GUIs used on PCs, e.g. X-Windows, use a certain amount of floating point operations for operations such as arc drawing. However, the use of floating-point operations in X-Windows seems to have decreased significantly in versions after X11R3, so the overall performance impact of a coprocessor is small [72]. Applications running under any GUI may take advantage of a math coprocessor though, e.g. Excel under MS-Windows. While support for 80x87 coprocessors is very common in application programs, the Weitek Abacus coprocessors do not enjoy such wide spread support. Due to their high price, only a few high-end PCs have been equipped with Weitek coprocessors. Therefore most of the programs that support these coprocessors are also high-end products like AutoCAD and Versacad-386. Installing a math coprocessor Usually, installing a coprocessor doesn't pose much of a problem, as every coprocessor comes with installation instructions and a diagnostic disk that lets you check for correct operation once the coprocessor has been installed. In addition, the user manuals of most computers have a section on coprocessor installation. 1) Make sure to get the right coprocessor for your system. An 8087 works together with 8086, 8088, V20, and V30 CPUs. An 80287, 287XL or compatible works together with a 80286 CPU. There are also some old 386 motherboards that accept a 80287 coprocessor, but they usually also provide a socket for the 387 and I recommend to get a 387 then for use with these systems. A 80387, 387DX or compatible coprocessor is for 386 based systems, as is the Intel RapidCAD. 387 coprocessors also work together with Cyrix' 486DLC CPU which despite its name does not include an FPU. Similarly, the 387SX or compatible coprocessor go into systems whose CPU is a 386SX or Cyrix 486SLC. The Weitek Abacus 3167 works with a 386 CPU but requires a 121-pin EMC socket in your system. Some computers, such as IBM's PS/2s, don't have this socket. The Weitek Abacus 4167 works together with the 486 and requires the appropriate 142-pin socket to be present. Always install a coprocessor that is rated at the same speed as the CPU. That is, for a 40 MHz 386 system using AMD Am386-40, install a coprocessor rated for 40 MHz such as a Cyrix 83D87-40, C&T 38700DX-40, IIT 3C87-40, or ULSI 83C87-40. Running a coprocessor above its specified frequency rating may cause it to produce false results, which you might fail to recognize as such. I have personally experienced this problem with a Cyrix 83D87-33 that I tried to push to 40 MHz. It passed all the diagnostic benchmarks on the Cyrix diagnostic disk and the tests of some commercial system test programs. However, I found it to fail the Whetstone and Linpack benchmarks, which include accuracy checks. So although there is usually no problem with overheating when pushing a coprocessor over the specified maximum frequency rating, be warned that operation of a coprocessor above the maximum ratings stated by the manufacturer makes operation unreliable. Some 386 boards allow the coprocessor to be clocked differently than the CPU. This is called asynchronous operation and allows you to run the coprocessor at 33 MHz while the CPU runs at 40 MHz, for example. Please note that only the Intel 80387 and 387DX support asynchronous operation. The 387 'clones' from Cyrix, C&T, IIT and ULSI always run at the full speed of the CPU, even if you have set up your motherboard for asynchronous operation. 2) Once you've got the correct coprocessor for your system you can start the actual installation process: - turn off the computer's power switch and unplug the power cord from the wall outlet - remove the cover of your computer - locate the math coprocessor socket. This socket is located right next to the CPU, which can be identified by the printing on top of the chip. The CPU usually is one of the biggest chips on the board. The 8078 and 80287 DIL sockets are rectangular sockets with 20 pin holes on each of the longer sides. The 387SX PLCC socket is a square socket that has 17 vertical connector strips on the 'wall' of each side. The 387 PGA socket is square and has two rows of pin holes on each side. The EMC socket is similar but has three rows of holes on each side. The PGA socket for the Weitek 4167 is also square with three rows of holes on each side. If the CPU and coprocessor socket is on a separate card rather than on the motherboard (typical for modular systems), you have to remove the card and place it on a flat and hard surface free of static electricity. If you can't find the math coprocessor socket, consult your owner's manual or your computer dealer. If you want to install the Intel RapidCAD in a 386 system, you will have to remove the 386 CPU before starting to install the two RapidCAD chips. Intel provides an easy to use chip extractor and a storage box for the 386 chip for this purpose. Just follow the instructions in the RapidCAD installation manual. - On many systems, the motherboard is only supported at a small number of points. Since considerable force is required to insert a pin grid chip like the 80387, RapidCAD, or Weitek Abacus 3167 into it's socket, the board may bend quite a lot due to the insertion pressure. This may lead to cracks in the board that may render it inoperable. Damage done to the board in this way is usually not covered by the warranty on the part. Therefore, it may be a good idea to check how much the board bends by pressing on the math coprocessor socket with a finger. If you find it to bend easily, try to put something under the board directly beneath the coprocessor socket. If this is impossible, like in most desktop cases, consider removing the whole mother board from the case, and placing it on a hard and flat surface free of static electricity. - Be sure you are properly grounded before you remove the coprocessor from its antistatic box. Static electricity can damage the coprocessor. Make sure you do not touch the pins. - Check if all pins are straight and not bend. If you find bent pins, carefully straigthen them with needle-nose pliers or tweezers. - Match the coprocessor's orientation with the orientation of the socket. 8087 and 287 coprocessors have a notch on one the shorter sides of their rectangular DIL package that should be matched with the notch of the coprocessor socket. Usually the 286 CPU and the 287 coprocessor are placed alongside each other and both have the same orientation, that is their respective notches point in the same direction. 387SX coprocessors feature a white dot or similar mark that matches with some sort of marking on the socket. 387 coprocessors have a beveled corner that is also marked with a white dot or similar marking. This should be matched with the beveled or otherwise marked corner of the socket. If you install a 387 coprocessor in an EMC socket, leave one row of holes free on each side. Correct orientation of the coprocessor is absolutely essential, because if you insert it the wrong way it may be damaged. If you have found the correct orientation, make sure all pins are correctly aligned with their respective holes. Press firmly and evenly on the chip. You may have to press hard to seat the coprocessor all the way. Make sure your motherboard does not bend more than slighty under the insertion pressure. Otherwise it may develop cracks that could damage the signal lines on the board. For 8087, 287, and 387 coprocessors it is normal that the coprocessor does not go all the way in but about one millimeter (1/25 inch) of space is left between the socket and the bottom of the coprocessor chip. This enables the insertion of a extraction device should it become necessary to remove the coprocessor. Note that the construction of the 387SX's PLCC socket makes it next to impossible to remove the coprocessor once fully inserted, as the top of the chip is level with the socket's 'walls' then. 3) Check your computer's manual for the jumpers and/or switches you may have to set for coprocessor operation. Put the cover back on the system unit and reconnect the power. Turn on your computer. Depending on your BIOS, you may have to run the setup or configuration program to register the coprocessor. Use the diagnostic disk included with your coprocessor to check for correct operation of your coprocessor. Coprocessor emulations In the absence of a coprocessor, floating-point calculations are most often performed by a software package that simulates the operations of the coprocessor. Such a program is called a coprocessor emulator. Simulating the coprocessor has the advantage that identical code can be generated for use with either the coprocessor and the emulator so that it is possible to write programs that run on both, systems with and systems without a coprocessor, without any changes. Wether the program is to use the coprocessor or the emulator can then be determined at run-time by checking if a math coprocessor is present in the system. Two approaches to interface an 80x87 emulator to programs are common. While the first method works with all 80x86 processors, the second only works from the 80286 on. The first method makes use of the fact that all coprocessor instruction start with the same five bit pattern 11011. Thus the first byte of a coprocessor instruction will be in the range D8-DF hexadecimal. In addition, coprocessor instructions usually are preceeded by a WAIT instruction (opcode 9Bh) which is one byte long (the reason for doing this is described in a later chapter on the operation of the 80x87). One common approach is to replace the WAIT instruction and the first byte of the coprocessor instruction with one of eight interrupts; the remaining bytes of the coprocessor instruction are left unchanged. Interrupts 34 to 3B hexadecimal are used for this emulation technique. Note that the sequences 9B D8 .. 9B DF can be easily converted to the interrupt instructions CD 34 .. CD 3B by simple addition and subtraction of constants. The compiler or assembler produces code that contains the appropriate interrupt calls instead of the coprocessor instructions. If a coprocessor is detected at run-time, the emulator interrupts point to a short routine that converts the interrupts calls back to coprocessor instructions (self modifying code). If no coprocessor is found the interrupts point to an emulation package which examines the byte(s) following the interrupt intruction to determine what operation to perform. The method described is used by the compilers from Microsoft and Borland for example. It works with every 80x86 CPU from the 8086/8088 on. The second method to interface an emulator is only available on 286 and 386 machines. If the emulation bit in the machine status word of these processors is set, the processors will generate an interrupt 7 whenever a coprocessor instruction is encountered. The interrupt vector then points to an emulation package that decodes the instruction and performs the desired operation. This approach has the advantage that the emulator doesn't have to be included in the program code, but can be loaded as a TSR or device driver once and then used by every program that requires a coprocessor. Emulation via interrupt 7 is transparent, which means that programs containing coprocessor instructions execute just like a coprocessor was present, only slower. This approach is taken by the public domain EM87 emulator and the commercial Franke387 emulator, for example. Even programs that require a coprocessor to run like AutoCAD are 'fooled' to believe that a coprocessor is present with emulators using INT 7. The size of the emulator used by TP 6.0 is about 9.5 kB, EM87 occupies about 15.8 kB as a TSR, and Franke387 uses about 13.4 kByte as a device driver. Note that Franke387 and especially EM87 model a real coprocessor much more closely than Turbo Pascal's emulator does. In particular, EM87 supports denormal numbers, precision control, and rounding control. The emulator in TP 6.0 does not implement these features. The version of Franke387 tested (V2.4) supports denormals in single and double precision, but not double extended precision. It supports precision control, but not rounding control. Intel's E80287 is supposed to be an 100% exact emulation of the 80287 coprocessor [44]. Generally, the more closely a real coprocessor is modelled by the emulator, the slower does the emulator run and the larger the code for the emulator gets. Relative execution times of coprocessor vs. software emulators for particular coprocessor instructions Intel 387DX TP 6.0 Emulator EM87 Emulator FADD ST, ST(0) 1 26 104 FDIV [DWord] 1 22 136 FXAM 1 10 73 FYL2X 1 33 102 FPATAN 1 36 110 F2XM1 1 38 110 The following table is an excerpt from [44]: Intel 80287 Intel E80287 Emulator FADD ST, ST(0) 1 42 FDIV [DWord] 1 266 FXAM 1 139 FYL2X 1 99 FPATAN 1 153 F2XM1 1 41 The following has been adapted from [43] and merged with my own data: Intel 8087 TP 6.0 Emul. (8086) Intel Emul. (8086) FADD ST, ST(0) 1 20 94 FDIV [DWord] 1 22 82 FPTAN 1 18 144 F2XM1 1 6 171 FSQRT 1 44 544 One of the reasons emulators are so slow is that they are often designed to run with every CPU from the 8086/8088 on. This is the case with the emulators built into the compiler libraries of the Turbo Pascal 6.0 (also used by Turbo C/C++) and Microsoft C 6.0 compiler (probably also used in other Microsoft products) and is also true for the EM87 emulator in the public domain. By using code that can run on a 8086/8088, these emulators forego the speed advantage offered by the additional instructions and architectureal enhancements (such as 32-bit registers) of the more advanced Intel 80x86 processors. A notable exception is the Franke387 emulator, a commercial emulator that is also sold as shareware. It uses 386 specific 32-bit code and only runs on 386/386SX computers. Besides being slow, coprocessor emulators have other drawbacks compared with real coprocessors. Most of the emulators do not support the additional instructions that the 387 compatible coprocessors offer over the 80287. Often, some of the low-level stack-manipulating instructions like FDECSTP are not emulated. For example, [76] lists the coprocessor instructions not emulated by Microsoft's emulator (included in the MS-C and MS-Fortran libraries) as follows: FCOS FRSTOR FSINCOS FXTRACT FDECSTP FSAVE FUCOM FINCSTP FSETPM FUCOMP FPREM1 FSIN FUCOMPP Often, some parts of the coprocessor architecture, like the status register, are not or only partially emulated. Some emulators do not conform to the IEEE-754 standard in their implementation of the basic arithmetic functions, while the coprocessors do. Also, they sometimes lack the support for denormals (a special class of floating point numbers) although it is required by the standard. Not all the 80x87 emulators support rounding control and precision control (features required by IEEE-754). Most of the omissions are aimed at making the emulator faster and smaller. Because of the shortcomings of coprocessor emulators, a real coprocessor is a must for anybody planning to do some serious computations. At todays prices, this shouldn't pose much of a problem to anybody. Nhuan Doduc (ndoduc@framentec.fr) has tested a number of standalone coprocessor emulators for PCs, among them the two emulators, EM87 and Franke387 V2.4, already mentioned. He found Franke387 to be the best in terms of reliablity, speed, and accuracy. Available coprocessors, CPU+FPU as of 08-10-92: Intel 8087 [43] was the first coprocessor that Intel brought out for the 80x86 family. It was introduced in 1980 and therefore does not have full compatibility with the IEEE-754 standard for floating point arithmetic, which was finally released in 1985. It complements the 8088 and 8086 CPUs and can also be interfaced to the 80188 and 80186 processors. It comes in a 40 pin CERDIP (ceramic dual inline package). It is available in 5 MHz, 8 Mhz (8087-2), and 10 MHz (8087-1) versions. The 8087 is implemented using NMOS. Power consumption is rated at max. 2400 mW [42]. A neat trick to enhance the processing power of the 8087 for computations that use only the basic arithmetic operations (+,-,*,/) and do not require high precision is to set the precision control to single precision. This gives one a performance increase of up to 20%. For details about programming the precision control, see program PCtrl in appendix A. Intel 80187 is a rather new coprocessor designed to support the 80C186 embedded controller. It was introduced in 1989 and implements the complete 80387 instruction set. It is available in a 40 pin CERDIP (ceramic dual inline package) and a 44 pin PLCC (plastic leaded chip carrier) for 12.5 and 16 MHz operation. Power consumption is rated at max. 675 mW for the 12.5 MHz version and max. 780 mW for the 16 MHz version [37]. Intel 80287 [44] is the original Intel coprocessor for the 80286 and was introduced in 1983. It uses the same execution unit as the 8087 and therefore has the same speed (sometimes slower due to additional overhead in CPU coprocessor communication). As the 8087, it does not provide full compatibility with the IEEE-754 floating point standard released in 1985. It was manufactured in NMOS technology. There are 6 MHz, 8 MHz, and 10 MHz versions. The chip comes in a 40 pin CERDIP (ceramic dual inline package). Power consumption can be estimated to be the same as that for the 8087, which is max. 2400 mW. The 80287 has been replaced in the Intel 80x87 family with its successor, the Intel 287XL, which has been introduced in 1990. The 287XL is done in CMOS. It is based on the 387 core and therefore much faster than the 80287. There may still be a few of the old 80287 chips on the market though. Intel 80287XL is the second generation 287 introduced by Intel in 1990. Since it is based on the 387 core, it features full IEEE 754 compatibility and faster execution of coprocessor instructions. Intel claims about 50% faster operation than the 80287 for typical benchmark test such as Whetstone [45]. Comparison with benchmark results for the AMD 80C287, which is identical to the Intel 80287, support this claim [1]. The Intel 287XL performed 66% faster than the AMD 80C287 on the fractal benchmark and 66% faster on the Whetstone benchmark in these tests. Whetstone results from [46] show the Intel 287XL at 12.5 MHz to perform 552 kWhets/sec as opposed to the AMD's 80C287 289 kWhets/sec, a 91% performance increase. A benchmark using the MathPak program showed the Intel 287XL to be 59% faster than the Intel 80287 (6.9 sec. vs. 11.0 sec.) [26]. Since the 287XL has all the additional instructions and enhancements of a 387, most software automatically identifies it as an 80387 compatible coprocessors and makes use of the extra features available like the FSIN and FCOS instructions. The 287XL is done in CMOS and therefore uses less power than the older 80287, which was done in NMOS. The 287XL is rated for speeds of up to 12.5 MHz. At 12.5 MHz, the power consumption is rated at max. 675 mW, about 1/4 of the 80287 power consumption. The 287XL comes in either a 40 pin CERDIP (ceramic dual inline package) or a 44 pin PLCC (plastic leaded chip carrier). The latter version is called the 287XLT and intended mainly for laptop use. AMD 80C287 is an exact clone of the old Intel 80287 that was brought to market by AMD in 1989. It contains the original microcode of the 80287 and is therefore 100% compatible with this chip. However, as the name indicates, the 80C287 is manufactured in CMOS and therefore uses less power than an equivalent Intel 80287. At 12.5 Mhz, its power consumption is rated at max. 625 mW or slightly less than that of the Intel 80287XL [27]. There is also another version called AMD 80EC287 that uses an 'intelligent' power save feature to reduce the power consumption below 80C287 levels. Tests at 10.7 MHz show typical power consumption for the 80EC287 to be at 30mW, compared to 150 mW for the AMD 80C287, 300 mW for the Intel 287XL and 1500 mW for the Intel 80287 [57]. The 80EC287 is therefore ideally suited for low power laptop systems. The AMD 80C287 is available in speeds of 10, 12, and 16 MHz. I have only seen it being offered in 10 MHz and 12 MHz versions though. At about US$ 50, it is the cheapest coprocessor available. Note that it provides less performance than the newer Intel 287XL (see above for details). The AMD 80C287 is available in 40 pin ceramic and plastic DIPs (dual inline package) and as 44 pin PLCC (plastic leaded chip carrier). Due to recent legal battles with Intel over the right to use the 287 microcode, which AMD lost, AMD may have to discontinue this product (disclaimer: I am not a legal expert). Cyrix 82S87 was developed from the Cyrix 83D87, Cyrix' 387 'clone' and has been available since 1991. It implements the full 387 instruction set. It totally complies with the IEEE-754 standard for floating point arithmetic and features nearly total compatibility with Intel's coprocessors. It implements the transcendental functions with the same degree of accuracy and the superior speed of the Cyrix 83D87. This makes the Cyrix 82S87 the fastest [1] and most accurate 287 compatible coprocessor available. Documentation by Cyrix [46] rates the 82S87 at 730 kWhets/sec for a 12.5 MHz system, while the Intel 287XL performs only 552 kWhets/sec. 82S87 chips sold after about August 1992 use the internals of the Cyrix 387+, which succeeds the original 83D87 [73]. The 82S87 is a fully static CMOS design with very low power requirements that can run at speeds of 6 to 20 MHz. Cyrix documentation shows the 82S87 to consume about the same amount of power as the AMD 80C287 (see above). The 82S87 comes in a 40 pin DIP or a 44 pin PLCC (plastic leaded chip carrier) compatible with the pinout of the Intel 287XLT and ideally suited for laptop use. IIT 2C87 was the first 287 clone available. It was introduced to the market in 1989. It has about the same speed as the Intel 287XL [1]. The 2C87 implements the full 387 instruction set [38]. Tests I ran on the 3C87 seem to indicate that it is not fully compatible with the IEEE-754 standard for floating-point arithmetic (see below for details), so it can be assumed that the 2C87 also fails these test as it presumably uses the same core as the 3C87. The IIT 2C87 provides extra functions not available on any other 287 chip [38]. It has 24 user accessible floating-point registers organized into three register banks. Additional instructions (FSBP0, FSBP1, FSBP2) allow switching from one bank to another. Transfers between registers in different banks are not supported however, so this feature by itself is of limited usefulness. Also there seems to be only one status register (containing the stack top pointer), so it has to be manually loaded and stored when switching between banks with a different number of registers in use [40]. The register bank's main purpose is to aid the fourth additional instruction the 2C87 has (F4X4), which does a full multiply of a 4x4 matrix by a 4x1 vector, an operation common in 3D graphics applications [39]. The built-in matrix multiply speeds this operations up by a factor of 6 to 8 compared with a programmed solution according to the manufacturer [38]. Tests show the speed-up to be indeed in this range [40]. For the 3C87, I measured the execution time of F4X4 to be about 280 clock cycles, the execution time on the 2C87 should be somewhat bigger. I estimate it to be around 310 clock cycles due to the higher CPU-NDP communication overhead in instruction execution in 286/287 systems (~45-50 clock cycles) compared with 386/387 systems (~16-20 clock cycles). As useful as the F4X4 instruction may seem, there are only very few applications that make use of this feature if a IIT coprocessor is detected at run time, among them Schroff Development's Silver Screen and Evolution Computing's Fast-CAD 3-D [25]. The 2C87 is available for speeds of up to 20 MHz. It is implemented in an advanced CMOS process and has therefore a low power consumption of typically about 500 mW [38]. Intel 387 was the first generation of coprocessors for the Intel 386. It was introduced in 1986, about one year after introduction of the 80386. Early 386 system were therefore equipped with a 80287 and a 80387 socket. The 80386 works together with the 80287 but the numerical performance is hardly adequate for such a system. The 80387 has since been superseeded by the Intel 387DX introduced by a quiet change in 1990. You might find it when aquiring an old 386 machine, though. The 80387 is about 20% slower than the newer 387DX (see the paragraph below for detailed information). Like the other 387 coprocessors, the 80387 is packaged in a 68-pin ceramic PGA. The Intel 80387 is manufactured using Intel's older 1.5 micron CHMOS III technology that has moderate power requirements. Power consumption at 16 MHz is max. 1250 mW (750 mW typical), at 20 MHz it is max. 1550 mW (950 mW typical), and at 25 MHz it is max. 1950 mW (1250 mW typical) [60]. Intel 387DX is the second generation Intel 387 that was quietly introduced in 1989. This version is done in a more advanced CMOS process than the 80387 that enables the coprocessor to run at a maximum frequency of 33 MHz, while the 80387 had a maximum frequency of 25 MHz. The 387DX is about 20% faster than the 80387 on the average for the same clock frequency. For a 386/387 system operating at 29 MHz the Whetstone benchmark compiled with the highly optimizing Metaware High-C V1.6 runs at 2377 kWhetstones/sec for the 80387 and at 2693 kWhetstones/sec for the 387DX, a 13% increase. In a fractal calculation programmed in assembly language, the 387DX performance was 28% higher than the performance of the 80387. The transcendental functions have also sped up from the 80387 to the 387DX. In the Savage benchmark compiled with the Metaware High-C V1.6 optimizing compiler and running on a 29 MHz system, the 80387 evaluated 77600 function calls/second, while the 387DX evaluated 97800 function calls/second, a 26% increase [7]. Some instructions have been sped up a lot more more than the average 20%. For example the FBSTP instruction has been sped up by a factor of 3.64. The Intel 387DX (and its predecessor 80387) are the only 387 coprocessors that support asynchronous operation of CPU and NDP. The 387 consists of a bus interface unit and a numerical execution unit. The bus interface unit always runs at the speed of the CPU clock (CPUCLK2). If the CKM (ClocK Mode) pin of the 387 is strapped to Vcc, the numerical execution unit runs at the same speed as the bus interface unit. If CKM is tied to ground, the numerical execution unit runs at the speed provided by the NUMCLK2 input. The ratio of NUMCLK2 (coprocessor clock) to CPUCLK2 (CPU clock) must lie within the range 10:16 to 14:10. For example, for a 20 MHz 386, the Intel 387DX could be clocked from 12.5 MHz to 28 MHz via the NUMCLK2 input. On the Cyrix 83D87, Cyrix 387+, ULSI 83C87, and the IIT 387, the CKM pin is not connected. These coprocessors always run at the speed of the CPU. The Intel 387DX is manufactured using Intel's advanced low power CHMOS IV technology. Power consumption at 20 MHz is max. 900 mW (525 mW typical), at 25 MHz it is max. 1050 mW (625 mW typical), and at 33 MHz it is 1250 mW (750mW typical) [59]. Intel 387SX is the coprocessor for the Intel 386SX. The 386SX is an Intel 386 with a 16-bit data path. This reduces somewhat the costs to build a complete system as compared to a full 32-bit design required by the 80386DX. The 386SX main purpose was to replace the 80286 CPU, which Intel subsequently stopped producing. Due to the 16-bit data path, the 386SX is slower than the 386DX and offers about the same speed as a 80286 at the same clock frequency for 16-bit applications. As the 386SX is a complete 80386, it offers also the possibility to run 32-bit applications and supports the virtual 8086 mode used for example by Windows' enhanced mode. The 387SX has all the features the Intel 387DX offers, including the ability for asynchronous operation of CPU and coprocessor (see the above paragraph on the Intel 387DX for details). Due to the 16 bit data path between the CPU and the coprocessor, the 387SX is a bit slower than a 387DX operating at the same frequency. The 387SX comes in a 68-pin PLCC (pastic leaded chip carrier) package and is available in 16 Mhz and 20 MHz versions. Coprocessors for faster 386SX systems based on the Am386SX CPU are available from IIT, Cyrix, and ULSI. Power consumption for the 387SX at 16 MHz is max. 1250 mW (740 mW typical), for the 20 MHz version it is max. 1500 mW (1000 mW typical) [62]. IIT 3C87 came out in 1989 at about the same time as the Cyrix 83D87. Both coprocessors are faster than Intel's 387DX coprocessor. Tests I ran with the IEEETEST program show that the 3C87 is not fully compatible with the IEEE-754 standard for floating-point arithmetic although the manufacturer claims differently. It is well possible that the reported errors are due to personal interpretations of the standard by the program's author that have been incorporated into IEEETEST and that the standard also supports the different interpretation chosen by IIT. On the other hand, the IEEE test vectors incorporated into IEEETEST have become somewhat of an industry standard [66] and Intel's 387, 486, and RapidCAD chips pass the test without a single failure, so the fact that the IIT 3C87 fails some of the tests indicates that it is not fully compatible with the Intel 387 coprocessor. My tests also show that the IIT 3C87 does not support denormals for the double extended format. It is not entirely clear wether the IEEE standard mandates support for extended precision denormals, as the IEEE-754 document explicitly only mentions single and double precision denormals. Missing support for denormals is not a critical issue with most applications but there are some programs for which support of denormals is quite helpful, if not important [41]. Anyhow, failure of the 3C87 to support extended precision denormal numbers is an incompatibility with the Intel 387 and 486. The 3C87 provides extra functions not available on any other 387 chip [38]. It has 24 user accessible floating-point registers organized into three register banks. Additional instructions (FSBP0, FSBP1, FSBP2) allow switching from one bank to another. Transfers between registers in different banks are not supported however, so this feature by itself is of limited usefulness. Also there seems to be only one status register (containing the stack top pointer), so it has to be manually loaded and stored when switching between banks with a different number of registers in use [40]. The register banks main purpose is to aid the fourth additional instruction the 3C87 has (F4X4), which does a full multiply of a 4x4 matrix by a 4x1 vector, an operation common in 3D graphics applications [39]. I measured this instruction to execute in about 280 clock cycles, during which time it executes 16 multiplications and 12 additions. The built-in matrix multiply speeds the matrix by vector multiply up by a factor of 3 compared with a programmed solution according to IIT [39]. The results for my own TRNSFORM benchmark support this claim (see results below), showing a performance increase by a factor of about 2.5. This makes matrix multiplies on the IIT 3C87 nearly as fast as on an Intel 486 at the same clock frequency. However, there are only very few applications that make use of this feature if a IIT 3C87 is detected at run time, among them Schroff Development's Silver Screen and Evolution Computing's Fast-CAD 3-D [25]. Like the 387 'clones' from Cyrix and ULSI, the 3C87 does not support asynchronous operation of the CPU and the coprocessor. The 3C87 always runs at the full speed of the CPU. The 3C87 is implemented in an advanced CMOS process and has low power requirements of typically about 600 mW. It is available in 16, 20, 25, 33, and 40 MHz versions. IIT 3C87SX is the version of the IIT 3C87 that is intended for use with Intel's 386SX or AMD's Am386SX CPU. It is functionally equivalent to the IIT3C87. Due to the 16-bit data path between the CPU and the coprocessor in a 386SX based system, coprocessor instructions will execute somewhat slower than on the 3C87. The IIT 3C87SX is the only 387SX coprocessor that is offered at speeds of 16, 20, 25, and 33 MHz right now. I have read that Cyrix has also annouced a 83S87-33, but haven't seen it being offered yet. The 3C87SX is packaged in a 68-pin PLCC. Cyrix 83D87 was introduced in 1989, only shortly after the coprocessors from IIT. It has been the fastest 387 compatible coprocessor in several benchmark comparisons [1,7,68,69]. It also came out as the fastest coprocessor in my own tests (see benchmark results below). Although the Cyrix 83D87 provides up to 50% more performance than the Intel 387DX in benchmarks comparisons, the speed advantage over other 387 compatible coprocessors in real applications is usually much smaller. For example, in a test using the program 3D-Studio, the Cyrix 83D87 was 6% faster than the Intel 387DX [1]. Besides being the fastest 387 coprocessor, the 83D87 also offers the most accurate transcendental functions results of all coprocessors tested (see test results below). The new version of the 83D87, which is sold as 387+ in Europe, even surpasses the level of accuracy of the original 83D87 design. Unlike Intel's coprocessors, which use the CORDIC [18,19] algorithm to compute the transcendental functions, Cyrix uses polynomial and rational approximations to the functions. In the past the CORDIC method has been popular since it requires only shifts and adds which makes it easy to implement. It is also reasonably fast. Recently, the cost for the implementation for fast floating-point multipliers has dropped significantly due to the availablity of VLSI, making the use of polynomial and rational approximations superior to CORDIC for the generation of transcendental functions [61]. The Cyrix 83D87 uses a fast array multiplier, making its transcendental functions faster than those of any other 387 compatible coprocessor. It also uses 75 bit for the mantissa in intermediate calculations (as opposed to 68 bits on other coprocessors), making its transcendental functions more accurate than those of any other coprocessor or FPU (see results below). The 83D87 and its successor, the 387+ are the 387 'clones' with the highest degree of compatibility. There are only very few SW and HW incompatibilties with the Intel 387DX. These have been documented by Cyrix [12]. The software differences are caused by some bugs present in the 387DX that Cyrix fixed for the 83D87. Unlike the Intel 387DX, the 83D87 (and all other 387 'clones' as well) does not support asynchronous operation of CPU and coprocessor. There have also been problems in the past with the CPU - coprocessor communication, causing the 83D87 to hang on some machines. The reason was that Cyrix shaved off a wait state in the communication protocol, which caused a communications breakdown between the CPU and the 83D87 for some systems running at 25 MHz or faster. One notable example of this behavior was the Intel 302 board. The problem is only rarely encountered with the current generation of 386 motherboards. It is possible that the problem has been entirely eliminated in the 387+, the sucessor to the 83D87. To reduce power consumption the 83D87 features advanced power saving features. Those portions of the coprocessor that are not needed are automatically shut down. If no coprocessor instructions are being executed, all parts except the bus interface unit are shut down [12]. Maximal power consumption of the Cyrix 83D87 at 33 MHz is 1900 mW, typical power consumption at this clock frequency is 500 mW [15].