home *** CD-ROM | disk | FTP | other *** search
Text File | 1993-04-27 | 69.6 KB | 1,679 lines |
- MFLOAT Version 1.00
- ===================
-
- A brief description of "MFLOAT":
-
- * "MFLOAT" handles arithmetic using high precision numbers of up to 72 decimal
- digits.
-
- * "MFLOAT" is optimized for maximum speed for high precision numbers.
- These subroutines are the fastest for the 8086 microprocessor known to us.
- If you find faster subroutines for a 8086, please send me a mail.
-
- * "MFLOAT" subroutines are much faster than a standard emulation of a
- coprocessor for the same accuracy. "MFLOAT" does not use the coprocessor.
-
- * "MFLOAT" is written in assembler for the 8086 microprocessor and
- compatibles. The calculation is 10 to 20 times faster than that of
- subroutines written in C or PASCAL. This is important for tedious
- calculations like an optimization, where you have problems with the
- accuracy. For this problems you can avoid the use of a super computer
- which you feed with a C or PASCAL program for high precision numbers.
-
- * All important functions of PASCAL and of a C library ("MATH.H") are
- included. This includes the transcedental functions like sin(x), atan(x),
- exp(x), log(x)....
-
- * Programs written in C or C++, can use MFLOAT simply by replacing the
- double data format with the "MFLOAT" data format. The operators and the
- functions are overloaded for C++ and so only little changes of the source
- code are necessary. (You need BORLAND C 3.1).
-
- * The subroutines are included in compiled form (object files) in a program.
- Therfore you can use them for PASCAL, C and C++. The subroutines are
- tested for BORLAND PASCAL 7.0, TURBO PASCAL 6.0 and BORLAND C 3.1.
-
- * This package is limited to about 72 decimal digits. You may license the
- source code. Then you can extend the precision for the algebraic
- calculations up to 10000 decimal digits (upper limit is due to the memory
- limit of one segment of the 8086) and for transcendental function up to 500
- digits.
-
- * The internal data format is a binary floating point format. The length of
- the mantissa can be choosen by the user from one word to 15 words.
- (1 word = 16 bits). This correspondeds to about 5, 10, 14, 19, 24,
- 29, 34, 39, 43, 48, 53, 58, 63, 67 and 72 digits of a decimal number.
- 1 word represents 4.81648 digits of a decimal number in average.
- The exponent of two has the length of 16 bits. The largest representable
- number is about 7.07E+9863, the smallest is about 7.07E-9865.
-
- ***************************************************************************
-
- The package includes following files:
- READ.ME - general information
- MFLOAT.DOC - describtion of the subroutines and mfloat numbers
- MFLOATA.OBJ - subroutines written in assembler
- MFLOATB.OBJ - subroutines written in C
- MFLOAT.H - header file for C
- MFLOAT.CXX - file for C++ : overloading functions and operators
- PFLOAT.PAS - unit: declarations of the external subroutines
- PI.PAS - example in PASCAL, calculation of pi
- BESSEL.PAS - example in PASCAL, calculation of the bessel function
- POLAR.PAS - example in PASCAL, conversion cartesian to polar coordinates
- PI.C - example in C, calculation of pi
- BESSEL.C - example in C, calculation of the bessel function
- POLAR.C - example in C, conversion cartesian to polar coordinates
- PI.CPP - example in C++, calculation of pi
- BESSEL.CPP - example in C++, calculation of the bessel function
- POLAR.CPP - example in C++, conversion cartesian to polar coordinates
- PI.EXE - example program executable
-
- The source code option includes further files:
- MFLOATA.ASM - source code in assembler (basic subroutines)
- MFLOATB.C - source code in C (transcendental functions and extensions)
- MFLOATC.ASM - constants (include file of "MFLOATA.ASM")
- CALCCON.PAS - calculation of constants in "MFLOATC.ASM" in PASCAL source
- CALCCON.EXE - calculation of constants in "MFLOATC.ASM"
-
- Registration:
- see READ.ME
-
- The authors of MFLOAT are interested in a feedback from the users of "MFLOAT".
- If you have problems or you find some bugs (the subroutines are checked very
- thorougly, but the probability, that a software of large dimension has no bug
- is not large in reality), please describe them and send us a mail per
- internet. You help us to remove this errors of the subroutines.
-
- Many thanks for your help and much success with "MFLOAT".
-
-
- Kaufmann Friedrich & Mueller Walter
- Students at the Technical University of Graz
-
- surface mail:
- Kaufmann Friedrich
- Schuetzenhofgasse 22
- A-8010 GRAZ
- AUSTRIA
- EUROPE
-
- email address:
- fkauf@fstgds06.tu-graz.ac.at
-
- ****************************************************************************
- ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- ****************************************************************************
-
-
- Multiprecision calculations with "MFLOAT"
- =========================================
-
- "MFLOAT" is a package of subroutines written in assembler for a 8086 micro-
- processor or compatibles. It is made for users having problems with the
- accuracy of standard IEEE arithmetic.
-
- The precision of IEEE-754 standard floating point numbers is good enough
- for most technical and scientific applications. But sometimes a scientist
- or a mathematician has problems in his calculations because the algorithm or
- the mathematical problem is very ill-conditioned (for example finding the
- roots of a polynomial). As the floating point numbers are only appoximations
- of real or rational numbers, there is an additional error at every floating
- point operation.
-
- There is no method known to represent every real number exactly in a
- binary format and therefore an approximation is necessary. A practical method
- is to extend the mantissa of the floating point number, which results in
- less error at every operation. But there are limitations because of higher
- memory needs for every floating point number and increased execution time.
- This is why a scientist having numerical problems of this kind is looking
- for very fast subroutines for multiprecision floating point numbers.
-
- This package represents such a package of subroutines. It uses a binary
- data format because there are optimal instructions for the manipulation of
- binary numbers in the instruction set of the 8086. A binary data format has
- also the advantage, that the used memory is minimized. The only disadvantage
- is the higher effort that is necessary for the input and output of the
- numbers in ASCII-Format.
-
- The main goal is a minimal computing time for very accurate numbers.
- Therefore no time is wasted for rounding (every result is truncated) and
- the handling of abnormal numbers as NANs or denormals. The only special
- number is zero. For saving time, there is no stack overflow check.
-
- In case of an overflow or some other calculation error an error flag is set,
- which can be tested during or at the end of the calculation. Of course in
- such a case the numerical result is wrong. In case of an underflow the result
- is set to zero without any message.
-
- A multiprecision number is consists of an array of words (1 word = 2 byte
- = 16 bit). The first word represents the exponent to the base of two. It is
- represented with an offset of 8000 Hex (= 32768) and ranges from 1 to
- FFFF Hex (= 65535). The special value 0 is used to represent the floating
- point number zero. In this case the rest of the array is arbitrary. The
- second word of the array represents the first word of mantissa. The mantissa
- is normalized for every floating point number and therefore the most
- significant bit of the first word of mantissa is always set. As there is no
- information in this bit, it is removed and at its place the sign bit is
- stored. The sign bit is one if the floating point number is negative, and
- zero if the floating point number is positive. The further words of
- the multiprecision number are words of mantissa. The length of mantissa
- ranging from one to a constant (15 in this release) can be choosen by the
- user.
-
- Example : 8002 490F Hex represents 3.141540528..
-
- Every word of mantissa represents in average 4.8165 (= 16 * log10(2)) digits
- of a decimal number. Therefore you need 15 words of mantissa for a accuracy
- of 72 decimal digits and the whole number uses 16 words or 32 bytes.
-
- If you have the source code, you can increase the accuracy. The upper limit
- of the accuracy is given by the used stack of the subroutines, which is
- about 3 times the memory of a "MFLOAT" number for the arithmetic operations
- addition, subtraction, multiplication and division. The stack is limited to
- one segment of the 8086 which is 65536 byte. A further limit for
- transcendental functions is 125 mantissa words (602 digits of a
- decimal number) for the series of the sine. If you extend this limit, you
- will get an error, for an argument near pi/4.
- "MFLOATA.OBJ" includes 19 "MFLOAT" constants. This constants are stored in
- the code segment, which is limited to 65536 bytes.
-
- There is a very large range for the exponent. The greatest number, that can
- by represented is about +/-7.07E+9863, the smallest is about +/-7.07E-9865.
-
- Now lets take a look inside the subroutines. The coprocessor 8087 or a
- compatible is not used because the IEEE format does not match the data
- format. There are subroutines for the algebraic operations addition,
- subtraction, multiplication and division.
-
- Addition and subtraction are done by the same subroutine. The time for an
- addition (or subtraction) is proportional to the actual number of mantissa-
- words.
-
- The implementation of the multiplication uses the instruction "MUL".
- which is much faster than shifted additions. The multiplication result
- is two times as large as its operands but only the half of the accuracy is
- needed. Therfore its a waste of time to calculate the second part of the
- mantissa. If exactly half the number of mantissawords are calculated and the
- rest is neglected, there is an accumulation of truncating errors, which can
- be very large. Therfore an additional extra word is added to the result during
- calculation, but for the final result this extra word is ignored. By this
- way half of the calculation time is saved and only small additional errors
- occure. The calculation time increases quadratic with the number of
- mantissawords.
-
- The division is implemented using an algorithm similar to hand calculation.
- First an estimate of the first word of the result is made and the rest is
- calculated. If the rest is negativ, the estimation is corrected until it is
- ok. Using the rest the next word of mantissa is estimated and eventually
- corrected. This goes on until the whole mantissa of the result is calculated.
- For saving calculation time, the rest is truncated at every step by one word.
- Only at the first step an extra word is added (no truncation). Because of this
- method half of the calculation time is saved (same priciple as by the
- multiplication). The calculation time increases quadratic with the number of
- mantissawords.
-
- The first transcendental function is the square root. There is a hand
- calculation method known similar to a division. The implemented algorithm
- uses this method. For saving calcultion time the same tricks are used as
- in the implementation of multiplication and division. The calculation time is
- smaller than those of a division. There are also iterative methods known to
- calculate the square root. These methods are much slower than the implemented
- method. (3 - 5 times slower).
-
- Further trancendental functions are the trigonometric, hyperbolic functions,
- the exponential function and their invers functions. This functions can
- be derived from four series:
-
- sin(x) = x - x^3/6 + x^5/120 - .....
-
- sinh(x) = x + x^3/6 + x^5/120 + .....
-
- arctan(x) = x - x^3/3 + x^5/5 - ..... |x| < 1
-
- artanh(x) = x + x^3/3 + x^5/5 + ..... |x| < 1
-
- These are the basic functions for further calculations. All other functions
- can be derived from them. Here are some examples:
-
- cos(x) = sin(x + pi/2)
- tan(x) = sin(x) / cos(x)
- cot(x) = cos(x) / sin(x)
- cosh(x) = sqrt(1 + sqr(sinh(x))
- tanh(x) = sinh(x) / cosh(x)
- coth(x) = cosh(x) / sinh(x)
- exp(x) = sinh(x) + cosh(x)
- arcsin(x) = arctan(x / sqrt(1 - sqr(x)))
- arccos(x) = pi/2 - arcsin(x)
- arccot(x) = pi/2 - arctan(x)
- arsinh(x) = artanh(x / sqrt(1 + sqr(x)) = ln(x + sqrt(1 + sqr(x)))
- arcosh(x) = artanh(sqrt(sqr(x) - 1) / x) = ln(x + sqrt(sqr(x) - 1))
- arcoth(x) = artanh(1/x)
- ln(x) = 2 * artanh((x - 1) / (x + 1))
-
- Some identities are used to save calculation time:
- sin(x + 2 * pi) = sin(x)
- exp(x + ln(2)) = 2 * exp(x)
- ....
- and so on.
-
- Most of the time is used for the calculation of the series. Therefore it is
- important to write an optimal assembler subroutine for the whole series.
- This series could also be programmed directly by the former subroutines
- but using a specialized assembler subroutine the calculation time can be
- reduced to one third at even higher accuracy. The calculation time for a
- series increases with the third power of the number of mantissawords.
-
-
-
- The subroutines are testet using the following compilers:
- Turbo Pascal 6.0, Borland Pascal 7.0, Borland C/C++ 3.1
-
- You may use the library in combination with other compilers but in this
- case you may have to adabt the header files. Please note, that pascal
- calling conventions (upercase only, argument order, no underlines, ...)
- is implemented. All routines are 'far' with 'far' data pointers.
-
-
-
- ===========================================================================
- ***************************************************************************
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- ***************************************************************************
- ===========================================================================
-
- Using the subroutines for BORLAND C 3.1 :
- =========================================
-
- To make the functions available you have to include MFLOAT.H in your source
- and to specify MFLOATA.OBJ and MFLOATB.OBJ in your project file.
-
- #define mfloatwords 16 /* maximum precision */
- typedef int mfloata[mfloatwords];/* mfloat array */
- #define mfloat_ int far * /* pointer to array */
- #define mfloat mfloata /* for use in 'C' programs */
-
- (declared in "MFLOAT.H")
- This type declaration for multiprecision floating point numbers sets
- a maximal precision of 15 mantissawords or about 72 decimal digits. The
- maximum precision is an assembly time constant. In this release it is fixed
- to 15 mantissawords. It is possible to change the type declaration to reduce
- the number of mantissawords. This leads to a saving in memory needs and to
- higher execution speed at the cost of precision. It is also possible to adjust
- the precision dynamically using the function 'setmantissawords'.
-
- The programmer must take care, that the stack is large enough, because
- there is no check of a stack overflow.
-
- ****************************************************************************
- ++++++++++++++++++++ mfloat basic functions ++++++++++++++++++++++++++
- ****************************************************************************
-
- void far pascal setmantissawords(int number);
-
- This procedure sets the length of the mantissa of mfloat numbers. The number
- of word must be in the range of [1..15], otherwise is is bounded to these
- values. If the number of mantissawords is increased and the numbers of
- former calculations are used further, the additional words are not
- initialized to zero, their values are not known. The error of the number
- due to this further mantissa words is of course very small and is in the
- range of the accuracy of the former calculation. If it is important, that
- this further mantissa words are zero, the user has to initialize them (see
- data format). If the number of mantissa words is reduced, every number can
- be seen as truncated, although nothing is changed at this numbers. The number
- of mantissa words refers only to calculations.
-
- ****************************************************************************
-
- int far pascal getmantissawords(void);
-
- The function returns the actual number of mantissa words.
-
- ****************************************************************************
-
- void far pascal reseterror(void);
-
- The error flag is reset.
-
- ****************************************************************************
-
- char far pascal geterror(void);
-
- The error flag is returned. If there ocures a runtime error during the
- calculation, the error flag is set to true (=1).
-
- ****************************************************************************
-
- mfloat_ far pascal equm(mfloat_ a, /* a <-- b */
- const mfloat_ b);
-
- The value of b is copied to variable a with the defined accuracy. The value
- of b is not changed. The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal addm(mfloat_ a, /* a <-- a + b */
- const mfloat_ b);
-
- The value of b is added to the value a with the defined accuracy. The
- result is variable a. The value of b is not changed. The return value is a
- pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal subm(mfloat_ a, /* a <-- a - b */
- const mfloat_ b);
-
- The value of b is subtracted from the value a with the defined accuracy.
- The result is variable a. The value of b is not changed. The return value is
- a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal multm(mfloat_ a, /* a <-- a * b */
- const mfloat_ b);
-
- The value of a is multiplied by value b with the defined accuracy. The result
- is variable a. The value of b is not changed. The return value is a pointer
- to a.
-
- ****************************************************************************
-
- mfloat_ far pascal divm(mfloat_ a, /* a <-- a / b */
- const mfloat_ b);
-
- The value of a is divided by the value b with the defined accuracy.
- The result is variable a. The value of b is not changed. The return value is
- a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal multi(mfloat_ a, /* a <-- a * b */
- int b);
-
- The value of a is multiplied by the integer value b with the defined
- accuracy. The result is variable a. The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal divi(mfloat_ a, /* a <-- a / b */
- int b);
-
- The value of a is divided by the integer value b with the defined accuracy.
- The result is variable a. The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal inversm(mfloat_ a); /* a <-- 1 / a */
-
- The number one is divided by the number a with the defined accuracy.
- The result is variable a. The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal negm(mfloat_ a); /* a <-- - a */
-
- This procedure changes the sign of the number of variable a. The return value
- is a pointer to a.
-
- ****************************************************************************
-
- char far pascal eqzero(const mfloat_ a); /* eqzero <-- (a == 0) */
-
- This function checks, if the multiprecision number a is zero. The value of a
- is not changed. The return value is 0 or 1.
-
- ****************************************************************************
-
- char far pascal gtzero(const mfloat_ a); /* gtzero <-- (a > 0) */
-
- This function checks, if the multiprecision number a is greater than zero.
- The value of a is not changed. The return value is 0 or 1.
-
- ****************************************************************************
-
- char far pascal gezero(const mfloat_ a); /* gezero <-- (a >= 0) */
-
- This function checks, if the multiprecision number a is greater than zero or
- equal zero. The value of a is not changed. The return value is 0 or 1.
-
- ****************************************************************************
-
- char far pascal gtm(const mfloat_ a, /* gtm <-- (a > b) */
- const mfloat_ b);
-
- This function checks, if the multiprecision number a is greater than the
- multiprecision number b. The values of a and b are not changed. The return
- value is 0 or 1.
-
- ****************************************************************************
-
- char far pascal eqm(const mfloat_ a, /* eqm <-- (a == b) */
- const mfloat_ b);
-
- This function checks, if the multiprecision number a is equal to the
- multiprecision number b. The values of a and b are not changed. The return
- value is 0 or 1.
-
- ****************************************************************************
-
- mfloat_ far pascal getzerom(mfloat_ a); /* a <-- 0 */
-
- This procedure copies the value zero = 0.0000.. to the variable a. The return
- value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal getonem(mfloat_ a); /* a <-- 1 */
-
- This procedure copies the value one = 1.0000.. to the variable a. The return
- value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal getpim(mfloat_ a); /* a <-- pi */
-
- This procedure copies the value pi = 3.14159.. to the variable a. The return
- value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal getln2m(mfloat_ a); /* a <-- ln(2) */
-
- This procedure copies the value ln(2) = 0.69314.. to the variable a.
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal getln10m(mfloat_ a); /* a <-- ln(10) */
-
- This procedure copies the value ln(10) = 2.3025.. to the variable a.
- The return value is a pointer to a.
-
- ****************************************************************************
-
- int far pascal strtomf(mfloat_ a, /* a <-- b */
- const char far * b);
-
- This procedure converts the string b to the multiprecision number a. If the
- conversion is correct, the integer zero is returned. Otherwise the number
- of the character, where the first error occurs, is retured. For conversion
- the accuracy of the calculation is set to maximum and afterwards is set back
- to the original value. This function is not optimized for speed, but for
- accuracy, because it is not used very often normally.
-
- ****************************************************************************
-
- char far * far pascal mftoa(char far * str, /* str <-- a */
- const mfloat_ a,
- int len);
-
- This procedure converts the multiprecision number a to a string str with
- the number of ASCII - characters, which is defined in len. If the number
- of characters is too less, the string is set to stars '*'. The format used
- for this conversion is '.32767F' (see mftostr). For conversion the
- accuracy of the calculation is set to maximum and afterwards is set back to
- the original value. This function is not optimized for speed, but for
- accuracy, because it is not used very often normally.
- The return value is a pointer to the character array a.
-
- ****************************************************************************
-
- char far * far pascal mftostr(char far * str, /* str <-- a */
- const mfloat_ a,
- int len,
- const char far * format);
-
- This procedure converts the multiprecision number a to a string str with
- a maximum number of ASCII - characters, which is defined in len. If the
- number of characters is too less, the string is set to stars '*'.
- The characters in the format - string have following meaning:
-
- '.' the point is shown ervery time, not significant zeros behind the point
- are shown
- 'E' scientific format with an exponent of ten
- 'F' fixpoint format, if the number is too large, a scientific format
- with an exponent of ten is used.
- 'G' fixpoint format, if the number is too large, or is less than 0.0001,
- a scientific format is used.
- '1..' an integer, which declares the number of digits behind the point
- for fixpoint and scientific format.
-
- The order of the characters (but not the digits for the precision) in the
- format string is arbitrary. It is also allowed to use lower case characters
- for the format, but in this case the letter for the exponent of ten is
- also the lower case letter 'e'. For conversion the accuracy of the
- calculation is set to maximum and afterwards is set back to the original
- value. This function is not optimized for speed, but for accuracy, because
- it is not used very often normally.
- The return value is a pointer to the character array a.
-
- ****************************************************************************
-
- double far pascal mftod(const mfloat_ b); /* mftod <-- b */
-
- This function converts a multiprecision number to a double number.
- The value of b is not changed.
-
- ****************************************************************************
-
- long double far pascal mftold(const mfloat_ b); /* mftold <-- b */
-
- This function converts a multiprecision number to a long double number.
- The value of b is not changed.
-
- ****************************************************************************
-
- mfloat_ far pascal dtomf(mfloat_ a, /* a <-- b */
- double b);
-
- This procedure converts a double precision number to a multiprecision number.
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal ldtomf(mfloat_ a, /* a <-- b */
- long double b);
-
- This procedure converts a long double number to a multiprecision number.
- The return value is a pointer to a.
-
- ****************************************************************************
- +++++++++++++++ standart functions (Borland C: MATH.H) +++++++++++++++
- ****************************************************************************
-
- mfloat_ far pascal acosm(mfloat_ a); /* a <-- acos(a) */
-
- This procedure calculates the invers cos(x) function of the argument a. The
- result is the variable a.
- -1 <= a <= 1 range of the argument, otherwise the error flag is set
- 0 <= a <= pi range of the result
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal asinm(mfloat_ a); /* a <-- asin(a) */
-
- This procedure calculates the invers sin(x) function of the argument a. The
- result is the variable a.
- -1 <= a <= 1 range of the argument, otherwise the error flag is set
- -pi/2 <= a <= pi/2 range of the result
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal atanm(mfloat_ a); /* a <-- atan(a) */
-
- This procedure calculates the invers tan(x) function of the argument a. The
- result is the variable a.
- -pi/2 < a < pi/2 range of the result
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal atan2m(mfloat_ a, /* a <-- atan2(a,b) */
- const mfloat_ b);
-
- This function calulates the extended arctan function phi = arctan(a/b). If
- a is zero or less than zero, the correct value phi is retured, which is
- in the intervall (-pi, pi]:
- a = r * sin(phi)
- b = r * cos(phi)
- r = sqrt(sqr(a) + sqr(b)) >= 0
-
- If a = 0 and b = 0, the value phi is not unique. The error flag is set and
- the number zero is returned. The return variable is the variable a. Variable
- b is not changed.
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal ceilm(mfloat_ a); /* a <-- ceil(a) */
-
- The least integer number, which is greater than or equal the mfloat number
- a, is calculated and returned to the variable a.
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal cosm(mfloat_ a); /* a <-- cos(a) */
-
- This procedure calculates the cos(x) function of the argument a. The result
- is the variable a.
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal coshm(mfloat_ a); /* a <-- cosh(a) */
-
- This procedure calculates the cosh(x) function of the argument a. The result
- is the variable a.
- -22712 < a < 22712 range of the argument without an overflow
- a >= 1 range of the result
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal expm(mfloat_ a); /* a <-- exp(a) */
-
- This procedure calculates exponential function exp(x) of the argument a. The
- result is the variable a.
-
- a < 22712 range of the argument without an overflow
- a < -22712 there is an underflow result = 0
- a >= 0 range of the result
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal fabsm(mfloat_ a); /* a <-- fabs(a) */
-
- This procedure returns the absolute value of the number a. The result is
- the variable a.
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal floorm(mfloat_ a); /* a <-- floor(a) */
-
- The greatest integer number, which is less than or equal the mfloat number
- a, is calculated and returned to the variable a.
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal fmodm(mfloat_ a, /* a <-- fmod(a,b) */
- const mfloat_ b);
-
- This is a floating point modul function. The number a is divided by b, so
- that the result is an integer. The rest of this division is returned by the
- variable a. The variable b is not changed.
-
- a = k * b + c; k is an integer; 0 <= c / b < 1; c is assigned to a
-
- If the variable b is zero, the error flag is set. For negative b, the result
- is negativ or zero. For very large k, which can not be represented as an
- mfloat number exactly, the result is zero.
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal frexpm(mfloat_ a, /* a <-- frexp(a,b) */
- int * b);
-
- The mfloat number a is splitted in the exponent and in the mantissa:
-
- a = c * (2**b); 0.5 <= |c| < 1 if a <> 0; c is assigned to a
-
- This procedure returns two results. IF the mfloat number is zero, it remains
- zero and b is set to zero.
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal hypotm(mfloat_ a, /* a <-- hypot(a,b) */
- const mfloat_ b);
-
- This function is used for the calculation of the absolute value of a complex
- number:
-
- c = sqrt(sqr(a) + sqr(b)); c is assigned to a; b remains unchanged.
- |a + i*b| = hypot(a,b);
- This procedure calculates the correct result also if sqr(a) can not be
- represented as a mfloat number, because it is too great or too small.
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal ldexpm(mfloat_ a, /* a <-- ldexp(a,b) */
- int b);
-
- The mfloat number a is multiplied by the power of two 2**b.
-
- a <-- a * (2**b);
-
- The calculation is very fast because only the exponent of two is manipulated.
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal logm(mfloat_ a); /* a <-- log(a) */
-
- This procedure calculates the natural logarithm ln(x) of the argument a. The
- result is the variable a.
-
- a > 0 range of the argument
- -22712 <= a <= 22712 range of the result
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal log10m(mfloat_ a); /* a <-- log10(a) */
-
- This procedure calculates the decade logarithm log(x) of the argument a. The
- result is the variable a.
-
- a > 0 range of the argument
- -9864 <= a <= 9863 range of the result
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal modfm(mfloat_ a, /* a,b <-- modf(a) */
- mfloat_ b);
-
- This procedure splits the number a in an integer part and in an fractional
- part:
-
- a = k + b; k is an integer; 0 <= b < 1; k is assigned to a
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal powm(mfloat_ a, /* a <-- pow(a,b) */
- const mfloat_ b);
-
- This procedure calculates the power of the basis a by the exponent b.
- The result is assigned to the variable a, the variable b is not changed.
- If the exponent is an integer, the calculation is made by repeated
- multiplications. The number of multiplications is proportional to
- log(|b|). If b has a fractional, the calculation is made by logarithm. In
- this case, the base a must be positiv, otherwise the error flag is set,
- because the result is not real. If the number b is very large so that
- no fractional can be represented by the mfloat format and the base
- is negativ, the error flag is set too. If a = 0 and b <= 0 the result is
- not defined and the error flag is set.
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal pow10m(mfloat_ a, /* a <-- 10**b */
- int b);
-
- The power of the basis ten is build with the integer exponent b. The result
- is assigned to a.
-
- a <-- 10**b
-
- if b < -9864 then a = 0
- if b > 9863 then there is an overflow -> error flag is set
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal sinm(mfloat_ a); /* a <-- sin(a) */
-
- This procedure calculates the sin(x) function of the argument a. The result
- is the variable a.
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal sinhm(mfloat_ a); /* a <-- sinh(a) */
-
- This procedure calculates the sinh(x) function of the argument a. The result
- is the variable a.
- -22712 < a < 22712 range of the argument without an overflow
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal sqrtm(mfloat_ a); /* a <-- sqrt(a) */
-
- This procedure calculates the square root of the number a. The result is
- variable a. If a is negativ, the error flag is set.
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal tanm(mfloat_ a); /* a <-- tan(a) */
-
- This procedure calculates the tan(x) function of the argument a. The result
- is the variable a. It is faster than the calculation of sin(a) / cos(a).
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal tanhm(mfloat_ a); /* a <-- tanh(a) */
-
- This procedure calculates the tanh(x) function of the argument a. The result
- is the variable a.
- The return value is a pointer to a.
-
- ****************************************************************************
- ++++++++++++++++++ extended standart functions ++++++++++++++++++++++++
- ****************************************************************************
-
- mfloat_ far pascal acoshm(mfloat_ a); /* a <-- acosh(a) */
-
- This procedure calculates the invers cosh(x) function of the argument a. The
- result is the variable a.
- 1 <= a range of the argument
- 0 <= a < 22712 range of the result
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal acotm(mfloat_ a); /* a <-- acot(a) */
-
- This procedure calculates the invers cot(x) function of the argument a. The
- result is the variable a.
- 0 < a < pi range of the result
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal acothm(mfloat_ a); /* a <-- acoth(a) */
-
- This procedure calculates the invers coth(x) function of the argument a. The
- result is the variable a.
- -1 > a or 1 < a range of the argument
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal asinhm(mfloat_ a); /* a <-- asinh(a) */
-
- This procedure calculates the invers sinh(x) function of the argument a. The
- result is the variable a.
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal atanhm(mfloat_ a); /* a <-- atanh(a) */
-
- This procedure calculates the invers tanh(x) function of the argument a. The
- result is the variable a.
- -1 < a < 1 range of the argument
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal cossinm(mfloat_ a, /* a <-- cos(a), */
- mfloat_ b); /* b <-- sin(a) */
-
- This procedure calculates the sin(x) and cos(x) function by only one
- evulation of the series for the sine function. This procedure is therefore
- faster than two individual calculations. It is recommended for example
- for the calculation of the exponential function of complex arguments.
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal cotm(mfloat_ a); /* a <-- cot(a) */
-
- This procedure calculates the cot(x) function of the argument a. The result
- is the variable a. It is faster than the calculation of cos(a) / sin(a).
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal cothm(mfloat_ a); /* a <-- coth(a) */
-
- This procedure calculates the coth(x) function of the argument a. The result
- is the variable a.
- range of the result: a >=1 or a <= -1
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal exp10m(mfloat_ a); /* a <-- 10 ** a */
-
- This procedure calculates the power of 10 of the argument a. The
- result is the variable a.
- a < 9863 range of the argument without an overflow
- a < -9864 there is an underflow, result = 0
- a >= 0 range of the result
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal sqrm(mfloat_ a); /* a <-- sqr(a) */
-
- This procedure calculates the sqare of a. The result is variabel a.
- sqr(a) = a * a
- The return value is a pointer to a.
-
- ****************************************************************************
-
- mfloat_ far pascal truncm(mfloat_ a); /* a <-- trunc(a) */
-
- This procedure returns the integer part of the number a (truncation). The
- result is variable a. The sign has no influence and remains unchanged.
- The return value is a pointer to a.
-
-
-
-
-
- ===========================================================================
- ***************************************************************************
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- ***************************************************************************
- ===========================================================================
-
- Using the subroutines for Borland C++ 3.1 :
- ===========================================
-
- Using the mfloat library with C++ all of the mathematical operators are
- overloaded to allow 'normal' calculations with the mfloat class.
-
- To make the functions available you have to include MFLOAT.CXX (includes
- MFLOAT.H) in your source and to specify MFLOATA.OBJ and MFLOATB.OBJ
- in your project file.
-
- All operators available for the type 'double' are also defined for 'mfloat',
- So it is very easy to convert a program using double variables to use the
- mfloat library. Please note that it is not allowed to assign 'mfloat'
- variables to other numeric types. To solve this problem please use 'mftod'
- and 'mftold' (see the example programs!).
-
- overloaded operators:
- =
- ++ (prefix, suffix) -- (prefix, suffix)
- + (unary, binary) - (unary, binary)
- * /
- += -= *= /=
- < > <= >= == !=
- << (output to stream) >> (input from stream)
-
- The following functions are overloaded mfloat equivalents of the
- functions defined in 'math.h':
-
- acos, asin, atan, atan2, ceil, cos, cosh, exp, fabs, floor, fmod, frexp,
- hypot, ldexp, log, log10, modf, pow, sin, sinh, sqrt, tan, tanh.
-
- Instead of 'atof' use 'atofm' (overloading problem).
- Instead of 'pow10' use 'pow10m' (overloading problem).
-
- The following functions are extensions to math.h:
-
- acosh, asinh, atanh, acoth, cot, coth, exp10, sqr, trunc.
-
- Further functions are available:
-
- void pascal setmantissawords(int number);
- int pascal getmantissawords(void);
- void pascal reseterror(void);
- char pascal geterror(void);
- int pascal strtomf(mfloat &a, const char *s);
- char * pascal mftoa(char *str, const mfloat &a, int len);
- char * pascal mftostr(char *str, const mfloat &a, int len,
- const char * format);
- double pascal mftod(const mfloat &b);
- long double pascal mftold(const mfloat &b);
-
- All functions are mapped to the functions defined in MFLOAT.H. So please
- take a look in description for using the functions with Borland C above.
-
-
-
-
- ===========================================================================
- ***************************************************************************
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- ***************************************************************************
- ===========================================================================
-
- Using the subroutines for TURBO PASCAL 6.0, 7.0 or BORLAND PASCAL 7.0 :
- =======================================================================
-
- CONST MfloatWords = 16;
- TYPE mfloat = ARRAY[0..MfloatWords-1] OF integer;
-
- (declared in "PFLOAT.PAS")
-
- This type declaration for multiprecision floating point numbers sets
- a maximal precision of 15 mantissawords or about 72 decimal digits. The
- maximum precision is an assembly time constant. In this release it is fixed
- to 15 mantissawords. It is possible to change the type declaration to reduce
- the number of mantissawords. This leads to a saving in memory needs and to
- higher execution speed at the cost of precision. It is also possible to adjust
- the precision dynamically using the function 'setmantissawords'.
-
- The programmer must take care, that the stack is large enough, because
- there is no check of a stack overflow.
-
- ****************************************************************************
- ++++++++++++++++++++ mfloat basic functions ++++++++++++++++++++++++++
- ****************************************************************************
-
- void far pascal setmantissawords(int number);
-
- ****************************************************************************
- ++++++++++++++++++++ mfloat basic functions ++++++++++++++++++++++++++
- ****************************************************************************
-
- PROCEDURE Setmantissawords(number : integer);
-
- This procedure sets the length of the mantissa of mfloat numbers. The number
- of word must be in the range of [1..15], otherwise is is bounded to these
- values. If the number of mantissawords is increased and the numbers of
- former calculations are used further, the additional words are not
- initialized to zero, their values are not known. The error of the number
- due to this further mantissa words is of course very small and is in the
- range of the accuracy of the former calculation. If it is important, that
- this further mantissa words are zero, the user has to initialize them (see
- data format). If the number of mantissa words is reduced, every number can
- be seen as truncated, although nothing is changed at this numbers. The number
- of mantissa words refers only to calculations.
-
- ****************************************************************************
-
- FUNCTION GetMantissawords : integer;
-
- The function returns the actual number of mantissa words.
-
- ****************************************************************************
-
- PROCEDURE ResetError;
-
- The error flag is reset.
-
- ****************************************************************************
-
- FUNCTION GetError : boolean;
-
- The error flag is returned. If there ocures a runtime error during the
- calculation, the error flag is set to true.
-
- ****************************************************************************
-
- PROCEDURE equm(VAR a, b : mfloat); { *** a <-- b *** }
-
- The value of b is copied to variable a with the defined accuracy. The value
- of b is not changed.
-
- ****************************************************************************
-
- PROCEDURE addm(VAR a, b : mfloat); { *** a <-- a + b *** }
-
- The value of b is added to the value a with the defined accuracy. The
- result is variable a. The value of b is not changed.
-
- ****************************************************************************
-
- PROCEDURE subm(VAR a, b : mfloat); { *** a <-- a - b *** }
-
- The value of b is subtracted from the value a with the defined accuracy.
- The result is variable a. The value of b is not changed.
-
- ****************************************************************************
-
- PROCEDURE multm(VAR a, b : mfloat); { *** a <-- a * b *** }
-
- The value of a is multiplied by value b with the defined accuracy.
- The result is variable a. The value of b is not changed.
-
- ****************************************************************************
-
- PROCEDURE divm(VAR a, b : mfloat); { *** a <-- a / b *** }
-
- The value of a is divided by the value b with the defined accuracy.
- The result is variable a. The value of b is not changed.
-
- ****************************************************************************
-
- PROCEDURE multi(VAR a : mfloat; b : integer); { *** a <-- a * b *** }
-
- The value of a is multiplied by the integer value b with the defined
- accuracy. The result is variable a.
-
- ****************************************************************************
-
- PROCEDURE divi(VAR a : mfloat; b : integer); { *** a <-- a / b *** }
-
- The value of a is divided by the integer value b with the defined accuracy.
- The result is variable a.
-
- ****************************************************************************
-
- PROCEDURE inversm(VAR a : mfloat); { *** a <-- 1 / a *** }
-
- The number one is divided by the number a with the defined accuracy.
- The result is variable a.
-
- ****************************************************************************
-
- PROCEDURE negm(VAR a : mfloat); { *** a <-- - a *** }
-
- This procedure changes the sign of the number of variable a.
-
- ****************************************************************************
-
- FUNCTION eqZero(VAR a : mfloat) : boolean; { *** eqZero <-- a = 0 *** }
-
- This function checks, if the multiprecision number a zero. The value of a
- is not changed.
-
- ****************************************************************************
-
- FUNCTION gtZero(VAR a : mfloat) : boolean; { *** gtZero <-- a > 0 *** }
-
- This function checks, if the multiprecision number a is greater than zero.
- The value of a is not changed.
-
- ****************************************************************************
-
- FUNCTION geZero(VAR a : mfloat) : boolean; { *** geZero <-- a >= 0 *** }
-
- This function checks, if the multiprecision number a is greater than zero or
- equal zero. The value of a is not changed.
-
- ****************************************************************************
-
- FUNCTION gtm(VAR a, b : mfloat) : boolean; { *** gtm <-- a > b *** }
-
- This function checks, if the multiprecision number a is greater than the
- multiprecision number b. The values of a and b are not changed.
-
- ****************************************************************************
-
- FUNCTION eqm(VAR a, b : mfloat) : boolean; { *** eqm <-- a = b *** }
-
- This function checks, if the multiprecision number a is equal to the
- multiprecision number b. The values of a and b are not changed.
-
- ****************************************************************************
-
- PROCEDURE GetZerom(VAR a : mfloat); { *** a <- 0 *** }
-
- This procedure copies the value zero = 0.0000.. to the variable a.
-
- ****************************************************************************
-
- PROCEDURE GetOnem(VAR a : mfloat); { *** a <- 1 *** }
-
- This procedure copies the value one = 1.0000.. to the variable a.
-
- ****************************************************************************
-
- PROCEDURE GetPim(VAR a : mfloat); { *** a <- pi *** }
-
- This procedure copies the value pi = 3.14159.. to the variable a.
-
- ****************************************************************************
-
- PROCEDURE GetLn2m(VAR a : mfloat); { *** a <- ln(2) *** }
-
- This procedure copies the value ln(2) = 0.69314.. to the variable a.
-
- ****************************************************************************
-
- PROCEDURE GetLn10m(VAR a : mfloat); { *** a <- ln(10) *** }
-
- This procedure copies the value ln(10) = 2.3025.. to the variable a.
-
- ****************************************************************************
-
- FUNCTION strtomf(VAR a : mfloat; { *** a <-- string *** }
- b : string) : integer;
-
- This procedure converts the string b to the multiprecision number a. If the
- conversion is correct, the integer zero is returned. Otherwise the number
- of the character, where the first error occurs, is retured. For conversion
- the accuracy of the calculation is set to maximum and afterwards is set back
- to the original value. This function is not optimized for speed, but for
- accuracy, because it is not used very often normally.
-
- ****************************************************************************
-
- FUNCTION mftoa(VAR a : mfloat; { *** string <-- a *** }
- len : integer) : string;
-
- This procedure converts the multiprecision number a to a string str with
- the number of ASCII - characters, which is defined in len. If the number
- of characters is too less, the string is set to stars '*'. The format used
- for this conversion is '.32767F' (see FUNCTION mftostr). For conversion the
- accuracy of the calculation is set to maximum and afterwards is set back to
- the original value. This function is not optimized for speed, but for
- accuracy, because it is not used very often normally.
-
- ****************************************************************************
-
- FUNCTION mftostr(VAR a : mfloat; { *** string <-- a *** }
- len : integer;
- format : string) : string;
-
- This procedure converts the multiprecision number a to a string str with
- a maximum number of ASCII - characters, which is defined in len. If the
- number of characters is too less, the string is set to stars '*'.
- The characters in the format - string have following meaning:
-
- '.' the point is shown ervery time, not significant zeros behind the point
- are shown
- 'E' scientific format with an exponent of ten
- 'F' fixpoint format, if the number is too large, a scientific format
- with an exponent of ten is used
- 'G' fixpoint format, if the number is too large, or less than 0.0001,
- a scientific format is used.
- '1..' an integer, which declares the number of digits behind the point
- for fixpoint and scientific format
-
- The order of the characters (but not the digits for the precision) in the
- format string is arbitrary. It is also allowed to use lower case characters
- for the format, but in this case the letter for the exponent of ten is
- also the lower case letter 'e'. For conversion the accuracy of the
- calculation is set to maximum and afterwards is set back to the original
- value. This function is not optimized for speed, but for accuracy, because
- it is not used very often normally.
-
- ****************************************************************************
-
- FUNCTION MfToD(VAR a : mfloat) : double; { *** MfToD <- a *** }
-
- This function converts a multiprecision number to a double precision number.
- The value of a is not changed.
-
- ****************************************************************************
-
- FUNCTION MfToLd(VAR a : mfloat) : extended; { *** MfToLd <- a *** }
-
- This function converts a multiprecision number to an extended number. The
- value of a is not changed.
-
- ****************************************************************************
-
- PROCEDURE DToMf(VAR a : mfloat; b : double); { *** a <- b *** }
-
- This procedure converts a double precision number to a multiprecision number.
-
- ****************************************************************************
-
- PROCEDURE LdToMf(VAR a : mfloat; b : extended);{ *** a <- b *** }
-
- This procedure converts an extended number to a multiprecision number.
-
- ****************************************************************************
- ++++++++++++++++ standart functions (Borland C: MATH.H) ++++++++++++++
- ****************************************************************************
-
- PROCEDURE acosm(VAR a : mfloat); { *** a <- arccos(a) *** }
-
- This procedure calculates the invers cos(x) function of the argument a. The
- result is the variable a.
- -1 <= a <= 1 range of the argument, otherwise the error flag is set
- 0 <= a <= pi range of the result
-
- ****************************************************************************
-
- PROCEDURE asinm(VAR a : mfloat); { *** a <- arcsin(a) *** }
-
- This procedure calculates the invers sin(x) function of the argument a. The
- result is the variable a.
- -1 <= a <= 1 range of the argument, otherwise the error flag is set
- -pi/2 <= a <= pi/2 range of the result
-
- ****************************************************************************
-
- PROCEDURE atanm(VAR a : mfloat); { *** a <- arctan(a) *** }
-
- This procedure calculates the invers tan(x) function of the argument a. The
- result is the variable a.
- -pi/2 < a < pi/2 range of the result
-
- ****************************************************************************
-
- PROCEDURE atan2m(VAR a, b : mfloat); { *** a <- arg(b + ia) *** }
-
- This function calulates the extended arctan function phi = arctan(a/b). If
- a is zero or less than zero, the correct value phi is retured, which is
- in the intervall (-pi, pi]:
- a = r * sin(phi)
- b = r * cos(phi)
- r = sqrt(sqr(a) + sqr(b)) >= 0
-
- If a = 0 and b = 0, the value phi is not unique. The error flag is set and
- the number zero is returned. The return variable is the variable a. Variable
- b is not changed.
-
- ****************************************************************************
-
- PROCEDURE ceilm(VAR a : mfloat); { *** a <-- ceil(a) *** }
-
- The least integer number, which is greater than or equal the mfloat number
- a, is calculated and returned to the variable a.
-
- ****************************************************************************
-
- PROCEDURE cosm(VAR a : mfloat); { *** a <- cos(a) *** }
-
- This procedure calculates the cos(x) function of the argument a. The result
- is the variable a.
-
- ****************************************************************************
-
- PROCEDURE coshm(VAR a : mfloat); { *** a <- cosh(a) *** }
-
- This procedure calculates the cosh(x) function of the argument a. The result
- is the variable a.
- -22712 < a < 22712 range of the argument without an overflow
- a >= 1 range of the result
-
- ****************************************************************************
-
- PROCEDURE expm(VAR a : mfloat); { *** a <- exp(a) *** }
-
- This procedure calculates exponential function exp(x) of the argument a. The
- result is the variable a.
-
- a < 22712 range of the argument without an overflow
- a < -22712 there is an underflow result = 0
- a >= 0 range of the result
-
- ****************************************************************************
-
- PROCEDURE fabsm(VAR a : mfloat); { *** a <-- fabs(a) *** }
-
- This procedure returns the absolute value of the number a. The result is
- the variable a.
-
- ****************************************************************************
-
- PROCEDURE floorm(VAR a : mfloat); { *** a <-- floor(a) *** }
-
- The greatest integer number, which is less than or equal the mfloat number
- a, is calculated and returned to the variable a.
-
- ****************************************************************************
-
- PROCEDURE fmodm(VAR a, b : mfloat); { *** a <-- fmod(a,b) *** }
-
- This is a floating point modul function. The number a is divided by b, so
- that the result is an integer. The rest of this division is returned by the
- variable a. The variable b is not changed.
-
- a = k * b + c; k is an integer; 0 <= c / b < 1; c is assigned to a
-
- If the variable b is zero, the error flag is set. For negative b, the result
- is negativ or zero. For very large k, which can not be represented as an
- mfloat number exactly, the result is zero.
-
- ****************************************************************************
-
- PROCEDURE frexpm(VAR a : mfloat; { *** a <-- frexp(a,b) *** }
- VAR b : integer);
-
- The mfloat number a is splitted in the exponent and in the mantissa:
-
- a = c * (2**b); 0.5 <= |c| < 1 if a <> 0; c is assigned to a
-
- This procedure returns two results. IF the mfloat number is zero, it remains
- zero and b is set to zero.
-
- ****************************************************************************
-
- PROCEDURE hypotm(VAR a, b : mfloat); { *** a <-- hypot(a,b) *** }
-
- This function is used for the calculation of the absolute value of a complex
- number:
-
- c = sqrt(sqr(a) + sqr(b)); c is assigned to a; b remains unchanged.
- |a + i*b| = hypot(a,b);
- This procedure calculates the correct result also if sqr(a) can not be
- represented as a mfloat number, because it is too great or too small.
-
- ****************************************************************************
-
- PROCEDURE ldexpm(VAR a : mfloat; b : integer); { *** a <-- ldexp(a,b) *** }
-
- The mfloat number a is multiplied by the power of two 2**b.
-
- a <-- a * (2**b);
-
- The calculation is very fast because only the exponent of two is manipulated.
-
- ****************************************************************************
-
- PROCEDURE logm(VAR a : mfloat); { *** a <- log(a) *** }
-
- This procedure calculates the natural logarithm ln(x) of the argument a. The
- result is the variable a.
-
- a > 0 range of the argument
- -22712 <= a <= 22712 range of the result
-
- ****************************************************************************
-
- PROCEDURE log10m(VAR a : mfloat); { *** a <- log10(a) *** }
-
- This procedure calculates the decade logarithm log(x) of the argument a. The
- result is the variable a.
-
- a > 0 range of the argument
- -9864 <= a <= 9863 range of the result
-
- ****************************************************************************
-
- PROCEDURE modfm(VAR a, b : mfloat); { *** a, b <-- modf(a) *** }
-
- This procedure splits the number a in an integer part and in an fractional
- part:
-
- a = k + b; k is an integer; 0 <= b < 1; k is assigned to a
-
- ****************************************************************************
-
- PROCEDURE powm(VAR a, b : mfloat); { *** a <-- pow(a,b) *** }
-
- This procedure calculates the power of the basis a by the exponent b.
- The result is assigned to the variable a, the variable b is not changed.
- If the exponent is an integer, the calculation is made by repeated
- multiplications. The number of multiplications is proportional to
- log(|b|). If b has a fractional, the calculation is made by logarithm. In
- this case, the base a must be positiv, otherwise the error flag is set,
- because the result is not real. If the number b is very large so that
- no fractional can be represented by the mfloat format and the base
- is negativ, the error flag is set too. If a = 0 and b <= 0 the result is
- not defined and the error flag is set.
-
- ****************************************************************************
-
- PROCEDURE pow10m(VAR a : mfloat; b : integer); { *** a <-- pow10(a) *** }
-
- The power of the basis ten is build with the integer exponent b. The result
- is assigned to a.
-
- a <-- 10**b
-
- if b < -9864 then a = 0
- if b > 9863 then there is an overflow -> error flag is set
-
- ****************************************************************************
-
- PROCEDURE sinm(VAR a : mfloat); { *** a <- sin(a) *** }
-
- This procedure calculates the sin(x) function of the argument a. The result
- is the variable a.
-
- ****************************************************************************
-
- PROCEDURE sinhm(VAR a : mfloat); { *** a <- sinh(a) *** }
-
- This procedure calculates the sinh(x) function of the argument a. The result
- is the variable a.
- -22712 < a < 22712 range of the argument without an overflow
-
- ****************************************************************************
-
- PROCEDURE sqrtm(VAR a : mfloat); { *** a <- sqrt(a) *** }
-
- This procedure calculates the square root of the number a. The result is
- variable a. If a is negativ, the error flag is set.
-
- ****************************************************************************
-
- PROCEDURE tanm(VAR a : mfloat); { *** a <- tan(a) *** }
-
- This procedure calculates the tan(x) function of the argument a. The result
- is the variable a. It is faster than the calculation of sin(a) / cos(a).
-
- ****************************************************************************
-
- PROCEDURE tanhm(VAR a : mfloat); { *** a <- tanh(a) *** }
-
- This procedure calculates the tanh(x) function of the argument a. The result
- is the variable a.
-
- -1 <= a <= 1 range of the result
-
- ****************************************************************************
- ++++++++++++++++++ extended standart functions ++++++++++++++++++++++++
- ****************************************************************************
-
- PROCEDURE acoshm(VAR a : mfloat); { *** a <- arcosh(a) *** }
-
- This procedure calculates the invers cosh(x) function of the argument a. The
- result is the variable a.
- 1 <= a range of the argument
- 0 <= a < 22712 range of the result
-
- ****************************************************************************
-
- PROCEDURE acotm(VAR a : mfloat); { *** a <- arccot(a) *** }
-
- This procedure calculates the invers cot(x) function of the argument a. The
- result is the variable a.
- 0 < a < pi range of the result
-
- ****************************************************************************
-
- PROCEDURE acothm(VAR a : mfloat); { *** a <- arcoth(a) *** }
-
- This procedure calculates the invers coth(x) function of the argument a. The
- result is the variable a.
- -1 > a or 1 < a range of the argument
-
- ****************************************************************************
-
- PROCEDURE asinhm(VAR a : mfloat); { *** a <- arsinh(a) *** }
-
- This procedure calculates the invers sinh(x) function of the argument a. The
- result is the variable a.
-
- ****************************************************************************
-
- PROCEDURE atanhm(VAR a : mfloat); { *** a <- artanh(a) *** }
-
- This procedure calculates the invers tanh(x) function of the argument a. The
- result is the variable a.
- -1 < a < 1 range of the argument
-
- ****************************************************************************
-
- PROCEDURE cossinm(VAR a, b : mfloat); { *** a <-- cos(a), b <-- sin(a) *** }
-
- This procedure calculates the sin(x) and cos(x) function by only one
- evulation of the series for the sine function. This procedure is therefore
- faster than two individual calculations. It is recommended for example
- for the calculation of the exponential function of complex arguments.
-
- ****************************************************************************
-
- PROCEDURE cotm(VAR a : mfloat); { *** a <- cot(a) *** }
-
- This procedure calculates the cot(x) function of the argument a. The result
- is the variable a. It is faster than the calculation of cos(a) / sin(a).
-
- ****************************************************************************
-
- PROCEDURE cothm(VAR a : mfloat); { *** a <- coth(a) *** }
-
- This procedure calculates the coth(x) function of the argument a. The result
- is the variable a.
- range of the result: a >=1 or a <= -1
-
- ****************************************************************************
-
- PROCEDURE exp10m(VAR a : mfloat); { *** a <- 10 ** a *** }
-
- This procedure calculates the power of 10 of the argument a. The
- result is the variable a.
- a < 9863 range of the argument without an overflow
- a < -9864 there is an underflow, result = 0
- a >= 0 range of the result
-
- ****************************************************************************
-
- PROCEDURE sqrm(VAR a : mfloat); { *** a <- sqr(a) *** }
-
- This procedure calculates the sqare of a. The result is variabel a.
- sqr(a) = a * a
-
- ****************************************************************************
-
- PROCEDURE truncm(VAR a : mfloat); { *** a <-- trunc(a) *** }
-
- This procedure returns the integer part of the number a (truncation). The
- result is variable a. The sign has no influence and remains unchanged.
-
-
- ===========================================================================
- ***************************************************************************
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- ***************************************************************************
- ===========================================================================
-
- Examples with MFLOAT:
-
- a) The calculation of the number PI:
- There are many methods known to calculate the irrational number pi,
- which is the relation between the perimeter and the diameter of a circle.
-
- Pi = 16 * arctan(1 / 5) - 4 * arctan(1 / 239);
- Pi = 48 * arctan(1 / 18) + 32 * arctan(1 / 57) - 20 * arctan(1 / 239);
- Pi = 6 * arctan(1 / sqrt(3));
- PI = 4 * (1 - 1/3 + 1/5 - 1/7 + ....)
-
- The function arctan(x) is calculated with the series:
-
- arctan(x) = x - x**3 / 3 + x**5 / 5 - x**7 / 7 + ......
-
- The operation "x**.." means the power of x.
-
- The advantage of the first two equations is, that they converge very
- fast and every element of the series can be calculated recursivly from
- the former element by only two divisions by an integer. A division by
- an integer is very fast for long MFLOAT-numbers.
- The third equation is build by the known value : tan(pi/6) = 1 / sqrt(3).
- This method is slower than the former. The forth equation has only
- theoretical importance, in the numerical use this equation is
- very bad.
- The example program compares the different methods to calculate pi.
-
- b) The evulation of the bessel function for large arguments:
- There are some series known, where some elements are very large and
- the result is small. In this case numerical errors are very large.
- An example is the bessel function J0(x) for an argument x = 100.0.
-
- J0(x) = 1 - (x/2)**2 / (1! * 1!) + (x/2)**4 / (2! * 2!) - ......
-
- The series converges very good, but there are problems with the accuracy
- of the calculation. The test program shows the influence of the accuracy
- of the calculation to the result.
-
- c) Conversion of cartesian coordinates to polar coordinates:
- This example shows the use of the functions hypot(x,y) and atan2(x,y).
- This program also demonstrates the input of MFLOAT-numbers.
-
- ===========================================================================
- ***************************************************************************
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
- ***************************************************************************
- ===========================================================================
-
- Graz, 28.4.1993
-
- Kaufmann Friedrich & Mueller Walter
- Students at the Technical University of Graz
-
- Registration address:
-
- Kaufmann Friedrich
- Schuetzenhofgasse 22
- A-8010 GRAZ
- AUSTRIA
- EUROPE
-
- email address:
- fkauf@fstgds06.tu-graz.ac.at
-