ARM Club 3

home *** CD-ROM | disk | FTP | other *** search

/ ARM Club 3 / TheARMClub_PDCD3.iso / hensa / documentation / documents / a252sqrt < prev next >

Wrap

Internet Message Format | 1999-04-27 | 19.3 KB

From: chughes@maths.tcd.ie (Conrad Hughes) Subject: Re: 32/64 bit square root. Date: 25 Sep 91 16:30:11 GMT Here's the best code I've managed yet; it manages to fit into four instructions per bit. If anyone has managed better, please let me know.. If you do use the code, it'd be nice to receive a mention :-) DEFFNsqrt(N,R,T,S):LOCALA%:IFS<>N [OPTpass:MOV S,N:] [OPTpass:MOV R,#0:]:FORA%=7TO0STEP-1:[OPTpass:ADD T,R,#1<<A%+15 CMP S,T,LSL#A%+1:SUBGE S,S,T,LSL#A%+1:ADDGE R,R,#1<<A%+16:]NEXT:[OPTpass ADD T,R,#1<<14:CMP S,T:SUBGE S,S,T:ADDGE R,R,#1<<15:]FORA%=2TO15:[OPTpass ADD T,R,#1<<15-A%:RSBS S,T,S,LSL#1:ADDLT S,S,T:ADDGE R,R,#1<<16-A%:]NEXT [OPTpass:ADD T,R,R:ADD T,T,#1:RSBS S,T,S,LSL#2:ADDLT S,S,T:ADDGE R,R,#1 CMP S,R:ADDGE R,R,#1:]=0 ..defines a function which calculates the square root of a number stored in one register, with the fixed point between bits 15 and 16; I didn't write down behaviour for negative numbers, but you should be able to test this out for yourself, and it shouldn't be too difficult to extend to different formats. To convert a number into this format, you just multiply it by 65536, and to convert back, you divide by 65536 (i.e., 1.000 would be represented as 65536 in a register). So, for example, you want R1=SQRT(R0): FNsqrt(0,1,2,3) will preserve the contents of R0, and use R2 and R3 for data - both will be corrupted. FNsqrt(0,1,2,0) will corrupt R0, but doesn't need R3. There are no other circumstances under which two of the registers the routine uses can safely be the same. The loops are unwrapped for speed (actually using loops would take up much less space, but would probably use up another register, and make the routine twice as slow). The routine takes up 400 or 404 bytes, depending on whether N=S or not. One thing - this is slow :-}, so wherever possible you should avoid using square roots.. Similarly division. Conrad From: gcwillia@daisy.waterloo.edu (Graeme Williams) Subject: Re: The (non-charity) ray tracer Date: 20 Nov 91 19:21:07 GMT The algorithm I'd use would be a Newton-Raphson iteration scheme designed to solve x^2=z. (Whether this is the fastest method I don't know) The way this works is you make a guess for sqrt(z), say x_0. (The guess can be very rough, I'd do something like find out how many bits in z (say there were 23) half that (11 or 12) and then guess x_0 (2^11 or 2^12.) One then makes successive approximations via x_{n+1} = x_{n} - (x_{n}^2 - z)/2/x_{n} This scheme will converge very rapidly to sqrt(z) if your initial guess is at all close to z (give or take a factor of 2 or 4 is close enough). You might want to write (x_{n}^2-z) as (x_{n}+z)*(x_{n}-z) it might be quicker (I've no idea). As an example: if z=529020918 and we guess x0=16384 then we find x1=24336.43719 x2=23037.12504 x3=23000.48392 x4=23000.45473 x5=23000.45473 x6=23000.45473 so you can see 4 or maybe 5 iterations at most is enough. Each iteration (if you arrange things *carefully*) will need only a 16x16 bit multiply and a 32x16 bit divide. Graeme Williams From: torbenm@diku.dk (Torben AEgidius Mogensen) Date: 21 Nov 91 15:59:37 GMT Sender: torbenm@freke.diku.dk gcwillia@daisy.waterloo.edu (Graeme Williams) writes: >The algorithm I'd use would be a Newton-Raphson iteration scheme designed >to solve x^2=z. (Whether this is the fastest method I don't know) >... >so you can see 4 or maybe 5 iterations at most is enough. Each iteration >(if you arrange things *carefully*) will need only a 16x16 bit multiply >and a 32x16 bit divide. There is a *much* faster way. If your number is of the form a*2^n, where 0.5 <= a < 1.0 then do the following: sqrt(a*2^(2n)) = sqrt(a)*2^n sqrt(a*2^(2n+1)) = sqrt(a/2)*2^(n+1) This reduces the problem to finding a square root of a number a, 0.25 <= a < 1.0. This can be done by this method, which should have a cost comparable to 2 or 3 divisions. { 0.25 <= a < 1.0 } x := 4*(a-0.25); r := 1; n := 1; while (n<number-of-bits) and (a<>0) do begin { invariant: r^2+x = 4^n*a, x<2*r+1 } n := n+1; t := 4*r+1; x := 4*x; r := 2*r; if x >= t then begin x := x-t; r := r+1 end end; r := r/2^n; { sqrt(a)-r < 2^-n } Note that only addition, subtractions and multiplications by 2 or 4 are used in each part of the loop. At the end a single n-bit shift is done. Note that r and t are integers, while x is a real. The (a<>0) test allows you to exit the loop if an exact root is found, but it can be omitted without causing error (so the loop can be unwound). We can reformulate the algorithm to keep x < 1.0 throughout the algorithm: { 0.25 <= a < 1.0 } x := (a-0.25)/2; r := 1; for n := 2 to number-of-bits do begin { invariant: r^2+2^n*x= 4^n*a, x<(2*r+1)*2^-n } t := 4*r+1; x := 2*x; r := 2*r; if x >= t/2^(n+1) then begin x := x-t/2^(n+1); r := r+1 end end; r := r/2^number-of-bits; { sqrt(a)-r < 2^-number-of-bits } Assuming we have a 32 bit mantissa which is stored with the most significant bit as 0.5, we would get the following ARM code, which uses 213 cycles to get a 32 bit result. A test of x=0 could be added at every iteration. This would make minimum time shorter, but assuming there is no large number of squares in the input, this would make the average time longer. \ on entry Ra holds a \ on exit Rr holds sqrt(a) \ uses Rx and Rt (either can be equal to Ra) SUB Rx,Ra,#2^30 MOV Rx,Rx LSR #1 \ x := (a-0.25)/2; MOV Rr,#1 \ r := 1; MOV Rt,Rr ASL #2 ADD Rt,Rt,#1 \ t := 4*r+1; MOV Rx,Rx ASL #1 \ x := 2*x; MOV Rr,Rr, ASL #1 \ r := 2*r; SUBS Rt,Rx,Rt ASL #29 \ if x >= t/2^(n+1) then begin MOVPL Rx,Rt \ x := x-t/2^(n+1); ADDPL Rr,Rr,#1 \ r := r+1 ... MOV Rt,Rr ASL #2 ADD Rt,Rt,#1 \ t := 4*r+1; MOV Rx,Rx ASL #1 \ x := 2*x; MOV Rr,Rr, ASL #1 \ r := 2*r; SUBS Rt,Rx,Rt ASL #0 \ if x >= t/2^(n+1) then begin MOVPL Rx,Rt \ x := x-t/2^(n+1); ADDPL Rr,Rr,#1 \ r := r+1 \ at exit: 0.5 <= Rr < 1.0 when considered as a mantissa We can actually get this better by avoiding the doubling of r in each iteration. This way r and t are no longer integers, but the code is better anyway: { 0.25 <= a < 1.0 } x := a-0.25; r := 0.5; for n := 2 to number-of-bits do begin { invariant: r^2+x/2^(n-1) = a, x<r+2^-(n+1) } t := r+2^-(n+2); x := 2*x if x >= t then begin x := x-t; r := r+2^-(n+1) end end; { sqrt(a)-r < 2^-number-of-bits } In ARM this uses 149 cycles to get a 32 bit result. Try to better that! \ on entry Ra holds a \ on exit Rr holds sqrt(a) \ uses Rx and Rt (either can be equal to Ra) SUB Rx,Ra,#2^30 \ x := a-0.25; MOV Rr,#2^31 \ r := 0.5; ADD Rt,Rr,#2^28 \ t := r+2^-(n+2); MOV Rx,Rx ASL #1 \ x := 2*x; SUBS Rt,Rx,Rt \ if x >= t then begin MOVPL Rx,Rt \ x := x-t; ADDPL Rr,Rr,#2^29 \ r := r+2^-(n+1) ... ADD Rt,Rr,#2^0 \ t := r+2^-(n+2); MOV Rx,Rx ASL #1 \ x := 2*x; SUBS Rt,Rx,Rt \ if x >= t then begin MOVPL Rx,Rt \ x := x-t; ADDPL Rr,Rr,#2^1 \ r := r+2^-(n+1) \ t := r+2^-(n+2); \ { 2^-(n+2) is less than precision } \ x := 2*x; RSBS Rt,Rr,Rx ASL #1 \ if x >= t then begin \ x := x-t; { x not needed } ADDPL Rr,Rr,#2^0 \ r := r+2^-(n+1) \ at exit: 0.5 <= Rr < 1.0 when considered as a mantissa Note: I haven't tried this code on my Archimedes, as it is at home and I am at work. I have, however, tried the pseudo-Pascal code. From: torbenm@diku.dk (Torben AEgidius Mogensen) Date: 22 Nov 91 10:46:26 GMT I made a small error in this. The correct code is: { 0.25 <= a < 1.0 } x := a-0.25; r := 0.5; for n := 2 to number-of-bits do begin { invariant: r^2+x/2^(n-2) = a, x<r+2^-(n+1) } t := r+2^-(n+1); x := 2*x if x >= t then begin x := x-t; r := r+2^-n end end; { sqrt(a)-r < 2^-number-of-bits } There are, however, still a small problem: x might be greater than 1 after the doubling. We can solve this by delaying the doubling: { 0.25 <= a < 1.0 } x := a-0.25; r := 0.5; for n := 2 to number-of-bits do begin { invariant: r^2+x/2^(n-2) = a, x<r+2^-(n+1) } t := r+2^-(n+1) if x >= t/2 then begin x := x-t/2; r := r+2^-n end; x := 2*x end; { sqrt(a)-r < 2^-number-of-bits } Also, my ARM code used unsigned numbers but compared them as if they were signed (using PL). Comparison of unsigned numbers requires HS. The code is still untested, so there might be further bugs. Note also that you can't write the constant 2^31 in BASIC assembler. Use 1<<31 instead. The number of cycles have increased to 154, as an extra pass through the loop is required to get 32 bit precision: \ on entry Ra holds a \ on exit Rr holds sqrt(a) \ uses Rx and Rt (either can be equal to Ra) SUB Rx,Ra,#2^30 \ x := a-0.25; MOV Rr,#2^31 \ r := 0.5; ADD Rt,Rr,#2^29 \ t := r+2^-(n+1); SUBS Rt,Rx,Rt,LSR #1 \ if x >= t/2 then begin MOVHS Rx,Rt \ x := x-t/2; ADDHS Rr,Rr,#2^30 \ r := r+2^-n MOV Rx,Rx,ASL #1 \ x := 2*x; ... ADD Rt,Rr,#2^0 \ t := r+2^-(n+1); SUBS Rt,Rx,Rt,LSR #1 \ if x >= t/2 then begin MOVHS Rx,Rt \ x := x-t/2; ADDHS Rr,Rr,#2^1 \ r := r+2^-n MOV Rx,Rx,ASL #1 \ x := 2*x; \ t := r+2^-(n+1); \ { 2^-(n+1) is less than precision } SUBS Rt,Rx,Rr,LSR #1 \ if x >= t/2 then begin \ x := x-t/2; { x not needed } ADDHS Rr,Rr,#2^0 \ r := r+2^-n \ x := 2*x; { x not needed } \ at exit: 0.5 <= Rr < 1.0 when considered as a mantissa Torben Mogensen (torbenm@diku.dk) From: dseal@armltd.co.uk (David Seal) Subject: Square roots (was Re: The (non-charity) ray tracer) Date: 21 Nov 91 15:52:14 GMT The Newton-Raphson technique is a good one when you've got a fast division - e.g. if you've got division in hardware (and haven't got square root in hardware :-). However, if you're going to do division by a software routine (as you must on an ARM), there is a "long square root" algorithm which is very similar to the "long division" algorithms used to do division in software. It takes only a bit more time than one division *in total*, compared with a division, an addition and a shift *for each of several iterations* in Newton-Raphson. I would therefore recommend it for ARM square root routines when speed is important. I won't supply code, since (a) I don't know what type of numbers you want to take square roots of, how accurate a result you want, etc. - these details will obviously affect the code; (b) I don't have time to write code, whereas I have a written description of the algorithm on file which I can simply include here! I'll illustrate the "long square root" algorithm in base 10 - suppose we want to take the square roots of 2 and 20: (A) First write down the number you want to take the square root of, split into groups of two digits, with one split at the decimal point. In these particular cases, this gives us: ----------------- ----------------- V 02. 00 00 00 ... V 20. 00 00 00 ... (Note that the two numbers chosen have the same pattern of digits, but different splits because of the different positions of the decimal point. This is the only reason they produce different answers.) (B) Initialise by finding the largest square which is smaller than the first pair of digits: in the first case the first pair of digits is 02 = 2, so this square is 1^2 = 1, and in the second the first pair of digits is 20, so this square is 4^2 = 16. Subtract the square from the first pair of digits, and put its square root in as the first digit of the result: 1. 4. ----------------- ----------------- V 02. 00 00 00 ... V 20. 00 00 00 ... - 1 -16 -- -- 1 4 Then bring the next pair of digits down: 1. 4. ----------------- ----------------- V 02. 00 00 00 ... V 20. 00 00 00 ... - 1 -16 ------ ------ 100 400 (C) To find the next digit of the result: take twice the answer so far. Look at numbers of the form N * (twice answer so far concatenated with N) for N a digit. The largest one of these that is still less than or equal to the remainder we've got gives the next digit of the result, and the value of N * (twice answer so far concatenated with N) is what we should subtract from the remainder before bringing down the next pair of digits: Twice answer so far is 2*1=2. Twice answer so far is 2*4=8. 0*20 = 0, less than 100 0*80 = 0, less than 400 1*21 = 21, less than 100 1*81 = 81, less than 400 2*22 = 44, less than 100 2*82 = 164, less than 400 3*23 = 69, less than 100 3*83 = 249, less than 400 4*24 = 96, less than 100 4*84 = 336, less than 400 5*25 = 125, more than 100 5*85 = 425, more than 400 So the next digit of the answer So the next digit of the answer is 4, and we must subtract 96 is 4, and we must subtract 336 1. 4 4. 4 ----------------- ----------------- V 02. 00 00 00 ... V 20. 00 00 00 ... - 1 -16 ------ ------ 100 400 - 96 -336 ------ ------ 400 6400 (D) Repeat step (C) as many times as you need result digits - I'll do it once more for the examples: Twice answer so far is 2*14=28. Twice answer so far is 2*44=88. 0*280 = 0, less than 400 0*880 = 0, less than 6400 1*281 = 281, less than 400 1*881 = 881, less than 6400 2*282 = 564, more than 400 2*882 = 1764, less than 6400 So the next digit of the answer : is 1, and we must subtract 281 : 7*887 = 6209, less than 6400 8*888 = 7104, more than 6400 So the next digit of the answer is 7, and we must subtract 6209 1. 4 1 4. 4 7 ----------------- ----------------- V 02. 00 00 00 ... V 20. 00 00 00 ... - 1 -16 ------ ------ 100 400 - 96 -336 ------ ------ 400 6400 - 281 -6209 ------- ------- 11900 19100 As to why the method works, I'll give some hints and leave you to work out the details yourself. To give the rough reason for it, note that the "remainder" in the above calculations is actually the difference between the number you are trying to take the square root of and the square of the answer so far: 2 20 = 1^2 + 1 = 4^2 + 4 = 1.4^2 + 0.04 = 4.4^2 + 0.64 = 1.41^2 + 0.0119 = 4.47^2 + 0.0191 = ... = ... [ Aside: this fact yields another useful property of the algorithm, which is that (as in long division) your answer is exactly right if the remainder becomes 0 and there are no further non-zeros to bring down. This can be very helpful e.g. for calculating square roots in floating point systems that conform to IEEE standard 754. ] Then prove this - it depends on the identity: (10x+y)^2 - (10x)^2 = 20xy + y^2 = (20x + y) * y The quantities (20x + y) * y are what the digit calculation in part (C) is actually determining. One important observation: this algorithm simplifies immensely in binary! I'll run through it, illustrating it for the case of the square root of two again: (A) Write down the argument, split into pairs aligned with the binary point: -------------------- V 10. 00 00 00 00 ... (B) If the first pair is 00, the first digit of the answer is 0; otherwise it is 1. Subtract the square of this first digit from the first pair and bring down the next pair: 1. -------------------- V 10. 00 00 00 00 ... - 1 ------ 100 (C) There are only two possible next digits, namely 0 and 1. But 0 * (twice the answer so far concatenated with 0) is always 0, which must be less than or equal to the remainder. So all that matters is whether 1 * (twice the answer so far concatenated with 1) is less than or equal to the remainder: if it is, the next result digit is 1; if it is greater than the remainder, the next result digit is 0. One more thing: twice the answer so far concatenated with 1 is the answer so far concatenated with 01. So the result digit selection procedure becomes: concatenate the answer so far with 01, and compare with the remainder. If greater, leave the remainder unchanged and the result digit is 0; if less than or equal, subtract it from the remainder and the result digit is 1. To continue the example: Result so far concatenated with 01 is 101, which is greater than 100, so result digit is 0 and subtract 0: 1. 0 -------------------- V 10. 00 00 00 00 ... - 1 ------ 100 - 0 ------ 10000 Result so far concatenated with 01 is 1001, which is less than 10000, so result digit is 1 and subtract 1001: 1. 0 1 -------------------- V 10. 00 00 00 00 ... - 1 ------- 100 - 0 ------- 10000 - 1001 -------- 11100 Result so far concatenated with 01 is 10101, which is less than 11100, so result digit is 1 and subtract 10101: 1. 0 1 1 -------------------- V 10. 00 00 00 00 ... - 1 ------- 100 - 0 ------- 10000 - 1001 -------- 11100 - 10101 --------- 11100 Result so far concatenated with 01 is 101101, which is greater than 11100, so result digit is 0 and subtract 0: 1. 0 1 1 0 -------------------- V 10. 00 00 00 00 ... - 1 ------- 100 - 0 ------- 10000 - 1001 -------- 11100 - 10101 --------- 11100 - 0 ---------- 1110000 One further point: Step (B) is just a special case of step (C) if we initialise the "square root so far" to 0 - i.e. you don't need special case hardware for the first iteration. Hope this helps. David Seal dseal@armltd.co.uk