[game tech]

Texture Mapping Optimization for Games


Sample Implementation

All of the ideas discussed on this page are being used in the Alpha engine. To see an example of these ideas in a working application, try Poly2.hpp and Poly2.cpp.

Texture Mapping Optimization

Now that we have a basic perspective correct 3D renderer running (see the previous page), let's see what makes it tick.

Span Sorting and Hidden Surface Removal

Before we get up to our armpits in Pentium instructions, let's take a step back and look at the bigger picture. The fastest instruction is the one you never execute. Likewise, if the front part of the display pipeline is asking the renderer to draw and redraw and re-re-redraw pixels, a lot of time is being wasted no matter how fast the renderer is.

In order to avoid the redraw overhead, the Alpha engine uses a span sorting algorithm in the hidden surface removal stage. The span sorter converts each visible (or partially visible) polygon into a list of pixel spans. Each span describes a visible horizontal slice of the polygon in screen coordinates. Only the visible (non-occluded) parts of each polygon are converted to spans. So the overall polygon rendering routine is to iterate through the list of spans and draw a single horizontal strip of pixels on the screen for each span.

The basic rendering function is:


void DrawSpan(char* screen, int pitch, int x, int y, int count);

which will draw count texture mapped pixels on the screen, starting at location (x, y). This is the function that we need to optimize.

Assembly Language Optimization

The first step is to optimize the layout of the data, shift variables into registers, and code up a tight inner loop in assembly language. Then we'll look at floating point performance and FPU/IPU parallelization. After that we'll do some low level timing of the inner loop, outer loop, and entire span function. Finally, we'll tune up the parts that don't get executed very much, because they can be slow, too!

Inner Loop Optimization

One key idea is to store each row of the texture bitmap on a 256 byte boundary, and the entire texture bitmap on a 64 Kbyte boundary. This allows the inner loop to dispense with some unnecessary shifting and masking of the texture indices (u and v). Note that this doesn't mean that the textures all need to be 256x256; it just means that they need to be stored in memory that way.

This algorithm also avoids doing any lighting or depth cueing in the inner loop. As in Quake, we assume that all textures have been MIP mapped and pre-lit in the cache. We'll look at texture caching in a later article.

A second performance enhancement can be had by using 16.16 fixed point numbers instead of floating point inside the inner loop. This allows the use of faster (parallel) integer instructions for the texture indexing. Even more important, it avoids some very expensive floating-to-integer conversions: fixed point numbers can be converted to integers with a simple shift.

NOTE It is also possible to use 24.8 fixed point numbers in the inner loop. This can save up to a whole clock cycle per pixel, but at the expense of image correctness. The reduced precision introduces unacceptable distortion, IMHO, so Alpha uses 16.16.
Just to make sure we're all together so far: In C language, the inner loop would look like this:


FIXED u  = Double2Fix(u0);

FIXED v  = Double2Fix(v0);

FIXED du = Double2Fix((u1 - u0) / run);

FIXED dv = Double2Fix((v1 - v0) / run);



while (run--) {

   *screen++ = *(texture + Fix2Int(v)*256 + Fix2Int(u));

   

   v += dv;

   u += du;

}

A third performance enhancement comes from dividing up the integer and fractional parts of the fixed point index values. The integral parts of u and v are kept in the register that points to the texture map (in our case, v is in BH and u is in BL). In effect, this premultiplies v by 256 so that it points to the beginning of a texture map row, while completely eliminating the fixed-to-integer conversion costs.

However, this means that each fixed point addition now requires two ADD instructions. The first ADD instruction adds the fractional part of the two fixed point numbers. The addition is completed by an ADC instruction which carries the overflow bit from the fractional add up into the integral part of the fixed point number. This also means that the fractional parts of the fixed point numbers must all be stored in the upper 16 bits of a 32 bit register.

The basic register layout for our inner loop looks like this:

Register31..2423..1615..87..0
EAXVf0color
EBXTex BaseViUi
ECXdVfcount
EDXUfdVidUi
ESIdUf00
EDIScreen Pointer
Table 1: Pentium Register Utilization

As shown in Table 1, EBX points into the texture bitmap, and EDI points to the destination pixel on the screen. CX contains the loop counter. And AL is used to copy the texel from the texture bitmap to the screen destination pixel.

The inner loop is just nine instructions long, of which two are loop overhead:


ml5:

      mov   al, BYTE PTR [ebx]   // read from texture

      mov   BYTE PTR [edi], al   // write to screen



      add   edx, esi             // update fractional part of u

      adc   bl,  dl              // update integral part of u

      add   eax, ecx             // update fractional part of v

      adc   bh,  dh              // update integral part of v



      inc   edi                  // next pixel in screen

      dec   cx                   // decrement count

      jnz   ml5                  // and do it again

By taking advantage of instruction pairing, this loop executes in 6-7 clocks per pixel (on average) on a Pentium, including memory access time.

Oddly enough, reordering these instructions will not speed things up. For example, there is a read-after-write dependency in the first two instructions that keeps them from pairing. And there is an unnecessary read-after-write in the third and fourth instructions. So it seems that swapping instructions two and three should improve instruction pairing and reduce the cycle count.

Empirically, however, this is not the case. It turns out that ADC instructions can only pair in the U pipe of the Pentium processor. Since the ADC instruction is the fourth instruction in the loop, an execution slot is being wasted. We can fix this by reording the instructions so that the ADC falls on an odd cycle:


ml5:

      mov   al, BYTE PTR [ebx]   // read from texture

      add   edx, esi             // update fractional part of u



      adc   bl,  dl              // update integral part of u

      add   eax, ecx             // update fractional part of v



      adc   bh,  dh              // update integral part of v

      mov   BYTE PTR [edi], al   // write to screen



      inc   edi                  // next pixel in screen

      dec   cx                   // decrement count

      jnz   ml5                  // and do it again

This gets things paired up nicely, but will not work correctly. The fourth instruction (add eax,ecx) is incidently adding the run count to the texel value in AL. We can fix this by getting the run count out of CX. If we use the BP register to hold the count, we can free up CX. Actually, we want to use a full 32 bit register for the count anyways, to avoid a prefix byte in the DEC instruction.

The final version of the loop works like this:


      push  ebp                  // save the base pointer

      xor   ebp, ebp             // init it to zero

      mov   bp, cx               // now copy over the 16 bit count

      xor   cx, cx               // and clear out the bottom of ECX



ml5:

      mov   al, BYTE PTR [ebx]   // read from texture

      add   edx, esi             // update fractional part of u



      adc   bl,  dl              // update integral part of u

      add   eax, ecx             // update fractional part of v



      adc   bh,  dh              // update integral part of v

      mov   BYTE PTR [edi], al   // write to screen



      inc   edi                  // next pixel in screen

      dec   ebp                  // decrement count

      jnz   ml5                  // and do it again

      

      pop   ebp                  // restore the base pointer

That loop should execute in 5 cycles, not counting cache misses. It is theoretically possible to do it in 4 cycles, by combining the count decrement and index increment operations. I haven't tried this yet myself, so I won't show it here.

Floating Point Optimization

Now that the inner loop is coded using strictly integer instructions, we can take advantage of the parallel integer and floating point execution units on the Pentium Processor. That expensive (39 cycle) floating point divide that needs to be done once every sixteen pixels can actually be done in parallel with the inner loop. This converts a 40 cycle operation to 2 cycles.

Here's how it's done. The original algorithm looks a bit like this:


while (count) {

   u  = Fix2Int(u0);

   v  = Fix2Int(v0);

   du = (u1-u0) >> 4;

   dv = (v1-v0) >> 4;

   

   // inner loop goes here

   

   a += A.x;

   b += B.x;

   c += C.x;

   

   cinv = 1.0 / c;   // here's the divide

   

   u0 = u1;

   v0 = v1;

   

   u1 = Double2Fix(a*cinv*65536.0);

   v1 = Double2Fix(b*cinv*65536.0);



   count -= run;

}

By reordering the operations, and using a bit of inline assembly, we get the following:


while (count) {

   u  = Fix2Int(u0);

   v  = Fix2Int(v0);

   du = (u1-u0) >> 4;

   dv = (v1-v0) >> 4;

   

   a += A.x;

   b += B.x;

   c += C.x;

   

   // here's the divide

   _asm {

      fld 1

      fdiv c

   }



   // inner loop goes here



   _asm {

      fstp cinv

   }



   u0 = u1;

   v0 = v1;

   

   u1 = Double2Fix(a*cinv*65536.0);

   v1 = Double2Fix(b*cinv*65536.0);



   count -= run;

}

Because the results of the FDIV instruction aren't used until the FSTP instruction, and because no other floating point instructions are needed, the Pentium pipeline will execute the entire integer inner loop concurrently with the floating point divide. Instead of taking 40 clock cycles, the divide and store operations only take 2 clock cycles -- one apiece.

As an aside, a competent C++ compiler optimizer should be able to do the same instruction reordering to eliminate the divide overhead. And in fact Visual C++ can do so, up to a point. The optimizer will insert as many operations as possible between the divide and store, but it will not cross an inline assembly code block to do so. In this case, that's the whole point of the optimization and so we use a bit more inline assembly to do the job right. Still, it's good to know that the compiler optimizer is in there batting for us!

Floating Point Precision

Another useful trick is to shift the Pentium Floating Point Unit into low-precision mode. This causes all intrinsic operations (including those divides) to be performed at 32 bit precision. In the case of an FDIV instruction, this reduces the execution time for the divide from 39 clocks to 19 clocks.

short OldFPUCW, FPUCW;



_asm {

   // Put the FPU in 32 bit mode

   fstcw   [OldFPUCW]            // store copy of CW

   mov     ax,OldFPUCW           // get it in ax

   and     eax,NOT 0x0300        // 24 bit precision

   mov     [FPUCW],ax            // store it

   fldcw   [FPUCW]               // load the FPU

}



// Main algorithm code goes here



_asm {

   fldcw   [OldFPUCW]            // restore the FPU

}

Timing and Profiling

At this point in the optimization process we found that even though the inner loop was running pretty well, the overall performance of the renderer was not all that great. In order to know what to do about it, we first had to understand the problem. In this case, that meant gathering more detailed performance information.

The Pentium is actually a big help here. Every Pentium has an onboard clock cycle counter that is maintained in a semi-undocumented register called the Time Stamp Counter. This sixty-four bit register is incremented on every clock cycle. The Pentium instruction set includes a new instruction RDTSC to ReaD the Time Stamp Counter and store its value in EDX:EAX.

Unfortunately, the Microsoft Visual C++ compiler does not recognize that opcode. To make use of the timer for profiling, I wrote this little CPP macro:


#define TIMESNAP(clock) _asm push eax\

                        _asm push edx\

                        _asm _emit 0x0F\

                        _asm _emit 0x31\

                        _asm mov clock, eax\

                        _asm pop edx\

                        _asm pop eax

To find the precise number of clock cycles expended executing a block of code, just do this:


unsigned long start, stop;



TIMESNAP(start);

// code under test

TIMESNAP(stop);



printf("Test code took %d cycles\n", stop-start-16);

When doing this kind of testing, the TIMESNAP macros seem to add about 16 clock cycles of overhead, so remember to subtract that out of your calculations.

Outer Loop Optimization

Armed with a new diagnostic tool, we set about to see where the clock cycles were going throughout the DrawSpan function.

Watch Out for Casting

First discovery: did you know that it takes about 40 clock cycles to convert a double to an int in Visual C++? Neither did I.

Float to int conversion is done through a C intrinsic function called _ftol. Aside from the function call overhead, this function performs a FISTP instruction which takes 6 clock cycles. Unfortunately it also monkeys with the FPU control word a couple of times (7 cycles each), and does some bit masking and pushing and popping, which adds up to 38 clock cycles in the _ftol function plus the call overhead.

Well OK, so it's expensive. But we only need to do that twice every sub-span, right? Let's see now...80 clock cycles every 16 pixels is...FIVE CLOCKS PER PIXEL?!? I'm glad I spent all that time getting the inner loop down to six clocks per pixel!

Check for Common Subexpressions, Too

Fortunately this is easily solved. It turns out that a simple FISTP does the right thing anyways, at six clocks per cast. As long as we're using a little more inline assembly, let's see what else is going on. The code in question is:


u0 = (long)(a * cinv * 65536.0); // a/c => fixed point

v0 = (long)(b * cinv * 65536.0); // b/c => fixed point

It turns out that the C++ optimizer isn't eliminating that common subexpression either. Let's clean this up:

const double sixtyfive = 65536.0;



#define UPDATE_U_V(u,v)\

            _asm fld   sixtyfive\

            _asm fmul  cinv\

            _asm fld   st(0)\

            _asm fmul  a\

            _asm fistp u\

            _asm fmul  b\

            _asm fistp v

This saves us a multiply and a load, and all of the overhead of the _ftol function. Not bad.

Don't Forget the Function Setup Code

Lastly, let's took a look at the code that only executes once per span. There isn't much room for improvement here. A few instructions can be reordered, and we can use the UPDATE_U_V macro twice more per span. This will net a savings of about 100 clock cycles per span.

Why bother? How often does this code get executed anyways? More than I originally thought. In high resolution (640x480) if we assume an average of four spans per screen line (this is way too low), we have 2560 spans per frame. So you do the math.

Processor Cache Effects

Aside from the obvious effect that the Pentium only executes integer instructions in parallel when they are read from the L1 instruction cache, what other cache effects are there to worry about? Frankly, I'm not really sure. We do know that an L1 data cache miss costs up to 18 clock cycles. So it's very important to have the texture data fit easily into a data cache line. Anything beyond this simple advice is more than I have had time to investigate.

Results

So how did we do?

Simply using a tight inner loop produced an average of 28 clocks per pixel (even though the inner loop code only used about 7 or 8). The worst case performance was about 40 clocks per pixel.

The first round of optimization (using the parallel divide and reduced-precision FPU) produced an average of 24 clocks per pixel, getting up over 30 for complex scenes and down to 21 for the simplest scenes.

Timing our current code shows an average of about 18 clocks per pixel, getting up to about 24 clocks per pixel in very complex scenes and down to about 16 or 17 clocks per pixel in the simplest scenes.

So our comprehensive optimization of the entire renderer produced a performance improvement of about 50% over just using a fast inner loop alone, and it also smoothed out the performance in the worst scenes. In practical terms, this puts the peak performance of our renderer at 22 frames per second at 640x480 on my Pentium 133. The early tight loop code had a peak performance of only 15 frames per second.

Acknowledgements

Many of the ideas described here came from Mike Abrash's work on the Quake engine. The low precision FPU code came from Chris Hecker's texture mapping code from Game Developer Magazine. Finally, stupid Pentium tricks were inspired by Mike Schmit's Pentium Processor Optimization Tools (ISBN 0-12-627230-1).


back to milo's home page | e-mail milod@netcom.com
This page is designed and implemented by John DiCamillo
Copyright © 1995-1997 John DiCamillo. All rights reserved.

Hosted by Irth Networks - http://www.irth.net