![]() Texture Mapping Optimization for GamesSample ImplementationAll of the ideas discussed on this page are being used in the Alpha engine. To see an example of these ideas in a working application, try Poly2.hpp and Poly2.cpp.Texture Mapping OptimizationNow that we have a basic perspective correct 3D renderer running (see the previous page), let's see what makes it tick.Span Sorting and Hidden Surface RemovalBefore we get up to our armpits in Pentium instructions, let's take a step back and look at the bigger picture. The fastest instruction is the one you never execute. Likewise, if the front part of the display pipeline is asking the renderer to draw and redraw and re-re-redraw pixels, a lot of time is being wasted no matter how fast the renderer is.In order to avoid the redraw overhead, the Alpha engine uses a span sorting algorithm in the hidden surface removal stage. The span sorter converts each visible (or partially visible) polygon into a list of pixel spans. Each span describes a visible horizontal slice of the polygon in screen coordinates. Only the visible (non-occluded) parts of each polygon are converted to spans. So the overall polygon rendering routine is to iterate through the list of spans and draw a single horizontal strip of pixels on the screen for each span. The basic rendering function is: which will drawvoid DrawSpan(char* screen, int pitch, int x, int y, int count); count texture mapped pixels on
the screen , starting at location (x ,
y ). This is the function that we need to
optimize.
Assembly Language OptimizationThe first step is to optimize the layout of the data, shift variables into registers, and code up a tight inner loop in assembly language. Then we'll look at floating point performance and FPU/IPU parallelization. After that we'll do some low level timing of the inner loop, outer loop, and entire span function. Finally, we'll tune up the parts that don't get executed very much, because they can be slow, too!Inner Loop OptimizationOne key idea is to store each row of the texture bitmap on a 256 byte boundary, and the entire texture bitmap on a 64 Kbyte boundary. This allows the inner loop to dispense with some unnecessary shifting and masking of the texture indices (u and v ).
Note that this doesn't mean that the textures all need to
be 256x256; it just means that they need to be stored in
memory that way.This algorithm also avoids doing any lighting or depth cueing in the inner loop. As in Quake, we assume that all textures have been MIP mapped and pre-lit in the cache. We'll look at texture caching in a later article. A second performance enhancement can be had by using 16.16 fixed point numbers instead of floating point inside the inner loop. This allows the use of faster (parallel) integer instructions for the texture indexing. Even more important, it avoids some very expensive floating-to-integer conversions: fixed point numbers can be converted to integers with a simple shift.
NOTE It is also possible to use 24.8 fixed point numbers in the inner loop. This can save up to a whole clock cycle per pixel, but at the expense of image correctness. The reduced precision introduces unacceptable distortion, IMHO, so Alpha uses 16.16.Just to make sure we're all together so far: In C language, the inner loop would look like this:
A third performance enhancement comes from dividing up the integer and fractional parts of the fixed point index values. The integral parts ofFIXED u = Double2Fix(u0); FIXED v = Double2Fix(v0); FIXED du = Double2Fix((u1 - u0) / run); FIXED dv = Double2Fix((v1 - v0) / run); while (run--) { *screen++ = *(texture + Fix2Int(v)*256 + Fix2Int(u)); v += dv; u += du; } u and v
are kept in the register that points to the texture map
(in our case, v is in BH and
u is in BL ). In effect, this
premultiplies v by 256 so that it points to
the beginning of a texture map row, while completely
eliminating the fixed-to-integer conversion costs.
However, this means that each fixed point addition now
requires two The basic register layout for our inner loop looks like this:
As shown in Table 1, The inner loop is just nine instructions long, of which two are loop overhead: ml5: mov al, BYTE PTR [ebx] // read from texture mov BYTE PTR [edi], al // write to screen add edx, esi // update fractional part of u adc bl, dl // update integral part of u add eax, ecx // update fractional part of v adc bh, dh // update integral part of v inc edi // next pixel in screen dec cx // decrement count jnz ml5 // and do it again By taking advantage of instruction pairing, this loop executes in 6-7 clocks per pixel (on average) on a Pentium, including memory access time. Oddly enough, reordering these instructions will not speed things up. For example, there is a read-after-write dependency in the first two instructions that keeps them from pairing. And there is an unnecessary read-after-write in the third and fourth instructions. So it seems that swapping instructions two and three should improve instruction pairing and reduce the cycle count. Empirically, however, this is not the case. It turns out that
ml5: mov al, BYTE PTR [ebx] // read from texture add edx, esi // update fractional part of u adc bl, dl // update integral part of u add eax, ecx // update fractional part of v adc bh, dh // update integral part of v mov BYTE PTR [edi], al // write to screen inc edi // next pixel in screen dec cx // decrement count jnz ml5 // and do it again This gets things paired up nicely, but will not work correctly.
The fourth instruction ( The final version of the loop works like this: push ebp // save the base pointer xor ebp, ebp // init it to zero mov bp, cx // now copy over the 16 bit count xor cx, cx // and clear out the bottom of ECX ml5: mov al, BYTE PTR [ebx] // read from texture add edx, esi // update fractional part of u adc bl, dl // update integral part of u add eax, ecx // update fractional part of v adc bh, dh // update integral part of v mov BYTE PTR [edi], al // write to screen inc edi // next pixel in screen dec ebp // decrement count jnz ml5 // and do it again pop ebp // restore the base pointer That loop should execute in 5 cycles, not counting cache misses. It is theoretically possible to do it in 4 cycles, by combining the count decrement and index increment operations. I haven't tried this yet myself, so I won't show it here. Floating Point OptimizationNow that the inner loop is coded using strictly integer instructions, we can take advantage of the parallel integer and floating point execution units on the Pentium Processor. That expensive (39 cycle) floating point divide that needs to be done once every sixteen pixels can actually be done in parallel with the inner loop. This converts a 40 cycle operation to 2 cycles.Here's how it's done. The original algorithm looks a bit like this:
while (count) { u = Fix2Int(u0); v = Fix2Int(v0); du = (u1-u0) >> 4; dv = (v1-v0) >> 4; // inner loop goes here a += A.x; b += B.x; c += C.x; cinv = 1.0 / c; // here's the divide u0 = u1; v0 = v1; u1 = Double2Fix(a*cinv*65536.0); v1 = Double2Fix(b*cinv*65536.0); count -= run; }By reordering the operations, and using a bit of inline assembly, we get the following:
while (count) { u = Fix2Int(u0); v = Fix2Int(v0); du = (u1-u0) >> 4; dv = (v1-v0) >> 4; a += A.x; b += B.x; c += C.x; // here's the divide _asm { fld 1 fdiv c } // inner loop goes here _asm { fstp cinv } u0 = u1; v0 = v1; u1 = Double2Fix(a*cinv*65536.0); v1 = Double2Fix(b*cinv*65536.0); count -= run; }Because the results of the FDIV instruction aren't used until
the FSTP instruction, and because no other floating point
instructions are needed, the Pentium pipeline will execute the entire
integer inner loop concurrently with the floating point divide. Instead
of taking 40 clock cycles, the divide and store operations only take
2 clock cycles -- one apiece.
As an aside, a competent C++ compiler optimizer should be able to do the same instruction reordering to eliminate the divide overhead. And in fact Visual C++ can do so, up to a point. The optimizer will insert as many operations as possible between the divide and store, but it will not cross an inline assembly code block to do so. In this case, that's the whole point of the optimization and so we use a bit more inline assembly to do the job right. Still, it's good to know that the compiler optimizer is in there batting for us! Floating Point PrecisionAnother useful trick is to shift the Pentium Floating Point Unit into low-precision mode. This causes all intrinsic operations (including those divides) to be performed at 32 bit precision. In the case of anFDIV instruction,
this reduces the execution time for the divide from 39 clocks
to 19 clocks.
short OldFPUCW, FPUCW; _asm { // Put the FPU in 32 bit mode fstcw [OldFPUCW] // store copy of CW mov ax,OldFPUCW // get it in ax and eax,NOT 0x0300 // 24 bit precision mov [FPUCW],ax // store it fldcw [FPUCW] // load the FPU } // Main algorithm code goes here _asm { fldcw [OldFPUCW] // restore the FPU } Timing and ProfilingAt this point in the optimization process we found that even though the inner loop was running pretty well, the overall performance of the renderer was not all that great. In order to know what to do about it, we first had to understand the problem. In this case, that meant gathering more detailed performance information.
The Pentium is actually a big help here. Every Pentium has
an onboard clock cycle counter that is maintained in a
semi-undocumented register called the Time Stamp Counter.
This sixty-four bit register is incremented on every clock cycle.
The Pentium instruction set includes a new instruction
Unfortunately, the Microsoft Visual C++ compiler does not recognize that opcode. To make use of the timer for profiling, I wrote this little CPP macro:
#define TIMESNAP(clock) _asm push eax\ _asm push edx\ _asm _emit 0x0F\ _asm _emit 0x31\ _asm mov clock, eax\ _asm pop edx\ _asm pop eaxTo find the precise number of clock cycles expended executing a block of code, just do this:
unsigned long start, stop; TIMESNAP(start); // code under test TIMESNAP(stop); printf("Test code took %d cycles\n", stop-start-16);When doing this kind of testing, the TIMESNAP macros
seem to add about 16 clock cycles of overhead, so remember to subtract
that out of your calculations.
Outer Loop OptimizationArmed with a new diagnostic tool, we set about to see where the clock cycles were going throughout theDrawSpan function.
Watch Out for CastingFirst discovery: did you know that it takes about 40 clock cycles to convert adouble to an int in Visual C++?
Neither did I.
Well OK, so it's expensive. But we only need to do that twice every sub-span, right? Let's see now...80 clock cycles every 16 pixels is...FIVE CLOCKS PER PIXEL?!? I'm glad I spent all that time getting the inner loop down to six clocks per pixel!
Check for Common Subexpressions, TooFortunately this is easily solved. It turns out that a simpleFISTP
does the right thing anyways, at six clocks per cast. As long as we're using
a little more inline assembly, let's see what else is going on. The code in
question is:
u0 = (long)(a * cinv * 65536.0); // a/c => fixed point v0 = (long)(b * cinv * 65536.0); // b/c => fixed pointIt turns out that the C++ optimizer isn't eliminating that common subexpression either. Let's clean this up: const double sixtyfive = 65536.0; #define UPDATE_U_V(u,v)\ _asm fld sixtyfive\ _asm fmul cinv\ _asm fld st(0)\ _asm fmul a\ _asm fistp u\ _asm fmul b\ _asm fistp vThis saves us a multiply and a load, and all of the overhead of the _ftol function. Not bad.
Don't Forget the Function Setup CodeLastly, let's took a look at the code that only executes once per span. There isn't much room for improvement here. A few instructions can be reordered, and we can use theUPDATE_U_V macro twice
more per span. This will net a savings of about 100 clock cycles per span.
Why bother? How often does this code get executed anyways? More than I originally thought. In high resolution (640x480) if we assume an average of four spans per screen line (this is way too low), we have 2560 spans per frame. So you do the math. Processor Cache EffectsAside from the obvious effect that the Pentium only executes integer instructions in parallel when they are read from the L1 instruction cache, what other cache effects are there to worry about? Frankly, I'm not really sure. We do know that an L1 data cache miss costs up to 18 clock cycles. So it's very important to have the texture data fit easily into a data cache line. Anything beyond this simple advice is more than I have had time to investigate.ResultsSo how did we do? Simply using a tight inner loop produced an average of 28 clocks per pixel (even though the inner loop code only used about 7 or 8). The worst case performance was about 40 clocks per pixel. The first round of optimization (using the parallel divide and reduced-precision FPU) produced an average of 24 clocks per pixel, getting up over 30 for complex scenes and down to 21 for the simplest scenes. Timing our current code shows an average of about 18 clocks per pixel, getting up to about 24 clocks per pixel in very complex scenes and down to about 16 or 17 clocks per pixel in the simplest scenes. So our comprehensive optimization of the entire renderer produced a performance improvement of about 50% over just using a fast inner loop alone, and it also smoothed out the performance in the worst scenes. In practical terms, this puts the peak performance of our renderer at 22 frames per second at 640x480 on my Pentium 133. The early tight loop code had a peak performance of only 15 frames per second. AcknowledgementsMany of the ideas described here came from Mike Abrash's work on the Quake engine. The low precision FPU code came from Chris Hecker's texture mapping code from Game Developer Magazine. Finally, stupid Pentium tricks were inspired by Mike Schmit's Pentium Processor Optimization Tools (ISBN
0-12-627230-1 ).
back to milo's home page | e-mail milod@netcom.com Copyright © 1995-1997 John DiCamillo. All rights reserved. Hosted by Irth Networks - http://www.irth.net |