Optimizing Database Rendering Code

Optimizing Database Rendering Code

This section includes some suggestions for writing peak-performance code for inner rendering loops.

Ideally, an application spends most of its time traversing the database and sending data to the graphics pipeline. Instructions in the display loop are executed many times every frame, creating hot spots. Any extra overhead in a hot spot is greatly magnified by the number of times it is executed.

When using simple, high-performance graphics primitives, the application is even more likely to be CPU limited. The data traversal must be optimized so that it does not become a bottleneck.

During rendering, the sections of code that actually issue graphics commands should be the hot spots in application code. These subroutines should use peak-performance coding methods. Small improvements to a line that is executed for every vertex in a database accumulate to have a noticeable effect when the entire frame is rendered.

The rest of this section looks at examples and techniques for optimizing immediate-mode rendering:

"Examples for Optimizing Data Structures for Drawing"
"Examples for Optimizing Program Structure"
"Using Specialized Drawing Subroutines and Macros"
"Preprocessing Drawing Data: Introduction"
"Preprocessing Meshes Into Fixed-Length Strips"
"Preprocessing Vertex Loops"

Examples for Optimizing Data Structures for Drawing

Follow these suggestions for optimizing how your application accesses data:

One-Dimensional Arrays. Use one-dimensional arrays traversed with a pointer that always holds the address for the current drawing command. Avoid array-element addressing or multidimensional array accesses.
bad: glVertex3fv(&data[i][j][k]);
good: glVertex3fv(dataptr);
Adjacent structures. Keep all static drawing data for a given object together in a single contiguous array traversed with a single pointer. Keep this data separate from other program data, such as pointers to drawing data, or interpreter flags.
Flat structures. Use flat data structures and do not use multiple pointer indirection when rendering:
bad: glVertex3fv(object->data->vert);
ok: glVertex3fv(dataptr->vert);
best: glVertex3fv(dataptr);
The following code fragment is an example of efficient code to draw a single smooth-shaded, lit polygon from quad-word aligned data. Notice that a single data pointer is used. It is updated once at the end of the polygon, after the glEnd() call.

glBegin(GL_QUADS);
glNormal3fv(ptr);
glVertex3fv(ptr+4);
glNormal3fv(ptr+8);
glVertex3fv(ptr+12);
glNormal3fv(ptr+16);
glVertex3fv(ptr+20);
glNormal3fv(ptr+24);
glVertex3fv(ptr+28);
glEnd();
ptr += 32;

Examples for Optimizing Program Structure

Loop unrolling (1). Avoid short, fixed-length loops, especially around vertices. Instead, unroll these loops:
bad:
for(i=0; i < 4; i++){
glColor4ubv(poly_colors[i]);
glVertex3fv(poly_vert_ptr[i]);
}
good:
glColor4ubv(poly_colors[0]);
glVertex3fv(poly_vert_ptr[0]);
glColor4ubv(poly_colors[1]);
glVertex3fv(poly_vert_ptr[1]);
glColor4ubv(poly_colors[2]);
glVertex3fv(poly_vert_ptr[2]);
glColor4ubv(poly_colors[3]);
glVertex3fv(poly_vert_ptr[3]);
Loop unrolling (2). Minimize the work done in a loop to maintain and update variables and pointers. Unrolling can often assist in this:
bad:
glNormal3fv(*(ptr++)); glVertex3fv(*(ptr++));
or
glNormal3fv(ptr); ptr += 4;
glVertex3fv(ptr); ptr += 4;
good:
glNormal3fv(*(ptr)); glVertex3fv(*(ptr+1)); glNormal3fv(*(ptr+2)); glVertex3fv(*(ptr+3));
or
glNormal3fv(ptr); glVertex3fv(ptr+4); glNormal3fv(ptr+8); glVertex3fv(ptr+12);
Note: On some processors, such as the R8000,(TM) loop unrolling may hurt performance more than it helps, so use it with caution. In fact, unrolling too far hurts on any processor because the loop may use an excessive portion of the cache. If it uses a large enough portion of the cache, it may interfere with itself; that is, the whole loop won't fit (not likely) or it may conflict with the instructions of one of the subroutines it calls.
Loops accessing buffers. Minimize the number of different buffers accessed in a loop:
bad:
glNormal3fv(normaldata);
glTexCoord2fv(texdata);
glVertex3fv(vertdata);
good:
glNormal3fv(dataptr);
glTexCoord2fv(dataptr+4);
glVertex3fv(dataptr+8);
Loop end conditions. Make end conditions on loops as trivial as possible; for example, compare the loop variable to a constant, preferably zero. Decrementing loops are often more efficient than their incrementing counterparts:
bad:
for (i = 0; i < (end-beginning)/size; i++)
{...}
better:
for (i = beginning; i < end; i += size)
{...}
good:
for (i = total; i > 0; i--)
{...}
Conditional statements.
- Use switch statements instead of multiple if-else-if control structures.
- Avoid if tests around vertices; use duplicate code instead.
Division. Avoid division. Shift or multiply by a reciprocal instead:
f = x * 0.5 instead of f = x / 2.0
Integer division is even slower than floating-point division.

i = j >> 1 instead of i = j/2
Subroutine prototyping. Prototype subroutines in ANSI C style to avoid runtime typecasting of parameters:
void drawit(float f, int count)
{
.......
}
Typecasting. Avoid typecasting of values, which happens at run-time:
val = (float) *f;
Instead, use typecasting of pointers, which occurs at compile time and is efficient:

int *ptr;
*(float *) ptr = float_val;
float_val = *(float *) ptr;
Multiple polygons. Send multiple polygons between glBegin()/glEnd() whenever possible:
glBegin(GL_TRIANGLES)
....
..../* many triangles */
....
glEnd

glBegin(GL_QUADS)
....
..../* many quads */
....
glEnd

Using Specialized Drawing Subroutines and Macros

This section looks at several ways to improve performance by making appropriate choices about display modes, geometry, and so on.

Geometry display choices. Make decisions about which geometry to display and which modes to use at the highest possible level in the program organization.
The drawing subroutines should be highly specialized leaves in the program's call tree. Decisions made too far down the tree can be redundant. For example, consider a program that switches back and forth between flat-shaded and smooth-shaded drawing. Once this choice has been made for a frame, the decision is fixed and the flag is set. For example, the following code is inefficient:

/* Inefficient way to toggle modes */
draw_object(float *data, int npolys, int smooth) {
int i;
glBegin(GL_QUADS);
for (i = npolys; i > 0; i--) {
if (smooth) glColor3fv(data);
glVertex3fv(data + 4);
if (smooth) glColor3fv(data + 8);
glVertex3fv(data + 12);
if (smooth) glColor3fv(data + 16);
glVertex3fv(data + 20);
if (smooth) glColor3fv(data + 24);
glVertex3fv(data + 28);
}
glEnd();
Even though the program chooses the drawing mode before entering the draw_object() routine, the flag is checked for every vertex in the scene. A simple if test may seem innocuous; however, when done on a per-vertex basis, it can accumulate a noticeable amount of overhead.

Compare the number of instructions in the disassembled code for a call to glColor3fv(), first without, and then with, the if test.

Assembly code for a call without if test (six instructions):

lw a0,32(sp)
lw t9,glColor3fv
addiu a0,a0,32
jalr ra,t9
nop
lw gp,24(sp)
Assembly code for a call with an if test (eight instructions):

lw t7,40(sp)
beql t7,zero,0x78
nop
lw t9,glColor3fv
lw a0,32(sp)
jalr ra,t9
addiu a0,a0,32
lw gp,24(sp)
Notice the two extra instructions required to implement the if test. The extra if test per vertex increases the number of instructions executed for this otherwise optimal code by 33%. These effects may not be visible if the code is used only to render objects that are always graphics limited. However, if the process is CPU-limited, then moving decision operations such as this if test higher up in the program structure improves performance.

Preprocessing Drawing Data: Introduction

Putting some extra effort into generating a simpler database makes a significant difference when traversing that data for display. A common tendency is to leave the data in a format that is good for loading or generating the object, but not optimal for actually displaying it. For peak performance, do as much of the work as possible before rendering.

Preprocessing turns a difficult database into a database that is easy to render quickly. This is typically done at initialization or when changing from a modeling to a fast-rendering mode. This section discusses "Preprocessing Meshes Into Fixed-Length Strips" and "Preprocessing Vertex Loops" to illustrate this point.

Preprocessing Meshes Into Fixed-Length Strips

Preprocessing can be used to turn general meshes into fixed-length strips.

The following sample code shows a commonly used, but inefficient, way to write a triangle strip render loop:

float* dataptr;
...
while (!done) switch(*dataptr) {
    case BEGINSTRIP:
        glBegin(GL_TRIANGLE_STRIP);
        dataptr++;
        break;
    case ENDSTRIP:
        glEnd();
        dataptr++;
        break;
    case EXIT:
        done = 1;
        break;
    default: /* have a vertex !!! */
        glNormal3fv(dataptr);
        glVertex3fv(dataptr + 4);
        dataptr += 8;
}

This traversal method incurs a significant amount of per-vertex overhead. The loop is evaluated for every vertex and every vertex must also be checked to make sure that it is not a flag. This wastes time and also brings all of the object data through the cache. This practice reduces the performance advantage of using triangle strips. Any variation of this code that has per-vertex overhead is likely to be CPU limited for most types of simple graphics operations.

Preprocessing Vertex Loops

Preprocessing is also possible for vertex loops:

glBegin(GL_TRIANGLE_STRIP);
for (i=num_verts; i > 0; i--) {
    glNormal3fv(dataptr); 
    glVertex3fv(dataptr+4);
    dataptr += 8;
    }
glEnd();

For peak immediate mode performance, precompile strips into specialized primitives of fixed length. Only a few fixed lengths are needed. For example, use strips that consist of 12, 8, and 2 primitives.

Note: The optimal length may vary depending on the hardware the program runs on. For more information, see Chapter 14, "System-Specific Tuning." These specialized strips are then sorted by size, resulting in the efficient loop shown in this sample code:

/* dump out N 8-triangle strips */
for (i=N; i > 0; i--) {
    glBegin(GL_TRIANGLE_STRIP);
    glNormal3fv(dataptr);
    glVertex3fv(dataptr+4);
    glNormal3fv(dataptr+8);
    glVertex3fv(dataptr+12);
    glNormal3fv(dataptr+16);
    glVertex3fv(dataptr+20);
    glNormal3fv(dataptr+24);
    glVertex3fv(datatpr+28);
    ...
    glEnd();
    dataptr += 64;
}

A mesh of length 12 is about the maximum for unrolling. Unrolling helps to reduce the overall cost-per-loop overhead, but after a point, it produces no further gain.

Note: Over-unrolling eventually hurts performance by increasing code size and reducing effectiveness of the instruction cache. The degree of unrolling depends on the processor; run some benchmarks to understand the optimal program structure on your system.