Ideally, an application spends most of its time traversing the database and sending data to the graphics pipeline. Instructions in the display loop are executed many times every frame, creating hot spots. Any extra overhead in a hot spot is greatly magnified by the number of times it is executed.
When using simple, high-performance graphics primitives, the application is even more likely to be CPU limited. The data traversal must be optimized so that it does not become a bottleneck.
During rendering, the sections of code that actually issue graphics commands should be the hot spots in application code. These subroutines should use peak-performance coding methods. Small improvements to a line that is executed for every vertex in a database accumulate to have a noticeable effect when the entire frame is rendered.
The rest of this section looks at examples and techniques for optimizing immediate-mode rendering:
bad: glVertex3fv(&data[i][j][k]);
good: glVertex3fv(dataptr);
bad: glVertex3fv(object->data->vert);
ok: glVertex3fv(dataptr->vert);
best: glVertex3fv(dataptr);
The following code fragment is an example of efficient code to draw a single smooth-shaded, lit polygon from quad-word aligned data. Notice that a single data pointer is used. It is updated once at the end of the polygon, after the glEnd() call.
glBegin(GL_QUADS);
glNormal3fv(ptr);
glVertex3fv(ptr+4);
glNormal3fv(ptr+8);
glVertex3fv(ptr+12);
glNormal3fv(ptr+16);
glVertex3fv(ptr+20);
glNormal3fv(ptr+24);
glVertex3fv(ptr+28);
glEnd();
ptr += 32;
bad:
for(i=0; i < 4; i++){
glColor4ubv(poly_colors[i]);
glVertex3fv(poly_vert_ptr[i]);
}
good:
glColor4ubv(poly_colors[0]);
glVertex3fv(poly_vert_ptr[0]);
glColor4ubv(poly_colors[1]);
glVertex3fv(poly_vert_ptr[1]);
glColor4ubv(poly_colors[2]);
glVertex3fv(poly_vert_ptr[2]);
glColor4ubv(poly_colors[3]);
glVertex3fv(poly_vert_ptr[3]);
bad:
glNormal3fv(*(ptr++)); glVertex3fv(*(ptr++));
or
glNormal3fv(ptr); ptr += 4;
glVertex3fv(ptr); ptr += 4;
good:
glNormal3fv(*(ptr)); glVertex3fv(*(ptr+1)); glNormal3fv(*(ptr+2)); glVertex3fv(*(ptr+3));
or
glNormal3fv(ptr); glVertex3fv(ptr+4); glNormal3fv(ptr+8); glVertex3fv(ptr+12);
Note: On some processors, such as the R8000,(TM) loop unrolling may hurt performance more than it helps, so use it with caution. In fact, unrolling too far hurts on any processor because the loop may use an excessive portion of the cache. If it uses a large enough portion of the cache, it may interfere with itself; that is, the whole loop won't fit (not likely) or it may conflict with the instructions of one of the subroutines it calls.
bad:
glNormal3fv(normaldata);
glTexCoord2fv(texdata);
glVertex3fv(vertdata);
good:
glNormal3fv(dataptr);
glTexCoord2fv(dataptr+4);
glVertex3fv(dataptr+8);
bad:
for (i = 0; i < (end-beginning)/size; i++)
{...}
better:
for (i = beginning; i < end; i += size)
{...}
good:
for (i = total; i > 0; i--)
{...}
f = x * 0.5 instead of f = x / 2.0
Integer division is even slower than floating-point division.
i = j >> 1 instead of i = j/2
void drawit(float f, int count)
{
.......
}
val = (float) *f;
Instead, use typecasting of pointers, which occurs at compile time and is efficient:
int *ptr;
*(float *) ptr = float_val;
float_val = *(float *) ptr;
glBegin(GL_TRIANGLES)
....
..../* many triangles */
....
glEnd
glBegin(GL_QUADS)
....
..../* many quads */
....
glEnd
The drawing subroutines should be highly specialized leaves in the program's call tree. Decisions made too far down the tree can be redundant. For example, consider a program that switches back and forth between flat-shaded and smooth-shaded drawing. Once this choice has been made for a frame, the decision is fixed and the flag is set. For example, the following code is inefficient:
/* Inefficient way to toggle modes */
draw_object(float *data, int npolys, int smooth) {
int i;
glBegin(GL_QUADS);
for (i = npolys; i > 0; i--) {
if (smooth) glColor3fv(data);
glVertex3fv(data + 4);
if (smooth) glColor3fv(data + 8);
glVertex3fv(data + 12);
if (smooth) glColor3fv(data + 16);
glVertex3fv(data + 20);
if (smooth) glColor3fv(data + 24);
glVertex3fv(data + 28);
}
glEnd();
Even though the program chooses the drawing mode before entering the draw_object() routine, the flag is checked for every vertex in the scene. A simple if test may seem innocuous; however, when done on a per-vertex basis, it can accumulate a noticeable amount of overhead.
Compare the number of instructions in the disassembled code for a call to glColor3fv(), first without, and then with, the if test.
Assembly code for a call without if test (six instructions):
lw a0,32(sp)
lw t9,glColor3fv
addiu a0,a0,32
jalr ra,t9
nop
lw gp,24(sp)
Assembly code for a call with an if test (eight instructions):
lw t7,40(sp)
beql t7,zero,0x78
nop
lw t9,glColor3fv
lw a0,32(sp)
jalr ra,t9
addiu a0,a0,32
lw gp,24(sp)
Notice the two extra instructions required to implement the if test. The extra if test per vertex increases the number of instructions executed for this otherwise optimal code by 33%. These effects may not be visible if the code is used only to render objects that are always graphics limited. However, if the process is CPU-limited, then moving decision operations such as this if test higher up in the program structure improves performance.
Preprocessing turns a difficult database into a database that is easy to render quickly. This is typically done at initialization or when changing from a modeling to a fast-rendering mode. This section discusses "Preprocessing Meshes Into Fixed-Length Strips" and "Preprocessing Vertex Loops" to illustrate this point.
The following sample code shows a commonly used, but inefficient, way to write a triangle strip render loop:
float* dataptr; ... while (!done) switch(*dataptr) { case BEGINSTRIP: glBegin(GL_TRIANGLE_STRIP); dataptr++; break; case ENDSTRIP: glEnd(); dataptr++; break; case EXIT: done = 1; break; default: /* have a vertex !!! */ glNormal3fv(dataptr); glVertex3fv(dataptr + 4); dataptr += 8; }This traversal method incurs a significant amount of per-vertex overhead. The loop is evaluated for every vertex and every vertex must also be checked to make sure that it is not a flag. This wastes time and also brings all of the object data through the cache. This practice reduces the performance advantage of using triangle strips. Any variation of this code that has per-vertex overhead is likely to be CPU limited for most types of simple graphics operations.
glBegin(GL_TRIANGLE_STRIP); for (i=num_verts; i > 0; i--) { glNormal3fv(dataptr); glVertex3fv(dataptr+4); dataptr += 8; } glEnd();For peak immediate mode performance, precompile strips into specialized primitives of fixed length. Only a few fixed lengths are needed. For example, use strips that consist of 12, 8, and 2 primitives.
Note: The optimal length may vary depending on the hardware the program runs on. For more information, see Chapter 14, "System-Specific Tuning." These specialized strips are then sorted by size, resulting in the efficient loop shown in this sample code:
/* dump out N 8-triangle strips */ for (i=N; i > 0; i--) { glBegin(GL_TRIANGLE_STRIP); glNormal3fv(dataptr); glVertex3fv(dataptr+4); glNormal3fv(dataptr+8); glVertex3fv(dataptr+12); glNormal3fv(dataptr+16); glVertex3fv(dataptr+20); glNormal3fv(dataptr+24); glVertex3fv(datatpr+28); ... glEnd(); dataptr += 64; }A mesh of length 12 is about the maximum for unrolling. Unrolling helps to reduce the overall cost-per-loop overhead, but after a point, it produces no further gain.
Note: Over-unrolling eventually hurts performance by increasing code size and reducing effectiveness of the instruction cache. The degree of unrolling depends on the processor; run some benchmarks to understand the optimal program structure on your system.