Tips for Improving Time-Critical Code

HomeOverviews

Writing fast, time-critical code requires that you carefully analyze all aspects of your code and how it interacts with the system. This section will suggest alternatives to some of the more obvious coding techniques when fast execution times are required.

One of the key factors in writing fast code is to "know your code." If you are using a library, make sure you have the source code (or have a thorough understanding of how the library works where it pertains to performance-critical code) which will enable you to make informed decisions.

In addition to the suggestions provided in this section there are tools like the profiler and Windows NT performance monitor (perfmon.exe) that may assist you in analyzing and maximizing your code's performance.

This section offers suggestions to ensure that the time-critical portions of your code perform satisfactorily.

Summary of Improving Time-Critical Sections of Code

The following points summarize where you should concentrate your efforts to make your code run as efficiently as possible:

Cache Hits and Page Faults

Missed cache hits, on both the internal and external cache, as well as page faults (going to secondary storage for program instructions and data) slow the performance of a program.

It is estimated that a missed CPU cache hit results in a loss of 10–20 clock cycles. A missed external cache hit results in a loss of 20–40 clock cycles. A page fault results in about 1 million lost instructions. Therefore, it is in the best interest of program execution to write code that will reduce the number of missed cache hits and page faults.

One reason for slow programs is that they take more page faults or miss the cache more often than necessary.  To avoid this, it’s important to use data structures with good locality of reference, which means keeping related things together.  Sometimes a data structure which looks great turns out to be horrible because of poor locality of reference, and sometimes the reverse is true. Here are two examples:

Sorting and Searching

Sorting is an inherently time consuming operation (compared to other typical operations you might have to do). The best way to speed up code that is slow due to sorting is to find a way to not sort at all.

You might be able to build the list in sorted order. But this might slow you down overall because inserting each new element in order could require a more complicated data structure with possibly worse locality of reference. There are no easy answers; try several approaches and measure the differences.

If you find that you must sort there are a few things to remember:

There are fewer alternatives to searching than for sorting if you want to improve performance. If you need to do a search and if the search is time-critical, a binary search or hash table lookup is almost always the right answer. But again keep locality in mind. A linear search through a small array might be faster than a binary search through a data structure with a lot of pointers, which could cause page faults or cache misses.

MFC and Class Libraries

The Microsoft Foundation Class library (MFC) provides a framework that can greatly simplify writing code. Other class libraries are available for the same purpose. The following discussion is specific to MFC but is valid for similar idioms in other libraries.

When writing time-critical code, you should be aware of the overhead implied in using some of the classes. Examine the MFC code that your time-critical section is using to see if it meets your performance requirements. The following is a list of some performance issues you should be aware of in MFC.

Shared Libraries

Code reuse is desirable. However, if you are going to use someone else’s code, you should make sure you know exactly what it does in those cases where performance is critical to you. The best way to understand this is by stepping through the source code or by measuring with tools such as the profiler, PView, or Windows NT Performance monitor.

Heaps

Use multiple heaps with discretion. Additional heaps are created with HeapCreate and HeapAlloc and let you manage and then dispose of a related set of allocations. You don’t want to commit too much memory. If you’re using multiple heaps, pay special attention to the amount of memory that is initially committed.

Instead of multiple heaps, you may want to use helper functions to wrap the default heap. You can write helper functions to interface between your code and the default heap. Helper functions could facilitate custom allocation strategies that could improve the performance of your application. For example, if you perform many small allocations frequently, you may want to localize these allocations to one part of the default heap. You can allocate a large block of memory and then use a helper function to suballocate from that block. If you do this, you will not have additional heaps with unused memory because the allocation is coming out of the default heap.

On the other hand, using the default heap might get you less locality of reference than you desire. You really have to measure the effects of moving objects from heap to heap with tools such as the profiler, PView, Spy++, or Windows NT Performance monitor.

Measure your heaps so you can account for every allocation on the heap. Use the C run-time debug heap routines to checkpoint and dump your heap.  You could then read the output into Excel and use pivot tables to view the results. Note the total number and size of allocations as well as the distribution of the allocations. You may want to compare these versus the size of working sets. You may also be interested in clustering of related-sized objects.

You can also use the Windows NT performance counters to monitor memory usage.

Threads

When you need a background task, threads may not be your best solution. Effective idle handling of the event may be a better solution. It’s easier to understand the locality of reference of your program with one thread than with two or more.

A good rule of thumb is that you should only use a thread if an operating system notification that you need to block on is at the root of the background work. Threads are best in such a case because it is not practical to block a main thread on an event.

Threads also present communication problems. You will have to manage the communication link between your threads, whether it be with a list of messages or by allocating and using shared memory. Managing your communication link will likely require synchronization to avoid race conditions and deadlock problems. Such complexity can easily turn into  bugs and performance problems.

For addititonal information, see Idle Loop Processing and Multithreading Topics.

Small Working Set

Smaller working sets mean fewer page faults and more cache hits. This directly translates to into performance gains. We’ve mentioned locality of reference throughout this document; a small working set can mean good locality of reference. The process working set is the closest metric the operating system directly provides for telling you if you have good or bad locality of reference.

See SetProcessWorkingSetSize for information on how to set the upper and lower limits of the working set.  See GetProcessWorkingSetSize for information on how to obtain the minimum and maximum working set sizes.  You can use Spy++ or PView to view the size of the working set.