Memory and Performance


	Log In \| Not a Member?	Support

Performance Issues: Memory Usage

When the G4 arrived, the capacity for data processing with a PowerPC based machine jumped by an immediate 4, 8 or 16 fold. However, the CPU system bus did not at the same time see a similar throughput jump, until the G5 was introduced. As a result, the AltiVec developer must be much more sensitive to how and when she touches memory, because whether the data she needs lies in the caches or RAM may exert a profound influence on the speed of her function. Let us look at a sample function to better illustrate this issue, vec_add().

Vec_add(), part of the AltiVec C Programming Model, adds two 16 byte vectors together. It can consume 32 bytes of data per CPU cycle. On a 400 MHz G4, this translates to (32 * 400MHz =) 12.8 gigabytes of data per second. The bus on a G4/400 runs at 100 MHz, and can deliver 8 bytes per bus cycle. Thus, when running at top speed, the bus will be able to supply (8 * 100 MHz=) 800 megabytes of data per second. A quick comparison of these two numbers reveals that a function that does nothing more than apply vec_add() to two data streams may be up to spending 93% of its time waiting for data to appear!

With a 2.0 GHz G5, the data consumption rate by vec_add() is (32 * 2000 MHz=) 64 gigabytes of data consumed per second and produce another 32 gigabytes per second of results at peak throughput. The peak bandwidth of the 1 GHz front side bus on a G5 is 3.2 GB/s in each direction (read and write). In the read direction, the difference is therefore 20:1, meaning that with everything operating at peak throughput, the code may be spending 95% of its time waiting for data to appear.

Note: This is not an extreme example. It is possible to consume 1.5 times that much data per cycle in the vector unit. If one considers that data must also be written back out, then that number rises to 2 times. (Though on G5, writing to memory does not contend for bandwidth with reading memory.) Nor are functions like this uncommon. You can find examples of many memory bound functions everywhere, from the C Std. Library to linear algebra packages to audio and video. Note that the full throughput of 12.9 gigabytes per second cannot be reached even when the data is in the L1 cache. This is because the load / store unit can only achieve a throughput of one vector load per cycle. Thus, 6.4 GB/s is a more realistic top speed for read-only operations on G4. Nonetheless it reveals a substantial peak performance disparity between bus and CPU. On G5, we have two load store units that can operate in parallel and we can issue up to 4 non-branch instructions per cycle, so it is realistic to imagine code that does two loads and one add every cycle. You will not be able to issue two loads, an add and a store every cycle however, because that would require three load/store units.

In general, the average developer will find that her code comes in two varieties. There will be a set that is Compute-bound -- the speed of the CPU is the limiting factor determining the speed of the function. There will also be a (probably larger) set that is I/O bound. When executing these functions, the CPU spends much or most of its time waiting for data to appear. Code micro-optimization can be helpful for compute bound code, but is generally a waste of time for I/O bound code. The best optimization strategy for I/O functions is to pay careful attention to how and when you use memory.

Optimize Memory Usage First, Micro-Optimize Code Second

AltiVec is not about 20 percent improvement. It is about 500 or 1000 percent. When 1500% or 2000% is potentially available, as in this case of the addition example above, you need to pay attention to it. This means paying strict attention to when and how you touch data. The only optimization strategy likely to give more improvement than that is careful selection of the right algorithm to save work, so by all means do that first. (For example, there is no reason to swap the elements in two large arrays, when all you need to do is swap the pointers to the array.) If you have two good algorithms, choose the one that makes better use of memory.

Once your algorithm has been chosen, try to figure out how to optimize its usage of memory. A simple approach would be to simply reduce the amount of memory used. Another approach would be to structure your program to do more work on the data while you have it loaded into the caches. Do both, if you can.

Doing More with the Data

As we have seen, very simple functions can spend most or nearly all of their time waiting for data to load from RAM. If your data is in the caches, then you are lucky. The time to access this data is much shorter. However data only appears in the caches if you have loaded it from RAM at some earlier time. Thus, at some point you had to have loaded that data from RAM the slow way.

Question:

When you loaded that data from RAM, did you stand around and wait for the data to appear, or were you busy working during that time?

Answer:

If you didn't prefetch the data, you were probably standing around wasting cycles.

Data can be pre-fetched into the caches using cache hints such as DST's or dcbt. Fetching data into the caches in this manner happens completely asynchronously, meaning that while the data is loading your program can be doing other things. This is generally not true if you do not prefetch, however. If a load instruction takes too long to complete, it will stall the processor until the data arrives. This happens because the processor uses a first-in first-out instruction completion queue. In order to start executing, an instruction must acquire a slot at the end of the queue, and it can only complete when it reaches the head of the queue. If the instruction ahead of it isn't done, then it and every instruction behind it will sit in the queue. In such circumstances, the queue will quickly fill, preventing any other instructions from entering execution, effectively stalling the processor. Thus, even those instructions that do not need the data being loaded will stall until the data arrives if they started execution after the load. Since loads from DRAM can be slow, this may be tens or hundreds of cycles during which nothing is happening.

The solution to this problem is to fetch the data into the caches ahead of time so that the load takes just a few cycles to complete and the instruction completion queue doesn't back up. As long as you are able to keep pre-fetching data into the caches ahead of time, your function will not stall and will gallop along at top speed.

Here we have performance results from a test program that allows us to vary how much time is spent processing data between attempts to load individual cachelines. We vary it all the way from a function that essentially does nothing but read data as fast as it can, to a function that reads data, then works on the data for 200 cycles before reading the next cacheline. This is repeated with and without data prefetch, and with or without storing the results back to the same place the data was read from. Adding a store operation in effect doubles the amount of activity on the memory bus -- the data has to be stored back out to RAM eventually to make room for new data being read in -- without significantly changing the architecture of the function. This allows us to confirm which changes in the curves arise from bus traffic vs. other factors.

For the red and the blue lines, the data is then written back to memory. The orange and green lines don't do the write back. The tests were run on MacOS X.1 on a stock G4/866 using a PPC 7450 on a 133 MHz system bus using data sets much larger than the L3 cache.

First of all, it should be clear that using cache hints helps throughput. The red and orange lines use vec_dst() to maintain an active prefetch over the 256 bytes needed most immediately at all times, using the overlap method suggested in the Caches section. The green and blue lines do not prefetch data. Performance is up to 4 times better for the cases that use cache hints to pre-fetch data, though 20-50% is more typical.

Though we said earlier that prefetch hints do not make the bus go any faster, they do help saturate the bus to make sure that it is always working. Without prefetch hints, there is always a delay between one set of loads and the next because we have to wait until we finish the calculation on the last cacheline before the function loops and is able to issue the next load. Thus, without prefetch hints, loading data through the system bus and execution can be thought to happen in series. With prefetch hints they happen more in parallel.

We can see that with prefetching, code performs equally well for very complex calculations -- those that do more than 50 cycles of work per cacheline -- regardless of how much data is travelling over the bus. This slope simply represents the maximum speed the CPU can do the calculation. Above 50 cycles of work per cacheline loaded, the function is compute-bound. Below about 50 cycles, the red prefetch + store case plateaus. This is the point at which the calculation consumes data faster than the bus can provide it -- i.e. it becomes I/O bound. We see a similar thing happen for the orange line (prefetch + no store) at about 25 cycles. Because we are not doing any stores in the orange case, we are able to use all of the bus bandwidth for loads, rather than just half. This allows the orange line to achieve up to twice the throughput before bandwidth becomes a problem.

Finally, take a look at the dotted line added to the plot that is co-linear with the orange and red cases. This shows what happens when we extrapolate that back to a work load of zero -- an indication of the amount of overhead in the function. By inspection, it seems to cross the Y axis at about 3-5 cycles of work. This is consistent with loop overhead and a few cycles for a pair of L1 cache hit vector loads. From this we can draw a very interesting conclusion: When a function is compute bound, the overhead for loading the data into the cache from RAM is near zero, if and only if you prefetch the data.

That is a very important observation. Expanding that out to its logical conclusion, it leads to the following principles (valid when you prefetch data):

If the speed of a function is limited by the speed of the CPU:

It runs at tops speed as if all the data was in a infinitely large L1 cache.

It performs equally well regardless of where the data actually comes from.

Since the data can be accessed from DRAM without performance penalty, transient cache instructions and/or LRU loads and stores will help protect the contents of the caches making nearby functions faster.

If the speed of a function is limited by the speed of the System bus:

The function will run at the same speed regardless of code micro-optimization.

Any additional work you do in the function is free so long as the additional work does not create more bus traffic and the function doesn't become CPU bound.

Do more work in these functions so you don't have to do it later.

It should be clear then that the fastest AltiVec functions are those that do more. 50 cycles is a HUGE amount of time to fill with calculations on 32 bytes. Most AltiVec functions do not do that much work. (There just isn't enough to do!) This means that any more work you can find to do on data loaded into your function can probably be done for free. Look for ways to combine many small functions into one larger function that does everything in a single pass. You may also be able to afford a more costly calculation that gives better results for near zero net cost.

If it is likely that your data is not to be found in the caches -- perhaps because it is too large -- then micro-optimization of that code is also likely to be unproductive. The real bottleneck is loading the data. This is why this section begins with a title that urges you to optimize for best memory usage before doing any micro-optimization.

Please note that this discussion concerns those functions that touch memory only. Small AltiVec functions that pass data in and out by value in register wont have these problems. Also, do not let this discussion cause you to be afraid to use the caches if you need them. Some complex functions are not able to fit everything in register, and so will need to use a little bit of space to temporarily hold data. This can be a good strategy. The caches are very fast. Use them when you need to.

Finally, please note that the point at which the prefetch accelerated functions plateau is machine dependent. The same red prefetch + stores function running on a G4/400 may plateau at 35 cycles rather than 50. The G4/400 enjoys a smaller bus speed multiplier between system bus and CPU clock. If clock speed ratios continue to grow, it seems likely that it will require increasingly more complex functions to be compute bound.

Reducing Memory Usage

There are many different methods to reduce the amount of memory you use. The simplest example would be to just pick algorithms that use less data and storage formats that use less space.

AltiVec friendly storage formats
You can often reduce the number of cachelines that have to be loaded by adjusting your storage formats so that data used together are stored together. If you are working with large arrays of data, it may not be strictly important to have the data stored in adjacent bytes. It is most important to maximize the data content of each cacheline that you load. It is the process of loading cachelines into the L1 cache from RAM (and storing cachelines back out to RAM) that is the performance bottleneck. If you rearrange your data make sure that there isn't any "unused" space in each cacheline, then you will likely be have to load in the minimum number of cachelines required for your function to do its job. So, for example, if your functions frequently only touch some data in a struct and not others, then you may benefit from storing your data as one struct that contains many arrays, rather than an array of structs. Remember that in AltiVec, you will be manipulating many data at once, so optimizing for the case where you only manipulate single data is not helpful. If you are using AltiVec and you touch the 32nd element of that array, it is exceptionally likely that you will also touch the 33rd, 34th, 35th and so on. Converting interleaved data formats to planar data formats will also give you uniform vectors, which may enhance code performance and simplicity significantly.

It is also important that your data be arranged so that your work is on adjacent data and not data scattered throughout the array. Scattered loads and stores as always get very poor use of the caches and tend to be quite slow. They also do not have much parallelism to them. Try to avoid algorithms that do a lot of scattered access to memory. Sometimes you will find that sparse calculations are beaten by brute force ones applied to the entire data set, even if for many individual data, the brute force calculation does nothing.

Memory Consuming "Optimizations"
Many classic programming techniques were developed when CPU cycles were scarce and memory access cheap. These trade memory for CPU cycles. You may wish to review your use of these to determine if they are really all that helpful. With increasing bus speed multipliers, such strategies become increasingly likely to be unsuccessful. One such strategy is precalculating and storing a frequently used value -- for example, storing both x and 1/x in a struct. See if that actually helped your code performance. If it didn't, take it out.

A lookup table is another such strategy. We earlier demonstrated that you can complete at least 16 vec_adds, and perhaps as many as 50 in the time it takes to load one cacheline from RAM. Thus, it should not be surprising that a brute force calculation is faster than data lookup in many cases. Brute force methods are also more friendly to other parts of your application because they do not load lookup tables. Lookup Tables may displace other important data from the caches. Finally, lookup tables are not usually parallelizable. using AltiVec, so when you use a lookup table, you will be in effect be installing a scalar code bottleneck in front of your high throughput vectorized function. Exceptions are those lookup tables that are able to load all 4, 8 or 16 lookups with a single load, or those lookup tables implemented using vec_perm(). In one example, vector code that converts unsigned char data to float and then applies a 9th order polynomial to it is still marginally faster than hand tuned scalar code that does a lookup into a 256 entry lookup table containing floats. Part of the reason for this is superscalarism. A lookup table does all its work with the LSU. The vector replacement does some work with the LSU and some with the vector ALU, meaning that more work can be done in parallel. A great many functions can be fit quite well by a 9th order polynomial.

Code Counts Too
Data is not the only thing that has to be loaded through the system bus. Instructions must be too. While the L1 cache has a dedicated area for instructions, it is only 32 kB in size. Excess code and data must compete for the same space in the L2 and L3 caches. This makes it important to keep your code from becoming unnecessarily large, especially that code that is rarely executed. Uncached code can be measured to execute at a rate of one instruction per five cycles. (This may vary by hardware.) Cached code can execute at a rate of up to three instructions per cycle. This means that size-increasing code optimizations that look good in a benchmark loop may be really lousy in a large application. This is a good reason not to optimize code that is rarely called. Loops are the exceptions. They get great temporal locality.

Globals and Constants
Globals and Constants can also be cycle and bandwidth wasters. This is especially true of globals. When you use a global in a tight loop, the compiler can't be sure that some other thread isn't changing that global. As a result, the global is freshly loaded in every time through the loop! If you just want the compiler to load it once, declare a stack variable and copy the global into it. Use the stack variable instead of the global in your function.

Constants also frequently trigger a load from storage if the compiler can't figure out how to generate the constant quickly. If you can figure out how to synthesize the constant in four or fewer cycles without triggering a load, in general, you will be better off if you do so. If you can remove all of the loaded constants from your function, some compilers may be able to save a few cycles by doing an abbreviated stack frame creation. In one example, we found this up to doubled the speed of a small function. Every use of vec_perm means that you have to create a constant. These are usually fairly difficult to synthesize. Sometimes you can save a constant by figuring out how to do the permute using vec_merge or vec_pack instead.

Stores Can Cause Unnecessary Bus Activity
With a few exceptions, a cacheline must be in the L1 cache for a store to complete. ("Complete" is used here in the sense of being discharged from the Completed Store Queue. The actual store instruction may complete much sooner than that.) If it is not in the cache, you will incur additional bandwidth cost to load it in. Once it is loaded, the appropriate bytes will be modified and the cacheline will be labelled modified. If you intend to overwrite an entire cacheline (for example by two adjacent vector stores) then loading in the data is a waste of bandwidth, since all of the data will be changed by the store operations. In some cases, the processor will sense when this occurs and not load the data, but that is not guaranteed to happen.

You can introduce a cacheline into the cache without loading data over the bus if you use the dcbz instruction to zero a cacheline. Some significant performance improvement can be found on some hardware if you zero cachelines before writing to them. Care must be exercised to make sure that you do not zero data that you do need. (Alignment mistakes have the unfortunately tendency to zero the 16 bytes immediately preceding a heap block, for example, which is where the heap data is stored.) In addition, it is not helpful when bandwidth is not the problem or for data marked cache inhibited, such as pixels residing in VRAM. In the latter case, an exception is thrown and the routine runs considerably slower.

Memory Usage and the Caches

Once you have paid to move data into the processor caches, it pays to know how much space is in there so you have some idea how long you can expect the data to be able to remain there before something else displaces it. The cache sizes are as follows:

	L1	L2	L3
750	32k	0.25-1MB	—
7400	32k	0.5-1 MB	—
7450	32k	256k	0-2 MB
970	32k	512k	—

In general, a least common denominator approach works best, choosing to try to live in 32k L1 or 256k L2.

However, knowing the physical size of the caches is not always the complete answer. If a cache is 512kB, can you use up all of the 512 kB from a single linear 512kB buffer and expect to stay in the cache? Currently, no. Why? Long story:

PowerPC caches are set associative. When you map an arbitrary cacheline from a N gigabyte memory space into a cache that is orders of magnitude smaller, multiple different pieces of memory have to map to the same spot in the cache. This can set up pathological situations where you bounce back and forth between two or more chunks of memory that compete for the same spot in the cache. As you touch each one, it flushes another one out. For N data accesses, you may take N cache misses.

To solve this problem, cachelines are organized in sets. This means that the N gigabyte memory space maps into even fewer bins in the cache (meaning more probability of overlap), but each bin can hold many cachelines concurrently. This means you can simultaneously access many cachelines that map to the same place in the cache at once. Since code tends to bounce back and forth between only a small number of competing chunks of memory, this is usually a performance win. When the cacheline is added to the cache, some other cacheline in the set is pushed out. Which one is thrown out is usually decided based on least recently used (or a pseudo least recently used algorithm). A cache organized into sets that can each contain 8 cachelines is described as 8-way set associative.

Caches are also physically addressed. This means that the address used to determine which set you land in is based on the physical address in RAM that the data is living in, not the virtual address that you the programmer sees in the debugger. There are virtual and physical addresses for the same byte? Yes. You use the virtual one. The hardware uses the physical one. Every load, store or prefetch hint goes through a address translation process in the load store unit hardware to translate the virtual address you provide to the actual physical address of the ram in hardware. This indirection makes possible features like virtual memory, memory mapping data on disk, protected memory, etc. The key thing to note here is that your data is likely mapping into a different set of cachelines than you might have expected based on the virtual address.

How do these two design elements relate? A 512k cacheline aligned segment of physically contiguous data should always fit in a 512k cache, because the number of cachelines falling into each set will be totally evenly distributed. There will be 8 in each. A 512k stretch of virtually contiguous data, like what you would get from malloc, is unlikely to fit in a 512k cache, because the physical pages that they map are unlikely be completely evenly distributed. Some of the sets may end up with more than eight cachelines and spill out of the cache, other sets may only get a few. Assuming your range of contiguous virtual pages has a random physical mapping, then statistically speaking, you are only very likely to fit in the cache if your data size is somewhere in the range of 1/2 or 2/3 of the size of the cache. Since as demonstrated by our back-of-the-envelope calculation at the top of this page, code taking cache misses can be up to 20 times slower than code operating out of the level 1 cache, its doesn't take very many to severely degrade performance.

If you ever noticed that your benchmark timings are highly inconsistent from run to run for certain data sizes near the size of the cache, that is likely why. If you want consistently fast performance on current operating systems (MacOS X.3, Panther, and earlier) it is best to keep your data sets or data tiles well under the size of the critical level of cache that is limiting your performance. This also speaks to the importance of running your benchmark applications multiple times (or allocating your buffers multiple times inside the benchmark loop) to look for page mapping associated performance variability. Speed differences between one single test and another might not be from that change in the code that you made, as you had assumed, they might be because the page mapping was different in your buffers from test to test. Once your test data set is much larger or much smaller than the cache size, then benchmark times will usually become much more reproducible.

Additional Notes Regarding DCB*

With the introduction of the G5, several things have changed. First and foremost the cacheline size is now 128 bytes on a G5 and 32 bytes on a G4. This means you must be careful with cache block instructions that change memory, such as dcbz. The old form of dcbz will continue to zero only 32 bytes of memory. It has no performance advantage on G5 because it does not operate on full cachelines and may result in a performance degradation. There is a newer dcbz form that means "zero the entire cachelines whatever the size is" that has an extra bit set in the opcode. This may provide a performance improvement on G5 in some situations. Care should be taken in its use however, since it zeros differently aligned, differently sized chunks of memory on G4 and G5, and if the cacheline size should change again in the future, a third alignment and data size possibility may be introduced.

DCBA is no longer legal on a G5. The kernel traps the exception and emulates the instruction as a noop. This is expensive. Its use in AltiVec code should be avoided.

DST instructions are serializing on G5, and if used inside an inner loop may be causing a multi-cycle stall. This is especially true if you are using multiple DSTs in concert. Since the new cacheline size on the G5 is similar to the most common DST prefetch segment size, code that needs to run well on G5 will likely be better served by replacing the DSTs with dcbt instructions. The G5 also introduces a special streaming dcbt instruction that may be used to begin a prefetch stream that reads in the part of a 4k page that appears after the address provided. Earlier processors will see this as a regular dcbt. Many functions work equally well on G4 with DST or dcbt. DSTs were just easier. Since they no longer quite perform as well on G5s, Apple recommends that you revert to using dcbt as your preferred method of prefetching.

G5 also has 4 dedicated automatic hardware driven prefetchers that respond to consecutive cache misses and introduce short lived data streams. In many cases, these do quite well, and may obviate the need for prefetching. In some cases, they may interfere with your user initiated prefetches and cause performance degradation. These can be disabled for testing purposes by adjusting a bit in the HID special purpose registers, using the CHUD toolkit. You should not change these settings in a shipping application.

Table of Contents Next Previous Top of Page