Memory and Alignment


	Log In \| Not a Member?	Support

The Caches

Caches are small segments of extremely fast memory that are used to hold recently used data and instructions. They can greatly accelerate calculations that repeatedly touch the same regions of memory over a short period of time. The G4 has a 32 kB level 1 data cache and a matching 32 kB level 1 instruction cache for instructions. In addition there is a level 2 cache whose size is usually in the range of 256 kB to 1 MB. It holds both data and instructions. In some cases there is a level 3 cache that may be as large as 2 megabytes, as on the PPC 7450. The G5 has a larger L1 instruction cache (64kB), a 32kB data cache (2-way) and a 512k 8-way L2 cache.

Data moves in and out of caches in aligned chunks called cachelines. Cachelines are customarily 32 bytes in size, but can be larger. efore data can be loaded into register and used in the processor, it must be loaded into the level 1 (L1) cache first, except where memory is marked non-cacheable. Thus, if you make a lot of scattered memory accesses, you may end up loading in up to 32 times as much data as you need into the processor. This is the pathological case. Most of the time, applications are highly repetitive in their data access patterns and the caches are very effective.

The normal data access process is as follows: in response to a load instruction, the processor will first examine the L1 cache to see if the cacheline holding that piece of data is present. If it is, the data is simply loaded from there. This is a called a L1 cache hit and is extremely fast, taking only a few cycles. If the data is not in the L1, then the processor checks the L2 cache. If it finds it there, then the entire cacheline containing the data is loaded into the L1 from the L2, and the data is loaded into register from there. Some machines have a level three cache that will be checked if the data is not in the level 2 cache. It behaves similarly. If the data isn't in any of the caches, then the processor takes a long slow trip to DRAM for it. In each case, once the data is found, the entire cacheline that contains the data will be loaded into the L1 cache over four beats of the PowerPC's 64 bit front side bus.

When the new data arrives in the L1 cache, some other cacheline must be displaced to make room for it. The cachelines are grouped in sets. As the new cacheline is added to a set, another element of the set needs to be flushed out to make room for it. Which element is flushed out is decided using a pseudo- least recently used (LRU) algorithm. The L1 data cache on the 7400, 7410, 7450 and 7455 is eight way set associative. This means that each set has eight cachelines. The 32 kB L1 data cache therefore contains 128 sets. All of the cacheline elements in a set live at an address some integer multiple of 4 kB apart. Thus, if you stride through memory 4 kB at a time (for example drawing a vertical line in a 1024 pixel wide 32 bit color GWorld) you will only be using a single set in the L1 data cache, and your code will operate as if the L1 was only 256 bytes in size. This again is the pathological case. Try to avoid large strides whose size in bytes is a power of two.

Where does the displaced data go? Unless you tell it to do otherwise, the processor will send the displaced cacheline to the L2 cache. This is the only way for data to get into the L2 cache on PowerPC 7400 and 7410. It is called a victim cache for that reason on these processors. There the process repeats itself as the cacheline displaces another cacheline from the appropriate set in the L2 cache. The process continues until we run out of caches and a (modified) cacheline is written to DRAM. (Unmodified cachelines are simply discarded when displaced from the caches.) The newer PPC 7450 and 7455, usually populate the L2 and L3 caches at the same time that a cacheline is moved to the L1. For these processors when a cacheline is cast out of the L1, it is already resident in the L2 so little needs to be done. If the cacheline was fetched with a transient prefetch instructions such as dstt, then it is only loaded into the L1.

The fact that data is read and written from DRAM as full cachelines, and not bytes, shorts, words, doubles or vectors may have a profound impact on how you think about the costs of memory access and the importance of keeping data frequently accessed together near one another. Scattered memory accesses can be very costly.

Software Directed Prefetch (Cache Management)

AltiVec provides advanced tools for helping you to control when data is loaded into the caches and how soon it is evicted. Bandwidth between the processor and memory can be managed explicitly by the programmer through the use of streaming data prefetch instructions. These instructions allow software to give hints to the cache hardware about how it should prefetch and prioritize write back of data. Because it works exclusively through the caches, this streaming data prefetch mechanism is also suitable for use with scalar code on G4 processors.

In the G4 processor there are four Data Stream (DST) channels. Each channel runs independently of the others and can prefetch up to 128K bytes of data. Here are some of the important characteristics of the DST prefetch engine.

Four independent channels
Lowest priority access to the bus so they will not interfere with normal bus activity
Start and forget.
- once started, no further program interaction is required
- will not cause a page fault if unmapped memory is accessed
- will not cause a page fault if protected memory is accessed
- will continue to run through interrupts
Can access up to a 12K bytes in a contiguous block
Can access up to 256 evenly spaced blocks of up to 512 bytes
Can load data into cache optimized for reads or writes
Can load data into cache as Most Recently Used (MRU) or Least Recently Used (LRU)
When a new command is issued on a stream, any command currently running on that channel is stopped.
One or all of the streams can be explicitly stopped

There are four instructions for starting a stream and two for stopping them:

    • DST    - Data Stream Touch
    • DSTT   - Data Stream Touch Transient
    • DSTST  - Data Stream Store
    • DSTSTT - Data Stream Store Transient
    • DSS    - Data Stream Stop
    • DSSALL - Data Stream Stop All

DST

The DST instruction will start one of the four data streams, causing it to read data into the cache. You should use this instruction when the data will be read by the program several times and the data has some reasonable degree of locality.

DSTT

The DSTT instruction will start one of the four data streams, causing it to read data into the cache. The cacheline will be marked transient so that when it becomes time for it to be displaced from the L1 cache, it will be flushed straight to RAM, rather than to the L2 and L3 caches. You should use this instruction when the data will be read by the program few times or the data does not have some reasonable degree of locality. It is a good way to protect the contents of the L2 and L3 caches.

DSTST

The DSTST instruction will start one of the four data streams, causing it to read data into the cache. You should use this instruction when the data will be read and written by the program several times and has some reasonable degree of locality.

DSTSTT

The DSTT instruction will start one of the four data streams, causing it to read data into the cache. The cacheline will be marked transient so that when it becomes time for it to be displaced from the L1 cache, it will be flushed straight to RAM, rather than to the L2 and L3 caches. You should use this instruction when the data will be read and written to by the program just a few times or the data does not have some reasonable degree of locality. It is a good way to protect the contents of the L2 and L3 caches.

DSS

The DST instructions have a counterpart called data stream stop (DSS). This instruction allows the program to stop any given stream prefetch by executing a dss with the tag of the stream it wants to stop. This is useful if, for example, the program wants to start a stream prefetch speculatively, but later determines that the instruction stream went the wrong way. DSS provides a mechanism to stop the stream so no more bandwidth is wasted.

DSSALL

All active streams may be stopped by using the DSSALL instruction. This is useful when, for example, the operating system needs to stop all active streams (e.g. process switch) but has no knowledge of how many streams are in progress.

The Basic form

All of the DST instructions have the same form; which one to use is dependent on how the data will be used later by the program. Each of the above instructions take three parameters: the starting address of the first block of data, a control word and a two-bit immediate value for the channel number (value: 0-3). The function definition for DST looks like this:

       void vec_dst (* Address, control, channel)

The address is a pointer to a byte in the first cacheline to be fetched
The control word is built from three parts: the block size, the block count and the distance between the blocks. The three values are packed into a long word as follows:

							
								
            
								
               
									
                  
										|
										
               
									
               
									
                  
										000
                  |
										
               
									
               
									
                  
										BlockSize
                  |
										
               
									
               
									
                  
										BlockCount
                  | 
										
               
									
               
									
                  
										Signed Block Stride
                  | 
										
               
									
            
								
            
								
               
									
                  
										bit: 0
										
               
									
               
									
                  
										     3
										
               
									
               
									
                  
										8
										
               
									
               
									
                  
										16
										
               
									
               
									
                  
										31

BlockSize is a five-bit value from 0 to 31 indicating the number of 16-byte chunks (0 = 32 16-byte chunks).

BlockCount is an eight-bit value from 0 to 255 indicating the number of blocks (0 = 256 blocks).

BlockStride is a 16-bit value from -32,768 to +32,767 (0 = +32,768 bytes), denoting the address increment or decrement from the beginning of one block to the beginning of the next. While stride can be any value, programmers are discouraged from specifying BlockStrides smaller than 1 block (16 bytes). Note also that optimum performance can be achieved by aligning blocks on 16-byte boundaries.

The channel is a two bit literal constant {0, 1, 2, 3} that indicates which of the four hardware streams to use. Apple recommends that in order to better avoid collisions with streams started by system software, you start with stream 0 and work towards stream 3. Interrupt level tasks on MacOS 9 should start at stream 2 and work towards stream 0.

Special Notes

The fact that a program issues a vec_dst command does not guarantee that the requested data will be in the cache. There are several things that can prevent the DST from completing or cause the data to be removed from the cache after it is loaded. Here is a list of some of the things that can prevent your data from being in the L1 when you want it:

The requested data is not currently in main memory (VM paged out to disk).
The requested data is in memory that is marked as non-cacheable.
Your task does not have read access to the requested data.
An Interrupt occurs and the interrupting routine starts a dst of its own, canceling yours (MacOS 9).
The scheduler swaps out your thread in favor of another one (MacOS X). This will cause the DST to terminate. It will not restart when you get the CPU back.
If the DST crosses a page boundary, it may terminate.
DST's have a lower priority than demand loads and stores, so it is possible that extremely load/store rich code (e.g. a copy routine) may push the DST aside. Look at the Darwin bcopy() source for an example of a case where dcbt was used instead.
The code calls a subroutine or system call that does its own dst, canceling yours.
There is not enough time between dst initiation and when the data is needed for the data to be read in.
The code is accessing data between the time the dst is started and the data is used, so the data is flushed out in favor of the other data the program is using. (Only likely with extended periods of time and lots of intervening data.)
Finally, you may not observe any improved performance if your data is already in the caches.

Suggested DST Usage

The particular block size, count and stride that works best for a function may be highly function and hardware dependent. In each case, it is generally necessary to test a wide variety of combinations to find the one that works best. This means that some experimentation typically must be done to see good results with DST's. If you are seeing no improvement, then one of many factors may be the cause:

You are not prefetching far enough ahead of time
You are prefetching so far ahead of time that the data has left the cache by the time you need it.
You are prefetching far more data than you need, causing additional unnecessary bus traffic that is sapping performance
The DST is stopping prematurely (see Special Notes above)
The data is already in the caches -- this is a special problem for benchmarking applications that call a function repeatedly
Your function is so complicated that cache miss stalls are a immeasurably small fraction of total function cost -- VERY few functions are this complex, but it does happen. This would typically require that your function spends fifty or more cycles processing each 32 byte segment of data after it is loaded into registers.

Typical improvement gains vary according to how much computation is done on each cacheline (usually 32 bytes) of data, but are often in the 10-40% improvement range for functions loading uncached data. In rare cases, they can provide up to four fold speed improvement starting with uncached data.

While you are free to use the DST instructions in the manner that works best for you, DST users should note that Motorola has drafted some guidelines for best results. These appear in section 5.2.1.8 of the AltiVec Technology Programming Environments Manual. In short, this document recommends that you prefetch the data in short overlapping blocks.

There are several reasons why short overlapping blocks are likely to work better than long non-overlapping ones:

If you prefetch your data in large blocks up front, it is quite likely that either the DST will outrun your function or (more likely) your function will outrun the DST. In either case, the DST wont be very helpful. It is exceptionally difficult to stay synchronized with such a long DST. Even if you managed it, it would probably only work on one particular collection of hardware and not another.
If you lose control of the CPU (for example, due to a preemptive thread swap on MacOS X) the new task may flush your hard won data back out of the caches in the course of its work. Thus, prefetching a long time ahead of time may turn out to be a waste of memory bandwidth -- the data still might not be in the cache when you need it.

DST's may "spontaneously" stop and not restart for reasons that you have no control over. These are described in the Special Notes section above. It is unlikely that any exceptionally long stream will actually reach completion. There is no way to find out if a DST has stopped or has completed. There is no way to find out which cacheline the DST is working on. Therefore, all you can do is repeatedly stop the stream and start it again in the right place.

The cost for starting a prefetch on data that has already been prefetched is low. The cost for frequent stopping and restarting a stream is fairly low as well. This cost is worth worrying about only if your function consumes data faster than the lower level caches can provide it. In such cases, appreciable amounts of overlap may have some detrimental influence on dst performance when your data is starting in the L2 or L3 caches. A good strategy to deal with this issue is to simply write more complex functions (see Performance Issues: Memory). As long as the calculation takes more cycles than loading the data, it does not matter how long loading the data takes. The two operate in parallel. Tuning DST's in these cases is usually quite easy.

On G5, DSTs terminate early when they encounter page boundaries.

In short, prefetching a series of short overlapping blocks timed to coincide with the data that you are using is the most effective way to keep the stream active and synchronized with your function. If you are working with a one-dimensional array, typically your function will be looping over that array, processing many bytes at a time. To implement a just-in-time prefetch strategy, start a prefetch segment from the data you need immediately towards the data that you need several loop iterations into the future each time you reach the top of the loop. If you are working with a 2D array, you may find it easier to just prefetch the next row as you are working on the current one. In both cases, simply reusing the same stream ID will stop the old stream and start the new one in the correct place. The example below, MyBlitter() shows a simple copy function that uses DST to prefetch in short overlapping blocks.

void MyBlitter( vector signed char *src, vector signed char *dst, int vectorCount )
{

int i;
const int kPrefetchLeadDistance = 0;
for( i = 0; i < vectorCount/4; i++ )
{

//Prefetch from src to 256 bytes away
vec_dst( src + kPrefetchLeadDistance, 0x10010100, 0 );
//Copy 64 bytes from src to dst
dst[0] = src[0];
dst[1] = src[1];
dst[2] = src[2];
dst[3] = src[3];

//Advance our pointers 64 bytes
dst += 4;
src += 4;

}

//Now copy any extra data left over

if( vectorCount & 2 )
{

dst[0] = src[0];
dst[1] = src[1];
dst += 2;
src += 2;

}

if( vectorCount & 1 )
{

dst[0] = src[0];

}

vec_dss( 0 );

}

Experience suggests that it won't hurt to ask for too much data in each block. The memory systems probably will not have enough time to load in all of it, if your loop iterates frequently. However, performance will be dramatically worse if the block is too small, or the DST is not called early enough. In some cases, better performance can be obtained by starting your prefetch some small distance ahead of the data that you need immediately. This could be done in the above example setting kPrefetchLeadDistance to a small non-zero positive value such as 64 bytes. See the section Performance Issues:Doing More with the Data for more information on exactly how and why data prefetching is helpful.

When tuning be aware that in addition to prefetch segment length, the offset to the first byte of the prefetch segment from your working region is something to look at. Often this will produce smaller improvements. The key thing is to make sure you are prefetching far enough ahead of time that the data actually reaches the caches before you need it. Typically this is 4-6 loop iterations ahead of time. It may be larger on G5, which has longer latencies to memory as measured in cycles.

LRU Loads and Stores

In addition to vec_ld and vec_st, you may opt to use vec_ldl() and vec_stl(). These operate in just the same way, except that they mark the cacheline least recently used ('LRU') and transient. LRU simply means that the cacheline will be the first to be flushed from its set in the L1 when a cacheline from that set needs to be flushed. Furthermore, because it is marked transient, it will be flushed straight to DRAM. This is an effective way of preserving L1, L2 and L3 cache contents.

It should be noted that because LRU loads and stores flush straight to RAM, their performance typically suffers. The reason you use these instructions is to protect needed data in the caches. For this reason, it is entirely possible, even likely, that addition of a LRU load or store may slow down your function and speed up the functions around it. For this reason, you must be careful when measuring the speed effect of LRU loads and stores. Performance measured as relative percentages of CPU execution time will appear to suggest that the LRU instruction is only making your application slower, regardless of whether it actually is or not. The best benchmark in this situation is to time overall application performance for specific tasks, with the LRU loads and stores in place.

The LRU hint is ignored on G5. LRU loads and stores are performed as regular loads and stores on G5.

Table of Contents Next Previous Top of Page