Log In | Not a Member? | Support | |
The CachesCaches are small segments of extremely fast memory that are used to hold recently used data and instructions. They can greatly accelerate calculations that repeatedly touch the same regions of memory over a short period of time. The G4 has a 32 kB level 1 data cache and a matching 32 kB level 1 instruction cache for instructions. In addition there is a level 2 cache whose size is usually in the range of 256 kB to 1 MB. It holds both data and instructions. In some cases there is a level 3 cache that may be as large as 2 megabytes, as on the PPC 7450. The G5 has a larger L1 instruction cache (64kB), a 32kB data cache (2-way) and a 512k 8-way L2 cache. Data moves in and out of caches in aligned chunks called cachelines. Cachelines are customarily 32 bytes in size, but can be larger. efore data can be loaded into register and used in the processor, it must be loaded into the level 1 (L1) cache first, except where memory is marked non-cacheable. Thus, if you make a lot of scattered memory accesses, you may end up loading in up to 32 times as much data as you need into the processor. This is the pathological case. Most of the time, applications are highly repetitive in their data access patterns and the caches are very effective. The normal data access process is as follows: in response to a load instruction, the processor will first examine the L1 cache to see if the cacheline holding that piece of data is present. If it is, the data is simply loaded from there. This is a called a L1 cache hit and is extremely fast, taking only a few cycles. If the data is not in the L1, then the processor checks the L2 cache. If it finds it there, then the entire cacheline containing the data is loaded into the L1 from the L2, and the data is loaded into register from there. Some machines have a level three cache that will be checked if the data is not in the level 2 cache. It behaves similarly. If the data isn't in any of the caches, then the processor takes a long slow trip to DRAM for it. In each case, once the data is found, the entire cacheline that contains the data will be loaded into the L1 cache over four beats of the PowerPC's 64 bit front side bus. When the new data arrives in the L1 cache, some other cacheline must be displaced to make room for it. The cachelines are grouped in sets. As the new cacheline is added to a set, another element of the set needs to be flushed out to make room for it. Which element is flushed out is decided using a pseudo- least recently used (LRU) algorithm. The L1 data cache on the 7400, 7410, 7450 and 7455 is eight way set associative. This means that each set has eight cachelines. The 32 kB L1 data cache therefore contains 128 sets. All of the cacheline elements in a set live at an address some integer multiple of 4 kB apart. Thus, if you stride through memory 4 kB at a time (for example drawing a vertical line in a 1024 pixel wide 32 bit color GWorld) you will only be using a single set in the L1 data cache, and your code will operate as if the L1 was only 256 bytes in size. This again is the pathological case. Try to avoid large strides whose size in bytes is a power of two. Where does the displaced data go? Unless you tell it to do otherwise, the processor will send the displaced cacheline to the L2 cache. This is the only way for data to get into the L2 cache on PowerPC 7400 and 7410. It is called a victim cache for that reason on these processors. There the process repeats itself as the cacheline displaces another cacheline from the appropriate set in the L2 cache. The process continues until we run out of caches and a (modified) cacheline is written to DRAM. (Unmodified cachelines are simply discarded when displaced from the caches.) The newer PPC 7450 and 7455, usually populate the L2 and L3 caches at the same time that a cacheline is moved to the L1. For these processors when a cacheline is cast out of the L1, it is already resident in the L2 so little needs to be done. If the cacheline was fetched with a transient prefetch instructions such as dstt, then it is only loaded into the L1. The fact that data is read and written from DRAM as full cachelines, and not bytes, shorts, words, doubles or vectors may have a profound impact on how you think about the costs of memory access and the importance of keeping data frequently accessed together near one another. Scattered memory accesses can be very costly. Software Directed Prefetch (Cache Management)AltiVec provides advanced tools for helping you to control when data is loaded into the caches and how soon it is evicted. Bandwidth between the processor and memory can be managed explicitly by the programmer through the use of streaming data prefetch instructions. These instructions allow software to give hints to the cache hardware about how it should prefetch and prioritize write back of data. Because it works exclusively through the caches, this streaming data prefetch mechanism is also suitable for use with scalar code on G4 processors. In the G4 processor there are four Data Stream (DST) channels. Each channel runs independently of the others and can prefetch up to 128K bytes of data. Here are some of the important characteristics of the DST prefetch engine.
There are four instructions for starting a stream and two for stopping them: • DST - Data Stream Touch • DSTT - Data Stream Touch Transient • DSTST - Data Stream Store • DSTSTT - Data Stream Store Transient • DSS - Data Stream Stop • DSSALL - Data Stream Stop All
DST
DSTT
DSTST
DSTSTT
DSS
DSSALL
The Basic formAll of the DST instructions have the same form; which one to use is dependent on how the data will be used later by the program. Each of the above instructions take three parameters: the starting address of the first block of data, a control word and a two-bit immediate value for the channel number (value: 0-3). The function definition for DST looks like this: void vec_dst (* Address, control, channel)
Special NotesThe fact that a program issues a vec_dst command does not guarantee that the requested data will be in the cache. There are several things that can prevent the DST from completing or cause the data to be removed from the cache after it is loaded. Here is a list of some of the things that can prevent your data from being in the L1 when you want it:
Suggested DST UsageThe particular block size, count and stride that works best for a function may be highly function and hardware dependent. In each case, it is generally necessary to test a wide variety of combinations to find the one that works best. This means that some experimentation typically must be done to see good results with DST's. If you are seeing no improvement, then one of many factors may be the cause:
Typical improvement gains vary according to how much computation is done on each cacheline (usually 32 bytes) of data, but are often in the 10-40% improvement range for functions loading uncached data. In rare cases, they can provide up to four fold speed improvement starting with uncached data. While you are free to use the DST instructions in the manner that works best for you, DST users should note that Motorola has drafted some guidelines for best results. These appear in section 5.2.1.8 of the AltiVec Technology Programming Environments Manual. In short, this document recommends that you prefetch the data in short overlapping blocks. There are several reasons why short overlapping blocks are likely to work better than long non-overlapping ones: If you prefetch your data in large blocks up front, it is quite likely that either the DST will outrun your function or (more likely) your function will outrun the DST. In either case, the DST wont be very helpful. It is exceptionally difficult to stay synchronized with such a long DST. Even if you managed it, it would probably only work on one particular collection of hardware and not another. In short, prefetching a series of short overlapping blocks timed to coincide with the data that you are using is the most effective way to keep the stream active and synchronized with your function. If you are working with a one-dimensional array, typically your function will be looping over that array, processing many bytes at a time. To implement a just-in-time prefetch strategy, start a prefetch segment from the data you need immediately towards the data that you need several loop iterations into the future each time you reach the top of the loop. If you are working with a 2D array, you may find it easier to just prefetch the next row as you are working on the current one. In both cases, simply reusing the same stream ID will stop the old stream and start the new one in the correct place. The example below, void MyBlitter( vector
signed char *src, vector signed char *dst, int vectorCount
) int i; } Experience suggests that it won't hurt to ask for too much data in each block. The memory systems probably will not have enough time to load in all of it, if your loop iterates frequently. However, performance will be dramatically worse if the block is too small, or the DST is not called early enough. In some cases, better performance can be obtained by starting your prefetch some small distance ahead of the data that you need immediately. This could be done in the above example setting kPrefetchLeadDistance to a small non-zero positive value such as 64 bytes. See the section Performance Issues:Doing More with the Data for more information on exactly how and why data prefetching is helpful. When tuning be aware that in addition to prefetch segment length, the offset to the first byte of the prefetch segment from your working region is something to look at. Often this will produce smaller improvements. The key thing is to make sure you are prefetching far enough ahead of time that the data actually reaches the caches before you need it. Typically this is 4-6 loop iterations ahead of time. It may be larger on G5, which has longer latencies to memory as measured in cycles. LRU Loads and StoresIn addition to vec_ld and vec_st, you may opt to use It should be noted that because LRU loads and stores flush straight to RAM, their performance typically suffers. The reason you use these instructions is to protect needed data in the caches. For this reason, it is entirely possible, even likely, that addition of a LRU load or store may slow down your function and speed up the functions around it. For this reason, you must be careful when measuring the speed effect of LRU loads and stores. Performance measured as relative percentages of CPU execution time will appear to suggest that the LRU instruction is only making your application slower, regardless of whether it actually is or not. The best benchmark in this situation is to time overall application performance for specific tasks, with the LRU loads and stores in place. The LRU hint is ignored on G5. LRU loads and stores are performed as regular loads and stores on G5. |