Hardware - Throughput and Latency in Vector Programming


	Log In \| Not a Member?	Support

Throughput and Latency

Some programmers write functions that do a small amount of work on a data atom and return a result. They optimize the function to do its task and return as quickly as possible. This is frequently described as the low-latency approach to optimization, because it seeks to reduce the function latency -- the time between when a function is called and when it returns. Low-Latency functions are simple and easy to use. Data is traditionally passed by value and functions are typically small.

On the other hand, when the overall goal is to complete an operation on your entire data set as quickly as possible, rather than just a small fragment of it, pursue a high-throughput design. High throughput designs are those that focus on moving as much data through the calculation as possible in the shortest amount of time. In order to achieve that, any part of the calculation that doesn't need to be repeated for each new piece of data is moved out of the inner loop. Just the essential core operations are repeated for each data. These functions can be complex and are most effective when operating over large data sets. In C, they are usually passed data by reference. They also typically have a higher up front overhead.

Calculation Time = Overhead Time + Data Count / Throughput

The AltiVec engine is specifically designed for the highest possible throughput. It is superscalar in design, highly parallel and aggressively pipelined. Functions that use AltiVec often have higher overhead associated with them than do scalar functions that do the same task. For this reason, you are more likely to benefit from high-throughput designs when writing AltiVec accelerated functions, since the goals and usage patterns for that style align with the characteristics of the vector unit. Vectorized low-latency function designs can in some cases be slower than their scalar counterparts due to the added overhead associated with the vector unit. Thus they should be avoided, except where they can be inlined into the caller or other steps can be taken to minimize their overhead.

Overhead in AltiVec

Overhead associated from AltiVec arises from a few different factors. These include the cost of saving and restoring vector registers, organizing data within register, and loading / assembling constants for use in your function.

Register Save and Restore

Setting up a stack frame for a function that uses AltiVec is more expensive than for other types of functions. This is true for two reasons. To begin with, the vector register file holds a lot of data. The 32 AltiVec registers together hold 512 bytes of data. Saving their contents and restoring them involves more of the caches, meaning that more other important data may be flushed out. In addition, the cost of other tasks that involve saving or restoring vector registers such as exception handling or thread context switching can also be larger due to the need to save and restore registers.

So that the processor doesn't have to save and restore every vector register every time an exception is thrown or there is a thread context switch, the processor maintains a special purpose register called vrsave. The 32 bits of the vrsave special purpose register indicate which vector registers are in use. Using information stored in this register, code that needs to save and restore registers can safely just save and restore the ones that are in use. By convention on MacOS, the unused registers are filled with 0x7FFFDEAD to indicate their status on register restore. Use of the vrsave register is required on MacOS. This means that every AltiVec function has added overhead involved to properly configure the vrsave register before vector calculations can commence. It must also restore the vrsave register to its former setting before the function can exit. For these reasons, it is in general a little bit more expensive to call into an AltiVec accelerated function.

Data Organization

The fact that a vector register can hold many data at a time means that the developer must address the new problem of how data is organized within the vector register. If data is stored in memory in some format that is not especially suitable for the vector unit, there may be some up front cost to reorganizing the data for efficient calculation in register. One example of this was given in the Data Handling section. Another example might be unaligned vector loads. These require that you load N+1 aligned vectors and do N permutes to move N unaligned vectors in to register. Clearly if you only load one unaligned vector in your function as part of a low-latency design, you will be paying an extra 100% load overhead, but if you load 10 vectors then the added load will only constitute a 10% extra overhead, and 100 vectors will only have 1% overhead.

Constants

Finally, constants are not as easy to generate in the vector units as they are in the scalar integer units. Getting them into register may involve a load from memory or a series of operations to synthesize the constant in register. Vector constants also take up more room in the caches.

Overhead Summary

AltiVec functions can have substantial setup cost compared to their scalar counterparts. It is important to minimize the fractional overhead contribution to the cost of the calculation as a whole, in order to avoid having the overhead become the dominant time component in your function. The best strategy is to try to pay the setup cost once and then amortize it over as much data as possible. Very small AltiVec functions with non-trivial setup costs can be slower than their scalar counterparts, so these are best avoided. Low latency functions are usually only a large performance win if they are inlined.

Throughput -- Pipelines, Parallelism and Superscalar execution

The other half of the speed equation is the throughput of the core calculation. The vector unit has a very large capacity to do work. It can begin executing two or three vector instructions in each cycle, and up to eleven or twelve instructions may be in the execution units concurrently, involving up to 32 vectors. This means quite a lot of data in flight at any given moment.

How is it that all this data can be in motion at once? A common problem in processor design is that some operations take quite a long time. If you waited for one operation to complete before starting the next one on the same unit, it would be quite slow. A solution to this problem is to break down the calculation into many shorter steps, similar to an automobile assembly line. As the clock advances, intermediate calculations are advanced down the assembly line from one pipeline stage to the next until a finished calculation emerges at the end of the pipe. As a calculation moves from stage 1 to stage 2, another instruction can move in behind it and occupy stage 1. With the next tick of the clock, the first instruction moves to stage 3, the second to stage 2 and a third instruction can begin execution in stage 1 and so forth. Were this process to go on long enough, we would achieve a throughput of one instruction completed per cycle, and the pipelines would be full of data undergoing calculation:

The AltiVec unit, load store unit and the scalar floating point unit are pipelined. The following table details the number of stages in the vector execution unit pipelines of the different G4 generations:

VPERM VSIU VCIU VFPU LSU scalar FPU

7400 / 7410:
1 1 3 4 (5)
2
3

7450 / 7455:
2 1 4 4 (5)
3
5

970

2

2

5

8

4-5

6

In addition, the AltiVec unit is capable of superscalar execution. This means that the processor can start executing instructions on multiple units in the same cycle. The PowerPC 7400 and 7410 can dispatch two instructions per cycle to different execution units. Up to two of these may be vector units. However, only the permute unit is capable of receiving simultaneous dispatch with any of the other three vector units on the 7400 and 7410. If you need to dispatch a complex and a simple vector integer instruction, then they must happen on separate cycles on those processors.

The PowerPC 7450 and 7455 can dispatch (send) up to three instructions per cycle. These must all go to different execution units, any two of which can be vector units. The third may be anything else, including a vector load, store or vec_lvsl / vec_lvsr.

It should be clear now how 11-12 instructions can come to be executing concurrently. If we saturate the VFPU, VCIU and the load/store unit (which has a pipeline length of 3), we can have 11 or 12 instructions in flight at once. As each instruction can involve more than one vector, this could in principle involve up to 39 vectors in motion in a given cycle. In actuality, we only have 32 registers and it is pretty rare to use both the VFPU and VCIU at the same time, so this won't happen. A more likely scenario might be a code segment that keeps the VFPU, VPERM and LSU units busy all at the same time on the PPC 7450/7455. This is fairly typical of what you would see with basic linear algebra functions that deal with unaligned vectors. Similarly, you could have the VCIU, VPERM and LSU all busy in typical functions involving unaligned pixel buffers. Functions of this type could involve up to 27 vectors in flight at any given cycle. (This is derived from 4 vec_madd, 2 vec_perm, and 3 vec_ld executing concurrently.) That is a few vectors short of half a kilobyte! If you write low latency functions that only consume a vector or two of data before returning, you won't be making the most of the vector unit.

The G5 can dispatch 4 instructions per cycle plus one branch instruction. Its dispatch limits are similar to the 7400, however. Only one permute and one VALU instruction may be dispatched per cycle. Up to two vector loads or one vector load and one vector store can also be dispatched that cycle. Latencies are one cycle longer to move data back and forth between the permute unit and the VALU. Loading data to the permute unit takes four cycles. To the VALU takes five.

The key to achieving maximum sustained performance in the vector unit is to keep the execution units busy all the time. For our initial discussion about how to do that, we will provide some examples using the scalar FPU, because it makes for simpler examples. There is only one FPU instead of the four vector units, and scalar floating point code is compact and easily understood. However, as we will show at the end, exactly the same principles apply to the vector unit.

We shall warn you ahead of time that the simple function that we will choose has some serious performance problems, as many simple functions do. It will be a long and difficult road to bring its real world performance up to something that comes close to saturating the FPU. Hopefully the trip will prove educational, and you will have a better idea what sorts of things help and harm processor performance.

Data Dependency Stalls

The trouble with pipelines and superscalar execution is that they don't do you any good if you don't give the execution units enough work to keep them busy. Let's take a simple scalar function as an example:

float DotProduct( float a[3], float b[3] ) {
return a[0] * b[0] + a[1] * b[1] + a[2] * b[2];

}

Ignoring for the moment the task of loading and storing data and any work required to set up and take down the function's stack frame, this compiles to two fmadds and an fmuls:

fmuls fmadds fmadds

temp, a0, b0 temp, a1, b1, temp result, a2, b2, temp

# Calculate a[0] * b[0] # add that to a[1] * b[1] # add that to a[2] * b[2]

The problem with this is that each instruction depends on the result from the instruction before it, so one instruction can not start executing until the one before it moves all the way through the pipeline. On PowerPC 7450 and 7455, this means 5 cycles for each instruction for a total of 15 cycles to finish this three instruction sequence. During that time, four out of five of the pipeline stages are empty at all times, meaning that the FPU is only doing about 20% of the work that it is capable of doing in that time. Here we depict the 7450/7455's five stage FPU pipeline, and the progress of the instructions as time progresses (cycles 1 - 15).

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

cycle 1:
fmuls

cycle 2:

fmuls

cycle 3:

fmuls

cycle 4:

fmuls

cycle 5:

fmuls

cycle 6:
fmadds

cycle 7:

fmadds

cycle 8:

fmadds

cycle 9:

fmadds

cycle 10:

fmadds

cycle 11:
fmadds

cycle 12:

fmadds

cycle 13:

fmadds

cycle 14:

fmadds

cycle 15:

fmadds

Now, suppose instead we had written this function to do five dot products simultaneously:

void DotFive( float a[5][3], float b[5][3], float result[5] ) {
result[0] = a[0][0] * b[0][0] + a[0][1] * b[0][1] + a[0][2] * b[0][2]; result[1] = a[1][0] * b[1][0] + a[1][1] * b[1][1] + a[1][2] * b[1][2]; result[2] = a[2][0] * b[2][0] + a[2][1] * b[2][1] + a[2][2] * b[2][2]; result[3] = a[3][0] * b[3][0] + a[3][1] * b[3][1] + a[3][2] * b[3][2]; result[4] = a[4][0] * b[4][0] + a[4][1] * b[4][1] + a[4][2] * b[4][2];

}

Again, ignoring the problem of loads, stores and stack frames, we now have five fmuls, and two sets of five fmadds. Most importantly, we have five independent chains of execution that can cohabit the FPU's five stage pipeline without causing a data dependency stall. This means that we always have work available to fill the FPU's five stage pipeline. Each individual fmuls or fmadds is formatted differently so you can see how they progress through the pipeline. The first dot product is in red, the second orange and so forth:

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

cycle 1:
fmuls

cycle 2:
fmuls fmuls

cycle 3:
fmuls fmuls fmuls

cycle 4:
fmuls fmuls fmuls fmuls

cycle 5:
fmuls fmuls fmuls fmuls fmuls

cycle 6:
fmadds fmuls fmuls fmuls fmuls

cycle 7:
fmadds fmadds fmuls fmuls fmuls

cycle 8:
fmadds fmadds fmadds fmuls fmuls

cycle 9:
fmadds fmadds fmadds fmadds fmuls

cycle 10:
fmadds fmadds fmadds fmadds fmadds

cycle 11:
fmadds fmadds fmadds fmadds fmadds

cycle 12:
fmadds fmadds fmadds fmadds fmadds

cycle 13:
fmadds fmadds fmadds fmadds fmadds

cycle 14:
fmadds fmadds fmadds fmadds fmadds

cycle 15:
fmadds fmadds fmadds fmadds fmadds

cycle 16:

fmadds fmadds fmadds fmadds

cycle 17:

fmadds fmadds fmadds

cycle 18:

fmadds fmadds

cycle 19:

fmadds

Given our hypothetical perfect pipeline, in 19 cycles, you would be able to complete five dot products! Since 15 is not a lot different from 19, the developer might choose to write his dot product function to always do five at a time, even if he didn't always have five dot products to do. If all you have is two dot products to do, you will be nearly twice as fast using that function than you would if you twice called a dot product function that only does one dot product. It would also be a constant reminder that you could be doing more work in parallel, every time you use it.

More than One Pipeline

What happens when there is more than one pipeline? There are two different kinds of scenarios here that need to be discussed.

Parallel Pipes

When there are two pipelines that do pretty much the same thing — e.g. the G5's dual floating point units or dual load/store units — then the two pipes operate in parallel, as long as there is enough data to keep them full. Looking at the G5's pair of FPUs, each has a pipeline length of 6. As long as there are 12 (6+6) independent operations that are able to execute, the pipelines will be full and the FPUs executing to capacity. If the series of sequential operations are dependent on one another, such that the result of one is required to begin calculation on the next, then the FPUs will be largely under utilized and one might not get used at all. For simplicity's sake, you can think of the dual FPUs as a single double clocked 12 stage FPU. This ignores some important dispatch limitations, but it is a good start and will help you understand the degree of parallelism required in your code. Most of the examples below focus on the 7450's 5 cycle FPU pipe. To do similar things on a G5, we would want to unroll the loops by 12 instead of 5. Happily unrolling by 12 will work fine on a G4 or G3 too. Extra parallelism rarely hurts.

Serial Pipeline Limitations

Pipelines can also be organized by code to operate serially with one another. For example if you do a load, add then store, then you have linked the LSU, FPU and LSU again in series to get the job done. This can change the throughput characteristics of the entire system, because the FPU isn't the only potential bottleneck anymore.

To continue with our example, we must ask whether the claim is realistic that we can do 5 dot products in 19 cycles as claimed in our theoretical discussion above? Can we really get five dot products done in scarcely more time it takes to do one? Yes, and no. We can stuff the pipelines like that and get a lot more done in the FPU in roughly the same amount of time. However, we can only do that in practice if all the data is in register, ready to enter the pipeline at the moment it is needed. Clearly we have ignored in the discussion so far any load or store instructions needed to get the data in and out of register, as well as any stack setup costs.

What really happens with these two functions when the full picture is taken into account? Below are shown actual execution times for these functions on a PowerPC 7450, when called in a tight loop:

DotProduct() DotFive()

Latency (time / function call):
0.0025 0.0108

Throughput (dot products / time):
400 462

DotFive() doesn't get close to five times faster, though throughput is improved. In the switch to DotFive(), our overall throughput improved by 15%. If we needed to do large numbers of dot products, DotFive() would be the faster choice. It just was not 15/(19/5) = 394% faster as we might expect based on the FPU alone, nor does it seem that calling DotFive() to do two dot products is going to be faster than the low latency function design.

What was wrong with our original approach? As it turns out, for every call to fmuls or fmadds that we make, we have to load in two floats using lfs:

Dump of assembler code for function DotProduct:
0x2780 <DotProduct+ 0>: lfs f13,4(r4) 0x2784 <DotProduct+ 4>: lfs f0,4(r3) 0x2788 <DotProduct+ 8>: lfs f1,0(r3)0x278c <DotProduct+12>: fmuls f0,f0,f130x2790 <DotProduct+16>: lfs f12,0(r4) 0x2794 <DotProduct+20>: lfs f11,8(r3) 0x2798 <DotProduct+24>: lfs f13,8(r4)0x279c <DotProduct+28>: fmadds f1,f1,f12,f0 0x27a0 <DotProduct+32>: fmadds f1,f11,f13,f1 0x27a4 <DotProduct+36>: blr #return to the caller

In addition, in at least the case of DotFive() we have to store out the result as well. There are no store instructions in DotProduct() because we return the result by value in the f1 register when the function exits.

The Load Store Unit (LSU) can achieve a throughput of one load or store per cycle, and the PPC 7450/7455 FPU can do five multiplies or multiply-add-fused every six cycles. Unfortunately, we have to complete two loads before we can start each FPU instruction. Thus, for large numbers of dot products, we can predict that the actual bottleneck would be the Load Store Unit.

When you are optimizing code for pipelining, you have to make sure you are optimizing the right pipeline!

However, a quick look at the diagram to the right reveals that for the single DotProduct() example, the empty FPU pipeline caused by the instruction to instruction data dependency seems to remain the largest source of inefficiency, so perhaps the LSU isn't the bottleneck there?

A simplified depiction of the 3 and 5 cycle pipelines of the Load Store Unit and FPU, and their interaction over the course of performing a simple Dot Product(). Over the course of this function, there are 160 total pipeline stages that might be occupied with work. Of these, only 33 are occupied.

The diagram to the right shows an idealized view of the interaction between the LSU and FPU for the PPC 7450 in the harder working DotFive() function. Because the LSU is the bottleneck here, the FPU is running about half full. Based on this information, we might predict that DotThree() -- a function that did three dot products simultaneously -- would have been sufficient to reach this level of throughput. Doing three dot products at once would have allowed us to make the load to load delay between the first and second pair of values for each dot product six cycles, which is long enough to cover the five cycle latency of the PPC 7450/7455's FPU.

Is that the whole story? No. The keen observer has by now noticed that the pipeline saturation depicted in the pipeline diagrams at right and above right suggest that we should see much more than a 42% throughput increase for DotFive(). After all, the DotProduct() example only utilizes ~20% of its pipeline stages, while the DotFive() example utilizes 55%. Clearly this model doesn't tell the whole story. Why doesn't this theoretical representation and actual fact match up?

One could note that we did not show the handful of stores required to write the data back out, meaning that the LSU is somewhat more of a bottleneck than previously presented. The LSU cost per dot product is actually seven instructions for the DotFive() example, rather than six. (The DotProduct() function returns its result by value in register so does not need to store the result.)

Also, there is stack frame creation overhead that we have thus far resisted discussing. DotProduct() is small enough that we can do the entire calculation in the FPU's volatile registers. These don't have to be saved and restored, and so there actually is no stack frame to set up for that little function. DotFive(), on the other hand, consumes enough registers that we may have to set up a stack frame and save/restore some of the non-volatile FPU registers. This extra overhead could be sapping what might have otherwise been a larger performance advantage? We might posit that since the stack frame needs only to be set up once per function call, had we fully committed to a high throughput design that did MANY dot products rather than just five, we might have seen more of an improvement.

A simplified depiction of the pipeline behavior of Dot Five(). Of the 296 pipeline segments in play, 165 are full.

To find out what really happened, we examine the code the compiler produced for DotFive(), and if that still doesn't satisfy us, then we can run a SimG4 trace.

`New DotFive()Compiler A`	`New DotFive()Compiler B`	`DotProduct()`
lfs f4,4(r3) lfs f5,4(r4) lfs f11,0(r3) fmuls f3,f4,f5 lfs f1,0(r4) lfs f8,8(r3) lfs f2,8(r4) fmadds f9,f11,f1,f3 fmadds f7,f8,f2,f9 stfs f7,0(r5) lfs f10,10(r3) lfs f12,10(r4) lfs f5,c(r3) fmuls f13,f10,f12 lfs f6,c(r4) lfs f1,14(r3) lfs f4,14(r4) fmadds f0,f5,f6,f13 fmadds f3,f1,f4,f0 stfs f3,4(r5) lfs f2,1c(r3) lfs f11,1c(r4) lfs f7,18(r3) fmuls f8,f2,f11 lfs f9,18(r4) lfs f4,20(r3) lfs f6,20(r4) fmadds f5,f7,f9,f8 fmadds f1,f4,f6,f5 stfs f1,8(r5) lfs f10,28(r3) lfs f12,28(r4) lfs f3,24(r4) fmuls f13,f10,f12 lfs f11,24(r3) lfs f0,2c(r4) lfs f9,2c(r3) fmadds f2,f11,f3,f13 fmadds f8,f9,f0,f2 stfs f8,c(r5) lfs f6,34(r3) lfs f7,34(r4) lfs f1,30(r3) fmuls f4,f6,f7 lfs f5,30(r4) lfs f2,38(r3) lfs f3,38(r4) fmadds f0,f1,f5,f4 fmadds f1,f2,f3,f0 stfs f1,10(r5) blr	lfs f1,4(r3) lfs f0,4(r4) lfs f2,0(r3) fmuls f0,f1,f0 lfs f1,0(r4) lfs f4,8(r3) lfs f3,8(r4) fmadds f0,f2,f1,f0 fmadds f0,f4,f3,f0 stfs f0,0(r5) lfs f1,16(r3) lfs f0,16(r4) lfs f2,12(r3) fmuls f0,f1,f0 lfs f1,12(r4) lfs f4,20(r3) lfs f3,20(r4) fmadds f0,f2,f1,f0 fmadds f0,f4,f3,f0 stfs f0,4(r5) lfs f1,28(r3) lfs f0,28(r4) lfs f2,24(r3) fmuls f0,f1,f0 lfs f1,24(r4) lfs f4,32(r3) lfs f3,32(r4) fmadds f0,f2,f1,f0 fmadds f0,f4,f3,f0 stfs f0,8(r5) lfs f1,40(r3) lfs f0,40(r4) lfs f2,36(r3) fmuls f0,f1,f0 lfs f1,36(r4) lfs f4,44(r3) lfs f3,44(r4) fmadds f0,f2,f1,f0 fmadds f0,f4,f3,f0 stfs f0,12(r5) lfs f1,52(r3) lfs f0,52(r4) lfs f2,48(r3) fmuls f0,f1,f0 lfs f1,48(r4) lfs f4,56(r3) lfs f3,56(r4) fmadds f0,f2,f1,f0 fmadds f0,f4,f3,f0 stfs f0,16(r5) blr	`lfs f1,4(r3) lfs f0,4(r4) lfs f2,0(r3) fmuls f0,f1,f0 lfs f1,0(r4) lfs f4,8(r3) lfs f3,8(r4) fmadds f0,f2,f1,f0 fmadds f1,f4,f3,f0 blr`

The first thing we find out is that both compilers just replicated the DotProduct() scheduling five times for DotFive(), rather than schedule the FPU instructions so that they do not stall. Let us rewrite DotFive() so that it is extremely clear to the compiler what code we want. The new function DotFive2() will explicitly spell out what we want the compiler to do. This way we are at least looking at optimized code. How does it do?

DotProduct() DotFive() DotFive2()

Latency (time / function call):
0.0025 0.0108
0.0094

Throughput (dot products / time):
400 462
532

The new code is faster (15%) and compiled as we intended but it is still not the huge speedup that we were looking for. (Recall that based on a perfect pipeline model, we were hoping for nearly a factor of five boost in throughput.) So, that is clearly not the complete story. The reader will note that our little optimization was still just a blind guess. We still haven't run the one clear experiment to find out why the two code samples do not perform significantly differently from each other!

To find out what really happens, go to the trace. Sim_G4 will show us how exactly this code executes on a PowerPC 7400 or 7410. (Remember that the FPU pipeline is only three stages on those processors.) Here we see the first two dot products from DotFive() -- the version of the function that was scheduled poorly:

DotFive():

Instruction
Arguments

Cycle: 012345678901234567890123456789

1 lfs 2 lfs 3 lfs 4 lfs 5 lfs 6 fmuls 7 lfs 8 fmadds 9 fmadds 10 stfs 11 lfs 12 lfs 13 lfs 14 lfs 15 lfs 16 fmuls 17 lfs 18 fmadds 19 fmadds 20 stfs . . .
F1,0x4(R3) F0,0x4(R4) F3,0x0(R3) F2,0x0(R4) F4,0x8(R3) F0,F1,F0 F1,0x8(R4) F0,F3,F2,F0 F0,F4,F1,F0 F0,0x0(R5) F1,0x10(R3) F0,0x10(R4) F3,0xc(R3) F2,0xc(R4) F4,0x14(R3) F0,F1,F0 F1,0x14(R4) F0,F3,F2,F0 F0,F4,F1,F0 F0,0x4(R5) . . .
..IDER........................ ..IIDER....................... ..IIIDER...................... ..IIIIDER..................... ...IIIIDER.................... ...IIIIDEEFR.................. ....IIIIDEFR.................. .....IIIDDDEEFR............... ......IIIIIDDDEEFR............ .......IIIIDEEEEEER........... ........IIIIDEFFFFR........... ........IIIIIDEFFFFR.......... .........IIIIIDEFFFR.......... .........IIIIIIDEFFFR......... ............IIIIDEFFR......... ............IIIIIIDEEFR....... .............IIIIIIDEFR....... ..............IIIIIIDDEEFR.... ...............IIIIIIIDDDEEFR. ................IIIIIIDEEEEEER . . .

A Sim_G4 trace of the execution of the early part of DotFive(). Each instruction goes through four stages: Instruction Fetch (I), Dispatch (D), Execution (E or F) and finally Completion (R).

In this single trace, we find the complete story. So far as the instruction order is concerned, the first ten instructions are the first dot product. The second ten are the second dot product. The third ten instructions, if they were shown, would be the third dot product, and so forth. When talking about the order of appearance of instructions, the dot products don't intermingle. This does not mean however that the first dot product has to finish before the second one can start, even though they reuse the same registers.

You can clearly see that they do in fact overlap as they execute: The first load from the second dot product (instruction 11 highlighted in red) was able to begin execution (regions containing 'E' or 'F') before the first dot product even started its second Multiply-Add Fused (MAF, instruction 9). In addition, though instructions 9, 10 and 12, (the last MAF and store from the first dot product, and the second load from the second dot product, blue) all use F0, they were able to execute all at nearly the same time, completing only one cycle apart. Thus, so far as the timing of instruction execution is concerned, there is a very high degree of overlap between the dot products, meaning that the LSU is a lot more full than the earlier simple models would imply. This is not to say that LSU usage is 100%. There is a two cycle "stall" between instruction 7 and 10 during which we begin executing no instructions in the LSU. This happens again after instructions 15 and 17.

The reason our optimizations are not having a larger effect is that the original unoptimized code was executing a lot more efficiently than we originally posited based on the simple pipeline models presented earlier. Were these models wrong? No, not so far as they go. However, they did not take into account the interaction between multiple dot products called in series.

Significantly, it is now clear that even with single dot products dispatched one after the other, the primary factor determining speed is not FPU usage. It is LSU throughput. When we reorganize the usage of the FPU so that we don't see stalls there, we force fewer LSU pipeline stalls. This is why the DotFive2() function is faster than DotFive(). In DotFive2(), because we are not stalling in the FPU, we don't have any long load to store dependencies causing trouble in the LSU pipeline.

In the end, our ability to calculate dot products quickly rests almost entirely on our ability to keep the load store unit busy.

You can see below why the DotFive2() function is more efficient. The code contains 35 LSU operations and 15 FPU operations, just as DotFive() does. Both the FPU and LSU can accept and complete one instruction per cycle, and they can do so at the same time -- the PPC 7400 and 7410 can dispatch and complete two instructions per cycle. This means that so as long as the FPU instructions do not stall the LSU due to a data dependency or rename register depletion, the FPU operations have no effect on the throughput of the function. This is because the largest linear instruction chain is in the LSU, and the two chains are sufficiently independent. Therefore, the execution time is dependent on how quickly we can get the 35 LSU operations through the pipeline. In DotFive2(), we are able to begin a load or store virtually every cycle, except after those in which we run out of rename registers (red, see below) on the PowerPC 7400 and 7410. We know we run out of renames here because Sim_G4's vertical scroll pipe layout for the same code (not shown) tells us so. (We will show an example of the vertical scroll pipe later.) In the DotFive() function, our FPU stalls were causing LSU stalls, meaning that it took longer to push the data in and out of the CPU.

DotFive2():

Instruction
Arguments

Cycle: 0123456789012345678900123456789012345678901234

4:lfs 5:lfs 6:lfs 7:lfs 8:lfs 9:fmuls 10:lfs 11:lfs 12:lfs 13:fmuls 14:lfs 15:lfs 16:fmadds 17:lfs 18:lfs 19:fmuls 20:lfs 21:lfs 22:fmadds 23:lfs 24:lfs 25:fmuls 26:lfs 27:lfs 28:fmadds 29:lfs 30:lfs 31:fmadds 32:lfs 33:lfs 34:fmuls 35:lfs 36:lfs 37:fmadds 38:lfs 39:lfs 40:fmadds 41:lfs 42:lfs 43:fmadds 44:lfs 45:lfs 46:fmadds 47:fmadds 48:stfs 49:fmadds 50:stfs 51:stfs 52:stfs 53:stfs F0,0x0(R3) F3,0x0(R4) F2,0x4(R3) F5,0x4(R4) F1,0x14(R3) F8,F0,F3 F0,0x14(R4) F4,0x8(R3) F6,0x8(R4) F9,F2,F5 F3,0x18(R3) F2,0x18(R4) F8,F1,F0,F8 F5,0xc(R3) F7,0xc(R4) F10,F4,F6 F1,0x28(R3) F0,0x28(R4) F9,F3,F2,F9 F3,0x1c(R3) F2,0x1c(R4) F11,F5,F7 F4,0x10(R3) F5,0x10(R4) F8,F1,F0,F8 F1,0x2c(R3) F0,0x2c(R4) F10,F3,F2,F10 F3,0x20(R3) F2,0x20(R4) F12,F4,F5 F7,0x24(R3) F6,0x24(R4) F9,F1,F0,F9 F5,0x30(R3) F4,0x30(R4) F11,F3,F2,F11 F3,0x34(R3) F2,0x34(R4) F12,F7,F6,F12 F1,0x38(R3) F0,0x38(R4) F10,F5,F4,F10 F11,F3,F2,F11 F8,0x0(R5) F12,F1,F0,F12 F9,0x4(R5) F10,0x8(R5) F11,0xc(R5) F12,0x10(R5)
IDER.......................................... IIDER......................................... IIIDER........................................ IIIIDER....................................... .IIIIDER...................................... .IIIIDEEFR.................................... ..IIIIDEFR.................................... ...IIIIDEFR................................... ....IIIIDER................................... .....IIIDEEFR................................. ......IIIDEFR................................. ......IIIIDEFR................................ .......IIIDEEFR............................... ........IIIDEFR............................... .........IIIDEFR.............................. .........IIIIDEEFR............................ ..........IIIDEFFR............................ ...........IIIDEFFR........................... ...........IIIIDEEFR.......................... ............IIIDEFFR.......................... .............IIIDEFFR......................... ..............IIIIDEEFR....................... ..............IIIIDEFFR....................... ...............IIIIDEFFR...................... ................IIIIDEEFR..................... ................IIIIDEFFR..................... .................IIIIDEFFR.................... ...................IIIIDEEFR.................. ...................IIIIDEFFR.................. ....................IIIIDEFFR................. .....................IIIIDEEFR................ .....................IIIIDEFFR................ ......................IIIIDEFFR............... ........................IIIIDEEFR............. ........................IIIIDEFFR............. .........................IIIIDEFFR............ ..........................IIIIDEEFR........... ..........................IIIIDEFFR........... ...........................IIIIDEFFR.......... .............................IIIIDEEFR........ ..............................IIIDEFFR........ ..............................IIIIDEFFR....... ...............................IIIIDEEFR...... ...............................IIIIIDEEFR..... ................................IIIIDEEEER.... ..................................IIIDEEFR.... ..................................IIIDEEEER... ...................................IIIDDDEER.. ....................................IIIDDDEER. .....................................IIIIDDEER

A Sim_G4 trace of the execution of DotFive2(). Each instruction goes through four stages: Instruction Fetch (I), Dispatch (D), Execution (E or F) and finally Completion (R).

Write Non-trivial Functions

It isn't really that satisfying to leave our function limited by the performance of LSU. It is hard to get multiple GigaFLOP performance if you leave the FPU largely idle. What can we do to fix that problem? Unfortunately, there is not much we can do to change the fact that we have to load two data for every FPU calculation that we do in the DotProduct() function. That is just what dot products do. Therefore, all we can do is change the function to do some other calculation that does more work in the FPU per data loaded.

It is often not hard to figure out what to do. Look at the caller. Find out what it is using the dot product for and merge that operation into the function, either directly or rely on the compiler to inline it for you. A dot product is fundamentally too simple to function efficiently -- it doesn't process enough data to keep the processor busy. A larger, more complex function can, however. This is a consistent trend that you will likely find in your code. Small simple functions are often the worst performers.

We already described a situation where load throughput issues were causing performance problems in the Memory Performance section (that time due to limited throughput from loads that miss the caches). This is another. The solution to both is the same. Write less trivial functions so that you have enough work to do to fill the spare cycles.

What we will seek to do is find ways to add more FPU instructions into the function without significantly worsening the load on the LSU. Because our dot product FPU pipeline is running less than half full, we hope that we will be able to squeeze much of the additional work into the available pipeline holes, thereby getting some or all of the work for free! Because we are doing the additional work now when it is free rather than later when we have to pay for it, we hope to see a net performance win in the application.

Another way one might look at this is that we removed some load/store overhead. However this interpretation understates the gains we make using this sort of optimization. Because the LSU is the bottleneck, by removing these loads and stores, we not only avoid the LSU overhead, but also remove the downstream inefficiencies caused by the bottleneck. Thus, the net performance win is much more than the time cost recovered by removing the extra instructions in the LSU alone.

Our Example Continues...

Let us suppose that the reason we wanted to use the dot product was to calculate a Doppler effect for 3D audio for a game. To calculate the Doppler effect, we need to know how fast two objects are approaching each other. We want to find the speed S along the vector (D) between the listener and sound source. If the listener is at the origin and is not moving, this is calculated from the velocity (V_n) of the noise and the distance (D) as follows:

In a real program, we probably could have gone on to use that result to further calculate audio pitch deltas for each sound using the speed S along the D vector and perhaps audio gain settings, but this unnecessarily complicates the example. We have enough material to look at already.

This problem is somewhat close to the problem of vector normalization that we described in the Data Handling section. Since we are still working in the scalar domain, and we are talking about filling pipelines, the routine we will present here will use all of the same ideas for the faster scalar vector normalization routine proposed there but not implemented. Specifically, division does not pipeline in the FPU. Also the sqrt() C Std library routine will break up our pipelines and in this particular case, sqrt() is unnecessarily precise -- we only need about 8 bits of precision to satisfy the Sound Manager volume functions. So that we can show full pipeline use through the entire function, we will use the frsqrte instruction to obtain a limited precision estimate of the reciprocal square root of a value, with an accuracy of about 5 bits, and then refine that further using Newton-Raphson step to get a higher precision answer. (WARNING: frsqrte is not present in PowerPC 601!)

Since the sound manager volume is quantized to 256 volume levels (provided we avoid Spinal Tap volumes) the full 23 bits of single precision accuracy are not strictly necessary, and so we will only go through one round of Newton-Raphson, providing about 10 bits of accuracy. We will rely on the compiler to properly schedule the instructions to fill as many pipeline holes as it can. We also rely on it to throw away some of the registers we declared, because we really don't need them all the whole time. This saves some stack overhead for saving and later restoring them at the beginning and end of the function:

void CalcSpeed( float *velocities[3], float *distances[3], float *results, int count ) {

register float vX1, vY1, vZ1; //velocity 1 register float vX2, vY2, vZ2; //velocity 2 register float vX3, vY3, vZ3; //velocity 3 register float dX1, dY1, dZ1; //distance 1 register float dX2, dY2, dZ2; //distance 2 register float dX3, dY3, dZ3; //distance 3 register float DD1, DD2, DD3; //D dot D, 1 through 3 register float DV1, DV2, DV3; //D dot V, 1 through 3 register float rsqrt1, rsqrt2, rsqrt3; register float one = 1.0; register float oneHalf = 0.5; register float *vel = (float*) velocities; register float *pos = (float*) distance;
//Process three sounds per loop for( ; count >= 3; count -= 3 ) {

//Load X, Y and Z velocity components for three //input distances from the listener dX1 = pos[0]; dY1 = pos[1]; dZ1 = pos[2]; dX2 = pos[3]; dY2 = pos[4]; dZ2 = pos[5]; dX3 = pos[6]; dY3 = pos[7]; dZ3 = pos[8]; pos += 9; //Load X, Y and Z velocity components for //three input velocities vX1 = vel[0]; vY1 = vel[1]; vZ1 = vel[2]; vX2 = vel[3]; vY2 = vel[4]; vZ2 = vel[5]; vX3 = vel[6]; vY3 = vel[7]; vZ3 = vel[8]; vel += 9; //Calculate D dot V for each of the three vectors DV1 = vX1 * dX1 + vY1 * dY1 + vZ1 * dZ1; DV2 = vX2 * dX2 + vY2 * dY2 + vZ2 * dZ2; DV3 = vX3 * dX3 + vY3 * dY3 + vZ3 * dZ3; //Calculate D dot D for each of the three vectors DD1 = dX1 * dX1 + dY1 * dY1 + dZ1 * dZ1; DD2 = dX2 * dX2 + dY2 * dY2 + dZ2 * dZ2; DD3 = dX3 * dX3 + dY3 * dY3 + dZ3 * dZ3; //Calculate (D dot D)^-0.5//First get an estimate of the reciprocal square root rsqrt1 = __frsqrte( DD1 ); rsqrt2 = __frsqrte( DD2 ); rsqrt3 = __frsqrte( DD3 ); //Do one round of Newton-Raphson refinement //because the results from frsqrte are //only accurate to one part in 32. This should //give us about one part in 1024 accuracy, which //is enough for audio volume for which we only need //one part in 256 for the Sound Manager rsqrt1 += oneHalf * rsqrt1 * ( one - DD1 * rsqrt1 * rsqrt1 ); rsqrt2 += oneHalf * rsqrt2 * ( one - DD2 * rsqrt2 * rsqrt2 ); rsqrt3 += oneHalf * rsqrt3 * ( one - DD3 * rsqrt3 * rsqrt3 ); //store out the result //result = (D dot V) * 1.0 / sqrt( D dot D) (results++)[0] = DV1 * rsqrt1; (results++)[0] = DV2 * rsqrt2; (results++)[0] = DV3 * rsqrt3;

}

// Here we show how we handle with the leftovers -- // there could be 1 or 2 vectors left unprocessed by // the above loop. It was not shown for brevity and clarity.

}

The vector version of this function uses vec_rsqrte() in place of frsqrte, which provides 12 bits of accuracy. That is more than enough for our purposes, so we don't need to bother with Newton-Raphson in the vector case.

Scalar Results

How did we do? Its a little difficult to run a side by side comparison between the scalar code and some other imaginary preexisting code in a non-existent game. However, we can verify that we did a better job of filling the FPU pipelines by returning again to the Sim_G4 trace. Because we show the vector code for this next, we can guess that the 7400, with SimG4 models, will never actually see this particular bit of scalar code. However for single precision floating point, the 7400 FPU is substantially like the G3, so it is a reasonable model to use. Here is a SimG4 vertical scroll pipe trace for the core function loop in execution:

FPU Instructions Other Instructions Comment

(bc+ from last loop) 1356:fmuls F1,F13,F13 1359:fmuls F2,F21,F21 1362:fmadds F8,F22,F22,F1 1364:fmadds F2,F4,F4,F2 1365:fmuls F1,F7,F7 1367:fmadds F28,F3,F3,F81370:fmuls F13,F12,F131371:fmadds F1,F6,F6,F1 1373:fmadds F30,F5,F5,F2 1375:frsqrte F9,F28 1377:fmadds F20,F8,F8,F1 1379:frsqrte F10,F30 1381:frsp F9,F9 1383:frsqrte F11,F20 1385:frsp F10,F10 1387:fmuls F28,F28,F9 1389:frsp F11,F111391:fmuls F12,F31,F21 1392:fmadds F13,F26,F22,F131393:fmuls F30,F30,F101394:fmadds F12,F29,F4,F12 1395:fmuls F7,F23,F7 1396:fmadds F3,F25,F3,F131397:fmuls F26,F0,F9 1398:fnmsubs F28,F9,F28,F27 1399:fmuls F31,F20,F111400:fmadds F4,F24,F6,F71401:fmadds F9,F26,F28,F9 1402:fmuls F29,F0,F10 1403:fnmsubs F30,F10,F30,F27 1404:fmuls F6,F3,F91405:fmadds F3,F1,F5,F121406:fmuls F13,F0,F11 1407:fnmsubs F7,F11,F31,F27 1409:fmadds F10,F29,F30,F101410:fmadds F1,F2,F8,F41411:fmadds F11,F13,F7,F11 1412:fmuls F2,F3,F10 1413:fmuls F1,F1,F111415:stfs F1,0x8(R5)

1354:lfs F13,0x4(R4) 1355:lfs F21,0x10(R4) 1357:lfs F22,0x0(R4) 1358:lfs F12,0x4(R3) 1360:lfs F4,0xc(R4) 1361:lfs F7,0x1c(R4) 1363:lfs F3,0x8(R4) 1366:lfs F6,0x18(R4) 1368:lfs F5,0x14(R4) 1369:lfs F8,0x20(R4) 1372:lfs F31,0x10(R3) 1374:lfs F26,0x0(R3) 1376:lfs F23,0x1c(R3) 1378:lfs F29,0xc(R3) 1380:lfs F25,0x8(R3) 1382:lfs F24,0x18(R3) 1384:lfs F1,0x14(R3) 1386:lfs F2,0x20(R3) 1388:addi R4,R4,0x24 1390:addi R3,R3,0x24 1408:stfs F6,0x0(R5) 1414:stfs F2,0x4(R5) 1416:addi R5,R5,0xc

Maximum Allowed Dispatched Maximum Allowed Dispatched CB Full CB Full CB Full Maximum Rename FPR's Reached Double Load/Store Maximum Rename FPR's Reached Maximum Allowed Dispatched Maximum Rename FPR's Reached Maximum Allowed Dispatched FPU Unit Busy Maximum Rename FPR's Reached Double Load/Store Maximum Allowed Dispatched Maximum Allowed Dispatched Maximum Rename FPR's Reached Maximum Allowed Dispatched Maximum Rename FPR's Reached Maximum Rename FPR's Reached Maximum Allowed Dispatched Maximum Rename FPR's Reached Maximum Allowed Dispatched Maximum Allowed Dispatched Maximum Rename FPR's Reached Maximum Allowed Dispatched Maximum Rename FPR's Reached Maximum Allowed Dispatched Maximum Allowed Dispatched Maximum Allowed Dispatched FPU Unit Busy FPU Unit Busy FPU Unit Busy FPU Unit Busy FPU Unit Busy FPU Unit Busy FPU Unit Busy FPU Unit Busy FPU Unit Busy FPU Unit Busy FPU Unit Busy FPU Unit Busy FPU Unit Busy FPU Unit Busy FPU Unit Busy FPU Unit Busy FPU Unit Busy FPU Unit Busy FPU Unit Busy FPU Unit Busy FPU Unit Busy FPU Unit Busy Maximum Allowed Dispatched FPU Unit Busy FPU Unit Busy FPU Unit Busy FPU Unit Busy FPU Unit Busy FPU Unit Busy Maximum Allowed Dispatched Maximum Allowed Dispatched

In the above table, pipeline holes in the FPU appear as lines in the left column that are blank. The comments column names the various stalls we encountered along the way. Overall, we have done very well -- better than it might appear at first glance. The following section details each of the stalls that we see.

Common Stalls You Will See in SimG4 Traces:

The first comment we see "Maximum Allowed Dispatched" The PPC 7400 can only dispatch two instructions per cycle to all of the execution units in total. It can also only retire two per cycle. The PowerPC 7450/7455 can do three for both. When you see "Maximum Allowed Dispatched" you know that the processor is operating at maximum efficiency. This is not a stall.

The CBFull stalls near the beginning happen because the Instruction Completion Queue (ICQ or IC queue) is full. The PowerPC 7400 has an 8 entry IC queue used to serialize the PowerPC out of order core post-execution. Before an instruction enters execution in any unit, it has to acquire a slot at the tail of the instruction completion queue. The instruction can not be retired until it reaches the front of the queue. (The queue operates in a first-in first-out order, and is there to make sure instructions complete in the order that they were issued to execution units.) In this case, some store operations from the preceding loop iteration are sitting at the head of the queue. They haven't finished executing yet (loads and stores can take a long time) and so the items behind them in the queue can't be retired. This causes the eight entry queue to fill up, meaning that no more instructions can enter execution. As a result, some of our initial lfs instructions are delayed by a cycle or two until a slot in the ICQ opens up for them.

Another thing you have to acquire before an instruction can enter execution is a rename register to help manage the process of writing your results to register and to help forward the data to other instructions that might need it immediately. The PowerPC 7400/7410 have 6 integer rename registers, 6 vector rename registers and 6 floating point rename registers. Since we can have 8 instructions in flight in the ICQ, it is possible that if nearly all of these write to any one register file that we will run out of rename registers for that register file. If your code strictly uses the FPU, this can't happen because the FPU pipelines are not long enough to saturate the rename registers. However towards the beginning of the function, we have both the FPU and the load store unit writing to the FP register file, so we do in fact run out of rename registers. This is where we see "Maximum Rename FPR's Reached" We suffer a few pipeline bubbles in both the LSU and FPU while we are in the early part of the loop and still using both units.

Finally we see a number of instructions labelled "FPU Busy". All this is telling us is that our FPU pipeline is saturated and there isn't anything else that could be scheduled in a different execution unit. This is what we intended to happen. It just means we are using the FPU to its fullest extent. The stall every third FP instruction happens because of some internal design elements that are part of the 7400. The PowerPC 7450 does not suffer from this problem quite as much, nor do we see it in the integer units or the vector units.

One other optimization note: the compiler has added some frsp instructions that don't need to be there. These round the result from our square root reciprocal estimate to single precision. As the estimate itself had a real precision significantly less than single precision, this artifact of the C language is not-productive in our case. You can remove them by using the following frsqrtes() intrinsic function instead of the __frsqrte() function we used:

inline float frsqrtes(float in) {
register float out; asm { frsqrte out,in } return out;

}

Scalar Performance Discussion

Our new function takes a standard dot product (D dot V) and appends a lot more work onto it, by adding a vector normalization, division and a second dot product using the same small set of data. What kinds of performance advantages did this buy us in the end?

The instructions corresponding to what would be issued for a dot product function that does three dot products are shaded in blue the Sim_G4 results above. Nine of the 38 FPU instructions are from the original. The other 29 are new. Our changes introduced no new load/store instructions. We simply do more with the data while it is in register. As a result, a routine that was load/store bound is now primarily FPU bound again, which means our calculation throughput is approaching 1 FPU instruction per cycle. One can see this effect by comparing the percent utilization of the FPU for on optimized high throughput DotProduct2() function vs. CalcSpeed():

FPU utilization

DotProduct2()
~30%

CalcSpeed()
89%

We can also see this in terms of raw execution times. The function CalcSpeed() does over four times as much work in the FPU as does DotProduct() but does not take four times as long:

DotProduct() CalcSpeed() ratio

Execution Time (sec):
0.72 1.47 1 : 2.04

The timings and the SimG4 trace above both suggest that by increasing the complexity of the calculation by a factor of four, we were able to saturate the FPU, saving quite a bit of time. Actual execution times only went up by a factor of two. Recall that this new function actually does two dot products. Our benchmarks show that calling DotProduct() twice takes about as long as calling CalcSpeed() once, so by this simple metric, it seems that we were able to get the rest of the CalcSpeed() function -- the reciprocal estimate and Newton-Raphson refinement (the square root and the divide) -- for free!

Vector Results

We also present a similar trace for the vectorized version of the code. Compared to the scalar implementation, we save some time because we don't have to do a Newton-Raphson optimization step, but we have to surrender some time dealing with the interleaved float[3] data format, a format highly inconvenient for the vector unit.

scalar version vector 1 vector 2

Execution Time (sec):
1.47 0.45 0.33

The first vector version, vector 1, takes the same interleaved {x, y, z} data format that the scalar code used. The second vector version, vector 2, uses vector-friendly planar data format instead, which requires no on the fly internal format conversions before the data can be processed. This second variety is for those that are curious what these interleaved formats sometimes cost. In this case, 27% of our time goes to dealing with swizzling, and we can expect a 36% performance increase if we switch to planar data layouts.

The four-fold parallelism in the vector unit lets us achieve throughputs of 3.3- to 4.2-fold higher than what we were able to get in the scalar units.

Earlier we promised to show that the same processor features that lead to enhanced scalar performance are also at work in the vector unit. As a result, we now ask, how much better is this than the vectorized version of DotProduct() written using a low-latency design?

//A AltiVec version of DotProduct() // distance holds { x, y, z, 0 } of the distance // velocity holds {x, y, z, 0 } of the velocity float SimpleVectorDotProduct( vector float distance, vector float velocity ) {

vector float zero = (vector float) vec_splat_u32(0); float result;

//Do the multiplications vector float D_dot_V = vec_madd( distance, velocity, zero );

//Add across the vectors D_dot_V = vec_add( D_dot_V, vec_sld( D_dot_V, D_dot_V, 8 ) ); D_dot_V = vec_add( D_dot_V, vec_sld( D_dot_V, D_dot_V, 4 ) );

//Copy it to the FPU, so we can return it in the FPU register vec_ste( D_dot_V, 0, &result );

return result;

}

Here is the performance comparison between various flavors of the DotProduct() function. The first one, DotProduct(), is the simple one-line scalar implementation first presented. The second one, DotProduct2(), is the same idea except that we move the loop inside the function so that it only has to be called once -- we get rid of stack overhead and some of the FPU pipeline problems this way. The vector version is the code immediately above. Each does forty dot products.

Execution Time
(microseconds to process 40 data)

DotProduct():

0.99

DotProduct2():

0.72

SimpleVectorDotProduct():

2.25

What?! The low latency vector routine is slower? Yes, it is. Perhaps you might think we handicapped it by forcing the routine to return a float. Lets fix that:

vector float VectorVectorDotProduct( vector float distance, vector float velocity ) {

vector float zero = (vector float) vec_splat_u32(0); float result;

//Do the multiplications vector float D_dot_V = vec_madd( distance, velocity, zero );

//Add across the vectors D_dot_V = vec_add( D_dot_V, vec_sld( D_dot_V, D_dot_V, 8 ) ); return vec_add( D_dot_V, vec_sld( D_dot_V, D_dot_V, 4 ) );

}

Here is the performance comparison between various flavors of the DotProduct function again.

Execution Time
(microseconds to process 40 data)

DotProduct():

0.99

DotProduct2():

0.72

SimpleVectorDotProduct():

2.25

VectorVectorDotProduct():

1.68

Clearly that wasn't the whole problem! Just to drive the point home, this is the speed difference between CalcSpeedVec() and the same calculation done using SimpleVectorDotProduct():

Execution Time
(microseconds to process 40 data)

CalcSpeedVec():

0.45

float dd = SimpleVectorDotProduct(&d,&d); float dv = SimpleVectorDotProduct(&d, &v); results = dv/sqrt(dd);
14.8

Why is the low latency vector dot product itself 3.7 times slower than CalcSpeedVec(), a function that calculates twice as many dot products and does a divide and half precision square root as well, and nearly 15 times slower for the entire operation?

Vector Performance Discussion

The problem with tiny low-latency vector functions is just as we described above for scalar functions, except worse because vector overhead is higher. We spend so much time dealing with overhead (setting up vrsave and coping with data layouts) that the overhead itself becomes the dominant time consumer in the function. If we look at it, we find that VectorVectorDotProduct() does very little right. The low-latency design forces us to do so many things wrong that performance is destroyed.

float VectorVectorDotProduct( vector float distance, vector float velocity ) {

vector float zero = (vector float) vec_splat_u32(0); float result;

To begin with, we see above that we have to spend a cycle or two generating the zero constant, and (unseen) we also some more time setting up the vrsave special purpose register.

//Do the multiplications vector float D_dot_V = vec_madd( distance, velocity, zero );

Next, the multiplication operation is only 3/4 efficient, because there is a do-nothing element in the fourth position on each vector. We waste a little time there. Also, because the VFPU pipeline is 4 stages long, we take a three cycle stall while we wait for vec_madd to complete before the next instruction can start.

//Add across the vectors D_dot_V = vec_add( D_dot_V, vec_sld( D_dot_V, D_dot_V, 8 ) ); return vec_add( D_dot_V, vec_sld( D_dot_V, D_dot_V, 4 ) );

We now have to sum across the vector. The problem is that because we are using a non-uniform vector, the Y and Z components are in the wrong place to add to the X component. Thus, we have to do two left rotates to fix the problem. In addition, because this routine is again data starved, we incur a total of six more cycles worth of stalls after the adds, before we can return the data. Also, the two adds are doing redundant work -- there is less and less parallelism in register as this stage nears completion.

}

Finally, the computer also has to restore the vrsave special purpose register.

The good news is that the function does complete from start to finish about as quickly as one could imagine doing a single dot product in the vector unit. However, it seems that in the end, optimizing for latency alone costs us dearly for overall performance. We can make a somewhat boastful claim of a factor of 33 performance improvement between the low latency vector implementation and the high-throughput one.

It should be clear that for proper performance of the vector unit, it is very important to make sure you give it enough work to keep it busy! More data will help fill the pipelines, and more data beyond that will help reduce the fractional cost of the function overhead.

Compilers and Inlining

The skeptical reader may have realized that much of the work we show here may normally be expected to be done by the compiler when it inlines small utility functions into larger routines. Much of what we have done above is exactly that. Unfortunately, compilers do not always inline things -- one did, one didn't when preparing these examples -- and even when they do, they are sometimes unclever about how they do it.

A compiler could in principle take a low latency function found in a loop, inline it, unroll the loop to cover the pipeline lengths and schedule everything to avoid pipeline bubbles and move one-time calculations out of the loop, thereby converting a low latency C function into high throughput assembly automatically for you. For whatever reason, this didn't happen in the examples presented here, so the improvements from inlining were modest. That is why it makes sense to verify the compiler is doing what you expect with performance sensitive code.

In addition, often you might need to merge code horizontally between two functions called one after the other that operate on the same buffer. Sometimes these code segments may come from parts of the application called at very different times. Few compilers yet implement features like loop fusion or the high level of analysis required to make code move large distances. Thus, with today's compilers the notion of high throughput vs. low latency design for your functions remains important for proper performance in both scalar code and AltiVec.

Summary

The PowerPC processor is capable of dispatching multiple instructions per cycle into many parallel execution units that have pipelines many cycles deep. To saturate this system requires much more data than is typically passed into small functions. In some cases, an AltiVec function is capable of operating on as much as half a kilobyte of data at any given time. Thus, low-latency designs often leave the processor data starved and performance suffers greatly.

In order to overcome these problems, the high throughput design must seek to keep ALU pipelines full. In many cases, this may mean that the functions themselves must be redesigned to do non-trivial amounts of work and operate in concert with better temporal locality of data reference. The end result can be a many-fold rate acceleration in the operation of the function. This is true even of code that uses only the scalar units.

Table of Contents Next Previous Top of Page