Technical: Hardware: G4
Advanced Search
Apple Developer Connection
Member Login Log In | Not a Member? Support

Code Optimization

Micro-optimization of AltiVec code is one of the last steps in the optimization process. This is because the performance wins obtained from micro-optimization are usually small. It should be the last thing you do, if you do it all. Certainly long before tackling micro-optimization, you should first make sure that you are using the right algorithm. The performance difference between a good algorithm and a bad one can be many thousand fold. Next, read the earlier sections about SIMD and data handling. The wrong data handling methods or poor cache usage can easily prevent optimally written code from performing well. Poor data use strategies are the most common reason why naively written AltiVec accelerated functions fail to perform faster than their scalar counterparts, so it is important to address these problems first. No amount of code micro-optimization will help code that is memory bandwidth limited. Finally, once these things are dealt with, check that your code makes proper use of pipelines and superscalarism in the processor. As we show in the High Throughput vs. Low Latency section, poor design in these areas can make your code run many times slower. One example is 33 times slower than its pipeline saturating counterpart.

Listed below you will find some general micro-optimization themes that will help improve the performance of CPU bound calculations. In addition, you may find some additional hints in the Algorithms sections for dealing with specific problems such as square root, divide, permutes, alignment, etc.

Work Maximization

Some AltiVec instructions do more than others. Why use vec_mule() to just multiply four shorts, when you can use vec_mradds() to multiply eight shorts, right shift them 15 bits (with rounding) and then add a vector full of eight shorts to them in the same amount of time? AltiVec provides 17 forms of integer multiplication. Using the right one can save you quite a bit of time.

For example, in all of AltiVec's 17 flavors of integer multiplication, there are no instructions that do the vector equivalent of mullw -- the low 32 bits of the result of 32 x 32 bit multiplication. The closest is vec_mladd, which uses 16 bit integers. You can solve this problem software by breaking down each word into half-words and then do the multiplication according to the formula:

Represent 32-bit quantities X and Y as two pairs of shorts, A & B and C & D:

X = (A << 16) + B
Y = (C << 16) + D

Rewrite X*Y in terms of A, B, C and D:

X * Y = ((A << 16) + B) * ((C << 16) + D)

Expand:

X * Y = (A*C << 32) + (A*D << 16) + (B*C << 16) + B*D

The partial result A * C ends up left shifted by 32 bits, so does not play a part in our result, but the other three terms do. They can be calculated and added together as follows:

typedef vector unsigned long vUInt32;
typedef vector unsigned short vUInt16;

//The straightforward, slow way of doing 32 bit multiplication
vUInt32 vMullw( vUInt32 X, vUInt32 Y )
{

vUInt32 YSwapped, BD, AD, BC;
vUInt32 sixteen = vec_splat_u32(-16 ); //only low 5 bits important here
YSwapped = vec_rl( Y, sixteen );

//Calculate A*D, B*C, and B*D
BD = vec_mulo( (vUInt16) X, (vUInt16) Y );
AD = vec_mule( (vUInt16) X, (vUInt16) YSwapped );
BC = vec_mulo( (vUInt16) X, (vUInt16) YSwapped );

//Sum A*D and B*C, then left shift them by 16 bits
AD = vec_add( AD, BC );
AD = vec_sl( AD, sixteen );

//Add in the BD component
return vec_add( AD, BD );

}

However, usually whenever you have a vec_mulo, vec_mule pair followed by addition of the results, you can do a single vec_msum instead. This lets you do 8 multiplies and an add in a single operation. In fact, we could have done a second add as well, if we had anything to add it to before the left shift:

//The more efficient way of doing 32 bit multiplication
vUInt32 vMullw_Fast( vUInt32 X, vUInt32 Y )
{

vUInt32 YSwapped, BD, AD_plus_BC;
vUInt32 sixteen = vec_splat_u32(-16 ); //only low 5 bits important here
vector unsigned int zero = vec_splat_u32(0);
YSwapped = vec_rl( Y, sixteen );

//Calculate A*D + B*C, and B*D
BD = vec_mulo( (vUInt16) X, (vUInt16) Y );
AD_plus_BC = vec_msum( (vUInt16) X, (vUInt16) YSwapped, zero );

//Left shift the high results by 16 bits
AD_plus_BC = vec_sl( AD_plus_BC, sixteen );

//Add in the BD component
return vec_add( AD_plus_BC, BD );

}

Making this modification shaves off about two cycles from the function latency, because we replace a vec_mule, vec_mulo and a vec_add, with a vec_msum. Even more important, it opened up a pipeline hole in the vector complex integer unit (VCIU) that we might exploit with more data to increase the overall throughput of this function for larger data sets by up to 50%! Vec_msum, vec_msums, vec_mladd, vec_mradds and vec_madds are particularly powerful.

 

The Identity Permute Vector (also byte swapping)

The identity permute vector (shown below) can in some cases be quite useful for turning complex multi-instruction permute operations into a single call to vec_perm():

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Rather than push each data item through the expensive multistep permute, push the identity vector through it. The result permute vector will be the correct one that does the entire calculation in a single call to vec_perm. This can be used to avoid situations in which because part of the calculation cannot be anticipated at compile time, you would otherwise have to place a complex series of permute operations in your main data handling loop.

Here we use this method to find a way to handle alignment and byte swapping in a single step. You don't know until runtime what the alignment is going to be, and in some cases you might not even know what the endianness of the data is going to be, so writing a generic piece of code to handle these conditions means that you would probably have to align the data first, then byte swap it. To do both in a single step, we can take the 32 byte identity vector (two AltiVec vectors, A and B):

A
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
B

Now apply our alignment permute to it

temp = vec_perm( A, B, vec_lvsl( 0, unalignedPtr ) );

which might yield something like:

12

13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

and then apply a 16 bit byte swap operation to it:

permute = vec_rl( temp, vec_splat_u16(8) );

13
12
15
14
17
16
19
18
21
20
23
22
25
24
27
26

The result, permute, is the correct permute constant to use to do the whole align + byte swap operation in a single call to vec_perm()! The reason this works is that the values stored in the identity vector track where the bytes end up as you push the identity vector through the complex permute sequence. When you are done with the permute, the resulting vector is a permute map which represents a lookup table approach to quickly do the expensive transformation in a single call to vec_perm. With permute in hand, we can now do the alignment plus byte swap to our data set more quickly:

for( i = 0; i < size; i++ )
{
//Byte swap and align in a single step
result[i] = vec_perm( unaligned_input[i], unaligned_input[i+1], permute);

}

Some care is in order with this approach to make sure that all of your data convolutions don't cause data to magically disappear into the aether. Obviously, this example only works in cases when the entire permute sequence is representable as a single permute. If the result bytes can be drawn from a pool larger than two vectors in size, then you will not be able to use this technique to do the calculation in a single step. In some cases, you may use it to do the calculation in fewer steps.

Note: In this particular example, we can do this a bit more simply because vec_lvsl( 0, unalignedPtr) is the same as the value for temp. (Using vec_perm() with identity vectors as source vectors just gives you your permute constant back.) This allows us to change:

temp = vec_perm( A, B, vec_lvsl( 0, unalignedPtr ) );

to:

temp = vec_lvsl( 0, unalignedPtr );

Also, vec_rl() is not the only way to byte swap data. With permute constants you can sometimes achieve the same effect more efficiently by XOR-ing the permute constant with ( data_size - 1). So, for example, with 32 bit integers, you are usually better off "byte swapping" the permute vector using permute XOR 3, rather than two calls to vec_rl().

//16 bit byteswap
permute = vec_xor( temp, vec_splat_u8(1) );

//32 bit byteswap
permute = vec_xor( temp, vec_splat_u8(3) );

 

Efficient Safe Unaligned Vector Loads

When loading N consecutive unaligned vectors into register, it is common to load N+1 aligned vectors, and then align them using vec_perm:

vector float v0, v1, v2, v3, vExtra;
vector unsigned char fixAlignment;

//Load four unaligned vectors as five aligned vectors
v0 = vec_ld( 0 * sizeof( vector float), ptr );
v1 = vec_ld( 1 * sizeof( vector float), ptr );
v2 = vec_ld( 2 * sizeof( vector float), ptr );
v3 = vec_ld( 3 * sizeof( vector float), ptr );
vExtra = vec_ld( 4 * sizeof( vector float), ptr );

//Use vec_perm to extract out the desired unaligned vectors
fixAlignment = vec_lvsl( 0, ptr );
v0 = vec_perm( v0, v1, fixAlignment );
v1 = vec_perm( v1, v2, fixAlignment );
v2 = vec_perm( v2, v3, fixAlignment );
v3 = vec_perm( v3, vExtra, fixAlignment );

This is highly efficient, because we only do one more vector load than we would have to do for the aligned case. In the context of a loop, vExtra could be copied to v0 for the next loop iteration to avoid having to redundantly reload that data:

vector float v0, v1, v2, v3, vExtra;
vector unsigned char fixAlignment;

//Do initial starter load
vExtra = vec_ld( 0, ptr );
ptr += vec_step( vector float );
fixAlignment = vec_lvsl( 0, ptr );

for( int i = 0; i < loopCount; i++ )
{

//Load four unaligned vectors as five aligned vectors
v0 = vExtra;
v1 = vec_ld( 1 * sizeof( vector float), ptr );
v2 = vec_ld( 2 * sizeof( vector float), ptr );
v3 = vec_ld( 3 * sizeof( vector float), ptr );
vExtra = vec_ld( 4 * sizeof( vector float), ptr );
ptr += 4 * vec_step( vector float );

//Use vec_perm to extract out the desired unaligned vectors
v0 = vec_perm( v0, v1, fixAlignment );
v1 = vec_perm( v1, v2, fixAlignment );
v2 = vec_perm( v2, v3, fixAlignment );
v3 = vec_perm( v3, vExtra, fixAlignment );

...

}

In this manner, we only ever load one more vector than we need to.

CAUTION: The only problem with this approach is that if ptr should turn out to be correctly 16 byte aligned, the last vExtra load will load no valid data. The danger in doing this is if vExtra happens to land on a new memory page, and the page is not readable by the current process, the application could crash due to an access exception. (As the load falls outside of data regions we are supposed to be loading, we have no idea if the extra load will cause an access exception or not.) Thus, you must be very careful to only load vectors that you know have at least one valid byte in them.

When you need to load a array that is small enough that no looping is necessary, this is very quickly and simply done safely by loading the last vector from the last valid data element in the array, rather than the first invalid element of the array:

vector float v0, v1, v2, v3, vExtra;
vector unsigned char fixAlignment;

//Load four unaligned vectors as five aligned vectors
v0 = vec_ld( 0 * sizeof( vector float), ptr );
v1 = vec_ld( 1 * sizeof( vector float), ptr );
v2 = vec_ld( 2 * sizeof( vector float), ptr );
v3 = vec_ld( 3 * sizeof( vector float), ptr );
vExtra = vec_ld( 4 * sizeof( vector float) - sizeof(float), ptr );

//Use vec_perm to extract out the desired unaligned vectors
fixAlignment = vec_lvsl( 0, ptr );
v0 = vec_perm( v0, v1, fixAlignment );
v1 = vec_perm( v1, v2, fixAlignment );
v2 = vec_perm( v2, v3, fixAlignment );
v3 = vec_perm( v3, vExtra, fixAlignment );

If ptr turns out to be 16 byte aligned, the last load will load in the same vector that we loaded for v3. Otherwise it will load in the next vector. As the fixAlignment permute will be the identity permute when ptr is 16 byte aligned, the values in vExtra will never be used when ptr is 16 byte aligned, meaning that loading in the wrong data in this case is not a problem. When ptr is not 16 byte aligned, the correct data will be loaded, so long as ptr is naturally aligned for the scalar data type in question. (i.e. four byte aligned for floats). This neatly avoids the problem of spilling into unmapped memory with no branching or laborious calculation.

We could easily extend this practice in the context of a loop, but we would end up doing one unnecessary load every loop iteration, which one might like to avoid. It would be nice to just have to do one extra load over the course of the entire function instead. This can be done by making sure that all loads in the loop are always valid.

The reasoning behind the below algorithm is complicated, but it is highly optimal. The trick is to get the second aligned load for a misaligned vector to occur at the same address as the first load for the aligned case, and 16 bytes after the first load for all other cases. We can borrow the +16-sizeof( element) trick from above to accomplish that. The other thing we have to do is a little bit of trickery with the lvsl generated permute so that the second load product is used instead of the first load product in the aligned case:

vector float v0, v1, v2, v3, vExtra;
vector unsigned char fixAlignment;

//Setup our alignment permute
// This returns the same result as vec_lvsl( 0, ptr ),
// except in the aligned case, where we get {16...31}
fixAlignment = vec_add( vec_lvsl( 15, ptr ), vec_splat_u8(1) );

//Do initial starter load
v0 = vec_ld( 0, ptr );

for( int i = 0; i < loopCount; i++ )
{

//Load four unaligned vectors as five aligned vectors
//These loads are addressed to the last byte of the
//previous load so that they overlap with the previous
//load for the aligned case only
v1 = vec_ld( 16 - 1, ptr );
v2 = vec_ld( 32 - 1, ptr );
v3 = vec_ld( 48 - 1, ptr );
vExtra = vec_ld( 64 - 1, ptr );
ptr += 4 * sizeof( vExtra );

//Use vec_perm to extract out the desired unaligned
//vectors
v0 = vec_perm( v0, v1, fixAlignment );
v1 = vec_perm( v1, v2, fixAlignment );
v2 = vec_perm( v2, v3, fixAlignment );
v3 = vec_perm( v3, vExtra, fixAlignment );

...

v0 = vExtra;

}

Similar things can be done with unaligned vector stores. Please note that though we described 16-sizeof(element) as the technique we are using, in actuality, we just assume the element size is 1, even when it is larger. This works for any standard data size, and doesn't run into problems when the elements are not themselves natively aligned (e.g. 4 byte aligned for a float), like 16 - sizeof(element) would.

In some cases, especially where the source and destination buffers are aligned with one another, but not necessarily aligned to a multiple of 16 bytes, you may find that you are better off doing the aligned portion of the calculation in the vector unit and do the edge cases in the scalar units. To the extent that there are extra unused instruction dispatch and completion slots available in your vector code, you may be able to get the extra edge work for free this way -- the scalar units can execute in parallel to the vector unit.

 

Avoid Branching

Most texts on performance will instruct you to avoid branching. This is because there is usually some cost to mispredicted branches in terms of restarting the instruction pipeline with a new instruction stream. However, for AltiVec code the cost can be much larger than that. This is because branching is inherently not parallelizeable. Worst case scenario code would be code that inspects vector elements one at a time and branches according to their value. Not only do you run the risk of branch misprediction, but you incur expensive overhead in isolating individual elements from the vectors that you are inspecting. Avoid these at all costs.

The SIMD approach to solving this sort of problem is to make use of special functions for saturated clipping, min and max when appropriate, and to use simple tests and masks elsewhere. A simple example might be the vector abs() function. Vec_abs() is one of the few AltiVec C Programming Interface functions that does not correlate 1:1 with an AltiVec instruction.

The simple scalar code for abs() might be:

if( value < 0 ) value = -value;

On the integer domain, you can usually replace conditional expressions like the above with a mask whose value, 0 or -1L depends on the result from the conditional expression. In this case we just right extend the sign bit:

long mask = ((long) value) >> 31; //-1 if value < 0, 0 otherwise
long negated = -value;

value = (negated & mask) | (value & ~mask);

In this special case, one can be a bit more clever and use xor to generate the negated quantity only if the mask is -1L:

long mask = ((long) value) >> 31; //-1 if value < 0, 0 otherwise
value = (value ^ mask) - mask; //-1: ~value + 1; 0: value

You can do similar things in the vector unit. In this case, vec_cmplt() generates a vector mask that holds -1L where the test is true and 0 elsewhere. You can then use vec_sel() to select from one of two results based on the value of the mask:

vector long zero = vec_splat_s32(0);
vector bool long mask = vec_cmplt( value, zero ); //-1L where element < 0
vector long negated = vec_sub( zero, value ); //-value

value = vec_sel( value, negated, mask );

We can merge the vec_cmplt() and vec_sel() step into a single vec_max() instruction, which is what the C Programming Interface actually does:

vector long zero = vec_splat_s32(0);
vector long negated = vec_sub( zero, value ); //-value

value = vec_max( value, negated );

The drawback of the select approach is often that one has to calculate both results, and then select between them at the end. This is something that we do in the vectorized lookup table code example, where for a 256 entry lookup table, we actually do 8 table lookups from 8 32-entry lookup tables and select among the 8 possible results. This is still faster than using the LSU to do lookups, because we can do 16 at a time this way in parallel. In cases where both calculations are the same, but use different data, you may instead opt to use the mask to select between different data inputs into the calculation, thereby allowing you to avoid having to calculate both results. This can be a large performance win.

 

Matrix Transpose

Matrix transposes are incredibly useful in nearly all AltiVec code, because it is a highly efficient way to convert many vectors of identical non-uniform type into as many vectors of uniform type. The cost for the transpose is NLog2(N) for N = {4, 8, 16}.

This is in general more efficient than other data swizzling methods when the amount of data is large.. Many examples of their use appear in the code examples presented in these pages. A general algorithm for efficient matrix transposes may be found in the Algorithms section.

Table of ContentsNextPreviousTop of Page