Log In | Not a Member? | Support | |
Code OptimizationMicro-optimization of AltiVec code is one of the last steps in the optimization process. This is because the performance wins obtained from micro-optimization are usually small. It should be the last thing you do, if you do it all. Certainly long before tackling micro-optimization, you should first make sure that you are using the right algorithm. The performance difference between a good algorithm and a bad one can be many thousand fold. Next, read the earlier sections about SIMD and data handling. The wrong data handling methods or poor cache usage can easily prevent optimally written code from performing well. Poor data use strategies are the most common reason why naively written AltiVec accelerated functions fail to perform faster than their scalar counterparts, so it is important to address these problems first. No amount of code micro-optimization will help code that is memory bandwidth limited. Finally, once these things are dealt with, check that your code makes proper use of pipelines and superscalarism in the processor. As we show in the High Throughput vs. Low Latency section, poor design in these areas can make your code run many times slower. One example is 33 times slower than its pipeline saturating counterpart. Listed below you will find some general micro-optimization themes that will help improve the performance of CPU bound calculations. In addition, you may find some additional hints in the Algorithms sections for dealing with specific problems such as square root, divide, permutes, alignment, etc. Work MaximizationSome AltiVec instructions do more than others. Why use For example, in all of AltiVec's 17 flavors of integer multiplication, there are no instructions that do the vector equivalent of Represent 32-bit quantities X and Y as two pairs of shorts, A & B and C & D:
Rewrite X*Y in terms of A, B, C and D:
Expand:
The partial result A * C ends up left shifted by 32 bits, so does not play a part in our result, but the other three terms do. They can be calculated and added together as follows:
However, usually whenever you have a
Making this modification shaves off about two cycles from
the function latency, because we replace a vec_mule,
vec_mulo and a vec_add, with a
vec_msum. Even more important, it opened up a
pipeline hole in the vector complex integer unit (VCIU) that
we might exploit with more data to increase the overall
throughput of this function for larger data sets by up to
50%!
The Identity Permute Vector (also byte swapping)The identity permute vector (shown below) can in some cases be quite useful for turning complex multi-instruction permute operations into a single call to
Rather than push each data item through the expensive multistep permute, push the identity vector through it. The result permute vector will be the correct one that does the entire calculation in a single call to vec_perm. This can be used to avoid situations in which because part of the calculation cannot be anticipated at compile time, you would otherwise have to place a complex series of permute operations in your main data handling loop. Here we use this method to find a way to handle alignment and byte swapping in a single step. You don't know until runtime what the alignment is going to be, and in some cases you might not even know what the endianness of the data is going to be, so writing a generic piece of code to handle these conditions means that you would probably have to align the data first, then byte swap it. To do both in a single step, we can take the 32 byte identity vector (two AltiVec vectors, A and B):
Now apply our alignment permute to it
which might yield something like:
and then apply a 16 bit byte swap operation to it:
The result,
Some care is in order with this approach to make sure that all of your data convolutions don't cause data to magically disappear into the aether. Obviously, this example only works in cases when the entire permute sequence is representable as a single permute. If the result bytes can be drawn from a pool larger than two vectors in size, then you will not be able to use this technique to do the calculation in a single step. In some cases, you may use it to do the calculation in fewer steps.
Efficient Safe Unaligned Vector LoadsWhen loading N consecutive unaligned vectors into register, it is common to load N+1 aligned vectors, and then align them using vec_perm:
This is highly efficient, because we only do one more
vector load than we would have to do for the aligned case.
In the context of a loop,
In this manner, we only ever load one more vector than we need to.
When you need to load a array that is small enough that no looping is necessary, this is very quickly and simply done safely by loading the last vector from the last valid data element in the array, rather than the first invalid element of the array:
If We could easily extend this practice in the context of a loop, but we would end up doing one unnecessary load every loop iteration, which one might like to avoid. It would be nice to just have to do one extra load over the course of the entire function instead. This can be done by making sure that all loads in the loop are always valid. The reasoning behind the below algorithm is complicated, but it is highly optimal. The trick is to get the second aligned load for a misaligned vector to occur at the same address as the first load for the aligned case, and 16 bytes after the first load for all other cases. We can borrow the +16-sizeof( element) trick from above to accomplish that. The other thing we have to do is a little bit of trickery with the lvsl generated permute so that the second load product is used instead of the first load product in the aligned case:
Similar things can be done with unaligned vector stores. Please note that though we described 16-sizeof(element) as the technique we are using, in actuality, we just assume the element size is 1, even when it is larger. This works for any standard data size, and doesn't run into problems when the elements are not themselves natively aligned (e.g. 4 byte aligned for a float), like 16 - sizeof(element) would. In some cases, especially where the source and destination buffers are aligned with one another, but not necessarily aligned to a multiple of 16 bytes, you may find that you are better off doing the aligned portion of the calculation in the vector unit and do the edge cases in the scalar units. To the extent that there are extra unused instruction dispatch and completion slots available in your vector code, you may be able to get the extra edge work for free this way -- the scalar units can execute in parallel to the vector unit.
Avoid BranchingMost texts on performance will instruct you to avoid branching. This is because there is usually some cost to mispredicted branches in terms of restarting the instruction pipeline with a new instruction stream. However, for AltiVec code the cost can be much larger than that. This is because branching is inherently not parallelizeable. Worst case scenario code would be code that inspects vector elements one at a time and branches according to their value. Not only do you run the risk of branch misprediction, but you incur expensive overhead in isolating individual elements from the vectors that you are inspecting. Avoid these at all costs. The SIMD approach to solving this sort of problem is to make use of special functions for saturated clipping, min and max when appropriate, and to use simple tests and masks elsewhere. A simple example might be the vector abs() function. Vec_abs() is one of the few AltiVec C Programming Interface functions that does not correlate 1:1 with an AltiVec instruction. The simple scalar code for abs() might be:
On the integer domain, you can usually replace conditional expressions like the above with a mask whose value, 0 or -1L depends on the result from the conditional expression. In this case we just right extend the sign bit:
In this special case, one can be a bit more clever and use xor to generate the negated quantity only if the mask is -1L:
You can do similar things in the vector unit. In this case, vec_cmplt() generates a vector mask that holds -1L where the test is true and 0 elsewhere. You can then use vec_sel() to select from one of two results based on the value of the mask:
We can merge the vec_cmplt() and vec_sel() step into a single vec_max() instruction, which is what the C Programming Interface actually does:
The drawback of the select approach is often that one has to calculate both results, and then select between them at the end. This is something that we do in the vectorized lookup table code example, where for a 256 entry lookup table, we actually do 8 table lookups from 8 32-entry lookup tables and select among the 8 possible results. This is still faster than using the LSU to do lookups, because we can do 16 at a time this way in parallel. In cases where both calculations are the same, but use different data, you may instead opt to use the mask to select between different data inputs into the calculation, thereby allowing you to avoid having to calculate both results. This can be a large performance win.
Matrix TransposeMatrix transposes are incredibly useful in nearly all AltiVec code, because it is a highly efficient way to convert many vectors of identical non-uniform type into as many vectors of uniform type. The cost for the transpose is NLog2(N) for N = {4, 8, 16}. This is in general more efficient than other data swizzling methods when the amount of data is large.. Many examples of their use appear in the code examples presented in these pages. A general algorithm for efficient matrix transposes may be found in the Algorithms section. |