![]() |
![]() |
![]() |
|
![]() | |
![]() |
![]() |
![]() |
![]() |
![]() |
Log In | Not a Member? |
Support
![]() |
![]() |
Data Handling and Data FormatsSometimes it seems that everywhere you look, somebody has found a way to take two perfectly good arrays of data and interleave them to make one array in a more complicated format, e.g. an array of structs. In fact, the basic data pattern for nearly every preexisting data format you are likely to use with AltiVec (color pixels, stereo sound, 3D geometry... ) is typically formatted as an array of non-uniform vectors (see Understanding SIMD.) The originators of these formats were probably trying to enforce good locality of reference for optimal cache usage, or perhaps trying to make use of a special processor feature that automatically incremented a pointer after use. Unfortunately interleaved data is a fact of life and is something you are going to have to deal with. There are two common solutions to this problem:
Vector Friendly Data FormatsThe easiest, most efficient thing to do is to refuse to use these sorts of data formats. Instead use formats that are comprised of arrays of uniform vectors. Uniform vectors can be loaded directly into vector registers without swizzling or alignment headaches and can be used as such without modification. This may be an interleaved format that interleaves data channels every sixteen bytes rather than every byte, or it may be a format that uses an independent array for each data channel -- no interleaving whatsoever. Sometimes these are called planar data formats. Unfortunately, such "new" data formats make it difficult to work with external libraries or operating systems that use traditional interleaved formats. If neither the OS or third party libraries need to see your data, then your own vector-friendly data format may work very well. Where infrequent interaction with the OS or third party libraries is required, it may be necessary to convert between formats "on demand." When designing a vector friendly data format, try to keep related data in the same cacheline. For example, if you need to change a pixel, it would be nice if the red, green and blue values for that pixel all lived in the same aligned 32 byte block. Some consideration as to format may be required to achieve this. For example, you can probably do this quite nicely with YUV-422 pixel data, but probably not with RGB pixel data. (YUV-422 data shares U and V channels for adjacent pixels. This means that you can store 16 YUV pixels in a cacheline. However 16 RGB pixels would take up 48 or 64 bytes, depending on the presence of an alpha channel.) This is only really important if you expect to be changing individual data values, however. Since in the vector unit, the basic idea is to change 4, 8 or 16 in parallel, you may find that this is a non-issue. The net cost gradually disappears when you change larger numbers of points. If one goes so far as to use a separate array for each channel, then you may actually discover significant memory-associated performance gains if this allows you to completely avoid touching one of the channels (e.g. a pixel's alpha channel) much of the time. Many applications never touch the alpha channel, but incur 33% more memory overhead because it is interleaved with their data. As the performance of much AltiVec code is limited by the rate that data can be fetched from memory, this may have significant implications for application performance. Applications that use vector friendly data storage formats are the most likely to see their performance directly scale a full 4, 8 or 16 fold as a result of vector parallelism.
On-the-Fly Format ConversionThe other solution to the problem is to use the non-uniform storage format but convert it to a vector-friendly one in register before using it. This process is called swizzling or data recombination. There are a
number of approaches to dealing with this problem. In many
cases, Working with uniform types internally makes things a lot easier, and usually a lot more efficient, as we show in the following example. Suppose you want to normalize the following four vectors, V1-4, to make vectors of unit length (i.e. whose sum of squares, X2 + Y2 + Z2, is 1.0):
Further suppose that the W element in each is an artificial value used to indicate whether the quartet represents a point or a vector. (Its value is 1.0 if the vector represents a point, and 0.0 if it is a vector.) If you try to find the length of the vector as: length != (X2 + Y2 +
Z2 + W2)1/2 ...and W is non-zero you would get the wrong answer. Unfortunately, this is exactly what SIMD is going to do to you if you simply square all the elements in each vector then add across it. In addition, when we know the length, we don't want to divide W by it, only X, Y and Z. Working directly with this non-uniform format therefore requires that we mask off the W element, normalize the other three values, then put the W value back on. The following function normalizes four vectors at once using this approach. We do four at once for efficiency and also to provide a better comparison with the approach presented immediately afterwards.
Inspection reveals that we don't seem to be able to do anything in this function without first juggling data around in the permute unit. About half or two-thirds of this routine seems to be permute operations. Would you be surprised to learn that this was actually a relatively easy example because there was a lot of symmetry in the operation? Imagine the chaos had we decided to do a cross product! Suppose instead we transpose the data first and use it that way. (We will transpose it back later when we are done.) Here is the transpose:
It should be clear that each row now contains a uniform vector. We have one vector to hold the X's, one for the Y's, one for the Z's and one for the W's. Using this new data layout, we can normalize V1-4 without worrying about what happens to W. The W elements will just sit quietly in place and we will never touch them. When we are done normalizing, we just transpose back to return to the original data layout. Notice how much the core of the calculation suddenly became a lot simpler and almost self-documenting. This new function for normalizing four vectors looks much like the scalar code for normalization of a single XYZW vector, except that we do four in parallel:
Swizzling with Transposes vs. ad hoc SwizzlingIn general, swizzling with transposes works best when you have complex functions that work with a large amount of interleaved data (>= 64 bytes). As functions go, normalizing a vector is a fairly trivial operation. If this function grew to do more than just normalize four vectors, the transpose overhead would become a smaller fraction of the overall function cost. Thus, more complex functions tend to incur less overhead from swizzling costs associated with non-uniform storage formats, and tend to be more efficient. As we will learn in the Memory Performance section, there are other very good reasons to prefer a few complex AltiVec functions over many simple ones. Of course, not in all cases do you want to do a full matrix transpose to swizzle / unswizzle data. There are many special cases where there are cheaper solutions, particularly in small functions that handle limited amounts of data. Matrix transposes just happen to be a cheap way of swizzling a lot of interleaved data that by happy coincidence also produces uniform vectors.
SpeedHow much does the transpose actually cost us in this
case? Each This is how the two functions compare: Execution times (G4/866):
Skeptical observers may have noted that the second one is not a lot faster despite its apparent brevity. The reason for this is that while the first function does a good job of filling processor pipelines, the second one leaves them 3/4 empty a lot of the time. This is not a reason to condemn the second implementation. Instead this is an opportunity to achieve greater throughput! If one fills those pipeline holes by processing more data concurrently (i.e. normalize 16 vectors at once) the second function would get four times as much done in not much more time than this function takes to execute! As proof, here is the performance of
Can we do this with the non-uniform vector using
How is it that
Calculations with uniform vectors almost never involve do-nothing calculations, since each element is unique. This is why they in general make for cleaner, more efficient code. A discussion of how to spot pipeline holes appears in the Performance Measurement section.
Economies of ScaleWe do give up one thing to achieve this sort of throughput efficiency. We give up the freedom to normalize just one vector at a time. If you prefer writing functions to complete a limited task in the least amount of time possible, using a somewhat slower function that does more work (potentially unnecessary work) might make you uneasy. However, before one condemns this approach, it is important to ask what each approach really costs. To find out, we look to see what would happen if we rewrote the
The scalar version isn't very optimized, so
understandably it doesn't perform all that well. Four fold
parallelism in the vector register suggests that we really
should only be able to beat its throughput by a factor of
four. The fact that Surprisingly, or perhaps not so surprisingly,
If most of the time you normalized a vector, you normalize two or more at a time, you would be a lot better off with a higher capacity function, even if some of the time you just gave it one vector and a bunch of garbage data to normalize. This is also true for the scalar units, which for reasons of either pipelining or superscalar design also benefit more from more data than less. A discussion on high throughput vs. low latency programming and processor pipelining can be found in the section High Throughput vs. Low Latency.
|