Hardware - Data Handling for Vector Programming


	Log In \| Not a Member?	Support

Data Handling and Data Formats

Sometimes it seems that everywhere you look, somebody has found a way to take two perfectly good arrays of data and interleave them to make one array in a more complicated format, e.g. an array of structs. In fact, the basic data pattern for nearly every preexisting data format you are likely to use with AltiVec (color pixels, stereo sound, 3D geometry... ) is typically formatted as an array of non-uniform vectors (see Understanding SIMD.) The originators of these formats were probably trying to enforce good locality of reference for optimal cache usage, or perhaps trying to make use of a special processor feature that automatically incremented a pointer after use. Unfortunately interleaved data is a fact of life and is something you are going to have to deal with. There are two common solutions to this problem:

Vector Friendly Data Formats

The easiest, most efficient thing to do is to refuse to use these sorts of data formats. Instead use formats that are comprised of arrays of uniform vectors. Uniform vectors can be loaded directly into vector registers without swizzling or alignment headaches and can be used as such without modification. This may be an interleaved format that interleaves data channels every sixteen bytes rather than every byte, or it may be a format that uses an independent array for each data channel -- no interleaving whatsoever. Sometimes these are called planar data formats. Unfortunately, such "new" data formats make it difficult to work with external libraries or operating systems that use traditional interleaved formats. If neither the OS or third party libraries need to see your data, then your own vector-friendly data format may work very well. Where infrequent interaction with the OS or third party libraries is required, it may be necessary to convert between formats "on demand."

When designing a vector friendly data format, try to keep related data in the same cacheline. For example, if you need to change a pixel, it would be nice if the red, green and blue values for that pixel all lived in the same aligned 32 byte block. Some consideration as to format may be required to achieve this. For example, you can probably do this quite nicely with YUV-422 pixel data, but probably not with RGB pixel data. (YUV-422 data shares U and V channels for adjacent pixels. This means that you can store 16 YUV pixels in a cacheline. However 16 RGB pixels would take up 48 or 64 bytes, depending on the presence of an alpha channel.) This is only really important if you expect to be changing individual data values, however. Since in the vector unit, the basic idea is to change 4, 8 or 16 in parallel, you may find that this is a non-issue. The net cost gradually disappears when you change larger numbers of points. If one goes so far as to use a separate array for each channel, then you may actually discover significant memory-associated performance gains if this allows you to completely avoid touching one of the channels (e.g. a pixel's alpha channel) much of the time. Many applications never touch the alpha channel, but incur 33% more memory overhead because it is interleaved with their data. As the performance of much AltiVec code is limited by the rate that data can be fetched from memory, this may have significant implications for application performance.

Applications that use vector friendly data storage formats are the most likely to see their performance directly scale a full 4, 8 or 16 fold as a result of vector parallelism.

On-the-Fly Format Conversion

The other solution to the problem is to use the non-uniform storage format but convert it to a vector-friendly one in register before using it. This process is called swizzling or data recombination. There are a number of approaches to dealing with this problem. In many cases, vec_perm() is a simple and effective approach to extracting arbitrary bytes from a 32 byte block of data. Similarly, you can often get quite far with vec_pack() and vec_merge(h/l)(). However, these suffer from the problem that they only are able to draw from 32 byte inputs to produce a 16 byte output. If your input data is a little bit more scattered than that, for example, if your interleaved data has four channels spread over 64 bytes, this problem can usually be solved more efficiently instead by doing a matrix transpose on a block of N vectors of identical non-uniform type, to yield a block of N vectors of uniform type.

Working with uniform types internally makes things a lot easier, and usually a lot more efficient, as we show in the following example. Suppose you want to normalize the following four vectors, V1-4, to make vectors of unit length (i.e. whose sum of squares, X² + Y² + Z², is 1.0):

V1:
X1 Y1 Z1 W1

V2:
X2 Y2 Z2 W2

V3:
X3 Y3 Z3 W3

V4:
X4 Y4 Z4 W4

(Each row represents the contents of a vector float.)

Further suppose that the W element in each is an artificial value used to indicate whether the quartet represents a point or a vector. (Its value is 1.0 if the vector represents a point, and 0.0 if it is a vector.) If you try to find the length of the vector as:

length != (X² + Y² +
         Z² + W²)^1/2

...and W is non-zero you would get the wrong answer. Unfortunately, this is exactly what SIMD is going to do to you if you simply square all the elements in each vector then add across it. In addition, when we know the length, we don't want to divide W by it, only X, Y and Z. Working directly with this non-uniform format therefore requires that we mask off the W element, normalize the other three values, then put the W value back on.

The following function normalizes four vectors at once using this approach. We do four at once for efficiency and also to provide a better comparison with the approach presented immediately afterwards.

//Using Non-uniform vectors void Normalize4( vector float v[4] ) {

vector float v1, v2, v3, v4, lengthSquared, oneOverLength; vector float zero = (vector float) (0); vector float one = (vector float) (1.0); vector unsigned char minusOne = vec_splat_u8(-1); vector float wMask = vec_sld( zero, (vector float) minusOne, 4 );
//load in our four vectors and mask out W v1 = vec_andc( v[0], wMask ); v2 = vec_andc( v[1], wMask ); v3 = vec_andc( v[2], wMask ); v4 = vec_andc( v[3], wMask );

//square each term v1 = vec_madd( v1, v1, zero ); v2 = vec_madd( v2, v2, zero ); v3 = vec_madd( v3, v3, zero ); v4 = vec_madd( v4, v4, zero );

//Now add across each vector //The sum will be written back to the vector //and duplicated across all four cells v1 = vec_add( v1, vec_sld( v1, v1, 8) ); v2 = vec_add( v2, vec_sld( v2, v2, 8) ); v3 = vec_add( v3, vec_sld( v3, v3, 8) ); v4 = vec_add( v4, vec_sld( v4, v4, 8) ); v1 = vec_add( v1, vec_sld( v1, v1, 4) ); v2 = vec_add( v2, vec_sld( v2, v2, 4) ); v3 = vec_add( v3, vec_sld( v3, v3, 4) ); v4 = vec_add( v4, vec_sld( v4, v4, 4) );

//Now we have a problem because the next thing we need to //do is calculate 1/length from length squared. The problem //is that the four lengths squared are in four different vectors!

//Put them into one vector lengthSquared = vec_sld( v1, v2, 4 ); lengthSquared = vec_sld( lengthSquared, v3, 4 ); lengthSquared = vec_sld( lengthSquared, v4, 4 );

//Find 1 / length for each oneOverLength = ReciprocalSquareRoot( lengthSquared );

//multiply this back into the starting vectors v1 = vec_madd( v[0], vec_splat( oneOverLength, 0 ), zero ); v2 = vec_madd( v[1], vec_splat( oneOverLength, 1 ), zero ); v3 = vec_madd( v[2], vec_splat( oneOverLength, 2 ), zero ); v4 = vec_madd( v[3], vec_splat( oneOverLength, 3 ), zero );

//Mask in the original W values v[0] = vec_sel( v1, v[0], (vector bool int) wMask ); v[1] = vec_sel( v2, v[1], (vector bool int) wMask ); v[2] = vec_sel( v3, v[2], (vector bool int) wMask ); v[3] = vec_sel( v4, v[3], (vector bool int) wMask );

}

Inspection reveals that we don't seem to be able to do anything in this function without first juggling data around in the permute unit. About half or two-thirds of this routine seems to be permute operations. Would you be surprised to learn that this was actually a relatively easy example because there was a lot of symmetry in the operation? Imagine the chaos had we decided to do a cross product!

Suppose instead we transpose the data first and use it that way. (We will transpose it back later when we are done.) Here is the transpose:

V1 V2 V3 V4

X1 X2 X3 X4

Y1 Y2 Y3 Y4

Z1 Z2 Z3 Z4

W1 W2 W3 W4

(Each row represents the contents of a vector float.)

It should be clear that each row now contains a uniform vector. We have one vector to hold the X's, one for the Y's, one for the Z's and one for the W's. Using this new data layout, we can normalize V1-4 without worrying about what happens to W. The W elements will just sit quietly in place and we will never touch them. When we are done normalizing, we just transpose back to return to the original data layout. Notice how much the core of the calculation suddenly became a lot simpler and almost self-documenting. This new function for normalizing four vectors looks much like the scalar code for normalization of a single XYZW vector, except that we do four in parallel:

void NormalizeUniform4( vector float v[4] ) {

vector float lengthSquared, oneOverLength; vector float zero = (vector float) (0);
//Load in the four vectors into x, y, z, w //These wont be vectors full of x, y, z, and w until after the transpose register vector float x = v[0]; register vector float y = v[1]; register vector float z = v[2]; register vector float w = v[3];

//Transpose the 4 vectors to create 4 uniform vectors for //X, Y, Z and W TRANSPOSE4( x, y, z, w ); //a eight instruction macro

// length²= X² + Y² + Z² lengthSquared = vec_madd( x, x, zero ); lengthSquared = vec_madd( y, y, lengthSquared ); lengthSquared = vec_madd( z, z, lengthSquared );

// 1 / length = 1 / sqrt( length²) oneOverLength = ReciprocalSquareRoot( lengthSquared );

//Normalize X, Y and Z by multiplying each by 1 / length x = vec_madd( x, oneOverLength, zero ); y = vec_madd( y, oneOverLength, zero ); z = vec_madd( z, oneOverLength, zero );

//Convert X, Y, Z and W vectors back to our starting four vectors, //now normalized TRANSPOSE4 (x, y, z, w);

v[0] = x; v[1] = y; v[2] = z; v[3] = w;

}

Swizzling with Transposes vs. ad hoc Swizzling

In general, swizzling with transposes works best when you have complex functions that work with a large amount of interleaved data (>= 64 bytes). As functions go, normalizing a vector is a fairly trivial operation. If this function grew to do more than just normalize four vectors, the transpose overhead would become a smaller fraction of the overall function cost. Thus, more complex functions tend to incur less overhead from swizzling costs associated with non-uniform storage formats, and tend to be more efficient. As we will learn in the Memory Performance section, there are other very good reasons to prefer a few complex AltiVec functions over many simple ones. Of course, not in all cases do you want to do a full matrix transpose to swizzle / unswizzle data. There are many special cases where there are cheaper solutions, particularly in small functions that handle limited amounts of data. Matrix transposes just happen to be a cheap way of swizzling a lot of interleaved data that by happy coincidence also produces uniform vectors.

Speed

How much does the transpose actually cost us in this case? Each TRANSPOSE4() macro is eight vec_merge() instructions. (You can see the source for the TRANSPOSE4() macro here. We present one for 8x8 and 16x16 matrix transposes in the Programming Examples. Matrix transposes typically have a cost that rises with NLog2(N) operations.) Thus, for six vector instructions plus about six more for ReciprocalSquareRoot() we pay a swizzle overhead cost of sixteen permutes! You may ask yourself again if using classic data types is really worth all this overhead! As it turns out, the earlier function that tried to do everything with non-uniform vectors, Normalize4(), spent even more time swizzling data into the right place, so doing it all at once in this way seems to have been a wise trade off after all. It also was a lot easier to write the function.

This is how the two functions compare:

Execution times (G4/866):

	Execution Time (seconds)	Throughput (vectors per second)
`Normalize4():`	1.29 x 10^-7	3.10 x 10⁷
`NormalizeUniform4():`	1.20 x 10^-7	3.33 x 10⁷

Skeptical observers may have noted that the second one is not a lot faster despite its apparent brevity. The reason for this is that while the first function does a good job of filling processor pipelines, the second one leaves them 3/4 empty a lot of the time. This is not a reason to condemn the second implementation. Instead this is an opportunity to achieve greater throughput! If one fills those pipeline holes by processing more data concurrently (i.e. normalize 16 vectors at once) the second function would get four times as much done in not much more time than this function takes to execute!

As proof, here is the performance of NormalizeUniform16(), which is the same algorithm as NormalizeUniform4(), except that it normalizes 16 vectors at once rather than four. Because we are better able to saturate the pipeline, its throughput is 2.6 times greater:

	Execution Time (seconds)	Throughput (vectors per second)
`NormalizeUniform16():`	1.83 x 10^-7	8.74 x 10⁷

Can we do this with the non-uniform vector using Normalize4() function as well? No. It does not see similar throughput gains if we increase the amount of data by four, because it doesn't have as many available pipeline holes to fill with work. The only region of that code with many pipeline holes in it is the part that calculates the reciprocal square root. All we really manage to do by rewriting Normalize4() to process more data concurrently is make a larger more ungainly function that no longer fits in the vector register file.

How is it that Normalize4() managed to fill all of its pipeline holes and NormalizeUniform4() did not? This is a typical by-product of the use of non-uniform vectors in a function. Often, in order to get around problems associated with trying to work on only some of the data in a vector, the function does a lot of redundant work (e.g. identical calculations within a vector), uses vectors with do-nothing elements that follow along for the ride (e.g. W), spends quite a bit of time shuffling data around with permute operations or does work on only some of the data in a vector and not others (e.g. vec_sel). All this inefficiency has the general effect of filling up pipeline holes in hand optimized code.

A successful optimization strategy is to look for ways to create pipeline holes by removing do-nothing calculations and then fill them back up again with useful work.

Calculations with uniform vectors almost never involve do-nothing calculations, since each element is unique. This is why they in general make for cleaner, more efficient code.

A discussion of how to spot pipeline holes appears in the Performance Measurement section.

Economies of Scale

We do give up one thing to achieve this sort of throughput efficiency. We give up the freedom to normalize just one vector at a time. If you prefer writing functions to complete a limited task in the least amount of time possible, using a somewhat slower function that does more work (potentially unnecessary work) might make you uneasy. However, before one condemns this approach, it is important to ask what each approach really costs. To find out, we look to see what would happen if we rewrote the Normalize4() to normalize just one vector at a time. This new function, Normalize1(), will achieve a lowest possible latency approach and normalize a single vector in the smallest possible amount of time. You may have already written some function like this if you work with 3D geometry. For reference, performance data on a scalar version is provided as well.

Execution Time
(seconds) Throughput
(vectors per second)

Normalize1():
9.69 x 10^-8 1.03 x 10⁷

ScalarNormalize():
2.92 x 10^-7 3.42 x 10⁶

The scalar version isn't very optimized, so understandably it doesn't perform all that well. Four fold parallelism in the vector register suggests that we really should only be able to beat its throughput by a factor of four. The fact that NormalizeUniform16() has 25 times greater throughput shows clearly that the scalar method suffers quite a bit due to its use of division and sqrt().

Surprisingly, or perhaps not so surprisingly, Normalize1() does not return much more quickly than does Normalize4(). Rather than reclaim a lot of cycles by normalizing fewer vectors, we mostly just created pipeline bubbles in our experiment. (Recall that we just said that they were full for the Normalize4() function earlier.) Importantly, the net additional cost for normalizing four vectors instead of one is fairly small. Furthermore, for less than the cost of normalizing two vectors with Normalize1(), you could have normalized four of them with NormalizeUniform4() or sixteen of them with NormalizeUniform16()! The lesson to learn should be clear:

Find ways to process more data in parallel in your application.
Small functions that operate on a single datum run very inefficiently.

If most of the time you normalized a vector, you normalize two or more at a time, you would be a lot better off with a higher capacity function, even if some of the time you just gave it one vector and a bunch of garbage data to normalize. This is also true for the scalar units, which for reasons of either pipelining or superscalar design also benefit more from more data than less.

A discussion on high throughput vs. low latency programming and processor pipelining can be found in the section High Throughput vs. Low Latency.

Table of Contents Next Previous Top of Page