Memory and Alignment


	Log In \| Not a Member?	Support

AltiVec Address Alignment

Unlike the scalar Integer and Floating Point units, the AltiVec unit does not automatically or transparently handle unaligned loads and stores for you. If you try to load or store to an address that is not 16-byte aligned, the address will be silently rounded down to the nearest aligned address, in a operation similar to (addr & ~15), and the data is loaded/stored from there. The vec_ld instruction and other types of loads and stores operate starting at a 16-byte boundary. Therefore, if a program wishes to use AltiVec instructions on data that may not be 16-byte aligned, the program must handle alignment in software.

Unaligned Loads and Stores of Entire Vectors

Using two vec_ld instructions, 32 contiguous bytes of data may be loaded beginning at the 16-byte aligned boundary preceding the target data (refer to the diagram below.) This brings the desired data into register. To extract the desired vector from the two aligned vectors that house it, an alignment permute vector is created using the vec_lvsl instruction (load vector for shift left). This instruction uses the four least significant bits of the target address to create a vector which can be used as a mask by the vec_perm instruction. After execution of vec_perm, the desired unaligned data is returned.

Overly simplified, the target data is located at target. The initial vec_ld loads the first 16 bytes, 00-0f in the above diagram, into MSQ, while the second vec_ld loads the second 16 bytes into LSQ. Then the permute vector is created in mask using the four least significant bits of target. Finally, the permute instruction uses the 32 bytes of data and the permute vector to load the target data, now aligned, into result.

However, while this approach appears in the AltiVec Programming Environments Manual (3.1.6.1), it is not safe to do misaligned loads and stores that way, if the target address is 16-byte aligned. If the target pointer is 16 byte aligned, the first load (MSQ) will load all 16 bytes, and the LSQ will load 16 bytes of completely unknown data. If the target pointer happens to be pointing to the last aligned vector in a page, then pointer + 16 will point to a completely new page. That page might be unmapped, in which case the load will trigger a segmentation fault. Direct use of the algorithms in the AltiVec PEM (section 3.1.6.1) for handling misaligned loads and stores may crash your application and is strongly discouraged.

The solution is to use pointer + 15 for the second load instead of pointer + 16. If pointer is not 16 byte aligned, then pointer + 15 will yield the same results as a load using pointer + 16 would. If pointer is 16 byte aligned, then pointer + 15 falls on the same vector as the one pointed to by pointer + 0. In this way, we always load vectors that contain valid data. (A vector load is always safe to do if at least one byte in the vector is known to exist.)

// Safe to use with aligned and unaligned addresses vector unsigned char LoadUnaligned( unsigned char *target ) {

vector unsigned char

MSQ, LSQ, result;

vector unsigned char

mask;

MSQ = vec_ld(0, target);

// most significant quadword

LSQ = vec_ld(15, target);

// least significant quadword

mask = vec_lvsl(0, target);

// create the permute mask

return vec_perm(MSQ, LSQ, mask);

// align the data

}

Similarly, you can use the vector unit to do unaligned stores, though the operation is somewhat more complicated. You must load in the two aligned vectors that bracket the store destination, write your data onto those vectors being careful to preserve those areas that are not to be overwritten, and then store them back. The code below uses the +15 technique to avoid crashing with aligned pointers:

// Mostly safe to use with aligned and unaligned addresses void StoreUnaligned( vector unsigned char src, unsigned char *target ) {

vector unsigned char

MSQ, LSQ, edges;

vector unsigned char

edgeAlign, align;

MSQ = vec_ld(0, target);

// most significant quadword

LSQ = vec_ld(15, target);

// least significant quadword

edgeAlign = vec_lvsl(0, target);

// permute map to extract edges

edges=vec_perm(LSQ,MSQ,edgeAlign);

// extract the edges

align = vec_lvsr( 0, target );

// permute map to misalign data

MSQ = vec_perm(edges,src,align);

// misalign the data (MSQ)

LSQ = vec_perm(src,edges,align);

// misalign the data (LSQ)

vec_st( LSQ, 15, target );

// Store the LSQ part first

vec_st( MSQ, 0, target );

// Store the MSQ part

}

Some caution is still in order when using the StoreUnaligned() function above for two reasons:

The code above reads in the two bracketing aligned vectors, splices in fragments of the misaligned vector into each and writes them back out again. If another thread can change data in the area covered by the two aligned vectors, then that change can be undone. For this reason the function above is not thread safe.

Storing single misaligned vectors in this fashion is not efficient when you have to store multiple contiguous misaligned vectors.

The only way to solve the thread safety problem is to store data using vector element stores. Here is one function that does that:

// Almost Completely safe void StoreUnaligned( vector unsigned char src, unsigned char *target ) {

src = vec_perm( src, src, vec_lvsr( 0, target ) );

vec_ste( (vector unsigned char) src, 0, (unsigned char*) target );

vec_ste( (vector unsigned short)src,1,(unsigned short*) target );

vec_ste( (vector unsigned int) src, 3, (unsigned int*) target );

vec_ste( (vector unsigned int) src, 4, (unsigned int*) target );

vec_ste( (vector unsigned int) src, 8, (unsigned int*) target );

vec_ste( (vector unsigned int) src, 12, (unsigned int*) target );

vec_ste( (vector unsigned short)src,14,(unsigned short*) target );

vec_ste( (vector unsigned char) src,15,(unsigned char*) target );

}

...but the above function is clearly still too expensive to be satisfactory for general use. It is also slightly dissatisfying because it is not atomic, which is why it isn't labeled to be completely safe. There is no way to do an atomic misaligned AltiVec load or store without falling back on other atomic synchronization primitives like a mutex or the operations in OSAtomic.h.

For reasons of performance and code simplicity, it is usually better simply to avoid misaligned stores. The best, fastest and simplest approach is to "just" align your store data. When that can't be done, it is often just fine to use the scalar engine to do the unaligned cases at the edges of the array and handle the aligned part in the middle using the vector unit. This approach can fail in two common cases:

1) The array consists of elements that are not themselves aligned — in this case a scalar loop may never arrive at a 16 byte aligned address.

2) You have to store to more than one misaligned array concurrently — there may be no single loop iteration at which both destination addresses are aligned.

In these cases, the optimum solution is a hybrid of store by element and full vector stores. Use element stores to do the partial vector stores at the edges of the misaligned array. That will prevent you from running into any threading issues with other data near the edges of your data structures as your code matures. Then use aligned full vector stores with appropriately shifted data to do the aligned region in the middle, optimized to remove redundant work in the middle. Here is a sample function that adds two arrays together to produce a third with arbitrary alignment that follows this strategy. (Option-click to download as a .c file.) It may be quickly modified to do a variety of different arithmetic operations. For best performance, if you have to choose between unaligned loads and unaligned stores, pick unaligned loads.

Unaligned Loads and Stores of Single Elements

You may also load and store data from vector registers a single element at a time. Here too, the vectors must be quadword (16 byte) aligned. The address that you pass to vec_lde() or vec_ste() will be rounded down to the nearest 16 byte boundary, and the actual element stored or loaded will be the one which corresponds to the actual address you pass. The data will be taken from or written to the appropriate element in the aligned vector. This example illustrates where an unsigned short loaded from 0x14 ends up in the resulting vector unsigned short.

vector unsigned short A = vec_lde( offset, addr );

When the load is complete, the values in the other seven elements are undefined. That is to say that even though you may observe the other elements to be zeroed today, that may not always happen, so do not depend on that behavior.

As another example, if we store the third element in a vector to an aligned vector:

vector unsigned char theVector = (vector unsigned char) (10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25); vector unsigned char vector2 = (vector unsigned char) (0);
//Store out the element in theVector at 3 + (char*)&vector2vec_ste( theVector, 3, &vector2 );

vector2 now holds:

(vector unsigned char) (0,0,0,13,0,0,0,0,0,0,0,0,0,0,0,0);

Note that alignment can be tricky:

char array[256] = { 0, 1, 2, 3, 4, 5, 6...};
vector signed char result = vec_lde( 3, array );

In this example, where the location of the value in the result vector is difficult to predict, because we may not know the alignment of array. If the first byte of the array was at an address ending in 0x2, then we would end up loading the value 3 into the sixth element of the vector, result. Similarly if the first byte of the array was at an address ending at 0xF, the 3 would appear in the third element. If the array started at an address ending in 0x0, the 3 would be loaded into the fourth element.

To load a 16 byte aligned scalar from memory and splat it across all values of a vector, simply load the value and splat it out:

typedef vector unsigned char Vec_UInt8; typedef vector unsigned short Vec_UInt16;

//For loading scalars that are aligned to a 16 byte boundaryVec_UInt16 LoadAndSplatVectorAligndShort( unsigned short *dataPtr ) {

Vec_UInt16 result = vec_lde( 0, dataPtr );

//return the splatted result return vec_splat( result, 0 );

}

Unfortunately, the second argument of vec_splat must be a literal constant, so that method is only useful for scalars whose alignment within a 16 byte segment is known at compile time. Clearly not every scalar has known alignment at compile time. In some cases, you can force data to have 16 byte alignment by union with a vector type. Where this is not possible, scalars that are naturally aligned (i.e. the scalar itself is 1, 2, or 4 byte aligned as appropriate for its size) can be rotated into a known position and splatted from there:

typedef vector unsigned char Vec_UInt8; typedef vector unsigned short Vec_UInt16;

Vec_UInt16 LoadAndSplatAlignedShort( unsigned short *dataPtr ) {

Vec_UInt16 result = vec_lde( 0, dataPtr ); Vec_UInt8 moveToStart = vec_lvsl( 0, dataPtr );

//Rotate the loaded vale to position 0 of the vector result = vec_perm( result, result, moveToStart );

//return the splatted result return vec_splat( result, 0 );

}

For scalars with arbitrary alignment, you must load the complete vector at the address and possibly the vector immediately following it. You must only load the second vector if the scalar spills over onto the second vector. Otherwise, a unmapped memory exception could occur. This can be done safely without branching as follows:

typedef vector unsigned char Vec_UInt8; typedef vector unsigned short Vec_UInt16;

Vec_UInt16 LoadAndSplatUnalignedShort( unsigned short *dataPtr ) {

Vec_UInt16 result = vec_ld( 0, dataPtr ); Vec_UInt16 temp = vec_ld( sizeof(*dataPtr) - 1, dataPtr); Vec_UInt8 moveToStart = vec_lvsl( 0, dataPtr );

//Rotate the loaded vale to position 0 of the vector result = vec_perm( result, temp, moveToStart );

//return the splatted result return vec_splat( result, 0 );

}

Data Alignment in MacOS

Blocks returned by the MacOS heap are all at least sixteen byte aligned. To obtain a page-aligned heap block on MacOS X, you may use valloc(). In Carbon, you may use MPAllocAligned(), to create blocks that are aligned to a size that you specify. The MacOS stack frame is 16 byte aligned. Likewise, global storage also starts 16 byte aligned. Thus, maintaining proper alignment is typically just a matter of preserving the alignment of the memory areas given to you. In the stack, globals and data structures (classes, structs) the compiler will automatically align data with vector type to 16 bytes. However, in some cases, you may wish to access scalar arrays with the vector unit. In these cases, a good method of ensuring 16 byte alignment of a scalar array is to union it with a vector type:

union {

float scalarArray[ kArrayLength ]; vector float v[kArrayLength / vec_step( vector float )];

}alignedUnion;

Unions are also a good way to transfer data from the scalar units to the vector unit, since they place data in areas of known alignment.

vector float FillVectorFloat( float f1, float f2, float f3, float f4 ) {

union {

float scalars[ vec_step( vector float ) ]; vector float v;

}buffer;

buffer.scalars[0] = f1; buffer.scalars[1] = f2; buffer.scalars[2] = f3; buffer.scalars[3] = f4;

return buffer.v;

}

Please note that it is usually highly inefficient to pass data between the scalar units and the vector unit because the data cannot be transferred directly between register files. It is written out to the caches and read back in to the new unit. If you are transferring data between units frequently, you are likely to be better off doing more or all of your calculation in the vector unit. This is also true for calculations that frequently transfer data between the two scalar units. The vector unit can convert between integer and floating point types much more quickly, with no added load / store overhead, so if you are doing a lot of calculation that hops back and forth between the scalar integer and scalar FP units, you may be better off doing the whole thing in the vector unit, even if you only have a single int or float to work on. This is especially true on the G5, where doing so can cause a pipeline fail, flush and retry due to a store and load to the same address in the same dispatch group. (See section on type conversion.)

Table of Contents Next Previous Top of Page