Accelerate.framework errata


	Log In \| Not a Member?	Support

Accelerate.framework errata

Below is a list of known issues with Accelerate.framework / vecLib that may cause your application to operate incorrectly:

vsub
LAPACK thread safety
Fortran calling vecLib's CDOTC, CDOTU, ZDOTC, and ZDOTU.
vImage Scale operations
vImage Shear operations
PEM contains broken misalignment code

vsub

In MacOS X.2.7 (G5 only) and MacOS X.3.0 (G4 and G5 only), but not MacOS X.2.8 or earlier or later revisions of MacOS X, vsub and vsubd swap their two input arguments. Rather than performing c = b - a over the length of the array, they do c = a - b. G3 is unaffected on any OS. There are two ways to work around this problem:

workaround 1

Determine the version of MacOS X during the runtime and do the correct thing.

#include<stdint.h> #include <CoreServices/CoreServices.h> #include <vecLib/vDSP.h> //OR #include <Accelerate/Accelerate.h>

void Workaround_vsub( float *a, int aStride, float *b, int bStride, float *c, int cStride, int size ) {

static uint32_t version = 0; float *temp; int tempStride; const uint32_t kSMEAGOL = 0x1027; const uint32_t kPANTHER_0 = 0x1030;

//Do this only if the vector code is to be called if ( (1 == aStride) && (1 == bStride) && (1 == cStride)) && (8 <= size) && ( ((int) a & 15) == ((int) b & 15) ) && ( ((int) a & 15) == ((int) c & 15) ) ){

//Only call Gestalt once if( 0 == version ) Gestalt(gestaltSystemVersion, &version);

//Swap the arguments if necessary if( (version == kSMEAGOL) || (version == kPANTHER_0) ) { temp = a; a = b; b = temp; }

}

vsub(a, aStride, b, bStride, c, cStrude, size);

}

You can also use sysctl to determine the OS revision. This might be lighter weight for mach-o applications.

workaround 2

Another way is to use vsmul() to negate a and then use vadd():

#include <vecLib/vDSP.h> //OR #include <Accelerate/Accelerate.h>

void Workaround_vsub( float *a, int aStride, float *b, int bStride, float *c, int cStride, int size ) {

const float minusOne = -1.0f;

vsmul(a, aStride, &minusOne, c, cStride, size);

vadd( c, cStride, b, bStride, c, cStride, size );

}

This method is likely to be slower.

LAPACK thread safety

MacOS X applications that intend to call the LAPACK linear algebra APIs from multiple threads must take the following precautions to ensure correct results. LAPACK is part of the Accelerate and vecLib frameworks. Prototypes for its APIs can be found in:

/System/Library/Frameworks/vecLib.framework/Headers/clapack.h

In MacOS X Release 10.2, LAPACK is not thread-safe. Applications that intend to call the LAPACK APIs from multiple threads must implement their own locking discipline to prevent simultaneous execution of LAPACK routines.

In MacOS X Release 10.3, LAPACK thread-safety is greatly enhanced. Applications that intend to call the LAPACK APIs from multiple threads must ensure that the following two initialization calls are completed before commencing simultaneous execution of LAPACK routines.

In C:

extern double slamch_(char *), dlamch_(char *);

(void) slamch_("e"); (void) dlamch_("e");

In FORTRAN:

REAL A, SLAMCH DOUBLE PRECISION D, DLAMCH EXTERNAL SLAMCH, DLAMCH

A = SLAMCH('e') D = DLAMCH('e')

Fortran calling vecLib's CDOTC, CDOTU, ZDOTC, and ZDOTU.

The FORTRAN entry points in Mac OS X's vecLib adhere to the call/return conventions of g77.

In particular, with g77, the return value of a COMPLEX or DOUBLE COMPLEX function is stored to memory through a pointer. The caller must take care to pass that pointer in PPC general purpose register R3 according to the g77 ABI.

With xlf (and the emerging g95), COMPLEX and DOUBLE COMPLEX function return values are left in the PowerPC floating point register file. Modern implementations of the C language use the same approach and no doubt gave impetus to this characteristic of modern FORTRAN.

Just four Level 1 BLAS functions are at issue: CDOTC, CDOTU, ZDOTC, and ZDOTU. Each returns a COMPLEX (or DOUBLE COMPLEX) value. When xlf compiles a function invocation into a call to one of these routines, it expects to find the *return* value in the floating point register file. When g77 compiles a function invocation into a call to one of these routines, it expects to find the return value in a pre-allocated *memory* location. The vecLib implementation of these four functions is compatible with the g77 scheme, but not the xlf scheme.

xlf codes may incorporate the following "wrappers" that re-implement CDOTC, CDOTU, ZDOTC, and ZDOTU in terms of a utility *subroutine* already present in vecLib. There is no ABI conflict in the call/return scheme for these vecLib subroutines with xlf. It is crucial though, that the same compiler, e.g. xlf, compile the caller to these replacements as well as the replacements themselves so that the *function* return ABI matches. The utility subroutines (cblas_*_sub) are fully optimized for PowerPC.

! ! scp% /opt/ibmcmp/xlf/8.1/bin/xlf95 -o xlfabi xlfabi.f -Wl,-framework -Wl,vecLib ! ** abitest === End of Compilation 1 === ! ** zdotc === End of Compilation 2 === ! ** zdotu === End of Compilation 3 === ! ** cdotc === End of Compilation 4 === ! ** cdotu === End of Compilation 5 === ! 1501-510 Compilation successful for file xlfabi.f. ! scp% ./xlfabi ! (0.000000000000000000E+00,-2.00000000000000000) ! (2.00000000000000000,0.000000000000000000E+00) ! (0.0000000000E+00,-2.000000000) ! (2.000000000,0.0000000000E+00)

program abitest double complex zx(1), zy(1), ztemp double complex zdotc, zdotu complex cx(1), cy(1), ctemp complex cdotc, cdotu

zx(1)=(1.0, 1.0) zy(1)=(1.0, -1.0)

ztemp = zdotc(1, zx, 1, zy, 1) print *, ztemp

ztemp = zdotu(1, zx, 1, zy, 1) print *, ztemp

cx(1)=(1.0, 1.0) cy(1)=(1.0, -1.0)

ctemp = cdotc(1, cx, 1, cy, 1) print *, ctemp

ctemp = cdotu(1, cx, 1, cy, 1) print *, ctemp

stop end

double complex function zdotc(n, zx, incx, zy, incy) double complex zx(*), zy(*), z integer n, incx, incy

call cblas_zdotc_sub(%val(n), zx, %val(incx), zy, %val(incy), z)

zdotc = z return end

double complex function zdotu(n, zx, incx, zy, incy) double complex zx(*), zy(*), z integer n, incx, incy

call cblas_zdotu_sub(%val(n), zx, %val(incx), zy, %val(incy), z)

zdotu = z return end

complex function cdotc(n, cx, incx, cy, incy) complex cx(*), cy(*), c integer n, incx, incy

call cblas_cdotc_sub(%val(n), cx, %val(incx), cy, %val(incy), c)

cdotc = c return end

complex function cdotu(n, cx, incx, cy, incy) complex cx(*), cy(*), c integer n, incx, incy

call cblas_cdotu_sub(%val(n), cx, %val(incx), cy, %val(incy), c)

cdotu = c return end

vImage Scale Operations

On MacOS X.3.{0,1,2}, the vImage Scale function may fail to properly translate the image vertically while it is scaling it. This can result in a resized image that is also translated. The last pixel row will be expanded to occupy a part of the image. It is recommended that you use the Affine Warp function instead, which does not have this problem. It may be slightly faster to use the low level shearing functions to do scaling, since that would be a two pass algorithm instead of a three pass algorithm.

On MacOS X.3 (any), the vImage Scale function does not correctly set the kvImageEdgeExtend flag internally. To avoid edging artifacts, pass this flag with vImageScale* on MacOS X.3. This problem is fixed on MacOS X.4.

vImage Shear Operations

The 1D shear operations do not support the case where the destination buffer size in the orthogonal dimension to the shear dimension (plus the srcOffset in that dimension, if non-zero) is larger than the size of the source buffer in that dimension. These functions attempt to fill all of the destination buffer. So, for example, when doing a horizontal shear, if the destination buffer height is larger than the source buffer height, a crash may occur since the destination buffer has more scanlines than the source buffer. Filling the entire destination buffer would naturally involve looking at scanlines in the source buffer that do not exist.

This limitation does not extend to size disparities in the shear dimension. In our horizontal shear example, if the width of the destination buffer is larger than the source buffer, the function handles the case gracefully, filling the residual space that does not map to any location in the source buffer with either the background color or the nearest edge pixel if kvImageEdgeExtend is used.

We do support oversized destination buffers in the orthogonal dimension through the AffineWarp functionality. The 1D shears are intended to be low level bottleneck functions, and have a few limitations that the higher level functions do not have.

AltiVec PEM misalignment algorithm errata

Section 3.1.6.1 of the AltiVec PEM details algorithms for dealing with loading and storing misaligned vectors. These algorithms are broken for the case where the data is actually 16 byte aligned, and may in rare circumstances lead to a segmentation fault. At issue here is the second aligned vector load or store at address + 16 bytes. If the address is 16 byte aligned, this second load or store may contain no valid bytes. If it happens to fall on a new page, that page may be unmapped, in which case your application will receive a segmentation fault. On MacOS 9, the memory space is a large contiguous region, so such segmentation faults rarely or never occurred. MacOS X is more at risk since the address space is fragmented into mapped and unmapped areas. Changes in malloc/valloc may expose your application to new arrangements of mapped and unmapped areas. This may cause applications that "worked" under previous OS releases to fail to work on later ones. This is not a bug in the operating system. This is a bug in the PEM misalignment algorithm which will cause aligned accesses to land in unknown memory spaces.

The simplest solution to solve this problem is to do the second load or store at address + 15 bytes instead of address + 16. For single vector loads, the algorithm is as follows:

#define TYPE float /*put your vector type here*/ vector TYPE vec_load_unaligned( long offset, TYPE *address) {

vector unsigned char align=vec_lvsl(offset, address ); vector TYPE load1 = vec_ld( offset, address ); vector TYPE load2 = vec_ld( offset + 15, address); return vec_perm( load1, load2, align );

}

The situation for stores is similar though some attention must be paid to store order:

Note: Blind use of the code provided above may lead to disappointing performance. Clearly, quite a bit of this work can be recycled between adjacent misaligned vectors. Please see the section on fast misalignment handling for how to efficiently and correctly load misaligned vectors.

Note: Though we take great pains not to (we have an entire cluster set up to detect this error condition), we have in the past occasionally fallen victim to this problem ourselves. If you experience crashes, the workaround is to allocate your image buffer and/or temp buffer to be 1 byte larger than it needs to be. If the problem is in the Accelerate.framework, please also file a bug report with http://bugreporter.apple.com against the Accelerate/X component.

Table of Contents Next Previous