Hardware - SIMD Executive Summary


	Log In \| Not a Member?	Support

Executive Summary

What is SIMD?

SIMD stands for Single Instruction Multiple Data. It is a way of packing N (usually a power of 2) like operations (e.g. 8 adds) into a single instruction. The data for the instruction operands is packed into registers capable of holding the extra data. The advantage of this format is that for the cost of doing a single instruction, N instructions worth of work are performed. This can translate into very large speedups for parallelizable algorithms.

Both PowerPC and ia-32 architectures have SIMD extensions to their vector architectures. On PowerPC, the extension is called AltiVec. On ia-32 the vector architecture extensions have been gradually introduced, at first as the Intel MultiMedia eXtensions (MMX) and then later as the Intel Streaming SIMD Extensions (SSE, SSE2, SSE3). Examples of common areas where SIMD can result in very large improvements in speed are 3-D graphics (Electric Image, games), image processing (Quartz, Photoshop filters), video processing (MPEG, MPEG2, MPEG4), and theater-quality audio (Dolby AC-3, DTS, mp3), and high performance scientific calculations. SIMD units are present on all G4, G5 or Pentium 3/4/M class processors.

Why do we need SIMD?

SIMD offers greater flexibility and opportunities for better performance in video, audio and communications tasks which are increasingly important for applications. SIMD provides a cornerstone for robust and powerful multimedia capabilities that significantly extend the scalar instruction set.

How do AltiVec Features Compare with SSE, SSE2 & SSE3?

AltiVec and SSE/SSE2/SSE3 are similar in some ways. They are both Single Instruction Multiple Data (SIMD) vector units with what are formally 128-bit register files. A single instruction (e.g. add) encodes for the parallel addition of all elements in one register to the like elements in another register. Indeed, approximately 60% of the instructions in the AltiVec ISA have direct counterparts on the Intel SSE/SSE2/SSE3 architecture. There are some differences, however:

AltiVec	SSE, SSE2 & SSE3
32 separate Registers max throughput: 8 Flops / cycle 32-bit saturated arithmetic unsigned compares throughput of 1/cycle for all instructions IEEE-754 (Java subset) compliant	8 XMM registers max throughput: 4 Flops / cycle no 32-bit saturated arithmetic no unsigned compares throughput of one every other cycle for most instructions Fully IEEE-754 compliant

Am I going to get 38.2 GFlops on a Dual 2.5 GHz G5 for everything?

No. The actual performance depends on the function and the algorithm used. The theoretical peak performance of a 2.5 GHz dual processor G5 machine is calculated as:

(2.5 x 10⁹ cycles / s) * (8 FP ops / cycle) * (2 processors) = 40 GFLops

Thus, it is possible that you may write a function that performs even better than 38.2 GFlops. Other functions may never reach this speed. We advertise 38.2 GFlops because that is the speed of the fastest function we have tested. It comes from a convolution function that is among the many vectorized functions in Accelerate.framework, a standard part of MacOS X. Below is a small table of a few Accelerate.framework functions and the average number of GFLops we measure for them over a number of runs on a 2.0 GHz dual processor machine.

		GFlops
convolution (2048 x 256)		38.2
complex 1024 FFT		23.0
real 1024 FFT		19.8
dot product (1024)		18.3

Skeptical about these times? You can download our complete source here, and benchmark for yourself.

Why should a developer care about SIMD?

SIMD can provide a substantial boost in performance and capability for an application that makes significant use of 3D graphics, image processing, audio compression or other calculation-intense functions. Other features of a program may be accelerated by recoding to take advantage of the parallelism and additional operations of SIMD. Apple is adding SIMD capabilities to Core Graphics, QuickDraw and QuickTime. An application that calls them today will see improvements from SIMD without any changes. SIMD also offers the potential to create new applications that take advantage of its features and power. To take advantage of SIMD, an application must be reprogrammed or at least recompiled; however you do not need to rewrite the entire application. SIMD typically works best for that 10% of the application that consumes 80% of your CPU time -- these functions typically have heavy computational and data loads, two areas where SIMD excels.

Is SIMD easy to learn?

Neither SIMD environment supported by Apple requires you to write in assembly. By taking advantage of the AltiVec C Programming Model and the Intel C Programming Model, developers may leverage their experience with C, C++ or Obj C for easier entry into SIMD.

However, before writing your own code, look to see what we have already done for you! Apple provides a number of highly tuned vector libraries in Accelerate.framework on OS X. There you will find BLAS, FFT's, a suite of basic vector math routines, like sine, cosine and square root. MacOS X.3, Panther, also provides high performance image processing primatives in the vImage framework.

Steps to get started with SIMD:

These web pages and the accompanying materials provide an introduction to SIMD. Developers can gain familiarity with SIMD by downloading the examples and tools. Dont miss our Quick Start page for developers new to SIMD who want to get their hands dirty quickly. FORTRAN developers shouldn't miss the new FORTRAN page for tips for integrating SIMD into FORTRAN apps.

Why is this information provided?

This page, along with its tool download area, has been created to encourage and facilitate the development of SIMD code benefiting Apple technologies.

Table of Contents Next Top of Page