Next Previous

OpenCL Overview

Introduced with Mac OS X v10.6, OpenCL is a Mac OS X framework as well as an open standard for writing applications that make use of GPUs and multi-core CPUs. Using OpenCL you can make your applications faster by moving the most time-consuming routines to a separate device within the system (most commonly, a GPU). OpenCL abstracts the nuances of the particular hardware so you don‚Äôt need to write vendor-specific code in order to offload computation.

This chapter briefly summarizes the architecture of OpenCL and relates it to the OpenCL API. For detailed information about the OpenCL architecture and API, see The OpenCL Specification, available from the Khronos Group at http://www.khronos.org/registry/cl/.

OpenCL Terminology

All OpenCL applications make use of the same set of elements: hosts, devices, compute units, contexts, command queues, program objects, kernel functions, kernel objects, and memory objects. If you have past experience with parallel computing environments, some of these terms may seem familiar. However, understanding the nuances of these terms will help you get the most out of OpenCL on the Macintosh platform.

Devices

The OpenCL specification refers to any computational device on a computer system as a device. Within each device there are one or more units‚Äîreferred to as compute units‚Äîthat handle the actual computation. A compute unit is a hardware unit capable of performing computation by interpreting an instruction set. A device such as a central processing unit (CPU) may have one or more compute units. A CPU with multiple compute units is known as a multi-core CPU. Generally speaking, the number of compute units corresponds to the number of independent instructions that a device can execute at the same time. In the case of a dual-core CPU, for example, it is possible for it to execute two distinct instructions simultaneously.

CPUs commonly contain two to eight compute units, with the maximum increasing year-to-year. A graphics processing unit (GPU) typically contains many compute units‚Äîthe GPUs in current Macintosh systems feature tens of compute units, and future GPUs may contain hundreds. As used by OpenCL, a CPU with 8 compute units is considered a single device, as is a GPU with 100 compute units.

Kernels

When you write a set of instructions in the OpenCL-C language intended for compilation and execution on a device, you‚Äôre creating an OpenCL kernel (also called a kernel function or a compute kernel). A kernel is essentially a function written in a language that enables it to be compiled for execution on any device that supports OpenCL. Although kernels are enqueued for execution by host applications written in C, C++, or Objective C, a kernel must be compiled separately in order to be customized for the device on which it is going to run. You can write your OpenCL kernel source code in a separate file or include it inline in your host application source code. You can compile OpenCL kernels at runtime when launching the host application, or you can use a previously-built binary.

Kernel Objects

A kernel object encapsulates a specific kernel declared in a program, along with the argument values to use when executing this kernel.

Programs

An OpenCL program is a set of OpenCL kernels, auxiliary functions called by the kernels, and constants used by the kernels.

Contexts

The context is the environment in which OpenCL kernels execute. The context includes a set of devices, the memory accessible to those devices, and one or more command queues used to schedule execution of one or more kernels. A context is needed to share memory objects between devices.

Program Objects

An OpenCL program object is a data type that represents your OpenCL program. It encapsulates the following data:

a reference to a context for the program (which is needed to know on which devices the program can run)
the program source code or binary
the latest successfully built OpenCL program executable
the list of devices for which the program executable was built, the build options used, and a build log

Command Queues

OpenCL command queues are used for submitting work to a device. They order the execution of kernels on a device and manipulate memory objects. OpenCL executes the commands in the order that you enqueue them. .

Hosts

The program that calls OpenCL functions to set up the context in which kernels run and enqueue the kernels for execution is known as the host application. The device on which the host application executes is known as the host device. Before kernels can be run, the host application must complete the following steps:

determine what compute devices are available
select compute devices appropriate for the application
create command queues for selected compute devices
allocate the memory objects needed by the kernels for execution

Note that the host device (the CPU) can itself be an OpenCL device and can be used to execute kernel instances.

Memory Objects

A memory object is a handle to a region of global memory (see ‚ÄúMemory Model‚Äù). You can create memory objects to reserve memory on a device to store your application data. There are two types of memory objects used in OpenCL: buffer objects and image objects, where buffer objects can contain any type of data and image objects are specifically for representing images. The host application can enqueue commands to read from and write to memory objects.

OpenCL Operation Model

The operation of OpenCL can be described in terms of four interrelated models:

Platform model
Execution model
Memory model
Programming model

Platform Model

As shown in Figure 1-1, the OpenCL device communicates with the host device‚Äîthat is, the device on which the controlling application is running. Normally, the throughputs of the compute device‚Äôs internal busses are much faster than the throughput of the external bus between the compute device and the host. In that case, of course, the speed of an external bus is not an issue. Because the data transfer takes a long time, you need to do enough computations in each kernel to make sure you are not limited by this latency. Note that the host, normally a CPU, can also be an OpenCL device.

Figure 1-1 OpenCL Platform Model

Execution Model

As described in ‚ÄúOpenCL Terminology,‚Äù execution of an OpenCL program involves simultaneous execution of multiple instances of a kernel on one or more OpenCL devices as queued and controlled by the host application. Each instance of a kernel is known as a work-item. Each work item executes the same code, but on different parts of the data. Each work-item runs on a single core of a multiprocessor. When you submit a kernel to execute on a device, you define the number of work-items that you need to completely process all of your data. This is known as an index space. OpenCL supports index spaces of 1, 2, or 3 dimensions. For example, if you have a kernel that changes the color value of a single pixel and you have an image that is 64 pixels wide by 64 pixels high, you might want to define a 2-dimensional index space of 64 by 64 so that there is a work-item for each pixel in the image. The total number of work-items is practically unlimited; use the number that maps best to your algorithm. OpenCL takes care of distributing the work-items among the available processors.

Work-items can be organized into work-groups. OpenCl supports synchronization of computation between work-items in a work-group using barriers and memory fences, but does not permit such synchronization between different work-groups or work-items in separate work-groups.

Each work-item has a unique global ID, which is the location of the work-item in the index space. For example, the work item in a 2-dimensional index space that is number 23 on the X axis and number 6 on the Y axis (counting from 0) would have the global ID (23,6). Each work-group has a unique work-group ID, which is similar to the work-item global ID in that it specifies the work-group‚Äôs position in the index space. The number of work-groups in each dimension must divide evenly into the number of work-items in that dimension. For example, if your global work size is 64 by 144, then your work-groups could each contain 8 x 24 work items so that the work-group array is 8 x 6. (8 work groups in the X dimension x 8 work-items per work group = 64 work-items in the X direction. 6 work-groups in the Y dimension x 24 work-items per work group = 144 work-items in the Y dimension.) . The work-group ID of the work group that is at index 3 on the X axis and index 8 on the Y axis (again, counting from 0) is (3,8).

A work-item can also be located by a combination of the index of that item within its work-group (its local ID) plus the work-group ID of its work-group. For example, the work item in the preceding example with the global ID (23,6) would be the last work-item in the third work-group in the X dimension (local index 7, work-group index 2) and the sixth work-item in the first work-group in the Y dimension (local index 5, work-group index 0).

In order to generalize your OpenCL application to run on a variety of hardware, you cannot hard-code into your program the number of work-items and work-groups to be used. Use the clGetDeviceInfo with the selector CL_DEVICE_MAX_WORK_ITEM_SIZES) and the clGetKernelWorkgroupInfo function with the selector CL_KERNEL_WORK_GROUP_SIZE to determine the maximum work-group size for a given device and kernel. This number changes based on the device and kernel.

The host application sets up the context in which the kernels run, including allocation of memory of various types (see the following section), transfer of data among memory objects, and creation of command queues used to control the sequence in which commands‚Äîincluding the commands that execute kernels‚Äîare run. You are responsible for synchronizing any necessary order of execution. The OpenCL API includes synchronization commands for this purpose.

You can use the OpenCL API to query the OpenCL runtime about its status and find out when your kernel has finished executing. Once it‚Äôs done you can retrieve the results of the computation from OpenCL and repeat as necessary.

Memory Model

OpenCL generalizes the different types of memory available into global memory, constant memory, local memory, and private memory, as follows:

Global memory is available for reading and writing by all work-items in all work-groups. It is the device memory in the platform model.
Constant memory is a region of global memory that is reserved for read-only access by work-items and is held constant during the execution of a kernel. It can be written and read by the host application.
Local memory is available for reading and writing by a specific work-group and can be used to hold values that are shared by all work-items in a work-group.
Private memory is available to only a single work-item.

When writing kernels in the OpenCL language, you must declare your memory with certain address space qualifiers to indicate whether the data resides in global, constant, local, or private memory, or it will default to private within a kernel.

Programming Models

OpenCL supports data parallel and task parallel programming models:

As used in OpenCL, data parallel processing refers to many instances of the same kernel being executed simultaneously, with each instance operating on its own set of data. Each set of data is associated with a point in a 1-, 2-, or 3-dimensional index space.

Task parallel programming is similar to the familiar process of spawning multiple threads, each performing different tasks. In OpenCL terms, task-parallel programming involves enqueuing many kernels, and letting OpenCL run them in parallel using available processors.

OpenCL Architecture

The OpenCL framework includes three main components:

The OpenCL

compiler,
API, and
runtime.

The OpenCL-C language is based on the ISO/IEC 9899:1999 C language specification (also known as the C99 specification) with specific extensions and restrictions. This is a C-like programming language that can compile and run natively on any processing unit that supports the OpenCL standard. The OpenCL-C language extensions include mathematical functions that make it easier to implement graphics and numerical-method algorithms to solve problems in engineering and science. You can use this language to implement computation-intensive portions of your code as OpenCL programs that can process many data sets in parallel.

The OpenCL compiler is included in the framework so that your application can compile your OpenCL program during execution. Therefore, you don‚Äôt need to compile and distribute different versions of your application for every possible Open-CL enabled device on the market. Instead, your application can compile the OpenCL source code for the specific OpenCL devices on the system the first time your application runs. You can then cache the compiled code so that your application does not need to recompile it each time it runs.

The OpenCL runtime abstracts the underlying hardware from the operating system on which OpenCL is running.

Next Previous

Last updated: 2009-06-10

Did this document help you?

Shop the Apple Online Store (1-800-MY-APPLE), visit an Apple Retail Store, or find a reseller.