OpenCL. An Introduction for HPC programmers. Benedict Gaster, AMD Tim Mattson, Intel. - Page 1

Size: px

Start display at page:

Download "OpenCL. An Introduction for HPC programmers. Benedict Gaster, AMD Tim Mattson, Intel. - Page 1"

Wilfrid Park
5 years ago
Views:

1 OpenCL An Introduction for HPC programmers Benedict Gaster, AMD Tim Mattson, Intel - Page 1

2 Preliminaries: Disclosures - The views expressed in this tutorial are those of the people delivering the tutorial. - We are not speaking for our employers. - We are not speaking for Khronos We take our tutorials VERY seriously: - Help us improve give us feedback and tell us how you would make this tutorial even better. - Page 2

3 Agenda Heterogeneous computing and the origins of OpenCL OpenCL overview Exploring the spec through a series of examples - Vector addition: - the basic platform layer - Matrix multiplication: - writing simple kernels - Optimizing matrix multiplication: - work groups and the memory model - Radix Sort: A survey of of OpenCL Page 3

GPU CPU GMCH ICH CPU DRAM OpenCL lets Programmers write a single portable

4 It s a Heterogeneous world A modern platform Includes: One or more CPUs One or more GPUs DSP processors other? GPU CPU GMCH ICH CPU DRAM OpenCL lets Programmers write a single portable program that uses ALL resources in the heterogeneous platform GMCH = graphics memory control hub, ICH = Input/output control hub - Page 4

Microprocessor trends Individual processors are

48 cores 160 cores 240 cores 80 cores Intel SCC

C1060 3 rd party names are the property of their

IBM Cell The Heterogeneous many-core challenge:

5 Microprocessor trends Individual processors are many core (and often heterogeneous) processors. 48 cores 160 cores 240 cores 80 cores Intel SCC Processor ATI RV770 1 CPU + 6 cores NVIDIA Tesla C rd party names are the property of their owners. IBM Cell The Heterogeneous many-core challenge: How are we to build a software ecosystem for the Heterogeneous many core platform? - Page 5

6 Industry Standards for Programming Heterogeneous Platforms CPUs Multiple cores driving performance increases Emerging Intersection GPUs Increasingly general purpose data-parallel computing Multi-processor programming e.g. OpenMP Heterogeneous Computing Graphics APIs and Shading Languages OpenCL Open Computing Language Open, royalty-free standard for portable, parallel programming of heterogeneous parallel computing CPUs, GPUs, and other processors - Page 6

7 The BIG idea behind OpenCL Replace loops with functions (a kernel) executing at each point in a problem domain. - E.g., process a 1024 x 1024 image with one kernel invocation per pixel or 1024 x 1024 = 1,048,576 kernel executions Traditional loops void trad_mul(int n, const float *a, const float *b, float *c) { int i; for (i=0; i<n; i++) c[i] = a[i] * b[i]; } Data Parallel OpenCL kernel void dp_mul(global const float *a, global const float *b, global float *c) { int id = get_global_id(0); c[id] = a[id] * b[id]; } // execute over n work-items - Page 7

8 The origins of OpenCL Ericsson AMD ATI Nvidia Intel Apple Merged, needed commonality across products GPU vendor - wants to steel mkt share from CPU CPU vendor - wants to steel mkt share from GPU was tired of recoding for many core, GPUs. Pushed vendors to standardize. Wrote a rough draft straw man API Khronos Compute group formed Third party names are the property of their owners. Nokia IBM Sony EA Freescale TI + many more Dec Page 8

9 OpenCL Working Group within Khronos Diverse industry participation - Processor vendors, system OEMs, middleware vendors, application developers. OpenCL became an important standard on release by virtue of the market coverage of the companies behind it. - Page 9

10 OpenCL Timeline 6 months to get from strawman to final draft. Rapid innovation required to match pace of HW innovation 18 months from 1.0 to 1.1 Backwards compatibility to protect SW investments. OpenCL 1.2 expected around Dec 12 During 2H09 Multiple conformant implementations ship across a diverse range of platforms. Jun08 Dec08 Jun10 OpenCL working group formed within Khronos Khronos publicly releases OpenCL 1.0 specification Khronos publicly releases OpenCL 1.1 specification. Conformant implementations available shortly thereafter - Page 10

11 OpenCL: From cell phone to supercomputer 11 OpenCL Embedded profile for mobile and embedded silicon - Relaxes some data type and precision requirements - Avoids the need for a separate ES specification Khronos APIs provide computing support for imaging & graphics - Enabling advanced applications in, e.g., Augmented Reality OpenCL will enable parallel computing in new markets - Mobile phones, cars, avionics A camera phone with GPS processes images to recognize buildings and landmarks and provides relevant data from internet - Page 11

12 Agenda Heterogeneous computing and the origins of OpenCL OpenCL overview Exploring the spec through a series of examples - Vector addition: - the basic platform layer - Matrix multiplication: - writing simple kernels - Optimizing matrix multiplication: - work groups and the memory model - Reduction: - synchronization A survey of the rest of OpenCL - Page 12

13 OpenCL Platform Model One Host + one or more Compute Devices - Each Compute Device is composed of one or more Compute Units - Each Compute Unit is further divided into one or more Processing Elements - Page 13

14 Execution model: OpenCL execution model define a problem domain and execute a kernel invocation for each point in the domain - E.g., process a 1024 x 1024 image: Global problem dimensions: 1024 x 1024 = 1 kernel execution per pixel: 1,048,576 total kernel executions Scalar void scalar_mul(int n, const float *a, const float *b, float *c) { int i; for (i=0; i<n; i++) c[i] = a[i] * b[i]; } Data Parallel kernel void dp_mul(global const float *a, global const float *b, global float *c) { int id = get_global_id(0); c[id] = a[id] * b[id]; } // execute over n work-items - Page 14

15 1024 An N-dimension domain of work-items Global Dimensions: 1024 x 1024 (whole problem space) Local Dimensions: 128 x 128 (work group executes together) 1024 Synchronization between work-items possible only within workgroups: barriers and memory fences Cannot synchronize outside of a workgroup Choose the dimensions that are best for your algorithm - Page 15

16 Examples: Work-Items and Workgroups get_work_dim = 1 get_global_size = 26 input workgroups get_num_groups = 2 get_group_id = 0 get_local_size = 13 get_local_id = 8 get_global_id = 21 - Page 16

17 Kernel A data-parallel function executed for each work-item kernel void square( global float* input, global float* output) { int i = get_global_id(0); output[i] = input[i] * input[i]; } get_global_id(0) Input Output Page 17

18 OpenCL Memory Model Private Memory - Per work-item Local Memory - Shared within a workgroup Local Global/Constant Memory - Visible to all workgroups Host Memory - On the CPU Private Memory Work-Item Workgroup Computer Device Work-Item Local Memory Private Memory Private Memory Work-Item Workgroup Global/Constant Memory Private Memory Work-Item Local Memory Host Memory Memory management is explicit You must move data from host -> global -> local and back Host - Page 18

19 Memory Consistency OpenCL uses a relaxed consistency memory model; i.e. - the state of memory visible to a work-item is not guaranteed to be consistent across the collection of work-items at all times. Within a work-item: - Memory has load/store consistency Within a work-group: - Local memory is consistent between work-items at a barrier Global memory is consistent within a work-group, at a barrier, but not guaranteed across different work-groups Consistency of memory shared between commands (e.g. kernel invocations) are enforced through synchronization (events) - Page 19

20 OpenCL C Language Derived from ISO C99 - No standard C99 headers, function pointers, recursion, variable length arrays, and bit fields Additions to the language for parallelism - Work-items and workgroups - Vector types - Synchronization Address space qualifiers Optimized image access Built-in functions - Page 20

21 Data Types Scalar data types - char, uchar, short, ushort, int, uint, long, ulong, float, (double?) - bool, intptr_t, ptrdiff_t, size_t, uintptr_t, void, half (storage) Image types - image2d_t, image3d_t, sampler_t Vector data types - Page 21

22 Vector Types Portable Vector length of 2, 4, 8, and 16 char2, ushort4, int8, float16, double2, Endian safe Aligned at vector length Vector operations (elementwise) and built-in functions - Page 22

23 Vector Operations Vector literal int4 vi0 = (int4) -7; int4 vi1 = (int4)(0, 1, 2, 3); Vector components vi0.lo = vi1.hi; int8 v8 = (int8)(vi0, vi1.s01, vi1.odd); Vector ops vi0 += vi1; vi0 = abs(vi0); Page 23

24 Contexts and Queues Contexts are used to contain and manage the state of the world Kernels are executed in contexts defined and manipulated by the host - Devices - Kernels - OpenCL functions - Program objects - kernel source and executable - Memory objects Command-queue - coordinates execution of kernels - Kernel execution commands - Memory commands - transfer or mapping of memory object data - Synchronization commands - constrains the order of commands Applications queue compute kernel execution instances - Queued in-order - Executed in-order or out-of-order - Events are used to synchronize execution instances - Page 24

25 OpenCL summary CPU GPU Context Programs Kernels Memory Objects Command Queues kernel void dp_mul(global const float *a, global const float *b, global float *c) { int id = get_global_id(0); c[id] = a[id] * b[id]; } dp_mul CPU program binary dp_mul GPU program binary dp_mul arg arg [0] [0] arg[0] value arg arg [1] [1] arg[1] value arg arg [2] [2] arg[2] value Images Buffers In In Out of of Order Order Queue Queue Compute GPU Device - Page 25

26 Agenda Heterogeneous computing and the origins of OpenCL OpenCL overview Exploring the spec through a series of examples - Vector addition: - the basic platform layer - Matrix multiplication: - writing simple kernels - Optimizing matrix multiplication: - work groups and the memory model - Reduction: - synchronization A survey of the rest of OpenCL - Page 26

27 Example: vector addition The hello world program of data parallel programming is a program to add two vectors C[i] = A[i] + B[i] for i=1 to N For the OpenCl solution, there are two parts - Kernel code - Host code - Page 27

28 Vector Addition - Kernel kernel void vec_add ( global const float *a, global const float *b, global float *c) { int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; } - Page 28

29 Vector Addition - Host Program // create the OpenCL context on a GPU device cl_context = clcreatecontextfromtype(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices associated with context clgetcontextinfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); devices = malloc(cb); clgetcontextinfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cmd_queue = clcreatecommandqueue(context, devices[0], 0, NULL); // allocate the buffer memory objects memobjs[0] = clcreatebuffer(context, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srca, NULL);} memobjs[1] = clcreatebuffer(context,cl_mem_read_only CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcb, NULL); memobjs[2] = clcreatebuffer(context,cl_mem_write_only, sizeof(cl_float)*n, NULL, NULL); // create the program program = clcreateprogramwithsource(context, 1, &program_source, NULL, NULL); // build the program err = clbuildprogram(program, 0, NULL, NULL, NULL, NULL); // create the kernel kernel = clcreatekernel(program, vec_add, NULL); // set the args values err = clsetkernelarg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem)); err = clsetkernelarg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); err = clsetkernelarg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); // set work-item dimensions global_work_size[0] = n; // execute kernel err = clenqueuendrangekernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array err = clenqueuereadbuffer(cmd_queue, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL); - Page 29

30 Vector Addition - Host Program // create the OpenCL context on a GPU device cl_context = clcreatecontextfromtype(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); Define platform and queues // get the list of GPU devices associated with context clgetcontextinfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); devices = malloc(cb); clgetcontextinfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cmd_queue = clcreatecommandqueue(context, devices[0], 0, NULL); // allocate the buffer memory memobjs[0] Define = clcreatebuffer(context, Memory objects CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srca, NULL);} memobjs[1] = clcreatebuffer(context,cl_mem_read_only CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcb, NULL); memobjs[2] = clcreatebuffer(context,cl_mem_write_only, sizeof(cl_float)*n, NULL, NULL); // create the program program = clcreateprogramwithsource(context, 1, &program_source, Create NULL, the NULL); program Build the program // build the program err = clbuildprogram(program, 0, NULL, NULL, NULL, NULL); Create and setup kernel // create the kernel kernel = clcreatekernel(program, vec_add, NULL); // set the args values err = clsetkernelarg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem)); err = clsetkernelarg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); err = clsetkernelarg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); // set work-item dimensions global_work_size[0] = n; Execute the kernel // execute kernel err = clenqueuendrangekernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array err = clenqueuereadbuffer(context, Read results on the memobjs[2], host CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL); It s complicated, but most of this is boilerplate and not as bad as it looks. - Page 30

31 Platform Layer: Basic discovery Platform layer allows applications to query for platform specific features Querying platform info Querying devices - clgetdeviceids() - Find out what compute devices are on the system - Device types include CPUs, GPUs, or Accelerators - clgetdeviceinfo() - Queries the capabilities of the discovered compute devices such as: - Number of compute cores - Maximum work-item and work-group size - Sizes of the different memory spaces - Maximum memory object size - Page 31

32 Platform Layer: Contexts Creating contexts - Contexts are used by the OpenCL runtime to manage objects and execute kernels on one or more devices - Contexts are associated to one or more devices - Multiple contexts could be associated to the same device - clcreatecontext() and clcreatecontextfromtype() returns a handle to the created contexts - Page 32

33 Platform layer: Command-Queues Command-queues store a set of operations to perform Command-queues are associated to a context GPU CPU Multiple command-queues can be created to handle independent commands that don t require synchronization Queue Context Queue Execution of the commandqueue is guaranteed to be completed at sync points - Page 33

34 VecAdd: Context, Devices, Queue // create the OpenCL context on a GPU device cl_context context = clcreatecontextfromtype( 0, // platform ID CL_DEVICE_TYPE_GPU, // Ask for a GPU NULL, // error callback NULL, // user data for callback NULL); // error code // get the list of GPU devices associated with context size_t cb; clgetcontextinfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); cl_device_id *devices = malloc(cb); clgetcontextinfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cl_cmd_queue cmd_queue = clcreatecommandqueue(context, devices[0], // Use the first GPU device 0, // default options NULL); // error code - Page 34

35 Memory Objects Buffers - Simple chunks of memory - Kernels can access however they like (array, pointers, structs) - Kernels can read and write buffers Images - Opaque 2D or 3D formatted data structures - Kernels access only via read_image() and write_image() - Each image can be read or written in a kernel, but not both - Page 35

36 Creating Memory Objects Memory objects are created within an associated context - clcreatebuffer(), clcreateimage2d(), and clcreateimage3d() Memory can be created as read only, write only, or read-write Where objects are created in the platform memory space can be controlled - Device memory - Device memory with data copied from a host pointer - Host memory - Host memory associated with a pointer - Memory at that pointer is guaranteed to be valid at synchronization points - Page 36

37 VecAdd: Create Memory Objects cl_mem memobjs[3]; // allocate input buffer memory objects memobjs[0] = clcreatebuffer(context, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, // flags sizeof(cl_float)*n, // size srca, NULL); memobjs[1] = clcreatebuffer(context, // host pointer // error code CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcb, NULL); // allocate output buffer memory object memobjs[2] = clcreatebuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL, NULL); - Page 37

38 Build the Program object The program object encapsulates: - A context - The program source/binary - List of target devices and build options The Build process to create a program object - clcreateprogramwithsource() - clcreateprogramwithbinary() Kernel Code kernel void horizontal_reflect(read_only image2d_t src, write_only image2d_t dst) { int x = get_global_id(0); // x-coord int y = get_global_id(1); // y-coord int width = get_image_width(src); float4 src_val = read_imagef(src, sampler, (int2)(width-1-x, y)); write_imagef(dst, (int2)(x, y), src_val); } Compile for GPU Compile for CPU Program GPU code CPU code - Page 38

39 VecAdd: Create and Build the Program // create the program cl_program program = clcreateprogramwithsource( context, 1, // string count &program_source, // program strings NULL, // string lengths NULL); // error code // build the program cl_int err = clbuildprogram(program, 0, // device num within the device list NULL, // device list NULL, // options NULL, // notifier callback function ptr NULL); // user data for callback function - Page 39

40 Kernel Objects Kernel objects encapsulate - Specific kernel functions declared in a program - Argument values used for kernel execution Creating kernel objects - clcreatekernel() - creates a kernel object for a single function in a program Setting arguments - clsetkernelarg(<kernel>, <argument index>) - Each argument data must be set for the kernel function - Argument values copied and stored in the kernel object Kernel vs. program objects - Kernels are related to program execution - Programs are related to program source - Page 40

41 VecAdd: Create the Kernel and Set the Arguments // create the kernel cl_kernel kernel = clcreatekernel(program, vec_add, NULL); // set a vector argument err = clsetkernelarg(kernel, 0, // argument index (void *)&memobjs[0], // argument data sizeof(cl_mem)); // argument data size // set b vector argument err = clsetkernelarg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); // set c vector argument err = clsetkernelarg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); - Page 41

42 Kernel Execution A command to execute a kernel must be enqueued to the commandqueue - Command-queue could be explicitly flushed to the device - Command-queues execute in-order or out-of-order - In-order - commands complete in the order queued and memory is consistent - Out-of-order - no guarantee of (1) when commands are executed or (2) if memory is consistent unless specific synchronization is used. clenqueuendrangekernel() - Data-parallel execution model - Describes the index space for kernel execution - Requires information on NDRange dimensions and work-group size clenqueuetask() - Task-parallel execution model (multiple queued tasks) - Kernel is executed on a single work-item clenqueuenativekernel() - Task-parallel execution model - Executes a native C/C++ function not compiled using the OpenCL compiler - This mode does not use a kernel object so arguments must be passed in - Page 42

43 VecAdd: Invoke Kernel size_t global_work_size[1] = n; // set work-item dimensions // execute kernel err = clenqueuendrangekernel(cmd_queue, kernel, 1, // Work dimensions NULL, // must be NULL (work offset) global_work_size, NULL, // automatic local work size 0, // no events to wait on NULL, // event list NULL); // event for this kernel - Page 43

44 Synchronization Synchronization - Signals when commands are completed to the host or other commands in queue - Blocking calls - Commands that do not return until complete - clenqueuereadbuffer() can be called as blocking and will block until complete - Event objects - Tracks execution status of a command - Some commands can be blocked until event objects signal a completion of previous command - clenqueuendrangekernel() can take an event object as an argument and wait until a previous command (e.g., clenqueuewritebuffer) is complete - Queue barriers - queued commands that can block command execution - Page 44

45 VecAdd: Read Output // read output array err = clenqueuereadbuffer( context, memobjs[2], CL_TRUE, // blocking 0, // offset n*sizeof(cl_float), // size dst, // pointer 0, NULL, NULL); // events - Page 45

46 OpenCL C for Compute Kernels Derived from ISO C99 - A few restrictions: recursion, function pointers, functions in C99 standard headers... - Preprocessing directives defined by C99 are supported Built-in Data Types - Scalar and vector data types, Pointers - Data-type conversion functions: convert_type<_sat><_roundingmode> - Image types: image2d_t, image3d_t and sampler_t Built-in Functions Required - work-item functions, math.h, read and write image - Relational, geometric functions, synchronization functions Built-in Functions Optional - double precision, atomics to global and local memory - selection of rounding mode, writes to image3d_t surface - Page 46

47 OpenCL C Language Highlights Function qualifiers - kernel qualifier declares a function as a kernel - Kernels can call other kernel functions Address space qualifiers - global, local, constant, private - Pointer kernel arguments must be declared with an address space qualifier Work-item functions - Query work-item identifiers - get_work_dim(), get_global_id(), get_local_id(), get_group_id() Synchronization functions - Barriers - all work-items within a work-group must execute the barrier function before any work-item can continue - Memory fences - provides ordering between memory operations - Page 47

48 OpenCL C Language Restrictions Pointers to functions are not allowed Pointers to pointers allowed within a kernel, but not as an argument Bit-fields are not supported Variable length arrays and structures are not supported Recursion is not supported Writes to a pointer of types less than 32-bit are not supported Double types are not supported, but reserved - Page 48

49 Vector Addition Kernel kernel void vec_add ( global const float *a, global const float *b, global float *c) { int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; } - Page 49

50 Agenda Heterogeneous computing and the origins of OpenCL OpenCL overview Exploring the spec through a series of examples - Vector addition: - the basic platform layer - Matrix multiplication: - writing simple kernels - Optimizing matrix multiplication: - work groups and the memory model - Radix Sort: A survey of the rest of OpenCL - Page 50

51 Linear Algebra Definition: - The branch of mathematics concerned with the study of vectors, vector spaces, linear transformations, and systems of linear equations. Example: Consider the following system of linear equations x + 2y + z = 1 x + 3y + 3z = 2 x + y + 4z = 6 - This system can be represented in terms of vectors and a matrix as the classic Ax = b problem x y = z 6 - Page 51

52 Solving Ax=b LU Decomposition: - transform a matrix into the product of a lower triangular and upper triangular matrix. It is used to solve a linear system of equations = L U = A Solving for x Given a problem Ax=b Ax=b LUx=b Ux=(L -1 )b x = (U -1 )L -1 b - Page 52

53 Linear transformations A matrix, A R MxP multiplies a vector, x R P to define a linear transformation of that vector into R m. Matrix multiplication defines the composition of two linear transformations on that vector space Compute C = A*B where C R MxN A R MxP B R PxN Matrix multiplication is the core building block of Linear Algebra - Page 53

54 Matrix Multiplication: Sequential code void mat_mul(int Mdim, int Ndim, int Pdim, float *A, float *B, float *C) { } int i, j, k; for (i=0; i<ndim; i++){ } for (j=0; j<mdim; j++){ } for(k=0;k<pdim;k++){ } //C(i,j) = sum(over k) A(i,k) * B(k,j) C[i*Ndim+j] += A[i*Ndim+k] * B[k*Pdim+j]; C(i,j) C(i,j) A(i,:) = + * B(:,j) Dot product of a row of A and a column of B for each element of C - Page 54

55 Matrix Multiplications Performance Results on an Apple laptop with an Intel CPU. Case MFLOPS CPU: Sequential C (not OpenCL) 167 Device is Intel Core 2 Duo CPU 2.40GHz 3 rd party names are the property of their owners. - Page 55

56 Matrix Multiplication: OpenCL kernel (1/4) void mat_mul(int Mdim, int Ndim, int Pdim, float *A, float *B, float *C) { } int i, j, k; for (i=0; i<ndim; i++){ } for (j=0; j<mdim; j++){ } for(k=0;k<pdim;k++){ } //C(i,j) = sum(over k) A(i,k) * B(k,j) C[i*Ndim+j] += A[i*Ndim+k] * B[k*Pdim+j]; - Page 56

57 Matrix Multiplication: OpenCL kernel (2/4) void mat_mul( kernel mat_mul( int Mdim, int Ndim, int Pdim, const int Mdim, const int Ndim, const int Pdim, float *A, float *B, float *C) global float *A, global float *B, global float *C) { } int i, j, k; for (i=0; i<ndim; i++){ } for (j=0; j<mdim; j++){ } for(k=0;k<pdim;k++){ } //C(i,j) = sum(over k) A(i,k) * B(k,j) C[i*Ndim+j] += A[i*Ndim+k] * B[k*Pdim+j]; Mark as a kernel function and specify memory qualifiers - Page 57

58 Matrix Multiplication: OpenCL kernel (3/4) kernel mat_mul( { } const int Mdim, const int Ndim, const int Pdim, global float *A, global float *B, global float *C) int i, j, k; for (i=0; i<ndim; i++){ } for (j=0; j<mdim; j++){ } for(k=0;k<pdim;k++){ } i = get_global_id(0); j = get_global_id(1); //C(i,j) = sum(over k) A(i,k) * B(k,j) C[i*Ndim+j] += A[i*Ndim+k] * B[k*Pdim+j]; Remove outer loops and set work item coordinates - Page 58

59 Matrix Multiplication: OpenCL kernel (4/4) kernel mat_mul( { const int Mdim, const int Ndim, const int Pdim, global float *A, global float *B, global float *C) int i, j, k; i = get_global_id(0); j = get_global_id(1); for(k=0;k<pdim;k++){ } //C(i,j) = sum(over k) A(i,k) * B(k,j) C[i*Ndim+j] += A[i*Ndim+k] * B[k*Pdim+j]; } - Page 59

60 Matrix Multiplication: OpenCL kernel Rearrange a bit and use a local scalar for intermediate C element values (a common optimization in Matrix Multiplication functions) kernel mmul( const int Mdim, const int Ndim, const int Pdim, global float* A, global float* B, global float* C) { } int k; int i = get_global_id(0); int j = get_global_id(1); float tmp; tmp = 0.0; for(k=0;k<pdim;k++) tmp += A[i*Ndim+k] * B[k*Pdim+j]; C[i*Ndim+j] = tmp; - Page 60

61 Matrix Multiplication host program #include "mult.h" int main(int argc, char **argv) { float *A, *B, *C; int Mdim, Ndim, Pdim; int err, sza, szb, szc; size_t global[dim]; size_t local[dim]; cl_device_id device_id; cl_context context; cl_command_queue commands; cl_program program; cl_kernel kernel; cl_uint nd; cl_mem a_in, b_in, c_out; Ndim = ORDER; Pdim = ORDER; Mdim = ORDER; sza = Ndim*Pdim; szb = Pdim*Mdim; szc = Ndim*Mdim; A = (float *)malloc(sza*sizeof(float)); B = (float *)malloc(szb*sizeof(float)); C = (float *)malloc(szc*sizeof(float)); initmat(mdim, Ndim, Pdim, A, B, C); err = clgetdeviceids(null, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL); context = clcreatecontext(0, 1, &device_id, NULL, NULL, &err); commands = clcreatecommandqueue(context, device_id, 0, &err); a_in = clcreatebuffer(context, CL_MEM_READ_ONLY, sizeof(float) * sza, NULL, NULL); b_in = clcreatebuffer(context, CL_MEM_READ_ONLY, sizeof(float) * szb, NULL, NULL); c_out = clcreatebuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * szc, NULL, NULL); err = clenqueuewritebuffer(commands, a_in, CL_TRUE, 0, sizeof(float) * sza, A, 0, NULL, NULL); err = clenqueuewritebuffer(commands, b_in, CL_TRUE, 0, sizeof(float) * szb, B, 0, NULL, NULL); *program = clcreateprogramwithsource(context, 1, (const char **) & C_elem_KernelSource, NULL, &err); err = clbuildprogram(*program, 0, NULL, NULL, NULL, NULL); *kernel = clcreatekernel(*program, "mmul", &err); err = clsetkernelarg(*kernel, 0, sizeof(int), &Mdim); err = clsetkernelarg(*kernel, 1, sizeof(int), &Ndim); err = clsetkernelarg(*kernel, 2, sizeof(int), &Pdim); err!= clsetkernelarg(*kernel, 3, sizeof(cl_mem), &a_in); err = clsetkernelarg(*kernel, 4, sizeof(cl_mem), &b_in); err = clsetkernelarg(*kernel, 5, sizeof(cl_mem), &c_out); global[0] = (size_t) Ndim; global[1] = (size_t) Mdim; *ndim = 2; err = clenqueuendrangekernel(commands, kernel, nd, NULL, global, NULL, 0, NULL, NULL); clfinish(commands); err = clenqueuereadbuffer( commands, c_out, CL_TRUE, 0, sizeof(float) * szc, C, 0, NULL, NULL ); test_results(a, B, c_out); } - Page 61

$h" int main(int argc, char **argv) { float *A, *B, *C; int Mdim, Ndim, Pdim; int err, sza, szb, szc; size_t global[dim]; size_t local[dim]; cl_device_id device_id; cl_context context;$

62 Matrix Multiplication host program #include "mult.h" int main(int argc, char **argv) { float *A, *B, *C; int Mdim, Ndim, Pdim; int err, sza, szb, szc; size_t global[dim]; size_t local[dim]; cl_device_id device_id; cl_context context; cl_command_queue commands; cl_program program; cl_kernel kernel; cl_uint nd; cl_mem a_in, b_in, c_out; Ndim = ORDER; Pdim = ORDER; Mdim = ORDER; sza = Ndim*Pdim; szb = Pdim*Mdim; szc = Ndim*Mdim; A = (float *)malloc(sza*sizeof(float)); B = (float *)malloc(szb*sizeof(float)); C = (float *)malloc(szc*sizeof(float)); Initmat(Mdim, Ndim, Pdim, A, B, C); err = clgetdeviceids(null, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL); context = clcreatecontext(0, 1, &device_id, NULL, NULL, &err); commands = clcreatecommandqueue(context, device_id, 0, &err); a_in = clcreatebuffer(context, CL_MEM_READ_ONLY, sizeof(float) * sza, NULL, NULL); b_in = clcreatebuffer(context, CL_MEM_READ_ONLY, sizeof(float) * szb, NULL, NULL); c_out = clcreatebuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * szc, NULL, NULL); err = clenqueuewritebuffer(commands, a_in, CL_TRUE, 0, sizeof(float) * sza, A, 0, NULL, NULL); err = clenqueuewritebuffer(commands, b_in, CL_TRUE, 0, sizeof(float) * szb, B, 0, NULL, NULL); Note: This isn t as bad as you might think. This is almost the same as the host code we wrote for vector add. It s boilerplate you get it right once and just re-use it. *program = clcreateprogramwithsource(context, 1, (const char **) & C_elem_KernelSource, NULL, &err); err = clbuildprogram(*program, 0, NULL, NULL, NULL, NULL); *kernel = clcreatekernel(*program, "mmul", &err); err = clsetkernelarg(*kernel, 0, sizeof(int), &Mdim); err = clsetkernelarg(*kernel, 1, sizeof(int), &Ndim); err = clsetkernelarg(*kernel, 2, sizeof(int), &Pdim); err!= clsetkernelarg(*kernel, 3, sizeof(cl_mem), &a_in); err = clsetkernelarg(*kernel, 4, sizeof(cl_mem), &b_in); err = clsetkernelarg(*kernel, 5, sizeof(cl_mem), &c_out); global[0] = (size_t) Ndim; global[1] = (size_t) Mdim; *ndim = 2; err = clenqueuendrangekernel(commands, kernel, nd, NULL, global, NULL, 0, NULL, NULL); clfinish(commands); err = clenqueuereadbuffer( commands, c_out, CL_TRUE, 0, sizeof(float) * szc, C, 0, NULL, NULL ); test_results(a, B, c_out); } - Page 62

63 Matrix Multiplication host program #include "mult.h" int main(int argc, char **argv) { float *A, *B, *C; int Mdim, Ndim, Pdim; int err, sza, szb, szc; size_t global[dim]; size_t local[dim]; cl_device_id device_id; cl_context context; cl_command_queue commands; cl_program program; cl_kernel Declare kernel; and cl_uint nd; cl_mem a_in, initialize b_in, c_out; data Ndim = ORDER; Pdim = ORDER; Mdim = ORDER; sza = Ndim*Pdim; szb = Pdim*Mdim; szc = Ndim*Mdim; A = (float *)malloc(sza*sizeof(float)); B = (float *)malloc(szb*sizeof(float)); C = (float *)malloc(szc*sizeof(float)); initmat(mdim, Ndim, Pdim, A, B, C); err = clgetdeviceids(null, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL); context = clcreatecontext(0, 1, &device_id, NULL, NULL, &err); commands = clcreatecommandqueue(context, device_id, 0, &err); a_in = clcreatebuffer(context, CL_MEM_READ_ONLY, sizeof(float) * sza, NULL, NULL); b_in = clcreatebuffer(context, CL_MEM_READ_ONLY, sizeof(float) * szb, NULL, NULL); c_out = clcreatebuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * szc, NULL, NULL); err = clenqueuewritebuffer(commands, a_in, CL_TRUE, 0, sizeof(float) * sza, A, 0, NULL, NULL); err = clenqueuewritebuffer(commands, b_in, CL_TRUE, 0, sizeof(float) * szb, B, 0, NULL, NULL); *program = clcreateprogramwithsource(context, 1, (const char **) & C_elem_KernelSource, NULL, &err); err = clbuildprogram(*program, 0, NULL, NULL, NULL, NULL); *kernel = clcreatekernel(*program, "mmul", &err); err = clsetkernelarg(*kernel, 0, sizeof(int), &Mdim); err = clsetkernelarg(*kernel, 1, sizeof(int), &Ndim); err = clsetkernelarg(*kernel, 2, sizeof(int), &Pdim); err!= clsetkernelarg(*kernel, 3, sizeof(cl_mem), &a_in); err = clsetkernelarg(*kernel, 4, sizeof(cl_mem), &b_in); err = clsetkernelarg(*kernel, 5, sizeof(cl_mem), &c_out); Setup the platform Setup buffers and write A and B matrices to the device memory Build the program, define the Kernel,and setup arguments global[0] = (size_t) Ndim; global[1] = (size_t) Mdim; *ndim = 2; err = clenqueuendrangekernel(commands, kernel, nd, NULL, global, NULL, 0, NULL, NULL); clfinish(commands); err = clenqueuereadbuffer( commands, c_out, CL_TRUE, 0, sizeof(float) * szc, C, 0, NULL, NULL ); test_results(a, B, c_out); } Run the kernel and collect results - Page 63

$h" int main(int argc, char **argv) { float *A, *B, *C; int Mdim, Ndim, Pdim; int err, sza, szb, szc; size_t global[dim]; size_t local[dim]; cl_device_id device_id; cl_context context;$ cl_command_queue commands; cl_program program; cl_kernel kernel; cl_uint nd; cl_mem clfinish(commands); a_in, b_in, c_out; Ndim = ORDER; Pdim = ORDER; Mdim = ORDER; sza = Ndim*Pdim; szb = Pdim*Mdim;

cl_command_queue commands; cl_program program; cl_kernel kernel; cl_uint nd; cl_mem clfinish(commands); a_in, b_in, c_out; Ndim = ORDER; Pdim = ORDER; Mdim = ORDER; sza = Ndim*Pdim; szb = Pdim*Mdim;

64 Matrix Multiplication host program #include "mult.h" int main(int argc, char **argv) { float *A, *B, *C; int Mdim, Ndim, Pdim; int err, sza, szb, szc; size_t global[dim]; size_t local[dim]; cl_device_id device_id; cl_context context; cl_command_queue commands; cl_program program; cl_kernel kernel; cl_uint nd; cl_mem clfinish(commands); a_in, b_in, c_out; Ndim = ORDER; Pdim = ORDER; Mdim = ORDER; sza = Ndim*Pdim; szb = Pdim*Mdim; szc = Ndim*Mdim; A = (float *)malloc(sza*sizeof(float)); B = (float *)malloc(szb*sizeof(float)); C = (float *)malloc(szc*sizeof(float)); initmat(mdim, Ndim, Pdim, A, B, C); err = clgetdeviceids(null, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL); context = clcreatecontext(0, 1, &device_id, NULL, NULL, &err); commands = clcreatecommandqueue(context, device_id, 0, &err); The only parts new to us 1. 2D ND Range a_in set = clcreatebuffer(context, to dimensions CL_MEM_READ_ONLY, of C matrix sizeof(float) * sza, NULL, NULL); b_in = clcreatebuffer(context, CL_MEM_READ_ONLY, sizeof(float) * szb, NULL, NULL); 2. Local sizes set c_out to = clcreatebuffer(context, NULL in clenqueuendrangekernel() CL_MEM_WRITE_ONLY, sizeof(float) * to szc, tell NULL, system NULL); to pick local dimensions for us. err = clenqueuewritebuffer(commands, a_in, CL_TRUE, 0, sizeof(float) * sza, A, 0, NULL, NULL); err = clenqueuewritebuffer(commands, b_in, CL_TRUE, 0, sizeof(float) * szb, B, 0, NULL, NULL); *program = clcreateprogramwithsource(context, 1, (const char **) & C_elem_KernelSource, NULL, &err); err = clbuildprogram(*program, 0, NULL, NULL, NULL, NULL); global[0] = (size_t) Ndim; global[1] = (size_t) Mdim; *ndim = 2; err = clenqueuendrangekernel(commands, kernel, nd, NULL, global, NULL, 0, NULL, NULL); *kernel = clcreatekernel(*program, "mmul", &err); err = clsetkernelarg(*kernel, 0, sizeof(int), &Mdim); err = clsetkernelarg(*kernel, 1, sizeof(int), &Ndim); err = clsetkernelarg(*kernel, 2, sizeof(int), &Pdim); err!= clsetkernelarg(*kernel, 3, sizeof(cl_mem), &a_in); err = clsetkernelarg(*kernel, 4, sizeof(cl_mem), &b_in); err = clsetkernelarg(*kernel, 5, sizeof(cl_mem), &c_out); err = clenqueuereadbuffer( commands, c_out, CL_TRUE, 0, sizeof(float) * szc, C, 0, NULL, NULL ); test_results(a, B, c_out); global[0] = (size_t) Ndim; global[1] = (size_t) Mdim; *ndim = 2; err = clenqueuendrangekernel(commands, kernel, nd, NULL, global, NULL, 0, NULL, NULL); clfinish(commands); err = clenqueuereadbuffer( commands, c_out, CL_TRUE, 0, sizeof(float) * szc, C, 0, NULL, NULL ); test_results(a, B, c_out); } - Page 64

65 Matrix Multiplications Performance Results on an Apple laptop with an NVIDIA GPU and an Intel CPU. Matrices are stored in global memory. Case MFLOPS CPU: Sequential C (not OpenCL) 167 GPU: C(i,j) per work item, all global 511 CPU: C(i,j) per work item, all global 744 Device is GeForce 8600M GT GPU from NVIDIA with a max of 4 compute units Device is Intel Core 2 Duo CPU 2.40GHz 3 rd party names are the property of their owners. - Page 65

66 Agenda Heterogeneous computing and the origins of OpenCL OpenCL overview Exploring the spec through a series of examples - Vector addition: - the basic platform layer - Matrix multiplication: - writing simple kernels - Optimizing matrix multiplication: - work groups and the memory model - Reduction: - synchronization A survey of the rest of OpenCL - Page 66

67 Optimizing Matrix Multiplication Cost determined by flops and memory movement: 2*n 3 = O(n 3 ) Flops operates on 3*n 2 numbers To optimize matrix multiplication, we must assure that for every memory movement, we execute as many flops as possible. Outer product algorithms are faster, but for pedagogical reasons, lets stick to the simple dot-product algorithm. C(i,j) C(i,j) A(i,:) = + * B(:,j) Dot product of a row of A and a column of B for each element of C We will work with work-item/work-group sizes and the memory model to optimize matrix multiplication - Page 67

1024 An N-dimension domain of work-items Global Dimensions: 1024 x 1024 (whole problem space) Local Dimensions: 128 x 128 (work group executes together) 1024 Synchronization

68 1024 An N-dimension domain of work-items Global Dimensions: 1024 x 1024 (whole problem space) Local Dimensions: 128 x 128 (work group executes together) 1024 Synchronization between work-items possible only within workgroups: barriers and memory fences Cannot synchronize outside of a workgroup Choose the dimensions that are best for your algorithm - Page 68

69 OpenCL Memory Model Private Memory - Per work-item Local Memory - Shared within a workgroup Local Global/Constant Memory - Visible to all workgroups Host Memory - On the CPU Private Memory Work-Item Workgroup Computer Device Work-Item Local Memory Private Memory Private Memory Work-Item Workgroup Global/Constant Memory Private Memory Work-Item Local Memory Host Memory Memory management is explicit You must move data from host -> global -> local and back Host - Page 69

70 Optimizing Matrix Multiplication There may be significant overhead to manage work items and work groups. Let s have each work item compute a full row of C C(i,j) C(i,j) A(i,:) = + * B(:,j) Dot product of a row of A and a column of B for each element of C - Page 70

71 An N-dimension domain of work-items Global Dimensions: 1000 x 1000 (whole problem space) Local Dimensions: 250 x 1000 (One work group per compute unit) Page 71

72 Reduce work-item overhead do one row of C per work-item kernel mmul( const int Mdim, const int Ndim, const int Pdim, global float* A, global float* B, global float* C) { int k,j; int i = get_global_id(0); float tmp; for(j=0;j<mdim;j++){ tmp = 0.0; for(k=0;k<pdim;k++) tmp += A[i*Ndim+k] * B[k*Pdim+j]; C[i*Ndim+j] = tmp; } } - Page 72

73 MatMult host program: one row of C per work-item #include "mult.h" int main(int argc, char **argv) { float *A, *B, *C; int Mdim, Ndim, Pdim; int err, sza, szb, szc; size_t global[dim]; size_t local[dim]; cl_device_id device_id; cl_context context; cl_command_queue commands; cl_program program; cl_kernel kernel; cl_uint nd; cl_mem a_in, b_in, c_out; Ndim = ORDER; Pdim = ORDER; Mdim = ORDER; sza = Ndim*Pdim; szb = Pdim*Mdim; szc = Ndim*Mdim; A = (float *)malloc(sza*sizeof(float)); B = (float *)malloc(szb*sizeof(float)); C = (float *)malloc(szc*sizeof(float)); initmat(mdim, Ndim, Pdim, A, B, C); err = clgetdeviceids(null, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL); context = clcreatecontext(0, 1, &device_id, NULL, NULL, &err); commands = clcreatecommandqueue(context, device_id, 0, &err); Changes to host program: 1. 1 D ND Range set to number of rows in the C matrix 2. Local Dim set to 250 so number of work-groups match number of compute units (4 in this case) for our order 1000 matrices a_in = clcreatebuffer(context, CL_MEM_READ_ONLY, sizeof(float) * sza, NULL, NULL); b_in = clcreatebuffer(context, CL_MEM_READ_ONLY, sizeof(float) * szb, NULL, NULL); c_out = clcreatebuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * szc, NULL, NULL); err = clenqueuewritebuffer(commands, a_in, CL_TRUE, 0, sizeof(float) * sza, A, 0, NULL, NULL); err = clenqueuewritebuffer(commands, b_in, CL_TRUE, 0, sizeof(float) * szb, B, 0, NULL, NULL); *program = clcreateprogramwithsource(context, 1, (const char **) & C_elem_KernelSource, NULL, &err); err = clbuildprogram(*program, 0, NULL, NULL, NULL, NULL); *kernel = clcreatekernel(*program, "mmul", &err); err = clsetkernelarg(*kernel, 0, sizeof(int), &Mdim); err = clsetkernelarg(*kernel, 1, sizeof(int), &Ndim); err = clsetkernelarg(*kernel, 2, sizeof(int), &Pdim); err!= clsetkernelarg(*kernel, 3, sizeof(cl_mem), &a_in); err = clsetkernelarg(*kernel, 4, sizeof(cl_mem), &b_in); err = clsetkernelarg(*kernel, 5, sizeof(cl_mem), &c_out); global[0] = (size_t) Ndim; local[0] = (size_t) 250; *ndim = 1; err = clenqueuendrangekernel(commands, kernel, nd, NULL, global, local, 0, NULL, NULL); global[0] = (size_t) Ndim; local[0] = (size_t) 250; *ndim = 1; err = clenqueuendrangekernel(commands, kernel, nd, NULL, global, local, 0, NULL, NULL); clfinish(commands); err = clenqueuereadbuffer( commands, c_out, CL_TRUE, 0, sizeof(float) * szc, C, 0, NULL, NULL ); test_results(a, B, c_out); } - Page 73

74 Results: MFLOPS Results on an Apple laptop with an NVIDIA GPU and an Intel CPU. Case MFLOPS CPU: Sequential C (not OpenCL) 167 GPU: C(i,j) per work item, all global 511 GPU: C row per work item, all global 258 CPU: C(i,j) per work item 744 This on its own didn t help. Device is GeForce 8600M GT GPU from NVIDIA with a max of 4 compute units Device is Intel Core 2 Duo CPU 2.40GHz 3 rd party names are the property of their owners. - Page 74

75 Optimizing Matrix Multiplication Notice that each element of C in a row uses the same row of A. Let s copy A into private memory so we don t incur the overhead of pulling it from global memory for each C(i,j) computation. C(i,j) C(i,j) A(i,:) = + * B(:,j) Private memory of each work item - Page 75

76 Row of C per work item, A row private kernel mmul( const int Mdim, const int Ndim, const int Pdim, global float* A, global float* B, global float* C) { int k,j; } int i = get_global_id(0); float Awrk[1000]; float tmp; for(k=0;k<pdim;k++) Awrk[k] = A[i*Ndim+k]; for(j=0;j<mdim;j++){ } tmp = 0.0; for(k=0;k<pdim;k++) tmp += Awrk[k] * B[k*Pdim+j]; C[i*Ndim+j] = tmp; Setup a work array for A in private memory and copy into from global memory before we start with the matrix multiplications. - Page 76

77 MatMult host program: one row of C per work-item #include "mult.h" int main(int argc, char **argv) { float *A, *B, *C; int Mdim, Ndim, Pdim; int err, sza, szb, szc; size_t global[dim]; size_t local[dim]; cl_device_id device_id; cl_context context; cl_command_queue commands; cl_program program; cl_kernel kernel; cl_uint nd; cl_mem a_in, b_in, c_out; Ndim = ORDER; Pdim = ORDER; Mdim = ORDER; sza = Ndim*Pdim; szb = Pdim*Mdim; szc = Ndim*Mdim; A = (float *)malloc(sza*sizeof(float)); B = (float *)malloc(szb*sizeof(float)); C = (float *)malloc(szc*sizeof(float)); initmat(mdim, Ndim, Pdim, A, B, C); err = clgetdeviceids(null, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL); context = clcreatecontext(0, 1, &device_id, NULL, NULL, &err); commands = clcreatecommandqueue(context, device_id, 0, &err); Changes to host program: 1. 1 D ND Range set to number of rows in the C matrix 2. Local Dim set to 250 so number of work-groups match number of compute units (4 in this case) for our order 1000 matrices a_in = clcreatebuffer(context, CL_MEM_READ_ONLY, sizeof(float) * sza, NULL, NULL); b_in = clcreatebuffer(context, CL_MEM_READ_ONLY, sizeof(float) * szb, NULL, NULL); c_out = clcreatebuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * szc, NULL, NULL); err = clenqueuewritebuffer(commands, a_in, CL_TRUE, 0, sizeof(float) * sza, A, 0, NULL, NULL); err = clenqueuewritebuffer(commands, b_in, CL_TRUE, 0, sizeof(float) * szb, B, 0, NULL, NULL); *program = clcreateprogramwithsource(context, 1, (const char **) & C_elem_KernelSource, NULL, &err); err = clbuildprogram(*program, 0, NULL, NULL, NULL, NULL); *kernel = clcreatekernel(*program, "mmul", &err); err = clsetkernelarg(*kernel, 0, sizeof(int), &Mdim); err = clsetkernelarg(*kernel, 1, sizeof(int), &Ndim); err = clsetkernelarg(*kernel, 2, sizeof(int), &Pdim); err!= clsetkernelarg(*kernel, 3, sizeof(cl_mem), &a_in); err = clsetkernelarg(*kernel, 4, sizeof(cl_mem), &b_in); err = clsetkernelarg(*kernel, 5, sizeof(cl_mem), &c_out); global[0] = (size_t) Ndim; local[0] = (size_t) 250; *ndim = 1; err = clenqueuendrangekernel(commands, kernel, nd, NULL, global, local, 0, NULL, NULL); global[0] = (size_t) Ndim; local[0] = (size_t) 250; *ndim = 1; err = clenqueuendrangekernel(commands, kernel, nd, NULL, global, local, 0, NULL, NULL); clfinish(commands); err = clenqueuereadbuffer( commands, c_out, CL_TRUE, 0, sizeof(float) * szc, C, 0, NULL, NULL ); test_results(a, B, c_out); } - Page 77

78 Matrix Multiplications Performance Results on an Apple laptop with an NVIDIA GPU and an Intel CPU. Case MFLOPS CPU: Sequential C (not OpenCL) 167 GPU: C(i,j) per work item, all global 511 GPU: C row per work item, all global 258 GPU: C row per work item, A row private 873 Big impact CPU: C(i,j) per work item 744 Device is GeForce 8600M GT GPU from NVIDIA with a max of 4 compute units Device is Intel Core 2 Duo CPU 2.40GHz 3 rd party names are the property of their owners. - Page 78

79 Optimizing Matrix Multiplication Notice that each element of C uses the same row of A. Each work-item in a work-group uses the same columns of B Let s store the B columns in local memory C(i,j) C(i,j) A(i,:) = + * B(:,j) Private memory of each work item Local memory for each work-group - Page 79

80 Row of C per work item, A row private, B columns local kernel mmul( const int Mdim, const int Ndim, const int Pdim, global float* A, global float* B, global float* C, local float* Bwrk) { int k,j; int i = get_global_id(0); int iloc = get_local_id(0); int nloc = get_local_size(0); float Awrk[1000]; float tmp; for(k=0;k<pdim;k++) } Awrk[k] = A[i*Ndim+k]; for(j=0;j<mdim;j++){ for(k=iloc;k<pdim;k=k+nloc) Bwrk[k] = B[k*Pdim+j]; barrier(clk_local_mem_fence); tmp = 0.0; for(k=0;k<pdim;k++) tmp += Awrk[k] * Bwrk[k]; } C[i*Ndim+j] = tmp; Pass in a pointer to local memory. Work-items in a group start by copying the columns of B they need into the local memory. - Page 80

81 MatMult host program: No change from before #include "mult.h" int main(int argc, char **argv) { float *A, *B, *C; int Mdim, Ndim, Pdim; int err, sza, szb, szc; size_t global[dim]; size_t local[dim]; cl_device_id device_id; cl_context context; cl_command_queue commands; cl_program program; cl_kernel kernel; cl_uint nd; cl_mem a_in, b_in, c_out; Ndim = ORDER; Pdim = ORDER; Mdim = ORDER; sza = Ndim*Pdim; szb = Pdim*Mdim; szc = Ndim*Mdim; A = (float *)malloc(sza*sizeof(float)); B = (float *)malloc(szb*sizeof(float)); C = (float *)malloc(szc*sizeof(float)); initmat(mdim, Ndim, Pdim, A, B, C); err = clgetdeviceids(null, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL); context = clcreatecontext(0, 1, &device_id, NULL, NULL, &err); commands = clcreatecommandqueue(context, device_id, 0, &err); Changes to host program: 1. Modify so we pass local memory to kernels. This requires a change to the kernel argument list a new call to clsetkernelarg is needed. a_in = clcreatebuffer(context, CL_MEM_READ_ONLY, sizeof(float) * sza, NULL, NULL); b_in = clcreatebuffer(context, CL_MEM_READ_ONLY, sizeof(float) * szb, NULL, NULL); c_out = clcreatebuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * szc, NULL, NULL); err = clenqueuewritebuffer(commands, a_in, CL_TRUE, 0, sizeof(float) * sza, A, 0, NULL, NULL); err = clenqueuewritebuffer(commands, b_in, CL_TRUE, 0, sizeof(float) * szb, B, 0, NULL, NULL); *program = clcreateprogramwithsource(context, 1, (const char **) & C_elem_KernelSource, NULL, &err); err = clbuildprogram(*program, 0, NULL, NULL, NULL, NULL); *kernel = clcreatekernel(*program, "mmul", &err); err = clsetkernelarg(*kernel, 0, sizeof(int), &Mdim); err = clsetkernelarg(*kernel, 1, sizeof(int), &Ndim); err = clsetkernelarg(*kernel, 2, sizeof(int), &Pdim); err!= clsetkernelarg(*kernel, 3, sizeof(cl_mem), &a_in); err = clsetkernelarg(*kernel, 4, sizeof(cl_mem), &b_in); err = clsetkernelarg(*kernel, 5, sizeof(cl_mem), &c_out); err = clsetkernelarg(*kernel, 6, sizeof(float)*pdim, NULL); err = clsetkernelarg(*kernel, 6, sizeof(float)*pdim, NULL); global[0] = (size_t) Ndim; global[1] = (size_t) Mdim; *ndim = 2; err = clenqueuendrangekernel(commands, kernel, nd, NULL, global, local, 0, NULL, NULL); clfinish(commands); err = clenqueuereadbuffer( commands, c_out, CL_TRUE, 0, sizeof(float) * szc, C, 0, NULL, NULL ); } test_results(a, B, c_out); - Page 81

82 Matrix Multiplications Performance Results on an Apple laptop with an NVIDIA GPU and an Intel CPU. Case MFLOPS CPU: Sequential C (not OpenCL) 167 GPU: C(i,j) per work item, all global 511 GPU: C row per work item, all global 258 GPU: C row per work item, A row private 873 GPU: C row per work item, A private, B local 2472 CPU: C(i,j) per work item 744 Device is GeForce 8600M GT GPU from NVIDIA with a max of 4 compute units Device is Intel Core 2 Duo CPU 2.40GHz 3 rd party names are the property of their owners. - Page 82

83 Matrix Multiplications Performance Results on an Apple laptop with an NVIDIA GPU and an Intel CPU. Case Speedup CPU: Sequential C (not OpenCL) 1 GPU: C(i,j) per work item, all global 3 GPU: C row per work item, all global 1.5 GPU: C row per work item, A row private 5.2 GPU: C row per work item, A private, B local 15 Wow!!! OpenCL on a GPU is radically faster that C on a CPU, right? CPU: C(i,j) per work item 4.5 Device is GeForce 8600M GT GPU from NVIDIA with a max of 4 compute units Device is Intel Core 2 Duo CPU 2.40GHz 3 rd party names are the property of their owners. - Page 83

84 CPU vs GPU: Let s be fair We made no attempt to optimize the CPU/C code but we worked hard to optimize OpenCL/GPU code. Lets optimize the CPU code - Use compiler optimization (level O3). - Replace float with double (CPU ALU s like double) - Reorder loops: - Float, no opt 167 mflops - Double, O3 272 mflops void mat_mul_ijk(int Mdim, int Ndim, int Pdim, double *A, double *B, double *C) { int i, j, k; for (i=0; i<ndim; i++) for (j=0; j<mdim; j++) for(k=0;k<pdim;k++) C[i*Ndim+j] += A[i*Ndim+k] * B[k*Pdim+j]; } void mat_mul_ikj(int Mdim, int Ndim, int Pdim, double *A, double *B, double *C) { int i, j, k; for (i=0; i<ndim; i++) for(k=0;k<pdim;k++) for (j=0; j<mdim; j++) C[i*Ndim+j] += A[i*Ndim+k] * B[k*Pdim+j]; } - ijk: 272 mflops - ikj: 1130 mflops - kij: 481 mflops - Page 84

85 Matrix Multiplications Performance Results on an Apple laptop with an NVIDIA GPU and an Intel CPU. Case Speedup CPU: Sequential C (not OpenCL) 1 GPU: C(i,j) per work item, all global 0.45 GPU: C row per work item, all global 0.23 GPU: C row per work item, A row private 0.77 GPU: C row per work item, A private, B local 2.2 CPU: C(i,j) per work item 0.66 And we still are only using one core and we are not using SSE so there is lots of room to further optimize the CPU code. Device is GeForce 8600M GT GPU from NVIDIA with a max of 4 compute units Device is Intel Core 2 Duo CPU 2.40GHz 3 rd party names are the property of their owners. - Page 85

86 Agenda Heterogeneous computing and the origins of OpenCL OpenCL overview Exploring the spec through a series of examples - Vector addition: - the basic platform layer - Matrix multiplication: - writing simple kernels - Optimizing matrix multiplication: - work groups and the memory model - Radix Sort: - synchronization A survey OpenCL Page 86

87 Parallel work group optimized Radix Sort Based on work by Lee Howes, AMD Original Radix Sort for the GPU by Mark Harris, Nvidia - Page 87

88 Principle of the radix sort Sorts a list of fixed size integer keys - Separates the key into individual digits - Sorts digit-by-digit In this case we use a least significant digit sort - Sorts first the least significant digit of the key, then the next and so on - Provides a stable sort Radix sort is O(kn) where n is the number of keys and k the key length - Grows linearly with the data set size, assuming constant memory performance - It is not necessarily the fasted sort available - Page 88

89 Steps in computation Take the least significant digit of the key Group the keys based on the value of that digit - Use a counting sort to achieve this - Maintain the ordering of keys within a value of the digit Repeat the process moving on to the next digit - Page 89

90 Sorting a simple radix-2 key We wish to sort the following set of keys in radix-2 - Each digit is 1 bit wide - Sort the least significant bit first Input keys The original data Digit to sort First iteration, least significant digit (or bit) Number of 1s Order keys by digit Count the number of 1s and 0s in the set (1s shown) Sort the keys based on the first digit - Page 90

91 Sorting a simple radix-2 key We wish to sort the following set of keys in radix-2 - Each digit is 1 bit wide - Sort the least significant bit first Input keys The output of the previous iteration Digit to sort Number of 1s Order keys by digit Second iteration, second-least significant digit (or bit) Count the number of 1s and 0s in the set (1s shown) Sort the keys based on the second digit - Page 91

92 Implementing on the GPU Sort the keys in radix-16-4 bit chunks on each sort pass - Only 16 counting buckets easily within the scope of an efficient counting sort Divide the set of keys into chunks - Sort into 16 buckets in local memory - More efficient global memory traffic scattering into only 16 address ranges Global prefix sum to obtain write locations - Page 92

93 High level view Dataset Divide into blocks Sort each block into 16 bins based on current digit Compute global offsets for bins Write partially sorted data into output - Page 93

94 Sorting individual blocks Each block in a radix-16 sort is performed using four iterations of the binary sort Take a single block of 512 elements Perform 1-bit prefix sum of 1s (we know the number of 0s as location number of 1s) Re-order data into 0s and 1s Repeat 4 times until we have our data sorted by the 4-bit digit Compute counts in each bin, giving us a histogram for the block Store the sorted block and the histogram data Page 94

95 Producing an efficient local prefix sum Each sorting pass requires a 1 bit prefix sum to be performed in local memory We use an efficient barrier-free local prefix sum for blocks 2x the wavefront size Take the block of data and load into local memory Write 0s into earlier local memory locations Add [index] to [index power of 2] with increasing powers of 2 The added 0s allow us to do this without conditionals The stride of 2 means that we can cover more elements with the wavefront and fix up at the end. This can be completely barrier free in a single wavefront When we want to cross a wavefront boundary we must be more careful - Page 95

96 Producing an efficient local prefix sum Each sorting pass requires a 1 bit prefix sum to be performed in local memory We use an efficient barrier-free local prefix sum for blocks 2x the wavefront size: Take the block of data and load into local memory Write 0s into earlier local memory locations Add [index] to [index power of 2] with increasing powers of 2 The added 0s allow us to do this without conditionals The stride of 2 means that we can cover more elements with the wavefront and fix up at the end. This can be completely barrier free in a single wavefront When we want to cross a wavefront boundary we must be more careful - Page 96

Producing an efficient local prefix sum Each sorting pass requires a 1 bit prefix sum to be performed in local memory We use an efficient barrier-free local prefix sum for blocks 2x the wavefront

97 Producing an efficient local prefix sum Each sorting pass requires a 1 bit prefix sum to be performed in local memory We use an efficient barrier-free local prefix sum for blocks 2x the wavefront size: Take the block of data and load into local memory Write 0s into earlier local memory locations Add [index] to [index power of 2] with increasing powers of 2 The added 0s allow us to do this without conditionals The stride of 2 means that we can cover more elements with the wavefront and fix up at the end. This can be completely barrier free in a single wavefront When we want to cross a wavefront boundary we must be more careful - Page 97

98 Producing an efficient local prefix sum Each sorting pass requires a 1 bit prefix sum to be performed in local memory We use an efficient barrier-free local prefix sum for blocks 2x the wavefront size: Take the block of data and load into local memory if( groupthreadid < 64 ) Write 0s into earlier { local memory locations sortersharedmemory[idx] += sortersharedmemory[idx-1]; Add [index] to [index sortersharedmemory[idx] power of 2] with increasing powers += sortersharedmemory[idx-2]; of 2 sortersharedmemory[idx] += sortersharedmemory[idx-4]; sortersharedmemory[idx] += sortersharedmemory[idx-8]; sortersharedmemory[idx] += sortersharedmemory[idx-16]; The added 0s allow us to do this without conditionals The stride of 2 means sortersharedmemory[idx] that we can cover more += sortersharedmemory[idx-32]; elements with the wavefront and fix up at the end. sortersharedmemory[idx] += sortersharedmemory[idx-64] This can be completely barrier free in a single wavefront sortersharedmemory[idx-1] += sortersharedmemory[idx-2];} When we want to barrier(clk_local_mem_fence); cross a wavefront boundary we must be more careful } - Page 98

99 Extending to 8x Vector register and efficient 128-bit memory loads Each WI using vector 4s Do internal 4-way prefix sum in vector first Perform local prefix sum operation on the vector elements in-registers Then write into local memory Page 99

100 Extending to 8x Vector register and efficient 128-bit memory loads Each WI using vector 4s Do internal 4-way prefix sum in vector first Perform local prefix sum operation on the vector elements in-registers // Do sum across vector in two stages prefixsumdata.y += prefixsumdata.x; prefixsumdata.w += prefixsumdata.z; Then write into local memory prefixsumdata.z += prefixsumdata.y; prefixsumdata.w += prefixsumdata.y; // Now just 128 values, each sum of a block of 4 sortersharedmemory[groupthreadid] = 0; sortersharedmemory[groupthreadid+128] = prefixsumdata.w; - Page 100

101 Computing a local histogram A histogram can be implemented in multiple ways: - Reduction - Direct atomics The Evergreen architecture supports very efficient local atomics: - Per-channel integer atomics - Computed at the memory interface So we choose this approach - Page 101

102 Computing a local histogram As before we are working on a vector4 per work-item to increase arithmetic density We first clear the histogram: if( get_local_id(0) < (1<<BITS_PER_PASS) ) histogram[addresses.x] = 0; Then obtain the approprate 4-bit chunk of the key: sorteddata.x >>= startbit; sorteddata.y >>= startbit; sorteddata.z >>= startbit; sorteddata.w >>= startbit; int andvalue = ((1<<BITS_PER_PASS)-1); sorteddata &= (uint4)(andvalue, andvalue, andvalue, andvalue); - Page 102

103 Computing a local histogram We barrier to allow the two wavefronts to view the cleared histogram, and then we run the atomic operations: barrier(clk_local_mem_fence); atomic_inc( &(histogram[sorteddata.x]) ); atomic_inc( &(histogram[sorteddata.y]) ); atomic_inc( &(histogram[sorteddata.z]) ); atomic_inc( &(histogram[sorteddata.w]) ); Finally we output the histogram for global summation: if( get_local_id(0) < 16 ) { uint histvalues; histvalues = histogram[get_local_id(0)]; unsigned globaloffset = 16*get_group_id(0); uint globaladdresses = get_local_id(0) + globaloffset; uint globaladdressesradixmajor = numgroups; globaladdressesradixmajor = globaladdressesradixmajor * get_local_id(0) + get_group_id(0); histogramoutputgroupmajor[globaladdresses] = histvalues; histogramoutputradixmajor[globaladdressesradixmajor] = histvalues; } - Page 103

104 The global prefix sum After we have prefix sums for each block we need the global versions - Each work group needs to know, for each radix, where its output range starts For each group we had the histogram representing the number of each radix in the block: We need to perform a global prefix sum across these work groups to obtain global addresses: Page 104

The global prefix sum After we have prefix sums for each block we need the global versions - Each work group needs to know, for each radix, where its

42 20 18 68 45 11 40 50 31 54 50 12 30 27 31 17 8 10 18 58 65 11 25 45 26 54 55 17 40 32 20 30 6 We need to perform a global prefix sum across these work

105 The global prefix sum After we have prefix sums for each block we need the global versions - Each work group needs to know, for each radix, where its output range starts For each group we had the histogram representing the number of each radix in the block: We need to perform a global prefix sum across these work groups to obtain global addresses: Radix 3 starts at location 328 for the 2 nd (green) group - Page 105

OpenCL and the quest for portable performance. Tim Mattson Intel Labs

OpenCL and the quest for portable performance Tim Mattson Intel Labs Disclaimer The views expressed in this talk are those of the speaker and not his employer. I am in a research group and know very little