Mali -T600 Series GPU OpenCL ARM. Developer Guide. Version 2.0. Copyright ARM. All rights reserved. DUI0538F (ID012914)

Size: px

Start display at page:

Reginald Ferguson
5 years ago
Views:

2 ARM Mali-T600 Series GPU OpenCL Developer Guide Copyright ARM. All rights reserved. Release Information The following changes have been made to this book. Change history Date Issue Confidentiality Change 12 July 2012 A Confidential First release 07 November 2012 D Confidential Second release 27 February 2013 E Non-confidential Third release 03 December 2013 F Non-confidential Fourth release Proprietary Notice Words and logos marked with or are registered trademarks or trademarks of ARM in the EU and other countries, except as otherwise stated below in this proprietary notice. Other brands and names mentioned herein may be the trademarks of their respective owners. Neither the whole nor any part of the information contained in, or the product described in, this document may be adapted or reproduced in any material form except with the prior written permission of the copyright holder. The product described in this document is subject to continuous developments and improvements. All particulars of the product and its use contained in this document are given by ARM in good faith. However, all warranties implied or expressed, including but not limited to implied warranties of merchantability, or fitness for purpose, are excluded. This document is intended only to assist the reader in the use of the product. ARM shall not be liable for any loss or damage arising from the use of any information in this document, or any error or omission in such information, or any incorrect use of the product. Where the term ARM is used it means ARM or any of its subsidiaries as appropriate. Confidentiality Status This document is. The right to use, copy and disclose this document may be subject to license restrictions in accordance with the terms of the agreement entered into by ARM and the party that ARM delivered this document to. Product Status The information in this document is final, that is for a developed product. Web Address DUI0538F Copyright ARM. All rights reserved. ii

3 Contents ARM Mali-T600 Series GPU OpenCL Developer Guide Preface About this book... vii Feedback... ix Chapter 1 Chapter 2 Chapter 3 Introduction 1.1 About GPU compute About OpenCL About the Mali-T600 Series GPU Linux OpenCL driver About the Mali OpenCL SDK Parallel Processing Concepts 2.1 Types of parallelism Concurrency Limitations of parallel processing Embarrassingly parallel applications Mixing different types of parallelism OpenCL Concepts 3.1 About OpenCL OpenCL applications OpenCL execution model OpenCL data processing The OpenCL memory model The Mali GPU memory model OpenCL concepts summary DUI0538F Copyright ARM. All rights reserved. iii

4 Contents Chapter 4 Chapter 5 Chapter 6 Chapter 7 Chapter 8 Chapter 9 Chapter 10 Appendix A Appendix B Developing an OpenCL Application 4.1 Software and hardware required for OpenCL development Development stages Execution Stages of an OpenCL Application 5.1 About the execution stages Finding the available compute devices Initializing and creating OpenCL contexts Creating a command queue Creating OpenCL program objects Building a program executable Creating kernel and memory objects Executing the kernel Reading the results Cleaning up Converting Existing Code to OpenCL 6.1 Profile your application Analyzing code for parallelization Parallel processing techniques in OpenCL Using parallel processing with non-parallelizable code Dividing data for OpenCL Retuning Existing OpenCL Code for Mali GPUs 7.1 About retuning existing OpenCL code for Mali GPUs Differences between desktop based architectures and Mali GPUs Procedure for retuning existing OpenCL code for Mali GPUs Optimizing OpenCL for Mali GPUs 8.1 The optimization process for OpenCL applications Load balancing between the application processor and the Mali GPU Sharing memory between I/O devices and OpenCL OpenCL Optimizations List 9.1 General optimizations Memory optimizations Kernel optimizations Code optimizations Execution optimizations Reducing the effect of serial computations The Mali OpenCL SDK OpenCL Data Types OpenCL Built-in Functions B.1 Work-item functions... B-2 B.2 Math functions... B-3 B.3 half_ and native_ math functions... B-4 B.4 Integer functions... B-5 B.5 Common functions... B-6 B.6 Geometric functions... B-7 B.7 Relational functions... B-8 B.8 Vector data load and store functions... B-9 B.9 Synchronization... B-10 B.10 Asynchronous copy functions... B-11 B.11 Atomic functions... B-12 B.12 Miscellaneous vector functions... B-13 DUI0538F Copyright ARM. All rights reserved. iv

6 Preface This preface introduces the Mali-T600 Series GPU OpenCL Developer Guide. It contains the following sections: About this book on page vii. Feedback on page ix. DUI0538F Copyright ARM. All rights reserved. vi

7 Preface About this book This book is for Mali-T600 Series GPU OpenCL. Product revision status The rnpn identifier indicates the revision status of the product described in this book, where: rn Identifies the major revision of the product. pn Identifies the minor revision or modification status of the product. Intended audience This guide is written for software developers with experience in C or C-like languages who want to develop OpenCL applications for Mali-T600 Series GPUs. Using this book This book is organized into the following chapters: Chapter 1 Introduction Read this for an introduction to OpenCL and the Mali OpenCL SDK. Chapter 2 Parallel Processing Concepts Read this for an introduction to parallel processing concepts. Chapter 3 OpenCL Concepts Read this for a description of the OpenCL concepts. Chapter 4 Developing an OpenCL Application Read this for a description of the development stages of an OpenCL application. Chapter 5 Execution Stages of an OpenCL Application Read this for a description of the execution stages of an OpenCL application. Chapter 6 Converting Existing Code to OpenCL Read this for a description of how to convert existing code to OpenCL. Chapter 7 Retuning Existing OpenCL Code for Mali GPUs Read this for a description of how to retune existing OpenCL code for the Mali-T600 Series GPUs. Chapter 8 Optimizing OpenCL for Mali GPUs Read this for a description of how to optimize OpenCL for the Mali-T600 Series GPUs. Chapter 9 OpenCL Optimizations List Read this for a list of optimizations for OpenCL on the Mali-T600 Series GPUs. Chapter 10 The Mali OpenCL SDK Read this for an introduction to the Mali OpenCL SDK. Appendix A OpenCL Data Types Read this for a list of the data types available. DUI0538F Copyright ARM. All rights reserved. vii

8 Preface Appendix B OpenCL Built-in Functions Read this for a list of the OpenCL built-in functions implemented in the Mali-T600 Series GPU Linux OpenCL driver. Appendix C OpenCL Extensions Read this for a list of extensions the Mali-T600 Series GPU Linux OpenCL driver supports. Glossary The ARM Glossary is a list of terms used in ARM documentation, together with definitions for those terms. The ARM Glossary does not contain terms that are industry standard unless the ARM meaning differs from the generally accepted meaning. See ARM Glossary, Typographical conventions This book uses the following typographical conventions: italic bold monospace monospace monospace italic monospace bold Introduces special terminology, denotes cross-references, and citations. Highlights interface elements, such as menu names. Denotes signal names. Also used for terms in descriptive lists, where appropriate. Denotes text that you can enter at the keyboard, such as commands, file and program names, and source code. Denotes a permitted abbreviation for a command or option. You can enter the underlined text instead of the full command or option name. Denotes arguments to monospace text where the argument is to be replaced by a specific value. Denotes language keywords when used outside example code. < and > Enclose replaceable terms for assembler syntax where they appear in code or code fragments. For example: MRC p15, 0 <Rd>, <CRn>, <CRm>, <Opcode_2> Additional reading This section lists publications by ARM and by third parties. See Infocenter, for access to ARM documentation. Other publications This section lists relevant documents published by third parties: OpenCL 1.1 Specification, DUI0538F Copyright ARM. All rights reserved. viii

9 Preface Feedback ARM welcomes feedback on this product and its documentation. Feedback on this product If you have any comments or suggestions about this product, contact your supplier and give: The product name. The product revision or version. An explanation with as much information as you can provide. Include symptoms and diagnostic procedures if appropriate. Feedback on content If you have comments on content then send an to Give: The title. The number, DUI0538F. The page numbers to which your comments apply. A concise explanation of your comments. ARM also welcomes general suggestions for additions and improvements. Note ARM tests the PDF only in Adobe Acrobat and Acrobat Reader, and cannot guarantee the quality of the represented document when used with any other PDF reader. DUI0538F Copyright ARM. All rights reserved. ix

10 Chapter 1 Introduction This chapter introduces GPU compute, OpenCL, the Mali-T600 Series GPU Linux OpenCL driver, and the Mali OpenCL SDK. It contains the following sections: About GPU compute on page 1-2. About OpenCL on page 1-3. About the Mali-T600 Series GPU Linux OpenCL driver on page 1-4. About the Mali OpenCL SDK on page 1-5. DUI0538F Copyright ARM. All rights reserved. 1-1

11 Introduction 1.1 About GPU compute GPU compute, or General Purpose computing on Graphics Processing Units (GPGPU), is the practice of using the parallel computing power of a GPU for tasks other than 3D graphics rendering. Application processors are designed to execute a single thread as quickly as possible. This sort of processing typically includes scalar operations and control code. GPUs are designed to execute many threads at the same time. They run compute intensive data processing tasks in parallel that contain relatively little control code. GPUs typically contain many more processing elements than application processors so can compute at a much higher rate than application processors. OpenCL is the first open standard language to enable developers to run general purpose computing tasks on GPUs, application processors, and other types of processors. DUI0538F Copyright ARM. All rights reserved. 1-2

12 Introduction 1.2 About OpenCL The Open Computing Language (OpenCL) is an open standard for writing applications to run on heterogeneous multi-processor systems. OpenCL provides a single development environment for applications that can run on different processors. OpenCL includes a platform-independent C99-based language for writing functions called kernels that execute on OpenCL devices, and APIs that define and control the platforms. OpenCL enables you to execute some applications faster by moving intensive data processing routines to the GPU instead of the application processor. OpenCL makes multiprocessor applications easier to write because it manages the execution of your application across multiple application processors and GPUs. The OpenCL language includes vectors and built in functions that enable you to easily utilize the features of accelerators. OpenCL is an open standard developed by the Khronos Group, DUI0538F Copyright ARM. All rights reserved. 1-3

13 Introduction 1.3 About the Mali-T600 Series GPU Linux OpenCL driver The Mali-T600 Series GPU Linux OpenCL driver is the implementation of OpenCL for the Mali-T600 Series GPUs. In this document, it is known as the Mali OpenCL driver. The Mali OpenCL driver: Supports OpenCL version 1.1, Full Profile. Is binary-compatible with OpenCL 1.0 applications. This includes compatibility with the APIs deprecated in OpenCL 1.1. Note The Mali OpenCL driver is for the Mali-T600 Series GPUs. It does not support the Mali-300, Mali-400, or Mali-450 GPUs. DUI0538F Copyright ARM. All rights reserved. 1-4

14 Introduction 1.4 About the Mali OpenCL SDK The Mali OpenCL SDK contains code examples and tutorials to help you understand OpenCL development. See Chapter 10 The Mali OpenCL SDK. DUI0538F Copyright ARM. All rights reserved. 1-5

15 Chapter 2 Parallel Processing Concepts Parallel processing is the processing of computations on multiple processors simultaneously. OpenCL enables applications to use hardware resources such as GPUs to accelerate computations with parallel processing. This chapter introduces the main concepts of parallel processing. It contains the following sections: Types of parallelism on page 2-2. Concurrency on page 2-4. Limitations of parallel processing on page 2-5. Embarrassingly parallel applications on page 2-6. Mixing different types of parallelism on page 2-7. DUI0538F Copyright ARM. All rights reserved. 2-1

16 Parallel Processing Concepts 2.1 Types of parallelism There are the following types of parallelism: Data parallelism In a data-parallel application, data is divided into data elements that can be processed in parallel. Multiple data elements are read and processed simultaneously by different processors. The data being processed must be in data structures that can be read and written in parallel. An example of a data parallel application is rendering three dimensional graphics. The generated pixels are independent so the computations required to generate them can be performed in parallel. This sort of parallelism is very fine grained and there can be hundreds or thousands of threads active simultaneously. OpenCL is primarily used for data parallel processing. Task parallelism Task parallelism is where the application is broken up into tasks and these tasks are executed in parallel. Task parallelism is also known as functional parallelism. An example of an application that can use task parallelism is playing an on-line video. To display a web page your device must do several tasks: Run a network stack that performs communication. Request data from external server. Read data from external server. Parse data. Decode video data. Decode audio data. Draw video frames. Play audio data. Figure 2-1 shows parts of an application and operating system that operate simultaneously when playing an on-line video. Draw video frame Decode sound Playback sound Decode video Parse data Request data from external server Read data from external server Operating system Network stack Figure 2-1 Task parallel processing Pipelines Pipelines process data in a series of stages. In a pipeline the stages can operate simultaneously but they do not process the same data. A pipeline typically has a relatively small number of stages. DUI0538F Copyright ARM. All rights reserved. 2-2

17 Parallel Processing Concepts An example of a pipeline is a video recorder application that must perform the following stages: 1. Capture image data from an image sensor and measure light levels. 2. Modify the image data to correct for lens effects. 3. Modify the contrast, color balance, and exposure of the image data. 4. Compress the image 5. Add the data to video file. 6. Write the video file to storage. These stages must be performed in order but they can be all be operating on data from a different video frame at the same time. Figure 2-2 shows parts of an application that can operate simultaneously as a pipeline playing a video from the internet. Capture data from image sensor Correct image for lens effects Modify: Contrast Color balance Exposure Compress image Add data to video file Write video file to storage Figure 2-2 Pipeline processing DUI0538F Copyright ARM. All rights reserved. 2-3

18 Parallel Processing Concepts 2.2 Concurrency Concurrent applications have multiple operations in progress at the same time. These can operate in parallel or in serial using a time sharing system. In a concurrent application multiple tasks are typically trying to share the same data. Access to this data must be managed carefully otherwise there can be complex problems such as: Race conditions A race condition occurs when two or more threads try to modify the value of one variable at the same time. The final value of the variable should always be the same, but when a race condition occurs the variable can get a different value depending on the order of the writes. Deadlocks A deadlock occurs when two threads become blocked by each other and neither thread can progress with their operations. This can happen when the threads each obtain a lock that the other thread requires. Live locks Live locks are similar to a deadlocks but the threads keep running. However, because of the lock the threads can never complete their task. A concurrent data structure is a data structure that can be accessed by multiple tasks without causing concurrency problems. Data parallel applications use concurrent data structures. These are the sorts of data structures that you typically use in OpenCL. DUI0538F Copyright ARM. All rights reserved. 2-4

19 Parallel Processing Concepts 2.3 Limitations of parallel processing There are limitations of parallel processing that you must consider when developing parallel applications. For example, if your application parallelizes perfectly, executing the application on 10 processors makes it run 10 times faster. Applications rarely parallelize perfectly because part of the application is serial. This serial component imposes a limit on the amount of parallelization the application can use. Amdahl s law describes the speedup you can get from parallel processing. The formula for Amdahl s law is shown in Figure 2-3 where the terms in the equation are; S Fraction of the application that is serial. P Fraction of the application that is parallelizable. N Number of processors. Speedup = S + 1 P N Figure 2-3 Formula for Amdahl s law Figure 2-4 shows the speedup that different numbers of processors provide for applications with different serial components. Speedup 10 X Perfect scaling 8 X 6 X 4 X 5% serial 10% serial 20% serial 2 X 1 X Processors Figure 2-4 Speedup for application with different serial components The biggest speedups are achieved with relatively small numbers of processors. However, as the number of processors rises the gains reduce. You cannot avoid Amdahl s law in your application but you can reduce the impact. See Reducing the effect of serial computations on page For high performance with a large number of processors the application must have a very small serial component. These sorts of applications are said to be embarrassingly parallel. See Embarrassingly parallel applications on page 2-6. DUI0538F Copyright ARM. All rights reserved. 2-5

Parallel Processing Concepts 2.4 Embarrassingly parallel applications If an application can be parallelized across a large number of processors easily, it is said to be embarrassingly parallel.

20 Parallel Processing Concepts 2.4 Embarrassingly parallel applications If an application can be parallelized across a large number of processors easily, it is said to be embarrassingly parallel. An example of an embarrassingly parallel application is rendering three dimensional graphics. The pixels are completely independent so they can be computed and drawn in parallel. OpenCL is ideally suited for developing and executing embarrassingly parallel applications. Figure 2-5 Embarrassingly parallel processing Figure 2-5 shows an image divided into many small parts. These parts can all be processed simultaneously. DUI0538F Copyright ARM. All rights reserved. 2-6

21 Parallel Processing Concepts 2.5 Mixing different types of parallelism You can mix these types of parallelism in your applications. For example, an audio synthesizer might use a combination of all three types of parallelism: Task parallelism computes the notes independently. A pipeline of audio generation and processing modules creates the sound of an individual note. Within the pipeline some stages can use data parallelism to accelerate the computation of processing. DUI0538F Copyright ARM. All rights reserved. 2-7

22 Chapter 3 OpenCL Concepts This chapter describes the OpenCL concepts. It contains the following sections: About OpenCL on page 3-2. OpenCL applications on page 3-3. OpenCL execution model on page 3-4. OpenCL data processing on page 3-5. The OpenCL memory model on page 3-8. The Mali GPU memory model on page 3-9. OpenCL concepts summary on page DUI0538F Copyright ARM. All rights reserved. 3-1

23 OpenCL Concepts 3.1 About OpenCL OpenCL is an open standard that enables you to use the parallel processing capabilities of multiple types of processors including application processors, GPUs, and other computing devices. OpenCL specifies an API for parallel programming that is designed for portability: It uses an abstracted memory and execution model. There is no requirement to know the application processor or GPU instruction set. There is scope for specific hardware optimizations. Functions executing on OpenCL devices are called kernels. These are written in a language called OpenCL C that is based on C99. The Mali-T600 Series GPUs support OpenCL 1.1, Full Profile. DUI0538F Copyright ARM. All rights reserved. 3-2

24 OpenCL Concepts 3.2 OpenCL applications OpenCL applications consist of the following parts: Application, or host, side code Calls the OpenCL APIs. Compiles the CL kernels. Allocates memory buffers to pass data into and out of the OpenCL kernels. Sets up command queues. Sets up dependencies between the tasks. Sets up the NDRanges that the kernels execute over. OpenCL kernels Written in OpenCL C language. Perform the parallel processing. Runs on the compute devices such as GPU shader cores. You must write both of these parts correctly to get the best performance. DUI0538F Copyright ARM. All rights reserved. 3-3

25 OpenCL Concepts 3.3 OpenCL execution model The OpenCL execution model includes: Kernels that run on compute devices. A host application that runs on the application processor. The host application The host application manages the execution of the kernels by setting up command queues for: Memory commands. Kernel execution commands. Synchronization. The context The host application defines the context for the kernels. The context includes: The kernels. Compute devices. Program objects. Memory objects. Operation of OpenCL kernels A kernel is a block of code that is executed on a compute device in parallel with other kernels. Kernels operate in the following sequence: 1. A kernel is defined in a host application. 2. The host application submits the kernel for execution on a compute device. A compute device can be an application processor, GPU, or another type of processor. 3. When the application issues a command to submit a kernel, OpenCL creates the NDRange of work-items. 4. An instance of the kernel is created for each element in the NDRange. This enables each element to be processed independently in parallel. DUI0538F Copyright ARM. All rights reserved. 3-4

26 OpenCL Concepts 3.4 OpenCL data processing This section describes OpenCL data processing. It contains the following sections: Work-items and the NDRange. OpenCL work-groups on page 3-6. Identifiers in OpenCL on page Work-items and the NDRange The data processed by OpenCL is in an index space of work-items. The work-items are organized in an N-Dimensional Range (NDRange) where: N is the number of dimensions minus one. N can be zero, one, or two. One kernel instance is executed for each work-item in the index space. Figure 3-1 shows NDRanges with one, two and three dimensions. One dimensional NDRange Work-items Two dimensional NDRange Work items Three dimensional NDRange Work-items Figure 3-1 NDRanges and work-items You group work-items into work-groups for processing. Figure 3-2 on page 3-6 shows a three dimensional NDRange that is split into 16 work-groups each with 16 work-items. DUI0538F Copyright ARM. All rights reserved. 3-5

27 OpenCL Concepts Work-groups Work-items Figure 3-2 Work-items and work-groups OpenCL work-groups Work-groups have a number of properties and limitations: Properties of work-groups Work-groups are independent of each other. You can issue multiple work-groups for execution in parallel. The work-items in a work-group can communicate with each other using shared data buffers. You must synchronize access to these buffers. Limitations of work-groups Work-groups typically do not directly share data. They can share data using global memory. The following are not supported across different work-groups: Barriers. Dependencies. Ordering. Coherency. Global atomics are available but these can be slower than local atomics. Work-items in a work-group The work-items in a work-group can do the following: Perform barrier operations to synchronize execution points. For example: barrier(clk_local_mem_fence); // Wait for all kernels in // this work-group to catch up Use local atomic operations. Access shared memory. DUI0538F Copyright ARM. All rights reserved. 3-6

28 OpenCL Concepts Identifiers in OpenCL There are a number of identifiers in OpenCL: global ID local ID Every work-item has a unique global ID that identifies it within the index space. Within each work-group, each work-item has a unique local ID that identifies it within its work-group. work-group ID Each work-group has a unique work-group ID. DUI0538F Copyright ARM. All rights reserved. 3-7

29 OpenCL Concepts 3.5 The OpenCL memory model The OpenCL memory model contains a number of components. Figure 3-3 shows the OpenCL memory model. Private memory Private memory Private memory Private memory Work-item Work-item Work-item Work-item Local memory Local memory Work-group Work-group Constant memory Global memory Private memory Private memory is specific to a work-item. It is not visible to other work-items. Local memory Local to a work-group. Accessible by the work-items in the work-group. Consistent to all work-items in the work-group. Accessed with the local keyword. Figure 3-3 OpenCL memory model Constant memory A memory region used for objects allocated and initialized by the host. Accessible as read-only by all work-items. Global memory Accessible to all work-items executing in a context, Accessible to the host using read, write, and map commands. Consistent across work-items in a single work-group. Implements a relaxed consistency, shared memory model. There are no guarantee of memory consistency between different work-groups. Accessed with the global keyword. DUI0538F Copyright ARM. All rights reserved. 3-8

30 OpenCL Concepts 3.6 The Mali GPU memory model The Mali GPU has a different memory model to desktop workstations: Desktop Mali GPU Traditional desktop workstations have physically separate global, local and private memories. Typically a graphics-card has its own local memory. Data must be copied to the local memory and back again. Mali GPUs have a unified memory system. Local and private memory is physically global memory. Moving data from global to local or private memory typically does not improve performance. The traditional copying of data is not required. Each compute device, that is the shader cores, have their own data caches. DUI0538F Copyright ARM. All rights reserved. 3-9

31 OpenCL Concepts 3.7 OpenCL concepts summary OpenCL primarily uses data parallel processing. OpenCL uses the following terminology: Computations in OpenCL are performed by pieces of code called kernels that execute on compute devices. Compute devices can be application processors, GPUs, or other types of processors. The data processed by OpenCL is in an index space of work-items. The work-items are organized in an NDRange. One kernel instance is executed for each work-item in the index space. Kernels instances execute in parallel. Work-items group together to form work-groups. The work-items in a work-group can communicate with each other using shared data buffers, but access to the buffers must be synchronised. Work-groups typically do not directly share data with each other. They can share data using global memory. Issue multiple work-groups for execution in parallel. DUI0538F Copyright ARM. All rights reserved. 3-10

32 Chapter 4 Developing an OpenCL Application This chapter describes the development stages of an OpenCL application. It contains the following sections: Software and hardware required for OpenCL development on page 4-2. Development stages on page 4-3. DUI0538F Copyright ARM. All rights reserved. 4-1

33 Developing an OpenCL Application 4.1 Software and hardware required for OpenCL development To develop OpenCL applications for Mali GPUs, you require: A platform with a Mali-T600 Series GPU. An implementation of OpenCL for the Mali-T600 Series GPU. You can develop on other hardware platforms with implementations of OpenCL but you cannot use them to estimate performance on a Mali-T600 Series GPU. Implementations of OpenCL are available for a number of operating systems. For a list of available hardware and the OpenCL drivers for the Mali-T600 Series GPUs see Mali developer center, DUI0538F Copyright ARM. All rights reserved. 4-2

34 Developing an OpenCL Application 4.2 Development stages These are the stages for developing an OpenCL application: Determine what you want to parallelize The first step when deciding to use OpenCL is to look at what your application does and identify the parts of the application that can run in parallel. This is often the hardest part of developing an OpenCL application. See Analyzing code for parallelization on page 6-3. Note Only convert the parts of an application to OpenCL where there is likely to be a benefit. Profile your application to find the most active parts and consider these parts for conversion. Write kernels OpenCL applications consists of a set of kernel functions. You must write the kernels that perform the computations. Write infrastructure for kernels OpenCL applications require infrastructure to set-up and run the kernels. DUI0538F Copyright ARM. All rights reserved. 4-3

35 Chapter 5 Execution Stages of an OpenCL Application This chapter describes the execution stages of an OpenCL application. It contains the following sections: About the execution stages on page 5-2. Finding the available compute devices on page 5-4. Initializing and creating OpenCL contexts on page 5-5. Creating a command queue on page 5-6. Creating OpenCL program objects on page 5-7. Building a program executable on page 5-8. Creating kernel and memory objects on page 5-9. Executing the kernel on page Reading the results on page Cleaning up on page Note This chapter provides an overview of the execution stages of an OpenCL application. It is not intended as a comprehensive lesson in OpenCL. DUI0538F Copyright ARM. All rights reserved. 5-1

36 Execution Stages of an OpenCL Application 5.1 About the execution stages This section describes a high level view of the OpenCL execution stages. Your OpenCL application must obtain information about your hardware then setup the runtime environment. The following sections describe these steps: Platform setup. Runtime setup Platform setup Runtime setup You use the platform API to: Determine what OpenCL devices are available. Query to find out what OpenCL devices are available on the system using OpenCL platform layer functions. See Finding the available compute devices on page 5-4. Set up the OpenCL context. Create and set up a OpenCL context and one or more command queues to schedule execution of your kernels. See Initializing and creating OpenCL contexts on page 5-5. You use the runtime API to: Create a command queue. See Creating a command queue on page 5-6. Compile and build your program objects. Issue commands to compile and build your source code and extracts kernel objects from the compiled code. You must follow this sequence of commands: 1. The program object is created either by calling clcreateprogramwithsource() or clcreateprogramwithbinary(). clcreateprogramwithsource() creates the program object from the kernel source code. clcreateprogramwithbinary() creates the program with a pre-compiled binary file. 2. Call the clbuildprogram() function to compile the program object for the specific devices on the system. See Creating OpenCL program objects on page 5-7. Build a program executable. See Building a program executable on page 5-8. Create the Kernel and memory objects. 1. Call the clcreatekernel() function for each kernel, or call the clcreatekernelsinprogram() function to create kernel objects for all the kernels in the OpenCL application. 2. Use the OpenCL API to allocate memory buffers. You can use the map() and unmap() operations to enable both the application processor and the Mali GPU to access the data. See Creating kernel and memory objects on page 5-9. Enqueue and execute the kernels. DUI0538F Copyright ARM. All rights reserved. 5-2

37 Execution Stages of an OpenCL Application Enqueue to the command queues the commands that control the sequence and synchronization of kernel execution, mapping and unmapping of memory, and manipulation of memory objects. To execute a kernel function, you must do the following: 1. Call clsetkernelarg() for each parameter in the kernel function definition to set the kernel parameter values. 2. Determine the work-group size and index space to use to execute the kernel. 3. Enqueue the kernel for execution in the command queue. See Executing the kernel on page Enqueue commands that make the results from the work-items available to the host. Reading the results on page Clean up. Cleaning up on page DUI0538F Copyright ARM. All rights reserved. 5-3

38 Execution Stages of an OpenCL Application 5.2 Finding the available compute devices To set up OpenCL you must choose compute devices. Call clgetdeviceids() to query the OpenCL driver for a list of devices on the machine that support OpenCL. You can restrict your search to a particular type of device or to any combination of device types. You must also specify the maximum number of device IDs that you want returned. DUI0538F Copyright ARM. All rights reserved. 5-4

39 Execution Stages of an OpenCL Application 5.3 Initializing and creating OpenCL contexts After you know the available OpenCL devices on the machine and have at least one valid device ID, you can create an OpenCL context. The context groups devices together to enable memory objects to be shared across different compute devices. Pass the device information to the clcreatecontext() function. For example: // Create an OpenCL context context = clcreatecontext( NULL, 1, &device_id, notify_function, NULL, &err ); if (err!= CL_SUCCESS) { Cleanup(); return 1; } You can optionally specify an error notification callback function when creating an OpenCL context. Leaving this parameter as a NULL value results in no error notification being registered. Providing a callback function can be useful if you want to receive runtime errors for the particular OpenCL context. For example: // Optionally user_data can contain contextual information // Implementation specific data of size cb, can be returned in private_info void context_notify( const char *notify_message, const void *private_info, size_t cb, void *user_data ) { printf("notification:\n\t%s\n", notify_message); } DUI0538F Copyright ARM. All rights reserved. 5-5

40 Execution Stages of an OpenCL Application 5.4 Creating a command queue After creating your OpenCL context, use clcreatecommandqueue() to create a command queue. For example: // Create a command-queue on the first device available // on the created context commandqueue = clcreatecommandqueue( context, &device); if (commandqueue == NULL) { Cleanup(); return 1; } If you have multiple OpenCL devices, such as an application processor and a GPU, you must: 1. Create a command queue for each device. 2. Divide up the work. 3. Submit commands separately to each device. DUI0538F Copyright ARM. All rights reserved. 5-6

41 Execution Stages of an OpenCL Application 5.5 Creating OpenCL program objects The program object encapsulates: Your OpenCL program source. The latest successfully built program executable. The build options. The build log. A list of devices the program is built for. The program object is loaded with the kernel source code and then the code is compiled on the devices attached to the context. All kernel functions must be identified in the application source with the kernel qualifier. OpenCL applications can also include functions you can call from your kernel functions. Load the OpenCL C kernel source and create an OpenCL program object from it. To create a program object use the clcreateprogramwithsource() function. For example: // Create OpenCL program program = clcreateprogramwithsource( context, device, <kernel source> ); if (program == NULL) { Cleanup(); return 1; } There are different options for building OpenCL programs: You can create a program object directly from the source code of an OpenCL application and compile it at runtime. Do this at application startup to save compute resources while the application is running. To avoid compilation at runtime, you can build a program object with a previously built binary. Note Applications with pre-built program objects are not portable. Creating a program object from a binary is a similar process to creating a program object from source code, except that you must supply the binary for each device that you want to execute the kernel on. Use the clcreateprogramwithbinary() function to do this. Use the clgetprograminfo() function to obtain the binary after you have generated it. DUI0538F Copyright ARM. All rights reserved. 5-7

42 Execution Stages of an OpenCL Application 5.6 Building a program executable After you have created a program object, you must build a program executable from the contents of the program object. Use the clbuildprogram() function to build your executable. Compile all kernels in the program object: err = clbuildprogram( program, 1, &device_id, "", NULL, NULL ); if (err == NULL) { Cleanup(); return 1; } DUI0538F Copyright ARM. All rights reserved. 5-8

43 Execution Stages of an OpenCL Application 5.7 Creating kernel and memory objects This section describes creating kernel and memory objects. It contains the following sections: Creating kernel objects. Creating memory objects Creating kernel objects Call the clcreatekernel() function to create a single kernel object, or call the clcreatekernelsinprogram() function to create kernel objects for all the kernels in the OpenCL application. For example: // Create OpenCL kernel kernel = clcreatekernel(program, <kernel_name>", NULL); if (kernel == NULL) { Cleanup(); return 1; } Creating memory objects After you have created and registered your kernels, send the program data to the kernels: 1. Package the data in a memory object. 2. Associate the memory object with the kernel. There are two types of memory objects: Buffer objects Simple blocks of memory. Image objects These are structures specifically for representing 2D or 3D images. These are opaque structures, that is, you cannot see the implementation details of these structures. To create buffer objects, use the clcreatebuffer() function. To create image objects, use the clcreateimage2d() or clcreateimage3d() functions. DUI0538F Copyright ARM. All rights reserved. 5-9

44 Execution Stages of an OpenCL Application 5.8 Executing the kernel This section describes the stages in executing the kernel. It contains the following sections: Determining the data dimensions. Determining the optimal global work size. Determining the local work-group size. Enqueuing kernel execution on page Executing kernels on page Determining the data dimensions If your data is an image x pixels wide by y pixels high, it is a two-dimensional data set. If you are dealing with spatial data that involves the x, y, and z position of nodes, it is a three-dimensional data set. The number of dimensions in the original data set does not have to be the same in OpenCL. You can for example, process a three dimensional data set as a single dimensional data set in OpenCL Determining the optimal global work size The global work size is the total number of work-items required for all dimensions combined. You can change the global work size by processing multiple data items in a single work-item. The new global worksize is then the original global work size divided by the number of data items processed by each work-item. The global work size must be large if you want to ensure high performance. Typically the number is several thousand but the ideal number depends on the number of shader cores in your device Determining the local work-group size You can specify the size of the work-group that OpenCL uses when you enqueue a kernel to execute on a device. To do this, you must know the maximum work-group size permitted by the OpenCL device your work-items execute on. To find the maximum work-group size for a specific kernel, use the clgetkernelworkgroupinfo() function and request the CL_KERNEL_WORK_GROUP_SIZE property. If your application is not required to share data among work-items, set the local_work_size parameter to NULL when enqueuing your kernel. This enables the OpenCL driver to determine the most efficient work-group size for your kernel. To get the maximum possible work-group size in each dimension, call clgetdeviceinfo() with CL_DEVICE_MAX_WORK_ITEM_SIZES. This is for the simplest kernel and dimensions might be lower for more complex kernels. The product of the dimensions of your work group might limit the size of the work group To get the total work-group size call clgetkernelworkgroupinfo() with CL_KERNEL_WORK_GROUP_SIZE. If the maximum work-group size for a kernel is lower than 128 performance is reduced. If this is the case, try simplifying the kernel. The work-group size for each dimension must divide evenly into the total data-size for that dimension. That is, the x size of the work-group must divide evenly into the x size of the total data. If this requirement means padding the work-group with extra work-items, ensure the additional work-items return immediately and do no work. DUI0538F Copyright ARM. All rights reserved. 5-10

45 Execution Stages of an OpenCL Application Enqueuing kernel execution When you have identified the dimensions necessary to represent your data, the necessary work-items for each dimension, and an appropriate work-group size, enqueue the kernel for execution using clenqueuendrangekernel(). For example: size_t globalworksize[1] = { ARRAY_SIZE }; size_t localworksize[1] = { 4 } ; // Queue the kernel up for execution across the array errnum = clenqueuendrangekernel(commandqueue, kernel, 1, NULL, globalworksize, localworksize, 0, NULL, NULL); if (errnum!= CL_SUCCESS) { printf( "Error queuing kernel for execution.\n" ); Cleanup(); return 1; } Executing kernels Queuing the kernel for execution does not mean that it executes immediately. The kernel execution is put into the command queue for later processing by the device. The call to clengueuendrangekernel() is not a blocking call and returns before the kernel has executed. It can sometimes return before the kernel has started executing. It is possible to make a kernel wait for execution until previous events are finished. You can specify certain kernels wait until other specific kernels are completed before executing. Kernels are executed in the order they are enqueued unless the property CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE is set when the command queue is created. Kernels that are enqueued to an in-order queue automatically wait for kernels that were previously enqueued on the same queue. You are not required to write any code to synchronize them. DUI0538F Copyright ARM. All rights reserved. 5-11

46 Execution Stages of an OpenCL Application 5.9 Reading the results After your kernels have finished execution, you must make the result accessible to the host. To access the results from the kernel, use clenqueuemapbuffer() to map the buffer into host memory. For example: local_buffer = clenqueuemapbuffer( queue, buffer, CL_FALSE, CL_MAP_READ, 0, ( sizeof(unsigned char) * buffer_size), num_deps, deps[1], NULL, &err ); ASSERT( CL_SUCCESS == err ); Note This call does not guarantee to make the buffer available until you call clfinish(). If you change the third parameter of clfinish() or clenqueuebuffer(), CL_FALSE to CL_TRUE, the call becomes a blocking call and the read is completed before clenqueuemapbuffer() returns. DUI0538F Copyright ARM. All rights reserved. 5-12

47 Execution Stages of an OpenCL Application 5.10 Cleaning up When the application no longer requires the various objects associated with the OpenCL runtime and context, you must free these resources. Use the following functions to release your OpenCL objects. These functions decrement the reference count for the associated object: clreleasememobject(). clreleasekernel(). clreleaseprogram(). clreleasecommandqueue(). clreleasecontext(). Ensure the reference counts for all OpenCL objects reach zero when your application no longer requires them. You can obtain the reference count by querying the object. For example, by calling clgetmemobjectinfo(). DUI0538F Copyright ARM. All rights reserved. 5-13

48 Chapter 6 Converting Existing Code to OpenCL This section describes converting existing code to OpenCL. It contains the following sections: Profile your application on page 6-2. Analyzing code for parallelization on page 6-3. Parallel processing techniques in OpenCL on page 6-6. Using parallel processing with non-parallelizable code on page Dividing data for OpenCL on page DUI0538F Copyright ARM. All rights reserved. 6-1

49 Converting Existing Code to OpenCL 6.1 Profile your application Profile your application to find the most compute intensive parts. These are the parts that might be worth porting to OpenCL. The proportion of an application that requires high performance is typically a relatively small part of the code. This is the part of the code that can probably make best use of OpenCL. Porting any more of the application to OpenCL is unlikely to provide a benefit. You can use profilers such as DS-5 to profile your application. You can download DS-5 from the Mali developer web site, DUI0538F Copyright ARM. All rights reserved. 6-2

50 Converting Existing Code to OpenCL 6.2 Analyzing code for parallelization This section describes how to analyze compute intensive code for parallelization. It contains the following sections: About analyzing code for parallelization. Look for data parallel operations. Look for operations with few dependencies Analyze loops on page About analyzing code for parallelization When you have identified the most compute intensive parts of your application, analyze the code to see if you can run it in parallel. Parallelizing code can be the following: Straight forward Parallelizing the code requires small modifications. See Use the global ID instead of the loop counter on page 6-6. Difficult Parallelizing the code requires complex modifications. See Compute values in a loop with a formula instead of using counters on page Look for data parallel operations Difficult and includes dependencies Parallelizing the code requires complex modifications and the use of techniques to avoid dependencies. See the following sections: Compute values per frame on page 6-7. Perform computations with dependencies in multiple-passes on page 6-8. Pre-compute values to remove dependencies on page 6-8. Appears to be impossible If parallelizing the code appears to be impossible, this only means that a particular code implementation cannot be parallelized. The purpose of code is to perform a function. There might be different algorithms that perform the same function but work in a different ways. Some of these might be parallelizable. Investigate different alternatives to the algorithms and data structures the code uses. These might make parallelization possible. See Using parallel processing with non-parallelizable code on page Look for tasks that do large numbers of operations that: Complete without sharing data. Do not depend on the results from each other. These types of operations are data parallel so are ideal for OpenCL Look for operations with few dependencies If tasks have few dependencies, it might be possible to run them in parallel. DUI0538F Copyright ARM. All rights reserved. 6-3

51 Converting Existing Code to OpenCL Dependencies between tasks prevent parallelization because it forces tasks to be performed sequentially. If the code has dependencies, consider: Is there a way to remove the dependencies? Can you delay the dependencies to later in execution? Analyze loops Loops are good targets for parallelization because they repeat computations many times, often independently. Loops that process a small number of elements If the loop only processes a relatively small number of elements it might not be appropriate for data parallel processing. It might be better to parallelize these sorts of loops with task parallelism on one or more application processors. Nested loops If the loop is part of a series of nested loops and the total number of iterations is large, this loop is probably appropriate for parallel processing. Perfect loops Look for loops that: Process thousands of items. Have no dependencies on previous iterations. Accesses data independently in each iteration. These types of loops are data parallel so are ideal for OpenCL. Simple loop parallelization If the loop includes a variable that is incremented based on a value from the previous iteration, this is a dependency between iterations that prevents parallelization. See if you can work out a formula that enables you to compute the value of the variable based on the main loop counter. In OpenCL work-items are processed in parallel, not in a sequential loop. However, work-item processing acts in a similar way to a loop. Every work-item has a unique global id that identifies it and you can use this value in place of a loop counter. See Use the global ID instead of the loop counter on page 6-6. It is also possible to have loops within work-items but these are independent of other work-items. Loops that require data from previous iterations If your loop involves dependencies based on data processed by a previous iteration, this is a more complex problem. Can the loop be restructured to remove the dependency? If not, it might not be possible to parallelize the loop. There are a number of techniques that help you deal with dependencies. See if you can use these techniques to parallelize the loop. See Parallel processing techniques in OpenCL on page 6-6. DUI0538F Copyright ARM. All rights reserved. 6-4

52 Converting Existing Code to OpenCL Non-parallelizable loops If the loop contains dependencies that you cannot remove, investigate alternative methods of performing the computation. These might be parallelizable. See Using parallel processing with non-parallelizable code on page DUI0538F Copyright ARM. All rights reserved. 6-5

53 Converting Existing Code to OpenCL 6.3 Parallel processing techniques in OpenCL This section describes parallel processing techniques you can use in OpenCL. It contains the following sections: Use the global ID instead of the loop counter. Compute values in a loop with a formula instead of using counters on page 6-7. Compute values per frame on page 6-7. Perform computations with dependencies in multiple-passes on page 6-8. Pre-compute values to remove dependencies on page 6-8. Use software pipelining on page 6-9. Use task parallelism on page Use the global ID instead of the loop counter In OpenCL you use kernels to perform the equivalent of loop iterations. This means there is no loop counter to use in computations. The global ID of the work-item provides the equivalent of the loop counter. Use the global ID to perform any computations based on the loop counter. Note You can include loops in OpenCL kernels but they can only iterate over the data for that work-item, not the entire NDRange. The following example shows a simple loop in C that assigns the value of the loop counter to each array element. Loop example in C: The following loop fills an array with numbers. void SetElements(void) { int loop_count; int my_array[4096]; for (loop_count = 0; loop_count < 4096; loop_count++) { my_array[loop_count] = loop_count; } printf("total %d\n", loop_count); } This loop is parallelizable because the loop elements are all independent. There is no main loop counter loop_count in the OpenCL kernel so it is replaced by the global ID. The equivalent code in an OpenCL kernel: kernel void example( global int * restrict my_array) { int id; id = get_global_id(0); my_array[id] = id; } DUI0538F Copyright ARM. All rights reserved. 6-6

loop counter. The global ID of the work-item provides the equivalent of the loop counter. 6.3.

54 Converting Existing Code to OpenCL Compute values in a loop with a formula instead of using counters If you are using work-items in place of loop iterations, compute variables based on the value of the global ID rather using than a loop counter. The global ID of the work-item provides the equivalent of the loop counter Compute values per frame If your application requires continuous updates of data elements and there are dependencies between them, try breaking the computations into discrete units and perform one iteration per image frame displayed. For example, the image shown in Figure 6-1 is of an application that runs a continuous physics simulation of a flag. Figure 6-1 Flag simulation The flag is made up of a grid of nodes that are connected to the neighboring nodes. These are shown in Figure 6-2. Figure 6-2 Flag simulation grid DUI0538F Copyright ARM. All rights reserved. 6-7

Cortex -A53 MPCore Processor Cryptography Extension ARM. Technical Reference Manual. Revision: r0p4

ARM Cortex -A53 MPCore Processor Cryptography Extension Revision: r0p4 Technical Reference Manual Copyright 2013-2014 ARM. All rights reserved. ARM DDI 0501F () ARM Cortex-A53 MPCore Processor Cryptography