Mali -T600 Series GPU OpenCL ARM. Developer Guide. Version 2.0. Copyright ARM. All rights reserved. DUI0538F (ID012914)

Size: px
Start display at page:

Download "Mali -T600 Series GPU OpenCL ARM. Developer Guide. Version 2.0. Copyright ARM. All rights reserved. DUI0538F (ID012914)"

Transcription

1 ARM Mali -T600 Series GPU OpenCL Version 2.0 Developer Guide Copyright ARM. All rights reserved. DUI0538F ()

2 ARM Mali-T600 Series GPU OpenCL Developer Guide Copyright ARM. All rights reserved. Release Information The following changes have been made to this book. Change history Date Issue Confidentiality Change 12 July 2012 A Confidential First release 07 November 2012 D Confidential Second release 27 February 2013 E Non-confidential Third release 03 December 2013 F Non-confidential Fourth release Proprietary Notice Words and logos marked with or are registered trademarks or trademarks of ARM in the EU and other countries, except as otherwise stated below in this proprietary notice. Other brands and names mentioned herein may be the trademarks of their respective owners. Neither the whole nor any part of the information contained in, or the product described in, this document may be adapted or reproduced in any material form except with the prior written permission of the copyright holder. The product described in this document is subject to continuous developments and improvements. All particulars of the product and its use contained in this document are given by ARM in good faith. However, all warranties implied or expressed, including but not limited to implied warranties of merchantability, or fitness for purpose, are excluded. This document is intended only to assist the reader in the use of the product. ARM shall not be liable for any loss or damage arising from the use of any information in this document, or any error or omission in such information, or any incorrect use of the product. Where the term ARM is used it means ARM or any of its subsidiaries as appropriate. Confidentiality Status This document is. The right to use, copy and disclose this document may be subject to license restrictions in accordance with the terms of the agreement entered into by ARM and the party that ARM delivered this document to. Product Status The information in this document is final, that is for a developed product. Web Address DUI0538F Copyright ARM. All rights reserved. ii

3 Contents ARM Mali-T600 Series GPU OpenCL Developer Guide Preface About this book... vii Feedback... ix Chapter 1 Chapter 2 Chapter 3 Introduction 1.1 About GPU compute About OpenCL About the Mali-T600 Series GPU Linux OpenCL driver About the Mali OpenCL SDK Parallel Processing Concepts 2.1 Types of parallelism Concurrency Limitations of parallel processing Embarrassingly parallel applications Mixing different types of parallelism OpenCL Concepts 3.1 About OpenCL OpenCL applications OpenCL execution model OpenCL data processing The OpenCL memory model The Mali GPU memory model OpenCL concepts summary DUI0538F Copyright ARM. All rights reserved. iii

4 Contents Chapter 4 Chapter 5 Chapter 6 Chapter 7 Chapter 8 Chapter 9 Chapter 10 Appendix A Appendix B Developing an OpenCL Application 4.1 Software and hardware required for OpenCL development Development stages Execution Stages of an OpenCL Application 5.1 About the execution stages Finding the available compute devices Initializing and creating OpenCL contexts Creating a command queue Creating OpenCL program objects Building a program executable Creating kernel and memory objects Executing the kernel Reading the results Cleaning up Converting Existing Code to OpenCL 6.1 Profile your application Analyzing code for parallelization Parallel processing techniques in OpenCL Using parallel processing with non-parallelizable code Dividing data for OpenCL Retuning Existing OpenCL Code for Mali GPUs 7.1 About retuning existing OpenCL code for Mali GPUs Differences between desktop based architectures and Mali GPUs Procedure for retuning existing OpenCL code for Mali GPUs Optimizing OpenCL for Mali GPUs 8.1 The optimization process for OpenCL applications Load balancing between the application processor and the Mali GPU Sharing memory between I/O devices and OpenCL OpenCL Optimizations List 9.1 General optimizations Memory optimizations Kernel optimizations Code optimizations Execution optimizations Reducing the effect of serial computations The Mali OpenCL SDK OpenCL Data Types OpenCL Built-in Functions B.1 Work-item functions... B-2 B.2 Math functions... B-3 B.3 half_ and native_ math functions... B-4 B.4 Integer functions... B-5 B.5 Common functions... B-6 B.6 Geometric functions... B-7 B.7 Relational functions... B-8 B.8 Vector data load and store functions... B-9 B.9 Synchronization... B-10 B.10 Asynchronous copy functions... B-11 B.11 Atomic functions... B-12 B.12 Miscellaneous vector functions... B-13 DUI0538F Copyright ARM. All rights reserved. iv

5 Contents B.13 Image read and write functions... B-14 Appendix C OpenCL Extensions DUI0538F Copyright ARM. All rights reserved. v

6 Preface This preface introduces the Mali-T600 Series GPU OpenCL Developer Guide. It contains the following sections: About this book on page vii. Feedback on page ix. DUI0538F Copyright ARM. All rights reserved. vi

7 Preface About this book This book is for Mali-T600 Series GPU OpenCL. Product revision status The rnpn identifier indicates the revision status of the product described in this book, where: rn Identifies the major revision of the product. pn Identifies the minor revision or modification status of the product. Intended audience This guide is written for software developers with experience in C or C-like languages who want to develop OpenCL applications for Mali-T600 Series GPUs. Using this book This book is organized into the following chapters: Chapter 1 Introduction Read this for an introduction to OpenCL and the Mali OpenCL SDK. Chapter 2 Parallel Processing Concepts Read this for an introduction to parallel processing concepts. Chapter 3 OpenCL Concepts Read this for a description of the OpenCL concepts. Chapter 4 Developing an OpenCL Application Read this for a description of the development stages of an OpenCL application. Chapter 5 Execution Stages of an OpenCL Application Read this for a description of the execution stages of an OpenCL application. Chapter 6 Converting Existing Code to OpenCL Read this for a description of how to convert existing code to OpenCL. Chapter 7 Retuning Existing OpenCL Code for Mali GPUs Read this for a description of how to retune existing OpenCL code for the Mali-T600 Series GPUs. Chapter 8 Optimizing OpenCL for Mali GPUs Read this for a description of how to optimize OpenCL for the Mali-T600 Series GPUs. Chapter 9 OpenCL Optimizations List Read this for a list of optimizations for OpenCL on the Mali-T600 Series GPUs. Chapter 10 The Mali OpenCL SDK Read this for an introduction to the Mali OpenCL SDK. Appendix A OpenCL Data Types Read this for a list of the data types available. DUI0538F Copyright ARM. All rights reserved. vii

8 Preface Appendix B OpenCL Built-in Functions Read this for a list of the OpenCL built-in functions implemented in the Mali-T600 Series GPU Linux OpenCL driver. Appendix C OpenCL Extensions Read this for a list of extensions the Mali-T600 Series GPU Linux OpenCL driver supports. Glossary The ARM Glossary is a list of terms used in ARM documentation, together with definitions for those terms. The ARM Glossary does not contain terms that are industry standard unless the ARM meaning differs from the generally accepted meaning. See ARM Glossary, Typographical conventions This book uses the following typographical conventions: italic bold monospace monospace monospace italic monospace bold Introduces special terminology, denotes cross-references, and citations. Highlights interface elements, such as menu names. Denotes signal names. Also used for terms in descriptive lists, where appropriate. Denotes text that you can enter at the keyboard, such as commands, file and program names, and source code. Denotes a permitted abbreviation for a command or option. You can enter the underlined text instead of the full command or option name. Denotes arguments to monospace text where the argument is to be replaced by a specific value. Denotes language keywords when used outside example code. < and > Enclose replaceable terms for assembler syntax where they appear in code or code fragments. For example: MRC p15, 0 <Rd>, <CRn>, <CRm>, <Opcode_2> Additional reading This section lists publications by ARM and by third parties. See Infocenter, for access to ARM documentation. Other publications This section lists relevant documents published by third parties: OpenCL 1.1 Specification, DUI0538F Copyright ARM. All rights reserved. viii

9 Preface Feedback ARM welcomes feedback on this product and its documentation. Feedback on this product If you have any comments or suggestions about this product, contact your supplier and give: The product name. The product revision or version. An explanation with as much information as you can provide. Include symptoms and diagnostic procedures if appropriate. Feedback on content If you have comments on content then send an to Give: The title. The number, DUI0538F. The page numbers to which your comments apply. A concise explanation of your comments. ARM also welcomes general suggestions for additions and improvements. Note ARM tests the PDF only in Adobe Acrobat and Acrobat Reader, and cannot guarantee the quality of the represented document when used with any other PDF reader. DUI0538F Copyright ARM. All rights reserved. ix

10 Chapter 1 Introduction This chapter introduces GPU compute, OpenCL, the Mali-T600 Series GPU Linux OpenCL driver, and the Mali OpenCL SDK. It contains the following sections: About GPU compute on page 1-2. About OpenCL on page 1-3. About the Mali-T600 Series GPU Linux OpenCL driver on page 1-4. About the Mali OpenCL SDK on page 1-5. DUI0538F Copyright ARM. All rights reserved. 1-1

11 Introduction 1.1 About GPU compute GPU compute, or General Purpose computing on Graphics Processing Units (GPGPU), is the practice of using the parallel computing power of a GPU for tasks other than 3D graphics rendering. Application processors are designed to execute a single thread as quickly as possible. This sort of processing typically includes scalar operations and control code. GPUs are designed to execute many threads at the same time. They run compute intensive data processing tasks in parallel that contain relatively little control code. GPUs typically contain many more processing elements than application processors so can compute at a much higher rate than application processors. OpenCL is the first open standard language to enable developers to run general purpose computing tasks on GPUs, application processors, and other types of processors. DUI0538F Copyright ARM. All rights reserved. 1-2

12 Introduction 1.2 About OpenCL The Open Computing Language (OpenCL) is an open standard for writing applications to run on heterogeneous multi-processor systems. OpenCL provides a single development environment for applications that can run on different processors. OpenCL includes a platform-independent C99-based language for writing functions called kernels that execute on OpenCL devices, and APIs that define and control the platforms. OpenCL enables you to execute some applications faster by moving intensive data processing routines to the GPU instead of the application processor. OpenCL makes multiprocessor applications easier to write because it manages the execution of your application across multiple application processors and GPUs. The OpenCL language includes vectors and built in functions that enable you to easily utilize the features of accelerators. OpenCL is an open standard developed by the Khronos Group, DUI0538F Copyright ARM. All rights reserved. 1-3

13 Introduction 1.3 About the Mali-T600 Series GPU Linux OpenCL driver The Mali-T600 Series GPU Linux OpenCL driver is the implementation of OpenCL for the Mali-T600 Series GPUs. In this document, it is known as the Mali OpenCL driver. The Mali OpenCL driver: Supports OpenCL version 1.1, Full Profile. Is binary-compatible with OpenCL 1.0 applications. This includes compatibility with the APIs deprecated in OpenCL 1.1. Note The Mali OpenCL driver is for the Mali-T600 Series GPUs. It does not support the Mali-300, Mali-400, or Mali-450 GPUs. DUI0538F Copyright ARM. All rights reserved. 1-4

14 Introduction 1.4 About the Mali OpenCL SDK The Mali OpenCL SDK contains code examples and tutorials to help you understand OpenCL development. See Chapter 10 The Mali OpenCL SDK. DUI0538F Copyright ARM. All rights reserved. 1-5

15 Chapter 2 Parallel Processing Concepts Parallel processing is the processing of computations on multiple processors simultaneously. OpenCL enables applications to use hardware resources such as GPUs to accelerate computations with parallel processing. This chapter introduces the main concepts of parallel processing. It contains the following sections: Types of parallelism on page 2-2. Concurrency on page 2-4. Limitations of parallel processing on page 2-5. Embarrassingly parallel applications on page 2-6. Mixing different types of parallelism on page 2-7. DUI0538F Copyright ARM. All rights reserved. 2-1

16 Parallel Processing Concepts 2.1 Types of parallelism There are the following types of parallelism: Data parallelism In a data-parallel application, data is divided into data elements that can be processed in parallel. Multiple data elements are read and processed simultaneously by different processors. The data being processed must be in data structures that can be read and written in parallel. An example of a data parallel application is rendering three dimensional graphics. The generated pixels are independent so the computations required to generate them can be performed in parallel. This sort of parallelism is very fine grained and there can be hundreds or thousands of threads active simultaneously. OpenCL is primarily used for data parallel processing. Task parallelism Task parallelism is where the application is broken up into tasks and these tasks are executed in parallel. Task parallelism is also known as functional parallelism. An example of an application that can use task parallelism is playing an on-line video. To display a web page your device must do several tasks: Run a network stack that performs communication. Request data from external server. Read data from external server. Parse data. Decode video data. Decode audio data. Draw video frames. Play audio data. Figure 2-1 shows parts of an application and operating system that operate simultaneously when playing an on-line video. Draw video frame Decode sound Playback sound Decode video Parse data Request data from external server Read data from external server Operating system Network stack Figure 2-1 Task parallel processing Pipelines Pipelines process data in a series of stages. In a pipeline the stages can operate simultaneously but they do not process the same data. A pipeline typically has a relatively small number of stages. DUI0538F Copyright ARM. All rights reserved. 2-2

17 Parallel Processing Concepts An example of a pipeline is a video recorder application that must perform the following stages: 1. Capture image data from an image sensor and measure light levels. 2. Modify the image data to correct for lens effects. 3. Modify the contrast, color balance, and exposure of the image data. 4. Compress the image 5. Add the data to video file. 6. Write the video file to storage. These stages must be performed in order but they can be all be operating on data from a different video frame at the same time. Figure 2-2 shows parts of an application that can operate simultaneously as a pipeline playing a video from the internet. Capture data from image sensor Correct image for lens effects Modify: Contrast Color balance Exposure Compress image Add data to video file Write video file to storage Figure 2-2 Pipeline processing DUI0538F Copyright ARM. All rights reserved. 2-3

18 Parallel Processing Concepts 2.2 Concurrency Concurrent applications have multiple operations in progress at the same time. These can operate in parallel or in serial using a time sharing system. In a concurrent application multiple tasks are typically trying to share the same data. Access to this data must be managed carefully otherwise there can be complex problems such as: Race conditions A race condition occurs when two or more threads try to modify the value of one variable at the same time. The final value of the variable should always be the same, but when a race condition occurs the variable can get a different value depending on the order of the writes. Deadlocks A deadlock occurs when two threads become blocked by each other and neither thread can progress with their operations. This can happen when the threads each obtain a lock that the other thread requires. Live locks Live locks are similar to a deadlocks but the threads keep running. However, because of the lock the threads can never complete their task. A concurrent data structure is a data structure that can be accessed by multiple tasks without causing concurrency problems. Data parallel applications use concurrent data structures. These are the sorts of data structures that you typically use in OpenCL. DUI0538F Copyright ARM. All rights reserved. 2-4

19 Parallel Processing Concepts 2.3 Limitations of parallel processing There are limitations of parallel processing that you must consider when developing parallel applications. For example, if your application parallelizes perfectly, executing the application on 10 processors makes it run 10 times faster. Applications rarely parallelize perfectly because part of the application is serial. This serial component imposes a limit on the amount of parallelization the application can use. Amdahl s law describes the speedup you can get from parallel processing. The formula for Amdahl s law is shown in Figure 2-3 where the terms in the equation are; S Fraction of the application that is serial. P Fraction of the application that is parallelizable. N Number of processors. Speedup = S + 1 P N Figure 2-3 Formula for Amdahl s law Figure 2-4 shows the speedup that different numbers of processors provide for applications with different serial components. Speedup 10 X Perfect scaling 8 X 6 X 4 X 5% serial 10% serial 20% serial 2 X 1 X Processors Figure 2-4 Speedup for application with different serial components The biggest speedups are achieved with relatively small numbers of processors. However, as the number of processors rises the gains reduce. You cannot avoid Amdahl s law in your application but you can reduce the impact. See Reducing the effect of serial computations on page For high performance with a large number of processors the application must have a very small serial component. These sorts of applications are said to be embarrassingly parallel. See Embarrassingly parallel applications on page 2-6. DUI0538F Copyright ARM. All rights reserved. 2-5

20 Parallel Processing Concepts 2.4 Embarrassingly parallel applications If an application can be parallelized across a large number of processors easily, it is said to be embarrassingly parallel. An example of an embarrassingly parallel application is rendering three dimensional graphics. The pixels are completely independent so they can be computed and drawn in parallel. OpenCL is ideally suited for developing and executing embarrassingly parallel applications. Figure 2-5 Embarrassingly parallel processing Figure 2-5 shows an image divided into many small parts. These parts can all be processed simultaneously. DUI0538F Copyright ARM. All rights reserved. 2-6

21 Parallel Processing Concepts 2.5 Mixing different types of parallelism You can mix these types of parallelism in your applications. For example, an audio synthesizer might use a combination of all three types of parallelism: Task parallelism computes the notes independently. A pipeline of audio generation and processing modules creates the sound of an individual note. Within the pipeline some stages can use data parallelism to accelerate the computation of processing. DUI0538F Copyright ARM. All rights reserved. 2-7

22 Chapter 3 OpenCL Concepts This chapter describes the OpenCL concepts. It contains the following sections: About OpenCL on page 3-2. OpenCL applications on page 3-3. OpenCL execution model on page 3-4. OpenCL data processing on page 3-5. The OpenCL memory model on page 3-8. The Mali GPU memory model on page 3-9. OpenCL concepts summary on page DUI0538F Copyright ARM. All rights reserved. 3-1

23 OpenCL Concepts 3.1 About OpenCL OpenCL is an open standard that enables you to use the parallel processing capabilities of multiple types of processors including application processors, GPUs, and other computing devices. OpenCL specifies an API for parallel programming that is designed for portability: It uses an abstracted memory and execution model. There is no requirement to know the application processor or GPU instruction set. There is scope for specific hardware optimizations. Functions executing on OpenCL devices are called kernels. These are written in a language called OpenCL C that is based on C99. The Mali-T600 Series GPUs support OpenCL 1.1, Full Profile. DUI0538F Copyright ARM. All rights reserved. 3-2

24 OpenCL Concepts 3.2 OpenCL applications OpenCL applications consist of the following parts: Application, or host, side code Calls the OpenCL APIs. Compiles the CL kernels. Allocates memory buffers to pass data into and out of the OpenCL kernels. Sets up command queues. Sets up dependencies between the tasks. Sets up the NDRanges that the kernels execute over. OpenCL kernels Written in OpenCL C language. Perform the parallel processing. Runs on the compute devices such as GPU shader cores. You must write both of these parts correctly to get the best performance. DUI0538F Copyright ARM. All rights reserved. 3-3

25 OpenCL Concepts 3.3 OpenCL execution model The OpenCL execution model includes: Kernels that run on compute devices. A host application that runs on the application processor. The host application The host application manages the execution of the kernels by setting up command queues for: Memory commands. Kernel execution commands. Synchronization. The context The host application defines the context for the kernels. The context includes: The kernels. Compute devices. Program objects. Memory objects. Operation of OpenCL kernels A kernel is a block of code that is executed on a compute device in parallel with other kernels. Kernels operate in the following sequence: 1. A kernel is defined in a host application. 2. The host application submits the kernel for execution on a compute device. A compute device can be an application processor, GPU, or another type of processor. 3. When the application issues a command to submit a kernel, OpenCL creates the NDRange of work-items. 4. An instance of the kernel is created for each element in the NDRange. This enables each element to be processed independently in parallel. DUI0538F Copyright ARM. All rights reserved. 3-4

26 OpenCL Concepts 3.4 OpenCL data processing This section describes OpenCL data processing. It contains the following sections: Work-items and the NDRange. OpenCL work-groups on page 3-6. Identifiers in OpenCL on page Work-items and the NDRange The data processed by OpenCL is in an index space of work-items. The work-items are organized in an N-Dimensional Range (NDRange) where: N is the number of dimensions minus one. N can be zero, one, or two. One kernel instance is executed for each work-item in the index space. Figure 3-1 shows NDRanges with one, two and three dimensions. One dimensional NDRange Work-items Two dimensional NDRange Work items Three dimensional NDRange Work-items Figure 3-1 NDRanges and work-items You group work-items into work-groups for processing. Figure 3-2 on page 3-6 shows a three dimensional NDRange that is split into 16 work-groups each with 16 work-items. DUI0538F Copyright ARM. All rights reserved. 3-5

27 OpenCL Concepts Work-groups Work-items Figure 3-2 Work-items and work-groups OpenCL work-groups Work-groups have a number of properties and limitations: Properties of work-groups Work-groups are independent of each other. You can issue multiple work-groups for execution in parallel. The work-items in a work-group can communicate with each other using shared data buffers. You must synchronize access to these buffers. Limitations of work-groups Work-groups typically do not directly share data. They can share data using global memory. The following are not supported across different work-groups: Barriers. Dependencies. Ordering. Coherency. Global atomics are available but these can be slower than local atomics. Work-items in a work-group The work-items in a work-group can do the following: Perform barrier operations to synchronize execution points. For example: barrier(clk_local_mem_fence); // Wait for all kernels in // this work-group to catch up Use local atomic operations. Access shared memory. DUI0538F Copyright ARM. All rights reserved. 3-6

28 OpenCL Concepts Identifiers in OpenCL There are a number of identifiers in OpenCL: global ID local ID Every work-item has a unique global ID that identifies it within the index space. Within each work-group, each work-item has a unique local ID that identifies it within its work-group. work-group ID Each work-group has a unique work-group ID. DUI0538F Copyright ARM. All rights reserved. 3-7

29 OpenCL Concepts 3.5 The OpenCL memory model The OpenCL memory model contains a number of components. Figure 3-3 shows the OpenCL memory model. Private memory Private memory Private memory Private memory Work-item Work-item Work-item Work-item Local memory Local memory Work-group Work-group Constant memory Global memory Private memory Private memory is specific to a work-item. It is not visible to other work-items. Local memory Local to a work-group. Accessible by the work-items in the work-group. Consistent to all work-items in the work-group. Accessed with the local keyword. Figure 3-3 OpenCL memory model Constant memory A memory region used for objects allocated and initialized by the host. Accessible as read-only by all work-items. Global memory Accessible to all work-items executing in a context, Accessible to the host using read, write, and map commands. Consistent across work-items in a single work-group. Implements a relaxed consistency, shared memory model. There are no guarantee of memory consistency between different work-groups. Accessed with the global keyword. DUI0538F Copyright ARM. All rights reserved. 3-8

30 OpenCL Concepts 3.6 The Mali GPU memory model The Mali GPU has a different memory model to desktop workstations: Desktop Mali GPU Traditional desktop workstations have physically separate global, local and private memories. Typically a graphics-card has its own local memory. Data must be copied to the local memory and back again. Mali GPUs have a unified memory system. Local and private memory is physically global memory. Moving data from global to local or private memory typically does not improve performance. The traditional copying of data is not required. Each compute device, that is the shader cores, have their own data caches. DUI0538F Copyright ARM. All rights reserved. 3-9

31 OpenCL Concepts 3.7 OpenCL concepts summary OpenCL primarily uses data parallel processing. OpenCL uses the following terminology: Computations in OpenCL are performed by pieces of code called kernels that execute on compute devices. Compute devices can be application processors, GPUs, or other types of processors. The data processed by OpenCL is in an index space of work-items. The work-items are organized in an NDRange. One kernel instance is executed for each work-item in the index space. Kernels instances execute in parallel. Work-items group together to form work-groups. The work-items in a work-group can communicate with each other using shared data buffers, but access to the buffers must be synchronised. Work-groups typically do not directly share data with each other. They can share data using global memory. Issue multiple work-groups for execution in parallel. DUI0538F Copyright ARM. All rights reserved. 3-10

32 Chapter 4 Developing an OpenCL Application This chapter describes the development stages of an OpenCL application. It contains the following sections: Software and hardware required for OpenCL development on page 4-2. Development stages on page 4-3. DUI0538F Copyright ARM. All rights reserved. 4-1

33 Developing an OpenCL Application 4.1 Software and hardware required for OpenCL development To develop OpenCL applications for Mali GPUs, you require: A platform with a Mali-T600 Series GPU. An implementation of OpenCL for the Mali-T600 Series GPU. You can develop on other hardware platforms with implementations of OpenCL but you cannot use them to estimate performance on a Mali-T600 Series GPU. Implementations of OpenCL are available for a number of operating systems. For a list of available hardware and the OpenCL drivers for the Mali-T600 Series GPUs see Mali developer center, DUI0538F Copyright ARM. All rights reserved. 4-2

34 Developing an OpenCL Application 4.2 Development stages These are the stages for developing an OpenCL application: Determine what you want to parallelize The first step when deciding to use OpenCL is to look at what your application does and identify the parts of the application that can run in parallel. This is often the hardest part of developing an OpenCL application. See Analyzing code for parallelization on page 6-3. Note Only convert the parts of an application to OpenCL where there is likely to be a benefit. Profile your application to find the most active parts and consider these parts for conversion. Write kernels OpenCL applications consists of a set of kernel functions. You must write the kernels that perform the computations. Write infrastructure for kernels OpenCL applications require infrastructure to set-up and run the kernels. DUI0538F Copyright ARM. All rights reserved. 4-3

35 Chapter 5 Execution Stages of an OpenCL Application This chapter describes the execution stages of an OpenCL application. It contains the following sections: About the execution stages on page 5-2. Finding the available compute devices on page 5-4. Initializing and creating OpenCL contexts on page 5-5. Creating a command queue on page 5-6. Creating OpenCL program objects on page 5-7. Building a program executable on page 5-8. Creating kernel and memory objects on page 5-9. Executing the kernel on page Reading the results on page Cleaning up on page Note This chapter provides an overview of the execution stages of an OpenCL application. It is not intended as a comprehensive lesson in OpenCL. DUI0538F Copyright ARM. All rights reserved. 5-1

36 Execution Stages of an OpenCL Application 5.1 About the execution stages This section describes a high level view of the OpenCL execution stages. Your OpenCL application must obtain information about your hardware then setup the runtime environment. The following sections describe these steps: Platform setup. Runtime setup Platform setup Runtime setup You use the platform API to: Determine what OpenCL devices are available. Query to find out what OpenCL devices are available on the system using OpenCL platform layer functions. See Finding the available compute devices on page 5-4. Set up the OpenCL context. Create and set up a OpenCL context and one or more command queues to schedule execution of your kernels. See Initializing and creating OpenCL contexts on page 5-5. You use the runtime API to: Create a command queue. See Creating a command queue on page 5-6. Compile and build your program objects. Issue commands to compile and build your source code and extracts kernel objects from the compiled code. You must follow this sequence of commands: 1. The program object is created either by calling clcreateprogramwithsource() or clcreateprogramwithbinary(). clcreateprogramwithsource() creates the program object from the kernel source code. clcreateprogramwithbinary() creates the program with a pre-compiled binary file. 2. Call the clbuildprogram() function to compile the program object for the specific devices on the system. See Creating OpenCL program objects on page 5-7. Build a program executable. See Building a program executable on page 5-8. Create the Kernel and memory objects. 1. Call the clcreatekernel() function for each kernel, or call the clcreatekernelsinprogram() function to create kernel objects for all the kernels in the OpenCL application. 2. Use the OpenCL API to allocate memory buffers. You can use the map() and unmap() operations to enable both the application processor and the Mali GPU to access the data. See Creating kernel and memory objects on page 5-9. Enqueue and execute the kernels. DUI0538F Copyright ARM. All rights reserved. 5-2

37 Execution Stages of an OpenCL Application Enqueue to the command queues the commands that control the sequence and synchronization of kernel execution, mapping and unmapping of memory, and manipulation of memory objects. To execute a kernel function, you must do the following: 1. Call clsetkernelarg() for each parameter in the kernel function definition to set the kernel parameter values. 2. Determine the work-group size and index space to use to execute the kernel. 3. Enqueue the kernel for execution in the command queue. See Executing the kernel on page Enqueue commands that make the results from the work-items available to the host. Reading the results on page Clean up. Cleaning up on page DUI0538F Copyright ARM. All rights reserved. 5-3

38 Execution Stages of an OpenCL Application 5.2 Finding the available compute devices To set up OpenCL you must choose compute devices. Call clgetdeviceids() to query the OpenCL driver for a list of devices on the machine that support OpenCL. You can restrict your search to a particular type of device or to any combination of device types. You must also specify the maximum number of device IDs that you want returned. DUI0538F Copyright ARM. All rights reserved. 5-4

39 Execution Stages of an OpenCL Application 5.3 Initializing and creating OpenCL contexts After you know the available OpenCL devices on the machine and have at least one valid device ID, you can create an OpenCL context. The context groups devices together to enable memory objects to be shared across different compute devices. Pass the device information to the clcreatecontext() function. For example: // Create an OpenCL context context = clcreatecontext( NULL, 1, &device_id, notify_function, NULL, &err ); if (err!= CL_SUCCESS) { Cleanup(); return 1; } You can optionally specify an error notification callback function when creating an OpenCL context. Leaving this parameter as a NULL value results in no error notification being registered. Providing a callback function can be useful if you want to receive runtime errors for the particular OpenCL context. For example: // Optionally user_data can contain contextual information // Implementation specific data of size cb, can be returned in private_info void context_notify( const char *notify_message, const void *private_info, size_t cb, void *user_data ) { printf("notification:\n\t%s\n", notify_message); } DUI0538F Copyright ARM. All rights reserved. 5-5

40 Execution Stages of an OpenCL Application 5.4 Creating a command queue After creating your OpenCL context, use clcreatecommandqueue() to create a command queue. For example: // Create a command-queue on the first device available // on the created context commandqueue = clcreatecommandqueue( context, &device); if (commandqueue == NULL) { Cleanup(); return 1; } If you have multiple OpenCL devices, such as an application processor and a GPU, you must: 1. Create a command queue for each device. 2. Divide up the work. 3. Submit commands separately to each device. DUI0538F Copyright ARM. All rights reserved. 5-6

41 Execution Stages of an OpenCL Application 5.5 Creating OpenCL program objects The program object encapsulates: Your OpenCL program source. The latest successfully built program executable. The build options. The build log. A list of devices the program is built for. The program object is loaded with the kernel source code and then the code is compiled on the devices attached to the context. All kernel functions must be identified in the application source with the kernel qualifier. OpenCL applications can also include functions you can call from your kernel functions. Load the OpenCL C kernel source and create an OpenCL program object from it. To create a program object use the clcreateprogramwithsource() function. For example: // Create OpenCL program program = clcreateprogramwithsource( context, device, <kernel source> ); if (program == NULL) { Cleanup(); return 1; } There are different options for building OpenCL programs: You can create a program object directly from the source code of an OpenCL application and compile it at runtime. Do this at application startup to save compute resources while the application is running. To avoid compilation at runtime, you can build a program object with a previously built binary. Note Applications with pre-built program objects are not portable. Creating a program object from a binary is a similar process to creating a program object from source code, except that you must supply the binary for each device that you want to execute the kernel on. Use the clcreateprogramwithbinary() function to do this. Use the clgetprograminfo() function to obtain the binary after you have generated it. DUI0538F Copyright ARM. All rights reserved. 5-7

42 Execution Stages of an OpenCL Application 5.6 Building a program executable After you have created a program object, you must build a program executable from the contents of the program object. Use the clbuildprogram() function to build your executable. Compile all kernels in the program object: err = clbuildprogram( program, 1, &device_id, "", NULL, NULL ); if (err == NULL) { Cleanup(); return 1; } DUI0538F Copyright ARM. All rights reserved. 5-8

43 Execution Stages of an OpenCL Application 5.7 Creating kernel and memory objects This section describes creating kernel and memory objects. It contains the following sections: Creating kernel objects. Creating memory objects Creating kernel objects Call the clcreatekernel() function to create a single kernel object, or call the clcreatekernelsinprogram() function to create kernel objects for all the kernels in the OpenCL application. For example: // Create OpenCL kernel kernel = clcreatekernel(program, <kernel_name>", NULL); if (kernel == NULL) { Cleanup(); return 1; } Creating memory objects After you have created and registered your kernels, send the program data to the kernels: 1. Package the data in a memory object. 2. Associate the memory object with the kernel. There are two types of memory objects: Buffer objects Simple blocks of memory. Image objects These are structures specifically for representing 2D or 3D images. These are opaque structures, that is, you cannot see the implementation details of these structures. To create buffer objects, use the clcreatebuffer() function. To create image objects, use the clcreateimage2d() or clcreateimage3d() functions. DUI0538F Copyright ARM. All rights reserved. 5-9

44 Execution Stages of an OpenCL Application 5.8 Executing the kernel This section describes the stages in executing the kernel. It contains the following sections: Determining the data dimensions. Determining the optimal global work size. Determining the local work-group size. Enqueuing kernel execution on page Executing kernels on page Determining the data dimensions If your data is an image x pixels wide by y pixels high, it is a two-dimensional data set. If you are dealing with spatial data that involves the x, y, and z position of nodes, it is a three-dimensional data set. The number of dimensions in the original data set does not have to be the same in OpenCL. You can for example, process a three dimensional data set as a single dimensional data set in OpenCL Determining the optimal global work size The global work size is the total number of work-items required for all dimensions combined. You can change the global work size by processing multiple data items in a single work-item. The new global worksize is then the original global work size divided by the number of data items processed by each work-item. The global work size must be large if you want to ensure high performance. Typically the number is several thousand but the ideal number depends on the number of shader cores in your device Determining the local work-group size You can specify the size of the work-group that OpenCL uses when you enqueue a kernel to execute on a device. To do this, you must know the maximum work-group size permitted by the OpenCL device your work-items execute on. To find the maximum work-group size for a specific kernel, use the clgetkernelworkgroupinfo() function and request the CL_KERNEL_WORK_GROUP_SIZE property. If your application is not required to share data among work-items, set the local_work_size parameter to NULL when enqueuing your kernel. This enables the OpenCL driver to determine the most efficient work-group size for your kernel. To get the maximum possible work-group size in each dimension, call clgetdeviceinfo() with CL_DEVICE_MAX_WORK_ITEM_SIZES. This is for the simplest kernel and dimensions might be lower for more complex kernels. The product of the dimensions of your work group might limit the size of the work group To get the total work-group size call clgetkernelworkgroupinfo() with CL_KERNEL_WORK_GROUP_SIZE. If the maximum work-group size for a kernel is lower than 128 performance is reduced. If this is the case, try simplifying the kernel. The work-group size for each dimension must divide evenly into the total data-size for that dimension. That is, the x size of the work-group must divide evenly into the x size of the total data. If this requirement means padding the work-group with extra work-items, ensure the additional work-items return immediately and do no work. DUI0538F Copyright ARM. All rights reserved. 5-10

45 Execution Stages of an OpenCL Application Enqueuing kernel execution When you have identified the dimensions necessary to represent your data, the necessary work-items for each dimension, and an appropriate work-group size, enqueue the kernel for execution using clenqueuendrangekernel(). For example: size_t globalworksize[1] = { ARRAY_SIZE }; size_t localworksize[1] = { 4 } ; // Queue the kernel up for execution across the array errnum = clenqueuendrangekernel(commandqueue, kernel, 1, NULL, globalworksize, localworksize, 0, NULL, NULL); if (errnum!= CL_SUCCESS) { printf( "Error queuing kernel for execution.\n" ); Cleanup(); return 1; } Executing kernels Queuing the kernel for execution does not mean that it executes immediately. The kernel execution is put into the command queue for later processing by the device. The call to clengueuendrangekernel() is not a blocking call and returns before the kernel has executed. It can sometimes return before the kernel has started executing. It is possible to make a kernel wait for execution until previous events are finished. You can specify certain kernels wait until other specific kernels are completed before executing. Kernels are executed in the order they are enqueued unless the property CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE is set when the command queue is created. Kernels that are enqueued to an in-order queue automatically wait for kernels that were previously enqueued on the same queue. You are not required to write any code to synchronize them. DUI0538F Copyright ARM. All rights reserved. 5-11

46 Execution Stages of an OpenCL Application 5.9 Reading the results After your kernels have finished execution, you must make the result accessible to the host. To access the results from the kernel, use clenqueuemapbuffer() to map the buffer into host memory. For example: local_buffer = clenqueuemapbuffer( queue, buffer, CL_FALSE, CL_MAP_READ, 0, ( sizeof(unsigned char) * buffer_size), num_deps, deps[1], NULL, &err ); ASSERT( CL_SUCCESS == err ); Note This call does not guarantee to make the buffer available until you call clfinish(). If you change the third parameter of clfinish() or clenqueuebuffer(), CL_FALSE to CL_TRUE, the call becomes a blocking call and the read is completed before clenqueuemapbuffer() returns. DUI0538F Copyright ARM. All rights reserved. 5-12

47 Execution Stages of an OpenCL Application 5.10 Cleaning up When the application no longer requires the various objects associated with the OpenCL runtime and context, you must free these resources. Use the following functions to release your OpenCL objects. These functions decrement the reference count for the associated object: clreleasememobject(). clreleasekernel(). clreleaseprogram(). clreleasecommandqueue(). clreleasecontext(). Ensure the reference counts for all OpenCL objects reach zero when your application no longer requires them. You can obtain the reference count by querying the object. For example, by calling clgetmemobjectinfo(). DUI0538F Copyright ARM. All rights reserved. 5-13

48 Chapter 6 Converting Existing Code to OpenCL This section describes converting existing code to OpenCL. It contains the following sections: Profile your application on page 6-2. Analyzing code for parallelization on page 6-3. Parallel processing techniques in OpenCL on page 6-6. Using parallel processing with non-parallelizable code on page Dividing data for OpenCL on page DUI0538F Copyright ARM. All rights reserved. 6-1

49 Converting Existing Code to OpenCL 6.1 Profile your application Profile your application to find the most compute intensive parts. These are the parts that might be worth porting to OpenCL. The proportion of an application that requires high performance is typically a relatively small part of the code. This is the part of the code that can probably make best use of OpenCL. Porting any more of the application to OpenCL is unlikely to provide a benefit. You can use profilers such as DS-5 to profile your application. You can download DS-5 from the Mali developer web site, DUI0538F Copyright ARM. All rights reserved. 6-2

50 Converting Existing Code to OpenCL 6.2 Analyzing code for parallelization This section describes how to analyze compute intensive code for parallelization. It contains the following sections: About analyzing code for parallelization. Look for data parallel operations. Look for operations with few dependencies Analyze loops on page About analyzing code for parallelization When you have identified the most compute intensive parts of your application, analyze the code to see if you can run it in parallel. Parallelizing code can be the following: Straight forward Parallelizing the code requires small modifications. See Use the global ID instead of the loop counter on page 6-6. Difficult Parallelizing the code requires complex modifications. See Compute values in a loop with a formula instead of using counters on page Look for data parallel operations Difficult and includes dependencies Parallelizing the code requires complex modifications and the use of techniques to avoid dependencies. See the following sections: Compute values per frame on page 6-7. Perform computations with dependencies in multiple-passes on page 6-8. Pre-compute values to remove dependencies on page 6-8. Appears to be impossible If parallelizing the code appears to be impossible, this only means that a particular code implementation cannot be parallelized. The purpose of code is to perform a function. There might be different algorithms that perform the same function but work in a different ways. Some of these might be parallelizable. Investigate different alternatives to the algorithms and data structures the code uses. These might make parallelization possible. See Using parallel processing with non-parallelizable code on page Look for tasks that do large numbers of operations that: Complete without sharing data. Do not depend on the results from each other. These types of operations are data parallel so are ideal for OpenCL Look for operations with few dependencies If tasks have few dependencies, it might be possible to run them in parallel. DUI0538F Copyright ARM. All rights reserved. 6-3

51 Converting Existing Code to OpenCL Dependencies between tasks prevent parallelization because it forces tasks to be performed sequentially. If the code has dependencies, consider: Is there a way to remove the dependencies? Can you delay the dependencies to later in execution? Analyze loops Loops are good targets for parallelization because they repeat computations many times, often independently. Loops that process a small number of elements If the loop only processes a relatively small number of elements it might not be appropriate for data parallel processing. It might be better to parallelize these sorts of loops with task parallelism on one or more application processors. Nested loops If the loop is part of a series of nested loops and the total number of iterations is large, this loop is probably appropriate for parallel processing. Perfect loops Look for loops that: Process thousands of items. Have no dependencies on previous iterations. Accesses data independently in each iteration. These types of loops are data parallel so are ideal for OpenCL. Simple loop parallelization If the loop includes a variable that is incremented based on a value from the previous iteration, this is a dependency between iterations that prevents parallelization. See if you can work out a formula that enables you to compute the value of the variable based on the main loop counter. In OpenCL work-items are processed in parallel, not in a sequential loop. However, work-item processing acts in a similar way to a loop. Every work-item has a unique global id that identifies it and you can use this value in place of a loop counter. See Use the global ID instead of the loop counter on page 6-6. It is also possible to have loops within work-items but these are independent of other work-items. Loops that require data from previous iterations If your loop involves dependencies based on data processed by a previous iteration, this is a more complex problem. Can the loop be restructured to remove the dependency? If not, it might not be possible to parallelize the loop. There are a number of techniques that help you deal with dependencies. See if you can use these techniques to parallelize the loop. See Parallel processing techniques in OpenCL on page 6-6. DUI0538F Copyright ARM. All rights reserved. 6-4

52 Converting Existing Code to OpenCL Non-parallelizable loops If the loop contains dependencies that you cannot remove, investigate alternative methods of performing the computation. These might be parallelizable. See Using parallel processing with non-parallelizable code on page DUI0538F Copyright ARM. All rights reserved. 6-5

53 Converting Existing Code to OpenCL 6.3 Parallel processing techniques in OpenCL This section describes parallel processing techniques you can use in OpenCL. It contains the following sections: Use the global ID instead of the loop counter. Compute values in a loop with a formula instead of using counters on page 6-7. Compute values per frame on page 6-7. Perform computations with dependencies in multiple-passes on page 6-8. Pre-compute values to remove dependencies on page 6-8. Use software pipelining on page 6-9. Use task parallelism on page Use the global ID instead of the loop counter In OpenCL you use kernels to perform the equivalent of loop iterations. This means there is no loop counter to use in computations. The global ID of the work-item provides the equivalent of the loop counter. Use the global ID to perform any computations based on the loop counter. Note You can include loops in OpenCL kernels but they can only iterate over the data for that work-item, not the entire NDRange. The following example shows a simple loop in C that assigns the value of the loop counter to each array element. Loop example in C: The following loop fills an array with numbers. void SetElements(void) { int loop_count; int my_array[4096]; for (loop_count = 0; loop_count < 4096; loop_count++) { my_array[loop_count] = loop_count; } printf("total %d\n", loop_count); } This loop is parallelizable because the loop elements are all independent. There is no main loop counter loop_count in the OpenCL kernel so it is replaced by the global ID. The equivalent code in an OpenCL kernel: kernel void example( global int * restrict my_array) { int id; id = get_global_id(0); my_array[id] = id; } DUI0538F Copyright ARM. All rights reserved. 6-6

54 Converting Existing Code to OpenCL Compute values in a loop with a formula instead of using counters If you are using work-items in place of loop iterations, compute variables based on the value of the global ID rather using than a loop counter. The global ID of the work-item provides the equivalent of the loop counter Compute values per frame If your application requires continuous updates of data elements and there are dependencies between them, try breaking the computations into discrete units and perform one iteration per image frame displayed. For example, the image shown in Figure 6-1 is of an application that runs a continuous physics simulation of a flag. Figure 6-1 Flag simulation The flag is made up of a grid of nodes that are connected to the neighboring nodes. These are shown in Figure 6-2. Figure 6-2 Flag simulation grid DUI0538F Copyright ARM. All rights reserved. 6-7

Cortex -A53 MPCore Processor Cryptography Extension ARM. Technical Reference Manual. Revision: r0p4

Cortex -A53 MPCore Processor Cryptography Extension ARM. Technical Reference Manual. Revision: r0p4 ARM Cortex -A53 MPCore Processor Cryptography Extension Revision: r0p4 Technical Reference Manual Copyright 2013-2014 ARM. All rights reserved. ARM DDI 0501F () ARM Cortex-A53 MPCore Processor Cryptography

More information

GPU Shader Library. Mali. User Guide. Version: 1.0. Copyright 2009 ARM. All rights reserved. ARM DUI 0510A (ID101409)

GPU Shader Library. Mali. User Guide. Version: 1.0. Copyright 2009 ARM. All rights reserved. ARM DUI 0510A (ID101409) Mali GPU Shader Library Version: 1.0 User Guide Copyright 2009 ARM. All rights reserved. ARM DUI 0510A () Mali GPU Shader Library User Guide Copyright 2009 ARM. All rights reserved. Release Information

More information

GPU Offline Shader Compiler. Mali. User Guide. Version: 2.2. Copyright 2009 ARM. All rights reserved. ARM DUI 0513A (ID101409)

GPU Offline Shader Compiler. Mali. User Guide. Version: 2.2. Copyright 2009 ARM. All rights reserved. ARM DUI 0513A (ID101409) Mali GPU Offline Shader Compiler Version: 2.2 User Guide Copyright 2009 ARM. All rights reserved. ARM DUI 0513A () Mali GPU Offline Shader Compiler User Guide Copyright 2009 ARM. All rights reserved. Release

More information

Offline Shader Compiler. Mali. User Guide. Version: 3.0. Copyright ARM. All rights reserved. ARM DUI 0513B (ID032912)

Offline Shader Compiler. Mali. User Guide. Version: 3.0. Copyright ARM. All rights reserved. ARM DUI 0513B (ID032912) Mali Offline Shader Compiler Version: 3.0 User Guide Copyright 2009-2012 ARM. All rights reserved. ARM DUI 0513B () Mali Offline Shader Compiler User Guide Copyright 2009-2012 ARM. All rights reserved.

More information

OpenGL ES 2.0 SDK for Android. Mali. User Guide. Version: Copyright 2011 ARM. All rights reserved. ARM DUI 0587A (ID120411)

OpenGL ES 2.0 SDK for Android. Mali. User Guide. Version: Copyright 2011 ARM. All rights reserved. ARM DUI 0587A (ID120411) Mali OpenGL ES 2.0 SDK for Android Version: 1.0.0 User Guide Copyright 2011 ARM. All rights reserved. ARM DUI 0587A () Mali OpenGL ES 2.0 SDK for Android User Guide Copyright 2011 ARM. All rights reserved.

More information

Cortex -A53 MPCore Processor Advanced SIMD and Floating-point Extension ARM. Technical Reference Manual. Revision: r0p4

Cortex -A53 MPCore Processor Advanced SIMD and Floating-point Extension ARM. Technical Reference Manual. Revision: r0p4 ARM Cortex -A53 MPCore Processor Advanced SIMD and Floating-point Extension Revision: r0p4 Technical Reference Manual Copyright 2013-2016 ARM. All rights reserved. ARM DDI 0502G () ARM Cortex-A53 MPCore

More information

ARM. Mali GPU. OpenGL ES Application Optimization Guide. Version: 3.0. Copyright 2011, 2013 ARM. All rights reserved. ARM DUI 0555C (ID102813)

ARM. Mali GPU. OpenGL ES Application Optimization Guide. Version: 3.0. Copyright 2011, 2013 ARM. All rights reserved. ARM DUI 0555C (ID102813) ARM Mali GPU Version: 3.0 OpenGL ES Application Optimization Guide Copyright 2011, 2013 ARM. All rights reserved. ARM DUI 0555C () ARM Mali GPU OpenGL ES Application Optimization Guide Copyright 2011,

More information

ARM. Mali GPU. OpenGL ES Application Optimization Guide. Version: 2.0. Copyright 2011, 2013 ARM. All rights reserved. ARM DUI 0555B (ID051413)

ARM. Mali GPU. OpenGL ES Application Optimization Guide. Version: 2.0. Copyright 2011, 2013 ARM. All rights reserved. ARM DUI 0555B (ID051413) ARM Mali GPU Version: 2.0 OpenGL ES Application Optimization Guide Copyright 2011, 2013 ARM. All rights reserved. ARM DUI 0555B () ARM Mali GPU OpenGL ES Application Optimization Guide Copyright 2011,

More information

ARM CoreLink DPE-400 Data Parity Extension for NIC-400

ARM CoreLink DPE-400 Data Parity Extension for NIC-400 ARM CoreLink DPE-400 Data Parity Extension for NIC-400 Revision: r1p0 Technical Reference Manual Copyright 2014, 2016 ARM. All rights reserved. ARM 100591_0100_00_en ARM CoreLink DPE-400 Data Parity Extension

More information

Integrator /CP Board Support Package for Microsoft Windows CE.NET

Integrator /CP Board Support Package for Microsoft Windows CE.NET Integrator /CP Board Support Package for Microsoft Windows CE.NET Revision: r0p0 Application Developer s Guide Copyright 2004 ARM Limited. All rights reserved. ARM DUI 0272A Integrator/CP Board Support

More information

ARM. Streamline. Performance Analyzer. Using ARM Streamline. Copyright 2010 ARM Limited. All rights reserved. ARM DUI 0482A (ID100210)

ARM. Streamline. Performance Analyzer. Using ARM Streamline. Copyright 2010 ARM Limited. All rights reserved. ARM DUI 0482A (ID100210) ARM Streamline Performance Analyzer Using ARM Streamline Copyright 2010 ARM Limited. All rights reserved. ARM DUI 0482A () ARM Streamline Performance Analyzer Using ARM Streamline Copyright 2010 ARM Limited.

More information

GPU Shader Development Studio. Mali. User Guide. Version: Copyright ARM. All rights reserved. DUI0504B (ID072410)

GPU Shader Development Studio. Mali. User Guide. Version: Copyright ARM. All rights reserved. DUI0504B (ID072410) Mali GPU Shader Development Studio Version: 1.2.0 User Guide Copyright 2009-2010 ARM. All rights reserved. DUI0504B () Mali GPU Shader Development Studio User Guide Copyright 2009-2010 ARM. All rights

More information

Design Simulation Model ARM. User Guide. for SystemC. Copyright 2016 ARM. All rights reserved. ARM ARM DUI 1031B (ID111116)

Design Simulation Model ARM. User Guide. for SystemC. Copyright 2016 ARM. All rights reserved. ARM ARM DUI 1031B (ID111116) ARM Design Simulation Model for SystemC User Guide Copyright 2016 ARM. All rights reserved. ARM ARM DUI 1031B () ARM Design Simulation Model User Guide Copyright 2016 ARM. All rights reserved. Release

More information

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania Course Overview This OpenCL base course is structured as follows: Introduction to GPGPU programming, parallel programming

More information

OpenCL API. OpenCL Tutorial, PPAM Dominik Behr September 13 th, 2009

OpenCL API. OpenCL Tutorial, PPAM Dominik Behr September 13 th, 2009 OpenCL API OpenCL Tutorial, PPAM 2009 Dominik Behr September 13 th, 2009 Host and Compute Device The OpenCL specification describes the API and the language. The OpenCL API, is the programming API available

More information

Arm Design Simulation Model

Arm Design Simulation Model Arm Design Simulation Model for SystemC User Guide Copyright 2017 Arm. All rights reserved. 101167_0100_00 () Arm Design Simulation Model User Guide Copyright 2017 Arm Limited (or its affiliates). All

More information

ARMv8-M processor power management

ARMv8-M processor power management ARMv8-M processor power management Version 1.0 secure state protection Copyright 2016 ARM Limited or its affiliates. All rights reserved. ARM 100737_0100_0100_en ARMv8-M processor power management ARMv8-M

More information

Component Architecture Debug Interface

Component Architecture Debug Interface Component Architecture Debug Interface Version 2.0 Developer Guide Copyright 2014, 2015 ARM. All rights reserved. ARM DUI0842F Component Architecture Debug Interface Component Architecture Debug Interface

More information

GPU Shader Development Studio. Mali. User Guide. Version: Beta. Copyright ARM. All rights reserved. DUI0504C (ID011412)

GPU Shader Development Studio. Mali. User Guide. Version: Beta. Copyright ARM. All rights reserved. DUI0504C (ID011412) Mali GPU Shader Development Studio Version: 1.3.0 User Guide Beta Copyright 2009-2011 ARM. All rights reserved. DUI0504C () Mali GPU Shader Development Studio User Guide Copyright 2009-2011 ARM. All rights

More information

Media Processing Engine. Cortex -A7 NEON. Technical Reference Manual. Revision: r0p5

Media Processing Engine. Cortex -A7 NEON. Technical Reference Manual. Revision: r0p5 Cortex -A7 NEON Media Processing Engine Revision: r0p5 Technical Reference Manual Copyright 2011-2013 ARM. All rights reserved. ARM DDI 0462F () Cortex-A7 NEON Media Processing Engine Technical Reference

More information

OpenCL The Open Standard for Heterogeneous Parallel Programming

OpenCL The Open Standard for Heterogeneous Parallel Programming OpenCL The Open Standard for Heterogeneous Parallel Programming March 2009 Copyright Khronos Group, 2009 - Page 1 Close-to-the-Silicon Standards Khronos creates Foundation-Level acceleration APIs - Needed

More information

Heterogeneous Computing

Heterogeneous Computing OpenCL Hwansoo Han Heterogeneous Computing Multiple, but heterogeneous multicores Use all available computing resources in system [AMD APU (Fusion)] Single core CPU, multicore CPU GPUs, DSPs Parallel programming

More information

Designer ADR-400 AMBA. User Guide. Revision: r3p2. Copyright ARM. All rights reserved. ARM DUI 0333M (ID011213)

Designer ADR-400 AMBA. User Guide. Revision: r3p2. Copyright ARM. All rights reserved. ARM DUI 0333M (ID011213) AMBA Designer ADR-400 Revision: r3p2 User Guide Copyright 2006-2012 ARM. All rights reserved. ARM DUI 0333M () AMBA Designer ADR-400 User Guide Copyright 2006-2012 ARM. All rights reserved. Release Information

More information

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1 Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 Introduction and aims of OpenCL - Neil Trevett, NVIDIA OpenCL Specification walkthrough - Mike

More information

ARMv8-A Synchronization primitives. primitives. Version 0.1. Version 1.0. Copyright 2017 ARM Limited or its affiliates. All rights reserved.

ARMv8-A Synchronization primitives. primitives. Version 0.1. Version 1.0. Copyright 2017 ARM Limited or its affiliates. All rights reserved. Connect ARMv8-A User Synchronization Guide primitives Version 0.1 Version 1.0 Page 1 of 9 Revision Information The following revisions have been made to this User Guide. Date Issue Confidentiality Change

More information

DS-5 ARM. Using Eclipse. Version Copyright ARM. All rights reserved. ARM DUI 0480L (ID100912)

DS-5 ARM. Using Eclipse. Version Copyright ARM. All rights reserved. ARM DUI 0480L (ID100912) ARM DS-5 Version 5.12 Using Eclipse Copyright 2010-2012 ARM. All rights reserved. ARM DUI 0480L () ARM DS-5 Using Eclipse Copyright 2010-2012 ARM. All rights reserved. Release Information The following

More information

PrimeCell Infrastructure AMBA 3 AXI File Reader Master (BP144)

PrimeCell Infrastructure AMBA 3 AXI File Reader Master (BP144) PrimeCell Infrastructure AMBA 3 AXI File Reader Master (BP144) Revision: r0p0 Technical Overview Copyright 2004 ARM Limited. All rights reserved. DTO0016A PrimeCell Infrastructure AMBA 3 AXI File Reader

More information

GPU Performance Analysis. Mali. User Guide. Version: 2.2. Copyright 2009 ARM. All rights reserved. DUI 0502A (ID101409)

GPU Performance Analysis. Mali. User Guide. Version: 2.2. Copyright 2009 ARM. All rights reserved. DUI 0502A (ID101409) Mali GPU Performance Analysis Tool Version: 2.2 User Guide Copyright 2009 ARM. All rights reserved. DUI 0502A () Mali GPU Performance Analysis Tool User Guide Copyright 2009 ARM. All rights reserved. Release

More information

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President 4 th Annual Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 CPUs Multiple cores driving performance increases Emerging Intersection GPUs Increasingly

More information

ARM TrustZone True Random Number Generator

ARM TrustZone True Random Number Generator ARM TrustZone True Random Number Generator Revision: r0p0 Technical Reference Manual Copyright 2017 ARM Limited or its affiliates. All rights reserved. ARM 100976_0000_00_en ARM TrustZone True Random Number

More information

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Architecture Parallel computing for heterogenous devices CPUs, GPUs, other processors (Cell, DSPs, etc) Portable accelerated code Defined

More information

Debugging and Analyzing Programs using the Intercept Layer for OpenCL Applications

Debugging and Analyzing Programs using the Intercept Layer for OpenCL Applications Debugging and Analyzing Programs using the Intercept Layer for OpenCL Applications Ben Ashbaugh IWOCL 2018 https://github.com/intel/opencl-intercept-layer Why am I here? Intercept Layer for OpenCL Applications

More information

Intel SDK for OpenCL* - Sample for OpenCL* and Intel Media SDK Interoperability

Intel SDK for OpenCL* - Sample for OpenCL* and Intel Media SDK Interoperability Intel SDK for OpenCL* - Sample for OpenCL* and Intel Media SDK Interoperability User s Guide Copyright 2010 2012 Intel Corporation All Rights Reserved Document Number: 327283-001US Revision: 1.0 World

More information

Mali Offline Compiler User Guide

Mali Offline Compiler User Guide Mali Offline Compiler 6.0.0 User Guide ARM Limited 2017 Introduction 2 Contents 3 Contents Chapter 1: Introduction... 5 Overview... 6 Installation Package... 6 This Release...6 Chapter 2: Installation...

More information

WebCL Overview and Roadmap

WebCL Overview and Roadmap Copyright Khronos Group, 2011 - Page 1 WebCL Overview and Roadmap Tasneem Brutch Chair WebCL Working Group Samsung Electronics Copyright Khronos Group, 2011 - Page 2 WebCL Motivation Enable high performance

More information

High Dynamic Range Tone Mapping Post Processing Effect Multi-Device Version

High Dynamic Range Tone Mapping Post Processing Effect Multi-Device Version High Dynamic Range Tone Mapping Post Processing Effect Multi-Device Version Intel SDK for OpenCL* Application Sample Documentation Copyright 2010 2012 Intel Corporation All Rights Reserved Document Number:

More information

PrimeCell AHB SRAM/NOR Memory Controller (PL241)

PrimeCell AHB SRAM/NOR Memory Controller (PL241) PrimeCell AHB SRAM/NOR Memory Controller (PL241) Revision: r0p1 Technical Reference Manual Copyright 2006 ARM Limited. All rights reserved. ARM DDI 0389B PrimeCell AHB SRAM/NOR Memory Controller (PL241)

More information

ARM. Compiler toolchain v4.1 for. Creating Static Software Libraries with armar. Copyright 2011 ARM. All rights reserved. ARM DUI 0590A (ID061811)

ARM. Compiler toolchain v4.1 for. Creating Static Software Libraries with armar. Copyright 2011 ARM. All rights reserved. ARM DUI 0590A (ID061811) ARM Compiler toolchain v4.1 for µvision Creating Static Software Libraries with armar Copyright 2011 ARM. All rights reserved. ARM DUI 0590A () ARM Compiler toolchain v4.1 for µvision Creating Static Software

More information

ARM DS-5. Eclipse for DS-5 User Guide. Version 5. Copyright ARM. All rights reserved. ARM DUI0480Q

ARM DS-5. Eclipse for DS-5 User Guide. Version 5. Copyright ARM. All rights reserved. ARM DUI0480Q ARM DS-5 Version 5 Eclipse for DS-5 User Guide Copyright 2010-2015 ARM. All rights reserved. ARM DUI0480Q ARM DS-5 ARM DS-5 Eclipse for DS-5 User Guide Copyright 2010-2015 ARM. All rights reserved. Release

More information

CoreLink MMU-500 System Memory Management Unit ARM. Technical Reference Manual. Revision: r2p2

CoreLink MMU-500 System Memory Management Unit ARM. Technical Reference Manual. Revision: r2p2 ARM CoreLink MMU-500 System Memory Management Unit Revision: r2p2 Technical Reference Manual Copyright 2013, 2014 ARM. All rights reserved. ARM DDI 0517E () ARM CoreLink MMU-500 System Memory Management

More information

Mali. GPU OpenVG. Application Development Guide. Copyright ARM Limited. All rights reserved. ARM DUI 0380D (ID121709)

Mali. GPU OpenVG. Application Development Guide. Copyright ARM Limited. All rights reserved. ARM DUI 0380D (ID121709) Mali GPU OpenVG Application Development Guide Copyright 2008-2009 ARM Limited. All rights reserved. ARM DUI 0380D () Mali GPU OpenVG Application Development Guide Copyright 2008-2009 ARM Limited. All rights

More information

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different

More information

DK2. Handel-C code optimization

DK2. Handel-C code optimization DK2 Handel-C code optimization Celoxica, the Celoxica logo and Handel-C are trademarks of Celoxica Limited. All other products or services mentioned herein may be trademarks of their respective owners.

More information

ARM DS-5. Eclipse for DS-5 User Guide. Version 5. Copyright ARM Limited or its affiliates. All rights reserved.

ARM DS-5. Eclipse for DS-5 User Guide. Version 5. Copyright ARM Limited or its affiliates. All rights reserved. ARM DS-5 Version 5 Eclipse for DS-5 User Guide ARM DS-5 ARM DS-5 Eclipse for DS-5 User Guide Release Information Document History Issue Date Confidentiality Change A June 2010 First release B September

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 6 April 2017 HW#8 will be posted Announcements HW#7 Power outage Pi Cluster Runaway jobs (tried

More information

AN 831: Intel FPGA SDK for OpenCL

AN 831: Intel FPGA SDK for OpenCL AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Intel FPGA SDK for OpenCL Host Pipelined Multithread...3 1.1

More information

Trace Debug Tools Version 1.2 Installation Guide

Trace Debug Tools Version 1.2 Installation Guide Trace Debug Tools Version 1.2 Installation Guide Copyright 2000-2002 ARM Limited. All rights reserved. Proprietary Notice Words and logos marked with or are registered trademarks or trademarks owned by

More information

ETM -A5. CoreSight. Technical Reference Manual. Revision: r0p2. Copyright 2009, 2010 ARM. All rights reserved. ARM DDI 0435C (ID072915)

ETM -A5. CoreSight. Technical Reference Manual. Revision: r0p2. Copyright 2009, 2010 ARM. All rights reserved. ARM DDI 0435C (ID072915) CoreSight ETM -A5 Revision: r0p2 Technical Reference Manual Copyright 2009, 2010 ARM. All rights reserved. ARM DDI 0435C () CoreSight ETM-A5 Technical Reference Manual Copyright 2009, 2010 ARM. All rights

More information

ARM. System Memory Management Unit Architecture Specification. 64KB Translation Granule Supplement

ARM. System Memory Management Unit Architecture Specification. 64KB Translation Granule Supplement ARM System Memory Management Unit Architecture Specification 64KB Translation Granule Supplement Copyright 2013 ARM. All rights reserved. ARM IHI 0067A.b () ARM System Memory Management Unit Architecture

More information

Texture Compression Tool. Mali. User Guide. Version: Copyright 2009, ARM. All rights reserved. DUI 0503E (ID092912)

Texture Compression Tool. Mali. User Guide. Version: Copyright 2009, ARM. All rights reserved. DUI 0503E (ID092912) Mali Texture Compression Tool Version: 4.0.1 User Guide Copyright 2009, 2011-2012 ARM. All rights reserved. DUI 0503E () Mali Texture Compression Tool User Guide Copyright 2009, 2011-2012 ARM. All rights

More information

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011 GPGPU COMPUTE ON AMD Udeepta Bordoloi April 6, 2011 WHY USE GPU COMPUTE CPU: scalar processing + Latency + Optimized for sequential and branching algorithms + Runs existing applications very well - Throughput

More information

Fault Handling and Detection

Fault Handling and Detection Fault Handling and Detection Version 1.0 Copyright 2016 ARM. All rights reserved. ARM 100691_0100_00_en Fault Handling and Detection Fault Handling and Detection Copyright 2016 ARM. All rights reserved.

More information

Fault Handling and Detection

Fault Handling and Detection Fault Handling and Detection Version 1.1 Copyright 2016 ARM Limited or its affiliates. All rights reserved. ARM 100691_0101_00_en Fault Handling and Detection Fault Handling and Detection Copyright 2016

More information

OpenCL* Device Fission for CPU Performance

OpenCL* Device Fission for CPU Performance OpenCL* Device Fission for CPU Performance Summary Device fission is an addition to the OpenCL* specification that gives more power and control to OpenCL programmers over managing which computational units

More information

ARM DS-5. Using the Debugger. Copyright 2010 ARM. All rights reserved. ARM DUI 0446A (ID070310)

ARM DS-5. Using the Debugger. Copyright 2010 ARM. All rights reserved. ARM DUI 0446A (ID070310) ARM DS-5 Using the Debugger Copyright 2010 ARM. All rights reserved. ARM DUI 0446A () ARM DS-5 Using the Debugger Copyright 2010 ARM. All rights reserved. Release Information The following changes have

More information

AMBA Keyboard/Mouse PS/2 Interface

AMBA Keyboard/Mouse PS/2 Interface AMBA Keyboard/Mouse PS/2 Interface Datasheet Copyright 1996-1998 ARM Limited. All rights reserved. ARM DDI 0096B AMBA Keyboard/Mouse PS/2 Interface Datasheet Copyright 1996-1998 ARM Limited. All rights

More information

OpenCL on the GPU. San Jose, CA September 30, Neil Trevett and Cyril Zeller, NVIDIA

OpenCL on the GPU. San Jose, CA September 30, Neil Trevett and Cyril Zeller, NVIDIA OpenCL on the GPU San Jose, CA September 30, 2009 Neil Trevett and Cyril Zeller, NVIDIA Welcome to the OpenCL Tutorial! Khronos and industry perspective on OpenCL Neil Trevett Khronos Group President OpenCL

More information

GPGPU IGAD 2014/2015. Lecture 1. Jacco Bikker

GPGPU IGAD 2014/2015. Lecture 1. Jacco Bikker GPGPU IGAD 2014/2015 Lecture 1 Jacco Bikker Today: Course introduction GPGPU background Getting started Assignment Introduction GPU History History 3DO-FZ1 console 1991 History NVidia NV-1 (Diamond Edge

More information

MOVE. Coprocessor. Technical Reference Manual. Copyright 2001, 2002, ARM Limited. All rights reserved. ARM DDI 0235C

MOVE. Coprocessor. Technical Reference Manual. Copyright 2001, 2002, ARM Limited. All rights reserved. ARM DDI 0235C MOVE Coprocessor Technical Reference Manual Copyright 00, 00, 004. ARM Limited. All rights reserved. ARM DDI 035C MOVE Coprocessor Technical Reference Manual Copyright 00, 00, 004. ARM Limited. All rights

More information

Compute Subsystem SCP ARM. Message Interface Protocols. Version: 1.2

Compute Subsystem SCP ARM. Message Interface Protocols. Version: 1.2 ARM Compute Subsystem SCP Version: 1.2 Message Interface Protocols Copyright 2015, 2016 ARM Limited or its affiliates. All rights reserved. ARM DUI 0922G () ARM Compute Subsystem SCP Message Interface

More information

MICHAL MROZEK ZBIGNIEW ZDANOWICZ

MICHAL MROZEK ZBIGNIEW ZDANOWICZ MICHAL MROZEK ZBIGNIEW ZDANOWICZ Legal Notices and Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY

More information

Krzysztof Laskowski, Intel Pavan K Lanka, Intel

Krzysztof Laskowski, Intel Pavan K Lanka, Intel Krzysztof Laskowski, Intel Pavan K Lanka, Intel Legal Notices and Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR

More information

Quadros. RTXC Kernel Services Reference, Volume 1. Levels, Threads, Exceptions, Pipes, Event Sources, Counters, and Alarms. Systems Inc.

Quadros. RTXC Kernel Services Reference, Volume 1. Levels, Threads, Exceptions, Pipes, Event Sources, Counters, and Alarms. Systems Inc. Quadros Systems Inc. RTXC Kernel Services Reference, Volume 1 Levels, Threads, Exceptions, Pipes, Event Sources, Counters, and Alarms Disclaimer Quadros Systems, Inc. makes no representations or warranties

More information

Compute Subsystem SCP ARM. Message Interface Protocols. Version: 1.0. Copyright 2015 ARM. All rights reserved. ARM DUI 0922B (ID050115)

Compute Subsystem SCP ARM. Message Interface Protocols. Version: 1.0. Copyright 2015 ARM. All rights reserved. ARM DUI 0922B (ID050115) ARM Compute Subsystem SCP Version: 1.0 Message Interface Protocols Copyright 2015 ARM. All rights reserved. ARM DUI 0922B () ARM Compute Subsystem SCP Message Interface Protocols Copyright 2015 ARM. All

More information

Single Master DMA Controller (PL081) PrimeCell. Technical Reference Manual. Revision: r1p2

Single Master DMA Controller (PL081) PrimeCell. Technical Reference Manual. Revision: r1p2 PrimeCell Single Master DMA Controller (PL081) Revision: r1p2 Technical Reference Manual Copyright 2001, 2003-2005 ARM Limited. All rights reserved. ARM DDI 0218E PrimeCell Single Master DMA Controller

More information

AMBA Remap and Pause. Technical Reference Manual. Revision: r0p0. Copyright 1997 ARM Limited. All rights reserved. ARM DDI 0048AC

AMBA Remap and Pause. Technical Reference Manual. Revision: r0p0. Copyright 1997 ARM Limited. All rights reserved. ARM DDI 0048AC Revision: r0p0 Technical Reference Manual Copyright 1997 ARM Limited. All rights reserved. ARM DDI 0048AC Technical Reference Manual Copyright 1997 ARM Limited. All rights reserved. Release Information

More information

Introduction to OpenCL!

Introduction to OpenCL! Lecture 6! Introduction to OpenCL! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! OpenCL Architecture Defined in four parts Platform Model

More information

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014 Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline

More information

Network Interconnect NIC-301. CoreLink. Technical Reference Manual. Revision: r2p3

Network Interconnect NIC-301. CoreLink. Technical Reference Manual. Revision: r2p3 CoreLink Network Interconnect NIC-301 Revision: r2p3 Technical Reference Manual Copyright 2006-2011 ARM. All rights reserved. ARM DDI 0397I () CoreLink Network Interconnect NIC-301 Technical Reference

More information

OpenCL Overview Benedict R. Gaster, AMD

OpenCL Overview Benedict R. Gaster, AMD Copyright Khronos Group, 2011 - Page 1 OpenCL Overview Benedict R. Gaster, AMD March 2010 The BIG Idea behind OpenCL OpenCL execution model - Define N-dimensional computation domain - Execute a kernel

More information

NIC-400 Network Interconnect. CoreLink. Technical Reference Manual. Revision: r0p0. Copyright 2012 ARM. All rights reserved. ARM DDI 0475A (ID081712)

NIC-400 Network Interconnect. CoreLink. Technical Reference Manual. Revision: r0p0. Copyright 2012 ARM. All rights reserved. ARM DDI 0475A (ID081712) CoreLink NIC-400 Network Interconnect Revision: r0p0 Technical Reference Manual Copyright 2012 ARM. All rights reserved. ARM DDI 0475A () CoreLink NIC-400 Network Interconnect Technical Reference Manual

More information

Introduction à OpenCL

Introduction à OpenCL 1 1 UDS/IRMA Journée GPU Strasbourg, février 2010 Sommaire 1 OpenCL 2 3 GPU architecture A modern Graphics Processing Unit (GPU) is made of: Global memory (typically 1 Gb) Compute units (typically 27)

More information

Secure software guidelines for ARMv8-M. for ARMv8-M. Version 0.1. Version 2.0. Copyright 2017 ARM Limited or its affiliates. All rights reserved.

Secure software guidelines for ARMv8-M. for ARMv8-M. Version 0.1. Version 2.0. Copyright 2017 ARM Limited or its affiliates. All rights reserved. Connect Secure software User Guide guidelines for ARMv8-M Version 0.1 Version 2.0 Page 1 of 19 Revision Information The following revisions have been made to this User Guide. Date Issue Confidentiality

More information

Versatile Express. Boot Monitor. Reference Manual. Copyright ARM. All rights reserved. ARM DUI 0465F (ID110612)

Versatile Express. Boot Monitor. Reference Manual. Copyright ARM. All rights reserved. ARM DUI 0465F (ID110612) Versatile Express Boot Monitor Reference Manual Copyright 2009-2012 ARM. All rights reserved. ARM DUI 0465F () Versatile Express Boot Monitor Reference Manual Copyright 2009-2012 ARM. All rights reserved.

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example Optimizing Previous Example Alternative Architectures 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator for desktop

More information

ETM -R4. CoreSight. Technical Reference Manual. Revision: r2p1. Copyright ARM. All rights reserved. ARM DDI 0367D (ID072915)

ETM -R4. CoreSight. Technical Reference Manual. Revision: r2p1. Copyright ARM. All rights reserved. ARM DDI 0367D (ID072915) CoreSight ETM -R4 Revision: r2p1 Technical Reference Manual Copyright 2005-2011 ARM. All rights reserved. ARM DDI 0367D (ID072915) CoreSight ETM-R4 Technical Reference Manual Copyright 2005-2011 ARM. All

More information

PrimeCell AHB SDR and SRAM/NOR Memory Controller (PL243)

PrimeCell AHB SDR and SRAM/NOR Memory Controller (PL243) PrimeCell AHB SDR and SRAM/NOR Memory Controller (PL243) Revision: r0p1 Technical Reference Manual Copyright 2006 ARM Limited. All rights reserved. ARM DDI 0391B PrimeCell AHB SDR and SRAM/NOR Memory Controller

More information

From Application to Technology OpenCL Application Processors Chung-Ho Chen

From Application to Technology OpenCL Application Processors Chung-Ho Chen From Application to Technology OpenCL Application Processors Chung-Ho Chen Computer Architecture and System Laboratory (CASLab) Department of Electrical Engineering and Institute of Computer and Communication

More information

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group Copyright Khronos Group, 2012 - Page 1 OpenCL Overview Shanghai March 2012 Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group Copyright Khronos Group, 2012 - Page 2 Processor

More information

ARM AMBA. Designer ADR-400. User Guide. Revision: r3p7

ARM AMBA. Designer ADR-400. User Guide. Revision: r3p7 ARM AMBA Designer ADR-400 Revision: r3p7 User Guide Copyright 2006-2014, 2016, 2017 ARM Limited or its affiliates. All rights reserved. ARM DUI 0333R () ARM AMBA Designer ADR-400 User Guide Copyright 2006-2014,

More information

Multi-Core Programming

Multi-Core Programming Multi-Core Programming Increasing Performance through Software Multi-threading Shameem Akhter Jason Roberts Intel PRESS Copyright 2006 Intel Corporation. All rights reserved. ISBN 0-9764832-4-6 No part

More information

Handout 3. HSAIL and A SIMT GPU Simulator

Handout 3. HSAIL and A SIMT GPU Simulator Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants

More information

ARM v8-m Processor Debug

ARM v8-m Processor Debug ARM v8-m Processor Debug Version 1.0 Copyright 2016 ARM Limited or its affiliates. All rights reserved. ARM 100734_0100_0100_en ARM v8-m Processor Debug ARM v8-m Processor Debug Copyright 2016 ARM Limited

More information

OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015

OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015 OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015 TOPICS Data transfer Parallelism Coalesced memory access Best work group size Occupancy branching All the performance numbers come from a W8100 running

More information

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao

More information

PowerVR Hardware. Architecture Overview for Developers

PowerVR Hardware. Architecture Overview for Developers Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

OpenCL in Action. Ofer Rosenberg

OpenCL in Action. Ofer Rosenberg pencl in Action fer Rosenberg Working with pencl API pencl Boot Platform Devices Context Queue Platform Query int GetPlatform (cl_platform_id &platform, char* requestedplatformname) { cl_uint numplatforms;

More information

Fundamentals of the Java Programming Language

Fundamentals of the Java Programming Language Fundamentals of the Java Programming Language Student Guide SL-110 REV E D61798GC10 Edition 1.0 2009 D62399 Copyright 2006, 2009, Oracle and/or its affiliates. All rights reserved. Disclaimer This document

More information

OpenCL Common Runtime for Linux on x86 Architecture

OpenCL Common Runtime for Linux on x86 Architecture OpenCL Common Runtime for Linux on x86 Architecture Installation and Users Guide Version 0.1 Copyright International Business Machines Corporation, 2011 All Rights Reserved Printed in the United States

More information

clarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018

clarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018 clarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018 ANECDOTE DISCOVERING A BUFFER OVERFLOW CPU GPU MEMORY MEMORY Data Data Data Data Data 2 clarmor: A

More information

Collecting OpenCL*-related Metrics with Intel Graphics Performance Analyzers

Collecting OpenCL*-related Metrics with Intel Graphics Performance Analyzers Collecting OpenCL*-related Metrics with Intel Graphics Performance Analyzers Collecting Important OpenCL*-related Metrics with Intel GPA System Analyzer Introduction Intel SDK for OpenCL* Applications

More information

Application Programming

Application Programming Multicore Application Programming For Windows, Linux, and Oracle Solaris Darryl Gove AAddison-Wesley Upper Saddle River, NJ Boston Indianapolis San Francisco New York Toronto Montreal London Munich Paris

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded

More information

EMBEDDED SYSTEMS: Jonathan W. Valvano INTRODUCTION TO THE MSP432 MICROCONTROLLER. Volume 1 First Edition June 2015

EMBEDDED SYSTEMS: Jonathan W. Valvano INTRODUCTION TO THE MSP432 MICROCONTROLLER. Volume 1 First Edition June 2015 EMBEDDED SYSTEMS: INTRODUCTION TO THE MSP432 MICROCONTROLLER Volume 1 First Edition June 2015 Jonathan W. Valvano ii Jonathan Valvano First edition 3 rd printing June 2015 The true engineering experience

More information

TrustZone Address Space Controller TZC-380. CoreLink. Technical Reference Manual. Revision: r0p1

TrustZone Address Space Controller TZC-380. CoreLink. Technical Reference Manual. Revision: r0p1 CoreLink TrustZone Address Space Controller TZC-380 Revision: r0p1 Technical Reference Manual Copyright 2008, 2010 ARM Limited. All rights reserved. ARM DDI 0431C () CoreLink TrustZone Address Space Controller

More information

MPI: A Message-Passing Interface Standard

MPI: A Message-Passing Interface Standard MPI: A Message-Passing Interface Standard Version 2.1 Message Passing Interface Forum June 23, 2008 Contents Acknowledgments xvl1 1 Introduction to MPI 1 1.1 Overview and Goals 1 1.2 Background of MPI-1.0

More information

SIMULATOR AMD RESEARCH JUNE 14, 2015

SIMULATOR AMD RESEARCH JUNE 14, 2015 AMD'S gem5apu SIMULATOR AMD RESEARCH JUNE 14, 2015 OVERVIEW Introducing AMD s gem5 APU Simulator Extends gem5 with a GPU timing model Supports Heterogeneous System Architecture in SE mode Includes several

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics

More information

Use cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games

Use cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games Viewdle Inc. 1 Use cases Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games 2 Why OpenCL matter? OpenCL is going to bring such

More information

Colin Riddell GPU Compiler Developer Codeplay Visit us at

Colin Riddell GPU Compiler Developer Codeplay Visit us at OpenCL Colin Riddell GPU Compiler Developer Codeplay Visit us at www.codeplay.com 2 nd Floor 45 York Place Edinburgh EH1 3HP United Kingdom Codeplay Overview of OpenCL Codeplay + OpenCL Our technology

More information