Mali -T600 Series GPU OpenCL ARM. Developer Guide. Version 2.0. Copyright ARM. All rights reserved. DUI0538F (ID012914)
|
|
- Reginald Ferguson
- 5 years ago
- Views:
Transcription
1 ARM Mali -T600 Series GPU OpenCL Version 2.0 Developer Guide Copyright ARM. All rights reserved. DUI0538F ()
2 ARM Mali-T600 Series GPU OpenCL Developer Guide Copyright ARM. All rights reserved. Release Information The following changes have been made to this book. Change history Date Issue Confidentiality Change 12 July 2012 A Confidential First release 07 November 2012 D Confidential Second release 27 February 2013 E Non-confidential Third release 03 December 2013 F Non-confidential Fourth release Proprietary Notice Words and logos marked with or are registered trademarks or trademarks of ARM in the EU and other countries, except as otherwise stated below in this proprietary notice. Other brands and names mentioned herein may be the trademarks of their respective owners. Neither the whole nor any part of the information contained in, or the product described in, this document may be adapted or reproduced in any material form except with the prior written permission of the copyright holder. The product described in this document is subject to continuous developments and improvements. All particulars of the product and its use contained in this document are given by ARM in good faith. However, all warranties implied or expressed, including but not limited to implied warranties of merchantability, or fitness for purpose, are excluded. This document is intended only to assist the reader in the use of the product. ARM shall not be liable for any loss or damage arising from the use of any information in this document, or any error or omission in such information, or any incorrect use of the product. Where the term ARM is used it means ARM or any of its subsidiaries as appropriate. Confidentiality Status This document is. The right to use, copy and disclose this document may be subject to license restrictions in accordance with the terms of the agreement entered into by ARM and the party that ARM delivered this document to. Product Status The information in this document is final, that is for a developed product. Web Address DUI0538F Copyright ARM. All rights reserved. ii
3 Contents ARM Mali-T600 Series GPU OpenCL Developer Guide Preface About this book... vii Feedback... ix Chapter 1 Chapter 2 Chapter 3 Introduction 1.1 About GPU compute About OpenCL About the Mali-T600 Series GPU Linux OpenCL driver About the Mali OpenCL SDK Parallel Processing Concepts 2.1 Types of parallelism Concurrency Limitations of parallel processing Embarrassingly parallel applications Mixing different types of parallelism OpenCL Concepts 3.1 About OpenCL OpenCL applications OpenCL execution model OpenCL data processing The OpenCL memory model The Mali GPU memory model OpenCL concepts summary DUI0538F Copyright ARM. All rights reserved. iii
4 Contents Chapter 4 Chapter 5 Chapter 6 Chapter 7 Chapter 8 Chapter 9 Chapter 10 Appendix A Appendix B Developing an OpenCL Application 4.1 Software and hardware required for OpenCL development Development stages Execution Stages of an OpenCL Application 5.1 About the execution stages Finding the available compute devices Initializing and creating OpenCL contexts Creating a command queue Creating OpenCL program objects Building a program executable Creating kernel and memory objects Executing the kernel Reading the results Cleaning up Converting Existing Code to OpenCL 6.1 Profile your application Analyzing code for parallelization Parallel processing techniques in OpenCL Using parallel processing with non-parallelizable code Dividing data for OpenCL Retuning Existing OpenCL Code for Mali GPUs 7.1 About retuning existing OpenCL code for Mali GPUs Differences between desktop based architectures and Mali GPUs Procedure for retuning existing OpenCL code for Mali GPUs Optimizing OpenCL for Mali GPUs 8.1 The optimization process for OpenCL applications Load balancing between the application processor and the Mali GPU Sharing memory between I/O devices and OpenCL OpenCL Optimizations List 9.1 General optimizations Memory optimizations Kernel optimizations Code optimizations Execution optimizations Reducing the effect of serial computations The Mali OpenCL SDK OpenCL Data Types OpenCL Built-in Functions B.1 Work-item functions... B-2 B.2 Math functions... B-3 B.3 half_ and native_ math functions... B-4 B.4 Integer functions... B-5 B.5 Common functions... B-6 B.6 Geometric functions... B-7 B.7 Relational functions... B-8 B.8 Vector data load and store functions... B-9 B.9 Synchronization... B-10 B.10 Asynchronous copy functions... B-11 B.11 Atomic functions... B-12 B.12 Miscellaneous vector functions... B-13 DUI0538F Copyright ARM. All rights reserved. iv
5 Contents B.13 Image read and write functions... B-14 Appendix C OpenCL Extensions DUI0538F Copyright ARM. All rights reserved. v
6 Preface This preface introduces the Mali-T600 Series GPU OpenCL Developer Guide. It contains the following sections: About this book on page vii. Feedback on page ix. DUI0538F Copyright ARM. All rights reserved. vi
7 Preface About this book This book is for Mali-T600 Series GPU OpenCL. Product revision status The rnpn identifier indicates the revision status of the product described in this book, where: rn Identifies the major revision of the product. pn Identifies the minor revision or modification status of the product. Intended audience This guide is written for software developers with experience in C or C-like languages who want to develop OpenCL applications for Mali-T600 Series GPUs. Using this book This book is organized into the following chapters: Chapter 1 Introduction Read this for an introduction to OpenCL and the Mali OpenCL SDK. Chapter 2 Parallel Processing Concepts Read this for an introduction to parallel processing concepts. Chapter 3 OpenCL Concepts Read this for a description of the OpenCL concepts. Chapter 4 Developing an OpenCL Application Read this for a description of the development stages of an OpenCL application. Chapter 5 Execution Stages of an OpenCL Application Read this for a description of the execution stages of an OpenCL application. Chapter 6 Converting Existing Code to OpenCL Read this for a description of how to convert existing code to OpenCL. Chapter 7 Retuning Existing OpenCL Code for Mali GPUs Read this for a description of how to retune existing OpenCL code for the Mali-T600 Series GPUs. Chapter 8 Optimizing OpenCL for Mali GPUs Read this for a description of how to optimize OpenCL for the Mali-T600 Series GPUs. Chapter 9 OpenCL Optimizations List Read this for a list of optimizations for OpenCL on the Mali-T600 Series GPUs. Chapter 10 The Mali OpenCL SDK Read this for an introduction to the Mali OpenCL SDK. Appendix A OpenCL Data Types Read this for a list of the data types available. DUI0538F Copyright ARM. All rights reserved. vii
8 Preface Appendix B OpenCL Built-in Functions Read this for a list of the OpenCL built-in functions implemented in the Mali-T600 Series GPU Linux OpenCL driver. Appendix C OpenCL Extensions Read this for a list of extensions the Mali-T600 Series GPU Linux OpenCL driver supports. Glossary The ARM Glossary is a list of terms used in ARM documentation, together with definitions for those terms. The ARM Glossary does not contain terms that are industry standard unless the ARM meaning differs from the generally accepted meaning. See ARM Glossary, Typographical conventions This book uses the following typographical conventions: italic bold monospace monospace monospace italic monospace bold Introduces special terminology, denotes cross-references, and citations. Highlights interface elements, such as menu names. Denotes signal names. Also used for terms in descriptive lists, where appropriate. Denotes text that you can enter at the keyboard, such as commands, file and program names, and source code. Denotes a permitted abbreviation for a command or option. You can enter the underlined text instead of the full command or option name. Denotes arguments to monospace text where the argument is to be replaced by a specific value. Denotes language keywords when used outside example code. < and > Enclose replaceable terms for assembler syntax where they appear in code or code fragments. For example: MRC p15, 0 <Rd>, <CRn>, <CRm>, <Opcode_2> Additional reading This section lists publications by ARM and by third parties. See Infocenter, for access to ARM documentation. Other publications This section lists relevant documents published by third parties: OpenCL 1.1 Specification, DUI0538F Copyright ARM. All rights reserved. viii
9 Preface Feedback ARM welcomes feedback on this product and its documentation. Feedback on this product If you have any comments or suggestions about this product, contact your supplier and give: The product name. The product revision or version. An explanation with as much information as you can provide. Include symptoms and diagnostic procedures if appropriate. Feedback on content If you have comments on content then send an to Give: The title. The number, DUI0538F. The page numbers to which your comments apply. A concise explanation of your comments. ARM also welcomes general suggestions for additions and improvements. Note ARM tests the PDF only in Adobe Acrobat and Acrobat Reader, and cannot guarantee the quality of the represented document when used with any other PDF reader. DUI0538F Copyright ARM. All rights reserved. ix
10 Chapter 1 Introduction This chapter introduces GPU compute, OpenCL, the Mali-T600 Series GPU Linux OpenCL driver, and the Mali OpenCL SDK. It contains the following sections: About GPU compute on page 1-2. About OpenCL on page 1-3. About the Mali-T600 Series GPU Linux OpenCL driver on page 1-4. About the Mali OpenCL SDK on page 1-5. DUI0538F Copyright ARM. All rights reserved. 1-1
11 Introduction 1.1 About GPU compute GPU compute, or General Purpose computing on Graphics Processing Units (GPGPU), is the practice of using the parallel computing power of a GPU for tasks other than 3D graphics rendering. Application processors are designed to execute a single thread as quickly as possible. This sort of processing typically includes scalar operations and control code. GPUs are designed to execute many threads at the same time. They run compute intensive data processing tasks in parallel that contain relatively little control code. GPUs typically contain many more processing elements than application processors so can compute at a much higher rate than application processors. OpenCL is the first open standard language to enable developers to run general purpose computing tasks on GPUs, application processors, and other types of processors. DUI0538F Copyright ARM. All rights reserved. 1-2
12 Introduction 1.2 About OpenCL The Open Computing Language (OpenCL) is an open standard for writing applications to run on heterogeneous multi-processor systems. OpenCL provides a single development environment for applications that can run on different processors. OpenCL includes a platform-independent C99-based language for writing functions called kernels that execute on OpenCL devices, and APIs that define and control the platforms. OpenCL enables you to execute some applications faster by moving intensive data processing routines to the GPU instead of the application processor. OpenCL makes multiprocessor applications easier to write because it manages the execution of your application across multiple application processors and GPUs. The OpenCL language includes vectors and built in functions that enable you to easily utilize the features of accelerators. OpenCL is an open standard developed by the Khronos Group, DUI0538F Copyright ARM. All rights reserved. 1-3
13 Introduction 1.3 About the Mali-T600 Series GPU Linux OpenCL driver The Mali-T600 Series GPU Linux OpenCL driver is the implementation of OpenCL for the Mali-T600 Series GPUs. In this document, it is known as the Mali OpenCL driver. The Mali OpenCL driver: Supports OpenCL version 1.1, Full Profile. Is binary-compatible with OpenCL 1.0 applications. This includes compatibility with the APIs deprecated in OpenCL 1.1. Note The Mali OpenCL driver is for the Mali-T600 Series GPUs. It does not support the Mali-300, Mali-400, or Mali-450 GPUs. DUI0538F Copyright ARM. All rights reserved. 1-4
14 Introduction 1.4 About the Mali OpenCL SDK The Mali OpenCL SDK contains code examples and tutorials to help you understand OpenCL development. See Chapter 10 The Mali OpenCL SDK. DUI0538F Copyright ARM. All rights reserved. 1-5
15 Chapter 2 Parallel Processing Concepts Parallel processing is the processing of computations on multiple processors simultaneously. OpenCL enables applications to use hardware resources such as GPUs to accelerate computations with parallel processing. This chapter introduces the main concepts of parallel processing. It contains the following sections: Types of parallelism on page 2-2. Concurrency on page 2-4. Limitations of parallel processing on page 2-5. Embarrassingly parallel applications on page 2-6. Mixing different types of parallelism on page 2-7. DUI0538F Copyright ARM. All rights reserved. 2-1
16 Parallel Processing Concepts 2.1 Types of parallelism There are the following types of parallelism: Data parallelism In a data-parallel application, data is divided into data elements that can be processed in parallel. Multiple data elements are read and processed simultaneously by different processors. The data being processed must be in data structures that can be read and written in parallel. An example of a data parallel application is rendering three dimensional graphics. The generated pixels are independent so the computations required to generate them can be performed in parallel. This sort of parallelism is very fine grained and there can be hundreds or thousands of threads active simultaneously. OpenCL is primarily used for data parallel processing. Task parallelism Task parallelism is where the application is broken up into tasks and these tasks are executed in parallel. Task parallelism is also known as functional parallelism. An example of an application that can use task parallelism is playing an on-line video. To display a web page your device must do several tasks: Run a network stack that performs communication. Request data from external server. Read data from external server. Parse data. Decode video data. Decode audio data. Draw video frames. Play audio data. Figure 2-1 shows parts of an application and operating system that operate simultaneously when playing an on-line video. Draw video frame Decode sound Playback sound Decode video Parse data Request data from external server Read data from external server Operating system Network stack Figure 2-1 Task parallel processing Pipelines Pipelines process data in a series of stages. In a pipeline the stages can operate simultaneously but they do not process the same data. A pipeline typically has a relatively small number of stages. DUI0538F Copyright ARM. All rights reserved. 2-2
17 Parallel Processing Concepts An example of a pipeline is a video recorder application that must perform the following stages: 1. Capture image data from an image sensor and measure light levels. 2. Modify the image data to correct for lens effects. 3. Modify the contrast, color balance, and exposure of the image data. 4. Compress the image 5. Add the data to video file. 6. Write the video file to storage. These stages must be performed in order but they can be all be operating on data from a different video frame at the same time. Figure 2-2 shows parts of an application that can operate simultaneously as a pipeline playing a video from the internet. Capture data from image sensor Correct image for lens effects Modify: Contrast Color balance Exposure Compress image Add data to video file Write video file to storage Figure 2-2 Pipeline processing DUI0538F Copyright ARM. All rights reserved. 2-3
18 Parallel Processing Concepts 2.2 Concurrency Concurrent applications have multiple operations in progress at the same time. These can operate in parallel or in serial using a time sharing system. In a concurrent application multiple tasks are typically trying to share the same data. Access to this data must be managed carefully otherwise there can be complex problems such as: Race conditions A race condition occurs when two or more threads try to modify the value of one variable at the same time. The final value of the variable should always be the same, but when a race condition occurs the variable can get a different value depending on the order of the writes. Deadlocks A deadlock occurs when two threads become blocked by each other and neither thread can progress with their operations. This can happen when the threads each obtain a lock that the other thread requires. Live locks Live locks are similar to a deadlocks but the threads keep running. However, because of the lock the threads can never complete their task. A concurrent data structure is a data structure that can be accessed by multiple tasks without causing concurrency problems. Data parallel applications use concurrent data structures. These are the sorts of data structures that you typically use in OpenCL. DUI0538F Copyright ARM. All rights reserved. 2-4
19 Parallel Processing Concepts 2.3 Limitations of parallel processing There are limitations of parallel processing that you must consider when developing parallel applications. For example, if your application parallelizes perfectly, executing the application on 10 processors makes it run 10 times faster. Applications rarely parallelize perfectly because part of the application is serial. This serial component imposes a limit on the amount of parallelization the application can use. Amdahl s law describes the speedup you can get from parallel processing. The formula for Amdahl s law is shown in Figure 2-3 where the terms in the equation are; S Fraction of the application that is serial. P Fraction of the application that is parallelizable. N Number of processors. Speedup = S + 1 P N Figure 2-3 Formula for Amdahl s law Figure 2-4 shows the speedup that different numbers of processors provide for applications with different serial components. Speedup 10 X Perfect scaling 8 X 6 X 4 X 5% serial 10% serial 20% serial 2 X 1 X Processors Figure 2-4 Speedup for application with different serial components The biggest speedups are achieved with relatively small numbers of processors. However, as the number of processors rises the gains reduce. You cannot avoid Amdahl s law in your application but you can reduce the impact. See Reducing the effect of serial computations on page For high performance with a large number of processors the application must have a very small serial component. These sorts of applications are said to be embarrassingly parallel. See Embarrassingly parallel applications on page 2-6. DUI0538F Copyright ARM. All rights reserved. 2-5
20 Parallel Processing Concepts 2.4 Embarrassingly parallel applications If an application can be parallelized across a large number of processors easily, it is said to be embarrassingly parallel. An example of an embarrassingly parallel application is rendering three dimensional graphics. The pixels are completely independent so they can be computed and drawn in parallel. OpenCL is ideally suited for developing and executing embarrassingly parallel applications. Figure 2-5 Embarrassingly parallel processing Figure 2-5 shows an image divided into many small parts. These parts can all be processed simultaneously. DUI0538F Copyright ARM. All rights reserved. 2-6
21 Parallel Processing Concepts 2.5 Mixing different types of parallelism You can mix these types of parallelism in your applications. For example, an audio synthesizer might use a combination of all three types of parallelism: Task parallelism computes the notes independently. A pipeline of audio generation and processing modules creates the sound of an individual note. Within the pipeline some stages can use data parallelism to accelerate the computation of processing. DUI0538F Copyright ARM. All rights reserved. 2-7
22 Chapter 3 OpenCL Concepts This chapter describes the OpenCL concepts. It contains the following sections: About OpenCL on page 3-2. OpenCL applications on page 3-3. OpenCL execution model on page 3-4. OpenCL data processing on page 3-5. The OpenCL memory model on page 3-8. The Mali GPU memory model on page 3-9. OpenCL concepts summary on page DUI0538F Copyright ARM. All rights reserved. 3-1
23 OpenCL Concepts 3.1 About OpenCL OpenCL is an open standard that enables you to use the parallel processing capabilities of multiple types of processors including application processors, GPUs, and other computing devices. OpenCL specifies an API for parallel programming that is designed for portability: It uses an abstracted memory and execution model. There is no requirement to know the application processor or GPU instruction set. There is scope for specific hardware optimizations. Functions executing on OpenCL devices are called kernels. These are written in a language called OpenCL C that is based on C99. The Mali-T600 Series GPUs support OpenCL 1.1, Full Profile. DUI0538F Copyright ARM. All rights reserved. 3-2
24 OpenCL Concepts 3.2 OpenCL applications OpenCL applications consist of the following parts: Application, or host, side code Calls the OpenCL APIs. Compiles the CL kernels. Allocates memory buffers to pass data into and out of the OpenCL kernels. Sets up command queues. Sets up dependencies between the tasks. Sets up the NDRanges that the kernels execute over. OpenCL kernels Written in OpenCL C language. Perform the parallel processing. Runs on the compute devices such as GPU shader cores. You must write both of these parts correctly to get the best performance. DUI0538F Copyright ARM. All rights reserved. 3-3
25 OpenCL Concepts 3.3 OpenCL execution model The OpenCL execution model includes: Kernels that run on compute devices. A host application that runs on the application processor. The host application The host application manages the execution of the kernels by setting up command queues for: Memory commands. Kernel execution commands. Synchronization. The context The host application defines the context for the kernels. The context includes: The kernels. Compute devices. Program objects. Memory objects. Operation of OpenCL kernels A kernel is a block of code that is executed on a compute device in parallel with other kernels. Kernels operate in the following sequence: 1. A kernel is defined in a host application. 2. The host application submits the kernel for execution on a compute device. A compute device can be an application processor, GPU, or another type of processor. 3. When the application issues a command to submit a kernel, OpenCL creates the NDRange of work-items. 4. An instance of the kernel is created for each element in the NDRange. This enables each element to be processed independently in parallel. DUI0538F Copyright ARM. All rights reserved. 3-4
26 OpenCL Concepts 3.4 OpenCL data processing This section describes OpenCL data processing. It contains the following sections: Work-items and the NDRange. OpenCL work-groups on page 3-6. Identifiers in OpenCL on page Work-items and the NDRange The data processed by OpenCL is in an index space of work-items. The work-items are organized in an N-Dimensional Range (NDRange) where: N is the number of dimensions minus one. N can be zero, one, or two. One kernel instance is executed for each work-item in the index space. Figure 3-1 shows NDRanges with one, two and three dimensions. One dimensional NDRange Work-items Two dimensional NDRange Work items Three dimensional NDRange Work-items Figure 3-1 NDRanges and work-items You group work-items into work-groups for processing. Figure 3-2 on page 3-6 shows a three dimensional NDRange that is split into 16 work-groups each with 16 work-items. DUI0538F Copyright ARM. All rights reserved. 3-5
27 OpenCL Concepts Work-groups Work-items Figure 3-2 Work-items and work-groups OpenCL work-groups Work-groups have a number of properties and limitations: Properties of work-groups Work-groups are independent of each other. You can issue multiple work-groups for execution in parallel. The work-items in a work-group can communicate with each other using shared data buffers. You must synchronize access to these buffers. Limitations of work-groups Work-groups typically do not directly share data. They can share data using global memory. The following are not supported across different work-groups: Barriers. Dependencies. Ordering. Coherency. Global atomics are available but these can be slower than local atomics. Work-items in a work-group The work-items in a work-group can do the following: Perform barrier operations to synchronize execution points. For example: barrier(clk_local_mem_fence); // Wait for all kernels in // this work-group to catch up Use local atomic operations. Access shared memory. DUI0538F Copyright ARM. All rights reserved. 3-6
28 OpenCL Concepts Identifiers in OpenCL There are a number of identifiers in OpenCL: global ID local ID Every work-item has a unique global ID that identifies it within the index space. Within each work-group, each work-item has a unique local ID that identifies it within its work-group. work-group ID Each work-group has a unique work-group ID. DUI0538F Copyright ARM. All rights reserved. 3-7
29 OpenCL Concepts 3.5 The OpenCL memory model The OpenCL memory model contains a number of components. Figure 3-3 shows the OpenCL memory model. Private memory Private memory Private memory Private memory Work-item Work-item Work-item Work-item Local memory Local memory Work-group Work-group Constant memory Global memory Private memory Private memory is specific to a work-item. It is not visible to other work-items. Local memory Local to a work-group. Accessible by the work-items in the work-group. Consistent to all work-items in the work-group. Accessed with the local keyword. Figure 3-3 OpenCL memory model Constant memory A memory region used for objects allocated and initialized by the host. Accessible as read-only by all work-items. Global memory Accessible to all work-items executing in a context, Accessible to the host using read, write, and map commands. Consistent across work-items in a single work-group. Implements a relaxed consistency, shared memory model. There are no guarantee of memory consistency between different work-groups. Accessed with the global keyword. DUI0538F Copyright ARM. All rights reserved. 3-8
30 OpenCL Concepts 3.6 The Mali GPU memory model The Mali GPU has a different memory model to desktop workstations: Desktop Mali GPU Traditional desktop workstations have physically separate global, local and private memories. Typically a graphics-card has its own local memory. Data must be copied to the local memory and back again. Mali GPUs have a unified memory system. Local and private memory is physically global memory. Moving data from global to local or private memory typically does not improve performance. The traditional copying of data is not required. Each compute device, that is the shader cores, have their own data caches. DUI0538F Copyright ARM. All rights reserved. 3-9
31 OpenCL Concepts 3.7 OpenCL concepts summary OpenCL primarily uses data parallel processing. OpenCL uses the following terminology: Computations in OpenCL are performed by pieces of code called kernels that execute on compute devices. Compute devices can be application processors, GPUs, or other types of processors. The data processed by OpenCL is in an index space of work-items. The work-items are organized in an NDRange. One kernel instance is executed for each work-item in the index space. Kernels instances execute in parallel. Work-items group together to form work-groups. The work-items in a work-group can communicate with each other using shared data buffers, but access to the buffers must be synchronised. Work-groups typically do not directly share data with each other. They can share data using global memory. Issue multiple work-groups for execution in parallel. DUI0538F Copyright ARM. All rights reserved. 3-10
32 Chapter 4 Developing an OpenCL Application This chapter describes the development stages of an OpenCL application. It contains the following sections: Software and hardware required for OpenCL development on page 4-2. Development stages on page 4-3. DUI0538F Copyright ARM. All rights reserved. 4-1
33 Developing an OpenCL Application 4.1 Software and hardware required for OpenCL development To develop OpenCL applications for Mali GPUs, you require: A platform with a Mali-T600 Series GPU. An implementation of OpenCL for the Mali-T600 Series GPU. You can develop on other hardware platforms with implementations of OpenCL but you cannot use them to estimate performance on a Mali-T600 Series GPU. Implementations of OpenCL are available for a number of operating systems. For a list of available hardware and the OpenCL drivers for the Mali-T600 Series GPUs see Mali developer center, DUI0538F Copyright ARM. All rights reserved. 4-2
34 Developing an OpenCL Application 4.2 Development stages These are the stages for developing an OpenCL application: Determine what you want to parallelize The first step when deciding to use OpenCL is to look at what your application does and identify the parts of the application that can run in parallel. This is often the hardest part of developing an OpenCL application. See Analyzing code for parallelization on page 6-3. Note Only convert the parts of an application to OpenCL where there is likely to be a benefit. Profile your application to find the most active parts and consider these parts for conversion. Write kernels OpenCL applications consists of a set of kernel functions. You must write the kernels that perform the computations. Write infrastructure for kernels OpenCL applications require infrastructure to set-up and run the kernels. DUI0538F Copyright ARM. All rights reserved. 4-3
35 Chapter 5 Execution Stages of an OpenCL Application This chapter describes the execution stages of an OpenCL application. It contains the following sections: About the execution stages on page 5-2. Finding the available compute devices on page 5-4. Initializing and creating OpenCL contexts on page 5-5. Creating a command queue on page 5-6. Creating OpenCL program objects on page 5-7. Building a program executable on page 5-8. Creating kernel and memory objects on page 5-9. Executing the kernel on page Reading the results on page Cleaning up on page Note This chapter provides an overview of the execution stages of an OpenCL application. It is not intended as a comprehensive lesson in OpenCL. DUI0538F Copyright ARM. All rights reserved. 5-1
36 Execution Stages of an OpenCL Application 5.1 About the execution stages This section describes a high level view of the OpenCL execution stages. Your OpenCL application must obtain information about your hardware then setup the runtime environment. The following sections describe these steps: Platform setup. Runtime setup Platform setup Runtime setup You use the platform API to: Determine what OpenCL devices are available. Query to find out what OpenCL devices are available on the system using OpenCL platform layer functions. See Finding the available compute devices on page 5-4. Set up the OpenCL context. Create and set up a OpenCL context and one or more command queues to schedule execution of your kernels. See Initializing and creating OpenCL contexts on page 5-5. You use the runtime API to: Create a command queue. See Creating a command queue on page 5-6. Compile and build your program objects. Issue commands to compile and build your source code and extracts kernel objects from the compiled code. You must follow this sequence of commands: 1. The program object is created either by calling clcreateprogramwithsource() or clcreateprogramwithbinary(). clcreateprogramwithsource() creates the program object from the kernel source code. clcreateprogramwithbinary() creates the program with a pre-compiled binary file. 2. Call the clbuildprogram() function to compile the program object for the specific devices on the system. See Creating OpenCL program objects on page 5-7. Build a program executable. See Building a program executable on page 5-8. Create the Kernel and memory objects. 1. Call the clcreatekernel() function for each kernel, or call the clcreatekernelsinprogram() function to create kernel objects for all the kernels in the OpenCL application. 2. Use the OpenCL API to allocate memory buffers. You can use the map() and unmap() operations to enable both the application processor and the Mali GPU to access the data. See Creating kernel and memory objects on page 5-9. Enqueue and execute the kernels. DUI0538F Copyright ARM. All rights reserved. 5-2
37 Execution Stages of an OpenCL Application Enqueue to the command queues the commands that control the sequence and synchronization of kernel execution, mapping and unmapping of memory, and manipulation of memory objects. To execute a kernel function, you must do the following: 1. Call clsetkernelarg() for each parameter in the kernel function definition to set the kernel parameter values. 2. Determine the work-group size and index space to use to execute the kernel. 3. Enqueue the kernel for execution in the command queue. See Executing the kernel on page Enqueue commands that make the results from the work-items available to the host. Reading the results on page Clean up. Cleaning up on page DUI0538F Copyright ARM. All rights reserved. 5-3
38 Execution Stages of an OpenCL Application 5.2 Finding the available compute devices To set up OpenCL you must choose compute devices. Call clgetdeviceids() to query the OpenCL driver for a list of devices on the machine that support OpenCL. You can restrict your search to a particular type of device or to any combination of device types. You must also specify the maximum number of device IDs that you want returned. DUI0538F Copyright ARM. All rights reserved. 5-4
39 Execution Stages of an OpenCL Application 5.3 Initializing and creating OpenCL contexts After you know the available OpenCL devices on the machine and have at least one valid device ID, you can create an OpenCL context. The context groups devices together to enable memory objects to be shared across different compute devices. Pass the device information to the clcreatecontext() function. For example: // Create an OpenCL context context = clcreatecontext( NULL, 1, &device_id, notify_function, NULL, &err ); if (err!= CL_SUCCESS) { Cleanup(); return 1; } You can optionally specify an error notification callback function when creating an OpenCL context. Leaving this parameter as a NULL value results in no error notification being registered. Providing a callback function can be useful if you want to receive runtime errors for the particular OpenCL context. For example: // Optionally user_data can contain contextual information // Implementation specific data of size cb, can be returned in private_info void context_notify( const char *notify_message, const void *private_info, size_t cb, void *user_data ) { printf("notification:\n\t%s\n", notify_message); } DUI0538F Copyright ARM. All rights reserved. 5-5
40 Execution Stages of an OpenCL Application 5.4 Creating a command queue After creating your OpenCL context, use clcreatecommandqueue() to create a command queue. For example: // Create a command-queue on the first device available // on the created context commandqueue = clcreatecommandqueue( context, &device); if (commandqueue == NULL) { Cleanup(); return 1; } If you have multiple OpenCL devices, such as an application processor and a GPU, you must: 1. Create a command queue for each device. 2. Divide up the work. 3. Submit commands separately to each device. DUI0538F Copyright ARM. All rights reserved. 5-6
41 Execution Stages of an OpenCL Application 5.5 Creating OpenCL program objects The program object encapsulates: Your OpenCL program source. The latest successfully built program executable. The build options. The build log. A list of devices the program is built for. The program object is loaded with the kernel source code and then the code is compiled on the devices attached to the context. All kernel functions must be identified in the application source with the kernel qualifier. OpenCL applications can also include functions you can call from your kernel functions. Load the OpenCL C kernel source and create an OpenCL program object from it. To create a program object use the clcreateprogramwithsource() function. For example: // Create OpenCL program program = clcreateprogramwithsource( context, device, <kernel source> ); if (program == NULL) { Cleanup(); return 1; } There are different options for building OpenCL programs: You can create a program object directly from the source code of an OpenCL application and compile it at runtime. Do this at application startup to save compute resources while the application is running. To avoid compilation at runtime, you can build a program object with a previously built binary. Note Applications with pre-built program objects are not portable. Creating a program object from a binary is a similar process to creating a program object from source code, except that you must supply the binary for each device that you want to execute the kernel on. Use the clcreateprogramwithbinary() function to do this. Use the clgetprograminfo() function to obtain the binary after you have generated it. DUI0538F Copyright ARM. All rights reserved. 5-7
42 Execution Stages of an OpenCL Application 5.6 Building a program executable After you have created a program object, you must build a program executable from the contents of the program object. Use the clbuildprogram() function to build your executable. Compile all kernels in the program object: err = clbuildprogram( program, 1, &device_id, "", NULL, NULL ); if (err == NULL) { Cleanup(); return 1; } DUI0538F Copyright ARM. All rights reserved. 5-8
43 Execution Stages of an OpenCL Application 5.7 Creating kernel and memory objects This section describes creating kernel and memory objects. It contains the following sections: Creating kernel objects. Creating memory objects Creating kernel objects Call the clcreatekernel() function to create a single kernel object, or call the clcreatekernelsinprogram() function to create kernel objects for all the kernels in the OpenCL application. For example: // Create OpenCL kernel kernel = clcreatekernel(program, <kernel_name>", NULL); if (kernel == NULL) { Cleanup(); return 1; } Creating memory objects After you have created and registered your kernels, send the program data to the kernels: 1. Package the data in a memory object. 2. Associate the memory object with the kernel. There are two types of memory objects: Buffer objects Simple blocks of memory. Image objects These are structures specifically for representing 2D or 3D images. These are opaque structures, that is, you cannot see the implementation details of these structures. To create buffer objects, use the clcreatebuffer() function. To create image objects, use the clcreateimage2d() or clcreateimage3d() functions. DUI0538F Copyright ARM. All rights reserved. 5-9
44 Execution Stages of an OpenCL Application 5.8 Executing the kernel This section describes the stages in executing the kernel. It contains the following sections: Determining the data dimensions. Determining the optimal global work size. Determining the local work-group size. Enqueuing kernel execution on page Executing kernels on page Determining the data dimensions If your data is an image x pixels wide by y pixels high, it is a two-dimensional data set. If you are dealing with spatial data that involves the x, y, and z position of nodes, it is a three-dimensional data set. The number of dimensions in the original data set does not have to be the same in OpenCL. You can for example, process a three dimensional data set as a single dimensional data set in OpenCL Determining the optimal global work size The global work size is the total number of work-items required for all dimensions combined. You can change the global work size by processing multiple data items in a single work-item. The new global worksize is then the original global work size divided by the number of data items processed by each work-item. The global work size must be large if you want to ensure high performance. Typically the number is several thousand but the ideal number depends on the number of shader cores in your device Determining the local work-group size You can specify the size of the work-group that OpenCL uses when you enqueue a kernel to execute on a device. To do this, you must know the maximum work-group size permitted by the OpenCL device your work-items execute on. To find the maximum work-group size for a specific kernel, use the clgetkernelworkgroupinfo() function and request the CL_KERNEL_WORK_GROUP_SIZE property. If your application is not required to share data among work-items, set the local_work_size parameter to NULL when enqueuing your kernel. This enables the OpenCL driver to determine the most efficient work-group size for your kernel. To get the maximum possible work-group size in each dimension, call clgetdeviceinfo() with CL_DEVICE_MAX_WORK_ITEM_SIZES. This is for the simplest kernel and dimensions might be lower for more complex kernels. The product of the dimensions of your work group might limit the size of the work group To get the total work-group size call clgetkernelworkgroupinfo() with CL_KERNEL_WORK_GROUP_SIZE. If the maximum work-group size for a kernel is lower than 128 performance is reduced. If this is the case, try simplifying the kernel. The work-group size for each dimension must divide evenly into the total data-size for that dimension. That is, the x size of the work-group must divide evenly into the x size of the total data. If this requirement means padding the work-group with extra work-items, ensure the additional work-items return immediately and do no work. DUI0538F Copyright ARM. All rights reserved. 5-10
45 Execution Stages of an OpenCL Application Enqueuing kernel execution When you have identified the dimensions necessary to represent your data, the necessary work-items for each dimension, and an appropriate work-group size, enqueue the kernel for execution using clenqueuendrangekernel(). For example: size_t globalworksize[1] = { ARRAY_SIZE }; size_t localworksize[1] = { 4 } ; // Queue the kernel up for execution across the array errnum = clenqueuendrangekernel(commandqueue, kernel, 1, NULL, globalworksize, localworksize, 0, NULL, NULL); if (errnum!= CL_SUCCESS) { printf( "Error queuing kernel for execution.\n" ); Cleanup(); return 1; } Executing kernels Queuing the kernel for execution does not mean that it executes immediately. The kernel execution is put into the command queue for later processing by the device. The call to clengueuendrangekernel() is not a blocking call and returns before the kernel has executed. It can sometimes return before the kernel has started executing. It is possible to make a kernel wait for execution until previous events are finished. You can specify certain kernels wait until other specific kernels are completed before executing. Kernels are executed in the order they are enqueued unless the property CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE is set when the command queue is created. Kernels that are enqueued to an in-order queue automatically wait for kernels that were previously enqueued on the same queue. You are not required to write any code to synchronize them. DUI0538F Copyright ARM. All rights reserved. 5-11
46 Execution Stages of an OpenCL Application 5.9 Reading the results After your kernels have finished execution, you must make the result accessible to the host. To access the results from the kernel, use clenqueuemapbuffer() to map the buffer into host memory. For example: local_buffer = clenqueuemapbuffer( queue, buffer, CL_FALSE, CL_MAP_READ, 0, ( sizeof(unsigned char) * buffer_size), num_deps, deps[1], NULL, &err ); ASSERT( CL_SUCCESS == err ); Note This call does not guarantee to make the buffer available until you call clfinish(). If you change the third parameter of clfinish() or clenqueuebuffer(), CL_FALSE to CL_TRUE, the call becomes a blocking call and the read is completed before clenqueuemapbuffer() returns. DUI0538F Copyright ARM. All rights reserved. 5-12
47 Execution Stages of an OpenCL Application 5.10 Cleaning up When the application no longer requires the various objects associated with the OpenCL runtime and context, you must free these resources. Use the following functions to release your OpenCL objects. These functions decrement the reference count for the associated object: clreleasememobject(). clreleasekernel(). clreleaseprogram(). clreleasecommandqueue(). clreleasecontext(). Ensure the reference counts for all OpenCL objects reach zero when your application no longer requires them. You can obtain the reference count by querying the object. For example, by calling clgetmemobjectinfo(). DUI0538F Copyright ARM. All rights reserved. 5-13
48 Chapter 6 Converting Existing Code to OpenCL This section describes converting existing code to OpenCL. It contains the following sections: Profile your application on page 6-2. Analyzing code for parallelization on page 6-3. Parallel processing techniques in OpenCL on page 6-6. Using parallel processing with non-parallelizable code on page Dividing data for OpenCL on page DUI0538F Copyright ARM. All rights reserved. 6-1
49 Converting Existing Code to OpenCL 6.1 Profile your application Profile your application to find the most compute intensive parts. These are the parts that might be worth porting to OpenCL. The proportion of an application that requires high performance is typically a relatively small part of the code. This is the part of the code that can probably make best use of OpenCL. Porting any more of the application to OpenCL is unlikely to provide a benefit. You can use profilers such as DS-5 to profile your application. You can download DS-5 from the Mali developer web site, DUI0538F Copyright ARM. All rights reserved. 6-2
50 Converting Existing Code to OpenCL 6.2 Analyzing code for parallelization This section describes how to analyze compute intensive code for parallelization. It contains the following sections: About analyzing code for parallelization. Look for data parallel operations. Look for operations with few dependencies Analyze loops on page About analyzing code for parallelization When you have identified the most compute intensive parts of your application, analyze the code to see if you can run it in parallel. Parallelizing code can be the following: Straight forward Parallelizing the code requires small modifications. See Use the global ID instead of the loop counter on page 6-6. Difficult Parallelizing the code requires complex modifications. See Compute values in a loop with a formula instead of using counters on page Look for data parallel operations Difficult and includes dependencies Parallelizing the code requires complex modifications and the use of techniques to avoid dependencies. See the following sections: Compute values per frame on page 6-7. Perform computations with dependencies in multiple-passes on page 6-8. Pre-compute values to remove dependencies on page 6-8. Appears to be impossible If parallelizing the code appears to be impossible, this only means that a particular code implementation cannot be parallelized. The purpose of code is to perform a function. There might be different algorithms that perform the same function but work in a different ways. Some of these might be parallelizable. Investigate different alternatives to the algorithms and data structures the code uses. These might make parallelization possible. See Using parallel processing with non-parallelizable code on page Look for tasks that do large numbers of operations that: Complete without sharing data. Do not depend on the results from each other. These types of operations are data parallel so are ideal for OpenCL Look for operations with few dependencies If tasks have few dependencies, it might be possible to run them in parallel. DUI0538F Copyright ARM. All rights reserved. 6-3
51 Converting Existing Code to OpenCL Dependencies between tasks prevent parallelization because it forces tasks to be performed sequentially. If the code has dependencies, consider: Is there a way to remove the dependencies? Can you delay the dependencies to later in execution? Analyze loops Loops are good targets for parallelization because they repeat computations many times, often independently. Loops that process a small number of elements If the loop only processes a relatively small number of elements it might not be appropriate for data parallel processing. It might be better to parallelize these sorts of loops with task parallelism on one or more application processors. Nested loops If the loop is part of a series of nested loops and the total number of iterations is large, this loop is probably appropriate for parallel processing. Perfect loops Look for loops that: Process thousands of items. Have no dependencies on previous iterations. Accesses data independently in each iteration. These types of loops are data parallel so are ideal for OpenCL. Simple loop parallelization If the loop includes a variable that is incremented based on a value from the previous iteration, this is a dependency between iterations that prevents parallelization. See if you can work out a formula that enables you to compute the value of the variable based on the main loop counter. In OpenCL work-items are processed in parallel, not in a sequential loop. However, work-item processing acts in a similar way to a loop. Every work-item has a unique global id that identifies it and you can use this value in place of a loop counter. See Use the global ID instead of the loop counter on page 6-6. It is also possible to have loops within work-items but these are independent of other work-items. Loops that require data from previous iterations If your loop involves dependencies based on data processed by a previous iteration, this is a more complex problem. Can the loop be restructured to remove the dependency? If not, it might not be possible to parallelize the loop. There are a number of techniques that help you deal with dependencies. See if you can use these techniques to parallelize the loop. See Parallel processing techniques in OpenCL on page 6-6. DUI0538F Copyright ARM. All rights reserved. 6-4
52 Converting Existing Code to OpenCL Non-parallelizable loops If the loop contains dependencies that you cannot remove, investigate alternative methods of performing the computation. These might be parallelizable. See Using parallel processing with non-parallelizable code on page DUI0538F Copyright ARM. All rights reserved. 6-5
53 Converting Existing Code to OpenCL 6.3 Parallel processing techniques in OpenCL This section describes parallel processing techniques you can use in OpenCL. It contains the following sections: Use the global ID instead of the loop counter. Compute values in a loop with a formula instead of using counters on page 6-7. Compute values per frame on page 6-7. Perform computations with dependencies in multiple-passes on page 6-8. Pre-compute values to remove dependencies on page 6-8. Use software pipelining on page 6-9. Use task parallelism on page Use the global ID instead of the loop counter In OpenCL you use kernels to perform the equivalent of loop iterations. This means there is no loop counter to use in computations. The global ID of the work-item provides the equivalent of the loop counter. Use the global ID to perform any computations based on the loop counter. Note You can include loops in OpenCL kernels but they can only iterate over the data for that work-item, not the entire NDRange. The following example shows a simple loop in C that assigns the value of the loop counter to each array element. Loop example in C: The following loop fills an array with numbers. void SetElements(void) { int loop_count; int my_array[4096]; for (loop_count = 0; loop_count < 4096; loop_count++) { my_array[loop_count] = loop_count; } printf("total %d\n", loop_count); } This loop is parallelizable because the loop elements are all independent. There is no main loop counter loop_count in the OpenCL kernel so it is replaced by the global ID. The equivalent code in an OpenCL kernel: kernel void example( global int * restrict my_array) { int id; id = get_global_id(0); my_array[id] = id; } DUI0538F Copyright ARM. All rights reserved. 6-6
54 Converting Existing Code to OpenCL Compute values in a loop with a formula instead of using counters If you are using work-items in place of loop iterations, compute variables based on the value of the global ID rather using than a loop counter. The global ID of the work-item provides the equivalent of the loop counter Compute values per frame If your application requires continuous updates of data elements and there are dependencies between them, try breaking the computations into discrete units and perform one iteration per image frame displayed. For example, the image shown in Figure 6-1 is of an application that runs a continuous physics simulation of a flag. Figure 6-1 Flag simulation The flag is made up of a grid of nodes that are connected to the neighboring nodes. These are shown in Figure 6-2. Figure 6-2 Flag simulation grid DUI0538F Copyright ARM. All rights reserved. 6-7
Cortex -A53 MPCore Processor Cryptography Extension ARM. Technical Reference Manual. Revision: r0p4
ARM Cortex -A53 MPCore Processor Cryptography Extension Revision: r0p4 Technical Reference Manual Copyright 2013-2014 ARM. All rights reserved. ARM DDI 0501F () ARM Cortex-A53 MPCore Processor Cryptography
More informationGPU Shader Library. Mali. User Guide. Version: 1.0. Copyright 2009 ARM. All rights reserved. ARM DUI 0510A (ID101409)
Mali GPU Shader Library Version: 1.0 User Guide Copyright 2009 ARM. All rights reserved. ARM DUI 0510A () Mali GPU Shader Library User Guide Copyright 2009 ARM. All rights reserved. Release Information
More informationGPU Offline Shader Compiler. Mali. User Guide. Version: 2.2. Copyright 2009 ARM. All rights reserved. ARM DUI 0513A (ID101409)
Mali GPU Offline Shader Compiler Version: 2.2 User Guide Copyright 2009 ARM. All rights reserved. ARM DUI 0513A () Mali GPU Offline Shader Compiler User Guide Copyright 2009 ARM. All rights reserved. Release
More informationOffline Shader Compiler. Mali. User Guide. Version: 3.0. Copyright ARM. All rights reserved. ARM DUI 0513B (ID032912)
Mali Offline Shader Compiler Version: 3.0 User Guide Copyright 2009-2012 ARM. All rights reserved. ARM DUI 0513B () Mali Offline Shader Compiler User Guide Copyright 2009-2012 ARM. All rights reserved.
More informationOpenGL ES 2.0 SDK for Android. Mali. User Guide. Version: Copyright 2011 ARM. All rights reserved. ARM DUI 0587A (ID120411)
Mali OpenGL ES 2.0 SDK for Android Version: 1.0.0 User Guide Copyright 2011 ARM. All rights reserved. ARM DUI 0587A () Mali OpenGL ES 2.0 SDK for Android User Guide Copyright 2011 ARM. All rights reserved.
More informationCortex -A53 MPCore Processor Advanced SIMD and Floating-point Extension ARM. Technical Reference Manual. Revision: r0p4
ARM Cortex -A53 MPCore Processor Advanced SIMD and Floating-point Extension Revision: r0p4 Technical Reference Manual Copyright 2013-2016 ARM. All rights reserved. ARM DDI 0502G () ARM Cortex-A53 MPCore
More informationARM. Mali GPU. OpenGL ES Application Optimization Guide. Version: 3.0. Copyright 2011, 2013 ARM. All rights reserved. ARM DUI 0555C (ID102813)
ARM Mali GPU Version: 3.0 OpenGL ES Application Optimization Guide Copyright 2011, 2013 ARM. All rights reserved. ARM DUI 0555C () ARM Mali GPU OpenGL ES Application Optimization Guide Copyright 2011,
More informationARM. Mali GPU. OpenGL ES Application Optimization Guide. Version: 2.0. Copyright 2011, 2013 ARM. All rights reserved. ARM DUI 0555B (ID051413)
ARM Mali GPU Version: 2.0 OpenGL ES Application Optimization Guide Copyright 2011, 2013 ARM. All rights reserved. ARM DUI 0555B () ARM Mali GPU OpenGL ES Application Optimization Guide Copyright 2011,
More informationARM CoreLink DPE-400 Data Parity Extension for NIC-400
ARM CoreLink DPE-400 Data Parity Extension for NIC-400 Revision: r1p0 Technical Reference Manual Copyright 2014, 2016 ARM. All rights reserved. ARM 100591_0100_00_en ARM CoreLink DPE-400 Data Parity Extension
More informationIntegrator /CP Board Support Package for Microsoft Windows CE.NET
Integrator /CP Board Support Package for Microsoft Windows CE.NET Revision: r0p0 Application Developer s Guide Copyright 2004 ARM Limited. All rights reserved. ARM DUI 0272A Integrator/CP Board Support
More informationARM. Streamline. Performance Analyzer. Using ARM Streamline. Copyright 2010 ARM Limited. All rights reserved. ARM DUI 0482A (ID100210)
ARM Streamline Performance Analyzer Using ARM Streamline Copyright 2010 ARM Limited. All rights reserved. ARM DUI 0482A () ARM Streamline Performance Analyzer Using ARM Streamline Copyright 2010 ARM Limited.
More informationGPU Shader Development Studio. Mali. User Guide. Version: Copyright ARM. All rights reserved. DUI0504B (ID072410)
Mali GPU Shader Development Studio Version: 1.2.0 User Guide Copyright 2009-2010 ARM. All rights reserved. DUI0504B () Mali GPU Shader Development Studio User Guide Copyright 2009-2010 ARM. All rights
More informationDesign Simulation Model ARM. User Guide. for SystemC. Copyright 2016 ARM. All rights reserved. ARM ARM DUI 1031B (ID111116)
ARM Design Simulation Model for SystemC User Guide Copyright 2016 ARM. All rights reserved. ARM ARM DUI 1031B () ARM Design Simulation Model User Guide Copyright 2016 ARM. All rights reserved. Release
More informationOpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania
OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania Course Overview This OpenCL base course is structured as follows: Introduction to GPGPU programming, parallel programming
More informationOpenCL API. OpenCL Tutorial, PPAM Dominik Behr September 13 th, 2009
OpenCL API OpenCL Tutorial, PPAM 2009 Dominik Behr September 13 th, 2009 Host and Compute Device The OpenCL specification describes the API and the language. The OpenCL API, is the programming API available
More informationArm Design Simulation Model
Arm Design Simulation Model for SystemC User Guide Copyright 2017 Arm. All rights reserved. 101167_0100_00 () Arm Design Simulation Model User Guide Copyright 2017 Arm Limited (or its affiliates). All
More informationARMv8-M processor power management
ARMv8-M processor power management Version 1.0 secure state protection Copyright 2016 ARM Limited or its affiliates. All rights reserved. ARM 100737_0100_0100_en ARMv8-M processor power management ARMv8-M
More informationComponent Architecture Debug Interface
Component Architecture Debug Interface Version 2.0 Developer Guide Copyright 2014, 2015 ARM. All rights reserved. ARM DUI0842F Component Architecture Debug Interface Component Architecture Debug Interface
More informationGPU Shader Development Studio. Mali. User Guide. Version: Beta. Copyright ARM. All rights reserved. DUI0504C (ID011412)
Mali GPU Shader Development Studio Version: 1.3.0 User Guide Beta Copyright 2009-2011 ARM. All rights reserved. DUI0504C () Mali GPU Shader Development Studio User Guide Copyright 2009-2011 ARM. All rights
More informationMedia Processing Engine. Cortex -A7 NEON. Technical Reference Manual. Revision: r0p5
Cortex -A7 NEON Media Processing Engine Revision: r0p5 Technical Reference Manual Copyright 2011-2013 ARM. All rights reserved. ARM DDI 0462F () Cortex-A7 NEON Media Processing Engine Technical Reference
More informationOpenCL The Open Standard for Heterogeneous Parallel Programming
OpenCL The Open Standard for Heterogeneous Parallel Programming March 2009 Copyright Khronos Group, 2009 - Page 1 Close-to-the-Silicon Standards Khronos creates Foundation-Level acceleration APIs - Needed
More informationHeterogeneous Computing
OpenCL Hwansoo Han Heterogeneous Computing Multiple, but heterogeneous multicores Use all available computing resources in system [AMD APU (Fusion)] Single core CPU, multicore CPU GPUs, DSPs Parallel programming
More informationDesigner ADR-400 AMBA. User Guide. Revision: r3p2. Copyright ARM. All rights reserved. ARM DUI 0333M (ID011213)
AMBA Designer ADR-400 Revision: r3p2 User Guide Copyright 2006-2012 ARM. All rights reserved. ARM DUI 0333M () AMBA Designer ADR-400 User Guide Copyright 2006-2012 ARM. All rights reserved. Release Information
More informationNeil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1
Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 Introduction and aims of OpenCL - Neil Trevett, NVIDIA OpenCL Specification walkthrough - Mike
More informationARMv8-A Synchronization primitives. primitives. Version 0.1. Version 1.0. Copyright 2017 ARM Limited or its affiliates. All rights reserved.
Connect ARMv8-A User Synchronization Guide primitives Version 0.1 Version 1.0 Page 1 of 9 Revision Information The following revisions have been made to this User Guide. Date Issue Confidentiality Change
More informationDS-5 ARM. Using Eclipse. Version Copyright ARM. All rights reserved. ARM DUI 0480L (ID100912)
ARM DS-5 Version 5.12 Using Eclipse Copyright 2010-2012 ARM. All rights reserved. ARM DUI 0480L () ARM DS-5 Using Eclipse Copyright 2010-2012 ARM. All rights reserved. Release Information The following
More informationPrimeCell Infrastructure AMBA 3 AXI File Reader Master (BP144)
PrimeCell Infrastructure AMBA 3 AXI File Reader Master (BP144) Revision: r0p0 Technical Overview Copyright 2004 ARM Limited. All rights reserved. DTO0016A PrimeCell Infrastructure AMBA 3 AXI File Reader
More informationGPU Performance Analysis. Mali. User Guide. Version: 2.2. Copyright 2009 ARM. All rights reserved. DUI 0502A (ID101409)
Mali GPU Performance Analysis Tool Version: 2.2 User Guide Copyright 2009 ARM. All rights reserved. DUI 0502A () Mali GPU Performance Analysis Tool User Guide Copyright 2009 ARM. All rights reserved. Release
More informationNeil Trevett Vice President, NVIDIA OpenCL Chair Khronos President
4 th Annual Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 CPUs Multiple cores driving performance increases Emerging Intersection GPUs Increasingly
More informationARM TrustZone True Random Number Generator
ARM TrustZone True Random Number Generator Revision: r0p0 Technical Reference Manual Copyright 2017 ARM Limited or its affiliates. All rights reserved. ARM 100976_0000_00_en ARM TrustZone True Random Number
More informationOpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR
OpenCL Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Architecture Parallel computing for heterogenous devices CPUs, GPUs, other processors (Cell, DSPs, etc) Portable accelerated code Defined
More informationDebugging and Analyzing Programs using the Intercept Layer for OpenCL Applications
Debugging and Analyzing Programs using the Intercept Layer for OpenCL Applications Ben Ashbaugh IWOCL 2018 https://github.com/intel/opencl-intercept-layer Why am I here? Intercept Layer for OpenCL Applications
More informationIntel SDK for OpenCL* - Sample for OpenCL* and Intel Media SDK Interoperability
Intel SDK for OpenCL* - Sample for OpenCL* and Intel Media SDK Interoperability User s Guide Copyright 2010 2012 Intel Corporation All Rights Reserved Document Number: 327283-001US Revision: 1.0 World
More informationMali Offline Compiler User Guide
Mali Offline Compiler 6.0.0 User Guide ARM Limited 2017 Introduction 2 Contents 3 Contents Chapter 1: Introduction... 5 Overview... 6 Installation Package... 6 This Release...6 Chapter 2: Installation...
More informationWebCL Overview and Roadmap
Copyright Khronos Group, 2011 - Page 1 WebCL Overview and Roadmap Tasneem Brutch Chair WebCL Working Group Samsung Electronics Copyright Khronos Group, 2011 - Page 2 WebCL Motivation Enable high performance
More informationHigh Dynamic Range Tone Mapping Post Processing Effect Multi-Device Version
High Dynamic Range Tone Mapping Post Processing Effect Multi-Device Version Intel SDK for OpenCL* Application Sample Documentation Copyright 2010 2012 Intel Corporation All Rights Reserved Document Number:
More informationPrimeCell AHB SRAM/NOR Memory Controller (PL241)
PrimeCell AHB SRAM/NOR Memory Controller (PL241) Revision: r0p1 Technical Reference Manual Copyright 2006 ARM Limited. All rights reserved. ARM DDI 0389B PrimeCell AHB SRAM/NOR Memory Controller (PL241)
More informationARM. Compiler toolchain v4.1 for. Creating Static Software Libraries with armar. Copyright 2011 ARM. All rights reserved. ARM DUI 0590A (ID061811)
ARM Compiler toolchain v4.1 for µvision Creating Static Software Libraries with armar Copyright 2011 ARM. All rights reserved. ARM DUI 0590A () ARM Compiler toolchain v4.1 for µvision Creating Static Software
More informationARM DS-5. Eclipse for DS-5 User Guide. Version 5. Copyright ARM. All rights reserved. ARM DUI0480Q
ARM DS-5 Version 5 Eclipse for DS-5 User Guide Copyright 2010-2015 ARM. All rights reserved. ARM DUI0480Q ARM DS-5 ARM DS-5 Eclipse for DS-5 User Guide Copyright 2010-2015 ARM. All rights reserved. Release
More informationCoreLink MMU-500 System Memory Management Unit ARM. Technical Reference Manual. Revision: r2p2
ARM CoreLink MMU-500 System Memory Management Unit Revision: r2p2 Technical Reference Manual Copyright 2013, 2014 ARM. All rights reserved. ARM DDI 0517E () ARM CoreLink MMU-500 System Memory Management
More informationMali. GPU OpenVG. Application Development Guide. Copyright ARM Limited. All rights reserved. ARM DUI 0380D (ID121709)
Mali GPU OpenVG Application Development Guide Copyright 2008-2009 ARM Limited. All rights reserved. ARM DUI 0380D () Mali GPU OpenVG Application Development Guide Copyright 2008-2009 ARM Limited. All rights
More informationINTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD
INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different
More informationDK2. Handel-C code optimization
DK2 Handel-C code optimization Celoxica, the Celoxica logo and Handel-C are trademarks of Celoxica Limited. All other products or services mentioned herein may be trademarks of their respective owners.
More informationARM DS-5. Eclipse for DS-5 User Guide. Version 5. Copyright ARM Limited or its affiliates. All rights reserved.
ARM DS-5 Version 5 Eclipse for DS-5 User Guide ARM DS-5 ARM DS-5 Eclipse for DS-5 User Guide Release Information Document History Issue Date Confidentiality Change A June 2010 First release B September
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 6 April 2017 HW#8 will be posted Announcements HW#7 Power outage Pi Cluster Runaway jobs (tried
More informationAN 831: Intel FPGA SDK for OpenCL
AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Intel FPGA SDK for OpenCL Host Pipelined Multithread...3 1.1
More informationTrace Debug Tools Version 1.2 Installation Guide
Trace Debug Tools Version 1.2 Installation Guide Copyright 2000-2002 ARM Limited. All rights reserved. Proprietary Notice Words and logos marked with or are registered trademarks or trademarks owned by
More informationETM -A5. CoreSight. Technical Reference Manual. Revision: r0p2. Copyright 2009, 2010 ARM. All rights reserved. ARM DDI 0435C (ID072915)
CoreSight ETM -A5 Revision: r0p2 Technical Reference Manual Copyright 2009, 2010 ARM. All rights reserved. ARM DDI 0435C () CoreSight ETM-A5 Technical Reference Manual Copyright 2009, 2010 ARM. All rights
More informationARM. System Memory Management Unit Architecture Specification. 64KB Translation Granule Supplement
ARM System Memory Management Unit Architecture Specification 64KB Translation Granule Supplement Copyright 2013 ARM. All rights reserved. ARM IHI 0067A.b () ARM System Memory Management Unit Architecture
More informationTexture Compression Tool. Mali. User Guide. Version: Copyright 2009, ARM. All rights reserved. DUI 0503E (ID092912)
Mali Texture Compression Tool Version: 4.0.1 User Guide Copyright 2009, 2011-2012 ARM. All rights reserved. DUI 0503E () Mali Texture Compression Tool User Guide Copyright 2009, 2011-2012 ARM. All rights
More informationGPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011
GPGPU COMPUTE ON AMD Udeepta Bordoloi April 6, 2011 WHY USE GPU COMPUTE CPU: scalar processing + Latency + Optimized for sequential and branching algorithms + Runs existing applications very well - Throughput
More informationFault Handling and Detection
Fault Handling and Detection Version 1.0 Copyright 2016 ARM. All rights reserved. ARM 100691_0100_00_en Fault Handling and Detection Fault Handling and Detection Copyright 2016 ARM. All rights reserved.
More informationFault Handling and Detection
Fault Handling and Detection Version 1.1 Copyright 2016 ARM Limited or its affiliates. All rights reserved. ARM 100691_0101_00_en Fault Handling and Detection Fault Handling and Detection Copyright 2016
More informationOpenCL* Device Fission for CPU Performance
OpenCL* Device Fission for CPU Performance Summary Device fission is an addition to the OpenCL* specification that gives more power and control to OpenCL programmers over managing which computational units
More informationARM DS-5. Using the Debugger. Copyright 2010 ARM. All rights reserved. ARM DUI 0446A (ID070310)
ARM DS-5 Using the Debugger Copyright 2010 ARM. All rights reserved. ARM DUI 0446A () ARM DS-5 Using the Debugger Copyright 2010 ARM. All rights reserved. Release Information The following changes have
More informationAMBA Keyboard/Mouse PS/2 Interface
AMBA Keyboard/Mouse PS/2 Interface Datasheet Copyright 1996-1998 ARM Limited. All rights reserved. ARM DDI 0096B AMBA Keyboard/Mouse PS/2 Interface Datasheet Copyright 1996-1998 ARM Limited. All rights
More informationOpenCL on the GPU. San Jose, CA September 30, Neil Trevett and Cyril Zeller, NVIDIA
OpenCL on the GPU San Jose, CA September 30, 2009 Neil Trevett and Cyril Zeller, NVIDIA Welcome to the OpenCL Tutorial! Khronos and industry perspective on OpenCL Neil Trevett Khronos Group President OpenCL
More informationGPGPU IGAD 2014/2015. Lecture 1. Jacco Bikker
GPGPU IGAD 2014/2015 Lecture 1 Jacco Bikker Today: Course introduction GPGPU background Getting started Assignment Introduction GPU History History 3DO-FZ1 console 1991 History NVidia NV-1 (Diamond Edge
More informationMOVE. Coprocessor. Technical Reference Manual. Copyright 2001, 2002, ARM Limited. All rights reserved. ARM DDI 0235C
MOVE Coprocessor Technical Reference Manual Copyright 00, 00, 004. ARM Limited. All rights reserved. ARM DDI 035C MOVE Coprocessor Technical Reference Manual Copyright 00, 00, 004. ARM Limited. All rights
More informationCompute Subsystem SCP ARM. Message Interface Protocols. Version: 1.2
ARM Compute Subsystem SCP Version: 1.2 Message Interface Protocols Copyright 2015, 2016 ARM Limited or its affiliates. All rights reserved. ARM DUI 0922G () ARM Compute Subsystem SCP Message Interface
More informationMICHAL MROZEK ZBIGNIEW ZDANOWICZ
MICHAL MROZEK ZBIGNIEW ZDANOWICZ Legal Notices and Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY
More informationKrzysztof Laskowski, Intel Pavan K Lanka, Intel
Krzysztof Laskowski, Intel Pavan K Lanka, Intel Legal Notices and Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR
More informationQuadros. RTXC Kernel Services Reference, Volume 1. Levels, Threads, Exceptions, Pipes, Event Sources, Counters, and Alarms. Systems Inc.
Quadros Systems Inc. RTXC Kernel Services Reference, Volume 1 Levels, Threads, Exceptions, Pipes, Event Sources, Counters, and Alarms Disclaimer Quadros Systems, Inc. makes no representations or warranties
More informationCompute Subsystem SCP ARM. Message Interface Protocols. Version: 1.0. Copyright 2015 ARM. All rights reserved. ARM DUI 0922B (ID050115)
ARM Compute Subsystem SCP Version: 1.0 Message Interface Protocols Copyright 2015 ARM. All rights reserved. ARM DUI 0922B () ARM Compute Subsystem SCP Message Interface Protocols Copyright 2015 ARM. All
More informationSingle Master DMA Controller (PL081) PrimeCell. Technical Reference Manual. Revision: r1p2
PrimeCell Single Master DMA Controller (PL081) Revision: r1p2 Technical Reference Manual Copyright 2001, 2003-2005 ARM Limited. All rights reserved. ARM DDI 0218E PrimeCell Single Master DMA Controller
More informationAMBA Remap and Pause. Technical Reference Manual. Revision: r0p0. Copyright 1997 ARM Limited. All rights reserved. ARM DDI 0048AC
Revision: r0p0 Technical Reference Manual Copyright 1997 ARM Limited. All rights reserved. ARM DDI 0048AC Technical Reference Manual Copyright 1997 ARM Limited. All rights reserved. Release Information
More informationIntroduction to OpenCL!
Lecture 6! Introduction to OpenCL! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! OpenCL Architecture Defined in four parts Platform Model
More informationProfiling and Debugging OpenCL Applications with ARM Development Tools. October 2014
Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline
More informationNetwork Interconnect NIC-301. CoreLink. Technical Reference Manual. Revision: r2p3
CoreLink Network Interconnect NIC-301 Revision: r2p3 Technical Reference Manual Copyright 2006-2011 ARM. All rights reserved. ARM DDI 0397I () CoreLink Network Interconnect NIC-301 Technical Reference
More informationOpenCL Overview Benedict R. Gaster, AMD
Copyright Khronos Group, 2011 - Page 1 OpenCL Overview Benedict R. Gaster, AMD March 2010 The BIG Idea behind OpenCL OpenCL execution model - Define N-dimensional computation domain - Execute a kernel
More informationNIC-400 Network Interconnect. CoreLink. Technical Reference Manual. Revision: r0p0. Copyright 2012 ARM. All rights reserved. ARM DDI 0475A (ID081712)
CoreLink NIC-400 Network Interconnect Revision: r0p0 Technical Reference Manual Copyright 2012 ARM. All rights reserved. ARM DDI 0475A () CoreLink NIC-400 Network Interconnect Technical Reference Manual
More informationIntroduction à OpenCL
1 1 UDS/IRMA Journée GPU Strasbourg, février 2010 Sommaire 1 OpenCL 2 3 GPU architecture A modern Graphics Processing Unit (GPU) is made of: Global memory (typically 1 Gb) Compute units (typically 27)
More informationSecure software guidelines for ARMv8-M. for ARMv8-M. Version 0.1. Version 2.0. Copyright 2017 ARM Limited or its affiliates. All rights reserved.
Connect Secure software User Guide guidelines for ARMv8-M Version 0.1 Version 2.0 Page 1 of 19 Revision Information The following revisions have been made to this User Guide. Date Issue Confidentiality
More informationVersatile Express. Boot Monitor. Reference Manual. Copyright ARM. All rights reserved. ARM DUI 0465F (ID110612)
Versatile Express Boot Monitor Reference Manual Copyright 2009-2012 ARM. All rights reserved. ARM DUI 0465F () Versatile Express Boot Monitor Reference Manual Copyright 2009-2012 ARM. All rights reserved.
More informationMartin Kruliš, v
Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example Optimizing Previous Example Alternative Architectures 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator for desktop
More informationETM -R4. CoreSight. Technical Reference Manual. Revision: r2p1. Copyright ARM. All rights reserved. ARM DDI 0367D (ID072915)
CoreSight ETM -R4 Revision: r2p1 Technical Reference Manual Copyright 2005-2011 ARM. All rights reserved. ARM DDI 0367D (ID072915) CoreSight ETM-R4 Technical Reference Manual Copyright 2005-2011 ARM. All
More informationPrimeCell AHB SDR and SRAM/NOR Memory Controller (PL243)
PrimeCell AHB SDR and SRAM/NOR Memory Controller (PL243) Revision: r0p1 Technical Reference Manual Copyright 2006 ARM Limited. All rights reserved. ARM DDI 0391B PrimeCell AHB SDR and SRAM/NOR Memory Controller
More informationFrom Application to Technology OpenCL Application Processors Chung-Ho Chen
From Application to Technology OpenCL Application Processors Chung-Ho Chen Computer Architecture and System Laboratory (CASLab) Department of Electrical Engineering and Institute of Computer and Communication
More informationOpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group
Copyright Khronos Group, 2012 - Page 1 OpenCL Overview Shanghai March 2012 Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group Copyright Khronos Group, 2012 - Page 2 Processor
More informationARM AMBA. Designer ADR-400. User Guide. Revision: r3p7
ARM AMBA Designer ADR-400 Revision: r3p7 User Guide Copyright 2006-2014, 2016, 2017 ARM Limited or its affiliates. All rights reserved. ARM DUI 0333R () ARM AMBA Designer ADR-400 User Guide Copyright 2006-2014,
More informationMulti-Core Programming
Multi-Core Programming Increasing Performance through Software Multi-threading Shameem Akhter Jason Roberts Intel PRESS Copyright 2006 Intel Corporation. All rights reserved. ISBN 0-9764832-4-6 No part
More informationHandout 3. HSAIL and A SIMT GPU Simulator
Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants
More informationARM v8-m Processor Debug
ARM v8-m Processor Debug Version 1.0 Copyright 2016 ARM Limited or its affiliates. All rights reserved. ARM 100734_0100_0100_en ARM v8-m Processor Debug ARM v8-m Processor Debug Copyright 2016 ARM Limited
More informationOPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015
OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015 TOPICS Data transfer Parallelism Coalesced memory access Best work group size Occupancy branching All the performance numbers come from a W8100 running
More informationA Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function
A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao
More informationPowerVR Hardware. Architecture Overview for Developers
Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.
More informationOpenCL in Action. Ofer Rosenberg
pencl in Action fer Rosenberg Working with pencl API pencl Boot Platform Devices Context Queue Platform Query int GetPlatform (cl_platform_id &platform, char* requestedplatformname) { cl_uint numplatforms;
More informationFundamentals of the Java Programming Language
Fundamentals of the Java Programming Language Student Guide SL-110 REV E D61798GC10 Edition 1.0 2009 D62399 Copyright 2006, 2009, Oracle and/or its affiliates. All rights reserved. Disclaimer This document
More informationOpenCL Common Runtime for Linux on x86 Architecture
OpenCL Common Runtime for Linux on x86 Architecture Installation and Users Guide Version 0.1 Copyright International Business Machines Corporation, 2011 All Rights Reserved Printed in the United States
More informationclarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018
clarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018 ANECDOTE DISCOVERING A BUFFER OVERFLOW CPU GPU MEMORY MEMORY Data Data Data Data Data 2 clarmor: A
More informationCollecting OpenCL*-related Metrics with Intel Graphics Performance Analyzers
Collecting OpenCL*-related Metrics with Intel Graphics Performance Analyzers Collecting Important OpenCL*-related Metrics with Intel GPA System Analyzer Introduction Intel SDK for OpenCL* Applications
More informationApplication Programming
Multicore Application Programming For Windows, Linux, and Oracle Solaris Darryl Gove AAddison-Wesley Upper Saddle River, NJ Boston Indianapolis San Francisco New York Toronto Montreal London Munich Paris
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded
More informationEMBEDDED SYSTEMS: Jonathan W. Valvano INTRODUCTION TO THE MSP432 MICROCONTROLLER. Volume 1 First Edition June 2015
EMBEDDED SYSTEMS: INTRODUCTION TO THE MSP432 MICROCONTROLLER Volume 1 First Edition June 2015 Jonathan W. Valvano ii Jonathan Valvano First edition 3 rd printing June 2015 The true engineering experience
More informationTrustZone Address Space Controller TZC-380. CoreLink. Technical Reference Manual. Revision: r0p1
CoreLink TrustZone Address Space Controller TZC-380 Revision: r0p1 Technical Reference Manual Copyright 2008, 2010 ARM Limited. All rights reserved. ARM DDI 0431C () CoreLink TrustZone Address Space Controller
More informationMPI: A Message-Passing Interface Standard
MPI: A Message-Passing Interface Standard Version 2.1 Message Passing Interface Forum June 23, 2008 Contents Acknowledgments xvl1 1 Introduction to MPI 1 1.1 Overview and Goals 1 1.2 Background of MPI-1.0
More informationSIMULATOR AMD RESEARCH JUNE 14, 2015
AMD'S gem5apu SIMULATOR AMD RESEARCH JUNE 14, 2015 OVERVIEW Introducing AMD s gem5 APU Simulator Extends gem5 with a GPU timing model Supports Heterogeneous System Architecture in SE mode Includes several
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics
More informationUse cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games
Viewdle Inc. 1 Use cases Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games 2 Why OpenCL matter? OpenCL is going to bring such
More informationColin Riddell GPU Compiler Developer Codeplay Visit us at
OpenCL Colin Riddell GPU Compiler Developer Codeplay Visit us at www.codeplay.com 2 nd Floor 45 York Place Edinburgh EH1 3HP United Kingdom Codeplay Overview of OpenCL Codeplay + OpenCL Our technology
More information