Handheld Devices. Kari Pulli. Research Fellow, Nokia Research Center Palo Alto. Material from Jyrki Leskelä, Jarmo Nikula, Mika Salmela

Similar documents
GPGPU on Mobile Devices

Whiz-Bang Graphics and Media Performance for Java Platform, Micro Edition (JavaME)

Easy to adapt C code to kernel code

Mobile Graphics Ecosystem. Tom Olson OpenGL ES working group chair

The Benefits of GPU Compute on ARM Mali GPUs

OpenCL The Open Standard for Heterogeneous Parallel Programming

Enabling a Richer Multimedia Experience with GPU Compute. Roberto Mijat Visual Computing Marketing Manager

GPGPU on ARM. Tom Gall, Gil Pitney, 30 th Oct 2013

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1

Dave Shreiner, ARM March 2009

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Modern Processor Architectures. L25: Modern Compiler Design

Profiling and Debugging Games on Mobile Platforms

Next Generation Visual Computing

The Bifrost GPU architecture and the ARM Mali-G71 GPU

Renderscript Accelerated Advanced Image and Video Processing on ARM Mali T-600 GPUs. Lihua Zhang, Ph.D. MulticoreWare Inc.

Overview. Technology Details. D/AVE NX Preliminary Product Brief

POWERVR MBX & SGX OpenVG Support and Resources

Take GPU Processing Power Beyond Graphics with Mali GPU Computing

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group

Maximizing Face Detection Performance

Scheduling Image Processing Pipelines

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Profiling & Tuning Applications. CUDA Course István Reguly

Mali Developer Resources. Kevin Ho ARM Taiwan FAE

Unleashing the benefits of GPU Computing with ARM Mali TM Practical applications and use-cases. Steve Steele, ARM

Scheduling Image Processing Pipelines

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

Altera SDK for OpenCL

Copyright Khronos Group, Page 1. OpenCL Overview. February 2010

OpenCL Overview Benedict R. Gaster, AMD

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

Multimedia in Mobile Phones. Architectures and Trends Lund

3D Graphics in Future Mobile Devices. Steve Steele, ARM

Understanding Shaders and WebGL. Chris Dalton & Olli Etuaho

Grafica Computazionale: Lezione 30. Grafica Computazionale. Hiding complexity... ;) Introduction to OpenGL. lezione30 Introduction to OpenGL

SIGGRAPH Briefing August 2014

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

CUDA Performance Optimization. Patrick Legresley

Outline. L9: Project Discussion and Floating Point Issues. Project Parts (Total = 50%) Project Proposal (due 3/8) 2/13/12.

Real-time Graphics 9. GPGPU

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Bifrost - The GPU architecture for next five billion

Mention driver developers in the room. Because of time this will be fairly high level, feel free to come talk to us afterwards

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015

Optimisation. CS7GV3 Real-time Rendering

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President

Heterogeneous Computing with a Fused CPU+GPU Device

Current Trends in Computer Graphics Hardware

Mobile Performance Tools and GPU Performance Tuning. Lars M. Bishop, NVIDIA Handheld DevTech Jason Allen, NVIDIA Handheld DevTools

Real-time Graphics 9. GPGPU

HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1

The Application Stage. The Game Loop, Resource Management and Renderer Design

Introduction to OpenCL!

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Copyright Khronos Group Page 1. Vulkan Overview. June 2015

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Fundamental CUDA Optimization. NVIDIA Corporation

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Copyright Khronos Group 2012 Page 1. OpenCL 1.2. August 2012

IEEE Computer Graphics and Applications

More performance options

Heterogeneous Computing Made Easy:

Higher Level Programming Abstractions for FPGAs using OpenCL

ECE 574 Cluster Computing Lecture 16

Building scalable 3D applications. Ville Miettinen Hybrid Graphics

Dominic Filion, Senior Engineer Blizzard Entertainment. Rob McNaughton, Lead Technical Artist Blizzard Entertainment

Working with Metal Overview

Introduction to CUDA

The OpenVX Computer Vision and Neural Network Inference

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

Hardware Accelerated Graphics for High Performance JavaFX Mobile Applications

Lecture Topic: An Overview of OpenCL on Xeon Phi

Bringing it all together: The challenge in delivering a complete graphics system architecture. Chris Porthouse

OpenCL C. Matt Sellitto Dana Schaa Northeastern University NUCAR

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

Copyright Khronos Group Page 1

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Metal. GPU-accelerated advanced 3D graphics rendering and data-parallel computation. source rebelsmarket.com

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

PowerVR Hardware. Architecture Overview for Developers

DX10, Batching, and Performance Considerations. Bryan Dudash NVIDIA Developer Technology

Enhancing Traditional Rasterization Graphics with Ray Tracing. March 2015

OpenCL Press Conference

CUDA Programming Model

Applications and Implementations

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Performance potential for simulating spin models on GPU

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

TUNING CUDA APPLICATIONS FOR MAXWELL

Windowing System on a 3D Pipeline. February 2005

Compiling for Performance on hp OpenVMS I64. Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005

Mobile HW and Bandwidth

Introduction to creating 3D UI with BeagleBoard. ESC-341 Presented by Diego Dompe

Mobile AR Hardware Futures

Tutorial on GPU Programming #2. Joong-Youn Lee Supercomputing Center, KISTI

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

OpenGL BOF Siggraph 2011

Transcription:

OpenCL in Handheld Devices Kari Pulli Research Fellow, Nokia Research Center Palo Alto Material from Jyrki Leskelä, Jarmo Nikula, Mika Salmela 1

OpenCL 1.0 Embedded Profile Enables OpenCL on mobile and embedded silicon Relaxes some data type and precision requirements Avoids the need for a separate ES specification Khronos APIs provide computing support for imaging & graphics Enabling advanced applications in, e.g., Augmented Reality OpenCL will enable parallel computing in new markets 2 Mobile phones, cars, avionics A camera phone with GPS processes images to recognize buildings and landmarks and provides relevant data from internet

Embedded Profile main differences Adapting code for embedded profile Added macro EMBEDDED_PROFILE CL_PLATFORM_PROFILE capability returns the string EMBEDDED_PROFILE if only the embedded profile is supported Online compiler is optional No 64-bit integers Reduced requirements for constant buffers, object allocation, constant argument count and local memory Image & floating point support matches OpenGL ES 2.0 texturing The extensions of full profile can be applied to embedded profile 3

Floating point numbers INF and NAN values for floats are not mandated Most accuracy requirements are the same, but some single precision floating-point operations are relaxed from full profile: x / y exp log <= 3 ulp [units in last place] <= 4 ulp <= 4 ulp Float add, sub, mul, mad can be rounded to zero resulting an error <= 1 ulp (helps to keep HW area down) Denormalized half floats can be flushed to zero The precision of conversions from normalized integers is <= 2 ulp for the embedded profile (instead of <= 1.5 ulp) 4

Image support in Embedded Profile Image support is an optional feature within an OpenCL device If Images are supported, the min reqs for the supported image capabilities are lowered to the level of OpenGL ES 2.0 textures Kernel must be able to read >= 8 simultaneous image objects Kernel must be able to write >= 1 simultaneous image objects Width and height of 2D image >= 2048 Number of samplers >= 8 Image formats match OpenGL ES 2.0 texture formats Support for 3D images is optional Float 2D/3D images only support point sampling (nearest neighbor) 5

Benefits for mobile devices Easier programming in a heterogeneous processor environment Instead of learning different programming methods for CPU, GPU, DSP OpenCL framework handles also event queuing Code developed once will run with future hardware If the application conforms to the specification, it will run Can be used with paravirtualized or full virtualized operating systems no manufacturer or HW (even CPU) dependencies Area and energy constrained embedded devices In embedded devices it is less likely that other processors (GPU) are 100x faster than main CPU (area consumption, energy constraints) they all are more closer to "sweet spot OpenCL can allocate workload on multiple processors for example in OMAP 4430 there are 2 X ArmCortexA9, ISP, GPU, DSP 6

OpenCL for image processing Algorithm Kernel Input St ep 1: St ep 2: St ep 3: Output Image Adjust Gaussian Adjust Image Geom et ry Blur Colour A three-step image processing algorithm in a single OpenCL kernel 7 Smaller execution overheads Lower memory bandwidth requirements Better performance comparisons as computation time dominates Three tests on TI OMAP 3430 (550 MHz ARM Cortex-A8 CPU & 110 MHz SGX530 GPU) CPU alone GPU alone CPU+GPU in parallel Both time and power consumption were measured

Kernel program: Function definition kernel void imageprocessingtest( float bend, float zoom, float4 adjust_r, // Conversion matrix column 1 float4 adjust_g, // Conversion matrix column 2 float4 adjust_b, // Conversion matrix column 3 float4 gauss_coeff, read_only image2d_t srcimage, write_only image2d_t destimage, int width, int height) { // Local variables // Step 1 // Step 2 // Step 3 } 8

Kernel program: Local variables // Local variables const sampler_t sampler = CLK_NORMALIZED_COORDS_TRUE CLK_ADDRESS_CLAMP_TO_EDGE CLK_FILTER_NEAREST; float2 size, xy, mycoord; int2 destpos; destpos.x = get_global_id( 0 ); // x index destpos.y = get_global_id( 1 ); // y index 9

Kernel program: Adjust geometry // Step 1 // Coordinate adjustment mycoord.x = (float)(destpos.x) / (float)(width); mycoord.y = (float)(destpos.y) / (float)(height); float2 center(0.5f, 0.5f); xy = (mycoord center) / center; xy = (xy * pow(dot(xy, xy), bend) * zoom + 1.0f) * 0.5f; 10

Kernel program: Read & blur pixels // Step 2 // Read a 3x3 gaussian blurred image float2 south, east; south.x = 0.0f; south.y = zoom / (float)(height); east.x = zoom / (float)(width); east.y = 0.0f; float4 srccolor = read_imagef(srcimage, sampler, xy) * gauss_coeff.x; srccolor += gauss_coeff.y * ( read_imagef(srcimage, sampler, xy+south) + read_imagef(srcimage, sampler, xy-south) + read_imagef(srcimage, sampler, xy+east) + read_imagef(srcimage, sampler, xy-east) ); srccolor += gauss_coeff.z * ( read_imagef(srcimage, sampler, xy+south+east) + read_imagef(srcimage, sampler, xy-south-east) + read_imagef(srcimage, sampler, xy+south-east) + read_imagef(srcimage, sampler, xy-south+east) ); 11

Kernel program: Adjust colors, w rite // Step 3 // Adjust contrast and saturation float4 adjcolor; adjcolor.x = dot (adjust_r, srccolor); adjcolor.y = dot (adjust_g, srccolor); adjcolor.z = dot (adjust_b, srccolor); adjcolor.w = srccolor.w; // Clamp and write image write_imagef(destimage, destpos, clamp(adjcolor, 0.0f, 1.0f)); 12

CPU versus GPU considerations GPU kernel uses image2d_t type to access images maps directly to GPU s HW-based texture access our OpenCL GPU back-end doesn t support memory buffers CPU works well with memory buffers, with native memory pointers emulating image2d_t behavior in SW is very expensive! the execution time dropped 77% compared to emulated image2d_t also replaced very slow standard pow() function with a fast approximation The results are measured using these two separate kernels: GPU: image2d_t CPU: memory buffers More fair comparison 13

Performance: Processors alone GPU is 3.5X faster than CPU GPU s 2.4 s = Why does GPU win? GPU s single-program-multiple-data architecture fits OpenCL GPU has fast vectored floats compared to slow VFPLite of ARM Cortex-A8 ARM s NEON is not fully utilized Algorithm could be implemented also with integers CPU could even outperform GPU 0.3 s of input data set-up/transfer 0.4 s of result data read-back only 1.7 second is consumed on actual execution 100 Frames 3MPix, 32b RGBA Alone Combined CPU GPU CPU GPU Frames/device 100 100 19 81 Execute time (s) 864.9 240.4 237.4 247.5 Average s/frame 8.6 2.4 12.5 3.1 2.5 Min s/frame 6.98 2.33 10.87 2.39 Max s/frame 10.94 2.49 12.62 3.16 14

Performance: CPU+GPU In combined run, frames were scheduled to either processor based on which one becomes free next Running CPU and GPU in parallel gives worse performance than with GPU alone, average frame time 2.5 seconds Graph shows kernel execution of CPU and GPU in red, black bars are idle time caused by scheduling Scheduling made CPU to idle more than necessary But why does GPU now need more time for 81 frames than for 100 frames when running alone? 100 Frames 3MPix, 32b RGBA Alone Combined CPU GPU CPU GPU Frames/device 100 100 19 81 Execute time (s) 864.9 240.4 237.4 247.5 Average s/frame 8.6 2.4 12.5 3.1 2.5 Min s/frame 6.98 2.33 10.87 2.39 Max s/frame 10.94 2.49 12.62 3.16 15

CPU+GPU combined run analyzed Transfer times doubled, total data transfer overhead 1.4 s input data set-up / transfer from 0.3 s to 0.6 s result read-back from 0.4 s to 0.8 s GPU execution time remained the same as alone, 3.1 s 1.4 s = 1.7 s Parallel execution doesn t make algorithm to become bandwidth limited why does the parallel version perform worse? Data transfer overhead is doubled since it is purely CPU bound that makes GPU idle more between frames Frame computation by CPU cannot compensate that loss 16

Scheduling is not easy Better scheduling would run data transfers at full speed and use CPU for frame computation only when it is free from serving GPU data transfers 84 frames with GPU and 16 with CPU => the optimum 202 s: For GPU, consumed time becomes 84 x 2.4 s = 202 s, including also 84 x 0.7 s = 59 s CPU time due to data transfers For CPU, data transfers leaves 202 s 59 s = 143 s for frame processing, enough for 143 s / 8.6 s = 16 complete frames Even with current scheduling CPU+GPU combination can give speed-up for an algorithm with more computation per frame Optimal scheduling depends on relative speeds of the processors and their dependencies (such as data transfers) E.g., if CPU happens to be faster than GPU, then it may be best to avoid wasting CPU time on data transfers and ignore GPU, running on CPU only 17

Energy measurements 3.8 V input voltage 516 ma idle current was subtracted from the results GPU consumes 0.56 J per frame, only 14 % of the CPU s 3.93 J for pure execution without data transfers GPU used only 0.26 J (7%!) over half of the energy in GPU case is due to data transfers run by CPU! Combined run is worse than GPU alone Efficacy CPU alone GPU alone CPU+ GPU Frames (n) 4 20 20 Time (s) 43.6 48.6 52.1 Current (ma) 95 61 132 Power (mw) 361 232 502 Energy/frame (J) 3.93 0.56 1.31 18

Analysis of energy measurements GPU architecture is better for OpenCL-like parallel data computing all execution units are busy, ALU utilization is maximized CPU suffers from poor ALU utilization we didn t have long pipelined vector computations for ARM NEON compiler and our OpenCL do not utilize NEON instructions very well GPU s lower operating frequency (110 MHz vs. 550 MHz) saves energy power dissipation increases non-linearly with frequency generally, high parallelism with low frequency is more energy-efficient than low parallelism with high frequency 19

Summary OpenCL 1.0 Embedded Profile is a subset of the full profile Not an ES specification of its own Easier programming of heterogeneous multi-processor Fast multiprocessor code without portability hassle Speedups and energy efficiency possible via parallelism But scheduling can be tricky 20