Next Generation Visual Computing (Making GPU Computing a Reality with Mali ) Taipei, 18 June 2013 Roberto Mijat ARM
Addressing Computational Challenges Trends Growing display sizes and resolutions Increasing computational power and novel applications Persistent users expectation of improved experience Limitations Limited and restricted energy and thermal budgets In mobile, processing power greatly outgrowing battery capacity Traditional scaling solutions not sustainable Necessities Increase computational efficiency of processing platforms Make use of heterogeneous and parallel computing Leverage new technologies such as GPU Compute 2
Complementary Compute Architectures Note: characteristics of generic CPUs and GPUs 3
Heterogeneous Computing Operating System Most application processing CPU Programmable through C-like languages and APIs GPU Cost effective, efficient, great floating point performance Control ALU ALU ALU ALU Caches RAM GPU used as computational accelerators or companion processor 2D/3D graphics Advanced Image Processing Accelerate/Complement ISP functionality Offload video codec blocks Accelerate physics computation 4
Benefits of GPU Computing Performance Faster computation Offload and acceleration of non-graphical applications Energy Efficiency Free-up CPU resource by offloading to GPU Better load-balance across system resources Increased system efficiency using the best processor for the job Cost Reduction Reduced cost through h/w consolidation and software flexibility Simpler interface to parallel programming through modern APIs Improved user experience Remove computational barriers Enable new use cases and applications 5
Adoption of Mobile GPU Compute 2012 2013 2014 2015+ OpenCL Full Profile Khronos conformant GPUs in mobile SoCs GPU Compute capable devices start shipping OEMs and SiPs evaluating leading GPU Compute solutions Gradual roll-out of GPU Compute APIs in mobile/embedded platforms Android RenderScript computation first enabled on GPU 6
Adoption of Mobile GPU Compute 2012 2013 2014 2015+ First public demonstrations of GPU Compute Mobile benchmarks ISVs and OEMs start porting/optimizing libraries and key use-case functionality using GPU Compute Computational Photography and Advanced Imaging GPU acceleration Codec vendors develop GPU Compute enabled HEVC decoders Exploration by mainstream developers 7
Adoption of Mobile GPU Compute 2012 2013 2014 2015+ Mainstream support for GPU Computing in Mobile and Embedded GPU Compute widely available and utilized by developers/libraries Introduction of GPUs implementing HSA features, full system coherency Hardware consolidation and software cost reduction through migration of selected ISP/DSP functionality to GPU New use cases, innovation 8
OPENCL 9
OpenCL Overview OpenCL is A framework to enable general purpose parallel computing A computing language portable across heterogeneous processing platforms An API to define and control the platforms A royalty-free open standard, interoperable with existing APIs OpenCL enables easier, better programming of heterogeneous parallel compute systems, and unleashes the general purpose computational power of GPUs needed by emerging workloads OpenCL and the OpenCL logo are trademarks of Apple Inc. 10
OpenCL Programming Model Application Optimize performance critical code Program The kernel is executed over each element of the N-dimensional index space Index space (NDRange) Kernel Runtime Compiler - OpenCL kernel - Native kernel Can use static compilation Binaries are cached Kernel object Can be built to target any supported device Execute command Work-item: instance of a kernel executing on a point in the index space Work-group: collection of work-items 11
The ARM OpenCL Implementation Implements the latest version of the standard Implements Full Profile, supports 64-bit Optimized for interoperability with existing Mali software stack Optimized for interoperability between CPU and GPU Architected for Cache Coherent Interconnect support Extensible design 12
With Full Profile you know what you get Full Profile defines the baseline set of features for OpenCL Embedded Profile defines a subset of the specification Designed to enable OpenCL on less capable devices Making optional a large set of features, restricting developers Reducing precision of floating point maths Key Feature Embedded Full FP32 precision Relaxed IEEE-754 Built-in atomic operations Optional Supported 64-bit integer Optional Supported Online compiler Optional Supported 3D image writes Optional Supported Linear interpolation for floating point images Optional Supported Size of buffers and memory Limited Supported Image data type requirements Reduced Supported 13
RENDERSCRIPT 14
Introduction to RenderScript Compute framework and API for Android Officially introduced in Honeycomb Cross-platform control-slave architecture, with runtime compilation A graphics engine component has been deprecated since Jelly Bean Complements existing APIs by adding: A compute API for parallel processing similar to OpenCL A scripting language based on C99 supporting vector data types Designed for portability, performance, usability On-device JIT compilation and dynamic thread launch Native code optimization to maximize performance critical algorithms Mali-T604 is the first GPU to support RenderScript 15
Online compilation How RenderScript works Java App Reflected Layer llvm-rs-cc Portable Bitcode RenderScript Script Online compilation Dalvik JIT libbcc Executable librs Machine Code ARM Compute System (Cortex CPU + Mali GPU + AMBA 4) 16
DESIGNED FOR GPU COMPUTE 17
Mali-T600 : Designed for GPU Compute Comprehensive support for general purpose data types 8/16/32/64-bit signed/unsigned integer FP16, FP32, FP64 2,3,4,8,16 wide vectors 2D/3D images Floating Point precision & performance Full IEEE 754-2008 compliance 100s of GFLOPs performance for non graphical workloads Sustainable and proven performance for real life workloads 18
Mali-T600: Designed for GPU Compute Hardware acceleration Most common mathematical functions implemented in h/w >70% coverage within newest industry APIs Most operations compute in one cycle Optimal memory throughput and latency Optimized for stream and generic load/store operations Tight integration with system using latest AMBA interfaces Leverage on new Cache Coherent Interconnect technologies Task management implemented in hardware Optimal automatic distribution of compute workloads Optimal dynamic power management Efficient use of processing resources 19
GPU Compute on Mali: here today! Passed Khronos Conformance Only OpenCL 1.1 Full Profile on Linux and Android outside of console and desktop space. Proven in Silicon Samsung Exynos 5 Dual, implements Full Profile OpenCL and RenderScript DDKs available now Mali-T600 shipping in real products Google Chromebook Google Nexus 10 InSignal Arndale Community Board API exposed for developers RenderScript on Android for Nexus 10 20
Example of the benefits of GPU Compute from the real world USE CASES 21
Example use cases for GPU Computing Mobile Computational Photography Physics in games Moving and still image real-time stabilization Information extraction: object detection, classification and tracking Imaging: correction, improvement, consolidation Content and context understanding HDR Augmented Reality DTV/STB 2D to 3D conversion Super resolution Pre and post processing Camera based UI Trans-coding Information extraction and superimposition Automotive Lane Detection Smart Head-Light Road Sign Recognition Night Vision Object Classification Pedestrian, Vehicle and Collision Detection Vehicle Detection Dynamic cruise control 100s GFLOPs of efficient processing power: improve existing use-cases, enable next generation use-cases 22
Advanced Image Processing RenderScript is the official Heterogeneous Compute Android API Since Android ICS 4.2 it has been enabled to target the GPU Complex image filters can be greatly accelerated by GPU Compute Filter Speed-up [1] MotionBlur 3.5x Cloud 4.2x Labyrinth 3.8x TitleReflection 7.3x WhirlPinch 3.6x Wave 7.0x Bicubic 15.4x Image size: 2560x1920 [1] Acceleration compares RenderScript compiled on device (LLVM) on dual-core Cortex -A15 and Mali -T604 on a stock Google Nexus 10 23
Video Processing APK Proprietary Transcoding/Processing Pipeline Image filters implemented using RenderScript Optimized for ARM + Mali-T600 GPU Compute Filter FPS (GPU+CPU vs CPU only) Speed-up Deshake (720p) 28 / 8 3.5x Upscaling (720p to 1080p) 20 / 3 6.7x 24
GPU Compute accelerated superscaling Accelerated using RenderScript On Google Nexus 10 (Mali-T604) 25
Next Generation Multimedia Codecs High Efficiency Video Coding (HEVC) Latest video compression standard ratified by ITU in Jan 2013 Improved video quality and double data compression from H.264 Can support up to 8k UHD ARM is collaborating with multiple codec vendors Ensuring widest availability of HEVC across multiple ARM platforms Enabling HEVC early, in software, through NEON and GPU Compute Flexibility of software solutions critical as HEVC rolls out 26
Why GPU Compute for HEVC High resolution HEVC decoding maximises CPU load GPUs are traditionally idle during video playback GPU architecture suites acceleration of parallel codec blocks Offloading computation to the GPU frees up the CPU to perform other (system) tasks Combining CPU (NEON) and GPU Compute enable most efficient HEVC decode Mali GPUs are well suited for Video Acceleration with significant power/performance benefits Ittiam Systems 27
Physics (Cloth Simulation) 28
ISP Pipeline Offload to GPU (OpenCL) Entire ISP pipeline offloaded to the GPU using OpenCL More flexibility Sensor and camera module vendors can invest in optimized portable software libraries instead of hardware ISP SoC implementers can reduce BoM by offloading ISP blocks to the GPU Mali-T604 demo was previewed at MWC13 OpenCL Raw Data form HDR Sensor Noise reduction HDR reconstruction Tone mapping Colour conversion Rendering De-noising Gamma correction OpenGL ES 29
Gesture User Interfaces eyesight TM s gesture recognition technology using GPU Compute on ARM s Mali-T600 offers unique capabilities Reduction of overall power consumption Reduction of load from the CPU Robust recognition in challenging lighting conditions Enhanced user experience Higher FPS for more gesture capabilities and features 30
Energy used for unit of work (lower is better) Computer Vision Based Applications Computer Vision entails the acquisition, processing, analysis and understanding of sensor data (images), in order to derive information to enable decisions to be made In this example: Consistent 6x speed up ~5x more energy efficiency Face detection study on Mali-T604 based silicon 31
Conclusions Improve energy efficiency through heterogeneous computing Use the best processor for the task Balance workload across system resources Offload heavy parallel computation to the GPU Bring the benefits of GPU Compute to key use cases Computational Photography and Advanced Imaging Next generation of multimedia codecs Computer Vision applications The Mali Ecosystem is making GPU Compute a reality 32