Renderscript Accelerated Advanced Image and Video Processing on ARM Mali T-600 GPUs. Lihua Zhang, Ph.D. MulticoreWare Inc.

Similar documents
Next Generation Visual Computing

Unleashing the benefits of GPU Computing with ARM Mali TM Practical applications and use-cases. Steve Steele, ARM

Our Technology Expertise for Software Engineering Services. AceThought Services Your Partner in Innovation

Enabling a Richer Multimedia Experience with GPU Compute. Roberto Mijat Visual Computing Marketing Manager

The Benefits of GPU Compute on ARM Mali GPUs

The OpenVX Computer Vision and Neural Network Inference

ECE 8823: GPU Architectures. Objectives

Mapping C++ AMP to OpenCL / HSA Wen-Heng Jack Chung

High Quality Real Time Image Processing Framework on Mobile Platforms using Tegra K1. Eyal Hirsch

GPGPU on ARM. Tom Gall, Gil Pitney, 30 th Oct 2013

Whiz-Bang Graphics and Media Performance for Java Platform, Micro Edition (JavaME)

Copyright Khronos Group Page 1. Vulkan Overview. June 2015

LLVM An Introduction. Linux Collaboration Summit, April 7, 2011 David Kipping, Qualcomm Incorporated

Next Generation OpenGL Neil Trevett Khronos President NVIDIA VP Mobile Copyright Khronos Group Page 1

Take GPU Processing Power Beyond Graphics with Mali GPU Computing

A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing

3D Graphics in Future Mobile Devices. Steve Steele, ARM

Fusion Enabled Image Processing

Altera SDK for OpenCL

SIGGRAPH Briefing August 2014

Navigating the Vision API Jungle: Which API Should You Use and Why? Embedded Vision Summit, May 2015

Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules

Higher Level Programming Abstractions for FPGAs using OpenCL

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Multimedia in Mobile Phones. Architectures and Trends Lund

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

Integrating CPU and GPU, The ARM Methodology. Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

More performance options

Khronos Connects Software to Silicon

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

Modern Processor Architectures. L25: Modern Compiler Design

Optimization of Tele-Immersion Codes

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015

Handheld Devices. Kari Pulli. Research Fellow, Nokia Research Center Palo Alto. Material from Jyrki Leskelä, Jarmo Nikula, Mika Salmela

HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1

Standards for Vision Processing and Neural Networks

OpenCL: History & Future. November 20, 2017

Windowing System on a 3D Pipeline. February 2005

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

Renderscript. Lecture May Android Native Development Kit. NDK Renderscript, Lecture 10 1/41

AR Standards Update Austin, March 2012

Module 12 Floating-Point Considerations

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Deep Learning Accelerators

HSA FOR APPLICATION PROGRAMMING

Getting Started with MATLAB Francesca Perino

Many-core Computing. Can compilers and tools do the heavy lifting? Wen-mei Hwu

Spring 2009 Prof. Hyesoon Kim

Module 12 Floating-Point Considerations

Spring 2011 Prof. Hyesoon Kim

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Threading Hardware in G80

Multi2sim Kepler: A Detailed Architectural GPU Simulator

WebGL Meetup GDC Copyright Khronos Group, Page 1

ARM Multimedia IP: working together to drive down system power and bandwidth

HKG OpenCL Support by NNVM & TVM. Jammy Zhou - Linaro

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016

Graphics Processor Acceleration and YOU

An introduction to Halide. Jonathan Ragan-Kelley (Stanford) Andrew Adams (Google) Dillon Sharlet (Google)

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size

Neural Network Exchange Format

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CUDA Experiences: Over-Optimization and Future HPC

Antonio R. Miele Marco D. Santambrogio

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

POWERVR MBX & SGX OpenVG Support and Resources

Profiling and Debugging Games on Mobile Platforms

Using Graphics Chips for General Purpose Computation

Mention driver developers in the room. Because of time this will be fairly high level, feel free to come talk to us afterwards

Open Standards for Vision and AI Peter McGuinness NNEF WG Chair CEO, Highwai, Inc May 2018

How GPUs can find your next hit: Accelerating virtual screening with OpenCL. Simon Krige

Overview. Think Silicon is a privately held company founded in 2007 by the core team of Atmel MMC IC group

Introduction to CUDA (1 of n*)

April 4-7, 2016 Silicon Valley VISIONWORKS A CUDA ACCELERATED COMPUTER VISION LIBRARY S6783. Elif Albuz, April 4, 2016

viewdle! - machine vision experts

Next Generation Verification Process for Automotive and Mobile Designs with MIPI CSI-2 SM Interface

Introduction to OpenGL ES 3.0

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

Introduction to GPU hardware and to CUDA

CUDA Programming Model

Serial and Parallel Sobel Filtering for multimedia applications

The Changing Face of Edge Compute

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

LegUp: Accelerating Memcached on Cloud FPGAs

designing a GPU Computing Solution

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Accelerating Cloud Graphics

Open API Standards for Mobile Graphics, Compute and Vision Processing GTC, March 2014

To hear the audio, please be sure to dial in: ID#

Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX

Heterogeneous SoCs. May 28, 2014 COMPUTER SYSTEM COLLOQUIUM 1

Lecture 1: Gentle Introduction to GPUs

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Advanced Imaging Applications on Smart-phones Convergence of General-purpose computing, Graphics acceleration, and Sensors

Mobile AR Hardware Futures

Transcription:

Renderscript Accelerated Advanced Image and Video Processing on ARM Mali T-600 GPUs Lihua Zhang, Ph.D. MulticoreWare Inc. lihua@multicorewareinc.com

Overview More & more mobile apps are beginning to require higher compute performance, in order to deliver new cool features to end-users. RenderScript allows the programmer to develop high performance kernels that can leverage the power of GPU, which allows compute intensive parts to be accelerated, while freeing the CPU to manage user interactions, etc. This talk will describe the technical work involved in implementing high performance image and video processing engines based on RenderScript, leveraging both the ARM CPU and Mali GPU. 2

Agenda Introduction to Mali-T604 GPU Renderscript Brief Renderscript on Mali GPU Use Renderscript to Accelerate Image & Video Processing Image Processing Video Processing Highlights of Optimizations Summary About MulticoreWare Inc. Copyright 2013, Confidential, MulticoreWare Inc

Introduction to Mali-T604 GPU First GPU based on the Midgard architecture and offers scalability from one to four cores. Support atomic & true IEEE double-precision floating-point math in hardware for Full Profile OpenCL & Renderscript compute support, world s 1 st for mobile devices Production-quality software support, via a single driver stack for all multicore configurations Multicore scheduling and performance scaling is fully handled within the graphics system, with no special considerations required from the application developer. 4

Renderscript Brief High performance computation API at the native level that developer writes in C (C99 standard). Give apps the ability to run operations with automatic parallelization across all available processor cores. Supports different types of processors such as the CPU, GPU or DSP. Single code base to support different arch or different core# Do not need to recompile Useful for apps that do image processing, mathematical modeling, or any operations that require lots of mathematical computation. 5

Renderscript on Mali GPU Two (same or different) Renderscript kernels can run on CPU & GPU in parallel without influencing each other. Key for heterogeneous compute Currently, Renderscript kernel can run on Mali GPU automatically if your kernel, doesn t use global variables has no recursive function calling does not call rsdebug function 6

Use Renderscript to Accelerate Image & Video Processing Why used Renderscript in our image/video APKs? Video/image s sizes are becoming bigger and bigger Advanced image/video processing algorithms need more compute power (such as deshake, deblur) Renderscript supports vector operations, suitable for RGBA compute Two APKs developed & demoed in CES2013 & MWC2013, Advanced Photo Editing APK: process images with filters such as Motionblur, Sobel, pencil, gray, blur, etc Video Transcode APK for video de-shake and upscaling By using Renderscript we are seeing 20X maximum and 4X average speedup on Mali-T604 GPU than on CPU. 7

Edit Mode Image Processing allows user to take pictures and edit pictures 8

Image editing Image Processing Up to 21 filers, and support both CPU and GPU Can save or share the result picture onto Facebook and Twitter. 9

Image Processing 21 filters implemented for Advanced Photo Editing APKs bw: motionblur: negative: oldphoto: pixelate: sobel: relievo: sepia: emboss: whirlpinch: black & white motionblur effect color negative oldphoto effect pixelate effect edge detect relievo effect sepia effect emboss effect Distortion effect 10

Image Processing histogram: gaussian: pencil: histequal: cloud: labyrinth: tile: wave: light: Bicubic: gray: three colors histogram Gaussian blur effect Sketch effect histogram equal cloudy effect labyrinth effect tile effect wave effect light effect bicubic Interpolation scale gray effect 11

Image Processing Live Camera support Select filters to apply to live video from device camera & check effects right away Choose between CPU and GPU 12

Batch mode Image Processing To compare the performance between CPU and GPU for image processing Allows user to Select single or multiple images Select single or multiple filters Select devices: CPU, GPU or both 13

Image Processing Result: (for 2560x1920 image) 14

Video Processing MCW Video Transcode/Processing Pipeline A well designed parallel pipeline is key to design performance critical video processing apps, e.g. DirectShow. Android doesn t supply APIs to help construct video pipeline We used Khronos OpenMAX framework to implement video pipeline in native code with C++. OpenMAX provides cross-platform open source API for multimedia applications Its basic element is called component, and any component runs on its own thread in parallel. By implementing some filters on CPU and others on GPU, we can use CPU & GPU together to achieve maximum throughput. 15

Video Processing Video Transcode Pipeline Diagram Some details of our pipeline HW decoder & HW encoder Color space conversion: YCbCr to RGBA Complicated filters can be split to multiple phases, and every phase as an independent component running in parallel. 16

Deshake Video Transcode APK 17

Upscaling Video Transcode APK 18

Video Transcode APK Filters A few filters implemented with Renderscript, and can execute on either CPU or GPU. As deshake needs lot of computation, we uses both CPU & GPU. Filter FPS (GPU/CPU) X-factor Deshake (720P) 28 / 8 3.5 Upscaling (720P to 1080P) 20 / 3 6.7 19

Video Transcode APK Using GPU can gain more than just performance, but also help reduce energy consumption! 20

Highlights of Optimizations For Image processing Optimized most of the algorithms in order to let them more suitable to run on GPU Use vector to process one pixel s RGBA channels together or several pixels together. Use shift and logical operations to replace divide and modulo calculation Optimized Bitmap loading time Load bitmaps in a separate background thread Live camera mode Implemented YUV to RGBA with CPU using NEON as a new pipeline component to execute in a separate thread. 21

Highlights of Optimizations For video processing Optimized the deshake algorithm itself Adopt a pipeline & split deshake algorithm to two phases Compute frame s transform parameters: on CPU with NEON Transform frame according to the transformation: on GPU with RS Some general opt hints Use intrinsic RS functions if possible (clamp, mad, dot ) Minimize branch codes (if, while, for ) Leave same calculation for all the RS threads outside the kernels, and feed to kernel as UserData parameter. This has helped reduce GPU computation a lot. Use integer instead of float to do the computation if precision is ok. 22

Summary Mobile CPU & GPU are becoming more powerful, which has enabled the possibility of implementing real-time compute intensive practical applications, like advanced image & video processing, on mobile devices. The best performance always comes out of efficient task allocations & scheduling on heterogeneous platform, i.e. tasks must be allocated to CPU & GPU, based on their types. We can expect to see more & more mobile apps to follow the trend and these practice would also help shape proper tools chain to ease the software development effort while get better performance. 23

About MulticoreWare Founded in 2008 Leading tools, libraries & services provider for heterogeneous computing CTO: Prof. Wen-Mei Hwu from UIUC Our Mission To empower people to leverage heterogeneous computing so as to deliver enhanced value everywhere One of largest heterogeneous computing teams worldwide More than 200 strong engineers globally with offices in the USA, China, India, and Singapore St. Louis Champaign IL Changchun Beijing Chennai Copyright 2013, Confidential, MulticoreWare Inc 24

Industry Leadership Tools leadership role on HSA Foundation Khronos Contributor Member Strategic relationship with University of Illinois at Urbana Champaign Partnerships with CPU/GPU/FPGA vendors Mobile, Desktop, and Cloud Copyright 2013, Confidential, MulticoreWare Inc 25

Compiler Technology and Tools Expertise in LLVM and Clang based compiler technologies MxPA High Performance OpenCL stack for CPUs GMAC Automated Memory Data Movement and Coherence DL Optimized Data Structure Layout TM Heterogeneous Task Scheduler SM Automated Packing of Work Items PPA Profiler and Analyzer for Heterogeneous System Copyright 2013, Confidential, MulticoreWare Inc 26

Heterogeneous Libraries MulticoreWare has built up accelerated libraries implemented in OpenCL, that we can integrate for use in your application Video Broadcasting, Video Processing, Encoding, Decoding, Transcoding Imaging Machine Vision, Image Filters, Feature Detection Security Data Encryption Storage Data Compression Copyright 2013, Confidential, MulticoreWare Inc 27

Expertise across Platforms Video and Imaging implementations done across many platforms Experience across heterogeneous compute platforms Mobile device platforms to workstations and cloud based platforms AMD Fusion, Nvidia Kepler GPUs, ARM Mali and NEON Altera OpenCL FPGA FPGA development in HDL Experience across heterogeneous programming models CUDA OpenCL Renderscript C++AMP etc Copyright 2012, Confidential, Multicoreware Inc 28

MulticoreWare Services MulticoreWare will work with you through your application development cycle Allows you to keep focus on your core product and feature development Leave all optimization work across different processor platforms to MulticoreWare Maximize your application s performance with every new generation processor released Develop the Application MulticoreWare optimizes Multiple processor architectures Services involvement span from full software or silicon architecture design to porting, benchmarking, and support Copyright 2013, Confidential, MulticoreWare Inc 29

Products & Services for Mobile Platforms Mobile Video & Image Library (MVIL) Across platforms Easy to use APIs Ready to use real-scenario samples for quick product prototype Full parallel pipeline design to enable optimal performance & energy efficiency on heterogeneous mobile system Tools for Heterogeneous Mobile Compute PPA for Android: powerful perf analysis tools for heterogeneous platform, which supports C/C++, Java, Renderscript & OpenCL MxPA for Android: Enable OpenCL on any multi-core mobile devices Single code base across platforms Mobile GPU compute application services MCW China is going to hold technical seminars focusing on heterogeneous compute on mobile platforms We are happy to share our knowledge and expertise and help promote the adoption of mobile GPU compute Copyright 2012, Confidential, Multicoreware Inc 30

Thank You! Lihua Zhang, Ph.D. Email: lihua@multicorewareinc.com MulticoreWare website: http://www.multicorewareinc.com 31