GPGPU on ARM. Tom Gall, Gil Pitney, 30 th Oct 2013

Similar documents
CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

SIGGRAPH Briefing August 2014

Mobile Graphics Ecosystem. Tom Olson OpenGL ES working group chair

Next Generation OpenGL Neil Trevett Khronos President NVIDIA VP Mobile Copyright Khronos Group Page 1

Enabling a Richer Multimedia Experience with GPU Compute. Roberto Mijat Visual Computing Marketing Manager

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

3D Graphics in Future Mobile Devices. Steve Steele, ARM

LCA14-412: GPGPU on ARM SoC. Thu 6 March, 2.00pm, T.Gall, G.Pitney

The Benefits of GPU Compute on ARM Mali GPUs

Unleashing the benefits of GPU Computing with ARM Mali TM Practical applications and use-cases. Steve Steele, ARM

Take GPU Processing Power Beyond Graphics with Mali GPU Computing

Integrating CPU and GPU, The ARM Methodology. Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

GPGPU. Peter Laurens 1st-year PhD Student, NSC

Next Generation Visual Computing

Technology for a better society. hetcomp.com

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

! Readings! ! Room-level, on-chip! vs.!

Shaders. Slide credit to Prof. Zwicker

Antonio R. Miele Marco D. Santambrogio

GPGPU on Mobile Devices

Our Technology Expertise for Software Engineering Services. AceThought Services Your Partner in Innovation

Copyright Khronos Group Page 1

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Applications and Implementations

GPGPU Applications. for Hydrological and Atmospheric Simulations. and Visualizations on the Web. Ibrahim Demir

Threading Hardware in G80

Modern Processor Architectures. L25: Modern Compiler Design

General Purpose GPU Computing in Partial Wave Analysis

Applications and Implementations

Advanced Imaging Applications on Smart-phones Convergence of General-purpose computing, Graphics acceleration, and Sensors

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Copyright Khronos Group 2012 Page 1. OpenCL 1.2. August 2012

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

How GPUs can find your next hit: Accelerating virtual screening with OpenCL. Simon Krige

CS427 Multicore Architecture and Parallel Computing

Computing on Low Power SoC Architecture

Graphics Hardware. Instructor Stephen J. Guy

From Brook to CUDA. GPU Technology Conference

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Bifrost - The GPU architecture for next five billion

GPUs and GPGPUs. Greg Blanton John T. Lubia

Current Trends in Computer Graphics Hardware

Renderscript Accelerated Advanced Image and Video Processing on ARM Mali T-600 GPUs. Lihua Zhang, Ph.D. MulticoreWare Inc.

Dave Shreiner, ARM March 2009

Khronos and the Mobile Ecosystem

Lecture 13: OpenGL Shading Language (GLSL)

PowerVR GPU IP from Wearables to Servers. Kristof Beets Director of Business Development May 2015

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

OpenCL Press Conference

Copyright Khronos Group Page 1. Vulkan Overview. June 2015

Altera SDK for OpenCL

Anatomy of a Globally Recursive Embedded LINPACK Benchmark

Mobile AR Hardware Futures

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

LPGPU Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM 2014)

Profiling and Debugging Games on Mobile Platforms

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU Programming Using NVIDIA CUDA

Handout 3. HSAIL and A SIMT GPU Simulator

The Bifrost GPU architecture and the ARM Mali-G71 GPU

Handheld Devices. Kari Pulli. Research Fellow, Nokia Research Center Palo Alto. Material from Jyrki Leskelä, Jarmo Nikula, Mika Salmela

ECE 8823: GPU Architectures. Objectives

TEGRA K1 による GPU コンピューティング

Barcelona Supercomputing Center

Graphics Processing Unit Architecture (GPU Arch)

Copyright Khronos Group Page 1

WebGL, WebCL and Beyond!

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

F28HS Hardware-Software Interface: Systems Programming

HKG OpenCL Support by NNVM & TVM. Jammy Zhou - Linaro

Graphics and Imaging Architectures

Open API Standards for Mobile Graphics, Compute and Vision Processing GTC, March 2014

Navigating the Vision API Jungle: Which API Should You Use and Why? Embedded Vision Summit, May 2015

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

POWERVR MBX & SGX OpenVG Support and Resources

WebGL Meetup GDC Copyright Khronos Group, Page 1

Real-time Graphics 9. GPGPU

Accelerating image registration on GPUs

The Many-Core Revolution Understanding Change. Alejandro Cabrera January 29, 2009

Press Briefing SIGGRAPH 2015 Neil Trevett Khronos President NVIDIA Vice President Mobile Ecosystem. Copyright Khronos Group Page 1

Heterogeneous Computing

Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm

Accelerating Realism with the (NVIDIA Scene Graph)

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

ECE 574 Cluster Computing Lecture 16

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Multimedia in Mobile Phones. Architectures and Trends Lund

General Purpose Computing on Graphical Processing Units (GPGPU(

THE HETEROGENEOUS SYSTEM ARCHITECTURE IT S BEYOND THE GPU

Update on Khronos Open Standard APIs for Vision Processing Neil Trevett Khronos President NVIDIA Vice President Mobile Ecosystem

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

Transcription:

GPGPU on ARM Tom Gall, Gil Pitney, 30 th Oct 2013

Session Description This session will discuss the current state of the art of GPGPU technologies on ARM SoC systems. What standards are there? Where are the drivers? What is the best hardware? What are the problem areas? What open source software has been accelerated? What opportunities for taking advantage of GPGPU computing on ARM exist? There will also be some time for interactive discussion about current and potential plans.

GWG: New GPGPU Sub-group At previous Connect (Dublin) discussed GPGPU, and what Linaro should do in that area. Recognized that: GPUs are much more energy efficient for highly parallel computing problems than using just the main processor, even with NEON SIMD. Big performance gains (> 20X) using the GPU for certain open source workloads. Decided: to get hardware and start gaining some experience with GPGPU drivers; it would be good to accelerate some OSS projects with GPGPU. So, we formed a GPGPU sub-team as part of the GWG.

GPU to GPGPU: Hardware Evolution Graphics pipeline operations migrate from the CPU to GPU: 80 s: CPU + framebuffers with blitters. 90 s: GPUs output one pixel per clock cycle, add pipeline operations: texture mapping, z-buffering, rasterization; transform and lighting (still fixed hardware) operations. Fixed functions become programmable kernels, adding multiple pipelines: 2000 s: GPUs become programmable, multicore: Programmable kernels (shaders) added to the per-pixel and vertex processing stages. More floating point precision, more GPU memory Unified shaders (eg: GeForce 8 Streaming Multiprocessor ) handle vertex, pixel and geometry computations. GPU: initially a single core, fixed function hardware pipeline implementation designed solely to accelerate graphics becomes GPGPU: set of parallel and highly programmable cores (shaders) which can be leveraged for more general purpose computation.

GPU to GPGPU: Software Evolution Software co-evolves with GPU hardware. 90 s: Graphics Languages: OpenGL (led GPU hardware features), DirectX. 2000: Shading Languages: OpenGL/GLSL, Direct3D/HLSL, Cg, others... Shading Languages: Need to understand graphics pipeline ops. But, shaders basically perform matrix and vector operations, ideal for many non-graphics, scientific applications. Enter higher level parallel computing stream processing languages: 2007: CUDA Compute Unified Device Architecture (NVIDIA) 2008: OpenCL Open Compute Language (Apple/Khronos)

GPGPU Use Cases Ideal GPGPU applications have large data sets, high parallelism, and minimal dependency between data elements: Advanced imaging: (Face recognition, cloth simulation, object detection). Cryptography searches: (Bitcoin, distributed.net). Video decode and post-processing: (idct, motion compensation, VLD, IQ, edge enhancement) Bioinformatics: (BLAST (protein and genome sequence comparisons), Folding@Home). Genetic Algorithms, Weather prediction, Particle Physics simulations MapReduce algorithm. The 13 Dwarfs of GP/GPU computing [2] and benchmarks [1] evaluate parallel programming models and hardware architectures for algorithms sharing similar computation and inter-processor communication patterns: - Computationally limited: Dense linear algebra, N-Body methods, cryptography. - Memory bandwidth limited: Sparse linear algebra, structured grid. - Memory latency limited: Graph traversal, dynamic programming, unstructured grid, FFT.

Standards / Programming Platforms Hardware: HSA (Heterogeneous System Architecture) foundation creates hardware and software standards. Implementations coming. (AMD/Kaveri, PS4: 4Q13?). Hardware must be certified HSA compliant. HSA requires one fully coherent shared memory heap, with unified addressing, between CPUs and Compute Devices (i.e. a pointer can be passed between CPU and GPU directly). An HSA-specified MMU (hmmu) allows the compute devices to share the page table mappings with host CPU(s). Software: OpenCL (Open Compute Language) is an open standard by the Khronos Group. One is free to implement the standard and pay for certification. A large and growing number of problem domain libraries and apps have been built on OpenCL. CUDA (Compute Unified Device Architecture) by NVidia is limited to NVidia hardware. There is no standardization body and no certification. Historically CUDA was not able to be licensed but on June 18 th, 2013 nvidia announced it was going to start licensing its GPU IP. Renderscript, part of Android, by Google. Renderscript is not a standard. There is no certification. There is no implementation outside of Android. HSA tools: Compiler to HSAIL, HSAIL Finalizer to accelerator code, and HSA host runtime software.

Open Source Software accelerated with GPGPU Many OSS libraries support CUDA and/or OpenCL. Linear Algebra Libraries: ViennaCL, MAGMA, clmath(clblas), VexCL. Vision: OpenCV (Computer Vision) some modules written to CUDA/OpenCL. Database: sqlite on CUDA (20x-70x speedup for SELECT queries). Finance: QuantLib (faster option pricing for HFT, Monte Carlo simulations, and company credit risk). Scientific Domains: OpenMM (Molecular Modeling), OpenBR (Biometric Recognition). Language bindings ease programmability, proliferate GPU usage: WebCL: JavaScript binding to OpenCL, and Khronos standard. Demos have shown (50x-100x) performance improvements over pure javascript web apps. Python: PyOpenCL, Java: Aparapi, Jogamp/JOCL.

OpenCL Drivers on Various Hardware ZiiLabs: H/W: ZMS-40: Quad ARM Cortex A9 with 96 StemCell cores. S/W: OpenCL 1.1 (desktop profile) drivers from Creative for custom devices. Vivante: H/W: GC600+ series licensed to Marvell ARMADA, Freescale i.mx6, etc. S/W: OpenCL 1.1 EP drivers that work on FreeScale processors (SabreBoard). Qualcomm: H/W: Adreno 300+ series, in Snapdragon SoC s. S/W: OpenCL 1.2 EP Android drivers in Adreno SDK (not shipped on device). ARM Mali: H/W: Samsung Exynos 4/5 series, many others. (eg: Arndale, Chromebook). S/W: Drivers for Exynos only seem to work on the Arndale-board when the LCD is also ordered; Chromebook Mali drivers just made available (Oct 2013). Imagination Technologies: H/W: TI OMAP, Exynos 4/5 series with PowerVR SGX GPUs S/W: OpenCL 1.1 EP: TI requires NDA, Samsung ODROID-XU Android drivers avail (Linux drivers require NDA).» Source: http://streamcomputing.eu/blog/2013-09-05/mobile-processor-opencl-drivers-q3-2013/

Challenges and Opportunities Challenges GPGPU drivers are mostly closed, and only slowly becoming available. Google invested in Renderscript vs OpenCL on Android. HSA compliant hardware can extend the GPGPU into new problem domains by increasing performance for memory latency constrained algorithms, but new hardware designs take a long time to implement. Programming models like Renderscript/OpenCL still not so simple. Opportunities Performance: Select a set of key algorithm libraries to benchmark and accelerate. OpenCL: Invest in a CPU OSS version of OpenCL, and update to latest OpenCL spec. Simplify Programmability: Explore directive based models: eg: OpenACC/OpenMP 4.0?

GWG Roadmap: GPGPU Next Steps Address availability of OpenCL driver needs through the use of Clover (an open source CPU only driver) and advance the use of non GPU devices such as TI DSPs which will also be supported by Clover. Update Clover to OCL v1.2. Accelerate libjpeg-turbo : utilizing an existing OpenCL port, validate it works on an ARM OpenCL implementation, and compare to a NEON optimized version. Accelerate sqlite : Sqlite is a database found in mobile and embedded solutions for both Android and Linux. A past project accelerated it using CUDA. We propose to accelerate the database using OpenCL. Accelerate OpenCV: OpenCV is an open source computer vision and machine learning software library built to provide a common infrastructure for computer vision applications and to accelerate the use of machine perception in the commercial products. Accelerate using OpenCL.

More about Linaro: http://www.linaro.org/about/ More about Linaro engineering: http://www.linaro.org/engineering/ How to join: http://www.linaro.org/about/how-to-join Linaro members: www.linaro.org/members