ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research

Similar documents
EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

OPENCL TM APPLICATION ANALYSIS AND OPTIMIZATION MADE EASY WITH AMD APP PROFILER AND KERNELANALYZER

AMD CORPORATE TEMPLATE AMD Radeon Open Compute Platform Felix Kuehling

Understanding GPGPU Vector Register File Usage


INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

AMD APU and Processor Comparisons. AMD Client Desktop Feb 2013 AMD

EXPLOITING ACCELERATOR-BASED HPC FOR ARMY APPLICATIONS

MEASURING AND MODELING ON-CHIP INTERCONNECT POWER ON REAL HARDWARE

AMD Graphics Team Last Updated February 11, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview February 2013 Approved for public distribution

SIMULATOR AMD RESEARCH JUNE 14, 2015

AMD RYZEN PROCESSOR WITH RADEON VEGA GRAPHICS CORPORATE BRAND GUIDELINES

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016

AMD IOMMU VERSION 2 How KVM will use it. Jörg Rödel August 16th, 2011

Panel Discussion: The Future of I/O From a CPU Architecture Perspective

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Sequential Consistency for Heterogeneous-Race-Free

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL

HyperTransport Technology

AMD Graphics Team Last Updated April 29, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview April 2013 Approved for public distribution

AMD Radeon ProRender plug-in for Unreal Engine. Installation Guide

Use cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games

3D Numerical Analysis of Two-Phase Immersion Cooling for Electronic Components

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

FUSION PROCESSORS AND HPC

Accelerating Applications. the art of maximum performance computing James Spooner Maxeler VP of Acceleration

CAUTIONARY STATEMENT 1 AMD NEXT HORIZON NOVEMBER 6, 2018

ROCm: An open platform for GPU computing exploration

ADVANCED RENDERING EFFECTS USING OPENCL TM AND APU Session Olivier Zegdoun AMD Sr. Software Engineer

INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS

KVM CPU MODEL IN SYSCALL EMULATION MODE ALEXANDRU DUTU, JOHN SLICE JUNE 14, 2015

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

viewdle! - machine vision experts

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

Maximizing Six-Core AMD Opteron Processor Performance with RHEL

clarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018

AMD HD3D Technology. Setup Guide. 1 AMD HD3D TECHNOLOGY: Setup Guide

BIOMEDICAL DATA ANALYSIS ON HETEROGENEOUS PLATFORM. Dong Ping Zhang Heterogeneous System Architecture AMD

Generic System Calls for GPUs

FLASH MEMORY SUMMIT Adoption of Caching & Hybrid Solutions

SOLUTION TO SHADER RECOMPILES IN RADEONSI SEPTEMBER 2015

STREAMING VIDEO DATA INTO 3D APPLICATIONS Session Christopher Mayer AMD Sr. Software Engineer

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015

RegMutex: Inter-Warp GPU Register Time-Sharing

The Road to the AMD. Fiji GPU. Featuring Die Stacking and HBM Technology 1 THE ROAD TO THE AMD FIJI GPU ECTC 2016 MAY 2015

Run Anywhere. The Hardware Platform Perspective. Ben Pollan, AMD Java Labs October 28, 2008

AMD RYZEN CORPORATE BRAND GUIDELINES

Multi-core processors are here, but how do you resolve data bottlenecks in native code?

1 Presentation Title Month ##, 2012

Driver Options in AMD Radeon Pro Settings. User Guide

NEXT-GENERATION MATRIX 3D IMMERSIVE USER INTERFACE [ M3D-IUI ] H Raghavendra Swamy AMD Senior Software Engineer

HPG 2011 HIGH PERFORMANCE GRAPHICS HOT 3D

Fusion Enabled Image Processing

Designing Natural Interfaces

Changing your Driver Options with Radeon Pro Settings. Quick Start User Guide v2.1

1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING

HPCA 18. Reliability-aware Data Placement for Heterogeneous memory Architecture

Changing your Driver Options with Radeon Pro Settings. Quick Start User Guide v3.0

AMD EPYC CORPORATE BRAND GUIDELINES

Pattern-based analytics to estimate and track yield risk of designs down to 7nm

HIGHLY PARALLEL COMPUTING IN PHYSICS-BASED RENDERING OpenCL Raytracing Based. Thibaut PRADOS OPTIS Real-Time & Virtual Reality Manager

Desktop Telepresence Arrived! Sudha Valluru ViVu CEO

Vulkan (including Vulkan Fast Paths)

OpenCL Implementation Of A Heterogeneous Computing System For Real-time Rendering And Dynamic Updating Of Dense 3-d Volumetric Data

AMD AIB Partner Guidelines. Version February, 2015

DR. LISA SU

Heterogeneous Computing

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

D3D12 & Vulkan: Lessons learned. Dr. Matthäus G. Chajdas Developer Technology Engineer, AMD

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

AMD SEV Update Linux Security Summit David Kaplan, Security Architect

Anatomy of AMD s TeraScale Graphics Engine

Fan Control in AMD Radeon Pro Settings. User Guide. This document is a quick user guide on how to configure GPU fan speed in AMD Radeon Pro Settings.

The mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management

Gestural and Cinematic Interfaces - DX11. David Brebner Unlimited Realities CTO

Introducing NVDIMM-X: Designed to be the World s Fastest NAND-Based SSD Architecture and a Platform for the Next Generation of New Media SSDs

LIQUIDVR TODAY AND TOMORROW GUENNADI RIGUER, SOFTWARE ARCHITECT

Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory

AMD S X86 OPEN64 COMPILER. Michael Lai AMD

PROTECTING VM REGISTER STATE WITH AMD SEV-ES DAVID KAPLAN LSS 2017

AMD 780G. Niles Burbank AMD. an x86 chipset with advanced integrated GPU. Hot Chips 2008

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors

AMD Radeon ProRender plug-in for Universal Scene Description. Installation Guide

Optimizing the operations with sparse matrices on Intel architecture

Graphics Hardware 2008

MULTIMEDIA PROCESSING Real-time H.264 video enhancement by using AMD APP SDK

Optimizations of BLIS Library for AMD ZEN Core

Memory Population Guidelines for AMD EPYC Processors

AMD Security and Server innovation

INTRODUCING RYZEN MARCH

Device Pack. Network Video Management System Standard Edition. Release Note. Software Version: 9.5a Sony Corporation

Cilk Plus: Multicore extensions for C and C++

Resource Saving: Latest Innovation in Optimized Cloud Infrastructure

A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL

Dynamic Sparse Matrix Allocation on GPUs. James King

オープンソ プンソース技術者のための AMD 最新テクノロジーアップデート 日本 AMD 株式会社 マーケティング ビジネス開発本部 エンタープライズプロダクトマーケティング部 山野 洋幸

Solid State Graphics (SSG) SDK Setup and Raw Video Player Guide

Device Pack. Network Video Management System Standard Edition. Release Note. Software Version: Sony Corporation

High Performance Graphics 2010

Transcription:

ACCELERATING MATRIX PROCESSING WITH GPUs Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research

ACCELERATING MATRIX PROCESSING WITH GPUS MOTIVATION Matrix operations are ubiquitous Critical in HPC, machine learning, 3D rendering, gaming, signal processing, and more Serial performance no longer doubling every ~2 years Parallel solutions are needed Emerging GPUs provide tremendous compute capabilities Important matrix processing tasks include SpMV: Sparse Matrix-Vector Multiply SpTS: Sparse Triangle Solve Graph Processing: Spare-Matrix Operations GEMM: General Matrix-Matrix Multiply Representative of a range of challenges Solutions also apply to manycore CPUs 2 Accelerating Matrix Processing with GPUs ARITH24 JULY, 24 2017

ACCELERATING MATRIX PROCESSING WITH GPUS REPRESENTATIVE PROBLEMS SpMV: Memory-bound problem with divergence SpTS: Heavily-researched sparse BLAS routine Graph Processing: Difficult to find general solutions GEMM: Compute-bound problem with new challenges Precision and accuracy requirements may vary greatly Optimizing data movement and memory accesses can be very important 3 Accelerating Matrix Processing with GPUs ARITH24 JULY, 24 2017

SPARSE MATRIX-VECTOR MULTIPLICATION (SPMV) 1.0-2.0-3.0-4.0-5.0 - - - 6.0 - - 7.0 - - 8.0 - - 9.0 - - - 1 2 3 4 5 = 22 28 18 39 18 1*1 + 2*3 + 3*5 SpMV applications include iterative solvers, machine learning, and graph analytics SpMV is memory-bound with performance dominated by how the sparse matrix is stored in memory Compressed sparse row (CSR) is the most common storage format Compresses most matrices well and runs efficiently on CPUs 4 Accelerating Matrix Processing with GPUs ARITH24 JULY, 24 2017

COMPRESSED SPARSE ROW (CSR) 1.0-2.0-3.0-4.0-5.0 - - - 6.0 - - 7.0 - - 8.0 - - 9.0 - - - row delimiters: 0 3 5 6 8 9 columns: 0 2 4 1 3 2 0 3 1 values: 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 5 Accelerating Matrix Processing with GPUs ARITH24 JULY, 24 2017

CSR-ADAPTIVE Efficient technique for SpMV with CSR format on GPUs Measures the number of non-zero values in each row, when matrix first created Groups together rows with roughly the same number of non-zero values Determines the number of rows each SIMD unit operates on based on the number of non-zero values in each row Loads contiguous rows into on-chip scratch-pad storage without causing memory divergence Solves the problems of memory divergence and lack of parallelism Achieves up to 95% efficiency for many input matrices Average of 28% faster than previous fastest technique Available as part of AMD s clsparse library 6 Accelerating Matrix Processing with GPUs ARITH24 JULY, 24 2017

CSR-ADAPTIVE A complete SpMV solution CSR-Adaptive CSR-Stream CSR-Vector CSR-VectorL Short rows Medium-sized rows Long rows Block 1 Block 2 Block 3 Block 4 Block 5 Efficient Sparse Matrix-Vector Multiplication on GPUs using the CSR Storage Format, J. Greathouse and M. Daga. SC 2014. 7 Accelerating Matrix Processing with GPUs ARITH24 JULY, 24 2017

SPMV FUTURE RESEARCH Sparse matrix operations are becoming increasingly important in machine learning Small weights and inputs are set to zero to reduce the number of computations Some problems can leverage much lower precision, but methods are needed to determine how much precision is acceptable Some input matrices have much worse performance because their vector inputs do not cache well Very large rows require a large number of vector accesses, which can displace useful data in the caches and cause memory conflicts Cache bypassing of large rows may improve performance New systems have closely coupled CPUs and GPUs Interesting to investigate which calculations should occur on the CPU and which should occur on the GPU 8 Accelerating Matrix Processing with GPUs ARITH24 JULY, 24 2017

SPTS: SPARSE TRIANGLE SOLVE START WITH THE DENSE CASE SpTS: Solve Ax = b, where A is lower triangular Used in direct solves, iterative methods, least squares, etc. Consider a dense 3x3 lower triangular matrix: Solution for each row is: x 1 = b 1 /a 11 x 2 = (b 2 a 21 x 1 )/a 22 x 3 = (b 3 a 31 x 1 a 32 x 2 )/a 33 Every row must be solved in series Contention and dependencies 9 Accelerating Matrix Processing with GPUs ARITH24 JULY, 24 2017

SPTS: SPARSE TRIANGLE SOLVE IMPACT OF SPARSITY For, x 2 = (b 2 a 21 x 1 )/a 22 If a 21 = 0, first two rows can be solved in parallel However, data dependencies not known a-priori Essential challenges of SpTS: Determine data dependencies between rows Lack of parallelism for some problems due to dependencies Existing solutions: Level sets Requires analysis to determine data dependencies between rows Enables load balancing and fast traversal of the solution Graph Coloring Requires simpler analysis phase and row re-ordering Row re-ordering can perturb the problem solution 10 Accelerating Matrix Processing with GPUs ARITH24 JULY, 24 2017

SPTS: SPARSE TRIANGLE SOLVE AVOIDING THE SPARSITY ANALYSIS AND FUTURE RESEARCH One solution [Liu et al.] Transpose from CSR format to Compressed Sparse Column (CSC) No longer requires pre-processing for data-dependencies Transpose is expensive, but faster than full analysis Requires additional memory Efficient implementations for SpTS on GPUs and manycore CPUs remains an important open research area May be able to apply solutions that are similar to those used for SpMV Determining accuracy and precision needed based on problem to be solved A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves, W. Liu, A. Li, J. Hogg, I. S. Duff, and B. Vinter, Euro-Par, 2016. 11 Accelerating Matrix Processing with GPUs ARITH24 JULY, 24 2017

GRAPH PROCESSING WITH BELRED OVERVIEW AND AN EXAMPLE Graph algorithms are widely used in diverse application domains Business analytics, social network, life sciences, healthcare, infrastructure planning, engineering simulations, and more BelRed uses sparse-matrix routines to perform graph applications on GPUs Includes a set of key sparse-matrix and vector routines Similar to GraphBLAS, but optimized for GPUs Initial implementation with OpenCL and SNACK Clean abstraction and various underlying optimizations BelRed: Constructing GPGPU Graph Applications with Software Building Blocks, S. Che, B. M. Beckmann, and S. K. Reinhardt, HPEC, 2014 12 Accelerating Matrix Processing with GPUs ARITH24 JULY, 24 2017

BELRED LIBRARY ROUTINES AND GRAPH ALGORITHMS Implemented linear-algebra routines BelRed implements important graph algorithms using sparse linear algebra operations PageRank (SpMV) Graph coloring (SegReduc) Maximal independent set (velementwise, SpMV) K-truss (SpMV, SpGeMM, SpElemWise) Programming GPGPU graph applications with linear algebra building blocks, S. Che, B. M. Beckmann, and S. K. Reinhardt, IJPP 2016. 13 Accelerating Matrix Processing with GPUs ARITH24 JULY, 24 2017

BELRED FUTURE RESEARCH Optimizations for the Radeon Open Compute Platform (ROCm) Optimize for lower-precision arithmetic when appropriate Leverage very efficient on-chip memories to improve performance Additional optimizations can build on previous work Greathouse and Daga (SC 14) for SpMV, Liu and Vinter (IPDPS 14) for SpGEMM Classify matrix regions into different bins (e.g., rows with different sizes), and launch different optimized GPU kernels to process different bins Multi-GPU implementations with efficient static and dynamic work partitioning across GPUs Some graph applications have interesting dependencies across different sparse-matrix routines Provides opportunities for more parallelism and asynchronous execution 14 Accelerating Matrix Processing with GPUs ARITH24 JULY, 24 2017

ACCELERATING MATRIX-MATRIX MULTIPLICATION (GEMM) OVERVIEW AND AN EXAMPLE Product of two dense matrices: C = AB Common operation in scientific computing and machine learning Computationally expensive and compute-bound Precision and accuracy requirements may vary greatly Consider 2x2 matrix multiply, c 11 c 12 c 21 c = a 11 a 12 b 11 b 12 22 a 21 a 22 b 21 b 22 Written out, c 11 c 12 c 21 c = a 11 b 11 + a 12 b 21 a 11 b 12 + a 12 b 22 22 a 21 b 11 + a 21 b 21 a 21 b 12 + a 22 b 22 This requires a total of 8 multiplies and 4 additions Matrix-matrix multiply scales asymptotically as O(n 3 ) 15 Accelerating Matrix Processing with GPUs ARITH24 JULY, 24 2017

GEMM MACHINE LEARNING With HPC, matrices are often square or close to square A (MxK) B (KxN) C (MxN) X = With machine learning, matrix dimensions can vary greatly based on problem being solved and layer in the network A (MxK) X B (KxN) = C (MxN) Optimized GEMM routines available in AMD s MIOpen library for a wide range of matrix sizes 16 Accelerating Matrix Processing with GPUs ARITH24 JULY, 24 2017

TECHNIQUES FOR FAST MATRIX-MATRIX MULTIPLICATION Strassen-Winograd Matrix Multiplication Recursive approach that reduces the number of multiplies while increasing the number of addative operations Reduces complexity from O n 3 to O(n 2.807 ) Can increase numerical error and need for communication Several other techniques exist for speeding up matrix-matrix multiplication, but they may not work as well in practice Important future research includes: Algorithms for non-square matrices of various sizes Optimizing low-precision GEMM Optimizing SpGEMM performance on GPUs and manycore CPUs Accelerating Strassen-Winograd s Matrix Multiplication Algorithm on GPUs, W. Lai, H. Arafat, V. Elango, and P. Sadayappan, HiPC, 2013. 17 Accelerating Matrix Processing with GPUs ARITH24 JULY, 24 2017

CONCLUSIONS SOLUTIONS TO ACCELERATING MATRIX PROCESSING Better algorithms E.g., designed to expose more parallelism Careful mapping of algorithms to hardware New instructions and specialized hardware for fast matrix computations Fitting problem into scratchpad memory Often requires direct programmer management Match precision to application and problem requirements Scientific computing: High precision Machine learning training: Low precision Machine learning inference: Very low precision Libraries that capture and provide users the above 18 Accelerating Matrix Processing with GPUs ARITH24 JULY, 24 2017

DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION 2015 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD FirePro and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners. 19 Accelerating Matrix Processing with GPUs ARITH24 JULY, 24 2017