EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

Similar documents
ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research

Understanding GPGPU Vector Register File Usage

AMD Graphics Team Last Updated February 11, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview February 2013 Approved for public distribution

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

AMD APU and Processor Comparisons. AMD Client Desktop Feb 2013 AMD


INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS

AMD Graphics Team Last Updated April 29, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview April 2013 Approved for public distribution

CAUTIONARY STATEMENT 1 AMD NEXT HORIZON NOVEMBER 6, 2018

The Road to the AMD. Fiji GPU. Featuring Die Stacking and HBM Technology 1 THE ROAD TO THE AMD FIJI GPU ECTC 2016 MAY 2015

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

HyperTransport Technology

AMD IOMMU VERSION 2 How KVM will use it. Jörg Rödel August 16th, 2011

MEASURING AND MODELING ON-CHIP INTERCONNECT POWER ON REAL HARDWARE

Maximizing Six-Core AMD Opteron Processor Performance with RHEL

Panel Discussion: The Future of I/O From a CPU Architecture Perspective

SIMULATOR AMD RESEARCH JUNE 14, 2015

AMD RYZEN PROCESSOR WITH RADEON VEGA GRAPHICS CORPORATE BRAND GUIDELINES

Use cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games

OPENCL TM APPLICATION ANALYSIS AND OPTIMIZATION MADE EASY WITH AMD APP PROFILER AND KERNELANALYZER

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

Anatomy of AMD s TeraScale Graphics Engine

ADVANCED RENDERING EFFECTS USING OPENCL TM AND APU Session Olivier Zegdoun AMD Sr. Software Engineer

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015

FLASH MEMORY SUMMIT Adoption of Caching & Hybrid Solutions

Multi-core processors are here, but how do you resolve data bottlenecks in native code?

Sequential Consistency for Heterogeneous-Race-Free

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL

RegMutex: Inter-Warp GPU Register Time-Sharing

3D Numerical Analysis of Two-Phase Immersion Cooling for Electronic Components

AMD CORPORATE TEMPLATE AMD Radeon Open Compute Platform Felix Kuehling

Automatic Intra-Application Load Balancing for Heterogeneous Systems

KVM CPU MODEL IN SYSCALL EMULATION MODE ALEXANDRU DUTU, JOHN SLICE JUNE 14, 2015

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016

STREAMING VIDEO DATA INTO 3D APPLICATIONS Session Christopher Mayer AMD Sr. Software Engineer

Generic System Calls for GPUs

Run Anywhere. The Hardware Platform Perspective. Ben Pollan, AMD Java Labs October 28, 2008

BIOMEDICAL DATA ANALYSIS ON HETEROGENEOUS PLATFORM. Dong Ping Zhang Heterogeneous System Architecture AMD

clarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018

The mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management

viewdle! - machine vision experts

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

DR. LISA SU

SOLUTION TO SHADER RECOMPILES IN RADEONSI SEPTEMBER 2015

Fusion Enabled Image Processing

AMD HD3D Technology. Setup Guide. 1 AMD HD3D TECHNOLOGY: Setup Guide

HPCA 18. Reliability-aware Data Placement for Heterogeneous memory Architecture

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

Accelerating Applications. the art of maximum performance computing James Spooner Maxeler VP of Acceleration

ROCm: An open platform for GPU computing exploration

FUSION PROCESSORS AND HPC

PROTECTING VM REGISTER STATE WITH AMD SEV-ES DAVID KAPLAN LSS 2017

OpenCL Implementation Of A Heterogeneous Computing System For Real-time Rendering And Dynamic Updating Of Dense 3-d Volumetric Data

Designing Natural Interfaces

Introducing NVDIMM-X: Designed to be the World s Fastest NAND-Based SSD Architecture and a Platform for the Next Generation of New Media SSDs

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

EXPLOITING ACCELERATOR-BASED HPC FOR ARMY APPLICATIONS

HPG 2011 HIGH PERFORMANCE GRAPHICS HOT 3D

AMD Radeon ProRender plug-in for Unreal Engine. Installation Guide

1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING

AMD EPYC CORPORATE BRAND GUIDELINES

Heterogeneous Computing

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Fan Control in AMD Radeon Pro Settings. User Guide. This document is a quick user guide on how to configure GPU fan speed in AMD Radeon Pro Settings.

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors

NEXT-GENERATION MATRIX 3D IMMERSIVE USER INTERFACE [ M3D-IUI ] H Raghavendra Swamy AMD Senior Software Engineer

Vulkan (including Vulkan Fast Paths)

Desktop Telepresence Arrived! Sudha Valluru ViVu CEO

HIGHLY PARALLEL COMPUTING IN PHYSICS-BASED RENDERING OpenCL Raytracing Based. Thibaut PRADOS OPTIS Real-Time & Virtual Reality Manager

Pattern-based analytics to estimate and track yield risk of designs down to 7nm

LIQUIDVR TODAY AND TOMORROW GUENNADI RIGUER, SOFTWARE ARCHITECT

AMD AIB Partner Guidelines. Version February, 2015

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

1 Presentation Title Month ##, 2012

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

D3D12 & Vulkan: Lessons learned. Dr. Matthäus G. Chajdas Developer Technology Engineer, AMD

Changing your Driver Options with Radeon Pro Settings. Quick Start User Guide v3.0

AMD RYZEN CORPORATE BRAND GUIDELINES

Gestural and Cinematic Interfaces - DX11. David Brebner Unlimited Realities CTO

AMD SEV Update Linux Security Summit David Kaplan, Security Architect

Driver Options in AMD Radeon Pro Settings. User Guide

EPYC VIDEO CUG 2018 MAY 2018

High Performance Graphics 2010

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

INTRODUCING RYZEN MARCH

Changing your Driver Options with Radeon Pro Settings. Quick Start User Guide v2.1

Memory Population Guidelines for AMD EPYC Processors

Resource Saving: Latest Innovation in Optimized Cloud Infrastructure

Cilk Plus: Multicore extensions for C and C++

Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra

AMD S X86 OPEN64 COMPILER. Michael Lai AMD

Optimizing the operations with sparse matrices on Intel architecture

AMD 780G. Niles Burbank AMD. an x86 chipset with advanced integrated GPU. Hot Chips 2008

1401 HETEROGENEOUS HPC How Fusion Designs Can Advance Science

Microsoft Windows 2016 Mellanox 100GbE NIC Tuning Guide

Dynamic Sparse Matrix Allocation on GPUs. James King

PCCC WORKSHOP:AMD の最新製品戦略とプラットフォームソリューション FEBRUARY 19 TH 2016 HIDETOSHI IWASA, FAE MANAGER AMD JAPAN

Graphics Hardware 2008

* ENDNOTES: RVM-26 AND RZG-01.

Sparse Matrix Formats

Transcription:

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT JOSEPH L. GREATHOUSE, MAYANK DAGA AMD RESEARCH 11/20/2014

THIS TALK IN ONE SLIDE Demonstrate how to save space and time by using the CSR format for GPU-based SpMV Optimized GPU-based SpMV algorithm that increases memory coalescing by using LDS 14.7x faster than other CSR-based algorithms 2.3x faster than using other matrix formats 2 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

Background

SPARSE MATRIX-VECTOR MULTIPLICATION 1.0-2.0-3.0-4.0-5.0 - - - 6.0 - - 7.0 - - 8.0 - - 9.0 - - - 1 2 3 4 5 = 22 28 18 39 18 4 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

SPARSE MATRIX-VECTOR MULTIPLICATION 1.0-2.0-3.0-4.0-5.0 - - - 6.0 - - 7.0 - - 8.0 - - 9.0 - - - 1 2 3 4 5 = 22 28 18 39 18 5 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

SPARSE MATRIX-VECTOR MULTIPLICATION 1.0-2.0-3.0-4.0-5.0 - - - 6.0 - - 7.0 - - 8.0 - - 9.0 - - - 1 2 3 4 5 = 22 28 18 39 18 6 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

SPARSE MATRIX-VECTOR MULTIPLICATION 1.0-2.0-3.0-4.0-5.0 - - - 6.0 - - 7.0 - - 8.0 - - 9.0 - - - 1 2 3 4 5 = 22 28 18 39 18 7 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

SPARSE MATRIX-VECTOR MULTIPLICATION 1.0-2.0-3.0-4.0-5.0 - - - 6.0 - - 7.0 - - 8.0 - - 9.0 - - - 1 2 3 4 5 = 22 28 18 39 18 8 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

SPARSE MATRIX-VECTOR MULTIPLICATION 1.0-2.0-3.0-4.0-5.0 - - - 6.0 - - 7.0 - - 8.0 - - 9.0 - - - 1 2 3 4 5 = 22 28 18 39 18 9 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

SPARSE MATRIX-VECTOR MULTIPLICATION 1.0-2.0-3.0-4.0-5.0 - - - 6.0 - - 7.0 - - 8.0 - - 9.0 - - - 1 2 3 4 5 = 22 28 18 39 18 Traditionally Bandwidth Limited 10 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

COMPRESSED SPARSE ROW (CSR) 1.0-2.0-3.0-4.0-5.0 - - - 6.0 - - 7.0 - - 8.0 - - 9.0 - - - values: 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 11 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

COMPRESSED SPARSE ROW (CSR) 1.0-2.0-3.0-4.0-5.0 - - - 6.0 - - 7.0 - - 8.0 - - 9.0 - - - values: 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 12 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

COMPRESSED SPARSE ROW (CSR) 1.0-2.0-3.0-4.0-5.0 - - - 6.0 - - 7.0 - - 8.0 - - 9.0 - - - columns: 0 2 4 1 3 2 0 3 1 values: 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 13 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

COMPRESSED SPARSE ROW (CSR) 1.0-2.0-3.0-4.0-5.0 - - - 6.0 - - 7.0 - - 8.0 - - 9.0 - - - columns: 0 2 4 1 3 2 0 3 1 values: 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 14 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

COMPRESSED SPARSE ROW (CSR) 1.0-2.0-3.0-4.0-5.0 - - - 6.0 - - 7.0 - - 8.0 - - 9.0 - - - row delimiters: 0 3 5 6 8 9 columns: 0 2 4 1 3 2 0 3 1 values: 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 15 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

COMPRESSED SPARSE ROW (CSR) 1.0-2.0-3.0-4.0-5.0 - - - 6.0 - - 7.0 - - 8.0 - - 9.0 - - - row delimiters: 0 3 5 6 8 9 columns: 0 2 4 1 3 2 0 3 1 values: 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 16 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

COMPRESSED SPARSE ROW (CSR) 1.0-2.0-3.0-4.0-5.0 - - - 6.0 - - 7.0 - - 8.0 - - 9.0 - - - row delimiters: 0 3 5 6 8 9 columns: 0 2 4 1 3 2 0 3 1 values: 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 17 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

AMD GPU MEMORY SYSTEM I$ K$ I$ K$ L1 L1 L1 L1 L1 L1 L1 L1 L2 64-bit Dual Channel Memory Controller L2 64-bit Dual Channel Memory Controller L2 64-bit Dual Channel Memory Controller 18 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

AMD GPU MEMORY SYSTEM I$ K$ I$ K$ 64 Bytes per clock L1 bandwidth per CU L1 L1 L1 L1 L1 L1 L1 L1 L2 64-bit Dual Channel Memory Controller L2 64-bit Dual Channel Memory Controller L2 64-bit Dual Channel Memory Controller 19 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

AMD GPU MEMORY SYSTEM I$ K$ I$ K$ 64 Bytes per clock L1 bandwidth per CU L1 L1 L1 L1 L1 L1 L1 L1 64 Bytes per clock L2 bandwidth per partition L2 L2 L2 64-bit Dual Channel Memory Controller 64-bit Dual Channel Memory Controller 64-bit Dual Channel Memory Controller 20 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

AMD GPU MEMORY SYSTEM I$ K$ I$ K$ 64 Bytes per clock L1 bandwidth per CU L1 L1 L1 L1 L1 L1 L1 L1 64 Bytes per clock L2 bandwidth per partition L2 L2 L2 Uncoalesced accesses are much less efficient! 64-bit Dual Channel Memory Controller 64-bit Dual Channel Memory Controller 64-bit Dual Channel Memory Controller 21 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

AMD GPU COMPUTE UNIT Vector Units (4x SIMD-16) Vector Registers (4x 64KB) Local Data Share (64KB) L1 Cache (16 KB) 22 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

AMD GPU COMPUTE UNIT Vector Units (4x SIMD-16) Vector Registers (4x 64KB) Local Data Share (64KB) L1 Cache (16 KB) Must do parallel loads to maximize bandwidth 23 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

LOCAL DATA SHARE (LDS) MEMORY Local Data Share (64KB) Scratchpad (software-addressed) on-chip cache Statically divided between all workgroups in a CU Highly ported to allow scatter/gather (uncoalesced) accesses 24 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-SCALAR One thread/work-item per matrix row 25 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-SCALAR One thread/work-item per matrix row Workgroup #1 Workgroup #2 Workgroup #3 26 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-SCALAR One thread/work-item per matrix row Workgroup #1 Workgroup #2 Workgroup #3 27 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-SCALAR One thread/work-item per matrix row Workgroup #1 Workgroup #2 Workgroup #3 Uncoalesced accesses 28 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-VECTOR One workgroup (multiple threads) per matrix row 29 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-VECTOR One workgroup (multiple threads) per matrix row Workgroup #1 Workgroup #2... Workgroup #12 30 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-VECTOR One workgroup (multiple threads) per matrix row Workgroup #1 Workgroup #2... Workgroup #12 31 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-VECTOR One workgroup (multiple threads) per matrix row Workgroup #1 Workgroup #2... Workgroup #12 Good: coalesced accesses 32 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-VECTOR One workgroup (multiple threads) per matrix row Workgroup #1 Workgroup #2... Workgroup #12 33 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-VECTOR One workgroup (multiple threads) per matrix row Workgroup #1 Workgroup #2... Workgroup #12 Bad: poor parallelism 34 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

MAKING GPU SPMV FAST Neither CSR-Scalar nor CSR-Vector offer ideal performance. CSR works well on CPUs and is widely used Traditional solution: new matrix storage format + new algorithms ELLPACK/ITPACK, BCSR, DIA, BDIA, SELL, SBELL, ELL+COO Hybrid, many others Auto-tuning frameworks like clspmv try to find the best format for each matrix Can require more storage space or dynamic transformation overhead Can we find a good GPU algorithm that does not modify the original CSR data? 35 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-Adaptive

INCREASING BANDWIDTH EFFICIENCY 37 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

INCREASING BANDWIDTH EFFICIENCY 38 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

INCREASING BANDWIDTH EFFICIENCY CSR-Vector coalesces long rows by streaming into the LDS 39 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-VECTOR USING THE LDS Local Data Share 40 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-VECTOR USING THE LDS Local Data Share 41 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-VECTOR USING THE LDS Local Data Share 42 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-VECTOR USING THE LDS Local Data Share 43 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-VECTOR USING THE LDS Local Data Share + + + + 44 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-VECTOR USING THE LDS Local Data Share + + + + Fast uncoalesced accesses! 45 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-VECTOR USING THE LDS Local Data Share 46 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-VECTOR USING THE LDS Local Data Share Fast accesses everywhere 47 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

INCREASING BANDWIDTH EFFICIENCY How can we get good bandwidth for other short rows? 48 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

INCREASING BANDWIDTH EFFICIENCY How can we get good bandwidth for other short rows? Load these into the LDS 49 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-STREAM: STREAM MATRIX INTO THE LDS Local Data Share 50 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-STREAM: STREAM MATRIX INTO THE LDS Local Data Share 51 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-STREAM: STREAM MATRIX INTO THE LDS Local Data Share Block 1 Block 2 Block 3 Block 4 52 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-STREAM: STREAM MATRIX INTO THE LDS Local Data Share Block 1 Block 2 Block 3 Block 4 53 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-STREAM: STREAM MATRIX INTO THE LDS Local Data Share Block 1 Block 2 Block 3 Block 4 54 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-STREAM: STREAM MATRIX INTO THE LDS Local Data Share Block 1 Block 2 Block 3 Block 4 55 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-STREAM: STREAM MATRIX INTO THE LDS Local Data Share Block 1 Block 2 Block 3 Block 4 56 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-STREAM: STREAM MATRIX INTO THE LDS Local Data Share Block 1 Block 2 Block 3 Block 4 57 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-STREAM: STREAM MATRIX INTO THE LDS Local Data Share Fast accesses everywhere Block 1 Block 2 Block 3 Block 4 58 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

FINDING BLOCKS FOR CSR-STREAM DONE ONCE ON THE CPU row delimiters: 0 3 6 8 10 14 16 17 19 20 22 24 32 columns: values: 59 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

FINDING BLOCKS FOR CSR-STREAM DONE ONCE ON THE CPU row delimiters: 0 3 6 8 10 14 16 17 19 20 22 24 32 columns: values: 60 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

FINDING BLOCKS FOR CSR-STREAM DONE ONCE ON THE CPU row delimiters: columns: values: 61 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

FINDING BLOCKS FOR CSR-STREAM DONE ONCE ON THE CPU Assuming 8 LDS entries per workgroup row blocks: 0 3 6 11 12 row delimiters: columns: values: 62 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

FINDING BLOCKS FOR CSR-STREAM DONE ONCE ON THE CPU Assuming 8 LDS entries per workgroup row blocks: 0 3 6 11 12 row delimiters: columns: values: 63 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

FINDING BLOCKS FOR CSR-STREAM DONE ONCE ON THE CPU Assuming 8 LDS entries per workgroup row blocks: 0 3 6 11 12 row delimiters: columns: values: 64 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

FINDING BLOCKS FOR CSR-STREAM DONE ONCE ON THE CPU Assuming 8 LDS entries per workgroup row blocks: 0 3 6 11 12 row delimiters: columns: values: 65 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

FINDING BLOCKS FOR CSR-STREAM DONE ONCE ON THE CPU Assuming 8 LDS entries per workgroup row blocks: 0 3 6 11 12 row delimiters: columns: values: Block 1 66 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

FINDING BLOCKS FOR CSR-STREAM DONE ONCE ON THE CPU Assuming 8 LDS entries per workgroup row blocks: 0 3 6 11 12 row delimiters: columns: values: Block 1 Block 2 67 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

FINDING BLOCKS FOR CSR-STREAM DONE ONCE ON THE CPU Assuming 8 LDS entries per workgroup row blocks: 0 3 6 11 12 row delimiters: columns: values: Block 1 Block 2 Block 3 Block 4 68 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

FINDING BLOCKS FOR CSR-STREAM DONE ONCE ON THE CPU Assuming 8 LDS entries per workgroup row blocks: 0 3 6 11 12 row delimiters: CSR Structure Unchanged columns: values: Block 1 Block 2 Block 3 Block 4 69 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

WHAT ABOUT VERY LONG ROWS? Local Data Share 70 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

WHAT ABOUT VERY LONG ROWS? Local Data Share Block 1 Block 2 Block 3 71 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

WHAT ABOUT VERY LONG ROWS? Local Data Share Block 1 Block 2 Block 3 Block 4 72 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

WHAT ABOUT VERY LONG ROWS? Local Data Share CSR-Stream Block 1 Block 2 Block 3 Block 4 73 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

WHAT ABOUT VERY LONG ROWS? Local Data Share CSR-Stream CSR-Vector Block 1 Block 2 Block 3 Block 4 74 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

WHAT ABOUT VERY LONG ROWS? CSR-Adaptive CSR-Stream CSR-Vector Block 1 Block 2 Block 3 Block 4 75 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-ADAPTIVE 76 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-ADAPTIVE (Once per matrix) On CPU: Create Row Blocks 77 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-ADAPTIVE (Once per matrix) On CPU: Create Row Blocks On GPU (each workgroup assigned one row block): Read Row Block Entries 78 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-ADAPTIVE (Once per matrix) On CPU: Create Row Blocks On GPU (each workgroup assigned one row block): Read Row Block Entries Single Row? 79 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-ADAPTIVE (Once per matrix) On CPU: Create Row Blocks On GPU (each workgroup assigned one row block): Read Row Block Entries Single Row? Yes CSR-Vector 80 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-ADAPTIVE (Once per matrix) On CPU: Create Row Blocks On GPU (each workgroup assigned one row block): Read Row Block Entries Single Row? Yes CSR-Vector Vector Load Row into LDS Perform Parallel Reduction out of LDS 81 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-ADAPTIVE (Once per matrix) On CPU: Create Row Blocks On GPU (each workgroup assigned one row block): Read Row Block Entries CSR-Stream No Single Row? Yes CSR-Vector Vector Load Row into LDS Perform Parallel Reduction out of LDS 82 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-ADAPTIVE (Once per matrix) On CPU: Create Row Blocks On GPU (each workgroup assigned one row block): Read Row Block Entries CSR-Stream No Single Row? Yes CSR-Vector Vector Load Multiple Rows into LDS Perform Reduction out of LDS (serially or in parallel) Vector Load Row into LDS Perform Parallel Reduction out of LDS 83 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

Experiments

EXPERIMENTAL SETUP CPU: AMD A10-7850K (3.7 GHz) 32 GB dual-channel DDR3-2133 GPU: AMD FirePro W9100 44 parallel compute units (CUs) 930 MHz core clock rate (5.2 single-precision TFLOPs) 16 GDDR5 channels at 1250 MHz (320 GB/s) CentOS 6.4 (kernel 2.6.32-358.23.2) AMD FirePro Driver version 14.20 beta AMD APP SDK v2.9 85 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

SPMV SETUP SHOC Benchmark Suite (Aug. 2013 version) CSR-Vector (details in paper: CSR-Scalar, ELLPACK) ViennaCL v1.5.2 CSR-Scalar, COO, ELLPACK, ELLPACK+COO HYB 4-value and 8-value padded versions of CSR-Scalar Best of these for each matrix: ViennaCL Best clspmv v0.1 CSR-Scalar, CSR-Vector, BCSR, COO, DIA, BDIA, ELLPACK, SELL, BELL, SBELL Best of these for each matrix: clspmv Best Details in paper: clspmv Cocktail format CSR-Adaptive 20 matrices from Univ. of Florida Sparse Matrix Collection 86 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

Generation time vs. COO CSR DATA STRUCTURE GENERATION TIMES 1024 512 256 128 64 32 16 8 4 2 1 Row Blocking ELLPACK BCSR Cocktail 87 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014 Note: Some matrices could not be held in ELLPACK or BCSR

PERFORMANCE IN SINGLE-PRECISION GFLOPS 88 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

GFLOPS (Single Precision) PERFORMANCE IN SINGLE-PRECISION GFLOPS 90 80 70 60 50 40 30 20 10 0 Roughly Equal Performance (7/20) 89 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

GFLOPS (Single Precision) PERFORMANCE IN SINGLE-PRECISION GFLOPS 90 80 70 60 50 40 30 20 10 0 Roughly Equal Performance (7/20) 90 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

GFLOPS (Single Precision) PERFORMANCE IN SINGLE-PRECISION GFLOPS 90 80 70 60 50 40 30 20 10 0 Roughly Equal Performance (7/20) 110.7 Others Better (4/20) 91 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

GFLOPS (Single Precision) PERFORMANCE IN SINGLE-PRECISION GFLOPS 90 80 70 60 50 40 30 20 10 0 Roughly Equal Performance (7/20) 110.7 Others Better (4/20) 92 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

GFLOPS (Single Precision) PERFORMANCE IN SINGLE-PRECISION GFLOPS 90 80 70 60 50 40 30 20 10 0 CSR-Adaptive Performance Higher (9/20) 93 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CSR-Adaptive Speedup PERFORMANCE BENEFIT OF CSR-ADAPTIVE 35 30 25 20 15 10 5 0 CSR-Scalar CSR-Vector ELLPACK ViennaCL Best clspmv Cocktail clspmv Best Single 94 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CONCLUSION CSR-Adaptive up to 14.7x faster than previous GPU-based SpMV algorithms Requires little work beyond generating the traditional CSR data structure Less than 0.1% extra storage compared to CSR Algorithm is not complex: easy to integrate into libraries and existing SW 95 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CONCLUSION CSR-Adaptive up to 14.7x faster than previous GPU-based SpMV algorithms Requires little work beyond generating the traditional CSR data structure Less than 0.1% extra storage compared to CSR Algorithm is not complex: easy to integrate into libraries and existing SW AMD Sample Code Available Soon Ask Us! 96 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

CONCLUSION CSR-Adaptive up to 14.7x faster than previous GPU-based SpMV algorithms Requires little work beyond generating the traditional CSR data structure Less than 0.1% extra storage compared to CSR Algorithm is not complex: easy to integrate into libraries and existing SW AMD Sample Code Available Soon Ask Us! Independently Implemented Version in ViennaCL 1.6.1 97 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION 2014 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, FirePro and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners. 98 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

Backup Slides

AMD GPU MICROARCHITECTURE 100 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

AMD GPU MICROARCHITECTURE 101 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014

AMD GPU MICROARCHITECTURE 102 EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT NOVEMBER 20, 2014