Flexible Batched Sparse Matrix-Vector Product on GPUs

Size: px
Start display at page:

Download "Flexible Batched Sparse Matrix-Vector Product on GPUs"

Transcription

1 ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems November 13, 217 Flexible Batched Sparse Matrix-Vector Product on GPUs Hartwig Anzt, Gary Collins, Jack Dongarra, Goran Flegar, Enrique, S. Quintana-Orti

2 Input A, x, y Output y = A x Matrix A contains only few nonzero elements. Storing all entries results in large overhead (memory & computation). 2

3 Input A, x, y Output y = A x Matrix A contains only few nonzero elements. Storing all entries results in large overhead (memory & computation). Idea: Store only nonzero elements [nz] explicitly. A = C A value = [ ] Value 3

4 Input A, x, y Output y = A x Matrix A contains only few nonzero elements. Storing all entries results in large overhead (memory & computation). Idea: Store only nonzero elements [nz] explicitly. Need to also store location of nonzero elements! COO format : A = C A Memory footprint of COO format: nz(val) + 2*nz(int) value = [ ] Value colidx = [ ] Column-index rowidx = [ ] Row-index 4

5 Matrix A contains only few nonzero elements. Storing all entries results in large overhead (memory & computation). Idea: Store only nonzero elements [nz] explicitly. Need to also store location of nonzero elements! CSR format : A = C A Memory footprint of COO format: nz(val) + 2*nz(int) Memory footprint of CSR format: nz(val) + nz(int) + (n+1) (int) value = [ ] Value colidx = [ ] Column-index rowptr = [ ] Points to the first element in each row Number of nonzero elements 5

6 How to parallelize this? A = C A value = [ ] Value colidx = [ ] Column-index rowidx = [ ] Row-index 6

7 How to parallelize this? Parallelize by rows: Every thread handles the computation of one sum in local memory. Significant workload imbalance! Need branching logic, branch divergence on vector machines. T1 T2 T3 A = T4 B T C A * access to input vector x = access to output vector y T1 T2 T3 T4 T5 T6 value = [ ] Value colidx = [ ] Column-index rowidx = [ ] Row-index 7

8 How to parallelize this? Parallelize by rows: Every thread handles the computation of one sum in local memory. Balanced workload. Can result in significant overhead for unbalanced problems. ELL format : T1 T2 T3 A = T4 B T C A * access to input vector x = access to output vector y T1 T2 T3 T4 T5 T6 Values and column-index padded for uniform row-length value = [ ] colidx = [ ] Number of nonzero elements 8

9 How to parallelize this? Parallelize by rows: Every thread handles the computation of one sum in local memory. Significant workload imbalance! Ordered access to input vector x. T1 T2 T3 A = T4 B T C A * access to input vector x = access to output vector y T1 T2 T3 T4 T5 T6 value = [ ] Value colidx = [ ] Column-index rowidx = [ ] Row-index 9

10 How to parallelize this? Parallelize by elements: Balanced workload. Partial sums need synchronization: Write conflicts! A = C A * access to input vector x = access to output vector y T1 T2 T3 T4 T5 T6 value = [ ] Value colidx = [ ] Column-index rowidx = [ ] Row-index 1

11 How to parallelize this? Parallelize by elements: Balanced workload. Partial sums need synchronization: Write conflicts! A = C A * access to input vector x = access to output vector y T1 T2 T3 T4 T5 T6 value = [ ] Value colidx = [ ] Column-index rowidx = [ ] Row-index 11 G. Flegar et al.: Overcoming Load Imbalance for Irregular Sparse Matrices, IA 3, 217.

12 Different kernels optimal for different problem classes CSR small memory footprint Needs some logic for row-parallel processing ELL zero-padding allows for efficient SIMD execution Efficient for balanced matrices COO can compensate workload imbalance for irregular patterns 12

13 Different kernels optimal for different problem classes CSR small memory footprint Needs some logic for row-parallel processing ELL zero-padding allows for efficient SIMD execution Efficient for balanced matrices COO can compensate workload imbalance for irregular patterns For a single problem, we can usually find an optimal kernel, BUT 13

14 What if we process many different matrices at a time? (Assume they are all small) 14

15 What if we process many different matrices at a time? (Assume they are all small) Design a batched SpMV kernel. Process a large number of data-independent problems in parallel. 15

16 What if we process many different matrices at a time? (Assume they are all small) Design a batched SpMV kernel. Process a large number of data-independent problems in parallel. 16

17 What if we process many different matrices at a time? (Assume they are all small) Design a batched SpMV kernel. Process a large number of data-independent problems in parallel. Are the problems Same Size? Same number of nonzerosoverall? Same number of nonzerosin every row? Same sparsity pattern? 17

18 What if we process many different matrices at a time? (Assume they are all small) Design a batched SpMV kernel. Process a large number of data-independent problems in parallel. Are the problems Same Size? Same number of nonzerosoverall? Same number of nonzerosin every row? Same sparsity pattern? Let s be as flexible as possible! Flexible means everything can be different Every Thread block handles one system Memory pointers to distinct systems Load input vector x into shared memory Kernel for all matrices in CSR, COO, ELL 18

19 What if we process many different matrices at a time? (Assume they are all small) GPU thread blocks Load respective vector x Load respective vector x All matrices stored the same format. Let s be as flexible as possible! Flexible means everything can be different Every Thread block handles one system Memory pointers to distinct systems Load input vector x into shared memory Kernel for all matrices in CSR, COO, ELL 19

20 Flexible batched SpMV First experiment: Use different batched SpMV kernels (COO, CSR, ELL ) A batch consisting of the same matrices (homogeneous batch) NVIDIA P1 GPU 56 SMX, 5.3 TF DP GB/s CAGE_8 CAN838 DWT_922 EX25 EX27 GR_3_3 2

21 Flexible batched SpMV First experiment: Use different batched SpMV kernels (COO, CSR, ELL ) A batch consisting of the same matrices (homogeneous batch) NVIDIA P1 GPU 56 SMX, 5.3 TF DP GB/s CAGE_8 CAN838 DWT_922 EX2 EX27 GR_3_3 Disclaimer: This is an artificial problem setting! In a real-world scenario, a homogeneous batched SpMVwould be handled as SpMM. 21

22 Flexible batched SpMV batched ELL std. ELL batched CSR std. CSR batched COO 1 GFLOPs batched ELL std. ELL batched CSR std. CSR batched COO batch size batched ELL std. ELL batched CSR std. CSR batched COO batch size 22 GFLOPs 8 GFLOPs 8 batched ELL std. ELL batched CSR std. CSR batched COO GR_3_ batch size ex batch size ex2 batched ELL std. ELL batched CSR std. CSR batched COO 2 GFLOPs DWT_922 CAN838 GFLOPs GFLOPs CAGE8 4 3 batched ELL std. ELL batched CSR std. CSR batched COO batch size batch size 8 1

23 Flexible batched SpMV 12 1 batched ELL batched CSR batched COO 8 GFLOPs

24 Flexible batched SpMV Second experiment: Use different batched SpMV kernels (COO, CSR, ELL ) A batch consisting of different matrices (in-homogeneous batch) 1. somewhat similar (similar size, nonzero count) 2. completely different 24

25 Flexible batched SpMV Batch of random similar-sized problems n 2 [9, 1] nz 2 [3, 4] GFLOPs Performance batched ELL batched CSR batched COO std. CSR std. ELL batch size GB/s Bandwidth batched ELL batched CSR batched COO batch size 25

26 Flexible batched SpMV Batch of random similar-sized problems n 2 [9, 1] nz 2 [3, 4] Batch of random any-sized problems. n 2 [1, 1] nz 2 [1, 4] GFLOPs GFLOPs Performance batched ELL batched CSR batched COO std. CSR std. ELL batch size batched ELL batched CSR batched COO std. CSR std. ELL batch size GB/s GB/s Bandwidth batched ELL batched CSR batched COO batch size batched ELL batched CSR batched COO batch size 26

27 ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems November 13, 217 Flexible batched SpMV on GPUs Large number of small SpMV simultaneously Matrices can be different in size, nnz, pattern COO format most suitable for inhomogeneous batches This work is in Collaboration with:

Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs

Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs Workshop on Batched, Reproducible, and Reduced Precision BLAS Atlanta, GA 02/25/2017 Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs Hartwig Anzt Joint work with Goran

More information

Iterative Sparse Triangular Solves for Preconditioning

Iterative Sparse Triangular Solves for Preconditioning Euro-Par 2015, Vienna Aug 24-28, 2015 Iterative Sparse Triangular Solves for Preconditioning Hartwig Anzt, Edmond Chow and Jack Dongarra Incomplete Factorization Preconditioning Incomplete LU factorizations

More information

Dynamic Sparse Matrix Allocation on GPUs. James King

Dynamic Sparse Matrix Allocation on GPUs. James King Dynamic Sparse Matrix Allocation on GPUs James King Graph Applications Dynamic updates to graphs Adding edges add entries to sparse matrix representation Motivation Graph operations (adding edges) (e.g.

More information

Sparse Matrix Formats

Sparse Matrix Formats Christopher Bross Friedrich-Alexander-Universität Erlangen-Nürnberg Motivation Sparse Matrices are everywhere Sparse Matrix Formats C. Bross BGCE Research Day, Erlangen, 09.06.2016 2/16 Motivation Sparse

More information

Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra

Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra Dominik Grewe Anton Lokhmotov Media Processing Division ARM School of Informatics University of Edinburgh December 13, 2010 Introduction

More information

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication Steve Rennich Nvidia Developer Technology - Compute Block Sparse Matrix Vector Multiplication Sparse Matrix-Vector Multiplication

More information

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT JOSEPH L. GREATHOUSE, MAYANK DAGA AMD RESEARCH 11/20/2014 THIS TALK IN ONE SLIDE Demonstrate how to save space and time

More information

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard

More information

Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi

Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi Erik Saule 1, Kamer Kaya 1 and Ümit V. Çatalyürek 1,2 esaule@uncc.edu, {kamer,umit}@bmi.osu.edu 1 Department of Biomedical

More information

Lecture 6: Input Compaction and Further Studies

Lecture 6: Input Compaction and Further Studies PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 6: Input Compaction and Further Studies 1 Objective To learn the key techniques for compacting input data for reduced consumption of

More information

GPU-Based Acceleration for CT Image Reconstruction

GPU-Based Acceleration for CT Image Reconstruction GPU-Based Acceleration for CT Image Reconstruction Xiaodong Yu Advisor: Wu-chun Feng Collaborators: Guohua Cao, Hao Gong Outline Introduction and Motivation Background Knowledge Challenges and Proposed

More information

Efficient Sparse Matrix-Vector Multiplication on x86-based Many-Core Processors

Efficient Sparse Matrix-Vector Multiplication on x86-based Many-Core Processors Efficient Sparse Matrix-Vector Multiplication on x86-based Many-Core Processors Xing Liu School of Computational Science and Engineering Georgia Institute of Technology, Atlanta, Georgia xing.liu@gatech.edu

More information

Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs

Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs Kaixi Hou, Hao Wang, Wu chun Feng {kaixihou, hwang121, wfeng}@vt.edu Jeffrey S. Vetter, Seyong Lee vetter@computer.org, lees2@ornl.gov

More information

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code

More information

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors Kaixi Hou, Wu-chun Feng {kaixihou, wfeng}@vt.edu Shuai Che Shuai.Che@amd.com Sparse

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

Optimizing the operations with sparse matrices on Intel architecture

Optimizing the operations with sparse matrices on Intel architecture Optimizing the operations with sparse matrices on Intel architecture Gladkikh V. S. victor.s.gladkikh@intel.com Intel Xeon, Intel Itanium are trademarks of Intel Corporation in the U.S. and other countries.

More information

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,

More information

A Performance Prediction and Analysis Integrated Framework for SpMV on GPUs

A Performance Prediction and Analysis Integrated Framework for SpMV on GPUs Procedia Computer Science Volume 80, 2016, Pages 178 189 ICCS 2016. The International Conference on Computational Science A Performance Prediction and Analysis Integrated Framework for SpMV on GPUs Ping

More information

Master Thesis. Master Program of Computer Science

Master Thesis. Master Program of Computer Science Hochschule Bonn-Rhein-Sieg University of Applied Sciences Fachbereich Informatik Computer Science Department Master Thesis Master Program of Computer Science Requirement Analysis and Realization of Efficient

More information

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices CSE 599 I Accelerated Computing - Programming GPUS Parallel Pattern: Sparse Matrices Objective Learn about various sparse matrix representations Consider how input data affects run-time performance of

More information

Optimizing Sparse Matrix Vector Multiplication Using Cache Blocking Method on Fermi GPU

Optimizing Sparse Matrix Vector Multiplication Using Cache Blocking Method on Fermi GPU 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing Optimizing Sparse Matrix Vector Multiplication Using Cache Blocking

More information

Parallel Combinatorial BLAS and Applications in Graph Computations

Parallel Combinatorial BLAS and Applications in Graph Computations Parallel Combinatorial BLAS and Applications in Graph Computations Aydın Buluç John R. Gilbert University of California, Santa Barbara SIAM ANNUAL MEETING 2009 July 8, 2009 1 Primitives for Graph Computations

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

A Cross-Platform SpMV Framework on Many-Core Architectures

A Cross-Platform SpMV Framework on Many-Core Architectures A Cross-Platform SpMV Framework on Many-Core Architectures YUNQUAN ZHANG and SHIGANG LI, State Key Laboratory of Computer Architecture, Institute of Computing Technologies, Chinese Academy of Sciences

More information

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu

More information

Module Memory and Data Locality

Module Memory and Data Locality GPU Teaching Kit Accelerated Computing Module 4.5 - Memory and Data Locality Handling Arbitrary Matrix Sizes in Tiled Algorithms Objective To learn to handle arbitrary matrix sizes in tiled matrix multiplication

More information

SpVM Acceleration with Latency Masking Threads on FPGAs

SpVM Acceleration with Latency Masking Threads on FPGAs Technical Report UCR-CSE-2014-04001 SpVM Acceleration with Latency Masking Threads on FPGAs Robert J. Halstead, Walid A. Najjar and Omar Huseini Computer Science & Engineering UC Riverside Riverside, CA

More information

Portable, usable, and efficient sparse matrix vector multiplication

Portable, usable, and efficient sparse matrix vector multiplication Portable, usable, and efficient sparse matrix vector multiplication Albert-Jan Yzelman Parallel Computing and Big Data Huawei Technologies France 8th of July, 2016 Introduction Given a sparse m n matrix

More information

Efficient sparse matrix-vector multiplication on cache-based GPUs

Efficient sparse matrix-vector multiplication on cache-based GPUs Efficient sparse matrix-vector multiplication on cache-based GPUs ABSTRACT István Reguly Faculty of Information Technology Pázmány Péter Catholic University Hungary reguly.istvan@itk.ppke.hu Sparse matrix-vector

More information

Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory

Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory Jongsoo Park, Parallel Computing Lab, Intel Corporation with contributions from MKL team 1 Algorithm/

More information

State of Art and Project Proposals Intensive Computation

State of Art and Project Proposals Intensive Computation State of Art and Project Proposals Intensive Computation Annalisa Massini - 2015/2016 Today s lecture Project proposals on the following topics: Sparse Matrix- Vector Multiplication Tridiagonal Solvers

More information

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations Hartwig Anzt 1, Marc Baboulin 2, Jack Dongarra 1, Yvan Fournier 3, Frank Hulsemann 3, Amal Khabou 2, and Yushan Wang 2 1 University

More information

Variable-Size Batched Gauss-Huard for Block-Jacobi Preconditioning

Variable-Size Batched Gauss-Huard for Block-Jacobi Preconditioning Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 18C (217) 1783 1792 International Conference on Computational Science, ICCS 217, 12-14 June 217, Zurich, Switzerland Variable-Size

More information

Globally Homogeneous, Locally Adaptive Sparse Matrix-Vector Multiplication on the GPU

Globally Homogeneous, Locally Adaptive Sparse Matrix-Vector Multiplication on the GPU Globally Homogeneous, Locally Adaptive Sparse Matrix-Vector Multiplication on the GPU Markus Steinberger Max Planck Institute for Informatics Saarland Informatics Campus msteinbe@mpi-inf.mpg.de ABSTRACT

More information

L17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011!

L17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011! L17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011! Administrative Class cancelled, Tuesday, November 15 Guest Lecture, Thursday, November 17, Ganesh Gopalakrishnan CUDA Project

More information

SPARSE PERSISTENT RNN. Feiwen Zhu, 5/9/2017

SPARSE PERSISTENT RNN. Feiwen Zhu, 5/9/2017 SPARSE PERSISTENT RNN Feiwen Zhu, 5/9/2017 Motivation Introduction Algorithm AGENDA Naïve Implementation Optimizations Experiments Conclusion 2 MOTIVATION Exploit sparsity for faster, larger networks Recurrent

More information

Accelerating Sparse Matrix Vector Multiplication in Iterative Methods Using GPU

Accelerating Sparse Matrix Vector Multiplication in Iterative Methods Using GPU 2011 International Conference on Parallel Processing Accelerating Sparse Matrix Vector Multiplication in Iterative Methods Using GPU Kiran Kumar Matam CSTAR, IIIT-Hyderabad Gachibowli, Hyderabad, India

More information

Tiled Matrix Multiplication

Tiled Matrix Multiplication Tiled Matrix Multiplication Basic Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { int Row = blockidx.y*blockdim.y+threadidx.y;

More information

A Hybrid Format for Better Performance of Sparse Matrix-Vector Multiplication on a GPU

A Hybrid Format for Better Performance of Sparse Matrix-Vector Multiplication on a GPU A Hybrid Format for Better Performance of Sparse Matrix-Vector Multiplication on a GPU Dahai Guo dahai@illinois.edu William Gropp wgropp@illinois.edu Luke N. Olson lukeo@illinois.edu Abstract In this paper,

More information

Figure 6.1: Truss topology optimization diagram.

Figure 6.1: Truss topology optimization diagram. 6 Implementation 6.1 Outline This chapter shows the implementation details to optimize the truss, obtained in the ground structure approach, according to the formulation presented in previous chapters.

More information

BUILDING HIGH PERFORMANCE INPUT-ADAPTIVE GPU APPLICATIONS WITH NITRO

BUILDING HIGH PERFORMANCE INPUT-ADAPTIVE GPU APPLICATIONS WITH NITRO BUILDING HIGH PERFORMANCE INPUT-ADAPTIVE GPU APPLICATIONS WITH NITRO Saurav Muralidharan University of Utah nitro-tuner.github.io Disclaimers This research was funded in part by the U.S. Government. The

More information

Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation

Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation Dominik Grewe Institute for Computing Systems Architecture School of Informatics University

More information

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs Jin Wang*, Norm Rubin, Albert Sidelnik, Sudhakar Yalamanchili* *Computer Architecture and System Lab, Georgia Institute of Technology NVIDIA

More information

Optimizing Sparse Data Structures for Matrix-Vector Multiply

Optimizing Sparse Data Structures for Matrix-Vector Multiply Summary Optimizing Sparse Data Structures for Matrix-Vector Multiply William Gropp (UIUC) and Dahai Guo (NCSA) Algorithms and Data Structures need to take memory prefetch hardware into account This talk

More information

Design of efficient sparse matrix-vector multiplication for Fermi GPUs

Design of efficient sparse matrix-vector multiplication for Fermi GPUs Design of efficient sparse matrix-vector multiplication for Fermi GPUs ABSTRACT The evaluation of sparse matrix-vector products is an integral part of a great variety of scientific algorithms. Several

More information

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,

More information

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100 CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Aydin Buluc John R. Gilbert University of California, Santa Barbara ICPP 2008 September 11, 2008 1 Support: DOE Office of Science,

More information

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs Jin Wang* Norm Rubin Albert Sidelnik Sudhakar Yalamanchili* *Georgia Institute of Technology NVIDIA Research Email: {jin.wang,sudha}@gatech.edu,

More information

Case study: OpenMP-parallel sparse matrix-vector multiplication

Case study: OpenMP-parallel sparse matrix-vector multiplication Case study: OpenMP-parallel sparse matrix-vector multiplication A simple (but sometimes not-so-simple) example for bandwidth-bound code and saturation effects in memory Sparse matrix-vector multiply (spmvm)

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

GPU Sparse Graph Traversal. Duane Merrill

GPU Sparse Graph Traversal. Duane Merrill GPU Sparse Graph Traversal Duane Merrill Breadth-first search of graphs (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source Or label every vertex by its predecessor

More information

Unveiling the performance-energy trade-off in iterative linear system solvers for multithreaded processors

Unveiling the performance-energy trade-off in iterative linear system solvers for multithreaded processors CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. (2014) Published online in Wiley Online Library (wileyonlinelibrary.com)..3341 SPECIAL ISSUE PAPER Unveiling the

More information

Sparse Matrix-Vector Multiplication on GPU

Sparse Matrix-Vector Multiplication on GPU Sparse Matrix-Vector Multiplication on GPU DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Arash

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research

ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research ACCELERATING MATRIX PROCESSING WITH GPUs Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research ACCELERATING MATRIX PROCESSING WITH GPUS MOTIVATION Matrix operations

More information

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL (stashing recurrent weights on-chip) Baidu SVAIL April 7, 2016 SVAIL Think hard AI. Goal Develop hard AI technologies that impact 100 million users. Deep Learning at SVAIL 100 GFLOP/s 1 laptop 6 TFLOP/s

More information

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture

More information

HPCG Performance Improvement on the K computer ~short brief~

HPCG Performance Improvement on the K computer ~short brief~ HPCG Performance Improvement on the K computer ~short brief~ Kiyoshi Kumahata, Kazuo Minami RIKEN AICS HPCG BoF@Room15 SC15 Austin Evaluate Original Code 1/2 Weak Scaling Measurement 9,500 GFLOPS in 32,768

More information

Lecture 13: March 25

Lecture 13: March 25 CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging

More information

Adaptable benchmarks for register blocked sparse matrix-vector multiplication

Adaptable benchmarks for register blocked sparse matrix-vector multiplication Adaptable benchmarks for register blocked sparse matrix-vector multiplication Berkeley Benchmarking and Optimization group (BeBOP) Hormozd Gahvari and Mark Hoemmen Based on research of: Eun-Jin Im Rich

More information

Accelerated Load Balancing of Unstructured Meshes

Accelerated Load Balancing of Unstructured Meshes Accelerated Load Balancing of Unstructured Meshes Gerrett Diamond, Lucas Davis, and Cameron W. Smith Abstract Unstructured mesh applications running on large, parallel, distributed memory systems require

More information

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides

More information

A Study of SpMV Implementation using MPI and OpenMP on Intel Many-Core Architecture

A Study of SpMV Implementation using MPI and OpenMP on Intel Many-Core Architecture A Study of SpMV Implementation using MPI and OpenMP on Intel Many-Core Architecture Fan Ye 1,2, Christophe Calvin 1, Serge Petiton 2,3 1 CEA/DEN/DANS/DM2S, CEA Saclay, France 2 LIFL, Université de Lille

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

CS 677: Parallel Programming for Many-core Processors Lecture 7

CS 677: Parallel Programming for Many-core Processors Lecture 7 1 CS 677: Parallel Programming for Many-core Processors Lecture 7 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Logistics Midterm: March 11

More information

Lecture 1: Introduction and Computational Thinking

Lecture 1: Introduction and Computational Thinking PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational

More information

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators Ahmad Abdelfattah 1, Jack Dongarra 2, David Keyes 1 and Hatem Ltaief 3 1 KAUST Division of Mathematical and Computer Sciences and

More information

Optimizing Parallel Reduction in CUDA

Optimizing Parallel Reduction in CUDA Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each

More information

Alleviating memory-bandwidth limitations for scalability and energy efficiency

Alleviating memory-bandwidth limitations for scalability and energy efficiency .. Alleviating memory-bandwidth limitations for scalability and energy efficiency Lessons learned from the optimization of SpMxV Georgios Goumas goumas@cslab.ece.ntua.gr Computing Systems Laboratory National

More information

clmf: A Fine-Grained and Portable Alternating Least Squares Algorithm for Parallel Matrix Factorization

clmf: A Fine-Grained and Portable Alternating Least Squares Algorithm for Parallel Matrix Factorization clmf: A Fine-Grained and Portable Alternating Least Squares Algorithm for Parallel Matrix Factorization Jing Chen a,, Jianbin Fang a,, Weifeng Liu b,, Tao Tang a, Canqun Yang a a College of Computer, National

More information

Three Storage Formats for Sparse Matrices on GPGPUs

Three Storage Formats for Sparse Matrices on GPGPUs Three Storage Formats for Sparse Matrices on GPGPUs Davide Barbieri Valeria Cardellini Alessandro Fanfarillo Salvatore Filippone Dipartimento di Ingegneria Civile e Ingegneria Informatica Università di

More information

Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors?

Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors? Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors? The CVDLab Team Francesco Furiani Tue, April 3, 2014 ROMA TRE UNIVERSITÀ DEGLI STUDI

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Remote CUDA (rcuda) Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Better performance-watt, performance-cost

More information

Optimizing Sparse Matrix-Matrix Multiplication on a Heterogeneous CPU-GPU Platform

Optimizing Sparse Matrix-Matrix Multiplication on a Heterogeneous CPU-GPU Platform Georgia State University ScholarWorks @ Georgia State University Computer Science Theses Department of Computer Science 12-16-2015 Optimizing Sparse Matrix-Matrix Multiplication on a Heterogeneous CPU-GPU

More information

A Comparative Study on Exact Triangle Counting Algorithms on the GPU

A Comparative Study on Exact Triangle Counting Algorithms on the GPU A Comparative Study on Exact Triangle Counting Algorithms on the GPU Leyuan Wang, Yangzihao Wang, Carl Yang, John D. Owens University of California, Davis, CA, USA 31 st May 2016 L. Wang, Y. Wang, C. Yang,

More information

Efficient PageRank and SpMV Computation on AMD GPUs

Efficient PageRank and SpMV Computation on AMD GPUs 2010 39th International Conference on Parallel Processing Efficient PageRank and SpMV Computation on AMD GPUs Tianji WU, Bo WANG, Yi SHAN, Feng YAN, Yu WANG and Ningyi XU Department of Electronic Engineering,

More information

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES Nikolay Markovskiy Peter Messmer ABOUT CP2K Atomistic and molecular simulations of solid state From ab initio DFT and Hartree-Fock

More information

Towards a Universal FPGA Matrix-Vector Multiplication Architecture

Towards a Universal FPGA Matrix-Vector Multiplication Architecture Towards a Universal FPGA Matrix-Vector Multiplication Architecture Srinidhi Kestur, John D. Davis, Eric S. Chung Dept. of Computer Science and Engineering, The Pennsylvania State University Microsoft Research

More information

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road

More information

Dynamic Sparse-Matrix Allocation on GPUs

Dynamic Sparse-Matrix Allocation on GPUs Dynamic Sparse-Matrix Allocation on GPUs James King, Thomas Gilray, Robert M. Kirby, and Matthew Might University of Utah {jsking2,tgilray,kirby,might}@cs.utah.edu Abstract. Sparse matrices are a core

More information

Solving the heat equation with CUDA

Solving the heat equation with CUDA Solving the heat equation with CUDA Oliver Meister January 09 th 2013 Last Tutorial CSR kernel - scalar One row per thread No coalesced memory access Non-uniform matrices CSR kernel - vectorized One row

More information

CUDA Toolkit 4.1 CUSPARSE Library. PG _v01 January 2012

CUDA Toolkit 4.1 CUSPARSE Library. PG _v01 January 2012 CUDA Toolkit 4.1 CUSPARSE Library PG-05329-041_v01 January 2012 Contents 1 Introduction 2 1.1 New and Legacy CUSPARSE API........................ 2 1.2 Naming Convention................................

More information

Gerald Schubert 1, Georg Hager 2, Holger Fehske 1, Gerhard Wellein 2,3 1

Gerald Schubert 1, Georg Hager 2, Holger Fehske 1, Gerhard Wellein 2,3 1 Parallel lsparse matrix-vector ti t multiplication li as a test case for hybrid MPI+OpenMP programming Gerald Schubert 1, Georg Hager 2, Holger Fehske 1, Gerhard Wellein 2,3 1 Institute of Physics, University

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures

Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures Alexander Monakov 1,AntonLokhmotov 2, and Arutyun Avetisyan 1, 1 Institute for System Programming of RAS, 25 Solzhenitsyna

More information

GPU Sparse Graph Traversal

GPU Sparse Graph Traversal GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex

More information

Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform

Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform J Supercomput (2013) 63:710 721 DOI 10.1007/s11227-011-0626-0 Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform Shiming Xu Wei Xue Hai Xiang Lin Published

More information

Data Structures for sparse matrices

Data Structures for sparse matrices Data Structures for sparse matrices The use of a proper data structures is critical to achieving good performance. Generate a symmetric sparse matrix A in matlab and time the operations of accessing (only)

More information

Generating Optimized Sparse Matrix Vector Product over Finite Fields

Generating Optimized Sparse Matrix Vector Product over Finite Fields Generating Optimized Sparse Matrix Vector Product over Finite Fields Pascal Giorgi 1 and Bastien Vialla 1 LIRMM, CNRS, Université Montpellier 2, pascal.giorgi@lirmm.fr, bastien.vialla@lirmm.fr Abstract.

More information