Flexible Batched Sparse Matrix-Vector Product on GPUs
|
|
- Gordon Kelley
- 5 years ago
- Views:
Transcription
1 ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems November 13, 217 Flexible Batched Sparse Matrix-Vector Product on GPUs Hartwig Anzt, Gary Collins, Jack Dongarra, Goran Flegar, Enrique, S. Quintana-Orti
2 Input A, x, y Output y = A x Matrix A contains only few nonzero elements. Storing all entries results in large overhead (memory & computation). 2
3 Input A, x, y Output y = A x Matrix A contains only few nonzero elements. Storing all entries results in large overhead (memory & computation). Idea: Store only nonzero elements [nz] explicitly. A = C A value = [ ] Value 3
4 Input A, x, y Output y = A x Matrix A contains only few nonzero elements. Storing all entries results in large overhead (memory & computation). Idea: Store only nonzero elements [nz] explicitly. Need to also store location of nonzero elements! COO format : A = C A Memory footprint of COO format: nz(val) + 2*nz(int) value = [ ] Value colidx = [ ] Column-index rowidx = [ ] Row-index 4
5 Matrix A contains only few nonzero elements. Storing all entries results in large overhead (memory & computation). Idea: Store only nonzero elements [nz] explicitly. Need to also store location of nonzero elements! CSR format : A = C A Memory footprint of COO format: nz(val) + 2*nz(int) Memory footprint of CSR format: nz(val) + nz(int) + (n+1) (int) value = [ ] Value colidx = [ ] Column-index rowptr = [ ] Points to the first element in each row Number of nonzero elements 5
6 How to parallelize this? A = C A value = [ ] Value colidx = [ ] Column-index rowidx = [ ] Row-index 6
7 How to parallelize this? Parallelize by rows: Every thread handles the computation of one sum in local memory. Significant workload imbalance! Need branching logic, branch divergence on vector machines. T1 T2 T3 A = T4 B T C A * access to input vector x = access to output vector y T1 T2 T3 T4 T5 T6 value = [ ] Value colidx = [ ] Column-index rowidx = [ ] Row-index 7
8 How to parallelize this? Parallelize by rows: Every thread handles the computation of one sum in local memory. Balanced workload. Can result in significant overhead for unbalanced problems. ELL format : T1 T2 T3 A = T4 B T C A * access to input vector x = access to output vector y T1 T2 T3 T4 T5 T6 Values and column-index padded for uniform row-length value = [ ] colidx = [ ] Number of nonzero elements 8
9 How to parallelize this? Parallelize by rows: Every thread handles the computation of one sum in local memory. Significant workload imbalance! Ordered access to input vector x. T1 T2 T3 A = T4 B T C A * access to input vector x = access to output vector y T1 T2 T3 T4 T5 T6 value = [ ] Value colidx = [ ] Column-index rowidx = [ ] Row-index 9
10 How to parallelize this? Parallelize by elements: Balanced workload. Partial sums need synchronization: Write conflicts! A = C A * access to input vector x = access to output vector y T1 T2 T3 T4 T5 T6 value = [ ] Value colidx = [ ] Column-index rowidx = [ ] Row-index 1
11 How to parallelize this? Parallelize by elements: Balanced workload. Partial sums need synchronization: Write conflicts! A = C A * access to input vector x = access to output vector y T1 T2 T3 T4 T5 T6 value = [ ] Value colidx = [ ] Column-index rowidx = [ ] Row-index 11 G. Flegar et al.: Overcoming Load Imbalance for Irregular Sparse Matrices, IA 3, 217.
12 Different kernels optimal for different problem classes CSR small memory footprint Needs some logic for row-parallel processing ELL zero-padding allows for efficient SIMD execution Efficient for balanced matrices COO can compensate workload imbalance for irregular patterns 12
13 Different kernels optimal for different problem classes CSR small memory footprint Needs some logic for row-parallel processing ELL zero-padding allows for efficient SIMD execution Efficient for balanced matrices COO can compensate workload imbalance for irregular patterns For a single problem, we can usually find an optimal kernel, BUT 13
14 What if we process many different matrices at a time? (Assume they are all small) 14
15 What if we process many different matrices at a time? (Assume they are all small) Design a batched SpMV kernel. Process a large number of data-independent problems in parallel. 15
16 What if we process many different matrices at a time? (Assume they are all small) Design a batched SpMV kernel. Process a large number of data-independent problems in parallel. 16
17 What if we process many different matrices at a time? (Assume they are all small) Design a batched SpMV kernel. Process a large number of data-independent problems in parallel. Are the problems Same Size? Same number of nonzerosoverall? Same number of nonzerosin every row? Same sparsity pattern? 17
18 What if we process many different matrices at a time? (Assume they are all small) Design a batched SpMV kernel. Process a large number of data-independent problems in parallel. Are the problems Same Size? Same number of nonzerosoverall? Same number of nonzerosin every row? Same sparsity pattern? Let s be as flexible as possible! Flexible means everything can be different Every Thread block handles one system Memory pointers to distinct systems Load input vector x into shared memory Kernel for all matrices in CSR, COO, ELL 18
19 What if we process many different matrices at a time? (Assume they are all small) GPU thread blocks Load respective vector x Load respective vector x All matrices stored the same format. Let s be as flexible as possible! Flexible means everything can be different Every Thread block handles one system Memory pointers to distinct systems Load input vector x into shared memory Kernel for all matrices in CSR, COO, ELL 19
20 Flexible batched SpMV First experiment: Use different batched SpMV kernels (COO, CSR, ELL ) A batch consisting of the same matrices (homogeneous batch) NVIDIA P1 GPU 56 SMX, 5.3 TF DP GB/s CAGE_8 CAN838 DWT_922 EX25 EX27 GR_3_3 2
21 Flexible batched SpMV First experiment: Use different batched SpMV kernels (COO, CSR, ELL ) A batch consisting of the same matrices (homogeneous batch) NVIDIA P1 GPU 56 SMX, 5.3 TF DP GB/s CAGE_8 CAN838 DWT_922 EX2 EX27 GR_3_3 Disclaimer: This is an artificial problem setting! In a real-world scenario, a homogeneous batched SpMVwould be handled as SpMM. 21
22 Flexible batched SpMV batched ELL std. ELL batched CSR std. CSR batched COO 1 GFLOPs batched ELL std. ELL batched CSR std. CSR batched COO batch size batched ELL std. ELL batched CSR std. CSR batched COO batch size 22 GFLOPs 8 GFLOPs 8 batched ELL std. ELL batched CSR std. CSR batched COO GR_3_ batch size ex batch size ex2 batched ELL std. ELL batched CSR std. CSR batched COO 2 GFLOPs DWT_922 CAN838 GFLOPs GFLOPs CAGE8 4 3 batched ELL std. ELL batched CSR std. CSR batched COO batch size batch size 8 1
23 Flexible batched SpMV 12 1 batched ELL batched CSR batched COO 8 GFLOPs
24 Flexible batched SpMV Second experiment: Use different batched SpMV kernels (COO, CSR, ELL ) A batch consisting of different matrices (in-homogeneous batch) 1. somewhat similar (similar size, nonzero count) 2. completely different 24
25 Flexible batched SpMV Batch of random similar-sized problems n 2 [9, 1] nz 2 [3, 4] GFLOPs Performance batched ELL batched CSR batched COO std. CSR std. ELL batch size GB/s Bandwidth batched ELL batched CSR batched COO batch size 25
26 Flexible batched SpMV Batch of random similar-sized problems n 2 [9, 1] nz 2 [3, 4] Batch of random any-sized problems. n 2 [1, 1] nz 2 [1, 4] GFLOPs GFLOPs Performance batched ELL batched CSR batched COO std. CSR std. ELL batch size batched ELL batched CSR batched COO std. CSR std. ELL batch size GB/s GB/s Bandwidth batched ELL batched CSR batched COO batch size batched ELL batched CSR batched COO batch size 26
27 ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems November 13, 217 Flexible batched SpMV on GPUs Large number of small SpMV simultaneously Matrices can be different in size, nnz, pattern COO format most suitable for inhomogeneous batches This work is in Collaboration with:
Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs
Workshop on Batched, Reproducible, and Reduced Precision BLAS Atlanta, GA 02/25/2017 Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs Hartwig Anzt Joint work with Goran
More informationIterative Sparse Triangular Solves for Preconditioning
Euro-Par 2015, Vienna Aug 24-28, 2015 Iterative Sparse Triangular Solves for Preconditioning Hartwig Anzt, Edmond Chow and Jack Dongarra Incomplete Factorization Preconditioning Incomplete LU factorizations
More informationDynamic Sparse Matrix Allocation on GPUs. James King
Dynamic Sparse Matrix Allocation on GPUs James King Graph Applications Dynamic updates to graphs Adding edges add entries to sparse matrix representation Motivation Graph operations (adding edges) (e.g.
More informationSparse Matrix Formats
Christopher Bross Friedrich-Alexander-Universität Erlangen-Nürnberg Motivation Sparse Matrices are everywhere Sparse Matrix Formats C. Bross BGCE Research Day, Erlangen, 09.06.2016 2/16 Motivation Sparse
More informationGenerating and Automatically Tuning OpenCL Code for Sparse Linear Algebra
Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra Dominik Grewe Anton Lokhmotov Media Processing Division ARM School of Informatics University of Edinburgh December 13, 2010 Introduction
More informationLeveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute
Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication Steve Rennich Nvidia Developer Technology - Compute Block Sparse Matrix Vector Multiplication Sparse Matrix-Vector Multiplication
More informationEFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT
EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT JOSEPH L. GREATHOUSE, MAYANK DAGA AMD RESEARCH 11/20/2014 THIS TALK IN ONE SLIDE Demonstrate how to save space and time
More informationAdministrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve
Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard
More informationPerformance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi
Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi Erik Saule 1, Kamer Kaya 1 and Ümit V. Çatalyürek 1,2 esaule@uncc.edu, {kamer,umit}@bmi.osu.edu 1 Department of Biomedical
More informationLecture 6: Input Compaction and Further Studies
PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 6: Input Compaction and Further Studies 1 Objective To learn the key techniques for compacting input data for reduced consumption of
More informationGPU-Based Acceleration for CT Image Reconstruction
GPU-Based Acceleration for CT Image Reconstruction Xiaodong Yu Advisor: Wu-chun Feng Collaborators: Guohua Cao, Hao Gong Outline Introduction and Motivation Background Knowledge Challenges and Proposed
More informationEfficient Sparse Matrix-Vector Multiplication on x86-based Many-Core Processors
Efficient Sparse Matrix-Vector Multiplication on x86-based Many-Core Processors Xing Liu School of Computational Science and Engineering Georgia Institute of Technology, Atlanta, Georgia xing.liu@gatech.edu
More informationHighly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs
Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs Kaixi Hou, Hao Wang, Wu chun Feng {kaixihou, hwang121, wfeng}@vt.edu Jeffrey S. Vetter, Seyong Lee vetter@computer.org, lees2@ornl.gov
More informationPorting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation
Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code
More informationAuto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors
Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors Kaixi Hou, Wu-chun Feng {kaixihou, wfeng}@vt.edu Shuai Che Shuai.Che@amd.com Sparse
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationOptimizing the operations with sparse matrices on Intel architecture
Optimizing the operations with sparse matrices on Intel architecture Gladkikh V. S. victor.s.gladkikh@intel.com Intel Xeon, Intel Itanium are trademarks of Intel Corporation in the U.S. and other countries.
More informationEFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI
EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,
More informationA Performance Prediction and Analysis Integrated Framework for SpMV on GPUs
Procedia Computer Science Volume 80, 2016, Pages 178 189 ICCS 2016. The International Conference on Computational Science A Performance Prediction and Analysis Integrated Framework for SpMV on GPUs Ping
More informationMaster Thesis. Master Program of Computer Science
Hochschule Bonn-Rhein-Sieg University of Applied Sciences Fachbereich Informatik Computer Science Department Master Thesis Master Program of Computer Science Requirement Analysis and Realization of Efficient
More informationCSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices
CSE 599 I Accelerated Computing - Programming GPUS Parallel Pattern: Sparse Matrices Objective Learn about various sparse matrix representations Consider how input data affects run-time performance of
More informationOptimizing Sparse Matrix Vector Multiplication Using Cache Blocking Method on Fermi GPU
2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing Optimizing Sparse Matrix Vector Multiplication Using Cache Blocking
More informationParallel Combinatorial BLAS and Applications in Graph Computations
Parallel Combinatorial BLAS and Applications in Graph Computations Aydın Buluç John R. Gilbert University of California, Santa Barbara SIAM ANNUAL MEETING 2009 July 8, 2009 1 Primitives for Graph Computations
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationA Cross-Platform SpMV Framework on Many-Core Architectures
A Cross-Platform SpMV Framework on Many-Core Architectures YUNQUAN ZHANG and SHIGANG LI, State Key Laboratory of Computer Architecture, Institute of Computing Technologies, Chinese Academy of Sciences
More informationPerformance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply
Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu
More informationModule Memory and Data Locality
GPU Teaching Kit Accelerated Computing Module 4.5 - Memory and Data Locality Handling Arbitrary Matrix Sizes in Tiled Algorithms Objective To learn to handle arbitrary matrix sizes in tiled matrix multiplication
More informationSpVM Acceleration with Latency Masking Threads on FPGAs
Technical Report UCR-CSE-2014-04001 SpVM Acceleration with Latency Masking Threads on FPGAs Robert J. Halstead, Walid A. Najjar and Omar Huseini Computer Science & Engineering UC Riverside Riverside, CA
More informationPortable, usable, and efficient sparse matrix vector multiplication
Portable, usable, and efficient sparse matrix vector multiplication Albert-Jan Yzelman Parallel Computing and Big Data Huawei Technologies France 8th of July, 2016 Introduction Given a sparse m n matrix
More informationEfficient sparse matrix-vector multiplication on cache-based GPUs
Efficient sparse matrix-vector multiplication on cache-based GPUs ABSTRACT István Reguly Faculty of Information Technology Pázmány Péter Catholic University Hungary reguly.istvan@itk.ppke.hu Sparse matrix-vector
More informationOpportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory
Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory Jongsoo Park, Parallel Computing Lab, Intel Corporation with contributions from MKL team 1 Algorithm/
More informationState of Art and Project Proposals Intensive Computation
State of Art and Project Proposals Intensive Computation Annalisa Massini - 2015/2016 Today s lecture Project proposals on the following topics: Sparse Matrix- Vector Multiplication Tridiagonal Solvers
More informationAccelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations
Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations Hartwig Anzt 1, Marc Baboulin 2, Jack Dongarra 1, Yvan Fournier 3, Frank Hulsemann 3, Amal Khabou 2, and Yushan Wang 2 1 University
More informationVariable-Size Batched Gauss-Huard for Block-Jacobi Preconditioning
Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 18C (217) 1783 1792 International Conference on Computational Science, ICCS 217, 12-14 June 217, Zurich, Switzerland Variable-Size
More informationGlobally Homogeneous, Locally Adaptive Sparse Matrix-Vector Multiplication on the GPU
Globally Homogeneous, Locally Adaptive Sparse Matrix-Vector Multiplication on the GPU Markus Steinberger Max Planck Institute for Informatics Saarland Informatics Campus msteinbe@mpi-inf.mpg.de ABSTRACT
More informationL17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011!
L17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011! Administrative Class cancelled, Tuesday, November 15 Guest Lecture, Thursday, November 17, Ganesh Gopalakrishnan CUDA Project
More informationSPARSE PERSISTENT RNN. Feiwen Zhu, 5/9/2017
SPARSE PERSISTENT RNN Feiwen Zhu, 5/9/2017 Motivation Introduction Algorithm AGENDA Naïve Implementation Optimizations Experiments Conclusion 2 MOTIVATION Exploit sparsity for faster, larger networks Recurrent
More informationAccelerating Sparse Matrix Vector Multiplication in Iterative Methods Using GPU
2011 International Conference on Parallel Processing Accelerating Sparse Matrix Vector Multiplication in Iterative Methods Using GPU Kiran Kumar Matam CSTAR, IIIT-Hyderabad Gachibowli, Hyderabad, India
More informationTiled Matrix Multiplication
Tiled Matrix Multiplication Basic Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { int Row = blockidx.y*blockdim.y+threadidx.y;
More informationA Hybrid Format for Better Performance of Sparse Matrix-Vector Multiplication on a GPU
A Hybrid Format for Better Performance of Sparse Matrix-Vector Multiplication on a GPU Dahai Guo dahai@illinois.edu William Gropp wgropp@illinois.edu Luke N. Olson lukeo@illinois.edu Abstract In this paper,
More informationFigure 6.1: Truss topology optimization diagram.
6 Implementation 6.1 Outline This chapter shows the implementation details to optimize the truss, obtained in the ground structure approach, according to the formulation presented in previous chapters.
More informationBUILDING HIGH PERFORMANCE INPUT-ADAPTIVE GPU APPLICATIONS WITH NITRO
BUILDING HIGH PERFORMANCE INPUT-ADAPTIVE GPU APPLICATIONS WITH NITRO Saurav Muralidharan University of Utah nitro-tuner.github.io Disclaimers This research was funded in part by the U.S. Government. The
More informationAutomatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation
Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation Dominik Grewe Institute for Computing Systems Architecture School of Informatics University
More informationLaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs
LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs Jin Wang*, Norm Rubin, Albert Sidelnik, Sudhakar Yalamanchili* *Computer Architecture and System Lab, Georgia Institute of Technology NVIDIA
More informationOptimizing Sparse Data Structures for Matrix-Vector Multiply
Summary Optimizing Sparse Data Structures for Matrix-Vector Multiply William Gropp (UIUC) and Dahai Guo (NCSA) Algorithms and Data Structures need to take memory prefetch hardware into account This talk
More informationDesign of efficient sparse matrix-vector multiplication for Fermi GPUs
Design of efficient sparse matrix-vector multiplication for Fermi GPUs ABSTRACT The evaluation of sparse matrix-vector products is an integral part of a great variety of scientific algorithms. Several
More informationEfficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs
Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationCS/EE 217 Midterm. Question Possible Points Points Scored Total 100
CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationChallenges and Advances in Parallel Sparse Matrix-Matrix Multiplication
Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Aydin Buluc John R. Gilbert University of California, Santa Barbara ICPP 2008 September 11, 2008 1 Support: DOE Office of Science,
More informationLaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs
LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs Jin Wang* Norm Rubin Albert Sidelnik Sudhakar Yalamanchili* *Georgia Institute of Technology NVIDIA Research Email: {jin.wang,sudha}@gatech.edu,
More informationCase study: OpenMP-parallel sparse matrix-vector multiplication
Case study: OpenMP-parallel sparse matrix-vector multiplication A simple (but sometimes not-so-simple) example for bandwidth-bound code and saturation effects in memory Sparse matrix-vector multiply (spmvm)
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationGPU Sparse Graph Traversal. Duane Merrill
GPU Sparse Graph Traversal Duane Merrill Breadth-first search of graphs (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source Or label every vertex by its predecessor
More informationUnveiling the performance-energy trade-off in iterative linear system solvers for multithreaded processors
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. (2014) Published online in Wiley Online Library (wileyonlinelibrary.com)..3341 SPECIAL ISSUE PAPER Unveiling the
More informationSparse Matrix-Vector Multiplication on GPU
Sparse Matrix-Vector Multiplication on GPU DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Arash
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research
ACCELERATING MATRIX PROCESSING WITH GPUs Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research ACCELERATING MATRIX PROCESSING WITH GPUS MOTIVATION Matrix operations
More informationPersistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL
(stashing recurrent weights on-chip) Baidu SVAIL April 7, 2016 SVAIL Think hard AI. Goal Develop hard AI technologies that impact 100 million users. Deep Learning at SVAIL 100 GFLOP/s 1 laptop 6 TFLOP/s
More informationExecution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures
Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture
More informationHPCG Performance Improvement on the K computer ~short brief~
HPCG Performance Improvement on the K computer ~short brief~ Kiyoshi Kumahata, Kazuo Minami RIKEN AICS HPCG BoF@Room15 SC15 Austin Evaluate Original Code 1/2 Weak Scaling Measurement 9,500 GFLOPS in 32,768
More informationLecture 13: March 25
CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging
More informationAdaptable benchmarks for register blocked sparse matrix-vector multiplication
Adaptable benchmarks for register blocked sparse matrix-vector multiplication Berkeley Benchmarking and Optimization group (BeBOP) Hormozd Gahvari and Mark Hoemmen Based on research of: Eun-Jin Im Rich
More informationAccelerated Load Balancing of Unstructured Meshes
Accelerated Load Balancing of Unstructured Meshes Gerrett Diamond, Lucas Davis, and Cameron W. Smith Abstract Unstructured mesh applications running on large, parallel, distributed memory systems require
More informationIntel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager
Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides
More informationA Study of SpMV Implementation using MPI and OpenMP on Intel Many-Core Architecture
A Study of SpMV Implementation using MPI and OpenMP on Intel Many-Core Architecture Fan Ye 1,2, Christophe Calvin 1, Serge Petiton 2,3 1 CEA/DEN/DANS/DM2S, CEA Saclay, France 2 LIFL, Université de Lille
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationTowards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers
Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationCS 677: Parallel Programming for Many-core Processors Lecture 7
1 CS 677: Parallel Programming for Many-core Processors Lecture 7 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Logistics Midterm: March 11
More informationLecture 1: Introduction and Computational Thinking
PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational
More informationAchieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation
Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationOptimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators
Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators Ahmad Abdelfattah 1, Jack Dongarra 2, David Keyes 1 and Hatem Ltaief 3 1 KAUST Division of Mathematical and Computer Sciences and
More informationOptimizing Parallel Reduction in CUDA
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each
More informationAlleviating memory-bandwidth limitations for scalability and energy efficiency
.. Alleviating memory-bandwidth limitations for scalability and energy efficiency Lessons learned from the optimization of SpMxV Georgios Goumas goumas@cslab.ece.ntua.gr Computing Systems Laboratory National
More informationclmf: A Fine-Grained and Portable Alternating Least Squares Algorithm for Parallel Matrix Factorization
clmf: A Fine-Grained and Portable Alternating Least Squares Algorithm for Parallel Matrix Factorization Jing Chen a,, Jianbin Fang a,, Weifeng Liu b,, Tao Tang a, Canqun Yang a a College of Computer, National
More informationThree Storage Formats for Sparse Matrices on GPGPUs
Three Storage Formats for Sparse Matrices on GPGPUs Davide Barbieri Valeria Cardellini Alessandro Fanfarillo Salvatore Filippone Dipartimento di Ingegneria Civile e Ingegneria Informatica Università di
More informationComputational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors?
Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors? The CVDLab Team Francesco Furiani Tue, April 3, 2014 ROMA TRE UNIVERSITÀ DEGLI STUDI
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationWhy? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators
Remote CUDA (rcuda) Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Better performance-watt, performance-cost
More informationOptimizing Sparse Matrix-Matrix Multiplication on a Heterogeneous CPU-GPU Platform
Georgia State University ScholarWorks @ Georgia State University Computer Science Theses Department of Computer Science 12-16-2015 Optimizing Sparse Matrix-Matrix Multiplication on a Heterogeneous CPU-GPU
More informationA Comparative Study on Exact Triangle Counting Algorithms on the GPU
A Comparative Study on Exact Triangle Counting Algorithms on the GPU Leyuan Wang, Yangzihao Wang, Carl Yang, John D. Owens University of California, Davis, CA, USA 31 st May 2016 L. Wang, Y. Wang, C. Yang,
More informationEfficient PageRank and SpMV Computation on AMD GPUs
2010 39th International Conference on Parallel Processing Efficient PageRank and SpMV Computation on AMD GPUs Tianji WU, Bo WANG, Yi SHAN, Feng YAN, Yu WANG and Ningyi XU Department of Electronic Engineering,
More informationGREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer
GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES Nikolay Markovskiy Peter Messmer ABOUT CP2K Atomistic and molecular simulations of solid state From ab initio DFT and Hartree-Fock
More informationTowards a Universal FPGA Matrix-Vector Multiplication Architecture
Towards a Universal FPGA Matrix-Vector Multiplication Architecture Srinidhi Kestur, John D. Davis, Eric S. Chung Dept. of Computer Science and Engineering, The Pennsylvania State University Microsoft Research
More informationAutomatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC
Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road
More informationDynamic Sparse-Matrix Allocation on GPUs
Dynamic Sparse-Matrix Allocation on GPUs James King, Thomas Gilray, Robert M. Kirby, and Matthew Might University of Utah {jsking2,tgilray,kirby,might}@cs.utah.edu Abstract. Sparse matrices are a core
More informationSolving the heat equation with CUDA
Solving the heat equation with CUDA Oliver Meister January 09 th 2013 Last Tutorial CSR kernel - scalar One row per thread No coalesced memory access Non-uniform matrices CSR kernel - vectorized One row
More informationCUDA Toolkit 4.1 CUSPARSE Library. PG _v01 January 2012
CUDA Toolkit 4.1 CUSPARSE Library PG-05329-041_v01 January 2012 Contents 1 Introduction 2 1.1 New and Legacy CUSPARSE API........................ 2 1.2 Naming Convention................................
More informationGerald Schubert 1, Georg Hager 2, Holger Fehske 1, Gerhard Wellein 2,3 1
Parallel lsparse matrix-vector ti t multiplication li as a test case for hybrid MPI+OpenMP programming Gerald Schubert 1, Georg Hager 2, Holger Fehske 1, Gerhard Wellein 2,3 1 Institute of Physics, University
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationAutomatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures
Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures Alexander Monakov 1,AntonLokhmotov 2, and Arutyun Avetisyan 1, 1 Institute for System Programming of RAS, 25 Solzhenitsyna
More informationGPU Sparse Graph Traversal
GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex
More informationPerformance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform
J Supercomput (2013) 63:710 721 DOI 10.1007/s11227-011-0626-0 Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform Shiming Xu Wei Xue Hai Xiang Lin Published
More informationData Structures for sparse matrices
Data Structures for sparse matrices The use of a proper data structures is critical to achieving good performance. Generate a symmetric sparse matrix A in matlab and time the operations of accessing (only)
More informationGenerating Optimized Sparse Matrix Vector Product over Finite Fields
Generating Optimized Sparse Matrix Vector Product over Finite Fields Pascal Giorgi 1 and Bastien Vialla 1 LIRMM, CNRS, Université Montpellier 2, pascal.giorgi@lirmm.fr, bastien.vialla@lirmm.fr Abstract.
More information