Dynamic Sparse Matrix Allocation on GPUs. James King
|
|
- Annabelle Chambers
- 5 years ago
- Views:
Transcription
1 Dynamic Sparse Matrix Allocation on GPUs James King
2 Graph Applications Dynamic updates to graphs Adding edges add entries to sparse matrix representation
3 Motivation Graph operations (adding edges) (e.g. transitive closure of a graph) Iterative Updates: Ax = b A = A + B A = A (Repeat) 3
4 Sparse Matrices Store nonzero values Sparsity of an MxN matrix: nnz M N COO, ELL, CSR, HYB, DIA, etc. 4
5 Coordinate Format (COO) Stores tuples of row index, column index, and value using 3 arrays Row Indices Column Indices Values
6 Compressed Sparse Row (CSR) Rows compressed Column Indices and Values stored uncompressed Row Offsets Column Indices Values
7 Ellpack (ELL) / Hybrid-Ellpack (HYB) ELL stores all rows with a fixed column width W HYB combines ELL with a COO matrix for overflow rows Column Indices ELL Values COO Row Indices 3 3 Column Indices Values
8 Current Formats Current formats are inefficient for dynamic updates Compressed sparse formats like CSR must be rebuilt for each update COO like formats must be sorted for efficient SpMV 8
9 Dynamic Compressed Sparse Row (DCSR) Stores K segments with starting and ending index of each segment Segments can be dynamically allocated and need not be in order Row Offsets Segments Column Indices Values Row Sizes 3 4 9
10 Memory Footprint Matrix Format COO ELL HYB CSR DCSR Memory Footprint for MxN Sparse Matrix 3(nnz) MW MW + 3(ovf) M+(nnz) MK + (nnz) 1
11 COO Memory Fill Elements are appended to the end of the matrix 11
12 DCSR Memory Fill Elements are added in new segments Gaps between segments are allowed 1
13 Dynamic Allocation Memory offset pointer keeps track of currently allocated space When a new segment is allocated, the offset pointer is atomically adjusted Defragmentation compacts segments into 1 segment 13
14 Dynamic Insertions Segs. Row Offsets Column Indices Values Segs. Segs. Inserting Entries: Row Offsets Row Indices Column Indices Values Column Indices Values
15 Defragmentation Gaps can form between segments due to insertion and deletion Defragmentation performs a prefix sum operation on the row sizes Row segments are scattered to their appropriate location in newly allocated arrays 15
16 Defragmentation Algorithm 1. Exclusive scan on row sizes to get offsets. Allocate new memory space. 3. Shuffle column indices and values to offsets within new memory space 4. Adjust entries in row offsets array 16
17 Defragmentation Segs. Row Offsets Column Indices Values Defragmentation: Row Sizes New offsets 17
18 Inserting Entries: Defragmentation Row Indices Column Indices Values Segs. Row Offsets New offsets Column Indices Values Segs. Row Offsets Column Indices Values
19 SpMV Compatible with standard CSR SpMV algorithms (CSR-scalar, CSR-Vector, etc.) Loop is added to SpMV to iterate over segments 15 CSR DCSR Def. DCSR HYB GFLOPS 1 5 AMA CNR DBL ENR EU FLI HOL IN IND INT KRO LJO RAL SOC WEB WIK 19
20 SpMV Optimizations DCSR compatible with CSR optimizations Bin rows by row size for optimized performance (Ashari et al. 14) Sort tuples of bin ID, row index, and row size Prefix-sum over permuted row sizes to get offsets Defragmentation will group by row sizes
21 Optimized SpMV Improved memory access due to row size groupings and row sorting 15 CSR ADCSR Def. ADCSR HYB GFLOPS 1 5 AMA CNR DBL ENR EU FLI HOL IN IND INT KRO LJO RAL SOC WEB WIK 1
22 Results Iterative Updates with SpMV operations Relative Speedup 1 5 DCSR HYB CSR AMA CNR DBL ENR EU FLI HOL IN IND INT KRO LJO RAL SOC WEB WIK
23 Conversion Between Formats RelaCve Conversion Times COO -> CSR COO -> DCSR COO -> HYB CSR -> DCSR CSR -> HYB DCSR -> CSR 3
24 Sorting vs. Defragmentation RelaCve Time Comparison AMA CNR DBL ENR EU FLI HOL IN IND INT KRO LJO RAL SOC WEB WIK DCSR HYB 4
25 Sparse Matrix-Matrix Multiplication (SpMM) GEMM algorithm is inefficient in sparse case Many entries are zero and need not be operated on = 5
26 Related Work Bell et al. 13 Exploiting Fine-Grained Parallelism in Algebraic Multigrid Methods 1. Compute partial products. Sort partial products 3. Reduce partial products 6
27 SpMM Work Efficient Given A (MxK), B(KxN), and C(MxN) AB = C Set of partial products: a ij b jk k =1...nnz(B j ) 8 nonzeros A 7
28 SpMM Example A B C x 5 = Row Indices Column Indices Values Set of ParNal Products Row Indices Column Indices Values Sorted by row and column Row Indices Column Indices Values Reduced by row and column 8
29 Format Conversions CSR à COO Compute C matrix Sort & reduction performed in COO format COO à CSR 9
30 Improved SpMM Compute rows asynchronously Dynamically update C matrix in DCSR format Defragment C matrix to get result in sorted CSR format Avoids conversion to and from COO format 3
31 Adaptive Binning Partial product row size of C is computed by a first pass over A and B rs i = nnz(a i ) X i=1 nnz(b j ) a ij 31
32 Adaptive Binning Rows are binned by partial product row size 3
33 Adaptive Binning Rows are grouped by size of partial product set Up to shared memory limit 33
34 Asynchronous Computations Kernels asynchronously compute rows by bin Bandwidth is reduced since row is implicit Shared memory is faster than global memory Row size is not known until after reduction 34
35 Row Updates Kernel 1-3 C Matrix: Column Indices Kernel Kernel Values Kernel
36 Defragmentation Defragmentation is >x faster than sorting equivalent COO matrix Bandwidth is reduced by a third since rows are compressed Segments are shuffled directly to offset location without sort 36
37 SpMM Results Relative Speedup DCSR SpMM CSR SpMM 5 a 9 a 3 7 a 3 7 a 5 b 9 b 3 7 b 3 7 b 37
38 Summary DCSR allows for efficient dynamic updates to sparse matrices Suitable for graph applications Defragmentation method is faster than sorting equivalent COO matrix Code will soon be available through Scientific Computing and Imaging (SCI) Institute GPUTUM library 38
39 Questions 39
Dynamic Sparse-Matrix Allocation on GPUs
Dynamic Sparse-Matrix Allocation on GPUs James King, Thomas Gilray, Robert M. Kirby, and Matthew Might University of Utah {jsking2,tgilray,kirby,might}@cs.utah.edu Abstract. Sparse matrices are a core
More informationDynamic Sparse-Matrix Allocation on GPUs
Dynamic Sparse-Matrix Allocation on GPUs James King, Thomas Gilray, Robert M. Kirby, and Matthew Might University of Utah {jsking2,tgilray,kirby,might}@cs.utah.edu Abstract. Sparse matrices are a core
More informationLecture 6: Input Compaction and Further Studies
PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 6: Input Compaction and Further Studies 1 Objective To learn the key techniques for compacting input data for reduced consumption of
More informationFlexible Batched Sparse Matrix-Vector Product on GPUs
ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems November 13, 217 Flexible Batched Sparse Matrix-Vector Product on GPUs Hartwig Anzt, Gary Collins, Jack Dongarra,
More informationGenerating and Automatically Tuning OpenCL Code for Sparse Linear Algebra
Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra Dominik Grewe Anton Lokhmotov Media Processing Division ARM School of Informatics University of Edinburgh December 13, 2010 Introduction
More informationPorting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation
Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code
More informationCSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices
CSE 599 I Accelerated Computing - Programming GPUS Parallel Pattern: Sparse Matrices Objective Learn about various sparse matrix representations Consider how input data affects run-time performance of
More informationEFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT
EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT JOSEPH L. GREATHOUSE, MAYANK DAGA AMD RESEARCH 11/20/2014 THIS TALK IN ONE SLIDE Demonstrate how to save space and time
More informationAdministrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve
Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard
More informationCharacterizing Dataset Dependence for Sparse Matrix-Vector Multiplication on GPUs
Characterizing Dataset Dependence for Sparse Matrix-Vector Multiplication on GPUs Naser Sedaghati Arash Ashari Louis-Noël Pouchet Srinivasan Parthasarathy P. Sadayappan Ohio State University {sedaghat,ashari,pouchet,srini,saday}@cse.ohio-state.edu
More informationLeveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute
Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication Steve Rennich Nvidia Developer Technology - Compute Block Sparse Matrix Vector Multiplication Sparse Matrix-Vector Multiplication
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationState of Art and Project Proposals Intensive Computation
State of Art and Project Proposals Intensive Computation Annalisa Massini - 2015/2016 Today s lecture Project proposals on the following topics: Sparse Matrix- Vector Multiplication Tridiagonal Solvers
More informationAuto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors
Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors Kaixi Hou, Wu-chun Feng {kaixihou, wfeng}@vt.edu Shuai Che Shuai.Che@amd.com Sparse
More informationSolving the heat equation with CUDA
Solving the heat equation with CUDA Oliver Meister January 09 th 2013 Last Tutorial CSR kernel - scalar One row per thread No coalesced memory access Non-uniform matrices CSR kernel - vectorized One row
More informationSparse Matrix Formats
Christopher Bross Friedrich-Alexander-Universität Erlangen-Nürnberg Motivation Sparse Matrices are everywhere Sparse Matrix Formats C. Bross BGCE Research Day, Erlangen, 09.06.2016 2/16 Motivation Sparse
More informationL17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011!
L17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011! Administrative Class cancelled, Tuesday, November 15 Guest Lecture, Thursday, November 17, Ganesh Gopalakrishnan CUDA Project
More informationChallenges and Advances in Parallel Sparse Matrix-Matrix Multiplication
Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Aydin Buluc John R. Gilbert University of California, Santa Barbara ICPP 2008 September 11, 2008 1 Support: DOE Office of Science,
More informationCUDA Toolkit 4.1 CUSPARSE Library. PG _v01 January 2012
CUDA Toolkit 4.1 CUSPARSE Library PG-05329-041_v01 January 2012 Contents 1 Introduction 2 1.1 New and Legacy CUSPARSE API........................ 2 1.2 Naming Convention................................
More informationEXPOSING FINE-GRAINED PARALLELISM IN ALGEBRAIC MULTIGRID METHODS
EXPOSING FINE-GRAINED PARALLELISM IN ALGEBRAIC MULTIGRID METHODS NATHAN BELL, STEVEN DALTON, AND LUKE N. OLSON Abstract. Algebraic multigrid methods for large, sparse linear systems are a necessity in
More informationACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research
ACCELERATING MATRIX PROCESSING WITH GPUs Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research ACCELERATING MATRIX PROCESSING WITH GPUS MOTIVATION Matrix operations
More informationParallel Combinatorial BLAS and Applications in Graph Computations
Parallel Combinatorial BLAS and Applications in Graph Computations Aydın Buluç John R. Gilbert University of California, Santa Barbara SIAM ANNUAL MEETING 2009 July 8, 2009 1 Primitives for Graph Computations
More informationEFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI
EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,
More informationTools and Primitives for High Performance Graph Computation
Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World
More informationCUDA math libraries APC
CUDA math libraries APC CUDA Libraries http://developer.nvidia.com/cuda-tools-ecosystem CUDA Toolkit CUBLAS linear algebra CUSPARSE linear algebra with sparse matrices CUFFT fast discrete Fourier transform
More informationGPU-Based Acceleration for CT Image Reconstruction
GPU-Based Acceleration for CT Image Reconstruction Xiaodong Yu Advisor: Wu-chun Feng Collaborators: Guohua Cao, Hao Gong Outline Introduction and Motivation Background Knowledge Challenges and Proposed
More informationScalable GPU Graph Traversal!
Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang
More informationChapter 4. Matrix and Vector Operations
1 Scope of the Chapter Chapter 4 This chapter provides procedures for matrix and vector operations. This chapter (and Chapters 5 and 6) can handle general matrices, matrices with special structure and
More informationBUILDING HIGH PERFORMANCE INPUT-ADAPTIVE GPU APPLICATIONS WITH NITRO
BUILDING HIGH PERFORMANCE INPUT-ADAPTIVE GPU APPLICATIONS WITH NITRO Saurav Muralidharan University of Utah nitro-tuner.github.io Disclaimers This research was funded in part by the U.S. Government. The
More informationPerformance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply
Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu
More informationLecture 13: March 25
CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging
More informationSparse matrices: Basics
Outline : Basics Bora Uçar RO:MA, LIP, ENS Lyon, France CR-08: Combinatorial scientific computing, September 201 http://perso.ens-lyon.fr/bora.ucar/cr08/ 1/28 CR09 Outline Outline 1 Course presentation
More informationAim. Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity. Structure and matrix sparsity: Overview
Aim Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity Julian Hall School of Mathematics University of Edinburgh jajhall@ed.ac.uk What should a 2-hour PhD lecture on structure
More informationGenerating Optimized Sparse Matrix Vector Product over Finite Fields
Generating Optimized Sparse Matrix Vector Product over Finite Fields Pascal Giorgi 1 and Bastien Vialla 1 LIRMM, CNRS, Université Montpellier 2, pascal.giorgi@lirmm.fr, bastien.vialla@lirmm.fr Abstract.
More informationCS 677: Parallel Programming for Many-core Processors Lecture 7
1 CS 677: Parallel Programming for Many-core Processors Lecture 7 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Logistics Midterm: March 11
More informationA Survey on Performance Modelling and Optimization Techniques for SpMV on GPUs
A Survey on Performance Modelling and Optimization Techniques for SpMV on GPUs Ms. Aditi V. Kulkarni #1, Prof. C. R. Barde *2 # Student, Department of Computer Engineering, G.E.S s. R.H. Sapat College
More informationAdaptable benchmarks for register blocked sparse matrix-vector multiplication
Adaptable benchmarks for register blocked sparse matrix-vector multiplication Berkeley Benchmarking and Optimization group (BeBOP) Hormozd Gahvari and Mark Hoemmen Based on research of: Eun-Jin Im Rich
More informationA new sparse matrix vector multiplication graphics processing unit algorithm designed for finite element problems
INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN ENGINEERING Int. J. Numer. Meth. Engng 2015; 102:1784 1814 Published online 9 January 2015 in Wiley Online Library (wileyonlinelibrary.com)..4865 A new sparse
More informationAutomatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation
Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation Dominik Grewe Institute for Computing Systems Architecture School of Informatics University
More informationTowards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers
Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationBLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker
BLAS and LAPACK + Data Formats for Sparse Matrices Part of the lecture Wissenschaftliches Rechnen Hilmar Wobker Institute of Applied Mathematics and Numerics, TU Dortmund email: hilmar.wobker@math.tu-dortmund.de
More informationExploiting Locality in Sparse Matrix-Matrix Multiplication on the Many Integrated Core Architecture
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Exploiting Locality in Sparse Matrix-Matrix Multiplication on the Many Integrated Core Architecture K. Akbudak a, C.Aykanat
More informationData Structures for sparse matrices
Data Structures for sparse matrices The use of a proper data structures is critical to achieving good performance. Generate a symmetric sparse matrix A in matlab and time the operations of accessing (only)
More informationSparse Matrix-Vector Multiplication on GPU
Sparse Matrix-Vector Multiplication on GPU DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Arash
More informationAccelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University
Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear
More informationStorage Formats for Sparse Matrices in Java
Storage Formats for Sparse Matrices in Java Mikel Luján, Anila Usman, Patrick Hardie, T.L. Freeman, and John R. Gurd Centre for Novel Computing, The University of Manchester, Oxford Road, Manchester M13
More informationPortable, usable, and efficient sparse matrix vector multiplication
Portable, usable, and efficient sparse matrix vector multiplication Albert-Jan Yzelman Parallel Computing and Big Data Huawei Technologies France 8th of July, 2016 Introduction Given a sparse m n matrix
More informationAbstractions for Specifying Sparse Matrix Data Transformations
Abstractions for Specifying Sparse Matrix Data Transformations Payal Nandy Mary Hall Eddie C. Davis Catherine Olschanowsky Mahdi S Mohammadi, Wei He Michelle Strout University of Utah Boise State University
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationGTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013
GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»
More informationOptimizing the operations with sparse matrices on Intel architecture
Optimizing the operations with sparse matrices on Intel architecture Gladkikh V. S. victor.s.gladkikh@intel.com Intel Xeon, Intel Itanium are trademarks of Intel Corporation in the U.S. and other countries.
More informationOptimizing Sparse Matrix Vector Multiplication Using Cache Blocking Method on Fermi GPU
2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing Optimizing Sparse Matrix Vector Multiplication Using Cache Blocking
More informationAMS 148 Chapter 6: Histogram, Sort, and Sparse Matrices
AMS 148 Chapter 6: Histogram, Sort, and Sparse Matrices Steven Reeves Now that we have completed the more fundamental parallel primitives on GPU, we will dive into more advanced topics. Histogram is a
More informationarxiv: v2 [cs.ms] 12 Nov 2018
Format Abstraction for Sparse Tensor Algebra Compilers arxiv:804.0v [cs.ms] Nov 08 STEPHEN CHOU, MIT CSAIL, USA FREDRIK KJOLSTAD, MIT CSAIL, USA SAMAN AMARASINGHE, MIT CSAIL, USA This paper shows how to
More informationComputational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors?
Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors? The CVDLab Team Francesco Furiani Tue, April 3, 2014 ROMA TRE UNIVERSITÀ DEGLI STUDI
More informationGlobally Homogeneous, Locally Adaptive Sparse Matrix-Vector Multiplication on the GPU
Globally Homogeneous, Locally Adaptive Sparse Matrix-Vector Multiplication on the GPU Markus Steinberger Max Planck Institute for Informatics Saarland Informatics Campus msteinbe@mpi-inf.mpg.de ABSTRACT
More informationEfficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs
Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationOptimizing Sparse Matrix-Matrix Multiplication on a Heterogeneous CPU-GPU Platform
Georgia State University ScholarWorks @ Georgia State University Computer Science Theses Department of Computer Science 12-16-2015 Optimizing Sparse Matrix-Matrix Multiplication on a Heterogeneous CPU-GPU
More informationA Performance Prediction and Analysis Integrated Framework for SpMV on GPUs
Procedia Computer Science Volume 80, 2016, Pages 178 189 ICCS 2016. The International Conference on Computational Science A Performance Prediction and Analysis Integrated Framework for SpMV on GPUs Ping
More informationEfficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI
Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from
More informationFigure 6.1: Truss topology optimization diagram.
6 Implementation 6.1 Outline This chapter shows the implementation details to optimize the truss, obtained in the ground structure approach, according to the formulation presented in previous chapters.
More informationA Cross-Platform SpMV Framework on Many-Core Architectures
A Cross-Platform SpMV Framework on Many-Core Architectures YUNQUAN ZHANG and SHIGANG LI, State Key Laboratory of Computer Architecture, Institute of Computing Technologies, Chinese Academy of Sciences
More informationBlocking Optimization Strategies for Sparse Tensor Computation
Blocking Optimization Strategies for Sparse Tensor Computation Jee Choi 1, Xing Liu 1, Shaden Smith 2, and Tyler Simon 3 1 IBM T. J. Watson Research, 2 University of Minnesota, 3 University of Maryland
More informationIterative solution of linear systems in electromagnetics (and not only): experiences with CUDA
How to cite this paper: D. De Donno, A. Esposito, G. Monti, and L. Tarricone, Iterative Solution of Linear Systems in Electromagnetics (and not only): Experiences with CUDA, Euro-Par 2010 Parallel Processing
More informationA Framework for General Sparse Matrix-Matrix Multiplication on GPUs and Heterogeneous Processors, Weifeng Liu a,, Brian Vinter a
A Framework for General Sparse Matrix-Matrix Multiplication on GPUs and Heterogeneous Processors, Weifeng Liu a,, Brian Vinter a a Niels Bohr Institute, University of Copenhagen, Blegdamsvej 17, 2100 Copenhagen,
More informationPerformance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi
Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi Erik Saule 1, Kamer Kaya 1 and Ümit V. Çatalyürek 1,2 esaule@uncc.edu, {kamer,umit}@bmi.osu.edu 1 Department of Biomedical
More informationSparse Linear Algebra in CUDA
Sparse Linear Algebra in CUDA HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 22 nd 2017 Table of Contents Homework - Worksheet 2
More informationAccelerating Sparse Matrix Vector Multiplication in Iterative Methods Using GPU
2011 International Conference on Parallel Processing Accelerating Sparse Matrix Vector Multiplication in Iterative Methods Using GPU Kiran Kumar Matam CSTAR, IIIT-Hyderabad Gachibowli, Hyderabad, India
More informationApplications of Berkeley s Dwarfs on Nvidia GPUs
Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse
More informationx = 12 x = 12 1x = 16
2.2 - The Inverse of a Matrix We've seen how to add matrices, multiply them by scalars, subtract them, and multiply one matrix by another. The question naturally arises: Can we divide one matrix by another?
More informationMulti-GPU simulations in OpenFOAM with SpeedIT technology.
Multi-GPU simulations in OpenFOAM with SpeedIT technology. Attempt I: SpeedIT GPU-based library of iterative solvers for Sparse Linear Algebra and CFD. Current version: 2.2. Version 1.0 in 2008. CMRS format
More informationGPU Sparse Graph Traversal
GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex
More informationarxiv: v1 [cs.ms] 2 Jun 2016
Parallel Triangular Solvers on GPU Zhangxin Chen, Hui Liu, and Bo Yang University of Calgary 2500 University Dr NW, Calgary, AB, Canada, T2N 1N4 {zhachen,hui.j.liu,yang6}@ucalgary.ca arxiv:1606.00541v1
More informationGPU-based Parallel Reservoir Simulators
GPU-based Parallel Reservoir Simulators Zhangxin Chen 1, Hui Liu 1, Song Yu 1, Ben Hsieh 1 and Lei Shao 1 Key words: GPU computing, reservoir simulation, linear solver, parallel 1 Introduction Nowadays
More informationGPU Sparse Graph Traversal. Duane Merrill
GPU Sparse Graph Traversal Duane Merrill Breadth-first search of graphs (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source Or label every vertex by its predecessor
More informationUnderstanding the performances of sparse compression formats using data parallel programming model
2017 International Conference on High Performance Computing & Simulation Understanding the performances of sparse compression formats using data parallel programming model Ichrak MEHREZ, Olfa HAMDI-LARBI,
More informationIterative Sparse Triangular Solves for Preconditioning
Euro-Par 2015, Vienna Aug 24-28, 2015 Iterative Sparse Triangular Solves for Preconditioning Hartwig Anzt, Edmond Chow and Jack Dongarra Incomplete Factorization Preconditioning Incomplete LU factorizations
More informationDistributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs
Distributed NVAMG Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs Istvan Reguly (istvan.reguly at oerc.ox.ac.uk) Oxford e-research Centre NVIDIA Summer Internship
More informationAnalysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms
Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms H. Anzt, V. Heuveline Karlsruhe Institute of Technology, Germany
More informationSparse Matrix-Matrix Multiplication on the GPU. Julien Demouth, NVIDIA
Sparse Matrix-Matrix Multiplication on the GPU Julien Demouth, NVIDIA Introduction: Problem Two sparse matrices A and B, compute: Sparse matrix: Many zeroes C = AB x Non-zero Zero Only non-zero elements
More informationDesign of efficient sparse matrix-vector multiplication for Fermi GPUs
Design of efficient sparse matrix-vector multiplication for Fermi GPUs ABSTRACT The evaluation of sparse matrix-vector products is an integral part of a great variety of scientific algorithms. Several
More informationIntel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager
Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides
More informationAutomatic Tuning of Sparse Matrix Kernels
Automatic Tuning of Sparse Matrix Kernels Kathy Yelick U.C. Berkeley and Lawrence Berkeley National Laboratory Richard Vuduc, Lawrence Livermore National Laboratory James Demmel, U.C. Berkeley Berkeley
More informationc 2014 Society for Industrial and Applied Mathematics
SIAM J. SCI. COMPUT. Vol. 36, No. 2, pp. C29 C239 c 204 Society for Industrial and Applied Mathematics COMPRESSED MULTIROW STORAGE FORMAT FOR SPARSE MATRICES ON GRAPHICS PROCESSING UNITS ZBIGNIEW KOZA,
More informationThree Storage Formats for Sparse Matrices on GPGPUs
Three Storage Formats for Sparse Matrices on GPGPUs Davide Barbieri Valeria Cardellini Alessandro Fanfarillo Salvatore Filippone Dipartimento di Ingegneria Civile e Ingegneria Informatica Università di
More informationLarge and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference
Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference 2012 Waters Corporation 1 Agenda Overview of LC/IMS/MS 3D Data Processing 4D Data Processing
More informationMaster Thesis. Master Program of Computer Science
Hochschule Bonn-Rhein-Sieg University of Applied Sciences Fachbereich Informatik Computer Science Department Master Thesis Master Program of Computer Science Requirement Analysis and Realization of Efficient
More informationSemi-External Memory Sparse Matrix Multiplication for Billion-Node Graphs
Semi-External Memory Sparse Matrix Multiplication for Billion-Node Graphs Da Zheng, Disa Mhembere, Vince Lyzinski 2, Joshua T. Vogelstein 3, Carey E. Priebe 2, and Randal Burns Department of Computer Science,
More informationGREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer
GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES Nikolay Markovskiy Peter Messmer ABOUT CP2K Atomistic and molecular simulations of solid state From ab initio DFT and Hartree-Fock
More informationA Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography
1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR) 2 LSQR in Seismic Tomography
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Memory bound computation, sparse linear algebra, OSKI Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh ATLAS Mflop/s Compile Execute
More informationFrom Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133)
From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133) Dr Paul Richmond Research Fellow University of Sheffield (NVIDIA CUDA Research Centre) Overview Complex
More informationPerformance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform
J Supercomput (2013) 63:710 721 DOI 10.1007/s11227-011-0626-0 Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform Shiming Xu Wei Xue Hai Xiang Lin Published
More informationPerformance Optimization Using Partitioned SpMV on GPUs and Multicore CPUs
IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 9, SEPTEMBER 2015 2623 Performance Optimization Using Partitioned SpMV on GPUs and Multicore CPUs Wangdong Yang, Kenli Li, Zeyao Mo, and Keqin Li, Fellow, IEEE
More informationData parallel algorithms, algorithmic building blocks, precision vs. accuracy
Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel
More informationEfficient sparse matrix-vector multiplication on cache-based GPUs
Efficient sparse matrix-vector multiplication on cache-based GPUs ABSTRACT István Reguly Faculty of Information Technology Pázmány Péter Catholic University Hungary reguly.istvan@itk.ppke.hu Sparse matrix-vector
More informationPARDISO Version Reference Sheet Fortran
PARDISO Version 5.0.0 1 Reference Sheet Fortran CALL PARDISO(PT, MAXFCT, MNUM, MTYPE, PHASE, N, A, IA, JA, 1 PERM, NRHS, IPARM, MSGLVL, B, X, ERROR, DPARM) 1 Please note that this version differs significantly
More informationTree Accumulations on GPU
Tree Accumulations on GPU With Applications to Sparse Linear Algera Scott Rostrup, Shweta Srivastava, Kishore Singhal, Synopsys Inc. Synopsys 0 Applications Tree Structure Computations i.e. sutree size,
More informationGPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA
GPU-Accelerated Algebraic Multigrid for Commercial Applications Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA ANSYS Fluent 2 Fluent control flow Accelerate this first Non-linear iterations Assemble
More information