Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication

Similar documents
Parallel Combinatorial BLAS and Applications in Graph Computations

Tools and Primitives for High Performance Graph Computation

Parallel algorithms for sparse matrix product, indexing, and assignment

arxiv: v2 [cs.dc] 26 Apr 2012

Downloaded 12/18/12 to Redistribution subject to SIAM license or copyright; see

Graph Partitioning for Scalable Distributed Graph Computations

Mathematics. 2.1 Introduction: Graphs as Matrices Adjacency Matrix: Undirected Graphs, Directed Graphs, Weighted Graphs CHAPTER 2

Sparse Matrices and Graphs: There and Back Again

Numerical Algorithms

Matrix Multiplication

Dynamic Sparse Matrix Allocation on GPUs. James King

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop

Graph Algorithms on A transpose A. Benjamin Chang John Gilbert, Advisor

Exam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV Joe Eaton Ph.D.

EE/CSCI 451: Parallel and Distributed Computation

Dense Matrix Algorithms

GraphBLAS Mathematics - Provisional Release 1.0 -

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays

Parallel Reduction from Block Hessenberg to Hessenberg using MPI

Parallel Computing: Parallel Algorithm Design Examples Jin, Hai

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms

Blocking Optimization Strategies for Sparse Tensor Computation

Lecture 4: Graph Algorithms

Parallel Numerical Algorithms

Chapter 4. Matrix and Vector Operations

EXPLOITING MULTIPLE LEVELS OF PARALLELISM IN SPARSE MATRIX-MATRIX MULTIPLICATION

Parallel Numerical Algorithms

Parallel Graph Algorithms

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Combinatorial Scientific Computing: Tutorial, Experiences, and Challenges

HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

Parallel dense linear algebra computations (1)

Lecture 3: Totally Unimodularity and Network Flows

Basic Communication Operations (Chapter 4)

Practical Parallel Processing

Preconditioning Linear Systems Arising from Graph Laplacians of Complex Networks

EE/CSCI 451 Midterm 1

A Parallel Solver for Laplacian Matrices. Tristan Konolige (me) and Jed Brown

Exploiting Locality in Sparse Matrix-Matrix Multiplication on the Many Integrated Core Architecture

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Algorithms and Applications

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

Lecture 15: More Iterative Ideas

All-Pairs Shortest Paths - Floyd s Algorithm

Flexible Batched Sparse Matrix-Vector Product on GPUs

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Optimizing the operations with sparse matrices on Intel architecture

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Introduction to parallel Computing

COMMUNICATION IN HYPERCUBES

Adaptable benchmarks for register blocked sparse matrix-vector multiplication

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

Non-Blocking Collectives for MPI

Reducing Communication Costs Associated with Parallel Algebraic Multigrid

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

A Comparative Study on Exact Triangle Counting Algorithms on the GPU

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI

Parsing in Parallel on Multiple Cores and GPUs

Scalable GPU Graph Traversal!

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Communication Optimal Parallel Multiplication of Sparse Random Matrices

Hypercubes. (Chapter Nine)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Sparse Matrix Formats

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Last Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort

Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory

AMS526: Numerical Analysis I (Numerical Linear Algebra)

L17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011!

Chapter 8 Dense Matrix Algorithms

Parallel Auction Algorithm for Linear Assignment Problem

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

Parallel Graph Algorithms

Lecture 13: March 25

Code Optimizations for High Performance GPU Computing

Big Data Management and NoSQL Databases

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors

Blocking SEND/RECEIVE

Parallel Numerical Algorithms

Lecture 17: Array Algorithms

Randomized Algorithms

Lecture Summary CSC 263H. August 5, 2016

Thomas H. Cormen Charles E. Leiserson Ronald L. Rivest. Introduction to Algorithms

A work-efficient parallel sparse matrix-sparse vector multiplication algorithm

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan

ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research

Non-blocking Collective Operations for MPI

Introduction to Algorithms Third Edition

A work-efficient parallel sparse matrix-sparse vector multiplication algorithm

Sparse matrices, graphs, and tree elimination

A Framework for General Sparse Matrix-Matrix Multiplication on GPUs and Heterogeneous Processors, Weifeng Liu a,, Brian Vinter a

CSE5351: Parallel Processing Part III

EE/CSCI 451: Parallel and Distributed Computation

Transcription:

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Aydin Buluc John R. Gilbert University of California, Santa Barbara ICPP 2008 September 11, 2008 1 Support: DOE Office of Science, MIT Lincoln Labs

Reasons (pronounced Outline ) Motivation? Building block for many (graph) algorithms. Challenge? Parallel scaling. Tools? Novel work-scalable sequential kernel. 2D parallel decomposition/algorithm Overlapping communication with computation This work? Comparative theoretical analysis of 1D and 2D algorithms. Preliminary parallel implementation of the first 2D algorithm. Challenges arising from sparsity, and ways to overcome them. 2

Matrices over Semirings Matrix multiplication C = AB (or matrix/vector): C i,j = A i,1 B 1,j + A i,2 B 2,j + + A i,n B n,j Replace scalar operations and + by : associative, distributes over, identity 1 : associative, commutative, identity 0 annihilates under Then C i,j = A i,1 B 1,j A i,2 B 2,j A i,n B n,j Examples: (,+) ; (and,or) ; (+,min) ;... Same data reference pattern and control flow 3

Multiple-source breadth-first search 1 2 4 7 5 A T X 3 6 4

Multiple-source breadth-first search 1 2 4 7 5 3 6 A T X A T X 5

Multiple-source breadth-first search 1 2 4 7 5 A T X A T X 3 6 Sparse array representation => space efficient Sparse matrix-matrix multiplication => work efficient Load balance depends on SpGEMM implementation Not a panacea for the memory latency wall! 6

SpGEMM: Sparse Matrix Sparse Matrix Shortest path calculations (APSP) Betweenness centrality BFS from multiple source vertices Subgraph / submatrix indexing Graph contraction Cycle detection Multigrid interpolation & restriction Colored intersection searching Applying constraints in finite element computations Context-free parsing 7

Challenges of Parallel SpGEMM Scalable sequential kernel (A ik * B kj ) Solved [IPDPS 08] Load balancing Especially for real world graphs Communication costs Communication to computation ratio is much higher than dense GEMM Updates (additions) scalar additions scalar multiplications 8

Standard data structure: CSC 31 0 53 0 59 0 41 26 0 NUM: IR: JC: 31 41 59 26 53 1 3 2 3 1 JC holds column pointers size: n+1 IR holds row indices size: nnz NUM holds numerical values size: nnz 9

Organizations of Matrix Multiplication 1) outer product: for k = 1:n C = C + A(:, k) * B(k, :) = Rank 1 update 2) inner product: for i = 1:n for j = 1:n C(i, j) = A(i, :) * B(:, j) = 10

Matlab s Algorithm 3) Column-by-column formulation: for j = 1:n forall k s.t. B(k,j) 0 C(:, j) = C(:, j) + A(:, k) * B(k, j) = Scanning B(:,j) to find the nonzero elements is sequential. Scanning A(:, k) for scalar*vector is sequential for each k. But the size and the structure of C(:,j) is not known in advance and updates may be costly. 11

CSC Sparse Matrix Multiplication with SPA for j = 1:n C(:, j) = A * B(:, j) = x C A B gather SPA scatter/ accumulate All matrix columns and vectors are stored compressed except the SPA. 12

SPA (Sparse Accumulator) Supports efficient vector addition by allowing random access to the currently active column. Dense Boolean array (BA) to determine efficiently whether an element was previously inscribed or not. Dense vector of real values (RE) Unordered list of indices who were previously inscribed. Complexity: Initialize SPA, set x = 0 : O(n) // Only once x = x + α*y: O(nnz(y)) Output x, reset x = 0: O(nnz(x)) 13

Parallel 1D Algorithm A 1 A 2 A 3 A 4 A 5 A 6 A 7 A 8 Distribution by columns: only A needs to be communicated. (only B for distribution by rows) A i refers to the n by n/p block column that processor i owns. Algorithm uses the formula: C i = C i + A*B i Processor i likely to need columns from all processors to fully form C i All-to-all broadcast at once. [we don t have space for that] Multiple iterations to fully form C i 14

Sparse 1D Algorithm for all processors P i in parallel for j 1 to p do Broadcast(Aj) for k 1 to N/p do SPA Load(C i (:,k)) SPA SPA+ A j *B ij (:,k) C i (:,k) UnLoad(SPA) end end end Computation per processor is independent of p No speed up at all Time spent on load/unloads of SPA are not amortized by flops The sparsity parameter: average number of nonzeros per column/row ( N( 1 c 2 ) T comp = γ + 15 The cost of one arithmetic operation

Utopian 1D Algorithm Assume we found a way to amortize the cost of loads/unloads of SPA. It would still be unscalable due to communication. 16

2D Algorithms: The Dense Case p(0,0) p(0,1) p(0,2) p(1,0) p(1,1) p(1,2) p(2,0) p(2,1) p(2,2) For the dense case, 2D scales better with the number of processors Likely to be same for the sparse case. Parallel Efficiency: 1D Layout: 2D Layout: 1 p 1+ Ο( ) n 1 1+ Ο( p ) n Should be zero for perfect efficiency 17

Submatrices are hypersparse (i.e. nnz << n) p blocks nnz = c 0 p Average of c nonzeros per column p blocks Total Storage: Ο( n + nnz) Ο( n p + nnz) A data structure or algorithm that depends on the matrix dimension n (e.g. CSR or CSC) is asymptotically too wasteful for submatrices 18

Sequential Kernel [IPDPS 2008] Strictly O(nnz) data structure Complexity independent of matrix dimension Outer-product formulation with multi-way merging Scalable in terms of the amount of work it performs. X 19

2D Example: Sparse SUMMA k j B kj k i * = C ij A ik C ij += A ik * B kj At worst doubles local storage Based on SUMMA (block size = n/sqrt(p)) Easy to generalize nonsquare matrices, etc. 20

Comparative Speedup of Sparse 1D & 2D 1D 2D N P N P Even utopian 1D algorithms can not scale beyond 40x Break-even point is around 50 processors. 21

Parallel efficiency of 2D algorithm? Parallel efficiency due to computation: As long as N grows linearly with p (weak scaling), this is constant at 1/log(c 2 N/p) Constant but not 1 cn E comp = 2 c N p + cn log p Parallel efficiency due to latency: N should grow on the order of p 1.5 γ c E latency = α. p. 2 N p Parallel efficiency due to bandwidth: Better than 1D Yet, still not scalable. γ. c E bw = β. p 22

Hiding Communication Costs SpGEMM is much harder to scale than dense GEMM. Both communication and computation costs increase at the same rate with increasing problem size! Overlap communication with computation. As much as possible. Possible for p<4000 Asynchronous, one sided communication GASNET, ARMCI (Truly one-sided) Communication layers Myrinet, Infiniband, etc Hardware supporting zero copy RDMA 23

Speedup of Asynchronous 2D Slightly better performance for smaller matrices Remember, this assumes random matrices (Erdos-Renyi) 24

Addressing the Load Balance (in Real-World Matrices) Random permutations are useful. Bulk synchronous algorithms may still suffer: Asynchronous algorithms have no notion of stages. Overall, no significant imbalance. 25

Strong Scaling of Real Code Synchronous implementation with C++ and MPI Runs on TACC s Lonestar cluster (2.66Ghz dual-core nodes) Good up to p = 64, flattens out at p =256 26

Barriers to Scalability (pronounced Future Work ) Submatrix additions (C ij += A ik * B kj ) o We currently use scanning based merging o Fast unbalanced merging is required. Load-balance problem o We used real world matrices for testing the implementation. o One-sided communication through RDMA is required Extra log(p) factor in Sparse SUMMA o Creates extra communication costs o Better overlapping communication with computation o Again, solution is asynchronous implementation. 27

28 Thank You!