Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication

Size: px

Start display at page:

Download "Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication"

Sophie Charles
5 years ago
Views:

1 ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication Penporn Koanantakool 1,2, riful zad 2, ydin Buluç 2,Dmitriy Morozov 2, Sang-Yun Oh 2,3, Leonid Oliker 2, Katherine Yelick 1,2 1 omputer Science Division, University of alifornia, Berkeley 2 Lawrence Berkeley National Laboratory 3 University of alifornia, Santa Barbara May 26, 2016

2 Outline Background and Motivation lgorithms 2D SUMM 3D SUMM 1.5D variants Experimental Results : 100x speedup Iterative Multiplication onclusions ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 2

3 Parallel lgorithm ost Model p distributed, homogenous processors connected through network Per-processor costs along critical path Image courtesy of: James Demmel PU DRM PU DRM PU DRM PU DRM time = #flops time_per_flop + #messages time_per_message + #words time_per_word omputation ommunication time_per_flop, time_per_message, time_per_word are machine-specific constants. Goal: minimize #messages (S) and #words (W) ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 3

4 Sparse Dense Matmul (sparse) B = (dense) (dense) i j k i,j += i,k B k,j B :,j i,: i,j ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 4

Matrix Multiplication lgorithms B B B 1.5D 2.

5 Matrix Multiplication lgorithms B B B 1.5D 2.5D 1D 2D 3D 2D (SUMM) B (dense) B (dense) 1D p 0 p 0 Optimal for dense-dense/ sparse-sparse matmul [Solomonik and Demmel 2011] [Ballard et al. 2013] (sparse) (dense) (sparse) (dense) ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication * 2D and 3D images courtesy of Grey Ballard 5

6 Existing lgorithms ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 6

7 1D Ring lgorithm p=4 1D layout p B ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 7

8 1D Ring lgorithm p=4 1D layout Shifts matrix around cyclically p B ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 8

9 1D Ring lgorithm p=4 1D layout Shifts matrix around cyclically p B ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 9

10 1D Ring lgorithm p=4 1D layout Shifts matrix around cyclically #msgs: p msg size: nnz()/p #words: nnz() p B ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 10

11 2D SUMM lgorithm p=16 2D layout, grid a b c d e f B a b c d e f a b c d e f ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 11

12 2D SUMM lgorithm p=16 2D layout, outer products grid B ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 12

13 2D SUMM lgorithm p=16 2D layout, outer products grid B broadcasts row-wise ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 13

14 2D SUMM lgorithm p=16 2D layout, outer products grid B broadcasts row-wise B broadcasts column-wise ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 14

15 2D SUMM lgorithm p=16 2D layout, outer products grid B ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 15

16 2D SUMM lgorithm p=16 2D layout, outer products grid B ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 16

17 2D SUMM lgorithm p=16 2D layout, outer products grid broadcasts of size nnz()/p broadcasts of size nnz(b)/p B ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 17

18 3D SUMM lgorithm p=32, c=2 (c is #copies) grid (teams of size c) B B ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 18

19 3D SUMM lgorithm p=32, c=2 (c is #copies) grid (teams of size c) B B First team member calculates the first half outer products. Second team member calculates the latter half outer products. ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 19

20 3D SUMM lgorithm p=32, c=2 (c is #copies) grid (teams of size c) Each team member calculates 1/c of all outer products broadcasts of size nnz()c/p broadcasts of size nnz(b)c/p B Second team member calculates the latter half outer products. ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 20

21 3D SUMM: osts Replication Replicate : Replicate B: Propagation Broadcast : Broadcast B: ollection Reduce : #messages #words ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 21

22 1.5D lgorithms ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 22

23 1.5D ol p p=4, c=2 B has p/c block columns p/c 0, 1 2, Round 1 ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 23

24 1.5D ol p p=4, c=2 B has p/c block columns p/c #msgs: p/c msg size: 0, 1 2, Round 1 #words: nnz() 2, 3 0, Round 2 ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 24

25 1.5D ol B p/c p=8, c=2 and B both have p/c block columns p/c B Each team member does 1/c of work Round 1 ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 25

26 1.5D ol B p/c p=8, c=2 and B both have p/c block columns p/c B Each team member does 1/c of work #msgs: p/c Round 1 msg size: #words: Round 2 ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 26

27 1.5D Inner B p/c p=8, c=2 has p/c block rows B has p/c block columns B Each team member does 1/c of work 0, 5 2, 7 4, 1 6, Round 1 ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 27

28 1.5D Inner B p/c p=8, c=2 has p/c block rows B has p/c block columns B Each team member does 1/c of work #msgs: p/c 2 0, 5 2, 7 4, 1 6, Round 1 msg size: #words: 6, 3 0, 5 2, 7 4, Round 2 ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 28

29 ost omparison Latency cost ppend,, for bandwidth cost Even replication and collection can be the bottleneck if nnz() nnz(b), nnz() ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 29

30 Experimental Results ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 30

31 Test environments omparison 1.5D variants: ol, ol B, and Inner B Baseline: Best of 2D/2.5D/3D SUMM B ++ MPI, hybrid (12 threads per MPI process) MKL s multithreaded routine for local computation (mkl_dcsrmm) (double-precision) SR + row major 2.57PFLOPS ray X30 12 cores, 2 sockets each ray ries, Dragonfly topology ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 31

32 ost Breakdown p=3,072, n=64k x 64k, 41 nnz per row, ray X30 Existing New ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 32

33 Moving Dense Matrix is ostly p=3,072, n=64k x 64k, 41 nnz per row, ray X30 Existing New ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 33

34 Replication might not pay off p=3,072, n=64k x 64k, 41 nnz per row, ray X30 17x Existing New ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 34

35 Different omputation Efficiency p=3,072, n=64k x 64k, 41 nnz per row, ray X30 17x Existing New ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 35

36 100x Improvement 66k x 172k, B 172k x 66k, % nnz, ray X30 100x Up is good ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 36

37 When is 1.5D better? Best theoretical bandwidth cost areas ssuming nnz() = nnz(b) lso applicable to dense-dense and sparse-sparse Most sparse matrices have < 5% nonzeroes ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 37

38 Iterative Multiplication ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 38

39 Sparse IM Estimation *[S. Oh et al 2014] IM = inverse covariance matrix Direct pairwise dependency Data could be Brain fmri scan Gene expression S is usually rank-deficient or too expensive to invert directly Estimate by setting an objective function and optimize r observations X T n variables/features X (observed data) S (sample covariance) rank-deficient ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 39

40 ONORD-IST ONORD objective function : diagonal elements : off-diagonal elements : estimated sparse inverse of S Iteratively compute [or ] Ω S or Ω X T Replication (and sometimes collection) costs can be amortized. ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 40

41 onclusions 1.5D beneficial when #nnz of matrices are imbalanced. lso applicable to dense-dense/sparse-sparse. Observed up to 100 speedup. 1.5D can handle moderately imbalanced load with greedy partitioning. Iterative multiplication gains even more benefit. Future work Parallel ONORD-IST New communication lower bounds Try recursive approach onsider different computation efficiency ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 41

42 References I P. Koanantakool,. zad,. Buluç, D. Morozov, S. Oh, L. Oliker, K. Yelick. ommunication-avoiding parallel sparse-dense matrix-matrix multiplication. In IPDPS E. Solomonik and J. Demmel, ommunication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In Euro-Par, 2011, vol. 6853, pp G. Ballard,. Buluc, J. Demmel, L. Grigori, B. Lipshitz, O. Schwartz, and S. Toledo. ommunication optimal parallel multiplication of sparse random matrices. In Proceedings of the twenty-fifth annual M symposium on Parallelism in algorithms and architectures. M, 2013, pp ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 42

43 References II M. Driscoll, E. Georganas, P. Koanantakool, E. Solomonik, K. Yelick. communication-optimal n-body algorithm for direct interactions. In IPDPS S. Oh, O. Dalal, K. Khare, and B. Rajaratnam, Optimization methods for sparse pseudo-likelihood graphical model selection. In NIPS 27, 2014, pp ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 43

44 Thank you! Questions? Suggestions/pplications? ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 44

45 ommunication osts 3 phases Phase Description s c increases.. Replication to have c copies Propagation to multiply ollection of matrix Non-replicating version only has propagation cost. Replication and collection happen between team members, i.e., only c processors ( ), so they are usually cheaper than propagation. ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 45

46 ONORD-IST ONORD objective function non-smooth : estimated sparse inverse of S Proximal gradient method Smooth: Gradient descent Non-smooth: Soft-thresholding (IST: Iterative Soft-Thresholding lgorithm) : diagonal elements : off-diagonal elements λ λ Does not assume Gaussian distribution ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 46

$Pseudocode : estimated sparse inverse of S S = X T X/r Ω = I do Ω 0 = Ω ompute gradient G try τ = {1, 1/2, 1/4, } // find appropriate step size Ω = soft_thresholding(ω 0 -τg, τλ) until descent$

47 Pseudocode : estimated sparse inverse of S S = X T X/r Ω = I do Ω 0 = Ω ompute gradient G try τ = {1, 1/2, 1/4, } // find appropriate step size Ω = soft_thresholding(ω 0 -τg, τλ) until descent condition satisfied while max( Ω 0 -Ω ) ε // until no more progress Distributed operations S = X T X/r : dense-dense matrix multiplication SΩ+ΩS : sparse-dense matrix multiplication ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication 47

Parallel Methods for Convex Optimization. A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney

Parallel Methods for Convex Optimization A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney Problems minimize g(x)+f(x; A, b) Sparse regression g(x) =kxk 1 f(x) =kax bk 2 2 mx Sparse SVM g(x) =kxk