Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication on Many-Core Architectures

Size: px

Start display at page:

Download "Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication on Many-Core Architectures"

Derrick McDonald
6 years ago
Views:

1 Eploiting Matri Reuse and Data Locality in Sparse Matri-Vector and Matri-ranspose-Vector Multiplication on Many-Core Architectures Ozan Karsavuran 1 Kadir Akbudak 2 Cevdet Aykanat 1 (speaker) sites.google.com/site/kadircs kadir.cs@gmail.com 1 Bilkent University, urkey 2 KAUS, KSA SIAM Workshop on Combinatorial Scientific Computing (CSC), Albuquerque, NM, USA, October 10-12, 2016 O. Karsavuran, K. Akbudak, and C. Aykanat, Locality-Aware Parallel Sparse Matri-Vector and Matri-ranspose-Vector Multiplication on Many-Core Architectures, IEEE ransactions on Parallel and Distributed Systems (PDS), vol. 27(6), pp , 2016, available at ieeeplore.ieee.org/document/ / 1 / 14

2 1 Introduction: y =AA 2 Open problems & Related work 3 Parallel SpAA based on 1D partitioning of A and A matrices Quality criteria for efficient parallelization of SpAA Proposed SpAA algorithms Eperiments 4 References 2 / 14

3 Introduction: y = AA hread-level parallelization of y =AA (SpAA ) y = AA is computed as two Sparse Matri-Vector Multiplies (SpMV) Sparse Matriz = A and then z A ranspose Vector Multiply (SpA ) z Sparse Matri-Vector y = Az y A Multiply (SpA) hread-level parallelization of repeated and consecutive SpA and SpA that involve the same sparse matri A Eamples: Linear Programming (LP) problems via interior point methods nonsymmetric systems via Bi-CG, CGNE, Lanczos Bi-ortagonalization least squares problem via LSQR linear feasibility problem via Surrogate Constraints method Krylov-based balancing algorithms used as preconditioners for sparse eigensolvers web page ranking via HIS algorithm 3 / 14

4 Open problems & Related work Open problems Utilize the opportunity of reusing A-matri nonzeros? Obtain close performance for both z = A and y = Az at the same time? Single storage of A for both z = A and y = Az Storage of A for z = A and a separate storage of A for y = Az Related work Optimized Sparse Kernel Interface (OSKI), Berkeley Serial z Each row/column is reused. y A z A Compressed Sparse Blocks (CSB) by Buluc et. al. [10] Parallel Same data structure for both SpA and SpA operations without any performance degradation wo phase, i.e., SpA and SpA are not performed simultaneously 4 / 14

5 Parallel SpAA based on 1D partitioning of A and A matrices hread-level baseline parallelization of SpAA z y = C 1 C 2 C 3 C 4 z = C1 C2 C3 C4 A CCp A Column-Column parallel y 1 R 1 y 2 R 2 = y 3 R 3 y 4 R 4 R1 R = 2 R3 R4 A RRp A Row-Row parallel z y = C 1 C 2 C 3 C 4 C1 C = 2 C3 C4 A CRp A y 1 R 1 y 2 R 2 = y 3 R 3 y 4 R 4 z = R1 R2 R3 R4 A RCp A Row-Column parallel Column-Row parallel YELLOW scale tone: eclusive accesses by a single thread RED color: concurrent accesses by multiple threads. Four baseline SpAA algorithms for computing y = A z after z = A by four threads.5 / 14

6 Parallel SpAA based on 1D partitioning of A and A matrices Contributions Identify five quality criteria (QC), which have impact on performance of parallel SpAA Singly-bordered block-diagonal (SB) form based methods: sbcrp and sbrcp Matri A partitioned in to four and the submatrices are processed by four threads. For sbcrp (SB-based Column-Row parallel algorithm), we permute matri A into a rowwise SB form, which induces a columnwise SB form of matri A z B y 1 y 2 y 3 = y 4 y B A 11 A22 A33 A44 A B1 A B2 A B3 A B4 y 1 A 11 y 2 A 22 = y 3 A 33 A B1 A B2 A B3 For sbrcp (SB-based Row-Column parallel algorithm), we permute matri A into a columnwise SB form, which induces a rowwise SB form of matri A y 4 A 44 A B4 Achieve (a) (z-vector reuse) and (b) (A-matri reuse). Objectives of minimizing the size of the row/column border in the SB form of A achieve QC (c), (d), and (e) in sbcrp/sbrcp. 6 / 14

7 Parallel SpAA based on 1D partitioning of A and A matrices Quality criteria for efficient parallelization of SpAA Quality criteria for efficient parallelization of SpAA Quality Criteria RRp CRp RCp sbcrp sbrcp (a) Reusing z-vector entries generated in z = A and then read in y = A z (b) Reusing matri nonzeros (together with their indices) 2 in z = A and y = A z (c) Eploiting temporal locality in reading input vector entries in row-parallel SpMVs (d) Eploiting temporal locality in updating output vector 3 3 entries in column-parallel SpMVs (e) Minimizing the number of concurrent writes performed by different threads in column-parallel SpMVs 1 z1 z2 z3 z4 y = C1 C2 C3 C4 z1 z2 = z3 C 1 C 2 C 3 z4 C4 A CRp A Column-Row parallel : satisfied : satisfied ecept z B border subvectors : not applicable : satisfied ecept A kb border submatrices : not satisfied 3 : may be satisfied through row/column reordering 7 / 14

8 Parallel SpAA based on 1D partitioning of A and A matrices Proposed SpAA algorithms Maintaining balance on the number of nonzeros at each slice Reducing parallel time under arbitrary task scheduling Reducing border size c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 1 c 2 Reducing # of cache misses due to loss of temporal locality λ(c j ) = {A k : c j has at least one nonzero at A k, k 1,..., K} r 1 2 r 3 4 r 5 6 λ(cj ) = r 1 4 r 2 3 r c 3 c 4 c 5 c 6 c 7 c 8 λ(cj ) = Matri A partitioned in to three and the submatrices are processed by three threads. Reducing # of concurrent writes λ(r i ) = {A k : r i has at least one nonzero at A k, k 1,..., K} 8 / 14

9 Parallel SpAA based on 1D partitioning of A and A matrices Proposed SpAA algorithms Merits of Singly-Bordered Block Diagonal (SB) Form on CRp SB Form z 1 y = C 1 C 2 C 3 C 4 C1 C = 2 C3 C4 A y 1 A 11 y 2 A 22 y 3 = A 33 y 4 A 44 y B A B1 A B2 A B3 A B4 = A CRp A sbcrp Concurrent accesses Whole and y vectors Only B and y B subvectors Eploits temporal locality in reading -vector entries in row parallel z = A Eploits temporal locality in updating y-vector entries in column-parallel y = A z Minimizing border size in the SB form 1 A 11 2 A 22 3 A 33 A 4 B A B1 A B2 A B3 A 44 A B4 Minimizing number of concurrent writes by different threads in column-parallel y = A z 9 / 14

10 Parallel SpAA based on 1D partitioning of A and A matrices Proposed SpAA algorithms Require: A kk and A Bk matrices;, y, and z vectors 1: for k 1 to K in parallel do 2: z k A kk k 3: z k z k + A Bk B 4: y k A kk z k 5: y B y B + A Bk z k Concurrent writes 6: end for z k C k y C k z k Require: A kk and A kb matrices;, y, and z vectors 1: for k 1 to K in parallel do 2: z k A kk k 3: z B z B + A kb k Concurrent writes 4: y k A kk z k 5: end for 6: for k 1 to K in parallel do 7: y k y k + A kb z B 8: end for z R k y k R k z k Singly-bordered block-diagonal (SB) form y 1 A 11 y 2 A 22 y 3 = A 33 y 4 A 44 y B A B1 A B2 A B3 A B4 = 1 A 11 2 A 22 3 A 33 4 A sbcrp A SB-based Column-Row parallel B A B1 A B2 A B3 A 44 A B4 Singly-bordered block-diagonal (SB) form y 1 y 2 = y 3 y 4 A 11 A 22 A 33 z B A 1B A 2B A 3B A 44 A 4B = z B 1 A 11 2 A 22 3 A 33 4 A 44 A 1B A 2B A 3B A 4B A sbrcp A SB-based Row-Column parallel 10 / 14

11 Parallel SpAA based on 1D partitioning of A and A matrices Proposed SpAA algorithms Iterative Methods CRp sbcrp sbrcp LP [1, 2] Directly applicable z A y Az Directly (no dependency since inner product can be delayed) z q A CGNE [3] β (z, z)/(q, q) y A z Directly (linear vector operations without synchronization) z A LSQR [4] w f (z) y A w z A Surrogate [5, 6] w f (z) Constraints y A w Independent SpMVs (the two for loops of sbrcp can be fused.) z A BiCG [3] y A w Lanczos Bi-orthogonalization [3] z A y A w z A HIS [7, 8] y A w Krylov-based z A [9] Balancing y A Not applicable due to inner product and inter-dependency z A CGNR [3] α y 2 2 / z 2 2 y A αw 11 / 14

Parallel SpAA based on 1D partitioning of A and A matrices Eperiments Performance Results on Intel Xeon Phi Average results of 28 sparse matrices from UFL Up to 20M

5M rows/cols Baseline methods RRp, CRp, RCp (OpenMP) RRp with vendor-provided MKL Reverse Cuthill-McKee for QC (c) and (d) Proposed methods sbcrp, sbrcp (OpenMP,

12 Parallel SpAA based on 1D partitioning of A and A matrices Eperiments Performance Results on Intel Xeon Phi Average results of 28 sparse matrices from UFL Up to 20M nonzeros, 3.5M rows/cols Baseline methods RRp, CRp, RCp (OpenMP) RRp with vendor-provided MKL Reverse Cuthill-McKee for QC (c) and (d) Proposed methods sbcrp, sbrcp (OpenMP, PaoH-3runs) Highly-tuned SpMV libs can be integrated. Normalized wrt RRp with original ordering web-berkstan Normalized parallel SpAA times Best of RRp MKL CRp/RCp org RCM org RCM org RCM SB *Smaller the better **Best of 1, 2, 3, and 4 threads per core sbcrp Original Columnwise SB form 12 / 14

13 Parallel SpAA based on 1D partitioning of A and A matrices Eperiments Performance Profiles 1.0 Proposed methods: sbcrp, sbrcp Double storage of A: RRp, MKL Original order, RCM ordering Single storage of A: CRp, RCp Original order, RCM ordering Percentage of test instances best of sbcrp/sbrcp 0.2 best of RRp/MKL best of CRp/RCp Parallel SpAA time relative to the best 13 / 14

14 Parallel SpAA based on 1D partitioning of A and A matrices Eperiments Performance Results on Xeon wo E cores in total 16 threads with Hyperhreading Normalized parallel SpAA times Best of RRp MKL CRp/RCp Matri org RCM org RCM org RCM SB degme LargeRegFile Stanford web-berkstan *Smaller the better Preprocessing overhead in terms of number of SpAA operations using RRp Matri sbcrp/sbrcp degme 136 LargeRegFile 143 Stanford 2 web-berkstan / 14

15 References References: [1] N. Karmarkar, A new polynomial-time algorithm for linear programming, Proc. 16th annual ACM symposium on heory of computing, pp , [2] S. Mehrotra, On the implementation of a primal-dual interior point method, SIAM Journal on Optimization, vol. 2, no. 4, pp , [3] Y. Saad, Iterative methods for sparse linear systems. SIAM, [4] C. C. Paige and M. A. Saunders, LSQR: An algorithm for sparse linear equations and sparse least squares, ACM ransactions on Mathematical Software (OMS), vol. 8, no. 1, pp , [5] K. Yang and K. G. Murty, New iterative methods for linear inequalities, Journal of Optimization heory and Applications, vol. 72, no. 1, pp , [6] B. Uçar, C. Aykanat, M. Ç. Pınar, and. Malas, Parallel image restoration using surrogate constraint methods, Journal of Parallel and Distributed Computing, vol. 67, no. 2, pp , [7] J. M. Kleinberg, Authoritative sources in a hyperlinked environment, Journal of the ACM (JACM), vol. 46, no. 5, pp , [8] X. Yang, S. Parthasarathy, and P. Sadayappan, Fast sparse matri-vector multiplication on GPUs: Implications for graph mining, Proc. VLDB Endow., vol. 4, no. 4, pp , Jan [9].-Y. Chen and J. W. Demmel, Balancing sparse matrices for computing eigenvalues, Linear Algebra and its Applications, vol. 309, no. 13, pp , [10] A. Buluç, J.. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson, Parallel sparse matri-vector and matri-transpose-vector multiplication using compressed sparse blocks, Proc. 21st symposium on Parallelism in Algorithms and Architectures, pp , [11] K. Akbudak, E. Kayaaslan, and C. Aykanat, Hypergraph partitioning based models and methods for eploiting cache locality in sparse matri-vector multiplication, SIAM Journal on Scientific Computing, vol. 35, no. 3, pp. C237 C262, / 14

Analyzing and Enhancing OSKI for Sparse Matrix-Vector Multiplication 1

Available on-line at www.prace-ri.eu Partnership for Advanced Computing in Europe Analyzing and Enhancing OSKI for Sparse Matrix-Vector Multiplication 1 Kadir Akbudak a, Enver Kayaaslan a, Cevdet Aykanat