RELIABLE GENERATION OF HIGH- PERFORMANCE MATRIX ALGEBRA

Size: px

Start display at page:

Download "RELIABLE GENERATION OF HIGH- PERFORMANCE MATRIX ALGEBRA"

Arron Wilkinson
5 years ago
Views:

1 RELIABLE GENERATION OF HIGH- PERFORMANCE MATRIX ALGEBRA Jeremy G. Siek, University of Colorado at Boulder Joint work with Liz Jessup, Boyana Norris, Geoffrey Belter, Thomas Nelson Lake Isabelle, Colorado Our SC 12 submission is available on my web page.

2 ABSTRACTION VS. PERFORMANCE A few lines from the PETSc Biconjugate gradient method. 97: betaold = beta; 98: KSP_MatMult(ksp,Amat,Pr,Zr); /* z <- Kp */ 99: VecConjugate(Pl); 100: KSP_MatMultTranspose(ksp,Amat,Pl,Zl); 101: VecConjugate(Pl); 102: VecConjugate(Zl); 103: VecDot(Zr,Pl,&dpi); /* dpi <- z'p */ 104: a = beta/dpi; /* a = beta/p'z */ 105: VecAXPY(X,a,Pr); /* x <- x + ap */

3 LARGER SCOPE = BETTER PERFORMANCE BLAS 3 BLAS 2.5 Optimization Opportunities BLAS 2 BLAS 1 Scope

4 LARGER SCOPE = BETTER PERFORMANCE 1/1 compute to data BLAS 3 BLAS 2.5 Optimization Opportunities BLAS 2 BLAS 1 Scope

5 LARGER SCOPE = BETTER PERFORMANCE 1/1 compute to data BLAS 3 BLAS 2.5 Optimization Opportunities BLAS 1 BLAS 2 But larger scope = more kernels = higher cost Scope

6 COST & PERFORMANCE BLAS, hand-tuned Man hours/kernel BLAS, auto-tuned General Purpose Compilers Domain-specific compilers Reliable Performance

7 BUILD TO ORDER BLAS Kernel Specification A = A + u1 * v1' + u2 * v2' x = beta * (A' * y) + z w = alpha * (A * x) Build to Order BLAS Compiler Optimized Kernel Implementation in C

8 BUILD TO ORDER BLAS Linear Algebra Specific Knowledge Bases General Purpose Infrastructure Kernel Specification Linear Algebra Implementation Dataflow Transformation Engine Dataflow Graph Linear Algebra Optimizations Analytic Performance Predictor Stack of Alternatives Hardware Database C++/ C++/ C Fortran Fortran Empirical Performance Evaluation C

9 KERNEL SPECIFICATION GEMVER in u1 : vector, u2 : vector, v1 : vector, v2 : vector, alpha : scalar, beta : scalar, y : vector, z : vector inout A : dense column matrix out x : vector, w : vector { A = A + u1 * v1' + u2 * v2' x = beta * (A' * y) + z w = alpha * (A * x) }

10 Speed Up Relative to ICC PGI BLAS (MKL) HAND BTO PLUTO AXPYDOT VADD WAXPBY Kernel AXPYDOT VADD WAXPBY ATAX BICGK DGEMV DGEMVT GEMVER GESUMMV Operation z w αv β z T u x w + y + z w αx + βy y A T Ax q Ap s A T r z αax + βy x βa T y + z w αax B A + u 1 v1 T + u 2v2 T x βb T y + z w αbx y αax + βbx Intel Westmere, 24 core, 2.66 GHz ATAX BICGK DGEMV DGEMVT GESUMMV GEMVER

11 RESULTS ON AMD Speedups relative to PGI Kernel BLAS Pluto HAND BTO AXPYDOT VADD WAXPBY ATAX BICGK DGEMV DGEMVT GEMVER GESUMMV AMD Phenom, 6 core, 3.3 GHz Kernel BLAS Pluto HAND BTO AXPYDOT VADD WAXPBY ATAX BICGK DGEMV DGEMVT GEMVER GESUMMV AMD Interlagos, 64 core, 2.2 GHz

12 BUILD TO ORDER BLAS Linear Algebra Specific Knowledge Bases General Purpose Infrastructure Kernel Specification Linear Algebra Implementation Dataflow Transformation Engine Dataflow Graph Linear Algebra Optimizations Analytic Performance Predictor Stack of Alternatives Hardware Database C++/ C++/ Fortran Fortran Empirical Performance Evaluation C++

13 DATAFLOW GRAPH A = A + u1 * v1' + u2 * v2' x = beta * (A' * y) + z w = alpha * (A * x) v2 T x A y alpha z beta u2 + + T x x + x x v1 T x u1 x w

14 TRAVERSAL PATTERNS orientations O ::= C R types τ ::= O τ S. R<S> R<R<S>> C<R<S>>

15 LINEAR ALGEBRA DB Algo Op and Operands Result Type Pipe O τ l + O τ r O τ l + τ r yes S + S S no O τ T O T τ T yes S S S no R τ l R τ r R R τ l τ r yes C τ l C τ r C P τ l C τ r yes R τ l C τ r (τl τ r ) no C τ l R τ r C τ l R τ r yes C τ l R τ r R C τ l τ r yes S O τ O S τ yes Table 1: Sample of the linear algebra knowledge base.

16 C TYPE/ALGO INFERENCE R<> v2 T x A y alpha z beta u2 + + T x x + x x v1 T x u1 x w C

17 C TYPE/ALGO INFERENCE R<> v2 T x A y alpha z beta u2 + + T x x + x x v1 T add u1 x R<> R<> x w C

18 C TYPE/ALGO INFERENCE R<> R<> v2 u2 T x add A y alpha z beta + + T x x + x x v1 T add u1 x R<> R<> x w R<> C

19 C TYPE/ALGO INFERENCE R<S> R<> R<> v2 u2 T x outer2 add A y alpha z beta v1 T outer2 + + T x x + x x add u1 R<S> x R<> R<> R<> x w C

20 BUILD TO ORDER BLAS Linear Algebra Specific Knowledge Bases General Purpose Infrastructure Kernel Specification Linear Algebra Implementation Dataflow Transformation Engine Dataflow Graph Linear Algebra Optimizations Analytic Performance Predictor Stack of Alternatives Hardware Database C++/ C++/ Fortran Fortran Empirical Performance Evaluation C++

21 DATAFLOW REFINEMENT j = 1..n Column Matrix R<> A Column Matrix R<> Column Matrix R<> A A(:,j) Column Vector Column Matrix R<> + C + C(:,j)<- C B B B(:,j) Column Matrix R<> Column Matrix R<> j = 1..n Column Matrix R<> A A(:,j) i = 1..m A(i,j) Column Vector Scalar S Column Matrix R<> + C(i,j)<- C(:,j)<- C B Column Matrix R<> B(:,j) B(i,j)

22 OPTIMIZATION FUSION (1) A A i = 1..n i = 1..n i = 1..n A(i) A(i) A(i)

23 OPTIMIZATION FUSION (2) i = 1..n i = 1..n A(i)<- i = 1..n A A(i)

24 BUILD TO ORDER BLAS Linear Algebra Specific Knowledge Bases General Purpose Infrastructure Kernel Specification Linear Algebra Implementation Dataflow Transformation Engine Dataflow Graph Linear Algebra Optimizations Analytic Performance Predictor Stack of Alternatives Hardware Database C++/ C++/ Fortran Fortran Empirical Performance Evaluation C++

25 SEARCH SPACE Illegal Legal Complete Search Space BTO Considered Search Space BTO Legal Points BTO Pruned

26 ENUMERATING THE SPACE We try to avoid even considering illegal points Loop fusion is an equivalence relation Can t fuse inner loops if you haven t already fused their outer loops. y βa T Ax. a : {{1}}{{2}}{{3}} b : {{1}{2}}{{3}} c : {{12}}{{3}} d : {{1}{3}}{{2}} e : {{1}{2}{3}} f : {{123}} a : b : for i in 1 to M for j in 1 to N t0[ i ] += A[i, j ] x[j ] for i in 1 to M for j in 1 to N t1[ j ] += A[i, j ] t0[ i ] for j in 1 to N y[ j ] = t1[ j ] beta

27 PARTITIONING for i in 1 to M for j in 1 to N t0[ i ] += A[i, j ] x[j ] for i in 1 to M rix or ve { i { j 1}} }, where A A { p(i) { { }}} { p(i) { i { j 1}}} { p(j) { i { j 1}}} { { { }}}

28 MFGA SEARCH ALGORITHM We start with a greedy search technique that we call max-fuse (MF). Then we mutate to seed a genetic algorithm (GA). add or remove fusions add or remove partitions change direction of partition (horizontal/vertical) increment/decrement number of threads assigned to a partition

29 SEARCH TIME VS. PERFORMANCE For GEMVER on Intel Westmere, 24 core

30 FUTURE WORK/ CONCLUSIONS We obtain reliable, high-performance matrix algebra 1. high-level specification language 2. careful enumeration of optimization choices 3. search algorithm: max-fuse + genetic Future work: More parallelism using MPI, GPUs More matrix formats: banded, triangular, sparse

A Reliable Generation of High-Performance Matrix Algebra

A Reliable Generation of High-Performance Matrix Algebra THOMAS NELSON, University of Colorado, Boulder GEOFFREY BELTER, University of Colorado, Boulder JEREMY G. SIEK, University of Colorado, Boulder