Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Size: px

Start display at page:

Download "Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends"

Emmeline Gardner
6 years ago
Views:

1 Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen ComplexHPC Spring School 2013 Heterogeneous computing - Impact on algorithms June 7th, 2013 Uppsala University, Sweden Paolo Bientinesi (AICES, RWTH Aachen) 1 / 34

2 1 Setting the stage 2 Part 1: blocked algorithms 3 Part 2: multithreading, fork-join 4 Part 3: multihtreading, algorithms-by-blocks 5 Part 4: streaming Paolo Bientinesi (AICES, RWTH Aachen) 2 / 34

3 Dense Linear Algebra M = Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

4 Dense Linear Algebra M = Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

5 Dense Linear Algebra M = Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

6 Dense Linear Algebra M = Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

7 Dense Linear Algebra M = Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

8 Dense Linear Algebra Linear systems Ax = b, AX = B, least squares,... Eigenproblems Ax = λx, AX = BXΛ, SVD,... Paolo Bientinesi (AICES, RWTH Aachen) 4 / 34

9 Dense Linear Algebra Linear systems Ax = b, AX = B, least squares,... Eigenproblems Ax = λx, AX = BXΛ, SVD,... Support routines factorizations, reductions,... Paolo Bientinesi (AICES, RWTH Aachen) 4 / 34

10 Dense Linear Algebra Matrix equations AX + XB = C, A = A+A 1 2,... Linear systems Ax = b, AX = B, least squares,... Eigenproblems Ax = λx, AX = BXΛ, SVD,... Support routines factorizations, reductions,... Paolo Bientinesi (AICES, RWTH Aachen) 4 / 34

11 Organization in layers Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

12 Organization in layers BLAS Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

13 Organization in layers BLAS BLAS-1: y := y + αx x, y R n, α R dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

14 Organization in layers BLAS BLAS-2: y := y + Ax A, L R n n, x, y R n y := L 1 x BLAS-1: y := y + αx x, y R n, α R dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

15 Organization in layers BLAS BLAS-3: C := C + AB A, B, C, L R n n C := L 1 B BLAS-2: y := y + Ax A, L R n n, x, y R n y := L 1 x BLAS-1: y := y + αx x, y R n, α R dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

16 Organization in layers LAPACK LU = A, LL T = A, QR = A, Q T T Q = A,... BLAS BLAS-3: C := C + AB A, B, C, L R n n C := L 1 B BLAS-2: y := y + Ax A, L R n n, x, y R n y := L 1 x BLAS-1: y := y + αx x, y R n, α R dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

17 Organization in layers other libraries ScaLAPACK, Elemental, PETSc,... LAPACK LU = A, LL T = A, QR = A, Q T T Q = A,... BLAS BLAS-3: C := C + AB A, B, C, L R n n C := L 1 B BLAS-2: y := y + Ax A, L R n n, x, y R n y := L 1 x BLAS-1: y := y + αx x, y R n, α R dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

18 Example: AX = B (full A) AX = B Linear System LU = A LU Factorization LX = B Triangular System LX = B Triangular System C = AB + C Gemm C = AB + C Gemm C = AB + C Gemm Paolo Bientinesi (AICES, RWTH Aachen) 6 / 34

19 Why BLAS-3? Why GEMM? Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

20 Why BLAS-3? Why GEMM? BLAS #FLOPS Mem. refs. Ratio BLAS-1: y := y + αx x, y R n, α R Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

21 Why BLAS-3? Why GEMM? BLAS #FLOPS Mem. refs. Ratio Level 1 2n 3n 2/3 BLAS-1: y := y + αx x, y R n, α R Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

22 Why BLAS-3? Why GEMM? BLAS #FLOPS Mem. refs. Ratio Level 1 2n 3n 2/3 BLAS-2: y := y + Ax A R n n, x, y R n BLAS-1: y := y + αx x, y R n, α R Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

23 Why BLAS-3? Why GEMM? BLAS #FLOPS Mem. refs. Ratio Level 2 2n 2 n 2 2 Level 1 2n 3n 2/3 BLAS-2: y := y + Ax A R n n, x, y R n BLAS-1: y := y + αx x, y R n, α R Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

24 Why BLAS-3? Why GEMM? BLAS #FLOPS Mem. refs. Ratio Level 2 2n 2 n 2 2 Level 1 2n 3n 2/3 BLAS-3: C := C + AB A, B, C, R n n BLAS-2: y := y + Ax A R n n, x, y R n BLAS-1: y := y + αx x, y R n, α R Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

25 Why BLAS-3? Why GEMM? BLAS #FLOPS Mem. refs. Ratio Level 3 2n 3 4n 2 n/2 Level 2 2n 2 n 2 2 Level 1 2n 3n 2/3 BLAS-3: C := C + AB A, B, C, R n n BLAS-2: y := y + Ax A R n n, x, y R n BLAS-1: y := y + αx x, y R n, α R Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

26 Why BLAS-3? Why GEMM? BLAS #FLOPS Mem. refs. Ratio Level 3 2n 3 4n 2 n/2 Level 2 2n 2 n 2 2 Level 1 2n 3n 2/3 BLAS-3: C := C + AB A, B, C, R n n BLAS-2: y := y + Ax A R n n, x, y R n BLAS-1: y := y + αx x, y R n, α R Morale BLAS-3: The larger the problem the better, as long as it fits in memory. GEMM is the building block for all the other BLAS-3 kernels, and for LAPACK. Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

27 Part 1: Blocked algorithms Simple example: Cholesky factorization Input: Goal: Matrix A, symmetric and positive definite. Determine L (lower triangular matrix) such that LL T = A L = ( LT L L BL L BR ) =? Paolo Bientinesi (AICES, RWTH Aachen) 8 / 34

28 Cholesky factorization iteration i DONE DONE Paolo Bientinesi (AICES, RWTH Aachen) 9 / 34

29 Cholesky factorization iteration i + 1 unblocked algorithm blocked algorithm sqrt CHOL trsv syr TRSM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 9 / 34

30 Cholesky factorization iteration i + 1 unblocked algorithm blocked algorithm DONE DONE DONE DONE Paolo Bientinesi (AICES, RWTH Aachen) 9 / 34

31 Cholesky: unblocked vs. blocked algorithms Paolo Bientinesi (AICES, RWTH Aachen) 10 / 34

32 Part 2: Parallelism? fork-join Solution #1: Multithreaded BLAS (+ vector instructions) Chol LU TRSM TRSM GEMM TRSM GEMM Paolo Bientinesi (AICES, RWTH Aachen) 11 / 34

33 Part 2: Parallelism? fork-join Solution #1: Multithreaded BLAS (+ vector instructions) Chol LU TRSM TRSM GEMM TRSM GEMM Advantage: ease of use. Legacy code! Drawback: unnecessary synchronization points OpenBLAS, ATLAS, BLIS, old versions of MKL,... Paolo Bientinesi (AICES, RWTH Aachen) 11 / 34

34 Multithreaded BLAS Xeon, 32 physical cores, MKL 1 Efficiency of GEMM 0.8 Efficiency Matrix dimension 1 thread 2 threads 4 threads 8 threads 16 threads 32 threads Paolo Bientinesi (AICES, RWTH Aachen) 12 / 34

35 Development of LA libraries - New architecture / new architectural features Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

36 Development of LA libraries - New architecture / new architectural features - GEMM Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

37 Development of LA libraries - New architecture / new architectural features - GEMM peak performance [FFT, SpMV] Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

38 Development of LA libraries - New architecture / new architectural features - GEMM peak performance [FFT, SpMV] - BLAS3, factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

39 Development of LA libraries - New architecture / new architectural features - GEMM peak performance [FFT, SpMV] - BLAS3, factorizations, AX=B LINPACK benchmark Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

40 Development of LA libraries - New architecture / new architectural features - GEMM peak performance [FFT, SpMV] - BLAS3, factorizations, AX=B LINPACK benchmark - BLAS3, factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

41 Development of LA libraries - New architecture / new architectural features - GEMM peak performance [FFT, SpMV] - BLAS3, factorizations, AX=B LINPACK benchmark - BLAS3, factorizations, AX=B - factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

42 Development of LA libraries - New architecture / new architectural features - GEMM peak performance [FFT, SpMV] - BLAS3, factorizations, AX=B LINPACK benchmark - BLAS3, factorizations, AX=B - factorizations, AX=B - factorizations, AX=B, support routines Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

43 Development of LA libraries - New architecture / new architectural features - GEMM peak performance [FFT, SpMV] - BLAS3, factorizations, AX=B LINPACK benchmark - BLAS3, factorizations, AX=B - factorizations, AX=B - factorizations, AX=B, support routines. Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

44 Development of LA libraries - New architecture / new architectural features - GEMM peak performance [FFT, SpMV] - BLAS3, factorizations, AX=B LINPACK benchmark - BLAS3, factorizations, AX=B - factorizations, AX=B - factorizations, AX=B, support routines.. - eigenproblems Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

Examples 2005 2006: multicores GEMM, mtblas HPL 2005-7: Blue Gene L dual cores (PowerPC A2) 2008-9: Roadrunner dual cores (Opteron) 2009:

45 Examples : multicores GEMM, mtblas HPL : Blue Gene L dual cores (PowerPC A2) : Roadrunner dual cores (Opteron) 2009: Jaguar 6-cores (Opteron) : K Computer 8-cores SPARC64... libflame, PLASMA, MKL Eigensolvers: 2010 Paolo Bientinesi (AICES, RWTH Aachen) 14 / 34

Examples 2005 2006: Cell GEMM: 99% no BLAS [FFT] HPL 2008-9: Roadrunner >1 PetaFLOP 6k *

46 Examples : Cell GEMM: 99% no BLAS [FFT] HPL : Roadrunner >1 PetaFLOP 6k * (Opteron + 2 Cell) no libs 2009: discontinued Eigensolvers: Paolo Bientinesi (AICES, RWTH Aachen) 15 / 34

47 Examples 2005: GPGPUs cublas (*) HPL 2010: Tianhe-1A 2.5k dual-gpu (ATI Radeon) CULA (*) libflame, MAGMA (multigpu) Eigensolvers: Paolo Bientinesi (AICES, RWTH Aachen) 16 / 34

48 Present Intel Xeon Phi (MIC) GEMM (*), MKL HPL (predicted) 2013: Tianhe-2 16k * (2 Ivy Bridge + 3 Phi) Paolo Bientinesi (AICES, RWTH Aachen) 17 / 34

49 Present Intel Xeon Phi (MIC) GEMM (*), MKL HPL (predicted) 2013: Tianhe-2 16k * (2 Ivy Bridge + 3 Phi) Also ARM, FPGAs, DSPs,... Paolo Bientinesi (AICES, RWTH Aachen) 17 / 34

50 Future 2017: Exascale (?)??? heterogeneous architecture Paolo Bientinesi (AICES, RWTH Aachen) 18 / 34

51 Back to the algorithms: Chol. fact Fork-join unnecessary synchronizations CHOL CHOL TRSM SYRK TRSM SYRK Iteration 1 Iteration 2 Paolo Bientinesi (AICES, RWTH Aachen) 19 / 34

52 Part 3: Parallelism? algorithms-by-blocks Idea: decompose the tasks Solution #2: dependent tasks + scheduling X A = B Paolo Bientinesi (AICES, RWTH Aachen) 20 / 34

53 Part 3: Parallelism? algorithms-by-blocks Idea: decompose the tasks Solution #2: dependent tasks + scheduling X X A 0 B 0 A = B = A 1 B 1 A 2 B 2 Paolo Bientinesi (AICES, RWTH Aachen) 20 / 34

54 Part 3: Parallelism? algorithms-by-blocks Idea: decompose the tasks Solution #2: dependent tasks + scheduling X X A 0 B 0 A 0 X = B 0 A = B = A 1 B 1 A 1 X = B 1 A 2 B 2 A 2 X = B 2 Paolo Bientinesi (AICES, RWTH Aachen) 20 / 34

55 Storage by blocks CHOL CHOL TRSM SYRK TRSM SYRK Iteration 1 Iteration 2 Paolo Bientinesi (AICES, RWTH Aachen) 21 / 34

56 Creating tasks Iteration 1 CHOL TRSM SYRK TRSM GEMM SYRK TRSM GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 22 / 34

57 Creating tasks Iteration 2 CHOL TRSM SYRK TRSM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 22 / 34

58 Creating tasks Iteration 3 CHOL TRSM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 22 / 34

59 Creating tasks Iteration 4 CHOL Paolo Bientinesi (AICES, RWTH Aachen) 22 / 34

60 Dependencies CHOL TRSM SYRK TRSM GEMM SYRK TRSM GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 23 / 34

61 Dependencies CHOL TRSM SYRK TRSM GEMM SYRK TRSM GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 23 / 34

62 Dependencies CHOL TRSM SYRK TRSM GEMM SYRK TRSM GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 23 / 34

63 Dependencies CHOL TRSM SYRK TRSM GEMM SYRK TRSM GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 23 / 34

64 Dependencies CHOL TRSM SYRK TRSM GEMM SYRK TRSM GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 23 / 34

65 DAG - Dependencies 4 4-tile matrix CHOL TRSM TRSM TRSM SYRK GEMM GEMM SYRK GEMM SYRK CHOL TRSM TRSM 3 3 SYRK GEMM SYRK CHOL TRSM SYRK CHOL Paolo Bientinesi (AICES, RWTH Aachen) 24 / 34

66 Task Execution 4 4-tile matrix Stage Scheduled Tasks 1 CHOL 2 TRSM TRSM TRSM TRSM 3 SYRK GEMM SYRK GEMM 4 GEMM SYRK GEMM GEMM 5 GEMM SYRK CHOL 6 TRSM TRSM TRSM 7 SYRK GEMM SYRK GEMM 8 GEMM SYRK CHOL 9 TRSM TRSM 10 SYRK GEMM SYRK 11 CHOL 12 TRSM 13 SYRK 14 CHOL Paolo Bientinesi (AICES, RWTH Aachen) 25 / 34

67 SPD Inverse: Chol+Inv+GEMM 5 5-tile matrix Paolo Bientinesi (AICES, RWTH Aachen) 26 / 34

68 SPD Inverse: Chol+Inv+GEMM 5 5-tile matrix Stage Scheduled Tasks 1 CHOL 2 TRSM TRSM TRSM TRSM 3 SYRK GEMM SYRK GEMM 4 GEMM SYRK GEMM GEMM 5 GEMM SYRK CHOL TRSM 6 TRSM TRSM TRSM TRSM 7 TRSM TRSM TRINV SYRK 8 GEMM SYRK GEMM GEMM 9 SYRK TTMM CHOL TRSM 10 TRSM TRSM TRSM TRSM 11 GEMM GEMM GEMM SYRK 12 GEMM SYRK TRSM CHOL 13 TRSM TRSM TRINV SYRK 14 TRSM GEMM GEMM GEMM 15 GEMM TRMM SYRK TRSM 16 TRSM TTMM CHOL TRSM 17 SYRK TRINV GEMM SYRK 18 GEMM GEMM GEMM TRMM 19 TRMM TRSM TRSM TRSM 20 TRSM TRSM TRSM TRSM 21 TTMM SYRK GEMM SYRK 22 TRINV GEMM GEMM TRINV 23 SYRK SYRK GEMM SYRK 24 TRMM GEMM TRMM GEMM 25 TRMM SYRK GEMM GEMM 26 TTMM GEMM TRMM TRMM 27 SYRK TRMM 28 TRMM 29 TTMM Paolo Bientinesi (AICES, RWTH Aachen) 26 / 34

69 Blocked vs. By-blocks Paolo Bientinesi (AICES, RWTH Aachen) 27 / 34

70 Blocked vs. By-blocks: crossover 80 Performance of Cholesky ( LL T = A ) GFLOPS SuperMatrix FLAME Matrix dimension Paolo Bientinesi (AICES, RWTH Aachen) 28 / 34

71 Multithreaded BLAS vs. Algorithms-by-blocks No absolute winner: crossover! Ease of use Synchronization Out of order execution Parallelism dictated by data dependencies Plateux libflame, MKL, PLASMA,... Heterogeneous systems schedulers Paolo Bientinesi (AICES, RWTH Aachen) 29 / 34

72 Part 4: Streaming The entire problem does not fit even in main memory Example b := ( X T M 1 X ) 1 X T M 1 y Paolo Bientinesi (AICES, RWTH Aachen) 30 / 34

73 Part 4: Streaming The entire problem does not fit even in main memory Example b := ( X T M 1 X ) 1 X T M 1 y Linear regression with non-independent outcomes Inputs M R n n, SP D(M), n [10 3,..., 10 4 ] X R n p, p [1,..., 20], full rank y R n Output b R p Sequence of thousands of problems Paolo Bientinesi (AICES, RWTH Aachen) 30 / 34

74 Algorithm LL T = M X := L 1 X S := X T X GG T = S y := L 1 y b := X T y b := G 1 b b := G T b Paolo Bientinesi (AICES, RWTH Aachen) 31 / 34

75 Algorithm bottleneck? LL T = M X := L 1 X to accelerator (TRSM) S := X T X GG T = S y := L 1 y b := X T y b := G 1 b b := G T b Paolo Bientinesi (AICES, RWTH Aachen) 31 / 34

76 double+triple buffering GPU GPU trsm b GPU trsm b+1 Send b+1 Send b+2 CPU Recv b-1 CPU comp b-1 Recv b CPU b t HDD Read b+2 Write b-1 Read b+3 CPU GPU transfer HDD CPU transfer GPU computation CPU computation Data dependencies Asynchronous dispatch Paolo Bientinesi (AICES, RWTH Aachen) 32 / 34

77 double+triple buffering GPU b-1 β trsm b α b-1 CPU/RAM b+1 b-1 b+2 C B A HDD Data X b-3 b b-2 b-1 b+1 b+2 b+3 Results r b-3 b-2 b-1 Paolo Bientinesi (AICES, RWTH Aachen) 32 / 34

78 Summary Dense linear algebra: it s matter of layers GEMM = foundations, THE building block Irrespective of the architecture, need for highly optimized BLAS How to use threads/cores? Multithreaded BLAS vs. algorithms-by-blocks Large/many problems but limited memory streaming Paolo Bientinesi (AICES, RWTH Aachen) 33 / 34

79 References MKL cublas, CULA libflame BLIS OpenBLAS Magma Plasma Paolo Bientinesi (AICES, RWTH Aachen) 34 / 34

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010