Task based parallelization of recursive linear algebra routines using Kaapi

Size: px

Start display at page:

Download "Task based parallelization of recursive linear algebra routines using Kaapi"

Morris Ryan
5 years ago
Views:

Task based parallelization of recursive linear algebra routines using Kaapi Clément PERNET joint work with Jean-Guillaume DUMAS and Ziad SULTAN Université

1 Task based parallelization of recursive linear algebra routines using Kaapi Clément PERNET joint work with Jean-Guillaume DUMAS and Ziad SULTAN Université Grenoble Alpes, LJK-CASYS January 20, 2017 Journée Runtime, Paris. Supported by OpenDreamKit Horizon 2020 European Research Infrastructures project (# ) 1 / 28

2 And fast software: LU over (Z/65521Z) in 3.8s (21.8Gfops on 1 Haswell core) 2 / 28 High performance algebraic computing Domain of computation Z, Q : variable size, multi-precision Z p, GF(p k ): fixed size, specific arithmetic Common belief : Slow terrible complexities, no need for all the precision Example (Linear System solving over Q) Method Complexity Naive Gauss Elim over Q O (2 n ) Gauss mod det O ( n 5) Gauss mod p + CRT O ( n 4), O ( n ω+1) p-adic Lifting O ( n 3), O (n ω )

3 Gaussian elimination in computer algebra Applications Algebraic cryptanalysis: RSA, DLP LinSys, Krylov, F q Comp. number theory: modular forms databases: Echelon over F q Exact mixed-integer linear programming: LinSys over Q Formal proof: Sums of squares Cholesky over Q HPC building block Dense linear algebra over Z/pZ log 2 p bits MatlMul (fgemm) and GaussElim (PLUQ) triangular decomposition PLUQ (for LinSys, Det) linear dependencies (Krylov, Grobner basis) 3 / 28

4 FFLAS-FFPACK library FFLAS-FFPACK features High performance implementation of basic linear algebra routines over word size prime fields Exact alternative to the numerical BLAS library Exact triangularization, Sys. solving, Det, Inv., CharPoly Matrix multiplicaiton mod on Intel Sandy Bridge 2.6Ghz with OpenBLAS n 3 /10 9 /t fgemm+strassen dgemm fgemm n 4 / 28

5 Exact vs numerical Gaussian elimination Similarities Reduction to gemm kernel (Matrix Multiplication) Blocking: slab/tiled, iterative/recursive Parallel blocking is constrained by pivoting numeric: ensuring numerical stability exact: able to reveal rank profile 5 / 28

6 Exact vs numerical Gaussian elimination Similarities Reduction to gemm kernel (Matrix Multiplication) Blocking: slab/tiled, iterative/recursive Parallel blocking is constrained by pivoting numeric: ensuring numerical stability exact: able to reveal rank profile Specificities Recursive tasks (vs block iterative in numeric) 5 / 28

7 Exact vs numerical Gaussian elimination Similarities Reduction to gemm kernel (Matrix Multiplication) Blocking: slab/tiled, iterative/recursive Parallel blocking is constrained by pivoting numeric: ensuring numerical stability exact: able to reveal rank profile Specificities Recursive tasks (vs block iterative in numeric) } Modular reductions efficiency increases with the granularity Strassen s algorithm tradeoff between total work and fine granularity 5 / 28

8 Exact vs numerical Gaussian elimination Similarities Reduction to gemm kernel (Matrix Multiplication) Blocking: slab/tiled, iterative/recursive Parallel blocking is constrained by pivoting numeric: ensuring numerical stability exact: able to reveal rank profile Specificities Recursive tasks (vs block iterative in numeric) } Modular reductions efficiency increases with the granularity Strassen s algorithm tradeoff between total work and fine granularity Pivoting strategies : no stability constraints, but rank profiles 5 / 28

9 Exact vs numerical Gaussian elimination Similarities Reduction to gemm kernel (Matrix Multiplication) Blocking: slab/tiled, iterative/recursive Parallel blocking is constrained by pivoting numeric: ensuring numerical stability exact: able to reveal rank profile Specificities Recursive tasks (vs block iterative in numeric) } Modular reductions efficiency increases with the granularity Strassen s algorithm tradeoff between total work and fine granularity Pivoting strategies : no stability constraints, but rank profiles Rank deficienices: blocks have unpredictable size ( and positions) unbalanced task load 5 / 28

10 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U 6 / 28

11 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28

12 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28

13 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28

14 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U 6 / 28

15 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28

16 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U 6 / 28

17 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28

18 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U 6 / 28

19 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28

20 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U 6 / 28

21 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U 6 / 28

22 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28

23 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U 6 / 28

24 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28

25 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28

26 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U 6 / 28

27 Need for a high level parallel programming environment Features required Portability, Performance and Scalability. But more precisely: Runtime system with good performance for recursive tasks. Dataflow task synchronization LU(A11) ApplyP ApplyP ApplyP ApplyP FTRSM FTRSM FTRSM FTRSM (A12) (A21) (A13) (A31) LU(A11) FGEMM (A32) FGEMM (A22) FGEMM (A23) FGEMM (A33) waiting for all tasks... LU(A22) Synchronization ApplyP FTRSM (A13) ApplyP ApplyP FTRSM FTRSM (A21) (A12) FGEMM ApplyP FTRSM (A31) ApplyP ApplyP (A22) FTRSM (A32) FTRSM (A23) FGEMM (A23) FGEMM (A32) FGEMM (A33) FGEMM (A33) LU(A22) ApplyP ApplyP waiting for all tasks... LU(A33) Synchronization Time FTRSM (A32) FTRSM (A23) FGEMM (A33) LU(A33) Time Handle efficiently unbalanced workloads. Efficient range cutting for parallel for. 7 / 28

28 Need for a high level parallel programming environment Features required Portability, Performance and Scalability. But more precisely: Runtime system with good performance for recursive tasks. Dataflow task synchronization LU(A11) ApplyP ApplyP ApplyP ApplyP FTRSM FTRSM FTRSM FTRSM (A12) (A21) (A13) (A31) LU(A11) FGEMM (A32) FGEMM (A22) FGEMM (A23) FGEMM (A33) waiting for all tasks... LU(A22) Synchronization ApplyP FTRSM (A13) ApplyP ApplyP FTRSM FTRSM (A21) (A12) FGEMM ApplyP FTRSM (A31) ApplyP ApplyP (A22) FTRSM (A32) FTRSM (A23) FGEMM (A23) FGEMM (A32) FGEMM (A33) FGEMM (A33) LU(A22) ApplyP ApplyP waiting for all tasks... LU(A33) Synchronization Time FTRSM (A32) FTRSM (A23) FGEMM (A33) LU(A33) Time Handle efficiently unbalanced workloads. Efficient range cutting for parallel for. Wish to design a code independently from the runtime system Using runtime systems as a plugin 7 / 28

29 Outline 1 Runtime systems 2 Matrix Multiplication 3 TRSM 4 Parallel exact Gaussian elimination 8 / 28

30 Runtime systems to be supported OpenMP3.x and 4.0 supported directives: (using libgomp) Data sharing attributes: OMP3 shared: data visible and accessible by all threads OMP3 firstprivate: local copy of original value OMP4 depend: set data dependencies Synchronization clauses: #pragma omp taskwait xkaapi: via the libkomp [BDG12] library: OpenMP directives xkaapi tasks. Re-implem. of task handling and management. Better recursive tasks execution. TBB: designed for nested and recursive parallelism parallel for tbb::task group, wait(), run() using C++11 lambda functions 9 / 28

31 PALADIn Parallel Algebraic Linear Algebra Dedicated Interface Mainly macro-based keywords No function call runtime overhead when using macros. No important modifications to be done to original program. Macros can be used also for C-based libraries. Complementary C++ template functions Implement the different cutting strategies. Store the iterators 10 / 28

32 PALADIn description: task parallelism Task parallelization: fork-join and dataflow models Example: PAR BLOCK: opens a parallel region. SYNCH GROUP: Group of tasks synchronized upon exit. TASK: creates a task. REFERENCE(args...): specify variables captured by reference. By default all variables accessed by value. READ(args...): set var. that are read only. WRITE(args...): set var. that are written only. READWRITE(args...): set var. that are read then written. void axpy ( const Element a, const Element b, Element &y ){ y += a x ; } SYNCH GROUP( TASK(MODE(READ( a, x ) READWRITE( y ) ), axpy ( a, x, y ) ) ; ) ; 11 / 28

33 Outline 1 Runtime systems 2 Matrix Multiplication 3 TRSM 4 Parallel exact Gaussian elimination 12 / 28

34 Parallel matrix multiplication Iterative variants Fixed block size (FIXED, GRAIN) Better control of data mapping in memory Complexity: O(n 3 ) Fixed number of tasks (THREADS) Less control of data mapping in memory Complexity: O(n ω ) Recursive variants Almost no control of data mapping in memory Complexity: O(n ω ) or O(n 3 ) s A iterative A 1 A 2 t B C B 1 B 2 C 11 C 12 C 21 C 22 B 11 B 12 B 21 B 22 A 11 A 12 C 11 C 12 A 21 A 22 C 21 C 22 recursive 13 / 28

35 Performance of pfgemm 500 pfgemm on 32 cores Xeon E Ghz with OpenMP 400 Gfops iter(block-threads) rec(two-d) 100 rec(two-d-adapt) rec(three-d) rec(three-d-adapt) matrix dimension Figure: Speed of MatMul variants using OpenMP tasks 14 / 28

36 Performance of pfgemm 500 pfgemm on 32 cores Xeon E Ghz with TBB 400 Gfops iter(block-threads) rec(two-d) 100 rec(two-d-adapt) rec(three-d) rec(three-d-adapt) matrix dimension Figure: Speed of MatMul variants using IntelTBB tasks 14 / 28

37 Performance of pfgemm 500 pfgemm on 32 cores Xeon E Ghz with libkomp 400 Gfops iter(block-threads) rec(two-d) 100 rec(two-d-adapt) rec(three-d) rec(three-d-adapt) matrix dimension Figure: Speed of MatMul variants using XKaapi tasks 14 / 28

38 Parallel Matrix Multiplication: State of the art HPAC server: 32 cores Xeon E Ghz (4 NUMA sockets) Comparison of our best implementations with the state of the art numerical libraries effective Gfops MKL dgemm 100 OpenBlas dgemm PLASMA-QUARK dgemm BensonBallard (Strassen) matrix dimension 15 / 28

39 Parallel Matrix Multiplication: State of the art HPAC server: 32 cores Xeon E Ghz (4 NUMA sockets) # of field ops using classic matrix product Effective Gfops = time. Comparison of our best implementations with the state of the art numerical libraries effective Gfops n 3 peak performance on 32 cores WinogradPar->classicPar<double> 200 ClassicPar->WinogradSeq<double> MKL dgemm 100 OpenBlas dgemm PLASMA-QUARK dgemm BensonBallard (Strassen) matrix dimension 15 / 28

40 Outline 1 Runtime systems 2 Matrix Multiplication 3 TRSM 4 Parallel exact Gaussian elimination 16 / 28

41 Parallel Triangular Solving Matrix Iterative variant: [ X1... X k ] L 1 [ B 1... B k ]. The computation of each X i L 1 B i is independent k sequential tasks set as the number of available threads Recursive variant: [ ] [ X1 L1 1: Split = X 2 L 2 L 3 ] 1 [ B1 B 2 ] 2: X 1 L 1 1 B 1 3: X 2 B 2 L 2 X 1 // Parallel MatMul 4: X 2 L 1 3 BX 2 17 / 28

42 Parallel Triangular Solving Matrix Iterative variant: [ X1... X k ] L 1 [ B 1... B k ]. The computation of each X i L 1 B i is independent k sequential tasks set as the number of available threads Recursive variant: [ ] [ ] 1 [ ] X1 L1 B1 1: Split = X 2 L 2 L 3 B 2 2: X 1 L 1 1 B 1 3: X 2 B 2 L 2 X 1 // Parallel MatMul 4: X 2 L 1 3 BX 2 Hybrid PFTRSM: column dimension of B small use iterative splitting in priority when #cols(x) < #proc: save some threads for recursive calls 17 / 28

43 Parallel Triangular Solving Matrix Experiments pftrsm on 32 cores Xeon E Ghz: solving LX=B where B is x n effective Gfops Hybrid threshold = 256 using libkomp 100 Hybrid threshold = 256 using libgomp Iterative using libkomp Iterative using libgomp 50 Intel-MKL dtrsm PLASMA-QUARK dtrsm n: column dimension of B Figure: Comparing the Iterative and the Hybrid variants for parallel FTRSM using libkomp and libgomp. Only the outer dimension varies: B and X are n. 18 / 28

44 Outline 1 Runtime systems 2 Matrix Multiplication 3 TRSM 4 Parallel exact Gaussian elimination 19 / 28

45 Gaussian elimination design Reducing to MatMul: block versions Asymptotically faster (O(n ω )) Better cache efficiency Variants of block versions Split on one dimension: Row or Column slab cutting Slab iterative Slab recursive Split on 2 dimensions: Tile cutting Tile iterative Tile recursive 20 / 28

46 Gaussian elimination design Reducing to MatMul: block versions Asymptotically faster (O(n ω )) Better cache efficiency Variants of block versions Iterative: Static better data mapping control Dataflow parallel model less sync Recursive: Adaptive sub-cubic complexity No Dataflow more sync Slab iterative Tile iterative Slab recursive Tile recursive 20 / 28

47 Gaussian elimination design Reducing to MatMul: block versions Asymptotically faster (O(n ω )) Better cache efficiency Variants of block versions Iterative: Static better data mapping control Dataflow parallel model less sync Recursive: Adaptive sub-cubic complexity No Dataflow more sync Slab iterative Tile iterative Slab recursive Tile recursive 20 / 28

48 Parallel tile recursive PLUQ algorithm 2 2 block splitting 21 / 28

49 Parallel tile recursive PLUQ algorithm Recursive call 21 / 28

50 Parallel tile recursive PLUQ algorithm ptrsm: B BU 1 TASK(MODE(READ(A) READWRITE(B ) ), pftrsm (..., A, lda, B, ldb ) ) ; 21 / 28

51 Parallel tile recursive PLUQ algorithm ptrsm: B L 1 B TASK(MODE(READ(A) READWRITE(B ) ), pftrsm (..., A, lda, B, ldb ) ) ; 21 / 28

52 Parallel tile recursive PLUQ algorithm pfgemm: C C A B TASK(MODE(READ(A,B) READWRITE(C) ), pfgemm (..., A, lda, B, ldb ) ) ; 21 / 28

53 Parallel tile recursive PLUQ algorithm pfgemm: C C A B TASK(MODE(READ(A,B) READWRITE(C) ), pfgemm (..., A, lda, B, ldb ) ) ; 21 / 28

54 Parallel tile recursive PLUQ algorithm pfgemm: C C A B TASK(MODE(READ(A,B) READWRITE(C) ), pfgemm (..., A, lda, B, ldb ) ) ; 21 / 28

55 Parallel tile recursive PLUQ algorithm 2 independent recursive calls (concurrent tasks) TASK(MODE(READWRITE(A ) ), ppluq (..., A, lda ) ) ; 21 / 28

56 Parallel tile recursive PLUQ algorithm ptrsm: B BU 1 TASK(MODE(READ(A) READWRITE(B ) ), pftrsm (..., A, lda, B, ldb ) ) ; 21 / 28

57 Parallel tile recursive PLUQ algorithm ptrsm: B L 1 B TASK(MODE(READ(A) READWRITE(B ) ), pftrsm (..., A, lda, B, ldb ) ) ; 21 / 28

58 Parallel tile recursive PLUQ algorithm pfgemm: C C A B TASK(MODE(READ(A,B) READWRITE(C) ), pfgemm (..., A, lda, B, ldb ) ) ; 21 / 28

59 Parallel tile recursive PLUQ algorithm pfgemm: C C A B TASK(MODE(READ(A,B) READWRITE(C) ), pfgemm (..., A, lda, B, ldb ) ) ; 21 / 28

60 Parallel tile recursive PLUQ algorithm pfgemm: C C A B TASK(MODE(READ(A,B) READWRITE(C) ), pfgemm (..., A, lda, B, ldb ) ) ; 21 / 28

61 Parallel tile recursive PLUQ algorithm Recursive call 21 / 28

62 Parallel tile recursive PLUQ algorithm Puzzle game (block permutations) Tile rec: better data locality and more square blocks for M.M. 21 / 28

63 State of the art: exact vs numerical linear algebra State of the art comparison: Exact PLUQ using PALADIn language: best performance with xkaapi Numerical LU (dgetrf) of PLASMA-Quark and MKL dgetrf 400 parallel dgetrf vs parallel PLUQ on full rank matrices effective Gfops explicit synch pluq<double> MKL dgetrf 50 PLASMA-Quark dgetrf tiled storage (k=212) PLASMA-Quark dgetrf (k=212) matrix dimension 22 / 28

64 Performance of parallel PLUQ decomposition Low impact of modular reductions in parallel Efficient SIMD implementation 400 Performance of tile PLUQ recursive vs iterative on full rank matrices effective Gfops explicit synch pluq rec<double> explicit synch pluq rec<131071> matrix dimension 23 / 28

65 Performance of task parallelism: dataflow model 400 Performance of tile PLUQ recursive vs iterative on full rank matrices effective Gfops explicit synch PLUQ rec<131071> explicit synch PLUQ iter<131071> matrix dimension 24 / 28

66 Performance of task parallelism: dataflow model 400 Performance of tile PLUQ recursive vs iterative on full rank matrices effective Gfops explicit synch PLUQ rec<131071> 50 dataflow synch PLUQ iter<131071> explicit synch PLUQ iter<131071> matrix dimension 24 / 28

67 Performance of task parallelism: dataflow model 400 Performance of tile PLUQ recursive vs iterative on full rank matrices effective Gfops explicit synch PLUQ rec<131071> dataflow synch PLUQ iter<131071> 50 dataflow synch PLUQ rec<131071> explicit synch PLUQ iter<131071> matrix dimension Possible improvement: implementation of the delegation of recursive tasks dependencies (Postpone access mode in the parallel programming environments) 24 / 28

68 Outline 1 Runtime systems 2 Matrix Multiplication 3 TRSM 4 Parallel exact Gaussian elimination 25 / 28

69 Conclusion Lessons learnt for the parallelization of LU over Z/pZ Blocking impacts arithmetic cost fine granularity hurts Rank deficiency can offer more parallelism (cf. separators) sub-cubic perfs in parallel requires a runtime efficient for recursive tasks (XKaapi) 26 / 28

70 Perspectives Data flow task dependencies LU(A11) ApplyP ApplyP ApplyP FTRSM (A13) FTRSM FTRSM (A21) (A12) FGEMM ApplyP FTRSM (A31) (A22) FGEMM (A23) FGEMM (A32) FGEMM (A33) LU(A22) ApplyP ApplyP FTRSM FTRSM (A32) (A23) FGEMM (A33) LU(A33) Time already at use in tiled iterative algorithms (XKaapi) new challenges for recursive tasks: Recursive inclusion of sub-matrices Postponed modes (removing fake dependencies) Distributed on small sized clusters 27 / 28

71 Thank you 28 / 28

Adaptive Parallel Exact dense LU factorization

1/35 Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013, Université de Grenoble 2/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation