Adaptive Parallel Exact dense LU factorization

Size: px

Start display at page:

Download "Adaptive Parallel Exact dense LU factorization"

Joan Elliott
5 years ago
Views:

1 1/35 Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013, Université de Grenoble

2 2/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation Exact gaussian elimination Rank profile 3 Dense linear algebra Optimized building block 4 Block generic full rank matrices Tiled LU factorization Parallelization of block LU speedup 5 Any rank profile Tiled CUP decomposition Parallelization with OpenMP Parallelization of Block CUP with KAAPI 6 perspective 7 conclusion

3 3/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation Exact gaussian elimination Rank profile 3 Dense linear algebra Optimized building block 4 Block generic full rank matrices Tiled LU factorization Parallelization of block LU speedup 5 Any rank profile Tiled CUP decomposition Parallelization with OpenMP Parallelization of Block CUP with KAAPI 6 perspective 7 conclusion

4 Gaussian elimination in Dense Computer algebra 4/35 Dense benchmarking of supercomputers ( basis of linear algebra Sparse Large sparse matrix problems smaller dense problems (still large!) Sparse Iterative : Induce dense elimination on blocs of iterated vectors (Krylov, Lanczos, smith normal form) Sparse Direct : Switch to dense after fill-in [FGB]

5 5/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation Exact gaussian elimination Rank profile 3 Dense linear algebra Optimized building block 4 Block generic full rank matrices Tiled LU factorization Parallelization of block LU speedup 5 Any rank profile Tiled CUP decomposition Parallelization with OpenMP Parallelization of Block CUP with KAAPI 6 perspective 7 conclusion

6 Gaussian elimination in numerical computation 6/35 pivoting strategies search for best pivot good numerical stability good data locality Reduce the fill-in reduce additional memory needs reduce induced computation costs

7 Exact gaussian elimination applications 7/35 Exact Rank Algebraic topology (smith normal form) Rank Profile Grobner basis computation [FGB] Computational number theory [Stein] Characteristic Polynomial Graph Theory [G. Royle] Coding theory Semi-fields

8 Rank profile 8/35 Row/Column rank profile Definition Generic rank profile example : lexico-graphically smallest sequence of r row/column indices s.t. the corresponding rows/columns of A are linearly independant. If its first r leading principal minors are non zero the sequence {1,...,r} is the row rank profile of a generic rank profile matrix

9 9/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation Exact gaussian elimination Rank profile 3 Dense linear algebra Optimized building block 4 Block generic full rank matrices Tiled LU factorization Parallelization of block LU speedup 5 Any rank profile Tiled CUP decomposition Parallelization with OpenMP Parallelization of Block CUP with KAAPI 6 perspective 7 conclusion

10 Optimized building block in Dense linear algebra 10/35 Matrix multiplication Algorithmic complexity : Strassen O(n 2.8 ),..., O(n ω ) Optimized hardware implementation : pipeline, SSE, AVX,... Implementation : block versions cache optimization reduce dependencies on the bus speed faster computation for blocks loaded in cache recursive iterative cascading

11 Gaussian elimination concerns 11/35 Same concerns as M.M. = block versions Implementation optimization benefits from matrix multiplication Reduce dependencies on bus speed (cache optimization) Possible best versions adapted for parallel computing Tiled iterative implementation block recursive implementation

12 Exact gaussian elimination adapted for Parallel computing 12/35 block versions trade-off Common point less memory accesses if block size fits the cache N 3 /B memory accesses. (N dimension of the matrix, B the block size) Trade-off block recursive : More adaptative tiled iterative : less synchronizations Historically, It s more difficult to parallelize recursive implementation with existing model of Parallel computing (OpenMP,...)

13 State of the art 13/35 State of the art Sequential Exact : FFLAS-FFPACK, M4RI, within FGB Parallel numeric : ScaLAPACK, Plasma-Quark Parallel exact :?? this work

14 14/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation Exact gaussian elimination Rank profile 3 Dense linear algebra Optimized building block 4 Block generic full rank matrices Tiled LU factorization Parallelization of block LU speedup 5 Any rank profile Tiled CUP decomposition Parallelization with OpenMP Parallelization of Block CUP with KAAPI 6 perspective 7 conclusion

15 LU factorization of generic rank profile matrices 15/35 A = L U P LU Decomposition applications Solving System : A.x = b; L.(U.x) = b; L.(y) = b Rank : Rank(A) is the number of rows of U Invert of A : A 1 = U 1.L 1 Determinant : det(a) = ±det(u) row or column Rank Profile : given by positions of row or column permutations

16 Tiled iterative LU decomposition 16/35 LU decomposition on first block A 11 = L 1.U 1 updates : A 21 = A 21.U 1 1 ; A 31 = A 31.U 1 1 A 12 = L 1 1.A 12 ; A 13 = L 1 1.A 13 ; A 22 = A 22 A 21.A 12...

17 Tiled iterative LU Decomposition 17/35 A11 A12 A13 A21 A22 A23 A31 A32 A33 Routines of FFLAS-FFPACK LIBRARY on a Finite Field Z/pZ FTRSM (update blocks on same column and same row) A ik = A ik.u 1 kk A ki = L 1 kk.a ki FGEMM (MM, update remaining blocks) A ij = A ij A ik.a kj applyp (applying permutation matrix) A ik = A ik.p 1 kk

18 Tiled iterative LU Decomposition 17/35 U1 L1 A'12 A'13 A'21 A'22 A'23 A'31 A'32 A'33 Routines of FFLAS-FFPACK LIBRARY on a Finite Field Z/pZ FTRSM (update blocks on same column and same row) A ik = A ik.u 1 kk A ki = L 1 kk.a ki FGEMM (MM, update remaining blocks) A ij = A ij A ik.a kj applyp (applying permutation matrix) A ik = A ik.p 1 kk

19 Tiled iterative LU Decomposition 17/35 U1 L1 A'21 A'12 A'13 U2 L2 A''23 A'31 A''32 A''33 Routines of FFLAS-FFPACK LIBRARY on a Finite Field Z/pZ FTRSM (update blocks on same column and same row) A ik = A ik.u 1 kk A ki = L 1 kk.a ki FGEMM (MM, update remaining blocks) A ij = A ij A ik.a kj applyp (applying permutation matrix) A ik = A ik.p 1 kk

20 Tiled iterative LU Decomposition 17/35 U1 L1 A'21 A'12 A'13 U2 L2 A''23 U3 A'31 A''32 L3 Routines of FFLAS-FFPACK LIBRARY on a Finite Field Z/pZ FTRSM (update blocks on same column and same row) A ik = A ik.u 1 kk A ki = L 1 kk.a ki FGEMM (MM, update remaining blocks) A ij = A ij A ik.a kj applyp (applying permutation matrix) A ik = A ik.p 1 kk

21 OpenMP parallel loop Synchronizations 18/35 LU(A11) ApplyP FTRSM (A12) ApplyP FTRSM (A21) ApplyP FTRSM (A13) ApplyP FTRSM (A31) FGEMM (A32) FGEMM (A22) FGEMM (A23) FGEMM (A33) waiting for all tasks... LU(A22) Synchronization ApplyP FTRSM (A32) FGEMM (A33) ApplyP FTRSM (A23) waiting for all tasks... LU(A33) Synchronization Time

22 for(k=0 ; k<nblocks ; k++){ R = FFPACK : :LUdivine(...) ; #pragma omp parallel for shared(a, P) { #pragma omp for nowait for(i=k+1 ; i<nblocks ; i++) FFLAS : :ftrsm(...) ; } #pragma omp parallel for shared(a, P) for(i=k+1 ; i<nblocks ; i++){ FFPACK : :applyp(...) ; FFLAS : :ftrsm(...) ; } #pragma omp parallel for shared(a, P, T) for(i=k+1 ; i<nblocks ; i++){ #pragma omp parallel for shared(a ) for(j=k+1 ; j<nblocks ; j++){ FFLAS : :fgemm(...) ;}} } 19/35

23 KAAPI dataflow scheduling for Tiled LUP 20/35 LU(A11) ApplyP FTRSM (A13) FGEMM (A23) ApplyP FTRSM (A21) FGEMM (A22) LU(A22) ApplyP FTRSM (A12) ApplyP FTRSM (A31) FGEMM (A32) FGEMM (A33) ApplyP FTRSM (A32) ApplyP FTRSM (A23) FGEMM (A33) LU(A33) Time

24 for(int k=0 ; k<nblocks ; k++){ #pragma kaapi task readwrite(&a) write(&p, &Q) R = FFPACK : :LUdivine(...) ; for(int i=k+1 ; i<nblocks ; i++){ #pragma kaapi task readwrite(&a) read(&a) FFLAS : :ftrsm(...) ; } for(int i=k+1 ; i<nblocks ; i++){ #pragma kaapi task readwrite(&a) read(&p) FFPACK : :applyp(...) ; #pragma kaapi task readwrite(&a) read(&a) FFLAS : :ftrsm(...) ; } for(int i=k+1 ; i<nblocks ; i++){ for(int j=k+1 ; j<nblocks ; j++){ #pragma kaapi task readwrite(&a) read(&a) FFPACK : :fgemm(...) ; } } } 21/35

25 KAAPI vs OpenMP HPAC : Intel SandyBridge E Ghz, 32 cores, L3 cache(16384 KB). (Z/1009Z) Overcost Parallel vs sequential for matrix dimension 10000*10000 LUdivine (sequential) OpenMP LU BS=512 KAAPI LU BS=212 KAAPI LU BS=424 timings (seconds) number of cores 22/35

26 KAAPI version speed-up 23/ speed-up kaapi and OpenMP for matrix dimension 10000*10000 KAAPI LU BS=212 KAAPI LU BS=424 OpenMP LU BS=512 Ideal speed-up number of cores

27 Parallelization overcost on LU algorithm 24/35 Timings (seconds) Gain factor KAAPI vs OMP on dense full rank matrices (32 cores) OpenMP kaapi 1-KAAPI/OMP 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % -10 % -20 % -30 % 2K 4K 6K 8K 10K 12K 14K 16K 18K 20K matrix dimension gain factor

28 25/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation Exact gaussian elimination Rank profile 3 Dense linear algebra Optimized building block 4 Block generic full rank matrices Tiled LU factorization Parallelization of block LU speedup 5 Any rank profile Tiled CUP decomposition Parallelization with OpenMP Parallelization of Block CUP with KAAPI 6 perspective 7 conclusion

29 CUP decomposition (Rank deficient matrices) 26/35 U A = C P

30 block CUP decomposition 27/35 block CUP Less parallelism some independent tasks removed big sequential costly task

31 Parallelization of block CUP with OpenMP 28/ CUP (n=10000, R=5000 blocksize=212) over Z/1009 OpenMP CUP speedup Ideal 25 Speed-up Number of cores

32 Parallelization of block CUP with KAAPI Dynamic scheduling 29/35 dependencies The graph of task dependency is calculated during runtime Dependency between tasks is done according to the referent of each task. In this implementation, the referent is the pointer of the block i.e. it s the pointer on the upper-left side of each block. X X X

33 Parallelization of block CUP with KAAPI Static scheduling 30/35 The graph of task dependancy is precalculated before execution. (faster) X X X X is a task parameter, set as CW. CW mode for static scheduling is not defined yet in actual KAAPI version.

34 31/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation Exact gaussian elimination Rank profile 3 Dense linear algebra Optimized building block 4 Block generic full rank matrices Tiled LU factorization Parallelization of block LU speedup 5 Any rank profile Tiled CUP decomposition Parallelization with OpenMP Parallelization of Block CUP with KAAPI 6 perspective 7 conclusion

35 Algorithme PLUQ Quad-recursif Recursive cutting according to row and columns We can permut the blocks in a way that we can obtain row rank profile and column rank profile at the same time. rank profile of all leading sub-matrices [Dumas, Pernet, Sultan, ISSAC 2013] 32/35

36 Algorithme PLUQ Quad-recursif Recursive cutting according to row and columns We can permut the blocks in a way that we can obtain row rank profile and column rank profile at the same time. rank profile of all leading sub-matrices 0 [Dumas, Pernet, Sultan, ISSAC 2013] 32/35

37 Algorithme PLUQ Quad-recursif Recursive cutting according to row and columns We can permut the blocks in a way that we can obtain row rank profile and column rank profile at the same time. rank profile of all leading sub-matrices 0 [Dumas, Pernet, Sultan, ISSAC 2013] 32/35

38 Algorithme PLUQ Quad-recursif Recursive cutting according to row and columns We can permut the blocks in a way that we can obtain row rank profile and column rank profile at the same time. rank profile of all leading sub-matrices 0 0 [Dumas, Pernet, Sultan, ISSAC 2013] 32/35

39 Algorithme PLUQ Quad-recursif Recursive cutting according to row and columns We can permut the blocks in a way that we can obtain row rank profile and column rank profile at the same time. rank profile of all leading sub-matrices [Dumas, Pernet, Sultan, ISSAC 2013] 32/35

40 Algorithme PLUQ Quad-recursif Recursive cutting according to row and columns We can permut the blocks in a way that we can obtain row rank profile and column rank profile at the same time. rank profile of all leading sub-matrices [Dumas, Pernet, Sultan, ISSAC 2013] 32/35

41 Algorithme PLUQ Quad-recursif Recursive cutting according to row and columns We can permut the blocks in a way that we can obtain row rank profile and column rank profile at the same time. rank profile of all leading sub-matrices [Dumas, Pernet, Sultan, ISSAC 2013] 32/35

42 Algorithme PLUQ Quad-recursif Recursive cutting according to row and columns We can permut the blocks in a way that we can obtain row rank profile and column rank profile at the same time. rank profile of all leading sub-matrices [Dumas, Pernet, Sultan, ISSAC 2013] 32/35

43 33/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation Exact gaussian elimination Rank profile 3 Dense linear algebra Optimized building block 4 Block generic full rank matrices Tiled LU factorization Parallelization of block LU speedup 5 Any rank profile Tiled CUP decomposition Parallelization with OpenMP Parallelization of Block CUP with KAAPI 6 perspective 7 conclusion

44 Conclusion 34/35 Exact Computation Parallelization in exact Trade-off : (Tiled, block) <=> (adaptative, less sync.) Specificity in Exact/Numeric : rank, rank profile New issues and trade-off /Numeric & Parallel Numeric dataflow synchro. LUP : better adaptatibity more parallelism PLUQ Dynamic scheduling CUP : dynamic block size, parallelism? new algorithm to parallelize : recursive, tile?

45 Thank you for your attention! 35/35

Task based parallelization of recursive linear algebra routines using Kaapi

Task based parallelization of recursive linear algebra routines using Kaapi Clément PERNET joint work with Jean-Guillaume DUMAS and Ziad SULTAN Université Grenoble Alpes, LJK-CASYS January 20, 2017 Journée