Scheduling Strategies for Parallel Sparse Backward/Forward Substitution

Size: px

Start display at page:

Download "Scheduling Strategies for Parallel Sparse Backward/Forward Substitution"

Tobias McKenzie
5 years ago
Views:

Scheduling Strategies for Parallel Sparse Backward/Forward Substitution J.I. Aliaga M. Bollhöfer A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ.

1 Scheduling Strategies for Parallel Sparse Backward/Forward Substitution J.I. Aliaga M. Bollhöfer A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ. Jaume I (Spain) {aliaga,martina,quintana}@icc.uji.es Institute of Computational Mathematics, TU-Braunschweig (Germany) m.bollhoefer@tu-braunschweig.de May, 8 J. I. Aliaga et. al. PARA 8@Trondheim / 4

2 Motivation and Introduction Motivation and Introduction Many numerical applications require the solution of LARGE and SPARSE linear systems preconditioned iterative solvers ILUPACK is a numerical (serial) package to solve Ax = b: Incomplete LU Decompositions (ILU) A M = LU Preconditioned Krylov Solvers Solve M Ax = M b J. I. Aliaga et. al. PARA 8@Trondheim / 4

3 Motivation and Introduction Motivation and Introduction Mid-term goal: Develop a parallel package to solve Ax = b on shared-memory multiprocessors using ILUPACK techniques Already Parallel ILU preconditioners for s.p.d. systems Focus: Parallel Forward (PFS) and Backward Substitution (PBS) stages of the iterative solution of the linear system Preconditioned Krylov Solver for j =,,..., until convergence do... Solve Ly j = b j Solve Ux j = y j... end for J. I. Aliaga et. al. PARA 8@Trondheim / 4

4 Outline Motivation and Introduction Motivation and Introduction Parallel ILU Preconditioners Data Decomposition Parallel ILU Computations Parallel ILU Execution Parallel Forward/Backward Substitution PFS Computations PBS Computations PFS and PBS Task mapping PFS and PBS Task Scheduling 4 Experimental Results 5 Conclusions J. I. Aliaga et. al. PARA 8@Trondheim 4 / 4

5 Data Decomposition Parallel ILU Preconditioners Data Decomposition Natural ordering MLND ordering (,) (,) (,) (,) (,) (,) (,4) Task tree Elimination Tree J. I. Aliaga et. al. PARA 5 / 4

6 Parallel ILU Preconditioners Data Decomposition Data Decomposition The task tree yields a block partitioning of A (,) (,) (,) (,) (,) (,4) (,) A M (,4 ) A (,4 ) How does our approach decompose A? J. I. Aliaga et. al. PARA 8@Trondheim 6 / 4

7 Data Decomposition Parallel ILU Preconditioners Data Decomposition The path from task (, i) to the root maps A (,i) to A (, ) A M (, ) (,) M (, ) (,) (,) (,) (,) (,) (,4) A A = M (,) A (,) (M (,) ) T + M (,) A (,) (M (,) ) T + M (,) A (,) (M (,) ) T + M (,4) A (,4) (M (,4) ) T J. I. Aliaga et. al. PARA 8@Trondheim 7 / 4

8 Data Decomposition Parallel ILU Preconditioners Data Decomposition The path from task (, i) to the root maps A (,i) to A (,) M (,4 ) (,) (,) (,) (,) (,) (,4) (,4 ) A M (,4 ) A A = M (,) A (,) (M (,) ) T + M (,) A (,) (M (,) ) T + M (,) A (,) (M (,) ) T + M (,4) A (,4) (M (,4) ) T J. I. Aliaga et. al. PARA 8@Trondheim 8 / 4

9 Parallel ILU Preconditioners Parallel ILU Computations Parallel ILU Computations (,) (,) (,) (,) (,) (,) (,4) First-level tasks compute in parallel ILUPACK partial ILUs A (,i) A (,i) A (,i) L (,i) A (,i) A (,i) A (,i) L (,i) I A (,i) A (,i) A (,i) where i =,..., 4 L (,i) I U (,i) U (,i) U (,i) S (,i) S (,i) S (,i) S (,i) J. I. Aliaga et. al. PARA 8@Trondheim 9 / 4

10 Parallel ILU Preconditioners Parallel ILU Computations Parallel ILU Computations (,) (,) (,) (,) (,) (,) (,4) Second-level tasks merge the Schur complements of its children ( ) ( ) ( ) (,i) A A (,i) (,i ) S A (,i) A (,i) S (,i ) (,i) S = S (,i ) S (,i ) S (,i) + S (,i) S (,i), where i =, J. I. Aliaga et. al. PARA 8@Trondheim / 4

11 Parallel ILU Preconditioners Parallel ILU Computations Parallel ILU Computations (,) (,) (,) (,) (,) (,) (,4) Second-level tasks compute in parallel ILUPACK partial ILUs, ( ) ( ) ( ) A (,i) A (,i) A (,i) A (,i) where now i =, L (,i) L (,i) I U (,i) U (,i) S (,i), J. I. Aliaga et. al. PARA 8@Trondheim / 4

12 Parallel ILU Preconditioners Parallel ILU Computations Parallel ILU Computations (,) (,) (,) (,) (,) (,) (,4) The root task merges the Schur complements of its children A (,) = S (,) + S (,) J. I. Aliaga et. al. PARA 8@Trondheim / 4

13 Parallel ILU Preconditioners Parallel ILU Computations Parallel ILU Computations (,) (,) (,) (,) (,) (,) (,4) Finally, the root task completes the parallel ILU A (,) L (,) U (,) J. I. Aliaga et. al. PARA 8@Trondheim / 4

14 Task Scheduling Parallel ILU Preconditioners Parallel ILU Execution The task trees are constructed before the parallel ILU commences The execution is scheduled via a dynamic load-balancing strategy: Always priorizes leaves over inner tasks Among leaves, priorizes those with higher estimated cost The parallel execution results in a mapping of tasks to processors: f=4 P T T T T Execution p=4 P T T P P P T T4 T5 T6 T T4 T5 P T6 P T7 T8 T9 T T7 T8 T9 T P P P P Remark: excellent results on shared-memory multiprocessors J. I. Aliaga et. al. PARA 8@Trondheim 4 / 4

15 Parallel Forward/Backward Substitution Solve Ly = b and Ux = y, respectively, where L and U the sparse triangular factors obtained from the parallel multilevel ILU of A Assume we split b, y, x accordingly to A b M (, ) b (, ) (,) (, ) M (,) (,) (,) (,) (,) (,4) b = M (,) b (,) + M (,) b (,) + M (,) b (,) + M (,4) b (,4) J. I. Aliaga et. al. PARA 8@Trondheim 5 / 4

16 Parallel Forward Substitution PFS Computations (,) (,) (,) (,) (,) (,) (,4) First-level tasks perform in parallel partial forward substitutions L(,i) L (,i) I y (,i) y (,i) = b(,i) L (,i) b (,i), i =,... 4, I y (,i) b (,i) Solve L (,i) y (,i) = b (,i) SpTR (Forward Substitution) ) Update ( ˆb(,i) ˆb (,i) = ( b (,i) b (,i) ) ( L (,i) L (,i) ) y (,i) SpMxV J. I. Aliaga et. al. PARA 8@Trondheim 6 / 4

17 Parallel Forward Substitution PFS Computations (,) (,) (,) (,) (,) (,) (,4) Second-level tasks merge the updates resulting from its children ( ) ( ) ( ) b (,i) ˆb(,i ) ˆb(,i) b (,i) = ˆb (,i ) + ˆb (,i), where i =, J. I. Aliaga et. al. PARA 8@Trondheim 7 / 4

18 Parallel Forward Substitution PFS Computations (,) (,) (,) (,) (,) (,) (,4) Second-level tasks perform in parallel partial forward substitutions ( ) ( ) ( ) (,i) (,i) (,i) L y L (,i) b I y (,i) = b (,i), i =,, Solve L (,i) y (,i) = b (,i) SpTR (Forward Substitution) Update ˆb (,i) = b (,i) L (,i) y (,i) SpMxV J. I. Aliaga et. al. PARA 8@Trondheim 8 / 4

19 Parallel Forward Substitution PFS Computations (,) (,) (,) (,) (,) (,) (,4) The root task merges the updates resulting from its children b (,) (,) (,) = ˆb + ˆb J. I. Aliaga et. al. PARA 8@Trondheim 9 / 4

20 Parallel Forward Substitution PFS Computations (,) (,) (,) (,) (,) (,) (,4) The root task completes the parallel forward substitution Solve L (,) y (,) = b (,) SpTR (Forward Substitution) J. I. Aliaga et. al. PARA 8@Trondheim / 4

21 Parallel Backward Substitution PBS Computations (,) (,) (,) (,) (,) (,) (,4) The root task starts the parallel backward substitution Solve U (,) x (,) = y (,) SpTR (Backward Substitution) J. I. Aliaga et. al. PARA 8@Trondheim / 4

22 Parallel Backward Substitution PBS Computations (,) (,) (,) (,) (,) (,) (,4) The root task provides copies of x (,) to its children (, ) x (,) (, ) x (,) J. I. Aliaga et. al. PARA 8@Trondheim / 4

23 Parallel Backward Substitution PBS Computations (,) (,) (,) (,) (,) (,) (,4) Second-level tasks perform in parallel partial backward substitutions ( ) ( ) ( ) U (,i) U (,i) (,i) (,i) x y I x (,i) = y (,i), i =,, Update ŷ (,i) = y (,i) U (,i) x (,i) SpMxV Solve U (,i) x (,i) = ŷ (,i) SpTR (Backward Substitution) J. I. Aliaga et. al. PARA 8@Trondheim / 4

24 Parallel Backward Substitution PBS Computations (,) (,) (,) (,) (,) (,) (,4) Second-level tasks provide copies to its children ( ) ( x (,) (, ), (, ) x (,) (, ), (, 4) x (,) x (,) ) J. I. Aliaga et. al. PARA 8@Trondheim 4 / 4

25 Parallel Backward Substitution PBS Computations (,) (,) (,) (,) (,) (,) (,4) First-level tasks compute in parallel partial backward substitutions U(,i) U (,i) U (,i) I x (,i) x (,i) = y (,i) y (,i), i =,... 4, I x (,i) y (,i) Update ŷ (,i) = y (,i) ( U (,i) U (,i) ) ( x (,i) x (,i) ) SpMxV Solve U (,i) x (,i) = ŷ (,i) SpTR (Backward Substitution) J. I. Aliaga et. al. PARA 8@Trondheim 5 / 4

26 PFS and PBS Task mapping PFS and PBS Task mapping How to distribute the tasks among the processors? There is a wide range of solutions: Redistribute the tasks for each PFS and PBS execution dynamic-load balancing... Maintain the mapping resulting from the parallel ILU for the whole solution process J. I. Aliaga et. al. PARA 8@Trondheim 6 / 4

27 PFS and PBS Task mapping PFS and PBS Task mapping Redistribute the tasks for each PFS and PBS execution May entail succesively moving the data structures On ccnuma and even with ccuma or multicore processors Our experimental analysis reveals that data movement outweighs some other advantages of redistributing Very restricted temporal locality of SpMxV, SpTR kernels J. I. Aliaga et. al. PARA 7 / 4

28 PFS and PBS Task mapping PFS and PBS Task mapping Maintain the mapping resulting from the parallel ILU It can provide acceptable solutions if there are moderate variations between the relative costs of the tasks of the parallel ILU and those for the tasks of the PFS and the PBS Parallel ILU % % % 5% 5% % % % % % % We consider the mapping problem for s.p.d. matrices closely similar costs for the PFS and the PBS J. I. Aliaga et. al. PARA 8@Trondheim 8 / 4

29 PFS and PBS Task mapping PFS and PBS Task mapping Maintain the mapping resulting from the parallel ILU It can provide acceptable solutions if there are moderate variations between the relative costs of the tasks of the parallel ILU and those for the tasks of the PFS and the PBS Parallel ILU vs. PFS % moderate variations? % % 5% 5% % % % % % % We consider the mapping problem for s.p.d. matrices closely similar costs for the PFS and the PBS J. I. Aliaga et. al. PARA 8@Trondheim 9 / 4

30 PFS and PBS Task mapping PFS and PBS Task mapping Maintain the mapping resulting from the parallel ILU For each task T i we define the relative cost ratio as: r PF i = rel. cost T i ILU rel. cost T i FS ri PF s close to moderate variations J. I. Aliaga et. al. PARA 8@Trondheim / 4

31 PFS and PBS Task mapping What do we get? i ) Relative cost ratio (r PF % G circuit f = 6 Leaf Tsks. Inner Tsks Task identifier (i) Task identifiers are assigned in descending relative cost J. I. Aliaga et. al. PARA 8@Trondheim / 4

32 PFS Task Scheduling PFS and PBS Task Scheduling For the task scheduling of the PFS: A thread can only execute tasks mapped to it Threads always priorize leaves over inner tasks Among leaves, threads priorize those with higher nnz(l (,i) ) Initially, we provide the leaves to their corresponding threads When a thread completes a task, it checks the dependencies of the parent task, and if they are resolved, then it provides the parent task to the corresponding thread J. I. Aliaga et. al. PARA 8@Trondheim / 4

33 PBS Task Scheduling PFS and PBS Task Scheduling Threads always priorize inner tasks over leaves T P P T T P P P T T4 T5 P T6 P T7 T8 T9 T P P P P J. I. Aliaga et. al. PARA 8@Trondheim / 4

34 PBS Task Scheduling PFS and PBS Task Scheduling The PBS execution uncovers some pitfalls P T P T T P P P T T4 T5 P T6 P T7 T8 T9 T P P P P J. I. Aliaga et. al. PARA 8@Trondheim 4 / 4

35 PBS Task Scheduling PFS and PBS Task Scheduling The thread resolving the inner task becomes responsible T P P T T P P P T T4 T5 P T6 P T7 T8 T9 T P P P P J. I. Aliaga et. al. PARA 8@Trondheim 5 / 4

T4 T5 P T6 P T7 T8 T9 T P P P P P T P T T P P P T T4 T5 P T6

36 PBS Task Scheduling PFS and PBS Task Scheduling We allow some flexibility in the mapping of inner tasks T P P T T P P P T T4 T5 P T6 P T7 T8 T9 T P P P P P T P T T P P P T T4 T5 P T6 P T7 T8 T9 T P P P P J. I. Aliaga et. al. PARA 8@Trondheim 6 / 4

37 Experimental Results Experimental Framework SGI Altix 5 CC-NUMA SMM with: 8 nodes, -processor-per-node, Intel Itanium GBytes of RAM shared via a SGI NUMAlink interconnect Intel Compiler OpenMP.5 compliance -O Intel Compiler optimization level One thread was binded per physical processor Whenever possible, one thread per node IEEE double precision J. I. Aliaga et. al. PARA 8@Trondheim 7 / 4

38 Experimental Results Benchmark Matrices Benchmark matrices from the UF sparse matrix collection Code Group/Name Rows/Cols. Nonzeros M GHS_psdef/bmwcra_ M Wissgott/parabolic_fem M Schmid/thermal M4 AMD/G_circuit J. I. Aliaga et. al. PARA 8@Trondheim 8 / 4

39 Experimental Results Experimental Results p =, 4, 8, 6 processors, and f = p, p, 4p Average parallel time in ms. for executions with the mapping resulting from the same parallel ILU Speed-Up measured with respect to the parallel algorithm executing the same task tree on a single processor Different values of f lead to different task trees p = /f = refer to ILUPACK serial routines J. I. Aliaga et. al. PARA 8@Trondheim 9 / 4

40 Experimental Results Experimental Results Name/Code m Algorithm PFS PBS f p T Sp T Sp m PFS PBS T Sp T Sp J. I. Aliaga et. al. PARA 8@Trondheim 4 / 4

41 Experimental Results Experimental Results Name/Code m Algorithm PFS PBS f p T Sp T Sp m4 PFS PBS T Sp T Sp J. I. Aliaga et. al. PARA 8@Trondheim 4 / 4

42 Conclusions Conclusions We have presented two parallel algorithms to compute FS and BS for the iterative solution of sparse linear systems on shared-memory multiprocessors The mapping resulting from the ILU provides acceptable solutions for the PBS and PFS The task scheduling strategies take care of some pitfalls which can significantly hurt the performance attained by the PBS Remarkable performance reported on a CC-NUMA platform with 6 processors J. I. Aliaga et. al. PARA 8@Trondheim 4 / 4

43 Conclusions Questions? J. I. Aliaga et. al. PARA 4 / 4

Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms

Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms H. Anzt, V. Heuveline Karlsruhe Institute of Technology, Germany