Efficient Clustering and Scheduling for Task-Graph based Parallelization

Size: px

Start display at page:

Download "Efficient Clustering and Scheduling for Task-Graph based Parallelization"

Milton Osborne Bridges
5 years ago
Views:

1 Center for Information Services and High Performance Computing TU Dresden Efficient Clustering and Scheduling for Task-Graph based Parallelization Marc Hartung 02. February

2 HPC-OM 2/23

3 Content 1 Motivation 2 Scheduling 3 TGSim - Framework 4 Results 3/23

4 Outline 1 Motivation 2 Scheduling 3 TGSim - Framework 4 Results 4/23

5 Motivation Achieve speed-up through parallel execution of the ODE-system s tasks OM-Simulation Start Values ODE-System Time Integration 5/23

6 Motivation Achieve speed-up through parallel execution of the ODE-system s tasks OM-Simulation Start Values ODE-System Time Integration Assigning tasks to more than one CPU reduces simulation time Improvement depends on model 5/23

7 Motivation Task Graph visualizes right-hand side evaluation Contains computation costs, dependencies and communication costs T1 70 T2 T T4 T5 0 TX strongly connected component computation costs dependency with communication costs 6/23

8 Motivation Task Graph visualizes right-hand side evaluation Contains computation costs, dependencies and communication costs T1 Core 1 Core 2 T2 T3 T T2 T3 50 T4 T T4 T5 TX strongly connected component 1 0 computation costs dependency with communication costs 6/23

9 Motivation Task Graph visualizes right-hand side evaluation Contains computation costs, dependencies and communication costs T1 Core 1 Core 2 T2 T3 T T2 T3 50 T4 T T4 T5 TX strongly connected component 1 0 computation costs dependency with communication costs Algorithms for task-to-core mapping and ordering are needed! 6/23

10 Motivation Task Graph based Parallelization + Heterogeneous data dependencies + Allows nested parallelism + Numerical stable + Universal parallel solution (in theory) 7/23

11 Motivation Task Graph based Parallelization + Heterogeneous data dependencies + Allows nested parallelism + Numerical stable + Universal parallel solution (in theory) Obstacles Compile time Parallel efficiency Model dependent 7/23

12 Outline 1 Motivation 2 Scheduling 3 TGSim - Framework 4 Results 8/23

13 Scheduling Scheduling is a NP-complete decision problem Many greedy algorithms available Complexity between O(n) and O(n 4 ) (n... number of tasks) 9/23

14 Scheduling Scheduling is a NP-complete decision problem Many greedy algorithms available Complexity between O(n) and O(n 4 ) (n... number of tasks) Low cost algorithms achieve usable solutions But: No speed up guaranty 9/23

15 Scheduling - Taxonomy /23

16 Scheduling - ETF Earliest Time First - Algorithm List scheduler (for bounded number of processors) Checks every ready task for earliest start time Draws solved by highest bottom level Complexity: O(p n 2 ) (p... number of cores) 11/23

17 Scheduling - ETF Earliest Time First - Algorithm List scheduler (for bounded number of processors) Checks every ready task for earliest start time Draws solved by highest bottom level Complexity: O(p n 2 ) (p... number of cores) 11/23

18 Scheduling - ETF Earliest Time First - Algorithm List scheduler (for bounded number of processors) Checks every ready task for earliest start time Draws solved by highest bottom level Complexity: O(p n 2 ) (p... number of cores) 11/23

19 Scheduling - ETF Earliest Time First - Algorithm List scheduler (for bounded number of processors) Checks every ready task for earliest start time Draws solved by highest bottom level Complexity: O(p n 2 ) (p... number of cores) 11/23

20 Scheduling - ETF Earliest Time First - Algorithm List scheduler (for bounded number of processors) Checks every ready task for earliest start time Draws solved by highest bottom level Complexity: O(p n 2 ) (p... number of cores) 11/23

21 Scheduling - ETF Earliest Time First - Algorithm List scheduler (for bounded number of processors) Checks every ready task for earliest start time Draws solved by highest bottom level Complexity: O(p n 2 ) (p... number of cores) 11/23

22 Scheduling - ETF Earliest Time First - Algorithm List scheduler (for bounded number of processors) Checks every ready task for earliest start time Draws solved by highest bottom level Complexity: O(p n 2 ) (p... number of cores) 11/23

23 Scheduling - ETF Earliest Time First - Algorithm List scheduler (for bounded number of processors) Checks every ready task for earliest start time Draws solved by highest bottom level Complexity: O(p n 2 ) (p... number of cores) 11/23

24 Scheduling - ETF Earliest Time First - Algorithm List scheduler (for bounded number of processors) Checks every ready task for earliest start time Draws solved by highest bottom level Complexity: O(p n 2 ) (p... number of cores) Other list scheduler: LVL... Level Scheduler DLS... Dynamical Level Scheduling MCP... Modified Critical Path 11/23

25 Scheduling - Status Model: Modelica.Fluid.Examples.BranchingDynamicPipes System: Intel i7-3930k 6x 3.20 GHz, Linux Theoretical-Speed-Up Cpp-Runtime_Level_OpenMP speed-up 7,00 6,00 5,00 4,00 3,00 2,00 1,00 0,00 2,38 2,80 1,59 1, number of cores 12/23

26 Scheduling - Status Model: Modelica.Fluid.Examples.BranchingDynamicPipes System: Intel i7-3930k 6x 3.20 GHz, Linux Theoretical-Speed-Up Cpp-Runtime_Level_OpenMP speed-up 7,00 6,00 5,00 4,00 3,00 2,00 1,00 0,00 2,38 2,80 1,59 1, number of cores Approach: Analyse scheduler and parallelization methods to close gaps 12/23

27 Outline 1 Motivation 2 Scheduling 3 TGSim - Framework 4 Results 13/23

28 TGSim-Framework Task Graph Simulation Framework Analyse and evaluate scheduling and clustering algorithms Benchmark different parallelization methods Parallel runtime prediction for OM simulations and other traceable programs 14/23

29 TGSim-Framework Task Graph Simulation Framework Analyse and evaluate scheduling and clustering algorithms Benchmark different parallelization methods Parallel runtime prediction for OM simulations and other traceable programs Implementation: Written in C++ using OOP Easy to expand and user-friendly Creates OM-simulation alike programs with low overhead tracing mechanisms ODE-tasks replaced by wait tasks to reduce unintended influences 14/23

30 TGSim workflow MOS Profiled CppRuntime-simulation creates GraphML-file TGSim uses GraphML-File as input OMC GraphML TGSim Scheduling Runtime Analysis Analytical evaluation of scheduled task graphs Execution of scheduled simulations to benchmark parallel methods Statistic 15/23

31 Outline 1 Motivation 2 Scheduling 3 TGSim - Framework 4 Results 16/23

32 Results - Outline 1 TGSim runtime simulation vs. OM-Cpp-Runtime simulation Comparable? 2 Compare parallelization methods Dynamic scheduling and static scheduling 3 Compare TGSim scheduler Scheduling algorithms: MCP, DLS, ETF, LVL 17/23

33 Results - TGSim vs. Cpp-Runtime Model: Modelica.Fluid.Examples.BranchingDynamicPipes System: Intel Xeon E x 2.90GHz, Linux TGSim_Level_OpenMP Cpp-Runtime_Level_OpenMP speed-up 3,00 2,50 2,00 1,50 1,00 0,50 0, number of cores 18/23

34 Results - TGSim vs. Cpp-Runtime Model: Modelica.Fluid.Examples.BranchingDynamicPipes System: Intel Xeon E x 2.90GHz, Linux TGSim_Level_OpenMP Cpp-Runtime_Level_OpenMP speed-up 3,00 2,50 2,00 1,50 1,00 0,50 0, number of cores TGSim simulates OM-simulation work flow very well 18/23

35 Results - TGSim vs. Cpp-Runtime Model: Modelica.Fluid.Examples.BranchingDynamicPipes System: Intel Xeon E x 2.90GHz, Linux TGSim_Level_OpenMP Cpp-Runtime_Level_OpenMP speed-up 3,00 2,50 2,00 1,50 1,00 0,50 0, number of cores TGSim simulates OM-simulation work flow very well To simplify comparing, the OpenMP-OM-Cpp-Runtime results will be in every diagram 18/23

36 Results - Benchmark Dynamic Methods Model: Modelica.Fluid.Examples.BranchingDynamicPipes System: Intel Xeon E x 2.90GHz, Linux TGSim_IntelTBB TGSim_OpenMP Cpp-Runtime_OpenMP speed-up 3,50 3,00 2,50 2,00 1,50 1,00 0,50 0, number of cores IntelTBB is comparable to OpenMP Small disadvantage: Initialization time of IntelTBB ist 3700µs and of OpenMP 4µs 19/23

37 Results - Benchmark Static Methods Model: Modelica.Fluid.Examples.BranchingDynamicPipes System: Intel Xeon E x 2.90GHz, Linux 3,00 2,50 TGSim_Pthread TGSim_MPI Cpp-Runtime_OpenMP speed-up 2,00 1,50 1,00 0,50 0, number of cores PThread Performance in the first test cases better Static parallelization depends on scheduling 20/23

38 Results - Benchmark Scheduling Model: Modelica.Fluid.Examples.BranchingDynamicPipes System: Intel Xeon E x 2.90GHz, Linux MCP DLS ETF LVL Cpp-Runtime speed-up number of cores 21/23

39 Results - Benchmark Scheduling Model: Modelica.Fluid.Examples.BranchingDynamicPipes System: Intel Xeon E x 2.90GHz, Linux speed-up MCP DLS ETF LVL Cpp-Runtime Theoretical_Speed-Up number of cores Proper scheduling leads to high improvements 21/23

40 Summary & Future Work Summary With increasing number of cores static scheduling performs better than dynamic Scheduler which consider communication costs comparable in performance and much better than other PThreads fastest parallelization method, OpenMP and IntelTBB comparable 22/23

41 Summary & Future Work Summary With increasing number of cores static scheduling performs better than dynamic Scheduler which consider communication costs comparable in performance and much better than other PThreads fastest parallelization method, OpenMP and IntelTBB comparable Future Work Extend HPCOM OpenModelica library including TGSim optimizations 22/23

42 Thank you for your attention. 23/23

43 Test Framework TGSim II MPI example 1 // core // task 1 4 wait ( costs1 ); 5 MPI \ _Isend ( data1,..., core2,...); 6 7 // task 2 8 wait ( costs2 ); 9 // task4 11 MPI \ _Irecv ( data3,..., core2,...); 12 wait ( costs4 ); 1 // core // task MPIIrecv ( data1,..., core1,...); wait ( costs3 ); 9 11 MPIsend ( data3,..., core1,...); 23/23

Design Approach for a Generic and Scalable Framework for Parallel FMU Simulations

Center for Information Services and High Performance Computing TU Dresden Design Approach for a Generic and Scalable Framework for Parallel FMU Simulations Martin Flehmig, Marc Hartung, Marcus Walther