Task-Graph-Based Parallelization of Modelica-Simulations. Tutorial on the Usage of the HPCOM-Module

Size: px

Start display at page:

Download "Task-Graph-Based Parallelization of Modelica-Simulations. Tutorial on the Usage of the HPCOM-Module"

Gloria Paul
5 years ago
Views:

1 Task-Graph-Based Parallelization of Modelica-Simulations Tutorial on the Usage of the HPCOM-Module

2 2 Introduction

3 Prerequisites a multi-core cpu compilation stages can be retraced using: a text editor to display debug-output a browser to display html-files (for big models IE is good) a graph-editor to display graphml-files ( we recommend yed - ) 3

4 4 Technical Overview

5 Outline Modelica Transformation Process Task-Graph Generation Parallelization Approaches Clusterung and Scheduling Usage OpenModelica flags to retrace compilation stages are marked. 5

6 Modelica Transformation Process Modelica.Electrical.Spice3.Examples.CoupledInductors.mo +d=dumpdaelow Flattening: model gets parsed and instantiated in order to attain a flat model. 6

7 Modelica Transformation Process +d=graphml Dependencies among variables and equations are detected. A bipartite graph is set up. (+d=graphml) 7

8 Modelica Transformation Process +d=graphml +d=dumprepl ReplaceSimpleEquations to reduce system size: Alias-Variables are replaced, i.e. simple assignments like a=b; 8

9 Modelica Transformation Process +d=bltmatrixdump 9 Causalization: Matching / Index-Reduction / Tarjan s Algorithm: each variable is assigned to an equation if necessary, index is reduced (Panthelides) strongly connected components are identified (BLT-Matrix)

Simulation: main-diagonal is traversed top down, blocks correspond to systems

10 Modelica Transformation Process Start Values States Evaluate Right- Hand-Side x t = f(x t, u(t)) y(t) = g(x(t), u(t)); Time Integration State-Derivatives Simulation: main-diagonal is traversed top down, blocks correspond to systems of equations computed state-derivatives are used for time integration scheme 10

11 Task-Graph Generation +d=graphml 1-dimensional computation sequence 2-dimensional sequnce, task dependencies Task-Graph Generation: traverse BLT-matrix and assign dependencies between tasks (i.e. strongly-connected component) 11

12 Task-Graph Generation Task-Graph: used for parallelization of statederivative computation Scheduling: assign tasks to threads to distribute the workload among all threads information about execution costs and communication costs needed +d=hpcom remove the ablgebraic branches determine execution costs (estimation or measurements) benchmark communication costs 12

13 Task-Graph Generation Task-Graph: used for parallelization of state-derivative computation remove the ablgebraic branches Scheduling: assign tasks to threads to distribute the workload among all threads determine execution costs (estimation or measurements) benchmark communication costs +d=hpcom 13

14 Parallelization approaches Modelling Solver Compiler Transmission Line Modeling (TLM) multirate submodels / cosimulation parmodelica parallel: steps/iterations parallel solving of equation systems in integrator QSS BLT - parallelization parallel solving of equation systems in system equations 14

15 Clustering and Scheduling Clustering merge linear task sequence merge parent nodes 15

16 Clustering and Scheduling Level Scheduling 16

$sections { #pragma omp section { eqfunction_1(data); } #pragma omp section { eqfunction_2(data); } } //Level 2 #pragma omp parallel sections { }}$

17 Clustering and Scheduling Level Scheduling and OpenMP-Code Level Level 2 4 static void solveode(data) { //Level 1 #pragma omp parallel sections { #pragma omp section { eqfunction_1(data); } #pragma omp section { eqfunction_2(data); } } //Level 2 #pragma omp parallel sections { }} 17

18 Clustering and Scheduling Thread-Scheduling (MCP) Modelica.Electrical.Machines.Examples.Synchronousinductionmachines.SMEE_LoadDump 18

19 Clustering and Scheduling Thread-Scheduling and pthreads-code Thread 1 Thread static void thread1ode(data) { //Function of thread1 while(1) { pthread_mutex_lock(&th_lock_0); eqfunction_1(data); SET_SPIN_LOCK(l23); eqfunction_3(data); pthread_mutex_unlock(&th_lock1_0); } } static void solveode(data) { INIT_SPIN_LOCK(l23,true); //pthread_spinlock_t INIT_LOCKS(); if(firstrun) CREATE_THREADS( ); //Start threads pthread_mutex_unlock(&th_lock_0); pthread_mutex_unlock(&th_lock_1); //"join" pthread_mutex_lock(&th_lock1_0); pthread_mutex_lock(&th_lock1_1); }

20 Influencing Factors domain specifics Mechanics: One big linear systems is the bottleneck Hydraulics: Even distribution of tasks 20

21 21 Usage of HPCOM-Parallelization

22 HPCOM - portfolio Task-Graph-Parallelization in HPC-OM Symbolic Task-Graph Conditioning Cost-Benchmarking & Estimation Task-Merging & Clustering Scheduling & Parallel Codegeneration Memory Optimization Profiling &Tracing 22

23 Usage of HPCOM-Parallelization Example: Modelica.Fluid.Examples.BranchingDynamicPipes.mo from Modelica Standard Library Modelica Scripting File: *.mos loadmodel(modelica,{"3.2.1"}); setdebugflags("hpcom,hpcomdump"); geterrorstring(); setcommandlineoptions("+n=4 +hpcomscheduler=list +hpcomcode=openmp"); geterrorstring(); simulate(modelica.fluid.examples.branchingdynamicpipes, stoptime=10.0); geterrorstring(); 23

24 Preparation Results: Critical Path successfully calculated Filter successfully applied. Merged 446 tasks. Using list Scheduler for the DAE system Using list Scheduler for the ODE system Using list Scheduler for the ZeroFunc system the number of locks: 577 the serialcosts: the parallelcosts: the cpcosts: The predicted SpeedUp with 4 processors is: 3.57 With a theoretical maximmum speedup of: Schedule created 24

Efficient Clustering and Scheduling for Task-Graph based Parallelization

Center for Information Services and High Performance Computing TU Dresden Efficient Clustering and Scheduling for Task-Graph based Parallelization Marc Hartung 02. February 2015 E-Mail: marc.hartung@tu-dresden.de