Task-Graph-Based Parallelization of Modelica-Simulations. Tutorial on the Usage of the HPCOM-Module

Task-Graph-Based Parallelization of Modelica-Simulations Tutorial on the Usage of the HPCOM-Module

2 Introduction

Prerequisites https://openmodelica.org/download/nightlybuildsdownload a multi-core cpu compilation stages can be retraced using: a text editor to display debug-output a browser to display html-files (for big models IE is good) a graph-editor to display graphml-files ( we recommend yed - https://www.yworks.com/downloads#yed ) 3

4 Technical Overview

Outline Modelica Transformation Process Task-Graph Generation Parallelization Approaches Clusterung and Scheduling Usage OpenModelica flags to retrace compilation stages are marked. 5

Modelica Transformation Process Modelica.Electrical.Spice3.Examples.CoupledInductors.mo +d=dumpdaelow Flattening: model gets parsed and instantiated in order to attain a flat model. 6

Modelica Transformation Process +d=graphml Dependencies among variables and equations are detected. A bipartite graph is set up. (+d=graphml) 7

Modelica Transformation Process +d=graphml +d=dumprepl ReplaceSimpleEquations to reduce system size: Alias-Variables are replaced, i.e. simple assignments like a=b; 8

Modelica Transformation Process +d=bltmatrixdump 9 Causalization: Matching / Index-Reduction / Tarjan s Algorithm: each variable is assigned to an equation if necessary, index is reduced (Panthelides) strongly connected components are identified (BLT-Matrix)

Modelica Transformation Process Start Values States Evaluate Right- Hand-Side x t = f(x t, u(t)) y(t) = g(x(t), u(t)); Time Integration State-Derivatives Simulation: main-diagonal is traversed top down, blocks correspond to systems of equations computed state-derivatives are used for time integration scheme 10

Task-Graph Generation +d=graphml 1-dimensional computation sequence 2-dimensional sequnce, task dependencies Task-Graph Generation: traverse BLT-matrix and assign dependencies between tasks (i.e. strongly-connected component) 11

Task-Graph Generation Task-Graph: used for parallelization of statederivative computation Scheduling: assign tasks to threads to distribute the workload among all threads information about execution costs and communication costs needed +d=hpcom remove the ablgebraic branches determine execution costs (estimation or measurements) benchmark communication costs 12

Task-Graph Generation Task-Graph: used for parallelization of state-derivative computation remove the ablgebraic branches Scheduling: assign tasks to threads to distribute the workload among all threads determine execution costs (estimation or measurements) benchmark communication costs +d=hpcom 13

Parallelization approaches Modelling Solver Compiler Transmission Line Modeling (TLM) multirate submodels / cosimulation parmodelica parallel: steps/iterations parallel solving of equation systems in integrator QSS BLT - parallelization parallel solving of equation systems in system equations 14

Clustering and Scheduling Clustering merge linear task sequence merge parent nodes 15

Clustering and Scheduling Level Scheduling 16

Clustering and Scheduling Level Scheduling and OpenMP-Code Level 1 1 2 3 Level 2 4 static void solveode(data) { //Level 1 #pragma omp parallel sections { #pragma omp section { eqfunction_1(data); } #pragma omp section { eqfunction_2(data); } } //Level 2 #pragma omp parallel sections { }} 17

Clustering and Scheduling Thread-Scheduling (MCP) Modelica.Electrical.Machines.Examples.Synchronousinductionmachines.SMEE_LoadDump 18

Clustering and Scheduling Thread-Scheduling and pthreads-code Thread 1 Thread 2 19 1 2 3 4 static void thread1ode(data) { //Function of thread1 while(1) { pthread_mutex_lock(&th_lock_0); eqfunction_1(data); SET_SPIN_LOCK(l23); eqfunction_3(data); pthread_mutex_unlock(&th_lock1_0); } } static void solveode(data) { INIT_SPIN_LOCK(l23,true); //pthread_spinlock_t INIT_LOCKS(); if(firstrun) CREATE_THREADS( ); //Start threads pthread_mutex_unlock(&th_lock_0); pthread_mutex_unlock(&th_lock_1); //"join" pthread_mutex_lock(&th_lock1_0); pthread_mutex_lock(&th_lock1_1); }

Influencing Factors domain specifics Mechanics: One big linear systems is the bottleneck Hydraulics: Even distribution of tasks 20

21 Usage of HPCOM-Parallelization

HPCOM - portfolio Task-Graph-Parallelization in HPC-OM Symbolic Task-Graph Conditioning Cost-Benchmarking & Estimation Task-Merging & Clustering Scheduling & Parallel Codegeneration Memory Optimization Profiling &Tracing 22

Usage of HPCOM-Parallelization Example: Modelica.Fluid.Examples.BranchingDynamicPipes.mo from Modelica Standard Library 3.2.1. Modelica Scripting File: *.mos loadmodel(modelica,{"3.2.1"}); setdebugflags("hpcom,hpcomdump"); geterrorstring(); setcommandlineoptions("+n=4 +hpcomscheduler=list +hpcomcode=openmp"); geterrorstring(); simulate(modelica.fluid.examples.branchingdynamicpipes, stoptime=10.0); geterrorstring(); 23

Preparation Results: Critical Path successfully calculated Filter successfully applied. Merged 446 tasks. Using list Scheduler for the DAE system Using list Scheduler for the ODE system Using list Scheduler for the ZeroFunc system the number of locks: 577 the serialcosts: 709266.3000000001 the parallelcosts: 198678.37 the cpcosts: 36994.58 The predicted SpeedUp with 4 processors is: 3.57 With a theoretical maximmum speedup of: 19.17 Schedule created 24