Task-Graph-Based Parallelization of Modelica-Simulations. Tutorial on the Usage of the HPCOM-Module

Similar documents
Efficient Clustering and Scheduling for Task-Graph based Parallelization

High-Performance-Computing meets OpenModelica: Achievements in the HPC-OM Project

Equation based parallelization of Modelica models

Dynamic Load Balancing in Parallelization of Equation-based Models

Parallel Computing Using Modelica

Design Approach for a Generic and Scalable Framework for Parallel FMU Simulations

Simulation and Benchmarking of Modelica Models on Multi-core Architectures with Explicit Parallel Algorithmic Language Extensions

OpenMP and more Deadlock 2/16/18

Introduction to parallel Computing

OpenModelica Compiler (OMC) Overview

OpenModelica Compiler (OMC) Overview

Parallelism paradigms

A Modular. OpenModelica. Compiler Backend

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Joe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago.

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Open Compute Stack (OpenCS) Overview. D.D. Nikolić Updated: 20 August 2018 DAE Tools Project,

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

1. Define algorithm complexity 2. What is called out of order in detail? 3. Define Hardware prefetching. 4. Define software prefetching. 5. Define wor

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Parallel Computing. Hwansoo Han (SKKU)

A Strategy for Parallel Simulation of Declarative Object-Oriented Models of Generalized Physical Networks

Parallel Systems. Project topics

Parallelising serial applications. Darryl Gove Compiler Performance Engineering

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 18. Combining MPI and OpenMP

Modelica Change Proposal MCP-0019 Flattening (In Development) Proposed Changes to the Modelica Language Specification Version 3.

Scientific Programming in C XIV. Parallel programming

Using SPARK as a Solver for Modelica. Michael Wetter Philip Haves Michael A. Moshier Edward F. Sowell. July 30, 2008

Concurrent Programming with OpenMP

Introduction to Performance Tuning & Optimization Tools

Overview: The OpenMP Programming Model

Go Multicore Series:

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

INTRODUCING NVBIO: HIGH PERFORMANCE PRIMITIVES FOR COMPUTATIONAL GENOMICS. Jonathan Cohen, NVIDIA Nuno Subtil, NVIDIA Jacopo Pantaleoni, NVIDIA

STUDYING OPENMP WITH VAMPIR

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Parallel Programming

Lecture 4: OpenMP Open Multi-Processing

CSE 4/521 Introduction to Operating Systems

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University Scalable Tools Workshop 7 August 2017

G(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu

Minimal Equation Sets for Output Computation in Object-Oriented Models

Introduction to OpenMP

Fall CSE 633 Parallel Algorithms. Cellular Automata. Nils Wisiol 11/13/12

Martin Kruliš, v

Automatic Parallelization of Mathematical Models Solved with Inlined Runge-Kutta Solvers

OpenMP Tutorial. Dirk Schmidl. IT Center, RWTH Aachen University. Member of the HPC Group Christian Terboven

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Hybrid Model Parallel Programs

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Introduction to Programming Using Java (98-388)

OpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN

Shared Memory Programming With OpenMP Computer Lab Exercises

A recipe for fast(er) processing of netcdf files with Python and custom C modules

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Concurrency, Thread. Dongkun Shin, SKKU

Code Generators for Stencil Auto-tuning

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Web Development I PRECISION EXAMS DESCRIPTION. EXAM INFORMATION Items

COSC 6374 Parallel Computation. Parallel Design Patterns. Edgar Gabriel. Fall Design patterns

Parallel Code Generation in MathModelica / An Object Oriented Component Based Simulation Environment

Applying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems

Parallel Algorithm Engineering

AUTOMATIC PARALLELIZATION OF OBJECT ORIENTED MODELS ACROSS METHOD AND SYSTEM

Basic programming knowledge (arrays, looping, functions) Basic concept of parallel programming (in OpenMP)

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

Data-intensive computing in radiative transfer modelling

GPU-Accelerated Topology Optimization on Unstructured Meshes

STUDYING OPENMP WITH VAMPIR & SCORE-P

Parallel Computing. Lecture 16: OpenMP - IV

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

CPS343 Parallel and High Performance Computing Project 1 Spring 2018

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

The OpenModelica Modeling, Simulation, and Development Environment

FMI Kit for Simulink version by Dassault Systèmes

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Parallel Programming

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks

CME 213 S PRING Eric Darve

"Charting the Course to Your Success!" MOC A Developing High-performance Applications using Microsoft Windows HPC Server 2008

Contributions to Parallel Simulation of Equation-Based Models on Graphics Processing Units

Towards Approximate Computing: Programming with Relaxed Synchronization

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

Parallelization, OpenMP

Threads CS1372. Lecture 13. CS1372 Threads Fall / 10

SQL Server Administration 10987: Performance Tuning and Optimizing SQL Databases. Upcoming Dates. Course Description.

Quiz for Chapter 1 Computer Abstractions and Technology

Chapter 4: Threads. Chapter 4: Threads

From versatile analysis methods to interactive simulation with a motion platform based on SimulationX and FMI

CS4961 Parallel Programming. Lecture 12: Advanced Synchronization (Pthreads) 10/4/11. Administrative. Mary Hall October 4, 2011

Shared Memory Programming With OpenMP Exercise Instructions

Chapter 4: Multi-Threaded Programming

OpenMP * Past, Present and Future

Bring your application to a new era:

OpenMP for next generation heterogeneous clusters

Morsel- Drive Parallelism: A NUMA- Aware Query Evaluation Framework for the Many- Core Age. Presented by Dennis Grishin

OmpCloud: Bridging the Gap between OpenMP and Cloud Computing

Multigrain Parallelism: Bridging Coarse- Grain Parallel Languages and Fine-Grain Event-Driven Multithreading

Transcription:

Task-Graph-Based Parallelization of Modelica-Simulations Tutorial on the Usage of the HPCOM-Module

2 Introduction

Prerequisites https://openmodelica.org/download/nightlybuildsdownload a multi-core cpu compilation stages can be retraced using: a text editor to display debug-output a browser to display html-files (for big models IE is good) a graph-editor to display graphml-files ( we recommend yed - https://www.yworks.com/downloads#yed ) 3

4 Technical Overview

Outline Modelica Transformation Process Task-Graph Generation Parallelization Approaches Clusterung and Scheduling Usage OpenModelica flags to retrace compilation stages are marked. 5

Modelica Transformation Process Modelica.Electrical.Spice3.Examples.CoupledInductors.mo +d=dumpdaelow Flattening: model gets parsed and instantiated in order to attain a flat model. 6

Modelica Transformation Process +d=graphml Dependencies among variables and equations are detected. A bipartite graph is set up. (+d=graphml) 7

Modelica Transformation Process +d=graphml +d=dumprepl ReplaceSimpleEquations to reduce system size: Alias-Variables are replaced, i.e. simple assignments like a=b; 8

Modelica Transformation Process +d=bltmatrixdump 9 Causalization: Matching / Index-Reduction / Tarjan s Algorithm: each variable is assigned to an equation if necessary, index is reduced (Panthelides) strongly connected components are identified (BLT-Matrix)

Modelica Transformation Process Start Values States Evaluate Right- Hand-Side x t = f(x t, u(t)) y(t) = g(x(t), u(t)); Time Integration State-Derivatives Simulation: main-diagonal is traversed top down, blocks correspond to systems of equations computed state-derivatives are used for time integration scheme 10

Task-Graph Generation +d=graphml 1-dimensional computation sequence 2-dimensional sequnce, task dependencies Task-Graph Generation: traverse BLT-matrix and assign dependencies between tasks (i.e. strongly-connected component) 11

Task-Graph Generation Task-Graph: used for parallelization of statederivative computation Scheduling: assign tasks to threads to distribute the workload among all threads information about execution costs and communication costs needed +d=hpcom remove the ablgebraic branches determine execution costs (estimation or measurements) benchmark communication costs 12

Task-Graph Generation Task-Graph: used for parallelization of state-derivative computation remove the ablgebraic branches Scheduling: assign tasks to threads to distribute the workload among all threads determine execution costs (estimation or measurements) benchmark communication costs +d=hpcom 13

Parallelization approaches Modelling Solver Compiler Transmission Line Modeling (TLM) multirate submodels / cosimulation parmodelica parallel: steps/iterations parallel solving of equation systems in integrator QSS BLT - parallelization parallel solving of equation systems in system equations 14

Clustering and Scheduling Clustering merge linear task sequence merge parent nodes 15

Clustering and Scheduling Level Scheduling 16

Clustering and Scheduling Level Scheduling and OpenMP-Code Level 1 1 2 3 Level 2 4 static void solveode(data) { //Level 1 #pragma omp parallel sections { #pragma omp section { eqfunction_1(data); } #pragma omp section { eqfunction_2(data); } } //Level 2 #pragma omp parallel sections { }} 17

Clustering and Scheduling Thread-Scheduling (MCP) Modelica.Electrical.Machines.Examples.Synchronousinductionmachines.SMEE_LoadDump 18

Clustering and Scheduling Thread-Scheduling and pthreads-code Thread 1 Thread 2 19 1 2 3 4 static void thread1ode(data) { //Function of thread1 while(1) { pthread_mutex_lock(&th_lock_0); eqfunction_1(data); SET_SPIN_LOCK(l23); eqfunction_3(data); pthread_mutex_unlock(&th_lock1_0); } } static void solveode(data) { INIT_SPIN_LOCK(l23,true); //pthread_spinlock_t INIT_LOCKS(); if(firstrun) CREATE_THREADS( ); //Start threads pthread_mutex_unlock(&th_lock_0); pthread_mutex_unlock(&th_lock_1); //"join" pthread_mutex_lock(&th_lock1_0); pthread_mutex_lock(&th_lock1_1); }

Influencing Factors domain specifics Mechanics: One big linear systems is the bottleneck Hydraulics: Even distribution of tasks 20

21 Usage of HPCOM-Parallelization

HPCOM - portfolio Task-Graph-Parallelization in HPC-OM Symbolic Task-Graph Conditioning Cost-Benchmarking & Estimation Task-Merging & Clustering Scheduling & Parallel Codegeneration Memory Optimization Profiling &Tracing 22

Usage of HPCOM-Parallelization Example: Modelica.Fluid.Examples.BranchingDynamicPipes.mo from Modelica Standard Library 3.2.1. Modelica Scripting File: *.mos loadmodel(modelica,{"3.2.1"}); setdebugflags("hpcom,hpcomdump"); geterrorstring(); setcommandlineoptions("+n=4 +hpcomscheduler=list +hpcomcode=openmp"); geterrorstring(); simulate(modelica.fluid.examples.branchingdynamicpipes, stoptime=10.0); geterrorstring(); 23

Preparation Results: Critical Path successfully calculated Filter successfully applied. Merged 446 tasks. Using list Scheduler for the DAE system Using list Scheduler for the ODE system Using list Scheduler for the ZeroFunc system the number of locks: 577 the serialcosts: 709266.3000000001 the parallelcosts: 198678.37 the cpcosts: 36994.58 The predicted SpeedUp with 4 processors is: 3.57 With a theoretical maximmum speedup of: 19.17 Schedule created 24