Design Approach for a Generic and Scalable Framework for Parallel FMU Simulations

Similar documents
Efficient Clustering and Scheduling for Task-Graph based Parallelization

High-Performance-Computing meets OpenModelica: Achievements in the HPC-OM Project

OpenModelica. Workshop Chair of Construction Machines. Functional Design-Prototyping using. OpenModelica. Volker Waurich

Functional Design-Prototyping using OpenModelica

Modelica3D. Platform Independent Simulation Visualization. Christoph Höger. Technische Universität Berlin Fraunhofer FIRST

Task-Graph-Based Parallelization of Modelica-Simulations. Tutorial on the Usage of the HPCOM-Module

Evaluation and Improvements of Programming Models for the Intel SCC Many-core Processor

Multi-core Simulation of Internal Combustion Engines using Modelica, FMI and xmod

Tuning Alya with READEX for Energy-Efficiency

Dynamic Load Balancing in Parallelization of Equation-based Models

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Functional Mockup Interface for Tool and Model Interoperability

The Use of Cloud Computing Resources in an HPC Environment

A Modular. OpenModelica. Compiler Backend

Optimize HPC - Application Efficiency on Many Core Systems

1. Define algorithm complexity 2. What is called out of order in detail? 3. Define Hardware prefetching. 4. Define software prefetching. 5. Define wor

READEX: A Tool Suite for Dynamic Energy Tuning. Michael Gerndt Technische Universität München

ECS289: Scalable Machine Learning

Introducing OTF / Vampir / VampirTrace

Leveraging Flash in HPC Systems

Introduction to parallel Computing

Quantifying power consumption variations of HPC systems using SPEC MPI benchmarks

The MPI Message-passing Standard Practical use and implementation (I) SPD Course 2/03/2010 Massimo Coppola

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

Functional Mockup Interface (FMI) A General Standard for Model Exchange and Simulator Coupling

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

L21: Putting it together: Tree Search (Ch. 6)!

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Generation of Functional Mock-up Units from Causal Block Diagrams

Leveraging Burst Buffer Coordination to Prevent I/O Interference

First Experiences with Intel Cluster OpenMP

INTEROPERABILITY WITH FMI TOOLS AND SOFTWARE COMPONENTS. Johan Åkesson

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel

HPC code modernization with Intel development tools

FMI for Industrial Programmable Logic Controllers Rüdiger Kampfmann

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

Overview: Memory Consistency

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

RAMSES on the GPU: An OpenACC-Based Approach

Performance Analysis of Large-Scale OpenMP and Hybrid MPI/OpenMP Applications with Vampir NG

Most real programs operate somewhere between task and data parallelism. Our solution also lies in this set.

Parallelization Using a PGAS Language such as X10 in HYDRO and TRITON

L20: Putting it together: Tree Search (Ch. 6)!

Performance analysis basics

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

Fourteen years of Cactus Community

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

Under the Hood, Part 1: Implementing Message Passing

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures

Hybrid Parallelization of the MrBayes & RAxML Phylogenetics Codes

OpenACC 2.6 Proposed Features

AUTOMATIC SMT THREADING

A Modelica Power System Library for Phasor Time-Domain Simulation

"Charting the Course to Your Success!" MOC A Developing High-performance Applications using Microsoft Windows HPC Server 2008

PROCESSES AND THREADS

Introduction to Parallel Performance Engineering

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

High Performance Computing. Introduction to Parallel Computing

AutoTune Workshop. Michael Gerndt Technische Universität München

Optimizing an Earth Science Atmospheric Application with the OmpSs Programming Model

Monitoring and Trouble Shooting on BioHPC

Parallel Programming on Larrabee. Tim Foley Intel Corp

BİL 542 Parallel Computing

Scalable and Fault Tolerant Failure Detection and Consensus

ASYNCHRONOUS SHADERS WHITE PAPER 0

EE/CSCI 451: Parallel and Distributed Computation

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

Performance potential for simulating spin models on GPU

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

HPC Workshop University of Kentucky May 9, 2007 May 10, 2007

Bring your application to a new era:

Power-Aware Scheduling of Virtual Machines in DVFS-enabled Clusters

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

EE/CSCI 451 Spring 2017 Homework 3 solution Total Points: 100

A Characterization of Shared Data Access Patterns in UPC Programs

Model-Based Development of Multi-Disciplinary Systems Challenges and Opportunities

Parallelism paradigms

EZTrace upcoming features

Performance Analysis with Vampir

5/5/2012. Message Passing Programming Model Blocking communication. Non-Blocking communication Introducing MPI. Non-Buffered Buffered

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

Parallel Programming Libraries and implementations

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Parallel programming models. Main weapons

Middleware and Interprocess Communication

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW

Scalable, Hybrid-Parallel Multiscale Methods using DUNE

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Parallel Execution of Functional Mock-up Units in Buildings Modeling

Multiprocessor scheduling

CS377P Programming for Performance Multicore Performance Multithreading

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

ELP. Effektive Laufzeitunterstützung für zukünftige Programmierstandards. Speaker: Tim Cramer, RWTH Aachen University

Techniques to improve the scalability of Checkpoint-Restart

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Equation based parallelization of Modelica models

Transcription:

Center for Information Services and High Performance Computing TU Dresden Design Approach for a Generic and Scalable Framework for Parallel FMU Simulations Martin Flehmig, Marc Hartung, Marcus Walther Linko ping, 02. February 2015 E-Mail: martin.flehmig@tu-dresden.de

Outline 1 Introduction and Motivation 2 Coupled Simulations Using FMI 3 Design Approach 4 Summary and Outlook 2/23

HPC-OM www.hpc-om.de 3/23

Outline 1 Introduction and Motivation 2 Coupled Simulations Using FMI 3 Design Approach 4 Summary and Outlook 4/23

Tasks within HPC-OM The three main tasks/goals within the HPC-OM project are: 1. Domain independent parallelization of Modelica simulations. Equation based parallelization using task graph. Algorithms for task merging and clustering. Implementation of various schedulers. Code generation for OpenMP, PThreads and Intel TBB. Exploiting repeated structures and vectorization in Modelica. 2. Parallel time integration. 3. Coupling of interactive simulations and an HPC system. 5/23

Tasks within HPC-OM The three main tasks/goals within the HPC-OM project are: 1. Domain independent parallelization of Modelica simulations. Equation based parallelization using task graph. Algorithms for task merging and clustering. Implementation of various schedulers. Code generation for OpenMP, PThreads and Intel TBB. Exploiting repeated structures and vectorization in Modelica. 2. Parallel time integration. 3. Coupling of interactive simulations and an HPC system. 4. High speedups. 5/23

Task Graph Parallelization... is promising, but current restrictions are: Model dependent benefit, because tasks are too lightweight, one large equation system thwarts whole parallel computation. OpenModelica has some issues regarding large models. 6/23

How to Achieve High Speedups? Idea: Build simulation from several FMUs to obtain multiple levels of parallelism: Task graph parallelization with FMUs as nodes. Use task graph parallelization generated by OMC in each FMU. 7/23

Outline 1 Introduction and Motivation 2 Coupled Simulations Using FMI 3 Design Approach 4 Summary and Outlook 8/23

FMI/FMU - Short Recap Can bring together sub-models from distinctive authoring tools (language, libraries, solvers, etc.). Models can be used in already existing production codes and frameworks. Protects intellectual property. 9/23

FMI/FMU - Short Recap Can bring together sub-models from distinctive authoring tools (language, libraries, solvers, etc.). Models can be used in already existing production codes and frameworks. Protects intellectual property. Single-threaded simulations using coupled FMUs have limitations: Suitable for real time applications? Large and complex models have long simulation execution times and demand for high memory consumption. 9/23

FMI/FMU - Short Recap Can bring together sub-models from distinctive authoring tools (language, libraries, solvers, etc.). Models can be used in already existing production codes and frameworks. Protects intellectual property. Single-threaded simulations using coupled FMUs have limitations: Suitable for real time applications? Large and complex models have long simulation execution times and demand for high memory consumption. Therefore, FMU simulations exploiting today s multi-core hardware are needed. 9/23

Synchronisation Step Approach Synchronisation of FMUs happens after predefined time intervals. At every synchronisation point, numerical error is calculated. In between inputs are extrapolated for single solver steps. error control extrapolation FMU 1 T 0 T 1 T 2 T 3 FMU 2 10/23

Synchronisation Step Approach Synchronisation of FMUs happens after predefined time intervals. At every synchronisation point, numerical error is calculated. In between inputs are extrapolated for single solver steps. error control extrapolation FMU 1 T 0 T 1 T 2 T 3 FMU 2 + No communication between synchronisation steps. + Clear and strait forward implementation possible. Revert FMUs to last synchronisation point or even rerun simulation. Communication leads to delays during synchronisation points. 10/23

Generic Approach - Idea Values of every valid solver step are communicated. Dependent FMUs take most recent values as inputs. Solvers can base the error estimation on profound data. Less interfering in solver step size. error controle and extra-/interpolation FMU 1 FMU 2 11/23

Generic Approach - Aspects and Challenges Aspects + Just single solver steps need to be rerun. + Replaces unsafe extrapolation with interpolation. + Direct error handling, i.e., change of numerical behaviour is treated on occurrence, solver step size depends only on input values, not on synch. points. Increasing communication effort. Sophisticated implementation with complex data structures. 12/23

Generic Approach - Aspects and Challenges Aspects + Just single solver steps need to be rerun. + Replaces unsafe extrapolation with interpolation. + Direct error handling, i.e., change of numerical behaviour is treated on occurrence, solver step size depends only on input values, not on synch. points. Increasing communication effort. Sophisticated implementation with complex data structures. Challenges Asynchronous communication is needed. FMUs need to be smartly distributed on system for high efficiency. Adoptable to different simulation setups. 12/23

Generic Approach - Aspects and Challenges Aspects + Just single solver steps need to be rerun. + Replaces unsafe extrapolation with interpolation. + Direct error handling, i.e., change of numerical behaviour is treated on occurrence, solver step size depends only on input values, not on synch. points. Increasing communication effort. Sophisticated implementation with complex data structures. Challenges Asynchronous communication is needed. Field of parallel computing provides several solutions. FMUs need to be smartly distributed on system for high efficiency. Adoptable to different simulation setups. 12/23

Generic Approach - Aspects and Challenges Aspects + Just single solver steps need to be rerun. + Replaces unsafe extrapolation with interpolation. + Direct error handling, i.e., change of numerical behaviour is treated on occurrence, solver step size depends only on input values, not on synch. points. Increasing communication effort. Sophisticated implementation with complex data structures. Challenges Asynchronous communication is needed. Field of parallel computing provides several solutions. FMUs need to be smartly distributed on system for high efficiency. Knowledge transfer from task scheduling. Adoptable to different simulation setups. 12/23

Generic Approach - Aspects and Challenges Aspects + Just single solver steps need to be rerun. + Replaces unsafe extrapolation with interpolation. + Direct error handling, i.e., change of numerical behaviour is treated on occurrence, solver step size depends only on input values, not on synch. points. Increasing communication effort. Sophisticated implementation with complex data structures. Challenges Asynchronous communication is needed. Field of parallel computing provides several solutions. FMUs need to be smartly distributed on system for high efficiency. Knowledge transfer from task scheduling. Adoptable to different simulation setups.! Requires interchangeable modular structure of the system and components. 12/23

FMU Simulation Framework Final goal is a generic and scalable framework for coupled FMU simulations. Framework for Parallel FMU Simulation OpenMP C++ Error control/handling Simulation Data Management MPI FMI Initialization I/O Communication Solver Generic User friendliness Adaptive step sizes Scalable 13/23

Outline 1 Introduction and Motivation 2 Coupled Simulations Using FMI 3 Design Approach 4 Summary and Outlook 14/23

DataHistory Challenge: How to provide input/output data for asynchronous simulation? FMU 1 FMU 2 15/23

DataHistory Challenge: How to provide input/output data for asynchronous simulation? Answer: Use buffers! Saves relevant state values of FMUs. Input values are written remotely. Interface for accessing input values on local storage. FMU 1 Buffer FMU 2 15/23

DataManager Challenge: How to achieve efficient communication and handling of parallel FMU simulation? 16/23

DataManager Challenge: How to achieve efficient communication and handling of parallel FMU simulation? Answer: Use a manager! Initiates shared and distributed memory writes for input values. Writes result data to file. Controls DataHistory, e.g., flushes unnecessary data. Interpolation/extrapolation of input data. Hides communication and data flow from solvers. FMU 1 Buffer FMU 2 16/23

Outline 1 Introduction and Motivation 2 Coupled Simulations Using FMI 3 Design Approach 4 Summary and Outlook 17/23

Summary Presented approach for efficient asynchronous parallel simulation of coupled FMUs. Key features and main challenges have been identified: Use task graph parallelized FMUs generated from OpenModelica. Use model exchange FMUs in order to obtain asynchronous simulation.! Need scalable data structures and communication.! Need localized buffers to provide input values for FMUs. 18/23

Outlook What has to be done? Finish implementation. Show scalability by performing large simulation with numerous FMUs. Find a fancy name for this piece of software - suggestions are welcome. Make a release available. 19/23

HPC-OM Thank you for your attention. www.hpc-om.de 20/23

21/23

FMU Input Extrapolation error control extrapolation FMU 1 T 0 T 1 T 2 T 3 FMU 2 error control of input values only at synchronization points in between extrapolated inputs only based on previous interval in several cases FMUs need to be set back to last synchronization point 22/23

FMU Input Extrapolation error controle and extra-/interpolation FMU 1 FMU 2 error control of input values based on most recent values solver can directly check error after every step leading to higher numerical stability less interfering in solver step size more dynamical error handling possible 23/23