Design Approach for a Generic and Scalable Framework for Parallel FMU Simulations

Size: px
Start display at page:

Download "Design Approach for a Generic and Scalable Framework for Parallel FMU Simulations"

Transcription

1 Center for Information Services and High Performance Computing TU Dresden Design Approach for a Generic and Scalable Framework for Parallel FMU Simulations Martin Flehmig, Marc Hartung, Marcus Walther Linko ping, 02. February martin.flehmig@tu-dresden.de

2 Outline 1 Introduction and Motivation 2 Coupled Simulations Using FMI 3 Design Approach 4 Summary and Outlook 2/23

3 HPC-OM 3/23

4 Outline 1 Introduction and Motivation 2 Coupled Simulations Using FMI 3 Design Approach 4 Summary and Outlook 4/23

5 Tasks within HPC-OM The three main tasks/goals within the HPC-OM project are: 1. Domain independent parallelization of Modelica simulations. Equation based parallelization using task graph. Algorithms for task merging and clustering. Implementation of various schedulers. Code generation for OpenMP, PThreads and Intel TBB. Exploiting repeated structures and vectorization in Modelica. 2. Parallel time integration. 3. Coupling of interactive simulations and an HPC system. 5/23

6 Tasks within HPC-OM The three main tasks/goals within the HPC-OM project are: 1. Domain independent parallelization of Modelica simulations. Equation based parallelization using task graph. Algorithms for task merging and clustering. Implementation of various schedulers. Code generation for OpenMP, PThreads and Intel TBB. Exploiting repeated structures and vectorization in Modelica. 2. Parallel time integration. 3. Coupling of interactive simulations and an HPC system. 4. High speedups. 5/23

7 Task Graph Parallelization... is promising, but current restrictions are: Model dependent benefit, because tasks are too lightweight, one large equation system thwarts whole parallel computation. OpenModelica has some issues regarding large models. 6/23

8 How to Achieve High Speedups? Idea: Build simulation from several FMUs to obtain multiple levels of parallelism: Task graph parallelization with FMUs as nodes. Use task graph parallelization generated by OMC in each FMU. 7/23

9 Outline 1 Introduction and Motivation 2 Coupled Simulations Using FMI 3 Design Approach 4 Summary and Outlook 8/23

10 FMI/FMU - Short Recap Can bring together sub-models from distinctive authoring tools (language, libraries, solvers, etc.). Models can be used in already existing production codes and frameworks. Protects intellectual property. 9/23

11 FMI/FMU - Short Recap Can bring together sub-models from distinctive authoring tools (language, libraries, solvers, etc.). Models can be used in already existing production codes and frameworks. Protects intellectual property. Single-threaded simulations using coupled FMUs have limitations: Suitable for real time applications? Large and complex models have long simulation execution times and demand for high memory consumption. 9/23

12 FMI/FMU - Short Recap Can bring together sub-models from distinctive authoring tools (language, libraries, solvers, etc.). Models can be used in already existing production codes and frameworks. Protects intellectual property. Single-threaded simulations using coupled FMUs have limitations: Suitable for real time applications? Large and complex models have long simulation execution times and demand for high memory consumption. Therefore, FMU simulations exploiting today s multi-core hardware are needed. 9/23

13 Synchronisation Step Approach Synchronisation of FMUs happens after predefined time intervals. At every synchronisation point, numerical error is calculated. In between inputs are extrapolated for single solver steps. error control extrapolation FMU 1 T 0 T 1 T 2 T 3 FMU 2 10/23

14 Synchronisation Step Approach Synchronisation of FMUs happens after predefined time intervals. At every synchronisation point, numerical error is calculated. In between inputs are extrapolated for single solver steps. error control extrapolation FMU 1 T 0 T 1 T 2 T 3 FMU 2 + No communication between synchronisation steps. + Clear and strait forward implementation possible. Revert FMUs to last synchronisation point or even rerun simulation. Communication leads to delays during synchronisation points. 10/23

15 Generic Approach - Idea Values of every valid solver step are communicated. Dependent FMUs take most recent values as inputs. Solvers can base the error estimation on profound data. Less interfering in solver step size. error controle and extra-/interpolation FMU 1 FMU 2 11/23

16 Generic Approach - Aspects and Challenges Aspects + Just single solver steps need to be rerun. + Replaces unsafe extrapolation with interpolation. + Direct error handling, i.e., change of numerical behaviour is treated on occurrence, solver step size depends only on input values, not on synch. points. Increasing communication effort. Sophisticated implementation with complex data structures. 12/23

17 Generic Approach - Aspects and Challenges Aspects + Just single solver steps need to be rerun. + Replaces unsafe extrapolation with interpolation. + Direct error handling, i.e., change of numerical behaviour is treated on occurrence, solver step size depends only on input values, not on synch. points. Increasing communication effort. Sophisticated implementation with complex data structures. Challenges Asynchronous communication is needed. FMUs need to be smartly distributed on system for high efficiency. Adoptable to different simulation setups. 12/23

18 Generic Approach - Aspects and Challenges Aspects + Just single solver steps need to be rerun. + Replaces unsafe extrapolation with interpolation. + Direct error handling, i.e., change of numerical behaviour is treated on occurrence, solver step size depends only on input values, not on synch. points. Increasing communication effort. Sophisticated implementation with complex data structures. Challenges Asynchronous communication is needed. Field of parallel computing provides several solutions. FMUs need to be smartly distributed on system for high efficiency. Adoptable to different simulation setups. 12/23

19 Generic Approach - Aspects and Challenges Aspects + Just single solver steps need to be rerun. + Replaces unsafe extrapolation with interpolation. + Direct error handling, i.e., change of numerical behaviour is treated on occurrence, solver step size depends only on input values, not on synch. points. Increasing communication effort. Sophisticated implementation with complex data structures. Challenges Asynchronous communication is needed. Field of parallel computing provides several solutions. FMUs need to be smartly distributed on system for high efficiency. Knowledge transfer from task scheduling. Adoptable to different simulation setups. 12/23

20 Generic Approach - Aspects and Challenges Aspects + Just single solver steps need to be rerun. + Replaces unsafe extrapolation with interpolation. + Direct error handling, i.e., change of numerical behaviour is treated on occurrence, solver step size depends only on input values, not on synch. points. Increasing communication effort. Sophisticated implementation with complex data structures. Challenges Asynchronous communication is needed. Field of parallel computing provides several solutions. FMUs need to be smartly distributed on system for high efficiency. Knowledge transfer from task scheduling. Adoptable to different simulation setups.! Requires interchangeable modular structure of the system and components. 12/23

21 FMU Simulation Framework Final goal is a generic and scalable framework for coupled FMU simulations. Framework for Parallel FMU Simulation OpenMP C++ Error control/handling Simulation Data Management MPI FMI Initialization I/O Communication Solver Generic User friendliness Adaptive step sizes Scalable 13/23

22 Outline 1 Introduction and Motivation 2 Coupled Simulations Using FMI 3 Design Approach 4 Summary and Outlook 14/23

23 DataHistory Challenge: How to provide input/output data for asynchronous simulation? FMU 1 FMU 2 15/23

24 DataHistory Challenge: How to provide input/output data for asynchronous simulation? Answer: Use buffers! Saves relevant state values of FMUs. Input values are written remotely. Interface for accessing input values on local storage. FMU 1 Buffer FMU 2 15/23

25 DataManager Challenge: How to achieve efficient communication and handling of parallel FMU simulation? 16/23

26 DataManager Challenge: How to achieve efficient communication and handling of parallel FMU simulation? Answer: Use a manager! Initiates shared and distributed memory writes for input values. Writes result data to file. Controls DataHistory, e.g., flushes unnecessary data. Interpolation/extrapolation of input data. Hides communication and data flow from solvers. FMU 1 Buffer FMU 2 16/23

27 Outline 1 Introduction and Motivation 2 Coupled Simulations Using FMI 3 Design Approach 4 Summary and Outlook 17/23

28 Summary Presented approach for efficient asynchronous parallel simulation of coupled FMUs. Key features and main challenges have been identified: Use task graph parallelized FMUs generated from OpenModelica. Use model exchange FMUs in order to obtain asynchronous simulation.! Need scalable data structures and communication.! Need localized buffers to provide input values for FMUs. 18/23

29 Outlook What has to be done? Finish implementation. Show scalability by performing large simulation with numerous FMUs. Find a fancy name for this piece of software - suggestions are welcome. Make a release available. 19/23

30 HPC-OM Thank you for your attention. 20/23

31 21/23

32 FMU Input Extrapolation error control extrapolation FMU 1 T 0 T 1 T 2 T 3 FMU 2 error control of input values only at synchronization points in between extrapolated inputs only based on previous interval in several cases FMUs need to be set back to last synchronization point 22/23

33 FMU Input Extrapolation error controle and extra-/interpolation FMU 1 FMU 2 error control of input values based on most recent values solver can directly check error after every step leading to higher numerical stability less interfering in solver step size more dynamical error handling possible 23/23

Efficient Clustering and Scheduling for Task-Graph based Parallelization

Efficient Clustering and Scheduling for Task-Graph based Parallelization Center for Information Services and High Performance Computing TU Dresden Efficient Clustering and Scheduling for Task-Graph based Parallelization Marc Hartung 02. February 2015 E-Mail: marc.hartung@tu-dresden.de

More information

High-Performance-Computing meets OpenModelica: Achievements in the HPC-OM Project

High-Performance-Computing meets OpenModelica: Achievements in the HPC-OM Project OpenModelica Workshop 2015 High-Performance-Computing meets OpenModelica: Achievements in the HPC-OM Project Linköping, 02/02/2015 HPC-OM www.hpc-om.de slide 2 Outline Outline 1. Parallelization Approaches

More information

OpenModelica. Workshop Chair of Construction Machines. Functional Design-Prototyping using. OpenModelica. Volker Waurich

OpenModelica. Workshop Chair of Construction Machines. Functional Design-Prototyping using. OpenModelica. Volker Waurich Chair of Construction Machines OpenModelica Workshop 2017 Functional Design-Prototyping using OpenModelica Volker Waurich Linköping, 06/02/2017 Outline Outline 1. Introduction 2. Functional Design-Prototyping

More information

Functional Design-Prototyping using OpenModelica

Functional Design-Prototyping using OpenModelica Professur für Baumaschinen 15. Modelisax Treffen Functional Design-Prototyping using OpenModelica Volker Waurich Dresden, 01/03/2017 Outline Outline 1. Introduction 2. Functional Design-Prototyping 3.

More information

Modelica3D. Platform Independent Simulation Visualization. Christoph Höger. Technische Universität Berlin Fraunhofer FIRST

Modelica3D. Platform Independent Simulation Visualization. Christoph Höger. Technische Universität Berlin Fraunhofer FIRST Modelica3D Platform Independent Simulation Visualization Christoph Höger Technische Universität Berlin Fraunhofer FIRST c Fraunhofer FIRST/TU Berlin 6. Februar 2012 Motivation - Goal Dymola MultiBody Visualization

More information

Task-Graph-Based Parallelization of Modelica-Simulations. Tutorial on the Usage of the HPCOM-Module

Task-Graph-Based Parallelization of Modelica-Simulations. Tutorial on the Usage of the HPCOM-Module Task-Graph-Based Parallelization of Modelica-Simulations Tutorial on the Usage of the HPCOM-Module 2 Introduction Prerequisites https://openmodelica.org/download/nightlybuildsdownload a multi-core cpu

More information

Evaluation and Improvements of Programming Models for the Intel SCC Many-core Processor

Evaluation and Improvements of Programming Models for the Intel SCC Many-core Processor Evaluation and Improvements of Programming Models for the Intel SCC Many-core Processor Carsten Clauss, Stefan Lankes, Pablo Reble, Thomas Bemmerl International Workshop on New Algorithms and Programming

More information

Multi-core Simulation of Internal Combustion Engines using Modelica, FMI and xmod

Multi-core Simulation of Internal Combustion Engines using Modelica, FMI and xmod CO 2 maîtrisé Carburants diversifiés Véhicules économes Raffinage propre Réserves prolongées Multi-core Simulation of Internal Combustion Engines using Modelica, FMI and xmod Abir Ben Khaled, IFPEN Mongi

More information

Tuning Alya with READEX for Energy-Efficiency

Tuning Alya with READEX for Energy-Efficiency Tuning Alya with READEX for Energy-Efficiency Venkatesh Kannan 1, Ricard Borrell 2, Myles Doyle 1, Guillaume Houzeaux 2 1 Irish Centre for High-End Computing (ICHEC) 2 Barcelona Supercomputing Centre (BSC)

More information

Dynamic Load Balancing in Parallelization of Equation-based Models

Dynamic Load Balancing in Parallelization of Equation-based Models Dynamic Load Balancing in Parallelization of Equation-based Models Mahder Gebremedhin Programing Environments Laboratory (PELAB), IDA Linköping University mahder.gebremedhin@liu.se Annual OpenModelica

More information

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation

More information

Functional Mockup Interface for Tool and Model Interoperability

Functional Mockup Interface for Tool and Model Interoperability Functional Mockup Interface for Tool and Model Interoperability Willi Braun, Bernhard Bachmann Acknowledgements: FMI Development Project is developing FMI. Most slides in this presentation by Martin Otter,

More information

The Use of Cloud Computing Resources in an HPC Environment

The Use of Cloud Computing Resources in an HPC Environment The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes

More information

A Modular. OpenModelica. Compiler Backend

A Modular. OpenModelica. Compiler Backend Chair of Construction Machines and Conveying Technology OpenModelica Workshop 2011 A Modular OpenModelica Compiler Backend J. Frenkel W. Braun A. Pop M. Sjölund Outline 1. Introduction 2. Concept of Modular

More information

Optimize HPC - Application Efficiency on Many Core Systems

Optimize HPC - Application Efficiency on Many Core Systems Meet the experts Optimize HPC - Application Efficiency on Many Core Systems 2018 Arm Limited Florent Lebeau 27 March 2018 2 2018 Arm Limited Speedup Multithreading and scalability I wrote my program to

More information

1. Define algorithm complexity 2. What is called out of order in detail? 3. Define Hardware prefetching. 4. Define software prefetching. 5. Define wor

1. Define algorithm complexity 2. What is called out of order in detail? 3. Define Hardware prefetching. 4. Define software prefetching. 5. Define wor CS6801-MULTICORE ARCHECTURES AND PROGRAMMING UN I 1. Difference between Symmetric Memory Architecture and Distributed Memory Architecture. 2. What is Vector Instruction? 3. What are the factor to increasing

More information

READEX: A Tool Suite for Dynamic Energy Tuning. Michael Gerndt Technische Universität München

READEX: A Tool Suite for Dynamic Energy Tuning. Michael Gerndt Technische Universität München READEX: A Tool Suite for Dynamic Energy Tuning Michael Gerndt Technische Universität München Campus Garching 2 SuperMUC: 3 Petaflops, 3 MW 3 READEX Runtime Exploitation of Application Dynamism for Energy-efficient

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 4, 2016 Outline Multi-core v.s. multi-processor Parallel Gradient Descent Parallel Stochastic Gradient Parallel Coordinate Descent Parallel

More information

Introducing OTF / Vampir / VampirTrace

Introducing OTF / Vampir / VampirTrace Center for Information Services and High Performance Computing (ZIH) Introducing OTF / Vampir / VampirTrace Zellescher Weg 12 Willers-Bau A115 Tel. +49 351-463 - 34049 (Robert.Henschel@zih.tu-dresden.de)

More information

Leveraging Flash in HPC Systems

Leveraging Flash in HPC Systems Leveraging Flash in HPC Systems IEEE MSST June 3, 2015 This work was performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344. Lawrence Livermore National Security,

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

Quantifying power consumption variations of HPC systems using SPEC MPI benchmarks

Quantifying power consumption variations of HPC systems using SPEC MPI benchmarks Center for Information Services and High Performance Computing (ZIH) Quantifying power consumption variations of HPC systems using SPEC MPI benchmarks EnA-HPC, Sept 16 th 2010, Robert Schöne, Daniel Molka,

More information

The MPI Message-passing Standard Practical use and implementation (I) SPD Course 2/03/2010 Massimo Coppola

The MPI Message-passing Standard Practical use and implementation (I) SPD Course 2/03/2010 Massimo Coppola The MPI Message-passing Standard Practical use and implementation (I) SPD Course 2/03/2010 Massimo Coppola What is MPI MPI: Message Passing Interface a standard defining a communication library that allows

More information

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC 2015-09-14 Algorithms, System and Data Centre Optimisation for Energy Efficient HPC Vincent Heuveline URZ Computing Centre of Heidelberg University EMCL Engineering Mathematics and Computing Lab 1 Energy

More information

Functional Mockup Interface (FMI) A General Standard for Model Exchange and Simulator Coupling

Functional Mockup Interface (FMI) A General Standard for Model Exchange and Simulator Coupling Functional Mockup Interface (FMI) A General Standard for Model Exchange and Simulator Coupling Adeel Asghar and Willi Braun Linköping University University of Applied Sciene Bielefeld 2017-02-07 FMI Motivation

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

L21: Putting it together: Tree Search (Ch. 6)!

L21: Putting it together: Tree Search (Ch. 6)! Administrative CUDA project due Wednesday, Nov. 28 L21: Putting it together: Tree Search (Ch. 6)! Poster dry run on Dec. 4, final presentations on Dec. 6 Optional final report (4-6 pages) due on Dec. 14

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

Generation of Functional Mock-up Units from Causal Block Diagrams

Generation of Functional Mock-up Units from Causal Block Diagrams Generation of Functional Mock-up Units from Causal Block Diagrams Bavo Vander Henst University of Antwerp Model Driven Engineering Bavo.VanderHenst@student.uantwerpen.be Abstract The purpose of this paper

More information

Leveraging Burst Buffer Coordination to Prevent I/O Interference

Leveraging Burst Buffer Coordination to Prevent I/O Interference Leveraging Burst Buffer Coordination to Prevent I/O Interference Anthony Kougkas akougkas@hawk.iit.edu Matthieu Dorier, Rob Latham, Rob Ross, Xian-He Sun Wednesday, October 26th Baltimore, USA Outline

More information

First Experiences with Intel Cluster OpenMP

First Experiences with Intel Cluster OpenMP First Experiences with Intel Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner surname@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany IWOMP 2008 May

More information

INTEROPERABILITY WITH FMI TOOLS AND SOFTWARE COMPONENTS. Johan Åkesson

INTEROPERABILITY WITH FMI TOOLS AND SOFTWARE COMPONENTS. Johan Åkesson INTEROPERABILITY WITH FMI TOOLS AND SOFTWARE COMPONENTS Johan Åkesson 1 OUTLINE FMI Technology FMI tools Industrial FMI integration example THE FUNCTIONAL MOCK-UP INTERFACE Problems/needs Component development

More information

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel Parallel Programming Environments Presented By: Anand Saoji Yogesh Patel Outline Introduction How? Parallel Architectures Parallel Programming Models Conclusion References Introduction Recent advancements

More information

HPC code modernization with Intel development tools

HPC code modernization with Intel development tools HPC code modernization with Intel development tools Bayncore, Ltd. Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 17 th 2016, Barcelona Microprocessor

More information

FMI for Industrial Programmable Logic Controllers Rüdiger Kampfmann

FMI for Industrial Programmable Logic Controllers Rüdiger Kampfmann FMI for Industrial Programmable Logic Controllers Rüdiger Kampfmann 07.02.2017 1 Outline Motivation Toolchain Application Limitations 2 Outline Motivation Toolchain Application Limitations 3 Motivation

More information

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17 Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17 Tutorial Instructors [James Reinders, Michael J. Voss, Pablo Reble, Rafael Asenjo]

More information

Overview: Memory Consistency

Overview: Memory Consistency Overview: Memory Consistency the ordering of memory operations basic definitions; sequential consistency comparison with cache coherency relaxing memory consistency write buffers the total store ordering

More information

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications Parallel Tiled Algorithms for Multicore Architectures Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

More information

RAMSES on the GPU: An OpenACC-Based Approach

RAMSES on the GPU: An OpenACC-Based Approach RAMSES on the GPU: An OpenACC-Based Approach Claudio Gheller (ETHZ-CSCS) Giacomo Rosilho de Souza (EPFL Lausanne) Romain Teyssier (University of Zurich) Markus Wetzstein (ETHZ-CSCS) PRACE-2IP project EU

More information

Performance Analysis of Large-Scale OpenMP and Hybrid MPI/OpenMP Applications with Vampir NG

Performance Analysis of Large-Scale OpenMP and Hybrid MPI/OpenMP Applications with Vampir NG Performance Analysis of Large-Scale OpenMP and Hybrid MPI/OpenMP Applications with Vampir NG Holger Brunst Center for High Performance Computing Dresden University, Germany June 1st, 2005 Overview Overview

More information

Most real programs operate somewhere between task and data parallelism. Our solution also lies in this set.

Most real programs operate somewhere between task and data parallelism. Our solution also lies in this set. for Windows Azure and HPC Cluster 1. Introduction In parallel computing systems computations are executed simultaneously, wholly or in part. This approach is based on the partitioning of a big task into

More information

Parallelization Using a PGAS Language such as X10 in HYDRO and TRITON

Parallelization Using a PGAS Language such as X10 in HYDRO and TRITON Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Parallelization Using a PGAS Language such as X10 in HYDRO and TRITON Marc Tajchman* a a Commissariat à l énergie atomique

More information

L20: Putting it together: Tree Search (Ch. 6)!

L20: Putting it together: Tree Search (Ch. 6)! Administrative L20: Putting it together: Tree Search (Ch. 6)! November 29, 2011! Next homework, CUDA, MPI (Ch. 3) and Apps (Ch. 6) - Goal is to prepare you for final - We ll discuss it in class on Thursday

More information

Performance analysis basics

Performance analysis basics Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis

More information

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

Splotch: High Performance Visualization using MPI, OpenMP and CUDA Splotch: High Performance Visualization using MPI, OpenMP and CUDA Klaus Dolag (Munich University Observatory) Martin Reinecke (MPA, Garching) Claudio Gheller (CSCS, Switzerland), Marzia Rivi (CINECA,

More information

Fourteen years of Cactus Community

Fourteen years of Cactus Community Fourteen years of Cactus Community Frank Löffler Center for Computation and Technology Louisiana State University, Baton Rouge, LA September 6th 2012 Outline Motivation scenario from Astrophysics Cactus

More information

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work Using Non-blocking Operations in HPC to Reduce Execution Times David Buettner, Julian Kunkel, Thomas Ludwig Euro PVM/MPI September 8th, 2009 Outline 1 Motivation 2 Theory of a non-blocking benchmark 3

More information

Under the Hood, Part 1: Implementing Message Passing

Under the Hood, Part 1: Implementing Message Passing Lecture 27: Under the Hood, Part 1: Implementing Message Passing Parallel Computer Architecture and Programming CMU 15-418/15-618, Fall 2017 Today s Theme 2 Message passing model (abstraction) Threads

More information

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer CHAO YANG Dr. Chao Yang is a full professor at the Laboratory of Parallel Software and Computational Sciences, Institute of Software, Chinese Academy Sciences. His research interests include numerical

More information

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures Dirk Ribbrock, Markus Geveler, Dominik Göddeke, Stefan Turek Angewandte Mathematik, Technische Universität Dortmund

More information

Hybrid Parallelization of the MrBayes & RAxML Phylogenetics Codes

Hybrid Parallelization of the MrBayes & RAxML Phylogenetics Codes Hybrid Parallelization of the MrBayes & RAxML Phylogenetics Codes Wayne Pfeiffer (SDSC/UCSD) & Alexandros Stamatakis (TUM) February 25, 2010 What was done? Why is it important? Who cares? Hybrid MPI/OpenMP

More information

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively

More information

AUTOMATIC SMT THREADING

AUTOMATIC SMT THREADING AUTOMATIC SMT THREADING FOR OPENMP APPLICATIONS ON THE INTEL XEON PHI CO-PROCESSOR WIM HEIRMAN 1,2 TREVOR E. CARLSON 1 KENZO VAN CRAEYNEST 1 IBRAHIM HUR 2 AAMER JALEEL 2 LIEVEN EECKHOUT 1 1 GHENT UNIVERSITY

More information

A Modelica Power System Library for Phasor Time-Domain Simulation

A Modelica Power System Library for Phasor Time-Domain Simulation 2013 4th IEEE PES Innovative Smart Grid Technologies Europe (ISGT Europe), October 6-9, Copenhagen 1 A Modelica Power System Library for Phasor Time-Domain Simulation T. Bogodorova, Student Member, IEEE,

More information

"Charting the Course to Your Success!" MOC A Developing High-performance Applications using Microsoft Windows HPC Server 2008

Charting the Course to Your Success! MOC A Developing High-performance Applications using Microsoft Windows HPC Server 2008 Description Course Summary This course provides students with the knowledge and skills to develop high-performance computing (HPC) applications for Microsoft. Students learn about the product Microsoft,

More information

PROCESSES AND THREADS

PROCESSES AND THREADS PROCESSES AND THREADS A process is a heavyweight flow that can execute concurrently with other processes. A thread is a lightweight flow that can execute concurrently with other threads within the same

More information

Introduction to Parallel Performance Engineering

Introduction to Parallel Performance Engineering Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance:

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

High Performance Computing. Introduction to Parallel Computing

High Performance Computing. Introduction to Parallel Computing High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials

More information

AutoTune Workshop. Michael Gerndt Technische Universität München

AutoTune Workshop. Michael Gerndt Technische Universität München AutoTune Workshop Michael Gerndt Technische Universität München AutoTune Project Automatic Online Tuning of HPC Applications High PERFORMANCE Computing HPC application developers Compute centers: Energy

More information

Optimizing an Earth Science Atmospheric Application with the OmpSs Programming Model

Optimizing an Earth Science Atmospheric Application with the OmpSs Programming Model www.bsc.es Optimizing an Earth Science Atmospheric Application with the OmpSs Programming Model HPC Knowledge Meeting'15 George S. Markomanolis, Jesus Labarta, Oriol Jorba University of Barcelona, Barcelona,

More information

Monitoring and Trouble Shooting on BioHPC

Monitoring and Trouble Shooting on BioHPC Monitoring and Trouble Shooting on BioHPC [web] [email] portal.biohpc.swmed.edu biohpc-help@utsouthwestern.edu 1 Updated for 2017-03-15 Why Monitoring & Troubleshooting data code Monitoring jobs running

More information

Parallel Programming on Larrabee. Tim Foley Intel Corp

Parallel Programming on Larrabee. Tim Foley Intel Corp Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This

More information

BİL 542 Parallel Computing

BİL 542 Parallel Computing BİL 542 Parallel Computing 1 Chapter 1 Parallel Programming 2 Why Use Parallel Computing? Main Reasons: Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion,

More information

Scalable and Fault Tolerant Failure Detection and Consensus

Scalable and Fault Tolerant Failure Detection and Consensus EuroMPI'15, Bordeaux, France, September 21-23, 2015 Scalable and Fault Tolerant Failure Detection and Consensus Amogh Katti, Giuseppe Di Fatta, University of Reading, UK Thomas Naughton, Christian Engelmann

More information

ASYNCHRONOUS SHADERS WHITE PAPER 0

ASYNCHRONOUS SHADERS WHITE PAPER 0 ASYNCHRONOUS SHADERS WHITE PAPER 0 INTRODUCTION GPU technology is constantly evolving to deliver more performance with lower cost and lower power consumption. Transistor scaling and Moore s Law have helped

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #15 3/7/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 From last class Outline

More information

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization Exploiting Task-Parallelism on Clusters via Adrián Castelló, Rafael Mayo, Judit Planas, Enrique S. Quintana-Ortí RePara 2015, August Helsinki, Finland Exploiting Task-Parallelism on Clusters via Power/energy/utilization

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

HPC Workshop University of Kentucky May 9, 2007 May 10, 2007

HPC Workshop University of Kentucky May 9, 2007 May 10, 2007 HPC Workshop University of Kentucky May 9, 2007 May 10, 2007 Part 3 Parallel Programming Parallel Programming Concepts Amdahl s Law Parallel Programming Models Tools Compiler (Intel) Math Libraries (Intel)

More information

Bring your application to a new era:

Bring your application to a new era: Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.

More information

Power-Aware Scheduling of Virtual Machines in DVFS-enabled Clusters

Power-Aware Scheduling of Virtual Machines in DVFS-enabled Clusters Power-Aware Scheduling of Virtual Machines in DVFS-enabled Clusters Gregor von Laszewski, Lizhe Wang, Andrew J. Younge, Xi He Service Oriented Cyberinfrastructure Lab Rochester Institute of Technology,

More information

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

Reconstruction of Trees from Laser Scan Data and further Simulation Topics Reconstruction of Trees from Laser Scan Data and further Simulation Topics Helmholtz-Research Center, Munich Daniel Ritter http://www10.informatik.uni-erlangen.de Overview 1. Introduction of the Chair

More information

EE/CSCI 451 Spring 2017 Homework 3 solution Total Points: 100

EE/CSCI 451 Spring 2017 Homework 3 solution Total Points: 100 EE/CSCI 451 Spring 2017 Homework 3 solution Total Points: 100 1 [10 points] 1. Task parallelism: The computations in a parallel algorithm can be split into a set of tasks for concurrent execution. Task

More information

A Characterization of Shared Data Access Patterns in UPC Programs

A Characterization of Shared Data Access Patterns in UPC Programs IBM T.J. Watson Research Center A Characterization of Shared Data Access Patterns in UPC Programs Christopher Barton, Calin Cascaval, Jose Nelson Amaral LCPC `06 November 2, 2006 Outline Motivation Overview

More information

Model-Based Development of Multi-Disciplinary Systems Challenges and Opportunities

Model-Based Development of Multi-Disciplinary Systems Challenges and Opportunities White Paper Model-Based Development of Multi-Disciplinary Systems Challenges and Opportunities Model-Based Development In the early days, multi-disciplinary systems, such as products involving mechatronics,

More information

Parallelism paradigms

Parallelism paradigms Parallelism paradigms Intro part of course in Parallel Image Analysis Elias Rudberg elias.rudberg@it.uu.se March 23, 2011 Outline 1 Parallelization strategies 2 Shared memory 3 Distributed memory 4 Parallelization

More information

EZTrace upcoming features

EZTrace upcoming features EZTrace 1.0 + upcoming features François Trahay francois.trahay@telecom-sudparis.eu 2015-01-08 Context Hardware is more and more complex NUMA, hierarchical caches, GPU,... Software is more and more complex

More information

Performance Analysis with Vampir

Performance Analysis with Vampir Performance Analysis with Vampir Johannes Ziegenbalg Technische Universität Dresden Outline Part I: Welcome to the Vampir Tool Suite Event Trace Visualization The Vampir Displays Vampir & VampirServer

More information

5/5/2012. Message Passing Programming Model Blocking communication. Non-Blocking communication Introducing MPI. Non-Buffered Buffered

5/5/2012. Message Passing Programming Model Blocking communication. Non-Blocking communication Introducing MPI. Non-Buffered Buffered Lecture 7: Programming Using the Message-Passing Paradigm 1 Message Passing Programming Model Blocking communication Non-Buffered Buffered Non-Blocking communication Introducing MPI 2 1 Programming models

More information

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc. HPC Runtime Software Rishi Khan SC 11 Current Programming Models Shared Memory Multiprocessing OpenMP fork/join model Pthreads Arbitrary SMP parallelism (but hard to program/ debug) Cilk Work Stealing

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP COMP4510 Introduction to Parallel Computation Shared Memory and OpenMP Thanks to Jon Aronsson (UofM HPC consultant) for some of the material in these notes. Outline (cont d) Shared Memory and OpenMP Including

More information

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues 4.2 Silberschatz, Galvin

More information

Parallel programming models. Main weapons

Parallel programming models. Main weapons Parallel programming models Von Neumann machine model: A processor and it s memory program = list of stored instructions Processor loads program (reads from memory), decodes, executes instructions (basic

More information

Middleware and Interprocess Communication

Middleware and Interprocess Communication Middleware and Interprocess Communication Reading Coulouris (5 th Edition): 41 4.1, 42 4.2, 46 4.6 Tanenbaum (2 nd Edition): 4.3 Spring 2015 CS432: Distributed Systems 2 Middleware Outline Introduction

More information

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation

More information

VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW

VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW 8th VI-HPS Tuning Workshop at RWTH Aachen September, 2011 Tobias Hilbrich and Joachim Protze Slides by: Andreas Knüpfer, Jens Doleschal, ZIH, Technische Universität

More information

Scalable, Hybrid-Parallel Multiscale Methods using DUNE

Scalable, Hybrid-Parallel Multiscale Methods using DUNE MÜNSTER Scalable Hybrid-Parallel Multiscale Methods using DUNE R. Milk S. Kaulmann M. Ohlberger December 1st 2014 Outline MÜNSTER Scalable Hybrid-Parallel Multiscale Methods using DUNE 2 /28 Abstraction

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started

More information

Parallel Execution of Functional Mock-up Units in Buildings Modeling

Parallel Execution of Functional Mock-up Units in Buildings Modeling ORNL/TM-2016/173 Parallel Execution of Functional Mock-up Units in Buildings Modeling Ozgur Ozmen James J. Nutaro Joshua R. New Approved for public release. Distribution is unlimited. June 30, 2016 DOCUMENT

More information

Multiprocessor scheduling

Multiprocessor scheduling Chapter 10 Multiprocessor scheduling When a computer system contains multiple processors, a few new issues arise. Multiprocessor systems can be categorized into the following: Loosely coupled or distributed.

More information

CS377P Programming for Performance Multicore Performance Multithreading

CS377P Programming for Performance Multicore Performance Multithreading CS377P Programming for Performance Multicore Performance Multithreading Sreepathi Pai UTCS October 14, 2015 Outline 1 Multiprocessor Systems 2 Programming Models for Multicore 3 Multithreading and POSIX

More information

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications Last Class Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications C. Faloutsos A. Pavlo Lecture#12: External Sorting (R&G, Ch13) Static Hashing Extendible Hashing Linear Hashing Hashing

More information

ELP. Effektive Laufzeitunterstützung für zukünftige Programmierstandards. Speaker: Tim Cramer, RWTH Aachen University

ELP. Effektive Laufzeitunterstützung für zukünftige Programmierstandards. Speaker: Tim Cramer, RWTH Aachen University ELP Effektive Laufzeitunterstützung für zukünftige Programmierstandards Agenda ELP Project Goals ELP Achievements Remaining Steps ELP Project Goals Goals of ELP: Improve programmer productivity By influencing

More information

Techniques to improve the scalability of Checkpoint-Restart

Techniques to improve the scalability of Checkpoint-Restart Techniques to improve the scalability of Checkpoint-Restart Bogdan Nicolae Exascale Systems Group IBM Research Ireland 1 Outline A few words about the lab and team Challenges of Exascale A case for Checkpoint-Restart

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

Equation based parallelization of Modelica models

Equation based parallelization of Modelica models Marcus Walther Volker Waurich Christian Schubert Dr.-Ing. Ines Gubsch Dresden University of Technology {marcus.walther, volker.waurich, christian.schubert, ines.gubsch@tu-dresden.de Abstract In order to

More information