Parallel Computing Using Modelica

Similar documents
Simulation and Benchmarking of Modelica Models on Multi-core Architectures with Explicit Parallel Algorithmic Language Extensions

Prototype P4.26, Report R4.25 Parallelization with increased performance based on model partitioning with TLM-couplings

Institutionen för datavetenskap Department of Computer and Information Science

Dynamic Load Balancing in Parallelization of Equation-based Models

Equation based parallelization of Modelica models

Contributions to Parallel Simulation of Equation-Based Models on Graphics Processing Units

Research in Model-Based Product Development at PELAB in the MODPROD Center

Task-Graph-Based Parallelization of Modelica-Simulations. Tutorial on the Usage of the HPCOM-Module

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

AUTOMATIC PARALLELIZATION OF OBJECT ORIENTED MODELS ACROSS METHOD AND SYSTEM

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Optimizing and Accelerating Your MATLAB Code

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Automatic Parallelization of Mathematical Models Solved with Inlined Runge-Kutta Solvers

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Towards Efficient Distributed Simulation in Modelica using Transmission Line Modeling

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Parallel Systems. Project topics

Coupled Simulations Using different Tools

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

! Readings! ! Room-level, on-chip! vs.!

Speeding up MATLAB Applications Sean de Wolski Application Engineer

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Parallelism paradigms

The Art of Parallel Processing

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Introduction to GPU hardware and to CUDA

Multi-Threaded Distributed System Simulations Using the Transmission Line Element Method

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Performance potential for simulating spin models on GPU

High Performance Computing on GPUs using NVIDIA CUDA

GPU Implementation of Implicit Runge-Kutta Methods

Modern GPUs (Graphics Processing Units)

Technology for a better society. hetcomp.com

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Stan Posey, CAE Industry Development NVIDIA, Santa Clara, CA, USA

HPC future trends from a science perspective

MAGMA. Matrix Algebra on GPU and Multicore Architectures

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

Accelerating Financial Applications on the GPU

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

Parallel Programming Libraries and implementations

Introduction to Multicore Programming

Deep learning in MATLAB From Concept to CUDA Code

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

Practical Introduction to CUDA and GPU

CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging

Two-Phase flows on massively parallel multi-gpu clusters

Data-intensive computing in radiative transfer modelling

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Large scale Imaging on Current Many- Core Platforms

Accelerators in Technical Computing: Is it Worth the Pain?

CUDA GPGPU Workshop 2012

Accelerating Data Warehousing Applications Using General Purpose GPUs

How to Optimize Geometric Multigrid Methods on GPUs

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Center for Computational Science

JCudaMP: OpenMP/Java on CUDA

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Parallel Memory Defragmentation on a GPU

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Accelerating Implicit LS-DYNA with GPU

High Performance Computing for Engineers

SPOC : GPGPU programming through Stream Processing with OCaml

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Proceedings of the 4th International Modelica Conference, Hamburg, March 7-8, 2005, Gerhard Schmitz (editor)

EE/CSCI 451: Parallel and Distributed Computation

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

OpenMP for next generation heterogeneous clusters

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement

Scan Primitives for GPU Computing

Graphics Processor Acceleration and YOU

Parallel Computing with MATLAB

Cuda C Programming Guide Appendix C Table C-

Model-Based Dynamic Optimization with OpenModelica and CasADi

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Open Compute Stack (OpenCS) Overview. D.D. Nikolić Updated: 20 August 2018 DAE Tools Project,

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Supporting Data Parallelism in Matcloud: Final Report

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

B. Tech. Project Second Stage Report on

Using GPUs to compute the multilevel summation of electrostatic forces

Transcription:

Parallel Computing Using Modelica Martin Sjölund, Mahder Gebremedhin, Kristian Stavåker, Peter Fritzson PELAB, Linköping University ModProd Feb 2012, Linköping University, Sweden

What is Modelica? Equation-based Object-Oriented Language Modelling of physical systems, and more Explained by example

Example - RC Circuit (Diagram)

model RC Example - RC Circuit (Code) Modelica.Electrical.Analog.Basic.Ground ground1; Modelica.Electrical.Analog.Basic.Resistor resistor1(r = 100); Modelica.Electrical.Analog.Basic.Capacitor capacitor1(c = 0.01); Modelica.Electrical.Analog.Sources.SineVoltage sinevoltage1(v = 240, f = 50); equation connect(capacitor1.n,ground1.p); connect(sinevoltage1.n,ground1.p); connect(resistor1.n,sinevoltage1.p); connect(resistor1.p,capacitor1.p); end RC;

Example - RC Circuit (Flat Code) class RC // 24 equations and variables equation ground1.p.v = 0.0; 0.0 = resistor1.p.i + resistor1.n.i; resistor1.i = resistor1.p.i; resistor1.t_heatport = resistor1.t; capacitor1.i = capacitor1.c * der(capacitor1.v); capacitor1.v = capacitor1.p.v - capacitor1.n.v; 0.0 = capacitor1.p.i + capacitor1.n.i; capacitor1.i = capacitor1.p.i;... end RC;

Magic Happens (Next talk has details)

Example - RC Circuit (Output)

Symptom: Simulation is slow Why?

Simple Model (10 years ago)

Simple Computer (10 years ago)

Complex Model (Today)

Computers Today

The Problem Algorithms for numerical simulation Mostly designed for single CPUs Scaled well until we got multi-core CPUs Not much research to parallelize simulations

Computer We Want Today

Idea: Map Submodels to CPUs/GPUs

Solutions

Strategies for Utilizing Parallelism Parallelize numeric solver Not covered here (complementary) Automatic parallelization No model manipulation Distributed simulation Model manipulation Explicit parallel programming New language constructs Parallel optimization algorithms

Automatic Parallelization a Parallelizes over the equation system Task graph b c Join and Split Scheduling Merging tasks d e f g

Automatic Parallelization Applications and Performance Applications Clusters GPUs (NVIDIA) Models Need to be highly parallel Need to synchronize a lot Aronsson (2006), Lundvall (2008) Östlund (2009), Stavåker (2011)

Distributed systems a Decouple systems to make them more parallel Delay lines (physically motivated) Trivial to parallelize Parallelizes over time Synchronize between time steps May use different step sizes and numerical solvers Nyström (2006), Sjölund (2010) b e g c d f

Transmission Line Modeling TLM Transmission Line Modeling numerically stable cosimulation Physically motivated time delays are inserted between components Originally used in hydraulics with propagation delays along pipes Generalized to other engineering domains c1, c2 are the TLM-parameters Ttlm is the information propagation time Zf is the implicit impedance 21

Distributed model SubSystem 3 Solver: Euler Stepsize:0.001 SubSystem 4 Solver: LAPACK Stepsize:1.0 SubSystem 1 Solver: Dassl Stepsize:0.1 SubSystem 2 Solver: Lsode2 Stepsize:0.01 22

Introducing parallel constructs Explicit parallel programming NestStepModelica ParModelica extension (this presentation) OpenCL target Parallel variables Parallel for loop Parallel functions Kernel functions Moghadam (2011), Gebremedhin (2011)

ParModelica/OpenCL Speedups Limitations Only for Algorithmic parts of Modelica. Requires a general knowledge of parallel programming paradigms. Advantages Easy to use. Eliminates the need to write external C functions for parallel computations. Same Modelica code can be targeted to different frameworks. e.g. OpenCL and CUDA. Can achieve good speedups for computationally heavy simulations.

ParModelica Global and Shared Variables function parvar Integer m = 1024; Integer A[m]; Integer B[m]; parglobal Integer pm; parglobal Integer pn; parglobal Integer pa[m]; parglobal Integer pb[m]; parshared Integer ps; parshared Integer pss[10]; algorithm B := A; pa := A; //copy to device B := pa; //copy from device pb := pa; //copy device to device pm := m; n := pm; pn := pm; end parvar;

ParModelica parallel for loops pa := A; pb := B; parfor i in 1:m loop for j in 1:pm loop ptemp := 0; for h in 1:pm loop ptemp := multiply(pa[i,h],pb[h,j])+ ptemp; end for; pc[i,j] := ptemp; end for; end parfor; C := pc; Parallel for loops in other languages MATLAB parfor, Visual C++ parallel_for, Mathematica paralleldo, OpenMP omp for ( dynamic scheduling).... OpenCL kernel file functions or CUDA device functions. parallel function multiply parglobal input Integer a; parglobal input Integer b; output Integer c; algorithm c := a * b; end multiply;

oclsetnumthreads(globalsizes,localsizes); pc := arrayelemwisemultiply(pm,pa,pb); OpenCL kernel functions or CUDA global functions. parkernel function arrayelemwisemultiply parglobal input Integer m; parglobal input Integer A[:]; parglobal input Integer B[:]; parglobal output Integer C[m]; parprivate Integer id; parshared Integer portionid; algorithm id = oclgetglobalid(1); if(oclgetlocalid(1) == 1) then portionid = oclgetgroupid(1); end if; ocllocalbarrier(); C[id] := multiply(a[id],b[id], portionid); end arrayelemwisemultiply; Full (up to 3d), work-group and work-item arrangment. OpenCL work-item functions supported. OpenCL synchronizations are supported. oclsetnumthreads(0);

Gained Speedup, Examples Matrix Multiplication Gained speedup Intel Xeon E5520 CPU (16 cores) 26 NVIDIA Fermi-Tesla M2050 GPU (448 cores) 115 Heat Conduction Gained speedup Intel Xeon E5520 CPU (16 cores) 7 NVIDIA Fermi-Tesla M2050 GPU (448 cores) 22 Speedup Speedup 114.67 22.46 35.95 24.76 26.34 13.41 4.36 0.61 4.61 64 128 256 512 Parameter M (Matrix sizes MxM) 10.1 5.85 6.23 6.41 4.21 2.04 3.32 0.22 0.87 128 256 512 1024 2048 Parameter M (Matrix size MxM)

Parallel Optimization Algorithms Compile once, run many times Parameters may be changed after compilation Example: Parameter sweeps Run n processes at a time Find the optimal solution Good solution for certain problems Not suitable for real-time/embedded systems

Conclusions No silver bullet Possible to achieve speedup For certain models By changing the model For certain applications Parallelize as large parts as possible to avoid overhead Don't do fine-grained parallelization if you want to perform a simple parameter sweep