Parallel Computing Using Modelica Martin Sjölund, Mahder Gebremedhin, Kristian Stavåker, Peter Fritzson PELAB, Linköping University ModProd Feb 2012, Linköping University, Sweden
What is Modelica? Equation-based Object-Oriented Language Modelling of physical systems, and more Explained by example
Example - RC Circuit (Diagram)
model RC Example - RC Circuit (Code) Modelica.Electrical.Analog.Basic.Ground ground1; Modelica.Electrical.Analog.Basic.Resistor resistor1(r = 100); Modelica.Electrical.Analog.Basic.Capacitor capacitor1(c = 0.01); Modelica.Electrical.Analog.Sources.SineVoltage sinevoltage1(v = 240, f = 50); equation connect(capacitor1.n,ground1.p); connect(sinevoltage1.n,ground1.p); connect(resistor1.n,sinevoltage1.p); connect(resistor1.p,capacitor1.p); end RC;
Example - RC Circuit (Flat Code) class RC // 24 equations and variables equation ground1.p.v = 0.0; 0.0 = resistor1.p.i + resistor1.n.i; resistor1.i = resistor1.p.i; resistor1.t_heatport = resistor1.t; capacitor1.i = capacitor1.c * der(capacitor1.v); capacitor1.v = capacitor1.p.v - capacitor1.n.v; 0.0 = capacitor1.p.i + capacitor1.n.i; capacitor1.i = capacitor1.p.i;... end RC;
Magic Happens (Next talk has details)
Example - RC Circuit (Output)
Symptom: Simulation is slow Why?
Simple Model (10 years ago)
Simple Computer (10 years ago)
Complex Model (Today)
Computers Today
The Problem Algorithms for numerical simulation Mostly designed for single CPUs Scaled well until we got multi-core CPUs Not much research to parallelize simulations
Computer We Want Today
Idea: Map Submodels to CPUs/GPUs
Solutions
Strategies for Utilizing Parallelism Parallelize numeric solver Not covered here (complementary) Automatic parallelization No model manipulation Distributed simulation Model manipulation Explicit parallel programming New language constructs Parallel optimization algorithms
Automatic Parallelization a Parallelizes over the equation system Task graph b c Join and Split Scheduling Merging tasks d e f g
Automatic Parallelization Applications and Performance Applications Clusters GPUs (NVIDIA) Models Need to be highly parallel Need to synchronize a lot Aronsson (2006), Lundvall (2008) Östlund (2009), Stavåker (2011)
Distributed systems a Decouple systems to make them more parallel Delay lines (physically motivated) Trivial to parallelize Parallelizes over time Synchronize between time steps May use different step sizes and numerical solvers Nyström (2006), Sjölund (2010) b e g c d f
Transmission Line Modeling TLM Transmission Line Modeling numerically stable cosimulation Physically motivated time delays are inserted between components Originally used in hydraulics with propagation delays along pipes Generalized to other engineering domains c1, c2 are the TLM-parameters Ttlm is the information propagation time Zf is the implicit impedance 21
Distributed model SubSystem 3 Solver: Euler Stepsize:0.001 SubSystem 4 Solver: LAPACK Stepsize:1.0 SubSystem 1 Solver: Dassl Stepsize:0.1 SubSystem 2 Solver: Lsode2 Stepsize:0.01 22
Introducing parallel constructs Explicit parallel programming NestStepModelica ParModelica extension (this presentation) OpenCL target Parallel variables Parallel for loop Parallel functions Kernel functions Moghadam (2011), Gebremedhin (2011)
ParModelica/OpenCL Speedups Limitations Only for Algorithmic parts of Modelica. Requires a general knowledge of parallel programming paradigms. Advantages Easy to use. Eliminates the need to write external C functions for parallel computations. Same Modelica code can be targeted to different frameworks. e.g. OpenCL and CUDA. Can achieve good speedups for computationally heavy simulations.
ParModelica Global and Shared Variables function parvar Integer m = 1024; Integer A[m]; Integer B[m]; parglobal Integer pm; parglobal Integer pn; parglobal Integer pa[m]; parglobal Integer pb[m]; parshared Integer ps; parshared Integer pss[10]; algorithm B := A; pa := A; //copy to device B := pa; //copy from device pb := pa; //copy device to device pm := m; n := pm; pn := pm; end parvar;
ParModelica parallel for loops pa := A; pb := B; parfor i in 1:m loop for j in 1:pm loop ptemp := 0; for h in 1:pm loop ptemp := multiply(pa[i,h],pb[h,j])+ ptemp; end for; pc[i,j] := ptemp; end for; end parfor; C := pc; Parallel for loops in other languages MATLAB parfor, Visual C++ parallel_for, Mathematica paralleldo, OpenMP omp for ( dynamic scheduling).... OpenCL kernel file functions or CUDA device functions. parallel function multiply parglobal input Integer a; parglobal input Integer b; output Integer c; algorithm c := a * b; end multiply;
oclsetnumthreads(globalsizes,localsizes); pc := arrayelemwisemultiply(pm,pa,pb); OpenCL kernel functions or CUDA global functions. parkernel function arrayelemwisemultiply parglobal input Integer m; parglobal input Integer A[:]; parglobal input Integer B[:]; parglobal output Integer C[m]; parprivate Integer id; parshared Integer portionid; algorithm id = oclgetglobalid(1); if(oclgetlocalid(1) == 1) then portionid = oclgetgroupid(1); end if; ocllocalbarrier(); C[id] := multiply(a[id],b[id], portionid); end arrayelemwisemultiply; Full (up to 3d), work-group and work-item arrangment. OpenCL work-item functions supported. OpenCL synchronizations are supported. oclsetnumthreads(0);
Gained Speedup, Examples Matrix Multiplication Gained speedup Intel Xeon E5520 CPU (16 cores) 26 NVIDIA Fermi-Tesla M2050 GPU (448 cores) 115 Heat Conduction Gained speedup Intel Xeon E5520 CPU (16 cores) 7 NVIDIA Fermi-Tesla M2050 GPU (448 cores) 22 Speedup Speedup 114.67 22.46 35.95 24.76 26.34 13.41 4.36 0.61 4.61 64 128 256 512 Parameter M (Matrix sizes MxM) 10.1 5.85 6.23 6.41 4.21 2.04 3.32 0.22 0.87 128 256 512 1024 2048 Parameter M (Matrix size MxM)
Parallel Optimization Algorithms Compile once, run many times Parameters may be changed after compilation Example: Parameter sweeps Run n processes at a time Find the optimal solution Good solution for certain problems Not suitable for real-time/embedded systems
Conclusions No silver bullet Possible to achieve speedup For certain models By changing the model For certain applications Parallelize as large parts as possible to avoid overhead Don't do fine-grained parallelization if you want to perform a simple parameter sweep