Performance and Software-Engineering Considerations for Massively Parallel Simulations

Size: px

Start display at page:

Download "Performance and Software-Engineering Considerations for Massively Parallel Simulations"

Amelia Morton
5 years ago
Views:

1 Performance and Software-Engineering Considerations for Massively Parallel Simulations Ulrich Rüde Ben Bergen, Frank Hülsemann, Christoph Freundl Universität Erlangen-Nürnberg www10.informatik.uni-erlangen.de SIAM CSE - February Outline Multigrid on Supercomputers Expression Templates: ParExPDE Hierarchical Hybrid Grids: HHG Lattice Boltzmann Methods Current and Future Challenges 2

2 Hitachi SR 8000 at Bavarian Leibniz Supercomputer Center No. 5 in TOP-500 at time of installation in 2000 replacement with 60 Tflop SGI scheduled for Proc and 8 GB per node Performance: 1344 CPUs (168*8) 12 GFlop/node 2016 GFlop total Linpack: 1645 Gflop (82% of theoretical peak) Very sensitive to data structures 3 Part I Large Scale Elliptic PDE 4

1,000,000 on a single processor Expressions: daxpy: c = a + k * b;

3 General architecture of ParExPDE 5 Performance of Expression Templates Results in Mflops for the execution on vectors of length 1,000,000 on a single processor Expressions: daxpy: c = a + k * b; complex: c = k * a + l * b + m * a * b Intel Pentium GHz, gcc

4 GHz 407 MFlops 918 MFlops 7 Structured vs.

4 Implementation of differential operators on Hitachi Performance problems on the Hitachi Compiler (optimizer) quality limits performance Expression type Simple Differential operator Hitachi SR MFlops 25 MFlops Pentium GHz 407 MFlops 918 MFlops 7 Structured vs. Unstructured Grids gridlib/hhg MFlops rates for matrix-vector multiplication on one node on the Hitachi compared with highly tuned JDS results for sparse matrices (courtesy of G. Wellein, RRZE Erlangen) Architecture very dependent on uniform data structures 8

5 What are hierarchical hybrid grids? (Ben Bergen) Standard geometric multigrid approach: Purely unstructured input grid resolves geometry of problem domain Patch-wise regular refinement applied repeatedly to every cell of the coarse grid generates nested grid hierarchies naturally suitable for geometric multigrid algorithms New: Modify storage formats and operations on the grid to reflect the generated regular substructures 9 Common misconceptions Hierarchical hybrid grids (HHG) are not yet another block structured grid HHG are more flexible (unstructured, hybrid input grids) are not yet another unstructured geometric multigrid package HHG achieve better performance -- unstructured treatment of regular regions does not improve performance 10

6 Refinement example Input Grid 11 Refinement example Refinement Level one 12

7 Refinement example Refinement Level Two 13 Refinement example Structured Interior 14

8 Refinement example Structured Interior 15 Refinement example Edge Interior 16

49 Time (s) 44 44 44 45 48 Poisson equation Dirichlet boundary conditions Multigrid FMG(2,2)

9 Refinement example Edge Interior 17 Results, Scaling, Efficiency (results by F. Hülsemann) #CPU Dof x Time (s) Poisson equation Dirichlet boundary conditions Multigrid FMG(2,2) cycle 27 point stencil 9 cubes/process refinement level 7 (h=1/128) Speedup for the same problem (6 times regularly refined) 18

10 Part II Lattice Boltzmann Methods 19 Towards Simulating Metal Foams in collaboration with Carolin Körner Dept. of Material Sciences, University Erlangen Bubble growth, coalescence, collapse, drainage, rheology, etc. are still poorly understood Simulation as a tool to better understand, control and optimize the process 20

The Collide Step Collisions of particles during movement Weigh

11 The Stream Step Move particle distribution functions along corresponding velocity vector Normalized time step, cell size and particle speed 21 The Collide Step Collisions of particles during movement Weigh equilibrium velocities and velocities from streaming depending on fluid viscosity 22

excellent performance on single SR-8000 node - almost linear speed-up -

12 True Foams with Disjoining Pressure (visualization by Nils Thürey) 23 Parallel Implementation (by Thomas Pohl) Standard LBM-Code in C - excellent performance on single SR-8000 node - almost linear speed-up - larger partitions better Performance on the SR-8000 Ca. 30% of peak Performance 24

13 Standard LBM-Code: Scalability Parallel Implementation Largest simulation: 1,08*10 9 cells, 370 GByte memory 64 MByte to communicate in each step: "efficiency ~ 75% 25 Free surface LBM-Code Parallelizing the code Standard LBM Free surface LBM 1 sweep through grid 5 sweeps through grid Cell type change, creating closed boundary, initializing changed cells, mass-rebalance 26

14 Free surface LBM-Code Parallelizing the code Standard LBM Free surface LBM 1 sweep through grid 5 sweeps through grid 1 column of ghost nodes 4 columns of ghost nodes 27 Performance Standard LBM-Code Free surface LBM-Code Performance very bad on a single node If-statements 2,9 SLBM " 51 free surface LBM Pentium 4: performance loss ~ 10% SR8000: high loss (pseudo-vector architecture, predictable statements) 28

15 Part III Challenges and Problems 29 Current Challenge: Parallelism on all levels and The Memory Wall Parallel computing is easy, good (single) processor performance is difficult (B. Gropp, Argonne) There has been no significant progress in High Performance Computing over the past 5 years (H. Simon, NERSC) Instruction level (on - chip) parallelism Memory bandwidth and latency are the limiting factors Cache-aware algorithms Conventional complexity measures (based on operation count) are becoming increasingly unrealistic 30

16 Transistors/Die K 256K V Growth: 52% per year 4M 1M M 1G Merced Pentium Pro Pentium G Growth: 42% per year K 4004 DRAM Microprocessor (Intel) Year Moore's Law in Semiconductor Technology (F. Hossfeld) 31 Atoms/Bit Semiconductor Technology kt " Energy/logic Operation [pico-joules] Year 10-9 Information Density & Energy Dissipation (adapted by F. Hossfeld from C. P. Williams et al., 1998) 32

17 Conclusions (1) High performance computing still requires heroic programming but we are on the way to make supercomputers more generally usable Which architecture? ASCI-type: custom CPU, massively parallel cluster of SMPs Earth-simulator-type: Vector CPU, as many CPUs as affordable Hitachi Class: modified custom CPU, cluster of SMPs Others: BlueGene, Cray X1, Multithreading, PIM, reconfigurable, quantum computing, What will come next? 33 Conclusions (2) Which grid data structures? structured (inflexible) unstructured (slow) HHG (high development effort, even prototype 50 K lines of code) Where are we going? the end of Moore s law (almost) nobody builds CPUs for HPC specific requirements petaflops: 100,000 processors needed and we can hardly handle 1000 the memory wall latency bandwidth It s the locality - stupid 34

18 Acknowledgements 13 Student projects (Bachelor Thesis): C. Freundl, A. Hausner, N. Thürey, I. Christadler, V. Daum, F. Fleißner, M. Sonntag, J. Wilke, M. Zetlmeisl, S. Donath, K. Iglberger, J. Thies, S. Weigand 5 Master Thesis (Diplomarbeit) H. Pfänder, N. Thürey, E. Lang, G. Radzom, C. Freundl 11 PhD Research Projects M. Kowarschik, M. Mohr, B. Bergen, C. Freundl, T. Pohl, U. Fabricius, N. Thürey, S. Meinlschmidt, P. Kipfer, J. Treibig, H. Köstler Additional thanks to C. Pflaum und J. Härtlein Funded by: KONWIHR DFG BMBF 35

Simulieren geht über Probieren

Simulieren geht über Probieren Ulrich Rüde (ruede@cs.fau.de) Lehrstuhl für Informatik 10 (Systemsimulation) Universität Erlangen-Nürnberg www10.informatik.uni-erlangen.de Ulm, 17. Mai 2006 1 Overview Motivation