Performance and Software-Engineering Considerations for Massively Parallel Simulations

Performance and Software-Engineering Considerations for Massively Parallel Simulations Ulrich Rüde (ruede@cs.fau.de) Ben Bergen, Frank Hülsemann, Christoph Freundl Universität Erlangen-Nürnberg www10.informatik.uni-erlangen.de SIAM CSE - February 2005 1 Outline Multigrid on Supercomputers Expression Templates: ParExPDE Hierarchical Hybrid Grids: HHG Lattice Boltzmann Methods Current and Future Challenges 2

Hitachi SR 8000 at Bavarian Leibniz Supercomputer Center No. 5 in TOP-500 at time of installation in 2000 replacement with 60 Tflop SGI scheduled for 2006 8 Proc and 8 GB per node Performance: 1344 CPUs (168*8) 12 GFlop/node 2016 GFlop total Linpack: 1645 Gflop (82% of theoretical peak) Very sensitive to data structures 3 Part I Large Scale Elliptic PDE 4

General architecture of ParExPDE 5 Performance of Expression Templates Results in Mflops for the execution on vectors of length 1,000,000 on a single processor Expressions: daxpy: c = a + k * b; complex: c = k * a + l * b + m * a * b Intel Pentium 4 2.4 GHz, gcc 3.3.3 6

Implementation of differential operators on Hitachi Performance problems on the Hitachi Compiler (optimizer) quality limits performance Expression type Simple Differential operator Hitachi SR-8000 513 MFlops 25 MFlops Pentium 4 2.4 GHz 407 MFlops 918 MFlops 7 Structured vs. Unstructured Grids gridlib/hhg MFlops rates for matrix-vector multiplication on one node on the Hitachi compared with highly tuned JDS results for sparse matrices (courtesy of G. Wellein, RRZE Erlangen) Architecture very dependent on uniform data structures 8

What are hierarchical hybrid grids? (Ben Bergen) Standard geometric multigrid approach: Purely unstructured input grid resolves geometry of problem domain Patch-wise regular refinement applied repeatedly to every cell of the coarse grid generates nested grid hierarchies naturally suitable for geometric multigrid algorithms New: Modify storage formats and operations on the grid to reflect the generated regular substructures 9 Common misconceptions Hierarchical hybrid grids (HHG) are not yet another block structured grid HHG are more flexible (unstructured, hybrid input grids) are not yet another unstructured geometric multigrid package HHG achieve better performance -- unstructured treatment of regular regions does not improve performance 10

Refinement example Input Grid 11 Refinement example Refinement Level one 12

Refinement example Refinement Level Two 13 Refinement example Structured Interior 14

Refinement example Structured Interior 15 Refinement example Edge Interior 16

Refinement example Edge Interior 17 Results, Scaling, Efficiency (results by F. Hülsemann) #CPU 64 128 256 512 550 Dof x 10 6 1179.48 2359.74 4719.47 9438.94 10139.49 Time (s) 44 44 44 45 48 Poisson equation Dirichlet boundary conditions Multigrid FMG(2,2) cycle 27 point stencil 9 cubes/process refinement level 7 (h=1/128) Speedup for the same problem (6 times regularly refined) 18

Part II Lattice Boltzmann Methods 19 Towards Simulating Metal Foams in collaboration with Carolin Körner Dept. of Material Sciences, University Erlangen Bubble growth, coalescence, collapse, drainage, rheology, etc. are still poorly understood Simulation as a tool to better understand, control and optimize the process 20

The Stream Step Move particle distribution functions along corresponding velocity vector Normalized time step, cell size and particle speed 21 The Collide Step Collisions of particles during movement Weigh equilibrium velocities and velocities from streaming depending on fluid viscosity 22

True Foams with Disjoining Pressure (visualization by Nils Thürey) 23 Parallel Implementation (by Thomas Pohl) Standard LBM-Code in C - excellent performance on single SR-8000 node - almost linear speed-up - larger partitions better Performance on the SR-8000 Ca. 30% of peak Performance 24

Standard LBM-Code: Scalability Parallel Implementation Largest simulation: 1,08*10 9 cells, 370 GByte memory 64 MByte to communicate in each step: "efficiency ~ 75% 25 Free surface LBM-Code Parallelizing the code Standard LBM Free surface LBM 1 sweep through grid 5 sweeps through grid Cell type change, creating closed boundary, initializing changed cells, mass-rebalance 26

Free surface LBM-Code Parallelizing the code Standard LBM Free surface LBM 1 sweep through grid 5 sweeps through grid 1 column of ghost nodes 4 columns of ghost nodes 27 Performance Standard LBM-Code Free surface LBM-Code Performance very bad on a single node If-statements 2,9 SLBM " 51 free surface LBM Pentium 4: performance loss ~ 10% SR8000: high loss (pseudo-vector architecture, predictable statements) 28

Part III Challenges and Problems 29 Current Challenge: Parallelism on all levels and The Memory Wall Parallel computing is easy, good (single) processor performance is difficult (B. Gropp, Argonne) There has been no significant progress in High Performance Computing over the past 5 years (H. Simon, NERSC) Instruction level (on - chip) parallelism Memory bandwidth and latency are the limiting factors Cache-aware algorithms Conventional complexity measures (based on operation count) are becoming increasingly unrealistic 30

Transistors/Die 10.000.000.000 1.000.000.000 100.000.000 10.000.000 1.000.000 100.000 64K 256K V Growth: 52% per year 4M 1M 80368 64M 1G Merced Pentium Pro Pentium 80468 4G Growth: 42% per year 10.000 1.000 1K 4004 DRAM Microprocessor (Intel) 100 1970 1975 1980 1985 1990 1995 2000 2005 Year Moore's Law in Semiconductor Technology (F. Hossfeld) 31 Atoms/Bit 10 24 10 21 10 18 10 15 10 12 10 9 10 6 10 3 Semiconductor Technology kt " 2017 10 15 10 12 10 9 10 6 10 3 1 10-3 10-6 Energy/logic Operation [pico-joules] 1 1950 1960 1970 1980 1990 2000 2010 2020 Year 10-9 Information Density & Energy Dissipation (adapted by F. Hossfeld from C. P. Williams et al., 1998) 32

Conclusions (1) High performance computing still requires heroic programming but we are on the way to make supercomputers more generally usable Which architecture? ASCI-type: custom CPU, massively parallel cluster of SMPs Earth-simulator-type: Vector CPU, as many CPUs as affordable Hitachi Class: modified custom CPU, cluster of SMPs Others: BlueGene, Cray X1, Multithreading, PIM, reconfigurable, quantum computing, What will come next? 33 Conclusions (2) Which grid data structures? structured (inflexible) unstructured (slow) HHG (high development effort, even prototype 50 K lines of code) Where are we going? the end of Moore s law (almost) nobody builds CPUs for HPC specific requirements petaflops: 100,000 processors needed and we can hardly handle 1000 the memory wall latency bandwidth It s the locality - stupid 34

Acknowledgements 13 Student projects (Bachelor Thesis): C. Freundl, A. Hausner, N. Thürey, I. Christadler, V. Daum, F. Fleißner, M. Sonntag, J. Wilke, M. Zetlmeisl, S. Donath, K. Iglberger, J. Thies, S. Weigand 5 Master Thesis (Diplomarbeit) H. Pfänder, N. Thürey, E. Lang, G. Radzom, C. Freundl 11 PhD Research Projects M. Kowarschik, M. Mohr, B. Bergen, C. Freundl, T. Pohl, U. Fabricius, N. Thürey, S. Meinlschmidt, P. Kipfer, J. Treibig, H. Köstler Additional thanks to C. Pflaum und J. Härtlein Funded by: KONWIHR DFG BMBF 35