Performance and Software-Engineering Considerations for Massively Parallel Simulations

Similar documents
Simulieren geht über Probieren

Towards PetaScale Computational Science

Architecture Aware Multigrid

Parallel Solution of a Finite Element Problem with 17 Billion Unknowns

Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance

Adaptive Hierarchical Grids with a Trillion Tetrahedra

Scalable Parallel Multigrid for Finite Element Computations

High End Computing for Large Scale Simulations

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

Massively Parallel Multgrid for Finite Elements

(LSS Erlangen, Simon Bogner, Ulrich Rüde, Thomas Pohl, Nils Thürey in collaboration with many more

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG. Lehrstuhl für Informatik 10 (Systemsimulation)

Numerical Simulation in the Multi-Core Age

Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures

Large scale Imaging on Current Many- Core Platforms

Software and Performance Engineering for numerical codes on GPU clusters

Simulation of Liquid-Gas-Solid Flows with the Lattice Boltzmann Method

simulation framework for piecewise regular grids

τ-extrapolation on 3D semi-structured finite element meshes

Towards Exa-Scale: Computing with Millions of Cores

The walberla Framework: Multi-physics Simulations on Heterogeneous Parallel Platforms

Computational Fluid Dynamics with the Lattice Boltzmann Method KTH SCI, Stockholm

Massively Parallel Phase Field Simulations using HPC Framework walberla


Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

walberla: Developing a Massively Parallel HPC Framework

Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion

Numerical Algorithms on Multi-GPU Architectures

From Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with LibGeoDecomp

Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla

Efficient implementation of simple lattice Boltzmann kernels

Is Unknowns the Largest Finite Element System that Can Be Solved Today?

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Lattice Boltzmann with CUDA

Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen

Large Scale Parallel Lattice Boltzmann Model of Dendritic Growth

Cost-Effective Parallel Computational Electromagnetic Modeling

Hierarchical Hybrid Grids

Performance of computer systems

Computing on GPU Clusters

GPU Cluster Computing for FEM

Simulation of moving Particles in 3D with the Lattice Boltzmann Method

Lehrstuhl für Informatik 10 (Systemsimulation)

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Lattice Boltzmann Methods on the way to exascale

smooth coefficients H. Köstler, U. Rüde

Peta-Scale Simulations with the HPC Software Framework walberla:

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

How to Optimize Geometric Multigrid Methods on GPUs

HPC Algorithms and Applications

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Multigrid algorithms on multi-gpu architectures

Exploring unstructured Poisson solvers for FDS

High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters

High Performance Computing for PDE Some numerical aspects of Petascale Computing

Performance Evaluation with the HPCC Benchmarks as a Guide on the Way to Peta Scale Systems

High Performance Computing for PDE Towards Petascale Computing

Efficient Imaging Algorithms on Many-Core Platforms

Free Surface Lattice-Boltzmann fluid simulations. with and without level sets.

High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers

Commodity Cluster Computing

Composite Metrics for System Throughput in HPC

Data Locality Optimizations for Iterative Numerical Algorithms and Cellular Automata on Hierarchical Memory Architectures

Advances of parallel computing. Kirill Bogachev May 2016

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

Optimization of HOM Couplers using Time Domain Schemes

Towards real-time prediction of Tsunami impact effects on nearshore infrastructure

The Mont-Blanc approach towards Exascale

Two-Phase flows on massively parallel multi-gpu clusters

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Performance Analysis and Optimization of Parallel Scientific Applications on CMP Cluster Systems

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

Virtual EM Inc. Ann Arbor, Michigan, USA

Clusters of SMP s. Sean Peisert

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

Challenges in Fully Generating Multigrid Solvers for the Simulation of non-newtonian Fluids

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

BlueGene/L (No. 4 in the Latest Top500 List)

HPC Technology Trends

Shallow Water Simulations on Graphics Hardware

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

The STREAM Benchmark. John D. McCalpin, Ph.D. IBM eserver Performance ^ Performance

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

TOP500 List s Twice-Yearly Snapshots of World s Fastest Supercomputers Develop Into Big Picture of Changing Technology

Jülich Supercomputing Centre

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Auto-tuning Multigrid with PetaBricks

High-Performance Scientific Computing

Network Bandwidth & Minimum Efficient Problem Size

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

CAF versus MPI Applicability of Coarray Fortran to a Flow Solver

Performance of deal.ii on a node

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model

Introduction to Multigrid and its Parallelization

Transcription:

Performance and Software-Engineering Considerations for Massively Parallel Simulations Ulrich Rüde (ruede@cs.fau.de) Ben Bergen, Frank Hülsemann, Christoph Freundl Universität Erlangen-Nürnberg www10.informatik.uni-erlangen.de SIAM CSE - February 2005 1 Outline Multigrid on Supercomputers Expression Templates: ParExPDE Hierarchical Hybrid Grids: HHG Lattice Boltzmann Methods Current and Future Challenges 2

Hitachi SR 8000 at Bavarian Leibniz Supercomputer Center No. 5 in TOP-500 at time of installation in 2000 replacement with 60 Tflop SGI scheduled for 2006 8 Proc and 8 GB per node Performance: 1344 CPUs (168*8) 12 GFlop/node 2016 GFlop total Linpack: 1645 Gflop (82% of theoretical peak) Very sensitive to data structures 3 Part I Large Scale Elliptic PDE 4

General architecture of ParExPDE 5 Performance of Expression Templates Results in Mflops for the execution on vectors of length 1,000,000 on a single processor Expressions: daxpy: c = a + k * b; complex: c = k * a + l * b + m * a * b Intel Pentium 4 2.4 GHz, gcc 3.3.3 6

Implementation of differential operators on Hitachi Performance problems on the Hitachi Compiler (optimizer) quality limits performance Expression type Simple Differential operator Hitachi SR-8000 513 MFlops 25 MFlops Pentium 4 2.4 GHz 407 MFlops 918 MFlops 7 Structured vs. Unstructured Grids gridlib/hhg MFlops rates for matrix-vector multiplication on one node on the Hitachi compared with highly tuned JDS results for sparse matrices (courtesy of G. Wellein, RRZE Erlangen) Architecture very dependent on uniform data structures 8

What are hierarchical hybrid grids? (Ben Bergen) Standard geometric multigrid approach: Purely unstructured input grid resolves geometry of problem domain Patch-wise regular refinement applied repeatedly to every cell of the coarse grid generates nested grid hierarchies naturally suitable for geometric multigrid algorithms New: Modify storage formats and operations on the grid to reflect the generated regular substructures 9 Common misconceptions Hierarchical hybrid grids (HHG) are not yet another block structured grid HHG are more flexible (unstructured, hybrid input grids) are not yet another unstructured geometric multigrid package HHG achieve better performance -- unstructured treatment of regular regions does not improve performance 10

Refinement example Input Grid 11 Refinement example Refinement Level one 12

Refinement example Refinement Level Two 13 Refinement example Structured Interior 14

Refinement example Structured Interior 15 Refinement example Edge Interior 16

Refinement example Edge Interior 17 Results, Scaling, Efficiency (results by F. Hülsemann) #CPU 64 128 256 512 550 Dof x 10 6 1179.48 2359.74 4719.47 9438.94 10139.49 Time (s) 44 44 44 45 48 Poisson equation Dirichlet boundary conditions Multigrid FMG(2,2) cycle 27 point stencil 9 cubes/process refinement level 7 (h=1/128) Speedup for the same problem (6 times regularly refined) 18

Part II Lattice Boltzmann Methods 19 Towards Simulating Metal Foams in collaboration with Carolin Körner Dept. of Material Sciences, University Erlangen Bubble growth, coalescence, collapse, drainage, rheology, etc. are still poorly understood Simulation as a tool to better understand, control and optimize the process 20

The Stream Step Move particle distribution functions along corresponding velocity vector Normalized time step, cell size and particle speed 21 The Collide Step Collisions of particles during movement Weigh equilibrium velocities and velocities from streaming depending on fluid viscosity 22

True Foams with Disjoining Pressure (visualization by Nils Thürey) 23 Parallel Implementation (by Thomas Pohl) Standard LBM-Code in C - excellent performance on single SR-8000 node - almost linear speed-up - larger partitions better Performance on the SR-8000 Ca. 30% of peak Performance 24

Standard LBM-Code: Scalability Parallel Implementation Largest simulation: 1,08*10 9 cells, 370 GByte memory 64 MByte to communicate in each step: "efficiency ~ 75% 25 Free surface LBM-Code Parallelizing the code Standard LBM Free surface LBM 1 sweep through grid 5 sweeps through grid Cell type change, creating closed boundary, initializing changed cells, mass-rebalance 26

Free surface LBM-Code Parallelizing the code Standard LBM Free surface LBM 1 sweep through grid 5 sweeps through grid 1 column of ghost nodes 4 columns of ghost nodes 27 Performance Standard LBM-Code Free surface LBM-Code Performance very bad on a single node If-statements 2,9 SLBM " 51 free surface LBM Pentium 4: performance loss ~ 10% SR8000: high loss (pseudo-vector architecture, predictable statements) 28

Part III Challenges and Problems 29 Current Challenge: Parallelism on all levels and The Memory Wall Parallel computing is easy, good (single) processor performance is difficult (B. Gropp, Argonne) There has been no significant progress in High Performance Computing over the past 5 years (H. Simon, NERSC) Instruction level (on - chip) parallelism Memory bandwidth and latency are the limiting factors Cache-aware algorithms Conventional complexity measures (based on operation count) are becoming increasingly unrealistic 30

Transistors/Die 10.000.000.000 1.000.000.000 100.000.000 10.000.000 1.000.000 100.000 64K 256K V Growth: 52% per year 4M 1M 80368 64M 1G Merced Pentium Pro Pentium 80468 4G Growth: 42% per year 10.000 1.000 1K 4004 DRAM Microprocessor (Intel) 100 1970 1975 1980 1985 1990 1995 2000 2005 Year Moore's Law in Semiconductor Technology (F. Hossfeld) 31 Atoms/Bit 10 24 10 21 10 18 10 15 10 12 10 9 10 6 10 3 Semiconductor Technology kt " 2017 10 15 10 12 10 9 10 6 10 3 1 10-3 10-6 Energy/logic Operation [pico-joules] 1 1950 1960 1970 1980 1990 2000 2010 2020 Year 10-9 Information Density & Energy Dissipation (adapted by F. Hossfeld from C. P. Williams et al., 1998) 32

Conclusions (1) High performance computing still requires heroic programming but we are on the way to make supercomputers more generally usable Which architecture? ASCI-type: custom CPU, massively parallel cluster of SMPs Earth-simulator-type: Vector CPU, as many CPUs as affordable Hitachi Class: modified custom CPU, cluster of SMPs Others: BlueGene, Cray X1, Multithreading, PIM, reconfigurable, quantum computing, What will come next? 33 Conclusions (2) Which grid data structures? structured (inflexible) unstructured (slow) HHG (high development effort, even prototype 50 K lines of code) Where are we going? the end of Moore s law (almost) nobody builds CPUs for HPC specific requirements petaflops: 100,000 processors needed and we can hardly handle 1000 the memory wall latency bandwidth It s the locality - stupid 34

Acknowledgements 13 Student projects (Bachelor Thesis): C. Freundl, A. Hausner, N. Thürey, I. Christadler, V. Daum, F. Fleißner, M. Sonntag, J. Wilke, M. Zetlmeisl, S. Donath, K. Iglberger, J. Thies, S. Weigand 5 Master Thesis (Diplomarbeit) H. Pfänder, N. Thürey, E. Lang, G. Radzom, C. Freundl 11 PhD Research Projects M. Kowarschik, M. Mohr, B. Bergen, C. Freundl, T. Pohl, U. Fabricius, N. Thürey, S. Meinlschmidt, P. Kipfer, J. Treibig, H. Köstler Additional thanks to C. Pflaum und J. Härtlein Funded by: KONWIHR DFG BMBF 35