Towards PetaScale Computational Science

Size: px

Start display at page:

Download "Towards PetaScale Computational Science"

Sara Hancock
5 years ago
Views:

1 Towards PetaScale Computational Science U. Rüde (LSS Erlangen, joint work with many Lehrstuhl für Informatik 10 (Systemsimulation) Universität Erlangen-Nürnberg www10.informatik.uni-erlangen.de 3rd Erlangen HEC-Symposium July 2,

2 Motivation Overview Towards PetaScale and Beyond Scalable Finite Element Solvers Scalable Algorithms: Multigrid Scalable Software Architecture: Hierarchical Hybrid Grids (HHG) Parallel Performance Results on 9170 Core SGI Altix (HLRB-II) Complex Flow Simulations with Lattice Boltzmann Methods Metal Foams as an Example Application Free Surfaces, Moving Objects Parallel Performance Results Conclusions 2

3 Part I Towards PetaScale and Beyond 3

4 Technology as Driver 4

Intel Pentium III 1GHz (~2000) If every person on earth does one

performance (less than a current laptop from Aldi) 1012= 1

, ~ 2000 1015= 1 PetaFlops Green Germ Q? >250 000 Proc. Cores?

5 How much is a PetaFlops? 106 = 1 MegaFlops: Intel MHz PC (~1989) 109 = 1 GigaFlops: Intel Pentium III 1GHz (~2000) If every person on earth does one operation every 6 seconds, all humans together have 1 GigaFlops performance (less than a current laptop from Aldi) 1012= 1 TeraFlops: HLRB-I 1344 Proc., ~ = 1 PetaFlops Green Germ Q? > Proc. Cores?, ~2008? HLRB-I: 2 TFlops If every person on earth runs a 486 PC, we all together have an aggregate Performance of 6 PetaFlops. HLRB-II: 63 TFlops 5

6 The Two Principles of Science Three Theory Mathematical Models, Differential Equations, Newton Experiments Observation and prototypes empirical Sciences Computational Science Simulation, Optimization (quantitative) virtual Reality 6

7 Part II Scalable Algorithms: Multigrid Scalable Data Strcutures Hierarchical Hybrid Grids Scalable Architectures 7

8 What is Multigrid? Has nothing to do with grid computing A general methodology multi - scale (actually it is the original ) many different applications developped in the 1970s -... Useful e.g. for solving elliptic PDEs large sparse systems of equations iterative convergence rate independent of problem size asymptotically optimal complexity -> algorithmic scalability! can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best (maybe the only?) basis for fully scalable FE solvers 8

9 Multigrid: V-Cycle Goal: solve A h u h = f h using a hierarchy of grids Relax on Correct Residual Restrict Interpolate Solve by recursion 9

10 Parallel High Performance FE Multigrid Parallelize plain vanilla multigrid partition domain parallelize all operations on all grids use clever data structures Do not worry (so much) about Coarse Grids idle processors? short messages? sequential dependency in grid hierarchy? Why we do not use Domain Decomposition DD without coarse grid does not scale (algorithmically) and is inefficient for large problems/ many procssors DD with coarse grids is still less efficient than multigrid and is as difficult to parallelize 10

11 Part II Towards Scalable FE Software Performance Results 11

12 #Proc #unkn. x , ,147.5 Ph.1: sec Ph. 2: sec 6.38* 6.67* 6.75* 6.80* 4.92 Time to sol Parallel scalability of scalar elliptic problem discretized by tetrahedral finite elements , , , , , , , , , * 7.39* * 7.75* B. Bergen, F. Hülsemann, U. Rüde, G. Wellein: ISC Award 2006, also: Is unknowns the largest finite element system that can be solved today?, SuperComputing, Nov Times for 12 V(2,2) cycles on SGI Altix: Itanium GHz. Largest problem solved to date: 3.07 x DOFS on 9170 Procs: 7.8 s per V(2,2) cycle 13

13 So what? With scalable algorithms, well implemented, we can do now (scalar) PDEs with > 10 million unknows on a desktop > 300 billion unknowns on a TOP-10 class machine (HLRB-II, 63 Tflop peak, 40 TeraByte Mem) In the future, we will be able to do around : 5 trillion unknowns on a peak PetaScale machine (assuming 1 PByte memory) around : 50 trillion unknowns on a machine delivering a petaflop for real applications (assuming 10 Pbyte memory) This is e.g. sufficient to resolve all of earth s atmosphere with 10 km grid resolution (current desktop) 250 m mesh (current supercomputer) 100 m mesh (Peak-Peta-Scale system in 2009?) 50 m mesh (Application-Peta-Scale system in 2012?) This is a buidling block for many other applications 14

14 Why Multigrid? A direct elimination banded solver has complexity O(N 2.3 ) for a 3-D problem (N = 3*10 11 ) This becomes = O(10 26 ) operations for our problem size At 100 TeraFlops this would result in a runtime of years which could be reduced to 8 years on a ExaFlops (10 18 Flops) system. We can alternatively use multigrid and compute a solution with a 60 TeraFlops machine in 1.5 minutes. 15

15 An HPC Tutorial! Getting Supercomputer Performance is Easy! If parallel efficiency is bad, choose a slower serial algorithm it is probably easier to parallelize and will make your speedups look much more impressive Introduce the CrunchMe variable for getting high Flops rates advanced method: disguise CrunchMe by using an inefficient (but computeintensive) algorithm from the start Introduce the HitMe variable to get good cache hit rates advanced version: disguise HitMe within clever data structures that introduce a lot of overhead Never cite time-to-solution who cares whether you solve a real life problem anyway it is the MachoFlops that interest the people who pay for your research Never waste your time by trying to use a complicated algorithm in parallel (such as multigrid) the more primitive the algorithm the easier to maximize your MachoFlops. 16

16 Part III HEC Application: Complex and Coupled Flow Simulations with the Lattice Boltzmann Method 17

17 Process Simulation of Foam Production KONWIHR project FreeWIHR joint work with WTM, Dr. C. Körner poorly understood: coalescence collapse, drying, solidification etc. Simulation as tool to understand and control the process 18

18 First Test Breaking Dam 19

19 Rising Bubbles 20

20 Foaming Simulation 1 22

21 Moving Nano Particles in a Liquid K. Iglberger - Master Thesis C. Feichtinger - Diplomarbeit 23

22 Datensatz Pulsating Blood Flow in an Aneurysma Master Thesis Jan Götz Collaboration with Neuro-Radiology (Prof. Dörfler, Dr. Richter) Image Processing Simulation Fluid Mechanics (Prof. Durst) 24

23 Parallelization Standard LBM-Code: Scalability on SR 8000-F1 Largest Simulation: 1,08*10 9 cells 370 GByte memory Communication Cost because of large data volume (64 MByte) Efficiency ~ 75% Dissertation T. Pohl (2007) 25

24 Parallel Performance LSSCluster FujitsuSiemens 26

25 Example application: So What? Flow with moving objects on the nano-scale (suspensions) 1µm lattice resolution, object size 5-10 µm (example: blood cells) We can simulate (with this resolution) on a laptop: 2 Million lattice cells = mm 3 of liquid. on a TOP-10 supercomputer: 80 Billion lattice cells = 80 mm 3 = 0.08 ml of liquid for blood additionally required: 400 Million blood cells on a (peak) peta scale system: 2 Trillion lattice cells = 2 cm 3 = 2 ml of liquid ~ 10 Billion blood cells on an (application) peta scale system: 20 Trillion lattice cells = 20 ml of liquid Is this good for anything? 27

26 Part IV Conclusions 28

27 Acknowledgements Collaborators In Erlangen: WTM, LSE, LSTM, LGDV, LME, RRZE, Neurozentrum, Radiologie, etc. Especially for foams: C. Körner (WTM) International: Utah, Technion, Constanta, Ghent, Boulder, München, Zürich,... Dissertation Projects C. Freundl (Parallel Expression Templates for PDE-solver) K. Iglberger (Flow with moving objects) J. Götz (Blood Flow with LBM) M. Stürmer (MultiCore) T. Gradl (Hierarchical Hybrid Grids) S. Donath (Multiphase flow in microporous media) C. Feichtinger (Flow with nano particles) finished and almost finished: Kowarschik, Mohr, Bergen, Thürey, Fabricius, Köstler, Treibig 20 Diplom- /Master- Thesis Studien- /Bachelor- Thesis Especially for Performance-Analysis/ Optimization for LBM J. Wilke, K. Iglberger, S. Donath, M. Stürmer,... and 23 more KONWIHR, DFG, NATO, BMBF,, (EU - hopefully) Elitenetzwerk Bayern Bavarian Graduate School in Computational Engineering (with TUM, since 2004) Special International PhD program: Identifikation, Optimierung und Steuerung für technische Anwendungen (with Bayreuth and Würzburg) since Jan

28 Fancy Physics User control of flow to generate unphysical fluid behavior 30

29 Talk is Over Questions? 31

Architecture Aware Multigrid

Architecture Aware Multigrid U. Rüde (LSS Erlangen, ruede@cs.fau.de) joint work with D. Ritter, T. Gradl, M. Stürmer, H. Köstler, J. Treibig and many more students Lehrstuhl für Informatik 10 (Systemsimulation)