Simulieren geht über Probieren

Size: px

Start display at page:

Download "Simulieren geht über Probieren"

Penelope Osborne
6 years ago
Views:

1 Simulieren geht über Probieren Ulrich Rüde Lehrstuhl für Informatik 10 (Systemsimulation) Universität Erlangen-Nürnberg www10.informatik.uni-erlangen.de Ulm, 17. Mai

2 Overview Motivation Three examples Material science and process technology: Metal Foams Nano Technology Biomedical Technology: The inverse EEG problem High Performance Computing Conclusions 2

3 The Two Principles of Science Three Experiments Theory Mathematical Models, Differential Equations, Newton Observation and prototypes empirical Sciences Computational Science Simulation, Optimization (quantitative) virtual Reality 3

4 Part IIa Metal Foams In collaboration with the Institut für Werkstoffwissenschaften Lehrstuhl Werkstoffkunde und Technologie der Metalle WTM (R.F. Singer, C. Körner) 4

5 Examples of Foams Glass Ceramics Metals Polymers Structural Properties stiffness energy absorption damping Functional Properties burner, shock absorber, heat exchanger, batteries large, dynamic surface expansion 5

6 Towards Simulating Metal Foams Bubble growth, coalescence, collapse, drainage, rheology, etc. are still poorly understood Simulation as a tool to better understand, control and optimize the process 6

7 The Lattice-Boltzmann Method Real valued representation of particles Discrete velocities and positions Algorithm consists of two steps: Stream Collide 7

8 The Stream Step Move particle distribution functions along corresponding velocity vector Normalized time step, cell size and particle speed 8

9 The Collide Step Amounts for collisions of particles during movement Weigh equilibrium velocities and velocities from streaming depending on fluid viscosity 9

10 Free surfaces with LBM Metal Foams huge gas volumes Only simulate and track fluid motion Compute boundary conditions at free surface 10

11 Boundary Conditions Problem: Missing distribution functions at interface cells after streaming! Liquid Gas Reconstruction such that macroscopic boundary conditions are satisfied. Körner et al. Lattice Boltzmann Model for Free Surface Flow, to be published in Journal of Computational Physics 11

12 Curvature calculation (version I) Alternative approaches: Integrate normals over surface (weighted triangles) Level set methods (track surface as implicit function) 12

13 Free surface flow: Breaking Dam Zur Anzeige wird der QuickTime Dekompressor YUV420 codec bentigt. 13

14 Visualization Ray-tracing Refraction Reflection Caustics About 15 Min per frame = 1 day for 4 secs About same compute time as flow simulation 14

15 Rising Bubbles Zur Anzeige wird der QuickTime Dekompressor YUV420 codec bentigt. 15

16 More Rising Bubbles Zur Anzeige wird der QuickTime Dekompressor YUV420 codec bentigt. 16

17 Simulation Verification by Experiment Zur Anzeige wird der QuickTime Dekompressor YUV420 codec bentigt. Simulation and Experiment: Diplomarbeit N. Thürey 17

18 Velocity 0,005 Verification for bubble dynamics 0,004 (C. Körner) Stokes Law: Climbing rate of a bubble exposed to gravity 0,003 Ideal bubble No boundaries Equilibrium state 0,002 0,001 Zur Anzeige wird der QuickTime Dekompressor Cinepak bentigt. Climb rate 0,000 R = 8, τ = 0.74, g = 10-4, σ = 2*10-2 Example 2 2: x x cells Rel. in error: Distance l.u.2 % Error = function of the system size 18

19 True Foams with Disjoining Pressure Zur Anzeige wird der QuickTime Dekompressor Cinepak bentigt. Zur Anzeige wird der QuickTime Dekompressor bentigt. 19

20 Data Set Pulsating Blood Flow at Aneurysm CE Elite Master Thesis: Jan Götz Zur Anzeige wird der QuickTime Dekompressor YUV420 codec bentigt. 20

21 Data Set Pulsating Blood Flow at Aneurysm Master Thesis Jan Götz In Zusammenarbeit mit Zur Anzeige wird der QuickTime Dekompressor YUV420 codec bentigt. Neuroradiologie (Prof. Dörfler) Bildverarbeitung Simulation Strömungsmechanik 21

22 Part III High Performance Computing 22

23 A little quiz... 1 Kflops = 103, 1 Mflops = 106, 1 Gflops = 109, 1 Tflops = 1012, 1 Pflops = 1015 floating point operations per second What is the speed of your PC? What is the speed of the fastest computer currently available (and where is it located) What was the speed of the fastest computer in 1995? 2000? 2005? 23

24 A little quiz... 1 Kflops = 103, 1 Mflops = 106, 1 Gflops = 109, 1 Tflops = 1012, 1 Pflops = 1015 floating point operations per second What is the speed of your PC? probably between 1 and 6.5 GFlops 24

25 A little quiz... 1 Kflops = 103, 1 Mflops = 106, 1 Gflops = 109, 1 Tflops = 1012, 1 Pflops = 1015 floating point operations per second What is the speed of your PC? What is the speed of the fastest computer currently available (and where is it located) What was the speed of the fastest computer in 1995? 2000? 2005? 25

26 A little quiz... 1 Kflops = 103, 1 Mflops = 106, 1 Gflops = 109, 1 Tflops = 1012, 1 Pflops = 1015 floating point operations per second What is the speed of your PC? What is the speed of the fastest computer currently available (and where is it located) 367 Tflop, it is a Blue Gene/L in Livermore/California with > processors 26

27 A little quiz... 1 Kflops = 103, 1 Mflops = 106, 1 Gflops = 109, 1 Tflops = 1012, 1 Pflops = 1015 floating point operations per second What is the speed of your PC? What is the speed of the fastest computer currently available (and where is it located) What was the speed of the fastest computer in 1995? 2000? 2005? 27

28 A little quiz... 1 Kflops = 103, 1 Mflops = 106, 1 Gflops = 109, 1 Tflops = 1012, 1 Pflops = 1015 floating point operations per second What is the speed of your PC? What is the speed of the fastest computer currently available (and where is it located) What was the speed of the fastest computer in 1995? 2000? 2005? TFlops 12.3 TFlops 367 TFlops... and how much has the speed of cars/airplanes/... improved in the same time? additional question: When do you expect that computers exceed 1 PFlops? 28

Compute Nodes (8x4 CPUs) LSS-Cluster CPU: AMD Opteron 848 2.2 GHz, max. 4.

29 Compute Nodes (8x4 CPUs) LSS-Cluster CPU: AMD Opteron GHz, max. 4.4 GFlops RAM: 16 GByte Interactive Nodes (9x2 CPUs) CPU: AMD Opteron 248 High-Speed Network InfiniBand 10 GBit/s Fujitsu-Siemens 29

Architecture example: Our Pet Dinosaur 8 Proc and 8 GB per node Hitachi SR 8000 at the Leibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften Performance: 1344 CPUs (168*8) 12 GFlop/node

30 Architecture example: Our Pet Dinosaur 8 Proc and 8 GB per node Hitachi SR 8000 at the Leibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften Performance: 1344 CPUs (168*8) 12 GFlop/node 2016 GFlop total Linpack: 1645 Gflop (82% of theoretical peak) Very sensitive to data structures To be replaced by a 6000 Proc. SGI in 1H 2006 Upgrade to >70 Tflop in 2007 (No. 5 at time of installation in 2000) 30

31 31

32 32

33 Supercomputer Performance TOP 500 Zur Anzeige wird der QuickTime Dekompressor TIFF (Unkomprimiert) bentigt. 33

34 Growth: V 52% per year Transistors/Die G 1G M Merced Pentium Pro Pentium M 1M K K Growth: 42% per year K DRAM Microprocessor (Intel) Year Moore's Law in Semiconductor Technology (F. Hossfeld)

35 Semiconductor Technology 10 9 Atoms/Bit kt Energy/logic Operation [pico-joules Year Information Density & Energy Dissipation (adapted by F. Hossfeld from C. P. Williams et al., 1998) 35

36 Parallelization of LBM Code Standard LBM-Code in C (1-D Partitioning): - excellent performance on single SR8000 node - almost linear speed-up - large partitions favorable Performance on SR8000 Ca. 30% of Peak Performance 36

37 Parallelization Standard LBM-Code: Scalability Largest Simulation: 1,08*109 cells 370 GByte memory Communication Cost because of large data volume (64 MByte) Efficiency ~ 75% Dissertation T. Pohl (2006) 37

38 Parallelization Free surface LBM-Code Standard LBM 1 sweep through grid Free surface LBM 5 sweeps through grid Cell type changes, Closed boundary for bubbles, Initialization of modified cells, Mass balance correction 38

39 Parallelization Free surface LBM-Code: Standard LBM 1 sweep through grid 1 row of ghost nodes Free surface LBM 5 sweeps through grid 4 rows of ghost nodes 39

40 Performance Standard LBM-Code Free surface LBM-Code Performance lousy on a single node! Conditionals: 2,9 SLBM 51 free surface LBM Pentium 4: almost no degradation ~ 10% SR 8000: enormous degradation (pseudo-vector, predictable jumps) 40

41 Structured vs. Unstructured Grids (on Hitachi SR 8000) JDS Stencils ,937 2,146,689 # unknowns gridlib/hhg MFlops rates for matrix-vector multiplication on one node on the Hitachi compared with highly tuned JDS results for sparse matrices (courtesy of G. Wellein, RRZE Erlangen) 41

42 Refinement example Input Grid 42

43 Refinement example Refinement Level one 43

44 Refinement example Refinement Level Two 44

45 Refinement example Structured Interior 45

46 Refinement example Structured Interior 46

47 Refinement example Edge Interior 47

48 Refinement example Edge Interior 48

49 HHG: Parallel Scalability #Procs #DOFS x 10^6 #Els x 10^6 #Input Els GFLOP/s Time [s] 64 2,144 12, / ,288 25, / ,577 51, / , , / , , ,456/ Parallel scalability of Poisson problem discretized by tetrahedral finite elements: Machine - SGI Altix (Itanium GHz) B. Bergen, F. Hülsemann, U. Ruede: Is unknowns the largest finite element system that can be solved today? in SuperComputing, Nov

50 Conclusions (1) High performance simulation still requires heroic programming but we are on the way to make supercomputers more generally usable Parallel Programming is easy, node performance is difficult (B. Gropp) Which architecture? ASCI-type: custom CPU, massively parallel cluster of SMPs nobody has been able to show that these machines scale efficiently, except on a few very special applications and using enormous human effort Earth-simulator-type: Vector CPU, as many CPUs as affordable impressive performance on vectorizable code, but need to check with more demanding data and algorithm structures Hitachi Class: modified custom CPU, cluster of SMPs excellent performance on some codes, but unexpected slowdowns on others, too exotic to have a sufficiently large software base Others: BlueGene, Cray X1, Multithreading, PIM, reconfigurable, quantum computing, 50

51 Conclusions (2) Which data structures? structured (inflexible) unstructured (slow) HHG (high development effort, even prototype 50 K lines of code) meshless (useful in niches) Where are we going? the end of Moore s law nobody builds CPUs with HPC specific requirements high on the list of priorities petaflops: 100,000 processors and we can hardly handle 1000 It s the locality - stupid! the memory wall latency bandwidth Distinguish between algorithms where control flow is data independent: latency hiding techniques (pipelining, prefetching, etc) can help data dependent 51

52 In the Future? What s beyond Moore s Law? 52

53 Part VI Outlook: Other applications 3D-Animation Computational Steering Real-Time Simulation 53

54 Near-Real-Time Free-Surface LBM (N. Thürey) Zur Anzeige wird der QuickTime Dekompressor bentigt. 54

55 Free-Surface LBM with Adaptive Refinement (N. Thürey) Zur Anzeige wird der QuickTime Dekompressor bentigt. Hochaufgelöste Animationen Adaptive Verfeinerung/ Vergröberung Visualisierung mit Raytracer Fluid-Simulation in Blender 2.4 ( ) Blender: 3DModellierungsprogramm Frei verfügbar: 55

56 Collaborators Acknowledgements In Erlangen: WTM, LSE, LSTM, LGDV, RRZE, Neurozentrum, Radiologie, etc. Especially for foams: C. Körner (WTM) International: Utah, Technion, Constanta, Ghent, Boulder,... Dissertationen Projects U. Fabricius (AMG-Verfahren and SW-Engineering for parallelization) C. Freundl (Parelle Expression Templates for PDE-solver) J. Härtlein (Expression Templates for FE-Applications) N. Thürey (LBM, free surfaces) T. Pohl (Parallel LBM)... and 6 more 19 Diplom- /Master- Thesis Studien- /Bachelor- Thesis Especially for Performance-Analysis/ Optimization for LBM J. Wilke, K. Iglberger, S. Donath... and 23 more KONWIHR, DFG, NATO, BMBF Elitenetzwerk Bayern Bavarian Graduate School in Computational Engineering (with TUM, since 2004) Special International PhD program: Identifikation, Optimierung und Steuerung für technische Anwendungen (with Bayreuth and Würzburg) to start Jan

57 Talk is Over Zur Anzeige wird der QuickTime Dekompressor bentigt. Please wake up! 57

58 The Lattice-Boltzmann Method Based on cellular automata Introduced by von Neumann around 1940 Famous: Conway s Game of Life Complex system with simple rules Regular grid Local rules specifying time evolution Intrinsically parallel for model & simulation, similar to elliptic PDE solvers 58

59 The Lattice-Boltzmann Method Weakly compressible approximation of the Navier-Stokes equations Easy implementation Applicable for small Mach numbers (< 0.1) Easy to adapt, e.g. for Complicated or time-varying geometries Free surfaces Additional physical and chemical effects 59

60 LBM Demonstration (Java applet) file:///users/ruede/doc/lehr/vorles/ws03/hppt/lbm/jlb-comp/start.html 60

61 Free surface implementation Before stream step, compute mass exchange across cell boundaries for interface cells Calculate bubble volumes and pressure Surface curvature for surface tension Change topology if interface cells become full or empty keep layer of interface cells closed 61

62 Surface Tension (Vers. 2) Marching-cube surface triangulation Compute a curvature for each triangle _ ν1 δα = Α Α δv k= 1 da 2 dv _ n3 Α _ Α n2 Associate with each LBM cell the average curvature of its triangles Complicated Beats level sets for our applications (mass conservation). 62

Nano Technology Curved Boundaries: Particles approximated with spheres Improve accuracy of LBM simulations by using curved boundary conditions

63 Nano Technology Curved Boundaries: Particles approximated with spheres Improve accuracy of LBM simulations by using curved boundary conditions Standard No-Slip Reflect DFs at cell boundary More accurate: Take distance to boundary surface into account, then interpolate DFs accordingly 63

64 What are hierarchical hybrid grids? Standard geometric multigrid approach: Purely unstructured input grid resolves geometry of problem domain Patch-wise regular refinement applied repeatedly to every cell of the coarse grid generates nested grid hierarchies naturally suitable for geometric multigrid algorithms New: Modify storage formats and operations on the grid to reflect the generated regular substructures 64

65 What s the new thing here? Hierarchical hybrid grids (HHG) are not yet another block structured grid HHG are more flexible (unstructured, hybrid input grids) are not yet another unstructured geometric multigrid package HHG achieve better performance -unstructured treatment of regular regions does not improve performance 65

66 Simulation is Performance hungry and Memory intensive Parallel Supercomputing required 66

67 Current Challenge: Parallelism on all levels and The Memory Wall Parallel computing is easy, good (single) processor performance is difficult (B. Gropp, Argonne) There has been no significant progress in High Performance Computing over the past 5 years (H. Simon, NERSC) Instruction level parallelism Memory bandwidth and latency are the limiting factors Cache-aware algorithms Conventional complexity measures (based on operation count) are becoming increasingly unrealistic 67

68 LSS Cluster-Computer Fujitsu-Siemens HPC Line Programming Methods Cache Optimization C++ Expression Templates (Parallel) Algorithms Cooperations in Material Sciences Engineering Mechanical Electrical Chemical Medical Technology... 68

Performance and Software-Engineering Considerations for Massively Parallel Simulations

Performance and Software-Engineering Considerations for Massively Parallel Simulations Ulrich Rüde (ruede@cs.fau.de) Ben Bergen, Frank Hülsemann, Christoph Freundl Universität Erlangen-Nürnberg www10.informatik.uni-erlangen.de