Framework Middleware Bridging Large-apps and Hardware

Size: px

Start display at page:

Download "Framework Middleware Bridging Large-apps and Hardware"

Marilyn Rodgers
5 years ago
Views:

1 Framework Middleware Bridging Large-apps and Hardware Zeyao Mo Institute of Applied Physics and Computational Mathematics CAEP Software Center for Numerical Simulation CoDesign 2014, Guangzhou

2 CoDesign:Bottleneck - Programming 100PF 1.0 PF Apps.Soft.:Cooperation Wall : Real Apps.: #researchers # groups #LOC seriously increase TH-1 TH-2 20 TF XX-V 1.0 TF XX-VI Apps.Soft.:Efficiency Wall Performance, Programming, Power, Reliability

3 CoDesign:Bottleneck - Programming Cooperation Wall Efficiency Wall General Parallel Programming Models are sufficient? NOT Enough : now and in the future.

4 General Models: Nested/Hybrid Programming Stacks Hetergeneous models: GPU-CUDA, MIC-offloads, OpenCL, etc.

5 General Models: Parallel Programming Challenges Reconstruction of 18 years of MPI legacy codes, New generation codes for multi-physics complex systems Big and increasing gaps for realization Nested/Hybrid Parallel Programming Models MPI O(10K), NUMA-aware O(10), OpenMP O(100), ILP O(1); Hetergeneous Models: GPU-CUDA, MIC-offload, etc.

6 Big and increasing gaps for realization Nested/hybrid parallel programming models and their evolvement Complex management for for data structure and irregular communication; Multilevel/Hybrid load imbalance arising from physics, stencils, communication, run-time, etc.; Parallel implementation for for fast numerical algorithms; Component-based extensibility required by by complex applications where LOC scales up up to to 100K; Large scale data management and visualization.

7 Great Challenges: Nested/Hybrid Parallel Programming Load imbalance Fast numerial algorithms Code complexity Preprocessing/Visualization API Good solutions Scientific Computing and Engineering Applications common Domain Specific Frameworks Applicaiton A: parallel code A Application N: parallel code N

8 Framework accelerates application software: Encapsulates and separates parallel computing from applications; Significantly simplifies the the introduction of of the the emerging parallel programming models; Applies component-based software engineering to to high performance numerical simulation; Holds the the rewritten efforts for for multiple years (e.g years) even the the machine is is rapidly changing.

9 Make applications Think Parallel, Write Sequential Once framework in in place, ratio of of science experts vs. parallel experts :: 10:1, physics added as as serial code, now and in in the future. --M.Heroux, LANL, July. 2011, DOE Workshop on on Exascale Programming Challenges

10 2. Framework-based Parallel Programming Models for Numerical Simulation Think Parallel, Write Sequential

11 Numerical Simulation Application Programming Think Parallel, Write Sequential Framework: What s the fundamentals? Nested/Hybrid Emerging Parallel Programming Models MPI O(10K), NUMA-aware O(10), OpenMP O(100), ILP O(1)

12 2.1 Framework: what s the fundamentals Physical Models Discrete Stencils Numerical Algorithms Applications Codes extract Data Dependency form Data Structure Promote Computers Computatio nal Patterns Communications support Load Balancing

13 2.1 Framework: what s the fundamentals Special Models Stencils Algorithms separate Library Models Common Stencils Algorithms extract Data Dependency form Data Structure Promote Computers Computatio nal Patterns Communications support Load Balancing

14 2.1 Framework: what s the fundamentals Models Stencils Algorithms Applications Codes Special separate Automatic parallelism Library Models Common Stencils Algorithms extract Parallel Programming Interfaces Data Dependency Data Structures form Computatio nal Patterns Communications support Load Balancing Promote Computers Framework middleware for large scale scientific and engineering computing on meshes

15 2.2 Framework: Separate Parallel Programming Graph Based : Halo exchanges, Collectives Graph Based : Halo exchanges, Collectives Computational Patterns Digraph Based: data driven, task scheduling Digraph Based: data driven, task scheduling

16 Halo Exchange Pattern: Separate Parallel Programming

17 Halo Exchange Pattern: Separate Parallel Programming

18 Halo Exchange Pattern: Separate Parallel Programming Do MPI message passing fill ghost cells for uo ;

19 Halo Exchange Pattern: Separate Parallel Programming User Part

20 Halo Exchange Pattern: Separate Parallel Programming User Part

21 Halo Exchange Pattern: Separate Parallel Programming Framework Part Component-based C++ strategy class implementation user part

22 Halo Exchange Pattern: Separate Parallel Programming Serial performance are also significantly improved: molecular simulation: 3.5% 32.3% peak; radiation and neutron transportation :: double.

23 3. JASMIN framework and its application Think Parallel, Write Sequential

parallel Adaptive Structured Mesh INfrastructure

24 3.1 JASMIN Inertial Confinement Fusion Global Climate Modeling Structured Grid Particle Simulation JASMIN J parallel Adaptive Structured Mesh INfrastructure cn/jasmin, 2010SR Unstructured Grid

25 3.1 JASMIN Motivations: Supports the development of parallel codes for large scale numerical simulation on personal computers. Hides parallel programming using millions of CPU cores; Integrates the efficient implementations of parallel fast algorithms; Provides efficient data structures and solver libraries; Supports the component-based software engineering for code extensibility.

26 3.2 key factors Data Structures Fast Algorithms JASMIN Parallelization Component Interfaces

27 3.2.1 Data Structures minimize data movement:mesh partitioning (domain specific) MPI NUMAaware OpenMP 2 domains 5 regions 16 patches SIMD Points / Patch

28 Mesh supported Data Structures

29 3.2.1 Data Structures Multiblock Physical variables Multiblock Sliding Particles SAMR 3-D Contact Various Apps. : Fusion, CFD, Climate, Electromagnetism, Material science, etc.

3.2.2 Communications 0 2 5 1 3 4 6 Distributed Undirected Graph (2 processors) Graph

30 3.2.2 Communications Distributed Undirected Graph (2 processors) Graph Based Patterns: Halo exchanges, Collectives Graph Based Patterns: Halo exchanges, Collectives

31 3.2.2 Communications distributed directed graph (digraph) 8 angles Parallel Sweeping for Sn transport Digraph Based Patterns: data driven, tasks scheduling Digraph Based Patterns: data driven, tasks scheduling

32 3.2.2 Communications (Hierarchical scheduling)

33 3.2.3 dynamic load balancing Global atmosphere model scale up to CPU cores Regional rainstorm model extreme load imbalance Radiation and neutron transport

34 3.2.4 Solvers for linear systems Radiation hydrodynamics: Usual algorithms :O(N )) PAMG+DDM :O(NlogN) data structures solver interfaces JXP Kinsol Hypre

3.2.5 SAMR h-adaptivity:advance-estimate-tag-refine-synchronize Coarser level integrates (user provides stencil) Error estimation (user provides)

35 3.2.5 SAMR h-adaptivity:advance-estimate-tag-refine-synchronize Coarser level integrates (user provides stencil) Error estimation (user provides) Tag cells for refinement Cluster tagged cells into Boxes Synchornization Finer level is initialized and integrates Generate patches for finer level

3.3 Current Version User provides: physics, parameters, numerical methods, expert experiences, special algorithms, etc. User Interfaces:Components based Parallel Programming models.

36 3.3 Current Version User provides: physics, parameters, numerical methods, expert experiences, special algorithms, etc. User Interfaces:Components based Parallel Programming models. ( C++ classes) JASMIN V. 3.0 Numerical Algorithms:geometry, fast solvers, mature numerical methods, time integrators, etc. HPC implementations( tens of thousands of CPUs): data structures, parallelization, load balancing, adaptivity, visualization, restart, memory, etc. Architecture:Multilayered, Modularized, Object-oriented; Codes: C++/C/F90/F77, MPI+MPI+MPI+OpenMP,660K LOC; Installation: Personal computers, Cluster, MPP.

37 Structured Grid JASMIN Three Frameworks for Scientific and Engineering Computing Unstructured Grid Composite Geometry JAUMIN JCOGIN

38 3.3 Current: 40 application codes, 1,500 KLOC HEDP Nuclear/Energy Material Laser Fusion ElectroMagnetism Hydrodynamics Climate Forecasting Earth Environment Mechanism Fission Energy

39 3.4 Some Applications on PetaFlops System TianHe-1A Codes # CPU cores Codes # CPU cores LARED-S 32,768 RH2D 1,024 LARED-P 72,000 HIME3D 3,600 LAP3D 16,384 PDD3D 4,096 MEPH3D 38,400 LARED-R 512 MD3D 80,000 LARED Integration 128 RT3D 1,000 JMES-FDTD 60,000 NEPTUNE 1,024 Simulation : several hours to tens of hours.

Applications-1:Inertial Confinement Fusion 12 12 codes, 48 48 researcheres, concurrently develop ICF

Simulation Cycle Hides parallel computing and adaptive implementations using tens of thousands of CPU

40 Applications-1:Inertial Confinement Fusion codes, researcheres, concurrently develop ICF Application Codes Different Combinations numerical methods Physical parameters Expert Experience Simulation Cycle Hides parallel computing and adaptive implementations using tens of thousands of CPU cores; Provides efficient data structures, algorithms and solvers; Support software engineering for code extensibility.

41 Applications-1:Inertial Confinement Fusion Codes Year 2004 Year 2010 LARED-H 2-D radiation hydrodynamics Lagrange code LARED-R 2-D radiation transport code LARED-S 3-D radiation hydrodynamics Euler code LARED-P 3-D laser plasma interaction code serial Single bolck Without capsule Serial Serial Single level 2-D: single group 3-D: no radiation Parallel Multiblock NIF ignition target Parallel (2048 cores) Scale up a factor of 1000 within 6 years. serial Parallel (32768 cores) SAMR Multi-group diffusion 3-D: radiaiton multigroup diffusion Parallel (36000 cores) Terascale of particles

42 ICF-1: Integration simulations for ignition targets Three temperatures conduction modeling Radiation thermal conduction modeling Transport modeling Multigroup diffusion modeling Laser transfer modeling Multi-physical modeling for radiation hydrodynamcs simulations

43 ICF-1: Integration simulations for ignition targets Radiation temp. Electron temp. Radiation temperatures in the hohlraum. Lagrange Meshes E/R swapped engr. LLNL NIF base target

ICF-2: 3-D simulations for laser plasma interactions LARED-P: super-strong lasers transfer and focus over the plasma cone and generate high energy electrons.

44 ICF-2: 3-D simulations for laser plasma interactions LARED-P: super-strong lasers transfer and focus over the plasma cone and generate high energy electrons. #mesh: 768M #partiles: 40G CPU Cores: 72,000 Para.Effi.: 44% Phys. time : 320 fs Sim. time: 4.5 hours #steps : fs fs fs Output data 640GB/ 20

ICF-3: 3-D simulations for laser plasma hydrodynamics LAP3D: lasers filament and self-focus while transfer over a long distance in the lower density plasma environments. #mesh: 2.

45 ICF-3: 3-D simulations for laser plasma hydrodynamics LAP3D: lasers filament and self-focus while transfer over a long distance in the lower density plasma environments. #mesh: 2.1G CPU Cores: 16,384 Para.Effi.: 50% Phys. time : 56.5 ps Sim. time: 12.8 hours #steps : 41,200 Output data 1.74TB/ 104 Parallel rendering using 72 cores Solve the hydrodynamics equations coupled with laser paraxial transfer equations. Simulation results are coincides with the ALPS codes

ICF-4: 3-D simulations for radiation hydrodynamics instabilities LARED-S: radiation hydrodynamics instabilities occur over the capsule interfaces while the capsule shell rapidly slows down.

46 ICF-4: 3-D simulations for radiation hydrodynamics instabilities LARED-S: radiation hydrodynamics instabilities occur over the capsule interfaces while the capsule shell rapidly slows down. #mesh: 160M 256x256x256 3-D disturb # Cores: 32,768 Para.Effi.: 52% P. time : 0.1 ps S. time: 39 hours #steps : 778,580 Output data 4.4TB/ 670 Gray : shell materials. Color: ion temperature for hot spots. Ideal compress

47 ICF-4: 3-D simulations for radiation hydrodynamics instabilities Three Levels AMR: 1300M cells, Resolution : 2000x2000x2000 (8000M cells) 1600M cells 2560x256x cores: 6505 steps, 9 hours SAMR improve : resolution by 50, scale by cores, 77K steps, 39 hours 47

Applications-2:Material simulations PDD3D: 3-D discrete dislocation simulations for the locally plastic deformations of metal materials while the shock is enforced.

48 Applications-2:Material simulations PDD3D: 3-D discrete dislocation simulations for the locally plastic deformations of metal materials while the shock is enforced. #dislocation: 3 M The details of dislocation structures are discovered. CPU Cores: 4096 Para.Effi.: 47% Phys. time : 5.5 ns Sim. time: 12 hours #steps : 1,800 Output data 196 GB/ 196 The stretch simulations of Cu crystal (0.12 mm 3, r=10 7 /s). Left: dislocation density; Right: local structures.

Applications-2:Material simulations MD3D: 3-D molecule dynamics simulations for the structures and dynamics shock responses of metal materials with nano-scale defects.

49 Applications-2:Material simulations MD3D: 3-D molecule dynamics simulations for the structures and dynamics shock responses of metal materials with nano-scale defects. #Molecules: 50G Dislocation holes release and collapse. #Cores: 80,000 Para.Effi.: 62% Ph.time: 5.5 ps Sim. time: 3.0 hours #steps: 30,000 Output data 162 GB/ 150 Shock Shock Front Hot spot Discation holes collapse and interact with each other, hot spots are formed to generate various dynamics responses. Left: local spots; Right: LLNL s results in 2005.

Applications-3:Electromagnetic Simulations JMES3D: 3-D FDTD simulations for destroy mechanism of electromagnetic waves with different frequencies and directions.

50 Applications-3:Electromagnetic Simulations JMES3D: 3-D FDTD simulations for destroy mechanism of electromagnetic waves with different frequencies and directions. # mesh: 1.2 G #Cores: 60,000 Para.Effi.: 70% Phy.ime : 250 ns S.time: 8.1 hours #steps : 254,000 Output data 736 GB/ 172 Deposited Energy Electronic Field Snapshot

51 Applications-3:Electromagnetic Simulations JMES3D JAS93 L-wave RCS analysis, 3000 CPU-Cores, 7.5 billions of grid cells

52 Atmosphere: GAMIL-JASMIN Applications-4:Climate Forecasting 1.0 degree/ 100 kilometers resolution 1.0 degree/ 100 kilometers resolution Land: CLM-JASMIN Coupler JASMIN 5800 cores, PE=33% cores, PE=33%. Ocean: 0.25 degree/ kilometers resolution LICOM-JASMIN Ocean ice: CSIM4- JASMIN

mesh blocks distribution # cells: 9,574,784 # blocks:

53 Applications-5:CFD automatic blocks joint and patch-based decomposition for large scale parallelism. mesh blocks distribution # cells: 9,574,784 # blocks: 194 # patches 9600 # CPU cores: 2048 # 10K steps 800 sec. # speedup 580

54 4. Comments Applications Software/Simulation Proxies Apps. Stencil Template Numerical Library Domain-Specific Framework General Parallel Programming Stack Hardware

55 JASMIN JCOGIN / JAUMIN JASMIN3.0 JAUMIN1.0 JCOGIN PFLOPS Exa-scale Computing JASMIN TFLOPS Product Release JASMIN PFLOPS

56 Application codes are heritable 40 codes, 1,500 KLOC Domain-Specific APIs have no changes Improve frameworks towards machines Two months JASMIN Teraflops JAUMIN 1.0 JCOGIN 1.0 JASMIN 3.0 Petaflops Three months JAUMIN 2.0 JCOGIN 2.0 JASMIN Petaflops

57 Thank you for your attentions

Acceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help?

Acceleration of HPC applications on hybrid CPU- systems: When can Multi-Process Service (MPS) help? GTC 2018 March 28, 2018 Olga Pearce (Lawrence Livermore National Laboratory) http://people.llnl.gov/olga