Exascale Challenges and Applications Initiatives for Earth System Modeling

Size: px

Start display at page:

Download "Exascale Challenges and Applications Initiatives for Earth System Modeling"

Russell Benson
5 years ago
Views:

1 Exascale Challenges and Applications Initiatives for Earth System Modeling Workshop on Weather and Climate Prediction on Next Generation Supercomputers October 2012 Tom Edwards

An Introduction HECToR UK s National Academic Supercomputing Service Cray XE6 AMD Interlagos + Gemini Network Service based in Edinburgh managed by EPCC 1600+ users, mixture of Academics and Industry

2 An Introduction HECToR UK s National Academic Supercomputing Service Cray XE6 AMD Interlagos + Gemini Network Service based in Edinburgh managed by EPCC users, mixture of Academics and Industry Part of the PRACE consortium Cray Centre of Excellence Two Cray staff working on scaling user applications Addressing high end scaling and exploitation of Cray systems Covers a wide variety of applications Climate, CFD, Turbulence, Chemistry Cray Exascale Research Initiative, Europe Another two Cray staff working in partnership with EPCC Addressing software and co-design challenges for Exascale 24 October

3 Topics The Challenge of Exascale From Petascale towards Exascale Key challenges Macro level trends Impacts on architectures and applications Investing in Applications Impact and objectives of Blue Waters and Titan (pre-exascale) Community R&D efforts CRESTA UPSCALE A Case study in Petascale climate The role of decomposition in performance Optimizing task placement 3

4 The Challenge of Exascale 4

Number of Processor Cores per Supercomputer (Top20 of Top500) Source: www.top500.

5 Concurrency is growing Not just from the number of cores Longer vector lengths Probably fixed length Systems made from different types of processors Hybrids of scalar and vectors Requires new methods of programming to fully exploit the hardware 47,309 79, ,306 Average Number of Processor Cores per Supercomputer (Top20 of Top500) Source: ,245 1,073 1,644 1,847 2,230 2,827 3,093 3,518 10,073 16,316 20, October 2012 AMMW03 5

Energy is now the limiting factor Which requires more energy? Performing a 64-bit floating-point FMA: 893,500.

78026233027699 + 2.02789331400154 = 39,226,724.80815564 This one takes over 3x the energy!

6 Energy is now the limiting factor Which requires more energy? Performing a 64-bit floating-point FMA: 893, x Or moving the three 64-bit operands 20 mm across the die: = 39,226, = 39,226, This one takes over 3x the energy! And loading the data from off chip takes > 10x more yet Flops are cheap, communication is expensive Exploiting data locality is critical for energy efficiency 6

Key Challenges to Get to Exascale Power Traditional voltage scaling is over

changes in architecture Concurrency A billion operations per clock Billions

all available parallelism Programming Difficulty Concurrency and new

complexity from the users Resiliency Many more components Components

7 Key Challenges to Get to Exascale Power Traditional voltage scaling is over Power now a major design constraint Cost of ownership Driving significant changes in architecture Concurrency A billion operations per clock Billions of refs in flight at all times Will require huge problems Need to exploit all available parallelism Programming Difficulty Concurrency and new microarchitectures will significantly complicate software Need to hide this complexity from the users Resiliency Many more components Components getting less reliable Checkpoint bandwidth not scaling Impacts both systems and storage 7

8 Macro Level Trends (1) Major changes to applications and algorithms to address and extend scalability O(10K) increase in concurrency for exascale Heterogeneous computing Defined as accelerators today Processor architecture in 5+ years remained undefined Software is the barrier to successful and efficient usage Architectural imbalances Memory and I/O speeds are not keeping up Expectation of deeper memory hierarchies Communication will continue to be expensive Exploiting data locality is critical for computational efficiency Very much defined in terms of Exascale, but equally of relevance today 8

9 Macro Level Trends (2) Application Resiliency Ability for an application to ride through system faults needs to be solved at the system software and application level Energy aware computing Architectural perspective: Driving heterogeneous approaches Driving the discussion of low power, low frequency processors Application perspective: Energy aware optimization of applications Although today there is little/no control by the system admins and programmers Exploiting data locality is critical for energy efficiency 9

10 From Petascale towards Exascale Adapting codes to petascale and beyond 10

11 Petascale towards Exascale Current Petascale systems have been key: In enabling transformational capabilities across a broad range of science and engineering, including industry As development platforms for preparing earth system models for extreme scale capabilities. Systems Such as NCSA Blue Waters and ORNL Titan will further push the boundary of Petascale computing. Fully coupled biogeochemistryphysical climate simulation Kinetic simulations of the magnetosphere HIV immature virion simulations Climate Specialists Workshop Earthquake simulation of San Andeas fault 11

System & Storage cabinets: Compute nodes:

12 Blue Waters System 11.5 PF Sustained Petascale Performance Cray System & Storage cabinets: Compute nodes: Usable Storage Bandwidth: System Memory: Memory per core: Gemini Interconnect Topology: Usable Storage: Peak performance: Number of AMD processors: Number of AMD x86 core: Number of NVIDIA GPUs: >300 >25,000 >1 TB/s >1.5 Petabytes 4 GB 3D Torus >25 Petabytes >11.5 Petaflops >49,000 >380,000 >3,000 Cray Sonexion Storage 12

Waters Nanotechnology Astronomy Earthquakes and the

13 Blue Waters Science More than 25 science teams 25 distinct research fields selected to run on the new Blue Waters Nanotechnology Astronomy Earthquakes and the damage they cause Viruses entering cells Severe storms Climate change 13

ORNL Titan 10-20 PF Hybrid System Upgrading Jaguar (XT5) to Titan (XK6) - $97M Contract Peak

scalability of unprecedented scale Hybrid of NVIDIA GPUs, AMD CPUs and Gemini interconnect

14 ORNL Titan PF Hybrid System Upgrading Jaguar (XT5) to Titan (XK6) - $97M Contract Peak performance 10 to 20 petaflops Upgrade will be completed by the end of 2012 Efficiency & scalability of unprecedented scale Hybrid of NVIDIA GPUs, AMD CPUs and Gemini interconnect ORNL step towards pre-exascale architecture Focus on programming environment: To exploit all levels of parallelism. Compiler directives hold the promise of allowing developers to efficiently exploit the GPUs while preserving portability to other platforms. Partnership with Cray. 14

15 Name Titan Configuration Architecture Processor Titan XK6 Cabinets 200 Nodes 18,688 CPU Memory/Node GPU Memory/Node Interconnect GPUs AMD Interlagos 32 GB 6 GB Gemini Nvidia Kepler 15

16 Investing in Applications

17 Titan: Early Science Applications 17

18 10/31/

19 10/31/

20 10/31/

21 Applications R&D for the Extreme Scale A number of applications focused petascale and exascale activities involving the earth system modeling community: Swiss Platform for High-Performance and High-Productivity Computing - HP2C UK Met Office / NCAS Project UPSCALE ICOsahedral-grid Models for EXascale Earth system simulations (ICOMEX) Titan Early Science Applications Blue Waters Petascale Computing Resource Allocations (PRAC) Teams CRESTA Project EU funded exascale co-design 21

22 The CRESTA Project Three year EU-funded CREST collaborative project, 13 partners, 12 million costs, 8.5 million funding Collaborative Research into Exascale Systemware, Tools and Applications Project coordinator: EPCC at The University of Edinburgh CRESTA has a very strong focus on exascale software challenges Uses a co-design model of applications with exascale potential interacting with CREST systemware and tools activities The hardware partner is Cray Applications represent broad spectrum from science and engineering CRESTA will compare and contrast incremental and disruptive solutions to Exascale challenges CREST

23 Consortium & Applications Leading European HPC centres EPCC, HLRS, CSC, PDC A world leading vendor Cray World leading tools providers TUD (Vampir), Allinea (DDT) Exascale application owners and specialists ABO, JYU, UCL, ECMWF, ECP, DLR epcc cresta Visual Identity Designs CREST CREST Application Grand challenge Partner responsible GROMACS Biomolecular systems KTH (Sweden) ELMFIRE Fusion energy ABO (Finland) HemeLB Virtual Physiological Human UCL (UK) IFS Numerical weather prediction ECMWF (International) OpenFOAM Engineering EPCC / HLRS / ECP Nek5000 Engineering KTH (Sweden)

24 Courtesy: George Mozdzynski, ECMWF

25 UPSCALE A Case study in Petascale Climate 25

26 The UPSCALE project Aim is to increase the fidelity of global climate simulations and understanding of weather and climate risk, by representing fundamental weather and climate processes more completely. Hermit XE6 is being used to conduct a series of 25-year simulations: Under both current climate conditions and a climate change scenario, using HadGEM3 at 25km horizontal resolution. Essentially the same resolution as used for the Met Office's global weather forecasts. Data set produced will be widely studied over next decade. Key component of project is to extend the scalability of HadGEM (and UM) to ~ 25K cores. 26

Project UPSCALE: UK Climate Research Team Wins Major Computing Award on the HLRS

for global Environmental risk The team comprises climate scientists at the UK

Vidale, Willis Professor of Climate System Science and Climate Hazards at the

27 Project UPSCALE: UK Climate Research Team Wins Major Computing Award on the HLRS Cray XE6 UPSCALE project: UK on PRACE: weather-resolving Simulations of Climate for global Environmental risk The team comprises climate scientists at the UK Met Office and National Centre for Atmospheric Science (NCAS) Led by Pier Luigi Vidale, Willis Professor of Climate System Science and Climate Hazards at the University of Reading's Meteorology department and Director of the Weather and Climate Hazards Laboratory. Granted substantial computing resources on the HLRS Hermit XE6 by the Partnership for Advanced Computing in Europe (PRACE) Access Committee. 27

HERMIT @ HLRS Configuration Value Number of Cray XE6 LC Cabinets 38 Number of Internal Chassis 114 Number of XDPs 8 Compute Partition Value Number of Compute Nodes (incl.

28 HLRS Configuration Value Number of Cray XE6 LC Cabinets 38 Number of Internal Chassis 114 Number of XDPs 8 Compute Partition Value Number of Compute Nodes (incl. 480 "fat" nodes and 16 XK6 nodes) 3,552 Number of Compute Processors 7,104(*) Number of Compute Cores 113,664(*) Compute Processor Type AMD 2.3 GHz Compute peak TF 1,046 TF Memory Configuration per Node (3072 nodes) 32 GB, DDR3 Memory Configuration per "fat" Node (480 nodes) Compute Partition Memory Capacity 64 GB, DDR TB (*) Currently there are 16 Tesla based XK6, will be upgraded to 32 Kepler based nodes at the end of the year 28

29 Optimising UPSCALE for Petascale (and beyond) Running production climate simulations at relatively untested resolutions (for climate) Potential stability issues CPU grant for 1 year only => need to be productive quickly Time to solution important (without wasting resources) No time for in-depth code modifications, look for optimisation in existing application and environmental parameters Large volumes of output data (~1TB per model per wallclock day) Potentially 5 models running consecutively (with favourable stability and queues) Tuning IO and the new IOS 29

30 Two Lessons from UPSCALE 31 October

31 Choices of Parallel Decomposition are growing More parallelism offers greater choices of decomposition Each parallel paradigm (distributed memory, shared memory, SIMT, vectorisation) offers even greater choices Applications need to make optimal choices to get best performance UPSCALE Choice of how to perform the 2D MPI Decomposition Plus number of OpenMP threads. Investigated a large number of decompositions over a limited range of total cores Found a wide range in performance 31

32 Maximum # of 2D Decompositions Max # of 2D Decompositions vs Processor Count Number of Ranks 32

33 Number of decopmsotions The best decompositions are rare Only a small proportion of decompositions giving the better performance % 5-10% 10-15% 15-20% 20-25% 25-30% 30-35% 35-40% 40-45% % Slower than fastest decomposition 45% between best and worst 33

34 Specialising cores to improve performance Using MPMD models to improve overall application performance 34

35 MPMD and Exascale It is probable that MPMD programs will become more prevalent at Exascale. Applications may be designed to use specialised subsets of processors for specific tasks (e.g. IO Servers, novel solvers, accelerated nodes) Coupling of multiple models is already happening (e.g. Atmosphere and Ocean), and is likely to be extended to other fields (e.g. whole body/brain simulations, multi-phase flows, combustion/chemistry) We will not be able to rely on there being corresponding symmetry in hardware and software layouts Instead, for optimal performance application placement must be guided by the software architecture in the future. UPSCALE s UM8.0 includes IO Servers for optimising data transfer to disk 35

36 UM Standard MPI Decomposition Node 3 Node 0 Node 1 Node 2 Domain decomposition distributed over the cores on each processor Deep East-West halos favour rows using intra-node comms (e.g. shared memory) Best performance achieved when processor E-W decomposition is a factor or domain E-W decomposition 36

37 IO Servers MPMD providing net performance gains Originally codes treat compute and IO as serial tasks to be performed by all nodes Compute+IO Compute IO Compute IO Compute IO Compute IO IO costs have grown so codes (e.g. UM) have been extended to include IO Server ranks These ranks are dedicated to performing the IO operations asynchronously of compute. Compute Compute Compute Compute Compute IO Servers IO IO IO IO Typically adding an additional 1% of nodes to act as IO servers can eliminate almost all IO from runtime. Requires data to be double buffered, so can increase overall memory overhead. Essentially a form of MPMD 37

38 Basic distribution of IO servers (1 per node) Node 4 Node 3 Node 0 Node 1 Node 2 38

39 Basic distribution of IO servers (pt 2) Advantages Easier implementation Efficient when number nodes number of IO servers Do not have to change the distribution of ranks across nodes E.g, Keep 16 ranks per node, just add to the total number of ranks Allows for much larger buffers on IO Server tasks Distributes IO traffic across the network. Disadvantages Disrupts the nice alignment between decomposition and nodes IO Servers restricted to the same memory limits as compute ranks IO Servers likely to require more memory, far less compute. 39

40 Rank Reordered Decomposition (IO Nodes) Node 4 Node 3 Node 0 Node 1 Node 2 23 October 2012 AMMW03 40

41 Rank Reordered Decomposition (pt 2) Advantages Keeps the nice alignment between proc decomposition and nodes Can change the distribution of ranks across nodes keep large numbers of ranks per node for compute nodes use fewer ranks per node on IO nodes Can be implemented at runtime on Cray architectures Efficient when number compute nodes >> number of IO servers Disadvantages Concentrates IO traffic on a few nodes on the systems Network bandwith > IO Bandwidth IO Servers should hide any IO delays anyway. UPSCALE recorded 6.5% improvement in runtime performance from rank reordering compared to default IO server distribution. 41

42 Summary Extreme scale computing will require the successful solution to a collection of interrelated science and technology challenges. A number of applications focused petascale and exascale activities involving the earth system modeling community: Several co-design initiatives are progressing well. Those that can scale have benefited from a focused, iterative multi-year algorithmic optimization effort: Optimization strategies do not remain stagnant and must take advantage of evolving hardware and software technologies. The successful groups have had ongoing access to scalable systems and support. 42

43 For More Information 43

An Introduction to OpenACC

An Introduction to OpenACC Alistair Hart Cray Exascale Research Initiative Europe 3 Timetable Day 1: Wednesday 29th August 2012 13:00 Welcome and overview 13:15 Session 1: An Introduction to OpenACC 13:15