Opportunities & Challenges for Piz Daint s Cray XC50 with ~5000 P100 GPUs. Thomas C. Schulthess

Size: px

Start display at page:

Download "Opportunities & Challenges for Piz Daint s Cray XC50 with ~5000 P100 GPUs. Thomas C. Schulthess"

Rosamund Lee
6 years ago
Views:

1 Opportunities & Challenges for Piz Daint s Cray XC50 with ~5000 P100 GPUs Thomas C. Schulthess 1

Piz Daint 2017 fact sheet ~5 000 NVIDIA P100 GPU accelerated nodes ~1 400 Dual multi-core socket nodes Model Cray XC40/Cray XC50 Number of Hybrid Compute Nodes 5 320 Number of Multicore Compute Nodes

Performance Hybrid Memory Capacity per Node Multicore Memory Capacity per Node Total System Memory System Interconnect Sonexion 3000 Storage Capacity 4.

2 Piz Daint 2017 fact sheet ~5 000 NVIDIA P100 GPU accelerated nodes ~1 400 Dual multi-core socket nodes Model Cray XC40/Cray XC50 Number of Hybrid Compute Nodes Number of Multicore Compute Nodes Theoretical Peak Floataing-point Performance per Hybrid Node Theoretical Peak Floating-point Performance per Multicore Node Theoretical Hybrid Peak Performance Theoretical Muliticore Peak Performance Hybrid Memory Capacity per Node Multicore Memory Capacity per Node Total System Memory System Interconnect Sonexion 3000 Storage Capacity Teraflops Intel Xeon E v3/nvidia Tesla P Teraflops Intel Xeon E v Petaflops Petaflops 64 GB; 16 GB CoWoS HBM2 64 GB, 128 GB TB; 83.1 TB Cray Aries routing and communications ASIC, and Dragonfly network topology 6.2 PB Sonexion 3000 Parallel File System Theoretical Peak Performance 112 GB/s Sonexion 1600 Storage Capacity Sonexion 1600 Parallel File System Theoretcal Peak Performance 2.5 PB 138 GB/s 2

3 Euclid Flagship Simulation 2016 Full sky map of the dark matter structure at ½ the age of the Universe. This structure will distort the shapes of more distant galaxies due to weak gravitational lensing. 2 trillion particles using all of available memory on Piz Daint and observing about 25 billion virtual galaxies (*) (*) this catalogue is being used to calibrate the experiments on board the Euclid satellite that will be launched in 2020 with the objective of investigating the nature of dark matter and dark energy Source: Joachim Stadel & Doug Potter (see: Potter et al. Comp. Astro. & Cosmol. (2017) DOI /s

4 Imaging the earth General concept: Collect recordings from large number of earthquakes. Simulate recordings for a simple model of the Earth. Compare observed and simulated recordings. Improve the Earth model to match observations and simulations. data coverage 5.0 min. period [s] 55.0 Source: Andreas Fichtner (andreas.fichtner@erdw.ethz.ch) 4

5 The Collaborative Earth Model First community effort to successively evolve a model of the Earth. Harness distributed resources and man power of many researchers. Overview of current subregions collaborators Source: Andreas Fichtner (andreas.fichtner@erdw.ethz.ch) 5

6 source: Towards Green Aviation with Python at Petascale. P. E. Vincent et al. Supercomputing

7 Website: Github: Paper: Withered et al. Comp. Phys. Comm. (2014) Governing Equations Spatial Discretisation Temporal Discretisation Precision Input Output Platforms Compressible Euler and Navier Stokes Arbitrary order Flux Reconstruction on mixed unstructured grids (tris, quads, hexes, tets, prisms ) Explicit Runge-Kutta schemes single, double.pyfrm.msh.cgns.pyfrs.vtu.pvtu CPU clusters (via C/OpenMP-MPI) MIC clusters (via C/OpenMP-MPI) Nvidia GPU clusters (via CUDA-MPI) AMD GPU clusters (via OpenCL-MPI) source: Peter Vincent 7

8 PyFR ~9k lines of python code source: Peter Vincent 8

9 Science driven exascale computing 9

10 Leadership in weather and climate European world leadership but far away from sufficient accuracy and reliability! Peter Bauer, ECMWF 10

11 The impact of resolution: simulated tropical cyclones 130 km 60 km 25 km Observations HADGEM3 PRACE UPSCALE, P.L. Vidale (NCAS) and M. Roberts (MO/HC) 11

What resolution is needed? Bjorn Stevens, MPI-M There are threshold scales in the atmosphere and ocean: going from 100 km to 10 km is incremental, 10 km to 1 km is a leap.

12 What resolution is needed? Bjorn Stevens, MPI-M There are threshold scales in the atmosphere and ocean: going from 100 km to 10 km is incremental, 10 km to 1 km is a leap. At 1km it is no longer necessary to parametrise precipitating convection, ocean eddies, or orographic wave drag and its effect on extratropical storms; ocean bathymetry, overflows and mixing, as well as regional orographic circulation in the atmosphere become resolved; the connection between the remaining parametrisation are now on a physical footing. We spend the last five decades in a paradigm of incremental advances. Here we incrementally improved the resolution of models from 200 to 20km Exascale allows us to make the leap to 1 km. This fundamentally changes the structure of our models. We move from crude parametric presentations to an explicit, physics based, description of essential processes. The last such step change was fifty years ago. This was when, in the late 1960s, climate scientists first introduced global climate models, which were distinguished by their ability to explicitly represent extra-tropical storms, ocean gyres and boundary current. 12

13 The importance of ensembles Peter Bauer, ECMWF 13

14 The relevant metric: Simulate Years Per Day (SPYD) NWP Climate in production Climate spinup Simulation 10 d 100 y y Desired wall clock time 0.1 d 0.1 y 0.5 y ratio 100 1'000 10'000 SYPD

15 Running COSMO 5.0 at global scale on Piz Daint Scaling to full system size: ~5300 GPU accelerate nodes available Running a near-global (±80º covering 97% of Earths surface) COSMO 5.0 simulation > Either on the hosts processors: Intel Xeon E5 2690v3 (Haswell 12c). > Or on the GPU accelerator: PCIe version ofnvidia GP100 (Pascal) GPU 15

Near-global climate simulation at 1km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0 Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.

9 km, P100 x = 930 m, P100 SYPD 1 0.1 0.01 10 100 1000 #nodes h x i #nodes t [s] SYPD MWh/SY gridpoints 930 m 4,888 6 0.043 596 3.46 10 10 1.9 km 4,888 12 0.23 97.8 8.64 10 9 47 km 18 300 9.6 0.099 1.

16 Near-global climate simulation at 1km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0 Fuhrer et al., Geosci. Model Dev. Discuss., in review, 2017 Metric: simulated years per wall-clock day x = 19 km, P100 x = 19 km, Haswell x = 3.7 km, P100 x = 3.7 km, Haswell x = 1.9 km, P100 x = 930 m, P100 SYPD #nodes h x i #nodes t [s] SYPD MWh/SY gridpoints 930 m 4, km 4, km (c) Time compression (SYPD) and energy cost (MWh/SY) for three moist simulations. At 930 m grid spacing obtained with a full 10d simulation, at 1.9 km from 1,000 steps, and at 47 km from 100 steps compression achieved in terms of SYPD. 16

17 x = 19 km, P100 x = 19 km, Haswell x = 3.7 km, P100 x = 3.7 km, Haswell x = 1.9 km, P100 x = 930 m, P100 SYPD 1 100x #nodes And reduce he footprint of the calculation by at least 10x 4888 Fuhrer et al., Geosci. Model Dev. Discuss., in review,

18 Deep Learning toolkits on Cray XC CSCS DL Toolkit C++ & GPU backend Installing a DL toolkit on Cray XC is similar to installing any HPC application few extra libraries are needed to satisfy dependencies Staging a toolkit can be done with SLURM (our resource manager at CSCS) some toolkits (like Spark) require SSH to be available on compute nodes MPI Working on Cray XC Fully CSCS TensorFlow yes no yes yes TensorFlow+MPI yes yes yes in progress MXNet yes no, ext. to use MPI yes in progress Caffe-MPI yes yes yes in progress CNTK yes yes yes in progress Spark no (Java + ext. to use GPUs) no yes yes Theano yes no yes yes 18

19 Moving Tensorflow to Pinz Daint Test-case setting simple neural network learning Standard model: LevNet-5-like convolutional MNIST model Written with Tensorflow/Python Testbed environment Standard desktop with Intel Broadwell (4c) Piz Daint multi-core node with Intel Broadwell (2x18c) Piz Daint hybrid node with Intel Haswell (12c) and NVIDIA Pascal (P100) Remark: this is a simple standard example, with complex models even more speedup expected Desktop Time to solution in sec. 3x speedup 18x speedup Daint MC node Daint hybrid node Source: Marcel Schöngens (schoengens@cscs.ch) 19

20 XC50 supercomputer plus Microsoft s Cognitive Toolkit was used to scale up training 20

21 21

22 22

23 Scaling CNTK with MPI rank i-1 rank i rank i+1 rank i+2 rank i+3 Samples i-1 Samples i Samples i+1 Samples i+2 Samples i+3 Update Gradient Update Gradient Update Gradient Update Gradient Update Gradient Gradient i-1 Gradient i Gradient i+1 Gradient i+2 Gradient +3 Sum gradients using MPI_Iallreduce Gradient Gradient Gradient Gradient Gradient Update Weights Update Weights Update Weights Update Weights Update Weights 23

24 We develop algorithms, we don t have time to deal with C/C++ or MPI a well-known computer science colleague working in machine learning 24

25 echoed by many scientists working with data Nishant Shukla (2017) 25

26 Architectural Developments Traditional Architecture Research Community CSCS User Data Flow CSCS External Login Access (ELA) Piz Daint Login & Mgmt /store Piz Daint Compute 26

Specific Portal Repository access Workflow Manager Does Not Scale

27 Architectural Developments Improved Architecture Based Research Community on External Portal CSCS User Data Flow CSCS Domain Specific Portal Repository access Workflow Manager Does Not Scale External Login Access (ELA) Piz Daint Login & Mgmt /store Piz Daint 27

28 Architectural developments Service Oriented Architecture (SOA) Research Community Domain Specific Portal CSCS User Repository access Workflow Manager CSCS Infrastructure Services Authentication & authorization User Management Data Management Workflow Automation Capacity Management IT Infrastructure DWH Networking & security OpenStack Services Archival Storage Active Storage HPC Services [Confidential - For CSCS internal use only] Kick-off Meeting 28

29 and the service should be up most of the time (like 99+ %) 29

Supporting Federation using SOA Research Community Domain Specific Portal CSCS User Repository access Workflow Manager Research Community Domain Specific Portal

30 Supporting Federation using SOA Research Community Domain Specific Portal CSCS User Repository access Workflow Manager Research Community Domain Specific Portal Software services Repository access Workflow Manager Platform services Infrastructure provider Infrastructure services Infrastructure provider Infrastructure services 30

31 Fenix Sites 31

32 Thank you to engineers at Cray, CSCS and NVIDIA for incredibly efficient development/upgrade of Piz Daint! Tim Palmer (U. of Oxford) Peter Bauer (ECMWF) Christoph Schar (ETH Zurich) Bjorn Stevens (MPI-M) Oliver Fuhrer (MeteoSwiss) Sadaf Alam (CSCS) Dirk Pleiter (FZ Jülich) Colin McMurtrie (CSCS) Torsten Hoefler (ETH Zurich) 32

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

NVIDIA Update and Directions on GPU Acceleration for Earth System Models Stan Posey, HPC Program Manager, ESM and CFD, NVIDIA, Santa Clara, CA, USA Carl Ponder, PhD, Applications Software Engineer, NVIDIA,