NVIDIA Update and Directions on GPU Acceleration for Earth System Models

Size: px

Start display at page:

Download "NVIDIA Update and Directions on GPU Acceleration for Earth System Models"

Harvey Baker
5 years ago
Views:

1 NVIDIA Update and Directions on GPU Acceleration for Earth System Models Stan Posey, HPC Program Manager, ESM and CFD, NVIDIA, Santa Clara, CA, USA Carl Ponder, PhD, Applications Software Engineer, NVIDIA, Austin, TX, USA

2 NVIDIA GPU UPDATE TOPICS OF DISCUSSION ESM GPU PROGRESS WRF DEVELOPMENTS 2

NVIDIA GPU: Introduction and Hardware Features

3 NVIDIA GPU: Introduction and Hardware Features GPU Introduction CPU PCIe or NVLink Tesla P100 1x 3x 10x Unified Memory Co-processor to the CPU Threaded Parallel (SIMT) CPUs: x86 Power ARM ORNL Titan #3 Top500.org 18,688 GPUs HPC Motivation: o Performance o Efficiency o Cost Savings 3

4 NVIDIA GPU: Introduction and Hardware Features GPU Introduction CPU Tesla P100 Unified Memory Co-processor to the CPU Threaded Parallel (SIMT) CPUs: x86 Power ARM HPC Motivation: o o o 1x PCIe or NVLink Performance Efficiency Cost Savings 3x 10x Next GPU Current GPUs Since 2014 (Q4 2016) GPU Feature Tesla P100 Tesla K80 Tesla K40 Stream Processors x Core Clock 1328MHz 562MHz 745MHz Boost Clock(s) 1480MHz 875MHz 810MHz, 875MHz Memory Clock 1.4Gbps HBM2 5Gbps GDDR5 6Gbps GDDR5 Memory Bus Width 4096-bit 2 x 384-bit 384-bit Memory Bandwidth 720GB/sec 2 x 240GB/sec 288GB/sec VRAM 16GB 2 x 12GB 12GB Half Precision 21.2 TFLOPS 8.74 TFLOPS 4.29 TFLOPS Single Precision 10.6 TFLOPS 8.74 TFLOPS 4.29 TFLOPS Double Precision 5.3 TFLOPS 2.91 TFLOPS 1.43 TFLOPS (1/2 rate) (1/3 rate) (1/3 rate) GPU GP100 (610mm2) GK210 GK110B Transistor Count 15.3B 2 x 7.1B(?) 7.1B Power Rating 300W 300W 235W Cooling N/A Passive Active/Passive Manufacturing Process TSMC 16nm FinFET TSMC 28nm TSMC 28nm Architecture Pascal Kepler Kepler NOTE: P100 nodes available for community remote access on NVIDIA PSG cluster 4 2.5x 3.7x

5 Speed-up vs Dual Socket Haswell COSMO Dycore Speedups on P100 GPU MeteoSwiss GPU Branch of COSMO Model Dycore Only 50x 45x 40x 35x 30x 25x 20x 15x 10x 5x 0x 2x K80 (4 x GPU) 2x P100 4x P100 8x P100 Socket-to-socket: P100 vs. HSW = 3.5x 27x 14x 7x 3x COSMO 2x HSW CPU Results from NVIDIA Internal Cluster (US) (Preliminary Mar 2016) COSMO 5.3 MCH branch 128x128, 80xVertical Time steps 10 CPU: x86 Xeon Haswell o GHz GPU: Tesla P100 Use of 8-GPU single node CUDA 8 Socket-to-socket: P100 vs. HSW = 3.5x 5

6 Select NVIDIA ESM Highlights Since MultiCore 5 Growth in GPU funded-development; Large GPU system deployments GPUs deployed for operational NWP by MeteoSwiss with COSMO model OpenACC (PGI) developments for ACME Atmosphere in production release New NCAR collaboration launched with OpenACC Hackathon Workshop KISTI 2-week GPU and OpenACC Workshop focus on MPAS and WRF NVIDIA selected for ECWMF ESCAPE program; ESCAPE GPU Workshop US DOE ORNL-led GPU Hackathons included several ES model teams ACME, COAMPS, ECHAM6, FVCOM, NOAA GFDL models 6

2 KM) 8 per day, 24 hr forecast Before GPUs MeteoSwiss COSMO NWP Configurations During 2016 IFS from ECMWF 2 per day, 10 day forecast COSMO E (2.

7 MeteoSwiss and Operational COSMO NWP on GPUs MeteoSwiss COSMO NWP Configurations Since 2008 IFS from ECMWF 2 per day, 10 day forecast COSMO 7 (6.6 KM) 3 per day, 3 day forecast COSMO 2 (2.2 KM) 8 per day, 24 hr forecast Before GPUs MeteoSwiss COSMO NWP Configurations During 2016 IFS from ECMWF 2 per day, 10 day forecast COSMO E (2.2 KM) 2 per day, 5 day forecast COSMO 1 (1.1 KM) 8 per day, 24 hr forecast With GPUs New configurations of higher resolution and ensemble predictions possible owing to the performance-per-energy gains from GPUs X. Lapillonne, MeteoSwiss; EGU Assembly, Apr

Total CPUs 192 Tesla K80 Total GPUs High GPU Density Nodes: 2 x CPU + 8 x

8 MeteoSwiss Weather Prediction Based on GPUs World s First GPU-Accelerated NWP Piz Kesch (Cray CS Storm) Installed at CSCS July x Racks with 48 Total CPUs 192 Tesla K80 Total GPUs High GPU Density Nodes: 2 x CPU + 8 x GPU > 90% of FLOPS from GPUs Operational NWP Mar 16 Image by NVIDIA/MeteoSwiss 8

9 MeteoSwiss Operational COSMO-E Benchmark Cray XC40 Original Code Node = 2 x HSW Cray CS Storm Refactored Code Node = 2 x HSW + 8 x K80 Speedup Vs. Original 9

10 MeteoSwiss Operational COSMO-E Benchmark Cray XC40 Original Code Node = 2 x HSW Cray XC40 Refactored Code Node = 2 x HSW Cray CS Storm Refactored Code Node = 2 x HSW + 8 x K80 Speedup Vs. Original Speedup Vs. Refactored 10

Speed-up vs Dual Socket Haswell COSMO Dycore Speedups on P100 GPU http://www.cosmo-model.

11 Speed-up vs Dual Socket Haswell COSMO Dycore Speedups on P100 GPU MeteoSwiss GPU Branch of COSMO Model Dycore Only 50x 45x 40x 35x 30x 25x 20x 15x 10x 5x 0x 2x K80 (4 x GPU) 2x P100 4x P100 8x P100 Socket-to-socket: P100 vs. HSW = 3.5x 27x 14x 7x 3x COSMO 2x HSW CPU Results from NVIDIA Internal Cluster (US) (Preliminary Mar 2016) COSMO 5.3 MCH branch 128x128, 80xVertical Time steps 10 CPU: x86 Xeon Haswell o GHz GPU: Tesla P100 Use of 8-GPU single node CUDA 8 Socket-to-socket: P100 vs. HSW = 3.5x 11

Update on DOE Pre-Exascale CORAL Systems US DOE

LLNL Sierra at 150 PF Mid-2018 Nodes of POWER 9 +

GPUs ORNL Summit System Based on original 150 PF

40+ TF peak performance About 1/5 of total #2

(27 PF) CORAL Summit System 5-10x Faster than

12 Update on DOE Pre-Exascale CORAL Systems US DOE CORAL Systems ORNL Summit at 200 PF Early 2018 LLNL Sierra at 150 PF Mid-2018 Nodes of POWER 9 + Tesla Volta GPUs NVLink Interconnect for CPUs + GPUs ORNL Summit System Based on original 150 PF plan: Approximately 3,400 total nodes Each node 40+ TF peak performance About 1/5 of total #2 Titan nodes (18K+) Same energy used as #2 Titan (27 PF) CORAL Summit System 5-10x Faster than Titan 1/5th the Nodes, Same Energy Use as Titan (Based on original 150 PF) 12

Programming Strategies for GPU Acceleration

(Fortran, C, C++) Increasing Development

with GPU Architecture and Software Features

13 Programming Strategies for GPU Acceleration Applications GPU Libraries Provides Fast Drop-In Acceleration OpenACC Directives GPU-acceleration in Standard Language (Fortran, C, C++) Increasing Development Effort Programming in CUDA Maximum Flexibility with GPU Architecture and Software Features NOTE: Many application developments include a combination of these strategies 13

Index: Scalable Rendering for Volume Visualization o Leverages GPU-clusters for largescale (volume) data visualization and interactive visual computing 1.

14 Index: Scalable Rendering for Volume Visualization o Leverages GPU-clusters for largescale (volume) data visualization and interactive visual computing 1.8 billion cells time steps o Commercial software solution available and deployed for in-situ visualization of large-scale data o Plugin for ParaView under development and available soon Dataset courtesy of Prof. Leigh Orf, UW-Madison and Rob Sisneros, NCSA o 14

15 GPUs at Convergence of Data and HPC in ESM Fusion of Observations from Machine Learning with the Model Yandex developments of ML + Model for Hyperlocal NWP with WRF: Yandex Introduces Hyperlocal Weather Forecasting Service Based on Machine Learning Technology DL dominant topic at NCAR workshop Climate Informatics 2015 IBM acquisition of The Weather Company applied data analytics Data Assimilation Next Phase Following Model Development 4DVAR GPU development success with MeteoSwiss and others RIKEN study of 10,240 member ensemble with NICAM (Miyoshi, et al.) Largest ensemble simulation of global weather using real-world data 15

16 CUDA: TQI/SSEC Commercial WRF TempoQuest Plans for CUDA WRF-based software product NVIDIA providing standard engineering guidance \ WRF GPU UPDATE OpenACC: NVIDIA Open WRF Project Migrating routines to 3.8 Initial projections of P100 GPU very good Working towards unified memory capability PGI compiler continues to improve/mature Potential for Full model WRF on GPUs Several months away, hybrid in near term P100 GPU will improve hybrid approach UM + NVLink will improve data transfer times P100 memory bandwidth 3x vs. Kepler-series 16

17 Questions? Stan Posey, Carl Ponder,

NVIDIA HPC Directions for Earth System Models. Stan Posey, HPC Program Manager, ESM Domain, NVIDIA (HQ), Santa Clara, CA, USA

NVIDIA HPC Directions for Earth System Models Stan Posey, HPC Program Manager, ESM Domain, NVIDIA (HQ), Santa Clara, CA, USA NVIDIA HPC DIRECTIONS TOPICS OF DISCUSSION ESM GPU PROGRESS PGI UPDATE D. NORTON