RECENT UPDATES ON ACCELERATING COMPUTING PLATFORM PRADEEP GUPTA SENIOR SOLUTIONS ARCHITECT, NVIDIA

Size: px

Start display at page:

Download "RECENT UPDATES ON ACCELERATING COMPUTING PLATFORM PRADEEP GUPTA SENIOR SOLUTIONS ARCHITECT, NVIDIA"

Julian Douglas
6 years ago
Views:

1 RECENT UPDATES ON ACCELERATING COMPUTING PLATFORM PRADEEP GUPTA SENIOR SOLUTIONS ARCHITECT, NVIDIA

2 GAMING AUTO ENTERPRISE HPC & CLOUD OEM & IP THE WORLD LEADER IN VISUAL COMPUTING 2

3 # of GPU Developers GPU Accelerator Redefined Parallel Computing in HPC Summit & Sierra: U.S. Announces Two Pre-Exascale Supercomputers Powered by GPU & NVLink Breakthrough in HIV Research: World s Largest Simulation of Virus Uncovers New Discovery Deep Learning: Univ. of Toronto Team Uses GPUs to Win Image-Net Competition, Google Acquires Team Oak Ridge TITAN: World s Fastest Supercomputer Top500: 3 of Top 5 Supercomputers with Tesla GPUs Tsubame: World s First GPU Supercomputer NVIDIA Launches CUDA

4 Vision: Mainstream Parallel Programming Enable more programmers to write portable parallel software in their language of choice Embrace and evolve standards in key languages CUDA continues to evolve as the target low-level platform for GPU acceleration C 4

5 Three Ways to Accelerate Your Application Applications Libraries Directives Languages Drop-in Acceleration Annotate code with compiler hints Modern language features (unified memory, for_each, lambda) 5

OpenACC Simple Powerful Portable Fueling the Next Wave of Scientific

//automatically runs on GPU { <parallel code> } } RIKEN Japan NICAM-

Illinois PowerGrid- MRI Reconstruction 70x Speed-Up 2 Days of Effort

com/sites/default/files/resources/openacc_213462.

com/off-the-wire/first-round-of-2015-hackathons-gets-underway

6 OpenACC Simple Powerful Portable Fueling the Next Wave of Scientific Discoveries in HPC main() { <serial code> #pragma acc kernels //automatically runs on GPU { <parallel code> } } RIKEN Japan NICAM- Climate Modeling 7-8x Speed-Up 5% of Code Modified University of Illinois PowerGrid- MRI Reconstruction 70x Speed-Up 2 Days of Effort Developers using OpenACC

Introducing the NVIDIA OpenACC Toolkit Free Toolkit Offers Simple & Powerful Path to Accelerated Computing PGI Compiler Free OpenACC compiler for academia NVProf Profiler Easily find

7 Introducing the NVIDIA OpenACC Toolkit Free Toolkit Offers Simple & Powerful Path to Accelerated Computing PGI Compiler Free OpenACC compiler for academia NVProf Profiler Easily find where to add compiler directives Code Samples Learn from examples of real-world algorithms Documentation Quick start guide, Best practices, Forums Download at 8

8 Three Ways to Accelerate Your Application Applications Libraries Directives Languages Drop-in Acceleration Annotate code with compiler hints Modern language features (unified memory, for_each, lambda) 10

9 Unified Memory: Simpler & Faster with NVLink Traditional Developer View Developer View With Unified Memory Developer View With Pascal & NVLink NVLink 80 GB/s System Memory GPU Memory Unified Memory Unified Memory Share Data Structures at CPU Memory Speeds, not PCIe speeds Oversubscribe GPU Memory 11

$a, float *x, float *y) { CPU/GPU Thrust Parallel for_each() void saxpy(int N, float a, float *x, float *y) { } using namespace std; auto r$ $= range(0, N); for_each (begin(r), end(r), [=] (int i) { y[i] = a * x[i] + y[i]; }); } using namespace thrust; auto r =$

10 FAMILIAR CODING STYLE, SINGLE CODE PATH Build parallel algorithms with C++ Parallel for * CPU Sequential for_each() void saxpy(int N, float a, float *x, float *y) { CPU/GPU Thrust Parallel for_each() void saxpy(int N, float a, float *x, float *y) { } using namespace std; auto r = range(0, N); for_each (begin(r), end(r), [=] (int i) { y[i] = a * x[i] + y[i]; }); } using namespace thrust; auto r = counting_iterator<int>(0); for_each (device, r, r+n, [=] device (int i) { y[i] = a * x[i] + y[i]; }); * Available today as an experimental feature in CUDA

11 Portable, High-level Parallel Code TODAY Thrust library allows the same C++ code to target both: NVIDIA GPUs x86, ARM and POWER CPUs Thrust was the inspiration for a proposal to the ISO C++ Committee Committee voted unanimously to accept as official tech. specification working draft N3960 Technical Specification Working Draft: Prototype: 13

12 Tesla Platform Tesla GPU NVLink IBM Power ARM 14

Tesla Accelerated Computing Platform Data

Tools Software Solutions / GPU Accelerators

System Management NVML / NVIDIA-SMI Compiler

13 Tesla Accelerated Computing Platform Data Center Infrastructure Development System Solutions Communication Infrastructure Management Programming Languages Development Tools Software Solutions / GPU Accelerators GPU Boost Interconnect GPU Direct NVLink System Management NVML / NVIDIA-SMI Compiler Solutions LLVM Profile and Debug CUDA Debugging API Libraries cublas 15

14 TESLA K80 WORLD S FASTEST ACCELERATOR FOR DATA ANALYTICS AND SCIENTIFIC COMPUTING Dual-GPU Accelerator for Max Throughput 2x Faster 2.9 TF 4992 Cores 480 GB/s 25x 20x 15x Deep Learning: Caffe Double the Memory Designed for Big Data Apps 24GB K40 12GB Maximum Performance Dynamically Maximize Perf for Every Application 10x 5x 0x CPU Tesla K40 Tesla K80 Oil & HPC Gas Viz Data Analytics Caffe Benchmark: AlexNet training throughput based on 20 iterations, CPU: 2.70GHz. 64GB System Memory, CentOS 6.2, Peak Perf with GPU Boost on 16

15 330+ GPU-Accelerated Applications 18

16 KEPLER GPU PASCAL GPU NVLink NVLink High-speed GPU Interconnect POWER CPU NVLink PCIe PCIe X86 ARM64 POWER CPU 2014 X86 ARM64 CPU

17 Major Data Center OEMs Support NVLink 20

US to Build Two Flagship Supercomputers Powered by the Tesla Platform 100-300 PFLOPS Peak 10x in Scientific App Performance IBM POWER9

18 US to Build Two Flagship Supercomputers Powered by the Tesla Platform PFLOPS Peak 10x in Scientific App Performance IBM POWER9 CPU + NVIDIA Volta GPU NVLink High Speed Interconnect 40 TFLOPS per Node, >3,400 Nodes 2017 Major Step Forward on the Path to Exascale 21

19 Accelerated Computing 5x Higher Energy Efficiency GB/s IBM POWER CPU Most Powerful Serial Processor NVIDIA NVLink Fastest CPU-GPU Interconnect NVIDIA Volta GPU Most Powerful Parallel Processor 22

20 IBM HPC Application Update 184 Total applications planned for port 108 Total apps ported to POWER 34 POWER + GPU port complete 24 POWER + GPU port in process 13 Libraries and benchmarks complete 330+ GPU-Accelerated Applications 23

21 Performance ratio Performance ratio NAMD on POWER + GPU NAMD Relative performance (STMV) NAMD Relative performance (apoa01) 450% 500% 400% 350% 371% 395% 450% 400% 414% 441% 300% 350% 250% 1-Haswell 300% 1-Haswell 2-Power8 42L 250% 2-Power8 42L 200% 3-Power8 42A 4-Haswell & 2x K40 200% 3-Power8 42A 4-Haswell & 2x K40 150% 100% 100% 132% 151% 5-Power8 42L & 2x K40 150% 100% 100% 141% 169% 5-Power8 42L & 2x K40 50% 50% 0% STMV Configuration 0% APOA01 Configuration 24

PGI FOR OPENPOWER + TESLA Feature parity

LLVM / Power code generator Limited access

22 PGI FOR OPENPOWER + TESLA Feature parity with PGI Compilers on Linux/x86+Tesla CUDA Fortran, OpenACC, OpenMP, CUDA C/C++ host compiler Integrated with IBM s optimized LLVM / Power code generator Limited access in 2015, Beta 1H 2016, Production in 2016 x86 Recompile 25

23 Enterprise Services for Premier Tesla Support Maximize Uptime & Efficiency for GPU Deployments in the Data Center Rapid response & timely issue resolution Long-term support & maintenance Direct communication w/ tech. experts On-Site consultation, training and more for subscribers Rapid Response to Critical Issues Avg. Days Public Release Maintenance Release Hot-Fix Release 26

24 CUDA bit Floating-Point Storage 2x larger datasets in GPU memory Great for Deep Learning cusparse Dense Matrix * Sparse Vector Speeds up Natural Language Processing Instruction-Level Profiling Pinpoint performance bottlenecks Easier to apply advanced optimizations Release Schedule: 7/6: Release Candidate ~Sept: Production Release *Experimental* GPU Lambdas NVIDIA Confidential. For use under NDA 27

25 AN AWESOME DEVELOPER PLATFORM GeForce GTX TITAN, TITAN Black, TITAN Z and TITAN X GeForce TITAN series GPUs now support: TCC Mode Multi-process server (MPS) CUDA Stream Priorities All relevant nvidia-smi commands * Most of these features will across the entire GeForce product line 28

26 WINDOWS REMOTE DESKTOP CUDA will work with Remote Desktop starting r352 CUDA apps will be able to run as a service on Windows Will work across all GPU products supported on Windows 29

27 ADDITIONAL IMPROVEMENTS See release notes and documentation for more details 64-bit API for cufft n-dimensional Euclidian norm floating-point math functions Bayer CFA to RGB conversion functions in NPP Faster double-precision square-roots (sqrt) CUDA Samples for the cusolver library Nsight Eclipse Edition supported on POWER platform Nsight Eclipse Edition supports multiple CUDA Toolkit versions 30

28 x86_64 Platform Support Linux RHEL & CentOS 6, 7 Fedora 21 Workstation SLES 11 SP3, 12 OpenSUSE 13.2 Ubuntu LTS, Windows 7, 8.1, 10, Server 2008 R2, 2012 R2 Visual Studio 2010, 2012, 2013 [CE] Mac OSX 10.9, 10.10, (~Sept) Alternative Linux host compilers Clang 3.5, 3.6 Intel icc PGI pgc (+) CUDA 7.5 drops support for: Ubuntu LTS on x86 cuda-gdb native debugging on Mac CUDA 7.5 announces deprecation of: Legacy profiler: Use nvprof instead Microsoft Visual Studio 2010 support These will be dropped in a future release 31

29 Power8 Platform Support Linux RHEL 7.2 Ubuntu Alternative Linux POWER8 compilers IBM xlc/xlc 13.1.x 32

30 THANK YOU 33

31 Backup 34

32 Industry Momentum 35

33 Widespread Use of GPUs in Climate & Weather Climate Model GPU Approach Collaboration NICAM OpenACC RIKEN, TiTech CAM-SE (ACME, CESM) OpenACC, CUDA DOE (ORNL, SNL), PGI Weather Ocean WRF OpenACC (1), CUDA (2) (1) NCAR-MMM, (2) SSEC COSMO OpenACC, CUDA CSCS, MeteoSwiss, PGI NIM OpenACC, F2C-ACC NOAA-ESRL, PGI ICON OpenACC CSCS, MPI-M, PGI IFS OpenACC ECMWF, CSC-FI MPAS-A OpenACC NCAR, NOAA-ESRL JMA-GSM, 4DVAR, ASUCA OpenACC, CUDA, H-F JMA, Hitachi, TiTech NEMO OpenACC STFC Additional Evaluations USA GEOS-5, HiRAM, HYCOM, MOM6, COAMPS, MPAS-O, CICE, ROMS, OLAM Europe DYNAMICO, HARMONIE, UM/GungHo/GOcean, ECHAM6 Asia-Pacific GRAPES, KWRF, CFSv2 (IN) 36

2 KM) 8 per day, 24 hr forecast Before GPUs MeteoSwiss COSMO NWP Configurations During 2016 IFS from ECMWF 2 per day, 10 day forecast COSMO E (2.

34 MeteoSwiss GPU-Driven Weather Prediction MeteoSwiss COSMO NWP Configurations Since 2008 IFS from ECMWF 2 per day, 10 day forecast COSMO 7 (6.6 KM) 3 per day, 3 day forecast COSMO 2 (2.2 KM) 8 per day, 24 hr forecast Before GPUs MeteoSwiss COSMO NWP Configurations During 2016 IFS from ECMWF 2 per day, 10 day forecast COSMO E (2.2 KM) 2 per day, 5 day forecast COSMO 1 (1.1 KM) 8 per day, 24 hr forecast With GPUs New configurations of higher resolution and ensemble predictions possible owing to the performance-per-energy gains from GPUs X. Lapillonne, MeteoSwiss; EGU Assembly, Apr 2015 NVIDIA Confidential. For use under NDA 37

The hardware needed to emulate the human brain may be ready even sooner than he predicted in around 2020 using technologies such as graphics processing units

35 The hardware needed to emulate the human brain may be ready even sooner than he predicted in around 2020 using technologies such as graphics processing units (GPUs), which are ideal for brain-software algorithms. Interview with Ray Kurzweil, Director of Engineering at Google & Renowned Futurist Washington Post, April 23,

36 GPUs Turbocharging Data Science We love GPU cards. We just use a lot of them. Jeff Dean, Google In five years, we think 50% of queries will be speech or images. Andrew Ng, Baidu 39

37 Competitive Update 40

38 Phi Struggles with Real World Apps Multigrid Solver: ~66% slower on Phi (Relative to Sandy Bridge) BerkeleyGW MS code: Phi less than 20% faster (Relative to Sandy Bridge) FLASH code: 65% slower on Phi (Relative to Sandy Bridge) 41

39 EXISTING WORKLOADS DON T USE PHI TACC Stampede Utilization 6400 Nodes Phi Provides >75% of FLOPs Less than 3% of node hours executed in Phi queues Sources:

40 330+ GPU-Accelerated Applications 43

41 Three Ways to Accelerate Your Application Applications Libraries Directives Languages Drop-in Acceleration Annotate code with compiler hints Modern language features (unified memory, for_each, lambda) 44

42 5X 10X SPEEDUP USING NVIDIA LIBRARIES BLAS LAPACK SPARSE FFT Math Deep Learning Graphs Image & Signal Processing 45

43 NVLink Unleashes Multi-GPU Performance GPUs Interconnected with NVLink CPU Speedup vs PCIe based Server 2.25x 2.00x Over 2x Application Performance Speedup When Next-Gen GPUs Connect via NVLink Versus PCIe PCIe Switch 1.75x TESLA GPU TESLA GPU 1.50x 1.25x 5x Faster than PCIe Gen3 x x ANSYS Fluent Multi-GPU Sort LQCD QUDA AMBER 3D FFT To learn more: 3D FFT, ANSYS: 2 GPU configuration, All other apps comparing 4 GPU 46 configuration AMBER Cellulose (256x128x128), FFT problem size (256^3)

CUDA 7.5 OVERVIEW WEBINAR 7/23/15

CUDA 7.5 OVERVIEW WEBINAR 7/23/15 CUDA 7.5 https://developer.nvidia.com/cuda-toolkit 16-bit Floating-Point Storage 2x larger datasets in GPU memory Great for Deep Learning cusparse Dense Matrix * Sparse