NVIDIA GPU TECHNOLOGY UPDATE

Size: px

Start display at page:

Download "NVIDIA GPU TECHNOLOGY UPDATE"

Oswald Hart
6 years ago
Views:

1 NVIDIA GPU TECHNOLOGY UPDATE May 2015 Axel Koehler Senior Solutions Architect, NVIDIA

2 NVIDIA: The VISUAL Computing Company GAMING DESIGN ENTERPRISE VIRTUALIZATION HPC & CLOUD SERVICE PROVIDERS AUTONOMOUS MACHINES PC DATA CENTER MOBILE 2

3 Tesla Accelerated Computing Platform 3

Double Precision Workloads Maximize Throughput within a Server Server Seismic, Data

4 Tesla GPU Accelerators for 2015 Tesla K40 Tesla K80 Best Single GPU Performance Server, Workstation, Liquid Cooled Higher Ed, Data Analytics, HPC Labs, Defense Double Precision Workloads Maximize Throughput within a Server Server Seismic, Data Analytics, HPC Labs, Defense Multi-GPU Accelerated Apps Single and Double Precision Workloads 4

5 Tesla K40 / K80 K40 K80 GPU GK110B GK210 Peak SP base clock) Peak DP (per board) 4.29TFLOPS 1.43 TFLOPS 1.68 TFLOPS(Boost) ~5.6TFLOPS (Base) ~1.87 TFLOPS (Base) ~2.7 TFLOPS (Boost) # of GPUs 1 2 # of CUDA Cores/board PCIe Gen Gen 3 Gen 3 GDDR5 Memory Size (per board) 12 GB 24 GB Memory Bandwidth 288 GB/s ~480GB/s GPUBoost 2 Levels >10 levels Power 235W 300W Form Factors PCIe Active PCIe Passive PCIe Passive 5

6 Board Power (Watts) Average GPU Power in Watts 180 Avg GPU Power in Watts for Real Applications on K20X AMBER ANSYS Black Scholes Chroma GROMACS GTC LAMMPS LSMS NAMD Nbody QMCPACK RTM SPECFEM3D 6

Boost Clock #2 Boost Clock #1 Base Clock GPU Boost K40 875Mhz 810Mhz 745Mhz GPU Boost Boost 875 MHz Dynamic GPU Boost K80 Most CUDA Apps Run At Boost Clocks 40-50% more flops with Boost 23 5W

7 Boost Clock #2 Boost Clock #1 Base Clock GPU Boost K40 875Mhz 810Mhz 745Mhz GPU Boost Boost 875 MHz Dynamic GPU Boost K80 Most CUDA Apps Run At Boost Clocks 40-50% more flops with Boost 23 5W Workload # 1 Worst case Reference App 23 5W Workload # 2 E.g. AMBER 23 5W Workload # 3 E.g. ANSYS Fluent Base 1.87 Teraflops 560 MHz DGEMM Heavy Apps Run at Base Clocks Zero Idle GPU Clock 7

8 GPU Roadmap 8

speed 3x lower energy/bit 3D Stacked Memory 4x Higher

9 Pascal GPU Features NVLINK and Stacked Memory NVLINK GPU high speed interconnect 5x PCIe bandwidth Move data at CPU memory speed 3x lower energy/bit 3D Stacked Memory 4x Higher Bandwidth (~1 TB/s) 3x Larger Capacity 4x More Energy Efficient per bit 9

10 Unified Memory Dramatically Lower Developer Effort Developer View Without Unified Memory Developer View With Unified Memory System Memory GPU Memory Unified Memory

11 NVLink and Unified Memory Enable Data Transfer At Speed of CPU Memory 11

12 Move Data where it is Needed Fast Accelerated Communication GPU Direct P2P Multi-GPU Scaling Fast GPU Communication Fast GPU Memory Access GPU Direct RDMA Fast Access to other Nodes Eliminate CPU Latency Eliminate CPU Bottleneck NVLINK 2x App Performance 5x Faster Than PCIe Fast Access to System Memory

13 Improving GPUDirect RDMA GPUDIRECT RDMA GPU HCA GPUDIRECT ASYNC GPU HCA IOH IOH CPU CPU CPU synchronizes with GPU tasks CPU prepares and queues communication tasks on HCA HCA directly accesses GPU memory SC14 TALK AT MELLANOX BOOTH CPU prepares and queues communication tasks on GPU GPU triggers communication on HCA HCA directly accesses GPU memory 13

14 Developer Platform With Open Ecosystem Accelerate Applications Across Multiple CPUs Libraries AmgX cublas Compiler Directives Programming Languages / x86 14

15 Drop-in Acceleration with GPU Libraries Speedups out of the box AmgX curand cusparse cublas cufft NPP MATH Linear Performance Scaling with XT libraries cublas-xt Machine learning, O&G, Material Sience, Defense, Supercomputing cufft-xt O&G, Molecular Dynamics, Defense AmgX CFD, Supercomputing, O&G Reservoir Sim 15

16 C++11 feature support CUDA 7 New Features Auto, Lambda, std::initializer_list, Variadic Templates, Static_asserts, Constexpr, Rvalue references, Range based for loops Runtime Compilation (RTC) cusolver library Routines for solving sparse and dense linear systems and Eigen problems Three APIs: Dense, Sparse, Refactorization Thrust improvements Device-side Thrust, API support for CUDA streams, Performance HyperQ/MPI (MPS): Multiple GPUs per Node

17 CUDA7: Supported C++11 Features C++11 language features enabled, including: Auto Lambda std::initializer_list Variadic Templates Static_asserts Constexpr Rvalue references Range based for loops Not supported: thread_local Standard libraries std::thread, Etc. 17

disk Runtime C++ Code Specialization Optimize code based on run-time data Reduce compile time

18 CUDA 7.0 Runtime Compilation Compile CUDA kernel source at run time Compiled kernels can be cached on disk Runtime C++ Code Specialization Optimize code based on run-time data Reduce compile time and compiled code size Enables runtime code generation, C++ template-based DSLs Application global foo(..) {.. } Compiled Kernel // launch foo() Runtime Compilation Library (libnvrtc) 18

cusolver cusolverdn Dense Cholesky, LU, SVD, QR Optimization, Computer vision, CFD cusolversp Sparse direct solvers & Eigensolvers Newton s method, Chemical kinetics cusolverrf Sparse refactorization

19 cusolver cusolverdn Dense Cholesky, LU, SVD, QR Optimization, Computer vision, CFD cusolversp Sparse direct solvers & Eigensolvers Newton s method, Chemical kinetics cusolverrf Sparse refactorization solver Chemistry, ODEs, Circuit simulation 8x 6x 4x 2x 0x 20x 15x 10x 5x 0x cusolverdn Speedup over CPU M=N=4096 SPOTRF DPOTRF CPOTRF ZPOTRF cusolversp Speedup over CPU mhd4800b ex33 Muu gyro_m cusolver 7.0, MKL , SuiteSparse K40, i7-3930k 3.20GHz 19

$lrank=$ompi_comm_world_local_rank case ${lrank} in MPS Server [0]) export CUDA_VISIBLE_DEVICES=0; numactl cpunodebind=0./executable;; [1]) export CUDA_VISIBLE_DEVICES=1; numactl cpunodebind=1.$

20 CUDA7: HyperQ/MPI (MPS): Multiple GPUs per Node CUDA MPI Rank 0 CUDA MPI Rank 1 CUDA MPI Rank 2 CUDA MPI Rank 3 MPS Server efficiently overlaps work from multiple ranks to each GPU lrank=$ompi_comm_world_local_rank case ${lrank} in MPS Server [0]) export CUDA_VISIBLE_DEVICES=0; numactl cpunodebind=0./executable;; [1]) export CUDA_VISIBLE_DEVICES=1; numactl cpunodebind=1./executable;; [2]) export CUDA_VISIBLE_DEVICES=0; numactl cpunodebind=0./executable;; [3]) export CUDA_VISIBLE_DEVICES=1; numactl cpunodebind=1./executable;; esac GPU 0 GPU 1

21 2008 PGI Accelerator Model (targeting NVIDIA GPUs) 2011 OpenACC 1.0 (targeting NVIDIA GPUs, AMD GPUs) data regions, compute regions, gang/worker/vector 2013 OpenACC 2.0 procedures, dynamic data lifetimes 2015 OpenACC 2.5 minor fixes, additions 2015/16 OpenACC 3.0 deep copy OpenACC Timeline

22 Vision: Mainstream Parallel Programming Enable more programmers to write parallel software Give programmers the choice of language to use Embrace and evolve key programming standards C

25 Mixed Precision Computation

26 Mixed Precision Computation Half precision (fp16) data type in addition to single (fp32), double (fp64) fp16: half the bandwidth, twice the throughput Format: s1e5m10 Range ~ -6*10^-8 6*10^4 as it includes denormals Limitations Limited precision: 11-bit mantissa Vector operations only: 32-bit register holds 2 fp16 values

27 FP16 Support in CUDA

28 Deep Learning using Deep Neural Networks NVIDIA cudnn Library Low-level Library of GPU-accelerated routines Out-of-the-box speedup of Neural Networks Developed and maintained by NVIDIA Image Today s Largest Networks ~10 layers. 1B parameters, 10M images, ~30 Exaflops, ~30 GPU days Sara First release focused on Convolutional Neural Networks Already part of major open-source frameworks Caffe, Torch, Theano cudnn@nvidia.com 28

data Visually monitor DNN training quality in real-time Manage training

29 DIGITS Interactive Deep Learning GPU Training System Data Scientists & Researchers: Quickly design the best deep neural network (DNN) for your data Visually monitor DNN training quality in real-time Manage training of many DNNs in parallel on multi-gpu systems developer.nvidia.com/digits 29

30 Image Classification, Object Detection, Localization Use Cases Face Recognition Speech & Natural Language Processing Medical Imaging & Interpretation Seismic Imaging & Interpretation Recommendation 30

classification Two Tegra X1 processors Up to twelve

31 NVIDIA DRIVE PX Auto-Pilot Platform Complex scenes require Deep Learningbased object identification and classification Two Tegra X1 processors Up to twelve camera inputs can be processed by one Tegra X1 in real-time 31

32 Cars that see better and Learn 32

33 US TO BUILD WORLD S TWO FASTEST SUPERCOMPUTERS SUMMIT SIERRA PFLOPS Peak Performance IBM POWER CPU + NVIDIA Volta GPU NVLink High Speed Interconnect 40 TFLOPS per Node, >3,400 Nodes 2017 Major Step Forward on the Path to Exascale 33

34 nvidia.qwiklabs.com Self-paced hands-on sessions that run on real GPUs in the cloud Using IPython Notebook technology lab instructions, editing and execution of code, and even interaction with visual tools are all weaved together into a single web application 34

35 NVIDIA GPU TECHNOLOGY UPDATE Axel Koehler

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

RECENT TRENDS IN GPU ARCHITECTURES Perspectives of GPU computing in Science, 26 th Sept 2016 NVIDIA THE AI COMPUTING COMPANY GPU Computing Computer Graphics Artificial Intelligence 2 NVIDIA POWERS WORLD