SYNERGIE VON HPC UND DEEP LEARNING MIT NVIDIA GPUS

Size: px

Start display at page:

Download "SYNERGIE VON HPC UND DEEP LEARNING MIT NVIDIA GPUS"

Allen Wilson
5 years ago
Views:

1 SYNERGIE VON HPC UND DEEP LEARNING MIT NVIDIA S Axel Koehler, Principal Solution Architect HPCN%Workshop%Goettingen,%14.%Mai%2018

2 NVIDIA - AI COMPUTING COMPANY Computer Graphics Computing Artificial Intelligence 2

3 ACCELERATED COMPUTING Performance & Energy Efficiency HIGH PERFORMANCE COMPUTE AI / DEEP LEARNING DATA ANALYTICS ACCELERATED VDI 3

FACTORS DRIVING CHANGES IN HPC End$of$Dennard$Scaling$places$a$cap$on$ single$threaded$performance

significant$computational$intensity Cloud$based$usage$models,$in?

Tight$coupling$of$interactive$simulation,$ visualization,$data$analysis/ai Service$Oriented$Architectures$(SOA)

4 FACTORS DRIVING CHANGES IN HPC End$of$Dennard$Scaling$places$a$cap$on$ single$threaded$performance Increasing$application$performance$will$ require$fine$grain$parallel$code$with$ significant$computational$intensity Cloud$based$usage$models,$in?situ$ execution$and$visualization$emerging$as$ new$workflows$critical$to$the$science$ process$and$productivity Tight$coupling$of$interactive$simulation,$ visualization,$data$analysis/ai Service$Oriented$Architectures$(SOA) AI$and$Data$Science$emerging$as$ important$new$components$of$scientific$ discovery Dramatic$improvements$in$accuracy,$ completeness$and$response$time$yield$ increased$insight$from$huge$volumes$of$ data 4

5 Multiple Experiments Coming or Upgrading In the Next 10 Years Exabyte/Day 15$TB/Day 10X$Increase$in$ Data$Volume Cryo$EM 30X$Increase$ in$power Personal$Genomics 5

6 TESLA PLATFORM ONE Data Center Platform for Accelerating HPC and AI APPLICATIONS Automotive Retail Healthcare Manufacturing Finance Defense +450 Applications INTERNET SERVICES ENTERPRISE APPLICATIONS HPC INDUSTRY FRAMEWORKS & TOOLS FRAMEWORKS ECOSYSTEM TOOLS NVIDIA SDK cudnn TensorRT NCCL cublas cusparse DeepStream SDK CUDA C/C++ FORTRAN DEEP LEARNING SDK COMPUTEWORKS TESLA & SYSTEMS TESLA NVIDIA DGX / DGX-Station NVIDIA HGX-1 SYSTEM OEM CLOUD 6

6 TF Single Precision (fp32), 125 Tensor TFLOP/s mixed-precision

HBM2 NVLink NCCL Direct / Direct RDMA 900 GB/s Memory Bandwidth

GB/s bidirectional for maximum scalability between s

7 S FOR HPC AND DEEP LEARNING Huge$demands$on$compute$power$(FLOPS) NVIDIA Tesla V energy efficient cores + TensorCores 7.8 TF Double Precision (fp64), 15.6 TF Single Precision (fp32), 125 Tensor TFLOP/s mixed-precision Huge demands on communication and memory bandwidth CoWoS with HBM2 NVLink NCCL Direct / Direct RDMA 900 GB/s Memory Bandwidth Unifying Compute & Memory in Single Package 6 links per a 50 GB/s bidirectional for maximum scalability between s High-performance multi- and multi-node collective communication primitives optimized for NVIDIA s Direct communication between s by eliminating the CPU from the critical path 7

8 DEEP LEARNING IS A HPC WORKLOAD HPC expertise is important for success HPC and Deep Learning are using inherently parallel algorithms HPC and Deep Learning require a huge amount of compute power (FLOPS) Mainly Double Precision arithmetic for HPC Single, half or 8b precision for Deep Learning Training/Inference HPC needs less memory per FLOPS than Deep Learning HPC is more demanding on network bandwidth than Deep Learning Data scientists like dense systems (as much s as possible per node) HPC has more demand for scalability than Deep Learning up to now Distributed training frameworks like Horovod (Uber) are meanwhile available 8

9 CONTINUED DEMAND FOR COMPUTE POWER Neural$Network$complexity$is$Exploding 7 ExaFLOPS 60 Million Parameters 20 ExaFLOPS 300 Million Parameters 100 ExaFLOPS 8700 Million Parameters 2015 Microsoft ResNet Superhuman Image Recognition 2016 Baidu Deep Speech 2 Superhuman Voice Recognition 2017 Google Neural Machine Translation Near Human Language Translation 9

* B[FP16] + C[FP32] Using Tensor cores via Volta optimized frameworks and

10 TENSOR CORE Mixed Precision Matrix Math - 4x4 matrices New CUDA TensorOp instructions & data formats 4x4x4 matrix processing array D[FP32] = A[FP16] * B[FP16] + C[FP32] Using Tensor cores via Volta optimized frameworks and libraries (cudnn, CuBLAS, TensorRT,..) CUDA C++ Warp Level Matrix Operations 10

11 cublas GEMMS FOR DEEP LEARNING V100 Tensor Cores + CUDA 9: over 9x Faster Matrix-Matrix Multiply cublas Single2Precision2(FP32) cublas Mixed2Precision2(FP162Input,2FP322compute) Relative2Performance 2 1,8 1,6 1,4 1,2 1 0,8 0,6 0,4 1.8x P1002(CUDA28) V1002(CUDA29) Relative2Performance P1002(CUDA28) V1002Tensor2Cores22(CUDA29) 9.3x 0, Matrix2Size2(M=N=K) Matrix2Size2(M=N=K) Note: pre-production Tesla V100 and pre-release CUDA 9. CUDA 8 GA release. 11

LINEAR ALGEBRA + TENSOR CORES Tflop/s 26 FP16-TC (Tensor Cores) hgetrf LU 24 FP16 hgetrf LU FP32 sgetrf LU 22 FP64

Stan. Tomov & Jack Dongarra, Innovative Computing Laboratory, University of Tennessee Investigating Half Precision

Dongarra, SC 17 GTC 2018 Poster P8237: Harnessing s Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision

12 LINEAR ALGEBRA + TENSOR CORES Tflop/s 26 FP16-TC (Tensor Cores) hgetrf LU 24 FP16 hgetrf LU FP32 sgetrf LU 22 FP64 dgetrf LU k 4k 6k 8k10k 14k 18k 22k 26k 30k 34k matrix size Data courtesy of: Azzam Haidar, Stan. Tomov & Jack Dongarra, Innovative Computing Laboratory, University of Tennessee Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers, A. Haidar, P. Wu, S. Tomov, J. Dongarra, SC 17 GTC 2018 Poster P8237: Harnessing s Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solves Double Precision LU Decomposition! Compute initial solution in FP16! Iteratively refine to FP64 Achieved FP64 Tflops: 26 Device FP64 Tflops:

VOLTA NVLINK 6 NVLINKS @ 50 GB/s bidirectional Reduce number of lanes for lightly loaded link (Power

13 VOLTA NVLINK 6 50 GB/s bidirectional Reduce number of lanes for lightly loaded link (Power savings) Coherence features for NVLINK enabled CPUs Hybrid cube mesh (eg. DGX1V) POWER9 based node 13

SSD Deep Learning Cache Dual 10GbE, Quad IB 100Gb 3RU 3200W Optimized Deep Learning

14 NVIDIA DGX-1 AI supercomputer-appliance-in-a-box 8x Tesla V100 connected via NVLINK (125 TFLOPS FP32, 1 PFLOPS Tensor Core performance) Dual Xeon CPU, 512 GB Memory 7 TB SSD Deep Learning Cache Dual 10GbE, Quad IB 100Gb 3RU 3200W Optimized Deep Learning Software across the entire stack Containerized$frameworks Always$up?to?date$via$the$cloud 14

NVIDIA DGX-2 NVIDIA Tesla V100 32GB 1 2% Two Boards 8 V100 32GB s per board 6 NVSwitches per board 512GB Total HBM2 Memory interconnected by Plane Card Twelve NVSwitches 2.

15 NVIDIA DGX-2 NVIDIA Tesla V100 32GB 1 2% Two Boards 8 V100 32GB s per board 6 NVSwitches per board 512GB Total HBM2 Memory interconnected by Plane Card Twelve NVSwitches 2.4 TB/sec bi-section bandwidth Eight EDR Infiniband/100 GigE 1600 Gb/sec Total Bi-directional Bandwidth 5 PCIe Switch Complex 30 TB NVME SSDs Internal Storage 8 6 Two Intel Xeon Platinum CPUs 7%%1.5 TB System Memory Dual 10/25 Gb/sec Ethernet 9 15

2$systems$later$this$year 18 NVLINK ports @50 GB/s per port

16 NVSWITCH Announced$at$GTC$US$in$March$2018 Will$be$available$in$DGX?2$systems$later$this$year 18 NVLINK GB/s per port bi-directional 900 GB/s total bi-directional Fully connected crossbar X4 PCIe Gen2 Management port GPIO I2C 2 billion transistors 16

17 FULL NON-BLOCKING BANDWIDTH 17

18 FULL 6-WAY POINT-TO-POINT NVSwitch2Fabric

19 INDEPENDENT COMMUNICATION NVSwitch2Fabric

20 LOAD & STORE TO ANY NVSwitch2Fabric

Atomics) UNIFIED MEMORY PROVIDES Single memory view shared by

21 NVSWITCH NVLINK PROVIDES All-to-all high-bandwidth peer mapping between s Full inter- memory interconnect (incl. Atomics) UNIFIED MEMORY PROVIDES Single memory view shared by all s Automatic migration of data between s User control of data locality 21

SOFTWARE CHALLENGES Current DIY deep learning

test and maintain Open Source Frameworks Same issues

multiple jobs from different users to co-exist on the

22 SOFTWARE CHALLENGES Current DIY deep learning environments are complex and time consuming to build, test and maintain Open Source Frameworks Same issues affect HPC and other accelerated applications Need multiple jobs from different users to co-exist on the same servers NVIDIA Libraries NVIDIA Docker NVIDIA Driver NVIDIA 22

NVIDIA CLOUD REGISTRY Common Software stack across NVIDIA s Deep Learning All major frameworks with multi- optimizations Uses NCCL for NVLINK data exchange

Visualization Paraview with Optix, Index and Holodeck with OpenGL visualization base on NVIDIA Docker 2.

23 NVIDIA CLOUD REGISTRY Common Software stack across NVIDIA s Deep Learning All major frameworks with multi- optimizations Uses NCCL for NVLINK data exchange Multi-threaded I/O to feed the s Caffe, Caffe2,CNTK, mxnet, PyTorch, Tensorflow, Theano, Torch HPC NAMD, Gromacs, LAMMPS, GAMESS, Relion, Chroma, MILC HPC Visualization Paraview with Optix, Index and Holodeck with OpenGL visualization base on NVIDIA Docker 2.0, IndeX, VMD Single NGC Account For use on s everywhere - NVIDIA Cloud containerizes optimized frameworks, applications, runtimes, libraries, and operating system, available at no charge 23

NVIDIA SATURN V AI supercomputer with 660 x DGX-1V Primarily research

testing algorithms, networks, new approaches Embedded, robotic, auto,

collaborations Study convergence of data science and HPC All jobs are

24 NVIDIA SATURN V AI supercomputer with 660 x DGX-1V Primarily research focused Used internally for Deep Learning applied research Many using testing algorithms, networks, new approaches Embedded, robotic, auto, hyperscale, HPC Partner with university research and industry collaborations Study convergence of data science and HPC All jobs are containerized 40$PF$Peak$FP64$Performance$,$ 660$PF$DL$Tensor$Performance 24

DEEP LEARNING DATA CENTER Reference Architecture http://www.

25 DEEP LEARNING DATA CENTER Reference Architecture 25

NVIDIA DRIVE SIM AND CONSTELLATION AV VALIDATION SYSTEM Virtual Reality AV Simulator Same Architecture as DRIVE Computer Simulate Rare and

26 NVIDIA DRIVE SIM AND CONSTELLATION AV VALIDATION SYSTEM Virtual Reality AV Simulator Same Architecture as DRIVE Computer Simulate Rare and Difficult Conditions, Recreate Scenarios, Run Regression Tests, Drive Billions of Virtual Miles 10,000 Constellations Drive 3B Miles per Year 27

27 NVIDIA ISAAC ROBOTICS PLATFORM SIMULATION TRAINING DEPLOYMENT SDK Simulation$environment$for$developing,$ testing$and$training$autonomous$ machines$in$the$virtual$world. Once$a$simulation$is$complete,$the$ trained$system$(brain)$can$be$ transferred$to$physical$robots. 28

COMBINING THE STRENGTHS OF HPC AND AI HPC AI

Proven$statistical$models$for$accurate$results$in$ multiple$science$domains

Develop$training$data$sets$using$first$principal$ models Incorporate$AI$models$in$semi?

Implement$inference$models$with$real$time$ interactivity$

28 COMBINING THE STRENGTHS OF HPC AND AI HPC AI Proven$algorithms$based$on$first$principles$theory Proven$statistical$models$for$accurate$results$in$ multiple$science$domains New$methods$to$improve$predictive$accuracy,$insight$ into$new$phenomena$and$response$time Develop$training$data$sets$using$first$principal$ models Incorporate$AI$models$in$semi?empirical$style$ applications$to$improve$throughput Validate$new$findings$from$AI Implement$inference$models$with$real$time$ interactivity$ Train$inference$models$to$improve$accuracy$and$ comprehend$more$of$the$physical$parameter$space Analyze$data$sets$that$are$simply$intractable$with$ classic$statistical$models Control$and$manage$complex$scientific$experiments 29

29 30

30 31

31 SUMMARY Same technology can be used for HPC and Machine Learning / deep learning Deep learning is enabling many usages in science (eg. Image recognition, classification,..) Applications can use DL to train neural networks with already simulated data and DL network can predict about the output is the right technology for HPC and DL 32

32 SYNERGIE VON HPC UND DEEP LEARNING MIT NVIDIA S Axel Koehler (akoehler@nvidia.com)

CUDA: NEW AND UPCOMING FEATURES

May 8-11, 2017 Silicon Valley CUDA: NEW AND UPCOMING FEATURES Stephen Jones, GTC 2018 CUDA ECOSYSTEM 2018 CUDA DOWNLOADS IN 2017 3,500,000 CUDA REGISTERED DEVELOPERS 800,000 GTC ATTENDEES 8,000+ 2 CUDA