ENDURING DIFFERENTIATION Timothy Lanfear

Size: px

Start display at page:

Download "ENDURING DIFFERENTIATION Timothy Lanfear"

Cynthia Bruce
5 years ago
Views:

1 ENDURING DIFFERENTIATION Timothy Lanfear WHERE ARE WE? 2 LIFE AFTER DENNARD SCALING GPU-ACCELERATED PERFORMANCE Years of Microprocessor Trend Data Single-threaded perf Transistors (thousands) 1.5X per year 1.1X per year AMBER Performance (ns/day) AMBER 12 CUDA 4 K20 AMBER 14 CUDA 5 K40 AMBER 14 CUDA 6 K80 AMBER 16 CUDA 8 P GoogleNet Performance (i/s) cudnn 2 CUDA 6 8x K80 cudnn 4 CUDA 7 8x Maxwell cudnn 6 CUDA 8 NCCL 1.6 DGX-1 cudnn 7 CUDA 9 NCCL 2 DGX-1V Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for by K. Rupp

2 GFLOPS per Watt TESLA PLATFORM ADVANTAGE TESLA VALUE: $15-20M COST SAVINGS DELIVERED Delivered value grows over time GPU-ACCELERATED EFFICIENCY Life Sciences (AMBER) Oil & Gas (RTM) 13/13 Greenest Supercomputers Powered by Tesla P100 On Track To Meet Exascale Goal 35 10X Perf in 8 Years 6X Perf in 6 Years TSUBAME 3.0 Kukai AIST AI Cloud RAIDEN GPU subsystem Piz Daint Wilkes-2 GOSAT-2 (RCF2) DGX Saturn V Reedbush-H JADE Facebook Cluster Cedar DAVIDE Eurotech Aurora K Tsubame- KFC K20X 5.3 TiTech W780I K SaturnV P100 V TSUBAME 3.0 P GF/W Exascale Goal Amber performance: Nano Seconds Per Day delivered on 1xServer with GPUs and CPUS 5 Top GPU Systems in Green500 Lists and NVIDIA Projections for V100 6 HOW ARE WE DOING THIS? And, is our differentiation sustainable? What are the most important dimensions of our differentiation? Why are GPUs so much more efficient than CPUs? How can we continue scaling performance/efficiency as Moore s Law fades? ENERGY EFFICIENCY Why can t competitors replicate GPU efficiency, performance, scaling, etc., with lots of weak CPU cores? (e.g., Intel KNC/KNL/KNM) How is optimizing GPUs for AI affecting their suitability for HPC? 7 8

COMPUTATION VERSUS COMMUNICATIONS 64-bit DP 20 pj 26 pj 256 pj CPU 126 pj/flop (SP) GPU

Latency Deep Cache Hierarchy Optimized for Throughput Explicit Management of On-chip

16 nm 10 ALU 6% Register File 11% HOW IS POWER SPENT IN A CPU?

of Order, High Performance Issue 11% RF 14% ALU 4% Rename 10% Data Supply 5% Fetch 11%

3 COMPUTATION VERSUS COMMUNICATIONS 64-bit DP 20 pj 26 pj 256 pj CPU 126 pj/flop (SP) GPU 28 pj/flop (SP) 256-bit access 8 kb SRAM 50 pj 256 bits pj DRAM Rd/Wr Optimized for Latency Deep Cache Hierarchy Optimized for Throughput Explicit Management of On-chip Memory 500 pj Efficient off-chip link 1000 pj 20mm 28nm IC 9 Broadwell E5 v4 14 nm Pascal 16 nm 10 ALU 6% Register File 11% HOW IS POWER SPENT IN A CPU? In Order, Embedded Data Supply 17% Clock + Control Logic 24% Instruction Supply 42% Out of Order, High Performance Issue 11% RF 14% ALU 4% Rename 10% Data Supply 5% Fetch 11% Clock + Pins 45% Payload Arithmetic 15pJ Overhead 985pJ Dally [2008] (Embedded in-order CPU) Natarajan [2003] (Alpha 21264) 11 12

4 SIMPLER CORES = ENERGY EFFICIENCY Payload Arithmetic 15pJ Overhead 15pJ Source: Azizi [PhD 2010] RISE OF LEAKAGE THROUGHPUT PROCESSORS 15 16

5 GFLOPS / watt FREQUENCY VS. LEAKAGE OPTIMIZED FOR DATACENTER EFFICIENCY 40% More Performance in a Rack V100 Max Performance V100 Max Efficiency Computer Vision Computer Vision Source: Gordon Moore, Intel; IEEE 17 80% Perf at Half the Power 13 KW Rack 4 Nodes of 8xV ResNet-50 Networks Trained Per Day 13 KW Rack 7 Nodes of 8xV ResNet-50 Networks Trained Per Day 19 ResNet-50 Training, Max Efficiency run with V100@160W V100 performance measured on pre-production hardware. 25 SP ENERGY 28 NM HETEROGENEOUS COMPUTING Fermi Kepler Maxwell 20 21

6 OPTIMIZING SERIAL/PARALLEL EXECUTION Application Code TWO TYPES OF ACCELERATORS GPU Parallel Work Majority of Ops Serial Work System and Sequential Ops CPU Many-Weak-Cores (MWC) Model Single CPU Core for Both Serial & Parallel Work Xeon Phi (And Others) Many Weak Serial Cores Heterogeneous Computing Model Complementary Processors Work Together CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks NVLINK: A MEMORY FABRIC, NOT A NETWORK DGX-1: 8 NVLink-Connected GPUs EXTENSIBILITY 24 25

Speed-up vs 1 KNL Server Node Speed-up vs 1 KNL Server Node Speed-up vs 1x KNL Server LATENCY

STRONG SCALING GPU-Accelerated Server NVIDIA DGX-1 AlexNet Training DGX-1 Faster than 128

Landing Servers 1x DGX1 26 27 STRONG SCALING STRONG SCALING LAMMPS: Molecular Dynamics 8x Tesla

Faster than 64 KNL Servers Gflop/s LQCD- Higher Energy Physics SATURNV DGX Servers vs SuperMUC

15x 10x 5x 8x P100: 35x Faster 4x P100: 21x Faster 2x P100: 13x Faster 0x 0 4 8 12 16 20 24 28

of KNL Nodes 20,000 15,000 10,000 5,000 0 4x DGX-1: 20K Gflop/s 2x DGX-1: 15K Gflop/s 1x DGX-1:

Nodes (in SuperMUC Supercomputer) 2,300 CPU Servers S3D: Discovering New Fuel for Engines 3,800

7 Speed-up vs 1 KNL Server Node Speed-up vs 1 KNL Server Node Speed-up vs 1x KNL Server LATENCY HIDING FOR LOAD/STORE/ATOMICS Where are the NICs? There are no NICs. STRONG SCALING GPU-Accelerated Server NVIDIA DGX-1 AlexNet Training DGX-1 Faster than 128 Knights Landing Servers 40x 30x 20x 10x GPU-Accelerated Server 0x Knights Landing Servers 1x DGX STRONG SCALING STRONG SCALING LAMMPS: Molecular Dynamics 8x Tesla P100 PCIe Server Faster than 32 KNL Servers GTC-P: Plasma Turbulence 8x Tesla P100 PCIe Server Faster than 64 KNL Servers Gflop/s LQCD- Higher Energy Physics SATURNV DGX Servers vs SuperMUC Supercomputer # of CPU Servers to Match Performance of SATURNV 40x 10x 25,000 35x 30x 25x 20x 15x 10x 5x 8x P100: 35x Faster 4x P100: 21x Faster 2x P100: 13x Faster 0x # of KNL Nodes 8x 6x 4x 2x 8x P100: 8x Faster 4x P100 2x P100 0x # of KNL Nodes 20,000 15,000 10,000 5, x DGX-1: 20K Gflop/s 2x DGX-1: 15K Gflop/s 1x DGX-1: 8K Gflop/s 3K Gflop/s 2K Gflop/s 5K Gflop/s 7K Gflop/s # of CPU Nodes (in SuperMUC Supercomputer) 2,300 CPU Servers S3D: Discovering New Fuel for Engines 3,800 CPU Servers SPECFEM3D: Simulating Earthquakes 28 QUDA version 0.9beta, using double-half mixed precision DDalphaAMG using double-single 29

8 NEW TENSOR CORE New CUDA TensorOp instructions and data formats 4 4 matrix processing array D FP32 = A FP16 B FP16 + C FP32 TESLA PLATFORM Optimized for deep learning Activation Inputs Weights Inputs Output Results TESLA IS A PLATFORM World s Leading Data Center Platform for Accelerating HPC and AI MULTIPLE GROWTH MARKETS APPLICATIONS INTERNET SERVICES Automotive Retail Defense Healthcare Manufacturing Finance ENTERPRISE APPLICATIONS HPC +450 Applications 10M Users 40 years of video/day GROWTH MARKETS INDUSTRY FRAMEWORKS & TOOLS FRAMEWORKS ECOSYSTEM TOOLS HPC AI Training AI Inference Desktop Virtualization Video Transcoding NVIDIA SDK cudnn TensorRT NCCL cublas cusparse DEEP LEARNING SDK DeepStream SDK COMPUTEWORKS TESLA PLATFORM TESLA GPU & SYSTEMS TESLA GPU NVIDIA DGX-1 NVIDIA HGX-1 SYSTEM OEM CLOUD 32 33

V100 Performance Normalized to P100 PASCAL TO VOLTA Architecture with Technology Area: ~600 mm 2 ~800 mm 2 (~33% more area) CONCLUSION Process: ~ small Pascal Volta improvement (a few percent)

9 V100 Performance Normalized to P100 PASCAL TO VOLTA Architecture with Technology Area: ~600 mm 2 ~800 mm 2 (~33% more area) CONCLUSION Process: ~ small Pascal Volta improvement (a few percent) Clocks: similar dynamic range, power limited Memory BW (sustained): 50% improvement Communications (NVLink): 160 GB/s 300 GB/s (almost double!) AI (Tensor Cores): ~20 TFLOPS 120 TFLOPS (~6x!) REVOLUTIONARY PERFORMANCE FOR HPC AND AI Single Platform For Data Science and Computation Science 1.5X HPC Performance In 1 Year 3X AI Performance in 1 Year GPU PERFORMANCE COMPARISON P100 V100 Ratio Training acceleration 10 TOPS 120 TOPS 12x 2X CPU 15 Days Inference acceleration 21 TFLOPS 120 TOPS 6x FP64/FP32 5/10 TFLOPS 7.5/15 TFLOPS 1.5x 1X P Hours HBM2 Bandwidth 720 GB/s 900 GB/s 1.2x Molecular Dynamics (VMD) Physics (QUDA) Seismic (RTM) STREAM 1X V100 6 Hours LSTM ( Neural Machine Translation) NVLink Bandwidth 160 GB/s 300 GB/s 1.9x L2 Cache 4 MB 6 MB 1.5x L1 Caches 1.3 MB 10 MB 7.7x NMT Training for 13 Epochs German ->English, WMT15 subset System Config Info: CPU = 2x Xeon E V4 w/ P100s or V100s QUDA, RTM, STREAM System Config Info: 2X Xeon E v4, 2.6GHz, w/ 1X Tesla P100 or V100. VMD System Config Info: Xeon E5-2698v3 w/ 1x Tesla P100 and E5-2697Av4 w/ 1x Tesla V100 V100 measured on pre-production hardware

10 GPU TRAJECTORY APPLICATIONS ALGORITHMS SYSTEMS GPU-Computing perf 1.5X per year 1.1X per year 1000X by 2025 CUDA Single-threaded perf 1.5X per year ARCHITECTURE Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for by K. Rupp 38

ENDURING DIFFERENTIATION. Timothy Lanfear

ENDURING DIFFERENTIATION. Timothy Lanfear ENDURING DIFFERENTIATION Timothy Lanfear WHERE ARE WE? 2 LIFE AFTER DENNARD SCALING 10 7 40 Years of Microprocessor Trend Data 10 6 10 5 10 4 Transistors (thousands) 1.1X per year 10 3 10 2 Single-threaded