ENDURING DIFFERENTIATION. Timothy Lanfear

Size: px

Start display at page:

Download "ENDURING DIFFERENTIATION. Timothy Lanfear"

Molly Dean
5 years ago
Views:

1 ENDURING DIFFERENTIATION Timothy Lanfear

2 WHERE ARE WE? 2

LIFE AFTER DENNARD SCALING 10 7 40 Years of Microprocessor Trend Data 10 6 10 5 10 4 Transistors (thousands) 1.1X per year 10 3 10 2 Single-threaded perf 1.

3 LIFE AFTER DENNARD SCALING Years of Microprocessor Trend Data Transistors (thousands) 1.1X per year Single-threaded perf 1.5X per year Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for by K. Rupp 3

4 GPU-ACCELERATED PERFORMANCE AMBER Performance (ns/day) GoogleNet Performance (i/s) AMBER 16 CUDA cudnn 7 CUDA 9 NCCL AMBER 14 CUDA 5 AMBER 14 CUDA cudnn 6 CUDA 8 NCCL AMBER 12 CUDA cudnn 2 CUDA 6 cudnn 4 CUDA 7 0 K20 K40 K80 P x K80 8x Maxwell DGX-1 DGX-1V

5 TESLA VALUE: $15-20M COST SAVINGS DELIVERED TESLA PLATFORM ADVANTAGE Delivered value grows over time Life Sciences (AMBER) Oil & Gas (RTM) 10X Perf in 8 Years 6X Perf in 6 Years Amber performance: Nano Seconds Per Day delivered on 1xServer with GPUs and CPUS 5

6 GFLOPS per Watt GPU-ACCELERATED EFFICIENCY 13/13 Greenest Supercomputers Powered by Tesla P100 On Track To Meet Exascale Goal 35 TSUBAME 3.0 Kukai AIST AI Cloud RAIDEN GPU subsystem Piz Daint Wilkes-2 GOSAT-2 (RCF2) DGX Saturn V Reedbush-H JADE Facebook Cluster Cedar DAVIDE Eurotech Aurora K Tsubame- KFC K20X 5.3 TiTech W780I K SaturnV P100 V TSUBAME 3.0 P GF/W Exascale Goal Top GPU Systems in Green500 Lists and NVIDIA Projections for V100 6

7 HOW ARE WE DOING THIS? And, is our differentiation sustainable? What are the most important dimensions of our differentiation? Why are GPUs so much more efficient than CPUs? How can we continue scaling performance/efficiency as Moore s Law fades? Why can t competitors replicate GPU efficiency, performance, scaling, etc., with lots of weak CPU cores? (e.g., Intel KNC/KNL/KNM) How is optimizing GPUs for AI affecting their suitability for HPC? 7

8 ENERGY EFFICIENCY 8

9 COMPUTATION VERSUS COMMUNICATIONS 64-bit DP 20 pj 26 pj 256 pj 256-bit access 8 kb SRAM 50 pj 256 bits pj DRAM Rd/Wr 500 pj Efficient off-chip link 1000 pj 20mm 28nm IC 9

10 CPU 126 pj/flop (SP) Optimized for Latency Deep Cache Hierarchy GPU 28 pj/flop (SP) Optimized for Throughput Explicit Management of On-chip Memory Broadwell E5 v4 14 nm Pascal 16 nm 10

Logic 24% Instruction Supply 42% Out of Order, High Performance Issue 11% RF

11 HOW IS POWER SPENT IN A CPU? ALU 6% Register File 11% In Order, Embedded Data Supply 17% Clock + Control Logic 24% Instruction Supply 42% Out of Order, High Performance Issue 11% RF 14% ALU 4% Rename 10% Data Supply 5% Fetch 11% Clock + Pins 45% Dally [2008] (Embedded in-order CPU) Natarajan [2003] (Alpha 21264) 11

12 Payload Arithmetic 15pJ Overhead 985pJ 12

13 SIMPLER CORES = ENERGY EFFICIENCY Source: Azizi [PhD 2010] 13

14 Payload Arithmetic 15pJ Overhead 15pJ 14

15 THROUGHPUT PROCESSORS 15

16 RISE OF LEAKAGE 16

17 FREQUENCY VS. LEAKAGE Source: Gordon Moore, Intel; IEEE 17

18 OPTIMIZED FOR DATACENTER EFFICIENCY 40% More Performance in a Rack V100 Max Performance V100 Max Efficiency Computer Vision Computer Vision 80% Perf at Half the Power 13 KW Rack 4 Nodes of 8xV ResNet-50 Networks Trained Per Day 13 KW Rack 7 Nodes of 8xV ResNet-50 Networks Trained Per Day ResNet-50 Training, Max Efficiency run with V100@160W V100 performance measured on pre-production hardware. 19

19 GFLOPS / watt 25 SP ENERGY 28 NM Fermi Kepler Maxwell 20

20 HETEROGENEOUS COMPUTING 21

21 OPTIMIZING SERIAL/PARALLEL EXECUTION Application Code Parallel Work Serial Work GPU Majority of Ops System and Sequential Ops CPU + 22

Processors Work Together Xeon Phi (And Others) Many Weak Serial Cores CPU

22 TWO TYPES OF ACCELERATORS Many-Weak-Cores (MWC) Model Single CPU Core for Both Serial & Parallel Work Heterogeneous Computing Model Complementary Processors Work Together Xeon Phi (And Others) Many Weak Serial Cores CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 23

23 EXTENSIBILITY 24

24 NVLINK: A MEMORY FABRIC, NOT A NETWORK DGX-1: 8 NVLink-Connected GPUs 25

25 LATENCY HIDING FOR LOAD/STORE/ATOMICS Where are the NICs? There are no NICs. 26

128 Knights Landing Servers 30x 20x 10x GPU-Accelerated

26 Speed-up vs 1x KNL Server STRONG SCALING GPU-Accelerated Server NVIDIA DGX-1 40x AlexNet Training DGX-1 Faster than 128 Knights Landing Servers 30x 20x 10x GPU-Accelerated Server 0x Knights Landing Servers 1x DGX1 27

27 Speed-up vs 1 KNL Server Node Speed-up vs 1 KNL Server Node STRONG SCALING 40x LAMMPS: Molecular Dynamics 8x Tesla P100 PCIe Server Faster than 32 KNL Servers 10x GTC-P: Plasma Turbulence 8x Tesla P100 PCIe Server Faster than 64 KNL Servers 35x 8x P100: 35x Faster 8x 8x P100: 8x Faster 30x 25x 20x 15x 10x 4x P100: 21x Faster 2x P100: 13x Faster 6x 4x 2x 4x P100 2x P100 5x 0x # of KNL Nodes 0x # of KNL Nodes 28

STRONG SCALING Gflop/s 25,000 LQCD- Higher Energy Physics SATURNV DGX Servers vs SuperMUC

4x DGX-1: 20K Gflop/s 2x DGX-1: 15K Gflop/s 1x DGX-1: 8K Gflop/s 2K Gflop/s 3K Gflop/s 5K

2,300 CPU Servers S3D: Discovering New Fuel for Engines 3,800 CPU Servers SPECFEM3D:

28 STRONG SCALING Gflop/s 25,000 LQCD- Higher Energy Physics SATURNV DGX Servers vs SuperMUC Supercomputer # of CPU Servers to Match Performance of SATURNV 20,000 15,000 10,000 5, x DGX-1: 20K Gflop/s 2x DGX-1: 15K Gflop/s 1x DGX-1: 8K Gflop/s 2K Gflop/s 3K Gflop/s 5K Gflop/s # of CPU Nodes (in SuperMUC Supercomputer) 7K Gflop/s 2,300 CPU Servers S3D: Discovering New Fuel for Engines 3,800 CPU Servers SPECFEM3D: Simulating Earthquakes QUDA version 0.9beta, using double-half mixed precision DDalphaAMG using double-single 29

29 NEW TENSOR CORE New CUDA TensorOp instructions and data formats 4 4 matrix processing array D FP32 = A FP16 B FP16 + C FP32 Optimized for deep learning Activation Inputs Weights Inputs Output Results 30

30 TESLA PLATFORM 31

TESLA IS A PLATFORM World s Leading Data

APPLICATIONS Automotive Retail Healthcare

Applications INTERNET SERVICES ENTERPRISE

FRAMEWORKS ECOSYSTEM TOOLS NVIDIA SDK cudnn

DEEP LEARNING SDK COMPUTEWORKS TESLA GPU &

31 TESLA IS A PLATFORM World s Leading Data Center Platform for Accelerating HPC and AI APPLICATIONS Automotive Retail Healthcare Manufacturing Finance Defense +450 Applications INTERNET SERVICES ENTERPRISE APPLICATIONS HPC INDUSTRY FRAMEWORKS & TOOLS FRAMEWORKS ECOSYSTEM TOOLS NVIDIA SDK cudnn TensorRT NCCL cublas cusparse DeepStream SDK DEEP LEARNING SDK COMPUTEWORKS TESLA GPU & SYSTEMS TESLA GPU NVIDIA DGX-1 NVIDIA HGX-1 SYSTEM OEM CLOUD 32

32 MULTIPLE GROWTH MARKETS 10M Users 40 years of video/day GROWTH MARKETS HPC AI Training AI Inference Desktop Virtualization Video Transcoding TESLA PLATFORM 33

33 CONCLUSION 34

34 PASCAL TO VOLTA Architecture with Technology Area: ~600 mm 2 ~800 mm 2 (~33% more area) Process: ~ small Pascal Volta improvement (a few percent) Clocks: similar dynamic range, power limited Memory BW (sustained): 50% improvement Communications (NVLink): 160 GB/s 300 GB/s (almost double!) AI (Tensor Cores): ~20 TFLOPS 120 TFLOPS (~6x!) 35

35 V100 Performance Normalized to P100 REVOLUTIONARY PERFORMANCE FOR HPC AND AI Single Platform For Data Science and Computation Science 1.5X HPC Performance In 1 Year 3X AI Performance in 1 Year 2X CPU 15 Days X P Hours Molecular Dynamics (VMD) Physics (QUDA) Seismic (RTM) STREAM 1X V100 6 Hours LSTM ( Neural Machine Translation) NMT Training for 13 Epochs German ->English, WMT15 subset System Config Info: CPU = 2x Xeon E V4 w/ P100s or V100s QUDA, RTM, STREAM System Config Info: 2X Xeon E v4, 2.6GHz, w/ 1X Tesla P100 or V100. VMD System Config Info: Xeon E5-2698v3 w/ 1x Tesla P100 and E5-2697Av4 w/ 1x Tesla V100 V100 measured on pre-production hardware. 36

36 GPU PERFORMANCE COMPARISON P100 V100 Ratio Training acceleration 10 TOPS 120 TOPS 12x Inference acceleration 21 TFLOPS 120 TOPS 6x FP64/FP32 5/10 TFLOPS 7.5/15 TFLOPS 1.5x HBM2 Bandwidth 720 GB/s 900 GB/s 1.2x NVLink Bandwidth 160 GB/s 300 GB/s 1.9x L2 Cache 4 MB 6 MB 1.5x L1 Caches 1.3 MB 10 MB 7.7x 37

37 GPU TRAJECTORY APPLICATIONS ALGORITHMS SYSTEMS GPU-Computing perf 1.5X per year 1.1X per year 1000X by 2025 CUDA Single-threaded perf 1.5X per year ARCHITECTURE Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for by K. Rupp 38

ENDURING DIFFERENTIATION Timothy Lanfear

ENDURING DIFFERENTIATION Timothy Lanfear WHERE ARE WE? 2 LIFE AFTER DENNARD SCALING GPU-ACCELERATED PERFORMANCE 10 7 40 Years of Microprocessor Trend Data 10 6 10 5 10 4 10 3 10 2 Single-threaded perf