April 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27,

Size: px

Start display at page:

Download "April 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27,"

Cuthbert Cook
5 years ago
Views:

1 April 4-7, 2016 Silicon Valley INSIDE PASCAL Mark Harris, October 27,

INTRODUCING TESLA P100 New GPU Architecture CPU to CPUEnable the World s Fastest Compute Node PCIe Switch PCIe Switch Pascal Architecture NVLink HBM2 Stacked Memory Page Migration Engine CPU

2 INTRODUCING TESLA P100 New GPU Architecture CPU to CPUEnable the World s Fastest Compute Node PCIe Switch PCIe Switch Pascal Architecture NVLink HBM2 Stacked Memory Page Migration Engine CPU Tesla P100 Unified Memory Highest Compute Performance GPU Interconnect for Maximum Scalability Unifying Compute & Memory in Single Package Simple Parallel Programming with 512 TB of Virtual Memory 2

3 Teraflops (FP32/FP16) Bandwidth (GB/Sec) Bandwidth GIANT LEAPS IN EVERYTHING 20 P100 (FP16) 160 P100 3x P K40 M40 P100 (FP32) K40 M40 2x 1x K40 M40 3x Compute 5x GPU-GPU BW 3x GPU Mem BW 3

4 Speed-up vs Dual Socket Broadwell TESLA P100 PERFORMANCE DELIVERED NVLink for Max Scalability, Over 45x Faster with 8x P100 50x 45x 2x K80 (M40 for Alexnet) 2x P100 4x P100 8x P100 40x 35x 30x 25x 20x 15x 10x 5x 0x Caffe/Alexnet VASP HOOMD-Blue COSMO MILC Amber HACC 2x Broadwell CPU 4

5 TESLA PASCAL ACCELERATORS Tesla P100 Tesla P40 Tesla P4 Compute SXM2+NVLINK PCIe 5.3 TF DP 10.6 TF SP 21.2 TF HP 11.8 TF SP 47 TOP/s INT8 5.4 TF SP 21.8 TOP/s INT8 Memory HBM2: 720 GB/s 16 GB GDDR5: 346 GB/s 24 GB GDDR5: 192 GB/s 8GB Interconnect NVLink + PCIe Gen3 PCIe Gen3 PCIe Gen3 Programmability Page Migration Engine Unified Memory Page Migration Engine Unified Memory Page Migration Engine Unified Memory Power 300W SXM2 250W PCIe 250W 50W & 75W (Compute #s based on boost clocks; P100 compute #s given for SXM2 version) 5

6 PASCAL ARCHITECTURE 6

TESLA P100 GPU: GP100 56 SMs 3584 CUDA Cores 5.3 TF Double Precision 10.

7 TESLA P100 GPU: GP SMs 3584 CUDA Cores 5.3 TF Double Precision 10.6 TF Single Precision 21.2 TF Half Precision 16 GB HBM2 720 GB/s Bandwidth 7

8 GP100 SM GP100 CUDA Cores 64 Register File Shared Memory 256 KB 64 KB Active Threads 2048 Active Blocks 32 8

9 Cores FP64 Cores FP64 LD/ST SFU Cores FP64 Cores FP64 LD/ST SFU Cores FP64 Cores FP64 LD/ST SFU Cores FP64 Cores FP64 LD/ST SFU Warps Warps Registers Registers Warps Warps Registers Registers P100 SM More resources per core Maxwell SM Warps Warps Registers Registers Shared Mem Warps Warps Registers Registers 2x Registers 1.33x Shared Memory Capacity 2x Shared Memory Bandwidth 2x Warps Higher Instruction Throughput P100 SM Shared Mem 9

10 IEEE 754 FLOATING POINT ON GP100 3 sizes, 3 speeds, all fast Feature Half precision Single precision Double precision Layout s5.10 s8.23 s11.52 Issue rate pair every clock 1 every clock 1 every 2 clocks Subnormal support Yes Yes Yes Atomic Addition Yes Yes Yes 10

11 Images/Sec/Watt HIGH THROUGHPUT FOR SMALL INTEGERS Tesla P4 and P40: 8-bit Integer Vector Operations 4x 8-bit vector dot product with 32-bit accumulate CUDA intrinsics: int dp4a(char4 a, char4 b, int c) A B int8 int8 int8 int8 int8 int8 int8 int8 Powerful for deep learning inference (Tesla P4 results) 40x Efficiency vs CPU, 8x Efficiency vs FPGA CPU FPGA 1x M4 (FP32) 1x P4 (INT8) int32 C int32 D int32 CUDA Library Support 50 0 AlexNet AlexNet, batch size = 128, CPU: Intel E5-2690v4 using Intel MKL 2017, FPGA is Arria x M4/P4 in node, P4 board power at 56W, P4 GPU power at 36W, M4 board power at 57W, M4 GPU power at 39W, Perf/W chart using GPU power cublas: GemmEx() cudnn v6: Int8 Inference Convolutions TensorRT Г. 11

12 NVLink 12

13 NVLINK P100 supports 4 NVLinks Up to 94% bandwidth efficiency Supports read/writes/atomics to peer GPU Supports read/write access to NVLink-enabled CPU Links can be ganged for higher bandwidth 40 GB/s 40 GB/s 40 GB/s 40 GB/s NVLink on Tesla P100 13

14 NVLINK - GPU CLUSTER Two fully connected quads, connected at corners 160GB/s per GPU bidirectional to Peers Load/store access to Peer Memory Full atomics to Peer GPUs High speed copy engines for bulk data copy PCIe to/from CPU 14

P8+ CPU IB 80 GB/s per GPU bidirectional to CPU 115 GB/s CPU Memory Bandwidth P100 P100

15 NVLINK TO CPU IBM Power Systems Server S822LC (codename Minsky ) 2x IBM Power8+ CPUs and 4x P100 GPUs DDR4 115GB/s DDR4 80 GB/s per GPU bidirectional for peer traffic IB P8+ CPU P8+ CPU IB 80 GB/s per GPU bidirectional to CPU 115 GB/s CPU Memory Bandwidth P100 P100 P100 P100 Direct Load/store access to CPU Memory High Speed Copy Engines for bulk data movement 15

16 UNIFIED MEMORY 16

17 PAGE MIGRATION ENGINE Support Virtual Memory Demand Paging 49-bit Virtual Addresses Sufficient to cover 48-bit CPU address + all GPU memory GPU page faulting capability Can handle thousands of simultaneous page faults Up to 2 MB page size Better TLB coverage of GPU memory 17

UNIFIED MEMORY Dramatically Lower Developer Effort CUDA 6+ Kepler

single pointer, accessible anywhere Eliminate need for explicit

GPU Memory Size Performance Through Data Locality Migrate data to

18 UNIFIED MEMORY Dramatically Lower Developer Effort CUDA 6+ Kepler GPU CPU Simpler Programming & Memory Model Single allocation, single pointer, accessible anywhere Eliminate need for explicit copy Greatly simplifies code porting Unified Memory Allocate Up To GPU Memory Size Performance Through Data Locality Migrate data to accessing processor Guarantee global coherence Still allows explicit hand tuning 18

CUDA 8: UNIFIED MEMORY Large datasets, simple programming, High

Oversubscribe GPU memory Allocate up to system memory size Unified

atomic operations Allocate Beyond GPU Memory Size Tune Unified

19 CUDA 8: UNIFIED MEMORY Large datasets, simple programming, High Performance CUDA 8 Pascal GPU CPU Enable Large Data Models Oversubscribe GPU memory Allocate up to system memory size Unified Memory Simpler Data Access CPU/GPU Data coherence Unified memory atomic operations Allocate Beyond GPU Memory Size Tune Unified Memory Performance Usage hints via cudamemadvise API Explicit prefetching API 19

20 Application Throughput (MDOF/s) Millions OUT-OF-CORE AMR COMPUTATIONS WITH UNIFIED MEMORY ON P100 P100 (x86 PCI-E) P100 + user hints (x86 PCI-E) P100 (P8 NVLINK) P100 + user hints (P8 NVLINK) All 5 levels fit in GPU memory P100 memory size (16GB) Only 2 levels fit Only 1 level fits Application working set (GB) x86 CPU: Intel E v3, 2 sockets of 10 cores each with HT on (40 threads) 20

21 INTRODUCING TESLA P100 New GPU Architecture CPU to CPUEnable the World s Fastest Compute Node PCIe Switch PCIe Switch Pascal Architecture NVLink HBM2 Stacked Memory Page Migration Engine CPU Tesla P100 Unified Memory Highest Compute Performance GPU Interconnect for Maximum Scalability Unifying Compute & Memory in Single Package Simple Parallel Programming with 512 TB of Virtual Memory More P100 Features: compute preemption, new instructions, larger L2 cache, more Find out more about P100: P4 and P40: Contact: 21

22 BACKUP 22

END-TO-END PRODUCT FAMILY HYPERSCALE HPC

DL SUPERCOMPUTER Training: Tesla P100 Tesla P100

with NVLink Tesla K80 DGX-1 Deep learning

Scaling to multiple GPUs HPC data centers with

23 END-TO-END PRODUCT FAMILY HYPERSCALE HPC STRONG-SCALE HPC MIXED-APPS HPC FULLY INTEGRATED DL SUPERCOMPUTER Training: Tesla P100 Tesla P100 with PCI-E Inference: Tesla P40 & P4 Tesla P100 with NVLink Tesla K80 DGX-1 Deep learning training & inference HPC and DL data centers Scaling to multiple GPUs HPC data centers with mix of CPU and GPU workloads Fully integrated deep learning solution 23

24 HALF-PRECISION FLOATING POINT (FP16) 16 bits s e x p f r a c. 1 sign bit, 5 exponent bits, 10 fraction bits 2 40 Dynamic range Normalized values: 1024 values for each power of 2, from 2-14 to 2 15 Subnormals at full speed: 1024 values from 2-24 to 2-15 Special values +- Infinity, Not-a-number USE CASES Deep Learning Training Radio Astronomy Sensor Data Image Processing 24

GOPS / watt RADIO ASTRONOMY AND PASCAL Cross correlation benchmark 300 Radio Telescopes: 250 Often remote: power efficiency crucial Low-precision data: FP32 not needed GPUs (FP32) already in use

25 GOPS / watt RADIO ASTRONOMY AND PASCAL Cross correlation benchmark 300 Radio Telescopes: 250 Often remote: power efficiency crucial Low-precision data: FP32 not needed GPUs (FP32) already in use Pascal INT8 arithmetic: x higher efficiency than FP M2090 Fermi (fp32) K40 Kepler (fp32) M40 Maxwell (fp32) P40 Pascal (fp32) P40 Pascal (int8) P40 Pascal (int8 + clock capping) 25

26 HBM2 STACKED MEMORY 26

27 HBM2 : 720GB/SEC BANDWIDTH And ECC is free Spacer 4-high HBM2 Stack Bumps GPU Silicon Carrier Substrate 27

28 UNIFIED MEMORY EXAMPLE On-Demand Paging global void setvalue(int *ptr, int index, int val) { ptr[index] = val; } void foo(int size) { char *data; cudamallocmanaged(&data, size); memset(data, 0, size); setvalue<<<...>>>(data, size/2, 5); cudadevicesynchronize(); Unified Memory allocation Access all values on CPU Access one value on GPU usedata(data); } cudafree(data); 28

29 HOW UNIFIED MEMORY WORKS IN CUDA 6 Servicing CPU page faults GPU Code global void setvalue(char *ptr, int index, char val) { ptr[index] = val; } GPU Memory Mapping CPU Code cudamallocmanaged(&array, size); memset(array, size); setvalue<<<...>>>(array, size/2, 5); CPU Memory Mapping array Page Fault array Interconnect 29

30 HOW UNIFIED MEMORY WORKS ON PASCAL Servicing CPU and GPU Page Faults Page Fault GPU Code global Void setvalue(char *ptr, int index, char val) { ptr[index] = val; } GPU Memory Mapping array Page Fault CPU Code cudamallocmanaged(&array, size); memset(array, size); setvalue<<<...>>>(array, size/2, 5); CPU Memory Mapping array Interconnect 30

31 UNIFIED MEMORY ON PASCAL GPU memory oversubscription void foo() { // Assume GPU has 16 GB memory // Allocate 32 GB char *data; size_t size = 32*1024*1024*1024; cudamallocmanaged(&data, size); } 32 GB allocation Pascal supports allocations where only a subset of pages reside on GPU. Pages can be migrated to the GPU when hot. Fails on Kepler/Maxwell 31

to solve for Quantum chemistry larger systems Ray tracing - larger

32 GPU OVERSUBSCRIPTION Now possible with Pascal Many domains can benefit from GPU memory oversubscription: Combustion many species to solve for Quantum chemistry larger systems Ray tracing - larger scenes to render Data / Graph Analysis large graphs, data sets 11/16/

33 GPU PERFORMANCE COMPARISON P100 P40 M40 K40 Double Precision TFlop/s Single Precision TFlop/s Half Precision Tflop/s Memory Bandwidth (GB/s) NA NA Memory Size 16GB 24GB 12GB, 24GB 12GB 33

34 NVLINK TO CPU Fully connected quad 120 GB/s per GPU bidirectional for peer traffic 40 GB/s per GPU bidirectional to CPU Direct Load/store access to CPU Memory High Speed Copy Engines for bulk data movement 34

35 SIMPLIFIED MEMORY MANAGEMENT CODE void sortfile(file *fp, int N) { char *data; data = (char *)malloc(n); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); CPU Code CUDA 6 Code with Unified Memory void sortfile(file *fp, int N) { char *data; cudamallocmanaged(&data, N); fread(data, 1, N, fp); qsort<<<...>>>(data,n,1,compare); cudadevicesynchronize(); use_data(data); } free(data); } cudafree(data); 35

36 Million elements per second GREAT PERFORMANCE WITH UNIFIED MEMORY RAJA: Portable C++ Framework for parallel-for style programming RAJA uses Unified Memory for heterogeneous array allocations Parallel forall loops run on device Excellent performance considering this is a "generic version of LULESH with no architecture-specific tuning CPU: 10-core Haswell GPU: Tesla K40 1.5x LULESH Throughput 1.9x 2.0x -Jeff Keasler, LLNL 0 45^3 100^3 150^3 Mesh size GPU: NVIDIA Tesla K40, CPU: Intel Haswell E GHz, single socket 10-core 36

37 UNIFIED MEMORY ON PASCAL Concurrent CPU/GPU access to managed memory global void mykernel(char *data) { data[1] = g ; } void foo() { char *data; cudamallocmanaged(&data, 2); } mykernel<<<...>>>(data); // no synchronize here data[0] = c ; cudafree(data); OK on Pascal: just a page fault Concurrent CPU access to data on previous GPUs caused a fatal segmentation fault 37

38 UNIFIED MEMORY ON PASCAL System-Wide Atomics global void mykernel(int *addr) { atomicadd_system(addr, 10); } void foo() { int *addr; cudamallocmanaged(&addr, 4); *addr = 0; Pascal enables system-wide atomics Direct support of atomics over NVLink Software-assisted over PCIe System-wide atomics not available on Kepler / Maxwell } mykernel<<<...>>>(addr); sync_fetch_and_add(addr, 10); 38

39 PERFORMANCE TUNING ON PASCAL Explicit Memory Hints and Prefetching Advise runtime on known memory access behaviors with cudamemadvise() cudamemadvisesetreadmostly: Specify read duplication cudamemadvisesetpreferredlocation: suggest best location cudamemadvisesetaccessedby: initialize a mapping Explicit prefetching with cudamemprefetchasync(ptr, length, destdevice, stream) Unified Memory alternative to cudamemcpyasync Asynchronous operation that follows CUDA stream semantics 39

40 USE CASE: ON-DEMAND PAGING Graph Algorithms Performance over GPU directly accessing host memory (zero-copy) Baseline: migrate on first touch Optimized: best placement in memory Large Data Set 11/16/

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016

LECTURE ON PASCAL GPU ARCHITECTURE Jiri Kraus, November 14 th 2016 ACCELERATED COMPUTING CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 2 ACCELERATED COMPUTING CPU Optimized