LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016

Similar documents
April 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27,

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Unified memory. GPGPU 2015: High Performance Computing with CUDA University of Cape Town (South Africa), April, 20th-24th, 2015

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

OpenACC Fundamentals. Steve Abbott November 15, 2017

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

GPU Fundamentals Jeff Larkin November 14, 2016

OpenACC Fundamentals. Steve Abbott November 13, 2016

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Tesla Architecture, CUDA and Optimization Strategies

NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY FOR NVIDIA GPU CLUSTERS. Sreeram Potluri, Anshuman Goswami - NVIDIA 3/28/2018

The Visual Computing Company

Multi GPU Programming with MPI and OpenACC

HPC with the NVIDIA Accelerated Computing Toolkit Mark Harris, November 16, 2015

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

NVIDIA Tesla P100. Whitepaper. The Most Advanced Datacenter Accelerator Ever Built. Featuring Pascal GP100, the World s Fastest GPU

More on CUDA and. graphics processing unit computing CHAPTER. Mark Harris and Isaac Gelado CHAPTER OUTLINE

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

NVIDIA TESLA V100 GPU ARCHITECTURE THE WORLD S MOST ADVANCED DATA CENTER GPU

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

PERFORMANCE ANALYSIS AND DEBUGGING FOR VOLTA. Felix Schmitt 11 th Parallel Tools Workshop September 11-12, 2017

Multi-Processors and GPU

LECTURE ON MULTI-GPU PROGRAMMING. Jiri Kraus, November 14 th 2016

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

Parallel Accelerators

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Accelerator programming with OpenACC

Parallel Accelerators

May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND. Mark Harris, May 10, 2017

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

GPU CUDA Programming

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

S8688 : INSIDE DGX-2. Glenn Dearth, Vyas Venkataraman Mar 28, 2018

Device Memories and Matrix Multiplication

NVIDIA S VISION FOR EXASCALE. Cyril Zeller, Director, Developer Technology

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

Basic unified GPU architecture

General Purpose GPU Computing in Partial Wave Analysis

OPENACC ONLINE COURSE 2018

COSC 6339 Accelerators in Big Data

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

Fundamental CUDA Optimization. NVIDIA Corporation

Profiling & Tuning Applications. CUDA Course István Reguly

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

MULTI-GPU PROGRAMMING MODELS. Jiri Kraus, Senior Devtech Compute Sreeram Potluri, Senior CUDA Software Engineer

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Towards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu

Directive-based Programming for Highly-scalable Nodes

Optimizing CUDA for GPU Architecture. CSInParallel Project

Lecture 1: Introduction and Computational Thinking

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

COMP Parallel Computing. Programming Accelerators using Directives

Introduction to Parallel Computing with CUDA. Oswald Haan

CUDA OPTIMIZATIONS ISC 2011 Tutorial

ACCELERATING SANJEEVINI: A DRUG DISCOVERY SOFTWARE SUITE

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

OpenACC Course. Office Hour #2 Q&A

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

World s most advanced data center accelerator for PCIe-based servers

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Lecture 1: an introduction to CUDA

Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA Experiences: Over-Optimization and Future HPC

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

Future Directions for CUDA Presented by Robert Strzodka

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

Introduction to CUDA Programming (Compute Unified Device Architecture) Jongsoo Kim Korea Astronomy and Space Science 2018 Workshop

Introduction to GPGPU and GPU-architectures

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Dense Linear Algebra. HPC - Algorithms and Applications

Fundamental CUDA Optimization. NVIDIA Corporation

Unrolling parallel loops

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

Accelerator Programming Lecture 1

CUDA Programming. Aiichiro Nakano

dcuda: Distributed GPU Computing with Hardware Overlap

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

n N c CIni.o ewsrg.au

May 8-11, 2017 Silicon Valley. CUDA 9 AND BEYOND Mark Harris, May 10, 2017

COSC 462 Parallel Programming

IBM Power AC922 Server

ADVANCES IN EXTREME-SCALE APPLICATIONS ON GPU. Peng Wang HPC Developer Technology

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

High Performance Computing and GPU Programming

Transcription:

LECTURE ON PASCAL GPU ARCHITECTURE Jiri Kraus, November 14 th 2016

ACCELERATED COMPUTING CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 2

ACCELERATED COMPUTING CPU Optimized for Serial Tasks GPU CPU Accelerator Strengths Optimized for Parallel Tasks Very large main memory Very fast clock speeds Latency optimized via large caches Small number of threads can run very quickly CPU Weaknesses Relatively low memory bandwidth Cache misses very costly Low performance per watt 3

ACCELERATED COMPUTING GPU Strengths CPU High bandwidth main memory Latency tolerant Optimized via parallelism for Significantly Serial more Tasks compute resources High throughput High performance per watt GPU Weaknesses Relatively low memory capacity Low per-thread performance GPU Accelerator Optimized for Parallel Tasks 4

SPEED V. THROUGHPUT Speed Throughput Which is better depends on your needs *Images from Wikimedia Commons via Creative Commons 5

HOW GPU ACCELERATION WORKS Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU + 6

3 WAYS TO ACCELERATED APPLICATIONS Applications Libraries Drop-in Acceleration Highest Performance Directives High Productivity Programming Languages Maximum Flexibility 7

OPENACC DIRECTIVES Manage Data Movement Initiate Parallel Execution Optimize Loop Mappings #pragma acc data copyin(x,y) copyout(z) { #pragma acc parallel { #pragma acc loop gang vector for (i = 0; i < n; ++i) { z[i] = x[i] + y[i];... } }... } Incremental Single source Interoperable Performance portable CPU, GPU, MIC 8

OPENACC DIRECTIVES Data Management #pragma acc enter data create(a[0:nx*ny],aref[0:nx*ny],anew[0:nx*ny],rhs[0:nx*ny])... #pragma acc update device(a[(iy_start-1)*nx:((iy_end-iy_start)+2)*nx]) #pragma acc update device(rhs[iy_start*nx:(iy_end-iy_start)*nx]) while ( error > tol && iter < iter_max ) {... iter++; } #pragma acc update self(a[(iy_start-1)*nx:((iy_end-iy_start)+2)*nx])... #pragma acc exit data delete(a,aref,anew,rhs) 9

OPENACC DIRECTIVES Parallel Execution real* restrict const A = (real*) malloc(nx*ny*sizeof(real)); real* restrict const Aref = (real*) malloc(nx*ny*sizeof(real));... #pragma acc parallel loop present(a,anew) for (int iy = iy_start; iy < iy_end; iy++) { for( int ix = ix_start; ix < ix_end; ix++ ) { A[iy*nx+ix] = Anew[iy*nx+ix]; } } 10

INTRODUCING TESLA P100 New GPU Architecture CPU to CPUEnable the World s Fastest Compute Node PCIe Switch PCIe Switch Pascal Architecture NVLink HBM2 Stacked Memory Page Migration Engine CPU Tesla P100 Unified Memory Highest Compute Performance GPU Interconnect for Maximum Scalability Unifying Compute & Memory in Single Package Simple Parallel Programming with 512 TB of Virtual Memory 11

PASCAL ARCHITECTURE 12

TESLA P100 GPU: GP100 56 SMs 3584 CUDA Cores 5.3 TF Double Precision 10.6 TF Single Precision 21.2 TF Half Precision 16 GB HBM2 732 GB/s Bandwidth 13

GPU PERFORMANCE COMPARISON P100 (SXM2) M40 K40 Double/Single/Half TFlop/s 5.3/10.6/21.2 0.2/7.0/NA 1.4/4.3/NA Memory Bandwidth (GB/s) 732 288 288 Memory Size 16GB 12GB, 24GB 12GB L2 Cache Size 4096 KB 3072 KB 1536 KB Base/Boost Clock (Mhz) 1328/1480 948/1114 745/875 TDP (Watts) 300 250 235 14

GP100 SM GP100 CUDA Cores 64 Register File Shared Memory 256 KB 64 KB Active Threads 2048 Active Blocks 32 15

IEEE 754 FLOATING POINT ON GP100 3 sizes, 3 speeds, all fast Feature Half precision Single precision Double precision Layout s5.10 s8.23 s11.52 Issue rate pair every clock 1 every clock 1 every 2 clocks Subnormal support Yes Yes Yes Atomic Addition Yes Yes Yes 16

HALF-PRECISION FLOATING POINT (FP16) 16 bits s e x p f r a c. 1 sign bit, 5 exponent bits, 10 fraction bits 2 40 Dynamic range Normalized values: 1024 values for each power of 2, from 2-14 to 2 15 Subnormals at full speed: 1024 values from 2-24 to 2-15 Special values +- Infinity, Not-a-number USE CASES Deep Learning Training Radio Astronomy Sensor Data Image Processing 17

HALF-PRECISION FLOATING POINT (FP16) Example #include <cuda_fp16.h> global void haxpy( float alpha, const half2 * restrict x, half2 * restrict y, int n) { half alphah = float2half(alpha); half2 alpha2 = halves2half2 ( alphah, alphah ); for (int i = blockidx.x * blockdim.x + threadidx.x; i < n; i += blockdim.x * griddim.x) { } } //y[i] = alpha * x[i] + y[i] y[i] = hfma2 ( alpha2, x[i], y[i] ); 18

NVLINK 19

NVLINK P100 supports 4 NVLinks Up to 94% bandwidth efficiency Supports read/writes/atomics to peer GPU Supports read/write access to NVLink-enabled CPU Links can be ganged for higher bandwidth 40 GB/s 40 GB/s 40 GB/s 40 GB/s NVLink on Tesla P100 20

NVLINK - GPU CLUSTER Two fully connected quads, connected at corners 160GB/s per GPU bidirectional to Peers Load/store access to Peer Memory Full atomics to Peer GPUs High speed copy engines for bulk data copy PCIe to/from CPU 21

NVLINK TO CPU Fully connected quad 120 GB/s per GPU bidirectional for peer traffic 40 GB/s per GPU bidirectional to CPU Direct Load/store access to CPU Memory High Speed Copy Engines for bulk data movement 22

UNIFIED MEMORY 23

PAGE MIGRATION ENGINE Support Virtual Memory Demand Paging 49-bit Virtual Addresses Sufficient to cover 48-bit CPU address + all GPU memory GPU page faulting capability Can handle thousands of simultaneous page faults Up to 2 MB page size Better TLB coverage of GPU memory 24

UNIFIED MEMORY Single pointer for CPU and GPU void sortfile(file *fp, int N) { char *data; data = (char *)malloc(n); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); CPU code GPU code with Unified Memory void sortfile(file *fp, int N) { char *data; cudamallocmanaged(&data, N); fread(data, 1, N, fp); qsort<<<...>>>(data,n,1,compare); cudadevicesynchronize(); use_data(data); } free(data); } cudafree(data); 11/10/ 25

UNIFIED MEMORY ON PRE-PASCAL Kernel launch triggers bulk page migrations GPU memory ~0.3 TB/s System memory ~0.1 TB/s cudamallocmanaged Kernel launch PCI-E page fault page fault 11/10/ 26

UNIFIED MEMORY ON PASCAL On-demand page migrations GPU memory ~0.7 TB/s System memory ~0.1 TB/s cudamallocmanaged page fault page fault interconnect map VA to system memory page fault page fault page fault 11/10/ 27

LINUX AND UNIFIED MEMORY ANY memory will be available for GPU* void sortfile(file *fp, int N) { char *data; data = (char *)malloc(n); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); CPU code GPU code with Unified Memory void sortfile(file *fp, int N) { char *data; data = (char *)malloc(n); fread(data, 1, N, fp); qsort<<<...>>>(data,n,1,compare); cudadevicesynchronize(); use_data(data); } free(data); } free(data); *on supported operating systems 11/10/ 28

INTRODUCING TESLA P100 New GPU Architecture CPU to CPUEnable the World s Fastest Compute Node PCIe Switch PCIe Switch Pascal Architecture NVLink HBM2 Stacked Memory Page Migration Engine CPU Tesla P100 Unified Memory Highest Compute Performance GPU Interconnect for Maximum Scalability Unifying Compute & Memory in Single Package Simple Parallel Programming with 512 TB of Virtual Memory Find out more at http://devblogs.nvidia.com/parallelforall/inside-pascal 29