GPU ARCHITECTURE Chris Schultz, June 2017

Similar documents
GPU ARCHITECTURE Chris Schultz, June 2017

GPU A rchitectures Architectures Patrick Neill May

Portland State University ECE 588/688. Graphics Processors

Threading Hardware in G80

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

GRAPHICS PROCESSING UNITS

CUDA. Matthew Joyner, Jeremy Williams

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Technology for a better society. hetcomp.com

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

GPU for HPC. October 2010

Antonio R. Miele Marco D. Santambrogio

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

CSC573: TSHA Introduction to Accelerators

GPU Fundamentals Jeff Larkin November 14, 2016

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Mattan Erez. The University of Texas at Austin

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Introduction to CUDA (1 of n*)

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

CME 213 S PRING Eric Darve

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

GPU Architecture. Alan Gray EPCC The University of Edinburgh

CS427 Multicore Architecture and Parallel Computing

CUFFT Library PG _V1.0 June, 2007

Tesla GPU Computing A Revolution in High Performance Computing

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

VSC Users Day 2018 Start to GPU Ehsan Moravveji

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Spring Prof. Hyesoon Kim

TUNING CUDA APPLICATIONS FOR MAXWELL

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

ADVANCES IN EXTREME-SCALE APPLICATIONS ON GPU. Peng Wang HPC Developer Technology

Fundamental CUDA Optimization. NVIDIA Corporation

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Technical Report on IEIIT-CNR

Tesla GPU Computing A Revolution in High Performance Computing

GPU Architecture and Function. Michael Foster and Ian Frasch

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

CUDA Programming Model

April 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27,

TUNING CUDA APPLICATIONS FOR MAXWELL

GPU Architecture. Michael Doggett Department of Computer Science Lund university

HPC with Multicore and GPUs

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

Tesla Architecture, CUDA and Optimization Strategies

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Practical Introduction to CUDA and GPU

Automatic FFT Kernel Generation for CUDA GPUs. Akira Nukada Tokyo Institute of Technology

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer

ECE 574 Cluster Computing Lecture 17

A Sampling of CUDA Libraries Michael Garland

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Using a GPU in InSAR processing to improve performance

Fundamental CUDA Optimization. NVIDIA Corporation

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Programming GPUs with CUDA. Prerequisites for this tutorial. Commercial models available for Kepler: GeForce vs. Tesla. I.

ECE 574 Cluster Computing Lecture 15

Parallel Computing: Parallel Architectures Jin, Hai

Real-Time Rendering Architectures

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

CS 179 Lecture 4. GPU Compute Architecture

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Introduction to GPU hardware and to CUDA

Computer Architecture

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

GPU programming. Dr. Bernhard Kainz

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

GPUs and GPGPUs. Greg Blanton John T. Lubia

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Introduction to Modern GPU Hardware

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

ECE 8823: GPU Architectures. Objectives

General Purpose GPU Computing in Partial Wave Analysis

Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen

Parallel Accelerators

Multi-Processors and GPU

Parallel Programming for Graphics

RAPID MULTI-GPU PROGRAMMING WITH CUDA LIBRARIES. Nikolay Markovskiy

GPU! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017!

CUDA Architecture & Programming Model

Windowing System on a 3D Pipeline. February 2005

Transcription:

GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2

OUTLINE CPU versus GPU Why are they different? CUDA Terminology GPU Pascal Opportunities What skills you should know 3 CPU Problems Solved Over Time Complex code with all sorts of different demands and use cases. 4

CPU Priorities Latency Event driven programming Managing complex control flow Branch prediction & speculative execution Reduction of memory access Large caches Small collection of active threads Results in less space devoted to math 5 GPU Problems Overtime Real-time graphics and simulations Recent boom of AI 6

Parallelization GPU Priorities Work-loads related to games are massively parallel 4K = 3840x2180 ~60fps ~= 500M threads per sec! SOL of VR for 2018 -> 4k @ 120hz ~= 1B threads! Latency Trade-off Advanced batching and scheduling of data Computation Less complex control flow, more math Most space devoted to math With a smart thread scheduler to keep it busy 7 CPU FUTURE Simpler more power efficient (IOT) Integration of components == cheaper GPU Perf/w Low-latency (VR requirement) HPC co-processors Power efficiency! Phi, Tesla, Fire pro. Fast interconnects (NV-Link, future PCI-E) Heterogeneous computing 8

WHY CARE? Understand what architectures are best for the problem you need to solve FEM/Graphics -> Co-proc/GPU GUI/OS -> CPU DNN -> GPU/co-proc/TPU? It s a spectrum, don t force the problem on a poorly suited architecture Example: Deep Neural Networks 9 CUDA Compute Unified Device Architecture NVIDIA s parallel language OpenCL is similar API/language focuses on compute features LD/ST to generic buffers Support for advanced scheduling features 10

CUDA Constructs Thread ~Work item Block/CTA ~Work group Grid ~Index space 11 CUDA Constructs Thread Per-thread local memory Warp = 32 threads Share through local memory Block/CTA = Group of warps Shared mem per CTA: inter-cta sync/communication Grid = Group of CTAs that share the same kernel Global mem for cross-cta communication 12

CUDA Warp 13 CUDA Compute Integration Utilize existing libraries cufft (10x faster) cublas (6-17x faster) cusparse (8x faster) Application integration Matlab, Mathematica DIY CUDA, OpenCL, DirectX Compute Source: http://developer.nvidia.com/gpu-accelerated-libraries 14

OpenACC #include <stdio.h> #define N 1000000 int main(void) { double pi = 0.0f; long i; #pragma acc region for for (i=0; i<n; i++) { } } double t= (double)((i+0.5)/n); pi +=4.0/(1.0+t*t); printf("pi=%16.15f\n",pi/n); return 0; CUDA Compute Integration cufft #define NX 64 #define NY 64 #define NZ 128 cuffthandle plan; cufftcomplex *data1, *data2; cudamalloc((void**)&data1, sizeof(cufftcomplex)*nx*ny*nz); cudamalloc((void**)&data2, sizeof(cufftcomplex)*nx*ny*nz); /* Create a 3D FFT plan. */ cufftplan3d(&plan, NX, NY, NZ, CUFFT_C2C); /* Transform the first signal in place. */ cufftexecc2c(plan, data1, data1, CUFFT_FORWARD); /* Transform the second signal using the same plan. */ cufftexecc2c(plan, data2, data2, CUFFT_FORWARD); /* Destroy the cufft plan. */ cufftdestroy(plan); cudafree(data1); cudafree(data2); 15 Constructs reflect: WHY CARE? Warps Communication boundaries Read/write coherency Low level primitive HW operates on Blocks/CTAs are made up of integer # of warps (48 threads = 2 warps, not 1.5) AMD/Intel HPC chips have similar constructs Use existing libraries Don t reinvent the wheel, better use of your time 16

GPU ARCHITECTURE Evolution Fixed function (Geforce 256, Radeon R100) Hardware T&L, Render Output Units/ Raster Operations Pipeline (ROPs) Programmability (Geforce 2 -> 7) Vertex, Pixel shaders Unification (Geforce 8) Generic compute units Generalization (Fermi, Maxwell, AMD GCN) Queuing, batching, advanced scheduling Compute focused 17 GPU ARCHITECTURE GPU Last Week Geforce 3 Source: http://hexus.net/tech/reviews/graphics/251-hercules-geforce-3-ti500/?page=2 18

GPU ARCHITECTURE GPU Yesterday GTX 980 (GM204) 19 GPU ARCHITECTURE GTX 1080 (GP104) 20

PASCAL GPC Similar to a CPU core Stamp out for scaling Contains multiple SMs Graphics Features Polymorph Engine Raster Engine Compute Features ~CTA (work group) 21 PASCAL SM Schedule threads efficiently Maximize Utilization 4 Warp schedulers 2 dispatch units per scheduler Warp == 32 threads 1 SM can handle 64 warps 22

PASCAL Warp Scheduler 23 Avoid Divergence! PASCAL Utilization Shuffle your program to avoid divergence within a warp Size of CTA matters! Programmer thinks in CTAs, HW operates on warps CTA size 48 threads 48/32 = 1.5 warps, second warp is partially occupied Scheduler 2 dispatch units need two independent inst per warp Series of dependent instructions = Poor utilization Poor instruction mix (heavy on a certain type) == Poor utilization 24

PASCAL Utilization Latency Memory access - ~10 cycles -> 800 cycles Cuda cores take a few cycles to complete # of warps on a SM tied to usage of register file Complex program with high # of registers fill up RF Decreases # of warps available on the SM for issue Balance work-load Tex versus FP versus SFU versus LD/ST Avoid peak utilization of any one unit 25 PERF # S PRICE ($) TFLOPS (SP) TFLOPS (DP) GFLOPS/W (SP) GFLOPS/W (DP) GTX 1080 599.00 8.23 0.257 45.6 1.43 NVIDIA GV100 Intel Xeon PHI 7290 Proj. Scorpio??? (Tesla was P100 5599.00) 15 7.5 50 25 6254.00 6 3.5 24.5 14.3??? 6????????? PS4 Pro 399.99 4.12????????? 26

WHAT S GOOD ENOUGH? GPU demands are never ending Physically based materials for lighting Indirect illumination Physically based materials for interaction 4k? 5k? 8k? VR doubles required work needs to hit 120fps+ API demands are never ending New low-level APIs (Vulkan, DX12) create their own problems 27 OPPORTUNITIES NVIDIA/AMD/Intel/Google/Qualcomm/Apple Very Strong C/C++ fundamentals Very Strong OS/Algorithms fundamentals Very Strong Graphics fundamentals Parallel Programming w/ emphasis on Performance Compiler experience a plus Other industries Games Industry Biochemistry simulations Weather/climate modeling CAD 28

cschultz@nvidia.com QUESTIONS? http://www.nvidia.com/object/careers.html 29