GPU A rchitectures Architectures Patrick Neill May

Similar documents
GPU ARCHITECTURE Chris Schultz, June 2017

GPU ARCHITECTURE Chris Schultz, June 2017

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

GRAPHICS PROCESSING UNITS

CUDA. Matthew Joyner, Jeremy Williams

Antonio R. Miele Marco D. Santambrogio

Threading Hardware in G80

CS427 Multicore Architecture and Parallel Computing

Portland State University ECE 588/688. Graphics Processors

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Mattan Erez. The University of Texas at Austin

Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen

TUNING CUDA APPLICATIONS FOR MAXWELL

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CSC573: TSHA Introduction to Accelerators

Fundamental CUDA Optimization. NVIDIA Corporation

TUNING CUDA APPLICATIONS FOR MAXWELL

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

CME 213 S PRING Eric Darve

Technology for a better society. hetcomp.com

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Introduction to CUDA (1 of n*)

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

The NVIDIA GeForce 8800 GPU

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

ASYNCHRONOUS SHADERS WHITE PAPER 0

GPU! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017!

ADVANCES IN EXTREME-SCALE APPLICATIONS ON GPU. Peng Wang HPC Developer Technology

Introduction to Modern GPU Hardware

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

CS 179 Lecture 4. GPU Compute Architecture

Spring 2009 Prof. Hyesoon Kim

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

Windowing System on a 3D Pipeline. February 2005

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Efficient and Scalable Shading for Many Lights

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Technical Report on IEIIT-CNR

Graphics Processing Unit Architecture (GPU Arch)

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Introduction to GPU hardware and to CUDA

Real-Time Rendering Architectures

Anatomy of AMD s TeraScale Graphics Engine

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

Parallel Computing: Parallel Architectures Jin, Hai

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

GPU for HPC. October 2010

Trends in HPC (hardware complexity and software challenges)

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

CUFFT Library PG _V1.0 June, 2007

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Parallel Programming for Graphics

GPU Architecture. Michael Doggett Department of Computer Science Lund university

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

arxiv: v1 [physics.comp-ph] 4 Nov 2013

GPU Fundamentals Jeff Larkin November 14, 2016

GRAPHICS HARDWARE. Niels Joubert, 4th August 2010, CS147

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

CS195V Week 9. GPU Architecture and Other Shading Languages

Introduction to CUDA

Computer Architecture

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Tesla GPU Computing A Revolution in High Performance Computing

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

ECE 574 Cluster Computing Lecture 16

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

GPU Architecture and Function. Michael Foster and Ian Frasch

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

2009: The GPU Computing Tipping Point. Jen-Hsun Huang, CEO

PowerVR Hardware. Architecture Overview for Developers

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Accelerating High Performance Computing.

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Computer Architecture

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Current Trends in Computer Graphics Hardware

Motivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Tesla GPU Computing A Revolution in High Performance Computing

Fundamental CUDA Optimization. NVIDIA Corporation

Transcription:

GPU Architectures Patrick Neill May 30, 2014

Outline CPU versus GPU CUDA GPU Why are they different? Terminology Kepler/Maxwell Graphics Tiled deferred rendering Opportunities What skills you should know

CPU - Evolution Problem to solve over time

CPU - Priorities Latency Event driven programming Managing complex control flow Branch prediction Speculative execution Reduction of memory/hd access Large caches Small collection of active threads Anything greater than dual core is generally overkill Less space devoted to math

GPU - Evolution Problems over time

GPU - Priorities Parallelization Work-loads related to games are massively parallel 4K = 3840x2180 ~60fps ~= 500M threads per sec! Latency Trade-off Sacrifices immediate access to data Allow advanced batching and scheduling of data Computation Less complex control flow, more math Most space devoted d to math With a smart thread scheduler to keep it busy

Evolution of Modern GPU Fixed function (Geforce 256, Radeon R100) Hardware T&L Render Output Units/ Raster Operations Pipeline (ROPs) Programmability (Geforce 2 -> >7) Vertex, Pixel Evolved with DirectX SM versions, CG Unification (Geforce 8) Generic compute units Geometry shader Generalization (Fermi, Kepler, AMD GCN) Queuing, batching, advanced scheduling Compute focused

Future CPU GPU HPC Simpler more power efficient Integration of components Software/language needs Generalization Power efficiency Fast interconnects Heterogeneous computing

Why do you care? Companies try to leverage IP Obviously biased (except me of course, you can trust me ) Understand what architectures are best for the problem you need to solve N-Body = GPU GUI = CPU

CUDA Compute Unified Device Architecture NVIDIA invented parallel language OpenCL is similar Turns off graphics oriented features Disables: rasterizer, input assembler, output merger Turns on compute oriented features LD/ST to generic buffers Support for advanced scheduling features

CUDA Constructs Thread ~Work item Block/CTA ~Work group Gid Grid ~Index space

CUDA Constructs Thread Per-thread local memory Warp = 32 threads Share through local memory Block/CTA = Group of warps Shared mem per CTA: inter-cta sync/communication Grid = Group of CTAs that share the same kernel Global mem for cross-cta communication Defined by user (compute) Reflect communication

Warp

Compute - Integration Utilize existing libraries cufft (10x faster) cublas (6-17x faster) cusparse (8x faster) Application integration DIY Matlab Mathematica CUDA OpenCL DirectX Compute Source: http://developer.nvidia.com/gpu-accelerated-libraries

Compute - Integration OpenACC #include <stdio.h> #define N 1000000 int main(void) { double pi = 0.0f; 0f; long i; #pragma acc region for for (i=0; i<n; i++) { double t= (double)((i+0.5)/n); pi +=4.0/(1.0+t*t); } } printf("pi=%16.15f\n",pi/n); return 0; cufft #define NX 64 #define NY 64 #define NZ 128 cuffthandle plan; cufftcomplex *data1 data1, *data2; cudamalloc((void**)&data1, sizeof(cufftcomplex)*nx*ny*nz); cudamalloc((void**)&data2, sizeof(cufftcomplex)*nx*ny*nz); /* Create a 3D FFT plan. */ cufftplan3d(&plan, NX, NY, NZ, CUFFT_C2C); /* Transform the first signal in place. */ cufftexecc2c(plan, data1, data1, CUFFT_FORWARD); /* Transform the second signal using the same plan. */ cufftexecc2c(plan, data2, data2, CUFFT_FORWARD); /* Destroy the cufft plan. */ cufftdestroy(plan); cudafree(data1); cudafree(data2);

Dynamic Parallelism Dynamic Parallelism Allow the GPU to create work for itself Remove CPU -> GPU sync points

Dynamic Parallelism

Why do you care? Constructs reflect: Warps Communication boundaries Read/write coherency Low level primitive HW operates on Blocks/CTAs are made up of integer # of warps AMD/Intel HPC chips are similar Use existing libraries Don t reinvent the wheel

GPU Yesterday Geforce 3 Source: http://hexus.net/tech/reviews/graphics/251-hercules-geforce-3-ti500/?page=2

Kepler GK110

Kepler GPC Similar to a CPU core Stamp out for scaling Contains multiple SMXs Graphics Features Polymorph Engine Raster Engine Independent execution Compute Features Independent Kernels ~CTA (work group)

Kepler SMX Schedule threads efficiently Maximize Utilization 192 CUDA Cores 64 DP units 32 SFU 32 LD/ST Perfect utilization? 320 threads available per clock

Kepler Scheduler

Maxwell GM107

Maxwell GPC 5 SMMs per GPC Kepler had 2 SMXs ~90% perf of SMX But *much* smaller More power efficient Scheduling changes

Maxwell GPC

Maxwell SMM Schedule threads efficiently Maximize Utilization 128 CUDA Cores 32 SFU 32 LD/ST Warps scheduling Each scheduler owns cores Own instruction buffer Still 2x dispatch unit

Maxwell Utilization Avoid Divergence! Shuffle your program to avoid divergence within a warp Size of CTA matters! Programmer thinks in CTAs, HW operates on warps CTA size 48 threads 32 thread warp architecture 48/32 = 1.5 warps, second warp is partially occupied Scheduler 2 dispatch units need two independent inst per warp Series of dependent instructions = Poor utilization

Maxwell Utilization Latency Memory access - ~10 cycles -> 800 cycles Cuda cores take a few cycles to complete Contention for limited resources 64K regs/sm * SM / 64 Warps * Warp / 32 Threads 32 regs/thread before # of warps/sm decrease High # of warps vital for high utilization Balance work-load Tex versus FP versus SFU versus LD/ST Avoid peak utilization of any one unit

Kepler SMX Each CUDA Core can do FMA = 2 FLOP 192 CUDA Cores per SMX, 15x2 SMX (Titan Z) 11520 FLOP/clock 705 Mhz (876 Mhz boost) 8.1 TFLOPs (10 TFLOPs boost) Intel Xeon E5-2690 371 GFLOPs (AVX) XBox One PS4 1.32 TFLOPs 1.84 TFLOPs

Infiltrator Demo Tiled deferred shading Temporal anti-aliasing Tessellation and Displacement Millions of particles, colliding/lit by the environment Physically based materials/lighting (static GI)

Deferred shading Render to G-Buffer Material Normal Depth Defer shading Wait until all objects are rendered Use pixel position plus G-Buffer Shading done exactly once per pixel

Deferred shading

Tiled Deferred Shading Render to G-Buffer (Materials, Normal, Depth) per pixel Segment screen into tiles (think parallel) DICE 2011 GDC presentation Compute light-tile intersection Output culled light list to buffer Perform per pixel light pass (Materials,Normal,Depth) + pixel pos Fetch list of lights hitting tile containing pixel Iterate + Light

Tiled Deferred Shading

Opportunities NVIDIA/AMD/Intel (Samsung/Qualcomm/Apple) Strong C/C++ Strong OS/Algorithms fundamentals Parallel Programming w/ emphasis on Performance Compiler experience a plus Graphics knowledge a plus Other industries Games Industry Seismic Processing Biochemistry simulations Weather/climate modeling CAD

Questions? pneill@nvidia.com