Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU)

Similar documents
Parallel Programming for Graphics

Introduction to Parallel Programming Models

Parallel Programming on Larrabee. Tim Foley Intel Corp

Beyond Programmable Shading 2012

Scientific Computing on GPUs: GPU Architecture Overview

GPU Task-Parallelism: Primitives and Applications. Stanley Tzeng, Anjul Patney, John D. Owens University of California at Davis

From Shader Code to a Teraflop: How Shader Cores Work

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Beyond Programmable Shading Keeping Many Cores Busy: Scheduling the Graphics Pipeline

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Beyond Programmable Shading. Scheduling the Graphics Pipeline

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Portland State University ECE 588/688. Graphics Processors

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

High Performance Computing on GPUs using NVIDIA CUDA

Beyond Programmable Shading Course ACM SIGGRAPH 2010 Panel: Fixed-Function Hardware

A Trip Down The (2011) Rasterization Pipeline

Compute-mode GPU Programming Interfaces

Real-Time Rendering Architectures

GRAPHICS HARDWARE. Niels Joubert, 4th August 2010, CS147

Working with Metal Overview

The Bifrost GPU architecture and the ARM Mali-G71 GPU

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Mattan Erez. The University of Texas at Austin

Preparing seismic codes for GPUs and other

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Threading Hardware in G80

Efficient and Scalable Shading for Many Lights

Computer Architecture

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Modern Processor Architectures. L25: Modern Compiler Design

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Lecture 25: Board Notes: Threads and GPUs

CUDA Programming Model

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

GRAPHICS PROCESSING UNITS

Programming Larrabee: Beyond Data Parallelism. Dr. Larry Seiler Intel Corporation

Lecture 9: Deferred Shading. Visual Computing Systems CMU , Fall 2013

Lecture 1: Introduction and Computational Thinking

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017

GPU 101. Mike Bailey. Oregon State University

Introduction to CUDA (1 of n*)

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Antonio R. Miele Marco D. Santambrogio

GPU A rchitectures Architectures Patrick Neill May

Multi Agent Navigation on GPU. Avi Bleiweiss

DirectCompute Performance on DX11 Hardware. Nicolas Thibieroz, AMD Cem Cebenoyan, NVIDIA

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

FRUSTUM-TRACED RASTER SHADOWS: REVISITING IRREGULAR Z-BUFFERS

Ray Tracing with Multi-Core/Shared Memory Systems. Abe Stephens

ASYNCHRONOUS SHADERS WHITE PAPER 0

A Real-time Micropolygon Rendering Pipeline. Kayvon Fatahalian Stanford University

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Day: Thursday, 03/19 Time: 16:00-16:50 Location: Room 212A Level: Intermediate Type: Talk Tags: Developer - Tools & Libraries; Game Development

Lecture 10: Part 1: Discussion of SIMT Abstraction Part 2: Introduction to Shading

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Current Trends in Computer Graphics Hardware

GPU Memory Model. Adapted from:

GPU Programming Using NVIDIA CUDA

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Patterns for! Parallel Programming!

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Pat Hanrahan. Modern Graphics Pipeline. How Powerful are GPUs? Application. Command. Geometry. Rasterization. Fragment. Display.

7/29/2010 Beyond Programmable Shading Course, ACM SIGGRAPH 2010

Scheduling the Graphics Pipeline on a GPU

Master Informatics Eng.

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

FRUSTUM-TRACED RASTER SHADOWS: REVISITING IRREGULAR Z-BUFFERS

Maximizing Face Detection Performance

Increase your FPS. with CPU Onload Josh Doss. Doug McNabb.

On the cost of managing data flow dependencies

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

NVIDIA Parallel Nsight. Jeff Kiel

Enabling immersive gaming experiences Intro to Ray Tracing

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Analyze and Optimize Windows* Game Applications Using Intel INDE Graphics Performance Analyzers (GPA)

Optimisation. CS7GV3 Real-time Rendering

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Real-Time Support for GPU. GPU Management Heechul Yun

Anatomy of AMD s TeraScale Graphics Engine

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Beyond Programmable Shading. Scheduling the Graphics Pipeline

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Lecture 4: Geometry Processing. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

CS427 Multicore Architecture and Parallel Computing

Course Recap + 3D Graphics on Mobile GPUs

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

Transcription:

Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU) Aaron Lefohn, Intel / University of Washington Mike Houston, AMD / Stanford 1

What s In This Talk? Overview of parallel programming models used in real-time graphics products and research Abstraction, execution, synchronization Shaders, task systems, conventional threads, graphics pipeline, GPU compute languages Parallel programming models Vertex shaders Conventional thread programming Task systems Graphics pipeline GPU compute languages (ComputeShader, OpenCL, CUDA) Discuss Strengths/weaknesses of different models How each model maps to the architectures 2

What Goes into a Game Frame? (2 years ago) 3

Computation graph for Battlefied: Bad Company provided by DICE 4

Data Parallelism 5

Task Parallelism 6

Graphics Pipelines Input Assembly Vertex Shading Primitive Setup Geometry Shading Pipeline Flow Rasterization Pixel Shading Output Merging 7

Remember: Our Enthusiast Chip Figure by Kayvon Fatahalian 8

Hardware Resources (from Kayvon s Talk) Core Execution Context SIMD functional units On-chip memory Figure by Kayvon Fatahalian 9

Abstraction Abstraction enables portability and system optimization E.g., dynamic load balancing, producer-consumer, SIMD utilization Lack of abstraction enables arch-specific user optimization E.g., multiple execution contexts jointly building on-chip data structure Remember: When a parallel programming model abstracts a HW resource, code written in that programming model scales across architectures with varying amounts of that resource 10

Definitions: Execution Task A logically related set of instructions executed in a single execution context (aka shader, instance of a kernel, task) Concurrent execution Multiple tasks that may execute simultaneously (because they are logically independent) Parallel execution Multiple tasks whose execution contexts are guaranteed to be live simultaneously (because you want them to be for locality, synchronization, etc) 11

Synchronization Synchronization Restricting when tasks are permitted to execute Granularity of permitted synchronization determines at which granularity system allows user to control scheduling 12

Vertex Shaders: Pure Data Parallelism Execution Concurrent execution of identical per-vertex tasks What is abstracted? Cores, execution contexts, SIMD functional units, memory hierarchy What synchronization is allowed? Between draw calls 13

Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numvertices - 1) { execute vertex shader } SPMD = Single Program Multiple Data This type of programming is sometimes called SPMD Instance the same program multiple times and run on different data Many names: shader-style, kernel-style, SPMD 14

Conventional Thread Parallelism (e.g., pthreads) Execution Parallel execution of N tasks with N execution contexts What is abstracted? Nothing (ignoring preemption) Where is synchronization allowed? Between any execution context at various granularities 15

Conventional Thread Parallelism Directly program: N execution contexts N SIMD ALUs / execution context To use SIMD ALUs: m128 a_line, b_line, r_line; r_line = _mm_mul_ps(a_line, b_line); Powerful, but dangerous... Figure by Kayvon Fatahalian 16

Game Workload Example 17

Typical Game Workload Subsystems given % of overall time budget Input, Miscellaneous: 5% Physics: 30% AI, Game Logic: 10% Graphics: 50% Audio: 5% I Physics AI Graphics A GPU Workload: Rendering Slide by Tim Foley 18

Parallelism Anti-Pattern #1 Assign each subsystems to a SW thread thread 0 thread 1 thread 2 thread 3 I Physics AI Graphics I Physics AI Graphics Problems frame N Communication/synchronization Load imbalance Preemption leads to thrashing Don t do this! Slide by Tim Foley 19

Parallelism Anti-Pattern #2 Group subsystems into HW threads thread 0 I Physics AI A I Physics AI A thread 1 Graphics Graphics frame N Problems Communication/synchronization Load imbalance Poor scalability (4, 8, HW threads) Don t do this either! Slide by Tim Foley 20

Better Solution: Find Concurrency Identify where ordering constraints are needed and run concurrently between constraints I P P P P P AI G G G G G G G G G G A Visualize as a graph G G I P P P P P AI G A G G G G G G G Slide by Tim Foley 21

And Distribute Work to Threads Dynamically distribute medium-grained concurrent tasks to hardware threads (Virtualize/abstract the threads thread 0 A A A P P G G thread 1 thread 2 thread 3 A A A A P P G G G A A P P P P P P G G G P P P G G G G Slide by Tim Foley 22

Task Systems (Cilk, TBB, ConcRT, GCD, ) Execution Concurrent execution of many (likely different) tasks What is abstracted? Cores and execution contexts Does not abstract: SIMD functional units or memory hierarchy Where is synchronization allowed? Between tasks 23

Mental Model: Task Systems Think of task as asynchronous function call Do F() at some point in the future Optionally after G() is done G() Can be implemented in HW or SW Launching/spawning task is nearly as fast as function call Usage model: Spawn 1000s of tasks and let scheduler map tasks to execution contexts F() Usually cooperative, not preemptive Task runs to completion no blocking Can create new tasks, but not wait Slide by Tim Foley 24

Task Parallel Code (Cilk) void mytask( some arguments ) { } void main() { for( i = 0 to NumTasks - 1 ) { cilk_spawn mytask( ); } cilk_sync; } 25

Task Parallel Code (Cilk) void mytask( some arguments ) { } void main() { cilk_for( i = 0 to NumTasks - 1 ) { mytask( ); } cilk_sync; } 26

Nested Task Parallel Code (Cilk) void bartask( some parameters ) { } void footask( some parameters ) { } if (somecondition) { cilk_spawn bartask( ); } else { cilk_spawn footask( ); } // Implicit cilk_sync at end of function void main() { cilk_for( i = 0 to NumTasks - 1 ) { footask( ); } cilk_sync; } More code 27

Task Systems Review Execution Concurrent execution of many (likely different) tasks What is abstracted? Cores and execution contexts Does not abstract: SIMD functional units or memory hierarchy Where is synchronization allowed? Between tasks Figure by Kayvon Fatahalian 28

DirectX/OpenGL Rendering Pipeline (Combination of multiple models) Execution Data-parallel concurrent execution of identical task within each shading stage Task-parallel concurrent execution of different shading stages No parallelism exposed to user What is abstracted? (just about everything) Cores, execution contexts, SIMD functional units, memory hierarchy, and fixed-function graphics units (tessellator, rasterizer, ROPs, etc) Where is synchronization allowed? Between draw calls 29

GPU Compute Languages (Combination of Multiple Models) DX11 DirectCompute OpenCL CUDA There are multiple possible usage models. We ll start with the text book hierarchical data-parallel usage model 30

GPU Compute Languages Execution Hierarchical model Lower level is parallel execution of identical tasks (work-items) within work-group Upper level is concurrent execution of identical work-groups What is abstracted? Work-group abstracts a core s execution contexts, SIMD functional units Set of work-groups abstracts cores Does not abstract core-local memory Where is synchronization allowed? Between work-items in a work-group Between passes (set of work-groups) 31

GPU Compute Pseudocode void myworkgroup() { parallel_for(i = 0 to NumWorkItems - 1) { GPU Kernel Code (This is where you write GPU compute code) } } void main() { concurrent_for( i = 0 to NumWorkGroups - 1) { myworkgroup(); } sync; } 32

DX CS/OCL/CUDA Execution Model Fundamental unit is work-item Single instance of kernel program (i.e., task using the definitions in this talk) Each work-item executes in single SIMD lane void f(...) { int x =...;...;...; if(...) {... } } Work items collected in work-groups Work-group scheduled on single core Work-items in a work-group Execute in parallel Can share R/W on-chip scratchpad memory Can wait for other work-items in work-group Users launch a grid of work-groups Spawn many concurrent work-groups Figure by Tim Foley 33

GPU Compute Models Work-group barrier Work-group barrier Slide by Tim Foley 34

GPU Compute Use Cases 1:1 Mapping Simple Fork/Join Increasing Sophistication Switching Axes of Parallelism 35

1:1 Mapping One work item per ray / per pixel / per matrix element Every work item executes the same kernel Often first, most obvious solution to a problem Pure data parallelism void saxpy( int i, float a, const float* x, const float* y, float* result ) { result[i] = a * x[i] + y[ i ]; } Slide by Tim Foley 36

Simple Fork/Join Some code must run at work-group granularity Example: work items cooperate to compute output structure size Atomic operation to allocate output must execute once Idiomatic solution Barrier, then make work item #0 do the group-wide operation void subdividepolygon(...) { shared int numoutputpolygons = 0; // in parallel, every work item does atomic_add( numoutputpolygons, 1); barrier(); } Polygon* output = NULL; if( workitemid == 0 ) { output = allocatememory( numoutputpolygons ); } barrier();... Slide by Tim Foley 37

Multiple Axes of Parallelism Deferred rendering with DX11 Compute Shader Example from Johan Andersson (DICE) 1000+ dynamic lights Multiple phases of execution Work group responsible for a screen-space tile Each phase exploits work items differently: Phase 1: pixel-parallel computation of tile depth bounds Phase 2: light-parallel test for intersection with tile Phase 3: pixel-parallel accumulation of lighting Picture by Johan Andersson Exploits producer-consumer locality between phases Slide by Tim Foley 38

Terminology Decoder Ring Direct Compute CUDA OpenCL Pthreads+ SSE This talk thread thread work-item SIMD lane work-item - warp - thread execution context threadgroup threadblock Work-group - work-group - streaming multiproces sor compute unit core - grid N-D range - core Set of workgroups Slide by Tim Foley 39

When Use GPU Compute vs Pixel Shader? Use GPU compute language if your algorithm needs on-chip memory Reduce bandwidth by building local data structures Otherwise, use pixel shader All mapping, decomposition, and scheduling decisions automatic (Easier to reach peak performance) 40

GPU Compute Languages Review Write code from within two nested concurrent/parallel loops Abstracts Cores, execution contexts, and SIMD ALUs Exposes Parallel execution contexts on same core Fast R/W on-core memory shared by the execution contexts on same core Synchronization Fine grain: between execution contexts on same core Very coarse: between large sets of concurrent work No medium-grain synchronization between function calls like task systems provide 41

Conventional Thread Parallelism on GPUs Also called persistent threads Expert usage model for GPU compute Defeat abstractions over cores, execution contexts, and SIMD functional units Defeat system scheduler, load balancing, etc. Code not portable between architectures 42

Conventional Thread Parallelism on GPUs Execution Two-level parallel execution model Lower level: parallel execution of M identical tasks on M-wide SIMD functional unit Higher level: parallel execution of N different tasks on N execution contexts What is abstracted? Nothing (other than automatic mapping to SIMD lanes) Where is synchronization allowed? Lower-level: between any task running on same SIMD functional unit Higher-level: between any execution context 43

Why Persistent Threads? Enable alternate programming models that require different scheduling and synchronization rules than the default model provides Example alternate programming models Task systems (esp. nested task parallelism) Producer-consumer rendering pipelines (See references at end of this slide deck for more details) 44

Summary of Concepts Abstraction When a parallel programming model abstracts a HW resource, code written in that programming model scales across architectures with varying amounts of that resource Execution Concurrency versus parallelism Synchronization Where is user allowed to control scheduling? 45

Ideal Parallel Programming Model Combine the best of CPU and GPU programming models Task systems are great for scheduling (from CPUs) Asynchronous function call is easy to understand and use Great load balancing and scalability (with cores, execution contexts) SPMD programming is great for utilizing SIMD (from GPUs) Write sequential code that is instanced N times across N-wide SIMD Intuitive: only slightly different from sequential programming Why not just launch tasks that run fine-grain SPMD code? The future on CPU and GPU? 46

Conclusions Task-, data- and pipeline-parallelism Three proven approaches to scalability Plentiful of concurrency with little exposed parallelism Applicable to many problems in visual computing Current real-time rendering programming uses a mix of data-, task-, and pipeline-parallel programming (and conventional threads as means to an end) Current GPU compute models designed for data-parallelism but can be abused to implement all of these other models 47

References GPU-inspired compute languages DX11 DirectCompute, OpenCL (CPU+GPU+ ), CUDA Task systems (CPU and CPU+GPU+ ) Cilk, Thread Building Blocks (TBB), Grand Central Dispatch (GCD), ConcRT, Task Parallel Library, OpenCL (limited in 1.0) Conventional CPU thread programming Pthreads GPU task systems and persistent threads (i.e., conventional thread programming on GPU) Aila et al, Understanding the Efficiency of Ray Traversal on GPUs, High Performance Graphics 2009 Tzeng et al, Task Management for Irregular-Parallel Workloads on the GPU, High Performance Graphics 2010 Parker et al, OptiX: A General Purpose Ray Tracing Engine, SIGGRAPH 2010 Additional input (concepts, terminology, patterns, etc) Foley, Parallel Programming for Graphics, Beyond Programmable Shading SIGGRAPH 2009 Beyond Programmable Shading CS448s Stanford course Fatahalian, Running Code at a Teraflop: How a GPU Shader Core Works, Beyond Programmable Shading SIGGRAPH 2009-2010 Keutzer et al, A Design Pattern Language for Engineering (Parallel) Software: Merging the PLPP and OPL projects, ParaPLoP 2010 48

Questions? http://www.cs.washington.edu/education/courses/cse558/11wi/ 49