Parallel Programming for Graphics

Similar documents
Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU)

Beyond Programmable Shading 2012

Parallel Programming on Larrabee. Tim Foley Intel Corp

Introduction to Parallel Programming Models

Portland State University ECE 588/688. Graphics Processors

GPU Task-Parallelism: Primitives and Applications. Stanley Tzeng, Anjul Patney, John D. Owens University of California at Davis

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Scientific Computing on GPUs: GPU Architecture Overview

Beyond Programmable Shading Course ACM SIGGRAPH 2010 Panel: Fixed-Function Hardware

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

A Trip Down The (2011) Rasterization Pipeline

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Mattan Erez. The University of Texas at Austin

From Shader Code to a Teraflop: How Shader Cores Work

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

Graphics and Imaging Architectures

Beyond Programmable Shading Keeping Many Cores Busy: Scheduling the Graphics Pipeline

7/29/2010 Beyond Programmable Shading Course, ACM SIGGRAPH 2010

GRAPHICS HARDWARE. Niels Joubert, 4th August 2010, CS147

Introduction to CUDA (1 of n*)

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017

GPU 101. Mike Bailey. Oregon State University

Antonio R. Miele Marco D. Santambrogio

Lecture 4: Geometry Processing. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

GRAPHICS PROCESSING UNITS

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Programming Larrabee: Beyond Data Parallelism. Dr. Larry Seiler Intel Corporation

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

CS195V Week 9. GPU Architecture and Other Shading Languages

Beyond Programmable Shading. Scheduling the Graphics Pipeline

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

Real-Time Rendering Architectures

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Efficient GPU Rendering of Subdivision Surfaces. Tim Foley,

COMP 4801 Final Year Project. Ray Tracing for Computer Graphics. Final Project Report FYP Runjing Liu. Advised by. Dr. L.Y.

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Comparing Reyes and OpenGL on a Stream Architecture

Threading Hardware in G80

Further Developing GRAMPS. Jeremy Sugerman FLASHG January 27, 2009

Fragment-Parallel Composite and Filter. Anjul Patney, Stanley Tzeng, and John D. Owens University of California, Davis

CS427 Multicore Architecture and Parallel Computing

Pat Hanrahan. Modern Graphics Pipeline. How Powerful are GPUs? Application. Command. Geometry. Rasterization. Fragment. Display.

Real-Time Support for GPU. GPU Management Heechul Yun

Programmable GPUS. Last Time? Reading for Today. Homework 4. Planar Shadows Projective Texture Shadows Shadow Maps Shadow Volumes

GPU Architecture and Function. Michael Foster and Ian Frasch

GPU A rchitectures Architectures Patrick Neill May

Preparing seismic codes for GPUs and other

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Compute-mode GPU Programming Interfaces

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Anatomy of AMD s TeraScale Graphics Engine

Computer Architecture

Real-Time Reyes Programmable Pipelines and Research Challenges

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

Efficient and Scalable Shading for Many Lights

NVIDIA Parallel Nsight. Jeff Kiel

NVIDIA Fermi Architecture

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

General Algorithm Primitives

Introduction to CUDA (1 of n*)

GPU ARCHITECTURE Chris Schultz, June 2017

Current Trends in Computer Graphics Hardware

CUDA Programming Model

A Real-time Micropolygon Rendering Pipeline. Kayvon Fatahalian Stanford University

GPU Architecture. Michael Doggett Department of Computer Science Lund university

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Parallel Systems. Project topics

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

High Performance Computing on GPUs using NVIDIA CUDA

The Bifrost GPU architecture and the ARM Mali-G71 GPU

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.

On-the-fly Vertex Reuse for Massively-Parallel Software Geometry Processing

GPUs and GPGPUs. Greg Blanton John T. Lubia

frame buffer depth buffer stencil buffer

Analyze and Optimize Windows* Game Applications Using Intel INDE Graphics Performance Analyzers (GPA)

FRUSTUM-TRACED RASTER SHADOWS: REVISITING IRREGULAR Z-BUFFERS

General-Purpose Computation on Graphics Hardware

LPGPU Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM 2014)

A Hardware Pipeline for Accelerating Ray Traversal Algorithms on Streaming Processors

NVIDIA Case Studies:

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Parallel Programming

A General Discussion on! Parallelism!

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

GPUs have enormous power that is enormously difficult to use

Lecture 10: Part 1: Discussion of SIMT Abstraction Part 2: Introduction to Shading

FRUSTUM-TRACED RASTER SHADOWS: REVISITING IRREGULAR Z-BUFFERS

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

Increase your FPS. with CPU Onload Josh Doss. Doug McNabb.

Transcription:

Beyond Programmable Shading Course ACM SIGGRAPH 2010 Parallel Programming for Graphics Aaron Lefohn Advanced Rendering Technology (ART) Intel

What s In This Talk? Overview of parallel programming models used in real-time graphics products and research Abstraction, execution, synchronization Shaders, task systems, conventional threads, graphics pipeline, GPU compute languages Discussion of strengths/weaknesses between the models

Beyond Programmable Shading Course ACM SIGGRAPH 2010 What Goes into a Game Frame? (2 years ago)

This. Computation graph for Battlefied: Bad Company provided by DICE

Data Parallelism

Task Parallelism

Graphics Pipelines Input Assembly Vertex Shading Primitive Setup Geometry Shading Rasterization Pipeline Flow Pixel Shading Output Merging

Hardware Resources (from Kayvon s Talk) Core Execution Context SIMD functional units On-chip memory Beyond Programmable Shading, SIGGRAPH 2010 8

Abstraction Abstraction enables portability and system optimization E.g., dynamic load balancing, producer-consumer, SIMD utilization Lack of abstraction enables arch-specific user optimization E.g., multiple execution contexts jointly building on-chip data structure When a parallel programming model abstracts a HW resource, code written in that programming model scales across architectures with varying amounts of that resource Beyond Programmable Shading, SIGGRAPH 2010 9

Execution Task A logically related set of instructions executed in a single execution context (aka shader, instance of a kernel, task) Concurrent execution Multiple tasks that may execute simultaneously (because they are logically independent) Parallel execution Multiple tasks whose execution contexts are guaranteed to be live simultaneously (because you want them to be for locality, synchronization, etc) Beyond Programmable Shading, SIGGRAPH 2010 10

Synchronization Synchronization Restricting when tasks are permitted to execute Granularity of permitted synchronization determines at which granularity system allows user to control scheduling Beyond Programmable Shading, SIGGRAPH 2010 11

Pixel Shaders Execution Concurrent execution of identical per-pixel tasks Parallelism between four pixels in a 2x2 quad What is abstracted? Cores, execution contexts, SIMD functional units, memory hierarchy What synchronization is allowed? Between draw calls

Task Systems (Cilk, TBB, ConcRT, GCD, ) Execution Concurrent execution of many (likely different) tasks What is abstracted? Cores and execution contexts Does not abstract: SIMD functional units or memory hierarchy Where is synchronization allowed? Between tasks Beyond Programmable Shading, SIGGRAPH 2010 13

Conventional Thread Parallelism (pthreads) Execution Parallel execution of N tasks with N execution contexts What is abstracted? Nothing (ignoring preemption) Where is synchronization allowed? Between any execution context at various granularities Beyond Programmable Shading, SIGGRAPH 2010 14

DirectX/OpenGL Rendering Pipeline Execution Data-parallel concurrent execution of identical task within each shading stage Task-parallel concurrent execution of different shading stages No parallelism exposed to user What is abstracted? Cores, execution contexts, SIMD functional units, memory hierarchy, and fixed-function graphics units (tessellator, rasterizer, ROPs, etc) Where is synchronization allowed? Between draw calls Beyond Programmable Shading, SIGGRAPH 2010 15

GPU Compute Languages DX11 DirectCompute OpenCL CUDA There are multiple possible usage models. We ll start with the text book hierarchical dataparallel usage model Beyond Programmable Shading, SIGGRAPH 2010 16

Terminology Decoder Ring Direct Compute CUDA OpenCL Pthreads+SSE This talk thread thread work-item SIMD lane work-item - warp - thread execution context threadgroup threadblock Work-group - work-group - streaming multiprocessor compute unit core core - grid N-D range - Set of workgroups

GPU Compute Languages Execution Hierarchical model Lower level is parallel execution of identical tasks (work-items) within work-group Upper level is concurrent execution of identical work-groups What is abstracted? Work-group abstracts a core s execution contexts, SIMD functional units Set of work-groups abstracts cores Does not abstract core-local memory Where is synchronization allowed? Between work-items in a work-group Between passes (set of work-groups) Beyond Programmable Shading, SIGGRAPH 2010 18

GPU Compute Models Work-group barrier Work-group barrier 19 Beyond Programmable Shading SIGGRAPH 2010

User Responsibilities: GPU Compute User manually maps work-item index to problem domain User controls size and number of work-groups Selecting these parameters is complex combination of architecture- and task-specific considerations Beyond Programmable Shading, SIGGRAPH 2010 20

When Use GPU Compute vs Pixel Shader? Use GPU compute language if your algorithm needs on-chip memory Reduce bandwidth by building local data structures Otherwise, use pixel shader All mapping, decomposition, and scheduling decisions automatic (Easier to reach peak performance) Beyond Programmable Shading, SIGGRAPH 2010 21

Conventional Thread Parallelism on GPUs Also called persistent threads Expert usage model for GPU compute Defeat abstractions over cores, execution contexts, and SIMD functional units Defeat system scheduler, load balancing, etc. Code not portable between architectures Beyond Programmable Shading, SIGGRAPH 2010 22

Conventional Thread Parallelism on GPUs Execution Two-level parallel execution model Lower level: parallel execution of M identical tasks on M-wide SIMD functional unit Higher level: parallel execution of N different tasks on N execution contexts What is abstracted? Nothing (other than automatic mapping to SIMD lanes) Where is synchronization allowed? Lower-level: between any task running on same SIMD functional unit Higher-level: between any execution context Beyond Programmable Shading, SIGGRAPH 2010 23

Why Persistent Threads? Enable alternate programming models that require different scheduling and synchronization rules than the default model provides Example alternate programming models Task systems (esp. nested task parallelism) Producer-consumer rendering pipelines (See references at end of this slide deck for more details) Beyond Programmable Shading, SIGGRAPH 2010 24

Summary of Concepts Abstraction When a parallel programming model abstracts a HW resource, code written in that programming model scales across architectures with varying amounts of that resource Execution Concurrency versus parallelism Synchronization Where is user allowed to control scheduling? Beyond Programmable Shading, SIGGRAPH 2010 25

Conclusions Current real-time rendering programming uses a mix of data-, task-, and pipeline-parallel programming (and conventional threads as means to an end) Current GPU compute models designed for dataparallelism but can be abused to implement all of these other models Look for uses of these different models throughout the rest of the course Beyond Programmable Shading, SIGGRAPH 2010 26

Acknowledgements Tim Foley, Intel Kayvon Fatahalian, Stanford Mike Houston, AMD Tim Mattson and Andrew Lauritzen, Intel Craig Kolb and Matt Pharr Beyond Programmable Shading, SIGGRAPH 2010 27

References GPU-inspired compute languages DX11 DirectCompute, OpenCL (CPU+GPU+ ), CUDA Task systems (CPU and CPU+GPU+ ) Cilk, Thread Building Blocks (TBB), Grand Central Dispatch (GCD), ConcRT, Task Parallel Library, OpenCL (limited in 1.0) Conventional CPU thread programming Pthreads GPU task systems and persistent threads (i.e., conventional thread programming on GPU) Aila et al, Understanding the Efficiency of Ray Traversal on GPUs, High Performance Graphics 2009 Tzeng et al, Task Management for Irregular-Parallel Workloads on the GPU, High Performance Graphics 2010 Parker et al, OptiX: A General Purpose Ray Tracing Engine, SIGGRAPH 2010 Additional input (concepts, terminology, patterns, etc) Foley, Parallel Programming for Graphics, Beyond Programmable Shading SIGGRAPH 2009 Beyond Programmable Shading CS448s Stanford course Fatahalian, Running Code at a Teraflop: How a GPU Shader Core Works, Beyond Programmable Shading SIGGRAPH 2009-2010 Keutzer et al, A Design Pattern Language for Engineering (Parallel) Software: Merging the PLPP and OPL projects, ParaPLoP 2010 Beyond Programmable Shading, SIGGRAPH 2010 28

Questions? Course web page and slides: http://bps10.idav.ucdavis.edu Beyond Programmable Shading Course, ACM SIGGRAPH 2010 29