The GPGPU Programming Model

Similar documents
X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

GPU Memory Model. Adapted from:

Image Processing Tricks in OpenGL. Simon Green NVIDIA Corporation

Graphics Processing Unit Architecture (GPU Arch)

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

What s New with GPGPU?

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

General-Purpose Computation on Graphics Hardware

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

2.11 Particle Systems

Graphics Hardware. Instructor Stephen J. Guy

The Application Stage. The Game Loop, Resource Management and Renderer Design

Rendering Objects. Need to transform all geometry then

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Today. Rendering - III. Outline. Texturing: The 10,000m View. Texture Coordinates. Specifying Texture Coordinates in GL

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

From Brook to CUDA. GPU Technology Conference

GPGPU: Beyond Graphics. Mark Harris, NVIDIA

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Programmable Graphics Hardware

CS427 Multicore Architecture and Parallel Computing

Performance OpenGL Programming (for whatever reason)

GPU Architecture and Function. Michael Foster and Ian Frasch

Comparing Reyes and OpenGL on a Stream Architecture

ACCELERATING SIGNAL PROCESSING ALGORITHMS USING GRAPHICS PROCESSORS

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

CS770/870 Fall 2015 Advanced GLSL

GPGPU. Peter Laurens 1st-year PhD Student, NSC

Deconstructing Hardware Usage for General Purpose Computation on GPUs

CSE 167: Lecture #4: Vertex Transformation. Jürgen P. Schulze, Ph.D. University of California, San Diego Fall Quarter 2012

Motivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University

CS 591B Lecture 9: The OpenGL Rendering Pipeline

A Streaming Narrow-Band Algorithm: Interactive Computation and Visualization of Level Sets

The Rasterization Pipeline

OpenGL: Open Graphics Library. Introduction to OpenGL Part II. How do I render a geometric primitive? What is OpenGL

Working with Metal Overview

GPU Memory Model Overview

GPUs and GPGPUs. Greg Blanton John T. Lubia

PowerVR Hardware. Architecture Overview for Developers

Why modern versions of OpenGL should be used Some useful API commands and extensions

The Graphics Pipeline

Programmable GPUs Outline

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

1.2.3 The Graphics Hardware Pipeline

CS450/550. Pipeline Architecture. Adapted From: Angel and Shreiner: Interactive Computer Graphics6E Addison-Wesley 2012

Shaders. Slide credit to Prof. Zwicker

General Algorithm Primitives

Cornell University CS 569: Interactive Computer Graphics. Introduction. Lecture 1. [John C. Stone, UIUC] NASA. University of Calgary

Lecture 25: Board Notes: Threads and GPUs

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

The Traditional Graphics Pipeline

! Readings! ! Room-level, on-chip! vs.!

CSE 167: Introduction to Computer Graphics Lecture #5: Rasterization. Jürgen P. Schulze, Ph.D. University of California, San Diego Fall Quarter 2015

Models and Architectures

Could you make the XNA functions yourself?

Programmable GPUs. Real Time Graphics 11/13/2013. Nalu 2004 (NVIDIA Corporation) GeForce 6. Virtua Fighter 1995 (SEGA Corporation) NV1

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

Accelerating CFD with Graphics Hardware

Scan Primitives for GPU Computing

Introduction to Computer Graphics with WebGL

SHADER PROGRAMMING. Based on Jian Huang s lecture on Shader Programming

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

CSE 167: Introduction to Computer Graphics Lecture #4: Vertex Transformation

Graphics Hardware. Computer Graphics COMP 770 (236) Spring Instructor: Brandon Lloyd 2/26/07 1

Implementing Survey Propagation on Graphics Processing Units

Lecture 2. Shaders, GLSL and GPGPU

Tutorial on GPU Programming #2. Joong-Youn Lee Supercomputing Center, KISTI

API Background. Prof. George Wolberg Dept. of Computer Science City College of New York

OpenGL. Toolkits.

Lecture 9: Deferred Shading. Visual Computing Systems CMU , Fall 2013

Journal of Universal Computer Science, vol. 14, no. 14 (2008), submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J.

The Traditional Graphics Pipeline

Sung-Eui Yoon ( 윤성의 )

Cg 2.0. Mark Kilgard

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

Threading Hardware in G80

VGP352 Week March-2008

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

The Traditional Graphics Pipeline

Mattan Erez. The University of Texas at Austin

SCIENTIFIC COMPUTING ON COMMODITY GRAPHICS HARDWARE

Intro to OpenGL III. Don Fussell Computer Science Department The University of Texas at Austin

Evolution of GPUs Chris Seitz

Adaptive Point Cloud Rendering

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Rendering approaches. 1.image-oriented. 2.object-oriented. foreach pixel... 3D rendering pipeline. foreach object...

Portland State University ECE 588/688. Graphics Processors

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

CS130 : Computer Graphics. Tamar Shinar Computer Science & Engineering UC Riverside

Sign up for crits! Announcments

Soft Particles. Tristan Lorach

Hardware-driven Visibility Culling Jeong Hyun Kim

OpenGL Programmable Shaders

Transcription:

The Programming Model Institute for Data Analysis and Visualization University of California, Davis

Overview Data-parallel programming basics The GPU as a data-parallel computer Hello World Example Programming model detail Conclusions C2

Data-Parallel Programming Basics What is a data-parallel program? Explicitly expresses data dependencies Exposes parallelism Stream programming is data-parallel model Stream programs are dependency graphs Kernels are graph nodes Streams are edges flowing between kernels C3

Stream Program Basics The stream programming model Split algorithm into kernels based on dependencies Example: Image processing Streams Kernels Dally et al., Stream Processors: Programmability with Efficiency, ACM Queue, March 2004, pp. 52-62 ftp://cva.stanford.edu/pub/publications/spqueue.pdf C4

GPU as a Stream Processor Draw a screen-sized quad Run kernel over each fragment (fragment program) Read stream data from textures Write results to frame buffer Figure courtesy of John Owens C5

Modern Graphics Pipeline VS3.0 GPUs Texture Buffer Vertex Buffer Vertex Processor Rasterizer Fragment Processor Frame Buffer(s) C6

GPU as a Stream Processor Streams Kernels foreach execution Textures Fragment program Draw single large quad Write Data To Texture Load Fragment Program Configure OpenGL for 1:1 Rendering Bind Textures Bind Fragment Program Draw Large Quad Write results to texture C7

Hello World Example 3 x 3 Image processing convolution CPU version image = loadimage( WIDTH, HEIGHT ); blurimage = alloczeros( WIDTH, HEIGHT ); for (x=0; x < WIDTH; x++) for (y=0; y < HEIGHT; y++) for (i=-1; i <= 1; i++) for (j=-1; j <= 1; j++) float w = computeweight(i,j); blurimage[x][y] += w * image[x+i, y+j]; C8

Hello World Example GPU Version 1) Load image into texture Figure courtesy of Mark Harris 2) Create blurimage texture to hold result C9

Hello World Example GPU Version 3) Load fragment program (kernel) Example shown in Cg float4 blurkernel( uniform samplerrect image, float2 winpos : WPOS, out float4 blurimage ) { blurimage = float4(0,0,0,0); } for (i=-1; i <= 1; i++) { for (j=-1; j <= 1; j++) { float2 texcoord = winpos + float2(i,j); float w = computeweight(i,j); blurimage += w * texrect( image, texcoord ); } } C10

Hello World Example GPU Version 4) Configure OpenGL to draw 1:1 No projection or rescaling glmatrixmode( GL_PROJECTION ); glloadidentity(); gluortho2d(0, 1, 0, 1); glviewport(0, 0, WIDTH, HEIGHT ); glmatrixmode( GL_MODELVIEW ); glloadidentity(); 5) Bind image and blurkernel (texture and fragment program) 6) Bind blurimage as render target C11

Hello World Example GPU Version 7) Execute kernel on each stream element Draw quad of size [WIDTH x HEIGHT] glbegin( GL_TRIANGLES ); glvertex2f(0, 0); glvertex2f(2, 0); glvertex2f(0, 2); glend(); 2 1 0 1 2 C12

Hello World Example What happened? blurkernel executed on each element of image Rendering replaced outer two loops of CPU version blurkernel performed gather operation at each element Result (blurimage) was written to framebuffer / texture C13

Hello World Example Get the source code for examples http://www.gpgpu.org/developer/ http://download.nvidia.com/developer/sdk/individual_samples/ samples.html gpgpu_fluid gpgpu_disease gpu_particles http://www.ati.com/developer/sdk/radeonsdk/html/samples/o pengl/ HW_Image_Processing.html C14

Questions? GPU as stream processor? Stream programming model? Hello World example? C15

Moving On Programming model detail Intro to handling missing features C16

Why Are Stream Programs Fast? Explicit parallelism No communication between stream elements Kernels cannot write to input streams Kernels cannot have writable static/global variables Stream elements are independent Explicit memory locality Temporary values in kernel are local Stream program expresses producers-consumers Hide cost of memory access with parallelism C17

GPU Computational Primitives Operations available in kernel Read-only memory (input streams) Random access read-only (gather) Per-data-element interpolants Temporary storage (no saved state) Read-only constants Write-only memory (result streams) Floating-point ALUops Texture sampler Texture sampler Varying registers Local registers Constant registers Render-to-texture C18

GPU Computational Primitives What s missing? No stack No heap No integer or bitwise operations No scatter (a[i] = b) No reduction operations (max, min, sum) Data-dependent conditionals Limited number of outputs Why missing? Lack of demand from games Early in GPU evolution as general data-parallel processor Stream programming model constraints C19

GPU Computational Primitives Handling missing features We ll explain how to emulate Scatter Global (reduction) operations Conditionals C20

Emulating Scatter Scatter i = foo(); a[i] = bar(); Solution 1 Transform scatter algorithm into gather algorithm See Ian s Tips and Tricks for more details Solution 2 Do actual scatter Vertex processor can scatter points Render points instead of large triangle Render-to-texture with vertex-texture-reads (PS 3.0) Or render-to-vertex-array Problem Drawing a point for each data element is slow C21

Emulating Reduction Operations Reductions Operations that require all data elements max, min, sum, norm, etc. Solution Perform repeated gathers until only single data value left log( WIDTH ) gather operations (assuming WIDTH == HEIGHT) Problem Extra passes can be costly Cloud figure courtesy of Mark Harris C22

Conditionals In a SIMD World Current fragment processors are SIMD Vertex processors are MIMD (PS 3.0) SIMD execution and conditionals are at odds SIMD model assumes all data elements processed identically Conditional execution breaks this assumption Solutions 1) Statically resolve conditional outside of kernel 2) Disable false stream elements 3) Improved fragment hardware? C23

Conditionals in a SIMD World Solutions 1) Can decision be made before fragment processor? Static branch resolution with substreams Figure courtesy of Mark Harris Trick discovered by Lefohn, Harris, and Goodnight Simultaneously in 2003 C24

Conditionals in a SIMD World Solution 2) Disable false stream elements: Occlusion query Idea Occlusion query reports the number of fragments that passed the depth test Useful when number of loop iterations is data-dependent Kernel kills fragments that do not need further processing CPU continues to issue render until all fragments are killed C25

Conditionals in a SIMD World Solution 3) Disable false stream elements: Early depth cull Idea Modern GPUs can kill fragments before kernel execution Kernel sets z-value to control whether or not execution occurs Problem Conditionals must be block-coherent More about this in Ian s Tips and Tricks talk C26

Conditionals in a SIMD World Solution 4) Improved fragment hardware? MIMD hardware NVIDIA GeForce 6-series (NV4x) vertex processors SIMD-with-conditional-support NVIDIA GeForce 6-series (NV4x) fragment processor Uniform branches are a win if more than ~5 instructions Varying branches must be coherent across hundreds of pixels Only 10s x 10s pixels Very useful in many cases Problem Do we want more SIMD processors or fewer MIMD processors? C27

Conditionals in a SIMD World Conclusions Conditionals are tough with today s GPUs Best case is when conditional can be staticaly resolved and removed from computational kernel Future GPUs will most likely fully support conditionals Solution must not interfere with parallelism MIMD? SIMD with Conditional Streams? Kapasi et al., Micro 33, 2000 C28

Conclusions computational basics Textures Storage for data streams Fragment program Computational kernel Render pass foreach loop over data stream Coming next... Ian Buck Data-parallel languages for programming Express data-parallel programs more elegantly than glbegin()...glend() C29

References ATI Developer web site, http://www.ati.com/developer/ Developer web site, http://www.gpgpu.org/developer N. Goodnight, C. Woolley, G. Lewin, D. Luebke, G. Humphreys, A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware, Graphics Hardware 2003 M. Harris, W. Baxter, T. Scheuermann, A. Lastra, Simulation of Cloud Dynamics on Graphics Hardware, Graphics Hardware 2003 Hillis et al., Data Parallel Algorithms, Comm. ACM, 29(12), December 1986 Kapasi et al., Efficient Conditional Operations for Data-parallel Architectures, In Proc. of the 33rd Ann. Int'l Symp. on Microarchitecture, pages 159--170, 2000 A. Lefohn, J. Kniss, C. Hansen, R. Whitaker, A Streaming Narrow-Band Algorithm: Interactive Deformation and Visualization of Level Sets, IEEE Transactions on Visualization and Computer Graphics 2004 NVIDIA Developer web site, http://developer.nvidia.com/page/home C30