GPGPU: Parallel Reduction and Scan

Similar documents
GLSL Applications: 2 of 2

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

Introduction to CUDA (1 of n*)

Scan Algorithm Effects on Parallelism and Memory Conflicts

Scan Primitives for GPU Computing

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 Parallelism in Software I

CUDA Performance Considerations (2 of 2)

NVIDIA Fermi Architecture

Real-time Graphics 9. GPGPU

Introduction to GLSL. Announcements. Agenda. Course Contents. Patrick Cozzi University of Pennsylvania CIS Fall 2012

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

Real-time Graphics 9. GPGPU

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

CS452/552; EE465/505. Image Processing Frame Buffer Objects

HD (in) Processing. Andrés Colubri Design Media Arts dept., UCLA Broad Art Center, Suite 2275 Los Angeles, CA

Matrix Operations on the GPU

GPU Memory Model. Adapted from:

2D rendering takes a photo of the 2D scene with a virtual camera that selects an axis aligned rectangle from the scene. The photograph is placed into

General Purpose computation on GPUs. Liangjun Zhang 2/23/2005

Programming with OpenGL Part 3: Shaders. Ed Angel Professor of Emeritus of Computer Science University of New Mexico

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Image Processing Tricks in OpenGL. Simon Green NVIDIA Corporation

GpuPy: Accelerating NumPy With a GPU

Shader Programming 1. Examples. Vertex displacement mapping. Daniel Wesslén 1. Post-processing, animated procedural textures

DEVELOPER DAY. Vulkan Subgroup Explained Daniel Koch NVIDIA MONTRÉAL APRIL Copyright Khronos Group Page 1

Information Coding / Computer Graphics, ISY, LiTH GLSL. OpenGL Shading Language. Language with syntax similar to C

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

CS 179 GPU Programming

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

CS 179: GPU Programming. Lecture 7

Parallel Patterns Ezio Bartocci

Copyright Khronos Group 2012 Page 1. Teaching GL. Dave Shreiner Director, Graphics and GPU Computing, ARM 1 December 2012

GPU Programming EE Midterm Examination

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 14 Parallelism in Software V

Lecture 5 Vertex and Fragment Shaders-1. CITS3003 Graphics & Animation

CIS 665 GPU Programming and Architecture

GPGPU. Peter Laurens 1st-year PhD Student, NSC

The Graphics Pipeline and OpenGL IV: Stereo Rendering, Depth of Field Rendering, Multi-pass Rendering!

Scan Primitives for GPU Computing

Sign up for crits! Announcments

Programming with OpenGL Shaders I. Adapted From: Ed Angel Professor of Emeritus of Computer Science University of New Mexico

GPU Programming EE Midterm Examination

Shader Programs. Lecture 30 Subsections 2.8.2, Robb T. Koether. Hampden-Sydney College. Wed, Nov 16, 2011

Programmable Graphics Hardware

PARALLEL PROCESSING UNIT 3. Dr. Ahmed Sallam

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

GPU Programming and Architecture: Course Overview

More frames per second. Alex Kan and Jean-François Roy GPU Software

UberFlow: A GPU-Based Particle Engine

Introduction to Shaders for Visualization. The Basic Computer Graphics Pipeline

Enhancing Traditional Rasterization Graphics with Ray Tracing. March 2015

Fast BVH Construction on GPUs

MPI CS 732. Joshua Hegie

Programming with OpenGL Shaders I. Adapted From: Ed Angel Professor of Emeritus of Computer Science University of New Mexico

Order Independent Transparency with Dual Depth Peeling. Louis Bavoil, Kevin Myers

CUDA Odds and Ends. Administrivia. Administrivia. Agenda. Patrick Cozzi University of Pennsylvania CIS Spring Assignment 5.

Lecture 2. Shaders, GLSL and GPGPU

Lecture 13: OpenGL Shading Language (GLSL)

Spring 2011 Prof. Hyesoon Kim

CS 179: GPU Programming

Supplement to Lecture 22

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

Lighting and Texturing

Homework 3: Programmable Shaders

Sorting and Searching. Tim Purcell NVIDIA

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

WebGL. Announcements. WebGL for Graphics Developers. WebGL for Web Developers. Homework 5 due Monday, 04/16. Final on Tuesday, 05/01

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome!

Overview. By end of the week:

Administrivia. Administrivia. Administrivia. CIS 565: GPU Programming and Architecture. Meeting

GPU Memory Model Overview

Efficient and Scalable Shading for Many Lights

The Rasterization Pipeline

GPU Programming EE Final Examination

Lecture 6: Input Compaction and Further Studies

CS GPU and GPGPU Programming Lecture 7: Shading and Compute APIs 1. Markus Hadwiger, KAUST

OUTLINE. Implementing Texturing What Can Go Wrong and How to Fix It Mipmapping Filtering Perspective Correction

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Zeyang Li Carnegie Mellon University

Introduction to Computer Graphics with WebGL

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CIS 665: GPU Programming and Architecture. Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider

Ambient Occlusion. Ambient Occlusion (AO) "shadowing of ambient light "darkening of the ambient shading contribution

A Shader Library for OpenGL 4 and GLSL 4.3 Learning and Development

From Brook to CUDA. GPU Technology Conference

Multipass Rendering: Rendering to a Texture

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Levy: Constraint Texture Mapping, SIGGRAPH, CS 148, Summer 2012 Introduction to Computer Graphics and Imaging Justin Solomon

HW-Tessellation Basics

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Information Coding / Computer Graphics, ISY, LiTH GLSL. OpenGL Shading Language. Language with syntax similar to C

Efficient Stream Reduction on the GPU

OpenGL on Android. Lecture 7. Android and Low-level Optimizations Summer School. 27 July 2015

OUTLINE. Learn the basic design of a graphics system Introduce pipeline architecture Examine software components for a graphics system

CS 179: GPU Programming

C P S C 314 S H A D E R S, O P E N G L, & J S RENDERING PIPELINE. Mikhail Bessmeltsev

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Transcription:

Administrivia GPGPU: Parallel Reduction and Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Assignment 3 due Wednesday 11:59pm on Blackboard Assignment 4 handed out Monday, 02/14 Final Wednesday 05/04, 12:00-2:00pm Review session probably Friday 04/29 Agenda Activity? Given a set of numbers, design a GPU algorithm for: Team 1: Sum of values Team 2: Maximum value Team 3: Product of values Team 4: Average value Consider: Bottlenecks Arithmetic intensity: compute to memory access ratio Optimizations Limitations 1

Reduction: An operation that computes a single result from a set of data Examples: Minimum/maximum value (for tone mapping) Average, sum, product, etc. : Do it in parallel. Obviously Store data in 2D texture Render viewport-aligned quad ¼ of texture size Texture coordinates address every other texel Fragment shader computes operation with four texture reads for surrounding data Use output as input to the next pass Repeat until done uniform sampler2d u_data; in vec2 fs_texcoords; out float out_maxvalue; void main(void) { float v0 = texture(u_data, fs_texcoords).r; float v1 = textureoffset(u_data, fs_texcoords, ivec2(0, 1)).r; float v2 = textureoffset(u_data, fs_texcoords, ivec2(1, 0)).r; float v3 = textureoffset(u_data, fs_texcoords, ivec2(1, 1)).r; out_maxvalue = max(max(v0, v1), max(v2, v3)); } Image from http://http.developer.nvidia.com/gpugems/gpugems_ch37.html 2

uniform sampler2d u_data; in vec2 fs_texcoords; out float out_maxvalue; Read four values in a 2x2 area of the texture uniform sampler2d u_data; in vec2 fs_texcoords; out float out_maxvalue; Only using the red channel. How efficient is this? void main(void) { float v0 = texture(u_data, fs_texcoords).r; float v1 = textureoffset(u_data, fs_texcoords, ivec2(0, 1)).r; float v2 = textureoffset(u_data, fs_texcoords, ivec2(1, 0)).r; float v3 = textureoffset(u_data, fs_texcoords, ivec2(1, 1)).r; out_maxvalue = max(max(v0, v1), max(v2, v3)); } void main(void) { float v0 = texture(u_data, fs_texcoords).r; float v1 = textureoffset(u_data, fs_texcoords, ivec2(0, 1)).r; float v2 = textureoffset(u_data, fs_texcoords, ivec2(1, 0)).r; float v3 = textureoffset(u_data, fs_texcoords, ivec2(1, 1)).r; out_maxvalue = max(max(v0, v1), max(v2, v3)); } uniform sampler2d u_data; in vec2 fs_texcoords; out float out_maxvalue; Output the maximum value Reduces n 2 elements in log(n) time 1024x1024 is only 10 passes void main(void) { float v0 = texture(u_data, fs_texcoords).r; float v1 = textureoffset(u_data, fs_texcoords, ivec2(0, 1)).r; float v2 = textureoffset(u_data, fs_texcoords, ivec2(1, 0)).r; float v3 = textureoffset(u_data, fs_texcoords, ivec2(1, 1)).r; out_maxvalue = max(max(v0, v1), max(v2, v3)); } 3

Bottlenecks Read back to CPU, recall glreadpixels Each pass depends on the previous How does this affect pipeline utilization? Low arithmetic intensity Optimizations Use just red channel or rgba? Read 2x2 areas or nxn? What is the trade off? When do you read back? 1x1? How many textures are needed? Ping Ponging Use two textures: X and Y First pass: X is input, Y is output Additional passes swap input and output Implement with FBOs Limitations Maximum texture size Requires a power of two in each dimension How do you work around these? Input X Y X Y Output Y X 1 2 3 4 Pass Y X 4

All-Prefix-Sums All-Prefix-Sums Input Array of n elements: Binary associate operator: Identity: I Outputs the array: All-Prefix-Sums Example If is addition, the array [3 1 7 0 4 1 6 3] is transformed to [0 3 4 11 11 15 16 22] Seems sequential, but there is an efficient parallel solution Images from http://http.developer.nvidia.com/gpugems3/gpugems3_ch39.html : all-prefix-sums operation on an array of data Exclusive : Element j of the result does not include element j of the input: In: [3 1 7 0 4 1 6 3] Out: [0 3 4 11 11 15 16 22] Inclusive (Prescan): All elements including j are summed In: [3 1 7 0 4 1 6 3] Out: [3 4 11 11 15 16 22 25] How do you generate an exclusive scan from an inclusive scan? In: [3 1 7 0 4 1 6 3] Inclusive: [3 4 11 11 15 16 22 25] Exclusive: [0 3 4 11 11 15 16 22] // Shift right, insert identity How do you go in the opposite direction? 5

Use cases Summed-area tables for variable width image processing Stream compaction Radix sort : Stream Compaction Given an array of elements Create a new array with elements that meet a certain criteria, e.g. non null Preserve order : Stream Compaction Given an array of elements Create a new array with elements that meet a certain criteria, e.g. non null Preserve order : Stream Compaction Used in collision detection, sparse matrix compression, etc. Can reduce bandwidth from GPU to CPU a c d g a c d g 6

: Stream Compaction containing 1 if corresponding element meets criteria 0 if element does not meet criteria : Stream Compaction 1 : Stream Compaction : Stream Compaction 1 0 1 0 1 7

: Stream Compaction : Stream Compaction 1 0 1 1 1 0 1 1 0 : Stream Compaction : Stream Compaction 1 0 1 1 0 0 1 0 1 1 0 0 1 8

: Stream Compaction : Stream Compaction It runs in parallel! : Stream Compaction : Stream Compaction Step 2: Run exclusive scan on temporary array It runs in parallel! result: 9

: Stream Compaction Step 2: Run exclusive scan on temporary array result: 0 1 1 2 3 3 3 4 : Stream Compaction Stream Compaction Step 3: Scatter Result of scan is index into final array Only write an element if temporary array has a 1 runs in parallel What can we do with the results? : Stream Compaction Stream Compaction Step 3: Scatter : Stream Compaction Stream Compaction Step 3: Scatter result: 0 1 1 2 3 3 3 4 result: 0 1 1 2 3 3 3 4 Final array: Final array: a 0 1 2 3 0 1 2 3 10

: Stream Compaction Stream Compaction Step 3: Scatter : Stream Compaction Stream Compaction Step 3: Scatter result: 0 1 1 2 3 3 3 4 result: 0 1 1 2 3 3 3 4 Final array: a c Final array: a c d 0 1 2 3 0 1 2 3 : Stream Compaction Stream Compaction Step 3: Scatter : Stream Compaction Stream Compaction Step 3: Scatter result: 0 1 1 2 3 3 3 4 result: 0 1 1 2 3 3 3 4 Final array: a c d g Final array: 0 1 2 3 0 1 2 3 Scatter runs in parallel! 11

: Stream Compaction Stream Compaction Step 3: Scatter Used to convert certain sequential computation into equivalent parallel computation result: 0 1 1 2 3 3 3 4 Final array: a c d g 0 1 2 3 Scatter runs in parallel! Image from http://http.developer.nvidia.com/gpugems3/gpugems3_ch39.html Activity Design a parallel algorithm for exclusive scan In: [3 1 7 0 4 1 6 3] Out: [0 3 4 11 11 15 16 22] Consider: Total number of additions Ignore GLSL constraints Sequential : single thread, trivial n adds for an array of length n Work complexity: O(n) How many adds will our parallel version have? Image from http://http.developer.nvidia.com/gpugems3/gpugems3_ch39.html 12

Naive Parallel Naive Parallel : Input Is this exclusive or inclusive? Each thread Writes one sum Reads two values Image from http://developer.download.nvidia.com/compute/cuda/1_1/website/projects/scan/doc/scan.pdf Naive Parallel : d = 1, 2 d-1 = 1 Naive Parallel : d = 1, 2 d-1 = 1 0 0 1 13

Naive Parallel : d = 1, 2 d-1 = 1 Naive Parallel : d = 1, 2 d-1 = 1 0 1 3 0 1 3 5 Naive Parallel : d = 1, 2 d-1 = 1 Naive Parallel : d = 1, 2 d-1 = 1 0 1 3 5 7 0 1 3 5 7 9 14

Naive Parallel : d = 1, 2 d-1 = 1 Naive Parallel : d = 1, 2 d-1 = 1 0 1 3 5 7 9 11 0 1 3 5 7 9 11 13 Naive Parallel : d = 1, 2 d-1 = 1 Naive Parallel : d = 1, 2 d-1 = 1 0 1 3 5 7 9 11 13 Recall, it runs in parallel! Recall, it runs in parallel! 15

Naive Parallel : d = 2, 2 d-1 = 2 Naive Parallel : d = 2, 2 d-1 = 2 0 1 3 5 7 9 11 13 after d = 1 0 1 3 5 7 9 11 13 after d = 1 22 Consider only k = 7 Naive Parallel : d = 2, 2 d-1 = 2 Naive Parallel : d = 3, 2 d-1 = 4 0 1 3 5 7 9 11 13 after d = 1 0 1 3 5 7 9 11 13 after d = 1 0 1 3 6 10 14 18 22 after d = 2 0 1 3 6 10 14 18 22 after d = 2 16

Naive Parallel : d = 3, 2 d-1 = 4 Naive Parallel : Final 0 1 3 5 7 9 11 13 after d = 1 0 1 3 5 7 9 11 13 0 1 3 6 10 14 18 22 after d = 2 0 1 3 6 10 14 18 22 28 0 1 1 1 10 15 21 28 Consider only k = 7 Naive Parallel What is naive about this algorithm? What was the work complexity for sequential scan? What is the work complexity for this? Summary Parallel reductions and scan are building blocks for many algorithms An understanding of parallel programming and GPU architecture yields efficient GPU implementations 17