Compute Shaders. Christian Hafner. Institute of Computer Graphics and Algorithms Vienna University of Technology

Similar documents
Get the most out of the new OpenGL ES 3.1 API. Hans-Kristian Arntzen Software Engineer

A Review of OpenGL Compute Shaders

OpenGL Compute Shaders

Information Coding / Computer Graphics, ISY, LiTH. Compute shaders!! The future of GPU computing or a late rip-off of Direct Compute?

CS195V Week 6. Image Samplers and Atomic Operations

OpenGL Compute Shaders

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU

GDC 2014 Barthold Lichtenbelt OpenGL ARB chair

Shaders. Slide credit to Prof. Zwicker

Getting the Most Out of OpenGL ES. Daniele Di Donato, Tom Olson, and Dave Shreiner ARM

The Application Stage. The Game Loop, Resource Management and Renderer Design

OpenGL Status - November 2013 G-Truc Creation

Programming shaders & GPUs Christian Miller CS Fall 2011

Working with Metal Overview

OpenGL BOF Siggraph 2011

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Antonio R. Miele Marco D. Santambrogio

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Particle Systems with Compute & Geometry Shader. Institute of Computer Graphics and Algorithms Vienna University of Technology

DEVELOPER DAY. Vulkan Subgroup Explained Daniel Koch NVIDIA MONTRÉAL APRIL Copyright Khronos Group Page 1

Shared Memory and Synchronizations

CUDA (Compute Unified Device Architecture)

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CS4621/5621 Fall Particle Systems and Compute Shaders

CS427 Multicore Architecture and Parallel Computing

GPU Fundamentals Jeff Larkin November 14, 2016

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

EE 4702 GPU Programming

Introduction to Multicore Programming

Tuning CUDA Applications for Fermi. Version 1.2

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

B. Tech. Project Second Stage Report on

Lecture 13: OpenGL Shading Language (GLSL)

Introduction to CUDA

Unleash the Benefits of OpenGL ES 3.1 and the Android Extension Pack. Nathan Li SGL & Staff Engineer, ARM

Motivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University

TOPICS IN GPU-BASED VIDEO PROCESSING. Thomas J. True

ECE 574 Cluster Computing Lecture 15

Blis: Better Language for Image Stuff Project Proposal Programming Languages and Translators, Spring 2017

Mention driver developers in the room. Because of time this will be fairly high level, feel free to come talk to us afterwards

OpenGL on Android. Lecture 7. Android and Low-level Optimizations Summer School. 27 July 2015

Introduction to CUDA (1 of n*)

ECE 574 Cluster Computing Lecture 17

Porting Roblox to Vulkan. Arseny

Vulkan (including Vulkan Fast Paths)

CS 179: GPU Programming

Starting out with OpenGL ES 3.0. Jon Kirkham, Senior Software Engineer, ARM

DirectCompute Performance on DX11 Hardware. Nicolas Thibieroz, AMD Cem Cebenoyan, NVIDIA

Compute-mode GPU Programming Interfaces

Martin Kruliš, v

CS475/CS675 - Computer Graphics. OpenGL Drawing

Efficient and Scalable Shading for Many Lights

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Brook+ Data Types. Basic Data Types

GPGPU on Mobile Devices

Copyright Khronos Group 2012 Page 1. Teaching GL. Dave Shreiner Director, Graphics and GPU Computing, ARM 1 December 2012

Exotic Methods in Parallel Computing [GPU Computing]

CS179: GPU Programming

CS179 GPU Programming: CUDA Memory. Lecture originally by Luke Durant and Tamas Szalay

Achieving High-performance Graphics on Mobile With the Vulkan API

Mali & OpenGL ES 3.0. Dave Shreiner Jon Kirkham ARM. Game Developers Conference 27 March 2013

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR

GPU-accelerated data expansion for the Marching Cubes algorithm

BOF Siggraph 2012 Barthold Lichtenbelt OpenGL ARB chair

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

The Bifrost GPU architecture and the ARM Mali-G71 GPU

CME 213 S PRING Eric Darve

EECS 487: Interactive Computer Graphics

GLSL Introduction. Fu-Chung Huang. Thanks for materials from many other people

last time put back pipeline figure today will be very codey OpenGL API library of routines to control graphics calls to compile and load shaders

GPU Memory Model. Adapted from:

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

AMD CodeXL 1.3 GA Release Notes

WebGL and GLSL Basics. CS559 Fall 2015 Lecture 10 October 6, 2015

OpenGL Shading Language Hints/Kinks

Computer Architecture

CMPSCI 691AD General Purpose Computation on the GPU

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Vulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics.

Programming with OpenGL Part 3: Shaders. Ed Angel Professor of Emeritus of Computer Science University of New Mexico

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Programming with OpenGL Shaders I. Adapted From: Ed Angel Professor of Emeritus of Computer Science University of New Mexico

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Octree-Based Sparse Voxelization for Real-Time Global Illumination. Cyril Crassin NVIDIA Research

Spring 2009 Prof. Hyesoon Kim

Martin Kruliš, v

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

GRAPHICS PROCESSING UNITS

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

CUDA Architecture & Programming Model

12.2 Programmable Graphics Hardware

Spring 2011 Prof. Hyesoon Kim

Real-Time Rendering Architectures

Introduction to GPU hardware and to CUDA

Lecture 5 Vertex and Fragment Shaders-1. CITS3003 Graphics & Animation

NVIDIA GPU CODING & COMPUTING

Transcription:

Compute Shaders Christian Hafner Institute of Computer Graphics and Algorithms Vienna University of Technology

Overview Introduction Thread Hierarchy Memory Resources Shared Memory & Synchronization Christian Hafner 5

Motivation I Use parallel processing power of GPU for General Purpose (GP) computations Great for image processing, particles, simulations, etc. Implement any parallel SPMD algorithm! Single Program, Multiple Data Christian Hafner 6

Motivation II Why not OpenCL or CUDA? One API for graphics and GP processing Avoid interop Avoid context switches You already know GLSL Christian Hafner 7

Availability Core since OpenGL 4.3 (Aug 2012) Part of OpenGL ES 3.1 Supported on Nvidia GeForce 400+ Nvidia Quadro x000, Kxxx AMD Radeon HD 5000+ Intel HD Graphics 4600 Christian Hafner 8

Pipeline Compute Shaders Old Pipeline Christian Hafner 9

How to use it? Write compute shader in GLSL Define memory resources Write main() function Initialization Run it Allocate GPU memory (buffers, textures) Compile shader, link program Bind buffers, textures, images, uniforms Call gldispatchcompute(...) Christian Hafner 10

Overview Introduction Thread Hierarchy Memory Resources Shared Memory & Synchronization Christian Hafner 11

Thread Hierarchy I Smallest execution unit: Thread = Invocation 1x Christian Hafner 12

Thread Hierarchy II Grid of Threads: Work Group 1D, 2D or 3D Size specified in shader code 2D example: layout( local_size_x = 5, local_size_y = 3, local_size_z = 1 ) in; Christian Hafner 13

Thread Hierarchy III Grid of work groups: Dispatch 1D, 2D or 3D Size specified in OpenGL call 2D example: gldispatchcompute( 4 /* x */, 8 /* y */, 1 /* z */ ); Christian Hafner 14

Thread Location GLSL built-in variables Have type uvec3 2D example: gl_workgroupid is (2,4,0) gl_localinvocationid is (3,0,0) gl_globalinvocationid is (13,12,0) (0,0) Christian Hafner 15

Work Group Size Limits Limits on work group size per dimension: int dim = 0; /* 0=x, 1=y, 2=z */ int maxsizex; glgetintegeri_v( GL_MAX_COMPUTE_WORK_GROUP_SIZE, dim, &maxsizex); Limit on total number of threads per work group: int maxinvoc; glgetintegeri( GL_MAX_COMPUTE_WORK_GROUP_INVOCATIONS, &maxinvoc); Christian Hafner 16

Overview Introduction Thread Hierarchy Memory Resources Shared Memory & Synchronization Christian Hafner 17

Memory Resources I No input from generic vertex attributes layout(location=0) in vec4 position; No output to draw buffers / depth buffer layout(location=0) out vec4 outcolor; Christian Hafner 18

Memory Resources II Inputs Outputs Random access to Images SSBOs Uniforms Samplers Random access to Images SSBOs Christian Hafner 19

Images I An image is a single mipmap layer of a texture No mipmapping No advanced sampling Can have array layers Access in shader with integer pixel coordinates No support for 3-component images (e.g. rgb8; use rgba8 instead) Christian Hafner 20

Images II In shader code Define image variable Access with imageload, imagestore, or atomic operations Initialization in C++ Allocate texture with corresponding format Binding Bind a layer of the texture with glbindimagetexture(...) Christian Hafner 21

Images III (Example) layout(binding=0, r8) uniform readonly image2d colimage; layout(binding=1, r8ui) uniform writeonly uimage2d bwimage;... float val = imageload(colimage, ivec2(gl_globalinvocationid.xy)).r; imagestore(bwimage, ivec2(gl_globalinvocationid.xy), uvec4(mix(0,1,val>0.5), 0,0,1)); Christian Hafner 22

SSBOs I SSBO = Shader Storage Buffer Object Continuous, large chunk of GPU memory Definition in shader similar to struct in C++ Supports unsized arrays Random access + atomic operations Christian Hafner 23

SSBOs II In shader code Define buffer block with data layout Access like local variable Initialization in C++ Buffer target is GL_SHADER_STORAGE_BUFFER Upload initial contents in glbufferdata (optional) Binding Bind with glbindbufferbase(...) Christian Hafner 24

SSBOs (Example I) GLSL struct PStruct { vec3 position; vec3 velocity; vec2 lifespan; }; layout(std430, binding=0) buffer Particles { PStruct particles[]; }; Christian Hafner 25

SSBOs (Example II) Example: Let s upload initial data into the buffer For a GL_ARRAY_BUFFER with vertex positions Provide pointer to glm::vec3 array For our particle buffer Declare struct PStruct in C++ Fill array of structs with data and upload in glbufferdata Christian Hafner 26

SSBOs (Example III) C++ struct PStruct { glm::vec3 position; glm::vec3 velocity; glm::vec2 lifespan; }; PStruct* particles = new PStruct[100]; Christian Hafner 27

SSBOs (Example IV) GLSL struct PStruct { vec3 position; vec3 velocity; vec2 lifespan; }; layout(std430, binding=0) buffer Particles { PStruct particles[]; }; Christian Hafner 28

SSBOs (Example V) GLSL C++ struct PStruct { vec3 position; vec3 velocity; vec2 lifespan; }; layout(std430, binding=0) buffer Particles { PStruct particles[]; }; struct PStruct { glm::vec3 position; float padding0; glm::vec3 velocity; float padding1; glm::vec2 lifespan; glm::vec2 padding2; }; Christian Hafner 29

SSBOs (Data Upload) When you match a C++ struct to an SSBO Look up the data alignment rules! Remember to add padding! Christian Hafner 30

Overview Introduction Thread Hierarchy Memory Resources Shared Memory & Synchronization Christian Hafner 31

Memory Local memory Per thread Variables declared in main() Shared memory (SM) Per work group Declared globally with shared Compute shaders only Global memory Textures, buffers, etc. This is what makes compute shaders so special Christian Hafner 32

Shared Memory I Use it why? SM has less delay than global memory Use it when? If many threads require the same data from global memory Read data from global memory once Share it with other threads in SM Christian Hafner 33

Shared Memory II Example: 1D average filter with 7-pixel kernel 8 Threads per work group Christian Hafner 34

Synchronization I Thread 31 Thread 33 Execution order between threads undefined What if Thread 33 is dependent on value written by Thread 31? Christian Hafner 35

Synchronization II Synchronization in GLSL is twofold Invocation Control Memory Control Christian Hafner 36

Invocation Control Invocation Control Control relative execution order of threads in the same work group (No mechanism to control execution order across work groups) GLSL function barrier() barrier() stalls execution until all threads in the work group have reached it However, this is not enough! Christian Hafner 37

Memory Control I What happens when a thread calls imagestore or writes to shared memory? Momentarily: nothing At some undefined point in the future: the value is written An OpenGL implementation has the freedom to cache and delay writes to SM random access writes to images, buffers Christian Hafner 38

Memory Control II Memory Control Flush all writes Make new values visible to other threads memorybarrier*() New values visible to all threads in dispatch groupmemorybarrier() New values visible to all threads in work group Christian Hafner 39

Memory Control III memorybarrierm() m { Shared, Buffer, Image, ε } Works only on buffers / images defined with keyword coherent Christian Hafner 40

Memory Control IV It seems we need barrier() for execution order memorybarriershared() to make SM writes visible 1. barrier(); memorybarriershared(); OR? 2. memorybarriershared(); barrier(); Christian Hafner 41

Memory Control V memorybarrier() Values visible to threads in dispatch What about the next dispatch? Visibility not guaranteed Call API function between dispatches glmemorybarrier(glbitfield mask); Various bits for different operations like buffer access, image access, etc. Christian Hafner 42

Memory Control VI Example: image processing // grayscale filter // image0 -> image1 gldispatchcompute(16,16,1); glmemorybarrier( GL_SHADER_IMAGE_ACCESS_BARRIER_BIT); // gauss filter // image1 -> image2 gldispatchcompute(16,16,1); Christian Hafner 43

List of Stuff Buffer textures / Buffer images Atomic operations On images, SSBOs, and SM GL Core: only on integer variables Atomic counters Warps and memory bank conflicts Further material on the last slides! Christian Hafner 44

Further Material I Atomic Counters http://www.lighthouse3d.com/tutorials/openglshort-tutorials/opengl-atomic-counters/ Parallel Reduction on GPU (E.g. parallel scalar product) http://developer.download.nvidia.com/assets/ cuda/files/reduction.pdf Warps and Memory Bank Conflicts https://www.youtube.com/watch?v=czgm3de BplE Christian Hafner 45

Further Material II Data alignment rules for std140 / std430 OpenGL Specification 4.5, Section 7.6.2.2 (Standard Uniform Block Layout) Synchronization with glmemorybarrier OpenGL Specification 4.5, Section 7.12.2 (Shader Memory Access Synchronization) NVIDIA presentation about OpenGL 4.3 with Gauß Filter Compute Shader Code http://de.slideshare.net/mark_kilgard/siggraph -2012-nvidia-opengl-for-2012 Christian Hafner 46

Further Material III OpenGL Timer Queries Measure your compute shader performance http://www.lighthouse3d.com/tutorials/openglshort-tutorials/opengl-timer-query/ Everything about images (formats, atomics) https://www.opengl.org/wiki/image_load_store Buffer Textures https://www.opengl.org/wiki/buffer_texture Christian Hafner 47