Real-time Graphics 9. GPGPU

Similar documents
Real-time Graphics 9. GPGPU

Josef Pelikán, Jan Horáček CGG MFF UK Praha

GPU programming. Dr. Bernhard Kainz

CUDA. Sathish Vadhiyar High Performance Computing

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Introduction to CUDA

Tesla Architecture, CUDA and Optimization Strategies

Scientific discovery, analysis and prediction made possible through high performance computing.

ECE 574 Cluster Computing Lecture 15

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

ECE 574 Cluster Computing Lecture 17

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Today s Content. Lecture 7. Trends. Factors contributed to the growth of Beowulf class computers. Introduction. CUDA Programming CUDA (I)

High Performance Linear Algebra on Data Parallel Co-Processors I

GPU Programming Using CUDA

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Lecture 2: CUDA Programming

Martin Kruliš, v

CUDA Programming Model

CUDA Architecture & Programming Model

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Practical Introduction to CUDA and GPU

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Introduction to Parallel Computing with CUDA. Oswald Haan

CS516 Programming Languages and Compilers II

Lecture 3: Introduction to CUDA

ECE 408 / CS 483 Final Exam, Fall 2014

Lecture 1: an introduction to CUDA

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Parallel Numerical Algorithms

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

Lecture 10!! Introduction to CUDA!

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Introduction to CUDA Programming

General-purpose computing on graphics processing units (GPGPU)

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CUDA C Programming Mark Harris NVIDIA Corporation

University of Bielefeld

OpenMP and GPU Programming

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Martin Kruliš, v

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

EEM528 GPU COMPUTING

Computer Architecture

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CUDA (Compute Unified Device Architecture)

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

Introduction to GPGPUs and to CUDA programming model

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

CUDA Parallelism Model

Module 2: Introduction to CUDA C

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Module 2: Introduction to CUDA C. Objective

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

GPU Programming. Rupesh Nasre.

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Shaders. Slide credit to Prof. Zwicker

COSC 6374 Parallel Computations Introduction to CUDA

High Performance Computing and GPU Programming

GPGPU in Film Production. Laurence Emms Pixar Animation Studios

Multi-Processors and GPU

GPU Computing: A Quick Start

GPGPU/CUDA/C Workshop 2012

Compute Shaders. Christian Hafner. Institute of Computer Graphics and Algorithms Vienna University of Technology

Introduction to GPU programming. Introduction to GPU programming p. 1/17

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Parallel Computing. Lecture 19: CUDA - I

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Introduction to CUDA

CS377P Programming for Performance GPU Programming - I

CUDA Parallel Programming Model Michael Garland

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique

Module 3: CUDA Execution Model -I. Objective

CS427 Multicore Architecture and Parallel Computing

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Lecture 11: GPU programming

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Module Memory and Data Locality

Transcription:

Real-time Graphics 9. GPGPU

GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose computation using power of GPU, parallel processor Real-time Graphics Martin Samuelčík 2

GPGPU related to rendering Deformations Sorting - dynamic alpha-blended particles Collision detection Global illumination Can be done via shaders, but requires hacking 3

Serial vs Parallel processing void process_in_serial() { for (int i=0;i<n;i++) c[i] = a[i]+b[i]; } c[0] = a[0]+b[0] void process_in_parallel() { start_threads(n, thread); } void thread() { int i = thread_id; c[i] = a[i]+b[i]; } c[1] = a[1]+b[1] c[n] = a[n]+b[n] 4

SIMD Single Instruction Multiple Data Performing the same operation on multiple data points simultaneously Usage - data processing & compression, cryptography, image & video processing, 3D graphics CPU - special instructions + registers Intel - since 1997, MMX, SSE AMD since 1998, 3DNow! IBM PowerPC, Cell Processor (PS3) Real-time Graphics Martin Samuelčík 5

SIMD www.wikipedia.org Real-time Graphics Martin Samuelčík 6

GPU history Primarily for transformation, rasterization and texturing Data types SIMD processing vertex 4 x 32bit FP fragment (pixel) 4 x 8bit FP/integer texel 4 x 8bit integer Increasing number of processing units (cores, pipelines) Separate pipelines for vertices and fragmets Unified shaders (2006), 60+ cores Increasing programmability of pipelines (shader units) Conditional instructions, loops General-purpose programming via new technologies Support for 32bit FP texels and fragments Real-time Graphics Martin Samuelčík 7

GPU architecture GPU multicore processor Basic computation unit Core, Stream Core, Stream (Thread) Processor Array of cores - Streaming Multiprocessor, SIMD Engine, Compute Unit Several levels of hierarchy Shared memory for cores in this array Special Function Units Private registers, cahces Scheduler Real-time Graphics Martin Samuelčík 8

GPU architecture Real-time Graphics Martin Samuelčík NVIDIA GeForce 8800 9

GPU architecture Work item Work-group NVIDIA GeForce 8800 9

GPU devices Radeon R9 290X 4GB Compute units: 44 Stream Processors: 2816 Work-group Nvidia GeForce GTX Titan 6GB Stream Processors: 2688 Work item Real-time Graphics Martin Samuelčík 11

GPU computation model Thread, Work item (SP) One running computation Executed on cores Local memory for one thread Kernel program running on device Work-group, Thread block (SM) Group of several threads Shared local per-block memory Organized in 1,2,3 dimensional grids Controlled by streaming multiprocessor Grid of blocks, work-groups per device Organized in 1,2,3, N dimensional arrays Real-time Graphics Martin Samuelčík 12

GPU memory model Global memory Accessible by all threads Slow, used with restrictions, bank conflicts Per device - 512- MB Constant memory Read-only section of global memory Per device = 64 KB -, cache per SM = 8 KB Local, Shared memory Sharing data between work items (SP) Per SM = 16 KB - Fast, used with restriction, synchronization needed Registers, Private memory per SM = 16384 - Allocated for each thread separately, Fast (SM) (SP) 14

GPU programming Graphics API OpenGL (universal) GLSL Cg ASM Compute shaders DirectX (win32) HLSL Cg ASM DirectCompute Real-time Graphics Martin Samuelčík Computing languages Nvidia CUDA (universal) CUDA PTX ISA ATI Stream (universal) Brook+ AMD IL ISA OpenCL (universal) all devices 14

CPU/GPU programming Real-time Graphics Martin Samuelčík 15

CUDA Compute Unified Device Architecture parallel computing platform and application programming interface Restricted to NVIDIA hardware Exploiting GPU capabilities more than graphics API unlimited scatter write direct access to memory utilize shared memory Kernel code based on C API for loading kernel, work with arrays, managing devices http://code.msdn.microsoft.com/windowsdesktop/nvidia- GPU-Architecture-45c11e6d 16

CUDA Real-time Graphics Martin Samuelčík 17

CUDA C/C++ extension Device vs Host code Function qualifiers: device, host, global Variable qualifiers: device, constant, shared Built-in variables threadidx coordinates of thread within a block blockdim number of threads in a block blockidx coordinates of thread s block within grid of blocks griddim number of blocks in grid Real-time Graphics Martin Samuelčík 18

Example - Host part Computing gradients in 2D one component image float* d_idata; // device (GPU) input data cudamalloc( (void**) &d_idata, width*height*sizeof(float)); float2* d_odata; // device (GPU) output data cudamalloc( (void**) &d_odata, width*height*sizeof(float2)); float* h_idata; int width, height; loadimg( input.tga, &width, &height, h_idata); // fill source host data by CPU cudamemcpy(d_idata, h_idata, width * height * 4, cudamemcpyhosttodevice); dim3 threads( 8, 8, 1); // threads per block dim3 grid( 1 + (width-1) / 8, 1 + (height-1) / 8, 1); // blocks in grid compute_gradient<<< grid, threads >>>(d_idata, d_odata, width, height); cudathreadsynchronize(); // wait for result float* h_odata = new float[2 * width * height]; cudamemcpy(h_odata, d_odata, 2 * width * height * 4, cudamemcpydevicetohost); saveimg( output.tga, h_odata); 19

CUDA width = 24 height = 24 // threads per block dim3 threads( 8, 8, 1); // blocks in grid dim3 grid( 1 + (width-1) / 8, 1 + (height-1) / 8, 1); 20

CUDA width = 24 height = 24 // threads per block dim3 threads( 8, 8, 1); // blocks in grid dim3 grid( 1 + (24-1) / 8, 1 + (24-1) / 8, 1); 21

CUDA width = 24 height = 24 // threads per block dim3 threads( 8, 8, 1); // blocks in grid dim3 grid( 1 + (24-1) / 8, 1 + (24-1) / 8, 1); 1 + 23 / 8 = 1 + 2.875 = 3 22

CUDA width = 24 height = 24 // threads per block dim3 threads( 8, 8, 1); // blocks in grid dim3 grid( 3, 3, 1); 1 + 23 / 8 = 1 + 2.875 = 3 23

Example - Device part { } global void compute_gradient(float* g_idata, float2* g_odata, const int width, const int height) const unsigned int i = blockidx.x * blockdim.x + threadidx.x; const unsigned int j = blockidx.y * blockdim.y + threadidx.y; if (i >= width j >= height) return; float value = g_idata[i + j * width]; float prev_x = value; if (i > 0) prev_x = g_idata[i - 1 + j * width]; float next_x = value; if (i < width-1) next_x = g_idata[i + 1 + j * width]; float prev_y = value; if (j > 0) prev_y = g_idata[i + (j 1) * width]; float next_y = value; if (j < height -1) next_y = g_idata[i + (j + 1) * width]; float z_dx = next_x prev_x; float z_dy = next_y prev_y; g_odata [i + j * width] = make_float2(z_dx, z_dy ); 24

CUDA? width = 20 height = 20 // threads per block dim3 threads( 8, 8, 1); // blocks in grid dim3 grid(?,?, 1); 25

CUDA? width = 20 height = 20 // threads per block dim3 threads( 8, 8, 1); // blocks in grid dim3 grid( 1 + (width-1) / 8, 1 + (height-1) / 8, 1); 1 + 20 / 8 = 1 + 2.5 = 3 26

Example - Device part { } global void compute_gradient(float* g_idata, float2* g_odata, const int width, const int height) const unsigned int i = blockidx.x * blockdim.x + threadidx.x; const unsigned int j = blockidx.y * blockdim.y + threadidx.y; if (i >= width j >= height) return; float value = g_idata[i + j * width]; float prev_x = value; if (i > 0) prev_x = g_idata[i - 1 + j * width]; float next_x = value; if (i < width-1) next_x = g_idata[i + 1 + j * width]; float prev_y = value; if (j > 0) prev_y = g_idata[i + (j 1) * width]; float next_y = value; if (j < height -1) next_y = g_idata[i + (j + 1) * width]; float z_dx = next_x prev_x; float z_dy = next_y prev_y; g_odata [i + j * width] = make_float2(z_dx, z_dy ); 27

CUDA width = 20 height = 20 // threads per block dim3 threads( 8, 8, 1); // blocks in grid dim3 grid( 3, 3, 1); 1 + 20 / 8 = 1 + 2.5 = 3 28

Example - Device part { } global void compute_gradient(float* g_idata, float2* g_odata, const int width, const int height) const unsigned int i = blockidx.x * blockdim.x + threadidx.x; const unsigned int j = blockidx.y * blockdim.y + threadidx.y; if (i >= width j >= height) return; float value = g_idata[i + j * width]; float prev_x = value; if (i > 0) prev_x = g_idata[i - 1 + j * width]; float next_x = value; if (i < width-1) next_x = g_idata[i + 1 + j * width]; float prev_y = value; if (j > 0) prev_y = g_idata[i + (j 1) * width]; float next_y = value; if (j < height -1) next_y = g_idata[i + (j + 1) * width]; float z_dx = next_x prev_x; float z_dy = next_y prev_y; g_odata [i + j * width] = make_float2(z_dx, z_dy ); j 0 1 2 3 4... width = 20 i = 3 j = 2 i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64... 29

Example - Device part { } global void compute_gradient(float* g_idata, float2* g_odata, const int width, const int height) const unsigned int i = blockidx.x * blockdim.x + threadidx.x; const unsigned int j = blockidx.y * blockdim.y + threadidx.y; if (i >= width j >= height) return; float value = g_idata[i + j * width]; float prev_x = value; if (i > 0) prev_x = g_idata[i - 1 + j * width]; float next_x = value; if (i < width-1) next_x = g_idata[i + 1 + j * width]; float prev_y = value; if (j > 0) prev_y = g_idata[i + (j 1) * width]; float next_y = value; if (j < height -1) next_y = g_idata[i + (j + 1) * width]; float z_dx = next_x prev_x; float z_dy = next_y prev_y; g_odata [i + j * width] = make_float2(z_dx, z_dy ); j 0 1 2 3 4.. width = 20 i = 3 j = 2 i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64.... 2(20) + 3 = 43 30

Example Shared memory global void compute_gradient(float* g_idata, float2* g_odata, const int width, const int height){ shared float s_data[10][10]; const unsigned int i = blockidx.x * blockdim.x + threadidx.x; const unsigned int j = blockidx.y * blockdim.y + threadidx.y; if (i >= width j >= height) return; float value = g_idata[i + j * width]; s_data[threadidx.x+1] [threadidx.y+1] = value; syncthreads(); float prev_x = s_idata[threadidx.x+1-1] [threadidx.y+1]; float next_x = s_idata[threadidx.x+1 + 1] [threadidx.y+1]; float prev_y = s_idata[threadidx.x+1] [threadidx.y+1-1]; float next_y = s_idata[threadidx.x+1] [threadidx.y+1 + 1]; float z_dx = next_x prev_x; float z_dy = next_y prev_y; g_odata [i + j * width] = float2(z_dx, z_dy); } 31

Example Shared memory global void compute_gradient(float* g_idata, float2* g_odata, const int width, const int height){ shared float s_data[10][10]; const unsigned int i = blockidx.x * blockdim.x + threadidx.x; const unsigned int j = blockidx.y * blockdim.y + threadidx.y; if (i >= width j >= height) return; float value = g_idata[i + j * width]; s_data[threadidx.x+1] [threadidx.y+1] = value; syncthreads(); } float prev_x = s_idata[threadidx.x+1-1] [threadidx.y+1]; float next_x = s_idata[threadidx.x+1 + 1] [threadidx.y+1]; float prev_y = s_idata[threadidx.x+1] [threadidx.y+1-1]; float next_y = s_idata[threadidx.x+1] [threadidx.y+1 + 1]; float z_dx = next_x prev_x; float z_dy = next_y prev_y; g_odata [i + j * width] = float2(z_dx, z_dy); Shared memory 32

Example Shared memory global void compute_gradient(float* g_idata, float2* g_odata, const int width, const int height){ shared float s_data[10][10]; const unsigned int i = blockidx.x * blockdim.x + threadidx.x; const unsigned int j = blockidx.y * blockdim.y + threadidx.y; if (i >= width j >= height) return; float value = g_idata[i + j * width]; s_data[threadidx.x+1] [threadidx.y+1] = value; syncthreads(); 0+1-1=0 } float prev_x = s_idata[threadidx.x+1-1] [threadidx.y+1]; float next_x = s_idata[threadidx.x+1 + 1] [threadidx.y+1]; float prev_y = s_idata[threadidx.x+1] [threadidx.y+1-1]; float next_y = s_idata[threadidx.x+1] [threadidx.y+1 + 1]; float z_dx = next_x prev_x; float z_dy = next_y prev_y; 8+1+1=10 g_odata [i + j * width] = float2(z_dx, z_dy); Shared memory 10 10 to avoid index out of bounds 33

Example Shared memory global void compute_gradient(float* g_idata, float2* g_odata, const int width, const int height){ shared float s_data[10][10]; const unsigned int i = blockidx.x * blockdim.x + threadidx.x; const unsigned int j = blockidx.y * blockdim.y + threadidx.y; if (i >= width j >= height) return; float value = g_idata[i + j * width]; s_data[threadidx.x+1] [threadidx.y+1] = value; syncthreads(); } float prev_x = s_idata[threadidx.x+1-1] [threadidx.y+1]; float next_x = s_idata[threadidx.x+1 + 1] [threadidx.y+1]; float prev_y = s_idata[threadidx.x+1] [threadidx.y+1-1]; float next_y = s_idata[threadidx.x+1] [threadidx.y+1 + 1]; float z_dx = next_x prev_x; float z_dy = next_y prev_y; g_odata [i + j * width] = float2(z_dx, z_dy); Shared memory 34

Example Shared memory global void compute_gradient(float* g_idata, float2* g_odata, const int width, const int height){ shared float s_data[10][10]; const unsigned int i = blockidx.x * blockdim.x + threadidx.x; const unsigned int j = blockidx.y * blockdim.y + threadidx.y; if (i >= width j >= height) return; float value = g_idata[i + j * width]; s_data[threadidx.x+1] [threadidx.y+1] = value; syncthreads(); some threads will not reach } float prev_x = s_idata[threadidx.x+1-1] [threadidx.y+1]; float next_x = s_idata[threadidx.x+1 + 1] [threadidx.y+1]; float prev_y = s_idata[threadidx.x+1] [threadidx.y+1-1]; float next_y = s_idata[threadidx.x+1] [threadidx.y+1 + 1]; float z_dx = next_x prev_x; float z_dy = next_y prev_y; g_odata [i + j * width] = float2(z_dx, z_dy); Shared memory 35

Example Shared memory global void compute_gradient(float* g_idata, float2* g_odata, const int width, const int height){ shared float s_data[10][10]; const unsigned int i = blockidx.x * blockdim.x + threadidx.x; const unsigned int j = blockidx.y * blockdim.y + threadidx.y; if (i >= width j >= height){ return; _syncthreads(); } float value = g_idata[i + j * width]; s_data[threadidx.x+1] [threadidx.y+1] = value; syncthreads(); } float prev_x = s_idata[threadidx.x+1-1] [threadidx.y+1]; float next_x = s_idata[threadidx.x+1 + 1] [threadidx.y+1]; float prev_y = s_idata[threadidx.x+1] [threadidx.y+1-1]; float next_y = s_idata[threadidx.x+1] [threadidx.y+1 + 1]; float z_dx = next_x prev_x; float z_dy = next_y prev_y; g_odata [i + j * width] = float2(z_dx, z_dy); Shared memory 36

Example Shared memory Block 0 Block 1? 37

OpenCL Open standardized library for parallel computation https://www.khronos.org/opencl/ For use, drivers and SDKs must be installed http://developer.amd.com/tools-andsdks/opencl-zone/opencl-tools-sdks/ https://developer.nvidia.com/opencl Language based on C,C++ Many similarities with CUDA Real-time Graphics Martin Samuelčík 38

OpenCL Host part - opencl.codeplex.com/wikipage?title=opencl Tutorials - 1 Getting platforms = host+collection of devices Getting devices for given platform Create context for devices, defines entire OpenCL environment Fill command queue with kernels that will be executed on device Create memory buffers and fill it with data Prepare, compile and build programs Create kernel from program, set its arguments and launch it Wait for completition and read data Real-time Graphics Martin Samuelčík 24

OpenCL kernel { kernel void compute_gradient( global const float* g_idata, global float2* g_odata, const int width, const int height) local float s_data[10][10]; const unsigned int i = get_global_id(0); //get_group_id(0) * get_local_size(0) + get_local_id(0); const unsigned int j = get_global_id(1); // get_group_id(1) * get_local_size(1) + get_local_id(1); if (i >= width j >= height) return; } float value = g_idata[i + j * width]; s_data[get_local_id(0)+1] [get_local_id(1)+1] = value; barrier(clk_local_mem_fence); float prev_x = s_idata[get_local_id(0)+1-1] [get_local_id(1)+1]; float next_x = s_idata[get_local_id(0)+1 + 1] [get_local_id(1)+1]; float prev_y = s_idata[get_local_id(0)+1] [get_local_id(1)+1-1]; float next_y = s_idata[get_local_id(0)+1] [get_local_id(1)+1 + 1]; float z_dx = next_x prev_x; float z_dy = next_y prev_y; g_odata [i + j * width] = float2(z_dx, z_dy); Real-time Graphics Martin Samuelčík 40

OpenCL & OpenGL Possibility to share resources, buffers and textures between OpenCL and OpenGL contexts Ping-pong between contexts problem! Based on optimalization in drivers Or use compute shaders if possible Real-time Graphics Martin Samuelčík 41

Compute shaders General purpose computation inside OpenGL context Direct access to OpenGL resources Not part of rendering pipeline one function to execute shader Access to work-items and groups indices Access to shared, local memory and synchronization Using well-known API for loading shaders, managing buffers and textures Using GLSL Real-time Graphics Martin Samuelčík 42

Compute shader example http://wili.cc/blog/opengl-cs.html #version 430 uniform float roll; uniform image2d desttex; layout (local_size_x = 16, local_size_y = 16) in; void main() { ivec2 storepos = ivec2(gl_globalinvocationid.xy); float localcoef = length(vec2(ivec2(gl_localinvocationid.xy)-8)/8.0); float globalcoef = sin(float(gl_workgroupid.x+gl_workgroupid.y)*0.1 + roll)*0.5; imagestore(desttex, storepos, vec4(1.0-globalcoef*localcoef, 0.0, 0.0, 0.0)); } Real-time Graphics Martin Samuelčík 43

Questions? Real-time Graphics Martin Samuelčík 44