Real-time Graphics 9. GPGPU

Similar documents
Real-time Graphics 9. GPGPU

CUDA. Sathish Vadhiyar High Performance Computing

Introduction to CUDA

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

GPU programming. Dr. Bernhard Kainz

High Performance Linear Algebra on Data Parallel Co-Processors I

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

ECE 574 Cluster Computing Lecture 15

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Scientific discovery, analysis and prediction made possible through high performance computing.

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Tesla Architecture, CUDA and Optimization Strategies

ECE 574 Cluster Computing Lecture 17

Today s Content. Lecture 7. Trends. Factors contributed to the growth of Beowulf class computers. Introduction. CUDA Programming CUDA (I)

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

CUDA Architecture & Programming Model

Martin Kruliš, v

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPU Programming Using CUDA

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

CS516 Programming Languages and Compilers II

General-purpose computing on graphics processing units (GPGPU)

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CUDA Programming Model

Lecture 2: CUDA Programming

Lecture 1: an introduction to CUDA

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Introduction to CUDA Programming

Practical Introduction to CUDA and GPU

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Introduction to Parallel Computing with CUDA. Oswald Haan

ECE 408 / CS 483 Final Exam, Fall 2014

Martin Kruliš, v

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Shaders. Slide credit to Prof. Zwicker

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

Parallel Numerical Algorithms

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

CUDA (Compute Unified Device Architecture)

Lecture 10!! Introduction to CUDA!

Compute Shaders. Christian Hafner. Institute of Computer Graphics and Algorithms Vienna University of Technology

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Computer Architecture

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

OpenMP and GPU Programming

CS427 Multicore Architecture and Parallel Computing

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

EEM528 GPU COMPUTING

Introduction to GPGPUs and to CUDA programming model

GPGPU/CUDA/C Workshop 2012

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

GPGPU in Film Production. Laurence Emms Pixar Animation Studios

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

GPU Computing: A Quick Start

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Accelerating image registration on GPUs

High Performance Computing and GPU Programming

GPU Programming. Rupesh Nasre.

University of Bielefeld

Introduction to CUDA

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

CUDA C Programming Mark Harris NVIDIA Corporation

Introduction to GPU programming. Introduction to GPU programming p. 1/17

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Massively Parallel Architectures

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Module 2: Introduction to CUDA C

Lecture 3: Introduction to CUDA

CS377P Programming for Performance GPU Programming - I

Module 2: Introduction to CUDA C. Objective

Module 3: CUDA Execution Model -I. Objective

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

COSC 6374 Parallel Computations Introduction to CUDA

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Lecture 11: GPU programming

General Purpose Computing on Graphical Processing Units (GPGPU(

Multi-Processors and GPU

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

Transcription:

9. GPGPU

GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose computation using power of GPU, parallel processor 2

RTG motivation Some rendering algorithms require special features Various deformations Correct order of dynamic geometry Sorting - dynamic alpha-blended particles Physics simulations Collision detection Cloth Can be done via shaders, but requires hacking 3

Serial vs Parallel processing void process_in_serial() { for (int i=0;i<n;i++) calculate(i); } void process_in_parallel() { start_threads(n, thread); } void thread() { calculate(thread_id); } 4

SIMD Single Instruction Multiple Data Performing the same operation on multiple data points simultaneously Usage - data processing & compression, cryptography, image & video processing, 3D graphics CPU - special instructions + registers Intel - since 1997, MMX, SSE AMD since 1998, 3DNow! IBM PowerPC, Cell Processor (PS3) 5

SIMD www.wikipedia.org 6

GPU history Primarily for transformation, rasterization and texturing Data types SIMD processing vertex 4 x 32bit FP fragment (pixel) 4 x 8bit FP/integer texel 4 x 8bit integer Increasing number of processing units (cores, pipelines) Separate pipelines for vertices and fragmets Unified shaders (2006), 60+ cores Increasing programmability of pipelines (shader units) Conditional instructions, loops General-purpose programming via new technologies Support for 32bit FP texels and fragments 7

GPU architecture GPU multicore processor Basic computation unit Core, Stream Core, Stream (Thread) Processor Array of cores - Streaming Multiprocessor, SIMD Engine, Compute Unit Several levels of hierarchy Shared memory for cores in this array Special Function Units Private registers, cahces Scheduler 8

GPU architecture NVIDIA GeForce 8800 9

GPU architecture NVIDIA GeForce GTX 780 10

GPU architecture AMD Radeon R9 290X 11

GPU devices Radeon R9 290X 4GB Compute units: 44 Stream Processors: 2816 Nvidia GeForce GTX Titan 6GB Stream Processors: 2688 12

GPU computation model Thread, Work item One running computation Executed on cores Local memory for one thread Kernel program running on device Work-group, Thread block Group of several threads Shared local per-block memory Organized in 1,2,3 dimensional grids Controlled by streaming multiprocessor Grid of blocks, work-groups per device Organized in 1,2,3, N dimensional arrays 13

GPU memory model Global memory Accessible by all threads Slow, used with restrictions, bank conflicts Per device - 512- MB Constant memory Read-only section of global memory Per device = 64 KB -, cache per SM = 8 KB Local, Shared memory Sharing data between work items Per SM = 16 KB - Fast, used with restriction, synchronization needed Registers, Private memory per SM = 16384 - Allocated for each thread separately, Fast 14

GPU programming Graphics API OpenGL (universal) GLSL Cg ASM Compute shaders DirectX (win32) HLSL Cg ASM DirectCompute Computing languages Nvidia CUDA (universal) CUDA PTX ISA ATI Stream (universal) Brook+ AMD IL ISA OpenCL (universal) all devices 15

CPU/GPU programming 16

CUDA Compute Unified Device Architecture www.nvidia.com/object/cuda_home_new.html Restricted to NVIDIA hardware Exploiting GPU capabilities more than graphics API unlimited scatter write direct access to memory utilize shared memory Kernel code based on C,C++ API for loading kernel, work with arrays, managing devices http://code.msdn.microsoft.com/windowsdesktop/nvidia- GPU-Architecture-45c11e6d 17

CUDA 18

CUDA C/C++ extension Device vs Host code Function qualifiers: device, host, global Variable qualifiers: device, constant, shared Built-in variables threadidx coordinates of thread within a block blockdim number of threads in a block blockidx coordinates of thread s block within grid of blocks griddim number of blocks in grid 19

Example - Host part Computing gradients in 2D one component image float* d_idata; // device (GPU) input data cudamalloc( (void**) &d_idata, mem_size); float* d_odata; // device (GPU) output data cudamalloc( (void**) &d_odata, mem_size); float* h_idata; int width, height; loadimg( input.tga, &width, &height, h_idata); // fill source host data by CPU cudamemcpy(d_idata, h_idata, width * height * 4, cudamemcpyhosttodevice); dim3 threads( 8, 8, 1); // threads per block dim3 grid( 1 + width / 8, 1 + height / 8, 1); // blocks in grid compute_gradient<<< grid, threads >>>(d_idata, d_odata, width, height); cudathreadsynchronize(); // wait for result float* h_odata = new float[2 * width * height]; cudamemcpy(h_odata, d_odata, 2 * width * height * 4, cudamemcpydevicetohost); saveimg( output.tga, h_odata); 20

Example - Device part global void compute_gradient(float* g_idata, float2* g_odata, const int width, const int height) { const unsigned int i = blockidx.x * blockdim.x + threadidx.x; const unsigned int j = blockidx.y * blockdim.y + threadidx.y; if (i >= width j >= height) return; float value = g_idata[i + j * width]; float prev_x = value; if (i > 0) prev_x = g_idata[i - 1 + j * width]; float next_x = value; if (i < width-1) next_x = g_idata[i + 1 + j * width]; float prev_y = value; if (i > 0) prev_y = g_idata[i + (j 1) * width]; float next_y = value; if (i < height -1) next_y = g_idata[i + (j + 1) * width]; float z_dx = next_x prev_x; float z_dy = next_y prev_y; g_odata [i + j * width] = float2(z_dx, z_dy); } 21

Example Shared memory kernel void compute_gradient( global const float* g_idata, global float2* g_odata, const int width, const int height) { shared float s_data[10][10]; const unsigned int i = blockidx.x * blockdim.x + threadidx.x; const unsigned int j = blockidx.y * blockdim.y + threadidx.y; if (i >= width j >= height) return; float value = g_idata[i + j * width]; s_data[threadidx.x+1] [threadidx.y+1] = value; syncthreads(); } float prev_x = s_idata[threadidx.x+1-1] [threadidx.y+1]; float next_x = s_idata[threadidx.x+1 + 1] [threadidx.y+1]; float prev_y = s_idata[threadidx.x+1] [threadidx.y+1-1]; float next_y = s_idata[threadidx.x+1] [threadidx.y+1 + 1]; float z_dx = next_x prev_x; float z_dy = next_y prev_y; g_odata [i + j * width] = float2(z_dx, z_dy); 22

OpenCL Open standardized library for parallel computation https://www.khronos.org/opencl/ For use, drivers and SDKs must be installed http://developer.amd.com/tools-andsdks/opencl-zone/opencl-tools-sdks/ https://developer.nvidia.com/opencl Language based on C,C++ Many similarities with CUDA 23

OpenCL Host part - opencl.codeplex.com/wikipage?title=opencl Tutorials - 1 Getting platforms = host+collection of devices Getting devices for given platform Create context for devices, defines entire OpenCL environment Fill command queue with kernels that will be executed on device Create memory buffers and fill it with data Prepare, compile and build programs Create kernel from program, set its arguments and launch it Wait for completition and read data 24

OpenCL kernel kernel void compute_gradient( global const float* g_idata, global float2* g_odata, const int width, const int height) { local float s_data[10][10]; const unsigned int i = get_global_id(0); //get_group_id(0) * get_local_size(0) + get_local_id(0); const unsigned int j = get_global_id(1); // get_group_id(1) * get_local_size(1) + get_local_id(1); if (i >= width j >= height) return; float value = g_idata[i + j * width]; s_data[get_local_id(0)+1] [get_local_id(1)+1] = value; barrier(clk_local_mem_fence); } float prev_x = s_idata[get_local_id(0)+1-1] [get_local_id(1)+1]; float next_x = s_idata[get_local_id(0)+1 + 1] [get_local_id(1)+1]; float prev_y = s_idata[get_local_id(0)+1] [get_local_id(1)+1-1]; float next_y = s_idata[get_local_id(0)+1] [get_local_id(1)+1 + 1]; float z_dx = next_x prev_x; float z_dy = next_y prev_y; g_odata [i + j * width] = float2(z_dx, z_dy); 25

OpenCL & OpenGL Possibility to share resources, buffers and textures between OpenCL and OpenGL contexts Ping-pong between contexts problem! Based on optimalization in drivers Or use compute shaders if possible 26

Compute shaders General purpose computation inside OpenGL context Direct access to OpenGL resources Not part of rendering pipeline one function to execute shader Access to work-items and groups indices Access to shared, local memory and synchronization Using well-known API for loading shaders, managing buffers and textures Using GLSL 27

Compute shader example http://wili.cc/blog/opengl-cs.html #version 430 uniform float roll; uniform image2d desttex; layout (local_size_x = 16, local_size_y = 16) in; void main() { ivec2 storepos = ivec2(gl_globalinvocationid.xy); float localcoef = length(vec2(ivec2(gl_localinvocationid.xy)-8)/8.0); float globalcoef = sin(float(gl_workgroupid.x+gl_workgroupid.y)*0.1 + roll)*0.5; imagestore(desttex, storepos, vec4(1.0-globalcoef*localcoef, 0.0, 0.0, 0.0)); } 28

Questions? 29