GPGPU IGAD 2014/2015. Lecture 1. Jacco Bikker

Size: px

Start display at page:

Download "GPGPU IGAD 2014/2015. Lecture 1. Jacco Bikker"

Rhoda Joseph
5 years ago
Views:

1 GPGPU IGAD 2014/2015 Lecture 1 Jacco Bikker

2 Today: Course introduction GPGPU background Getting started Assignment

3 Introduction GPU History

4 History 3DO-FZ1 console 1991

5 History NVidia NV-1 (Diamond Edge 3D) 1995

6 History 3Dfx Diamond Monster 3D 1996

7 History Quake vs GLQuake 1997

8 History Fixed function pipeline vs Programmable pipeline 2007

9 History

10 History Source: Naffziger, AMD

11 History

12 History GPU - conveyor belt: input = vertices + connectivity step 1: transform step 2: rasterize step 3: shade step 4: z-test output = pixels

13 Introduction void main(void) { float t = iglobaltime; vec2 uv = gl_fragcoord.xy / iresolution.y; float r = length(uv), a = atan(uv.y,uv.x); float i = floor(r*10); a *= floor(pow(128,i/10)); a += 20.*sin(0.5*t) *i-100.*(r*i/10)*cos(0.5*t); r += ( *cos(a)) / 10; r = floor(n*r)/10; gl_fragcolor = (1-r)*vec4(0.5,1,1.5,1); }

14 Introduction Historically, the GPU is a co-processor. GPUs perform well because they have a constrained execution model, which is based on parallelism. GPU programming requires a very different way of expressing algorithms.

15 Introduction This course Teacher background Your role Learning objectives ECTS / lectures / homework / assessment

16 This course AGT6: 7 lectures We start at 10.00am Demo time Break half-way

17 Lecturer Me : dr. Jacco Bikker - CUDA Ray tracing Rendering

18 Your role You: Maybe a GPGPU / shader expert Use AGT6 to get further Or just pass with a 6

19 Objectives Objectives: Get feet wet Generic GPGPU concepts *not*: Detailed API knowledge

20 Details AGT6: 3 ECTS = ~80 hours Weekly homework, unverified Final assignment: free form

21 Background GPU architecture

22 GPU architecture CPU: Designed to run one thread as fast as possible. Use large caches to minimize memory latency Maximize cache usage using pipeline & branch prediction Multi-core processing Task parallelism Interesting tricks: SIMD Hyperthreading

23 GPU architecture GPU: Designed to combat latency using many threads. Hide latency by computation Maximize parallelism Streaming processing Data parallelism Interesting tricks: S I M T Use typical GPU hardware (filtering etc.) Cache anyway

24 GPU architecture CPU Multiple tasks = multiple threads Tasks run different instructions 10s of complex threads execute on a few cores Thread execution managed explicitly GPU SIMD: same instructions on multiple data s of light-weight threads on 100s of cores Threads are managed and scheduled by hardware

25 GPU architecture

26 GPU architecture SIMT Thread execution: Group 32 threads (vertices, pixels, primitives) into warps Each warp executes the same instruction In case of latency, switch to different warp (thus: switch out 32 threads for 32 different threads) Flow control:

27 GPU architecture void main(void) // for each pixel { float t = iglobaltime; vec2 uv = gl_fragcoord.xy / iresolution.y; float r = length(uv), a = atan(uv.y,uv.x); float i = floor(r*10); a *= floor(pow(128,i/10)); a += 20.*sin(0.5*t) *i-100.*(r*i/10)*cos(0.5*t); r += ( *cos(a)) / 10; r = floor(n*r)/10; gl_fragcolor = (1-r)*vec4(0.5,1,1.5,1); }

28 GPU architecture Easy to port to GPU: Image postprocessing Particle effects Ray tracing Actually, a lot of algorithms are not easy to port at all. Decades of legacy, or a fundamental problem?

29 Background Why GPGPU OpenCL vs Shaders vs CUDA

30 Why GPGPU Some tasks are more efficient on the GPU GPU has high theoretical peak performance Prevent wasting processing power

31 OpenCL vs shaders No mapping to graphics context needed Avoid thinking about various transformations of coordinates (world / screen / texture) Access to memory levels that are implicit in OpenGL OpenCL also runs on CPUs

OpenCL vs CUDA (but if you must: A Comprehensive Performance Comparison of CUDA and OpenCL, Fang et al., 2011 http://www.

32 OpenCL vs CUDA (but if you must: A Comprehensive Performance Comparison of CUDA and OpenCL, Fang et al., )

33 Getting Started Tools of the trade Template

34 Tools Get your development tools here: NVidia: AMD: Intel:

35 Template Template available from

36 Template kernel void main( write_only image2d_t outimg ) { int column = get_global_id( 0 ); int line = get_global_id( 1 ); // calculate checkerboard pattern int tilex = column / 40; int tiley = line / 40; float color = (float)((tilex + tiley) & 1); // 0 or 1 float4 white = (float4)( 1, 1, 1, 1 ); write_imagef( outimg, (int2)(column, line), color * white ); }

37 Template #version 330 uniform sampler2d color; in vec2 P; in vec2 uv; out vec3 pixel; void main() { // retrieve input pixel pixel = texture( color, uv ).rgb; // darken towards edges float dx = P.x - 0.5, dy = P.y - 0.5; float distance = sqrt( dx * dx + dy * dy ); float scale = 1 - max( 0, distance * ); pixel *= scale; }

38 Template bool Game::Init() { // load shader and texture cloutput = new Texture( SCRWIDTH, SCRHEIGHT, Texture::FLOAT ); shader = new Shader( "shaders/checker.vert", "shaders/checker.frag" ); // load OpenCL code kernel = new Kernel( "programs/program.cl", "main" ); // link cl output texture as an OpenCL buffer outputbuffer = clcreatefromgltexture2d( kernel->getcontext(), CL_MEM_WRITE_ONLY, GL_TEXTURE_2D, 0, cloutput->getid(), 0 ); kernel->setargument( 0, &outputbuffer ); // done return true; }

39 Template void Game::Tick() { // run cl code to fill texture kernel->run( &outputbuffer ); // run shader on cl-generated texture shader->bind(); shader->setinputtexture( GL_TEXTURE0, "color", cloutput ); shader->setinputmatrix( "view", mat4( 1 ) ); DrawQuad(); }

40 Getting Started MyFirst OpenCL app OpenCL terminology

41 Terminology A few words you need to know the meaning of: 1. Device 2. Host 3. Context 4. Kernel 5. Program 6. Compute unit (CUDA: CUDA core) 7. Work item (CUDA: thread) 8. Command queue (synchronous, asynchronous)

42 MyFirst To execute an OpenCL program: 1. Query the host system for OpenCL devices 2. Create a context to associate the OpenCL devices 3. Create programs that will run on one or more associated devices 4. From the programs, select kernels to execute 5. Create memory objects on the host or on the device 6. Copy memory data to the device as needed 7. Provide arguments for the kernels 8. Submit the kernels to the command queue for execution 9. Copy the results from the device to the host. clgetplatformids( ) clgetdeviceids( ) clcreatecontext( ) clcreatecommandqueue( ) clcreateprogramwithsource( ) clbuildprogram( ) clcreatekernel( ) clcreatebuffer( ) clenqueuewritebuffer( ) clsetkernelarg( ) clenqueuendrangekernel( ) clfinish( ) clenqueuereadbuffer( )

$h" #define ITEMS 10 const char *KernelSource = " kernel void hello( global float *input, global float *output)\n"\ "{\n size_t id = get_global_id(0);\n output[id] = input[id] * input[id];\n}"; void$

43 MyFirst #include <stdio.h> #include "CL/cl.h" #define ITEMS 10 const char *KernelSource = " kernel void hello( global float *input, global float *output)\n"\ "{\n size_t id = get_global_id(0);\n output[id] = input[id] * input[id];\n}"; void main() { cl_int err; cl_uint num_of_platforms = 0; cl_platform_id platform_id; cl_device_id device_id; cl_uint num_of_devices = 0; size_t global = ITEMS; float inputdata[items] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }, results[items] = { 0 }; clgetplatformids( 1, &platform_id, &num_of_platforms ); clgetdeviceids( platform_id, CL_DEVICE_TYPE_GPU, 1, &device_id, &num_of_devices ); cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, (cl_context_properties)platform_id, 0 }; cl_context context = clcreatecontext( props, 1, &device_id, 0, 0, &err ); cl_command_queue queue = clcreatecommandqueue( context, device_id, 0, &err ); cl_program program = clcreateprogramwithsource( context, 1, (const char**)&kernelsource, 0, &err ); clbuildprogram( program, 0, NULL, NULL, NULL, NULL ); cl_kernel kernel = clcreatekernel( program, "hello", &err ); cl_mem input = clcreatebuffer( context, CL_MEM_READ_ONLY, 4 * ITEMS, 0, 0 ); cl_mem output = clcreatebuffer( context, CL_MEM_WRITE_ONLY, 4 * ITEMS, 0, 0 ); clenqueuewritebuffer( queue, input, CL_TRUE, 0, 4 * ITEMS, inputdata, 0, 0, 0 ); clsetkernelarg( kernel, 0, sizeof( cl_mem ), &input ); clsetkernelarg( kernel, 1, sizeof( cl_mem ), &output ); clenqueuendrangekernel( queue, kernel, 1, 0, &global, 0, 0, 0, 0 ); clfinish( queue ); clenqueuereadbuffer( queue, output, CL_TRUE, 0, 4 * ITEMS, results, 0, 0, 0 ); for( int i = 0; i < ITEMS; i++ ) printf( "%f ",results[i] ); } clreleasememobject( input ); clreleasememobject( output ); clreleaseprogram( program ); clreleasekernel( kernel ); clreleasecommandqueue( queue ); clreleasecontext( context );

$MyFirst bool Kernel::InitCL() { cl_platform_id platform; cl_device_id* devices; cl_uint devcount; cl_int error; Like I$

44 MyFirst bool Kernel::InitCL() { cl_platform_id platform; cl_device_id* devices; cl_uint devcount; cl_int error; Like I said, I don t care much for API details Just start with the template, and modify / replace it when the need arises.... }

45 Assignment Create an OpenCL program that calculates Voronoi noise for a 512x512 buffer and make it available to the CPU. Measure the performance gain compared to CPU-only. Reference:

46 Words of Advice WebGL!= OpenCL Can t do by reference, use pointers instead float3 parameter: (float3)(1, 1, 1) fract requires second parameter sinf doesn t exist, use sin Also, see this helpful chart:

47 The End (for now)

Heterogeneous Computing

OpenCL Hwansoo Han Heterogeneous Computing Multiple, but heterogeneous multicores Use all available computing resources in system [AMD APU (Fusion)] Single core CPU, multicore CPU GPUs, DSPs Parallel programming