CS 677: Parallel Programming for Many-core Processors Lecture 12

Size: px
Start display at page:

Download "CS 677: Parallel Programming for Many-core Processors Lecture 12"

Transcription

1 1 CS 677: Parallel Programming for Many-core Processors Lecture 12 Instructor: Philippos Mordohai Webpage:

2 CS Department Project Poster Day May 5, 12-2pm (99% confirmed) Lieb third floor conference room and corridors 5% of total grade as bonus Suggestion: 9-12 printed pages Demos would be cool 2

3 CS Department Project Poster Day Your name, course number, project title Project objective: what are you trying to accomplish? what makes this computation worthy of a project? Method General description of method (not of the implementation) Suitability for GPU acceleration Design choices for GPU implementation Workload allocation, use of resources, bottlenecks (avoided and not avoided) Experimental results including timings 3

4 Final Project Presentations April 29 Send me PPT/PDF file by 5pm 12 min presentation + 2 min Q&A Counts for 15% of total grade 4

5 Final Project Presentations Target audience: fellow classmates Content: Problem description What is the computation and why is it important? Suitability for GPU acceleration Amdahl s Law: describe the inherent parallelism. Argue that it is close to 100% of computation. Compare with CPU version 5

6 Final Project Presentations Content (cont.): GPU Implementation Which steps of the algorithm were ported to the GPU? Work load allocation to threads Use of resources (registers, shared memory, constant memory, etc.) Occupancy achieved Results Experiments performed Timings and comparisons against CPU version 6

7 Final Report Due May 7 (11:59pm) 6-10 pages including figures, tables and references Content See presentation instructions Do not repeat course material Counts for 20% of total grade NO LATE SUBMISSIONS 7

8 Outline OpenCL Convolution Example Parallel Min() Example 8

9 Image Convolution Using OpenCL Udeepta Bordoloi, ATI Stream Application Engineer 10/13/2009 Note: ATI Stream Technology is now called AMD Accelerated Parallel Processing (APP) Technology. 9

10 Step 1 The Algorithm Ignore boundaries Output size: (input_image_width filter_width + 1) by (input_image_height filter_width + 1) 10

11 C Version void Convolve(float * pinput, float * pfilter, float * poutput, const int ninwidth, const int nwidth, const int nheight, const int nfilterwidth, const int nnumthreads) { for (int yout = 0; yout < nheight; yout++) { const int yintopleft = yout; for (int xout = 0; xout < nwidth; xout++) { const int xintopleft = xout; float sum = 0; 11

12 C Version (2) for (int r = 0; r < nfilterwidth; r++) { const int idxftmp = r * nfilterwidth; const int yin = yintopleft + r; const int idxintmp = yin * ninwidth + xintopleft; for (int c = 0; c < nfilterwidth; c++) { const int idxf = idxftmp + c; const int idxin = idxintmp + c; sum += pfilter[idxf]*pinput[idxin]; } } //for (int r = 0 12

13 C Version (3) } const int idxout = yout * nwidth + xout; poutput[idxout] = sum; } //for (int xout = 0 } //for (int yout = 0 13

14 Parameters struct paramstruct { int nwidth; //Output image width int nheight; //Output image height int ninwidth; //Input image width int ninheight; //Input image height int nfilterwidth; //Filter size is nfilterwidth X //nfilterwidth int niterations; //Run timing loop for niterations //Test CPU performance with 1,4,8 etc. OpenMP threads std::vector ompthreads; int nompruns; //ompthreads.size() bool bcputiming; //Time CPU performance } params; 14

15 OpenMP for Comparison //This #pragma splits the work between multiple threads #pragma omp parallel for num_threads(nnumthreads) for (int yout = 0; yout < nheight; yout++)... void InitParams(int argc, char* argv[]) { // time the OpenMP convolution performance with // different numbers of threads params.ompthreads.push_back(4); params.ompthreads.push_back(1); params.ompthreads.push_back(8); params.nompruns = params.ompthreads.size(); } 15

16 First Kernel kernel void Convolve(const global float * pinput, constant float * pfilter, global float * poutput, const int ninwidth, const int nfilterwidth) { const int nwidth = get_global_size(0); const int xout = get_global_id(0); const int yout = get_global_id(1); const int xintopleft = xout; const int yintopleft = yout; float sum = 0; 16

17 First Kernel (2) for (int r = 0; r < nfilterwidth; r++) { const int idxftmp = r * nfilterwidth; const int yin = yintopleft + r; const int idxintmp = yin * ninwidth + xintopleft; for (int c = 0; c < nfilterwidth; c++) { const int idxf = idxftmp + c; const int idxin = idxintmp + c; sum += pfilter[idxf]*pinput[idxin]; } } //for (int r = 0 const int idxout = yout * nwidth + xout; Output[idxOut] = sum; } 17

18 Initialize OpenCL cl_context context = clcreatecontextfromtype(,cl_device_type_cpu, ); // get list of devices - quad core counts as one device size_t listsize; /* First, get the size of device list */ clgetcontextinfo(context, CL_CONTEXT_DEVICES,, &listsize); /* Now, allocate the device list */ cl_device_id devices = (cl_device_id *)malloc(listsize); /* Next, get the device list data */ clgetcontextinfo(context, CL_CONTEXT_DEVICES, listsize, devices, ); 18

19 Initialize OpenCL (2) cl_command_queue queue = clcreatecommandqueue(context, devices[0], ); cl_program program = clcreateprogramwithsource(context, 1, &source, ); clbuildprogram(program, 1, devices, ); cl_kernel kernel = clcreatekernel(program, "Convolve", ); // get error messages clgetprogrambuildinfo(program, devices[0], CL_PROGRAM_BUILD_LOG, ); 19

20 Initialize Buffers cl_mem inputcl = clcreatebuffer(context, CL_MEM_READ_ONLY CL_MEM_USE_HOST_PTR, host_buffer_size, host_buffer_ptr, ); //If the device is a GPU (CL_DEVICE_TYPE_GPU), we can // explicitly copy data to the input image buffer on the // device: clenqueuewritebuffer(queue, inputcl,, host_buffer_ptr, ); // And copy back from the output image buffer after the // convolution kernel execution. clenqueuereadbuffer(queue, outputcl,, host_buffer_ptr, ); 20

21 Execute Kernel /* input buffer, arg 0 */ clsetkernelarg(kernel, 0, sizeof(cl_mem), (void *)&inputcl); /* filter buffer, arg 1 */ clsetkernelarg(kernel, 1, sizeof(cl_mem), (void *)&filtercl); /* output buffer, arg 2 */ clsetkernelarg(kernel, 2, sizeof(cl_mem), (void *)&outputcl); /* input image width, arg 3*/ clsetkernelarg(kernel, 3, sizeof(int), (void *)&ninwidth); /* filter width, arg 4*/ clsetkernelarg(kernel, 4, sizeof(int), (void *)&nfilterwidth); 21

22 Execute Kernel clenqueuendrangekernel(queue, kernel, data_dimensionality,, total_work_size, work_group_size, ); // release all buffers clreleasebuffer(inputcl);... // release all resources clreleasekernel(kernel); clreleaseprogram(program); clreleasecommandqueue(queue); clreleasecontext(context); 22

23 Timing clfinish(queue); //Timer Started here(); for (int i = 0; i < niterations; i++) clenqueuendrangekernel( ); clfinish(queue); //Timer Stopped here(); //Average Time = ElapsedTime()/nIterations; clfinish() call before both starting and stopping the timer ensures that we time the kernel execution activity to its completion and nothing else On 4-core AMD Phenom treated as a single device by OpenCL 23

24 C++ Bindings cl_context context = clcreatecontextfromtype(,cl_device_type_cpu, ); cl::context context = cl::context(cl_device_type_cpu); // get list of devices - quad core counts as one device size_t listsize; /* First, get the size of device list */ clgetcontextinfo(context, CL_CONTEXT_DEVICES,, &listsize); /* Now, allocate the device list */ cl_device_id devices = (cl_device_id *)malloc(listsize); /* Next, get the device list data */ clgetcontextinfo(context, CL_CONTEXT_DEVICES, listsize, devices, ); std::vector<cl::device> devices = context.getinfo(); See 24

25 C++ Bindings (2) cl::commandqueue queue = cl::commandqueue(context, devices[0]); cl::program program = cl::program(context, ); program.build(devices); cl::kernel kernel = cl::kernel(program, "Convolve"); string str = program.getbuildinfo(devices[0]); // Buffer init is similar to C version // using methods of queue 25

26 Execute Kernel /* input buffer, arg 0 */ clsetkernelarg(kernel, 0, sizeof(cl_mem), (void *)&inputcl); kernel.setarg(0, inputcl); /* filter buffer, arg 1 */ clsetkernelarg(kernel, 1, sizeof(cl_mem), (void *)&filtercl); kernel.setarg(1, filtercl); // etc. queue.clenqueuendrangekernel(kernel,, total_work_size, work_group_size, ); 26

27 Loop Unrolling kernel void Convolve_Unroll(const global float * pinput, constant float * pfilter, global float * poutput, const int ninwidth, const int nfilterwidth) { const int nwidth = get_global_size(0); const int xout = get_global_id(0); const int yout = get_global_id(1); const int xintopleft = xout; const int yintopleft = yout; float sum = 0; for (int r = 0; r < nfilterwidth; r++) { const int idxftmp = r * nfilterwidth; const int yin = yintopleft + r; const int idxintmp = yin * ninwidth + xintopleft; 27

28 Loop Unrolling (2) int c = 0; while (c <= nfilterwidth-4) { int idxf = idxftmp + c; int idxin = idxintmp + c; sum += pfilter[idxf]*pinput[idxin]; idxf++; idxin++; sum += pfilter[idxf]*pinput[idxin]; idxf++; idxin++; sum += pfilter[idxf]*pinput[idxin]; idxf++; idxin++; sum += pfilter[idxf]*pinput[idxin]; c += 4; } 28

29 Loop Unrolling (3) } for (int c1 = c; c1 < nfilterwidth; c1++) { const int idxf = idxftmp + c1; const int idxin = idxintmp + c1; sum += pfilter[idxf]*pinput[idxin]; } } //for (int r = 0 const int idxout = yout * nwidth + xout; poutput[idxout] = sum; // what does this do? 29

30 Performance 30

31 Unrolled Kernel 2 (if Kernel) // last loop int cmod = nfilterwidth c; if (cmod == 1) { int idxf = idxftmp + c; int idxin = idxintmp + c; sum += pfilter[idxf]*pinput[idxin]; } else if (cmod == 2) { int idxf = idxftmp + c; int idxin = idxintmp + c; sum += pfilter[idxf]*pinput[idxin]; sum += pfilter[idxf+1]*pinput[idxin+1]; } 31

32 Unrolled Kernel 2 (2) } else if (cmod == 3) { int idxf = idxftmp + c; int idxin = idxintmp + c; sum += pfilter[idxf]*pinput[idxin]; sum += pfilter[idxf+1]*pinput[idxin+1]; sum += pfilter[idxf+2]*pinput[idxin+2]; } } //for (int r = 0 const int idxout = yout * nwidth + xout; poutput[idxout] = sum; 32

33 Performance Yet another way to achieve similar results is to write four different versions of the ConvolveUnroll kernel. The four versions will correspond to (filterwidth%4) equalling 0, 1, 2, or 3. The particular version called can be decided at run-time depending on the value of filterwidth 33

34 Kernel with Invariants Loop unrolling did not help when the filter width is low So far, kernels have been written in a generic way so that they will work for all filter sizes What if we can focus on a particular filter size? E.g We can now unroll the inner loop five times and get rid of the loop condition If we use the invariant in the loop condition, a good compiler will unroll the loop itself 34

35 Kernel with Invariants kernel void Convolve_Def(const global float * pinput, constant float * pfilter, global float * poutput, const int ninwidth, const int nfilterwidth) { const int nwidth = get_global_size(0); const int xout = get_global_id(0); const int yout = get_global_id(1); const int xintopleft = xout; const int yintopleft = yout; float sum = 0; for (int r = 0; r < FILTER_WIDTH; r++) { const int idxftmp = r * FILTER_WIDTH; const int yin = yintopleft + r; const int idxintmp = yin * ninwidth + xintopleft; 35

36 Kernel with Invariants (2) } for (int c = 0; c < FILTER_WIDTH; c++) { const int idxf = idxftmp + c; const int idxin = idxintmp + c; sum += pfilter[idxf]*pinput[idxin]; } } //for (int r = 0 const int idxout = yout * nwidth + xout; poutput[idxout] = sum; 36

37 Setting Filter Width // this can be done online and offline /* create a cl source string */ std::string sourcestr = Convert-File-To-String(File-Name); cl::program::sources sources(1, std::make_pair(sourcestr.c_str(), sourcestr.length())); /* create a cl program object */ program = cl::program(context, sources); /* build a cl program executable with some #defines */ char options[128]; sprintf(options, "-DFILTER_WIDTH=%d", filter_width); program.build(devices, options); /* create a kernel object for a kernel with the given name */ cl::kernel kernel = cl::kernel(program, "Convolve_Def"); 37

38 Performance 38

39 Performance 39

40 Performance Unroll + if on remainder 40

41 Vectorization kernel void Convolve_Unroll(const global float * pinput, constant float * pfilter, global float * poutput, const int ninwidth, const int nfilterwidth) { const int nwidth = get_global_size(0); const int xout = get_global_id(0); const int yout = get_global_id(1); const int xintopleft = xout; const int yintopleft = yout; float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; for (int r = 0; r < nfilterwidth; r++) { const int idxftmp = r * nfilterwidth; 41

42 Vectorization (2) const int yin = yintopleft + r; const int idxintmp = yin * ninwidth + xintopleft; int c = 0; while (c <= nfilterwidth-4) { float mul0, mul1, mul2, mul3; int idxf = idxftmp + c; int idxin = idxintmp + c; mul0 = pfilter[idxf]*pinput[idxin]; idxf++; idxin++; mul1 += pfilter[idxf]*pinput[idxin]; idxf++; idxin++; mul2 += pfilter[idxf]*pinput[idxin]; idxf++; idxin++; mul3 += pfilter[idxf]*pinput[idxin]; 42

43 Vectorization (3) } sum0 += mul0; sum1 += mul1; sum2 += mul2; sum3 += mul3; c += 4; } for (int c1 = c; c1 < nfilterwidth; c1++) { const int idxf = idxftmp + c1; const int idxin = idxintmp + c1; sum0 += pfilter[idxf]*pinput[idxin]; } } //for (int r = 0 const int idxout = yout * nwidth + xout; poutput[idxout] = sum0 + sum1 + sum2 + sum3; 43

44 Vectorized Kernel kernel void Convolve_Float4(const global float * pinput, constant float * pfilter, global float * poutput, const int ninwidth, const int nfilterwidth) { const int nwidth = get_global_size(0); const int xout = get_global_id(0); const int yout = get_global_id(1); const int xintopleft = xout; const int yintopleft = yout; float4 sum4 = 0; for (int r = 0; r < nfilterwidth; r++) { const int idxftmp = r * nfilterwidth; const int yin = yintopleft + r; const int idxintmp = yin * ninwidth + xintopleft; 44

45 Vectorized Kernel int c = 0; int c4 = 0; while (c <= nfilterwidth-4) { float4 filter4 = vload4(c4,pfilter+idxftmp); float4 in4 = vload4(c4,pinput +idxintmp); sum4 += in4 * filter4; c += 4; c4++; } for (int c1 = c; c1 < nfilterwidth; c1++) { const int idxf = idxftmp + c1; const int idxin = idxintmp + c1; sum4.x += pfilter[idxf]*pinput[idxin]; } } //for (int r = 0 const int idxout = yout * nwidth + xout; poutput[idxout] = sum4.x + sum4.y + sum4.z + sum4.w; } 45

46 Performance 46

47 Performance if Kernel 47

48 Performance Kernel with Invariants 48

49 OpenMP Comparison 49

50 OpenMP Comparison 50

51 Parallel Min() Programming Guide AMD Accelerated Parallel Processing OpenCL (November 2013) 51

52 // // Copyright (c) 2010 Advanced Micro Devices, Inc. All rights reserved. // #include <CL/cl.h> #include <stdio.h> #include <stdlib.h> #include <time.h> #include "Timer.h" #define NDEVS 2 // A parallel min() kernel that works well on CPU and GPU const char *kernel_source = " \n" "#pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable \n" "#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable \n" " \n " // 9. The source buffer is accessed as 4-vectors. \n" " \n" 52

53 " kernel void minp( global uint4 *src, \n" " global uint *gmin, \n" " local uint *lmin, \n" " global uint *dbg, \n" " int nitems, \n" " uint dev ) \n" "{ \n" " // 10. Set up global memory access pattern. \n" " \n" " uint count = ( nitems / 4 ) / get_global_size(0); \n" " uint idx = (dev == 0)? get_global_id(0) * count \n" " : get_global_id(0); \n" " uint stride = (dev == 0)? 1 : get_global_size(0); \n" " uint pmin = (uint) -1; \n" " \n" 53

54 " // 11. First, compute private min, for this work-item. \n" " \n" " for( int n=0; n < count; n++, idx += stride ) \n" " { \n" " pmin = min( pmin, src[idx].x ); \n" " pmin = min( pmin, src[idx].y ); \n" " pmin = min( pmin, src[idx].z ); \n" " pmin = min( pmin, src[idx].w ); \n" " } \n" " \n" " // 12. Reduce min values inside work-group. \n" " \n" " if( get_local_id(0) == 0 ) \n" " lmin[0] = (uint) -1; \n" " \n" " barrier( CLK_LOCAL_MEM_FENCE ); \n" " \n" " (void) atom_min( lmin, pmin ); \n" " \n" " barrier( CLK_LOCAL_MEM_FENCE ); \n" " \n" 54

55 " // Write out to global. \n" " \n" " if( get_local_id(0) == 0 ) \n" " gmin[ get_group_id(0) ] = lmin[0]; \n " \n" " // Dump some debug information. \n" " \n" " if( get_global_id(0) == 0 ) \n" " { \n" " dbg[0] = get_num_groups(0); \n" " dbg[1] = get_global_size(0); \n" " dbg[2] = count; \n" " dbg[3] = stride; \n" " } \n" "} \n" " \n 55

56 "// 13. Reduce work-group min values from global to global. \n" " \n" " kernel void reduce( global uint4 *src, \n" " global uint *gmin ) \n" "{ \n" " (void) atom_min( gmin, gmin[get_global_id(0)] ) ; \n" "} \n"; 56

57 int main(int argc, char ** argv) { cl_platform_id platform; int dev, nw; cl_device_type devs[ndevs] = { CL_DEVICE_TYPE_CPU, CL_DEVICE_TYPE_GPU }; cl_uint *src_ptr; unsigned int num_src_items = 4096*4096; // 1. quick & dirty MWC random init of source buffer. // Random seed (portable). time_t ltime; time(&ltime); src_ptr = (cl_uint *) malloc( num_src_items * sizeof(cl_uint) ); cl_uint a = (cl_uint) ltime, b = (cl_uint) ltime; cl_uint min = (cl_uint) -1; 57

58 // Do serial computation of min() for result verification. for( int i=0; i < num_src_items; i++ ) { src_ptr[i] = (cl_uint) (b = ( a * ( b & )) + ( b >> 16 )); min = src_ptr[i] < min? src_ptr[i] : min; } // Get a platform. clgetplatformids( 1, &platform, NULL ); // 3. Iterate over devices. for(dev=0; dev < NDEVS; dev++) { cl_device_id device; cl_context context; cl_command_queue queue; 58

59 cl_program program; cl_kernel minp; cl_kernel reduce; cl_mem src_buf; cl_mem dst_buf; cl_mem dbg_buf; cl_uint *dst_ptr, *dbg_ptr; printf("\n%s: ", dev == 0? "CPU" : "GPU"); // Find the device. clgetdeviceids( platform, devs[dev], 1, &device, NULL); 59

60 // 4. Compute work sizes. cl_uint compute_units; size_t global_work_size; size_t local_work_size; size_t num_groups; clgetdeviceinfo( device, CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(cl_uint), &compute_units, NULL); if( devs[dev] == CL_DEVICE_TYPE_CPU ) { global_work_size = compute_units * 1; // 1 thread per core local_work_size = 1; } 60

61 Wavefront = CUDA warp currently has 64 work items else { cl_uint ws = 64; global_work_size = compute_units * 7 * ws; // 7 wavefronts per SIMD while( (num_src_items / 4) % global_work_size!= 0 ) global_work_size += ws; local_work_size = ws; } num_groups = global_work_size/local_work_size; // Create a context and command queue on that //device. context = clcreatecontext( NULL, 1, &device, NULL, NULL, NULL); queue = clcreatecommandqueue(context, device, 0, NULL); 61

62 // Minimal error check. if( queue == NULL ) { printf("compute device setup failed\n"); return(-1); } // Perform runtime source compilation, and obtain // kernel entry point. program = clcreateprogramwithsource( context, 1, &kernel_source, NULL, NULL ); // 5. Print compiler error messages SKIPPED 62

63 minp = clcreatekernel( program, "minp", NULL ); reduce = clcreatekernel( program, "reduce", NULL ); // Create input, output and debug buffers. src_buf = clcreatebuffer( context, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, num_src_items * sizeof(cl_uint), src_ptr, NULL ); dst_buf = clcreatebuffer( context, CL_MEM_READ_WRITE, num_groups * sizeof(cl_uint), NULL, NULL ); dbg_buf = clcreatebuffer( context, CL_MEM_WRITE_ONLY, global_work_size * sizeof(cl_uint), NULL, NULL ); 63

64 clsetkernelarg(minp, 0, sizeof(void *), (void*) &src_buf); clsetkernelarg(minp, 1, sizeof(void *), (void*) &dst_buf); clsetkernelarg(minp, 2, 1*sizeof(cl_uint), (void*) NULL); clsetkernelarg(minp, 3, sizeof(void *), (void*) &dbg_buf); clsetkernelarg(minp, 4, sizeof(num_src_items), (void*) &num_src_items); clsetkernelarg(minp, 5, sizeof(dev), (void*) &dev); clsetkernelarg(reduce, 0, sizeof(void *), (void*) &src_buf); clsetkernelarg(reduce, 1, sizeof(void *), (void*) &dst_buf); CPerfCounter t; t.reset(); t.start(); 64

65 // 6. Main timing loop. #define NLOOPS 500 cl_event ev; int nloops = NLOOPS; while(nloops--) { clenqueuendrangekernel( queue, minp, 1, NULL, &global_work_size, &local_work_size, 0, NULL, &ev); clenqueuendrangekernel( queue, reduce, 1, NULL, &num_groups, NULL, 1, &ev, NULL); } 65

66 clfinish( queue ); t.stop(); printf("b/w %.2f GB/sec, ", ((float) num_src_items * sizeof(cl_uint) * NLOOPS) / t.getelapsedtime() / 1e9 ); // 7. Look at the results via synchronous buffer map. dst_ptr = (cl_uint *) clenqueuemapbuffer( queue, dst_buf, CL_TRUE, CL_MAP_READ, 0, num_groups * sizeof(cl_uint), 0, NULL, NULL, NULL ); dbg_ptr = (cl_uint *) clenqueuemapbuffer( queue, dbg_buf, CL_TRUE, CL_MAP_READ, 0, global_work_size * sizeof(cl_uint), 0, NULL, NULL, NULL ); 66

67 // 8. Print some debug info. printf("%d groups, %d threads, count %d, stride %d\n", dbg_ptr[0], dbg_ptr[1], dbg_ptr[2], dbg_ptr[3]}; if( dst_ptr[0] == min ) printf("result correct\n"); else printf("result INcorrect\n"); } // iterate over devices } printf("\n"); return 0; 67

68 Binary Search Design a GPU-friendly binary search algorithm Assume input array is sorted and enormous 68

CS 677: Parallel Programming for Many-core Processors Lecture 12

CS 677: Parallel Programming for Many-core Processors Lecture 12 1 CS 677: Parallel Programming for Many-core Processors Lecture 12 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Final Project Presentations

More information

Heterogeneous Computing

Heterogeneous Computing OpenCL Hwansoo Han Heterogeneous Computing Multiple, but heterogeneous multicores Use all available computing resources in system [AMD APU (Fusion)] Single core CPU, multicore CPU GPUs, DSPs Parallel programming

More information

Data Parallelism. CSCI 5828: Foundations of Software Engineering Lecture 28 12/01/2016

Data Parallelism. CSCI 5828: Foundations of Software Engineering Lecture 28 12/01/2016 Data Parallelism CSCI 5828: Foundations of Software Engineering Lecture 28 12/01/2016 1 Goals Cover the material in Chapter 7 of Seven Concurrency Models in Seven Weeks by Paul Butcher Data Parallelism

More information

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011 GPGPU COMPUTE ON AMD Udeepta Bordoloi April 6, 2011 WHY USE GPU COMPUTE CPU: scalar processing + Latency + Optimized for sequential and branching algorithms + Runs existing applications very well - Throughput

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 6 April 2017 HW#8 will be posted Announcements HW#7 Power outage Pi Cluster Runaway jobs (tried

More information

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1 Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 Introduction and aims of OpenCL - Neil Trevett, NVIDIA OpenCL Specification walkthrough - Mike

More information

Masterpraktikum Scientific Computing

Masterpraktikum Scientific Computing Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany Outline Intel Cilk Plus OpenCL Übung, October 7, 2012 2 Intel Cilk

More information

OpenCL in Action. Ofer Rosenberg

OpenCL in Action. Ofer Rosenberg pencl in Action fer Rosenberg Working with pencl API pencl Boot Platform Devices Context Queue Platform Query int GetPlatform (cl_platform_id &platform, char* requestedplatformname) { cl_uint numplatforms;

More information

OpenCL The Open Standard for Heterogeneous Parallel Programming

OpenCL The Open Standard for Heterogeneous Parallel Programming OpenCL The Open Standard for Heterogeneous Parallel Programming March 2009 Copyright Khronos Group, 2009 - Page 1 Close-to-the-Silicon Standards Khronos creates Foundation-Level acceleration APIs - Needed

More information

Instructions for setting up OpenCL Lab

Instructions for setting up OpenCL Lab Instructions for setting up OpenCL Lab This document describes the procedure to setup OpenCL Lab for Linux and Windows machine. Specifically if you have limited no. of graphics cards and a no. of users

More information

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL CS/EE 217 GPU Architecture and Parallel Programming Lecture 22: Introduction to OpenCL Objective To Understand the OpenCL programming model basic concepts and data types OpenCL application programming

More information

Performing Reductions in OpenCL

Performing Reductions in OpenCL Performing Reductions in OpenCL Mike Bailey mjb@cs.oregonstate.edu opencl.reduction.pptx Recall the OpenCL Model Kernel Global Constant Local Local Local Local Work- ItemWork- ItemWork- Item Here s the

More information

Programming with CUDA and OpenCL. Dana Schaa and Byunghyun Jang Northeastern University

Programming with CUDA and OpenCL. Dana Schaa and Byunghyun Jang Northeastern University Programming with CUDA and OpenCL Dana Schaa and Byunghyun Jang Northeastern University Tutorial Overview CUDA - Architecture and programming model - Strengths and limitations of the GPU - Example applications

More information

GPGPU IGAD 2014/2015. Lecture 1. Jacco Bikker

GPGPU IGAD 2014/2015. Lecture 1. Jacco Bikker GPGPU IGAD 2014/2015 Lecture 1 Jacco Bikker Today: Course introduction GPGPU background Getting started Assignment Introduction GPU History History 3DO-FZ1 console 1991 History NVidia NV-1 (Diamond Edge

More information

Advanced OpenMP. Other threading APIs

Advanced OpenMP. Other threading APIs Advanced OpenMP Other threading APIs What s wrong with OpenMP? OpenMP is designed for programs where you want a fixed number of threads, and you always want the threads to be consuming CPU cycles. - cannot

More information

OpenCL. An Introduction for HPC programmers. Benedict Gaster, AMD Tim Mattson, Intel. - Page 1

OpenCL. An Introduction for HPC programmers. Benedict Gaster, AMD Tim Mattson, Intel. - Page 1 OpenCL An Introduction for HPC programmers Benedict Gaster, AMD Tim Mattson, Intel - Page 1 Preliminaries: Disclosures - The views expressed in this tutorial are those of the people delivering the tutorial.

More information

Introduction to Parallel & Distributed Computing OpenCL: memory & threads

Introduction to Parallel & Distributed Computing OpenCL: memory & threads Introduction to Parallel & Distributed Computing OpenCL: memory & threads Lecture 12, Spring 2014 Instructor: 罗国杰 gluo@pku.edu.cn In this Lecture Example: image rotation GPU threads and scheduling Understanding

More information

OpenCL. Dr. David Brayford, LRZ, PRACE PATC: Intel MIC & GPU Programming Workshop

OpenCL. Dr. David Brayford, LRZ, PRACE PATC: Intel MIC & GPU Programming Workshop OpenCL Dr. David Brayford, LRZ, brayford@lrz.de PRACE PATC: Intel MIC & GPU Programming Workshop 1 Open Computing Language Open, royalty-free standard C-language extension For cross-platform, parallel

More information

OpenCL. Computation on HybriLIT Brief introduction and getting started

OpenCL. Computation on HybriLIT Brief introduction and getting started OpenCL Computation on HybriLIT Brief introduction and getting started Alexander Ayriyan Laboratory of Information Technologies Joint Institute for Nuclear Research 05.09.2014 (Friday) Tutorial in frame

More information

Introduction to OpenCL. Benedict R. Gaster October, 2010

Introduction to OpenCL. Benedict R. Gaster October, 2010 Introduction to OpenCL Benedict R. Gaster October, 2010 OpenCL With OpenCL you can Leverage CPUs and GPUs to accelerate parallel computation Get dramatic speedups for computationally intensive applications

More information

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President 4 th Annual Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 CPUs Multiple cores driving performance increases Emerging Intersection GPUs Increasingly

More information

OpenCL on the GPU. San Jose, CA September 30, Neil Trevett and Cyril Zeller, NVIDIA

OpenCL on the GPU. San Jose, CA September 30, Neil Trevett and Cyril Zeller, NVIDIA OpenCL on the GPU San Jose, CA September 30, 2009 Neil Trevett and Cyril Zeller, NVIDIA Welcome to the OpenCL Tutorial! Khronos and industry perspective on OpenCL Neil Trevett Khronos Group President OpenCL

More information

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Architecture Parallel computing for heterogenous devices CPUs, GPUs, other processors (Cell, DSPs, etc) Portable accelerated code Defined

More information

Introduction to OpenCL. Piero Lanucara SuperComputing Applications and Innovation Department

Introduction to OpenCL. Piero Lanucara SuperComputing Applications and Innovation Department Introduction to OpenCL Piero Lanucara p.lanucara@cineca.it SuperComputing Applications and Innovation Department Heterogeneous High Performance Computing Green500 list Top of the list is dominated by heterogeneous

More information

CS 677: Parallel Programming for Many-core Processors Lecture 11

CS 677: Parallel Programming for Many-core Processors Lecture 11 1 CS 677: Parallel Programming for Many-core Processors Lecture 11 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Project Status Update Due

More information

WebCL Overview and Roadmap

WebCL Overview and Roadmap Copyright Khronos Group, 2011 - Page 1 WebCL Overview and Roadmap Tasneem Brutch Chair WebCL Working Group Samsung Electronics Copyright Khronos Group, 2011 - Page 2 WebCL Motivation Enable high performance

More information

INTRODUCTION TO OPENCL. Jason B. Smith, Hood College May

INTRODUCTION TO OPENCL. Jason B. Smith, Hood College May INTRODUCTION TO OPENCL Jason B. Smith, Hood College May 4 2011 WHAT IS IT? Use heterogeneous computing platforms Specifically for computationally intensive apps Provide a means for portable parallelism

More information

INTRODUCING OPENCL TM

INTRODUCING OPENCL TM INTRODUCING OPENCL TM The open standard for parallel programming across heterogeneous processors 1 PPAM 2011 Tutorial IT S A MEANY-CORE HETEROGENEOUS WORLD Multi-core, heterogeneous computing The new normal

More information

Rock em Graphic Cards

Rock em Graphic Cards Rock em Graphic Cards Agnes Meyder 27.12.2013, 16:00 1 / 61 Layout Motivation Parallelism Old Standards OpenMPI OpenMP Accelerator Cards CUDA OpenCL OpenACC Hardware C++AMP The End 2 / 61 Layout Motivation

More information

Parallelization using the GPU

Parallelization using the GPU ~ Parallelization using the GPU Scientific Computing Winter 2016/2017 Lecture 29 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de made wit pandoc 1 / 26 ~ Recap 2 / 26 MIMD Hardware: Distributed memory

More information

OpenCL and the quest for portable performance. Tim Mattson Intel Labs

OpenCL and the quest for portable performance. Tim Mattson Intel Labs OpenCL and the quest for portable performance Tim Mattson Intel Labs Disclaimer The views expressed in this talk are those of the speaker and not his employer. I am in a research group and know very little

More information

Parallel programming languages:

Parallel programming languages: Parallel programming languages: A new renaissance or a return to the dark ages? Simon McIntosh-Smith University of Bristol Microelectronics Research Group simonm@cs.bris.ac.uk 1 The Microelectronics Group

More information

AMD Accelerated Parallel Processing. OpenCL User Guide. October rev1.0

AMD Accelerated Parallel Processing. OpenCL User Guide. October rev1.0 AMD Accelerated Parallel Processing OpenCL User Guide October 2014 rev1.0 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD Accelerated Parallel Processing, the AMD Accelerated

More information

AMD Accelerated Parallel Processing. OpenCL User Guide. December rev1.0

AMD Accelerated Parallel Processing. OpenCL User Guide. December rev1.0 AMD Accelerated Parallel Processing OpenCL User Guide December 2014 rev1.0 2014 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD Accelerated Parallel Processing, the AMD Accelerated

More information

Tools for Multi-Cores and Multi-Targets

Tools for Multi-Cores and Multi-Targets Tools for Multi-Cores and Multi-Targets Sebastian Pop Advanced Micro Devices, Austin, Texas The Linux Foundation Collaboration Summit April 7, 2011 1 / 22 Sebastian Pop Tools for Multi-Cores and Multi-Targets

More information

GPGPU IGAD 2014/2015. Lecture 4. Jacco Bikker

GPGPU IGAD 2014/2015. Lecture 4. Jacco Bikker GPGPU IGAD 2014/2015 Lecture 4 Jacco Bikker Today: Demo time! Parallel scan Parallel sort Assignment Demo Time Parallel scan What it is: in: 1 1 6 2 7 3 2 out: 0 1 2 8 10 17 20 C++: out[0] = 0 for ( i

More information

Towards Transparent and Efficient GPU Communication on InfiniBand Clusters. Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake

Towards Transparent and Efficient GPU Communication on InfiniBand Clusters. Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake Towards Transparent and Efficient GPU Communication on InfiniBand Clusters Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake MPI and I/O from GPU vs. CPU Traditional CPU point-of-view

More information

GPU acceleration on IB clusters. Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake

GPU acceleration on IB clusters. Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake GPU acceleration on IB clusters Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake HPC Advisory Council European Workshop 2011 Why it matters? (Single node GPU acceleration) Control

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example Optimizing Previous Example Alternative Architectures 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator for desktop

More information

Introduction to OpenCL (utilisation des transparents de Cliff Woolley, NVIDIA) intégré depuis à l initiative Kite

Introduction to OpenCL (utilisation des transparents de Cliff Woolley, NVIDIA) intégré depuis à l initiative Kite Introduction to OpenCL (utilisation des transparents de Cliff Woolley, NVIDIA) intégré depuis à l initiative Kite riveill@unice.fr http://www.i3s.unice.fr/~riveill What is OpenCL good for Anything that

More information

OPENCL C++ Lee Howes AMD Senior Member of Technical Staff, Stream Computing

OPENCL C++ Lee Howes AMD Senior Member of Technical Staff, Stream Computing OPENCL C++ Lee Howes AMD Senior Member of Technical Staff, Stream Computing Benedict Gaster AMD Principle Member of Technical Staff, AMD Research (now at Qualcomm) OPENCL TODAY WHAT WORKS, WHAT DOESN T

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example (and its Optimization) Alternative Frameworks Most Recent Innovations 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator

More information

OpenCL API. OpenCL Tutorial, PPAM Dominik Behr September 13 th, 2009

OpenCL API. OpenCL Tutorial, PPAM Dominik Behr September 13 th, 2009 OpenCL API OpenCL Tutorial, PPAM 2009 Dominik Behr September 13 th, 2009 Host and Compute Device The OpenCL specification describes the API and the language. The OpenCL API, is the programming API available

More information

SYCL: An Abstraction Layer for Leveraging C++ and OpenCL

SYCL: An Abstraction Layer for Leveraging C++ and OpenCL SYCL: An Abstraction Layer for Leveraging C++ and OpenCL Alastair Murray Compiler Research Engineer, Codeplay Visit us at www.codeplay.com 45 York Place Edinburgh EH1 3HP United Kingdom Overview SYCL for

More information

OpenCL Training Course

OpenCL Training Course OpenCL Training Course Intermediate Level Class http://www.ksc.re.kr http://webedu.ksc.re.kr INDEX 1. Class introduction 2. Multi-platform and multi-device 3. OpenCL APIs in detail 4. OpenCL C language

More information

Scientific Computing WS 2017/2018. Lecture 27. Jürgen Fuhrmann Lecture 27 Slide 1

Scientific Computing WS 2017/2018. Lecture 27. Jürgen Fuhrmann Lecture 27 Slide 1 Scientific Computing WS 2017/2018 Lecture 27 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de Lecture 27 Slide 1 Lecture 27 Slide 2 Why parallelization? Computers became faster and faster without that...

More information

Programming paradigms for hybrid architecture. Piero Lanucara, SCAI

Programming paradigms for hybrid architecture. Piero Lanucara, SCAI Programming paradigms for hybrid architecture Piero Lanucara, SCAI p.lanucara@cineca.it From CUDA to OpenCL Let s start from a simple CUDA code (matrixmul from NVIDIA CUDA samples). Now, you perfectly

More information

OpenCL Overview Benedict R. Gaster, AMD

OpenCL Overview Benedict R. Gaster, AMD Copyright Khronos Group, 2011 - Page 1 OpenCL Overview Benedict R. Gaster, AMD March 2010 The BIG Idea behind OpenCL OpenCL execution model - Define N-dimensional computation domain - Execute a kernel

More information

NVIDIA OpenCL JumpStart Guide. Technical Brief

NVIDIA OpenCL JumpStart Guide. Technical Brief NVIDIA OpenCL JumpStart Guide Technical Brief Version 1.0 February 19, 2010 Introduction The purposes of this guide are to assist developers who are familiar with CUDA C/C++ development and want to port

More information

Advanced OpenCL Event Model Usage

Advanced OpenCL Event Model Usage Advanced OpenCL Event Model Usage Derek Gerstmann University of Western Australia http://local.wasp.uwa.edu.au/~derek OpenCL Event Model Usage Outline Execution Model Usage Patterns Synchronisation Event

More information

APARAPI Java platform s Write Once Run Anywhere now includes the GPU. Gary Frost AMD PMTS Java Runtime Team

APARAPI Java platform s Write Once Run Anywhere now includes the GPU. Gary Frost AMD PMTS Java Runtime Team APARAPI Java platform s Write Once Run Anywhere now includes the GPU Gary Frost AMD PMTS Java Runtime Team AGENDA The age of heterogeneous computing is here The supercomputer in your desktop/laptop Why

More information

Scientific Computing WS 2018/2019. Lecture 25. Jürgen Fuhrmann Lecture 25 Slide 1

Scientific Computing WS 2018/2019. Lecture 25. Jürgen Fuhrmann Lecture 25 Slide 1 Scientific Computing WS 2018/2019 Lecture 25 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de Lecture 25 Slide 1 Lecture 25 Slide 2 SIMD Hardware: Graphics Processing Units ( GPU) [Source: computing.llnl.gov/tutorials]

More information

OpenCL Introduction. Acknowledgements. Frédéric Desprez. Simon Mc Intosh-Smith (Univ. of Bristol) Tom Deakin (Univ. of Bristol)

OpenCL Introduction. Acknowledgements. Frédéric Desprez. Simon Mc Intosh-Smith (Univ. of Bristol) Tom Deakin (Univ. of Bristol) OpenCL Introduction Frédéric Desprez INRIA Grenoble Rhône-Alpes/LIG Corse team Simulation par ordinateur des ondes gravitationnelles produites lors de la fusion de deux trous noirs. Werner Benger, CC BY-SA

More information

Sistemi Operativi e Reti

Sistemi Operativi e Reti Sistemi Operativi e Reti GPGPU Computing: the multi/many core computing era Dipartimento di Matematica e Informatica Corso di Laurea Magistrale in Informatica Osvaldo Gervasi ogervasi@computer.org 1 2

More information

A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs

A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs Taylor Lloyd, Artem Chikin, Erick Ochoa, Karim Ali, José Nelson Amaral University of Alberta Sept 7 FSP 2017 1 University

More information

OpenCL Events. Mike Bailey. Oregon State University. OpenCL Events

OpenCL Events. Mike Bailey. Oregon State University. OpenCL Events 1 OpenCL Events Mike Bailey mjb@cs.oregonstate.edu opencl.events.pptx OpenCL Events 2 An event is an object that communicates the status of OpenCL commands Event Read Buffer dc Execute Kernel Write Buffer

More information

OpenCL Events. Mike Bailey. Computer Graphics opencl.events.pptx

OpenCL Events. Mike Bailey. Computer Graphics opencl.events.pptx 1 OpenCL Events This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License Mike Bailey mjb@cs.oregonstate.edu opencl.events.pptx OpenCL Events 2 An

More information

OpenCL / OpenGL Texture Interoperability: An Image Blurring Case Study

OpenCL / OpenGL Texture Interoperability: An Image Blurring Case Study 1 OpenCL / OpenGL Texture Interoperability: An Image Blurring Case Study Mike Bailey mjb@cs.oregonstate.edu opencl.opengl.rendertexture.pptx OpenCL / OpenGL Texture Interoperability: The Basic Idea 2 Application

More information

Copyright 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes. OpenCL. OpenCL

Copyright 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes. OpenCL. OpenCL OpenCL OpenCL What is OpenCL? Ø Cross-platform parallel computing API and C-like language for heterogeneous computing devices Ø Code is portable across various target devices: Ø Correctness is guaranteed

More information

The resurgence of parallel programming languages

The resurgence of parallel programming languages The resurgence of parallel programming languages Jamie Hanlon & Simon McIntosh-Smith University of Bristol Microelectronics Research Group hanlon@cs.bris.ac.uk 1 The Microelectronics Research Group at

More information

Making OpenCL Simple with Haskell. Benedict R. Gaster January, 2011

Making OpenCL Simple with Haskell. Benedict R. Gaster January, 2011 Making OpenCL Simple with Haskell Benedict R. Gaster January, 2011 Attribution and WARNING The ideas and work presented here are in collaboration with: Garrett Morris (AMD intern 2010 & PhD student Portland

More information

OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015

OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015 OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015 TOPICS Data transfer Parallelism Coalesced memory access Best work group size Occupancy branching All the performance numbers come from a W8100 running

More information

Many-core Processors Lecture 11. Instructor: Philippos Mordohai Webpage:

Many-core Processors Lecture 11. Instructor: Philippos Mordohai Webpage: 1 CS 677: Parallel Programming for Many-core Processors Lecture 11 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Outline More CUDA Libraries

More information

Scientific Computing WS 2017/2018. Lecture 28. Jürgen Fuhrmann Lecture 28 Slide 1

Scientific Computing WS 2017/2018. Lecture 28. Jürgen Fuhrmann Lecture 28 Slide 1 Scientific Computing WS 2017/2018 Lecture 28 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de Lecture 28 Slide 1 SIMD Hardware: Graphics Processing Units ( GPU) [Source: computing.llnl.gov/tutorials] Principle

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics

More information

To Co-Run, or Not To Co-Run: A Performance Study on Integrated Architectures

To Co-Run, or Not To Co-Run: A Performance Study on Integrated Architectures To Co-Run, or Not To Co-Run: A Performance Study on Integrated Architectures Feng Zhang, Jidong Zhai, Wenguang Chen Tsinghua University, Beijing, 100084, China Bingsheng He and Shuhao Zhang Nanyang Technological

More information

GPU COMPUTING RESEARCH WITH OPENCL

GPU COMPUTING RESEARCH WITH OPENCL GPU COMPUTING RESEARCH WITH OPENCL Studying Future Workloads and Devices Perhaad Mistry, Dana Schaa, Enqiang Sun, Rafael Ubal, Yash Ukidave, David Kaeli Dept of Electrical and Computer Engineering Northeastern

More information

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania Course Overview This OpenCL base course is structured as follows: Introduction to GPGPU programming, parallel programming

More information

The Open Computing Language (OpenCL)

The Open Computing Language (OpenCL) 1 OpenCL The Open Computing Language (OpenCL) OpenCL consists of two parts: a C/C++-callable API and a C-ish programming language. Also go look at the files first.cpp and first.cl! Mike Bailey mjb@cs.oregonstate.edu

More information

Lecture Topic: An Overview of OpenCL on Xeon Phi

Lecture Topic: An Overview of OpenCL on Xeon Phi C-DAC Four Days Technology Workshop ON Hybrid Computing Coprocessors/Accelerators Power-Aware Computing Performance of Applications Kernels hypack-2013 (Mode-4 : GPUs) Lecture Topic: on Xeon Phi Venue

More information

The Open Computing Language (OpenCL)

The Open Computing Language (OpenCL) 1 The Open Computing Language (OpenCL) Also go look at the files first.cpp and first.cl! Mike Bailey mjb@cs.oregonstate.edu opencl.pptx OpenCL 2 OpenCL consists of two parts: a C/C++-callable API and a

More information

The Open Computing Language (OpenCL)

The Open Computing Language (OpenCL) 1 The Open Computing Language (OpenCL) Also go look at the files first.cpp and first.cl! Mike Bailey mjb@cs.oregonstate.edu opencl.pptx OpenCL 2 OpenCL consists of two parts: a C/C++-callable API and a

More information

Mali -T600 Series GPU OpenCL ARM. Developer Guide. Version 2.0. Copyright ARM. All rights reserved. DUI0538F (ID012914)

Mali -T600 Series GPU OpenCL ARM. Developer Guide. Version 2.0. Copyright ARM. All rights reserved. DUI0538F (ID012914) ARM Mali -T600 Series GPU OpenCL Version 2.0 Developer Guide Copyright 2012-2013 ARM. All rights reserved. DUI0538F () ARM Mali-T600 Series GPU OpenCL Developer Guide Copyright 2012-2013 ARM. All rights

More information

Debugging and Analyzing Programs using the Intercept Layer for OpenCL Applications

Debugging and Analyzing Programs using the Intercept Layer for OpenCL Applications Debugging and Analyzing Programs using the Intercept Layer for OpenCL Applications Ben Ashbaugh IWOCL 2018 https://github.com/intel/opencl-intercept-layer Why am I here? Intercept Layer for OpenCL Applications

More information

Accelerate with GPUs Harnessing GPGPUs with trending technologies

Accelerate with GPUs Harnessing GPGPUs with trending technologies Accelerate with GPUs Harnessing GPGPUs with trending technologies Anubhav Jain and Amit Kalele Parallelization and Optimization CoE Tata Consultancy Services Ltd. Copyright 2016 Tata Consultancy Services

More information

Introduction à OpenCL

Introduction à OpenCL 1 1 UDS/IRMA Journée GPU Strasbourg, février 2010 Sommaire 1 OpenCL 2 3 GPU architecture A modern Graphics Processing Unit (GPU) is made of: Global memory (typically 1 Gb) Compute units (typically 27)

More information

What does Fusion mean for HPC?

What does Fusion mean for HPC? What does Fusion mean for HPC? Casey Battaglino Aparna Chandramowlishwaran Jee Choi Kent Czechowski Cong Hou Chris McClanahan Dave S. Noble, Jr. Richard (Rich) Vuduc AMD Fusion Developers Summit Bellevue,

More information

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome!

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome! /INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2018 - Lecture 10: GPGPU (3) Welcome! Today s Agenda: Don t Trust the Template The Prefix Sum Parallel Sorting Stream Filtering Optimizing GPU

More information

Linux Clusters Institute: Getting the Most from Your Linux Cluster

Linux Clusters Institute: Getting the Most from Your Linux Cluster Linux Clusters Institute: Getting the Most from Your Linux Cluster Advanced Topics: GPU Clusters Mike Showerman mshow@ncsa.illinois.edu Our background Cluster models for HPC in 1996 I believe first compute

More information

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different

More information

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray Pragma-based GPU Programming and HMPP Workbench Scott Grauer-Gray Pragma-based GPU programming Write programs for GPU processing without (directly) using CUDA/OpenCL Place pragmas to drive processing on

More information

ECE 574 Cluster Computing Lecture 10

ECE 574 Cluster Computing Lecture 10 ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular

More information

GPU Architecture and Programming with OpenCL

GPU Architecture and Programming with OpenCL GPU Architecture and Programming with OpenCL David Black-Schaffer david.black-schaffer@it black-schaffer@it.uu.se Room 1221 Today s s Topic GPU architecture What and why The good The bad Compute Models

More information

Design and implementation of a highperformance. platform on multigenerational GPUs.

Design and implementation of a highperformance. platform on multigenerational GPUs. Design and implementation of a highperformance stream-based computing platform on multigenerational GPUs. By Pablo Lamilla Álvarez September 27, 2010 Supervised by: Professor Shinichi Yamagiwa Kochi University

More information

GPU Architecture and Programming with OpenCL. OpenCL. GPU Architecture: Why? Today s s Topic. GPUs: : Architectures for Drawing Triangles Fast

GPU Architecture and Programming with OpenCL. OpenCL. GPU Architecture: Why? Today s s Topic. GPUs: : Architectures for Drawing Triangles Fast Today s s Topic GPU Architecture and Programming with OpenCL David Black-Schaffer david.black-schaffer@it black-schaffer@it.uu.se Room 1221 GPU architecture What and why The good The bad Compute Models

More information

OpenCL Device Fission Benedict R. Gaster, AMD

OpenCL Device Fission Benedict R. Gaster, AMD Copyright Khronos Group, 2011 - Page 1 Fission Benedict R. Gaster, AMD March 2011 Fission (cl_ext_device_fission) Provides an interface for sub-dividing an device into multiple sub-devices Typically used

More information

Using OpenMP to Program. Systems

Using OpenMP to Program. Systems Using OpenMP to Program Embedded Heterogeneous Systems Eric Stotzer, PhD Senior Member Technical Staff Software Development Organization, Compiler Team Texas Instruments February 16, 2012 Presented at

More information

A hands-on Introduction to OpenCL

A hands-on Introduction to OpenCL A hands-on Introduction to OpenCL Tim Mattson Acknowledgements: Alice Koniges of Berkeley Lab/NERSC and Simon McIntosh-Smith, James Price, and Tom Deakin of the University of Bristol OpenCL Learning progression

More information

Motion Estimation Extension for OpenCL

Motion Estimation Extension for OpenCL Motion Estimation Extension for OpenCL Authors: Nico Galoppo, Craig Hansen-Sturm Reviewers: Ben Ashbaugh, David Blythe, Hong Jiang, Stephen Junkins, Raun Krisch, Matt McClellan, Teresa Morrison, Dillon

More information

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4 CS/CoE 1541 Final exam (Fall 2017). Name: This is the cumulative final exam given in the Fall of 2017. Question 1 (12 points): was on Chapter 4 Question 2 (13 points): was on Chapter 4 For Exam 2, you

More information

SimpleOpenCL: desenvolupament i documentació d'una llibreria que facilita la programació paral lela en OpenCL

SimpleOpenCL: desenvolupament i documentació d'una llibreria que facilita la programació paral lela en OpenCL Treball de fi de Carrera ENGINYERIA TÈCNICA EN INFORMÀTICA DE SISTEMES Facultat de Matemàtiques Universitat de Barcelona SimpleOpenCL: desenvolupament i documentació d'una llibreria que facilita la programació

More information

OpenCL C. Matt Sellitto Dana Schaa Northeastern University NUCAR

OpenCL C. Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL C Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL C Is used to write kernels when working with OpenCL Used to code the part that runs on the device Based on C99 with some extensions

More information

OPENCL WITH AMD FIREPRO W9100 GERMAN ANDRYEYEV MAY 20, 2015

OPENCL WITH AMD FIREPRO W9100 GERMAN ANDRYEYEV MAY 20, 2015 OPENCL WITH AMD FIREPRO W9100 GERMAN ANDRYEYEV MAY 20, 2015 Introducing AMD FirePro W9100 HW COMPARISON W9100(HAWAII) VS W9000(TAHITI) FirePro W9100 FirePro W9000 Improvement Notes Compute Units 44 32

More information

OpenACC (Open Accelerators - Introduced in 2012)

OpenACC (Open Accelerators - Introduced in 2012) OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in

More information

Introduction to OpenCL!

Introduction to OpenCL! Lecture 6! Introduction to OpenCL! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! OpenCL Architecture Defined in four parts Platform Model

More information

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance

More information

From CUDA to OpenCL. Piero Lanucara, SCAI

From CUDA to OpenCL. Piero Lanucara, SCAI From CUDA to OpenCL Piero Lanucara, SCAI p.lanucara@cineca.it Let s start from a simple CUDA code (matrixmul from NVIDIA CUDA samples). Now, you perfectly know how to compile and run on NVIDIA hardware

More information

Modern C++ Parallelism from CPU to GPU

Modern C++ Parallelism from CPU to GPU Modern C++ Parallelism from CPU to GPU Simon Brand @TartanLlama Senior Software Engineer, GPGPU Toolchains, Codeplay C++ Russia 2018 2018-04-21 Agenda About me and Codeplay C++17 CPU Parallelism Third-party

More information

Parallel Programming Recipes

Parallel Programming Recipes San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research 2010 Parallel Programming Recipes Thuy C. Nguyenphuc San Jose State University Follow this and additional

More information

INTRODUCTION TO OPENCL. HAIBO XIE, PH.D.

INTRODUCTION TO OPENCL. HAIBO XIE, PH.D. INTRODUCTION TO OPENCL HAIBO XIE, PH.D. haibo.xie@amd.com AGENDA What s OpenCL Fundamentals for OpenCL programming OpenCL programming basics OpenCL programming tools Examples & demos 2 WHAT IS OPENCL Open

More information