CS 677: Parallel Programming for Many-core Processors Lecture 12
|
|
- Kathleen Esther Reynolds
- 5 years ago
- Views:
Transcription
1 1 CS 677: Parallel Programming for Many-core Processors Lecture 12 Instructor: Philippos Mordohai Webpage:
2 CS Department Project Poster Day May 5, 12-2pm (99% confirmed) Lieb third floor conference room and corridors 5% of total grade as bonus Suggestion: 9-12 printed pages Demos would be cool 2
3 CS Department Project Poster Day Your name, course number, project title Project objective: what are you trying to accomplish? what makes this computation worthy of a project? Method General description of method (not of the implementation) Suitability for GPU acceleration Design choices for GPU implementation Workload allocation, use of resources, bottlenecks (avoided and not avoided) Experimental results including timings 3
4 Final Project Presentations April 29 Send me PPT/PDF file by 5pm 12 min presentation + 2 min Q&A Counts for 15% of total grade 4
5 Final Project Presentations Target audience: fellow classmates Content: Problem description What is the computation and why is it important? Suitability for GPU acceleration Amdahl s Law: describe the inherent parallelism. Argue that it is close to 100% of computation. Compare with CPU version 5
6 Final Project Presentations Content (cont.): GPU Implementation Which steps of the algorithm were ported to the GPU? Work load allocation to threads Use of resources (registers, shared memory, constant memory, etc.) Occupancy achieved Results Experiments performed Timings and comparisons against CPU version 6
7 Final Report Due May 7 (11:59pm) 6-10 pages including figures, tables and references Content See presentation instructions Do not repeat course material Counts for 20% of total grade NO LATE SUBMISSIONS 7
8 Outline OpenCL Convolution Example Parallel Min() Example 8
9 Image Convolution Using OpenCL Udeepta Bordoloi, ATI Stream Application Engineer 10/13/2009 Note: ATI Stream Technology is now called AMD Accelerated Parallel Processing (APP) Technology. 9
10 Step 1 The Algorithm Ignore boundaries Output size: (input_image_width filter_width + 1) by (input_image_height filter_width + 1) 10
11 C Version void Convolve(float * pinput, float * pfilter, float * poutput, const int ninwidth, const int nwidth, const int nheight, const int nfilterwidth, const int nnumthreads) { for (int yout = 0; yout < nheight; yout++) { const int yintopleft = yout; for (int xout = 0; xout < nwidth; xout++) { const int xintopleft = xout; float sum = 0; 11
12 C Version (2) for (int r = 0; r < nfilterwidth; r++) { const int idxftmp = r * nfilterwidth; const int yin = yintopleft + r; const int idxintmp = yin * ninwidth + xintopleft; for (int c = 0; c < nfilterwidth; c++) { const int idxf = idxftmp + c; const int idxin = idxintmp + c; sum += pfilter[idxf]*pinput[idxin]; } } //for (int r = 0 12
13 C Version (3) } const int idxout = yout * nwidth + xout; poutput[idxout] = sum; } //for (int xout = 0 } //for (int yout = 0 13
14 Parameters struct paramstruct { int nwidth; //Output image width int nheight; //Output image height int ninwidth; //Input image width int ninheight; //Input image height int nfilterwidth; //Filter size is nfilterwidth X //nfilterwidth int niterations; //Run timing loop for niterations //Test CPU performance with 1,4,8 etc. OpenMP threads std::vector ompthreads; int nompruns; //ompthreads.size() bool bcputiming; //Time CPU performance } params; 14
15 OpenMP for Comparison //This #pragma splits the work between multiple threads #pragma omp parallel for num_threads(nnumthreads) for (int yout = 0; yout < nheight; yout++)... void InitParams(int argc, char* argv[]) { // time the OpenMP convolution performance with // different numbers of threads params.ompthreads.push_back(4); params.ompthreads.push_back(1); params.ompthreads.push_back(8); params.nompruns = params.ompthreads.size(); } 15
16 First Kernel kernel void Convolve(const global float * pinput, constant float * pfilter, global float * poutput, const int ninwidth, const int nfilterwidth) { const int nwidth = get_global_size(0); const int xout = get_global_id(0); const int yout = get_global_id(1); const int xintopleft = xout; const int yintopleft = yout; float sum = 0; 16
17 First Kernel (2) for (int r = 0; r < nfilterwidth; r++) { const int idxftmp = r * nfilterwidth; const int yin = yintopleft + r; const int idxintmp = yin * ninwidth + xintopleft; for (int c = 0; c < nfilterwidth; c++) { const int idxf = idxftmp + c; const int idxin = idxintmp + c; sum += pfilter[idxf]*pinput[idxin]; } } //for (int r = 0 const int idxout = yout * nwidth + xout; Output[idxOut] = sum; } 17
18 Initialize OpenCL cl_context context = clcreatecontextfromtype(,cl_device_type_cpu, ); // get list of devices - quad core counts as one device size_t listsize; /* First, get the size of device list */ clgetcontextinfo(context, CL_CONTEXT_DEVICES,, &listsize); /* Now, allocate the device list */ cl_device_id devices = (cl_device_id *)malloc(listsize); /* Next, get the device list data */ clgetcontextinfo(context, CL_CONTEXT_DEVICES, listsize, devices, ); 18
19 Initialize OpenCL (2) cl_command_queue queue = clcreatecommandqueue(context, devices[0], ); cl_program program = clcreateprogramwithsource(context, 1, &source, ); clbuildprogram(program, 1, devices, ); cl_kernel kernel = clcreatekernel(program, "Convolve", ); // get error messages clgetprogrambuildinfo(program, devices[0], CL_PROGRAM_BUILD_LOG, ); 19
20 Initialize Buffers cl_mem inputcl = clcreatebuffer(context, CL_MEM_READ_ONLY CL_MEM_USE_HOST_PTR, host_buffer_size, host_buffer_ptr, ); //If the device is a GPU (CL_DEVICE_TYPE_GPU), we can // explicitly copy data to the input image buffer on the // device: clenqueuewritebuffer(queue, inputcl,, host_buffer_ptr, ); // And copy back from the output image buffer after the // convolution kernel execution. clenqueuereadbuffer(queue, outputcl,, host_buffer_ptr, ); 20
21 Execute Kernel /* input buffer, arg 0 */ clsetkernelarg(kernel, 0, sizeof(cl_mem), (void *)&inputcl); /* filter buffer, arg 1 */ clsetkernelarg(kernel, 1, sizeof(cl_mem), (void *)&filtercl); /* output buffer, arg 2 */ clsetkernelarg(kernel, 2, sizeof(cl_mem), (void *)&outputcl); /* input image width, arg 3*/ clsetkernelarg(kernel, 3, sizeof(int), (void *)&ninwidth); /* filter width, arg 4*/ clsetkernelarg(kernel, 4, sizeof(int), (void *)&nfilterwidth); 21
22 Execute Kernel clenqueuendrangekernel(queue, kernel, data_dimensionality,, total_work_size, work_group_size, ); // release all buffers clreleasebuffer(inputcl);... // release all resources clreleasekernel(kernel); clreleaseprogram(program); clreleasecommandqueue(queue); clreleasecontext(context); 22
23 Timing clfinish(queue); //Timer Started here(); for (int i = 0; i < niterations; i++) clenqueuendrangekernel( ); clfinish(queue); //Timer Stopped here(); //Average Time = ElapsedTime()/nIterations; clfinish() call before both starting and stopping the timer ensures that we time the kernel execution activity to its completion and nothing else On 4-core AMD Phenom treated as a single device by OpenCL 23
24 C++ Bindings cl_context context = clcreatecontextfromtype(,cl_device_type_cpu, ); cl::context context = cl::context(cl_device_type_cpu); // get list of devices - quad core counts as one device size_t listsize; /* First, get the size of device list */ clgetcontextinfo(context, CL_CONTEXT_DEVICES,, &listsize); /* Now, allocate the device list */ cl_device_id devices = (cl_device_id *)malloc(listsize); /* Next, get the device list data */ clgetcontextinfo(context, CL_CONTEXT_DEVICES, listsize, devices, ); std::vector<cl::device> devices = context.getinfo(); See 24
25 C++ Bindings (2) cl::commandqueue queue = cl::commandqueue(context, devices[0]); cl::program program = cl::program(context, ); program.build(devices); cl::kernel kernel = cl::kernel(program, "Convolve"); string str = program.getbuildinfo(devices[0]); // Buffer init is similar to C version // using methods of queue 25
26 Execute Kernel /* input buffer, arg 0 */ clsetkernelarg(kernel, 0, sizeof(cl_mem), (void *)&inputcl); kernel.setarg(0, inputcl); /* filter buffer, arg 1 */ clsetkernelarg(kernel, 1, sizeof(cl_mem), (void *)&filtercl); kernel.setarg(1, filtercl); // etc. queue.clenqueuendrangekernel(kernel,, total_work_size, work_group_size, ); 26
27 Loop Unrolling kernel void Convolve_Unroll(const global float * pinput, constant float * pfilter, global float * poutput, const int ninwidth, const int nfilterwidth) { const int nwidth = get_global_size(0); const int xout = get_global_id(0); const int yout = get_global_id(1); const int xintopleft = xout; const int yintopleft = yout; float sum = 0; for (int r = 0; r < nfilterwidth; r++) { const int idxftmp = r * nfilterwidth; const int yin = yintopleft + r; const int idxintmp = yin * ninwidth + xintopleft; 27
28 Loop Unrolling (2) int c = 0; while (c <= nfilterwidth-4) { int idxf = idxftmp + c; int idxin = idxintmp + c; sum += pfilter[idxf]*pinput[idxin]; idxf++; idxin++; sum += pfilter[idxf]*pinput[idxin]; idxf++; idxin++; sum += pfilter[idxf]*pinput[idxin]; idxf++; idxin++; sum += pfilter[idxf]*pinput[idxin]; c += 4; } 28
29 Loop Unrolling (3) } for (int c1 = c; c1 < nfilterwidth; c1++) { const int idxf = idxftmp + c1; const int idxin = idxintmp + c1; sum += pfilter[idxf]*pinput[idxin]; } } //for (int r = 0 const int idxout = yout * nwidth + xout; poutput[idxout] = sum; // what does this do? 29
30 Performance 30
31 Unrolled Kernel 2 (if Kernel) // last loop int cmod = nfilterwidth c; if (cmod == 1) { int idxf = idxftmp + c; int idxin = idxintmp + c; sum += pfilter[idxf]*pinput[idxin]; } else if (cmod == 2) { int idxf = idxftmp + c; int idxin = idxintmp + c; sum += pfilter[idxf]*pinput[idxin]; sum += pfilter[idxf+1]*pinput[idxin+1]; } 31
32 Unrolled Kernel 2 (2) } else if (cmod == 3) { int idxf = idxftmp + c; int idxin = idxintmp + c; sum += pfilter[idxf]*pinput[idxin]; sum += pfilter[idxf+1]*pinput[idxin+1]; sum += pfilter[idxf+2]*pinput[idxin+2]; } } //for (int r = 0 const int idxout = yout * nwidth + xout; poutput[idxout] = sum; 32
33 Performance Yet another way to achieve similar results is to write four different versions of the ConvolveUnroll kernel. The four versions will correspond to (filterwidth%4) equalling 0, 1, 2, or 3. The particular version called can be decided at run-time depending on the value of filterwidth 33
34 Kernel with Invariants Loop unrolling did not help when the filter width is low So far, kernels have been written in a generic way so that they will work for all filter sizes What if we can focus on a particular filter size? E.g We can now unroll the inner loop five times and get rid of the loop condition If we use the invariant in the loop condition, a good compiler will unroll the loop itself 34
35 Kernel with Invariants kernel void Convolve_Def(const global float * pinput, constant float * pfilter, global float * poutput, const int ninwidth, const int nfilterwidth) { const int nwidth = get_global_size(0); const int xout = get_global_id(0); const int yout = get_global_id(1); const int xintopleft = xout; const int yintopleft = yout; float sum = 0; for (int r = 0; r < FILTER_WIDTH; r++) { const int idxftmp = r * FILTER_WIDTH; const int yin = yintopleft + r; const int idxintmp = yin * ninwidth + xintopleft; 35
36 Kernel with Invariants (2) } for (int c = 0; c < FILTER_WIDTH; c++) { const int idxf = idxftmp + c; const int idxin = idxintmp + c; sum += pfilter[idxf]*pinput[idxin]; } } //for (int r = 0 const int idxout = yout * nwidth + xout; poutput[idxout] = sum; 36
37 Setting Filter Width // this can be done online and offline /* create a cl source string */ std::string sourcestr = Convert-File-To-String(File-Name); cl::program::sources sources(1, std::make_pair(sourcestr.c_str(), sourcestr.length())); /* create a cl program object */ program = cl::program(context, sources); /* build a cl program executable with some #defines */ char options[128]; sprintf(options, "-DFILTER_WIDTH=%d", filter_width); program.build(devices, options); /* create a kernel object for a kernel with the given name */ cl::kernel kernel = cl::kernel(program, "Convolve_Def"); 37
38 Performance 38
39 Performance 39
40 Performance Unroll + if on remainder 40
41 Vectorization kernel void Convolve_Unroll(const global float * pinput, constant float * pfilter, global float * poutput, const int ninwidth, const int nfilterwidth) { const int nwidth = get_global_size(0); const int xout = get_global_id(0); const int yout = get_global_id(1); const int xintopleft = xout; const int yintopleft = yout; float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; for (int r = 0; r < nfilterwidth; r++) { const int idxftmp = r * nfilterwidth; 41
42 Vectorization (2) const int yin = yintopleft + r; const int idxintmp = yin * ninwidth + xintopleft; int c = 0; while (c <= nfilterwidth-4) { float mul0, mul1, mul2, mul3; int idxf = idxftmp + c; int idxin = idxintmp + c; mul0 = pfilter[idxf]*pinput[idxin]; idxf++; idxin++; mul1 += pfilter[idxf]*pinput[idxin]; idxf++; idxin++; mul2 += pfilter[idxf]*pinput[idxin]; idxf++; idxin++; mul3 += pfilter[idxf]*pinput[idxin]; 42
43 Vectorization (3) } sum0 += mul0; sum1 += mul1; sum2 += mul2; sum3 += mul3; c += 4; } for (int c1 = c; c1 < nfilterwidth; c1++) { const int idxf = idxftmp + c1; const int idxin = idxintmp + c1; sum0 += pfilter[idxf]*pinput[idxin]; } } //for (int r = 0 const int idxout = yout * nwidth + xout; poutput[idxout] = sum0 + sum1 + sum2 + sum3; 43
44 Vectorized Kernel kernel void Convolve_Float4(const global float * pinput, constant float * pfilter, global float * poutput, const int ninwidth, const int nfilterwidth) { const int nwidth = get_global_size(0); const int xout = get_global_id(0); const int yout = get_global_id(1); const int xintopleft = xout; const int yintopleft = yout; float4 sum4 = 0; for (int r = 0; r < nfilterwidth; r++) { const int idxftmp = r * nfilterwidth; const int yin = yintopleft + r; const int idxintmp = yin * ninwidth + xintopleft; 44
45 Vectorized Kernel int c = 0; int c4 = 0; while (c <= nfilterwidth-4) { float4 filter4 = vload4(c4,pfilter+idxftmp); float4 in4 = vload4(c4,pinput +idxintmp); sum4 += in4 * filter4; c += 4; c4++; } for (int c1 = c; c1 < nfilterwidth; c1++) { const int idxf = idxftmp + c1; const int idxin = idxintmp + c1; sum4.x += pfilter[idxf]*pinput[idxin]; } } //for (int r = 0 const int idxout = yout * nwidth + xout; poutput[idxout] = sum4.x + sum4.y + sum4.z + sum4.w; } 45
46 Performance 46
47 Performance if Kernel 47
48 Performance Kernel with Invariants 48
49 OpenMP Comparison 49
50 OpenMP Comparison 50
51 Parallel Min() Programming Guide AMD Accelerated Parallel Processing OpenCL (November 2013) 51
52 // // Copyright (c) 2010 Advanced Micro Devices, Inc. All rights reserved. // #include <CL/cl.h> #include <stdio.h> #include <stdlib.h> #include <time.h> #include "Timer.h" #define NDEVS 2 // A parallel min() kernel that works well on CPU and GPU const char *kernel_source = " \n" "#pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable \n" "#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable \n" " \n " // 9. The source buffer is accessed as 4-vectors. \n" " \n" 52
53 " kernel void minp( global uint4 *src, \n" " global uint *gmin, \n" " local uint *lmin, \n" " global uint *dbg, \n" " int nitems, \n" " uint dev ) \n" "{ \n" " // 10. Set up global memory access pattern. \n" " \n" " uint count = ( nitems / 4 ) / get_global_size(0); \n" " uint idx = (dev == 0)? get_global_id(0) * count \n" " : get_global_id(0); \n" " uint stride = (dev == 0)? 1 : get_global_size(0); \n" " uint pmin = (uint) -1; \n" " \n" 53
54 " // 11. First, compute private min, for this work-item. \n" " \n" " for( int n=0; n < count; n++, idx += stride ) \n" " { \n" " pmin = min( pmin, src[idx].x ); \n" " pmin = min( pmin, src[idx].y ); \n" " pmin = min( pmin, src[idx].z ); \n" " pmin = min( pmin, src[idx].w ); \n" " } \n" " \n" " // 12. Reduce min values inside work-group. \n" " \n" " if( get_local_id(0) == 0 ) \n" " lmin[0] = (uint) -1; \n" " \n" " barrier( CLK_LOCAL_MEM_FENCE ); \n" " \n" " (void) atom_min( lmin, pmin ); \n" " \n" " barrier( CLK_LOCAL_MEM_FENCE ); \n" " \n" 54
55 " // Write out to global. \n" " \n" " if( get_local_id(0) == 0 ) \n" " gmin[ get_group_id(0) ] = lmin[0]; \n " \n" " // Dump some debug information. \n" " \n" " if( get_global_id(0) == 0 ) \n" " { \n" " dbg[0] = get_num_groups(0); \n" " dbg[1] = get_global_size(0); \n" " dbg[2] = count; \n" " dbg[3] = stride; \n" " } \n" "} \n" " \n 55
56 "// 13. Reduce work-group min values from global to global. \n" " \n" " kernel void reduce( global uint4 *src, \n" " global uint *gmin ) \n" "{ \n" " (void) atom_min( gmin, gmin[get_global_id(0)] ) ; \n" "} \n"; 56
57 int main(int argc, char ** argv) { cl_platform_id platform; int dev, nw; cl_device_type devs[ndevs] = { CL_DEVICE_TYPE_CPU, CL_DEVICE_TYPE_GPU }; cl_uint *src_ptr; unsigned int num_src_items = 4096*4096; // 1. quick & dirty MWC random init of source buffer. // Random seed (portable). time_t ltime; time(<ime); src_ptr = (cl_uint *) malloc( num_src_items * sizeof(cl_uint) ); cl_uint a = (cl_uint) ltime, b = (cl_uint) ltime; cl_uint min = (cl_uint) -1; 57
58 // Do serial computation of min() for result verification. for( int i=0; i < num_src_items; i++ ) { src_ptr[i] = (cl_uint) (b = ( a * ( b & )) + ( b >> 16 )); min = src_ptr[i] < min? src_ptr[i] : min; } // Get a platform. clgetplatformids( 1, &platform, NULL ); // 3. Iterate over devices. for(dev=0; dev < NDEVS; dev++) { cl_device_id device; cl_context context; cl_command_queue queue; 58
59 cl_program program; cl_kernel minp; cl_kernel reduce; cl_mem src_buf; cl_mem dst_buf; cl_mem dbg_buf; cl_uint *dst_ptr, *dbg_ptr; printf("\n%s: ", dev == 0? "CPU" : "GPU"); // Find the device. clgetdeviceids( platform, devs[dev], 1, &device, NULL); 59
60 // 4. Compute work sizes. cl_uint compute_units; size_t global_work_size; size_t local_work_size; size_t num_groups; clgetdeviceinfo( device, CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(cl_uint), &compute_units, NULL); if( devs[dev] == CL_DEVICE_TYPE_CPU ) { global_work_size = compute_units * 1; // 1 thread per core local_work_size = 1; } 60
61 Wavefront = CUDA warp currently has 64 work items else { cl_uint ws = 64; global_work_size = compute_units * 7 * ws; // 7 wavefronts per SIMD while( (num_src_items / 4) % global_work_size!= 0 ) global_work_size += ws; local_work_size = ws; } num_groups = global_work_size/local_work_size; // Create a context and command queue on that //device. context = clcreatecontext( NULL, 1, &device, NULL, NULL, NULL); queue = clcreatecommandqueue(context, device, 0, NULL); 61
62 // Minimal error check. if( queue == NULL ) { printf("compute device setup failed\n"); return(-1); } // Perform runtime source compilation, and obtain // kernel entry point. program = clcreateprogramwithsource( context, 1, &kernel_source, NULL, NULL ); // 5. Print compiler error messages SKIPPED 62
63 minp = clcreatekernel( program, "minp", NULL ); reduce = clcreatekernel( program, "reduce", NULL ); // Create input, output and debug buffers. src_buf = clcreatebuffer( context, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, num_src_items * sizeof(cl_uint), src_ptr, NULL ); dst_buf = clcreatebuffer( context, CL_MEM_READ_WRITE, num_groups * sizeof(cl_uint), NULL, NULL ); dbg_buf = clcreatebuffer( context, CL_MEM_WRITE_ONLY, global_work_size * sizeof(cl_uint), NULL, NULL ); 63
64 clsetkernelarg(minp, 0, sizeof(void *), (void*) &src_buf); clsetkernelarg(minp, 1, sizeof(void *), (void*) &dst_buf); clsetkernelarg(minp, 2, 1*sizeof(cl_uint), (void*) NULL); clsetkernelarg(minp, 3, sizeof(void *), (void*) &dbg_buf); clsetkernelarg(minp, 4, sizeof(num_src_items), (void*) &num_src_items); clsetkernelarg(minp, 5, sizeof(dev), (void*) &dev); clsetkernelarg(reduce, 0, sizeof(void *), (void*) &src_buf); clsetkernelarg(reduce, 1, sizeof(void *), (void*) &dst_buf); CPerfCounter t; t.reset(); t.start(); 64
65 // 6. Main timing loop. #define NLOOPS 500 cl_event ev; int nloops = NLOOPS; while(nloops--) { clenqueuendrangekernel( queue, minp, 1, NULL, &global_work_size, &local_work_size, 0, NULL, &ev); clenqueuendrangekernel( queue, reduce, 1, NULL, &num_groups, NULL, 1, &ev, NULL); } 65
66 clfinish( queue ); t.stop(); printf("b/w %.2f GB/sec, ", ((float) num_src_items * sizeof(cl_uint) * NLOOPS) / t.getelapsedtime() / 1e9 ); // 7. Look at the results via synchronous buffer map. dst_ptr = (cl_uint *) clenqueuemapbuffer( queue, dst_buf, CL_TRUE, CL_MAP_READ, 0, num_groups * sizeof(cl_uint), 0, NULL, NULL, NULL ); dbg_ptr = (cl_uint *) clenqueuemapbuffer( queue, dbg_buf, CL_TRUE, CL_MAP_READ, 0, global_work_size * sizeof(cl_uint), 0, NULL, NULL, NULL ); 66
67 // 8. Print some debug info. printf("%d groups, %d threads, count %d, stride %d\n", dbg_ptr[0], dbg_ptr[1], dbg_ptr[2], dbg_ptr[3]}; if( dst_ptr[0] == min ) printf("result correct\n"); else printf("result INcorrect\n"); } // iterate over devices } printf("\n"); return 0; 67
68 Binary Search Design a GPU-friendly binary search algorithm Assume input array is sorted and enormous 68
CS 677: Parallel Programming for Many-core Processors Lecture 12
1 CS 677: Parallel Programming for Many-core Processors Lecture 12 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Final Project Presentations
More informationHeterogeneous Computing
OpenCL Hwansoo Han Heterogeneous Computing Multiple, but heterogeneous multicores Use all available computing resources in system [AMD APU (Fusion)] Single core CPU, multicore CPU GPUs, DSPs Parallel programming
More informationData Parallelism. CSCI 5828: Foundations of Software Engineering Lecture 28 12/01/2016
Data Parallelism CSCI 5828: Foundations of Software Engineering Lecture 28 12/01/2016 1 Goals Cover the material in Chapter 7 of Seven Concurrency Models in Seven Weeks by Paul Butcher Data Parallelism
More informationGPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011
GPGPU COMPUTE ON AMD Udeepta Bordoloi April 6, 2011 WHY USE GPU COMPUTE CPU: scalar processing + Latency + Optimized for sequential and branching algorithms + Runs existing applications very well - Throughput
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 6 April 2017 HW#8 will be posted Announcements HW#7 Power outage Pi Cluster Runaway jobs (tried
More informationNeil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1
Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 Introduction and aims of OpenCL - Neil Trevett, NVIDIA OpenCL Specification walkthrough - Mike
More informationMasterpraktikum Scientific Computing
Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany Outline Intel Cilk Plus OpenCL Übung, October 7, 2012 2 Intel Cilk
More informationOpenCL in Action. Ofer Rosenberg
pencl in Action fer Rosenberg Working with pencl API pencl Boot Platform Devices Context Queue Platform Query int GetPlatform (cl_platform_id &platform, char* requestedplatformname) { cl_uint numplatforms;
More informationOpenCL The Open Standard for Heterogeneous Parallel Programming
OpenCL The Open Standard for Heterogeneous Parallel Programming March 2009 Copyright Khronos Group, 2009 - Page 1 Close-to-the-Silicon Standards Khronos creates Foundation-Level acceleration APIs - Needed
More informationInstructions for setting up OpenCL Lab
Instructions for setting up OpenCL Lab This document describes the procedure to setup OpenCL Lab for Linux and Windows machine. Specifically if you have limited no. of graphics cards and a no. of users
More informationCS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL
CS/EE 217 GPU Architecture and Parallel Programming Lecture 22: Introduction to OpenCL Objective To Understand the OpenCL programming model basic concepts and data types OpenCL application programming
More informationPerforming Reductions in OpenCL
Performing Reductions in OpenCL Mike Bailey mjb@cs.oregonstate.edu opencl.reduction.pptx Recall the OpenCL Model Kernel Global Constant Local Local Local Local Work- ItemWork- ItemWork- Item Here s the
More informationProgramming with CUDA and OpenCL. Dana Schaa and Byunghyun Jang Northeastern University
Programming with CUDA and OpenCL Dana Schaa and Byunghyun Jang Northeastern University Tutorial Overview CUDA - Architecture and programming model - Strengths and limitations of the GPU - Example applications
More informationGPGPU IGAD 2014/2015. Lecture 1. Jacco Bikker
GPGPU IGAD 2014/2015 Lecture 1 Jacco Bikker Today: Course introduction GPGPU background Getting started Assignment Introduction GPU History History 3DO-FZ1 console 1991 History NVidia NV-1 (Diamond Edge
More informationAdvanced OpenMP. Other threading APIs
Advanced OpenMP Other threading APIs What s wrong with OpenMP? OpenMP is designed for programs where you want a fixed number of threads, and you always want the threads to be consuming CPU cycles. - cannot
More informationOpenCL. An Introduction for HPC programmers. Benedict Gaster, AMD Tim Mattson, Intel. - Page 1
OpenCL An Introduction for HPC programmers Benedict Gaster, AMD Tim Mattson, Intel - Page 1 Preliminaries: Disclosures - The views expressed in this tutorial are those of the people delivering the tutorial.
More informationIntroduction to Parallel & Distributed Computing OpenCL: memory & threads
Introduction to Parallel & Distributed Computing OpenCL: memory & threads Lecture 12, Spring 2014 Instructor: 罗国杰 gluo@pku.edu.cn In this Lecture Example: image rotation GPU threads and scheduling Understanding
More informationOpenCL. Dr. David Brayford, LRZ, PRACE PATC: Intel MIC & GPU Programming Workshop
OpenCL Dr. David Brayford, LRZ, brayford@lrz.de PRACE PATC: Intel MIC & GPU Programming Workshop 1 Open Computing Language Open, royalty-free standard C-language extension For cross-platform, parallel
More informationOpenCL. Computation on HybriLIT Brief introduction and getting started
OpenCL Computation on HybriLIT Brief introduction and getting started Alexander Ayriyan Laboratory of Information Technologies Joint Institute for Nuclear Research 05.09.2014 (Friday) Tutorial in frame
More informationIntroduction to OpenCL. Benedict R. Gaster October, 2010
Introduction to OpenCL Benedict R. Gaster October, 2010 OpenCL With OpenCL you can Leverage CPUs and GPUs to accelerate parallel computation Get dramatic speedups for computationally intensive applications
More informationNeil Trevett Vice President, NVIDIA OpenCL Chair Khronos President
4 th Annual Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 CPUs Multiple cores driving performance increases Emerging Intersection GPUs Increasingly
More informationOpenCL on the GPU. San Jose, CA September 30, Neil Trevett and Cyril Zeller, NVIDIA
OpenCL on the GPU San Jose, CA September 30, 2009 Neil Trevett and Cyril Zeller, NVIDIA Welcome to the OpenCL Tutorial! Khronos and industry perspective on OpenCL Neil Trevett Khronos Group President OpenCL
More informationOpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR
OpenCL Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Architecture Parallel computing for heterogenous devices CPUs, GPUs, other processors (Cell, DSPs, etc) Portable accelerated code Defined
More informationIntroduction to OpenCL. Piero Lanucara SuperComputing Applications and Innovation Department
Introduction to OpenCL Piero Lanucara p.lanucara@cineca.it SuperComputing Applications and Innovation Department Heterogeneous High Performance Computing Green500 list Top of the list is dominated by heterogeneous
More informationCS 677: Parallel Programming for Many-core Processors Lecture 11
1 CS 677: Parallel Programming for Many-core Processors Lecture 11 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Project Status Update Due
More informationWebCL Overview and Roadmap
Copyright Khronos Group, 2011 - Page 1 WebCL Overview and Roadmap Tasneem Brutch Chair WebCL Working Group Samsung Electronics Copyright Khronos Group, 2011 - Page 2 WebCL Motivation Enable high performance
More informationINTRODUCTION TO OPENCL. Jason B. Smith, Hood College May
INTRODUCTION TO OPENCL Jason B. Smith, Hood College May 4 2011 WHAT IS IT? Use heterogeneous computing platforms Specifically for computationally intensive apps Provide a means for portable parallelism
More informationINTRODUCING OPENCL TM
INTRODUCING OPENCL TM The open standard for parallel programming across heterogeneous processors 1 PPAM 2011 Tutorial IT S A MEANY-CORE HETEROGENEOUS WORLD Multi-core, heterogeneous computing The new normal
More informationRock em Graphic Cards
Rock em Graphic Cards Agnes Meyder 27.12.2013, 16:00 1 / 61 Layout Motivation Parallelism Old Standards OpenMPI OpenMP Accelerator Cards CUDA OpenCL OpenACC Hardware C++AMP The End 2 / 61 Layout Motivation
More informationParallelization using the GPU
~ Parallelization using the GPU Scientific Computing Winter 2016/2017 Lecture 29 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de made wit pandoc 1 / 26 ~ Recap 2 / 26 MIMD Hardware: Distributed memory
More informationOpenCL and the quest for portable performance. Tim Mattson Intel Labs
OpenCL and the quest for portable performance Tim Mattson Intel Labs Disclaimer The views expressed in this talk are those of the speaker and not his employer. I am in a research group and know very little
More informationParallel programming languages:
Parallel programming languages: A new renaissance or a return to the dark ages? Simon McIntosh-Smith University of Bristol Microelectronics Research Group simonm@cs.bris.ac.uk 1 The Microelectronics Group
More informationAMD Accelerated Parallel Processing. OpenCL User Guide. October rev1.0
AMD Accelerated Parallel Processing OpenCL User Guide October 2014 rev1.0 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD Accelerated Parallel Processing, the AMD Accelerated
More informationAMD Accelerated Parallel Processing. OpenCL User Guide. December rev1.0
AMD Accelerated Parallel Processing OpenCL User Guide December 2014 rev1.0 2014 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD Accelerated Parallel Processing, the AMD Accelerated
More informationTools for Multi-Cores and Multi-Targets
Tools for Multi-Cores and Multi-Targets Sebastian Pop Advanced Micro Devices, Austin, Texas The Linux Foundation Collaboration Summit April 7, 2011 1 / 22 Sebastian Pop Tools for Multi-Cores and Multi-Targets
More informationGPGPU IGAD 2014/2015. Lecture 4. Jacco Bikker
GPGPU IGAD 2014/2015 Lecture 4 Jacco Bikker Today: Demo time! Parallel scan Parallel sort Assignment Demo Time Parallel scan What it is: in: 1 1 6 2 7 3 2 out: 0 1 2 8 10 17 20 C++: out[0] = 0 for ( i
More informationTowards Transparent and Efficient GPU Communication on InfiniBand Clusters. Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake
Towards Transparent and Efficient GPU Communication on InfiniBand Clusters Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake MPI and I/O from GPU vs. CPU Traditional CPU point-of-view
More informationGPU acceleration on IB clusters. Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake
GPU acceleration on IB clusters Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake HPC Advisory Council European Workshop 2011 Why it matters? (Single node GPU acceleration) Control
More informationMartin Kruliš, v
Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example Optimizing Previous Example Alternative Architectures 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator for desktop
More informationIntroduction to OpenCL (utilisation des transparents de Cliff Woolley, NVIDIA) intégré depuis à l initiative Kite
Introduction to OpenCL (utilisation des transparents de Cliff Woolley, NVIDIA) intégré depuis à l initiative Kite riveill@unice.fr http://www.i3s.unice.fr/~riveill What is OpenCL good for Anything that
More informationOPENCL C++ Lee Howes AMD Senior Member of Technical Staff, Stream Computing
OPENCL C++ Lee Howes AMD Senior Member of Technical Staff, Stream Computing Benedict Gaster AMD Principle Member of Technical Staff, AMD Research (now at Qualcomm) OPENCL TODAY WHAT WORKS, WHAT DOESN T
More informationMartin Kruliš, v
Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example (and its Optimization) Alternative Frameworks Most Recent Innovations 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator
More informationOpenCL API. OpenCL Tutorial, PPAM Dominik Behr September 13 th, 2009
OpenCL API OpenCL Tutorial, PPAM 2009 Dominik Behr September 13 th, 2009 Host and Compute Device The OpenCL specification describes the API and the language. The OpenCL API, is the programming API available
More informationSYCL: An Abstraction Layer for Leveraging C++ and OpenCL
SYCL: An Abstraction Layer for Leveraging C++ and OpenCL Alastair Murray Compiler Research Engineer, Codeplay Visit us at www.codeplay.com 45 York Place Edinburgh EH1 3HP United Kingdom Overview SYCL for
More informationOpenCL Training Course
OpenCL Training Course Intermediate Level Class http://www.ksc.re.kr http://webedu.ksc.re.kr INDEX 1. Class introduction 2. Multi-platform and multi-device 3. OpenCL APIs in detail 4. OpenCL C language
More informationScientific Computing WS 2017/2018. Lecture 27. Jürgen Fuhrmann Lecture 27 Slide 1
Scientific Computing WS 2017/2018 Lecture 27 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de Lecture 27 Slide 1 Lecture 27 Slide 2 Why parallelization? Computers became faster and faster without that...
More informationProgramming paradigms for hybrid architecture. Piero Lanucara, SCAI
Programming paradigms for hybrid architecture Piero Lanucara, SCAI p.lanucara@cineca.it From CUDA to OpenCL Let s start from a simple CUDA code (matrixmul from NVIDIA CUDA samples). Now, you perfectly
More informationOpenCL Overview Benedict R. Gaster, AMD
Copyright Khronos Group, 2011 - Page 1 OpenCL Overview Benedict R. Gaster, AMD March 2010 The BIG Idea behind OpenCL OpenCL execution model - Define N-dimensional computation domain - Execute a kernel
More informationNVIDIA OpenCL JumpStart Guide. Technical Brief
NVIDIA OpenCL JumpStart Guide Technical Brief Version 1.0 February 19, 2010 Introduction The purposes of this guide are to assist developers who are familiar with CUDA C/C++ development and want to port
More informationAdvanced OpenCL Event Model Usage
Advanced OpenCL Event Model Usage Derek Gerstmann University of Western Australia http://local.wasp.uwa.edu.au/~derek OpenCL Event Model Usage Outline Execution Model Usage Patterns Synchronisation Event
More informationAPARAPI Java platform s Write Once Run Anywhere now includes the GPU. Gary Frost AMD PMTS Java Runtime Team
APARAPI Java platform s Write Once Run Anywhere now includes the GPU Gary Frost AMD PMTS Java Runtime Team AGENDA The age of heterogeneous computing is here The supercomputer in your desktop/laptop Why
More informationScientific Computing WS 2018/2019. Lecture 25. Jürgen Fuhrmann Lecture 25 Slide 1
Scientific Computing WS 2018/2019 Lecture 25 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de Lecture 25 Slide 1 Lecture 25 Slide 2 SIMD Hardware: Graphics Processing Units ( GPU) [Source: computing.llnl.gov/tutorials]
More informationOpenCL Introduction. Acknowledgements. Frédéric Desprez. Simon Mc Intosh-Smith (Univ. of Bristol) Tom Deakin (Univ. of Bristol)
OpenCL Introduction Frédéric Desprez INRIA Grenoble Rhône-Alpes/LIG Corse team Simulation par ordinateur des ondes gravitationnelles produites lors de la fusion de deux trous noirs. Werner Benger, CC BY-SA
More informationSistemi Operativi e Reti
Sistemi Operativi e Reti GPGPU Computing: the multi/many core computing era Dipartimento di Matematica e Informatica Corso di Laurea Magistrale in Informatica Osvaldo Gervasi ogervasi@computer.org 1 2
More informationA Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs
A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs Taylor Lloyd, Artem Chikin, Erick Ochoa, Karim Ali, José Nelson Amaral University of Alberta Sept 7 FSP 2017 1 University
More informationOpenCL Events. Mike Bailey. Oregon State University. OpenCL Events
1 OpenCL Events Mike Bailey mjb@cs.oregonstate.edu opencl.events.pptx OpenCL Events 2 An event is an object that communicates the status of OpenCL commands Event Read Buffer dc Execute Kernel Write Buffer
More informationOpenCL Events. Mike Bailey. Computer Graphics opencl.events.pptx
1 OpenCL Events This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License Mike Bailey mjb@cs.oregonstate.edu opencl.events.pptx OpenCL Events 2 An
More informationOpenCL / OpenGL Texture Interoperability: An Image Blurring Case Study
1 OpenCL / OpenGL Texture Interoperability: An Image Blurring Case Study Mike Bailey mjb@cs.oregonstate.edu opencl.opengl.rendertexture.pptx OpenCL / OpenGL Texture Interoperability: The Basic Idea 2 Application
More informationCopyright 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes. OpenCL. OpenCL
OpenCL OpenCL What is OpenCL? Ø Cross-platform parallel computing API and C-like language for heterogeneous computing devices Ø Code is portable across various target devices: Ø Correctness is guaranteed
More informationThe resurgence of parallel programming languages
The resurgence of parallel programming languages Jamie Hanlon & Simon McIntosh-Smith University of Bristol Microelectronics Research Group hanlon@cs.bris.ac.uk 1 The Microelectronics Research Group at
More informationMaking OpenCL Simple with Haskell. Benedict R. Gaster January, 2011
Making OpenCL Simple with Haskell Benedict R. Gaster January, 2011 Attribution and WARNING The ideas and work presented here are in collaboration with: Garrett Morris (AMD intern 2010 & PhD student Portland
More informationOPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015
OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015 TOPICS Data transfer Parallelism Coalesced memory access Best work group size Occupancy branching All the performance numbers come from a W8100 running
More informationMany-core Processors Lecture 11. Instructor: Philippos Mordohai Webpage:
1 CS 677: Parallel Programming for Many-core Processors Lecture 11 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Outline More CUDA Libraries
More informationScientific Computing WS 2017/2018. Lecture 28. Jürgen Fuhrmann Lecture 28 Slide 1
Scientific Computing WS 2017/2018 Lecture 28 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de Lecture 28 Slide 1 SIMD Hardware: Graphics Processing Units ( GPU) [Source: computing.llnl.gov/tutorials] Principle
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics
More informationTo Co-Run, or Not To Co-Run: A Performance Study on Integrated Architectures
To Co-Run, or Not To Co-Run: A Performance Study on Integrated Architectures Feng Zhang, Jidong Zhai, Wenguang Chen Tsinghua University, Beijing, 100084, China Bingsheng He and Shuhao Zhang Nanyang Technological
More informationGPU COMPUTING RESEARCH WITH OPENCL
GPU COMPUTING RESEARCH WITH OPENCL Studying Future Workloads and Devices Perhaad Mistry, Dana Schaa, Enqiang Sun, Rafael Ubal, Yash Ukidave, David Kaeli Dept of Electrical and Computer Engineering Northeastern
More informationOpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania
OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania Course Overview This OpenCL base course is structured as follows: Introduction to GPGPU programming, parallel programming
More informationThe Open Computing Language (OpenCL)
1 OpenCL The Open Computing Language (OpenCL) OpenCL consists of two parts: a C/C++-callable API and a C-ish programming language. Also go look at the files first.cpp and first.cl! Mike Bailey mjb@cs.oregonstate.edu
More informationLecture Topic: An Overview of OpenCL on Xeon Phi
C-DAC Four Days Technology Workshop ON Hybrid Computing Coprocessors/Accelerators Power-Aware Computing Performance of Applications Kernels hypack-2013 (Mode-4 : GPUs) Lecture Topic: on Xeon Phi Venue
More informationThe Open Computing Language (OpenCL)
1 The Open Computing Language (OpenCL) Also go look at the files first.cpp and first.cl! Mike Bailey mjb@cs.oregonstate.edu opencl.pptx OpenCL 2 OpenCL consists of two parts: a C/C++-callable API and a
More informationThe Open Computing Language (OpenCL)
1 The Open Computing Language (OpenCL) Also go look at the files first.cpp and first.cl! Mike Bailey mjb@cs.oregonstate.edu opencl.pptx OpenCL 2 OpenCL consists of two parts: a C/C++-callable API and a
More informationMali -T600 Series GPU OpenCL ARM. Developer Guide. Version 2.0. Copyright ARM. All rights reserved. DUI0538F (ID012914)
ARM Mali -T600 Series GPU OpenCL Version 2.0 Developer Guide Copyright 2012-2013 ARM. All rights reserved. DUI0538F () ARM Mali-T600 Series GPU OpenCL Developer Guide Copyright 2012-2013 ARM. All rights
More informationDebugging and Analyzing Programs using the Intercept Layer for OpenCL Applications
Debugging and Analyzing Programs using the Intercept Layer for OpenCL Applications Ben Ashbaugh IWOCL 2018 https://github.com/intel/opencl-intercept-layer Why am I here? Intercept Layer for OpenCL Applications
More informationAccelerate with GPUs Harnessing GPGPUs with trending technologies
Accelerate with GPUs Harnessing GPGPUs with trending technologies Anubhav Jain and Amit Kalele Parallelization and Optimization CoE Tata Consultancy Services Ltd. Copyright 2016 Tata Consultancy Services
More informationIntroduction à OpenCL
1 1 UDS/IRMA Journée GPU Strasbourg, février 2010 Sommaire 1 OpenCL 2 3 GPU architecture A modern Graphics Processing Unit (GPU) is made of: Global memory (typically 1 Gb) Compute units (typically 27)
More informationWhat does Fusion mean for HPC?
What does Fusion mean for HPC? Casey Battaglino Aparna Chandramowlishwaran Jee Choi Kent Czechowski Cong Hou Chris McClanahan Dave S. Noble, Jr. Richard (Rich) Vuduc AMD Fusion Developers Summit Bellevue,
More information/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome!
/INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2018 - Lecture 10: GPGPU (3) Welcome! Today s Agenda: Don t Trust the Template The Prefix Sum Parallel Sorting Stream Filtering Optimizing GPU
More informationLinux Clusters Institute: Getting the Most from Your Linux Cluster
Linux Clusters Institute: Getting the Most from Your Linux Cluster Advanced Topics: GPU Clusters Mike Showerman mshow@ncsa.illinois.edu Our background Cluster models for HPC in 1996 I believe first compute
More informationINTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD
INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different
More informationPragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray
Pragma-based GPU Programming and HMPP Workbench Scott Grauer-Gray Pragma-based GPU programming Write programs for GPU processing without (directly) using CUDA/OpenCL Place pragmas to drive processing on
More informationECE 574 Cluster Computing Lecture 10
ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular
More informationGPU Architecture and Programming with OpenCL
GPU Architecture and Programming with OpenCL David Black-Schaffer david.black-schaffer@it black-schaffer@it.uu.se Room 1221 Today s s Topic GPU architecture What and why The good The bad Compute Models
More informationDesign and implementation of a highperformance. platform on multigenerational GPUs.
Design and implementation of a highperformance stream-based computing platform on multigenerational GPUs. By Pablo Lamilla Álvarez September 27, 2010 Supervised by: Professor Shinichi Yamagiwa Kochi University
More informationGPU Architecture and Programming with OpenCL. OpenCL. GPU Architecture: Why? Today s s Topic. GPUs: : Architectures for Drawing Triangles Fast
Today s s Topic GPU Architecture and Programming with OpenCL David Black-Schaffer david.black-schaffer@it black-schaffer@it.uu.se Room 1221 GPU architecture What and why The good The bad Compute Models
More informationOpenCL Device Fission Benedict R. Gaster, AMD
Copyright Khronos Group, 2011 - Page 1 Fission Benedict R. Gaster, AMD March 2011 Fission (cl_ext_device_fission) Provides an interface for sub-dividing an device into multiple sub-devices Typically used
More informationUsing OpenMP to Program. Systems
Using OpenMP to Program Embedded Heterogeneous Systems Eric Stotzer, PhD Senior Member Technical Staff Software Development Organization, Compiler Team Texas Instruments February 16, 2012 Presented at
More informationA hands-on Introduction to OpenCL
A hands-on Introduction to OpenCL Tim Mattson Acknowledgements: Alice Koniges of Berkeley Lab/NERSC and Simon McIntosh-Smith, James Price, and Tom Deakin of the University of Bristol OpenCL Learning progression
More informationMotion Estimation Extension for OpenCL
Motion Estimation Extension for OpenCL Authors: Nico Galoppo, Craig Hansen-Sturm Reviewers: Ben Ashbaugh, David Blythe, Hong Jiang, Stephen Junkins, Raun Krisch, Matt McClellan, Teresa Morrison, Dillon
More informationCS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4
CS/CoE 1541 Final exam (Fall 2017). Name: This is the cumulative final exam given in the Fall of 2017. Question 1 (12 points): was on Chapter 4 Question 2 (13 points): was on Chapter 4 For Exam 2, you
More informationSimpleOpenCL: desenvolupament i documentació d'una llibreria que facilita la programació paral lela en OpenCL
Treball de fi de Carrera ENGINYERIA TÈCNICA EN INFORMÀTICA DE SISTEMES Facultat de Matemàtiques Universitat de Barcelona SimpleOpenCL: desenvolupament i documentació d'una llibreria que facilita la programació
More informationOpenCL C. Matt Sellitto Dana Schaa Northeastern University NUCAR
OpenCL C Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL C Is used to write kernels when working with OpenCL Used to code the part that runs on the device Based on C99 with some extensions
More informationOPENCL WITH AMD FIREPRO W9100 GERMAN ANDRYEYEV MAY 20, 2015
OPENCL WITH AMD FIREPRO W9100 GERMAN ANDRYEYEV MAY 20, 2015 Introducing AMD FirePro W9100 HW COMPARISON W9100(HAWAII) VS W9000(TAHITI) FirePro W9100 FirePro W9000 Improvement Notes Compute Units 44 32
More informationOpenACC (Open Accelerators - Introduced in 2012)
OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in
More informationIntroduction to OpenCL!
Lecture 6! Introduction to OpenCL! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! OpenCL Architecture Defined in four parts Platform Model
More informationCS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance
More informationFrom CUDA to OpenCL. Piero Lanucara, SCAI
From CUDA to OpenCL Piero Lanucara, SCAI p.lanucara@cineca.it Let s start from a simple CUDA code (matrixmul from NVIDIA CUDA samples). Now, you perfectly know how to compile and run on NVIDIA hardware
More informationModern C++ Parallelism from CPU to GPU
Modern C++ Parallelism from CPU to GPU Simon Brand @TartanLlama Senior Software Engineer, GPGPU Toolchains, Codeplay C++ Russia 2018 2018-04-21 Agenda About me and Codeplay C++17 CPU Parallelism Third-party
More informationParallel Programming Recipes
San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research 2010 Parallel Programming Recipes Thuy C. Nguyenphuc San Jose State University Follow this and additional
More informationINTRODUCTION TO OPENCL. HAIBO XIE, PH.D.
INTRODUCTION TO OPENCL HAIBO XIE, PH.D. haibo.xie@amd.com AGENDA What s OpenCL Fundamentals for OpenCL programming OpenCL programming basics OpenCL programming tools Examples & demos 2 WHAT IS OPENCL Open
More information