Easy to adapt C code to kernel code
|
|
- Jayson Francis
- 6 years ago
- Views:
Transcription
1
2 The language of OpenCL kernels A simplified version of C No recursion No pointers to functions Kernels have no return values Easy to adapt C code to kernel code Tal Ben-Nun, HUJI. All rights reserved. 2
3 The following kernel performs vector addition: kernel void VecAdd(const global float *veca, const global float *vecb, global float *result) { } int id = get_global_id(0); result[id] = veca[id] + vecb[id]; Tal Ben-Nun, HUJI. All rights reserved. 3
4 Each work-item performs operations on one index All work-items are assumed to compute in parallel Vector Tal Ben-Nun, HUJI. All rights reserved. 4
5 Work-items are discerned via identifications Identified globally using global ID Identified as part of a work-group using local ID Use the built-in functions: get_global_id(uint dim) Returns the global ID get_local_id(uint dim) Returns the local ID get_group_id(uint dim) Returns the work-group s global ID dim specifies the requested dimension Tal Ben-Nun, HUJI. All rights reserved. 5
6 Other helpful functions: get_work_dim() Number of dimensions get_global_size(uint dim) Amount of work-items get_local_size(uint dim) Size of work-groups get_num_groups(uint dim) Amount of work-groups Tal Ben-Nun, HUJI. All rights reserved. 6
7 All variables have annotations that identify their address space: global: Global compute device memory constant: Specialized constant global memory local: Shared work-group memory private (default): Work-item memory read_only/write_only: Used for images only Tal Ben-Nun, HUJI. All rights reserved. 7
8 OpenCL-C provides two data types: scalars and vectors Scalar data types operate just like C Examples: char, int, float, half (16-bit FP) Unsigned counterparts: uchar, uint Vector data types are new and use the vectorization capabilities of compute devices Tal Ben-Nun, HUJI. All rights reserved. 8
9 Vectors are defined as typen type is one of the scalar data types N is one of the following: 2,3,4,8,16 Example: float4 vf = (float4)(1.0f, 6.5f, 0.9f, -1.0f); Using vectors, one action replaces N actions Miscellaneous data types include: image2d_t and image3d_t, sampler_t event_t Tal Ben-Nun, HUJI. All rights reserved. 9
10 Using vector data types: kernel void VectorizedCopy(const global float16 *src, global float16 *dest) { // Kernel dimension is (length / 16) int id = get_global_id(0); } // Note: Operates on 16 values concurrently dest[id] = src[id]; Tal Ben-Nun, HUJI. All rights reserved. 10
11 This can also be written for a regular array: kernel void ComplicatedCopy(const global float *src, global float *dest) { // Kernel dimension is still (length / 16) int id = get_global_id(0); } // The following automatically loads the correct position float16 val = vload16(id, src); vstore16(val, id, dest); Tal Ben-Nun, HUJI. All rights reserved. 11
12 Vectors can be accessed via three notations: XYZW: f.x is the first dimension, etc. High/low: f.hi, f.lo, f.even, f.odd S-notation: f.s0 is the first, f.s3 is the fourth, etc. Vectors can be reshaped Into sub-vectors: float4 a; float2 b = a.xz; By shuffling: f.xyzw = f.wzyx; f.s0123 = f.s1320; Tal Ben-Nun, HUJI. All rights reserved. 12
13 In kernel code, the most common functions from <math.h> are provided automatically Examples: sin(), cos(), asin(), pow() Most functions operate on vectors too Floating-point comparison: isequal() More functions: min(), max(), clamp(), clz() (count leading zeros) Tal Ben-Nun, HUJI. All rights reserved. 13
14 kernel void VectorSine(global float4 *vec) { int id = get_global_id(0); } // Note: Operates on 4 values concurrently vec[id] = sin(vec[id]); Tal Ben-Nun, HUJI. All rights reserved. 14
15 Working with work-groups is very important Sometimes information must be shared Redundant computations can be avoided To avoid memory conflicts, work-items in the same work-group can wait for each other Work-groups are completely independent from one another Tal Ben-Nun, HUJI. All rights reserved. 15
16 Local memory can either be statically allocated Example: local int data[20]; Or dynamically allocated, its size specified in host code Use clsetkernelarg(kernel, index, size, NULL); The local array is then one of the kernel arguments Tal Ben-Nun, HUJI. All rights reserved. 16
17 Synchronization is achieved with barriers and memory fences Barrier: Blocks work-items until the entire workgroup reaches the barrier function Use barrier(clk_local_mem_fence); Memory Fence: Ensures correct ordering of memory read/write operations (advanced) Use mem_fence() with the same argument as barrier() Tal Ben-Nun, HUJI. All rights reserved. 17
18 kernel void Reverse32(const global int *vector, global int *result) { int id = get_global_id(0), lid = get_local_id(0); local int share[32]; // Load values to shared memory share[lid] = vector[id]; barrier(clk_local_mem_fence); } vector[id] = share[32 - lid]; Without the barrier, the work-items will have not had the memory ready for reading. Tal Ben-Nun, HUJI. All rights reserved. 18
19 kernel void Sum32(const global int *vector, global int *result) { int id = get_global_id(0), lid = get_local_id(0); local int share[32], sum; // Load values to shared memory share[lid] = vector[id]; barrier(clk_local_mem_fence); // Only the first group item performs summation if(lid == 0) { sum = 0; for(int i = 0; i < 32; i++) sum += share[i]; } barrier(clk_local_mem_fence); } // All group items use the same value vector[id] = sum; Tal Ben-Nun, HUJI. All rights reserved. 19
20 Work-items can read global memory to local (or vice versa) asynchronously: event_t async_work_group_copy ( local gentype *dst, const global gentype *src, size_t num_elements, event_t event) The event_t returned is used in wait_group_events to wait for the copy Tal Ben-Nun, HUJI. All rights reserved. 20
21 kernel void CopyWhileComputing(const global int *vector, global int *err) { int id = get_global_id(0); local int localvec[20], computedblock[20]; event_t ev; // Start the copy, must be called by all work-items ev = async_work_group_copy(localvec, vector + id, 20, NULL); dosomethingcomplicated(computedblock); // May take a while wait_group_events(1, &ev); // Waits for copy to finish } for(int i = 0; i < 20; i++) if(localvec[i]!= computedblock[i]) *error = 1; Tal Ben-Nun, HUJI. All rights reserved. 21
22 OpenCL provides means to perform image processing The presented example performs image blurring using a 3x3 Gaussian filter Tal Ben-Nun, HUJI. All rights reserved. 22 [ ] =
23 A type of memory objects, optimized for pixel-wise access In kernels: image2d_t and image3d_t Images can only be read-only or write-only Specified in address space qualifiers Tal Ben-Nun, HUJI. All rights reserved. 23
24 Images can only be of pre-specified formats The cl_image_format construct specifies image types Defined by Data Type and Channel Order Tal Ben-Nun, HUJI. All rights reserved. 24
25 Data type specifies the size and structure of pixel data Examples: CL_FLOAT, CL_UNSIGNED_32 Channel order specifies the amount and ordering of channels in a pixel Examples: CL_INTENSITY, CL_RGBA, CL_ARGB Tal Ben-Nun, HUJI. All rights reserved. 25
26 Not all image formats are supported on all devices To obtain the supported image formats, use clgetsupportedimageformats() Tal Ben-Nun, HUJI. All rights reserved. 26
27 Images are opaque handles that can only be accessed using samplers Samplers define the way images are accessed and what happens when accessing beyond its borders For instance, a normalized coordinate sampler takes an (x,y,z) coordinate and accesses the image in the range [0,0,0] [1,1,1] Tal Ben-Nun, HUJI. All rights reserved. 27
28 Samplers contain 3 distinct properties: Coordinate Normalization CLK_NORMALIZED_COORDS_TRUE/FALSE Filtering (accessing data between pixels) CLK_FILTER_NEAREST Nearest pixel value CLK_FILTER_LINEAR Linear interpolation Tal Ben-Nun, HUJI. All rights reserved. 28
29 Addressing (accessing out-of-range coordinates) CLK_ADDRESS_NONE No access CLK_ADDRESS_CLAMP Default value (0,0,0,0) CLK_ADDRESS_CLAMP_TO_EDGE Same as edge CLK_ADDRESS_REPEAT Wraps around image Tal Ben-Nun, HUJI. All rights reserved. 29
30 Samplers are declared in kernels as sampler_t Note: Samplers may be also initialized in the host using clcreatesampler Example: sampler_t sampler = CLK_NORMALIZED_COORDS_TRUE CLK_ADDRESS_REPEAT CLK_FILTER_NEAREST; Tal Ben-Nun, HUJI. All rights reserved. 30
31 kernel void GaussianFilter(read_only image2d_t image, write_only image2d_t result) { int x = get_global_id(0), y = get_global_id(1); float4 color = (float4)0.0f; sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE CLK_ADDRESS_CLAMP_TO_EDGE CLK_FILTER_NEAREST; color += 1.0f * read_imagef(image, sampler, (int2)(x 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x 1, y )); color += 4.0f * read_imagef(image, sampler, (int2)(x, y )); color += 2.0f * read_imagef(image, sampler, (int2)(x + 1, y )); color += 1.0f * read_imagef(image, sampler, (int2)(x 1, y + 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y + 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y + 1)); color /= 16.0f; Since we want to blur the image correctly on the edges, we use CLAMP_TO_EDGE } write_imagef(result, (int2)(x, y), color); Tal Ben-Nun, HUJI. All rights reserved. 31
32 To read from images, use read_image{f,i,ui} Examples: float4 read_imagef(image2d_t image, sampler_t sampler, int2/float2 coord) uint4 read_imageui(image3d_t image, sampler_t sampler, float4 coord) Tal Ben-Nun, HUJI. All rights reserved. 32
33 kernel void GaussianFilter(read_only image2d_t image, write_only image2d_t result) { int x = get_global_id(0), y = get_global_id(1); float4 color = (float4)0.0f; sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE CLK_ADDRESS_CLAMP_TO_EDGE CLK_FILTER_NEAREST; Reads all 9 neighboring pixels (target pixel included) color += 1.0f * read_imagef(image, sampler, (int2)(x 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x 1, y )); color += 4.0f * read_imagef(image, sampler, (int2)(x, y )); color += 2.0f * read_imagef(image, sampler, (int2)(x + 1, y )); color += 1.0f * read_imagef(image, sampler, (int2)(x 1, y + 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y + 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y + 1)); color /= 16.0f; } write_imagef(result, (int2)(x, y), color); Tal Ben-Nun, HUJI. All rights reserved. 33
34 To write to images, use write_image{f,i,ui} Examples: void write_imagef(image2d_t image, int2 coord, float4 color) void write_imagei(image2d_t image, int2 coord, int4 color) Notice there are no samplers Actual pixels have to be written 3D images cannot be write-only Only with the cl_khr_3d_image_writes extension Tal Ben-Nun, HUJI. All rights reserved. 34
35 kernel void GaussianFilter(read_only image2d_t image, write_only image2d_t result) { int x = get_global_id(0), y = get_global_id(1); float4 color = (float4)0.0f; sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE CLK_ADDRESS_CLAMP_TO_EDGE CLK_FILTER_NEAREST; color += 1.0f * read_imagef(image, sampler, (int2)(x 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x 1, y )); color += 4.0f * read_imagef(image, sampler, (int2)(x, y )); color += 2.0f * read_imagef(image, sampler, (int2)(x + 1, y )); color += 1.0f * read_imagef(image, sampler, (int2)(x 1, y + 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y + 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y + 1)); color /= 16.0f; } write_imagef(result, (int2)(x, y), color); Tal Ben-Nun, HUJI. All rights reserved. 35
36 Other image functions: get_image_width(image2d_t/image3d_t) get_image_height(image2d_t/image3d_t) get_image_depth(image3d_t) Tal Ben-Nun, HUJI. All rights reserved. 36
37 kernel void GaussianFilter(read_only image2d_t image, write_only image2d_t result) { int x = get_global_id(0), y = get_global_id(1); float4 color = (float4)0.0f; sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE CLK_ADDRESS_CLAMP_TO_EDGE CLK_FILTER_NEAREST; color += 1.0f * read_imagef(image, sampler, (int2)(x 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x 1, y )); color += 4.0f * read_imagef(image, sampler, (int2)(x, y )); color += 2.0f * read_imagef(image, sampler, (int2)(x + 1, y )); color += 1.0f * read_imagef(image, sampler, (int2)(x 1, y + 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y + 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y + 1)); color /= 16.0f; } write_imagef(result, (int2)(x, y), color); Tal Ben-Nun, HUJI. All rights reserved. 37
38 Exercise 1 contains both OpenCL host and kernel programming It is strongly recommended to test the code on an actual GPU Optimized code is considered a bonus and graded accordingly Good luck! Tal Ben-Nun, HUJI. All rights reserved. 38
OpenCL C. Matt Sellitto Dana Schaa Northeastern University NUCAR
OpenCL C Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL C Is used to write kernels when working with OpenCL Used to code the part that runs on the device Based on C99 with some extensions
More informationINTRODUCING OPENCL TM
INTRODUCING OPENCL TM The open standard for parallel programming across heterogeneous processors 1 PPAM 2011 Tutorial IT S A MEANY-CORE HETEROGENEOUS WORLD Multi-core, heterogeneous computing The new normal
More informationOpenCL Overview Benedict R. Gaster, AMD
Copyright Khronos Group, 2011 - Page 1 OpenCL Overview Benedict R. Gaster, AMD March 2010 The BIG Idea behind OpenCL OpenCL execution model - Define N-dimensional computation domain - Execute a kernel
More informationHandheld Devices. Kari Pulli. Research Fellow, Nokia Research Center Palo Alto. Material from Jyrki Leskelä, Jarmo Nikula, Mika Salmela
OpenCL in Handheld Devices Kari Pulli Research Fellow, Nokia Research Center Palo Alto Material from Jyrki Leskelä, Jarmo Nikula, Mika Salmela 1 OpenCL 1.0 Embedded Profile Enables OpenCL on mobile and
More informationOpenCL Overview. Tim Mattson Intel Labs. Copyright Khronos Group, Page 1
OpenCL Overview Tim Mattson Intel Labs Copyright Khronos Group, 2009 - Page 1 Programming Heterogeneous Platforms CPUs Multiple cores driving performance increases Emerging Intersection GPUs Increasingly
More informationIntroduction to OpenCL. Benedict R. Gaster October, 2010
Introduction to OpenCL Benedict R. Gaster October, 2010 OpenCL With OpenCL you can Leverage CPUs and GPUs to accelerate parallel computation Get dramatic speedups for computationally intensive applications
More informationNeil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1
Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 Introduction and aims of OpenCL - Neil Trevett, NVIDIA OpenCL Specification walkthrough - Mike
More informationOpenCL. Dr. David Brayford, LRZ, PRACE PATC: Intel MIC & GPU Programming Workshop
OpenCL Dr. David Brayford, LRZ, brayford@lrz.de PRACE PATC: Intel MIC & GPU Programming Workshop 1 Open Computing Language Open, royalty-free standard C-language extension For cross-platform, parallel
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationHeterogeneous Computing
OpenCL Hwansoo Han Heterogeneous Computing Multiple, but heterogeneous multicores Use all available computing resources in system [AMD APU (Fusion)] Single core CPU, multicore CPU GPUs, DSPs Parallel programming
More informationCopyright Khronos Group Page 1. OpenCL BOF SIGGRAPH 2013
Copyright Khronos Group 2013 - Page 1 OpenCL BOF SIGGRAPH 2013 Copyright Khronos Group 2013 - Page 2 OpenCL Roadmap OpenCL-HLM (High Level Model) High-level programming model, unifying host and device
More informationOpenCL The Open Standard for Heterogeneous Parallel Programming
OpenCL The Open Standard for Heterogeneous Parallel Programming March 2009 Copyright Khronos Group, 2009 - Page 1 Close-to-the-Silicon Standards Khronos creates Foundation-Level acceleration APIs - Needed
More informationA Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function
A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao
More informationCopyright Khronos Group, Page 1. OpenCL Overview. February 2010
Copyright Khronos Group, 2011 - Page 1 OpenCL Overview February 2010 Copyright Khronos Group, 2011 - Page 2 Khronos Vision Billions of devices increasing graphics, compute, video, imaging and audio capabilities
More informationIntroduction to OpenCL!
Disclaimer Introduction to OpenCL I worked for Apple developing OpenCL I m biased (But not in the way you might think ) David Black-Schaffer david.black-schaffer@it.uu.se What is OpenCL? Low-level language
More informationGPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011
GPGPU COMPUTE ON AMD Udeepta Bordoloi April 6, 2011 WHY USE GPU COMPUTE CPU: scalar processing + Latency + Optimized for sequential and branching algorithms + Runs existing applications very well - Throughput
More informationNeil Trevett Vice President, NVIDIA OpenCL Chair Khronos President
4 th Annual Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 CPUs Multiple cores driving performance increases Emerging Intersection GPUs Increasingly
More informationMartin Kruliš, v
Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example Optimizing Previous Example Alternative Architectures 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator for desktop
More informationAccelerate Performance Using OpenCL* with Intel HD Graphics
Accelerate Performance Using OpenCL* with Intel HD Graphics Abstract Recently, Intel announced the release of the Intel SDK for OpenCL Applications 2013 with certified OpenCL* 1.2 support on 3rd and 4th
More informationGPGPU Training. Personal Super Computing Competence Centre PSC 3. Jan G. Cornelis. Personal Super Computing Competence Center
GPGPU Training Personal Super Computing Competence Centre PSC 3 Jan G. Cornelis 1 Levels of Understanding Level 0 Host and device Level 1 Parallel execution on the device Level 2 Device model and work
More informationOpenCL / OpenGL Texture Interoperability: An Image Blurring Case Study
1 OpenCL / OpenGL Texture Interoperability: An Image Blurring Case Study Mike Bailey mjb@cs.oregonstate.edu opencl.opengl.rendertexture.pptx OpenCL / OpenGL Texture Interoperability: The Basic Idea 2 Application
More informationOpenCL and the quest for portable performance. Tim Mattson Intel Labs
OpenCL and the quest for portable performance Tim Mattson Intel Labs Disclaimer The views expressed in this talk are those of the speaker and not his employer. I am in a research group and know very little
More informationProgramming with CUDA, WS09
Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 3 Thursday, 29 Nov, 2009 Recap Motivational videos Example kernel Thread IDs Memory overhead CUDA hardware and programming
More informationParallel Implementation of the Box Counting Algorithm in OpenCL
Parallel Implementation of the Box Counting Algorithm in OpenCL Ramakrishnan Mukundan Department of Computer Science and Software Engineering University of Canterbury Christchurch, New Zealand. mukundan@canterbury.ac.nz
More informationINTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD
INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different
More informationOpenCL. Computation on HybriLIT Brief introduction and getting started
OpenCL Computation on HybriLIT Brief introduction and getting started Alexander Ayriyan Laboratory of Information Technologies Joint Institute for Nuclear Research 05.09.2014 (Friday) Tutorial in frame
More informationOpenACC (Open Accelerators - Introduced in 2012)
OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in
More informationAltera SDK for OpenCL
Altera SDK for OpenCL Best Practices Guide Subscribe Last updated for Quartus Prime Design Suite: 16.0 UG-OCL003 101 Innovation Drive San Jose, CA 95134 www.altera.com TOC-2 Contents...1-1 Introduction...1-1
More informationOpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR
OpenCL Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Architecture Parallel computing for heterogenous devices CPUs, GPUs, other processors (Cell, DSPs, etc) Portable accelerated code Defined
More informationIntroduction to OpenCL!
Lecture 6! Introduction to OpenCL! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! OpenCL Architecture Defined in four parts Platform Model
More informationGPGPU IGAD 2014/2015. Lecture 4. Jacco Bikker
GPGPU IGAD 2014/2015 Lecture 4 Jacco Bikker Today: Demo time! Parallel scan Parallel sort Assignment Demo Time Parallel scan What it is: in: 1 1 6 2 7 3 2 out: 0 1 2 8 10 17 20 C++: out[0] = 0 for ( i
More informationCUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci
TECHNISCHE UNIVERSITÄT WIEN Fakultät für Informatik Cyber-Physical Systems Group CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci Outline of CUDA Basics Basic Kernels and Execution on GPU
More informationThe OpenCL Extension Specification
The OpenCL Extension Specification Version: 2.0 Document Revision: 34 Khronos OpenCL Working Group Editor: Lee Howes and Aaftab Munshi Last Revision Date: February 13, 2018 Page 1 9. OPTIONAL EXTENSIONS...
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA
More informationNVIDIA GPU CODING & COMPUTING
NVIDIA GPU CODING & COMPUTING WHY GPU S? ARCHITECTURE & PROGRAM MODEL CPU v. GPU Multiprocessor Model Memory Model Memory Model: Thread Level Programing Model: Logical Mapping of Threads Programing Model:
More informationBrook Spec v0.2. Ian Buck. May 20, What is Brook? 0.2 Streams
Brook Spec v0.2 Ian Buck May 20, 2003 0.1 What is Brook? Brook is an extension of standard ANSI C which is designed to incorporate the ideas of data parallel computing and arithmetic intensity into a familiar,
More informationReal-time Graphics 9. GPGPU
9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose
More informationPackage OpenCL. February 19, 2015
Package OpenCL February 19, 2015 Version 0.1-3 Title Interface allowing R to use OpenCL Author Maintainer Depends R (>= 2.0.0) This package provides
More informationPerforming Reductions in OpenCL
Performing Reductions in OpenCL Mike Bailey mjb@cs.oregonstate.edu opencl.reduction.pptx Recall the OpenCL Model Kernel Global Constant Local Local Local Local Work- ItemWork- ItemWork- Item Here s the
More informationMathematical computations with GPUs
Master Educational Program Information technology in applications Mathematical computations with GPUs CUDA Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University CUDA - Compute Unified Device
More informationCREATING A DECISION FRAMEWORK FOR OpenCL USAGE. Graham Brown CTO Corel Corporation
CREATING A DECISION FRAMEWORK FOR OpenCL USAGE Graham Brown CTO Corel Corporation AGENDA OpenCL Overview Corel s View of Optimization Sample of Corel s Decision Framework Additional Considerations 2 Creating
More informationReal-time Graphics 9. GPGPU
Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing
More informationCUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list
CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into
More informationOPENCL. Episode 6 - Shared Memory Kernel Optimization
OPENCL Episode 6 - Shared Memory Kernel Optimization David W. Gohara, Ph.D. Center for Computational Biology Washington University School of Medicine, St. Louis email: sdg0919@gmail.com THANK YOU SHARED
More informationMartin Kruliš, v
Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example (and its Optimization) Alternative Frameworks Most Recent Innovations 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator
More informationSynthesizing Benchmarks for Predictive Modeling.
Synthesizing Benchmarks for Predictive Modeling http://chriscummins.cc/cgo17 Chris Cummins University of Edinburgh Pavlos Petoumenos University of Edinburgh Zheng Wang Lancaster University Hugh Leather
More informationUsing Deep Learning to Generate Human-like Code
Using Deep Learning to Generate Human-like Code Synthesizing Benchmarks for Predictive Modeling Chris Cummins Zheng Wang Pavlos Petoumenos Hugh Leather achine learning for compilers y = f(x) Optimisations
More informationCS179 GPU Programming Recitation 4: CUDA Particles
Recitation 4: CUDA Particles Lab 4 CUDA Particle systems Two parts Simple repeat of Lab 3 Interacting Flocking simulation 2 Setup Two folders given particles_simple, particles_interact Must install NVIDIA_CUDA_SDK
More informationA Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs
A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs Taylor Lloyd, Artem Chikin, Erick Ochoa, Karim Ali, José Nelson Amaral University of Alberta Sept 7 FSP 2017 1 University
More informationGPU Programming. CUDA Memories. Miaoqing Huang University of Arkansas Spring / 43
1 / 43 GPU Programming CUDA Memories Miaoqing Huang University of Arkansas Spring 2016 2 / 43 Outline CUDA Memories Atomic Operations 3 / 43 Hardware Implementation of CUDA Memories Each thread can: Read/write
More informationOpenCL. An Introduction for HPC programmers. Benedict Gaster, AMD Tim Mattson, Intel. - Page 1
OpenCL An Introduction for HPC programmers Benedict Gaster, AMD Tim Mattson, Intel - Page 1 Preliminaries: Disclosures - The views expressed in this tutorial are those of the people delivering the tutorial.
More informationEEM528 GPU COMPUTING
EEM528 CS 193G GPU COMPUTING Lecture 2: GPU History & CUDA Programming Basics Slides Credit: Jared Hoberock & David Tarjan CS 193G History of GPUs Graphics in a Nutshell Make great images intricate shapes
More informationMaking OpenCL Simple with Haskell. Benedict R. Gaster January, 2011
Making OpenCL Simple with Haskell Benedict R. Gaster January, 2011 Attribution and WARNING The ideas and work presented here are in collaboration with: Garrett Morris (AMD intern 2010 & PhD student Portland
More informationUnified Parallel C (UPC)
Unified Parallel C (UPC) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 21 March 27, 2008 Acknowledgments Supercomputing 2007 tutorial on Programming using
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics
More informationOpenCL: A Hands-on Introduction
OpenCL: A Hands-on Introduction Tim Mattson Intel Corp. Alice Koniges Berkeley Lab/NERSC Simon McIntosh-Smith University of Bristol Acknowledgements: In addition to Tim, Alice and Simon Tom Deakin (Bristol)
More informationBasic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Basic Elements of CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of
More informationThe Cut and Thrust of CUDA
The Cut and Thrust of CUDA Luke Hodkinson Center for Astrophysics and Supercomputing Swinburne University of Technology Melbourne, Hawthorn 32000, Australia May 16, 2013 Luke Hodkinson The Cut and Thrust
More informationOpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group
Copyright Khronos Group, 2012 - Page 1 OpenCL Overview Shanghai March 2012 Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group Copyright Khronos Group, 2012 - Page 2 Processor
More informationImage convolution with CUDA
Image convolution with CUDA Lecture Alexey Abramov abramov _at_ physik3.gwdg.de Georg-August University, Bernstein Center for Computational Neuroscience, III Physikalisches Institut, Göttingen, Germany
More informationINTRODUCTION TO OPENCL. Jason B. Smith, Hood College May
INTRODUCTION TO OPENCL Jason B. Smith, Hood College May 4 2011 WHAT IS IT? Use heterogeneous computing platforms Specifically for computationally intensive apps Provide a means for portable parallelism
More informationA hands-on Introduction to OpenCL
A hands-on Introduction to OpenCL Tim Mattson Acknowledgements: Alice Koniges of Berkeley Lab/NERSC and Simon McIntosh-Smith, James Price, and Tom Deakin of the University of Bristol OpenCL Learning progression
More informationCS240: Programming in C
CS240: Programming in C Lecture 5: Functions. Scope of variables. Program structure. Cristina Nita-Rotaru Lecture 5/ Fall 2013 1 Functions: Explicit declaration Declaration, definition, use, order matters.
More informationCS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL
CS/EE 217 GPU Architecture and Parallel Programming Lecture 22: Introduction to OpenCL Objective To Understand the OpenCL programming model basic concepts and data types OpenCL application programming
More informationBrook+ Data Types. Basic Data Types
Brook+ Data Types Important for all data representations in Brook+ Streams Constants Temporary variables Brook+ Supports Basic Types Short Vector Types User-Defined Types 29 Basic Data Types Basic data
More informationA First Book of ANSI C Fourth Edition. Chapter 8 Arrays
A First Book of ANSI C Fourth Edition Chapter 8 Arrays Objectives One-Dimensional Arrays Array Initialization Arrays as Function Arguments Case Study: Computing Averages and Standard Deviations Two-Dimensional
More informationWriting Optimal OpenCL Code with Intel OpenCL SDK
Writing Optimal OpenCL Code with Intel OpenCL SDK Performance Guide Copyright 2010 2011 Intel Corporation All Rights Reserved Document Number: 325696-001US Revision: 1.3 World Wide Web: http://www.intel.com
More informationIntroduction OpenCL Code Exercices. OpenCL. Tópicos em Arquiteturas Paralelas. Peter Frank Perroni. November 25, 2015
Code Tópicos em Arquiteturas Paralelas November 25, 2015 The Device Code GPU Device Memory Access Thread Management Private Private Thread1 Thread M Streaming Processor 0... Private Private Thread1 Thread
More informationAn Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture
An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight
More informationOpenCL. Parallel Computing for Heterogeneous Devices. Ofer Rosenberg Visual Computing Software Division, Intel
OpenCL Parallel Computing for Heterogeneous Devices Ofer Rosenberg Visual Computing Software Division, Intel Based on Khronos OpenCL Overview by Aaftab Munshi Welcome to OpenCL With OpenCL you can Leverage
More informationCompute Shaders. Christian Hafner. Institute of Computer Graphics and Algorithms Vienna University of Technology
Compute Shaders Christian Hafner Institute of Computer Graphics and Algorithms Vienna University of Technology Overview Introduction Thread Hierarchy Memory Resources Shared Memory & Synchronization Christian
More informationOpenCL on the GPU. San Jose, CA September 30, Neil Trevett and Cyril Zeller, NVIDIA
OpenCL on the GPU San Jose, CA September 30, 2009 Neil Trevett and Cyril Zeller, NVIDIA Welcome to the OpenCL Tutorial! Khronos and industry perspective on OpenCL Neil Trevett Khronos Group President OpenCL
More informationCHAPTER 4 FUNCTIONS. 4.1 Introduction
CHAPTER 4 FUNCTIONS 4.1 Introduction Functions are the building blocks of C++ programs. Functions are also the executable segments in a program. The starting point for the execution of a program is main
More informationThe COPRTHR Primer rev. 1.6
The COPRTHR Primer rev. 1.6 Copyright 2011-2014 Brown Deer Technology, LLC Verbatim copying and distribution of this entire document is permitted in any medium, provided this notice is preserved. Contents
More informationGPU Architecture and Programming with OpenCL. OpenCL. GPU Architecture: Why? Today s s Topic. GPUs: : Architectures for Drawing Triangles Fast
Today s s Topic GPU Architecture and Programming with OpenCL David Black-Schaffer david.black-schaffer@it black-schaffer@it.uu.se Room 1221 GPU architecture What and why The good The bad Compute Models
More informationOutline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun
Outline Memory Management CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory
More informationGPU Architecture and Programming with OpenCL
GPU Architecture and Programming with OpenCL David Black-Schaffer david.black-schaffer@it black-schaffer@it.uu.se Room 1221 Today s s Topic GPU architecture What and why The good The bad Compute Models
More informationExercise Session 2 Simon Gerber
Exercise Session 2 Simon Gerber CASP 2014 Exercise 2: Binary search tree Implement and test a binary search tree in C: Implement key insert() and lookup() functions Implement as C module: bst.c, bst.h
More informationStructured Data. CIS 15 : Spring 2007
Structured Data CIS 15 : Spring 2007 Functionalia HW4 Part A due this SUNDAY April 1st: 11:59pm Reminder: I do NOT accept LATE HOMEWORK. Today: Dynamic Memory Allocation Allocating Arrays Returning Pointers
More informationMassively Parallel Algorithms
Massively Parallel Algorithms Introduction to CUDA & Many Fundamental Concepts of Parallel Programming G. Zachmann University of Bremen, Germany cgvr.cs.uni-bremen.de Hybrid/Heterogeneous Computation/Architecture
More informationCSE 333 Autumn 2013 Midterm
CSE 333 Autumn 2013 Midterm Please do not read beyond this cover page until told to start. A question involving what could be either C or C++ is about C, unless it explicitly states that it is about C++.
More informationclarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018
clarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018 ANECDOTE DISCOVERING A BUFFER OVERFLOW CPU GPU MEMORY MEMORY Data Data Data Data Data 2 clarmor: A
More informationOPENCL WITH AMD FIREPRO W9100 GERMAN ANDRYEYEV MAY 20, 2015
OPENCL WITH AMD FIREPRO W9100 GERMAN ANDRYEYEV MAY 20, 2015 Introducing AMD FirePro W9100 HW COMPARISON W9100(HAWAII) VS W9000(TAHITI) FirePro W9100 FirePro W9000 Improvement Notes Compute Units 44 32
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationCS 432 Interactive Computer Graphics
CS 432 Interactive Computer Graphics Lecture 7 Part 2 Texture Mapping in OpenGL Matt Burlick - Drexel University - CS 432 1 Topics Texture Mapping in OpenGL Matt Burlick - Drexel University - CS 432 2
More informationSlide Set 3. for ENCM 339 Fall 2017 Section 01. Steve Norman, PhD, PEng
Slide Set 3 for ENCM 339 Fall 2017 Section 01 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary September 2017 ENCM 339 Fall 2017 Section 01
More informationCOMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers
COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu
More informationby Pearson Education, Inc. All Rights Reserved.
Let s improve the bubble sort program of Fig. 6.15 to use two functions bubblesort and swap. Function bubblesort sorts the array. It calls function swap (line 51) to exchange the array elements array[j]
More information/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome!
/INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2018 - Lecture 10: GPGPU (3) Welcome! Today s Agenda: Don t Trust the Template The Prefix Sum Parallel Sorting Stream Filtering Optimizing GPU
More informationConcurrent Programming with the Cell Processor. Dietmar Kühl Bloomberg L.P.
Concurrent Programming with the Cell Processor Dietmar Kühl Bloomberg L.P. dietmar.kuehl@gmail.com Copyright Notice 2009 Bloomberg L.P. Permission is granted to copy, distribute, and display this material,
More informationComputer Science & Engineering 150A Problem Solving Using Computers
Computer Science & Engineering 150A Problem Solving Using Computers Lecture 06 - Stephen Scott Adapted from Christopher M. Bourke 1 / 30 Fall 2009 Chapter 8 8.1 Declaring and 8.2 Array Subscripts 8.3 Using
More information04. CUDA Data Transfer
04. CUDA Data Transfer Fall Semester, 2015 COMP427 Parallel Programming School of Computer Sci. and Eng. Kyungpook National University 2013-5 N Baek 1 CUDA Compute Unified Device Architecture General purpose
More informationMichael Kinsner, Dirk Seynhaeve IWOCL 2018
Michael Kinsner, Dirk Seynhaeve IWOCL 2018 Topics 1. FPGA overview 2. Motivating application classes 3. Host pipes 4. Some data 2 FPGA: Fine-grained Massive Parallelism Intel Stratix 10 FPGA: Over 5 Million
More informationSpiNNaker Application Programming Interface (API)
SpiNNaker Application Programming Interface (API) Version 2.0.0 10 March 2016 Application programming interface (API) Event-driven programming model The SpiNNaker API programming model is a simple, event-driven
More informationEE109 Lab Need for Speed
EE109 Lab Need for Speed 1 Introduction In this lab you will parallelize two pieces of code. The first is an image processing program to blur (smooth) or sharpen an image. We will guide you through this
More informationGPU Programming Using CUDA
GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa
More informationEric Schenk, EA August 2009
Game Developer s Perspective on OpenCL Eric Schenk, EA August 2009 Copyright Electronic Arts, 2009 - Page 1 Motivation Copyright Electronic Arts, 2009 - Page 2 Motivation Supports a variety of compute
More informationMotion Estimation Extension for OpenCL
Motion Estimation Extension for OpenCL Authors: Nico Galoppo, Craig Hansen-Sturm Reviewers: Ben Ashbaugh, David Blythe, Hong Jiang, Stephen Junkins, Raun Krisch, Matt McClellan, Teresa Morrison, Dillon
More informationLecture 11: GPU programming
Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!
More information