Easy to adapt C code to kernel code

Size: px
Start display at page:

Download "Easy to adapt C code to kernel code"

Transcription

1

2 The language of OpenCL kernels A simplified version of C No recursion No pointers to functions Kernels have no return values Easy to adapt C code to kernel code Tal Ben-Nun, HUJI. All rights reserved. 2

3 The following kernel performs vector addition: kernel void VecAdd(const global float *veca, const global float *vecb, global float *result) { } int id = get_global_id(0); result[id] = veca[id] + vecb[id]; Tal Ben-Nun, HUJI. All rights reserved. 3

4 Each work-item performs operations on one index All work-items are assumed to compute in parallel Vector Tal Ben-Nun, HUJI. All rights reserved. 4

5 Work-items are discerned via identifications Identified globally using global ID Identified as part of a work-group using local ID Use the built-in functions: get_global_id(uint dim) Returns the global ID get_local_id(uint dim) Returns the local ID get_group_id(uint dim) Returns the work-group s global ID dim specifies the requested dimension Tal Ben-Nun, HUJI. All rights reserved. 5

6 Other helpful functions: get_work_dim() Number of dimensions get_global_size(uint dim) Amount of work-items get_local_size(uint dim) Size of work-groups get_num_groups(uint dim) Amount of work-groups Tal Ben-Nun, HUJI. All rights reserved. 6

7 All variables have annotations that identify their address space: global: Global compute device memory constant: Specialized constant global memory local: Shared work-group memory private (default): Work-item memory read_only/write_only: Used for images only Tal Ben-Nun, HUJI. All rights reserved. 7

8 OpenCL-C provides two data types: scalars and vectors Scalar data types operate just like C Examples: char, int, float, half (16-bit FP) Unsigned counterparts: uchar, uint Vector data types are new and use the vectorization capabilities of compute devices Tal Ben-Nun, HUJI. All rights reserved. 8

9 Vectors are defined as typen type is one of the scalar data types N is one of the following: 2,3,4,8,16 Example: float4 vf = (float4)(1.0f, 6.5f, 0.9f, -1.0f); Using vectors, one action replaces N actions Miscellaneous data types include: image2d_t and image3d_t, sampler_t event_t Tal Ben-Nun, HUJI. All rights reserved. 9

10 Using vector data types: kernel void VectorizedCopy(const global float16 *src, global float16 *dest) { // Kernel dimension is (length / 16) int id = get_global_id(0); } // Note: Operates on 16 values concurrently dest[id] = src[id]; Tal Ben-Nun, HUJI. All rights reserved. 10

11 This can also be written for a regular array: kernel void ComplicatedCopy(const global float *src, global float *dest) { // Kernel dimension is still (length / 16) int id = get_global_id(0); } // The following automatically loads the correct position float16 val = vload16(id, src); vstore16(val, id, dest); Tal Ben-Nun, HUJI. All rights reserved. 11

12 Vectors can be accessed via three notations: XYZW: f.x is the first dimension, etc. High/low: f.hi, f.lo, f.even, f.odd S-notation: f.s0 is the first, f.s3 is the fourth, etc. Vectors can be reshaped Into sub-vectors: float4 a; float2 b = a.xz; By shuffling: f.xyzw = f.wzyx; f.s0123 = f.s1320; Tal Ben-Nun, HUJI. All rights reserved. 12

13 In kernel code, the most common functions from <math.h> are provided automatically Examples: sin(), cos(), asin(), pow() Most functions operate on vectors too Floating-point comparison: isequal() More functions: min(), max(), clamp(), clz() (count leading zeros) Tal Ben-Nun, HUJI. All rights reserved. 13

14 kernel void VectorSine(global float4 *vec) { int id = get_global_id(0); } // Note: Operates on 4 values concurrently vec[id] = sin(vec[id]); Tal Ben-Nun, HUJI. All rights reserved. 14

15 Working with work-groups is very important Sometimes information must be shared Redundant computations can be avoided To avoid memory conflicts, work-items in the same work-group can wait for each other Work-groups are completely independent from one another Tal Ben-Nun, HUJI. All rights reserved. 15

16 Local memory can either be statically allocated Example: local int data[20]; Or dynamically allocated, its size specified in host code Use clsetkernelarg(kernel, index, size, NULL); The local array is then one of the kernel arguments Tal Ben-Nun, HUJI. All rights reserved. 16

17 Synchronization is achieved with barriers and memory fences Barrier: Blocks work-items until the entire workgroup reaches the barrier function Use barrier(clk_local_mem_fence); Memory Fence: Ensures correct ordering of memory read/write operations (advanced) Use mem_fence() with the same argument as barrier() Tal Ben-Nun, HUJI. All rights reserved. 17

18 kernel void Reverse32(const global int *vector, global int *result) { int id = get_global_id(0), lid = get_local_id(0); local int share[32]; // Load values to shared memory share[lid] = vector[id]; barrier(clk_local_mem_fence); } vector[id] = share[32 - lid]; Without the barrier, the work-items will have not had the memory ready for reading. Tal Ben-Nun, HUJI. All rights reserved. 18

19 kernel void Sum32(const global int *vector, global int *result) { int id = get_global_id(0), lid = get_local_id(0); local int share[32], sum; // Load values to shared memory share[lid] = vector[id]; barrier(clk_local_mem_fence); // Only the first group item performs summation if(lid == 0) { sum = 0; for(int i = 0; i < 32; i++) sum += share[i]; } barrier(clk_local_mem_fence); } // All group items use the same value vector[id] = sum; Tal Ben-Nun, HUJI. All rights reserved. 19

20 Work-items can read global memory to local (or vice versa) asynchronously: event_t async_work_group_copy ( local gentype *dst, const global gentype *src, size_t num_elements, event_t event) The event_t returned is used in wait_group_events to wait for the copy Tal Ben-Nun, HUJI. All rights reserved. 20

21 kernel void CopyWhileComputing(const global int *vector, global int *err) { int id = get_global_id(0); local int localvec[20], computedblock[20]; event_t ev; // Start the copy, must be called by all work-items ev = async_work_group_copy(localvec, vector + id, 20, NULL); dosomethingcomplicated(computedblock); // May take a while wait_group_events(1, &ev); // Waits for copy to finish } for(int i = 0; i < 20; i++) if(localvec[i]!= computedblock[i]) *error = 1; Tal Ben-Nun, HUJI. All rights reserved. 21

22 OpenCL provides means to perform image processing The presented example performs image blurring using a 3x3 Gaussian filter Tal Ben-Nun, HUJI. All rights reserved. 22 [ ] =

23 A type of memory objects, optimized for pixel-wise access In kernels: image2d_t and image3d_t Images can only be read-only or write-only Specified in address space qualifiers Tal Ben-Nun, HUJI. All rights reserved. 23

24 Images can only be of pre-specified formats The cl_image_format construct specifies image types Defined by Data Type and Channel Order Tal Ben-Nun, HUJI. All rights reserved. 24

25 Data type specifies the size and structure of pixel data Examples: CL_FLOAT, CL_UNSIGNED_32 Channel order specifies the amount and ordering of channels in a pixel Examples: CL_INTENSITY, CL_RGBA, CL_ARGB Tal Ben-Nun, HUJI. All rights reserved. 25

26 Not all image formats are supported on all devices To obtain the supported image formats, use clgetsupportedimageformats() Tal Ben-Nun, HUJI. All rights reserved. 26

27 Images are opaque handles that can only be accessed using samplers Samplers define the way images are accessed and what happens when accessing beyond its borders For instance, a normalized coordinate sampler takes an (x,y,z) coordinate and accesses the image in the range [0,0,0] [1,1,1] Tal Ben-Nun, HUJI. All rights reserved. 27

28 Samplers contain 3 distinct properties: Coordinate Normalization CLK_NORMALIZED_COORDS_TRUE/FALSE Filtering (accessing data between pixels) CLK_FILTER_NEAREST Nearest pixel value CLK_FILTER_LINEAR Linear interpolation Tal Ben-Nun, HUJI. All rights reserved. 28

29 Addressing (accessing out-of-range coordinates) CLK_ADDRESS_NONE No access CLK_ADDRESS_CLAMP Default value (0,0,0,0) CLK_ADDRESS_CLAMP_TO_EDGE Same as edge CLK_ADDRESS_REPEAT Wraps around image Tal Ben-Nun, HUJI. All rights reserved. 29

30 Samplers are declared in kernels as sampler_t Note: Samplers may be also initialized in the host using clcreatesampler Example: sampler_t sampler = CLK_NORMALIZED_COORDS_TRUE CLK_ADDRESS_REPEAT CLK_FILTER_NEAREST; Tal Ben-Nun, HUJI. All rights reserved. 30

31 kernel void GaussianFilter(read_only image2d_t image, write_only image2d_t result) { int x = get_global_id(0), y = get_global_id(1); float4 color = (float4)0.0f; sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE CLK_ADDRESS_CLAMP_TO_EDGE CLK_FILTER_NEAREST; color += 1.0f * read_imagef(image, sampler, (int2)(x 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x 1, y )); color += 4.0f * read_imagef(image, sampler, (int2)(x, y )); color += 2.0f * read_imagef(image, sampler, (int2)(x + 1, y )); color += 1.0f * read_imagef(image, sampler, (int2)(x 1, y + 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y + 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y + 1)); color /= 16.0f; Since we want to blur the image correctly on the edges, we use CLAMP_TO_EDGE } write_imagef(result, (int2)(x, y), color); Tal Ben-Nun, HUJI. All rights reserved. 31

32 To read from images, use read_image{f,i,ui} Examples: float4 read_imagef(image2d_t image, sampler_t sampler, int2/float2 coord) uint4 read_imageui(image3d_t image, sampler_t sampler, float4 coord) Tal Ben-Nun, HUJI. All rights reserved. 32

33 kernel void GaussianFilter(read_only image2d_t image, write_only image2d_t result) { int x = get_global_id(0), y = get_global_id(1); float4 color = (float4)0.0f; sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE CLK_ADDRESS_CLAMP_TO_EDGE CLK_FILTER_NEAREST; Reads all 9 neighboring pixels (target pixel included) color += 1.0f * read_imagef(image, sampler, (int2)(x 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x 1, y )); color += 4.0f * read_imagef(image, sampler, (int2)(x, y )); color += 2.0f * read_imagef(image, sampler, (int2)(x + 1, y )); color += 1.0f * read_imagef(image, sampler, (int2)(x 1, y + 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y + 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y + 1)); color /= 16.0f; } write_imagef(result, (int2)(x, y), color); Tal Ben-Nun, HUJI. All rights reserved. 33

34 To write to images, use write_image{f,i,ui} Examples: void write_imagef(image2d_t image, int2 coord, float4 color) void write_imagei(image2d_t image, int2 coord, int4 color) Notice there are no samplers Actual pixels have to be written 3D images cannot be write-only Only with the cl_khr_3d_image_writes extension Tal Ben-Nun, HUJI. All rights reserved. 34

35 kernel void GaussianFilter(read_only image2d_t image, write_only image2d_t result) { int x = get_global_id(0), y = get_global_id(1); float4 color = (float4)0.0f; sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE CLK_ADDRESS_CLAMP_TO_EDGE CLK_FILTER_NEAREST; color += 1.0f * read_imagef(image, sampler, (int2)(x 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x 1, y )); color += 4.0f * read_imagef(image, sampler, (int2)(x, y )); color += 2.0f * read_imagef(image, sampler, (int2)(x + 1, y )); color += 1.0f * read_imagef(image, sampler, (int2)(x 1, y + 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y + 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y + 1)); color /= 16.0f; } write_imagef(result, (int2)(x, y), color); Tal Ben-Nun, HUJI. All rights reserved. 35

36 Other image functions: get_image_width(image2d_t/image3d_t) get_image_height(image2d_t/image3d_t) get_image_depth(image3d_t) Tal Ben-Nun, HUJI. All rights reserved. 36

37 kernel void GaussianFilter(read_only image2d_t image, write_only image2d_t result) { int x = get_global_id(0), y = get_global_id(1); float4 color = (float4)0.0f; sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE CLK_ADDRESS_CLAMP_TO_EDGE CLK_FILTER_NEAREST; color += 1.0f * read_imagef(image, sampler, (int2)(x 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x 1, y )); color += 4.0f * read_imagef(image, sampler, (int2)(x, y )); color += 2.0f * read_imagef(image, sampler, (int2)(x + 1, y )); color += 1.0f * read_imagef(image, sampler, (int2)(x 1, y + 1)); color += 2.0f * read_imagef(image, sampler, (int2)(x, y + 1)); color += 1.0f * read_imagef(image, sampler, (int2)(x + 1, y + 1)); color /= 16.0f; } write_imagef(result, (int2)(x, y), color); Tal Ben-Nun, HUJI. All rights reserved. 37

38 Exercise 1 contains both OpenCL host and kernel programming It is strongly recommended to test the code on an actual GPU Optimized code is considered a bonus and graded accordingly Good luck! Tal Ben-Nun, HUJI. All rights reserved. 38

OpenCL C. Matt Sellitto Dana Schaa Northeastern University NUCAR

OpenCL C. Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL C Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL C Is used to write kernels when working with OpenCL Used to code the part that runs on the device Based on C99 with some extensions

More information

INTRODUCING OPENCL TM

INTRODUCING OPENCL TM INTRODUCING OPENCL TM The open standard for parallel programming across heterogeneous processors 1 PPAM 2011 Tutorial IT S A MEANY-CORE HETEROGENEOUS WORLD Multi-core, heterogeneous computing The new normal

More information

OpenCL Overview Benedict R. Gaster, AMD

OpenCL Overview Benedict R. Gaster, AMD Copyright Khronos Group, 2011 - Page 1 OpenCL Overview Benedict R. Gaster, AMD March 2010 The BIG Idea behind OpenCL OpenCL execution model - Define N-dimensional computation domain - Execute a kernel

More information

Handheld Devices. Kari Pulli. Research Fellow, Nokia Research Center Palo Alto. Material from Jyrki Leskelä, Jarmo Nikula, Mika Salmela

Handheld Devices. Kari Pulli. Research Fellow, Nokia Research Center Palo Alto. Material from Jyrki Leskelä, Jarmo Nikula, Mika Salmela OpenCL in Handheld Devices Kari Pulli Research Fellow, Nokia Research Center Palo Alto Material from Jyrki Leskelä, Jarmo Nikula, Mika Salmela 1 OpenCL 1.0 Embedded Profile Enables OpenCL on mobile and

More information

OpenCL Overview. Tim Mattson Intel Labs. Copyright Khronos Group, Page 1

OpenCL Overview. Tim Mattson Intel Labs. Copyright Khronos Group, Page 1 OpenCL Overview Tim Mattson Intel Labs Copyright Khronos Group, 2009 - Page 1 Programming Heterogeneous Platforms CPUs Multiple cores driving performance increases Emerging Intersection GPUs Increasingly

More information

Introduction to OpenCL. Benedict R. Gaster October, 2010

Introduction to OpenCL. Benedict R. Gaster October, 2010 Introduction to OpenCL Benedict R. Gaster October, 2010 OpenCL With OpenCL you can Leverage CPUs and GPUs to accelerate parallel computation Get dramatic speedups for computationally intensive applications

More information

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1 Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 Introduction and aims of OpenCL - Neil Trevett, NVIDIA OpenCL Specification walkthrough - Mike

More information

OpenCL. Dr. David Brayford, LRZ, PRACE PATC: Intel MIC & GPU Programming Workshop

OpenCL. Dr. David Brayford, LRZ, PRACE PATC: Intel MIC & GPU Programming Workshop OpenCL Dr. David Brayford, LRZ, brayford@lrz.de PRACE PATC: Intel MIC & GPU Programming Workshop 1 Open Computing Language Open, royalty-free standard C-language extension For cross-platform, parallel

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

Heterogeneous Computing

Heterogeneous Computing OpenCL Hwansoo Han Heterogeneous Computing Multiple, but heterogeneous multicores Use all available computing resources in system [AMD APU (Fusion)] Single core CPU, multicore CPU GPUs, DSPs Parallel programming

More information

Copyright Khronos Group Page 1. OpenCL BOF SIGGRAPH 2013

Copyright Khronos Group Page 1. OpenCL BOF SIGGRAPH 2013 Copyright Khronos Group 2013 - Page 1 OpenCL BOF SIGGRAPH 2013 Copyright Khronos Group 2013 - Page 2 OpenCL Roadmap OpenCL-HLM (High Level Model) High-level programming model, unifying host and device

More information

OpenCL The Open Standard for Heterogeneous Parallel Programming

OpenCL The Open Standard for Heterogeneous Parallel Programming OpenCL The Open Standard for Heterogeneous Parallel Programming March 2009 Copyright Khronos Group, 2009 - Page 1 Close-to-the-Silicon Standards Khronos creates Foundation-Level acceleration APIs - Needed

More information

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao

More information

Copyright Khronos Group, Page 1. OpenCL Overview. February 2010

Copyright Khronos Group, Page 1. OpenCL Overview. February 2010 Copyright Khronos Group, 2011 - Page 1 OpenCL Overview February 2010 Copyright Khronos Group, 2011 - Page 2 Khronos Vision Billions of devices increasing graphics, compute, video, imaging and audio capabilities

More information

Introduction to OpenCL!

Introduction to OpenCL! Disclaimer Introduction to OpenCL I worked for Apple developing OpenCL I m biased (But not in the way you might think ) David Black-Schaffer david.black-schaffer@it.uu.se What is OpenCL? Low-level language

More information

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011 GPGPU COMPUTE ON AMD Udeepta Bordoloi April 6, 2011 WHY USE GPU COMPUTE CPU: scalar processing + Latency + Optimized for sequential and branching algorithms + Runs existing applications very well - Throughput

More information

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President 4 th Annual Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 CPUs Multiple cores driving performance increases Emerging Intersection GPUs Increasingly

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example Optimizing Previous Example Alternative Architectures 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator for desktop

More information

Accelerate Performance Using OpenCL* with Intel HD Graphics

Accelerate Performance Using OpenCL* with Intel HD Graphics Accelerate Performance Using OpenCL* with Intel HD Graphics Abstract Recently, Intel announced the release of the Intel SDK for OpenCL Applications 2013 with certified OpenCL* 1.2 support on 3rd and 4th

More information

GPGPU Training. Personal Super Computing Competence Centre PSC 3. Jan G. Cornelis. Personal Super Computing Competence Center

GPGPU Training. Personal Super Computing Competence Centre PSC 3. Jan G. Cornelis. Personal Super Computing Competence Center GPGPU Training Personal Super Computing Competence Centre PSC 3 Jan G. Cornelis 1 Levels of Understanding Level 0 Host and device Level 1 Parallel execution on the device Level 2 Device model and work

More information

OpenCL / OpenGL Texture Interoperability: An Image Blurring Case Study

OpenCL / OpenGL Texture Interoperability: An Image Blurring Case Study 1 OpenCL / OpenGL Texture Interoperability: An Image Blurring Case Study Mike Bailey mjb@cs.oregonstate.edu opencl.opengl.rendertexture.pptx OpenCL / OpenGL Texture Interoperability: The Basic Idea 2 Application

More information

OpenCL and the quest for portable performance. Tim Mattson Intel Labs

OpenCL and the quest for portable performance. Tim Mattson Intel Labs OpenCL and the quest for portable performance Tim Mattson Intel Labs Disclaimer The views expressed in this talk are those of the speaker and not his employer. I am in a research group and know very little

More information

Programming with CUDA, WS09

Programming with CUDA, WS09 Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 3 Thursday, 29 Nov, 2009 Recap Motivational videos Example kernel Thread IDs Memory overhead CUDA hardware and programming

More information

Parallel Implementation of the Box Counting Algorithm in OpenCL

Parallel Implementation of the Box Counting Algorithm in OpenCL Parallel Implementation of the Box Counting Algorithm in OpenCL Ramakrishnan Mukundan Department of Computer Science and Software Engineering University of Canterbury Christchurch, New Zealand. mukundan@canterbury.ac.nz

More information

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different

More information

OpenCL. Computation on HybriLIT Brief introduction and getting started

OpenCL. Computation on HybriLIT Brief introduction and getting started OpenCL Computation on HybriLIT Brief introduction and getting started Alexander Ayriyan Laboratory of Information Technologies Joint Institute for Nuclear Research 05.09.2014 (Friday) Tutorial in frame

More information

OpenACC (Open Accelerators - Introduced in 2012)

OpenACC (Open Accelerators - Introduced in 2012) OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in

More information

Altera SDK for OpenCL

Altera SDK for OpenCL Altera SDK for OpenCL Best Practices Guide Subscribe Last updated for Quartus Prime Design Suite: 16.0 UG-OCL003 101 Innovation Drive San Jose, CA 95134 www.altera.com TOC-2 Contents...1-1 Introduction...1-1

More information

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Architecture Parallel computing for heterogenous devices CPUs, GPUs, other processors (Cell, DSPs, etc) Portable accelerated code Defined

More information

Introduction to OpenCL!

Introduction to OpenCL! Lecture 6! Introduction to OpenCL! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! OpenCL Architecture Defined in four parts Platform Model

More information

GPGPU IGAD 2014/2015. Lecture 4. Jacco Bikker

GPGPU IGAD 2014/2015. Lecture 4. Jacco Bikker GPGPU IGAD 2014/2015 Lecture 4 Jacco Bikker Today: Demo time! Parallel scan Parallel sort Assignment Demo Time Parallel scan What it is: in: 1 1 6 2 7 3 2 out: 0 1 2 8 10 17 20 C++: out[0] = 0 for ( i

More information

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci TECHNISCHE UNIVERSITÄT WIEN Fakultät für Informatik Cyber-Physical Systems Group CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci Outline of CUDA Basics Basic Kernels and Execution on GPU

More information

The OpenCL Extension Specification

The OpenCL Extension Specification The OpenCL Extension Specification Version: 2.0 Document Revision: 34 Khronos OpenCL Working Group Editor: Lee Howes and Aaftab Munshi Last Revision Date: February 13, 2018 Page 1 9. OPTIONAL EXTENSIONS...

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA

More information

NVIDIA GPU CODING & COMPUTING

NVIDIA GPU CODING & COMPUTING NVIDIA GPU CODING & COMPUTING WHY GPU S? ARCHITECTURE & PROGRAM MODEL CPU v. GPU Multiprocessor Model Memory Model Memory Model: Thread Level Programing Model: Logical Mapping of Threads Programing Model:

More information

Brook Spec v0.2. Ian Buck. May 20, What is Brook? 0.2 Streams

Brook Spec v0.2. Ian Buck. May 20, What is Brook? 0.2 Streams Brook Spec v0.2 Ian Buck May 20, 2003 0.1 What is Brook? Brook is an extension of standard ANSI C which is designed to incorporate the ideas of data parallel computing and arithmetic intensity into a familiar,

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose

More information

Package OpenCL. February 19, 2015

Package OpenCL. February 19, 2015 Package OpenCL February 19, 2015 Version 0.1-3 Title Interface allowing R to use OpenCL Author Maintainer Depends R (>= 2.0.0) This package provides

More information

Performing Reductions in OpenCL

Performing Reductions in OpenCL Performing Reductions in OpenCL Mike Bailey mjb@cs.oregonstate.edu opencl.reduction.pptx Recall the OpenCL Model Kernel Global Constant Local Local Local Local Work- ItemWork- ItemWork- Item Here s the

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs CUDA Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University CUDA - Compute Unified Device

More information

CREATING A DECISION FRAMEWORK FOR OpenCL USAGE. Graham Brown CTO Corel Corporation

CREATING A DECISION FRAMEWORK FOR OpenCL USAGE. Graham Brown CTO Corel Corporation CREATING A DECISION FRAMEWORK FOR OpenCL USAGE Graham Brown CTO Corel Corporation AGENDA OpenCL Overview Corel s View of Optimization Sample of Corel s Decision Framework Additional Considerations 2 Creating

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing

More information

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into

More information

OPENCL. Episode 6 - Shared Memory Kernel Optimization

OPENCL. Episode 6 - Shared Memory Kernel Optimization OPENCL Episode 6 - Shared Memory Kernel Optimization David W. Gohara, Ph.D. Center for Computational Biology Washington University School of Medicine, St. Louis email: sdg0919@gmail.com THANK YOU SHARED

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example (and its Optimization) Alternative Frameworks Most Recent Innovations 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator

More information

Synthesizing Benchmarks for Predictive Modeling.

Synthesizing Benchmarks for Predictive Modeling. Synthesizing Benchmarks for Predictive Modeling http://chriscummins.cc/cgo17 Chris Cummins University of Edinburgh Pavlos Petoumenos University of Edinburgh Zheng Wang Lancaster University Hugh Leather

More information

Using Deep Learning to Generate Human-like Code

Using Deep Learning to Generate Human-like Code Using Deep Learning to Generate Human-like Code Synthesizing Benchmarks for Predictive Modeling Chris Cummins Zheng Wang Pavlos Petoumenos Hugh Leather achine learning for compilers y = f(x) Optimisations

More information

CS179 GPU Programming Recitation 4: CUDA Particles

CS179 GPU Programming Recitation 4: CUDA Particles Recitation 4: CUDA Particles Lab 4 CUDA Particle systems Two parts Simple repeat of Lab 3 Interacting Flocking simulation 2 Setup Two folders given particles_simple, particles_interact Must install NVIDIA_CUDA_SDK

More information

A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs

A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs Taylor Lloyd, Artem Chikin, Erick Ochoa, Karim Ali, José Nelson Amaral University of Alberta Sept 7 FSP 2017 1 University

More information

GPU Programming. CUDA Memories. Miaoqing Huang University of Arkansas Spring / 43

GPU Programming. CUDA Memories. Miaoqing Huang University of Arkansas Spring / 43 1 / 43 GPU Programming CUDA Memories Miaoqing Huang University of Arkansas Spring 2016 2 / 43 Outline CUDA Memories Atomic Operations 3 / 43 Hardware Implementation of CUDA Memories Each thread can: Read/write

More information

OpenCL. An Introduction for HPC programmers. Benedict Gaster, AMD Tim Mattson, Intel. - Page 1

OpenCL. An Introduction for HPC programmers. Benedict Gaster, AMD Tim Mattson, Intel. - Page 1 OpenCL An Introduction for HPC programmers Benedict Gaster, AMD Tim Mattson, Intel - Page 1 Preliminaries: Disclosures - The views expressed in this tutorial are those of the people delivering the tutorial.

More information

EEM528 GPU COMPUTING

EEM528 GPU COMPUTING EEM528 CS 193G GPU COMPUTING Lecture 2: GPU History & CUDA Programming Basics Slides Credit: Jared Hoberock & David Tarjan CS 193G History of GPUs Graphics in a Nutshell Make great images intricate shapes

More information

Making OpenCL Simple with Haskell. Benedict R. Gaster January, 2011

Making OpenCL Simple with Haskell. Benedict R. Gaster January, 2011 Making OpenCL Simple with Haskell Benedict R. Gaster January, 2011 Attribution and WARNING The ideas and work presented here are in collaboration with: Garrett Morris (AMD intern 2010 & PhD student Portland

More information

Unified Parallel C (UPC)

Unified Parallel C (UPC) Unified Parallel C (UPC) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 21 March 27, 2008 Acknowledgments Supercomputing 2007 tutorial on Programming using

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics

More information

OpenCL: A Hands-on Introduction

OpenCL: A Hands-on Introduction OpenCL: A Hands-on Introduction Tim Mattson Intel Corp. Alice Koniges Berkeley Lab/NERSC Simon McIntosh-Smith University of Bristol Acknowledgements: In addition to Tim, Alice and Simon Tom Deakin (Bristol)

More information

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Basic Elements of CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of

More information

The Cut and Thrust of CUDA

The Cut and Thrust of CUDA The Cut and Thrust of CUDA Luke Hodkinson Center for Astrophysics and Supercomputing Swinburne University of Technology Melbourne, Hawthorn 32000, Australia May 16, 2013 Luke Hodkinson The Cut and Thrust

More information

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group Copyright Khronos Group, 2012 - Page 1 OpenCL Overview Shanghai March 2012 Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group Copyright Khronos Group, 2012 - Page 2 Processor

More information

Image convolution with CUDA

Image convolution with CUDA Image convolution with CUDA Lecture Alexey Abramov abramov _at_ physik3.gwdg.de Georg-August University, Bernstein Center for Computational Neuroscience, III Physikalisches Institut, Göttingen, Germany

More information

INTRODUCTION TO OPENCL. Jason B. Smith, Hood College May

INTRODUCTION TO OPENCL. Jason B. Smith, Hood College May INTRODUCTION TO OPENCL Jason B. Smith, Hood College May 4 2011 WHAT IS IT? Use heterogeneous computing platforms Specifically for computationally intensive apps Provide a means for portable parallelism

More information

A hands-on Introduction to OpenCL

A hands-on Introduction to OpenCL A hands-on Introduction to OpenCL Tim Mattson Acknowledgements: Alice Koniges of Berkeley Lab/NERSC and Simon McIntosh-Smith, James Price, and Tom Deakin of the University of Bristol OpenCL Learning progression

More information

CS240: Programming in C

CS240: Programming in C CS240: Programming in C Lecture 5: Functions. Scope of variables. Program structure. Cristina Nita-Rotaru Lecture 5/ Fall 2013 1 Functions: Explicit declaration Declaration, definition, use, order matters.

More information

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL CS/EE 217 GPU Architecture and Parallel Programming Lecture 22: Introduction to OpenCL Objective To Understand the OpenCL programming model basic concepts and data types OpenCL application programming

More information

Brook+ Data Types. Basic Data Types

Brook+ Data Types. Basic Data Types Brook+ Data Types Important for all data representations in Brook+ Streams Constants Temporary variables Brook+ Supports Basic Types Short Vector Types User-Defined Types 29 Basic Data Types Basic data

More information

A First Book of ANSI C Fourth Edition. Chapter 8 Arrays

A First Book of ANSI C Fourth Edition. Chapter 8 Arrays A First Book of ANSI C Fourth Edition Chapter 8 Arrays Objectives One-Dimensional Arrays Array Initialization Arrays as Function Arguments Case Study: Computing Averages and Standard Deviations Two-Dimensional

More information

Writing Optimal OpenCL Code with Intel OpenCL SDK

Writing Optimal OpenCL Code with Intel OpenCL SDK Writing Optimal OpenCL Code with Intel OpenCL SDK Performance Guide Copyright 2010 2011 Intel Corporation All Rights Reserved Document Number: 325696-001US Revision: 1.3 World Wide Web: http://www.intel.com

More information

Introduction OpenCL Code Exercices. OpenCL. Tópicos em Arquiteturas Paralelas. Peter Frank Perroni. November 25, 2015

Introduction OpenCL Code Exercices. OpenCL. Tópicos em Arquiteturas Paralelas. Peter Frank Perroni. November 25, 2015 Code Tópicos em Arquiteturas Paralelas November 25, 2015 The Device Code GPU Device Memory Access Thread Management Private Private Thread1 Thread M Streaming Processor 0... Private Private Thread1 Thread

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight

More information

OpenCL. Parallel Computing for Heterogeneous Devices. Ofer Rosenberg Visual Computing Software Division, Intel

OpenCL. Parallel Computing for Heterogeneous Devices. Ofer Rosenberg Visual Computing Software Division, Intel OpenCL Parallel Computing for Heterogeneous Devices Ofer Rosenberg Visual Computing Software Division, Intel Based on Khronos OpenCL Overview by Aaftab Munshi Welcome to OpenCL With OpenCL you can Leverage

More information

Compute Shaders. Christian Hafner. Institute of Computer Graphics and Algorithms Vienna University of Technology

Compute Shaders. Christian Hafner. Institute of Computer Graphics and Algorithms Vienna University of Technology Compute Shaders Christian Hafner Institute of Computer Graphics and Algorithms Vienna University of Technology Overview Introduction Thread Hierarchy Memory Resources Shared Memory & Synchronization Christian

More information

OpenCL on the GPU. San Jose, CA September 30, Neil Trevett and Cyril Zeller, NVIDIA

OpenCL on the GPU. San Jose, CA September 30, Neil Trevett and Cyril Zeller, NVIDIA OpenCL on the GPU San Jose, CA September 30, 2009 Neil Trevett and Cyril Zeller, NVIDIA Welcome to the OpenCL Tutorial! Khronos and industry perspective on OpenCL Neil Trevett Khronos Group President OpenCL

More information

CHAPTER 4 FUNCTIONS. 4.1 Introduction

CHAPTER 4 FUNCTIONS. 4.1 Introduction CHAPTER 4 FUNCTIONS 4.1 Introduction Functions are the building blocks of C++ programs. Functions are also the executable segments in a program. The starting point for the execution of a program is main

More information

The COPRTHR Primer rev. 1.6

The COPRTHR Primer rev. 1.6 The COPRTHR Primer rev. 1.6 Copyright 2011-2014 Brown Deer Technology, LLC Verbatim copying and distribution of this entire document is permitted in any medium, provided this notice is preserved. Contents

More information

GPU Architecture and Programming with OpenCL. OpenCL. GPU Architecture: Why? Today s s Topic. GPUs: : Architectures for Drawing Triangles Fast

GPU Architecture and Programming with OpenCL. OpenCL. GPU Architecture: Why? Today s s Topic. GPUs: : Architectures for Drawing Triangles Fast Today s s Topic GPU Architecture and Programming with OpenCL David Black-Schaffer david.black-schaffer@it black-schaffer@it.uu.se Room 1221 GPU architecture What and why The good The bad Compute Models

More information

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun Outline Memory Management CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory

More information

GPU Architecture and Programming with OpenCL

GPU Architecture and Programming with OpenCL GPU Architecture and Programming with OpenCL David Black-Schaffer david.black-schaffer@it black-schaffer@it.uu.se Room 1221 Today s s Topic GPU architecture What and why The good The bad Compute Models

More information

Exercise Session 2 Simon Gerber

Exercise Session 2 Simon Gerber Exercise Session 2 Simon Gerber CASP 2014 Exercise 2: Binary search tree Implement and test a binary search tree in C: Implement key insert() and lookup() functions Implement as C module: bst.c, bst.h

More information

Structured Data. CIS 15 : Spring 2007

Structured Data. CIS 15 : Spring 2007 Structured Data CIS 15 : Spring 2007 Functionalia HW4 Part A due this SUNDAY April 1st: 11:59pm Reminder: I do NOT accept LATE HOMEWORK. Today: Dynamic Memory Allocation Allocating Arrays Returning Pointers

More information

Massively Parallel Algorithms

Massively Parallel Algorithms Massively Parallel Algorithms Introduction to CUDA & Many Fundamental Concepts of Parallel Programming G. Zachmann University of Bremen, Germany cgvr.cs.uni-bremen.de Hybrid/Heterogeneous Computation/Architecture

More information

CSE 333 Autumn 2013 Midterm

CSE 333 Autumn 2013 Midterm CSE 333 Autumn 2013 Midterm Please do not read beyond this cover page until told to start. A question involving what could be either C or C++ is about C, unless it explicitly states that it is about C++.

More information

clarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018

clarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018 clarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018 ANECDOTE DISCOVERING A BUFFER OVERFLOW CPU GPU MEMORY MEMORY Data Data Data Data Data 2 clarmor: A

More information

OPENCL WITH AMD FIREPRO W9100 GERMAN ANDRYEYEV MAY 20, 2015

OPENCL WITH AMD FIREPRO W9100 GERMAN ANDRYEYEV MAY 20, 2015 OPENCL WITH AMD FIREPRO W9100 GERMAN ANDRYEYEV MAY 20, 2015 Introducing AMD FirePro W9100 HW COMPARISON W9100(HAWAII) VS W9000(TAHITI) FirePro W9100 FirePro W9000 Improvement Notes Compute Units 44 32

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

CS 432 Interactive Computer Graphics

CS 432 Interactive Computer Graphics CS 432 Interactive Computer Graphics Lecture 7 Part 2 Texture Mapping in OpenGL Matt Burlick - Drexel University - CS 432 1 Topics Texture Mapping in OpenGL Matt Burlick - Drexel University - CS 432 2

More information

Slide Set 3. for ENCM 339 Fall 2017 Section 01. Steve Norman, PhD, PEng

Slide Set 3. for ENCM 339 Fall 2017 Section 01. Steve Norman, PhD, PEng Slide Set 3 for ENCM 339 Fall 2017 Section 01 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary September 2017 ENCM 339 Fall 2017 Section 01

More information

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu

More information

by Pearson Education, Inc. All Rights Reserved.

by Pearson Education, Inc. All Rights Reserved. Let s improve the bubble sort program of Fig. 6.15 to use two functions bubblesort and swap. Function bubblesort sorts the array. It calls function swap (line 51) to exchange the array elements array[j]

More information

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome!

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome! /INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2018 - Lecture 10: GPGPU (3) Welcome! Today s Agenda: Don t Trust the Template The Prefix Sum Parallel Sorting Stream Filtering Optimizing GPU

More information

Concurrent Programming with the Cell Processor. Dietmar Kühl Bloomberg L.P.

Concurrent Programming with the Cell Processor. Dietmar Kühl Bloomberg L.P. Concurrent Programming with the Cell Processor Dietmar Kühl Bloomberg L.P. dietmar.kuehl@gmail.com Copyright Notice 2009 Bloomberg L.P. Permission is granted to copy, distribute, and display this material,

More information

Computer Science & Engineering 150A Problem Solving Using Computers

Computer Science & Engineering 150A Problem Solving Using Computers Computer Science & Engineering 150A Problem Solving Using Computers Lecture 06 - Stephen Scott Adapted from Christopher M. Bourke 1 / 30 Fall 2009 Chapter 8 8.1 Declaring and 8.2 Array Subscripts 8.3 Using

More information

04. CUDA Data Transfer

04. CUDA Data Transfer 04. CUDA Data Transfer Fall Semester, 2015 COMP427 Parallel Programming School of Computer Sci. and Eng. Kyungpook National University 2013-5 N Baek 1 CUDA Compute Unified Device Architecture General purpose

More information

Michael Kinsner, Dirk Seynhaeve IWOCL 2018

Michael Kinsner, Dirk Seynhaeve IWOCL 2018 Michael Kinsner, Dirk Seynhaeve IWOCL 2018 Topics 1. FPGA overview 2. Motivating application classes 3. Host pipes 4. Some data 2 FPGA: Fine-grained Massive Parallelism Intel Stratix 10 FPGA: Over 5 Million

More information

SpiNNaker Application Programming Interface (API)

SpiNNaker Application Programming Interface (API) SpiNNaker Application Programming Interface (API) Version 2.0.0 10 March 2016 Application programming interface (API) Event-driven programming model The SpiNNaker API programming model is a simple, event-driven

More information

EE109 Lab Need for Speed

EE109 Lab Need for Speed EE109 Lab Need for Speed 1 Introduction In this lab you will parallelize two pieces of code. The first is an image processing program to blur (smooth) or sharpen an image. We will guide you through this

More information

GPU Programming Using CUDA

GPU Programming Using CUDA GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa

More information

Eric Schenk, EA August 2009

Eric Schenk, EA August 2009 Game Developer s Perspective on OpenCL Eric Schenk, EA August 2009 Copyright Electronic Arts, 2009 - Page 1 Motivation Copyright Electronic Arts, 2009 - Page 2 Motivation Supports a variety of compute

More information

Motion Estimation Extension for OpenCL

Motion Estimation Extension for OpenCL Motion Estimation Extension for OpenCL Authors: Nico Galoppo, Craig Hansen-Sturm Reviewers: Ben Ashbaugh, David Blythe, Hong Jiang, Stephen Junkins, Raun Krisch, Matt McClellan, Teresa Morrison, Dillon

More information

Lecture 11: GPU programming

Lecture 11: GPU programming Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!

More information