OpenCL parallel Processing using General Purpose Graphical Processing units TiViPE software development

Size: px
Start display at page:

Download "OpenCL parallel Processing using General Purpose Graphical Processing units TiViPE software development"

Transcription

1 TiViPE Visual Programming OpenCL parallel Processing using General Purpose Graphical Processing units TiViPE software development Technical Report Copyright c TiViPE All rights reserved. Tino Lourens TiViPE Kanaaldijk ZW LD Helmond The Netherlands tino@tivipe.com November 6, 2012

2 CONTENTS 2 Contents 1 Introduction 4 2 GPUadd using CUDA Cuda implementation GPUadd implementation CLadd using OpenCL 8 4 Programming a TiViPE module using OpenCL 12 5 OpenCL components Device information and selection Data transportation Computational modules

3 CONTENTS 3 Abstract The aim of this report to elaborate TiViPE modules that make use of Open Computing Language (OpenCL) programming. OpenCL is available in TiViPE from version

4 1 Introduction 4 Figure 1: OpenCL example: addcl is a merged module with two inputs and one output. On the right the atomic modules given in green. 1 Introduction The aim of TiViPE is to integrate different technologies in a seamless way using graphical icons [1]. Due to these icons the user does not need to have in depth knowledge of the underlying hardware architecture [2]. In this report the focus is on how to integrate the Open Computing Language (OpenCL) into TiViPE. OpenCL is an open standard for parallel programming of heterogeneous systems, it has been intitated by Apple and is coordinated by the Khronos group. The aim of OpenCL is to support parallel programming on many different computational devices, including Central Processing Units (CPUs) and Graphical Processing Units (GPUs). The terminology for CPU and GPU becomes somewhat outdated because GPUs can be used as generic processing units also known as GP-GPUs. It is the reason that AMD has redefined its processing units to Accelerated Processing Units (APU), since both CPU and GPU can be used in the same way when using the OpenCL SDK (Softwre Development Kit). Objectively spoken OpenCL mostly benefits if the hardware is massively parallel and for the moment this implies that the GPU processing units are most suitable. The NVIDIA CUDA (Compute Unified Device Architecture) SDK is a competitor of OpenCL. The only drawback of CUDA compared to OpenCL is that only NVIDIA (GPU) hardware is supported. OpenCL has a bunch of restrictions compared to CUDA and TiViPE has put an effort to alleviate the gap between CUDA and OpenCL. Our aim is to make OpenCL and CUDA programming very similar and make modular integration in TiViPE the same. In this document we make a step by step comparison between the 2 technologies using a simple example. 2 GPUadd using CUDA The addcl TiViPE module illustrated in Figure 1 performs an addition for every element. In formula: d i = a i + b i i (1) A very simple module to program, but several steps need to be taken to make such a module be integrated in an environment like TiViPE. Figure 1 already illustrates that the memory allocation and transfer needs to be taken in consideration. The reason to do so is that

5 2.1 Cuda implementation GPUadd implementation 5 1. allocation of memory on a GPU takes time 2. freeing memory also 3. transfer from CPU to GPU memory is time consuming 4. and vice versa the transfer from GPU to CPU memory In the design for an efficient algorithm it is therefore important to allocate memory in the initialization phase and use the allocated memory during an iterative process. When the process has finished memory is being freed. Modules in TiViPE require to be constructed and merged to a single module where IO to CPU memory is included, as illustrated in Figure 1. The construction of e i = a i + b i + c i i (2) can be reprogrammed but in this case be constructed using the basic building block twice: d i = a i + b i i (3) e i = d i + c i i (4) The CLinit module, initializes all available modules, for all available detected OpenCL capable devices. For instance, an ASUS EEE notebook PC this can contain 2 different OpenCL devices: an AMD E-350 Processor a dual core E-350 CPU and a Loveland which is a Radeon 6310 GPU. While on a desktop this can be a single NVIDIA GPU device like GeForce GTX 560. This CLinit module uses a platform number and a device number to give the user control over these devices. The module, when executed, compiles all modules to binaries when they do not exist for a given device. In some occasions this is not necessary, because the binaries used for different devices are the same. The module for this reason has 2 additional parameters to map devices by providing string commands. For instance, when the devices are GeForce GTX 580, GeForce GTX 570 they can be mapped to the 560 gtx binaries, by providing GeForce GTX 560,GeForce GTX 560 as mapped devices. The CLinit module is required as the first module for every TiViPE network that uses OpenCL. Likewise, the CLresetDevices is required as last module in every TiViPE network. The CLadd module performs the addition as described by (1) using the generic Data5D datatype. The inputs and outputs are made yellow to indicate that memory resides on a OpenCL device rather than CPU memory. This holds for instance for a GPU device with own GPU memory. The module has three inputs, the first input is used to pre-allocate output memory, while the content of the second and third input are used to perform addition itself. This construction avoids the overhead of memory allocation and freeing in the CLadd module itself. For the three inputs memory is allocated the first time only. The second and third input copy data from CPU memory to the OpenCL device. The output in turn is copied from the device memory to CPU memory. 2.1 Cuda implementation GPUadd implementation Figure 1 illustrates a TiViPE network using OpenCL commands. The green CLadd module in this figure is in essence the same as the GPUadd module. The code information file GPUadd.ci in TiViPE reveals that the following interface parameters have been used Input("Input pre-allocated output data", "-preo", DATA5D, DATA5D_ANYTYPE, IGNORED) [1]; Input("Input 1",

6 2.1 Cuda implementation GPUadd implementation 6 "-input1", DATA5D, DATA5D_ANYTYPE, IGNORED) [2]; Input("Input 2", "-input2", DATA5D, DATA5D_ANYTYPE, IGNORED) [3]; Output("Output", "-output", DATA5D, DATA5D_ANYTYPE, IGNORED) [F]; The term IGNORED denotes that memory does not reside on the CPU (but on the GPU). The class and routine are given by code are TVPgpuArithmetic and TVPgpuAdd. This routine is part of library TVPgpu that is found in the $(TVPHOME)/lib directory. The source code given by files TVPgpuarithmetic.h and TVPgpuarithmetic.cpp are found in directory $(TVPHOME)/Libraries/TVPgpu/src. In the header file this is given by Data5D TVPgpuAdd (Data5D gpu_ret, Data5D gpu_d1, Data5D gpu_d2); and we find the routine call itself back in file TVPgpuarithmetic.cpp. The routine itself checks if the parameters are set properly, if so it performs the following: TVPcudaAdd3 (gpu_ret, gpu_d1, gpu_d2); return gpu_ret; The TVPcudaAdd3 routine is found in the TVPcuda library in the file TVPcudaarithmetic.h it is found as extern "C" TVPCUDA_EXPORT void TVPcudaAdd3 (Data5D ret, Data5D d1, Data5D d2); The cuda code itself is not revealed by TiViPE but it is implemented as follows void TVPcudaAdd3 (Data5D ret, Data5D d1, Data5D d2) #if defined (TVPCUDA_DEBUG) defined (ALL_DEBUG) printf ("TVPcudaAdd3\n"); #endif /******************************************************* * Setup block and grid sizes *******************************************************/ int w, h, len; TVPcudaImageSize (d1, w, h, len); int wh = w * h; int Bx, By, Bz, Gx, Gy, Gz; TVPcudaBlockGridSize (Bx, By, Bz, Gx, Gy, Gz, w, h, len, CUDA_NO_CHECK); dim3 block (Bx, By, Bz); dim3 grid (Gx, Gy, Gz); /*******************************************************

7 2.1 Cuda implementation GPUadd implementation 7 * Add * indata and outdata are coupled to d2 and d1, * respectively. *******************************************************/ switch (d1.datatype) case DATA5D_FLOAT: float *indata1 = (float *) d1.data; float *indata2 = (float *) d2.data; float *lindata1 = indata1; float *lindata2 = indata2; float *outdata = (float *) ret.data; float *loutdata = outdata; /************************************************* * iterate over the different images *************************************************/ for (int i = 0; i < len; i++) TVPcudaAddKernel3<float><<<grid, block>>> (loutdata, lindata1, lindata2, w); lindata1 += wh; lindata2 += wh; loutdata += wh; case DATA5D_CHAR: /* * Copy all data from the case above * Replace all float by char */ case DATA5D_UCHAR: case DATA5D_SHORT: case DATA5D_USHORT: case DATA5D_INT: case DATA5D_UINT: case DATA5D_LONG: case DATA5D_ULONG: case DATA5D_DOUBLE: case DATA5D_COMPLEXFLOAT: case DATA5D_COMPLEXDOUBLE: printf ("Float datatype is implemented only.\n"); case DATA5D_NOTYPE:

8 3 CLadd using OpenCL 8 The TVPcudaAddKernel3 routine is the so-called CUDA kernel module which is implemented as follows: template <typename DType> global void TVPcudaAddKernel (DType *ret, DType *d1, DType *d2, int width) /****************************************************** Add ******************************************************/ int x = blockidx.x * blockdim.x + threadidx.x; int y = blockidx.y * blockdim.y + threadidx.y; int i = y * width + x; ret[i] = (DType) (d1[i] + d2[i]); The code above is compiled and made available in the TVPcuda library. 3 CLadd using OpenCL Lets try to achieve the same in OpenCL. In header TVPclArithmetic.h the TVPgpuAdd the routine name has changed to TVPclAdd: Data5D TVPclAdd (Data5D cl_ret, Data5D cl_d1, Data5D cl_d2); but the parameters are the same. The implementation of this routine given in file TVPclarithmetic.cpp also this is very similar to the CUDA version: TVPclAddPriv (cl_ret, cl_d1, cl_d2); return cl_ret; The TVPclAddPriv routine is found in the same file and not in a separated library. The routine is implemented as follows: cl_event TVPclArithmetic::TVPclAddPriv (Data5D ret, Data5D d1, Data5D d2, bool useevent) #if defined (TVPCL_DEBUG) defined (ALL_DEBUG) qdebug () << "TVPclArithmetic::TVPclAddPriv" << flush; #endif /******************************************************* * Setup block and grid sizes *******************************************************/ int w, h, len;

9 3 CLadd using OpenCL 9 TVPclImageSize (d1, w, h, len); int wh = w * h; int Bx, By, Bz, Gx, Gy, Gz; TVPclBlockGridSize (Bx, By, Bz, Gx, Gy, Gz, w, h, len, CL_NO_CHECK); const int dims = 3; size_t block[dims] = Bx, By, Bz; size_t grid[dims] = Gx, Gy, Gz; /******************************************************* * get first kernel index of TVPclAddKernel *******************************************************/ int index = TVPclKernelIndex ("TVPclAddKernel"); switch (d1.datatype) case DATA5D_CHAR: index += 0; case DATA5D_UCHAR: index += 1; case DATA5D_SHORT: index += 2; case DATA5D_USHORT: index += 3; case DATA5D_INT: index += 4; case DATA5D_UINT: index += 5; case DATA5D_LONG: index += 6; case DATA5D_ULONG: index += 7; case DATA5D_FLOAT: index += 8; case DATA5D_DOUBLE: index += 9; case DATA5D_COMPLEXFLOAT: index += 10; case DATA5D_COMPLEXDOUBLE:

10 3 CLadd using OpenCL 10 index += 11; case DATA5D_NOTYPE: default: error ("TVPclArithmetic", "TVPclAddPriv", QObject::tr ("No type has been set.")); /******************************************************* * Fetch appropriate kernel *******************************************************/ cl_kernel kernel = TVPclKernel (index); /******************************************************* * Set arguments *******************************************************/ cl_int ok; ok = clsetkernelarg (kernel, 0, sizeof (cl_mem), (void*) &(ret.data)); ok = clsetkernelarg (kernel, 1, sizeof (cl_mem), (void*) &(d1.data)); ok = clsetkernelarg (kernel, 2, sizeof (cl_mem), (void*) &(d2.data)); ok = clsetkernelarg (kernel, 3, sizeof (int), (void*) &w); ok = clsetkernelarg (kernel, 4, sizeof (int), (void*) &len); ok = clsetkernelarg (kernel, 5, sizeof (int), (void*) &wh); if (ok!= CL_SUCCESS) error ("TVPclArithmetic", "TVPclAddPriv", QObject::tr ("Unable to set kernel arguments.")); /******************************************************* * Launch kernel *******************************************************/ cl_event event = TVPclLaunchKernel (kernel, dims, block, grid, useevent); return event; The code above fetches a kernel, sets parameters as kernel arguments, and launches the kernel, and appears as easy to use as the CUDA example. In fact it was aimed to be very similar to the cuda code with the exception that it is not possible to iterate over the kernel, where the kernel is given a certain index. This whole loop needs to be performed in the OpenCL code and is found in the file TVPclAddKernel.cl: kernel void TVPclAddKernel<DType> ( global DType *ret, global DType *d1, global DType *d2, const int width,

11 3 CLadd using OpenCL 11 const int const int len, offset) int x = get_global_id (0); int y = get_global_id (1); int i = y * width + x; /****************************************************** * add ******************************************************/ for (int j = 0; j < len; j++) ret[i] = d1[i] + d2[i]; i += offset; This loop is the only difference compared to the TVPcudaAddKernel3 CUDA routine. This is a structural difference with the CUDA kernels, but can be alleviated easily as given in the code above. OpenCL does not support templates by default, hence the code above is not according to the specifications given in OpenCL version 1.1 or 1.2. Templating is thus a TiViPE extension to OpenCL. TiViPE only allows very strict templating rules. That is a templated routine name always look like...clkernelname<dtype>......clkernelname<dtype,dtype1>......clkernelname<dtype,dtype1,dtype2>... and there are no spaces between routinename and comma s. Up to three different templates are supported. In case of one template only DType is used in case of two templates DType and DType1 are used. Note that the routine name includes thes Figure 2 illustrates the normal situation for using a routine call from a library. The standard case (as depicted in blue in the figure) a header file and a library are required. The header file contains a description of the routine that is being called which is available in the library. For NVIDIA CUDA there is a compiler available which compiles to a library, and the way a routine is called from a library is thus the same, as illustrated in green. The OpenCL setup given in yellow is different because OpenCL code does not compile to a library, but instead it is assumed that there are OpenCL kernels represented as text strings, these kernels are internally compiled before they are launched. The concept therefore is that OpenCL assumes that the source code is available as a text string in a C or C++ program. We would like to change this to the concept of having a file with OpenCL source code, compile this to a binary object file 1, and store these into a library. There is no way to create a library without any tools, rules, or administration. In order to get the concept working properly we need to make some rules: binary. 1. The code above couples a kernel name TVPclAddKernel to a file /src/cl/tvpcladdkernel.cl. 2. The kernel module is administrated in the file /src/cl/tvpclkernels.txt. This is done as: TVPclAddKernel DT12; 1 This binary for NVIDIA still is text, the so-called ptx code. Latter is compiled by a just in time (JIT) compiler to a real

12 4 Programming a TiViPE module using OpenCL 12 TVPcpuCall Routine call TVPgpuCall CUDA Routine call Library CUDA Library Header file Source file CUDA header file CUDA source file TVPopenclCall TVPopenclCallPriv index = TVPclKernelIndex ( Name ) kernel = TVPclKernel (index + offset); clsetkernelarg ( ); TVPclLaunchKernel (kernel, ); TVPclTemplates.txt Library TVPclKernels.txt Init needs to be performed first TVPopenclCallInit OpenCL file OpenCL binary file Figure 2: Programmers infrastructure for calling a routine from a library. In blue the normal situation. In green the situation for NVIDIA CUDA. In yellow the way it is done by TiViPE for OpenCL. 3. Template DT12, which stands for 12 basic datatypes is coupled to TVPclAddKernel and this template is administrated in /src/cl/tvpcltemplates.txt. In this file the template is given as DT12 char 0 uchar 0 short 0 ushort 0 int 0 uint 0 long 0 ulong 0 float 0 double 1 float2 0 double2 1;. In OpenCL hardware not all devices support double floating point arithmetic, and thus for every datatype in the template a flag is created to indicate whether or not the data type requires double operand floating point arithmetic. 4. For open source development it might be preferred to use the OpenCL source code only, and (on the fly) compile the required modules. For companies, like ours it is not always possible to provide source code only. We d prefer to be able to use solely binaries too. Binaries are not generic, so for every OpenCL capable platform (that we intend to support) we need to generate binaries. TiViPE compiles all kernels given in TVPclKernels.txt to /obj/ [device] /TVPclAddKernel [DType] [DType1] [DType2].o object files. Here device is the hardware device supported, and DType, DType1, DType2 are the (optional) template datatype substitutions. For instance the TVPclAdd routine with integer datatype for a Geforce GTX 560 compile to and use the binary code obtained from /obj/geforce GTX 560/TVPclAddKernelint.o. 4 Programming a TiViPE module using OpenCL Programming a module in TiViPE takes 3 steps: 1. Registration of kernel name in file /src/cl/tvpclkernels.txt. This is done by adding a line of text, for instance TVPclMyKernelDouble 1;, TVPclMyKernel Datatype1 Datatype2; ), or the add kernel TVPclAddKernel DT12;. In case the datatypes have not been created they need to be added to /src/cl/tvpcltemplates.txt.

13 5 OpenCL components 13 Figure 3: All available OpenCL modules in TiViPE version Write a kernel module /src/cl/tvpclmykernel.cl. 3. Create the interface routines. The interface routine contains a routine to select the appropriate kernel: index = TVPclKernelIndex ("TVPclAddKernel"); TVPclKernel (index + offset); here the offset determines the datatype used, for DT12 it is indexed from 0 to 11. For every kernel the arguments used, that is the input, output, and internal parameters, need to be set to the kernel. As last step the kernel is launched: TVPclLaunchKernel (kernel, dims, block, grid, useevent); The rest of the OpenCL specific coding and complexity is hidden within the core TiViPE OpenCL routines. 5 OpenCL components In TiViPE version OpenCL has been incorporated and 23 new modules developed. It will grow in the following versions. The number of available modules for CUDA is currently 68, and it likely that most of them will be made available for OpenCL as well. The current release involves 5 generic basic modules, 4 IO transfer modules, and 7 computational blocks. The 7 merged modules allow the user to use these computational blocks. So far most effort has gone in an appropriate framework for OpenCL with the aim to easily embed new OpenCL functionality in the future. 5.1 Device information and selection Obtaining information from all OpenCL devices is obtained by using CLinfo, for a graphical representation see the left side of Figure 3. This module gives, depending on the hardware used, the following output:

14 5.1 Device information and selection 14 ============================== Available kernels on Loveland TVPclAddKernel : char uchar short ushort int uint long ulong float float2... ============================== Available kernels on AMD E-350 Processor TVPclAddKernel : char uchar short ushort int uint long ulong float double float2 double2... ============================== Number of Platforms: 1 ============================== Platform Number: 0 Platform Name: AMD Accelerated Parallel Processing Platform Profile: FULL_PROFILE Platform Version: OpenCL 1.1 AMD-APP (851.4) Platform Vendor: Advanced Micro Devices, Inc. Platform Extensions: cl_khr_icd cl_amd_event_callback... Number of Devices: Device Number: 0 Name: Loveland Type: CL_DEVICE_TYPE_GPU Profile: FULL_PROFILE Version: OpenCL 1.1 AMD-APP (851.4) Vendor: Advanced Micro Devices, Inc. Vendor Id: 4098 Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics... Driver Version: CAL (VM) Address Bits: 32 Available: Yes Compiler Available:Yes Double Precision Floating Point Capability * Denorms: No * Quiet NaNs: No * Round to nearest even: No * Round to zero: No * Round to +ve and infinity: No * IEEE fused multiply-add: No Device Number: 1 Name: AMD E-350 Processor Type: CL_DEVICE_TYPE_CPU Profile: FULL_PROFILE Version: OpenCL 1.1 AMD-APP (851.4)

15 5.2 Data transportation Data transportation The basic components for data handling on OpenCL devices are data allocation and freeing, data transfer on the device and from device to CPU and vice versa. The available modules are illustrated in the second column of Figure Computational modules The available computational modules illustrated in the rightmost 2 columns of Figure 3 is divided in arithmetic (add, sum, matrix vector multiplication, and 2 polynomial functions) and filters (clipping and Sobel). References [1] T. Lourens. Tivipe tino s visual programming environment. In The 28 th Annual International Computer Software & Applications Conference, IEEE COMPSAC 2004, pages 10 15, [2] T. Lourens. The art of parallel processing using general purpose graphical processing units hardware, cuda introduction, and software architecture. Technical report, TiViPE, 2009.

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

Lecture 3: Introduction to CUDA

Lecture 3: Introduction to CUDA CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose

More information

Instructions for setting up OpenCL Lab

Instructions for setting up OpenCL Lab Instructions for setting up OpenCL Lab This document describes the procedure to setup OpenCL Lab for Linux and Windows machine. Specifically if you have limited no. of graphics cards and a no. of users

More information

Programming with CUDA, WS09

Programming with CUDA, WS09 Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 3 Thursday, 29 Nov, 2009 Recap Motivational videos Example kernel Thread IDs Memory overhead CUDA hardware and programming

More information

CUDA Parallelism Model

CUDA Parallelism Model GPU Teaching Kit Accelerated Computing CUDA Parallelism Model Kernel-Based SPMD Parallel Programming Multidimensional Kernel Configuration Color-to-Grayscale Image Processing Example Image Blur Example

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

Parallel Computing. Lecture 19: CUDA - I

Parallel Computing. Lecture 19: CUDA - I CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs CUDA Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University CUDA - Compute Unified Device

More information

Example 1: Color-to-Grayscale Image Processing

Example 1: Color-to-Grayscale Image Processing GPU Teaching Kit Accelerated Computing Lecture 16: CUDA Parallelism Model Examples Example 1: Color-to-Grayscale Image Processing RGB Color Image Representation Each pixel in an image is an RGB value The

More information

CUDA by Example. The University of Mississippi Computer Science Seminar Series. April 28, 2010

CUDA by Example. The University of Mississippi Computer Science Seminar Series. April 28, 2010 CUDA by Example The University of Mississippi Computer Science Seminar Series Martin.Lilleeng.Satra@sintef.no SINTEF ICT Department of Applied Mathematics April 28, 2010 Outline 1 The GPU 2 cudapi 3 CUDA

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Review Secret behind GPU performance: simple cores but a large number of them; even more threads can exist live on the hardware (10k 20k threads live). Important performance

More information

OpenCL Overview Benedict R. Gaster, AMD

OpenCL Overview Benedict R. Gaster, AMD Copyright Khronos Group, 2011 - Page 1 OpenCL Overview Benedict R. Gaster, AMD March 2010 The BIG Idea behind OpenCL OpenCL execution model - Define N-dimensional computation domain - Execute a kernel

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 6 April 2017 HW#8 will be posted Announcements HW#7 Power outage Pi Cluster Runaway jobs (tried

More information

ECE 408 / CS 483 Final Exam, Fall 2014

ECE 408 / CS 483 Final Exam, Fall 2014 ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across

More information

Module Memory and Data Locality

Module Memory and Data Locality GPU Teaching Kit Accelerated Computing Module 4.4 - Memory and Data Locality Tiled Matrix Multiplication Kernel Objective To learn to write a tiled matrix-multiplication kernel Loading and using tiles

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Memory spaces and memory access Shared memory Examples Lecture questions: 1. Suggest two significant

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA

More information

GPU programming basics. Prof. Marco Bertini

GPU programming basics. Prof. Marco Bertini GPU programming basics Prof. Marco Bertini CUDA: atomic operations, privatization, algorithms Atomic operations The basics atomic operation in hardware is something like a read-modify-write operation performed

More information

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Introduction to memory spaces and memory access Shared memory Matrix multiplication example Lecture

More information

Heterogeneous Computing

Heterogeneous Computing OpenCL Hwansoo Han Heterogeneous Computing Multiple, but heterogeneous multicores Use all available computing resources in system [AMD APU (Fusion)] Single core CPU, multicore CPU GPUs, DSPs Parallel programming

More information

GPU Programming for Mathematical and Scientific Computing

GPU Programming for Mathematical and Scientific Computing GPU Programming for Mathematical and Scientific Computing Ethan Kerzner and Timothy Urness Department of Mathematics and Computer Science Drake University Des Moines, IA 50311 ethan.kerzner@gmail.com timothy.urness@drake.edu

More information

GPU Programming Using CUDA

GPU Programming Using CUDA GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example Optimizing Previous Example Alternative Architectures 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator for desktop

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Architecture Parallel computing for heterogenous devices CPUs, GPUs, other processors (Cell, DSPs, etc) Portable accelerated code Defined

More information

GPU Computing Master Clss. Development Tools

GPU Computing Master Clss. Development Tools GPU Computing Master Clss Development Tools Generic CUDA debugger goals Support all standard debuggers across all OS Linux GDB, TotalView and DDD Windows Visual studio Mac - XCode Support CUDA runtime

More information

CS179 GPU Programming Recitation 4: CUDA Particles

CS179 GPU Programming Recitation 4: CUDA Particles Recitation 4: CUDA Particles Lab 4 CUDA Particle systems Two parts Simple repeat of Lab 3 Interacting Flocking simulation 2 Setup Two folders given particles_simple, particles_interact Must install NVIDIA_CUDA_SDK

More information

Vector Addition on the Device: main()

Vector Addition on the Device: main() Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

Using SYCL as an Implementation Framework for HPX.Compute

Using SYCL as an Implementation Framework for HPX.Compute Using SYCL as an Implementation Framework for HPX.Compute Marcin Copik 1 Hartmut Kaiser 2 1 RWTH Aachen University mcopik@gmail.com 2 Louisiana State University Center for Computation and Technology The

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

General-purpose computing on graphics processing units (GPGPU)

General-purpose computing on graphics processing units (GPGPU) General-purpose computing on graphics processing units (GPGPU) Thomas Ægidiussen Jensen Henrik Anker Rasmussen François Rosé November 1, 2010 Table of Contents Introduction CUDA CUDA Programming Kernels

More information

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to GPU programming. Introduction to GPU programming p. 1/17 Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

NVIDIA GPU CODING & COMPUTING

NVIDIA GPU CODING & COMPUTING NVIDIA GPU CODING & COMPUTING WHY GPU S? ARCHITECTURE & PROGRAM MODEL CPU v. GPU Multiprocessor Model Memory Model Memory Model: Thread Level Programing Model: Logical Mapping of Threads Programing Model:

More information

Module 3: CUDA Execution Model -I. Objective

Module 3: CUDA Execution Model -I. Objective ECE 8823A GPU Architectures odule 3: CUDA Execution odel -I 1 Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource

More information

Lecture 1: an introduction to CUDA

Lecture 1: an introduction to CUDA Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming

More information

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Graph Partitioning Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Partition given graph G=(V,E) in k subgraphs of nearly equal

More information

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34 1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions

More information

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks. Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)

More information

High Performance Linear Algebra on Data Parallel Co-Processors I

High Performance Linear Algebra on Data Parallel Co-Processors I 926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018

More information

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute

More information

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I) Lecture 9 CUDA CUDA (I) Compute Unified Device Architecture 1 2 Outline CUDA Architecture CUDA Architecture CUDA programming model CUDA-C 3 4 CUDA : a General-Purpose Parallel Computing Architecture CUDA

More information

OpenACC (Open Accelerators - Introduced in 2012)

OpenACC (Open Accelerators - Introduced in 2012) OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in

More information

Scientific discovery, analysis and prediction made possible through high performance computing.

Scientific discovery, analysis and prediction made possible through high performance computing. Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013

More information

The Art of Parallel Processing using General Purpose Graphical Processing units Hardware, CUDA introduction, and Software architecture

The Art of Parallel Processing using General Purpose Graphical Processing units Hardware, CUDA introduction, and Software architecture TiViPE Visual Programming The Art of Parallel Processing using General Purpose Graphical Processing units Hardware, CUDA introduction, and Software architecture Technical Report Copyright c TiViPE 2009.

More information

Lecture 10!! Introduction to CUDA!

Lecture 10!! Introduction to CUDA! 1(50) Lecture 10 Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY 1(50) Laborations Some revisions may happen while making final adjustments for Linux Mint. Last minute changes may occur.

More information

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Automatic Intra-Application Load Balancing for Heterogeneous Systems Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena

More information

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL CS/EE 217 GPU Architecture and Parallel Programming Lecture 22: Introduction to OpenCL Objective To Understand the OpenCL programming model basic concepts and data types OpenCL application programming

More information

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA Sreepathi Pai October 18, 2017 URCS Outline Background Memory Code Execution Model Outline Background Memory Code Execution Model

More information

CS516 Programming Languages and Compilers II

CS516 Programming Languages and Compilers II CS516 Programming Languages and Compilers II Zheng Zhang Spring 2015 Jan 29 GPU Programming II Rutgers University Review: Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing

More information

Image convolution with CUDA

Image convolution with CUDA Image convolution with CUDA Lecture Alexey Abramov abramov _at_ physik3.gwdg.de Georg-August University, Bernstein Center for Computational Neuroscience, III Physikalisches Institut, Göttingen, Germany

More information

How to use OpenMP within TiViPE

How to use OpenMP within TiViPE TiViPE Visual Programming How to use OpenMP within TiViPE Technical Report: Version 1.0.0 Copyright c TiViPE 2011. All rights reserved. Tino Lourens TiViPE Kanaaldijk ZW 11 5706 LD Helmond The Netherlands

More information

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API Review Lecture 10 CUDA (II) host device CUDA many core processor threads thread blocks grid # threads >> # of cores to be efficient Threads within blocks can cooperate Threads between thread blocks cannot

More information

Device Memories and Matrix Multiplication

Device Memories and Matrix Multiplication Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the

More information

NVIDIA CUDA Compute Unified Device Architecture

NVIDIA CUDA Compute Unified Device Architecture NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 2.0 6/7/2008 ii CUDA Programming Guide Version 2.0 Table of Contents Chapter 1. Introduction...1 1.1 CUDA: A Scalable Parallel

More information

Copyright 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes. OpenCL. OpenCL

Copyright 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes. OpenCL. OpenCL OpenCL OpenCL What is OpenCL? Ø Cross-platform parallel computing API and C-like language for heterogeneous computing devices Ø Code is portable across various target devices: Ø Correctness is guaranteed

More information

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Basic Elements of CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of

More information

CUDA Basics. July 6, 2016

CUDA Basics. July 6, 2016 Mitglied der Helmholtz-Gemeinschaft CUDA Basics July 6, 2016 CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching

More information

GPU Computing: A Quick Start

GPU Computing: A Quick Start GPU Computing: A Quick Start Orest Shardt Department of Chemical and Materials Engineering University of Alberta August 25, 2011 Session Goals Get you started with highly parallel LBM Take a practical

More information

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel

More information

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania Course Overview This OpenCL base course is structured as follows: Introduction to GPGPU programming, parallel programming

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Josef Pelikán, Jan Horáček CGG MFF UK Praha GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

Introduction to GPGPUs and to CUDA programming model

Introduction to GPGPUs and to CUDA programming model Introduction to GPGPUs and to CUDA programming model www.cineca.it Marzia Rivi m.rivi@cineca.it GPGPU architecture CUDA programming model CUDA efficient programming Debugging & profiling tools CUDA libraries

More information

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Basic Elements of CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment

More information

OpenCL C. Matt Sellitto Dana Schaa Northeastern University NUCAR

OpenCL C. Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL C Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL C Is used to write kernels when working with OpenCL Used to code the part that runs on the device Based on C99 with some extensions

More information

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015 Goals Running CUDA code on CPUs. Why? Performance portability! A major challenge faced

More information

High-Performance Computing Using GPUs

High-Performance Computing Using GPUs High-Performance Computing Using GPUs Luca Caucci caucci@email.arizona.edu Center for Gamma-Ray Imaging November 7, 2012 Outline Slide 1 of 27 Why GPUs? What is CUDA? The CUDA programming model Anatomy

More information

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different

More information

simcuda: A C++ based CUDA Simulation Framework

simcuda: A C++ based CUDA Simulation Framework Technical Report simcuda: A C++ based CUDA Simulation Framework Abhishek Das and Andreas Gerstlauer UT-CERC-16-01 May 20, 2016 Computer Engineering Research Center Department of Electrical & Computer Engineering

More information

Module 2: Introduction to CUDA C

Module 2: Introduction to CUDA C ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding

More information

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix

More information

Overview: Graphics Processing Units

Overview: Graphics Processing Units advent of GPUs GPU architecture Overview: Graphics Processing Units the NVIDIA Fermi processor the CUDA programming model simple example, threads organization, memory model case study: matrix multiply

More information

Lecture 11: GPU programming

Lecture 11: GPU programming Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!

More information

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into

More information

COSC 6374 Parallel Computations Introduction to CUDA

COSC 6374 Parallel Computations Introduction to CUDA COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo

More information

Lecture Topic: An Overview of OpenCL on Xeon Phi

Lecture Topic: An Overview of OpenCL on Xeon Phi C-DAC Four Days Technology Workshop ON Hybrid Computing Coprocessors/Accelerators Power-Aware Computing Performance of Applications Kernels hypack-2013 (Mode-4 : GPUs) Lecture Topic: on Xeon Phi Venue

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5) CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux

More information

Introduction to CUDA

Introduction to CUDA Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations

More information

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN Massively Parallel Computing with CUDA Carlos Alberto Martínez Angeles Cinvestav-IPN What is a GPU? A graphics processing unit (GPU) The term GPU was popularized by Nvidia in 1999 marketed the GeForce

More information

GPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics

GPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics 1 GPU Computing Workshop CSU 2013 Getting Started Garland Durham Quantos Analytics nvidia-smi 2 At command line, run command nvidia-smi to get/set GPU properties. nvidia-smi Options: -q query -L list attached

More information

CS 314 Principles of Programming Languages

CS 314 Principles of Programming Languages CS 314 Principles of Programming Languages Zheng Zhang Fall 2016 Dec 14 GPU Programming Rutgers University Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations

More information

Lab 1 Part 1: Introduction to CUDA

Lab 1 Part 1: Introduction to CUDA Lab 1 Part 1: Introduction to CUDA Code tarball: lab1.tgz In this hands-on lab, you will learn to use CUDA to program a GPU. The lab can be conducted on the SSSU Fermi Blade (M2050) or NCSA Forge using

More information

Module 2: Introduction to CUDA C. Objective

Module 2: Introduction to CUDA C. Objective ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding

More information

Supporting Data Parallelism in Matcloud: Final Report

Supporting Data Parallelism in Matcloud: Final Report Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by

More information

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline

More information

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing

More information