OpenCL parallel Processing using General Purpose Graphical Processing units TiViPE software development
|
|
- Arlene Poole
- 5 years ago
- Views:
Transcription
1 TiViPE Visual Programming OpenCL parallel Processing using General Purpose Graphical Processing units TiViPE software development Technical Report Copyright c TiViPE All rights reserved. Tino Lourens TiViPE Kanaaldijk ZW LD Helmond The Netherlands tino@tivipe.com November 6, 2012
2 CONTENTS 2 Contents 1 Introduction 4 2 GPUadd using CUDA Cuda implementation GPUadd implementation CLadd using OpenCL 8 4 Programming a TiViPE module using OpenCL 12 5 OpenCL components Device information and selection Data transportation Computational modules
3 CONTENTS 3 Abstract The aim of this report to elaborate TiViPE modules that make use of Open Computing Language (OpenCL) programming. OpenCL is available in TiViPE from version
4 1 Introduction 4 Figure 1: OpenCL example: addcl is a merged module with two inputs and one output. On the right the atomic modules given in green. 1 Introduction The aim of TiViPE is to integrate different technologies in a seamless way using graphical icons [1]. Due to these icons the user does not need to have in depth knowledge of the underlying hardware architecture [2]. In this report the focus is on how to integrate the Open Computing Language (OpenCL) into TiViPE. OpenCL is an open standard for parallel programming of heterogeneous systems, it has been intitated by Apple and is coordinated by the Khronos group. The aim of OpenCL is to support parallel programming on many different computational devices, including Central Processing Units (CPUs) and Graphical Processing Units (GPUs). The terminology for CPU and GPU becomes somewhat outdated because GPUs can be used as generic processing units also known as GP-GPUs. It is the reason that AMD has redefined its processing units to Accelerated Processing Units (APU), since both CPU and GPU can be used in the same way when using the OpenCL SDK (Softwre Development Kit). Objectively spoken OpenCL mostly benefits if the hardware is massively parallel and for the moment this implies that the GPU processing units are most suitable. The NVIDIA CUDA (Compute Unified Device Architecture) SDK is a competitor of OpenCL. The only drawback of CUDA compared to OpenCL is that only NVIDIA (GPU) hardware is supported. OpenCL has a bunch of restrictions compared to CUDA and TiViPE has put an effort to alleviate the gap between CUDA and OpenCL. Our aim is to make OpenCL and CUDA programming very similar and make modular integration in TiViPE the same. In this document we make a step by step comparison between the 2 technologies using a simple example. 2 GPUadd using CUDA The addcl TiViPE module illustrated in Figure 1 performs an addition for every element. In formula: d i = a i + b i i (1) A very simple module to program, but several steps need to be taken to make such a module be integrated in an environment like TiViPE. Figure 1 already illustrates that the memory allocation and transfer needs to be taken in consideration. The reason to do so is that
5 2.1 Cuda implementation GPUadd implementation 5 1. allocation of memory on a GPU takes time 2. freeing memory also 3. transfer from CPU to GPU memory is time consuming 4. and vice versa the transfer from GPU to CPU memory In the design for an efficient algorithm it is therefore important to allocate memory in the initialization phase and use the allocated memory during an iterative process. When the process has finished memory is being freed. Modules in TiViPE require to be constructed and merged to a single module where IO to CPU memory is included, as illustrated in Figure 1. The construction of e i = a i + b i + c i i (2) can be reprogrammed but in this case be constructed using the basic building block twice: d i = a i + b i i (3) e i = d i + c i i (4) The CLinit module, initializes all available modules, for all available detected OpenCL capable devices. For instance, an ASUS EEE notebook PC this can contain 2 different OpenCL devices: an AMD E-350 Processor a dual core E-350 CPU and a Loveland which is a Radeon 6310 GPU. While on a desktop this can be a single NVIDIA GPU device like GeForce GTX 560. This CLinit module uses a platform number and a device number to give the user control over these devices. The module, when executed, compiles all modules to binaries when they do not exist for a given device. In some occasions this is not necessary, because the binaries used for different devices are the same. The module for this reason has 2 additional parameters to map devices by providing string commands. For instance, when the devices are GeForce GTX 580, GeForce GTX 570 they can be mapped to the 560 gtx binaries, by providing GeForce GTX 560,GeForce GTX 560 as mapped devices. The CLinit module is required as the first module for every TiViPE network that uses OpenCL. Likewise, the CLresetDevices is required as last module in every TiViPE network. The CLadd module performs the addition as described by (1) using the generic Data5D datatype. The inputs and outputs are made yellow to indicate that memory resides on a OpenCL device rather than CPU memory. This holds for instance for a GPU device with own GPU memory. The module has three inputs, the first input is used to pre-allocate output memory, while the content of the second and third input are used to perform addition itself. This construction avoids the overhead of memory allocation and freeing in the CLadd module itself. For the three inputs memory is allocated the first time only. The second and third input copy data from CPU memory to the OpenCL device. The output in turn is copied from the device memory to CPU memory. 2.1 Cuda implementation GPUadd implementation Figure 1 illustrates a TiViPE network using OpenCL commands. The green CLadd module in this figure is in essence the same as the GPUadd module. The code information file GPUadd.ci in TiViPE reveals that the following interface parameters have been used Input("Input pre-allocated output data", "-preo", DATA5D, DATA5D_ANYTYPE, IGNORED) [1]; Input("Input 1",
6 2.1 Cuda implementation GPUadd implementation 6 "-input1", DATA5D, DATA5D_ANYTYPE, IGNORED) [2]; Input("Input 2", "-input2", DATA5D, DATA5D_ANYTYPE, IGNORED) [3]; Output("Output", "-output", DATA5D, DATA5D_ANYTYPE, IGNORED) [F]; The term IGNORED denotes that memory does not reside on the CPU (but on the GPU). The class and routine are given by code are TVPgpuArithmetic and TVPgpuAdd. This routine is part of library TVPgpu that is found in the $(TVPHOME)/lib directory. The source code given by files TVPgpuarithmetic.h and TVPgpuarithmetic.cpp are found in directory $(TVPHOME)/Libraries/TVPgpu/src. In the header file this is given by Data5D TVPgpuAdd (Data5D gpu_ret, Data5D gpu_d1, Data5D gpu_d2); and we find the routine call itself back in file TVPgpuarithmetic.cpp. The routine itself checks if the parameters are set properly, if so it performs the following: TVPcudaAdd3 (gpu_ret, gpu_d1, gpu_d2); return gpu_ret; The TVPcudaAdd3 routine is found in the TVPcuda library in the file TVPcudaarithmetic.h it is found as extern "C" TVPCUDA_EXPORT void TVPcudaAdd3 (Data5D ret, Data5D d1, Data5D d2); The cuda code itself is not revealed by TiViPE but it is implemented as follows void TVPcudaAdd3 (Data5D ret, Data5D d1, Data5D d2) #if defined (TVPCUDA_DEBUG) defined (ALL_DEBUG) printf ("TVPcudaAdd3\n"); #endif /******************************************************* * Setup block and grid sizes *******************************************************/ int w, h, len; TVPcudaImageSize (d1, w, h, len); int wh = w * h; int Bx, By, Bz, Gx, Gy, Gz; TVPcudaBlockGridSize (Bx, By, Bz, Gx, Gy, Gz, w, h, len, CUDA_NO_CHECK); dim3 block (Bx, By, Bz); dim3 grid (Gx, Gy, Gz); /*******************************************************
7 2.1 Cuda implementation GPUadd implementation 7 * Add * indata and outdata are coupled to d2 and d1, * respectively. *******************************************************/ switch (d1.datatype) case DATA5D_FLOAT: float *indata1 = (float *) d1.data; float *indata2 = (float *) d2.data; float *lindata1 = indata1; float *lindata2 = indata2; float *outdata = (float *) ret.data; float *loutdata = outdata; /************************************************* * iterate over the different images *************************************************/ for (int i = 0; i < len; i++) TVPcudaAddKernel3<float><<<grid, block>>> (loutdata, lindata1, lindata2, w); lindata1 += wh; lindata2 += wh; loutdata += wh; case DATA5D_CHAR: /* * Copy all data from the case above * Replace all float by char */ case DATA5D_UCHAR: case DATA5D_SHORT: case DATA5D_USHORT: case DATA5D_INT: case DATA5D_UINT: case DATA5D_LONG: case DATA5D_ULONG: case DATA5D_DOUBLE: case DATA5D_COMPLEXFLOAT: case DATA5D_COMPLEXDOUBLE: printf ("Float datatype is implemented only.\n"); case DATA5D_NOTYPE:
8 3 CLadd using OpenCL 8 The TVPcudaAddKernel3 routine is the so-called CUDA kernel module which is implemented as follows: template <typename DType> global void TVPcudaAddKernel (DType *ret, DType *d1, DType *d2, int width) /****************************************************** Add ******************************************************/ int x = blockidx.x * blockdim.x + threadidx.x; int y = blockidx.y * blockdim.y + threadidx.y; int i = y * width + x; ret[i] = (DType) (d1[i] + d2[i]); The code above is compiled and made available in the TVPcuda library. 3 CLadd using OpenCL Lets try to achieve the same in OpenCL. In header TVPclArithmetic.h the TVPgpuAdd the routine name has changed to TVPclAdd: Data5D TVPclAdd (Data5D cl_ret, Data5D cl_d1, Data5D cl_d2); but the parameters are the same. The implementation of this routine given in file TVPclarithmetic.cpp also this is very similar to the CUDA version: TVPclAddPriv (cl_ret, cl_d1, cl_d2); return cl_ret; The TVPclAddPriv routine is found in the same file and not in a separated library. The routine is implemented as follows: cl_event TVPclArithmetic::TVPclAddPriv (Data5D ret, Data5D d1, Data5D d2, bool useevent) #if defined (TVPCL_DEBUG) defined (ALL_DEBUG) qdebug () << "TVPclArithmetic::TVPclAddPriv" << flush; #endif /******************************************************* * Setup block and grid sizes *******************************************************/ int w, h, len;
9 3 CLadd using OpenCL 9 TVPclImageSize (d1, w, h, len); int wh = w * h; int Bx, By, Bz, Gx, Gy, Gz; TVPclBlockGridSize (Bx, By, Bz, Gx, Gy, Gz, w, h, len, CL_NO_CHECK); const int dims = 3; size_t block[dims] = Bx, By, Bz; size_t grid[dims] = Gx, Gy, Gz; /******************************************************* * get first kernel index of TVPclAddKernel *******************************************************/ int index = TVPclKernelIndex ("TVPclAddKernel"); switch (d1.datatype) case DATA5D_CHAR: index += 0; case DATA5D_UCHAR: index += 1; case DATA5D_SHORT: index += 2; case DATA5D_USHORT: index += 3; case DATA5D_INT: index += 4; case DATA5D_UINT: index += 5; case DATA5D_LONG: index += 6; case DATA5D_ULONG: index += 7; case DATA5D_FLOAT: index += 8; case DATA5D_DOUBLE: index += 9; case DATA5D_COMPLEXFLOAT: index += 10; case DATA5D_COMPLEXDOUBLE:
10 3 CLadd using OpenCL 10 index += 11; case DATA5D_NOTYPE: default: error ("TVPclArithmetic", "TVPclAddPriv", QObject::tr ("No type has been set.")); /******************************************************* * Fetch appropriate kernel *******************************************************/ cl_kernel kernel = TVPclKernel (index); /******************************************************* * Set arguments *******************************************************/ cl_int ok; ok = clsetkernelarg (kernel, 0, sizeof (cl_mem), (void*) &(ret.data)); ok = clsetkernelarg (kernel, 1, sizeof (cl_mem), (void*) &(d1.data)); ok = clsetkernelarg (kernel, 2, sizeof (cl_mem), (void*) &(d2.data)); ok = clsetkernelarg (kernel, 3, sizeof (int), (void*) &w); ok = clsetkernelarg (kernel, 4, sizeof (int), (void*) &len); ok = clsetkernelarg (kernel, 5, sizeof (int), (void*) &wh); if (ok!= CL_SUCCESS) error ("TVPclArithmetic", "TVPclAddPriv", QObject::tr ("Unable to set kernel arguments.")); /******************************************************* * Launch kernel *******************************************************/ cl_event event = TVPclLaunchKernel (kernel, dims, block, grid, useevent); return event; The code above fetches a kernel, sets parameters as kernel arguments, and launches the kernel, and appears as easy to use as the CUDA example. In fact it was aimed to be very similar to the cuda code with the exception that it is not possible to iterate over the kernel, where the kernel is given a certain index. This whole loop needs to be performed in the OpenCL code and is found in the file TVPclAddKernel.cl: kernel void TVPclAddKernel<DType> ( global DType *ret, global DType *d1, global DType *d2, const int width,
11 3 CLadd using OpenCL 11 const int const int len, offset) int x = get_global_id (0); int y = get_global_id (1); int i = y * width + x; /****************************************************** * add ******************************************************/ for (int j = 0; j < len; j++) ret[i] = d1[i] + d2[i]; i += offset; This loop is the only difference compared to the TVPcudaAddKernel3 CUDA routine. This is a structural difference with the CUDA kernels, but can be alleviated easily as given in the code above. OpenCL does not support templates by default, hence the code above is not according to the specifications given in OpenCL version 1.1 or 1.2. Templating is thus a TiViPE extension to OpenCL. TiViPE only allows very strict templating rules. That is a templated routine name always look like...clkernelname<dtype>......clkernelname<dtype,dtype1>......clkernelname<dtype,dtype1,dtype2>... and there are no spaces between routinename and comma s. Up to three different templates are supported. In case of one template only DType is used in case of two templates DType and DType1 are used. Note that the routine name includes thes Figure 2 illustrates the normal situation for using a routine call from a library. The standard case (as depicted in blue in the figure) a header file and a library are required. The header file contains a description of the routine that is being called which is available in the library. For NVIDIA CUDA there is a compiler available which compiles to a library, and the way a routine is called from a library is thus the same, as illustrated in green. The OpenCL setup given in yellow is different because OpenCL code does not compile to a library, but instead it is assumed that there are OpenCL kernels represented as text strings, these kernels are internally compiled before they are launched. The concept therefore is that OpenCL assumes that the source code is available as a text string in a C or C++ program. We would like to change this to the concept of having a file with OpenCL source code, compile this to a binary object file 1, and store these into a library. There is no way to create a library without any tools, rules, or administration. In order to get the concept working properly we need to make some rules: binary. 1. The code above couples a kernel name TVPclAddKernel to a file /src/cl/tvpcladdkernel.cl. 2. The kernel module is administrated in the file /src/cl/tvpclkernels.txt. This is done as: TVPclAddKernel DT12; 1 This binary for NVIDIA still is text, the so-called ptx code. Latter is compiled by a just in time (JIT) compiler to a real
12 4 Programming a TiViPE module using OpenCL 12 TVPcpuCall Routine call TVPgpuCall CUDA Routine call Library CUDA Library Header file Source file CUDA header file CUDA source file TVPopenclCall TVPopenclCallPriv index = TVPclKernelIndex ( Name ) kernel = TVPclKernel (index + offset); clsetkernelarg ( ); TVPclLaunchKernel (kernel, ); TVPclTemplates.txt Library TVPclKernels.txt Init needs to be performed first TVPopenclCallInit OpenCL file OpenCL binary file Figure 2: Programmers infrastructure for calling a routine from a library. In blue the normal situation. In green the situation for NVIDIA CUDA. In yellow the way it is done by TiViPE for OpenCL. 3. Template DT12, which stands for 12 basic datatypes is coupled to TVPclAddKernel and this template is administrated in /src/cl/tvpcltemplates.txt. In this file the template is given as DT12 char 0 uchar 0 short 0 ushort 0 int 0 uint 0 long 0 ulong 0 float 0 double 1 float2 0 double2 1;. In OpenCL hardware not all devices support double floating point arithmetic, and thus for every datatype in the template a flag is created to indicate whether or not the data type requires double operand floating point arithmetic. 4. For open source development it might be preferred to use the OpenCL source code only, and (on the fly) compile the required modules. For companies, like ours it is not always possible to provide source code only. We d prefer to be able to use solely binaries too. Binaries are not generic, so for every OpenCL capable platform (that we intend to support) we need to generate binaries. TiViPE compiles all kernels given in TVPclKernels.txt to /obj/ [device] /TVPclAddKernel [DType] [DType1] [DType2].o object files. Here device is the hardware device supported, and DType, DType1, DType2 are the (optional) template datatype substitutions. For instance the TVPclAdd routine with integer datatype for a Geforce GTX 560 compile to and use the binary code obtained from /obj/geforce GTX 560/TVPclAddKernelint.o. 4 Programming a TiViPE module using OpenCL Programming a module in TiViPE takes 3 steps: 1. Registration of kernel name in file /src/cl/tvpclkernels.txt. This is done by adding a line of text, for instance TVPclMyKernelDouble 1;, TVPclMyKernel Datatype1 Datatype2; ), or the add kernel TVPclAddKernel DT12;. In case the datatypes have not been created they need to be added to /src/cl/tvpcltemplates.txt.
13 5 OpenCL components 13 Figure 3: All available OpenCL modules in TiViPE version Write a kernel module /src/cl/tvpclmykernel.cl. 3. Create the interface routines. The interface routine contains a routine to select the appropriate kernel: index = TVPclKernelIndex ("TVPclAddKernel"); TVPclKernel (index + offset); here the offset determines the datatype used, for DT12 it is indexed from 0 to 11. For every kernel the arguments used, that is the input, output, and internal parameters, need to be set to the kernel. As last step the kernel is launched: TVPclLaunchKernel (kernel, dims, block, grid, useevent); The rest of the OpenCL specific coding and complexity is hidden within the core TiViPE OpenCL routines. 5 OpenCL components In TiViPE version OpenCL has been incorporated and 23 new modules developed. It will grow in the following versions. The number of available modules for CUDA is currently 68, and it likely that most of them will be made available for OpenCL as well. The current release involves 5 generic basic modules, 4 IO transfer modules, and 7 computational blocks. The 7 merged modules allow the user to use these computational blocks. So far most effort has gone in an appropriate framework for OpenCL with the aim to easily embed new OpenCL functionality in the future. 5.1 Device information and selection Obtaining information from all OpenCL devices is obtained by using CLinfo, for a graphical representation see the left side of Figure 3. This module gives, depending on the hardware used, the following output:
14 5.1 Device information and selection 14 ============================== Available kernels on Loveland TVPclAddKernel : char uchar short ushort int uint long ulong float float2... ============================== Available kernels on AMD E-350 Processor TVPclAddKernel : char uchar short ushort int uint long ulong float double float2 double2... ============================== Number of Platforms: 1 ============================== Platform Number: 0 Platform Name: AMD Accelerated Parallel Processing Platform Profile: FULL_PROFILE Platform Version: OpenCL 1.1 AMD-APP (851.4) Platform Vendor: Advanced Micro Devices, Inc. Platform Extensions: cl_khr_icd cl_amd_event_callback... Number of Devices: Device Number: 0 Name: Loveland Type: CL_DEVICE_TYPE_GPU Profile: FULL_PROFILE Version: OpenCL 1.1 AMD-APP (851.4) Vendor: Advanced Micro Devices, Inc. Vendor Id: 4098 Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics... Driver Version: CAL (VM) Address Bits: 32 Available: Yes Compiler Available:Yes Double Precision Floating Point Capability * Denorms: No * Quiet NaNs: No * Round to nearest even: No * Round to zero: No * Round to +ve and infinity: No * IEEE fused multiply-add: No Device Number: 1 Name: AMD E-350 Processor Type: CL_DEVICE_TYPE_CPU Profile: FULL_PROFILE Version: OpenCL 1.1 AMD-APP (851.4)
15 5.2 Data transportation Data transportation The basic components for data handling on OpenCL devices are data allocation and freeing, data transfer on the device and from device to CPU and vice versa. The available modules are illustrated in the second column of Figure Computational modules The available computational modules illustrated in the rightmost 2 columns of Figure 3 is divided in arithmetic (add, sum, matrix vector multiplication, and 2 polynomial functions) and filters (clipping and Sobel). References [1] T. Lourens. Tivipe tino s visual programming environment. In The 28 th Annual International Computer Software & Applications Conference, IEEE COMPSAC 2004, pages 10 15, [2] T. Lourens. The art of parallel processing using general purpose graphical processing units hardware, cuda introduction, and software architecture. Technical report, TiViPE, 2009.
Tesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationAn Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture
An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationLecture 3: Introduction to CUDA
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu
More informationReal-time Graphics 9. GPGPU
Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing
More informationReal-time Graphics 9. GPGPU
9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose
More informationInstructions for setting up OpenCL Lab
Instructions for setting up OpenCL Lab This document describes the procedure to setup OpenCL Lab for Linux and Windows machine. Specifically if you have limited no. of graphics cards and a no. of users
More informationProgramming with CUDA, WS09
Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 3 Thursday, 29 Nov, 2009 Recap Motivational videos Example kernel Thread IDs Memory overhead CUDA hardware and programming
More informationCUDA Parallelism Model
GPU Teaching Kit Accelerated Computing CUDA Parallelism Model Kernel-Based SPMD Parallel Programming Multidimensional Kernel Configuration Color-to-Grayscale Image Processing Example Image Blur Example
More informationGPU programming. Dr. Bernhard Kainz
GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling
More informationParallel Computing. Lecture 19: CUDA - I
CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4
More informationMathematical computations with GPUs
Master Educational Program Information technology in applications Mathematical computations with GPUs CUDA Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University CUDA - Compute Unified Device
More informationExample 1: Color-to-Grayscale Image Processing
GPU Teaching Kit Accelerated Computing Lecture 16: CUDA Parallelism Model Examples Example 1: Color-to-Grayscale Image Processing RGB Color Image Representation Each pixel in an image is an RGB value The
More informationCUDA by Example. The University of Mississippi Computer Science Seminar Series. April 28, 2010
CUDA by Example The University of Mississippi Computer Science Seminar Series Martin.Lilleeng.Satra@sintef.no SINTEF ICT Department of Applied Mathematics April 28, 2010 Outline 1 The GPU 2 cudapi 3 CUDA
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Review Secret behind GPU performance: simple cores but a large number of them; even more threads can exist live on the hardware (10k 20k threads live). Important performance
More informationOpenCL Overview Benedict R. Gaster, AMD
Copyright Khronos Group, 2011 - Page 1 OpenCL Overview Benedict R. Gaster, AMD March 2010 The BIG Idea behind OpenCL OpenCL execution model - Define N-dimensional computation domain - Execute a kernel
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 6 April 2017 HW#8 will be posted Announcements HW#7 Power outage Pi Cluster Runaway jobs (tried
More informationECE 408 / CS 483 Final Exam, Fall 2014
ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across
More informationModule Memory and Data Locality
GPU Teaching Kit Accelerated Computing Module 4.4 - Memory and Data Locality Tiled Matrix Multiplication Kernel Objective To learn to write a tiled matrix-multiplication kernel Loading and using tiles
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationInformation Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY
Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Memory spaces and memory access Shared memory Examples Lecture questions: 1. Suggest two significant
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationGPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh
GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA
More informationGPU programming basics. Prof. Marco Bertini
GPU programming basics Prof. Marco Bertini CUDA: atomic operations, privatization, algorithms Atomic operations The basics atomic operation in hardware is something like a read-modify-write operation performed
More informationInformation Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY
Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Introduction to memory spaces and memory access Shared memory Matrix multiplication example Lecture
More informationHeterogeneous Computing
OpenCL Hwansoo Han Heterogeneous Computing Multiple, but heterogeneous multicores Use all available computing resources in system [AMD APU (Fusion)] Single core CPU, multicore CPU GPUs, DSPs Parallel programming
More informationGPU Programming for Mathematical and Scientific Computing
GPU Programming for Mathematical and Scientific Computing Ethan Kerzner and Timothy Urness Department of Mathematics and Computer Science Drake University Des Moines, IA 50311 ethan.kerzner@gmail.com timothy.urness@drake.edu
More informationGPU Programming Using CUDA
GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationCUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17
CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA
More informationMartin Kruliš, v
Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example Optimizing Previous Example Alternative Architectures 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator for desktop
More informationCUDA Architecture & Programming Model
CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New
More informationOpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR
OpenCL Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Architecture Parallel computing for heterogenous devices CPUs, GPUs, other processors (Cell, DSPs, etc) Portable accelerated code Defined
More informationGPU Computing Master Clss. Development Tools
GPU Computing Master Clss Development Tools Generic CUDA debugger goals Support all standard debuggers across all OS Linux GDB, TotalView and DDD Windows Visual studio Mac - XCode Support CUDA runtime
More informationCS179 GPU Programming Recitation 4: CUDA Particles
Recitation 4: CUDA Particles Lab 4 CUDA Particle systems Two parts Simple repeat of Lab 3 Interacting Flocking simulation 2 Setup Two folders given particles_simple, particles_interact Must install NVIDIA_CUDA_SDK
More informationVector Addition on the Device: main()
Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space
More informationPractical Introduction to CUDA and GPU
Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing
More informationUsing SYCL as an Implementation Framework for HPX.Compute
Using SYCL as an Implementation Framework for HPX.Compute Marcin Copik 1 Hartmut Kaiser 2 1 RWTH Aachen University mcopik@gmail.com 2 Louisiana State University Center for Computation and Technology The
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel
More informationGeneral-purpose computing on graphics processing units (GPGPU)
General-purpose computing on graphics processing units (GPGPU) Thomas Ægidiussen Jensen Henrik Anker Rasmussen François Rosé November 1, 2010 Table of Contents Introduction CUDA CUDA Programming Kernels
More informationIntroduction to GPU programming. Introduction to GPU programming p. 1/17
Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk
More informationG P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G
Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty
More informationNVIDIA GPU CODING & COMPUTING
NVIDIA GPU CODING & COMPUTING WHY GPU S? ARCHITECTURE & PROGRAM MODEL CPU v. GPU Multiprocessor Model Memory Model Memory Model: Thread Level Programing Model: Logical Mapping of Threads Programing Model:
More informationModule 3: CUDA Execution Model -I. Objective
ECE 8823A GPU Architectures odule 3: CUDA Execution odel -I 1 Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource
More informationLecture 1: an introduction to CUDA
Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming
More informationGraph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.
Graph Partitioning Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Partition given graph G=(V,E) in k subgraphs of nearly equal
More informationGPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34
1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions
More informationRegister file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.
Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)
More informationHigh Performance Linear Algebra on Data Parallel Co-Processors I
926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018
More informationGPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum
GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute
More informationLecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)
Lecture 9 CUDA CUDA (I) Compute Unified Device Architecture 1 2 Outline CUDA Architecture CUDA Architecture CUDA programming model CUDA-C 3 4 CUDA : a General-Purpose Parallel Computing Architecture CUDA
More informationOpenACC (Open Accelerators - Introduced in 2012)
OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in
More informationScientific discovery, analysis and prediction made possible through high performance computing.
Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013
More informationThe Art of Parallel Processing using General Purpose Graphical Processing units Hardware, CUDA introduction, and Software architecture
TiViPE Visual Programming The Art of Parallel Processing using General Purpose Graphical Processing units Hardware, CUDA introduction, and Software architecture Technical Report Copyright c TiViPE 2009.
More informationLecture 10!! Introduction to CUDA!
1(50) Lecture 10 Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY 1(50) Laborations Some revisions may happen while making final adjustments for Linux Mint. Last minute changes may occur.
More informationAutomatic Intra-Application Load Balancing for Heterogeneous Systems
Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena
More informationCS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL
CS/EE 217 GPU Architecture and Parallel Programming Lecture 22: Introduction to OpenCL Objective To Understand the OpenCL programming model basic concepts and data types OpenCL application programming
More informationCSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA
CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA Sreepathi Pai October 18, 2017 URCS Outline Background Memory Code Execution Model Outline Background Memory Code Execution Model
More informationCS516 Programming Languages and Compilers II
CS516 Programming Languages and Compilers II Zheng Zhang Spring 2015 Jan 29 GPU Programming II Rutgers University Review: Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing
More informationImage convolution with CUDA
Image convolution with CUDA Lecture Alexey Abramov abramov _at_ physik3.gwdg.de Georg-August University, Bernstein Center for Computational Neuroscience, III Physikalisches Institut, Göttingen, Germany
More informationHow to use OpenMP within TiViPE
TiViPE Visual Programming How to use OpenMP within TiViPE Technical Report: Version 1.0.0 Copyright c TiViPE 2011. All rights reserved. Tino Lourens TiViPE Kanaaldijk ZW 11 5706 LD Helmond The Netherlands
More informationReview. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API
Review Lecture 10 CUDA (II) host device CUDA many core processor threads thread blocks grid # threads >> # of cores to be efficient Threads within blocks can cooperate Threads between thread blocks cannot
More informationDevice Memories and Matrix Multiplication
Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the
More informationNVIDIA CUDA Compute Unified Device Architecture
NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 2.0 6/7/2008 ii CUDA Programming Guide Version 2.0 Table of Contents Chapter 1. Introduction...1 1.1 CUDA: A Scalable Parallel
More informationCopyright 2013 by Yong Cao, Referencing UIUC ECE408/498AL Course Notes. OpenCL. OpenCL
OpenCL OpenCL What is OpenCL? Ø Cross-platform parallel computing API and C-like language for heterogeneous computing devices Ø Code is portable across various target devices: Ø Correctness is guaranteed
More informationBasic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Basic Elements of CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of
More informationCUDA Basics. July 6, 2016
Mitglied der Helmholtz-Gemeinschaft CUDA Basics July 6, 2016 CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching
More informationGPU Computing: A Quick Start
GPU Computing: A Quick Start Orest Shardt Department of Chemical and Materials Engineering University of Alberta August 25, 2011 Session Goals Get you started with highly parallel LBM Take a practical
More informationHPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming
KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel
More informationOpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania
OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania Course Overview This OpenCL base course is structured as follows: Introduction to GPGPU programming, parallel programming
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationJosef Pelikán, Jan Horáček CGG MFF UK Praha
GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationIntroduction to GPGPUs and to CUDA programming model
Introduction to GPGPUs and to CUDA programming model www.cineca.it Marzia Rivi m.rivi@cineca.it GPGPU architecture CUDA programming model CUDA efficient programming Debugging & profiling tools CUDA libraries
More informationBasic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Basic Elements of CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationCUDA C Programming Mark Harris NVIDIA Corporation
CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment
More informationOpenCL C. Matt Sellitto Dana Schaa Northeastern University NUCAR
OpenCL C Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL C Is used to write kernels when working with OpenCL Used to code the part that runs on the device Based on C99 with some extensions
More informationAutomatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015 Goals Running CUDA code on CPUs. Why? Performance portability! A major challenge faced
More informationHigh-Performance Computing Using GPUs
High-Performance Computing Using GPUs Luca Caucci caucci@email.arizona.edu Center for Gamma-Ray Imaging November 7, 2012 Outline Slide 1 of 27 Why GPUs? What is CUDA? The CUDA programming model Anatomy
More informationINTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD
INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different
More informationsimcuda: A C++ based CUDA Simulation Framework
Technical Report simcuda: A C++ based CUDA Simulation Framework Abhishek Das and Andreas Gerstlauer UT-CERC-16-01 May 20, 2016 Computer Engineering Research Center Department of Electrical & Computer Engineering
More informationModule 2: Introduction to CUDA C
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding
More informationCUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012
CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix
More informationOverview: Graphics Processing Units
advent of GPUs GPU architecture Overview: Graphics Processing Units the NVIDIA Fermi processor the CUDA programming model simple example, threads organization, memory model case study: matrix multiply
More informationLecture 11: GPU programming
Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!
More informationCUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list
CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into
More informationCOSC 6374 Parallel Computations Introduction to CUDA
COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo
More informationLecture Topic: An Overview of OpenCL on Xeon Phi
C-DAC Four Days Technology Workshop ON Hybrid Computing Coprocessors/Accelerators Power-Aware Computing Performance of Applications Kernels hypack-2013 (Mode-4 : GPUs) Lecture Topic: on Xeon Phi Venue
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationCUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)
CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux
More informationIntroduction to CUDA
Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations
More informationMassively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN
Massively Parallel Computing with CUDA Carlos Alberto Martínez Angeles Cinvestav-IPN What is a GPU? A graphics processing unit (GPU) The term GPU was popularized by Nvidia in 1999 marketed the GeForce
More informationGPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics
1 GPU Computing Workshop CSU 2013 Getting Started Garland Durham Quantos Analytics nvidia-smi 2 At command line, run command nvidia-smi to get/set GPU properties. nvidia-smi Options: -q query -L list attached
More informationCS 314 Principles of Programming Languages
CS 314 Principles of Programming Languages Zheng Zhang Fall 2016 Dec 14 GPU Programming Rutgers University Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations
More informationLab 1 Part 1: Introduction to CUDA
Lab 1 Part 1: Introduction to CUDA Code tarball: lab1.tgz In this hands-on lab, you will learn to use CUDA to program a GPU. The lab can be conducted on the SSSU Fermi Blade (M2050) or NCSA Forge using
More informationModule 2: Introduction to CUDA C. Objective
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding
More informationSupporting Data Parallelism in Matcloud: Final Report
Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by
More informationGPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique
GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline
More informationParallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer
Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing
More information