Programming with CUDA

Size: px
Start display at page:

Download "Programming with CUDA"

Transcription

1 Programming with CUDA Jens K. Mueller Department of Mathematics and Computer Science Friedrich-Schiller-University Jena Tuesday 19 th April, 2011

2 Today s lecture: Synchronization and Texture Memory

3 CUDA 3 / 20 Synchronization within Blocks Within blocks threads can be synchronized using synchthreads(). Acts as barrier All threads within the block have to reach this barrier before any thread can proceed. Avoid data hazards for memory accesses In conditional code only allowed if identical for the entire thread block Expected to be lightweight Can degrade device utilization

4 CUDA 4 / 20 Synchronization within Blocks (cont.) Additional for compute capability 2.x int syncthreads_count(int predicate) Same as synchthreads() but evaluates predicate for all threads within the block and returns the number for which it evaluates to non-zero. int syncthreads_and(int predicate) Same as synchthreads() but evaluates predicate for all threads within the block and returns non-zero iff it evaluates to non-zero for all threads within the block. int syncthreads_or(int predicate) Same as synchthreads() but evaluates predicate for all threads within the block and returns non-zero iff it evaluates to non-zero for any threads within the block.

5 CUDA 5 / 20 Synchronization for Memory Access threadfence_block() Calling thread waits until all global/shared memory accesses are visible to all threads within the block threadfence() Calling thread waits until all shared memory accesses are visible to all threads within the block and all global memory accesses are visible to all threads in the device threadfence_system() (2.x only) Calling thread waits until Shared memory accesses are visible to thread block Global memory accesses are visible to all threads within the device Page-locked host memory accesses are visible to host threads

6 CUDA 6 / 20 Atomic Operations Make read-modify-write on global/shared memory an atomic operation Atomic No other thread can interfere with this operation. Available since compute capability 1.1 Since 1.2 also shared memory and 64 bit words for global memory 2.x 64 bit words for shared memory Not atomic on page-locked memory as seen by the host thread/other devices Mainly signed/unsigned integer operation are supported

7 CUDA 7 / 20 Atomic Operations (cont.) Atomic Functions Arithmetic atomicadd, atomicsub, atomicexch, atomicmin, atomicmax, atomicinc, atomicdec, and atomiccas Bitwise atomicand, atomicor, and atomicxor

8 CUDA 8 / 20 Built-In Vector Types {type}{1,2,3,4} where type is char, uchar, short, ushort, int, uint, long, or ulong longlong1, ulonglong1, longlong2, ulonglong2 float1, float2, float3, float4, double1, double2 Construct with make_<typename>(...) Components accessible through x,y,z, and w

9 CUDA 9 / 20 CUDA Arrays Opaque memory layout Optimized for textures 1,2, or 3 dimensional Elements are 1, 2 or 4 vectors that may be signed/unsigned integer or floats Only readable through kernels using texture fetches

10 CUDA 10 / 20 CUDA Arrays (cont.) cudaerror_t cudamallocarray(struct cudaarray** array, const struct cudachannelformatdesc* desc, size_t width, size_t height = 0, unsigned int flags = 0) struct cudachannelformatdesc { int x, y, z, w; enum cudachannelformatkind f; }; enum cudachannelformatkind { cudachannelformatkindsigned, cudachannelformatkindunsigned, cudachannelformatkindfloat };

11 CUDA 11 / 20 CUDA Arrays (cont.) cudaerror_t cudamemcpy2dtoarray(...) cudaerror_t cudamemcpy2dfromarray(...) cudaerror_t cudafreearray(struct cudaarray* array)

12 CUDA 12 / 20 Texture Texture A region of linear memory or CUDA array Texture reference Declared at compile time and bound at runtime to a texture Texture fetch Accessing the texture within kernels Read-only with kernel Optimized for 2D spatial locality Addressing modes allow simpler code Interpolation

13 CUDA 13 / 20 Texture Reference Declared at compile time as a static global variable texture<type, Dim, ReadMode> textureref Type is the type returned when fetching the texture Restricted to integer, single-precision floats and built-in 1-, 2-, 4-vector types Dim is the dimensionality Either 1,2, or 3. Defaults to 1. ReadMode Either cudareadmodenormalizedfloat or cudareadmodeelementtype. Defaults to cudareadmodeelementtype.

14 CUDA 14 / 20 Texture Reference (cont.) Defined at runtime Texture coordinates (textureref.normalized) Not normalized Coordinates in [0, maxdim 1]. Normalized Coordinates in [0, 1). Addressing mode (textureref.addressmode[]) cudaaddressmodeclamp cudaaddressmodewrap Linear filtering for interpolation (only if floats are returned) (textureref.filtermode) cudafiltermodelinear cudafiltermodepoint

15 CUDA 15 / 20 Binding a Texture Linear Memory texture<float, 2, cudareadmodeelementtype> textureref; cudachannelformatdesc channeldesc = cudacreatechanneldesc<float>(); cudabindtexture2d(0, textureref, devptr, &channeldesc, width, height, pitch); CUDA array texture<float, 2, cudareadmodeelementtype> textureref; cudabindtexturetoarray(textureref, cuarray);

16 CUDA 16 / 20 Unbinding a Texture cudaunbindtexture(textureref);

17 CUDA 17 / 20 Texture Fetching To fetch the texture within kernels Linear Memory tex1dfetch(textureref, int x) CUDA Arrays tex1d(textureref, float x) tex2d(textureref, float x, float y) tex3d(textureref, float x, float y, float z)

18 CUDA 18 / 20 Limitations for Texture References 1D texture reference bound to CUDA array 8192 for 1.x and for 2.x 1D texture reference bound to linear memory D texture reference bound to CUDA array/linear memory x for 1.x and x for 2.x 3D texture reference bound to CUDA array/linear memory 2048 x 2048 x 2048 Maximum number of texture bound to a kernel is 128

19 CUDA 19 / 20 Example for using Texture Memory 1. Declare a texture reference 2. Allocate memory 3. Set runtime properties of the texture reference 4. Bind the texture reference to a texture 5. Launch kernel that fetches the texture 6. Unbind the texture reference 7. Free memory

20 CUDA 20 / 20 Read-Write Coherency Texture is cached but not kept coherent within a kernel Writes to the underlying memory within the kernel call result in undefined behavior Writing to the memory is only safe using another kernel call or a a memory operation

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs CUDA Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University CUDA - Compute Unified Device

More information

NVIDIA CUDA Compute Unified Device Architecture

NVIDIA CUDA Compute Unified Device Architecture NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 0.8 2/12/2007 ii CUDA Programming Guide Version 0.8 Table of Contents Chapter 1. Introduction to CUDA... 1 1.1 The Graphics Processor

More information

Shared Memory and Synchronizations

Shared Memory and Synchronizations and Synchronizations Bedrich Benes, Ph.D. Purdue University Department of Computer Graphics Technology SM can be accessed by all threads within a block (but not across blocks) Threads within a block can

More information

Class. Windows and CUDA : Shu Guo. Program from last time: Constant memory

Class. Windows and CUDA : Shu Guo. Program from last time: Constant memory Class Windows and CUDA : Shu Guo Program from last time: Constant memory Windows on CUDA Reference: NVIDIA CUDA Getting Started Guide for Microsoft Windows Whiting School has Visual Studio Cuda 5.5 Installer

More information

Memory Management. Memory Access Bandwidth. Memory Spaces. Memory Spaces

Memory Management. Memory Access Bandwidth. Memory Spaces. Memory Spaces Memory Access Bandwidth Memory Management Bedrich Benes, Ph.D. Purdue University Department of Computer Graphics Technology High Performance Computer Graphics Lab Host and device different memory spaces

More information

CMPSCI 691AD General Purpose Computation on the GPU

CMPSCI 691AD General Purpose Computation on the GPU CMPSCI 691AD General Purpose Computation on the GPU Spring 2009 Lecture 5: Quantitative Analysis of Parallel Algorithms Rui Wang (cont. from last lecture) Device Management Context Management Module Management

More information

CUDA Memory Hierarchy

CUDA Memory Hierarchy CUDA Memory Hierarchy Piotr Danilewski October 2012 Saarland University Memory GTX 690 GTX 690 Memory host memory main GPU memory (global memory) shared memory caches registers Memory host memory GPU global

More information

Textures & Surfaces CUDA Webinar Gernot Ziegler, Developer Technology (Compute)

Textures & Surfaces CUDA Webinar Gernot Ziegler, Developer Technology (Compute) Textures & Surfaces CUDA Webinar Gernot Ziegler, Developer Technology (Compute) Outline Intro to Texturing and Texture Unit CUDA Array Storage Textures in CUDA C (Setup, Binding Modes, Coordinates) Texture

More information

CS179 GPU Programming: CUDA Memory. Lecture originally by Luke Durant and Tamas Szalay

CS179 GPU Programming: CUDA Memory. Lecture originally by Luke Durant and Tamas Szalay : CUDA Memory Lecture originally by Luke Durant and Tamas Szalay CUDA Memory Review of Memory Spaces Memory syntax Constant Memory Allocation Issues Global Memory Gotchas Shared Memory Gotchas Texture

More information

NVIDIA CUDA Compute Unified Device Architecture

NVIDIA CUDA Compute Unified Device Architecture NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 2.0 6/7/2008 ii CUDA Programming Guide Version 2.0 Table of Contents Chapter 1. Introduction...1 1.1 CUDA: A Scalable Parallel

More information

CSC266 Introduction to Parallel Computing using GPUs Synchronization and Communication

CSC266 Introduction to Parallel Computing using GPUs Synchronization and Communication CSC266 Introduction to Parallel Computing using GPUs Synchronization and Communication Sreepathi Pai November 8, 2017 URCS Outline Barriers Atomics Warp Primitives Memory Fences Outline Barriers Atomics

More information

Debugging and Optimization strategies

Debugging and Optimization strategies Debugging and Optimization strategies Philip Blakely Laboratory for Scientific Computing, Cambridge Philip Blakely (LSC) Optimization 1 / 25 Writing a correct CUDA code You should start with a functional

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

Programming with CUDA WS 08/09. Lecture 7 Thu, 13 Nov, 2008

Programming with CUDA WS 08/09. Lecture 7 Thu, 13 Nov, 2008 Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008 Previously CUDA Runtime Common Built-in vector types Math functions Timing Textures Texture fetch Texture reference Texture read modes Normalized

More information

CUDA Odds and Ends. Administrivia. Administrivia. Agenda. Patrick Cozzi University of Pennsylvania CIS Spring Assignment 5.

CUDA Odds and Ends. Administrivia. Administrivia. Agenda. Patrick Cozzi University of Pennsylvania CIS Spring Assignment 5. Administrivia CUDA Odds and Ends Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Assignment 5 Handed out Wednesday, 03/16 Due Friday, 03/25 Project One page pitch due Sunday, 03/20, at 11:59pm

More information

CUDA Odds and Ends. Joseph Kider University of Pennsylvania CIS Fall 2011

CUDA Odds and Ends. Joseph Kider University of Pennsylvania CIS Fall 2011 CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011 Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors

More information

NVIDIA GPU CODING & COMPUTING

NVIDIA GPU CODING & COMPUTING NVIDIA GPU CODING & COMPUTING WHY GPU S? ARCHITECTURE & PROGRAM MODEL CPU v. GPU Multiprocessor Model Memory Model Memory Model: Thread Level Programing Model: Logical Mapping of Threads Programing Model:

More information

Information Coding / Computer Graphics, ISY, LiTH. More memory! ! Managed memory!! Atomics!! Pinned memory 25(85)

Information Coding / Computer Graphics, ISY, LiTH. More memory! ! Managed memory!! Atomics!! Pinned memory 25(85) 25(85) Information Coding / Computer Graphics, ISY, LiTH More memory Managed memory Atomics Pinned memory 25(85) Managed memory Makes read/write memory as easy as constant New, simpler Hello World #include

More information

Introduction to CUDA (2 of 2)

Introduction to CUDA (2 of 2) Announcements Introduction to CUDA (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2012 Open pull request for Project 0 Project 1 released. Due Sunday 09/30 Not due Tuesday, 09/25 Code

More information

Programming with CUDA, WS09

Programming with CUDA, WS09 Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 3 Thursday, 29 Nov, 2009 Recap Motivational videos Example kernel Thread IDs Memory overhead CUDA hardware and programming

More information

Introduction to GPU Programming

Introduction to GPU Programming Introduction to GPU Programming Volodymyr (Vlad) Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) Part III CUDA C and CUDA API Hands-on:

More information

Introduction to CUDA C

Introduction to CUDA C Introduction to CUDA C What will you learn today? Start from Hello, World! Write and launch CUDA C kernels Manage GPU memory Run parallel kernels in CUDA C Parallel communication and synchronization Race

More information

Introduction to CUDA C

Introduction to CUDA C NVIDIA GPU Technology Introduction to CUDA C Samuel Gateau Seoul December 16, 2010 Who should you thank for this talk? Jason Sanders Senior Software Engineer, NVIDIA Co-author of CUDA by Example What is

More information

CUDA Memory Model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA Memory Model. N. Cardoso & P. Bicudo. Física Computacional (FC5) CUDA Memory Model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA Memory Model 1/32 Outline 1 Memory System Global Memory Shared Memory L2/L1 Cache Constant Memory Texture

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Review Secret behind GPU performance: simple cores but a large number of them; even more threads can exist live on the hardware (10k 20k threads live). Important performance

More information

Advanced GPU Programming. Samuli Laine NVIDIA Research

Advanced GPU Programming. Samuli Laine NVIDIA Research Advanced GPU Programming Samuli Laine NVIDIA Research Today Code execution on GPU High-level GPU architecture SIMT execution model Warp-wide programming techniques GPU memory system Estimating the cost

More information

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API Review Lecture 10 CUDA (II) host device CUDA many core processor threads thread blocks grid # threads >> # of cores to be efficient Threads within blocks can cooperate Threads between thread blocks cannot

More information

NVIDIA CUDA Compute Unified Device Architecture

NVIDIA CUDA Compute Unified Device Architecture NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 1.0 6/23/2007 ii CUDA Programming Guide Version 1.0 Table of Contents Chapter 1. Introduction to CUDA... 1 1.1 The Graphics Processor

More information

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA

More information

NVIDIA CUDA. Programming Guide. Version 2.2.1

NVIDIA CUDA. Programming Guide. Version 2.2.1 NVIDIA CUDA Programming Guide Version 2.2.1 5/26/2009 ii CUDA Programming Guide Version 2.2.1 Table of Contents Chapter 1. Introduction... 1 1.1 From Graphics Processing to General-Purpose Parallel Computing...

More information

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search CSE 599 I Accelerated Computing - Programming GPUS Parallel Patterns: Graph Search Objective Study graph search as a prototypical graph-based algorithm Learn techniques to mitigate the memory-bandwidth-centric

More information

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA CUDA Parallel Programming Model Scalable Parallel Programming with CUDA Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of

More information

CUDA Parallel Programming Model Michael Garland

CUDA Parallel Programming Model Michael Garland CUDA Parallel Programming Model Michael Garland NVIDIA Research Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of a parallel

More information

Master Thesis Accelerating Image Registration on GPUs

Master Thesis Accelerating Image Registration on GPUs Master Thesis Accelerating Image Registration on GPUs A proof of concept migration of FAIR to CUDA Sunil Ramgopal Tatavarty Prof. Dr. Ulrich Rüde Dr.-Ing.Harald Köstler Lehrstuhl für Systemsimulation Universität

More information

Introduction to GPU Programming

Introduction to GPU Programming Introduction to GPU Programming Volodymyr (Vlad) Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) Part II CUDA C Hands-on: Mandelbrot

More information

CSE 599 I Accelerated Computing - Programming GPUS. Intrinsic Functions

CSE 599 I Accelerated Computing - Programming GPUS. Intrinsic Functions CSE 599 I Accelerated Computing - Programming GPUS Intrinsic Functions Objective Learn more about intrinsic functions Get a sense for the breadth of intrinsic functions available in CUDA Introduce techniques

More information

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I) Lecture 9 CUDA CUDA (I) Compute Unified Device Architecture 1 2 Outline CUDA Architecture CUDA Architecture CUDA programming model CUDA-C 3 4 CUDA : a General-Purpose Parallel Computing Architecture CUDA

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

The Object Model Overview. Contents. Section Title

The Object Model Overview. Contents. Section Title The Object Model 1 This chapter describes the concrete object model that underlies the CORBA architecture. The model is derived from the abstract Core Object Model defined by the Object Management Group

More information

COLLECTIVE COMMUNICATION AND BARRIER. SYNCHRONIZATION ON NVIDIA CUDA GPUs. September 10, 2009

COLLECTIVE COMMUNICATION AND BARRIER. SYNCHRONIZATION ON NVIDIA CUDA GPUs. September 10, 2009 COLLECTIVE COMMUNICATION AND BARRIER SYNCHRONIZATION ON NVIDIA CUDA GPUs September 10, 2009 1 Contents 1 Introduction 6 1.1 Identification of the problem................................ 6 1.2 Motivation

More information

simcuda: A C++ based CUDA Simulation Framework

simcuda: A C++ based CUDA Simulation Framework Technical Report simcuda: A C++ based CUDA Simulation Framework Abhishek Das and Andreas Gerstlauer UT-CERC-16-01 May 20, 2016 Computer Engineering Research Center Department of Electrical & Computer Engineering

More information

MIC-GPU: High-Performance Computing for Medical Imaging on Programmable Graphics Hardware (GPUs)

MIC-GPU: High-Performance Computing for Medical Imaging on Programmable Graphics Hardware (GPUs) MIC-GPU: High-Performance Computing for Medical Imaging on Programmable Graphics Hardware (GPUs) CUDA API Klaus Mueller, Ziyi Zheng, Eric Papenhausen Stony Brook University Function Qualifiers Device Global,

More information

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Basic Elements of CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of

More information

Programming with CUDA, WS09

Programming with CUDA, WS09 Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 7.5 Thursday, 19 November, 2009 Recap CUDA texture memory commands Today CUDA driver API Runtime and Driver APIs Two interfaces

More information

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Basic Elements of CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

CS 179: GPU Computing. Lecture 2: The Basics

CS 179: GPU Computing. Lecture 2: The Basics CS 179: GPU Computing Lecture 2: The Basics Recap Can use GPU to solve highly parallelizable problems Performance benefits vs. CPU Straightforward extension to C language Disclaimer Goal for Week 1: Fast-paced

More information

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

COSC 6385 Computer Architecture. - Data Level Parallelism (II) COSC 6385 Computer Architecture - Data Level Parallelism (II) Fall 2013 SIMD Instructions Originally developed for Multimedia applications Same operation executed for multiple data items Uses a fixed length

More information

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors COSC 6385 Computer Architecture - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors Fall 2012 References Intel Larrabee: [1] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M.

More information

OpenCL Overview Benedict R. Gaster, AMD

OpenCL Overview Benedict R. Gaster, AMD Copyright Khronos Group, 2011 - Page 1 OpenCL Overview Benedict R. Gaster, AMD March 2010 The BIG Idea behind OpenCL OpenCL execution model - Define N-dimensional computation domain - Execute a kernel

More information

CS179: GPU Programming Recitation 5: Rendering Fractals

CS179: GPU Programming Recitation 5: Rendering Fractals CS179: GPU Programming Recitation 5: Rendering Fractals Rendering Fractals Volume data vs. texture memory Creating and using CUDA arrays Using PBOs for screen output Quaternion Julia Sets Rendering volume

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

L17: Lessons from Particle System Implementations

L17: Lessons from Particle System Implementations L17: Lessons from Particle System Implementations Administrative Still missing some design reviews - Please email to me slides from presentation - And updates to reports - By Thursday, Apr 16, 5PM Grading

More information

1/31/11. How to tell if results are correct. Assignment 2: Analyzing the Results. Targets of Memory Hierarchy Optimizations. Overview of Lecture

1/31/11. How to tell if results are correct. Assignment 2: Analyzing the Results. Targets of Memory Hierarchy Optimizations. Overview of Lecture Administrative L5: emory Hierarchy Optimization III, Data lacement, cont. and emory Bandwidth Optimizations ext assignment available ext four slides Goals of assignment: simple memory hierarchy management

More information

Lecture 12 Integers. Computer and Network Security 19th of December Computer Science and Engineering Department

Lecture 12 Integers. Computer and Network Security 19th of December Computer Science and Engineering Department Lecture 12 Integers Computer and Network Security 19th of December 2016 Computer Science and Engineering Department CSE Dep, ACS, UPB Lecture 12, Integers 1/40 Outline Data Types Representation Conversions

More information

CS179 GPU Programming Recitation 4: CUDA Particles

CS179 GPU Programming Recitation 4: CUDA Particles Recitation 4: CUDA Particles Lab 4 CUDA Particle systems Two parts Simple repeat of Lab 3 Interacting Flocking simulation 2 Setup Two folders given particles_simple, particles_interact Must install NVIDIA_CUDA_SDK

More information

High Performance Linear Algebra on Data Parallel Co-Processors I

High Performance Linear Algebra on Data Parallel Co-Processors I 926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018

More information

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu

More information

CUDA C PROGRAMMING GUIDE

CUDA C PROGRAMMING GUIDE CUDA C PROGRAMMING GUIDE PG-02829-001_v8.0 September 2016 Design Guide CHANGES FROM VERSION 7.0 Updated C/C++ Language Support to: Added new section C++11 Language Features, Clarified that values of const-qualified

More information

Brook+ Data Types. Basic Data Types

Brook+ Data Types. Basic Data Types Brook+ Data Types Important for all data representations in Brook+ Streams Constants Temporary variables Brook+ Supports Basic Types Short Vector Types User-Defined Types 29 Basic Data Types Basic data

More information

GPU Profiling and Optimization. Scott Grauer-Gray

GPU Profiling and Optimization. Scott Grauer-Gray GPU Profiling and Optimization Scott Grauer-Gray Benefits of GPU Programming "Free" speedup with new architectures More cores in new architecture Improved features such as L1 and L2 cache Increased shared/local

More information

Concurrent GPU Programming

Concurrent GPU Programming CS842 - Concurrent Programming Mechanisms and Tools Concurrent GPU Programming Lesley Northam School of Computer Science University of Waterloo 200 University Avenue West Waterloo, Ontario, Canada N2L

More information

Chapter 2: Using Data

Chapter 2: Using Data Chapter 2: Using Data Declaring Variables Constant Cannot be changed after a program is compiled Variable A named location in computer memory that can hold different values at different points in time

More information

High Performance Computing and GPU Programming

High Performance Computing and GPU Programming High Performance Computing and GPU Programming Lecture 3: GPU Application GPU Intro Review Simple Example Memory Effects GPU Intro Review GPU Intro Review Shared Multiprocessors Global parallelism Assign

More information

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86) 26(86) Information Coding / Computer Graphics, ISY, LiTH CUDA memory Coalescing Constant memory Texture memory Pinned memory 26(86) CUDA memory We already know... Global memory is slow. Shared memory is

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5) CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration

More information

COLLECTIVE COMMUNICATION AND BARRIER SYNCHRONIZATION ON NVIDIA CUDA GPU

COLLECTIVE COMMUNICATION AND BARRIER SYNCHRONIZATION ON NVIDIA CUDA GPU University of Kentucky UKnowledge University of Kentucky Master's Theses Graduate School 2009 COLLECTIVE COMMUNICATION AND BARRIER SYNCHRONIZATION ON NVIDIA CUDA GPU Diego Alejandro Rivera-Polanco University

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Introduce C# as Object Oriented programming language. Explain, tokens,

Introduce C# as Object Oriented programming language. Explain, tokens, Module 2 98 Assignment 1 Introduce C# as Object Oriented programming language. Explain, tokens, lexicals and control flow constructs. 99 The C# Family Tree C Platform Independence C++ Object Orientation

More information

Programming with CUDA

Programming with CUDA Programming with CUDA Jens K. Mueller jkm@informatik.uni-jena.de Department of Mathematics and Computer Science Friedrich-Schiller-University Jena Monday 4 th April, 2011 Today s lecture: Organization

More information

CUDA C PROGRAMMING GUIDE

CUDA C PROGRAMMING GUIDE CUDA C PROGRAMMING GUIDE PG-02829-001_v9.0 September 2017 Design Guide CHANGES FROM VERSION 8.0 Updates to add compute capability 7.0, including: Added Tensor Core row to table in Table 13 Added compute

More information

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Using CUDA in Practice Klaus Mueller Computer Science Department Stony Brook University Code examples from Shane Cook CUDA Programming Related to: score boarding load and store

More information

NVJPEG. DA _v0.2.0 October nvjpeg Libary Guide

NVJPEG. DA _v0.2.0 October nvjpeg Libary Guide NVJPEG DA-06762-001_v0.2.0 October 2018 Libary Guide TABLE OF CONTENTS Chapter 1. Introduction...1 Chapter 2. Using the Library... 3 2.1. Single Image Decoding... 3 2.3. Batched Image Decoding... 6 2.4.

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

Parallel Computer Architecture and Programming Written Assignment 3

Parallel Computer Architecture and Programming Written Assignment 3 Parallel Computer Architecture and Programming Written Assignment 3 50 points total. Due Monday, July 17 at the start of class. Problem 1: Message Passing (6 pts) A. (3 pts) You and your friend liked the

More information

Compute Shaders. Christian Hafner. Institute of Computer Graphics and Algorithms Vienna University of Technology

Compute Shaders. Christian Hafner. Institute of Computer Graphics and Algorithms Vienna University of Technology Compute Shaders Christian Hafner Institute of Computer Graphics and Algorithms Vienna University of Technology Overview Introduction Thread Hierarchy Memory Resources Shared Memory & Synchronization Christian

More information

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

GPU Programming Using CUDA. Samuli Laine NVIDIA Research GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick

More information

CUDA Advanced Techniques 3 Mohamed Zahran (aka Z)

CUDA Advanced Techniques 3 Mohamed Zahran (aka Z) Some slides are used and slightly modified from: NVIDIA teaching kit CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming CUDA Advanced Techniques 3 Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

More information

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

GPU Programming Using CUDA. Samuli Laine NVIDIA Research GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick

More information

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose Synchronization Ideal case for parallelism: no resources shared between threads no communication between threads Many

More information

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Memory concept. Grid concept, Synchronization. GPU Programming.   Szénási Sándor. Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip

More information

Final Exam. 11 May 2018, 120 minutes, 26 questions, 100 points

Final Exam. 11 May 2018, 120 minutes, 26 questions, 100 points Name: CS520 Final Exam 11 May 2018, 120 minutes, 26 questions, 100 points The exam is closed book and notes. Please keep all electronic devices turned off and out of reach. Note that a question may require

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed

More information

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into

More information

and Parallel Algorithms Programming with CUDA, WS09 Waqar Saleem, Jens Müller

and Parallel Algorithms Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Organization People Waqar Saleem, waqar.saleem@uni-jena.de Jens Mueller, jkm@informatik.uni-jena.de Room 3335, Ernst-Abbe-Platz 2

More information

COSC 6374 Parallel Computations Introduction to CUDA

COSC 6374 Parallel Computations Introduction to CUDA COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo

More information

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z)

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Alignment Memory Alignment Memory

More information

A FRAMEWORK FOR DATA STRUCTURES IN A TYPED FORTH

A FRAMEWORK FOR DATA STRUCTURES IN A TYPED FORTH A FRAMEWORK FOR DATA STRUCTURES IN A TYPED FORTH Federico de Ceballos Universidad de Cantabria federico.ceballos@unican.es September, 2007 Strong Forth as a typed Forth In Strong Forth the interpreter

More information

Variables in C. Variables in C. What Are Variables in C? CMSC 104, Fall 2012 John Y. Park

Variables in C. Variables in C. What Are Variables in C? CMSC 104, Fall 2012 John Y. Park Variables in C CMSC 104, Fall 2012 John Y. Park 1 Variables in C Topics Naming Variables Declaring Variables Using Variables The Assignment Statement 2 What Are Variables in C? Variables in C have the

More information

NVIDIA CUDA. NVIDIA CUDA C Programming Guide. Version 3.2

NVIDIA CUDA. NVIDIA CUDA C Programming Guide. Version 3.2 NVIDIA CUDA NVIDIA CUDA C Programming Guide Version 3.2 10/15/2010 Changes from Version 3.1.1 Simplified all the code samples that use cuparamsetv() to set a kernel parameter of type CUdeviceptr since

More information

OpenCL C. Matt Sellitto Dana Schaa Northeastern University NUCAR

OpenCL C. Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL C Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL C Is used to write kernels when working with OpenCL Used to code the part that runs on the device Based on C99 with some extensions

More information

Introduction to C# Applications

Introduction to C# Applications 1 2 3 Introduction to C# Applications OBJECTIVES To write simple C# applications To write statements that input and output data to the screen. To declare and use data of various types. To write decision-making

More information

COSC 6339 Accelerators in Big Data

COSC 6339 Accelerators in Big Data COSC 6339 Accelerators in Big Data Edgar Gabriel Fall 2018 Motivation Programming models such as MapReduce and Spark provide a high-level view of parallelism not easy for all problems, e.g. recursive algorithms,

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Non-Blocking Inter-Partition Communication with Wait-Free Pair Transactions

Non-Blocking Inter-Partition Communication with Wait-Free Pair Transactions Non-Blocking Inter-Partition Communication with Wait-Free Pair Transactions Ethan Blanton and Lukasz Ziarek Fiji Systems, Inc. October 10 th, 2013 WFPT Overview Wait-Free Pair Transactions A communication

More information

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME 1 Last time... GPU Memory System Different kinds of memory pools, caches, etc Different optimization techniques 2 Warp Schedulers

More information

NVJPEG. DA _v0.1.4 August nvjpeg Libary Guide

NVJPEG. DA _v0.1.4 August nvjpeg Libary Guide NVJPEG DA-06762-001_v0.1.4 August 2018 Libary Guide TABLE OF CONTENTS Chapter 1. Introduction...1 Chapter 2. Using the Library... 3 2.1. Single Image Decoding... 3 2.3. Batched Image Decoding... 6 2.4.

More information