Programming with CUDA, WS09
|
|
- Mabel Palmer
- 5 years ago
- Views:
Transcription
1 Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 7.5 Thursday, 19 November, 2009
2 Recap CUDA texture memory commands
3 Today CUDA driver API
4 Runtime and Driver APIs Two interfaces for writing CUDA programs: C for CUDA, CUDA driver API C for CUDA allows to write kernels in C provides runtime API which builds on the driver API needs to be compiled with nvcc Driver API provides functions to load cubin/ PTX kernels, e.g. as compiled from runtime kernels
5 Runtime and Driver APIs The runtime API is provided by the cudart library functions prefixed with cuda implicit device initialization with first call to runtime The driver API is provided by the cuda library functions/objects prefixed with cu explicit device initialization with cuinit()
6 Driver API Initialize driver API with cuinit() Create a CUDA context Attach the context to a device Make the context current to the calling host thread
7 Driver API Object Handles
8 A runtime API sample // Kernel global void VecAdd( float *A, float *B, float *C) { //... } // Host code int main() { //... VecAdd<<<1,N>>> ( A, B, C ); //... }
9 Driver API equivalent int main() { // Initialize if ( cuinit(0)!= CUDA_SUCCESS) exit(0); // Get number of devices supporting CUDA int devicecount = 0; cudevicegetcount(&devicecount); if (devicecount == 0) { printf("there is no device supporting CUDA.\n"); exit (0); } // Get handle for device 0 CUdevice cudevice = 0; cudeviceget(&cudevice, 0); // Create context CUcontext cucontext; cuctxcreate(&cucontext, 0, cudevice); // Create module from binary file CUmodule cumodule; cumoduleload(&cumodule, VecAdd.ptx ); // Get function handle from module CUfunction vecadd; cumodulegetfunction(&vecadd, cumodule, "VecAdd"); // Invoke kernel #define ALIGN_UP(offset, alignment) \ (offset) = ((offset) + (alignment) 1) & ~((alignment) 1) int offset = 0; void* ptr; ptr = (void*)(size_t)a; ALIGN_UP(offset, alignof(ptr)); cuparamsetv(vecadd, offset, &ptr, sizeof(ptr)); offset += sizeof(ptr); ptr = (void*)(size_t)b; ALIGN_UP(offset, alignof(ptr)); cuparamsetv(vecadd, offset, &ptr, sizeof(ptr)); offset += sizeof(ptr); ptr = (void*)(size_t)c; ALIGN_UP(offset, alignof(ptr)); cuparamsetv(vecadd, offset, &ptr, sizeof(ptr)); offset += sizeof(ptr); cuparamsetsize(vecadd, offset); int threadsperblock = 256; int blockspergrid = (N + threadsperblock 1) / threadsperblock; cufuncsetblockshape(vecadd, threadsperblock, 1, 1); culaunchgrid(vecadd, blockspergrid, 1); //... }
10 CUDA Context A CUDA context loads cubin/ptx kernels C kernels must be compiled down using nvcc cubin kernels are not forward compatible, PTX kernels are All driver API resources and actions are encapsulated in contexts These are automatically cleaned up when the context is destroyed CUDA functions called outside a context return an error Each context has its own 32-bit address space
11 Working with contexts Create a context using cuctxcreate() The created context, C, is automatically made current to the calling host thread C has a usage count of 1 C is pushed on top of the current host thread s stack of current threads host thread should call cuctxdextroy() or cuctxdetach() on C when done with it C replaces previously current context, if any
12 Working with contexts Pop C from the stack using cuctxpopcurrent(), and make current using cuctxpushcurrent() Use a context in other threads using cuctxattach() and cuctxdetach() cuctxsynchronize(), cuctxgetdevice() Each context has a usage count which is 1 at creation. Incremented/decremented at cuctxattach()/ cuctxdetach() respectively A context and its resources are automatically destroyed when its usage count becomes 0
13 Modules Modules are previously compiled device functions Function names, texture references, global variables are available at module scope A context may incorporate external modules as well cumoduleload(), cumodulegetfunction()
14 Data alignment An alignment requirement for a type specifies the memory addresses at which variables of that type should be stored Data that is aligned can be read more efficiently In C/C++, a type s alignment requirement can be obtained using alignof() Alignment conditions depend on the hardware architecture A memory address, a, is n-aligned if a is a multiple of n
15 CUDA alignment requirements
16 Kernel execution cufuncsetblockshape() sets arrangement of threads and their IDs cufuncsetsharedsize() sets the size of shared memory the function will use cuparamseti(), cuparamsetf(), cuparamsettexref(), cuparamsetv() functions add integer, float, texture reference and arbitrary variables to a function s argument list added variables have to be aligned cuparamsetsize() sets total size of arguments culaunch(), culaunchgrid(), culaunchgridasync() launch a kernel
17 Device Memory Linear memory: cumemalloc(), cumemallocpitch(), cumemfree(), cumemcpyhtod(), cumemcpydtoh(), cumemcpyhtodasync(), cumemcpydtohasync() CUDA array CUDA_ARRAY_DESCRIPTOR desc; desc.format = CU_AD_FORMAT_FLOAT; desc.numchannels = 1; desc.width = desc.height = n; CUarray cuarray; cuarraycreate( &cuarray, &desc ); cuarraydestroy( cuarray ); memory copy functions...
18 Pinned Memory cumemhostalloc(), cumemfreehost() flags at alloc for portable, write combined and/or mapped memory check CU_DEVICE_ATTRIBUTE_CAN_MAP_HOST_ MEMORY in cudevicegetattribute() enable memory pinning for a context by passing CU_CTX_MAP_HOST flag to cuctxcreate()
19 Textures texture<float, 2, cudareadmodeelementtype> texref in cumodule retrieved in driver API as CUtexref cutexref; cumodulegettexref( &cutexref, cumodule, texref ) Bind texref to linear memory CUDA_ARRAY_DESCRIPTOR desc; cutexrefsetaddress2d( cutexref, &desc, devptr, pitch ); to CUDA array cutexrefsetarray( cutexref, cuarray, CU_TRSA_OVERRIDE_FORMAT ); cutexrefsetaddressmode(), cutexrefsetfiltermode(), cutexrefsetflags(): normalize texels, normalize coordinates cutexrefsetformat(): analogous to CUDA array descriptor
20 Asynchronous Execution check CU_DEVICE_ATTRIBUTE_GPU_OVERLAP in cudevicegetattribute() Stream: custreamcreate(), custreamdestroy() Event: cueventcreate(), cueventrecord(), cueventsynchronize(), cueventelapsedtime(), cueventdestroy()
21 Shared memory: set size for function using cufuncsetsharedsize() Multiple devices: cudevicegetcount(), cudeviceget() Error handling: Same variable update as in runtime API. Get error codes from asynchronous functions using synchronization
22 Next time Performance Optimizations
23 See you next time!
CUDA Programming. Week 5. Asynchronized execution, Instructions, and CUDA driver API
CUDA Programming Week 5. Asynchronized execution, Instructions, and CUDA driver API Outline Asynchronized Transfers Instruction optimization CUDA driver API Homework ASYNCHRONIZED TRANSFER Asynchronous
More informationCUDA Driver API. Alexey A. Romanenko Novosibirsk State University
CUDA Driver API Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University Is it possible to distribute kernel separately? Is it possible to launch kernel without C/C++ extensions? Is it possible
More informationCUDA DRIVER API. TRM _v7.0 March API Reference Manual
CUDA DRIVER API TRM-06703-001 _v7.0 March 2015 API Reference Manual TABLE OF CONTENTS Chapter 1. API synchronization behavior... 1 Chapter 2. Stream synchronization behavior... 3 Chapter 3.... 5 3.1. Data
More informationCUDA DRIVER API. TRM _v6.0 February API Reference Manual
CUDA DRIVER API TRM-06703-001 _v6.0 February 2014 API Reference Manual TABLE OF CONTENTS Chapter 1. API synchronization behavior... 1 Chapter 2. Stream synchronization behavior... 3 Chapter 3.... 4 3.1.
More informationCUDA DRIVER API. TRM _v8.0 February API Reference Manual
CUDA DRIVER API TRM-06703-001 _v8.0 February 2016 API Reference Manual TABLE OF CONTENTS Chapter 1. Difference between the driver and runtime APIs...1 Chapter 2. API synchronization behavior... 3 Chapter
More informationCUDA DRIVER API. TRM _vrelease Version July API Reference Manual
CUDA DRIVER API TRM-06703-001 _vrelease Version July 2017 API Reference Manual TABLE OF CONTENTS Chapter 1. Difference between the driver and runtime APIs...1 Chapter 2. API synchronization behavior...
More informationProgramming with CUDA WS 08/09. Lecture 7 Thu, 13 Nov, 2008
Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008 Previously CUDA Runtime Common Built-in vector types Math functions Timing Textures Texture fetch Texture reference Texture read modes Normalized
More informationCUDA DRIVER API. TRM _vrelease Version July API Reference Manual
CUDA DRIVER API TRM-06703-001 _vrelease Version July 2018 API Reference Manual TABLE OF CONTENTS Chapter 1. Difference between the driver and runtime APIs...1 Chapter 2. API synchronization behavior...
More informationNVIDIA CUDA Compute Unified Device Architecture
NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 1.0 6/23/2007 ii CUDA Programming Guide Version 1.0 Table of Contents Chapter 1. Introduction to CUDA... 1 1.1 The Graphics Processor
More informationNVIDIA CUDA. Fermi Compatibility Guide for CUDA Applications. Version 1.1
NVIDIA CUDA Fermi Compatibility Guide for CUDA Applications Version 1.1 4/19/2010 Table of Contents Software Requirements... 1 What Is This Document?... 1 1.1 Application Compatibility on Fermi... 1 1.2
More informationPathScale ENZO GTC12 S0631 Programming Heterogeneous Many-Cores Using Directives. C. Bergström May 14th, 2012
PathScale ENZO GTC12 S0631 Programming Heterogeneous Many-Cores Using Directives C. Bergström May 14th, 2012 Brief Introduction to ENZO 2 PathScale GTC12 S0631 Tutorial May 14th, 2012 ENZO Overview & Goals
More informationPlatform Support o Additional OS support - Windows Vista 32-bit - Windows Vista 64-bit
NVIDIA CUDA Windows XP and Vista Release Notes Version 2.0 New Features Hardware Support o Additional hardware support: - GeForce GTX 280 - GeForce GTX 260 - GeForce 9800 GX2 - GeForce 9800 GTX - GeForce
More information3.Constructors and Destructors. Develop cpp program to implement constructor and destructor.
3.Constructors and Destructors Develop cpp program to implement constructor and destructor. Constructors A constructor is a special member function whose task is to initialize the objects of its class.
More informationProgramming with CUDA
Programming with CUDA Jens K. Mueller jkm@informatik.uni-jena.de Department of Mathematics and Computer Science Friedrich-Schiller-University Jena Tuesday 19 th April, 2011 Today s lecture: Synchronization
More informationPROGRAMMING NVIDIA GPUS WITH CUDANATIVE.JL
DEPARTMENT ELECTRONICS AND INFORMATION SYSTEMS COMPUTER SYSTEMS LAB PROGRAMMING NVIDIA GPUS WITH CUDANATIVE.JL Tim Besard 2017-06-21 TABLE OF CONTENTS 1. GPU programming: what, why, how 2. CUDAnative.jl
More informationOpenACC introduction (part 2)
OpenACC introduction (part 2) Aleksei Ivakhnenko APC Contents Understanding PGI compiler output Compiler flags and environment variables Compiler limitations in dependencies tracking Organizing data persistence
More informationClass. Windows and CUDA : Shu Guo. Program from last time: Constant memory
Class Windows and CUDA : Shu Guo Program from last time: Constant memory Windows on CUDA Reference: NVIDIA CUDA Getting Started Guide for Microsoft Windows Whiting School has Visual Studio Cuda 5.5 Installer
More informationDECEMBER 5, NVRTC - CUDA Runtime Compilation NVIDIA CORPORATION V7.0
DECEMBER 5, 2014 NVRTC - CUDA Runtime Compilation NVIDIA CORPORATION V7.0 NVRTC - CUDA RUNTIME COMPILATION II www.nvidia.com TABLE OF CONTENTS 1 Introduction... 1 2 Getting Started... 2 2.1 System Requirements...
More informationCUDA 2.2 Pinned Memory APIs
CUDA 2.2 Pinned Memory APIs July 2012 July 2012 ii Table of Contents Table of Contents... 1 1. Overview... 2 1.1 Portable pinned memory : available to all contexts... 3 1.2 Mapped pinned memory : zero-copy...
More informationCUDA TOOLKIT 3.2 READINESS FOR CUDA APPLICATIONS
CUDA TOOLKIT 3.2 READINESS FOR CUDA APPLICATIONS August 20, 2010 Technical Brief INTRODUCTION In NVIDIA CUDA TM Toolkit version 3.2 and the accompanying 260.xx release of the CUDA driver, changes are being
More informationNVIDIA OpenCL JumpStart Guide. Technical Brief
NVIDIA OpenCL JumpStart Guide Technical Brief Version 1.0 February 19, 2010 Introduction The purposes of this guide are to assist developers who are familiar with CUDA C/C++ development and want to port
More informationProgramming with CUDA, WS09
Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 3 Thursday, 29 Nov, 2009 Recap Motivational videos Example kernel Thread IDs Memory overhead CUDA hardware and programming
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationCUDA Parallel Programming Model. Scalable Parallel Programming with CUDA
CUDA Parallel Programming Model Scalable Parallel Programming with CUDA Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of
More informationCUDA Parallel Programming Model Michael Garland
CUDA Parallel Programming Model Michael Garland NVIDIA Research Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of a parallel
More informationIntroduction to GPU Programming
Introduction to GPU Programming Volodymyr (Vlad) Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) Tutorial Goals Become familiar with
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationCUDA programming. CUDA requirements. CUDA Querying. CUDA Querying. A CUDA-capable GPU (NVIDIA) NVIDIA driver A CUDA SDK
CUDA programming Bedrich Benes, Ph.D. Purdue University Department of Computer Graphics CUDA requirements A CUDA-capable GPU (NVIDIA) NVIDIA driver A CUDA SDK Standard C compiler http://www.nvidia.com/cuda
More informationCMPSCI 691AD General Purpose Computation on the GPU
CMPSCI 691AD General Purpose Computation on the GPU Spring 2009 Lecture 5: Quantitative Analysis of Parallel Algorithms Rui Wang (cont. from last lecture) Device Management Context Management Module Management
More informationNVIDIA CUDA Compute Unified Device Architecture
NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 0.8 2/12/2007 ii CUDA Programming Guide Version 0.8 Table of Contents Chapter 1. Introduction to CUDA... 1 1.1 The Graphics Processor
More informationProgramming Languages
Programming Languages Tevfik Koşar Lecture - VIII February 9 th, 2006 1 Roadmap Allocation techniques Static Allocation Stack-based Allocation Heap-based Allocation Scope Rules Static Scopes Dynamic Scopes
More informationToday's Topics. CISC 458 Winter J.R. Cordy
Today's Topics Last Time Semantics - the meaning of program structures Stack model of expression evaluation, the Expression Stack (ES) Stack model of automatic storage, the Run Stack (RS) Today Managing
More informationGPU Computing: Introduction to CUDA. Dr Paul Richmond
GPU Computing: Introduction to CUDA Dr Paul Richmond http://paulrichmond.shef.ac.uk This lecture CUDA Programming Model CUDA Device Code CUDA Host Code and Memory Management CUDA Compilation Programming
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux
More informationCSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA
CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA Sreepathi Pai October 18, 2017 URCS Outline Background Memory Code Execution Model Outline Background Memory Code Execution Model
More informationStream Computing using Brook+
Stream Computing using Brook+ School of Electrical Engineering and Computer Science University of Central Florida Slides courtesy of P. Bhaniramka Outline Overview of Brook+ Brook+ Software Architecture
More informationTextures & Surfaces CUDA Webinar Gernot Ziegler, Developer Technology (Compute)
Textures & Surfaces CUDA Webinar Gernot Ziegler, Developer Technology (Compute) Outline Intro to Texturing and Texture Unit CUDA Array Storage Textures in CUDA C (Setup, Binding Modes, Coordinates) Texture
More informationCUDA Memory Hierarchy
CUDA Memory Hierarchy Piotr Danilewski October 2012 Saarland University Memory GTX 690 GTX 690 Memory host memory main GPU memory (global memory) shared memory caches registers Memory host memory GPU global
More informationMathematical computations with GPUs
Master Educational Program Information technology in applications Mathematical computations with GPUs CUDA Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University CUDA - Compute Unified Device
More informationNVRTC - CUDA RUNTIME COMPILATION
NVRTC - CUDA RUNTIME COMPILATION DU-07529-001 _v7.0 March 2015 User Guide TABLE OF CONTENTS Chapter 1. Introduction...1 Chapter 2. Getting Started... 2 2.1. System Requirements... 2 2.2. Installation...
More informationGPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh
GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA
More informationGPU Programming Using CUDA. Samuli Laine NVIDIA Research
GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick
More informationCSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591: GPU Programming Programmer Interface Klaus Mueller Computer Science Department Stony Brook University Compute Levels Encodes the hardware capability of a GPU card newer cards have higher compute
More informationNumbaPro CUDA Python. Square matrix multiplication
NumbaPro Enables parallel programming in Python Support various entry points: Low-level (CUDA-C like) programming language High-level array oriented interface CUDA library bindings Also support multicore
More informationProgramming GPUs with PyCuda
Intro GPUs Scripting Hands-on Programming GPUs with PyCuda SciPy Conference 2009 / Advanced Tutorial http://conference.scipy.org/advanced_tutorials August 19, 2009 Intro GPUs Scripting Hands-on Thanks
More informationLecture 11: GPU programming
Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!
More informationCS 179: GPU Computing. Lecture 2: The Basics
CS 179: GPU Computing Lecture 2: The Basics Recap Can use GPU to solve highly parallelizable problems Performance benefits vs. CPU Straightforward extension to C language Disclaimer Goal for Week 1: Fast-paced
More informationMemory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.
Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip
More informationCuda Compilation Utilizing the NVIDIA GPU Oct 19, 2017
Cuda Compilation Utilizing the NVIDIA GPU Oct 19, 2017 This document will essentially provide the reader with the understanding on how to use the CUDA 7.0 environment within the Electrical and Computer
More informationJournal of Statistical Software
JSS Journal of Statistical Software MMMMMM YYYY, Volume VV, Issue II. http://www.jstatsoft.org/ RCUDA: General programming facilities for GPUs in R Paul Baines University of California at Davis Duncan
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More information04. CUDA Data Transfer
04. CUDA Data Transfer Fall Semester, 2015 COMP427 Parallel Programming School of Computer Sci. and Eng. Kyungpook National University 2013-5 N Baek 1 CUDA Compute Unified Device Architecture General purpose
More informationAssembler Programming. Lecture 10
Assembler Programming Lecture 10 Lecture 10 Mixed language programming. C and Basic to MASM Interface. Mixed language programming Combine Basic, C, Pascal with assembler. Call MASM routines from HLL program.
More informationCompiling CUDA and Other Languages for GPUs. Vinod Grover and Yuan Lin
Compiling CUDA and Other Languages for GPUs Vinod Grover and Yuan Lin Agenda Vision Compiler Architecture Scenarios SDK Components Roadmap Deep Dive SDK Samples Demos Vision Build a platform for GPU computing
More informationGPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique
GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationOperating Systems CMPSCI 377, Lec 2 Intro to C/C++ Prashant Shenoy University of Massachusetts Amherst
Operating Systems CMPSCI 377, Lec 2 Intro to C/C++ Prashant Shenoy University of Massachusetts Amherst Department of Computer Science Why C? Low-level Direct access to memory WYSIWYG (more or less) Effectively
More informationGPU Profiling and Optimization. Scott Grauer-Gray
GPU Profiling and Optimization Scott Grauer-Gray Benefits of GPU Programming "Free" speedup with new architectures More cores in new architecture Improved features such as L1 and L2 cache Increased shared/local
More informationIntroduction to CUDA
Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations
More informationGPU Programming Using CUDA. Samuli Laine NVIDIA Research
GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick
More informationReview. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API
Review Lecture 10 CUDA (II) host device CUDA many core processor threads thread blocks grid # threads >> # of cores to be efficient Threads within blocks can cooperate Threads between thread blocks cannot
More informationCUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list
CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into
More informationLecture 3: C Programm
0 3 E CS 1 Lecture 3: C Programm ing Reading Quiz Note the intimidating red border! 2 A variable is: A. an area in memory that is reserved at run time to hold a value of particular type B. an area in memory
More informationProcesses. Johan Montelius KTH
Processes Johan Montelius KTH 2017 1 / 47 A process What is a process?... a computation a program i.e. a sequence of operations a set of data structures a set of registers means to interact with other
More informationOpenMP 4.0 implementation in GCC. Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat
OpenMP 4.0 implementation in GCC Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat OpenMP 4.0 implementation in GCC Work started in April 2013, C/C++ support with host fallback only
More informationCSC 2400: Computer Systems. Using the Stack for Function Calls
CSC 24: Computer Systems Using the Stack for Function Calls Lecture Goals Challenges of supporting functions! Providing information for the called function Function arguments and local variables! Allowing
More informationGPU Programming. Rupesh Nasre.
GPU Programming Rupesh Nasre. http://www.cse.iitm.ac.in/~rupesh IIT Madras July 2017 Debugging Debugging parallel programs is difficult. Non-determinism due to thread-scheduling Output can be different
More informationA process. the stack
A process Processes Johan Montelius What is a process?... a computation KTH 2017 a program i.e. a sequence of operations a set of data structures a set of registers means to interact with other processes
More informationLecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)
Lecture 9 CUDA CUDA (I) Compute Unified Device Architecture 1 2 Outline CUDA Architecture CUDA Architecture CUDA programming model CUDA-C 3 4 CUDA : a General-Purpose Parallel Computing Architecture CUDA
More informationCUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA
CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes
More informationKEPLER COMPATIBILITY GUIDE FOR CUDA APPLICATIONS
KEPLER COMPATIBILITY GUIDE FOR CUDA APPLICATIONS DA-06287-001_v5.0 October 2012 Application Note TABLE OF CONTENTS Chapter 1. Kepler Compatibility... 1 1.1 About this Document... 1 1.2 Application Compatibility
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationNVRTC - CUDA RUNTIME COMPILATION
NVRTC - CUDA RUNTIME COMPILATION DU-07529-001 _v8.0 February 2016 User Guide TABLE OF CONTENTS Chapter 1. Introduction...1 Chapter 2. Getting Started... 2 2.1. System Requirements... 2 2.2. Installation...
More informationBasic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Basic Elements of CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of
More informationNVIDIA CUDA Compute Unified Device Architecture
NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 2.0 6/7/2008 ii CUDA Programming Guide Version 2.0 Table of Contents Chapter 1. Introduction...1 1.1 CUDA: A Scalable Parallel
More informationShort Notes of CS201
#includes: Short Notes of CS201 The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with < and > if the file is a system
More informationCUDA C BEST PRACTICES GUIDE
CUDA C BEST PRACTICES GUIDE DG-05603-001_v4.0 May 2011 Design Guide DOCUMENT CHANGE HISTORY DG-05603-001_v4.0 Version Date Authors Description of Change 3.0 February 4, 2010 CW See Section C.1 3.1 May
More informationPASCAL COMPATIBILITY GUIDE FOR CUDA APPLICATIONS
PASCAL COMPATIBILITY GUIDE FOR CUDA APPLICATIONS DA-08133-001_v9.1 April 2018 Application Note TABLE OF CONTENTS Chapter 1. Pascal Compatibility...1 1.1. About this Document...1 1.2. Application Compatibility
More informationDeveloping Portable CUDA C/C++ Code with Hemi
Developing Portable CUDA C/C++ Code with Hemi Software development is as much about writing code fast as it is about writing fast code, and central to rapid development is software reuse and portability.
More informationLecture 2: Introduction to CUDA C
CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013 1 CUDA /OpenCL Execution Model Integrated host+device app C program Serial or
More informationCS201 - Introduction to Programming Glossary By
CS201 - Introduction to Programming Glossary By #include : The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with
More information! Those values must be stored somewhere! Therefore, variables must somehow be bound. ! How?
A Binding Question! Variables are bound (dynamically) to values Subprogram Activation! Those values must be stored somewhere! Therefore, variables must somehow be bound to memory locations! How? Function
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationMemory Management. Memory Access Bandwidth. Memory Spaces. Memory Spaces
Memory Access Bandwidth Memory Management Bedrich Benes, Ph.D. Purdue University Department of Computer Graphics Technology High Performance Computer Graphics Lab Host and device different memory spaces
More informationCUDA C Programming Mark Harris NVIDIA Corporation
CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment
More informationMAXWELL COMPATIBILITY GUIDE FOR CUDA APPLICATIONS
MAXWELL COMPATIBILITY GUIDE FOR CUDA APPLICATIONS DA-07172-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Compatibility... 1 1.1. About this Document...1 1.2. Application Compatibility
More informationCOSC 6374 Parallel Computations Introduction to CUDA
COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo
More informationStackVsHeap SPL/2010 SPL/20
StackVsHeap Objectives Memory management central shared resource in multiprocessing RTE memory models that are used in Java and C++ services for Java/C++ programmer from RTE (JVM / OS). Perspectives of
More informationPANOPTES: A BINARY TRANSLATION FRAMEWORK FOR CUDA. Chris Kennelly D. E. Shaw Research
PANOPTES: A BINARY TRANSLATION FRAMEWORK FOR CUDA Chris Kennelly D. E. Shaw Research Outline The Motivating Problems Binary Translation as a Solution Results of Panoptes Future Work My Story: Buffer Ping-Ponging
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationJAYARAM COLLEGE OF ENGINEERING AND TECHNOLOGY Pagalavadi, Tiruchirappalli (An approved by AICTE and Affiliated to Anna University)
Estd: 1994 JAYARAM COLLEGE OF ENGINEERING AND TECHNOLOGY Pagalavadi, Tiruchirappalli - 621014 (An approved by AICTE and Affiliated to Anna University) ISO 9001:2000 Certified Subject Code & Name : CS 1202
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationLearn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh
Learn CUDA in an Afternoon Alan Gray EPCC The University of Edinburgh Overview Introduction to CUDA Practical Exercise 1: Getting started with CUDA GPU Optimisation Practical Exercise 2: Optimising a CUDA
More informationMaster Thesis Accelerating Image Registration on GPUs
Master Thesis Accelerating Image Registration on GPUs A proof of concept migration of FAIR to CUDA Sunil Ramgopal Tatavarty Prof. Dr. Ulrich Rüde Dr.-Ing.Harald Köstler Lehrstuhl für Systemsimulation Universität
More informationZero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.
Table of Contents Multi-GPU Learning CUDA to Solve Scientific Problems. 1 Objectives Miguel Cárdenas Montes 2 Zero-copy Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain
More informationThe Procedure Abstraction
The Procedure Abstraction Procedure Abstraction Begins Chapter 6 in EAC The compiler must deal with interface between compile time and run time Most of the tricky issues arise in implementing procedures
More informationImplementing Abstractions
Implementing Abstractions Pointers A pointer is a C++ variable that stores the address of an object. Given a pointer to an object, we can get back the original object. Can then read the object's value.
More informationIn Java we have the keyword null, which is the value of an uninitialized reference type
+ More on Pointers + Null pointers In Java we have the keyword null, which is the value of an uninitialized reference type In C we sometimes use NULL, but its just a macro for the integer 0 Pointers are
More informationCOMP528: Multi-core and Multi-Processor Computing
COMP528: Multi-core and Multi-Processor Computing Dr Michael K Bane, G14, Computer Science, University of Liverpool m.k.bane@liverpool.ac.uk https://cgi.csc.liv.ac.uk/~mkbane/comp528 21 You should compute
More information