Module 3: CUDA Execution Model -I. Objective
|
|
- Candice Farmer
- 6 years ago
- Views:
Transcription
1 ECE 8823A GPU Architectures odule 3: CUDA Execution odel -I 1 Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource assignment at the block level Scheduling at the warp level Basics of SIT execution 1
2 Reading Assignment Kirk and Hwu, Programming assively Parallel Processors: A Hands on Approach,, Chapter 4 Kirk and Hwu, Programming assively Parallel Processors: A Hands on Approach,, Chapter 6.3 Reference: CUDA Programming Guide 3 A ulti-dimensional Grid Example host device Kernel 1 Grid 1 Block (0, 0) Block (1, 0) Block (0, 1) Block (1, 1) Kernel 2 Grid 2 Block (1,1) (1,0,0) (1,0,1) (1,0,2) (1,0,3) (0,0,0) (0,1,0) (0,0,1) (0,1,1) (0,0,2) (0,1,2) (0,0,3) Threa d (0,1,3) (0,0,0) 2
3 Built-In Variables 1D-3D Grid of thread blocks Built-in: griddim griddim.x, griddim.y, griddim.z Built-in: blockdim blockdim.x, blockdim.y, blockdim.z Example dim3 dimgrid (32,2,2) - 3D grid of thread blocks dim3 dimgrid (2,2,1) - 2D grid of thread blocks dim3 dimgrid (32,1,1) - 1D grid of thread blocks dim3 dimgrid ((n/256.0),1,1) - 1D grid of thread blocks my_kernel<<<dimgrid, dimblock>>>(..) Built-In Variables (2) 1D-3D grid of threads in a thread block Built-in: blockidx blockidx.x, blockidx.y, blockidx.z Built-in: threadidx threadidx.x, threadidx.y, threadidx.z All blocks have the same thread configuration Example dim3 dimblock (4,2,2) - 3D grid of thread blocks dim3 dimblock (2,2,1) - 2D grid of thread blocks dim3 dimblock (32,1,1) - 1D grid of thread blocks my_kernel<<<dimgrid, dimblock>>>(..) 3
4 Built-In Variables (3) 1D-3D grid of threads in a thread block Built-in: blockidx blockidx.x, blockidx.y, blockidx.z Built-in: threadidx threadidx.x, threadidx.y, threadidx.z Initialized by the runtime through a kernel call Range fixed by the compute capability and target devices You can query the device (later) 2D Examples 4
5 Processing a Picture with a 2D Grid blocks 72x62 pixels Row-ajor Layout in C/C++ 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 2,0 2,1 2,2 2,3 3,0 3,1 3,2 3,3 0,0 0,1 0,2 0,3 1,1 1,0 1,2 1,3 2,0 2,1 2,2 2,3 Row*Width+Col = 2*4+1 = ,0 3,1 3,2 3,
6 Source Code of the Picture Kernel global void PictureKernel(float* d_pin, float* d_pout, int n,int m) { // Calculate the row # of the d_pin and d_pout element to process int Row = blockidx.y*blockdim.y + threadidx.y; // Calculate the column # of the d_pin and d_pout element to process int Col = blockidx.x*blockdim.x + threadidx.x; // each thread computes one element of d_pout if in range if ((Row < m) && (Col < n)) { d_pout[row*n+col] = 2*d_Pin[Row*n+Col]; 11 Approach Summary Storage layout of data Assign unique ID ap IDs to Data (access) Figure 4.5 Covering a picture with 16 blocks. 6
7 A Simple Running Example atrix ultiplication A simple illustration of the basic features of memory and thread management in CUDA programs index usage emory layout Register usage Assume square matrix for simplicity Leave shared memory usage until later Square atrix-atrix ultiplication P = * N of size x Each thread calculates one element of P Each row of is loaded times from global memory Each column of N is loaded times from global memory N P 7
8 Row-ajor Layout in C/C++ 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 2,0 2,1 2,2 2,3 3,0 3,1 3,2 3,3 0,0 0,1 0,2 0,3 1,1 1,0 1,2 1,3 2,0 2,1 2,2 2,3 3,0 3,1 3,2 3,3 Row*Width+Col = 2*4+1 = atrix ultiplication A Simple Host Version in C // atrix multiplication on the (CPU) host in double precision void atrixulonhost(float*, float* N, float* P, int Width) { N for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) {map double sum = 0; for (int k = 0; k < Width; ++k) double a = [i * Width + k]; double b = N[k * Width + j]; sum += a * b; P[i * Width + j] = sum; k i P j k 8
9 Kernel Version: Functional Description for (int k = 0; k < Width; ++k) Pvalue += d_[row*width+k] * d_n[k*width+col]; Which thread is at coordinate (row,col)? s self allocate row i N P j col k blockdim.y k blockdim.x Kernel Function - A Small Example Have each 2D thread block to compute a (TILE_) 2 sub-matrix (tile) of the result matrix Each has (TILE_) 2 threads Generate a 2D Grid of (/TILE_) 2 blocks Block(0,0) Block(0,1) P 0,0 P 1,0 P 0,1 P 1,1 P 0,2 P 0,3 P 1,2 P 1,3 = 4; TILE_ = 2 Each block has 2*2 = 4 threads P 2,0 P 2,1 P 2,2 P 2,3 /TILE_ = 2 Use 2* 2 = 4 blocks P 3,0 P 3,1 P 2,3 P 3,3 Block(1,0) Block(1,1) David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July
10 A Slightly Bigger Example P 0,0 P 0,1 P 0,2 P 0,3 P 0,4 P 0,5 P 0,6 P 0,7 P 1,0 P 1,1 P 1,2 P 1,3 P 2,0 P 2,1 P 2,2 P 2,3 P 3,0 P 3,1 P 3,2 P 3,3 P 1,4 P 1,5 P 1,6 P 1,7 P 2,4 P 2,5 P 2,6 P 2,7 P 3,4 P 3,5 P 3,6 P 3,7 = 8; TILE_ = 2 Each block has 2*2 = 4 threads P 4,0 P 5,0 P 4,1 P 5,1 P 4,2 P 4,3 P 5,2 P 5,3 P 4,4 P 5,4 P 4,5 P 5,5 P 4,6 P 4,7 P 5,6 P 5,7 /TILE_ = 4 Use 4* 4 = 16 blocks P 6,0 P 6,1 P 6,2 P 6,3 P 7,0 P 7,1 P 7,2 P 7,3 P 6,4 P 6,5 P 6,6 P 6,7 P 7,4 P 7,5 P 7,6 P 7,7 David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13 A Slightly Bigger Example (cont.) P 0,0 P 0,1 P 0,2 P 0,3 P 0,4 P 0,5 P 0,6 P 0,7 P 1,0 P 1,1 P 1,2 P 1,3 P 2,0 P 2,1 P 2,2 P 2,3 P 3,0 P 3,1 P 3,2 P 3,3 P 4,0 P 4,1 P 4,2 P 4,3 P 1,4 P 1,5 P 1,6 P 1,7 P 2,4 P 2,5 P 2,6 P 2,7 P 3,4 P 3,5 P 3,6 P 3,7 P 4,4 P 4,5 P 4,6 P 4,7 = 8; TILE_ = 4 Each block has 4*4 =16 threads /TILE_ = 2 Use 2* 2 = 4 blocks P 5,0 P 5,1 P 5,2 P 5,3 P 5,4 P 5,5 P 5,6 P 5,7 P 6,0 P 6,1 P 6,2 P 6,3 P 7,0 P 7,1 P 7,2 P 7,3 P 6,4 P 6,5 P 6,6 P 6,7 P 7,4 P 7,5 P 7,6 P 7,7 David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July
11 Kernel Invocation (Host-side Code) // Setup the execution configuration // TILE_ is a #define constant dim3 dimgrid(width/tile_, Width/TILE_, 1); dim3 dimblock(tile_, TILE_, 1); // Launch the device computation threads! atrixulkernel<<<dimgrid, dimblock>>>(d, Nd, Pd, Width); Kernel Function // atrix multiplication kernel per thread code global void atrixulkernel(float* d_, float* d_n, float* d_p, int Width) { // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; 11
12 Work for Block (0,0) in a TILE_ = 2 Configuration blockdim.x Col = 0 * 2 + threadidx.x Row = 0 * 2 + threadidx.y blockdim.y Col = 1 Col = 0 N 0,0 N 0,1 N 0,2 N 0,3 blockidx.x blockidx.y N 1,0 N 1,1 N 1,2 N 1,3 N 2,0 N 2,1 N 2,2 N 2,3 N 3,0 N 3,1 N 3,2 N 3,3 Row = 0 0,0 0,1 0,2 0,3 P 0,0 P 0,1 P 0,2 P 0,3 Row = 1 1,0 1,1 1,2 1,3 P 1,0 P 1,1 P 1,2 P 1,3 2,0 2,1 2,2 2,3 P 2,0 P 2,1 P 2,2 P 2,3 David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July ,0 3,1 3,2 3,3 P 3,0 P 3,1 P 3,2 P 3,3 blockdim.x Col = 1 * 2 + threadidx.x Row = 0 * 2 + threadidx.y Work for Block (0,1) blockdim.y N 0,0 N 0,1 Col = 2 Col = 3 N 0,2 N 0,3 N 1,0 N 1,1 N 1,2 N 1,3 blockidx.x blockidx.y N 2,0 N 2,1 N 2,2 N 2,3 N 3,0 N 3,1 N 2,3 N 3,3 Row = 0 Row = 1 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 2,0 2,1 2,2 2,3 P 0,0 P 0,1 P 0,1 P 1,1 P 0,2 P 0,3 P 1,2 P 1,3 P 2,0 P 2,1 P 2,2 P 2,3 3,0 3,1 3,2 3,3 P 3,0 P 3,1 P 3,2 P 3,3 12
13 A Simple atrix ultiplication Kernel global void atrixulkernel(float* d_, float* d_n, float* d_p, int Width) { // Calculate the row index of the d_p element and d_ int Row = blockidx.y*blockdim.y+threadidx.y; // Calculate the column idenx of d_p and d_n int Col = blockidx.x*blockdim.x+threadidx.x; if ((Row < Width) && (Col < Width)) { float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += d_[row*width+k] * d_n[k*width+col]; d_p[row*width+col] = Pvalue; QUESTIONS? 26 David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 13
Data Parallel Execution Model
CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling
More informationCUDA Parallelism Model
GPU Teaching Kit Accelerated Computing CUDA Parallelism Model Kernel-Based SPMD Parallel Programming Multidimensional Kernel Configuration Color-to-Grayscale Image Processing Example Image Blur Example
More informationCUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)
CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration
More informationLessons learned from a simple application
Computation to Core Mapping Lessons learned from a simple application A Simple Application Matrix Multiplication Used as an example throughout the course Goal for today: Show the concept of Computation-to-Core
More informationModule 2: Introduction to CUDA C. Objective
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding
More informationComputation to Core Mapping Lessons learned from a simple application
Lessons learned from a simple application Matrix Multiplication Used as an example throughout the course Goal for today: Show the concept of Computation-to-Core Mapping Block schedule, Occupancy, and thread
More informationLecture 3: Introduction to CUDA
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu
More informationIntroduction to CUDA (1 of n*)
Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today
More informationIntroduction to GPU programming. Introduction to GPU programming p. 1/17
Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk
More informationModule 2: Introduction to CUDA C
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding
More informationLecture 2: Introduction to CUDA C
CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013 1 CUDA /OpenCL Execution Model Integrated host+device app C program Serial or
More informationModule Memory and Data Locality
GPU Teaching Kit Accelerated Computing Module 4.4 - Memory and Data Locality Tiled Matrix Multiplication Kernel Objective To learn to write a tiled matrix-multiplication kernel Loading and using tiles
More informationOutline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun
Outline Memory Management CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory
More informationParallel Computing. Lecture 19: CUDA - I
CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4
More informationMatrix Multiplication in CUDA. A case study
Matrix Multiplication in CUDA A case study 1 Matrix Multiplication: A Case Study Matrix multiplication illustrates many of the basic features of memory and thread management in CUDA Usage of thread/block
More informationThis is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.
David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com
More informationRegister file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.
Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)
More informationIntroduction to GPGPUs and to CUDA programming model
Introduction to GPGPUs and to CUDA programming model www.cineca.it Marzia Rivi m.rivi@cineca.it GPGPU architecture CUDA programming model CUDA efficient programming Debugging & profiling tools CUDA libraries
More informationGPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34
1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions
More informationGPU programming basics. Prof. Marco Bertini
GPU programming basics Prof. Marco Bertini CUDA: atomic operations, privatization, algorithms Atomic operations The basics atomic operation in hardware is something like a read-modify-write operation performed
More information1/31/11. How to tell if results are correct. Assignment 2: Analyzing the Results. Targets of Memory Hierarchy Optimizations. Overview of Lecture
Administrative L5: emory Hierarchy Optimization III, Data lacement, cont. and emory Bandwidth Optimizations ext assignment available ext four slides Goals of assignment: simple memory hierarchy management
More informationECE 408 / CS 483 Final Exam, Fall 2014
ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across
More informationCOSC 6374 Parallel Computations Introduction to CUDA
COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo
More informationIntroduction to Parallel Computing with CUDA. Oswald Haan
Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries
More informationCS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees
CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 Objective To master Reduction Trees, arguably the
More informationJosef Pelikán, Jan Horáček CGG MFF UK Praha
GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing
More informationInformation Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY
Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Memory spaces and memory access Shared memory Examples Lecture questions: 1. Suggest two significant
More informationCUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)
CUDA Memory Model Some material David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission) 1 G80 Implementation of CUDA Memories Each thread can: Grid Read/write per-thread registers Read/write
More informationInformation Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY
Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Introduction to memory spaces and memory access Shared memory Matrix multiplication example Lecture
More information2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions
Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,
More informationLecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)
Lecture 9 CUDA CUDA (I) Compute Unified Device Architecture 1 2 Outline CUDA Architecture CUDA Architecture CUDA programming model CUDA-C 3 4 CUDA : a General-Purpose Parallel Computing Architecture CUDA
More informationECE408/CS483 Fall Applied Parallel Programming. Objective. To learn about tiled convolution algorithms
ECE408/CS483 Fall 2016 Applied Parallel Programming Lecture 9: Tiled Convolution 1 Objective To learn about tiled convolution algorithms Some intricate aspects of tiling algorithms Output tiles versus
More informationThis is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.
David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com
More informationLecture 10!! Introduction to CUDA!
1(50) Lecture 10 Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY 1(50) Laborations Some revisions may happen while making final adjustments for Linux Mint. Last minute changes may occur.
More informationExample 1: Color-to-Grayscale Image Processing
GPU Teaching Kit Accelerated Computing Lecture 16: CUDA Parallelism Model Examples Example 1: Color-to-Grayscale Image Processing RGB Color Image Representation Each pixel in an image is an RGB value The
More informationComputational Fluid Dynamics (CFD) using Graphics Processing Units
Computational Fluid Dynamics (CFD) using Graphics Processing Units Aaron F. Shinn Mechanical Science and Engineering Dept., UIUC Accelerators for Science and Engineering Applications: GPUs and Multicores
More informationProgramming in CUDA. Malik M Khan
Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement
More information2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.
Administrative L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies Next assignment on the website Description at end of class Due Wednesday, Feb. 17, 5PM Use handin program on
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationIntroduction to GPU Computing. Design and Analysis of Parallel Algorithms
Introduction to GPU Computing Design and Analysis of Parallel Algorithms Sources CUDA Programming Guide (3.2) CUDA Best Practices Guide (3.2) CUDA Toolkit Reference Manual (3.2) CUDA SDK Examples Part
More informationCSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591: GPU Programming Memories Klaus Mueller Computer Science Department Stony Brook University Importance of Memory Access Efficiency Every loop iteration has two global memory accesses two floating
More informationCUDA Programming. Aiichiro Nakano
CUDA Programming Aiichiro Nakano Collaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science
More informationCUDA Basics. July 6, 2016
Mitglied der Helmholtz-Gemeinschaft CUDA Basics July 6, 2016 CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching
More informationHPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA
HPC COMPUTING WITH CUDA AND TESLA HARDWARE Timothy Lanfear, NVIDIA WHAT IS GPU COMPUTING? What is GPU Computing? x86 PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing Low Latency or High Throughput?
More informationGPU Computing: A Quick Start
GPU Computing: A Quick Start Orest Shardt Department of Chemical and Materials Engineering University of Alberta August 25, 2011 Session Goals Get you started with highly parallel LBM Take a practical
More informationBasic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Basic Elements of CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationCUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA
CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationIntroduction to Scientific Programming using GPGPU and CUDA
Introduction to Scientific Programming using GPGPU and CUDA Day 1 Sergio Orlandini s.orlandini@cineca.it Mario Tacconi m.tacconi@cineca.it 0 Hands on: Compiling a CUDA program Environment and utility:
More informationConvolution, Constant Memory and Constant Caching
PSU CS 410/510 General Purpose GPU Computing Prof. Karen L. Karavanic Summer 2014 Convolution, Constant Memory and Constant Caching includes materials David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al
More informationGPU Memory Memory issue for CUDA programming
Memory issue for CUDA programming Variable declaration Memory Scope Lifetime device local int LocalVar; local thread thread device shared int SharedVar; shared block block device int GlobalVar; global
More informationCUDA C Programming Mark Harris NVIDIA Corporation
CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment
More informationCUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list
CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into
More informationIntroduction to CUDA (2 of 2)
Announcements Introduction to CUDA (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2012 Open pull request for Project 0 Project 1 released. Due Sunday 09/30 Not due Tuesday, 09/25 Code
More informationGPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics
1 GPU Computing Workshop CSU 2013 Getting Started Garland Durham Quantos Analytics nvidia-smi 2 At command line, run command nvidia-smi to get/set GPU properties. nvidia-smi Options: -q query -L list attached
More informationToday s Content. Lecture 7. Trends. Factors contributed to the growth of Beowulf class computers. Introduction. CUDA Programming CUDA (I)
Today s Content Lecture 7 CUDA (I) Introduction Trends in HPC GPGPU CUDA Programming 1 Trends Trends in High-Performance Computing HPC is never a commodity until 199 In 1990 s Performances of PCs are getting
More informationVector Addition on the Device: main()
Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space
More informationBasic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Basic Elements of CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of
More informationGPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:
COMP528 Multi-Core Programming GPU programming,ii www.csc.liv.ac.uk/~alexei/comp528 Alexei Lisitsa Dept of computer science University of Liverpool a.lisitsa@.liverpool.ac.uk Different ways: GPU programming
More informationPractical Introduction to CUDA and GPU
Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing
More informationReal-time Graphics 9. GPGPU
Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing
More informationGPU Memory. GPU Memory. Memory issue for CUDA programming. Copyright 2013 Yong Cao, Referencing UIUC ECE498AL Course Notes
Memory issue for CUDA programming CUDA Variable Type Qualifiers Variable declaration Memory Scope Lifetime device local int LocalVar; local thread thread device shared int SharedVar; shared block block
More informationInformation Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)
26(86) Information Coding / Computer Graphics, ISY, LiTH CUDA memory Coalescing Constant memory Texture memory Pinned memory 26(86) CUDA memory We already know... Global memory is slow. Shared memory is
More informationCS 677: Parallel Programming for Many-core Processors Lecture 6
1 CS 677: Parallel Programming for Many-core Processors Lecture 6 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Logistics Midterm: March 11
More informationHardware/Software Co-Design
1 / 13 Hardware/Software Co-Design Review so far Miaoqing Huang University of Arkansas Fall 2011 2 / 13 Problem I A student mentioned that he was able to multiply two 1,024 1,024 matrices using a tiled
More informationGeneral Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop
General Purpose GPU programming (GP-GPU) with Nvidia CUDA Libby Shoop 3 What is (Historical) GPGPU? General Purpose computation using GPU and graphics API in applications other than 3D graphics GPU accelerates
More informationLab 1 Part 1: Introduction to CUDA
Lab 1 Part 1: Introduction to CUDA Code tarball: lab1.tgz In this hands-on lab, you will learn to use CUDA to program a GPU. The lab can be conducted on the SSSU Fermi Blade (M2050) or NCSA Forge using
More informationCUDA programming interface - CUDA C
CUDA programming interface - CUDA C Presentation CUDA functions Device settings Memory (de)allocation within the device global memory data transfers between host memory and device global memory Data partitioning
More informationReal-time Graphics 9. GPGPU
9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose
More informationAn Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture
An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS
More informationProgramming with CUDA, WS09
Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 3 Thursday, 29 Nov, 2009 Recap Motivational videos Example kernel Thread IDs Memory overhead CUDA hardware and programming
More informationMIC-GPU: High-Performance Computing for Medical Imaging on Programmable Graphics Hardware (GPUs)
MIC-GPU: High-Performance Computing for Medical Imaging on Programmable Graphics Hardware (GPUs) CUDA API Klaus Mueller, Ziyi Zheng, Eric Papenhausen Stony Brook University Function Qualifiers Device Global,
More information1/25/12. Administrative
Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:
More informationEEM528 GPU COMPUTING
EEM528 CS 193G GPU COMPUTING Lecture 2: GPU History & CUDA Programming Basics Slides Credit: Jared Hoberock & David Tarjan CS 193G History of GPUs Graphics in a Nutshell Make great images intricate shapes
More informationWhat is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms
CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D
More informationCOMP 322: Fundamentals of Parallel Programming Module 3: Locality and Distribution
c 2016 by Vivek Sarkar April 18, 2016 DRAFT VERSION PLEASE DO NOT DISTRIBUTE Contents 1 Task Affinity with Places 2 1.1 Hardware Memory Hierarchies................................... 2 1.2 HJ s place construct.........................................
More informationScientific discovery, analysis and prediction made possible through high performance computing.
Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013
More informationLecture 1: Introduction
ECE 498AL Applied Parallel Programming Lecture 1: Introduction 1 Course Goals Learn how to program massively parallel processors and achieve high performance functionality and maintainability scalability
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationCard Sizes. Tesla K40: 2880 processors; 12 GB memory
Card Sizes Tesla K40: 2880 processors; 12 GB memory Data bigger than grid Maximum grid sizes Compute capability 1.0, 1D and 2D grids supported Compute capability 2, 3, 3D grids too. Grid sizes: 65,535
More informationLecture 5. Performance Programming with CUDA
Lecture 5 Performance Programming with CUDA Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Today s lecture Matrix multiplication 2011 Scott B. Baden / CSE 262 / Spring 2011 3 Memory Hierarchy
More informationWriting and compiling a CUDA code
Writing and compiling a CUDA code Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) Writing CUDA code 1 / 65 The CUDA language If we want fast code, we (unfortunately)
More informationGPGPU. Lecture 2: CUDA
GPGPU Lecture 2: CUDA GPU is fast Previous GPGPU Constraints Dealing with graphics API Working with the corner cases of the graphics API Addressing modes Limited texture size/dimension Shader capabilities
More informationCUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012
CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix
More informationCS 677: Parallel Programming for Many-core Processors Lecture 1
1 CS 677: Parallel Programming for Many-core Processors Lecture 1 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Objectives Learn how to program
More informationCUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci
TECHNISCHE UNIVERSITÄT WIEN Fakultät für Informatik Cyber-Physical Systems Group CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci Outline of CUDA Basics Basic Kernels and Execution on GPU
More informationCSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA
CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA Sreepathi Pai October 18, 2017 URCS Outline Background Memory Code Execution Model Outline Background Memory Code Execution Model
More informationGPU programming. Dr. Bernhard Kainz
GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationUsing The CUDA Programming Model
Using The CUDA Programming Model Leveraging GPUs for Application Acceleration Dan Ernst, Brandon Holt University of Wisconsin Eau Claire 1 What is (Historical) GPGPU? General Purpose computation using
More informationGPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum
GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute
More informationHigh Performance Linear Algebra on Data Parallel Co-Processors I
926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018
More informationGPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60
1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling
More informationCOSC 462. CUDA Basics: Blocks, Grids, and Threads. Piotr Luszczek. November 1, /10
COSC 462 CUDA Basics: Blocks, Grids, and Threads Piotr Luszczek November 1, 2017 1/10 Minimal CUDA Code Example global void sum(double x, double y, double *z) { *z = x + y; } int main(void) { double *device_z,
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Review Secret behind GPU performance: simple cores but a large number of them; even more threads can exist live on the hardware (10k 20k threads live). Important performance
More informationLecture 1: an introduction to CUDA
Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming
More informationGPU Programming. Rupesh Nasre.
GPU Programming Rupesh Nasre. http://www.cse.iitm.ac.in/~rupesh IIT Madras July 2017 Hello World. #include int main() { printf("hello World.\n"); return 0; Compile: nvcc hello.cu Run: a.out GPU
More informationCS 314 Principles of Programming Languages
CS 314 Principles of Programming Languages Zheng Zhang Fall 2016 Dec 14 GPU Programming Rutgers University Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations
More informationReduction of a Symmetrical Matrix. to Tridiagonal Form on GPUs
Reduction of a Symmetrical Matrix to Tridiagonal Form on GPUs By Shuotian Chen Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign Adviser: Professor Volodymyr
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More information