L3: Data Partitioning and Memory Organization, Synchronization

Size: px
Start display at page:

Download "L3: Data Partitioning and Memory Organization, Synchronization"

Transcription

1 Outline More on CUDA L3: Data Partitioning and Organization, Synchronization January 21, First assignment, due Jan. 30 (later in lecture) Error checking mechanisms Synchronization More on data partitioning Reading: GPU Gems 2, Ch. 31; CUDA 2.0 Manual, particularly Chapters 4 and 5 This lecture includes slides provided by: Wen-mei Hwu (UIUC) and David Kirk (NVIDIA) see and Austin Robison (NVIDIA) 2 Today s Lecture Establish background to write your first CUDA code Testing for correctness (a little) Debugging (a little) Core concepts to be reinforced - Data partitioning -Synchronization Questions from Last Time Considering the GPU gets a single copy of the array, can two threads access the shared array at the same time? - Yes, if no bank conflicts. But watch out! Can we have different thread functions for different subsets of data? - Not efficiently. Why do I get the same results every time using a random number generator? - Deterministic if starts with the default (or same) seed. 3 4

2 My Approach to Working with New Systems Approach with caution - May not behave as expected. - Learn about behavior through observation and measurement. - Realize that documentation is not always clear or accurate. Crawl, then walk, then run - Start with something known to work. - Only change one variable at a time. Debugging Using Device Emulation Mode An executable compiled in device emulation mode (nvcc -deviceemu) runs completely on the host using the CUDA runtime - No need of any device and CUDA driver - Each device thread is emulated with a host thread When running in device emulation mode, one can: - Use host native debug support (breakpoints, inspection, etc.) - Access any device-specific data from host code and viceversa - Call any host function from device code (e.g. printf) and vice-versa - Detect deadlock situations caused by improper usage of syncthreads 5 6 ECE 498AL, University of Illinois, Urbana-Champaign Device Emulation Mode Pitfalls Emulated device threads execute sequentially, so simultaneous accesses of the same location by multiple threads could produce different results. Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode Results of floating-point computations will slightly differ because of: - Different compiler outputs, instruction sets - Use of extended precision for intermediate results - There are various options to force strict single precision on the host 7 ECE 498AL, University of Illinois, Urbana-Champaign Run-time functions & macros for error checking In CUDA run-time services, cudagetdeviceproperties(deviceprop &dp, d); check number, type and whether device present In libcutil.a of Software Developers Kit, cutcomparef (float *ref, float *data, unsigned len); compare output with reference from CPU implementation In cutil.h of Software Developers Kit (with #define _DEBUG or D_DEBUG compile flag), CUDA_SAFE_CALL(f(<args>)), CUT_SAFE_CALL(f(<args>)) check for error in run-time call and exit if error detected CUT_SAFE_MALLOC(cudaMalloc(<args>)); similar to above, but for malloc calls CUT_CHECK_ERROR( error message goes here ); check for error immediately following kernel execution and if detected, exit with error 8 message

3 Simple working code example from last time Goal for this example: - Really simple but illustrative of key concepts - Fits in one file with simple compile command - Can absorb during lecture What does it do? - Scan elements of array of numbers (any of 0 to 9) - How many times does 6 appear? - Array of 16 elements, each thread examines 4 elements, 1 block in grid, 1 grid threadidx.x = 0 examines in_array elements 0, 4, 8, 12 Known as a threadidx.x = 1 examines in_array elements 1, 5, 9, 13 cyclic data threadidx.x = 2 examines in_array elements 2, 6, 10, 14 distribution threadidx.x = 3 examines in_array elements 3, 7, 11, 15 9 Reductions (from last time) This type of computation is called a parallel reduction - Operation is applied to large data structure - Computed result represents the aggregate solution across the large data structure -Large data structure computed result (perhaps single number) [dimensionality reduced] Why might parallel reductions be well-suited to GPUs? What if we tried to compute the final sum on the GPUs? (next class and assignment) 10 Reminder: Gathering and Reporting Results Global, device functions and excerpts from host, main device int compare(int a, int b) { if (a == b) return 1; return 0; Compute individual global void results compute(int for each *d_in, thread int *d_out) { d_out[threadidx.x] Serialize = 0; final resutls gathering on host for (i=0; i<size/blocksize; i++) { int val = d_in[i*blocksize + threadidx.x]; d_out[threadidx.x] += compare(val, 6); int host void outer_compute (int *h_in_array, int *h_out_array) { compute<<<1,blocksize,msize)>>> (d_in_array, d_out_array); cudamemcpy(h_out_array, d_out_array, BLOCKSIZE*sizeof(int), cudamemcpydevicetohost); main(int argc, char **argv) { for (int i=0; i<blocksize; i++) { sum+=out_array[i]; printf ( Result = %d\n",sum); 11 Gathering Results on GPU: Synchronization void syncthreads(); Functionality: Synchronizes all threads in a block - Each thread waits at the point of this call until all other threads have reached it - Once all threads have reached this point, execution resumes normally Why is this needed? - A thread can freely read the shared of its thread block or the global of either its block or grid. - Allows the program to guarantee partial ordering of these accesses to prevent incorrect orderings. Watch out! - Potential for deadlock when it appears in conditionals 12

4 Gathering Results on GPU for Count 6 global void compute(int *d_in, int *d_out) { d_out[threadidx.x] = 0; for (i=0; i<size/blocksize; i++) { int val = d_in[i*blocksize + threadidx.x]; d_out[threadidx.x] += compare(val, 6); global void compute(int *d_in, int *d_out, int *d_sum) { d_out[threadidx.x] = 0; for (i=0; i<size/blocksize; i++) { int val = d_in[i*blocksize + threadidx.x]; d_out[threadidx.x] += compare(val, 6); synthreads(); if (threadidx.x == 0) { for 0..BKSIZE-1 *d_sum += d_out[i]; Also Atomic Update to Sum Variable int atomicadd(int* address, int val); Increments the integer at address by val. Atomic means that once initiated, the operation executes to completion without interruption by other threads Example: atomicadd(d_sum, d_out_array[threadidx.x]); Programming Assignment #1 Problem 1: In the count 6 example using synchronization, we accumulated all results into out_array[0], thus serializing the accumulation computation on the GPU. Suppose we want to exploit some parallelism in this part, which will likely be particularly important to performance as we scale the number of threads. A common idiom for reduction computations is to use a tree-structured results-gathering phase, where independent threads collect their results in parallel. The tree below illustrates this for our example, with SIZE=16 and BLOCKSIZE=4. out[0] += out[2] Your job is to write this version of the reduction in CUDA. You can start with the sample code, but adjust the problem size to be larger: #define SIZE 256 #define BLOCKSIZE 32 out[0] out[1] out[2] out[3] out[0] += out[1] out[2] += out[3] 15 Another Example: Adding Two Matrices CPU C program void add_matrix_cpu(float *a, float *b, float *c, int N) { int i, j, index; for (i=0;i<n;i++) { for (j=0;j<n;j++) { index =i+j*n; c[index]=a[index]+b[index]; void main() {... add_matrix(a,b,c,n); Example source: Austin Robison, NVIDIA CUDA C program global void add_matrix_gpu(float *a, float *b, float *c, int N) { int i =blockidx.x*blockdim.x+threadidx.x; int j=blockidx.y*blockdim.y+threadidx.y; int index =i+j*n; if( i <N && j <N) c[index]=a[index]+b[index]; void main() { dim3 dimblock(blocksize,blocksize); dim3 dimgrid(n/dimblock.x,n/dimblock.y); add_matrix_gpu<<<dimgrid,dimblock>>>(a,b,c,n); 16

5 Closer Inspection of Computation and Data Partitioning Define 2-d set of blocks, and 2-d set of threads per block dim3 dimblock(blocksize,blocksize); dim3 dimgrid(n/dimblock.x,n/dimblock.y); Each thread identifies what single element of the matrix it operates on int i=blockidx.x*blockdim.x+threadidx.x; int j=blockidx.y*blockdim.y+threadidx.y; int index =i+j*n; if( i <N && j <N) c[index]=a[index]+b[index]; Let blocksize = 2 and N = Closer Inspection of Computation and Data Partitioning Define 2-d set of blocks, and 2-d set of threads per block dim3 dimblock(blocksize,blocksize); dim3 dimgrid(n/dimblock.x,n/dimblock.y); Each thread identifies what single element of the matrix it operates on int i=blockidx.x*blockdim.x+threadidx.x; int j=blockidx.y*blockdim.y+threadidx.y; int index =i+j*n; if( i <N && j <N) c[index]=a[index]+b[index]; Let blocksize = 2 and N = 4 BLOCK (0,0) 0,0 0,1 0,0 0, ,0 1,1 1,0 1, ,0 0,1 0,0 0, ,0 1,1 1,0 1, BLOCK (1,0) BLOCK (0,1) BLOCK (1,1) Data Distribution to Threads Many data parallel programming languages have mechanisms for expressing how a distributed data structure is mapped to threads - That is, the portion of the data a thread accesses (and usually stores locally) - Examples: HPF, Chapel, UPC Reasons for selecting a particular distribution - Sharing patterns: group together according to data reuse and communication needs - Affinity with other data - Locality: transfer size and capacity limit on shared - Access patterns relative to computation (avoid bank conflicts) - Minimizing overhead: Cost of copying to host and between global and shared Concept of Data Distributions in CUDA This concept is not completely natural for CUDA - Implicit from computation partitioning and access expressions within threads - Within a block of threads, most is shared, so not really distributed Nevertheless, I think it is still useful - Spatial understanding of computation and data organization (helps conceptualize) - Distribution of data needed for objects too large to fit in shared (i.e., partitioning across blocks in a grid) - Helpful in putting CUDA in context of other languages 19 20

6 Common Data Distributions Consider a 1-Dimensional array CYCLIC: for (i = 0; i<blocksize; i++) d_in [i*blocksize + threadidx.x]; BLOCK: for (i=threadidx.x*blocksize; i<(threadidx.x+1) *BLOCKSIZE; i++) d_in[i]; BLOCK-CYCLIC: Programming Assignment #1, cont. Problem 2: Computation Partitioning & Data Distribution Using the matrix addition example from Lecture 2 class notes, develop a complete CUDA program for integer matrix addition. Initialize the two input matrices using random integers from 0 to 9 (as in the count 6 example). Initialize the ith entry in both a[i] and b[i], so that everyone s matrix is initialized the same way. Your CUDA code should use a data distribution for the input and output matrices that exploits several features of the architecture to improve performance (although the advantages of these will be greater on objects allocated to shared and with reuse of data within a block). a grid of 16 square blocks of threads (>= # multiproc) each block executing 64 threads (> warpsize) each thread accessing 32 elements (better if larger) Within a grid block, you should use a BLOCK data distribution for the columns, so that each thread operates on a column. This should help reduce bank conflicts. 21 Return the output array. 22 Hardware Implementation: Architecture The local, global, constant, and texture spaces are regions of device Each multiprocessor has: - A set of 32-bit registers per processor - On-chip shared - Where the shared space resides - A read-only constant cache - To speed up access to the constant space - A read-only texture cache - To speed up access to the texture space Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 Processor 1 Shared Processor 2 Device Global, constant, texture memories 23 ECE 498AL, University of Illinois, Urbana-Champaign Processor M Instruction Unit Constant Cache Texture Cache Programming Model: Spaces Each thread can: - Read/write per-thread registers - Read/write per-thread local - Read/write per-block shared - Read/write per-grid global - Read only per-grid constant - Read only per-grid texture Host The host can read/write global, constant, and texture 24 ECE 498AL, University of Illinois, Urbana-Champaign Grid Block (0, 0) Local Global Constant Texture Shared Thread (0, 0) Thread (1, 0) Local Block (1, 0) Shared Thread (0, 0) Thread (1, 0) Local Local

7 Threads, Warps, Blocks There are (up to) 32 threads in a Warp - Only <32 when there are fewer than 32 total threads There are (up to) 16 Warps in a Block Each Block (and thus, each Warp) executes on a single SM G80 has 16 SMs At least 16 Blocks required to fill the device More is better - If resources (registers, thread space, shared ) allow, more than 1 Block can occupy each SM 25 ECE 498AL, University of Illinois, Urbana-Champaign Hardware Implementation: A Set of SIMD Multiprocessors A device has a set of multiprocessors Device Each multiprocessor is a Multiprocessor N set of 32-bit processors Multiprocessor 2 with a Single Instruction Multiprocessor 1 Multiple Data architecture - Shared instruction unit At each clock cycle, a multiprocessor executes the same instruction on a group of threads called a warp The number of threads in a warp is the warp size Processor 1 Processor 2 26 ECE 498AL, University of Illinois, Urbana-Champaign Processor M Instruction Unit Hardware Implementation: Execution Model Terminology Review Each thread block of a grid is split into warps, each gets executed by one multiprocessor (SM) - The device processes only one grid at a time Each thread block is executed by one multiprocessor - So that the shared space resides in the on-chip shared A multiprocessor can execute multiple blocks concurrently - Shared and registers are partitioned among the threads of all concurrent blocks - So, decreasing shared usage (per block) and register usage (per thread) increases number of blocks that can run concurrently device = GPU = set of multiprocessors Multiprocessor = set of processors & shared Kernel = GPU program Grid = array of thread blocks that execute a kernel Thread block = group of SIMD threads that execute a kernel and can communicate via shared Location Cached Access Who Local Off-chip No Read/write One thread Shared On-chip N/A - resident Read/write All threads in a block Global Off-chip No Read/write All threads + host Constant Off-chip Yes Read All threads + host Texture Off-chip Yes Read All threads + host 27 ECE 498AL, University of Illinois, Urbana-Champaign 28 ECE 498AL, University of Illinois, Urbana-Champaign

8 Access Times Register dedicated HW - single cycle Shared dedicated HW - single cycle Local DRAM, no cache - *slow* Global DRAM, no cache - *slow* Constant DRAM, cached, 1 10s 100s of cycles, depending on cache locality Texture DRAM, cached, 1 10s 100s of cycles, depending on cache locality Instruction (invisible) DRAM, cached Language Extensions: Variable Type Qualifiers Scope Lifetime device local int LocalVar; local thread thread device shared int SharedVar; shared block block device int GlobalVar; global grid application device constant int ConstantVar; constant grid application device is optional when used with local, shared, or constant Automatic variables without any qualifier reside in a register - Except arrays that reside in local 29 ECE 498AL, University of Illinois, Urbana-Champaign 30 ECE 498AL, University of Illinois, Urbana-Champaign Variable Type Restrictions Pointerscan only point to allocated or declared in global : - Allocated in the host and passed to the kernel: global void KernelFunc(float* ptr) - Obtained as the address of a global variable: float* ptr = &GlobalVar; Relating this Back to the Count 6 Example A few variables (i, val) are going in device registers We are copying input array into device and output array back to host - These device copies of the data all reside in global Suppose we want to access only shared on the devices - Cannot copy into shared from host! - Requires another copy from/to global to/from shared Other options - Use constant for input array (not helpful for current implementation, but possibly useful) - Use shared for output array and just return single result (today s version with synchronization) 31 ECE 498AL, University of Illinois, Urbana-Champaign 32

9 Summary of Lecture Some error checking techniques to determine if program is correct Synchronization constructs Discussion of Data Partitioning First assignment, due FRIDAY, JANUARY 30, 5PM Next Week Reasoning about parallelization - When is it safe? 33 34

L18: Introduction to CUDA

L18: Introduction to CUDA L18: Introduction to CUDA November 5, 2009 Administrative Homework assignment 3 will be posted today (after class) Due, Thursday, November 5 before class - Use the handin program on the CADE machines -

More information

Administrative. L8: Writing Correct Programs, cont. and Control Flow. Questions/comments from previous lectures. Outline 2/10/11

Administrative. L8: Writing Correct Programs, cont. and Control Flow. Questions/comments from previous lectures. Outline 2/10/11 Administrative L8 Writing Correct Programs, cont. and Control Flow Next assignment available Goals of assignment simple memory hierarchy management block-thread decomposition tradeoff Due Thursday, Feb.

More information

Blockseminar: Verteiltes Rechnen und Parallelprogrammierung. Tim Conrad

Blockseminar: Verteiltes Rechnen und Parallelprogrammierung. Tim Conrad Blockseminar: Verteiltes Rechnen und Parallelprogrammierung Tim Conrad GPU Computer: CHINA.imp.fu-berlin.de Today Introduction to GPGPUs Hands on to get you started Assignment 3 Projects Traditional Computing

More information

L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects

L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects L17: CUDA, cont. October 28, 2010 Final Project Purpose: - A chance to dig in deeper into a parallel programming model and explore concepts. - Present results to work on communication of technical ideas

More information

1/25/12. Administrative

1/25/12. Administrative Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:

More information

Programming in CUDA. Malik M Khan

Programming in CUDA. Malik M Khan Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement

More information

Introduction to GPU (Graphics Processing Unit) Architecture & Programming

Introduction to GPU (Graphics Processing Unit) Architecture & Programming Introduction to GU (Graphics rocessing Unit) Architecture & rogramming C240A. 2017 T. Yang ome of slides are from M. Hall of Utah C6235 Overview Hardware architecture rogramming model Example Historical

More information

L14: CUDA, cont. Execution Model and Memory Hierarchy"

L14: CUDA, cont. Execution Model and Memory Hierarchy Programming Assignment 3, Due 11:59PM Nov. 7 L14: CUDA, cont. Execution Model and Hierarchy" October 27, 2011! Purpose: - Synthesize the concepts you have learned so far - Data parallelism, locality and

More information

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission) CUDA Memory Model Some material David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission) 1 G80 Implementation of CUDA Memories Each thread can: Grid Read/write per-thread registers Read/write

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Memories Klaus Mueller Computer Science Department Stony Brook University Importance of Memory Access Efficiency Every loop iteration has two global memory accesses two floating

More information

Introduction to CUDA (1 of n*)

Introduction to CUDA (1 of n*) Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today

More information

L1: Introduction to CS6963 and CUDA

L1: Introduction to CS6963 and CUDA L1: Introduction to and CUDA January 12, 2011 Outline of Today s Lecture Introductory remarks A brief motivation for the course Course plans Introduction to CUDA - Motivation for programming model - Presentation

More information

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix

More information

GPU Memory Memory issue for CUDA programming

GPU Memory Memory issue for CUDA programming Memory issue for CUDA programming Variable declaration Memory Scope Lifetime device local int LocalVar; local thread thread device shared int SharedVar; shared block block device int GlobalVar; global

More information

L1: Introduction CS 6235: Parallel Programming for Many-Core Architectures"

L1: Introduction CS 6235: Parallel Programming for Many-Core Architectures Outline of Today s Lecture Introductory remarks L1: Introduction CS 6235: Parallel Programming for Many-Core Architectures" January 7, 2013! A brief motivation for the course Course plans Introduction

More information

GPU Memory. GPU Memory. Memory issue for CUDA programming. Copyright 2013 Yong Cao, Referencing UIUC ECE498AL Course Notes

GPU Memory. GPU Memory. Memory issue for CUDA programming. Copyright 2013 Yong Cao, Referencing UIUC ECE498AL Course Notes Memory issue for CUDA programming CUDA Variable Type Qualifiers Variable declaration Memory Scope Lifetime device local int LocalVar; local thread thread device shared int SharedVar; shared block block

More information

1/26/09. Administrative. L4: Hardware Execution Model and Overview. Recall Execution Model. Outline. First assignment out, due Friday at 5PM

1/26/09. Administrative. L4: Hardware Execution Model and Overview. Recall Execution Model. Outline. First assignment out, due Friday at 5PM Administrative L4: Hardware Execution Model and Overview January 26, 2009 First assignment out, due Friday at 5PM Any questions? New mailing list: cs6963-discussion@list.eng.utah.edu Please use for all

More information

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Programmer Interface Klaus Mueller Computer Science Department Stony Brook University Compute Levels Encodes the hardware capability of a GPU card newer cards have higher compute

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,

More information

Paralization on GPU using CUDA An Introduction

Paralization on GPU using CUDA An Introduction Paralization on GPU using CUDA An Introduction Ehsan Nedaaee Oskoee 1 1 Department of Physics IASBS IPM Grid and HPC workshop IV, 2011 Outline 1 Introduction to GPU 2 Introduction to CUDA Graphics Processing

More information

Data Parallel Execution Model

Data Parallel Execution Model CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 22 CUDA

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 22 CUDA EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 22 CUDA Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture 22

More information

1/19/11. Administrative. L2: Hardware Execution Model and Overview. What is an Execution Model? Parallel programming model. Outline.

1/19/11. Administrative. L2: Hardware Execution Model and Overview. What is an Execution Model? Parallel programming model. Outline. L2: Hardware Execution Model and Overview January 19, 2011 Administrative First assignment out, due Friday at 5PM Use handin on CADE machines to submit handin cs6963 lab1 The file

More information

Programming with CUDA, WS09

Programming with CUDA, WS09 Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 3 Thursday, 29 Nov, 2009 Recap Motivational videos Example kernel Thread IDs Memory overhead CUDA hardware and programming

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop General Purpose GPU programming (GP-GPU) with Nvidia CUDA Libby Shoop 3 What is (Historical) GPGPU? General Purpose computation using GPU and graphics API in applications other than 3D graphics GPU accelerates

More information

Matrix Multiplication in CUDA. A case study

Matrix Multiplication in CUDA. A case study Matrix Multiplication in CUDA A case study 1 Matrix Multiplication: A Case Study Matrix multiplication illustrates many of the basic features of memory and thread management in CUDA Usage of thread/block

More information

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5) CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont. Administrative L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies Next assignment on the website Description at end of class Due Wednesday, Feb. 17, 5PM Use handin program on

More information

Module 3: CUDA Execution Model -I. Objective

Module 3: CUDA Execution Model -I. Objective ECE 8823A GPU Architectures odule 3: CUDA Execution odel -I 1 Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource

More information

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute

More information

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes

More information

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks. Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)

More information

Module 2: Introduction to CUDA C

Module 2: Introduction to CUDA C ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding

More information

NVIDIA GPU CODING & COMPUTING

NVIDIA GPU CODING & COMPUTING NVIDIA GPU CODING & COMPUTING WHY GPU S? ARCHITECTURE & PROGRAM MODEL CPU v. GPU Multiprocessor Model Memory Model Memory Model: Thread Level Programing Model: Logical Mapping of Threads Programing Model:

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME 1 Last time... GPU Memory System Different kinds of memory pools, caches, etc Different optimization techniques 2 Warp Schedulers

More information

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to GPU programming. Introduction to GPU programming p. 1/17 Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Lecture 6. Programming with Message Passing Message Passing Interface (MPI) Lecture 6 Programming with Message Passing Message Passing Interface (MPI) Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Finish CUDA Today s lecture Programming with message passing 2011

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics

More information

Lecture 2: Introduction to CUDA C

Lecture 2: Introduction to CUDA C CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013 1 CUDA /OpenCL Execution Model Integrated host+device app C program Serial or

More information

Spring Prof. Hyesoon Kim

Spring Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp

More information

Blocks, Grids, and Shared Memory

Blocks, Grids, and Shared Memory Blocks, Grids, and Shared Memory GPU Course, Fall 2012 Last week: ax+b Homework Threads, Blocks, Grids CUDA threads are organized into blocks Threads operate in SIMD(ish) manner -- each executing same

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment

More information

GPU Programming. CUDA Memories. Miaoqing Huang University of Arkansas Spring / 43

GPU Programming. CUDA Memories. Miaoqing Huang University of Arkansas Spring / 43 1 / 43 GPU Programming CUDA Memories Miaoqing Huang University of Arkansas Spring 2016 2 / 43 Outline CUDA Memories Atomic Operations 3 / 43 Hardware Implementation of CUDA Memories Each thread can: Read/write

More information

Core/Many-Core Architectures and Programming. Prof. Huiyang Zhou

Core/Many-Core Architectures and Programming.  Prof. Huiyang Zhou ST: CDA 6938 Multi-Core/Many Core/Many-Core Architectures and Programming http://csl.cs.ucf.edu/courses/cda6938/ Prof. Huiyang Zhou School of Electrical Engineering and Computer Science University of Central

More information

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into

More information

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60 1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling

More information

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose Synchronization Ideal case for parallelism: no resources shared between threads no communication between threads Many

More information

Lecture 11: GPU programming

Lecture 11: GPU programming Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Introduc)on to GPU Programming

Introduc)on to GPU Programming Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/people/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/cuda/training/nvidia_gpu_compu)ng_webinars_cuda_memory_op)miza)on.pdf

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

Module 2: Introduction to CUDA C. Objective

Module 2: Introduction to CUDA C. Objective ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding

More information

Outline. L5: Writing Correct Programs, cont. Is this CUDA code correct? Administrative 2/11/09. Next assignment (a homework) given out on Monday

Outline. L5: Writing Correct Programs, cont. Is this CUDA code correct? Administrative 2/11/09. Next assignment (a homework) given out on Monday Outline L5: Writing Correct Programs, cont. 1 How to tell if your parallelization is correct? Race conditions and data dependences Tools that detect race conditions Abstractions for writing correct parallel

More information

Tuning CUDA Applications for Fermi. Version 1.2

Tuning CUDA Applications for Fermi. Version 1.2 Tuning CUDA Applications for Fermi Version 1.2 7/21/2010 Next-Generation CUDA Compute Architecture Fermi is NVIDIA s next-generation CUDA compute architecture. The Fermi whitepaper [1] gives a detailed

More information

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications ME964 High Performance Computing for Engineering Applications Memory Issues in CUDA Execution Scheduling in CUDA February 23, 2012 Dan Negrut, 2012 ME964 UW-Madison Computers are useless. They can only

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

GPU Architecture and Programming. Ozcan Ozturk Bilkent University Ankara, Turkey

GPU Architecture and Programming. Ozcan Ozturk Bilkent University Ankara, Turkey GPU Architecture and Programming Ozcan Ozturk Ankara, Turkey 1 Overview Massively Parallel Processing CPU/GPU Architecture CUDA Programming OpenCL Programming Other Accelerators 2 Overview Massively Parallel

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

CUDA Parallel Programming Model Michael Garland

CUDA Parallel Programming Model Michael Garland CUDA Parallel Programming Model Michael Garland NVIDIA Research Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of a parallel

More information

Mattan Erez. The University of Texas at Austin

Mattan Erez. The University of Texas at Austin EE382V (17325): Principles in Computer Architecture Parallelism and Locality Fall 2007 Lecture 12 GPU Architecture (NVIDIA G80) Mattan Erez The University of Texas at Austin Outline 3D graphics recap and

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

GPU Programming Using CUDA. Samuli Laine NVIDIA Research GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick

More information

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Josef Pelikán, Jan Horáček CGG MFF UK Praha GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing

More information

Lecture 5. Performance Programming with CUDA

Lecture 5. Performance Programming with CUDA Lecture 5 Performance Programming with CUDA Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Today s lecture Matrix multiplication 2011 Scott B. Baden / CSE 262 / Spring 2011 3 Memory Hierarchy

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

CS 179: GPU Programming. Lecture 7

CS 179: GPU Programming. Lecture 7 CS 179: GPU Programming Lecture 7 Week 3 Goals: More involved GPU-accelerable algorithms Relevant hardware quirks CUDA libraries Outline GPU-accelerated: Reduction Prefix sum Stream compaction Sorting(quicksort)

More information

High Performance Computing and GPU Programming

High Performance Computing and GPU Programming High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz

More information

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Introduction to memory spaces and memory access Shared memory Matrix multiplication example Lecture

More information

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy Lecture 7 Using Shared Memory Performance programming and the memory hierarchy Announcements Scott B. Baden /CSE 260/ Winter 2014 2 Assignment #1 Blocking for cache will boost performance but a lot more

More information

GPU Programming for Mathematical and Scientific Computing

GPU Programming for Mathematical and Scientific Computing GPU Programming for Mathematical and Scientific Computing Ethan Kerzner and Timothy Urness Department of Mathematics and Computer Science Drake University Des Moines, IA 50311 ethan.kerzner@gmail.com timothy.urness@drake.edu

More information

High Level Programming for GPGPU. Jason Yang Justin Hensley

High Level Programming for GPGPU. Jason Yang Justin Hensley Jason Yang Justin Hensley Outline Brook+ Brook+ demonstration on R670 AMD IL 2 Brook+ Introduction 3 What is Brook+? Brook is an extension to the C-language for stream programming originally developed

More information

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Memory concept. Grid concept, Synchronization. GPU Programming.   Szénási Sándor. Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip

More information

GPU Computing: Introduction to CUDA. Dr Paul Richmond

GPU Computing: Introduction to CUDA. Dr Paul Richmond GPU Computing: Introduction to CUDA Dr Paul Richmond http://paulrichmond.shef.ac.uk This lecture CUDA Programming Model CUDA Device Code CUDA Host Code and Memory Management CUDA Compilation Programming

More information

Cartoon parallel architectures; CPUs and GPUs

Cartoon parallel architectures; CPUs and GPUs Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay Introduction to CUDA Lecture originally by Luke Durant and Tamas Szalay Today CUDA - Why CUDA? - Overview of CUDA architecture - Dense matrix multiplication with CUDA 2 Shader GPGPU - Before current generation,

More information

Lecture 1: an introduction to CUDA

Lecture 1: an introduction to CUDA Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming

More information