CUDA Memory Model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Size: px
Start display at page:

Download "CUDA Memory Model. N. Cardoso & P. Bicudo. Física Computacional (FC5)"

Transcription

1 CUDA Memory Model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA Memory Model 1/32

2 Outline 1 Memory System Global Memory Shared Memory L2/L1 Cache Constant Memory Texture Memory Registers Read-Only Data Cache Load Function 2 Error Management N. Cardoso & P. Bicudo CUDA Memory Model 2/32

3 Memory System Stage2: GPGPU Trick GPU to do general purpose computing Programmable, but requires knowledge on computer graphics CPU and GPU have different memory spaces Stream Processing Platforms The data between High-level GPU and programming CPU are interface transfered through the PCIe CUDA has functions No knowledge to allocate/initialize/copy/release on Computer Graphics is required memory in the GPU. Pointers are only addresses Examples: NVIDIA s CUDA, OpenCL It is not possible to distinguish a CPU or GPU pointer by looking at its value. Be very careful. There is a CUDA function that can be used to check: cudaerror_t cudapointergetattributes ( struct cudapointerattributes * attributes, void * ptr) however when using this function with a CPU pointer, the cuda-memcheck triggers this as an error How does it work? CPU d$a'oo ap%('(q& r$`!(& GPU PC Memory j$a%k l&%( m'(' n$a%k `!o( GPU Memory N. Cardoso & P. Bicudo CUDA Memory Model 3/32

4 Memory System Local memory: private; Shared memory: per block; shared by threads within the same block; communication between threads. Global memory: per application; shared among all threads; inter-grid communication. Constant memory: read/write by the CPU; read-only by any thread. L1 and L2 cache: only available from Fermi architecture. Texture memory: is connected to the global memory; can be used as a cache; read-only by threads; updated between kernel calls. Thread Block Local Memory Shared Memory Registers per thread Read-Write Shared memory per block Read-Write Global memory per grid Read-Write Constant memory per grid Read-only Texture memory per grid Read-only Grid 0 Grid Global Memory Sequential Grids in Time 118 N. Cardoso & P. Bicudo CUDA Memory Model 4/32

5 Global Memory CPU: malloc f l o a t array_h=( f l o a t ) m a l l o c (N s i z e o f ( f l o a t ) ) ; GPU: cudamalloc(), allocate memory in the Global Memory c u d a E r r o r _ t cudamalloc ( v o i d devptr, s i z e _ t s i z e ) f l o a t array_d ; cudamalloc ( ( v o i d )& array_d, N s i z e o f ( f l o a t ) ) ; Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Global Memory cudafree(), release allocate memory from cudamalloc() cudafree ( array_d ) ; cudamemset(), memory initialization c u d a E r r o r _ t cudamemset ( v o i d devptr, i n t v a l u e, s i z e _ t count ) 2 N. Cardoso & P. Bicudo CUDA Memory Model 5/32

6 Memory transfers cudamemcpy() c u d a E r r o r _ t cudamemcpy ( v o i d dst, c o n s t v o i d s r c, s i z e _ t count, enum cudamemcpykind k i n d ) needs 4 parameters dst - Destination memory address src - Source memory address count - Size in bytes to copy kind - Type of transfer: Host to Host cudamemcpyhosttohost Host to Device cudamemcpyhosttodevice Device to Host cudamemcpydevicetohost Device to Device cudamemcpydevicetodevice N. Cardoso & P. Bicudo CUDA Memory Model 6/32

7 Global Memory Coalescing Example: Offset Access Offset: Tesla C2075 (Fermi, 2.0 CC) Geforce GTX TITAN (Kepler, 3.5 CC) t e m p l a t e <typename T> global v o i d o f f s e t (T a, i n t s ){ i n t i = blockdim. x b l o c k I d x. x + t h r e a d I d x. x + s ; a [ i ] = a [ i ] + 1 ; } N. Cardoso & P. Bicudo CUDA Memory Model 7/32

8 Global Memory Coalescing Example: Offset Access Offset: Tesla C2075 (Fermi, 2.0 CC) Geforce GTX TITAN (Kepler, 3.5 CC) t e m p l a t e <typename T> global v o i d o f f s e t (T a, i n t s ){ i n t i = blockdim. x b l o c k I d x. x + t h r e a d I d x. x + s ; a [ i ] = a [ i ] + 1 ; } Effective Bandwidth (GB/s) TITAN double TeslaC2075 double TITAN float TeslaC2075 float Offset N. Cardoso & P. Bicudo CUDA Memory Model 7/32

9 Global Memory Coalescing Example: Stride Access Stride: Tesla C2075 (Fermi, 2.0 CC) Geforce GTX TITAN (Kepler, 3.5 CC) t e m p l a t e <typename T> global v o i d s t r i d e (T a, i n t s ){ i n t i = ( blockdim. x b l o c k I d x. x + t h r e a d I d x. x ) s ; a [ i ] = a [ i ] + 1 ; } N. Cardoso & P. Bicudo CUDA Memory Model 8/32

10 Global Memory Coalescing Example: Stride Access Stride: Tesla C2075 (Fermi, 2.0 CC) Geforce GTX TITAN (Kepler, 3.5 CC) t e m p l a t e <typename T> global v o i d s t r i d e (T a, i n t s ){ i n t i = ( blockdim. x b l o c k I d x. x + t h r e a d I d x. x ) s ; a [ i ] = a [ i ] + 1 ; } Effective Bandwidth (GB/s) TITAN double TeslaC2075 double TITAN float TeslaC2075 float stride N. Cardoso & P. Bicudo CUDA Memory Model 8/32

11 Global Memory Coalescing Example: Even/Odd Access Even/Odd access: 1 Access the array with stride 2 2 Change data layout: the first half of the array is assigned to the even sites and then the odd sites Recovers stride 1 access Test: Tesla C2075 (2.0) array with size N = and using 256 threads per block types: int, float, double 1: time 2: time 1: Effective 2: Effective speedup (ms) (ms) Bandwidth (MB/s) Bandwidth (MB/s) em BW int float double N. Cardoso & P. Bicudo CUDA Memory Model 9/32

12 Global Memory Coalescing Example: Even/Odd Access Even/Odd access: 1 Access the array with stride 2 2 Change data layout: the first half of the array is assigned to the even sites and then the odd sites Recovers stride 1 access Test: Tesla C2075 (2.0) array with size N = and using 256 threads per block types: int, float, double 1: time 2: time 1: Effective 2: Effective speedup (ms) (ms) Bandwidth (MB/s) Bandwidth (MB/s) em BW int float double N. Cardoso & P. Bicudo CUDA Memory Model 9/32

13 Global Memory Coalescing Example: Even/Odd Access Even/Odd access: 1 Access the array with stride 2 2 Change data layout: the first half of the array are assigned to the even sites and then the odd sites Recovers stride 1 access Test: Tesla C2075 (2.0) consider that each array element is a structure with 2/4 elements (AOS): template <c l a s s T> template <c l a s s T> s t r u c t elem2 { s t r u c t elem4 { T x, y ; T x, y, z, w ; } ; } ; 1: time 2: time 1: Effective 2: Effective speedup (ms) (ms) Bandwidth (MB/s) Bandwidth (MB/s) em BW float elem2<float> elem4<float> double elem2<double> elem4<double> Why the big difference in performance between elem2<double> and elem4<float>? Same size, 16 Bytes! N. Cardoso & P. Bicudo CUDA Memory Model 10/32

14 Global Memory Coalescing Example: Even/Odd Access Even/Odd access: 1 Access the array with stride 2 2 Change data layout: the first half of the array are assigned to the even sites and then the odd sites Recovers stride 1 access Test: Tesla C2075 (2.0) consider that each array element is a structure with 2/4 elements (AOS): template <c l a s s T> template <c l a s s T> s t r u c t elem2 { s t r u c t elem4 { T x, y ; T x, y, z, w ; } ; } ; 1: time 2: time 1: Effective 2: Effective speedup (ms) (ms) Bandwidth (MB/s) Bandwidth (MB/s) em BW float elem2<float> elem4<float> double elem2<double> elem4<double> Why the big difference in performance between elem2<double> and elem4<float>? Same size, 16 Bytes! N. Cardoso & P. Bicudo CUDA Memory Model 10/32

15 Global Memory Coalescing Example: Even/Odd Access Even/Odd access: 1 Access the array with stride 2 2 Change data layout: the first half of the array are assigned to the even sites and then the odd sites Recovers stride 1 access Test: Tesla C2075 (2.0) use the CUDA built-in vector types: float2, float4, double2, double4 1: time 2: time 1: Effective 2: Effective speedup (ms) (ms) Bandwidth (MB/s) Bandwidth (MB/s) em BW float elem2<float> float elem4<float> float double elem2<double> double elem4<double> double Performances for double2 and float4 are the same order! Why the difference in performance between float2 and elem2<float>, float4 and elem4<float>, double2 and elem2<double>, double4 and elem4<double>? N. Cardoso & P. Bicudo CUDA Memory Model 11/32

16 Global Memory Coalescing Example: Even/Odd Access Even/Odd access: 1 Access the array with stride 2 2 Change data layout: the first half of the array are assigned to the even sites and then the odd sites Recovers stride 1 access Test: Tesla C2075 (2.0) use the CUDA built-in vector types: float2, float4, double2, double4 1: time 2: time 1: Effective 2: Effective speedup (ms) (ms) Bandwidth (MB/s) Bandwidth (MB/s) em BW float elem2<float> float elem4<float> float double elem2<double> double elem4<double> double Performances for double2 and float4 are the same order! Why the difference in performance between float2 and elem2<float>, float4 and elem4<float>, double2 and elem2<double>, double4 and elem4<double>? N. Cardoso & P. Bicudo CUDA Memory Model 11/32

17 Global Memory Coalescing structs in CUDA and alignment done as in standard C but the alignment must be done using the directive align(x), as align(4), 4 Bytes align(8), 8 Bytes, for example 2 floats align(16), 16 Bytes, for example 4 floats or 2 doubles maximum CUDA aligment is 128 bits Example: elem2<float> template <> st r uc t align (8) elem2<float >{ f l o a t x, y ; } ; therefore, elem2<float>=float2 For structures that don t fit in this range: Use SOA or SOAOS Example: elem4<float> double4 exceeds 16 Bytes, solution SOAOS of double2 AOS: array of structures SOA: structure of arrays SOAOS: structure of arrays of structures template <> s tr uc t align (16) elem4<float >{ f l o a t x, y, z, w ; } ; therefore, elem4<float>=float4 N. Cardoso & P. Bicudo CUDA Memory Model 12/32

18 Shared Memory: shared has 16 KB of Shared Each SM has 16KB/48KB of Shared Memory in Fermi architecture (depends on the architecture) The shared qualifier, optionally used together with device, declares a variable that: anks of 32bit words B in Fermi resides in the shared memory space of a thread block allows read/write by all threads in the same block; synchronization between threads using: syncthreads();. has the lifetime of the block, is only accessible from all the threads within the block. among all threads in a lock key to increase performance: move data in the shared memory; access it several times and/or share between threads. Grid and write Block access (0, 0) Block (1, 0) Shared Memory Shared Memory I$ L1 Multithreaded Instruction Buffer R F C$ L1 Operand Select Shared Mem Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) formance Enhancement Host Global Memory e data in shared memory it multiple times MAD SFU N. Cardoso & P. Bicudo CUDA Memory Model 13/32

19 Shared Memory: shared Shared memory banks Bancos de Memória Compartilhada The shared memory is divided in 16 banks (Fermi) with 32 bit words, which can be accessed simultaneously A memória compartilhada é dividida em 16 bancos, com palavras de Each 32 bits, architecture que podem has some servariations, acessados please simultaneamente check the manual. N. Cardoso & P. Bicudo CUDA Memory Model 14/32

20 Shared Memory: shared Bancos de Memória Compartilhada Shared memory banks When several threads access the same bank, the GPU serializes the requests Quando loss in performance ocorrem conflitos, a GPU serializa as requisições N. Cardoso & P. Bicudo CUDA Memory Model 15/32

21 Shared Memory: shared Example 1: constant memory size global v o i d k e r n e l (... ) { shared f l o a t a r r a y [BLOCK_Y ] [ BLOCK_X ] ; (... ) // w r i t e on s h a r e d memory syncthreads ( ) ; // a s s u r e t h a t t h r e a d s o f same b l o c k e x e c u t e d a l l p r e v i o u s i n s t r u c t i o n s (... ) } kernel<<<nblocks, nthreadsperblock>>>(...); Example 2: "dynamical" memory size the size of the array is determined at launch time. all variables declared in this fashion, start at the same address in memory, so that the layout of the variables in the array must be explicitly managed through offsets. global v o i d k e r n e l (... ) { e x t e r n shared f l o a t a r r a y [ ] ; (... ) } s i z e _ t s h a r e d S i z e = BLOCK_Y BLOCK_X s i z e o f ( f l o a t ) ; kernel<<<nblocks, nthreadsperblock, sharedsize>>>(...); N. Cardoso & P. Bicudo CUDA Memory Model 16/32

22 Shared Memory: shared Reduction Example CPU #include <stdio.h> #include <sys/time.h> inline double seconds(){ struct timeval tv; gettimeofday(&tv, NULL); return (double)tv.tv_sec + (double)tv.tv_usec*1.e-6; } int main(int argc, char **argv){ if (argc < 2){ printf("missing argument.\n"); return 0; } int n = atoi(argv[1]); int *in = (int*)malloc(n*sizeof(int)); for(uint i=0;i<n;i++) in[i] = 1; double t0 = seconds(); int sum = 0; for(uint i = 0; i < n; i++) sum += in[i]; double t1 = seconds(); printf("cpu time: %f\tsum: %d\n", t1-t0, sum); free(in); return 0; } N. Cardoso & P. Bicudo CUDA Memory Model 17/32

23 Shared Memory: shared Reduction Example GPU: Host part #include <cuda.h> #include <stdio.h> #include <sys/time.h> int main(int argc, char **argv){ if (argc < 2){ printf("missing argument.\n"); return 0; } int n = atoi(argv[1]); int *in = (int*)malloc(n*sizeof(int)); for(uint i=0;i<n;i++) in[i] = 1; int *d_in, *d_sum; cudamalloc(&d_in, n*sizeof(int)); cudamemcpy(d_in, in, n*sizeof(int), cudamemcpyhosttodevice); cudamalloc(&d_sum, sizeof(int)); const uint bthreads = 256; uint nblocks = (n + bthreads - 1) / bthreads; size_t smem = bthreads * sizeof(int); cudamemset(d_sum, 0, sizeof(int)); t0 = seconds(); reduce<<<nblocks, bthreads, smem>>>(d_in, d_sum, n); cudadevicesynchronize(); t1 = seconds(); int sum = 0; cudamemcpy(&sum, d_sum, sizeof(int), cudamemcpydevicetohost); printf("gpu time: %f\tsum: %d\n", t1-t0, sum); free(in); cudafree(d_in); cudafree(d_sum); cudadevicereset(); return 0; } N. Cardoso & P. Bicudo CUDA Memory Model 18/32

24 Shared Memory: shared Reduction Example GPU: Device part global void reduce(int *in, int *out, int n){ uint id = threadidx.x + blockidx.x * blockdim.x; extern shared int smem[]; smem[threadidx.x] = (id < n)? in[id] : 0; syncthreads(); // compute sum for the thread block for (int dist = blockdim.x/2; dist > 0; dist /= 2){ if (threadidx.x < dist) smem[threadidx.x] += smem[threadidx.x + dist]; syncthreads(); } if(threadidx.x == 0 ) atomicadd(out, smem[0]); } Parallel Reduction: Sequential Addressing Values (shared memory) Step 1 Stride 8 Step 2 Stride 4 Step 3 Stride 2 Step 4 Stride 1 Thread IDs Values Thread IDs Values Thread IDs Values Thread IDs Values Sequential addressing is conflict free atomicadd(): avoid race conditions (all blocks are reading/writing in same global memory address) no other thread can access this address until the operation is complete CUDA atomic operations: atomicadd(), atomicsub(), atomicmin(), atomicmax(), atomicinc(),... syncthreads() can be expensive /a.out (CPU: i GHz GPU: Geforce GTX TITAN 3.5CC) CPU time: s sum: GPU time: s sum: Speed up (GPU-CPU): 2.1x N. Cardoso & P. Bicudo CUDA Memory Model 19/32

25 Shared Memory: shared Reduction Example GPU: Device part (less sync barriers) #define WARP_SIZE 32 global void reduce(int *in, int *out, int n){ uint id = threadidx.x + blockidx.x * blockdim.x; extern shared int smem[]; smem[threadidx.x] = (id < n)? in[id] : 0; syncthreads(); //Only one active warp do the reduction, Blocks are always multiple of the warp size! if(blockdim.x > WARP_SIZE && threadidx.x < WARP_SIZE){ for(uint s = 1; s < blockdim.x / WARP_SIZE; s++) smem[threadidx.x] += smem[threadidx.x + WARP_SIZE * s]; } // syncthreads(); //No need to synchronize inside warp!!!! //One thread do the warp reduction if(threadidx.x == 0 ) { int sum = 0; for(uint s = 0; s < WARP_SIZE; s++) sum += smem[s]; atomicadd(out, sum); } } Using only one sync barrier: GPU time: s sum: Speed up (GPU-CPU): 4.6x Speed up(gpu-gpu): 2.2x N. Cardoso & P. Bicudo CUDA Memory Model 20/32

26 L2/L1 Cache Only available on Fermi and newer architectures. Only allowed one configuration option for L1 cache. User cannot control L2/L1 cache, however L1 cache size can be configurable. Configuration options for L1 cache and shared memory: cudafuncsetcacheconfig( kernel_name, Mode); the initial configuration is 48KB for shared memory and 16KB for L1 cache (Fermi architecture). Mode: cudafunccacheprefershared: 48KB for shared memory and 16KB for L1 cache cudafunccachepreferl1: 16KB for shared memory and 48KB for L1 cache cudafunccacheprefernone: default N. Cardoso & P. Bicudo CUDA Memory Model 21/32

27 Constant Memory: constant has the lifetime of an application; is accessible from all the threads within the grid and from the host through the runtime library; the host can read and write; the device can only read from it; limited size: Tesla C2075 only 64KB; as fast as the register access only if all threads in the same warp access the same position; host control: cudagetsymboladdress() cudagetsymbolsize() cudamemcpytosymbol() cudamemcpyfromsymbol() Example: host declaration: f l o a t p i = ; cudamemcpytosymbol ( Pi, &pi, s i z e o f ( f l o a t ) ) ; f l o a t array_h ; (... ) cudamemcpytosymbol ( array_d, array_h, N s i z e o f ( f l o a t ) ) ; device declaration (defined as a global variable/array as in standard C/C++): constant f l o a t Pi ; constant f l o a t array_d [N ] ; // o n l y s t a t i c s i z e a l l o w e d N. Cardoso & P. Bicudo CUDA Memory Model 22/32

28 Texture Memory: Reference API A texture reference can only be declared as a static global variable and cannot be passed as an argument to a function. The other attributes of a texture reference are mutable and can be changed at runtime through the host runtime. texture memory procedure: Allocation Linear memory cudaarrays texture reference "Binding/Unbinding" Texture memory access, Texture fetch texture<type, Dim, ReadMode> tex_name; Binds host side and device side texture access Type: float, int, uchar4, etc... doesn t support double precision explicitly, solution: use int2 as Type in the device code convert a int2 to double: hiloint2double(x.y, X.x) Dim: 1, 2, ou 3 ReadMode cudareadmodenormalizedfloat: [0.0f, 1.0f ] for unsigned types and [ 1.0f, 1.0f ] for signed types cudareadmodeelementtype: no conversion is done N. Cardoso & P. Bicudo CUDA Memory Model 23/32

29 Texture Memory: Reference API Linear Memory case Allocation simple, with cudamalloc Textures with linear memory can only be unidimensional and only supports up to 2 27 elements Texture reference: texture<type, 1, cudareadmodeelementtype> tex_name; Texture Bind cudabindtexture(&offset,tex_name, cuda_array) Texture Unbind cudaunbindtexture(tex_name) Texture access in the device code: value=tex1dfetch(tex_name, index); N. Cardoso & P. Bicudo CUDA Memory Model 24/32

30 Texture Memory: Object API A texture object is created using cudacreatetextureobject(). Can be passed as a kernel argument. Only available in Kepler and newer architectures. When used with linear memory template<class T> T tex1dfetch(cudatextureobject_t texobj, int x); N. Cardoso & P. Bicudo CUDA Memory Model 25/32

31 Registers Register File (RF) Fermi: registers Kepler: registers Maxwell: registers The registers are dynamically divided by the thread blocks. Once assigned to a block, the register is not accessible by threads in other blocks. each thread can only access a register assigned to it. r File (RF) B 84 registers (each 32 bit) 8192 registers in G registers in Fermi Exercise: SM occupation (Fermi architecture) vides 4 operands per clock cycle assume that the number of threads in a block is 256 consider that the total number of registers per SM is rs are dynamically partitioned among locks signed to a Block, the register is NOT if each thread uses 10 registers, how many blocks can run in each SM? # of registers per block = = 2560 # of blocks per SM = 16384/2560 = 6 # of threads per SM = = 1536 threads per SM Maximum SM occupation ble by now assume threads that each in thread other uses 20Blocks registers, # of registers per block = = 5120 # of blocks per SM = 16384/5120 = 3 # of threads per SM = < 1536 threads per SM SM not fully occupied read only access registers assigned to I$ L1 Multithreaded Instruction Buffer R F C$ L1 Operand Select MAD SFU Shared Mem N. Cardoso & P. Bicudo CUDA Memory Model 26/32

32 Read-Only Data Cache Load Function The read-only data cache load function is only supported by devices of compute capability 3.5 and higher. Using T ldg(const T address); returns the data of type T located at address address, where T is char, short, int, unsigned short, unsigned int, unsigned long long, int2, int4, uint2, uint4, float, float2, float4, double or double2. or by marking pointers used for loading such data with both the const and restrict qualifiers The operation is cached in the read-only data cache. much easier to use than Textures. N. Cardoso & P. Bicudo CUDA Memory Model 27/32

33 Memory Location Cached Access Who? Local Off-chip No Read/Write One thread Shared On-chip N/A Read/Write All threads within the same block Global Off-chip No Read/Write All threads Constant Off-chip Yes Read All threads Texture Off-chip Yes Read All threads Each thread can: Read/write per-thread registers Read/write per-block shared memory Read/write per-grid global memory Read/write per-thread local memory Read only per-grid constant memory Read only per-grid texture memory CUDA tool to detect errors with memory accesses: cuda memcheck N. Cardoso & P. Bicudo CUDA Memory Model 28/32

34 Error Management All CUDA call returns an error code except the kernels cudaerror_t type #d e f i n e CUDA_SAFE_CALL( c a l l ) { \ c u d a E r r o r e r r = c a l l ; \ i f ( c u d a S u c c e s s!= e r r ) { \ f p r i n t f ( s t d e r r, " Cuda e r r o r i n f i l e % s i n l i n e %i : %s. \ n ", \ FILE, LINE, c u d a G e t E r r o r S t r i n g ( e r r ) ) ; \ e x i t ( EXIT_FAILURE ) ; \ } \ } char cudageterrorstring(cudaerror_t code) returns a sequence of characters describing the error with NULL termination N. Cardoso & P. Bicudo CUDA Memory Model 29/32

35 Error Management cudaerror_t cudagetlasterror(void) returns the error code of last error cudaerror_t type the only way to check errors with kernels #d e f i n e CHECK_ERROR( e r r o r M e s s a g e ) { \ cudaerror_t e r r = cudagetlasterror ( ) ; \ i f ( c u d a S u c c e s s!= e r r ) { \ f p r i n t f ( s t d e r r, " Cuda e r r o r : %s i n f i l e % s i n l i n e %i : %s. \ n ", \ e r r o r M e s s a g e, FILE, LINE, c u d a G e t E r r o r S t r i n g ( e r r ) ) ; \ e x i t ( EXIT_FAILURE ) ; \ } \ } N. Cardoso & P. Bicudo CUDA Memory Model 30/32

36 Conclusion NVIDIA GPUs have different memory spaces In terms of speed, the device memory rank is 1 Register file 2 Shared Memory 3 Constant Memory 4 Texture Memory 5 Last place: Local Memory and Global Memory Accessing data in the global memory is critical to the performance of a CUDA application: data access must be done in a coalesced way find a data layout which satisfies this criteria or minimizes the impact Use atomic memory operations to avoid race conditions All CUDA call returns an error, except the kernels N. Cardoso & P. Bicudo CUDA Memory Model 31/32

37 Thanks N. Cardoso & P. Bicudo CUDA Memory Model 32/32

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Memory concept. Grid concept, Synchronization. GPU Programming.   Szénási Sándor. Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip

More information

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I) Lecture 9 CUDA CUDA (I) Compute Unified Device Architecture 1 2 Outline CUDA Architecture CUDA Architecture CUDA programming model CUDA-C 3 4 CUDA : a General-Purpose Parallel Computing Architecture CUDA

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5) CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration

More information

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan CUDA Workshop High Performance GPU computing EXEBIT- 2014 Karthikeyan CPU vs GPU CPU Very fast, serial, Low Latency GPU Slow, massively parallel, High Throughput Play Demonstration Compute Unified Device

More information

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks. Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)

More information

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690 Stanford University NVIDIA Tesla M2090 NVIDIA GeForce GTX 690 Moore s Law 2 Clock Speed 10000 Pentium 4 Prescott Core 2 Nehalem Sandy Bridge 1000 Pentium 4 Williamette Clock Speed (MHz) 100 80486 Pentium

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

CS 179: GPU Computing. Lecture 2: The Basics

CS 179: GPU Computing. Lecture 2: The Basics CS 179: GPU Computing Lecture 2: The Basics Recap Can use GPU to solve highly parallelizable problems Performance benefits vs. CPU Straightforward extension to C language Disclaimer Goal for Week 1: Fast-paced

More information

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci TECHNISCHE UNIVERSITÄT WIEN Fakultät für Informatik Cyber-Physical Systems Group CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci Outline of CUDA Basics Basic Kernels and Execution on GPU

More information

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34 1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions

More information

Introduction to GPGPUs and to CUDA programming model

Introduction to GPGPUs and to CUDA programming model Introduction to GPGPUs and to CUDA programming model www.cineca.it Marzia Rivi m.rivi@cineca.it GPGPU architecture CUDA programming model CUDA efficient programming Debugging & profiling tools CUDA libraries

More information

Programming with CUDA, WS09

Programming with CUDA, WS09 Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 3 Thursday, 29 Nov, 2009 Recap Motivational videos Example kernel Thread IDs Memory overhead CUDA hardware and programming

More information

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation Introduction to GPU Computing Junjie Lai, NVIDIA Corporation Outline Evolution of GPU Computing Heterogeneous Computing CUDA Execution Model & Walkthrough of Hello World Walkthrough : 1D Stencil Once upon

More information

CS179 GPU Programming Recitation 4: CUDA Particles

CS179 GPU Programming Recitation 4: CUDA Particles Recitation 4: CUDA Particles Lab 4 CUDA Particle systems Two parts Simple repeat of Lab 3 Interacting Flocking simulation 2 Setup Two folders given particles_simple, particles_interact Must install NVIDIA_CUDA_SDK

More information

CS179 GPU Programming: CUDA Memory. Lecture originally by Luke Durant and Tamas Szalay

CS179 GPU Programming: CUDA Memory. Lecture originally by Luke Durant and Tamas Szalay : CUDA Memory Lecture originally by Luke Durant and Tamas Szalay CUDA Memory Review of Memory Spaces Memory syntax Constant Memory Allocation Issues Global Memory Gotchas Shared Memory Gotchas Texture

More information

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ BASICS. NVIDIA Corporation CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions

More information

Class. Windows and CUDA : Shu Guo. Program from last time: Constant memory

Class. Windows and CUDA : Shu Guo. Program from last time: Constant memory Class Windows and CUDA : Shu Guo Program from last time: Constant memory Windows on CUDA Reference: NVIDIA CUDA Getting Started Guide for Microsoft Windows Whiting School has Visual Studio Cuda 5.5 Installer

More information

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research Reductions and Low-Level Performance Considerations CME343 / ME339 27 May 2011 David Tarjan [dtarjan@nvidia.com] NVIDIA Research REDUCTIONS Reduction! Reduce vector to a single value! Via an associative

More information

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun Outline Memory Management CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory

More information

University of Bielefeld

University of Bielefeld Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld

More information

Massively Parallel Algorithms

Massively Parallel Algorithms Massively Parallel Algorithms Introduction to CUDA & Many Fundamental Concepts of Parallel Programming G. Zachmann University of Bremen, Germany cgvr.cs.uni-bremen.de Hybrid/Heterogeneous Computation/Architecture

More information

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Basic Elements of CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of

More information

GPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics

GPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics 1 GPU Computing Workshop CSU 2013 Getting Started Garland Durham Quantos Analytics nvidia-smi 2 At command line, run command nvidia-smi to get/set GPU properties. nvidia-smi Options: -q query -L list attached

More information

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z)

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Alignment Memory Alignment Memory

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance

More information

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into

More information

NVIDIA CUDA Compute Unified Device Architecture

NVIDIA CUDA Compute Unified Device Architecture NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 0.8 2/12/2007 ii CUDA Programming Guide Version 0.8 Table of Contents Chapter 1. Introduction to CUDA... 1 1.1 The Graphics Processor

More information

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API Review Lecture 10 CUDA (II) host device CUDA many core processor threads thread blocks grid # threads >> # of cores to be efficient Threads within blocks can cooperate Threads between thread blocks cannot

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

Scientific discovery, analysis and prediction made possible through high performance computing.

Scientific discovery, analysis and prediction made possible through high performance computing. Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013

More information

Introduction to GPU Programming

Introduction to GPU Programming Introduction to GPU Programming Volodymyr (Vlad) Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) Part III CUDA C and CUDA API Hands-on:

More information

High Performance Linear Algebra on Data Parallel Co-Processors I

High Performance Linear Algebra on Data Parallel Co-Processors I 926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018

More information

GPU Programming Using CUDA

GPU Programming Using CUDA GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa

More information

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

GPU Programming Using CUDA. Samuli Laine NVIDIA Research GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs CUDA Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University CUDA - Compute Unified Device

More information

EEM528 GPU COMPUTING

EEM528 GPU COMPUTING EEM528 CS 193G GPU COMPUTING Lecture 2: GPU History & CUDA Programming Basics Slides Credit: Jared Hoberock & David Tarjan CS 193G History of GPUs Graphics in a Nutshell Make great images intricate shapes

More information

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

GPU Programming Using CUDA. Samuli Laine NVIDIA Research GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

Introduction to CUDA C

Introduction to CUDA C Introduction to CUDA C What will you learn today? Start from Hello, World! Write and launch CUDA C kernels Manage GPU memory Run parallel kernels in CUDA C Parallel communication and synchronization Race

More information

04. CUDA Data Transfer

04. CUDA Data Transfer 04. CUDA Data Transfer Fall Semester, 2015 COMP427 Parallel Programming School of Computer Sci. and Eng. Kyungpook National University 2013-5 N Baek 1 CUDA Compute Unified Device Architecture General purpose

More information

Programming with CUDA

Programming with CUDA Programming with CUDA Jens K. Mueller jkm@informatik.uni-jena.de Department of Mathematics and Computer Science Friedrich-Schiller-University Jena Tuesday 19 th April, 2011 Today s lecture: Synchronization

More information

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center An Introduction to GPU Architecture and CUDA C/C++ Programming Bin Chen April 4, 2018 Research Computing Center Outline Introduction to GPU architecture Introduction to CUDA programming model Using the

More information

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Advanced CUDA Optimizations. Umar Arshad ArrayFire Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Shared Memory and Synchronizations

Shared Memory and Synchronizations and Synchronizations Bedrich Benes, Ph.D. Purdue University Department of Computer Graphics Technology SM can be accessed by all threads within a block (but not across blocks) Threads within a block can

More information

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment

More information

CS377P Programming for Performance GPU Programming - I

CS377P Programming for Performance GPU Programming - I CS377P Programming for Performance GPU Programming - I Sreepathi Pai UTCS November 9, 2015 Outline 1 Introduction to CUDA 2 Basic Performance 3 Memory Performance Outline 1 Introduction to CUDA 2 Basic

More information

Introduction to CUDA

Introduction to CUDA Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations

More information

An Introduction to GPU Computing and CUDA Architecture

An Introduction to GPU Computing and CUDA Architecture An Introduction to GPU Computing and CUDA Architecture Sarah Tariq, NVIDIA Corporation GPU Computing GPU: Graphics Processing Unit Traditionally used for real-time rendering High computational density

More information

Hands-on CUDA Optimization. CUDA Workshop

Hands-on CUDA Optimization. CUDA Workshop Hands-on CUDA Optimization CUDA Workshop Exercise Today we have a progressive exercise The exercise is broken into 5 steps If you get lost you can always catch up by grabbing the corresponding directory

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team Laboratory of Information Technologies Joint Institute for Nuclear Research The Helmholtz International Summer School Lattice

More information

Introduction to CUDA C

Introduction to CUDA C NVIDIA GPU Technology Introduction to CUDA C Samuel Gateau Seoul December 16, 2010 Who should you thank for this talk? Jason Sanders Senior Software Engineer, NVIDIA Co-author of CUDA by Example What is

More information

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu

More information

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool Basic CUDA workshop Penporn Koanantakool twitter: @kaewgb e-mail: kaewgb@gmail.com Outlines Setting Up Your Machine Architecture Getting Started Programming CUDA Debugging Fine-Tuning Setting Up Your Machine

More information

Lecture 1: an introduction to CUDA

Lecture 1: an introduction to CUDA Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

Cartoon parallel architectures; CPUs and GPUs

Cartoon parallel architectures; CPUs and GPUs Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD

More information

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ BASICS. NVIDIA Corporation CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

GPU COMPUTING. Ana Lucia Varbanescu (UvA) GPU COMPUTING Ana Lucia Varbanescu (UvA) 2 Graphics in 1980 3 Graphics in 2000 4 Graphics in 2015 GPUs in movies 5 From Ariel in Little Mermaid to Brave So 6 GPUs are a steady market Gaming CAD-like activities

More information

GPU CUDA Programming

GPU CUDA Programming GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications

More information

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation What is CUDA? CUDA Platform Expose GPU computing for general purpose Retain performance CUDA C/C++ Based on industry-standard C/C++ Small

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Parallel Computing with CUDA. Oswald Haan Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries

More information

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4 CS/CoE 1541 Final exam (Fall 2017). Name: This is the cumulative final exam given in the Fall of 2017. Question 1 (12 points): was on Chapter 4 Question 2 (13 points): was on Chapter 4 For Exam 2, you

More information

Global Memory Access Pattern and Control Flow

Global Memory Access Pattern and Control Flow Optimization Strategies Global Memory Access Pattern and Control Flow Objectives Optimization Strategies Global Memory Access Pattern (Coalescing) Control Flow (Divergent branch) Global l Memory Access

More information

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Lecture 6. Programming with Message Passing Message Passing Interface (MPI) Lecture 6 Programming with Message Passing Message Passing Interface (MPI) Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Finish CUDA Today s lecture Programming with message passing 2011

More information

Optimization Strategies Global Memory Access Pattern and Control Flow

Optimization Strategies Global Memory Access Pattern and Control Flow Global Memory Access Pattern and Control Flow Optimization Strategies Global Memory Access Pattern (Coalescing) Control Flow (Divergent branch) Highest latency instructions: 400-600 clock cycles Likely

More information

CUDA Programming. Aiichiro Nakano

CUDA Programming. Aiichiro Nakano CUDA Programming Aiichiro Nakano Collaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science

More information

CUDA Memory Hierarchy

CUDA Memory Hierarchy CUDA Memory Hierarchy Piotr Danilewski October 2012 Saarland University Memory GTX 690 GTX 690 Memory host memory main GPU memory (global memory) shared memory caches registers Memory host memory GPU global

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics

More information

Lecture 6b Introduction of CUDA programming

Lecture 6b Introduction of CUDA programming CS075 1896 1920 1987 2006 Lecture 6b Introduction of CUDA programming 0 1 0, What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on

More information

HPCSE II. GPU programming and CUDA

HPCSE II. GPU programming and CUDA HPCSE II GPU programming and CUDA What is a GPU? Specialized for compute-intensive, highly-parallel computation, i.e. graphic output Evolution pushed by gaming industry CPU: large die area for control

More information

GPU Computing: A Quick Start

GPU Computing: A Quick Start GPU Computing: A Quick Start Orest Shardt Department of Chemical and Materials Engineering University of Alberta August 25, 2011 Session Goals Get you started with highly parallel LBM Take a practical

More information

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Josef Pelikán, Jan Horáček CGG MFF UK Praha GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Basic Elements of CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018 GPU 1 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives

More information

ECE 408 / CS 483 Final Exam, Fall 2014

ECE 408 / CS 483 Final Exam, Fall 2014 ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing

More information

Lecture 5. Performance Programming with CUDA

Lecture 5. Performance Programming with CUDA Lecture 5 Performance Programming with CUDA Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Today s lecture Matrix multiplication 2011 Scott B. Baden / CSE 262 / Spring 2011 3 Memory Hierarchy

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance

More information

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

COSC 6385 Computer Architecture. - Data Level Parallelism (II) COSC 6385 Computer Architecture - Data Level Parallelism (II) Fall 2013 SIMD Instructions Originally developed for Multimedia applications Same operation executed for multiple data items Uses a fixed length

More information

Introduc)on to GPU Programming

Introduc)on to GPU Programming Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/people/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/cuda/training/nvidia_gpu_compu)ng_webinars_cuda_memory_op)miza)on.pdf

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

NVIDIA CUDA Compute Unified Device Architecture

NVIDIA CUDA Compute Unified Device Architecture NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 2.0 6/7/2008 ii CUDA Programming Guide Version 2.0 Table of Contents Chapter 1. Introduction...1 1.1 CUDA: A Scalable Parallel

More information

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes

More information

COSC 6374 Parallel Computations Introduction to CUDA

COSC 6374 Parallel Computations Introduction to CUDA COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014

More information