CUDA Memory Model. N. Cardoso & P. Bicudo. Física Computacional (FC5)
|
|
- Alexandrina Atkins
- 5 years ago
- Views:
Transcription
1 CUDA Memory Model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA Memory Model 1/32
2 Outline 1 Memory System Global Memory Shared Memory L2/L1 Cache Constant Memory Texture Memory Registers Read-Only Data Cache Load Function 2 Error Management N. Cardoso & P. Bicudo CUDA Memory Model 2/32
3 Memory System Stage2: GPGPU Trick GPU to do general purpose computing Programmable, but requires knowledge on computer graphics CPU and GPU have different memory spaces Stream Processing Platforms The data between High-level GPU and programming CPU are interface transfered through the PCIe CUDA has functions No knowledge to allocate/initialize/copy/release on Computer Graphics is required memory in the GPU. Pointers are only addresses Examples: NVIDIA s CUDA, OpenCL It is not possible to distinguish a CPU or GPU pointer by looking at its value. Be very careful. There is a CUDA function that can be used to check: cudaerror_t cudapointergetattributes ( struct cudapointerattributes * attributes, void * ptr) however when using this function with a CPU pointer, the cuda-memcheck triggers this as an error How does it work? CPU d$a'oo ap%('(q& r$`!(& GPU PC Memory j$a%k l&%( m'(' n$a%k `!o( GPU Memory N. Cardoso & P. Bicudo CUDA Memory Model 3/32
4 Memory System Local memory: private; Shared memory: per block; shared by threads within the same block; communication between threads. Global memory: per application; shared among all threads; inter-grid communication. Constant memory: read/write by the CPU; read-only by any thread. L1 and L2 cache: only available from Fermi architecture. Texture memory: is connected to the global memory; can be used as a cache; read-only by threads; updated between kernel calls. Thread Block Local Memory Shared Memory Registers per thread Read-Write Shared memory per block Read-Write Global memory per grid Read-Write Constant memory per grid Read-only Texture memory per grid Read-only Grid 0 Grid Global Memory Sequential Grids in Time 118 N. Cardoso & P. Bicudo CUDA Memory Model 4/32
5 Global Memory CPU: malloc f l o a t array_h=( f l o a t ) m a l l o c (N s i z e o f ( f l o a t ) ) ; GPU: cudamalloc(), allocate memory in the Global Memory c u d a E r r o r _ t cudamalloc ( v o i d devptr, s i z e _ t s i z e ) f l o a t array_d ; cudamalloc ( ( v o i d )& array_d, N s i z e o f ( f l o a t ) ) ; Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Global Memory cudafree(), release allocate memory from cudamalloc() cudafree ( array_d ) ; cudamemset(), memory initialization c u d a E r r o r _ t cudamemset ( v o i d devptr, i n t v a l u e, s i z e _ t count ) 2 N. Cardoso & P. Bicudo CUDA Memory Model 5/32
6 Memory transfers cudamemcpy() c u d a E r r o r _ t cudamemcpy ( v o i d dst, c o n s t v o i d s r c, s i z e _ t count, enum cudamemcpykind k i n d ) needs 4 parameters dst - Destination memory address src - Source memory address count - Size in bytes to copy kind - Type of transfer: Host to Host cudamemcpyhosttohost Host to Device cudamemcpyhosttodevice Device to Host cudamemcpydevicetohost Device to Device cudamemcpydevicetodevice N. Cardoso & P. Bicudo CUDA Memory Model 6/32
7 Global Memory Coalescing Example: Offset Access Offset: Tesla C2075 (Fermi, 2.0 CC) Geforce GTX TITAN (Kepler, 3.5 CC) t e m p l a t e <typename T> global v o i d o f f s e t (T a, i n t s ){ i n t i = blockdim. x b l o c k I d x. x + t h r e a d I d x. x + s ; a [ i ] = a [ i ] + 1 ; } N. Cardoso & P. Bicudo CUDA Memory Model 7/32
8 Global Memory Coalescing Example: Offset Access Offset: Tesla C2075 (Fermi, 2.0 CC) Geforce GTX TITAN (Kepler, 3.5 CC) t e m p l a t e <typename T> global v o i d o f f s e t (T a, i n t s ){ i n t i = blockdim. x b l o c k I d x. x + t h r e a d I d x. x + s ; a [ i ] = a [ i ] + 1 ; } Effective Bandwidth (GB/s) TITAN double TeslaC2075 double TITAN float TeslaC2075 float Offset N. Cardoso & P. Bicudo CUDA Memory Model 7/32
9 Global Memory Coalescing Example: Stride Access Stride: Tesla C2075 (Fermi, 2.0 CC) Geforce GTX TITAN (Kepler, 3.5 CC) t e m p l a t e <typename T> global v o i d s t r i d e (T a, i n t s ){ i n t i = ( blockdim. x b l o c k I d x. x + t h r e a d I d x. x ) s ; a [ i ] = a [ i ] + 1 ; } N. Cardoso & P. Bicudo CUDA Memory Model 8/32
10 Global Memory Coalescing Example: Stride Access Stride: Tesla C2075 (Fermi, 2.0 CC) Geforce GTX TITAN (Kepler, 3.5 CC) t e m p l a t e <typename T> global v o i d s t r i d e (T a, i n t s ){ i n t i = ( blockdim. x b l o c k I d x. x + t h r e a d I d x. x ) s ; a [ i ] = a [ i ] + 1 ; } Effective Bandwidth (GB/s) TITAN double TeslaC2075 double TITAN float TeslaC2075 float stride N. Cardoso & P. Bicudo CUDA Memory Model 8/32
11 Global Memory Coalescing Example: Even/Odd Access Even/Odd access: 1 Access the array with stride 2 2 Change data layout: the first half of the array is assigned to the even sites and then the odd sites Recovers stride 1 access Test: Tesla C2075 (2.0) array with size N = and using 256 threads per block types: int, float, double 1: time 2: time 1: Effective 2: Effective speedup (ms) (ms) Bandwidth (MB/s) Bandwidth (MB/s) em BW int float double N. Cardoso & P. Bicudo CUDA Memory Model 9/32
12 Global Memory Coalescing Example: Even/Odd Access Even/Odd access: 1 Access the array with stride 2 2 Change data layout: the first half of the array is assigned to the even sites and then the odd sites Recovers stride 1 access Test: Tesla C2075 (2.0) array with size N = and using 256 threads per block types: int, float, double 1: time 2: time 1: Effective 2: Effective speedup (ms) (ms) Bandwidth (MB/s) Bandwidth (MB/s) em BW int float double N. Cardoso & P. Bicudo CUDA Memory Model 9/32
13 Global Memory Coalescing Example: Even/Odd Access Even/Odd access: 1 Access the array with stride 2 2 Change data layout: the first half of the array are assigned to the even sites and then the odd sites Recovers stride 1 access Test: Tesla C2075 (2.0) consider that each array element is a structure with 2/4 elements (AOS): template <c l a s s T> template <c l a s s T> s t r u c t elem2 { s t r u c t elem4 { T x, y ; T x, y, z, w ; } ; } ; 1: time 2: time 1: Effective 2: Effective speedup (ms) (ms) Bandwidth (MB/s) Bandwidth (MB/s) em BW float elem2<float> elem4<float> double elem2<double> elem4<double> Why the big difference in performance between elem2<double> and elem4<float>? Same size, 16 Bytes! N. Cardoso & P. Bicudo CUDA Memory Model 10/32
14 Global Memory Coalescing Example: Even/Odd Access Even/Odd access: 1 Access the array with stride 2 2 Change data layout: the first half of the array are assigned to the even sites and then the odd sites Recovers stride 1 access Test: Tesla C2075 (2.0) consider that each array element is a structure with 2/4 elements (AOS): template <c l a s s T> template <c l a s s T> s t r u c t elem2 { s t r u c t elem4 { T x, y ; T x, y, z, w ; } ; } ; 1: time 2: time 1: Effective 2: Effective speedup (ms) (ms) Bandwidth (MB/s) Bandwidth (MB/s) em BW float elem2<float> elem4<float> double elem2<double> elem4<double> Why the big difference in performance between elem2<double> and elem4<float>? Same size, 16 Bytes! N. Cardoso & P. Bicudo CUDA Memory Model 10/32
15 Global Memory Coalescing Example: Even/Odd Access Even/Odd access: 1 Access the array with stride 2 2 Change data layout: the first half of the array are assigned to the even sites and then the odd sites Recovers stride 1 access Test: Tesla C2075 (2.0) use the CUDA built-in vector types: float2, float4, double2, double4 1: time 2: time 1: Effective 2: Effective speedup (ms) (ms) Bandwidth (MB/s) Bandwidth (MB/s) em BW float elem2<float> float elem4<float> float double elem2<double> double elem4<double> double Performances for double2 and float4 are the same order! Why the difference in performance between float2 and elem2<float>, float4 and elem4<float>, double2 and elem2<double>, double4 and elem4<double>? N. Cardoso & P. Bicudo CUDA Memory Model 11/32
16 Global Memory Coalescing Example: Even/Odd Access Even/Odd access: 1 Access the array with stride 2 2 Change data layout: the first half of the array are assigned to the even sites and then the odd sites Recovers stride 1 access Test: Tesla C2075 (2.0) use the CUDA built-in vector types: float2, float4, double2, double4 1: time 2: time 1: Effective 2: Effective speedup (ms) (ms) Bandwidth (MB/s) Bandwidth (MB/s) em BW float elem2<float> float elem4<float> float double elem2<double> double elem4<double> double Performances for double2 and float4 are the same order! Why the difference in performance between float2 and elem2<float>, float4 and elem4<float>, double2 and elem2<double>, double4 and elem4<double>? N. Cardoso & P. Bicudo CUDA Memory Model 11/32
17 Global Memory Coalescing structs in CUDA and alignment done as in standard C but the alignment must be done using the directive align(x), as align(4), 4 Bytes align(8), 8 Bytes, for example 2 floats align(16), 16 Bytes, for example 4 floats or 2 doubles maximum CUDA aligment is 128 bits Example: elem2<float> template <> st r uc t align (8) elem2<float >{ f l o a t x, y ; } ; therefore, elem2<float>=float2 For structures that don t fit in this range: Use SOA or SOAOS Example: elem4<float> double4 exceeds 16 Bytes, solution SOAOS of double2 AOS: array of structures SOA: structure of arrays SOAOS: structure of arrays of structures template <> s tr uc t align (16) elem4<float >{ f l o a t x, y, z, w ; } ; therefore, elem4<float>=float4 N. Cardoso & P. Bicudo CUDA Memory Model 12/32
18 Shared Memory: shared has 16 KB of Shared Each SM has 16KB/48KB of Shared Memory in Fermi architecture (depends on the architecture) The shared qualifier, optionally used together with device, declares a variable that: anks of 32bit words B in Fermi resides in the shared memory space of a thread block allows read/write by all threads in the same block; synchronization between threads using: syncthreads();. has the lifetime of the block, is only accessible from all the threads within the block. among all threads in a lock key to increase performance: move data in the shared memory; access it several times and/or share between threads. Grid and write Block access (0, 0) Block (1, 0) Shared Memory Shared Memory I$ L1 Multithreaded Instruction Buffer R F C$ L1 Operand Select Shared Mem Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) formance Enhancement Host Global Memory e data in shared memory it multiple times MAD SFU N. Cardoso & P. Bicudo CUDA Memory Model 13/32
19 Shared Memory: shared Shared memory banks Bancos de Memória Compartilhada The shared memory is divided in 16 banks (Fermi) with 32 bit words, which can be accessed simultaneously A memória compartilhada é dividida em 16 bancos, com palavras de Each 32 bits, architecture que podem has some servariations, acessados please simultaneamente check the manual. N. Cardoso & P. Bicudo CUDA Memory Model 14/32
20 Shared Memory: shared Bancos de Memória Compartilhada Shared memory banks When several threads access the same bank, the GPU serializes the requests Quando loss in performance ocorrem conflitos, a GPU serializa as requisições N. Cardoso & P. Bicudo CUDA Memory Model 15/32
21 Shared Memory: shared Example 1: constant memory size global v o i d k e r n e l (... ) { shared f l o a t a r r a y [BLOCK_Y ] [ BLOCK_X ] ; (... ) // w r i t e on s h a r e d memory syncthreads ( ) ; // a s s u r e t h a t t h r e a d s o f same b l o c k e x e c u t e d a l l p r e v i o u s i n s t r u c t i o n s (... ) } kernel<<<nblocks, nthreadsperblock>>>(...); Example 2: "dynamical" memory size the size of the array is determined at launch time. all variables declared in this fashion, start at the same address in memory, so that the layout of the variables in the array must be explicitly managed through offsets. global v o i d k e r n e l (... ) { e x t e r n shared f l o a t a r r a y [ ] ; (... ) } s i z e _ t s h a r e d S i z e = BLOCK_Y BLOCK_X s i z e o f ( f l o a t ) ; kernel<<<nblocks, nthreadsperblock, sharedsize>>>(...); N. Cardoso & P. Bicudo CUDA Memory Model 16/32
22 Shared Memory: shared Reduction Example CPU #include <stdio.h> #include <sys/time.h> inline double seconds(){ struct timeval tv; gettimeofday(&tv, NULL); return (double)tv.tv_sec + (double)tv.tv_usec*1.e-6; } int main(int argc, char **argv){ if (argc < 2){ printf("missing argument.\n"); return 0; } int n = atoi(argv[1]); int *in = (int*)malloc(n*sizeof(int)); for(uint i=0;i<n;i++) in[i] = 1; double t0 = seconds(); int sum = 0; for(uint i = 0; i < n; i++) sum += in[i]; double t1 = seconds(); printf("cpu time: %f\tsum: %d\n", t1-t0, sum); free(in); return 0; } N. Cardoso & P. Bicudo CUDA Memory Model 17/32
23 Shared Memory: shared Reduction Example GPU: Host part #include <cuda.h> #include <stdio.h> #include <sys/time.h> int main(int argc, char **argv){ if (argc < 2){ printf("missing argument.\n"); return 0; } int n = atoi(argv[1]); int *in = (int*)malloc(n*sizeof(int)); for(uint i=0;i<n;i++) in[i] = 1; int *d_in, *d_sum; cudamalloc(&d_in, n*sizeof(int)); cudamemcpy(d_in, in, n*sizeof(int), cudamemcpyhosttodevice); cudamalloc(&d_sum, sizeof(int)); const uint bthreads = 256; uint nblocks = (n + bthreads - 1) / bthreads; size_t smem = bthreads * sizeof(int); cudamemset(d_sum, 0, sizeof(int)); t0 = seconds(); reduce<<<nblocks, bthreads, smem>>>(d_in, d_sum, n); cudadevicesynchronize(); t1 = seconds(); int sum = 0; cudamemcpy(&sum, d_sum, sizeof(int), cudamemcpydevicetohost); printf("gpu time: %f\tsum: %d\n", t1-t0, sum); free(in); cudafree(d_in); cudafree(d_sum); cudadevicereset(); return 0; } N. Cardoso & P. Bicudo CUDA Memory Model 18/32
24 Shared Memory: shared Reduction Example GPU: Device part global void reduce(int *in, int *out, int n){ uint id = threadidx.x + blockidx.x * blockdim.x; extern shared int smem[]; smem[threadidx.x] = (id < n)? in[id] : 0; syncthreads(); // compute sum for the thread block for (int dist = blockdim.x/2; dist > 0; dist /= 2){ if (threadidx.x < dist) smem[threadidx.x] += smem[threadidx.x + dist]; syncthreads(); } if(threadidx.x == 0 ) atomicadd(out, smem[0]); } Parallel Reduction: Sequential Addressing Values (shared memory) Step 1 Stride 8 Step 2 Stride 4 Step 3 Stride 2 Step 4 Stride 1 Thread IDs Values Thread IDs Values Thread IDs Values Thread IDs Values Sequential addressing is conflict free atomicadd(): avoid race conditions (all blocks are reading/writing in same global memory address) no other thread can access this address until the operation is complete CUDA atomic operations: atomicadd(), atomicsub(), atomicmin(), atomicmax(), atomicinc(),... syncthreads() can be expensive /a.out (CPU: i GHz GPU: Geforce GTX TITAN 3.5CC) CPU time: s sum: GPU time: s sum: Speed up (GPU-CPU): 2.1x N. Cardoso & P. Bicudo CUDA Memory Model 19/32
25 Shared Memory: shared Reduction Example GPU: Device part (less sync barriers) #define WARP_SIZE 32 global void reduce(int *in, int *out, int n){ uint id = threadidx.x + blockidx.x * blockdim.x; extern shared int smem[]; smem[threadidx.x] = (id < n)? in[id] : 0; syncthreads(); //Only one active warp do the reduction, Blocks are always multiple of the warp size! if(blockdim.x > WARP_SIZE && threadidx.x < WARP_SIZE){ for(uint s = 1; s < blockdim.x / WARP_SIZE; s++) smem[threadidx.x] += smem[threadidx.x + WARP_SIZE * s]; } // syncthreads(); //No need to synchronize inside warp!!!! //One thread do the warp reduction if(threadidx.x == 0 ) { int sum = 0; for(uint s = 0; s < WARP_SIZE; s++) sum += smem[s]; atomicadd(out, sum); } } Using only one sync barrier: GPU time: s sum: Speed up (GPU-CPU): 4.6x Speed up(gpu-gpu): 2.2x N. Cardoso & P. Bicudo CUDA Memory Model 20/32
26 L2/L1 Cache Only available on Fermi and newer architectures. Only allowed one configuration option for L1 cache. User cannot control L2/L1 cache, however L1 cache size can be configurable. Configuration options for L1 cache and shared memory: cudafuncsetcacheconfig( kernel_name, Mode); the initial configuration is 48KB for shared memory and 16KB for L1 cache (Fermi architecture). Mode: cudafunccacheprefershared: 48KB for shared memory and 16KB for L1 cache cudafunccachepreferl1: 16KB for shared memory and 48KB for L1 cache cudafunccacheprefernone: default N. Cardoso & P. Bicudo CUDA Memory Model 21/32
27 Constant Memory: constant has the lifetime of an application; is accessible from all the threads within the grid and from the host through the runtime library; the host can read and write; the device can only read from it; limited size: Tesla C2075 only 64KB; as fast as the register access only if all threads in the same warp access the same position; host control: cudagetsymboladdress() cudagetsymbolsize() cudamemcpytosymbol() cudamemcpyfromsymbol() Example: host declaration: f l o a t p i = ; cudamemcpytosymbol ( Pi, &pi, s i z e o f ( f l o a t ) ) ; f l o a t array_h ; (... ) cudamemcpytosymbol ( array_d, array_h, N s i z e o f ( f l o a t ) ) ; device declaration (defined as a global variable/array as in standard C/C++): constant f l o a t Pi ; constant f l o a t array_d [N ] ; // o n l y s t a t i c s i z e a l l o w e d N. Cardoso & P. Bicudo CUDA Memory Model 22/32
28 Texture Memory: Reference API A texture reference can only be declared as a static global variable and cannot be passed as an argument to a function. The other attributes of a texture reference are mutable and can be changed at runtime through the host runtime. texture memory procedure: Allocation Linear memory cudaarrays texture reference "Binding/Unbinding" Texture memory access, Texture fetch texture<type, Dim, ReadMode> tex_name; Binds host side and device side texture access Type: float, int, uchar4, etc... doesn t support double precision explicitly, solution: use int2 as Type in the device code convert a int2 to double: hiloint2double(x.y, X.x) Dim: 1, 2, ou 3 ReadMode cudareadmodenormalizedfloat: [0.0f, 1.0f ] for unsigned types and [ 1.0f, 1.0f ] for signed types cudareadmodeelementtype: no conversion is done N. Cardoso & P. Bicudo CUDA Memory Model 23/32
29 Texture Memory: Reference API Linear Memory case Allocation simple, with cudamalloc Textures with linear memory can only be unidimensional and only supports up to 2 27 elements Texture reference: texture<type, 1, cudareadmodeelementtype> tex_name; Texture Bind cudabindtexture(&offset,tex_name, cuda_array) Texture Unbind cudaunbindtexture(tex_name) Texture access in the device code: value=tex1dfetch(tex_name, index); N. Cardoso & P. Bicudo CUDA Memory Model 24/32
30 Texture Memory: Object API A texture object is created using cudacreatetextureobject(). Can be passed as a kernel argument. Only available in Kepler and newer architectures. When used with linear memory template<class T> T tex1dfetch(cudatextureobject_t texobj, int x); N. Cardoso & P. Bicudo CUDA Memory Model 25/32
31 Registers Register File (RF) Fermi: registers Kepler: registers Maxwell: registers The registers are dynamically divided by the thread blocks. Once assigned to a block, the register is not accessible by threads in other blocks. each thread can only access a register assigned to it. r File (RF) B 84 registers (each 32 bit) 8192 registers in G registers in Fermi Exercise: SM occupation (Fermi architecture) vides 4 operands per clock cycle assume that the number of threads in a block is 256 consider that the total number of registers per SM is rs are dynamically partitioned among locks signed to a Block, the register is NOT if each thread uses 10 registers, how many blocks can run in each SM? # of registers per block = = 2560 # of blocks per SM = 16384/2560 = 6 # of threads per SM = = 1536 threads per SM Maximum SM occupation ble by now assume threads that each in thread other uses 20Blocks registers, # of registers per block = = 5120 # of blocks per SM = 16384/5120 = 3 # of threads per SM = < 1536 threads per SM SM not fully occupied read only access registers assigned to I$ L1 Multithreaded Instruction Buffer R F C$ L1 Operand Select MAD SFU Shared Mem N. Cardoso & P. Bicudo CUDA Memory Model 26/32
32 Read-Only Data Cache Load Function The read-only data cache load function is only supported by devices of compute capability 3.5 and higher. Using T ldg(const T address); returns the data of type T located at address address, where T is char, short, int, unsigned short, unsigned int, unsigned long long, int2, int4, uint2, uint4, float, float2, float4, double or double2. or by marking pointers used for loading such data with both the const and restrict qualifiers The operation is cached in the read-only data cache. much easier to use than Textures. N. Cardoso & P. Bicudo CUDA Memory Model 27/32
33 Memory Location Cached Access Who? Local Off-chip No Read/Write One thread Shared On-chip N/A Read/Write All threads within the same block Global Off-chip No Read/Write All threads Constant Off-chip Yes Read All threads Texture Off-chip Yes Read All threads Each thread can: Read/write per-thread registers Read/write per-block shared memory Read/write per-grid global memory Read/write per-thread local memory Read only per-grid constant memory Read only per-grid texture memory CUDA tool to detect errors with memory accesses: cuda memcheck N. Cardoso & P. Bicudo CUDA Memory Model 28/32
34 Error Management All CUDA call returns an error code except the kernels cudaerror_t type #d e f i n e CUDA_SAFE_CALL( c a l l ) { \ c u d a E r r o r e r r = c a l l ; \ i f ( c u d a S u c c e s s!= e r r ) { \ f p r i n t f ( s t d e r r, " Cuda e r r o r i n f i l e % s i n l i n e %i : %s. \ n ", \ FILE, LINE, c u d a G e t E r r o r S t r i n g ( e r r ) ) ; \ e x i t ( EXIT_FAILURE ) ; \ } \ } char cudageterrorstring(cudaerror_t code) returns a sequence of characters describing the error with NULL termination N. Cardoso & P. Bicudo CUDA Memory Model 29/32
35 Error Management cudaerror_t cudagetlasterror(void) returns the error code of last error cudaerror_t type the only way to check errors with kernels #d e f i n e CHECK_ERROR( e r r o r M e s s a g e ) { \ cudaerror_t e r r = cudagetlasterror ( ) ; \ i f ( c u d a S u c c e s s!= e r r ) { \ f p r i n t f ( s t d e r r, " Cuda e r r o r : %s i n f i l e % s i n l i n e %i : %s. \ n ", \ e r r o r M e s s a g e, FILE, LINE, c u d a G e t E r r o r S t r i n g ( e r r ) ) ; \ e x i t ( EXIT_FAILURE ) ; \ } \ } N. Cardoso & P. Bicudo CUDA Memory Model 30/32
36 Conclusion NVIDIA GPUs have different memory spaces In terms of speed, the device memory rank is 1 Register file 2 Shared Memory 3 Constant Memory 4 Texture Memory 5 Last place: Local Memory and Global Memory Accessing data in the global memory is critical to the performance of a CUDA application: data access must be done in a coalesced way find a data layout which satisfies this criteria or minimizes the impact Use atomic memory operations to avoid race conditions All CUDA call returns an error, except the kernels N. Cardoso & P. Bicudo CUDA Memory Model 31/32
37 Thanks N. Cardoso & P. Bicudo CUDA Memory Model 32/32
Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.
Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip
More informationCUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17
CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA
More informationCUDA Architecture & Programming Model
CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationLecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)
Lecture 9 CUDA CUDA (I) Compute Unified Device Architecture 1 2 Outline CUDA Architecture CUDA Architecture CUDA programming model CUDA-C 3 4 CUDA : a General-Purpose Parallel Computing Architecture CUDA
More informationIntroduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research
Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers
More informationCUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)
CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration
More informationCUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan
CUDA Workshop High Performance GPU computing EXEBIT- 2014 Karthikeyan CPU vs GPU CPU Very fast, serial, Low Latency GPU Slow, massively parallel, High Throughput Play Demonstration Compute Unified Device
More informationRegister file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.
Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)
More informationStanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690
Stanford University NVIDIA Tesla M2090 NVIDIA GeForce GTX 690 Moore s Law 2 Clock Speed 10000 Pentium 4 Prescott Core 2 Nehalem Sandy Bridge 1000 Pentium 4 Williamette Clock Speed (MHz) 100 80486 Pentium
More informationAn Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture
An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationCS 179: GPU Computing. Lecture 2: The Basics
CS 179: GPU Computing Lecture 2: The Basics Recap Can use GPU to solve highly parallelizable problems Performance benefits vs. CPU Straightforward extension to C language Disclaimer Goal for Week 1: Fast-paced
More informationCUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci
TECHNISCHE UNIVERSITÄT WIEN Fakultät für Informatik Cyber-Physical Systems Group CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci Outline of CUDA Basics Basic Kernels and Execution on GPU
More informationGPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34
1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions
More informationIntroduction to GPGPUs and to CUDA programming model
Introduction to GPGPUs and to CUDA programming model www.cineca.it Marzia Rivi m.rivi@cineca.it GPGPU architecture CUDA programming model CUDA efficient programming Debugging & profiling tools CUDA libraries
More informationProgramming with CUDA, WS09
Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 3 Thursday, 29 Nov, 2009 Recap Motivational videos Example kernel Thread IDs Memory overhead CUDA hardware and programming
More informationIntroduction to GPU Computing Junjie Lai, NVIDIA Corporation
Introduction to GPU Computing Junjie Lai, NVIDIA Corporation Outline Evolution of GPU Computing Heterogeneous Computing CUDA Execution Model & Walkthrough of Hello World Walkthrough : 1D Stencil Once upon
More informationCS179 GPU Programming Recitation 4: CUDA Particles
Recitation 4: CUDA Particles Lab 4 CUDA Particle systems Two parts Simple repeat of Lab 3 Interacting Flocking simulation 2 Setup Two folders given particles_simple, particles_interact Must install NVIDIA_CUDA_SDK
More informationCS179 GPU Programming: CUDA Memory. Lecture originally by Luke Durant and Tamas Szalay
: CUDA Memory Lecture originally by Luke Durant and Tamas Szalay CUDA Memory Review of Memory Spaces Memory syntax Constant Memory Allocation Issues Global Memory Gotchas Shared Memory Gotchas Texture
More informationCUDA C/C++ BASICS. NVIDIA Corporation
CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions
More informationClass. Windows and CUDA : Shu Guo. Program from last time: Constant memory
Class Windows and CUDA : Shu Guo Program from last time: Constant memory Windows on CUDA Reference: NVIDIA CUDA Getting Started Guide for Microsoft Windows Whiting School has Visual Studio Cuda 5.5 Installer
More informationReductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research
Reductions and Low-Level Performance Considerations CME343 / ME339 27 May 2011 David Tarjan [dtarjan@nvidia.com] NVIDIA Research REDUCTIONS Reduction! Reduce vector to a single value! Via an associative
More informationOutline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun
Outline Memory Management CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory
More informationUniversity of Bielefeld
Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld
More informationMassively Parallel Algorithms
Massively Parallel Algorithms Introduction to CUDA & Many Fundamental Concepts of Parallel Programming G. Zachmann University of Bremen, Germany cgvr.cs.uni-bremen.de Hybrid/Heterogeneous Computation/Architecture
More informationBasic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Basic Elements of CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of
More informationGPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics
1 GPU Computing Workshop CSU 2013 Getting Started Garland Durham Quantos Analytics nvidia-smi 2 At command line, run command nvidia-smi to get/set GPU properties. nvidia-smi Options: -q query -L list attached
More informationCUDA Advanced Techniques 2 Mohamed Zahran (aka Z)
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Alignment Memory Alignment Memory
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance
More informationCUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list
CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into
More informationNVIDIA CUDA Compute Unified Device Architecture
NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 0.8 2/12/2007 ii CUDA Programming Guide Version 0.8 Table of Contents Chapter 1. Introduction to CUDA... 1 1.1 The Graphics Processor
More informationReview. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API
Review Lecture 10 CUDA (II) host device CUDA many core processor threads thread blocks grid # threads >> # of cores to be efficient Threads within blocks can cooperate Threads between thread blocks cannot
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationScientific discovery, analysis and prediction made possible through high performance computing.
Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013
More informationIntroduction to GPU Programming
Introduction to GPU Programming Volodymyr (Vlad) Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) Part III CUDA C and CUDA API Hands-on:
More informationHigh Performance Linear Algebra on Data Parallel Co-Processors I
926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018
More informationGPU Programming Using CUDA
GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa
More informationGPU Programming Using CUDA. Samuli Laine NVIDIA Research
GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick
More informationMathematical computations with GPUs
Master Educational Program Information technology in applications Mathematical computations with GPUs CUDA Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University CUDA - Compute Unified Device
More informationEEM528 GPU COMPUTING
EEM528 CS 193G GPU COMPUTING Lecture 2: GPU History & CUDA Programming Basics Slides Credit: Jared Hoberock & David Tarjan CS 193G History of GPUs Graphics in a Nutshell Make great images intricate shapes
More informationGPU Programming Using CUDA. Samuli Laine NVIDIA Research
GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationIntroduction to CUDA C
Introduction to CUDA C What will you learn today? Start from Hello, World! Write and launch CUDA C kernels Manage GPU memory Run parallel kernels in CUDA C Parallel communication and synchronization Race
More information04. CUDA Data Transfer
04. CUDA Data Transfer Fall Semester, 2015 COMP427 Parallel Programming School of Computer Sci. and Eng. Kyungpook National University 2013-5 N Baek 1 CUDA Compute Unified Device Architecture General purpose
More informationProgramming with CUDA
Programming with CUDA Jens K. Mueller jkm@informatik.uni-jena.de Department of Mathematics and Computer Science Friedrich-Schiller-University Jena Tuesday 19 th April, 2011 Today s lecture: Synchronization
More informationAn Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center
An Introduction to GPU Architecture and CUDA C/C++ Programming Bin Chen April 4, 2018 Research Computing Center Outline Introduction to GPU architecture Introduction to CUDA programming model Using the
More informationAdvanced CUDA Optimizations. Umar Arshad ArrayFire
Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationShared Memory and Synchronizations
and Synchronizations Bedrich Benes, Ph.D. Purdue University Department of Computer Graphics Technology SM can be accessed by all threads within a block (but not across blocks) Threads within a block can
More informationCUDA C Programming Mark Harris NVIDIA Corporation
CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment
More informationCS377P Programming for Performance GPU Programming - I
CS377P Programming for Performance GPU Programming - I Sreepathi Pai UTCS November 9, 2015 Outline 1 Introduction to CUDA 2 Basic Performance 3 Memory Performance Outline 1 Introduction to CUDA 2 Basic
More informationIntroduction to CUDA
Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations
More informationAn Introduction to GPU Computing and CUDA Architecture
An Introduction to GPU Computing and CUDA Architecture Sarah Tariq, NVIDIA Corporation GPU Computing GPU: Graphics Processing Unit Traditionally used for real-time rendering High computational density
More informationHands-on CUDA Optimization. CUDA Workshop
Hands-on CUDA Optimization CUDA Workshop Exercise Today we have a progressive exercise The exercise is broken into 5 steps If you get lost you can always catch up by grabbing the corresponding directory
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationTutorial: Parallel programming technologies on hybrid architectures HybriLIT Team
Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team Laboratory of Information Technologies Joint Institute for Nuclear Research The Helmholtz International Summer School Lattice
More informationIntroduction to CUDA C
NVIDIA GPU Technology Introduction to CUDA C Samuel Gateau Seoul December 16, 2010 Who should you thank for this talk? Jason Sanders Senior Software Engineer, NVIDIA Co-author of CUDA by Example What is
More informationCOMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers
COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu
More informationBasic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool
Basic CUDA workshop Penporn Koanantakool twitter: @kaewgb e-mail: kaewgb@gmail.com Outlines Setting Up Your Machine Architecture Getting Started Programming CUDA Debugging Fine-Tuning Setting Up Your Machine
More informationLecture 1: an introduction to CUDA
Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationCartoon parallel architectures; CPUs and GPUs
Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD
More informationCUDA C/C++ BASICS. NVIDIA Corporation
CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationGPU COMPUTING. Ana Lucia Varbanescu (UvA)
GPU COMPUTING Ana Lucia Varbanescu (UvA) 2 Graphics in 1980 3 Graphics in 2000 4 Graphics in 2015 GPUs in movies 5 From Ariel in Little Mermaid to Brave So 6 GPUs are a steady market Gaming CAD-like activities
More informationGPU CUDA Programming
GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications
More informationCUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation
CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation What is CUDA? CUDA Platform Expose GPU computing for general purpose Retain performance CUDA C/C++ Based on industry-standard C/C++ Small
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationIntroduction to Parallel Computing with CUDA. Oswald Haan
Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries
More informationCS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4
CS/CoE 1541 Final exam (Fall 2017). Name: This is the cumulative final exam given in the Fall of 2017. Question 1 (12 points): was on Chapter 4 Question 2 (13 points): was on Chapter 4 For Exam 2, you
More informationGlobal Memory Access Pattern and Control Flow
Optimization Strategies Global Memory Access Pattern and Control Flow Objectives Optimization Strategies Global Memory Access Pattern (Coalescing) Control Flow (Divergent branch) Global l Memory Access
More informationLecture 6. Programming with Message Passing Message Passing Interface (MPI)
Lecture 6 Programming with Message Passing Message Passing Interface (MPI) Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Finish CUDA Today s lecture Programming with message passing 2011
More informationOptimization Strategies Global Memory Access Pattern and Control Flow
Global Memory Access Pattern and Control Flow Optimization Strategies Global Memory Access Pattern (Coalescing) Control Flow (Divergent branch) Highest latency instructions: 400-600 clock cycles Likely
More informationCUDA Programming. Aiichiro Nakano
CUDA Programming Aiichiro Nakano Collaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science
More informationCUDA Memory Hierarchy
CUDA Memory Hierarchy Piotr Danilewski October 2012 Saarland University Memory GTX 690 GTX 690 Memory host memory main GPU memory (global memory) shared memory caches registers Memory host memory GPU global
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics
More informationLecture 6b Introduction of CUDA programming
CS075 1896 1920 1987 2006 Lecture 6b Introduction of CUDA programming 0 1 0, What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on
More informationHPCSE II. GPU programming and CUDA
HPCSE II GPU programming and CUDA What is a GPU? Specialized for compute-intensive, highly-parallel computation, i.e. graphic output Evolution pushed by gaming industry CPU: large die area for control
More informationGPU Computing: A Quick Start
GPU Computing: A Quick Start Orest Shardt Department of Chemical and Materials Engineering University of Alberta August 25, 2011 Session Goals Get you started with highly parallel LBM Take a practical
More informationJosef Pelikán, Jan Horáček CGG MFF UK Praha
GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationBasic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Basic Elements of CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationGPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018
GPU 1 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives
More informationECE 408 / CS 483 Final Exam, Fall 2014
ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux
More informationReal-time Graphics 9. GPGPU
Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing
More informationLecture 5. Performance Programming with CUDA
Lecture 5 Performance Programming with CUDA Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Today s lecture Matrix multiplication 2011 Scott B. Baden / CSE 262 / Spring 2011 3 Memory Hierarchy
More informationReal-time Graphics 9. GPGPU
9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose
More informationConvolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam
Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance
More informationCOSC 6385 Computer Architecture. - Data Level Parallelism (II)
COSC 6385 Computer Architecture - Data Level Parallelism (II) Fall 2013 SIMD Instructions Originally developed for Multimedia applications Same operation executed for multiple data items Uses a fixed length
More informationIntroduc)on to GPU Programming
Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/people/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/cuda/training/nvidia_gpu_compu)ng_webinars_cuda_memory_op)miza)on.pdf
More informationKernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow
Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization
More informationNVIDIA CUDA Compute Unified Device Architecture
NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 2.0 6/7/2008 ii CUDA Programming Guide Version 2.0 Table of Contents Chapter 1. Introduction...1 1.1 CUDA: A Scalable Parallel
More informationCUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA
CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes
More informationCOSC 6374 Parallel Computations Introduction to CUDA
COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014
More information