CUDA Memory Model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Size: px

Start display at page:

Download "CUDA Memory Model. N. Cardoso & P. Bicudo. Física Computacional (FC5)"

Alexandrina Atkins
5 years ago
Views:

1 CUDA Memory Model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA Memory Model 1/32

2 Outline 1 Memory System Global Memory Shared Memory L2/L1 Cache Constant Memory Texture Memory Registers Read-Only Data Cache Load Function 2 Error Management N. Cardoso & P. Bicudo CUDA Memory Model 2/32

3 Memory System Stage2: GPGPU Trick GPU to do general purpose computing Programmable, but requires knowledge on computer graphics CPU and GPU have different memory spaces Stream Processing Platforms The data between High-level GPU and programming CPU are interface transfered through the PCIe CUDA has functions No knowledge to allocate/initialize/copy/release on Computer Graphics is required memory in the GPU. Pointers are only addresses Examples: NVIDIA s CUDA, OpenCL It is not possible to distinguish a CPU or GPU pointer by looking at its value. Be very careful. There is a CUDA function that can be used to check: cudaerror_t cudapointergetattributes ( struct cudapointerattributes * attributes, void * ptr) however when using this function with a CPU pointer, the cuda-memcheck triggers this as an error How does it work? CPU d$a'oo ap%('(q& r$`!(& GPU PC Memory j$a%k l&%( m'(' n$a%k `!o( GPU Memory N. Cardoso & P. Bicudo CUDA Memory Model 3/32

4 Memory System Local memory: private; Shared memory: per block; shared by threads within the same block; communication between threads. Global memory: per application; shared among all threads; inter-grid communication. Constant memory: read/write by the CPU; read-only by any thread. L1 and L2 cache: only available from Fermi architecture. Texture memory: is connected to the global memory; can be used as a cache; read-only by threads; updated between kernel calls. Thread Block Local Memory Shared Memory Registers per thread Read-Write Shared memory per block Read-Write Global memory per grid Read-Write Constant memory per grid Read-only Texture memory per grid Read-only Grid 0 Grid Global Memory Sequential Grids in Time 118 N. Cardoso & P. Bicudo CUDA Memory Model 4/32

5 Global Memory CPU: malloc f l o a t array_h=( f l o a t ) m a l l o c (N s i z e o f ( f l o a t ) ) ; GPU: cudamalloc(), allocate memory in the Global Memory c u d a E r r o r _ t cudamalloc ( v o i d devptr, s i z e _ t s i z e ) f l o a t array_d ; cudamalloc ( ( v o i d )& array_d, N s i z e o f ( f l o a t ) ) ; Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Global Memory cudafree(), release allocate memory from cudamalloc() cudafree ( array_d ) ; cudamemset(), memory initialization c u d a E r r o r _ t cudamemset ( v o i d devptr, i n t v a l u e, s i z e _ t count ) 2 N. Cardoso & P. Bicudo CUDA Memory Model 5/32

6 Memory transfers cudamemcpy() c u d a E r r o r _ t cudamemcpy ( v o i d dst, c o n s t v o i d s r c, s i z e _ t count, enum cudamemcpykind k i n d ) needs 4 parameters dst - Destination memory address src - Source memory address count - Size in bytes to copy kind - Type of transfer: Host to Host cudamemcpyhosttohost Host to Device cudamemcpyhosttodevice Device to Host cudamemcpydevicetohost Device to Device cudamemcpydevicetodevice N. Cardoso & P. Bicudo CUDA Memory Model 6/32

7 Global Memory Coalescing Example: Offset Access Offset: Tesla C2075 (Fermi, 2.0 CC) Geforce GTX TITAN (Kepler, 3.5 CC) t e m p l a t e <typename T> global v o i d o f f s e t (T a, i n t s ){ i n t i = blockdim. x b l o c k I d x. x + t h r e a d I d x. x + s ; a [ i ] = a [ i ] + 1 ; } N. Cardoso & P. Bicudo CUDA Memory Model 7/32

8 Global Memory Coalescing Example: Offset Access Offset: Tesla C2075 (Fermi, 2.0 CC) Geforce GTX TITAN (Kepler, 3.5 CC) t e m p l a t e <typename T> global v o i d o f f s e t (T a, i n t s ){ i n t i = blockdim. x b l o c k I d x. x + t h r e a d I d x. x + s ; a [ i ] = a [ i ] + 1 ; } Effective Bandwidth (GB/s) TITAN double TeslaC2075 double TITAN float TeslaC2075 float Offset N. Cardoso & P. Bicudo CUDA Memory Model 7/32

9 Global Memory Coalescing Example: Stride Access Stride: Tesla C2075 (Fermi, 2.0 CC) Geforce GTX TITAN (Kepler, 3.5 CC) t e m p l a t e <typename T> global v o i d s t r i d e (T a, i n t s ){ i n t i = ( blockdim. x b l o c k I d x. x + t h r e a d I d x. x ) s ; a [ i ] = a [ i ] + 1 ; } N. Cardoso & P. Bicudo CUDA Memory Model 8/32

10 Global Memory Coalescing Example: Stride Access Stride: Tesla C2075 (Fermi, 2.0 CC) Geforce GTX TITAN (Kepler, 3.5 CC) t e m p l a t e <typename T> global v o i d s t r i d e (T a, i n t s ){ i n t i = ( blockdim. x b l o c k I d x. x + t h r e a d I d x. x ) s ; a [ i ] = a [ i ] + 1 ; } Effective Bandwidth (GB/s) TITAN double TeslaC2075 double TITAN float TeslaC2075 float stride N. Cardoso & P. Bicudo CUDA Memory Model 8/32

11 Global Memory Coalescing Example: Even/Odd Access Even/Odd access: 1 Access the array with stride 2 2 Change data layout: the first half of the array is assigned to the even sites and then the odd sites Recovers stride 1 access Test: Tesla C2075 (2.0) array with size N = and using 256 threads per block types: int, float, double 1: time 2: time 1: Effective 2: Effective speedup (ms) (ms) Bandwidth (MB/s) Bandwidth (MB/s) em BW int float double N. Cardoso & P. Bicudo CUDA Memory Model 9/32

12 Global Memory Coalescing Example: Even/Odd Access Even/Odd access: 1 Access the array with stride 2 2 Change data layout: the first half of the array is assigned to the even sites and then the odd sites Recovers stride 1 access Test: Tesla C2075 (2.0) array with size N = and using 256 threads per block types: int, float, double 1: time 2: time 1: Effective 2: Effective speedup (ms) (ms) Bandwidth (MB/s) Bandwidth (MB/s) em BW int float double N. Cardoso & P. Bicudo CUDA Memory Model 9/32

13 Global Memory Coalescing Example: Even/Odd Access Even/Odd access: 1 Access the array with stride 2 2 Change data layout: the first half of the array are assigned to the even sites and then the odd sites Recovers stride 1 access Test: Tesla C2075 (2.0) consider that each array element is a structure with 2/4 elements (AOS): template <c l a s s T> template <c l a s s T> s t r u c t elem2 { s t r u c t elem4 { T x, y ; T x, y, z, w ; } ; } ; 1: time 2: time 1: Effective 2: Effective speedup (ms) (ms) Bandwidth (MB/s) Bandwidth (MB/s) em BW float elem2<float> elem4<float> double elem2<double> elem4<double> Why the big difference in performance between elem2<double> and elem4<float>? Same size, 16 Bytes! N. Cardoso & P. Bicudo CUDA Memory Model 10/32

14 Global Memory Coalescing Example: Even/Odd Access Even/Odd access: 1 Access the array with stride 2 2 Change data layout: the first half of the array are assigned to the even sites and then the odd sites Recovers stride 1 access Test: Tesla C2075 (2.0) consider that each array element is a structure with 2/4 elements (AOS): template <c l a s s T> template <c l a s s T> s t r u c t elem2 { s t r u c t elem4 { T x, y ; T x, y, z, w ; } ; } ; 1: time 2: time 1: Effective 2: Effective speedup (ms) (ms) Bandwidth (MB/s) Bandwidth (MB/s) em BW float elem2<float> elem4<float> double elem2<double> elem4<double> Why the big difference in performance between elem2<double> and elem4<float>? Same size, 16 Bytes! N. Cardoso & P. Bicudo CUDA Memory Model 10/32

15 Global Memory Coalescing Example: Even/Odd Access Even/Odd access: 1 Access the array with stride 2 2 Change data layout: the first half of the array are assigned to the even sites and then the odd sites Recovers stride 1 access Test: Tesla C2075 (2.0) use the CUDA built-in vector types: float2, float4, double2, double4 1: time 2: time 1: Effective 2: Effective speedup (ms) (ms) Bandwidth (MB/s) Bandwidth (MB/s) em BW float elem2<float> float elem4<float> float double elem2<double> double elem4<double> double Performances for double2 and float4 are the same order! Why the difference in performance between float2 and elem2<float>, float4 and elem4<float>, double2 and elem2<double>, double4 and elem4<double>? N. Cardoso & P. Bicudo CUDA Memory Model 11/32

16 Global Memory Coalescing Example: Even/Odd Access Even/Odd access: 1 Access the array with stride 2 2 Change data layout: the first half of the array are assigned to the even sites and then the odd sites Recovers stride 1 access Test: Tesla C2075 (2.0) use the CUDA built-in vector types: float2, float4, double2, double4 1: time 2: time 1: Effective 2: Effective speedup (ms) (ms) Bandwidth (MB/s) Bandwidth (MB/s) em BW float elem2<float> float elem4<float> float double elem2<double> double elem4<double> double Performances for double2 and float4 are the same order! Why the difference in performance between float2 and elem2<float>, float4 and elem4<float>, double2 and elem2<double>, double4 and elem4<double>? N. Cardoso & P. Bicudo CUDA Memory Model 11/32

17 Global Memory Coalescing structs in CUDA and alignment done as in standard C but the alignment must be done using the directive align(x), as align(4), 4 Bytes align(8), 8 Bytes, for example 2 floats align(16), 16 Bytes, for example 4 floats or 2 doubles maximum CUDA aligment is 128 bits Example: elem2<float> template <> st r uc t align (8) elem2<float >{ f l o a t x, y ; } ; therefore, elem2<float>=float2 For structures that don t fit in this range: Use SOA or SOAOS Example: elem4<float> double4 exceeds 16 Bytes, solution SOAOS of double2 AOS: array of structures SOA: structure of arrays SOAOS: structure of arrays of structures template <> s tr uc t align (16) elem4<float >{ f l o a t x, y, z, w ; } ; therefore, elem4<float>=float4 N. Cardoso & P. Bicudo CUDA Memory Model 12/32

18 Shared Memory: shared has 16 KB of Shared Each SM has 16KB/48KB of Shared Memory in Fermi architecture (depends on the architecture) The shared qualifier, optionally used together with device, declares a variable that: anks of 32bit words B in Fermi resides in the shared memory space of a thread block allows read/write by all threads in the same block; synchronization between threads using: syncthreads();. has the lifetime of the block, is only accessible from all the threads within the block. among all threads in a lock key to increase performance: move data in the shared memory; access it several times and/or share between threads. Grid and write Block access (0, 0) Block (1, 0) Shared Memory Shared Memory I$ L1 Multithreaded Instruction Buffer R F C$ L1 Operand Select Shared Mem Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) formance Enhancement Host Global Memory e data in shared memory it multiple times MAD SFU N. Cardoso & P. Bicudo CUDA Memory Model 13/32

Shared Memory: shared Shared memory banks Bancos de Memória Compartilhada The shared memory is divided in 16 banks (Fermi) with 32 bit words, which can be accessed simultaneously A memória

19 Shared Memory: shared Shared memory banks Bancos de Memória Compartilhada The shared memory is divided in 16 banks (Fermi) with 32 bit words, which can be accessed simultaneously A memória compartilhada é dividida em 16 bancos, com palavras de Each 32 bits, architecture que podem has some servariations, acessados please simultaneamente check the manual. N. Cardoso & P. Bicudo CUDA Memory Model 14/32

20 Shared Memory: shared Bancos de Memória Compartilhada Shared memory banks When several threads access the same bank, the GPU serializes the requests Quando loss in performance ocorrem conflitos, a GPU serializa as requisições N. Cardoso & P. Bicudo CUDA Memory Model 15/32

21 Shared Memory: shared Example 1: constant memory size global v o i d k e r n e l (... ) { shared f l o a t a r r a y [BLOCK_Y ] [ BLOCK_X ] ; (... ) // w r i t e on s h a r e d memory syncthreads ( ) ; // a s s u r e t h a t t h r e a d s o f same b l o c k e x e c u t e d a l l p r e v i o u s i n s t r u c t i o n s (... ) } kernel<<<nblocks, nthreadsperblock>>>(...); Example 2: "dynamical" memory size the size of the array is determined at launch time. all variables declared in this fashion, start at the same address in memory, so that the layout of the variables in the array must be explicitly managed through offsets. global v o i d k e r n e l (... ) { e x t e r n shared f l o a t a r r a y [ ] ; (... ) } s i z e _ t s h a r e d S i z e = BLOCK_Y BLOCK_X s i z e o f ( f l o a t ) ; kernel<<<nblocks, nthreadsperblock, sharedsize>>>(...); N. Cardoso & P. Bicudo CUDA Memory Model 16/32

22 Shared Memory: shared Reduction Example CPU #include <stdio.h> #include <sys/time.h> inline double seconds(){ struct timeval tv; gettimeofday(&tv, NULL); return (double)tv.tv_sec + (double)tv.tv_usec*1.e-6; } int main(int argc, char **argv){ if (argc < 2){ printf("missing argument.\n"); return 0; } int n = atoi(argv[1]); int *in = (int*)malloc(n*sizeof(int)); for(uint i=0;i<n;i++) in[i] = 1; double t0 = seconds(); int sum = 0; for(uint i = 0; i < n; i++) sum += in[i]; double t1 = seconds(); printf("cpu time: %f\tsum: %d\n", t1-t0, sum); free(in); return 0; } N. Cardoso & P. Bicudo CUDA Memory Model 17/32

23 Shared Memory: shared Reduction Example GPU: Host part #include <cuda.h> #include <stdio.h> #include <sys/time.h> int main(int argc, char **argv){ if (argc < 2){ printf("missing argument.\n"); return 0; } int n = atoi(argv[1]); int *in = (int*)malloc(n*sizeof(int)); for(uint i=0;i<n;i++) in[i] = 1; int *d_in, *d_sum; cudamalloc(&d_in, n*sizeof(int)); cudamemcpy(d_in, in, n*sizeof(int), cudamemcpyhosttodevice); cudamalloc(&d_sum, sizeof(int)); const uint bthreads = 256; uint nblocks = (n + bthreads - 1) / bthreads; size_t smem = bthreads * sizeof(int); cudamemset(d_sum, 0, sizeof(int)); t0 = seconds(); reduce<<<nblocks, bthreads, smem>>>(d_in, d_sum, n); cudadevicesynchronize(); t1 = seconds(); int sum = 0; cudamemcpy(&sum, d_sum, sizeof(int), cudamemcpydevicetohost); printf("gpu time: %f\tsum: %d\n", t1-t0, sum); free(in); cudafree(d_in); cudafree(d_sum); cudadevicereset(); return 0; } N. Cardoso & P. Bicudo CUDA Memory Model 18/32

24 Shared Memory: shared Reduction Example GPU: Device part global void reduce(int *in, int *out, int n){ uint id = threadidx.x + blockidx.x * blockdim.x; extern shared int smem[]; smem[threadidx.x] = (id < n)? in[id] : 0; syncthreads(); // compute sum for the thread block for (int dist = blockdim.x/2; dist > 0; dist /= 2){ if (threadidx.x < dist) smem[threadidx.x] += smem[threadidx.x + dist]; syncthreads(); } if(threadidx.x == 0 ) atomicadd(out, smem[0]); } Parallel Reduction: Sequential Addressing Values (shared memory) Step 1 Stride 8 Step 2 Stride 4 Step 3 Stride 2 Step 4 Stride 1 Thread IDs Values Thread IDs Values Thread IDs Values Thread IDs Values Sequential addressing is conflict free atomicadd(): avoid race conditions (all blocks are reading/writing in same global memory address) no other thread can access this address until the operation is complete CUDA atomic operations: atomicadd(), atomicsub(), atomicmin(), atomicmax(), atomicinc(),... syncthreads() can be expensive /a.out (CPU: i GHz GPU: Geforce GTX TITAN 3.5CC) CPU time: s sum: GPU time: s sum: Speed up (GPU-CPU): 2.1x N. Cardoso & P. Bicudo CUDA Memory Model 19/32

25 Shared Memory: shared Reduction Example GPU: Device part (less sync barriers) #define WARP_SIZE 32 global void reduce(int *in, int *out, int n){ uint id = threadidx.x + blockidx.x * blockdim.x; extern shared int smem[]; smem[threadidx.x] = (id < n)? in[id] : 0; syncthreads(); //Only one active warp do the reduction, Blocks are always multiple of the warp size! if(blockdim.x > WARP_SIZE && threadidx.x < WARP_SIZE){ for(uint s = 1; s < blockdim.x / WARP_SIZE; s++) smem[threadidx.x] += smem[threadidx.x + WARP_SIZE * s]; } // syncthreads(); //No need to synchronize inside warp!!!! //One thread do the warp reduction if(threadidx.x == 0 ) { int sum = 0; for(uint s = 0; s < WARP_SIZE; s++) sum += smem[s]; atomicadd(out, sum); } } Using only one sync barrier: GPU time: s sum: Speed up (GPU-CPU): 4.6x Speed up(gpu-gpu): 2.2x N. Cardoso & P. Bicudo CUDA Memory Model 20/32

26 L2/L1 Cache Only available on Fermi and newer architectures. Only allowed one configuration option for L1 cache. User cannot control L2/L1 cache, however L1 cache size can be configurable. Configuration options for L1 cache and shared memory: cudafuncsetcacheconfig( kernel_name, Mode); the initial configuration is 48KB for shared memory and 16KB for L1 cache (Fermi architecture). Mode: cudafunccacheprefershared: 48KB for shared memory and 16KB for L1 cache cudafunccachepreferl1: 16KB for shared memory and 48KB for L1 cache cudafunccacheprefernone: default N. Cardoso & P. Bicudo CUDA Memory Model 21/32

27 Constant Memory: constant has the lifetime of an application; is accessible from all the threads within the grid and from the host through the runtime library; the host can read and write; the device can only read from it; limited size: Tesla C2075 only 64KB; as fast as the register access only if all threads in the same warp access the same position; host control: cudagetsymboladdress() cudagetsymbolsize() cudamemcpytosymbol() cudamemcpyfromsymbol() Example: host declaration: f l o a t p i = ; cudamemcpytosymbol ( Pi, &pi, s i z e o f ( f l o a t ) ) ; f l o a t array_h ; (... ) cudamemcpytosymbol ( array_d, array_h, N s i z e o f ( f l o a t ) ) ; device declaration (defined as a global variable/array as in standard C/C++): constant f l o a t Pi ; constant f l o a t array_d [N ] ; // o n l y s t a t i c s i z e a l l o w e d N. Cardoso & P. Bicudo CUDA Memory Model 22/32

28 Texture Memory: Reference API A texture reference can only be declared as a static global variable and cannot be passed as an argument to a function. The other attributes of a texture reference are mutable and can be changed at runtime through the host runtime. texture memory procedure: Allocation Linear memory cudaarrays texture reference "Binding/Unbinding" Texture memory access, Texture fetch texture<type, Dim, ReadMode> tex_name; Binds host side and device side texture access Type: float, int, uchar4, etc... doesn t support double precision explicitly, solution: use int2 as Type in the device code convert a int2 to double: hiloint2double(x.y, X.x) Dim: 1, 2, ou 3 ReadMode cudareadmodenormalizedfloat: [0.0f, 1.0f ] for unsigned types and [ 1.0f, 1.0f ] for signed types cudareadmodeelementtype: no conversion is done N. Cardoso & P. Bicudo CUDA Memory Model 23/32

29 Texture Memory: Reference API Linear Memory case Allocation simple, with cudamalloc Textures with linear memory can only be unidimensional and only supports up to 2 27 elements Texture reference: texture<type, 1, cudareadmodeelementtype> tex_name; Texture Bind cudabindtexture(&offset,tex_name, cuda_array) Texture Unbind cudaunbindtexture(tex_name) Texture access in the device code: value=tex1dfetch(tex_name, index); N. Cardoso & P. Bicudo CUDA Memory Model 24/32

30 Texture Memory: Object API A texture object is created using cudacreatetextureobject(). Can be passed as a kernel argument. Only available in Kepler and newer architectures. When used with linear memory template<class T> T tex1dfetch(cudatextureobject_t texobj, int x); N. Cardoso & P. Bicudo CUDA Memory Model 25/32

31 Registers Register File (RF) Fermi: registers Kepler: registers Maxwell: registers The registers are dynamically divided by the thread blocks. Once assigned to a block, the register is not accessible by threads in other blocks. each thread can only access a register assigned to it. r File (RF) B 84 registers (each 32 bit) 8192 registers in G registers in Fermi Exercise: SM occupation (Fermi architecture) vides 4 operands per clock cycle assume that the number of threads in a block is 256 consider that the total number of registers per SM is rs are dynamically partitioned among locks signed to a Block, the register is NOT if each thread uses 10 registers, how many blocks can run in each SM? # of registers per block = = 2560 # of blocks per SM = 16384/2560 = 6 # of threads per SM = = 1536 threads per SM Maximum SM occupation ble by now assume threads that each in thread other uses 20Blocks registers, # of registers per block = = 5120 # of blocks per SM = 16384/5120 = 3 # of threads per SM = < 1536 threads per SM SM not fully occupied read only access registers assigned to I$ L1 Multithreaded Instruction Buffer R F C$ L1 Operand Select MAD SFU Shared Mem N. Cardoso & P. Bicudo CUDA Memory Model 26/32

32 Read-Only Data Cache Load Function The read-only data cache load function is only supported by devices of compute capability 3.5 and higher. Using T ldg(const T address); returns the data of type T located at address address, where T is char, short, int, unsigned short, unsigned int, unsigned long long, int2, int4, uint2, uint4, float, float2, float4, double or double2. or by marking pointers used for loading such data with both the const and restrict qualifiers The operation is cached in the read-only data cache. much easier to use than Textures. N. Cardoso & P. Bicudo CUDA Memory Model 27/32

33 Memory Location Cached Access Who? Local Off-chip No Read/Write One thread Shared On-chip N/A Read/Write All threads within the same block Global Off-chip No Read/Write All threads Constant Off-chip Yes Read All threads Texture Off-chip Yes Read All threads Each thread can: Read/write per-thread registers Read/write per-block shared memory Read/write per-grid global memory Read/write per-thread local memory Read only per-grid constant memory Read only per-grid texture memory CUDA tool to detect errors with memory accesses: cuda memcheck N. Cardoso & P. Bicudo CUDA Memory Model 28/32

34 Error Management All CUDA call returns an error code except the kernels cudaerror_t type #d e f i n e CUDA_SAFE_CALL( c a l l ) { \ c u d a E r r o r e r r = c a l l ; \ i f ( c u d a S u c c e s s!= e r r ) { \ f p r i n t f ( s t d e r r, " Cuda e r r o r i n f i l e % s i n l i n e %i : %s. \ n ", \ FILE, LINE, c u d a G e t E r r o r S t r i n g ( e r r ) ) ; \ e x i t ( EXIT_FAILURE ) ; \ } \ } char cudageterrorstring(cudaerror_t code) returns a sequence of characters describing the error with NULL termination N. Cardoso & P. Bicudo CUDA Memory Model 29/32

35 Error Management cudaerror_t cudagetlasterror(void) returns the error code of last error cudaerror_t type the only way to check errors with kernels #d e f i n e CHECK_ERROR( e r r o r M e s s a g e ) { \ cudaerror_t e r r = cudagetlasterror ( ) ; \ i f ( c u d a S u c c e s s!= e r r ) { \ f p r i n t f ( s t d e r r, " Cuda e r r o r : %s i n f i l e % s i n l i n e %i : %s. \ n ", \ e r r o r M e s s a g e, FILE, LINE, c u d a G e t E r r o r S t r i n g ( e r r ) ) ; \ e x i t ( EXIT_FAILURE ) ; \ } \ } N. Cardoso & P. Bicudo CUDA Memory Model 30/32

36 Conclusion NVIDIA GPUs have different memory spaces In terms of speed, the device memory rank is 1 Register file 2 Shared Memory 3 Constant Memory 4 Texture Memory 5 Last place: Local Memory and Global Memory Accessing data in the global memory is critical to the performance of a CUDA application: data access must be done in a coalesced way find a data layout which satisfies this criteria or minimizes the impact Use atomic memory operations to avoid race conditions All CUDA call returns an error, except the kernels N. Cardoso & P. Bicudo CUDA Memory Model 31/32

37 Thanks N. Cardoso & P. Bicudo CUDA Memory Model 32/32

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor. Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip