Memory Management. Memory Access Bandwidth. Memory Spaces. Memory Spaces

Size: px

Start display at page:

Download "Memory Management. Memory Access Bandwidth. Memory Spaces. Memory Spaces"

Camron Dawson
6 years ago
Views:

1 Memory Access Bandwidth Memory Management Bedrich Benes, Ph.D. Purdue University Department of Computer Graphics Technology High Performance Computer Graphics Lab Host and device different memory spaces How fast is the access? (2009 CPU Intel Pentium i7, GT200) CPU Memory approx 20 GB/sec GPU Main memory 2x 4 GB/sec (r/w) GPU GDRAM approx 150 GB/sec GTX GB/s 1.5 GB GTX GB/s Device manages its own memory Host manages its own memory and some device memory Host manages data copy between host and device and d2d Image courtesy of NVIDIA Image courtesy of NVIDIA 1

2 - speed Main Memory L3 cache 200 cycles L3 cache L1/L2 cache cycles, Device Memory Linear memory The most commonly used cudafree(), cudamalloc(),cudamallocpitch(), cudamalloc3d(), etc. L1/L2 cache registers 5-12 cycles CUDA Arrays Texture memory Surface memory 1) Global Memory (R/W) Slow. Accessible to all threads. Much slower than SM. Accessible from device and host. Lives with the application. Up to 6GB. 2) Constant Memory (R) Fast read when all threads access the same location. Accessed by all threads. Accessible from device and host. Lives with the application. Limit to 65 kb 2

3 3) Shared Memory (R/W) On-chip. Very fast. As fast as register (if no bank conflicts or not reading the same space). Allocated to thread blocks. 3) Shared Memory (R/W) Accessible by ANY thread within block Dies with the block. Accessible from device kB 4) Registers (R/W) On-chip. Very fast. 5) Local Memory (R/W) Can be much slower than SM. Allocated to a thread Accessible by one thread Dies with a thread. Dies with a thread. Accessible from device Accessible from device. 3

4 5) Texture Memory (R) (Cuda Array) Can be 150x slower than SM. If cached can be faster. Accessible to all threads. Lives with the application. Special look-up functions Accessible from device and host. SM vs. Global/Local memory: GPU access memory command: 4 clock cycles Local/Global memory access: cycles GPU SM access: 4 clock cycles SM access is approx x faster!!! Local variables are by default in registers. If too many local resources are used, compiler can locate a variable into the local memory How do I know where my variable lives? Compile with ptx or keep parameter and see the assembly code 4

5 the assembly code.reg.u16 %rh<6>;//register unsigned int16.reg.u32 %r<29>;//register unsigned int32.reg.f32 %f<24>;//register float 32.loc //local variable Device and Host Pointers Coding suggestions Starting a variable with d indicates it points to device h indicates it points to host float *dptr; //pointer to device float *hptr; //pointer to host Device and Host Pointers Both pointers live in the host memory But they point to different spaces Device and Host Pointers Host pointers are accessed/manipulated by standard C/C++ constructs malloc, free, new, delete *dptr; *hptr; Host Memory *hptr Device Memory Device pointers cannot be used in the same way. They need special functions. *dptr 5

6 GPU Linear Memory 1D cudamalloc(void **ptr, sizet_t n) cudamemset(void **ptr, int val, sizet_t n) cudafree(void *ptr) Example: int n=128; int size=n*sizeof(float); int *da; cudamalloc((void **)&da,size); cudamemset(da,0,size); cudafree(da); Data Copy Linear Memory 1D cudamemcpy(void *dst, void *src, size_t n, enum cudamemorykind direction) enum cudamemorykind cudamemorycpyhosttodevice cudamemorycpydevicetohost cudamemorycpydevicetodevice Data Copy Linear Memory 1D Does NOT start until all CUDA calls complete (synchronous) Does NOT let CPU work, while copying (blocks CPU thread) It is a safe call Note: Asynchronous calls exist in CUDA Data Copy 1D Example float* h_a,* h_b,* h_c; //host pts float* d_a,* d_b,* d_c; //device ptrs int N = 50000; size_t size = N * sizeof(float); // Allocate input vectors h_a and h_b in host memory h_a = (float*)malloc(size); h_b = (float*)malloc(size); h_c = (float*)malloc(size); for (int i=0;i<n;i++) { a[i]= (float)i/n; b[i]=1-(float)i/n; } 6

7 Data Copy 1D Example // Allocate vectors in device memory cudamalloc((void**)&d_a,size); cudamalloc((void**)&d_b,size); cudamalloc((void**)&d_c,size); // Copy vectors from host memory to device memory cudamemcpy(d_a,h_a,size,cudamemcpyhosttodevice); cudamemcpy(d_b,h_b,size,cudamemcpyhosttodevice); //kernel would be executed here cudamemcpy(d_b,d_a,size,cudamemcpydevicetodevice); // Copy result from device memory to host memory cudamemcpy(h_c,d_c,size,cudamemcpydevicetohost); GPU Linear Memory 2D cudamallocpitch(void **ptr, sizet_t &pitch, size_t width, size_t height) Used for 2D arrays of width x height GPU performs better when the data is correctly aligned (on multiples of 2 ) pitch says what is being used per row (can be bigger than the column expected) It pads the allocation for a good performance GPU Linear Memory 2D Having an array of 12 float rows, CUDA may pad it to pitch=16 floats: an array row: [ ] GPU Linear Memory 2D rows you have to deal with this while using it columns pitch 7

8 GPU Linear Memory 2D 2D Memory allocation: const int w=h=500;//with and height are the same float *dptr, a[w][h]; size_t pitch; //size_t is important error=cudamallocpitch((void**)&dptr;&pitch,w,h); //check the error Kernel<<<100,512>>>(dPtr,pitch,w,h); Kernel2<<<100,512>>>(dPtr,pitch); GPU Linear Memory 2D 2D Memory Access row and column are known (from the kernel indices), pitch is known from the allocation, and the element is of type *T. Its location is: T* pelement =(T*)((char*)baseAddress+ row*pitch)+column; GPU Linear Memory 2D //kernel with nested cycles global void Kernel(float *dptr, int pitch, int w, int h) { for (int r=0;r<h;r++) { float *row=(float*)((char*)dptr+r*pitch); for (int c=0;c<w;c++) { float element=row[c]; //do some operation }//of for c }//of for r }//of Kernel GPU Linear Memory 2D //kernel with implicit indexing global void Kernel2(float *dptr, int pitch) { int i=blockdim.x*blockidx.x+threadidx.x; int j=blockdim.y*blockidx.y+threadidx.y; if ((i>=n) (j>=n)) return; float *elm=(float*)((char*)dptr+j*pitch)+i; *elm=0.5;//sets the value in the 2D array }//of Kernel 8

9 Data Copy Linear Memory 2D cudamemcpy2d(void *dst, size_t dpitch, void *src, size_t spitch, size_t w, size_t h, enum cudamemorykind direction) enum cudamemorykind cudamemorycpyhosttodevice cudamemorycpydevicetohost cudamemorycpydevicetodevice Data Copy Linear Memory 2D uses two pitch values, one for the source and one for the destination in the host memory, the pitch is usually the size of the row (in bytes) Data Copy 2D Example const int MAX=500;//will need to be pitched float a[max][max] float *dptr; size_t pitch; int maxbytes=max*sizeof(float); error=cudamallocpitch((void**)&dptr,&pitch,maxbytes,max); error=cudamemcpy2d(dptr,pitch, a,maxbytes, maxbytes,max, cudamemcpyhosttodevice); Kernel2<<<100,200>>>(dPtr,pitch); error=cudamemcpy2d(a,maxbytes, dptr,pitch,maxbytes,max, cudamemcpydevicetohost); GPU Linear Memory 3D cudamalloc3d( struct cudapitchedptr *pitcheddevptr, struct cudaextent extent) cudamemcpy3d( const struct cudamemcpy3dparms *p) 9

10 Constant Memory Similar to global variables. Read only. 64kB only, but very useful. Defined with global scope within the kernel file constant Initialized by the host cudamemcpytosymbol, cudamemcpyfromsymbol Constant Memory Similar to global variables but read only. Defined with global scope within the kernel file constant Initialized by the host cudamemcpytosymbol, cudamemcpyfromsymbol Constant Variables const float PI= ; Page-locked (Pinned) Memory On the host Will be in registers, as long as there is enough space otherwise in global memory. Can be read by the GPU directly and processed concurrently with the kernel execution Will not be in the constant memory. Useful for single read 10

11 Page-locked (Pinned) Memory cudaallochost(void **ph,sizet_t n,int attribs) cudafreehost(void *ph) attribs can be: cudahostallocwritecombined cudahostallocmapped cudahostallocportable Portable Memory page-locked memory has the benefits only for the host thread that created it by making it portable the memory is available for all host threads Write-Combining memory page-locked memory uses L1 and L2 cudahostallocwritecombined makes it write-combined does not use cache (more cache for other things) not snooped during PCIEx (40% faster) reading from the host is slow should be used for host writes only Mapped Memory some devices can map the page-locked memory to the device address space no need for reads/writes between host and device! the same page has two pointers one for the host and one for the device multiple GPUs can access the same page 11

12 Mapped Memory cudahostgetdevicepointer(void **pd, void *phost, unsigned int flags) maps the pointer at host ph (taken from cudamallochost()) and maps it to the device space pd flags is unused for now Mapped Memory #if CUDART_VERSION<2020 #error No support for mapped memory!\n #endif //Check if device 0 supports mapped memory cudadeviceprop devprop; cudagetdeviceproperties(&devprop,0); if(!devprop.canmaphostmemory) { printf("device cannot map host memory!\n ); exit(exit_failure); } Mapped Memory size_t=1024*sizeof(float); float *ah,*ad; cudahostalloc((void **)&ah, size, cudahostallocmapped); Page-locked (Pinned) Memory Speedup? vector addition on Quadro FX 770M 10,000x reading and writing the results //Get the device pointers to memory mapped cudahostgetdevicepointer((void **)&ad, (void *)ah,0); 12

13 Device Memory Linear memory The most commonly used cudafree(), cudamalloc(),cudamallocpitch(), cudamalloc3d(), etc. CUDA Arrays Texture memory Surface memory Texture Memory (CUDA Array) can be faster than the global memory is a global memory with cached access cache is optimized for 2D spatial locality designed for streaming fetches read by kernel using texture fetches Texture Memory texture reference is an object texture must be bounded, has attributes texture can be linear mem or CUDA array texture can be shared with OpenGL Texture Declaration texture<type,dim,readmode> texref; type: float, basic integer dim: 1,2,3 readmode: cudareadmodenormalizedfloat ranges: [0,1] or [-1,1] cudareadmodeelementtype ranges: 0 0XFF 13

14 Texture Declaration NTC (normalized texture coordinates) are in range [0,1] using floating point textures allows for wrapping filtering using integer textures outside values are clamped Texture Binding cudabindtexture( size_t *offset, const struct texturereference *texref, const void *devptr, const struct cudachannelformatdesc *desc, size_t size) offset - returned because of alignment texref the texture to bind devptr memory address on the device desc channel format size size of the memory Texture Binding struct texturereference{ int normalized; enum cudatexturefiltermode filtermode, enum cudatextureaddressmode addressmode[3]; struct cudachannelformatdesc channeldesc; } normalized: ~ if 0, values are [0,,width-1]x[0,,height-1]x[0,,depth-1] ~ if 1, values are [0,1] 3 Texture Binding filter mode: specifies the filtering mode cudafiltermodepoint nearest neighbor sampling cudafiltermodelinear (bi/tri) linear intrpolation (valid only for floating point types) 14

15 Texture Binding address mode: defines what values out of range cudaaddressmodeclamp clamped to the valid range cudaaddressmodewrap wrapped to the valid range (valid only for floating point types) Texture Binding channel Description: struct cudachannelformatdesc( int x,y,z,w; # of bits per component enum cudachannelformatkind f) cudachannelformatkind cudachannelformatkindsigned cudachannelformatkindunsigned cudachannelformatkindfloat Texture Binding Example cudachannelformatdesc channeldesc= cudacreatechanneldesc(32,0,0,0,cudachannelformatkindfloat); cudaarray* cu_array;//cuda array cudamallocarray(&cu_array,&channeldesc,width,height); cudamemcpytoarray(cu_array,0,0,h_data,size, cudamemcpyhosttodevice)); tex.addressmode[0]=cudaaddressmodewrap; tex.addressmode[1]=cudaaddressmodewrap; tex.filtermode=cudafiltermodelinear; tex.normalized=true; cudabindtexturetoarray(tex,cu_array,channeldesc); //there is no input of the kernel, it is the texture BlurKernel<<<dimGrid,dimBlock>>>(d_data,width,height); Texture Binding Example texture<float,2,cudareadmodeelementtype> tex; global void BlurKernel(float* d_data,int w, int h) { unsigned int x=blockidx.x*blockdim.x + threadidx.x; unsigned int y=blockidx.y*blockdim.y + threadidx.y; float u=x/(float)w;float v=y/(float)h;//the textel itself //plus minus one in botm u and v float up=(x+1)/(float)w;float um=(x-1)/(float)w; float vp=(y+1)/(float)h;float vm=(y-1)/(float)h; //read from texture, sum all nine neighbors, divide by nine //and write to global memory d_data[y*width + x]=(tex2d(tex,u,v)+tex2d(tex,up,v)+ tex2d(tex,um,v)+tex2d(tex,u,vp)+tex2d(tex,u,vm)+ tex2d(tex,up,vp)+tex2d(tex,up,vm)+tex2d(tex,um,vp)+ tex2d(tex,um,vm))/9.f; } 15

read-only inputs Reading CUDA Programming Guide Kirk, D.B., Hwu, W.

16 Texture Memory rather complicated setup can be faster than the global memory (it is cached) Surface Memory (CUDA Array) Cubemap Textures with R and W access surf2dread( ) surf2dwrite( ) good for large read-only inputs Reading CUDA Programming Guide Kirk, D.B., Hwu, W.W., Programming Massively Parallel Processors, NVIDIA, Morgan Kaufmann 2010 Sanders, J., Kandrot, E., CUDA by Example, Addison-Wesley 16

NVIDIA CUDA Compute Unified Device Architecture

NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 0.8 2/12/2007 ii CUDA Programming Guide Version 0.8 Table of Contents Chapter 1. Introduction to CUDA... 1 1.1 The Graphics Processor