CS179 GPU Programming Recitation 4: CUDA Particles

Size: px

Start display at page:

Download "CS179 GPU Programming Recitation 4: CUDA Particles"

Bathsheba Doyle
5 years ago
Views:

1 Recitation 4: CUDA Particles

2 Lab 4 CUDA Particle systems Two parts Simple repeat of Lab 3 Interacting Flocking simulation 2

3 Setup Two folders given particles_simple, particles_interact Must install NVIDIA_CUDA_SDK to compile CUDA is set up in the lab 3

4 particles_simple CPU code should be set up, up to initial conditions, etc. But understand the code, because we give you less for particles_interact Just write the kernel to implement a basic particle system. Don t spend too much time on this part, just get something reasonable working. 4

5 particles_interact Less CPU code is given, need to manually handle more of the buffers, initial conditions, etc. Implement a kernel that has some n^2 particle-particle interaction flocking Use device functions and shared memory! 5

6 CUDA Programming Basics Host (CPU) code, for the Runtime API Global memory management cudamalloc(void** dev_ptr, size_t count); cudafree(void* dev_ptr); cudamemcpy(void* dst, void* src, size_t count, enum cudamemcpykind kind); Kind = cudamemcpyhosttodevice, cudamemcpydevicetodevice, cudamemcpydevicetohost, cudamemcpyhosttohost 6

7 Host code More memory management cudamemset, other more specialized functions used in some optimization cases See cuda reference manual for more details Remember that these functions cannot be called in device code! Cannot malloc from a thread! 7

8 Kernel Launches Now that data can be set up, we need to launch threads to do our computations kernel is program that each thread executes Specify grid dimension Number of blocks 1d or 2d grids, can t have 3d grids As many blocks as you want Specify block dimension 1d, 2d, or 3d blocks of threads Maximum of 512 threads Optional Shared mem size, stream. 8

9 Kernel launches Host code Kernel_function<<<Dg,Db,(Ns), (S)>>>(param1, ); Dg dim3 griddimension Db dim3 blockdimension Ns size_t Memory size to be dynamically allocated for shared mem. Default 0. S cudastream_t Stream associated, default 0. Params Whatever parameters the kernel specifies, i.e, function call. Think of these as uniforms! 9

10 Kernel functions Must be declared as: global void Func( ) { /* code */ } Must return void Then would launch this as: Func<<<Dg, Db, Ns>>>( ); 10

11 Function type qualifiers device void func( ) Indicates a function executed on device, and is only callable on the device. global void func( ) Indicates a function (kernel) executed on device, and is only callable on the host. host void func( ) Indicates a normal function on host, cannot be called from device. Not necessary to put in, defaults to host. 11

12 Restrictions on functions device and global functions Do not support recursion Cannot declare static variables inside function body Cannot have a variable number of arguments Cannot have a function pointer to a device function But you can for global global and host qualifiers cannot be used together. global functions: Must have void return Function parameters must be limited to 256 bytes Launch asynchronously! 12

13 Variable qualifiers device, constant Stored in global, constant memory respectively Lifetime of application Accessible through all threads, and host, through runtime library shared Resides in memory space of thread block Lifetime of block, only accessible to threads in block Modifications by one thread aren t guaranteed to be seen by other threads until syncthreads(); 13

14 Built-In Vector Types char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1, ushort1, short2, ushort2, short3, ushort3, short4, ushort4, int1, uint1, int2, uint2, int3, uint3, int4, uint4, long1, ulong1, long2, ulong2, long3, ulong3, long4, ulong4, float1, float2, float3, float4, double2 Vector structures with components accessible through.x,.y,.z,.w. Construct with make_<type_name>( ); int2 make_int2(int x, int y); 14

15 dim3 Built-In Vector Types Type based on uint3 Used to specify grid dimensions, block dimensions Remember that for grid dimensions, the Z component must be 1! These types are all available in both host and device code. Make a float4 array on host, then cudamemcpy to have that array on the GPU. 15

16 dim3 griddim Built-In Variables (device) Contains dimensions of the grid uint3 blockidx Contains block index within the grid 0 griddim.<>-1 in each dimension Unique within grid! dim3 blockdim Contains dimensions of block uint3 threadidx Thread index within block, unique, etc. 16

17 Grids, Blocks, and Threads 17

18 Shared Memory Use shared memory to reduce the number of global memory accesses on the nbody simulation! Ns from the kernel launch (<<<Dg, Db, Ns>>>) determines how much shared memory will be allocated per block. Declare an array which will be mapped to this memory: extern shared float4 array[]; 18

19 Shared Memory For Nbody, you probably need Ns to be 4 * number of bodies processed by a single block Must use syncthreads() to be certain of the state of the shared memory. I.e., load a block of memory from global to shared, each thread loads a float4. Must syncthreads() before reading the memory back (from another thread), otherwise there are no guarantees that it will be read correctly. 19

20 Memory accesses Read memory based on blockidx and threadidx Global_array[blockDim.x*blockIdx.x + threadidx.x]; Avoid having multiple threads read from the same location at the same time (ie., both before, or both after a syncthreads() ) If multiple threads try to write to the same location at the same time, it s guaranteed that one of the writes will occur, but not which one. Lots of optimizations here! We ll come back to this later, don t worry too much for this week. 20

21 Example Load shared from global Example for 1D grid, and 1D blocks. global func(float* global) { extern shared array[]; array[threadidx.x] = global[blockidx.x*blockdim.x+threadidx.x]; syncthreads(); // go do something with it now in shared mem! for(int i = 0; i < blockdim.x; ++i) value += array[(threadidx.x + i)%blockdim.x]; // Indexing by threadidx.x still, so we avoid bank conflicts 21

22 Optimization note array[(threadidx.x + i) %blockdim.x]; % is slow! Use a macro, or something else instead #define WRAP(x,m) (((x)<m)?(x): (m-x)) Works for x < 2m, which in our for loop example, is the relevant case. array[wrap(threadidx.x+i, blockdim.x)]; 22

23 Device code Basically C code, aside from the things we ve covered. Have most standard functions sin, cos, *, /, +, -, bitwise operations, etc. etc. Also have some lower precision, higher speed mathematical functions sin, cos, sqrt, mul24, etc. 23

24 Notes on files Can mix C/C++ code with cuda code freely. Convention is to label files with any cuda code as.cu, instead of.c/.cpp Declarations (i.e., in header files) still need the qualifiers global void kernel_func(); SDK has convention to have kernels in separate.cu file, and #include kernel_file.cu If you do this, make sure your build targets see when this is changed. Code given doesn t use includes. 24

25 CUDA_SAFE_CALL( ) Macro included with the CUDA Utilities Reports error messages when compiled for debug mode. Compiled out for release version of code. Place any host cuda call inside CUDA_SAFE_CALL(cudaFree(d_ptr)); Might as well use it, when you can, so you should do so with your host cuda calls for this lab. 25

26 particles_simple Just implement the kernel function, and modify initial conditions if you want. Make something that looks like a reasonable, repeating particle system. You ve already done most of the legwork here in the previous lab. 26

27 Flocking simulation particles_interact Insect flocking model can do bird flocking model for extra credit. Buffer particle positions in shared memory, per block For the bird flocking case, you need to buffer particle velocities as well, for reasons to be seen soon. 27

28 Flocking Very simple computer model Three concepts Separation Steer to avoid crowding local flockmates Alignment Steer towards the average heading of local flockmates Cohesion Steer to move toward the average position of local flockmates Alignment not present for insect flocking! 28

29 Local flockmates? Characterized by a distance. You can do something fancy like figure out what each bird can see: Or you could just forget about angles. They re more important for the bird flocking case anyway. 29

30 Generalized nbody simulation 1 thread per particle Compute acceleration for that particle due to every other particle, in this thread. A device function to compute acceleration from each of the steering forces given two positions may help! Some number of particles per block. In each block, repeatedly pull in from global memory the positions of the particles from each block (including self), sync, then update net accn based on all the positions currently in shared memory. 30

31 How to actually implement it Treat the particles as moving at a constant speed. Acceleration term will change the velocity by changing the direction, but you should normalize the speed to be constant afterwards. Otherwise the birds/locusts will reach equilibrium. 31

32 Treat as a force heading away from neighbors. Inverse square works very well here. The closer the neighbors, the more steering force is applied. Separation 32

33 Cohesion Steer toward the average position of neighbors. The average position is just all of the offset vectors added together and then divided by the number of neighbors that are visible. 33

34 Alignment (bird flocking only) Steer towards the average heading of flockmates. Requires your force function to be dependent on both position and velocity, so more complicated. If you have time to do this, more power to you. 34

35 Grid of Threadblocks for Nbody 35

36 Nbody simulation Probably want a device function to compute the net acceleration on the particle (for the current thread) Have this function repeatedly pull in blocks of positions from global to shared memory, then update accn based on those positions. 36

37 Nbody simulation Once net_accn is computed, need to update positions and velocities. Symplectic euler integration (For our case) this is extremely simple In this order (note that this syntax doesn t work for float4 s must do component operations). Vel = cur_vels[index]; Vel += accn * dt; (normalize velocity for constant speed) Pos += Vel * dt; (Writeback to newpos/vel) 37

38 Symplectic euler State update Basic idea, update velocity with accn first, then use updated velocity to update position. Need separate oldpos/newpos, oldvel/newvel regions of memory. Why? Block 0 could update positions before Block 50 starts. Block 50 will then read in the new positions of the particles of Block 0, and incorrect results will occur. Buffers made, but you have to pingpong them. 38

39 State update This should be your main kernel for the nbody simulation. It shouldn t be a continuous loop; it should take a single step of time dt. Bind/unbind appropriate buffers around kernel launch so as to pingpong old/new. 39

40 Initial conditions Pretty simple. Make a few clusters of points that are fairly close to each other but not within the neighborhood threshold. Many adjustable parameters: easy to have all of your insects/birds disappear off the screen immediately. Start with slow speeds! Perhaps weight separation/alignment/cohesion steering force differently. 40

41 A note on the buffers The two velocity buffers are simply cuda global memory, because opengl doesn t care about them. The position buffers are both VBOs, so opengl can render them at each frame. Need to explicitly tell cuda about the VBOs. 41

42 CUDA, OpenGL Interoperability Interaction through Buffer Objects (PBO, VBO) Must initially register buffer with CUDA. cudaglregisterbufferobject(bufferobj); Map buffer to a pointer usable in a kernel function (you need this for nbody/pingponging ) cudaglmapbufferobject((void**)&devptr, bufferobj); Unmap before opengl can use it again cudaglunmapbufferobject( bufferobj ); Once done, unregister buffer cudaglunregisterbuffer( bufferobj ); 42

CS179: GPU Programming. Lecture 7: Lab 3 Recitation

CS179: GPU Programming. Lecture 7: Lab 3 Recitation CS179: GPU Programming Lecture 7: Lab 3 Recitation Today Miscellaneous CUDA syntax Recap on CUDA and buffers Shared memory for an N-body simulation Flocking simulations Integrators CUDA Kernels Launching