CS267 Report Particle Simulation on a GPU with PyCUDA

Size: px

Start display at page:

Download "CS267 Report Particle Simulation on a GPU with PyCUDA"

Allyson Tate
6 years ago
Views:

1 CS267 Report Particle Simulation on a GPU with PyCUDA Min Ragan-Kelley minrk@berkeley.edu May 12, Introduction This report is on a small test problem within the context of a larger long-term research project. GPUs are increasingly popular for particle methods, due to the readily apparent parallelism inherent to N-Body problems. Particle-In-Cell is a popular scheme for exploring systems in plasma physics. We hope to explore a small sample problem in order to gauge the viability of using PyCUDA to run interactive-scale particle simulations on a GPU for use in our research. 1.1 Project Context The Plasma Simulation and Theory Group (PTSG) at UC Berkeley maintains a large Object- Oriented Particle-In-Cell (OOPIC) code, largely built in the mid 1990s [1]. It is primarily a 2D system, and is a serial code. Applications of OOPC range from small interactive simulations, running many timesteps per second on a laptop, to large offline simulations up to days or weeks in runtime. The goal of the object oriented design is to facilitate continued development of new physics behaviors to simulations over the lifetime of the code, and has been quite successful, and OOPIC continues to enjoy active use and development. OOPIC s shortcomings in the present landscape are twofold. First, the User Interface (UI) is mouse-only button based input. This makes programmatic interaction with the simulation impossible. The diagnostics available are also static - if a user would like to add a Derivative Diagnostic (for example: given diagnostics J and B, add J B ), their only option is to add it to the C++ source code and recompile the program. The second major shortcoming is performance. OOPIC is a fast serial code, but parallelism was not considered in the original design. Basic parallel functionality has been implemented with MPI, but does not scale beyond a few processors. The goal of this project is to build a system with similar extensibility to existing OOPIC, but solve its parallel performance and interface problems. Our plan to accomplish this is by writing the application logic in Python, with the performance components as native C or CUDA kernels. Adding new physics capabilities should be at least as well facilitated as in current OOPIC by 1

2 a modular design. Having the interface itself be Python, we get a very powerful programmatic interface and user environment. Having NumPy arrays as the native data structure for diagnostics allows us to use existing plotting libraries such as matplotlib and Chaco, without having to write our own plotting system [5] [7] [8]. NumPy arrays are also the native data structure of most data analysis tools in Python, so our users gain access to all the powerful facilities of tools such as SciPy for free, just by our decision to present diagnostics as NumPy arrays [6]. Hierarchical parallelism for multi-device simulations will be a part of the design from the beginning. Spatial Decomposition occurs at the device level, and each Spatial Region communicates with others via MPI or Shared Memory in Host Code. Each device has its own Python process, issuing control and diagnostic commands, and the Python processes communicate via the IPython parallel computing kernel, and the native code backends will also be able to communicate with each other directly, via MPI. A parent Controller process is the User s interface to the whole simulation, which merges the diagnostic information from the various components, as seen in Figure 1. Our ultimate goal is to be able to utilize a system with: 1 machines on a network, where each machine has 1 CPU core and 1 compute capable GPU. Our short term plan, however, is just to explore the use of a single GPU. IPython Controller Control, Relay Merged Diagnostics IPython Kernel Control Derivative Diagnostics MPI C Sim Backend Native Diagnostics Figure 1: Schematic of the connections in the system. Python processes communicate control commands with each other and the native backends, while the backends communicate directly with each other using high performance systems, such as MPI. 1.2 PyCUDA PyCUDA is a module for Python that exposes the CUDA drivers to Python via the Boost library[4]. It has a variety of sophisticated functionalities. At its most basic, it works as a translation layer, allowing Python code to launch CUDA kernels and transfer data to and from CUDA enabled devices. In our case, we wrote a kernel fully in C+CUDA, setup the simulation with Python, and launched the CUDA kernel with the Python defined data via PyCUDA. 2

3 1.3 Test Problem The scope of this project is a small sample simulation, to test the feasibility of using PyCUDA as a backend for particle simulations. The model to be simulated is a combination of that explored in CS267 HW2 and the N-Body simulation found in NVIDIA s GPU Gems 3 and the CUDA SDK [2] [3]. The Gems simulation is a 3D all-pairs N-Body gravitational (least-squares attractive) simulation with free boundary conditions. We adapted this to be more like the simulation in HW2. The simulation is a 2D least-squares repulsive (Coulomb) simulation, where interactions beyond a certain distance are approximated as zero (short-range force approximation). The boundary conditions are specular (elastic) reflection off the x boundaries, and periodic boundary conditions in y. Particles are given random (uniformly distributed) initial velocity in the +y direction, and random (uniformly distributed) velocity in ±x, 1 order of magnitude smaller than the +y velocity. v o = (v rand( 1, 1) )ˆx + (v rand(0, 1))ŷ (1) 10 Particles are randomly distributed in the region, centered in x, filling half the width of the container, and the full height. r o = ( X 4 rand(1, 3))ˆx + (Y rand(0, 1))ŷ (2) where X ranges from Y/10 to Y/2. Figure 2: Schematic of the simulation. As particles leave the top, they wrap around the bottom. The resulting simulation approximates a sheet beam of electrons under free expansion, infinite in y and z, see Figure 2. 2 Implementation Starting with the Gems N-Body kernel, our major adaptations were to account for the short-range approximation, reduce the dimension to 2D, and apply boundary conditions. The short-range interaction means that most interactions are zero. To deal with this, the model, as in HW2, was to 3

4 // r i j [ 2 FLOPS] r. x = bj. x bi. x ; r. y = bj. y bi. y ; // d i s t S q r = dot ( r i j, r i j ) + \ e p s i l o n ˆ2 [ 4 FLOPS] T distsqr = r. x * r. x + r. y * r. y ; distsqr += getsofteningsquared<t >() ; i f ( distsqr > cutoff ) { // branch // invdistcube =1/ d i s t S q r ˆ ( 3 / 2 ) [ 4 FLOPS (2 mul, 1 sqrt, 1 inv ) ] T invdist = rsqrt_t ( distsqr ) ; T invdistcube = invdist * invdist * invdist ; // s = m j * invdistcube [ 1 FLOP] T s = bj. w * invdistcube ; } // a i = a i + s * r i j [ 4 FLOPS] ai. x += r. x * s ; ai. y += r. y * s ; break up the simulation into subregions. All interactions within a subregion are computed. Each subregion also contains ghost particles from neighboring subregions within a cutoff distance. In 2D, it takes 15 FLOPS to compute a non-zero interaction, and 6 flops to compute a zero interaction, as seen in the interaction routine above. 2.1 Subregions For simplicity, we only implemented 1D subdivisions, so each subregion fills the width of the system for a slice in y. The model computes a nonzero interaction (15 FLOPS) for every particle within the cutoff range, and a zero interaction (6 FLOPS) for every particle in the subdomain that is not within the cutoff range. By analyzing the area of the cutoff radius versus the subdomain, we can see approximately how much work we waste. Y A sub = X ( + 2r c ) n subs (3) A nonzero = πrc 2 (4) real work 15 A sub (5) wasted effort 6(A sub A nonzero ) (6) It would make sense to have square subdomains only slightly larger than the cutoff radius - this would cut down on the wasted effort a great deal. However, the bookkeeping involved in such a scheme requires a great deal of if-tests and memory access, which is much more expensive relative to computation on a GPU than on a CPU. So we decided to waste FLOPS rather than spend extra time and memory bookkeeping, as is common practice when programming GPUs. We intended to quantify the performance difference, but did not succeed in implementing the latter method in 4

5 Figure 3: Diagram showing cutoff radius of the red particle near the edge of a subdomain, including the ghost particles. The ghost region is the dotted line, and particles outside the dotted line reside in a different subregion, and are not considered for interactions, since they are guaranteed to be outside r c. total work regions 4 regions 16 regions 64 regions 256 regions Y/r c Figure 4: Total work computing interactions (including zeros) versus the size of the cutoff distance relative to the simulated system. The dotted line is the amount of work spent calculating non-zero interactions. time. 5

6 The scheme can be summarized as follows: f o r each subdomain S : # parallel ( rows in grid ) f o r each particle p in S excluding ghost particles : # parallel ( threads / thread blocks ) f o r each particle q in S including ghost particles : interact ( p, q ) advance ( p ) i f p will leave : add p to send queues send / recv new particles send / recv ghost particles 2.2 Threads and Blocks The kernel is run with a number of CUDA thread blocks that is evenly divisible by the number of subdivision. That way, each subdivision gets the same number of thread blocks. It is a single particle kernel, and the number of blocks is arranged such that there is a maximum number of threads (and thus particles) in a given subdomain. For the relative uniformity of this simulation, simply using a safety factor of 1.5, that is, allowing for 1.5x the number of particles expected by the average density. This scheme of thread allocation is not appropriate for nonuniform workloads, but is adequate for the test problem. 3 Performance We ran our simulation in single precision on NVIDIA GT200 GPUs (GTX-260 and Tesla C1060). We found that in order to minimize wasted computation, choosing the subregion height to be 2x of the cutoff distance, and our plots all use a cutoff distance r c = Y/256 and X = Y/10. 6

7 GFLOPS C1060: 256 tpb GTX 260: 256 tpb C1060: 64 tpb GTX 260: 64 tpb C1060: 32 tpb GTX 260: 32 tpb N Figure 5: Performance for various arrangements of threads per block (tpb). Note that 32 tbp is slower than larger numbers, but on average 64 tpb and 256 tbp are very similar. time C1060: 256 tpb GTX 260: 256 tpb C1060: 64 tpb GTX 260: 64 tpb C1060: 32 tpb GTX 260: 32 tpb N Figure 6: Runtime (normalized) for various configurations. Note the step function for 256 threads per block. 3.1 Periodicity In the plot of performance in Figure 5, there is an interesting periodic structure. By looking at a closeup of time vs N in Figures 6 and 7, we see that the runtime is close to a step function for larger thread blocks. The length of the steps appears to be proportional to (and slightly greater than) the number of threads per block * number of SMs on the chip. This makes sense, because if the GPU is fully utilized, it is running one thread block on each SM at a time. If the number of blocks is evenly divisible by the number of SMs, then the GPU will be near full load for the whole simulation. However, if N BLOCKs % N SMs = 1, then the GPU will spend an entire round with only one SM occupied. Further, adding blocks after that will be able to be run concurrently with 7

8 time C1060: 256 tpb GTX 260: 256 tpb N Figure 7: Closeup of the step function in time for 256 threads per block. Note that the period is 10% shorter for the 260, corresponding to having 10% fewer SMs. the first new block and not significantly affect the time until the pass, as illustrated in Figure 8. This results in a timing function to describe the run time of the simulation. t = s t 0 ((B + SM 1)/SM) (7) where t 0 is the typical time it takes to evaluate a block, s is the number of steps to evolve, B is the number of Thread Blocks, SM is the number of SMs on the GPU, and the / represents integer division. Further, B = 1.5N/tpb, rounded up for even divisibility among subregions. Noting that the wasted time is a function of the size of a thread block, there is an incentive to keep the number of threads per block low. However, the overhead of switching puts a lower limit on that number. Running with 64 threads per block gets approximately the same average performance as 256, but with a much smoother line. 8

9 t B%SM = SM Figure 8: Illustration of timing for numbers of thread blocks. B is the number of thread blocks, SM is the number of SMs on the GPU. SM=27 on the 260, and SM=30 on the Tesla board. 3.2 Overhead There is overhead associated with reducing the number of nonzero interactions. Each timestep, particles must be communicated between subregions, and it takes significant work to identify and move those particles. However, since it amounts simply to moving a particle in device memory, the penalty is small. If the workload were not so evenly distributed, there would potentially be a large penalty for having a significant number of idle threads. 3.3 Divergence The if statement in our kernel causes some threads to execute the zero-interaction path, and some to evaluate the full interaction. In CUDA, branch divergence causes a serious performance hit. Branch divergence is per-warp (set of 32 threads), so as long as within a given warp every thread takes the same branch, there should be no suffering. However, we were unable to fully avoid branch divergence, which cost a great deal of performance. To illustrate this, setting the cutoff to be larger than a subregion as in Figure 9, such that the if statement is always True, We see a performance increase of approximately 5%, in Figure 10. Of course, this simulation may be faster, 9

10 but it no longer makes sense, because the model requires that all particles within the cutoff range are included in the force calculation, and only particles within the subregion are actually used. This results in an error where the force calculated on particles closer to the subregion edge than the cutoff is more strongly directed towards the outside of the subregion than if the model were properly calculated. Figure 9: Diagram showing the cutoff radius larger than the subregion, used to eliminate divergence. Particles outside the box should be included in the model, but are outside the search space, which includes only the subdomain. GFLOPS Divergent 256 tpb Convergent 256 tpb Divergent 64 tpb Convergent 64 tpb N Figure 10: Performance for the divergent and invalid convergent codes. The divergence penalty appears to be about 5%. The divergence also appears to introduce some variability into the performance, as the case without divergence seems smoother. 10

11 4 Conclusions and Future Plans We managed to utilize 23% of the theoretical 1TFlop performance of the Tesla GPU 1. When compared to the original GPU Gems implementation, which does a complete all-pairs simulation, that gets closer to ( 50%) of peak. Clearly there is room for improvement, but since our scheme only computes N 2 /128 interactions, a factor of two performance shortage versus a full N 2 model is quite acceptable. In our implementation, we made the assumption that the overhead of a more thorough bookkeeping method would overwhelm the wasted computation of evaluating many zero interactions. It would be both interesting and prudent to explore such a scheme in order to examine the validity of this assumption. Even taking the same 1D scheme we have, we could make the slices across x instead of across y. This would result in dramatically reducing the amount of particle communication, since particles travel predominantly in the y direction, but it could negatively impact the load balancing, since the particles are not uniformly distributed in x. PyCUDA was not integral to performance of the code, nor did it hinder it noticeably. The kernel was handwritten in C+CUDA, adapted from the Gems source available from NVIDIA, and PyCUDA was simply used to load up the simulation and launch the kernel. In this way, PyCUDA allowed for the simple part of the code to be trivial by letting it be written in Python, and the hard code to be fast by running the raw CUDA kernel. Our original goal was to compare this simple method to one or two more sophisticated methods, but this implementation took longer to implement than anticipated. We hope to explore various schemes further, particularly ones more suitable to plasma simulations [9] [10]. Over the next months (and longer), we plan to build a more complete PIC and tree-code simulation on the GPU with PyCUDA to form the starting backend for our broader Python PIC project. There were also plenty of iterations of this code not discussed here, which, running afoul of coalesced and aligned memory access, defaulting double precision constants, and improper synchronization, managed to be far slower than a naive serial N 2 implementation. The general conclusion of the experience is that it is manageable to make a problem reasonably fast on a GPU, but it is very easy to make it very slow. References [1] J.P. Verboncoeur, A.B. Langdon and N.T. Gladd, An Object-Oriented Electromagnetic PIC Code, Comp. Phys. Comm., 87, [2] L. Nyland, M. Harris, J. Prins, GPU Gems 3: Chapter 31. Fast N-Body Simulation with CUDA, ch31.html [3] CUDA SDKhttp:// get.html 1 in case you remember a different number from my poster, I had made a favorable miscalculation there, giving closer to 30% of peak. 11

12 [4] A. Klöckner, PyCUDA, [5] NumPy, [6] SciPy, [7] matplotlib, [8] Chaco, [9] J. Barnes, and P. Hut. A Hierarchical O(n log n) Force Calculation Algorithm. Nature 324, [10] L. Greengard, et. al. A New Version of the Fast Multipole Method for Screened Coulomb Interactions in Three Dimensions, JCP 180,

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices