Parallel Monte Carlo Simulation of Colloidal Crystallization Jeff Boghosian Advisor Dr. Talid Sinno

Size: px

Start display at page:

Download "Parallel Monte Carlo Simulation of Colloidal Crystallization Jeff Boghosian Advisor Dr. Talid Sinno"

Anne Freeman
5 years ago
Views:

1 Parallel Monte Carlo Simulation of Colloidal Crystallization Jeff Boghosian Advisor Dr. Talid Sinno Abstract The Monte Carlo algorithm is frequently used in particle simulations, but with increasing numbers of particles simulations may take days to run. Efforts have been made to parallelize Monte Carlo, but due to the nature of the algorithm parallel Monte Carlo algorithms are inherently inefficient. Simple methods of parallelizing the algorithm could result in data being sent across the network at every step, killing performance. I created a modified parallel Monte Carlo simulation to significantly reduce network traffic and eliminate any possible synchronization issues. I also developed an interactive 3-dimensional graphical user interface (GUI) to facilitate visualization of the system as it grows. With this interface, users are able to quickly see how the system is developing and understand the growth mechanisms. The GUI reads the text files that are outputted from the simulation at a fixed number of steps. The user has many visualization controls, including changing the color and size of particles, and different appearances for specific particle types and phases. Related work The de facto algorithm for Monte Carlo simulation is the Metropolis algorithm (Metropolis et. al., 1953). At each step, a particle is given a random displacement. If the energy of the system decreases, the move is accepted. Otherwise, the move is accepted with probability exp(- E/kT). While the algorithm has changed little since then, computing power has increased exponentially, allowing us to simulate larger and larger numbers of particles. The actual simulation in the Metropolis paper consisted of just 224 particles. By recent comparison, Molecular Dynamics simulations with over 12 billion atoms were reported running on a Cray supercomputer (Rapaport, 2006). The simulation used hundreds of processors and terabytes of memory in order to accomplish this.

2 The high cost of supercomputers has led to great interest in running parallel simulations on cheaper, commodity hardware. Significant improvements in running time have been achieved using particle subdivision and the Message Passing Interface in a Monte Carlo simulation (Carvalho et. al., 2000). The group used a master-slave pattern to parallelize the code. At each step, the master process would choose a particle and ask every other processor for the energy between the chosen particle and the subset of particles in that processor's particle list. During trials, four processors were connected via local network, and for 2048 particles, speedups near 2.5x were realized. The group also noticed larger speedups when there was a larger number of particles. There exist a few general-purpose molecular dynamics programs. MOLDY is a molecular dynamics modeling program which is written in C. This is a popular program, but it uses molecular dynamics instead of Monte Carlo. Also, since there are so many possible combinations of particles, system types, and interactions, it is impossible for any particular one to serve one's exact needs. Commercial modeling software exists, but it is very general and not specialized for this type of modeling. Accelrys sells a product called Materials Studio that does graphical modeling and simulations. The cost of full-featured commercial modeling software is astronomical though, and this doesn't seem to support the specific features we need. Finally, NAMD is an open source simulator and viewer for biomolecular molecules and systems. It has a pretty interface and nice graphics using OpenGL, but it is focused towards biology and it lacks the capability of viewing large particle systems. Technical Approach Monte Carlo is a random algorithm, so no two runs are the same. Each iteration, the algorithm applies a random move to each particle in sequence. After each individual move, the program calculates the change in energy for the system. Depending on the change in energy, the program can decide to accept the particle's new position, or reject the move and keep the particle in its old position. This step is repeated for every particle in the system. An iteration through the entire list of particles is commonly known as a sweep. Simulations may last thousands or millions of sweeps.

3 In the serial Monte Carlo simulation, nearest neighbor lists are kept in order to minimize unnecessary calculations (Verlet, 1967). When the program calculates energies for a given particle, it also keeps track of all of the particle's close neighbors and stores them in an array. This way, the next time the program is calculating energies for that particle, it only has to check the particles in the array since these are the only ones close enough to interact. The serial Monte Carlo algorithm is fairly straightforward, but there were many choices I had to make in parallelizing the code. First of all, there are several general techniques for parallelizing particle simulations. One such method is spatial subdivision, in which each machine simulates the particles in a different chunk of space. Since the colloidal potential function is very small at relatively large distances, we define a cutoff distance R c. We define the potential between a particle and any other particle outside of R c to be zero. Using this approximation, we only have to calculate energies for a particle using its nearest neighbors (Prasad 2005). Fig. 1. Spatial subdivision of particles. Separate machines process regions A, B, and C in parallel. Figure 1 shows spatial subdivision of a system. At any time, each particle is in region A, B, or C. Spatial subdivision assigns a machine to each region in this case, there are three machines. Each machine simulates the motions of atoms in its region. When particles are near the border, the machines need to communicate through the network in order to ensure that all neighboring interactions are considered.

4 Another way to accomplish parallelism is through particle subdivision (Prasad 2005). In this algorithm, each particle is assigned to a machine, and this machine controls the particle throughout the duration of the simulation. In Figure 1, the blue particles could be assigned to one machine and the green particles to another. The master process chooses a random particle to move, and informs the other processes of the potential move. Then, each slave process returns the energy between the moved particle and the subset of particles that process owns. The master then decides whether or not to accept the move and informs the others. The issue with this algorithm is that several messages need to be passed for each individual movement. If the cost of calculating energies is large compared to the network delay, this is not an issue. In our simulation, however, the interaction distance is very small, and processors are fast enough so the network latency would far outweigh the computation time. I chose to use spatial subdivision for my simulation. The simulations I am working with contain thousands of particles, so I wanted to reduce network traffic as much as possible. The colloidal particles that were being simulated had relatively short interaction distances. Since the cell length is much greater than the interaction distance, it is very likely that all of a given particle's neighbors will be on the same processor. This way, the number of queries made to neighboring cells is minimized. Parallel Monte Carlo algorithms are notoriously slow. The problem arises since every individual particle movement is done sequentially. In each step of the serial Monte Carlo algorithm, the position of a particle may change. This move may influence the energy of subsequent moves, so it is necessary that every other particle be aware of its new position. In the serial algorithm this isn't an issue, as all particles are stored in local memory on the same machine. However, in a parallel algorithm, after the particle's position is updated a message might need to be sent across the network to notify the other cells. The potential for network traffic at every step can potentially slow the algorithm to a crawl. Synchronization is also an issue with parallel Monte Carlo. If processors A and B are simultaneously choosing a particle within their space and making random moves, issues can arise.

5 If a particle in cell A is moved, cell B might be calculating the energy of a move based on A's old state. Since messages take a relatively long time to send, this problem could potentially arise very frequently. If this happens the simulation would no longer have any physical significance, since the simulation did not run as intended. It's quite possible that two particles could end up overlapping, a state that energetically should never exist. To synchronize individual particle movements would take a lot of network communication at each step, which is infeasible if we are trying to speed up the simulation. For comparison, Molecular Dynamics (MD) is one type of particle simulation that is frequently parallelized. The nature of the algorithm only requires that there is communication between processors after every complete sweep of the particles. In MD, the motion of the particles is not randomly assigned, but instead calculated by the forces applied by its neighboring particles. The forces acting on each particle are all calculated at once, and then the incremental movements of the particles can be made independently of each other, depending on the previously calculated forces. Finally, cells exchange information about their particles' movements and repeat the process (Prasad 2005). The difference between these algorithms is that in MD, you only need to update particle positions every sweep, but MC movements are individually made, so positions may have to be updated every step. I utilized the Message Passing Interface (MPI) for my simulation. MPI has become the industry standard for parallelization. While MPI itself is just a specification, there are many implementations of it, such as OpenMPI, MPICH, and LAM/MPI. MPI is comprised of over a hundred functions, but only a very small subset is actually necessary for full parallelism. The most important functions are MPI_Send and MPI_Recv, which send and receive messages from other processes. There are blocking and non-blocking versions of these, and they must be used carefully in order to avoid deadlock. It is quite easy to inadvertently cause deadlock. Suppose you want every cell to send data to its right neighbor and receive from its left. Simply calling MPI_Send followed by MPI_Recv in each process will cause deadlock each process will block at MPI_Send until the sent message is received but no process will ever reach the MPI_Recv since they are all blocked. A smarter approach might have even cells receive data first, and odd cells send data first (Gropp, Lusk, Skjellum, 1994).

6 The simulation space is a cube. For my simulation, I divided that cube into cells and then further divided the cells into subcells. To run the simulation, I planned on assigning one cell to each processor for maximum efficiency. Dividing the cells even further mainly had the benefit that I didn't have to keep neighbor lists for each cell the neighbors can be found in the subcells directly around the current subcell. The side length of the subcells must be more than the interaction cutoff, R c. This way you confine the particles that can potentially interact with those in a particular subcell to itself and its 26 neighbors (a 3 by 3 cube). This is a substantial speedup since the program won't needlessly check energies of particles that are guaranteed to be outside of R c. Fig. 2. The subdivision of the cells is shown here. Particles in the yellow subcells are the controlled by that cell, while particles in the white subcells are copied from neighboring cells. For instance, subcells 10, 15, and 20 in Cell 1 are copies of subcells 7, 12, and 17 in Cell 2. Local copies of subcells are stored to reduce network traffic. An extra outer layer of subcells was added to each cell in order to keep track of nearby particles owned by neighboring cells. In Figure 2, the white subcells comprise this extra layer. When calculating energies for particles in a border subcell such as number 14, we need to look at the particles stored in subcells 10, 15 and 20, among others. The particles in these subcells aren't actually owned by that cell, but having them stored locally saves expensive network traffic.

7 When an MPI program is initialized, it assigns each process a unique rank. Each process is running the same exact code, so the rank is necessary to differentiate between the cells. By convention, the master process (if there is one) is the one with rank 0. In my parallel simulation, the master cell creates the particles, either reading from an input file or assigning them random positions. It then distributes the particles to the appropriate slave cell, or itself, since the master cell acts like any others after distributing the particles. Once the particles are distributed, the simulation can begin. I actually had a significant portion of my code implemented before I realized why Monte Carlo simulations generally aren't parallelized. The synchronization issue was quite difficult and hard to circumvent. If I moved a particle near the border of a cell, it might interact with particles in a neighboring cell, but that cell might already be in the middle of calculating energies. I thought about ways to send messages back and forth so synchronize this, but it seemed infeasible. There could be any number of particles near the border, so any number of messages would have to be synchronized. Next, I wondered if I could predetermine the order in which particles were processed. The simulation would proceed as normal in each cell, but if a particle was near the border, the cell would only continue processing if it had precedence, or until it received an updated particle position from the neighboring cell and now it was this particle's turn. This I also discarded as too slow and too complicated. My idea was to run the simulation one subcell at a time. Every cell would start off simultaneously processing their version of the same subcell. After finishing that subcell, each cell would update its neighbors with the new positions, then move on to the next subcell, and repeat this process until every subcell was processed. The order in which the subcells were processed is randomly chosen each iteration in order to prevent any regional bias. Since any given subcell is in the same relative position within each cell, none of the subcells being run in parallel bordered each other. This eliminated any border synchronization issues. It also significantly reduced the amount of data that would be sent across the network. Data only needs to be sent after a surface subcell is processed interior subcells don't have any effect on particles in neighboring cells. The number of surface subcells increases as n 2/3 with the number of subcells. Now the program only had to transmit data after entire subcells were processed as opposed to after every particle.

8 Now that the particles were dispersed and the simulation could proceed, last next thing to take care of was particles wandering into another cell. The particle lists need to be updated when it is possible that two particles from non-neighboring subcells have moved far enough so that they can interact. The calculation of the interval is straightforward. Initially, two particles with a subcell between them have a separation of at least the size of the subcell, d s. The particles will start interacting when they are within the interaction cutoff, r c. The other constant is rd max, the maximum displacement a particle can have each step. Thus, it is necessary that 2*n* rd max + r c < d s. Depending on the constants used, the particle lists only need to be updated every steps. Once the lists are updated, there is no chance for two particles in non-neighboring cells to interact for at least another n steps. Graphical User Interface The GUI was developed as a completely separate component than the simulation. Modularizing the code eliminated dependencies and gave me more flexibility to write the GUI in any language I liked. The only link between the two parts is the files that the simulation outputs, which are read by the GUI and turned into a 3-D representation. One of the important features of this program is ability to realize that a simulation has gone awry while it is still running. Since the simulation can run for days at a time, it is important to know how the system is progressing during the simulation. The Java Swing interface was used for the GUI. The Java platform was chosen for interoperability on multiple platforms and because creating interfaces is very easy with just a little knowledge of the Swing libraries. Inserting an OpenGL canvas into the interface was very simple using the thirdparty JOGL libraries. Rendering thousands of particles per frame can be very costly, so I decided to render the solid particles as low-resolution spheres and the liquid particles as GL dots. Dots are rendered much faster than spheres, which are constructed of multiple polygons. Zoom, rotation, and translation controls were implemented so the user can have virtually any view of the system, and a slider bar was added to change frames to watch the growth of the system.

9 Fig. 3. An image of a colloidal system from the simulation. The spheres are solid-phase particles and the small dots are liquid-phase particles The GUI has already turned out to be very important, as it allowed me to see what was going on in the system during my simulation's test runs. Before I ironed out the bugs, I would see very odd things sometimes the crystal would disband completely within a few steps, or a whole section of the crystal would disappear right around the cell boundaries. This gave me an idea how useful the GUI could be in conjunction with the particle simulations. Conclusion Overall I'm satisfied with my project. The parallel Monte Carlo algorithm became the main thrust of my project even though it was the last thing I added. Originally I was planning on just doing the GUI and some energy calculations, but when I realized that wouldn't be enough I added the parallel simulation. Writing a parallel simulation was a learning experience for me, and one that I'm glad to have had. I'd never written parallel code before except for multi-threaded programs. I had to learn a lot,

10 starting right at the basics with MPI (which I didn't know existed) to parallel coding techniques and optimization. I had to learn the inner workings of the serial Monte Carlo simulation, with countless arrays, potential matrices, and neighbor lists, and then figure out a way to parallelize it. The challenge of parallelizing the code was not obvious at first, as I figured it couldn't be much different than parallelizing MD. A lot of thought was put in before I found a unique way to accomplish it. I'm disappointed I didn't have the time and resources to test the speedup of my parallel simulation on multiple machines. During development, I had an Open MPI implementation installed on my computer, and was testing the parallelism by having it spawn 8 or 27 processes locally. This worked, and actually pretty fast, considering all communication was still done using sockets and all 8 processes were running on the same CPU. in 6-7 minutes, 1000 sweeps were completed, each sweep being 5000 particles By the time I was completely finished and had the bugs ironed out I only had a week or so left. I wasn't able to secure enough workstations in the SEAS clusters in that time, so I instead tried to run them on a few of my friends' MacBooks. I had all sorts of issues with running MPI on these computers. Open MPI got the closest the master process started running, and I could see on the other machines that it had spawned processes through SSH. But after that, nothing happened. The main function of the program was never reached. MPICH never felt like running in parallel no matter what I did, only one process ran at a time. And the LAM/MPI daemons didn't help out either. My problems may have been because Mac is not the usual platform for MPI, but the OS is based on Linux and I was able to compile every implementation without problems so I don't see the issue there. The GUI turned out not to be a huge endeavor I didn't really expect it to be, but it is a really cool piece of software that will save a lot of time and effort when it is used by Dr. Sinno's group. It was a lot of fun to develop and play around with. I've had a good deal of experience doing interfaces for websites and applications, and I was able to apply this knowledge to make an easy-to-use piece of software that is portable and pretty fun.

11 References Carvalho A., Gomes, J., and Cordeiro M. (2000). Parallel Implementation of a Monte Carlo Molecular Simulation Program. J.Chem. Inf. Comput. Sci., 40(3), A parallel Monte Carlo implementation was created using particle subdivision by a doctoral student. This research claimed speedups of 2 using four processors. It was encouraging to see other attempts at parallel Monte Carlo simulation, but this work took place in 1999 with 266 MHz machines connected across 100Mb ethernet. Since then, processing power has increased exponentially, but 100Mb ethernet is generally still the standard. Now more than ever, network latency is the limiting factor in parallel computing, and the speedups seen in this research could be hard to reproduce due to network limitations. Gropp W, Lusk E., & Skjellum A. (1994). Using MPI: Portable Parallel Programming with the Message-Passing Interface. Cambridge, MA: The MIT Press. The authors of this reference manual were all part of the team that created the Message Passing Interface. At the time of printing, Gropp and Lusk were both Computer Scientists at Argonne National Laboratory, and Skjellum was an Assistant Professor of Computer Science at Mississippi State University. This guide on using MPI was written in 1994, just a couple years after MPI was standardized. Since the interface doesn't change, the text is just as relevant today as it was when it was written. Since then, MPI 2 has been specified, adding lots of new functionality. However, it is backwards compatible with MPI-1, and the new version has not seen widespread adoption. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., and Teller, E. (1953). Equation of State Calculations by Fast Computing Machines. J. Chem. Phys., 21(6), This paper was the introduction of the Metropolis Monte Carlo Algorithm, which has become one of the most popular algorithms used today. The paper was published over fifty years ago yet the algorithm is still widely used. Nicholas Metropolis was a physicist who worked at Los Alamos National Laboratory during the Manhattan Project, and the coauthors were also esteemed physicists. This paper was very interesting to read, since computing methods were not nearly as advanced as today the largest simulation consisted of only 224 particles. Nonetheless, this is one of the seminal papers in the field of particle simulations.

12 Prasad, M. (2005). Multiscale modeling and simulation of aggregation in crystalline semiconductor materials. Unpublished doctoral dissertation, University of Pennsylvania. This dissertation from a Ph.D. student in Dr. Sinno's group was not focused on colloidal particles, but it utilized a parallel Molecular Dynamics simulation. I applied some of the ideas found in this dissertation, especially the transfer of bordering particles between neighboring cells. It also includes good descriptions of the different decomposition methods for parallelization, including atom, force, and spatial subdivision. Rapaport, D. (2006). Multibillion-atom molecular dynamics simulation: Design considerations for vector-parallel processing. Computer Physics Communications, 174(7), The author is a professor of physics who has seemingly made a career out of molecular dynamics, writing tens of papers about the subject. He seems to be an authority in cutting-edge simulations, so I didn't doubt his research. I included this to give an idea of the future of particle simulation. This paper was written in 2006, so it is one of the more recent attempts to increase the number of particles in simulations. Using 2016 processors on a Cray X1 supercomputer, over 12 billion atoms were simulated, require almost 2 TB of memory. Parallelism is clearly the future of particle simulation, as processor speeds aren't increasing by much any more and massive memory requirements demand it. Verlet, L. (1967). Computer Experiments on Classical Fluids. I. Thermodynamical Properties of Lennard-Jones Molecules. Physical Review, 159(1), This is the paper in which Loup Verlet introduced his algorithm for keeping nearest neighbor lists. Obviously it is a very respected and important paper, since the lists are now named after him. In the paper, the technique is only briefly mentioned, but it has become a staple in Monte Carlo simulations. When calculating the energy of a particle, you only need to look at the particles in its neighbor list instead of looking at every other particle. This insight significantly reduced the running time of the simulations from n 2 to n*p, where p is the average number of particles in the nearest neighbor list. For my simulation, I didn't actually use nearest neighbor lists, since I already had a similar infrastructure in place. The subcells I created have the same effect as neighbor lists. They may contain more particles than necessary for any particular particle, but time is saved in not constructing the lists at all.

UCLA UCLA Previously Published Works

UCLA UCLA Previously Published Works Title Parallel Markov chain Monte Carlo simulations Permalink https://escholarship.org/uc/item/4vh518kv Authors Ren, Ruichao Orkoulas, G. Publication Date 2007-06-01