UCLA UCLA Previously Published Works

Similar documents
Optimal Allocation of Replicas to Processors in Parallel Tempering Simulations

Utilizing Hierarchical Clusters in the Design of Effective and Efficient Parallel Simulations of 2-D and 3-D Ising Spin Models

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Monte Carlo simulation of ice models

The midpoint method for parallelization of particle simulations

Efficient Parallel Multi-Copy Simulation of the Hybrid Atomistic-Continuum Simulation Based on Geometric Coupling

Samuel Coolidge, Dan Simon, Dennis Shasha, Technical Report NYU/CIMS/TR

Performance potential for simulating spin models on GPU

Chapter 1. Introduction

Performance Metrics of a Parallel Three Dimensional Two-Phase DSMC Method for Particle-Laden Flows

Driven Cavity Example

Massively Parallel Computation for Three-Dimensional Monte Carlo Semiconductor Device Simulation

Sampling informative/complex a priori probability distributions using Gibbs sampling assisted by sequential simulation

Parallel Monte Carlo Simulation of Colloidal Crystallization Jeff Boghosian Advisor Dr. Talid Sinno

Exercises Lecture IX Ising Model

Introduction to network metrics

Temporal Pooling Method for Rapid HTM Learning Applied to Geometric Object Recognition

Algorithm Design (4) Metaheuristics

OPTIMIZATION OF MONTE CARLO TRANSPORT SIMULATIONS IN STOCHASTIC MEDIA

GiRaF: a toolbox for Gibbs Random Fields analysis

Parallelizing a Monte Carlo simulation of the Ising model in 3D

Petascale Multiscale Simulations of Biomolecular Systems. John Grime Voth Group Argonne National Laboratory / University of Chicago

OPTIMIZATION OF MONTE CARLO TRANSPORT SIMULATIONS IN STOCHASTIC MEDIA

Cluster-Algorithms for off-lattice systems Ludger Santen

Web page recommendation using a stochastic process model

Scientific Visualization Final Report

Vectorized Search for Single Clusters

Embedded SRAM Technology for High-End Processors

PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES

LATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS

Figure (5) Kohonen Self-Organized Map

Mesh segmentation. Florent Lafarge Inria Sophia Antipolis - Mediterranee

MRF-based Algorithms for Segmentation of SAR Images

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Application of the MCMC Method for the Calibration of DSMC Parameters

A noninformative Bayesian approach to small area estimation

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

A note on the pairwise Markov condition in directed Markov fields

6. Parallel Volume Rendering Algorithms

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

An Object-Oriented Serial and Parallel DSMC Simulation Package

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language

Multiprocessor and Real-Time Scheduling. Chapter 10

A Parallel Implementation of a Higher-order Self Consistent Mean Field. Effectively solving the protein repacking problem is a key step to successful

Improving the Performance of the Molecular Similarity in Quantum Chemistry Fits. Alexander M. Cappiello

Analysis of Slotted Multi-Access Techniques for Wireless Sensor Networks

PARALLEL SIMULATION OF A FLUID FLOW BY MEANS OF THE SPH METHOD: OPENMP VS. MPI COMPARISON. Pawe l Wróblewski, Krzysztof Boryczko

A Branch and Bound Algorithm for the Protein Folding Problem in the HP Lattice Model

Task Allocation for Minimizing Programs Completion Time in Multicomputer Systems

Adaptive spatial resampling as a Markov chain Monte Carlo method for uncertainty quantification in seismic reservoir characterization

Approximate Bayesian Computation. Alireza Shafaei - April 2016

Demand fetching is commonly employed to bring the data

The Shapes of a Three Dimensional Grain Growth Microstructure

Sequential Monte Carlo Adaptation in Low-Anisotropy Participating Media. Vincent Pegoraro Ingo Wald Steven G. Parker

Resistor Networks based on Symmetrical Polytopes

Failure in Complex Social Networks

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads

Fast Quantum Molecular Dynamics on Multi-GPU Architectures in LATTE

Smallest small-world network

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

STEREO BY TWO-LEVEL DYNAMIC PROGRAMMING

Probabilistic Modeling of Leach Protocol and Computing Sensor Energy Consumption Rate in Sensor Networks

Classification of Hyperspectral Breast Images for Cancer Detection. Sander Parawira December 4, 2009

Graph Classification for the Red Team Event of the Los Alamos Authentication Graph. John M. Conroy IDA Center for Computing Sciences

A Short History of Markov Chain Monte Carlo

Objective. A Finite State Machine Approach to Cluster Identification Using the Hoshen-Kopelman Algorithm. Hoshen-Kopelman Algorithm

Lecture 4: Principles of Parallel Algorithm Design (part 4)

Available online at ScienceDirect. Procedia Computer Science 20 (2013 )

IMAGE CLASSIFICATION USING COMPETITIVE NEURAL NETWORKS

Hardware-Efficient Parallelized Optimization with COMSOL Multiphysics and MATLAB

Lecture 13: March 25

Time Series Analysis by State Space Methods

Multivariate Capability Analysis

IMPUTING MISSING VALUES IN TWO-WAY CONTINGENCY TABLES USING LINEAR PROGRAMMING AND MARKOV CHAIN MONTE CARLO

AI Technology for Quickly Solving Urban Security Positioning Problems

School of Computer and Information Science

Cycles in Random Graphs

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

Visual Learning with Explicit and Implicit Manifolds

Synergy Of Clustering Multiple Back Propagation Networks

Random Walks. Dieter W. Heermann. Monte Carlo Methods. Dieter W. Heermann (Monte Carlo Methods) Random Walks / 22

Improving Performance of Sparse Matrix-Vector Multiplication

GAMES Webinar: Rendering Tutorial 2. Monte Carlo Methods. Shuang Zhao

A Path Decomposition Approach for Computing Blocking Probabilities in Wavelength-Routing Networks

Switching for a Small World. Vilhelm Verendel. Master s Thesis in Complex Adaptive Systems

1. Estimation equations for strip transect sampling, using notation consistent with that used to

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Vectorized Search for Single Clusters

MSEC PLANT LAYOUT OPTIMIZATION CONSIDERING THE EFFECT OF MAINTENANCE

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

GPU Optimized Monte Carlo

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University

1 Introduction. Russia

Mapping Vector Codes to a Stream Processor (Imagine)

High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers

For the hardest CMO tranche, generalized Faure achieves accuracy 10 ;2 with 170 points, while modied Sobol uses 600 points. On the other hand, the Mon

Transcription:

UCLA UCLA Previously Published Works Title Parallel Markov chain Monte Carlo simulations Permalink https://escholarship.org/uc/item/4vh518kv Authors Ren, Ruichao Orkoulas, G. Publication Date 2007-06-01 Peer reviewed escholarship.org Powered by the California Digital Library University of California

THE JOURNAL OF CHEMICAL PHYSICS 126, 211102 2007 Parallel Markov chain Monte Carlo simulations Ruichao Ren a and G. Orkoulas b Department of Chemical and Biomolecular Engineering, University of California, Los Angeles, California 90095 Received 2 April 2007; accepted 1 May 2007; published online 5 June 2007 With strict detailed balance, parallel Monte Carlo simulation through domain decompositioannot be validated with conventional Markov chain theory, which describes an intrinsically serial stochastic process. In this work, the parallel version of Markov chain theory and its role in accelerating Monte Carlo simulations via cluster computing is explored. It is shown that sequential updating is the key to improving efficiency in parallel simulations through domain decomposition. A parallel scheme is proposed to reduce interprocessor communication or synchronization, which slows down parallel simulation with increasing number of processors. Parallel simulation results for the two-dimensional lattice gas model show substantial reduction of simulation time for systems of moderate and large size. 2007 American Institute of Physics. DOI: 10.1063/1.2743003 I. INTRODUCTION The increasing availability of computer clusters with distributed memory has made parallel computing an inevitable trend to simulate more realistic system sizes and longer time scales iomputational science. Recently, new advancements have been made in both serial and parallel molecular dynamics simulations of biomolecules. 1 3 On the other hand, the parallelism of Monte Carlo simulations is not just a technical implementatiohallenge, but also an unsettled theoretical problem due to the serial nature of Markov chain theory. Any massively parallel Monte Carlo simulation based on spatial decomposition inevitably involves simultaneous updates on multiple CPUs. Several attempts have been made to parallelize Monte Carlo algorithms through system decomposition into a number of noninteracting domains. 4,5 Updating a molecule with interaction range beyond its own domain needs the most up-to-date information of its neighbor molecules from logically adjacent computers. In parallel simulations on distributed computers/clusters, communication between processors is a much slower process due to the limit of network speed and data traffic, compared with ultrafast local cache or memory operation. As the number of processors grows, the speed of simulation may drop dramatically due to the extra communication overhead. Thus, it is crucial to synchronize information between processors at a relatively low frequency, which inevitably involves switching active regions periodically. The most straightforward parallel Metropolis algorithm is to divide the simulation box into stripes or squares domains. Then, each processor tries to update particles or spins within one domain. Once an update on the border of a domain is accepted, the processor iharge will send the updated information to its logical neighboring processors. A smarter variation of this method is to use the same random number sequence on all processors to avoid conflicts when two adjacent sites are selected at the same time by two neighboring processors. 4 However, both methods are only practical in supercomputers with shared memory architecture. Iluster computing, they both suffer from frequent communication, which usually leads to worse-than-serial performance. A more efficient example of domain decomposition algorithm on four processors 5 is shown in Fig. 1. This case corresponds to a two-dimensional system with short-range intermolecular forces. Each of the four processors is in charge of a domain which comprises four regions A,B,C and D: see Fig. 1. At any given time, only molecules in the active shaded region are updated, and the active regiohanges periodically in the order of A B C D A. Since the molecules inside the active regions are separated by distances longer than the interaction range, synchronization frequency is reduced dramatically. In this example, strict detailed balance 6,7 is not obeyed because only moves in the active regions are immediately reversible. The violation of strict detailed balance causes concerns about the precision of this type of parallel Monte Carlo simulation. 5 Most Monte Carlo simulations ensure strict detailed balance via random updating. Recently, we proposed a Monte Carlo algorithm based on sequential updating moves with partial randomness that only satisfies the weaker balance condition. 8 The new random skipping sequential RSS Monte Carlo algorithm, identifies the correct equilibrium distribution of states, and converges much faster than the conventional Metropolis algorithm. 8 The improved efficiency of the sequential updating algorithm is attributed to the socalled neighbor effect. In the language of lattice models, successful moves trigger a higher probability of nearestneighbor moves. The cascade behavior of the neighbor effect results in fast decorrelation and enhanced statistical quality of the generated samples. In this work, we explore the parallel version of the Markov process for the nearest-neighbor lattice gas on the square lattice. We show that sequential updating is the key to reduca Electronic mail: ruichao@ucla.edu b Electronic mail: makis@seas.ucla.edu 0021-9606/2007/126 21 /211102/4/$23.00 126, 211102-1 2007 American Institute of Physics

211102-2 R. Ren and G. Orkoulas J. Chem. Phys. 126, 211102 2007 T 2 = S m, m type2. 2 m FIG. 1. An example of domain decomposition that combines random updating and periodic switching of active regions shaded area. Active regions are separated by distances that are longer than the interaction range of particles or spins. ing communication among processors and is ideally suitable for parallel implementation through domain decomposition without compromising precision. II. PARALLEL MARKOV CHAIN We assume that it is possible to classify all elementary moves by independency: Moves of the same type are independent of each other, but moves of different type might be dependent. For example, in Fig. 2 we consider eight possible moves that can be categorized into two types type1: A-D and type2: E-H. In serial Monte Carlo simulations, moves are usually randomly selected and thus mixed up. In parallel Monte Carlo simulations, however, different types of moves must be isolated so that moves of the same type can be updated concurrently. To render an ergodic transition matrix, different types of moves/updates are executed periodically. Referring to Fig. 2, at time t 1 only type1 moves are executed, while at time t 2 only type2 moves are executed, and so on. Following this assumption, we must identify the form of transition matrix for parallel Markov chains in order to understand its convergence behavior. If S m is the transition matrix 8 associated with updating site/cell m, the transition matrices at time t 1 and t 2 are defined as T 1 = S m, m type1, 1 m The parallel transition kernel, R, can be related 8 with the type transition matrices T n as k R = T n = T 1 T 2 T 3, n=1 where k is the number of types of independent moves. Note that a different sequence of T n will generate a different parallel kernel because of the dependency between different types of moves. Since each elementary move is accepted with Metropolis acceptance probabilities, 6,7,9 detailed balance is satisfied while updating on a single component i S m ij = j S m ji, which implies that the balance condition 8 { S m = {, is also satisfied for updating an individual component, where { = 1, 2,... row vector is the equilibrium distribution of states. Following the analysis of Ref. 8, one can see that { T n = {, and consequently that k { R = { T n = { T 2 T k = = {. n=1 Equation 7 indicates that the balance condition for the parallel transition kernel is satisfied. Since immediate reversal of a move is not possible within a sweep, detailed balance is not satisfied for the sweep transition kernel R, i.e., i R ij j R ji. Despite the absence of strict detailed balance, it has been shown 8 that sequential algorithms converge to the correct equilibrium distribution with the less strong balance condition, Eq. 7. Ergodicity is ensured by adding occasional random skipping. The parallel version of the random skipping sequential RSS Monte Carlo algorithm is shown in Fig. 3 for a twodimensional system. The simulation box is divided into stripes and all sites in each row are updated sequentially before moving on to the next row; see Fig. 3. Parallelization does not affect the serial efficiency of the RSS algorithm, so that the parallel computing efficiency improvement is always on top of the statistical efficiency 8 improvement by sequential updating. 3 4 5 6 7 8 III. OPTIMAL NUMBER OF PROCESSORS FIG. 2. In serial Monte Carlo simulations top, all types of moves A-H are usually randomly selected and mixed up. But, in parallel Monte Carlo simulations bottom, different types of moves A-D and E-H are isolated so that moves of the same type can be updated concurrently. In parallel Monte Carlo simulations, increasing the number of processors does not necessarily reduce the simulation time due to the extra communication overhead introduced by additional processors. The competition betweeomputation and communication results in an optimal number of processors for a given system size. Above the optimal number of processors, the parallel efficiency decreases with additional

211102-3 Parallel Markov chain simulations J. Chem. Phys. 126, 211102 2007 FIG. 3. Illustration of the parallel version of the random skipping sequential RSS Monte Carlo algorithm of Ref. 8. The simulation box is divided into four stripes, and each stripe is updated by one processor via sequential updating. There is no communication necessary until all CPUs finish updating the top and bottom line of their domains. processors. An example of a parallel RSS Monte Carlo simulation is shown in Fig. 4. This case corresponds to the nearest-neighbor lattice gas on the square lattice. 10 As the number of processors increases, the simulation time decreases first; then, beyond a point six CPUs in this case it starts to increase. To determine the optimal number of processors, we formulate the simulation time as an empirical function of the number of processors and system size L, t = cl d + o L d 1 + 1. 9 In Eq. 9 d is the dimensionality of the system, c is the unit computing time per cell, o is the data transfer overhead, and 1 is communication initialization/finalization overhead. FIG. 5. Simulation time vs number of CPUs for a 480 480 lattice gas system. In this case, the simulation time decreases monotonically with the number of processors. In systems of moderate size L, the initialization and finalization overhead 1 is significantly larger than the data transfer overhead o. The simulation time can then be simplified as + 1. 10 From Eq. 10 it follows that there is an optimal number of CPUs which renders the shortest simulation time. By taking the first derivative of the simulation time with respect to the number of CPUs, t = cl d 2 + 1, 11 one can see that the optimal number of processors for system size L is n opt = cl d. 12 1 As the system size increases, the data transfer overhead, o, becomes larger and more significant than the initialization/finalization overhead 1. The simulation time is + o L d 1. 13 In this case, the optimal number of processors, FIG. 4. Simulation time vs number of CPUs for a 120 120 lattice gas model on the square lattice. The simulation time corresponds to 10 5 lattice sweeps. The reduced temperature is T =k B T/ =0.6 and the value of the chemical potential corresponds to the symmetry zero-field axis for this model. The simulation time decreases first and eventually increases with the number of processors. n opt = cl o, 14 is independent of the dimensionality d. For very large systems, the communication overhead becomes negligible compared to the lengthy computing time. In this case, the simulation time is

211102-4 R. Ren and G. Orkoulas J. Chem. Phys. 126, 211102 2007. 15 Thus, there is no optimal number of CPUs for very large systems, and the computation time decreases monotonically with the number of CPUs as shown in Fig. 5. IV. CONCLUSIONS Due to the absence of strict detailed balance, the sequential algorithm is ideal for parallel computing through domain decomposition. We introduce a parallel scheme that reduces the interprocessor communication which slows down parallel simulation with increasing number of processors. Parallel simulation results show substantial reduction of simulation time for systems of moderate and large size. The efficiency improvement is always on top of the statistical efficiency improvement by sequential updating. Future work in this direction is associated with devising sequential type of algorithms, both serial and parallel, for continuum fluid models. ACKNOWLEDGMENTS The authors are grateful to the Intel Higher Education Program, which provided the Linux cluster used in the simulations. The authors are also grateful to the developers of MPICH at Argonne National Laboratory, which made parallel programming of this work convenient and possible. Financial support from NSF, CBET-0652131, is gratefully acknowledged. 1 X. Zhao, A. Striolo, and P. T. Cummings, Biophys. J. 89, 3856 2005. 2 D. E. Shaw, J. Comput. Chem. 26, 1318 2005. 3 K. J. Bowers, R. O. Dror, and D. E. Shaw, J. Chem. Phys. 124, 184109 2006. 4 G. Barkema and T. Macfarland, Phys. Rev. E 50, 1623 1994. 5 G. S. Heffelfinger, Comput. Phys. Commun. 128, 219 2000. 6 M. P. Allen and D. J. Tildesley, Computer Simulation of Liquids Clarendon, Oxford, 1987. 7 D. Frenkel and B. Smit, Understanding Molecular Simulation, 2nd ed. Academic, New York, 2002. 8 R. Ren and G. Orkoulas, J. Chem. Phys. 124, 064109 2006. 9 D. P. Landau and K. Binder, A Guide to Monte Carlo Simulations in Statistical Physics Cambridge University Press, Cambridge, 2000. 10 R. Ren, C. J. O Keeffe, and G. Orkoulas, Mol. Phys. 105, 231 2007.