Optimizing Molecular Dynamics

Similar documents
Grid Computing: Application to Science

6. Parallel Volume Rendering Algorithms

Lecture 4: Principles of Parallel Algorithm Design (part 4)

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing...

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Visualizing Higher-Dimensional Data by Neil Dickson

Graph Partitioning for High-Performance Scientific Simulations. Advanced Topics Spring 2008 Prof. Robert van Engelen

Massive Dataset Visualization

Memory Hierarchy Management for Iterative Graph Structures

INTERCONNECTION NETWORKS LECTURE 4

High End Computing Is Bringing Atomistic Simulation To Macroscopic

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Exploiting hierarchical parallelisms for molecular dynamics simulation on multicore clusters

Foster s Methodology: Application Examples

OVERVIEW OF SPACE-FILLING CURVES AND THEIR APPLICATIONS IN SCHEDULING

Digital Halftoning Algorithm Based o Space-Filling Curve

Topology and Boundary Representation. The ACIS boundary representation (B-rep) of a model is a hierarchical decomposition of the model s topology:

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Efficient Storage and Processing of Adaptive Triangular Grids using Sierpinski Curves

Managing MPICH-G2 Jobs with WebCom-G

Parallel Molecular Dynamics

Algorithms for GIS: Space filling curves

Chap4: Spatial Storage and Indexing. 4.1 Storage:Disk and Files 4.2 Spatial Indexing 4.3 Trends 4.4 Summary

A motion planning method for mobile robot considering rotational motion in area coverage task

Adaptive-Mesh-Refinement Pattern

Maximal Monochromatic Geodesics in an Antipodal Coloring of Hypercube

Partitioning with Space-Filling Curves on the Cubed-Sphere

Parallel MD Part 1. Parallel Computing: Concepts

COPYRIGHTED MATERIAL. Introduction: Enabling Large-Scale Computational Science Motivations, Requirements, and Challenges.

Empirical Analysis of Space Filling Curves for Scientific Computing Applications

Dynamic load balancing in OSIRIS

Evaluating the Performance of the Community Atmosphere Model at High Resolutions

Overpartioning with the Rice dhpf Compiler

Algorithm classification

Sorting Algorithms. - rearranging a list of numbers into increasing (or decreasing) order. Potential Speedup

A Resource Discovery Algorithm in Mobile Grid Computing Based on IP-Paging Scheme

EE/CSCI 451 Midterm 1

Space Filling Curves

Edge Equalized Treemaps

Advanced Parallel Architecture. Annalisa Massini /2017

Empirical Analysis of Space Filling Curves for Scientific Computing Applications

Meta- Heuristic based Optimization Algorithms: A Comparative Study of Genetic Algorithm and Particle Swarm Optimization

Turning Heterogeneity into an Advantage in Overlay Routing

CAR-TR-990 CS-TR-4526 UMIACS September 2003

CHAPTER 1 INTRODUCTION

Graph/Network Visualization

Lecturer 2: Spatial Concepts and Data Models

CS 6143 COMPUTER ARCHITECTURE II SPRING 2014

OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP

Introduction to Parallel Computing

Chapter 8 : Multiprocessors

Introduction to Spatial Database Systems

SNS COLLEGE OF ENGINEERING

Grid Computing Systems: A Survey and Taxonomy

Dr e v prasad Dt

MPI+X on The Way to Exascale. William Gropp

Parallel Implementation of 3D FMA using MPI

Effective Tour Searching for Large TSP Instances. Gerold Jäger

Efficient Clustered BVH Update Algorithm for Highly-Dynamic Models. Kirill Garanzha

Benchmarking the UB-tree

Kevin J. Barker. Scott Pakin and Darren J. Kerbyson

Multidimensional Indexes [14]

IOS: A Middleware for Decentralized Distributed Computing

Code Transformation of DF-Expression between Bintree and Quadtree

L21: Putting it together: Tree Search (Ch. 6)!

Clustering in Data Mining

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

VIII. Communication costs, routing mechanism, mapping techniques, cost-performance tradeoffs. April 6 th, 2009

ISSUES IN SPATIAL DATABASES AND GEOGRAPHICAL INFORMATION SYSTEMS (GIS) HANAN SAMET

Improving Performance of Sparse Matrix-Vector Multiplication

Picture Maze Generation by Repeated Contour Connection and Graph Structure of Maze

Evaluating the Performance of Skeleton-Based High Level Parallel Programs

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

Wide-area Cluster System

Classification and Generation of 3D Discrete Curves

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

The Icosahedral Nonhydrostatic (ICON) Model

A Distributed Media Service System Based on Globus Data-Management Technologies1

Ray Tracing Acceleration Data Structures

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

A Graph Algorithmic Framework for the Assembly of Shredded Documents. Fabian Richter, Christian X. Ries, Rainer Lienhart. Report August 2011

Multidimensional Data and Modelling

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

A PARALLEL ALGORITHM FOR THE DEFORMATION AND INTERACTION OF STRUCTURES MODELED WITH LAGRANGE MESHES IN AUTODYN-3D

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Massive Data Algorithmics

Preliminary Investigation of Accelerating Molecular Dynamics Simulation on Godson-T Many-core Processor

Tutorial: Application MPI Task Placement

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Chap4: Spatial Storage and Indexing

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

L20: Putting it together: Tree Search (Ch. 6)!

10th August Part One: Introduction to Parallel Computing

Leveraging Flash in HPC Systems

Squid: Enabling search in DHT-based systems

Meshlization of Irregular Grid Resource Topologies by Heuristic Square-Packing Methods

Analysis of Basic Data Reordering Techniques

An Introduction to Spatial Databases

CS 614 COMPUTER ARCHITECTURE II FALL 2005

Transcription:

Optimizing Molecular Dynamics This chapter discusses performance tuning of parallel and distributed molecular dynamics (MD) simulations, which involves both: (1) intranode optimization within each node of a parallel computer; and (2) internode optimization involving communications. Intranode Optimization: Memory Access Pattern in Molecular Dynamics Molecular dynamics (MD) program using a linked-list cell method is characterized by a random memory access pattern due to excessive indirections. This could in turn cause a significant degradation of MFlops performance. Fig. 1: Linked-lest access to atoms in a cell. In program, pmd.c, interatomic forces are computed in nested for loops over central and neighbor cells. Atoms in each cell are then accessed following the linked list implemented with array, lscl, this access pattern is unchanged for the entire force-computing routine. To take advantage of this fixed memory access pattern, we could copy the entire coordinate array, r, to another array, r1, by rearranging atoms such that memory access becomes consecutive [1]. We can also prepare an array, cell_end, which holds the starting and ending atomic indices in r1 for all the cells. For cell c, then the atoms are accessed as: for i = cell_end[c]+1 to cell_end[c+1] access r1[i] endfor Fig. 2: Modification of array layout for atomic coordinates. 1

Spacefilling (Hilbert-Peano) Curve The above data layout achieves data locality. The next step is to achieve computation locality. Note that the pmd.c program loops over the pairs of atoms residing in the nearest-neighbor cells. The locality of this computation can be achieved by ordering the cells such that the spatial proximity of the consecutive cells is preserved. The spacefilling curve may be used for this purpose. Below, we review one type of spacefilling curve called Hilbert-Peano curve. Gray code: a sequence of numbers such that each successive numbers have Hamming distance 1. (Hamming distance is the total number of bit positions at which two binary numbers differ.) The k-bit Gray code G(k) is defined recursively. (1) G(1) is a sequence: 0 1. (2) G(k+1) is constructed from G(k) as follows. a. Construct a new sequence by appending a 0 to the left of all members of G(k). b. Construct a new sequence by reversing G(k) and then appending a 1 to the left of all members of the sequence. c. G(k+1) is the concatenation of the sequences defined in steps a and b. Example: > Two-bit Gray code 00 01 11 10 > Three-bit Gray code 000 001 011 010 110 111 101 100 EMBEDDING A LINE TOPOLOGY INTO A HYPERCUBE Map the processor i of the line topology (size 2 d ) onto the i-th entry of the d-dimensional hypercube. (3D Example) SPACEFILLING CURVE Spacefilling curve: A mapping from [0,1] [0,1] d, or a one-dimensional curve, which fills a d- dimensional cube. It has many applications in graph partitioning, image compression, optimization (Traveling salesman problem), etc. Partitioning and ordering many points in a d-dimensional space (many of them is NP-complete) approximately reduces to an one-dimensional sorting problem whose complexity is O(N log N). Hilbert curve: A special spacefilling curve, which is based on the Gray sequence. Hilbert curve in d-dimensional space uses the d-dimensional Gray code. Note in 3 dimensions, there are 24 possible Gray sequences: 8 starting nodes, each having 3 possible terminating nodes. All of them are used to construct a Hilbert curve. 2

$ seed_rotate start = 0 end = 1: 0 2 6 4 5 7 3 1 start = 0 end = 2: 0 4 5 1 3 7 6 2 start = 0 end = 4: 0 1 3 2 6 7 5 4 start = 1 end = 0: 1 3 2 6 7 5 4 0 start = 1 end = 3: 1 5 7 6 4 0 2 3 start = 1 end = 5: 1 0 4 6 2 3 7 5 start = 2 end = 0: 2 3 1 5 7 6 4 0 start = 2 end = 3: 2 6 7 5 4 0 1 3 start = 2 end = 6: 2 0 4 5 1 3 7 6 start = 3 end = 1: 3 2 6 7 5 4 0 1 start = 3 end = 2: 3 7 5 1 0 4 6 2 start = 3 end = 7: 3 1 0 2 6 4 5 7 start = 4 end = 0: 4 6 2 3 7 5 1 0 start = 4 end = 5: 4 0 1 3 2 6 7 5 start = 4 end = 6: 4 5 7 3 1 0 2 6 start = 5 end = 1: 5 7 6 4 0 2 3 1 start = 5 end = 4: 5 1 3 7 6 2 0 4 start = 5 end = 7: 5 4 0 1 3 2 6 7 start = 6 end = 2: 6 7 5 4 0 1 3 2 start = 6 end = 4: 6 2 3 7 5 1 0 4 start = 6 end = 7: 6 4 0 2 3 1 5 7 start = 7 end = 3: 7 6 2 0 4 5 1 3 start = 7 end = 5: 7 3 1 0 2 6 4 5 start = 7 end = 6: 7 5 4 0 1 3 2 6 Hilbert curve is obtained as a limit of a recursive procedure: Prepare a Gray code as a seed, and recursively replace its nodes by (rotated) Gray seeds. EXAMPLE: 2-DIMENSIONAL HILBERT CURVES 3

APPLICATION OF HILBERT CURVE: TRAVELLING SALESMAN PROBLEM Traveling salesman problem: Given N cities in a map, find the shortest path to visit all the cities. This is known as an NP-complete problem (i.e., all the combinations must be tested so that the cost grows exponentially as N). A heuristic solution to the traveling salesman problem is obtained by using the Hilbert curve. First divide a square containing all the cities into 2 m 2 m cells so that each cell contains at most one city. Second, draw the Hilbert curve, which traverse all the cells. Finally visit the cities according to the onedimensional sequence on the Hilbert curve. Internode Optimization Metacomputerized Molecular Dynamics Metacomputing: Using geographically distributed computing resources as a single computing platform. Metacomputing applications > Distributed supercomputing: Large-scale computation that is beyond the power of a single parallel supercomputer. > Collaborative computing: Collaborative, hybrid computation that integrates distributed, multiple expertise [2]. Metacomputing tools > MPI-G2: Global version of MPI. It facilitates multi-protocol communication, cross-platform authentication, etc. in a heterogeneous metacomputing environment [3]. > MPI-GQ: MPI-G2 with quality-of-services support [4]. > Grid remote procedure call (GridRPC): Hybrid GridRPC (e.g., NinfG, see http://ninf.apgrid.org) + MPI programs run on a Grid of distributed parallel computers, in which the number of processors change dynamically on demand and resources are allocated and tasks are migrated adaptively in response to unexpected faults [5]. METACOMPUTERIZED MD OVERLAPPING COMPUTATION AND COMMUNICATION Using MPI-G2, parallel MD codes such as pmd.c can be run in a metacomputing environment, e.g., by constructing a virtual machine consisting of hosts at USC and at the Grid Technology Research Center in Japan. We only need to prepare a processor group file that contains host names at both institutions. The problem is such a brute-force approach is latency. Since the force computations cannot start until all the communications for caching complete, larger latency associated with wide area networks 4

between U.S. and Japan will cause processors to be idle most of the time waiting for the messages (see Fig. 3, center). One possible solution to this latency problem is the use of asynchronous messages to overlap computation and communication. To do so, we first classify the linked-list cells for the inner and boundary cells, see Fig. 4. The inner cells do not have any face that coincides with one of processor boundaries, and therefore the forces on the atoms in an inner cell can be computed without any cached information. Inner-cell computation can thus be overlapped with communication, see Fig. 3, right. Boundary cells, on the other hand, have processor boundaries as one or more of their faces, and their force computation require cached information. Therefore, we need to wait the asynchronous messages to complete before we start force computation for boundary cells. The following is the metacomputerized MD algorithm: 1. asynchronous receive of cells to be cached 2. send atomic coordinates in the boundary cells 3. compute forces for atoms in the inner cells 4. wait for the completion of the asynchronous receive 5. compute forces for atoms in the boundary cells The actual implementation of the above idea is slightly more complex. Since the message passing is done in a 3-step loop (x, y and z directions), we need to specify which groups of cells can be allowed to compute forces after each step of message passing is completed. Specifically, let us define the following 4 groups: 1) inner cells; 2) boundary cells without any y or z processor-boundary faces; and 3) boundary cells without any z processor-boundary faces; and 4) boundary cells with z processor-boundary faces. (Question) Modify the above metacomputerized MD algorithm, taking account of the stepwise communication scheme. METACOMPUTERIZED MD RENORMALIZED MESSAGES Fig. 3: Gantt charts for parallel MD algorithms, where the arrows, thin lines and boxes indicate time progress, messages and computation activity, respectively. (Left) Regular spatial decomposition as in pmd.c on tightly-coupled computers. (Center) the same in a metacomputing environment involving computers at USC and the Grid Technology Research Center in Japan. (Right) Metacomputerized-MD at USC-Japan. To reduce the latency, it is desirable to minimize the number of messages. For a metacomputing involving multiple processors in one geographical site, latency can be reduced significantly by 5 Fig. 4: Inner and boundary cells in a processor for the linked cell list method are shown along with cached cells from other processors.

composing a large cross-site message instead of sending all processor-to-processor messages between the site boundary, see Fig. 5. Such operations are facilitated using the communicator construct in MPI. References Fig. 5: (Top) Processor-to-processor messages. (Bottom) A renormalized message. 1. J. Mellor-Crummey, D. Whalley, and K. Kennedy, Improving memory hierarchy performance for irregular applications using data and computation reorderings, International Journal of Parallel Programming 29, 217 (2001). 2. H. Kikuchi, R.K. Kalia, A. Nakano, P. Vashishta, H. Iyetomi, S. Ogata, T. Kouno, F. Shimojo, K. Tsuruta, and S. Saini, Collaborative simulation Grid: multiscale quantum-mechanical/classical atomistic simulations on distributed PC clusters in the US and Japan, in Proceedings of Supercomputing 2002 (IEEE Computer Society, Los Alamitos, CA, 2002). 3. I. Foster, J. Geisler, W.D. Gropp, N.T. Karonis, E. Lusk, G. Thiruvathukal, and S. Tuecke, Widearea implementation of the Message Passing Interface, Parallel Computing 24, 1735 (1998); See also the MPICH-G2 (Grid/Globus enabled MPI) homepage, http://www.hpclab.niu.edu/mpi. 4. A. Roy, I. Foster, W.D. Gropp, N.T. Karonis, V. Sander, and B. Toonen, MPICH-GQ: quality-ofservice for message passing programs, in Proceedings of Supercomputing 2000 (IEEE Computer Society, Los Alamitos, CA, 2000). 5. H. Takemiya, Y. Tanaka, S. Sekiguchi, S. Ogata, R. K. Kalia, A. Nakano, and P. Vashishta, Sustainable adaptive Grid supercomputing: multiscale simulation of semiconductor processing across the Pacific, in Proceedings of Supercomputing 2006 (IEEE Computer Society, Los Alamitos, CA, 2006). 6