Optimizing Molecular Dynamics This chapter discusses performance tuning of parallel and distributed molecular dynamics (MD) simulations, which involves both: (1) intranode optimization within each node of a parallel computer; and (2) internode optimization involving communications. Intranode Optimization: Memory Access Pattern in Molecular Dynamics Molecular dynamics (MD) program using a linked-list cell method is characterized by a random memory access pattern due to excessive indirections. This could in turn cause a significant degradation of MFlops performance. Fig. 1: Linked-lest access to atoms in a cell. In program, pmd.c, interatomic forces are computed in nested for loops over central and neighbor cells. Atoms in each cell are then accessed following the linked list implemented with array, lscl, this access pattern is unchanged for the entire force-computing routine. To take advantage of this fixed memory access pattern, we could copy the entire coordinate array, r, to another array, r1, by rearranging atoms such that memory access becomes consecutive [1]. We can also prepare an array, cell_end, which holds the starting and ending atomic indices in r1 for all the cells. For cell c, then the atoms are accessed as: for i = cell_end[c]+1 to cell_end[c+1] access r1[i] endfor Fig. 2: Modification of array layout for atomic coordinates. 1
Spacefilling (Hilbert-Peano) Curve The above data layout achieves data locality. The next step is to achieve computation locality. Note that the pmd.c program loops over the pairs of atoms residing in the nearest-neighbor cells. The locality of this computation can be achieved by ordering the cells such that the spatial proximity of the consecutive cells is preserved. The spacefilling curve may be used for this purpose. Below, we review one type of spacefilling curve called Hilbert-Peano curve. Gray code: a sequence of numbers such that each successive numbers have Hamming distance 1. (Hamming distance is the total number of bit positions at which two binary numbers differ.) The k-bit Gray code G(k) is defined recursively. (1) G(1) is a sequence: 0 1. (2) G(k+1) is constructed from G(k) as follows. a. Construct a new sequence by appending a 0 to the left of all members of G(k). b. Construct a new sequence by reversing G(k) and then appending a 1 to the left of all members of the sequence. c. G(k+1) is the concatenation of the sequences defined in steps a and b. Example: > Two-bit Gray code 00 01 11 10 > Three-bit Gray code 000 001 011 010 110 111 101 100 EMBEDDING A LINE TOPOLOGY INTO A HYPERCUBE Map the processor i of the line topology (size 2 d ) onto the i-th entry of the d-dimensional hypercube. (3D Example) SPACEFILLING CURVE Spacefilling curve: A mapping from [0,1] [0,1] d, or a one-dimensional curve, which fills a d- dimensional cube. It has many applications in graph partitioning, image compression, optimization (Traveling salesman problem), etc. Partitioning and ordering many points in a d-dimensional space (many of them is NP-complete) approximately reduces to an one-dimensional sorting problem whose complexity is O(N log N). Hilbert curve: A special spacefilling curve, which is based on the Gray sequence. Hilbert curve in d-dimensional space uses the d-dimensional Gray code. Note in 3 dimensions, there are 24 possible Gray sequences: 8 starting nodes, each having 3 possible terminating nodes. All of them are used to construct a Hilbert curve. 2
$ seed_rotate start = 0 end = 1: 0 2 6 4 5 7 3 1 start = 0 end = 2: 0 4 5 1 3 7 6 2 start = 0 end = 4: 0 1 3 2 6 7 5 4 start = 1 end = 0: 1 3 2 6 7 5 4 0 start = 1 end = 3: 1 5 7 6 4 0 2 3 start = 1 end = 5: 1 0 4 6 2 3 7 5 start = 2 end = 0: 2 3 1 5 7 6 4 0 start = 2 end = 3: 2 6 7 5 4 0 1 3 start = 2 end = 6: 2 0 4 5 1 3 7 6 start = 3 end = 1: 3 2 6 7 5 4 0 1 start = 3 end = 2: 3 7 5 1 0 4 6 2 start = 3 end = 7: 3 1 0 2 6 4 5 7 start = 4 end = 0: 4 6 2 3 7 5 1 0 start = 4 end = 5: 4 0 1 3 2 6 7 5 start = 4 end = 6: 4 5 7 3 1 0 2 6 start = 5 end = 1: 5 7 6 4 0 2 3 1 start = 5 end = 4: 5 1 3 7 6 2 0 4 start = 5 end = 7: 5 4 0 1 3 2 6 7 start = 6 end = 2: 6 7 5 4 0 1 3 2 start = 6 end = 4: 6 2 3 7 5 1 0 4 start = 6 end = 7: 6 4 0 2 3 1 5 7 start = 7 end = 3: 7 6 2 0 4 5 1 3 start = 7 end = 5: 7 3 1 0 2 6 4 5 start = 7 end = 6: 7 5 4 0 1 3 2 6 Hilbert curve is obtained as a limit of a recursive procedure: Prepare a Gray code as a seed, and recursively replace its nodes by (rotated) Gray seeds. EXAMPLE: 2-DIMENSIONAL HILBERT CURVES 3
APPLICATION OF HILBERT CURVE: TRAVELLING SALESMAN PROBLEM Traveling salesman problem: Given N cities in a map, find the shortest path to visit all the cities. This is known as an NP-complete problem (i.e., all the combinations must be tested so that the cost grows exponentially as N). A heuristic solution to the traveling salesman problem is obtained by using the Hilbert curve. First divide a square containing all the cities into 2 m 2 m cells so that each cell contains at most one city. Second, draw the Hilbert curve, which traverse all the cells. Finally visit the cities according to the onedimensional sequence on the Hilbert curve. Internode Optimization Metacomputerized Molecular Dynamics Metacomputing: Using geographically distributed computing resources as a single computing platform. Metacomputing applications > Distributed supercomputing: Large-scale computation that is beyond the power of a single parallel supercomputer. > Collaborative computing: Collaborative, hybrid computation that integrates distributed, multiple expertise [2]. Metacomputing tools > MPI-G2: Global version of MPI. It facilitates multi-protocol communication, cross-platform authentication, etc. in a heterogeneous metacomputing environment [3]. > MPI-GQ: MPI-G2 with quality-of-services support [4]. > Grid remote procedure call (GridRPC): Hybrid GridRPC (e.g., NinfG, see http://ninf.apgrid.org) + MPI programs run on a Grid of distributed parallel computers, in which the number of processors change dynamically on demand and resources are allocated and tasks are migrated adaptively in response to unexpected faults [5]. METACOMPUTERIZED MD OVERLAPPING COMPUTATION AND COMMUNICATION Using MPI-G2, parallel MD codes such as pmd.c can be run in a metacomputing environment, e.g., by constructing a virtual machine consisting of hosts at USC and at the Grid Technology Research Center in Japan. We only need to prepare a processor group file that contains host names at both institutions. The problem is such a brute-force approach is latency. Since the force computations cannot start until all the communications for caching complete, larger latency associated with wide area networks 4
between U.S. and Japan will cause processors to be idle most of the time waiting for the messages (see Fig. 3, center). One possible solution to this latency problem is the use of asynchronous messages to overlap computation and communication. To do so, we first classify the linked-list cells for the inner and boundary cells, see Fig. 4. The inner cells do not have any face that coincides with one of processor boundaries, and therefore the forces on the atoms in an inner cell can be computed without any cached information. Inner-cell computation can thus be overlapped with communication, see Fig. 3, right. Boundary cells, on the other hand, have processor boundaries as one or more of their faces, and their force computation require cached information. Therefore, we need to wait the asynchronous messages to complete before we start force computation for boundary cells. The following is the metacomputerized MD algorithm: 1. asynchronous receive of cells to be cached 2. send atomic coordinates in the boundary cells 3. compute forces for atoms in the inner cells 4. wait for the completion of the asynchronous receive 5. compute forces for atoms in the boundary cells The actual implementation of the above idea is slightly more complex. Since the message passing is done in a 3-step loop (x, y and z directions), we need to specify which groups of cells can be allowed to compute forces after each step of message passing is completed. Specifically, let us define the following 4 groups: 1) inner cells; 2) boundary cells without any y or z processor-boundary faces; and 3) boundary cells without any z processor-boundary faces; and 4) boundary cells with z processor-boundary faces. (Question) Modify the above metacomputerized MD algorithm, taking account of the stepwise communication scheme. METACOMPUTERIZED MD RENORMALIZED MESSAGES Fig. 3: Gantt charts for parallel MD algorithms, where the arrows, thin lines and boxes indicate time progress, messages and computation activity, respectively. (Left) Regular spatial decomposition as in pmd.c on tightly-coupled computers. (Center) the same in a metacomputing environment involving computers at USC and the Grid Technology Research Center in Japan. (Right) Metacomputerized-MD at USC-Japan. To reduce the latency, it is desirable to minimize the number of messages. For a metacomputing involving multiple processors in one geographical site, latency can be reduced significantly by 5 Fig. 4: Inner and boundary cells in a processor for the linked cell list method are shown along with cached cells from other processors.
composing a large cross-site message instead of sending all processor-to-processor messages between the site boundary, see Fig. 5. Such operations are facilitated using the communicator construct in MPI. References Fig. 5: (Top) Processor-to-processor messages. (Bottom) A renormalized message. 1. J. Mellor-Crummey, D. Whalley, and K. Kennedy, Improving memory hierarchy performance for irregular applications using data and computation reorderings, International Journal of Parallel Programming 29, 217 (2001). 2. H. Kikuchi, R.K. Kalia, A. Nakano, P. Vashishta, H. Iyetomi, S. Ogata, T. Kouno, F. Shimojo, K. Tsuruta, and S. Saini, Collaborative simulation Grid: multiscale quantum-mechanical/classical atomistic simulations on distributed PC clusters in the US and Japan, in Proceedings of Supercomputing 2002 (IEEE Computer Society, Los Alamitos, CA, 2002). 3. I. Foster, J. Geisler, W.D. Gropp, N.T. Karonis, E. Lusk, G. Thiruvathukal, and S. Tuecke, Widearea implementation of the Message Passing Interface, Parallel Computing 24, 1735 (1998); See also the MPICH-G2 (Grid/Globus enabled MPI) homepage, http://www.hpclab.niu.edu/mpi. 4. A. Roy, I. Foster, W.D. Gropp, N.T. Karonis, V. Sander, and B. Toonen, MPICH-GQ: quality-ofservice for message passing programs, in Proceedings of Supercomputing 2000 (IEEE Computer Society, Los Alamitos, CA, 2000). 5. H. Takemiya, Y. Tanaka, S. Sekiguchi, S. Ogata, R. K. Kalia, A. Nakano, and P. Vashishta, Sustainable adaptive Grid supercomputing: multiscale simulation of semiconductor processing across the Pacific, in Proceedings of Supercomputing 2006 (IEEE Computer Society, Los Alamitos, CA, 2006). 6