BUILDING A HIGHLY SCALABLE MPI RUNTIME LIBRARY ON GRID USING HIERARCHICAL VIRTUAL CLUSTER APPROACH

Size: px

Start display at page:

Download "BUILDING A HIGHLY SCALABLE MPI RUNTIME LIBRARY ON GRID USING HIERARCHICAL VIRTUAL CLUSTER APPROACH"

Piers Reed
5 years ago
Views:

1 BUILDING A HIGHLY SCALABLE MPI RUNTIME LIBRARY ON GRID USING HIERARCHICAL VIRTUAL CLUSTER APPROACH Theewara Vorakosit and Putchong Uthayopas High Performance Computing and Networking Center Faculty of Engineering, Kasetsart University, 50 Phaholyotin Rd, Chatuchak Bangkok, 0900, Thailand thvo@hpcnc.cpe.ku.ac.th, pu@ku.ac.th ABSTRACT For large computational grid systems, the Message Passing Interface (MPI) is one of the most widely used programming models for distributed computing. However, current MPI implementations cannot utilize all computing nodes on a grid as a result of network configurations that effectively hide nodes behind a frontend node or gateway. This work presents an approach called Hierarchical Virtual Clustering (HVC) that overcomes this problem in an efficient manner. The idea is to separate a grid into physical and hierarchically organized logical clusters, and to incorporate routing between these clusters at the MPI runtime library level. This approach has been implemented on an experimental grid-enabled runtime system called MPITH. Experiments have been conducted to compare the performance of MPITH with MPICH and found that a comparable speed can be obtained with a very little loss of performance. Hence, the approach presented here can be used to vastly increase grid computing resources available to MPI applications without impacting efficient utilization of those resources. KEY WORDS MPI, grid, cluster, NAT transparency, and routing algorithm. Introduction MPI [],[2] is one of the most established parallel programming models for distributed computing on a grid[3]. However, most of the nodes on a grid are hidden behind clusters front-end nodes; in this situation a gridbased MPI runtime can only utilize the front-end node of each cluster for computations. This substantially reduces the amount of computing power available to grid users. In this paper, a new and systematic approach to enable the MPI runtime to use all nodes on the grid is proposed. This Hierarchical Virtual Cluster or HVC approach provides a uniform way to make all nodes addressable by a grid-based MPI runtime environment. Hence, this practically eliminates the problem caused by NAT and gateways. The HVC approach has been incorporated into MPITH[], our grid-based MPI implementation. Experiments show that this system can utilize more nodes in a scalable and efficient way. The rest of this paper is organized as follows. Section 2 describes previous work in this area. Section 3 describes the system model. Section gives an overview of the model s implementation in MPITH. Experimental results are presented in Section 5. Finally, conclusions and future work are given in Section Related Work Many programming models have been employed to harvest the enormous computing power of a grid system; an extensive survey of various approaches can be found in [5]. MPI is one of the most established parallel programming models that can be used on a grid. Many MPI implementations have been developed; of these, two of the most widely used are LAM[6] and MPICH[7]. MPICH, developed at Argonne National Laboratory, provides a portable, high performance MPI implementation for clusters and grid systems; MPICH supports both the MPI.2 and 2.0 standards. LAM is another MPI implementation that provides many additional features such as OpenPBS integration, beta grid support, high-speed interconnection support, check-point and restart, fast start-up. Open MPI[8], a recent collaboration between several universities and Los Alamos Nat'l Laboratory, aims to integrate technologies of several MPI projects into a fast, efficient, MPI-2 compliant implementation. Open MPI currently focuses on fault-tolerance, heterogeneity, and performance rather than providing seamless grid support. There are also special purpose MPI implementations that focus on specific issues. MPI-FT[9] focuses on how to make the MPI runtime fault-tolerance in a semitransparent way. MPICH-G2[0] is an extension of MPICH to grid systems, supporting the authentication, scheduling, and resource management requirements of a grid environment. However, MPICH-G2 does not yet provide a solution for running tasks across NAT gateways. MagPIe[], another MPI library that extends MPICH, attempts to optimize the MPI communication algorithm for wide area networks using a two-layer hierarchy: cluster level (LAN) and grid level (WAN). Optimizing MPI communications is the focus of other

2 projects such as GridMPI[2], that incorporates an algorithm to build a latency aware communication library. The MPITH project has also proposed topology-aware communications using a genetic algorithm and smart message scheduling. Another challenge presented by the grid is how to build an efficient communication library that is aware of the presence of NAT gateways. This is crucial since, as mentioned above, most current MPI implementations cannot utilize the majority of computing resources hidden behind NAT gateways. There are two approaches to this problem. The first is to use a MPI routing daemon. PACX-MPI[3] utilizes this approach: intra-cluster communication takes places directly among tasks, while inter-cluster communication is done through communication nodes called MPI-servers. An MPI-server compresses data and transfers it via TCP/IP to a destination, called a PACX-server, which decompresses the data and sends it to the target node. Although this is a viable solution, it has some drawbacks. First, the MPI- and PACX-server have to be installed and maintained on each cluster, which complicates administration. Second, the MPI-Server will run and consume some resources continuously even if no MPI programs are being used. Third, the presence of separate MPI routing processes complicates runtime system and adds difficulties to enforcing the grid security model. The second approach is to implement a MPI message forwarder in the MPI runtime. This eliminates the need for a separate MPI routing daemon and promotes strict enforcement of the grid security model. MPICH/MADIII[] uses this approach. MPICH/MADIII is designed to be a complementary tool for MPICH-G2, implementing MPICH device names on top of the Madeleine III communication library. However, interfacing MPICH/MADIII and MPICH-G2 complicates the runtime. Our approach integrates both grid-level and cluster-level message routing into a single MPI implementation. Hence, it provides full transparency and global optimization between the grid and cluster level. 3. Hierarchical Virtual Clusters In order to support NAT transparency and provide a highly scalable MPI runtime environment, a new approach called Hierarchical Virtual Cluster Model (HVC) is proposed. This section describes about the model and MPI process routing mechanism 3.. The Proposed Model A cluster is a group of computers or nodes connected together through an interconnection network. A designated front-end node acts as a management point for the cluster. In this paper, we use a term physical cluster to refer to this form of cluster. There are two types of physical clusters: closed and open. In a closed physical cluster, all compute nodes are placed behind the front-end node. The front-end or gateway node provides NAT facility for other nodes. In an open physical cluster, all compute nodes are directly addressable from the outside network. A grid is an interconnected set of these clusters. Figure shows a grid consisting of three clusters: Mercury, Venus, and Jupiter. Venus and Jupiter are closed physical clusters while Mercury is an open physical cluster. Venus, Jupiter and Mercury are front-end nodes of their respective clusters. Figure Clusters environment Connection initiation between nodes in a physical cluster can be represented by a graph termed a Connectivity Graph (CG), as defined in Definition. Definition. A Connectivity Graph (CG) is a directed graph G = (V,E) where V is the set of nodes in a grid and E is the set of edges. An edge (u, v) E if and only if node u can initiate a connection to node v. Figure 2 shows the CG derived from Figure Figure 2 Connectivity Graph of Figure In a NAT environment, a node behind a gateway can initiate a connection to outside nodes directly but an outside node cannot initiate a connection to a node behind the gateway since it cannot identify the destination address. For example, Mercury2 cannot initiate a connection to Venus2 if Venus2 is in NAT environment behind Venus. Although a connectivity graph provides a lot of useful information, this representation of network topology is too complex for our purposes. A connection between two nodes is useful only if initiation can be done bi-directionally. Based on this observation, a connectivity 5

3 graph can be reduced to a simpler Direct Connectivity Graph (DCG), as in Definition 2. Definition 2. Direct Connectivity Graph (DCG) is an undirected graph G = (V, E ) where V = V, E is the set of edges. An edge e E if and only if (u,v) E and (v,u) E. Figure illustrates three VC[]s in G. The group of VC[] in G is called a VC[] set. Note that VC[] is a strongly connected component in G and an edge (u,v) E if and only if u and v are in the same VC[]. As for VC[], there are more than one VC for each level of them in G. These VC[i] can be grouped into a VC[i] set. Figure 5 and Figure 6 shows that VC[] can be merged into a higher level. The gateway vertex is important because in MPI process mapping, we always map one process onto this node. This process not only performs computation, but also helps route messages to and from nodes inside the cluster as well. Figure 3 DCG derived from Figure 2 Figure 3 shows the DCG resulting from Figure 2. With this concept, the task we need to pursue is finding a systematic way to map MPI tasks onto a grid and provide routing that ultimately organizes the tasks into a DCG. Hence, all tasks will have a way to communicate with each other regardless of any NAT encountered. The mapping proposed in this paper is based on the concept of building a multi-level set of virtual clusters on a grid of physical clusters. A virtual cluster is a set of nodes such that every node can initiate a bi-directional connection to every node in the cluster. Let G be a DCG of a grid. We define virtual clusters and vertices of virtual clusters as follows. Definition 3. A Level Virtual Cluster, denoted by VC[], is a maximal subgraph of G such that the shortest path length u v is for all nodes u, v in the subgraph. Definition. A Level i Virtual Cluster, denoted by VC[i], is a maximal subgraph of G such that shortest path length u v is at most i for all nodes u, v in the subgraph. Definition 5. A gateway vertex of a VC is a vertex v V such that the reduced set V = V {v} causes G = (V,E ) to be divided into two connected components VC[] VC[] 2 VC[] 3 5 Figure VC[] in G VC[3] Figure 5 Intermediate step of VC merging Figure 6 Final VC merging The maximum level of VC in G is equal to the longest path length in G. Using this approach, a mapping of MPI tasks onto physical clusters can be performed by iteratively merging lower level VCs into a higher level VC. Hence, routing can be easily be derived by sending messages from a top level VC in HVC down the hierarchy of VCs. Another important property is that a VC[i] is always a connected component Routing Discovery Algorithm Having established that a VC is always a connected graph, a route can be considered as a shortest path from u to v. Floyd-Warshall s algorithm [5] can be used to find all paired shortest paths. First, process mappings are defined based on the HVC model. Once this information has been obtained, Floyd-Warshall s algorithm is used to compute the routing table. This routing table is then passed to all nodes. The HVC concept always ensures that one process is placed on a gateway node or gateway 5

4 vertex of each VC and provides the necessary routing capability. To provide routing between each pair of nodes, we define a route from u v in G as a simple path <u 0,u,,u k > such that u = u 0 and v = u k. Vertex u in the path is called a next hop of routing from u to v. The source vertex only needs to know the next hop to a given destination. Routes are maintained in a routing table. The routing table maintains next hop vertex for each source and destination. Definition 6 defines the routing table formally (the term routing matrix is used interchangeably). Definition 6. A routing table T is an n n matrix where n is the number of vertices in V such that t uv = u. Floyd-Warshall's algorithm needs weight of each edge. We associate weight to edge as w uv = if (u, v) E ; otherwise w uv =. After the routing table is created, a routing function r can be defined as r(u, v) = t uv Process-Level Routing An MPI application consists of a group of processes running cooperatively. Normally, the mapping of processes onto physical cluster nodes is known prior to the startup of MPI tasks. After the application is started, MPI processes are identified by MPI rank. Given a DCG G(V,E) that represent the grid, process mapping can be defined as follow. Definition 7. Process mapping is a function m: P V where P is a set of processes and V is a set of vertices in G. Definition 8. The reverse of the process mapping function is a relation μ:v P where V is a set of vertices in G and P is a set of processes. Both m and μ are used in the routing routine for MPI message forwarding. Let n(v) be the number of processes that are mapped onto vertex v. At run-time, the MPI process is a forwarder, so n(v) for each v that is a gateway vertex. Routing and message forwarding are implemented as follows: suppose process p sends a message to p 2. Let u = m(p ) and v = m(p 2 ). The next hop vertex is v = t uv. Then, find the process that is mapped onto v. Suppose that p 3 = μ(v ). At this point, p creates an MPI message and sends it to p 3, specifying that the destination is p 2. After p 3 receives message, it uses the same method to find next hop process. This forwarding routine stops when p 2 receives the message. developed as a framework for testing of various messages passing-based research projects. MPITH supports both intra-cluster and inter-cluster environments. MPITH uses a remote execution command such as rsh and ssh to start processes in the same cluster and GRAM/DUROC for inter-cluster execution. It also provides a utility to create an RSL file for use with the globusrun command. The implementation of HVC is as described here. In MPITH, process or tasks are categorized as primary master, secondary master, and slave. The master-type process is responsible for spawning of processes within its physical cluster. There is one primary master and remaining processes are secondary masters. The primary master differs from secondaries in that the former is a source of configuration files. A process that is not a master-type process is a slave process. All process types have the same execution codes. The process type is specified in the configuration file. There are two configuration files: a grid description file and a process mapping file. The grid description file describes HVC; it identifies the clusters, number of nodes in each cluster, and interconnections among clusters. The process mapping file describes process mapping function, including the number of processes that are mapped into each cluster, and specifies the process that is the primary master process. On startup, master-type processes are started in each cluster using the globusrun command with a generated RSL script. The primary master process has two additional configuration files as arguments. After these processes are started, an MPI_Init() function is executed. In MPI_Init(), the master processes synchronize using DUROC and learn the IP address/port of all other master processes. Then, the primary master process broadcasts a grid description and process mapping file to the other masters. After the secondary master processes receive the configuration files, they spawn child process in their cluster according to the requirements in the process mapping file. Child processes then execute the MPI_Init() function to register themselves with their master. Then, all processes return from MPI_Init and execute user code. 5. Experimental Results The proposed algorithm is tested on the Kasetsart University Grid (KUGrid), which is a part of Thai Grid[6]. Four clusters are included in the evaluation. All clusters are linked through KU s Gigabit campus network. Table shows the clusters' configuration.. MPITH Implementation In order to evaluate the proposed model, we implement HVC in MPITH, an experimental library that has been

5 Table Test bed configuration Cluster Type Nodes Brief Configuration Network Maeka Close 32 Opteron 2, 3 GB RAM Gigabit Gass Open 6 dual AthlonMP 800+, GB RAM Fast Magi Close Athlon XP 200+, 52 MB RAM Fast Amata Open 5 Athlon GHz, 52 MB RAM Fast MPITH is evaluated in three areas: point-to-point, broadcast, and application performance in comparison with MPICH All tests presented here were done ten times and the average values reported. First, point-to-point performance is evaluated. We compare time spend in send/receive operation between MPITH and MPICH2. The results are shown in Figure 7. Transmission time (milliseconds) MPICH2 Cluster MPITH Cluster MPITH Grid MPITH NAT Send/recv operation e e+07 Message size (bytes) Figure 7 point-to-point performances Figure 7 shows point-to-point performance of MPITH and MPICH2 for a message size from byte to 2 MB. The series labeled MPICH2 and MPITH cluster is the transmission time of MPI_Send/MPI_Recv in the Magi cluster. MPITH grid is the transmission time of MPI_Send/MPI_Recv between the head node of Magi and Maeka clusters. Finally, MPITH NAT is the transmission time between the head node of Magi and a compute node of Maeka. In the cluster environment, MPITH performance is better than MPICH2, especially for a message size between byte to 6 kb. For larger message sizes, both MPITH and MPICH2 exhibit similar levels of performance. In a grid environment, the transmission time is larger than in case of a cluster. The time of directconnect and NAT environment is the same, especially when the message size is larger than 6 kb. This shows that MPITH s message forwarder routine performs very well. Figure 8 shows MPI_Bcast performance in two cluster configurations and two grid configurations. Table 2 shows the four configurations in this test. Total transmission time (milliseconds) MPICH2 6 nodes MPITH 6 nodes MPITH/grid # MPITH/grid #2 MPI_Bcast e e+07 Message size (bytes) Figure 8 MPI_Bcast performances Table 2 Test-bed for MPI_Bcast Configuration Nodes Maeka Magi Gass MPICH2 6 nodes MPITH 6 nodes MPITH/grid # 8 MPITH/grid #2 6 2 For a small message size, the total transmission time depends mainly on latency that comes from the Linux buffer policy. When the message is smaller than about kb, message fits in the Linux kernel buffer, hence Linux waits until buffer is full. For a larger message size, the message is larger than the buffer, hence Linux will flush message immediately. Figure 8 shows that MPITH is slightly better than MPICH2 in a cluster environment. In a grid environment, MPITH uses the LPBF[7] algorithm for the MPI_Bcast operation. For small messages, time varies greatly due to the buffer management policy. For messages larger than kb, two grid configurations have constant different total transmission time. Running time (seconds) MPICH2 32 nodes cluster MPITH 32 nodes cluster MPITH/grid # MPITH/grid #2 MPITH/grid #3 MPITH/grid # Gaussian Elimination Problem size Figure 9 Gauss elimination on grid environment

6 Table 3 Grid configuration in test environment Configuration Nodes Maeka Amata Magi Gass MPITH/grid # MPITH/grid #2 6 8 MPITH/grid # MPITH/grid # Figure 9 shows running times of a Gaussian elimination application in a grid environment compared with a cluster environment. Table 3 shows the configuration of each grid environment. The result shows that applications running in a cluster perform better than in a grid environment. This is because of three factors; is the implementation of the MPITH algorithm itself. Gaussian elimination uses MPI_Scatter, but the current version of MPITH is not optimized for the grid version of this operation. Second, Gaussian elimination is a fine grain application so it performs better in a network with faster interconnection. The third factor is the processor speed of compute nodes: nodes in the Maeka cluster are faster that those of other clusters. 6. Conclusion This paper presented a model of MPI process creation and routing using the Hierarchical Virtual Cluster concept and implementation as a component of a gridenabled MPI runtime library. The model supports this routing algorithm and demonstrated that tasks on a grid can always communicate, even when the tasks are located on nodes hidden behind NAT gateways or other closed cluster configurations. A master process running on the gateway node of each cluster performs the required message forwarding. For grid computing applications, this means that all nodes on a grid can be fully utilized, substantially increasing the computing power available to grid applications. Tests conducted using the MPITH experimental MPI runtime show that MPI routing imposes very little overhead and provides good performance. This approach does not require any kernel modification nor running a separate routing process, both desirable features for administration and enforcement of grid security. 7. References: [] M. Snir, S. Otto, S. Huss-Lederman, D. Walker and J. Dongarra, MPI: The Complete Reference Volume - The MPI Core, 2 nd edition (MIT Press, 998) [2] W. Gropp, S. Huss-Lederman, A. Lumsdaine, E. Lusk, B. Nitzberg, W. Saphir and M. Snir, MPI: The Complete Reference Volume 2 - The MPI-2 Extensions, (MIT Press, 998) [3] G. Fox, D. Gannon, and M. Thomas, Special Issues: Grid Computing Environments, Concurrency and Computation: Practice and Experience. (3-5), 2002, [] T. Vorakosit, and P. Uthayopas, Developing a Thin and High Performance Implementation of Message Passing Interface. Proceeding of The Sixth Annual National Symposium on Computational Science and Engineering, Nakhonsithammarat, Thailand, [5] C. Lee, and D. Talia, Grid Programming Models: Current Tools, Issues and Directions. Proceeding of Grid Computing: Making the Global Infrastructure a Really. John Wiley and Sons, Chichester, UK., 2003, [6] G. Burns, R. Daoud, and Je Vaigl, LAM: An Open Cluster Environment for MPI, Proceeding of Supercomputing Symposium, 99, [7] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, Highperformance, portable implementation of the MPI Message Passing Interface Standard. Parallel Computing, 22(6), 996, [8] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall, Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. Proceeding of th European PVM/MPI Users' Group Meeting, Budapest, Hungary, September, 200, 05- [9] R. Batchu,Y. S. Dandass, A. Skjellum, M. Beddhu, MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware, Cluster Computing(7), 200, [0] N. T. Karonis, B. Toonen, I. Foster, MPICH-G2: A gridenabled implementation of the message passing interface. Journal of Parallel and Distributed Computing. 63(5), 2003, [] N. T. Karonis, R.F.H. Hofman, H.E. Bal, A. Plaat, and R.A.F. Bhoedjang. MAGPIE: MPI's Collective Communication Operations for Clustered Wide Area Systems, Proceeding of The 7 th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), Atlanta, GA., 999, 3-0 [2] Y. Ishikawa, M. Matsuda, T. Kudoh, H. Tezuka, and S. Sekiguchi, The Design of a Latency-aware MPI Communication Library, Proceedings of SWOPP03, 2003 [3] R. Keller, B. Krammer, M. S. Mueller, M. M. Resch, and E. Gabriel, MPI Development Tools and Applications for the Grid, Workshop on Grid Applications and Programming Tools, held in conjunction with the GGF8 meetings, Seattle, WA, USA, June 25th, [] O. Aumage, and G. Mercier, MPICH/MADIII: a cluster of clusters enabled MPI implementation, Proceeding of 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003, [5] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 2 nd edition (MPI Press, 200) [6] V. Varavidthaya, P. Uthayopas, ThaiGrid: Architecture and Overview, NECTEC Technical Journal(2)9, 2000 [7] T. Vorakosit, and P. Uthayopas, Improving MPI Multicast Performance over Grid Environment using Intelligent Message Scheduling. Proceeding of International Conference on Scientific and Engineering Computation, Singapore, 200.

Developing a Thin and High Performance Implementation of Message Passing Interface 1

Developing a Thin and High Performance Implementation of Message Passing Interface 1 Theewara Vorakosit and Putchong Uthayopas Parallel Research Group Computer and Network System Research Laboratory Department