Developing a Thin and High Performance Implementation of Message Passing Interface 1

Size: px

Start display at page:

Download "Developing a Thin and High Performance Implementation of Message Passing Interface 1"

Barrie Bruce
5 years ago
Views:

1 Developing a Thin and High Performance Implementation of Message Passing Interface 1 Theewara Vorakosit and Putchong Uthayopas Parallel Research Group Computer and Network System Research Laboratory Department of Computer Engineering, Faculty of Engineering, KasetsartUniversity, 50 Phaholyotin Rd, Chatuchak Bangkok, 900, Thailand address: g @ku.ac.th and pu@ku.ac.th Abstract Communication library is a substantially important part for the development of the parallel applications on PC clusters. MPI is currently the most important messaging passing standard being used worldwide. Although powerful, MPI is very complex and require a certain amount of effort to learn. In fact, only a basic set of MPI functions is enough to develop a large class of parallel applications. There is a need to explore other aspect of massing passing programming which is inadequately explored. These include issues such as fault tolerance programming, debugging, performance optimization, and usability of programming environment. In this paper, the work on implementing of a compact but high performance MPI implementation called MPITH is presented. MPITH is a communication library for PC cluster that conforms to a subset of most used functions in MPI 1.1 standard. This paper discusses the architecture, design, and implementation of MPITH along with the results of the comparison of MPITH performance with MPICH and LAM on PC cluster. The experimental results show that MPITH can deliver a comparable performance to both MPICH and LAM Introduction and Related Works Commodity PC cluster had proven to be a viable solution to provide very high computing power. Moreover, cluster systems can be used in many fields such as scientific computing, high performance web cluster, and high availability system. To use cluster with compute intensive programs, users need a parallel application that specially designed for the cluster. These applications usually communicate using message passing through communication library. Hence, the communication library plays an important role in the development of parallel applications for cluster. The 1 This research is supported in part by KURDI grant SRU and Advanced Micro Devices Far East Inc. -D34-

2 performance of a communication library depends mostly on internal algorithms such as data buffering, communication protocol, and communication algorithm. Most communication-related algorithms used in library involve the collective operation. Communication model such as LogP[5] can be used to create a communication schedule. The optimization of communication schedule based on LogP can be found in [8]. A communication library not only provides an efficient communication, but also provides other useful features such as data manipulation and group communication. Data manipulation helps programmer to compute sum, maximum, and minimum value. Group communication is a mechanism to communicate between more than 2 processes in each time. To make parallel program portable, many standard library interfaces are invented such as PVM [6, 12] and MPI [, 11]. MPI is currently the most important standard. MPI standard define both syntax and semantic of MPI functions. Most used communication libraries that conform to MPI standard are MPICH [7] and LAM [3]. MPICH is developed by Argon National Laboratory. The current version of MPICH is MPICH supports MPI standard 1.2 and part of MPI 2.0. It is freely distributed in open source license. The core of MPICH is written in C language. However, MPICH also includes C++ interface implemented by University of Notre Dame as well. LAM is developed by University of Notre Dame. The current version is LAM is developed in C++ but provides C interface to programmer. In MPI standard consists of a large and complex sets of functions. For example, MPICH supports as many as 280 functions. This complexity makes MPI difficult to learn. In addition, MPI implementation that supports all these functions is huge, complex, slow, and potentially less reliable. One important question to ask is whether the programmers of parallel program need this complexity or not. Therefore, the study has been conducted to see how many MPI functions are used by many popular packages such as PETSc [2], MPI Blacs [4], MPI Povray, HPL benchmark and PGAPack [9]. Table 1 and Figure 1 show the number of functions used in each packages. Table 1 Used MPI Function Application MPI Function Count PETSc 52 MPI Blacs 38 MPI Povray 11 HPL 21 PGAPack 14 -D35-

3 Function Count PETSc MPI Blacs MPI Povray HPL PGAPack Application Figure 1 MPI Functions used by popular software packages and libraries One can observe that the number of function used in many practical software and libraries is in fact very low. Most parallel program and parallel library use only functions. This indicates that MPI most of the functionality in MPI is still unused. Instead of trying to include all functions, there should be an effort to focus on building robust and necessary functions and explore other directions such as fault tolerance programming, automatic load balancing. Current MPI implementation seems to have too many functions to optimize or explore these new features. The fault tolerance is one of most interesting feature in communication library that need to be explored. As the size of a cluster increases, the probability of node failure caused by hardware, OS, and memory increases substantially. With current MPI implementation, if there is node failure, all process has to stop and roll back to the beginning stage. Thus, there is a need for library with such a feature. These problems motivate the development of MPITH, a communication library for cluster. MPITH is designed to conform to only a most used subset of MPI standard. Current version of MPITH supports basic point-topoint and collective communication. The goal of MPITH is to explore the mostly ignored but important aspect of parallel programming such as load balancing, fault tolerance programming. This paper is organized as follow. First, Section 0 describes the architecture of MPITH followed by discussion about implementation of MPITH in Section 0. In Section 0, the performance of MPITH is compared with that of MPICH and LAM. Finally, the conclusion and future work are given in Section 0. -D36-

4 MPITH Architecture MPITH architecture is as illustrated in Figure 2. MPITH is divided into 4 layers: communication device layer, device handler layer, MPI engine layer, and API layer. P C A I API Layer Communication Service Administrative Communication Engine Layer Device she Handler Device Handler Layer TCP UDP DP VIA Device Layer Figure 2 MPITH Architecture Device Layer MPITH is designed to support multiple underlying network protocol through the concept of device. The communication device layer is responsible for interfacing with various abstract devices. MPITH communication layer require that each device conform to the following requirements: Reliable transfer between two end point of communication Support select () system call. Operate in synchronous mode. In the next version, device must be able to operate in both synchronous and asynchronous mode Device may or may not perform the buffering of send/receive data. Each device in MPITH is also separated into the server device and client device. Server device waits for a connection from client device. When a server device accepts an incoming connection, it will create a new client device to communicate with its -D37-

5 peer. Client device is responsible for the subsequent data transmission. In addition, client device is also used to initiate the communication with server device of other processes. Device Handler Layer Device Handler is responsible for runtime device management. The upper layer module must send and receive data through this layer except at the beginning of program execution, where upper layer code may contact the communication device directly. This layer manages a proper device opening and closing and guarantee that the engine layer can send and receive data at any time regardless of the device status. The strategy used allows the device to stay open (connect) most of the time in order to reduce the latency time spent opening the new device. This layer also buffers the incoming data in order to improve the waiting time of the sender. Communication Engine Layer Communication engine layer is an important part that executes the entire communication algorithms and collective operations such as MPI_BCAST, MPI_REDUCE. Moreover, this layer also manages the processes and services the request from API layer. The engine layer is separated from API layer so the addition of other features can be done in API layer. For example, message queue logging or debugging support can be added to API layer before calling engine layer. API Layer API layer implements the interface to parallel application programmer. The syntax of API function conforms to MPI standard. Currently, MPITH supports 13 functions that are divided to 4 sets as follows: Point-to-point communication functions: The functions that belong to this group are MPI_SEND and MPI_RECV. Collective communication functions: The functions that belong to this group are MPI_BCAST, MPI_RECV, MPI_GATHER and MPI_SCATTER. Information query functions: A set of function that gives the information about the runtime environment. The function that belong to this group are MPI_WTIME, MPI_WTICK and MPI_GET_PROCESSOR_NAME Administrative functions: This includes services about process creation and data manipulation. This kind of function may involve collection to broadcast control -D38-

6 message among each process. The functions that belong to this group are MPI_INIT and MPI_FINALIZE. Implementation MPITH is developed using C++ language so that object-oriented technique can be fully utilized. The OO paradigm decrease coupling and increase cohesion between modules. In this section, some of the implementation concept will be discussed. Process Creation Process in MPITH has 2 modes: master mode and slave mode. User starts MPI program via the "mpirun" utility. This utility accepts number of process, machine file and program name. Using that information, mpirun program can create command line argument and execute the given program. At this point, this program is running in "master" mode. When a process is created under master mode, a server device is also created in order to listen to the incoming connection from other processes. In addition, master process also spawns slave processes via system executor. In this version the executor is based on fork/exec mechanism. Remote process creation is done using RSH mechanism or KSIX middleware call if operates under SCE environment. After spawning the slaves, master process will wait for connection requests from the slaves. The received slave addresses and port number are then used to build a global process table. Master process also assigns a MPI Process Identifier (MPID) to each slave. After received all connect request from every slave, master process broadcasts process table to its children. If process is a slave, it will start by sending the connect request to master. The address and port number of master process is taken from the command line arguments. After slave process connected to master, it sends its address and server port number to master. Finally slave process receives process table from master and broadcasts this table to its children. Send/Receive Operation MPITH process always starts using 2 threads. The first thread executes the application program while the second thread controls I/O. When one process wants to send data to another process, MPI_SEND function is used. As the MPI_SEND function in API layer is passed to engine layer, the engine layer copy data from user to its buffer. After that engine layer will call the device handler to send its buffer to destination process. -D39-

7 For the receiving side, the I/O thread responsible for the polling the device ready to read. If it is a server device, I/O thread will accept the incoming connection since this means that there is an incoming connection. If the device is a client device, I/O thread reads the data from device and adds the incoming data to internal queue. When an application calls MPI_RECV, the engine only passes the data from incoming queue to application. The control returns when all incoming data is received from queue. Results The performance of MPITH is evaluated in a 16-nodes Beowulf cluster. Each node is a PC using Athlon 1GHz processor, 512 Mbytes RAM. This cluster connects together using a 0 Mbps Fast Ethernet switch. Most of the test has been conducted to compare the performance of MPITH with the popular MPICH and LAM. For the first test, the send/receive time between 2 nodes has been measured. The results are as shown in Figure 3. It can be seen that the speed of MPITH is comparable to MPICH. Both are a little slower than LAM. MPICH seems to have a slight problem with buffer handling when message size is ranging from about 1-0K MPITH MPICH LAM Send/Receive Time Time (microsecond) Message Size (kbyte) Figure 3 Send/Receive Time Next, the performance data of broadcast reduces, scatter, and gather operations have been measured. The results are as illustrated in Figure 4 through Figure 7 -D40-

8 00000 MPITH MPICH LAM Broadcast Time 0000 Ttime (microsecond) Figure 4 Broadcast Time Message size (kbyte) MPITH Reduce Time MPICH LAM Time (microsecond) Figure 5 Reduce Time Message Size (kbyte) -D41-

9 MPITH MPICH LAM Scatter Time Time (microsecond) Figure 6 Scatter Time Message Size (kbyte) In summary, these results show that MPITH has a performance that is comparable to MPICH. For small message, MPITH has slightly better performance than MPICH. However, for large message, MPICH becomes faster. The performance characteristics are caused by the MPITH buffering policy. In MPITH, all incoming messages, regardless of the message size, are buffered. In contrast, MPICH, buffers only the small messages. Time (microsecond) MPITH MPICH LAM Gather Time Figure 7 Gather Time Message size (kbyte) When MPITH are compared with LAM, the performance of LAM is better for small message size. However, performance of MPITH comes close to LAM when message size increases. Note that from the graph, the speed of operation when message size is about 1 kb is better much than smaller message. This peculiar result is caused by the -D42-

10 implementation of Linux TCP/IP stack. Normally, Linux kernel buffers the incoming message. Kernel only sends the message to user level when the message size reaches a certain threshold value. For small message, the size does exceed the threshold size; kernel will wait for more incoming message before returning to user process. Hence, this will cause the slowdown in TCP transmission for a certain message size. Finally, the performance of a parallel matrix multiplication program has been measured on 15 nodes. The result is as reported in Table 2. Table 2 Matrix Multiplication Program Running Time Size Running time (s) MPITH MPICH LAM It can be seen that the application performance of the matrix multiplication program are almost the same for all MPI implementation used. These results clearly demonstrate that MPITH performance is comparable with MPICH and LAM. Conclusions and Future Work In this paper, the design and implementation concept of MPITH communication library for cluster is presented. MPITH comply with subset of MPI standard. The goal of MPITH project is to build a software infrastructure to explore other aspect of MPI based parallel programming. The comparison of the performance with that of MPICH and LAM has shown very promising results that MPITH can deliver the performance rival to both popular implementations. In the future, there will be an addition of more MPI functions to make it more convenient to code the parallel application. Also, the buffering policy is one of the things that will be improved. Next generation of MPITH will run on KSIX [1] middleware that provides some features such as process management, fault tolerance. These features are essentially needed to develop a fault tolerance MPI program. -D43-

11 References [1] Angskun, T., Uthayopas, P. & Rungsawang, A Dynamic Process Management in KSIX Cluster Middleware in Proceedings of Euro PVM/MPI 2001,Santorini (Thera) Island. Greece, September [2] Balay, S., McInnes, L. C., Gropp, W. D. & Smith, B. F PETSc 2.0 users manual. ANL Report ANL-95/11, Argonne National Laboratory, Argonne, Ill., November. [3] Burns, G., Daoud, R. & Vaigl, J LAM: An open cluster environment for MPI. Technical report, Ohio Supercomputer Center, Columbus, Ohio. URL [4] Whaley, R. C Outstanding Issues in the MPIBLACS. Available on netlib from the blacs/ directory. [5] Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K., Santos, E., Subramonian, R. & von Eicken, T LogP: Towards a realistic model for parallel computation. In Proceeding of 5th Symp. on Parallel Algorithms and Architectures. [6] Geist, G. A., Beguelin, A., Dongarra, J. J., Jiang, W., Manchek, R. & Sunderam, V. S PVM 3 user's guide and reference manual. Technical Report ORNL/TM-12187, Oak Ridge National Laboratory. [7] Gropp, W., Lusk, E., Doss, N. & Skjellum, A A High-performance, Portable Implementation of The MPI Message Passing Interface Standard. Parallel Computing, 22(6): [8] Karp, R., Sahay, A. & Santos, E Optimal broadcast and summation in the LogP model. Technical Report CSD , University of California, Berkeley. [9] Levine, D PGAPack Parallel Genetic Algorithm Library. Argonne National Laboratory, ANL95 /18, Argonne, Il. [] Message Passing Interface Forum MPI: A Message-Passing Interface Standard. [11] Message Passing Interface Forum MPI-2: Extensions to the Message- Passing Interface. DRAFT. [12] Zhou, H. & Geist, A LPVM: A Step Towards Multithread PVM. Technical report, Oak Ridge Nat l Laboratory. -D44-

BUILDING A HIGHLY SCALABLE MPI RUNTIME LIBRARY ON GRID USING HIERARCHICAL VIRTUAL CLUSTER APPROACH

BUILDING A HIGHLY SCALABLE MPI RUNTIME LIBRARY ON GRID USING HIERARCHICAL VIRTUAL CLUSTER APPROACH Theewara Vorakosit and Putchong Uthayopas High Performance Computing and Networking Center Faculty of