Tracing Internal Communication in MPI and MPI-I/O

Size: px

Start display at page:

Download "Tracing Internal Communication in MPI and MPI-I/O"

Amie Mills
6 years ago
Views:

1 Tracing Internal Communication in MPI and MPI-I/O Julian M. Kunkel, Yuichi Tsujita, Olga Mordvinova, Thomas Ludwig Abstract MPI implementations can realize MPI operations with any algorithm that fulfills the specified semantics. To provide optimal efficiency the MPI implementation might choose the algorithm dynamically, depending on the parameters given to the function call. However, the selection is not transparent to the user. While this is appropriate for common users, achieving best performance with fixed parameter sets requires knowledge of internal processing. Also, for developers of collective operations it might be useful to understand timing issues inside the communication or I/O call. In this paper we extended the PIOviz environment to trace MPI internal communication. Thus, this allows the user to see PVFS server behavior together with the behavior in the MPI application and inside MPI itself. We present some analysis results for these capabilites for MPICH2 on a Beowulf Cluster. 1 Introduction The Message Passing Interface (MPI) [1] is state-of-the-art in programming distributed memory architectures. This interface offers an abstraction of the underlying communication infrastructure to the programmer. The MPI specification defines a wide range of operations with their corresponding semantics. Collective operations, for instance, allow a group of processes to exchange information among them. While there are many possible implementations to provide the semantics as defined in MPI, a fast execution on the hardware is favorable. Therefore, the MPI definition allows vendors of a supercomputer to tune processing by adapting specific algorithms to their architecture. Depending on the operation definition and parameters given by participating processes, the implementation might choose the appropriate algorithm dynamically to provide best performance on the architecture. Internally the selected algorithm can induce complex communication patterns. Ruprecht-Karls-Universität, Heidelberg, INF 348, Heidelberg, Germany Kinki University, 1 Umenobe, Takaya, Higashi-Hiroshima, Hiroshima , Japan Universität Hamburg, c/o DKRZ, Bundesstrasse 55, Hamburg, Germany 1

2 As an example broadcasting data among all processes could be done by sending data sequentially from a root process to peer processes. While this might be acceptable for small messages and a small number of processes, the single sender is likely to be the bottleneck of the communication. With a switched interconnection information exchange can be performed in a binary tree fashion to address this issue. Thereby, the root process sends to another process and then each of them is a root for half of the processes. An extension to the MPI standard offers routines for input/output operations, similar to communication definitions, MPI-I/O provides independent and collective functions. Collective I/O operations are a candidate for optimization, because I/O is typically at least an order of magnitude slower than communication. Optimization for hard disk based subsystems could be done for example by bundling small non-contiguous I/O operations together to large contiguous accesses. Therefore, depending on the underlying file system the interplay between MPI and file system could be very easy, or in case of a parallel file system the interplay could be extensive. In general the user is interested to reduce runtime of his application and not in MPI internal processing. However, in some cases such internal information might be useful, on one hand for tuning an application, on the other hand for optimizing the MPI library itself. For instance, by knowing the algorithm used for a particular communication call the programmer might know in advance which processes need more time to execute the call. Then the user could realize static load balancing by assigning less work to these processes. While in most cases it is not favorable to tune the application to the MPI implementation, it might be necessary to get maximum performance for a well known parameter set. By understanding the interaction between MPI and MPI-I/O better, the provided I/O infrastructure could be optimized. In the following we present an attempt to visualize internal communication in MPI and interaction between MPI and parallel file system. This paper is structured as follows: In section 2 state-of-the-art in visualizing MPI internals and parallel file systems is provided. Then we briefly discuss our modifications which enable us to visualize MPI processing in section 3. In section 4 we evaluate our work by presenting results obtained for collective communication and for MPI-I/O interactions. 2 State-of-the-art and Related Work Optimization of collective operations in MPI is a hot research topic and is important for implementing HPC applications on various HPC systems [2, 3, 4]. Visualization of application behavior assists to optimize MPI programs. Two analysis approaches are used for this purpose: on-line analysis tools and tracing tools for postmortem analysis. On-line tools, e. g. Paradyn [5], use a monitoring system or instrumentation to gather data of the running application. This data is immediately available for several purposes, for example to display run-time performance, and it is typically not stored for futher analysis. Integration of monitoring tools is state-of-the-art 2

3 in parallel file systems. In GPFS a monitoring of client I/O performance is possible with a separate command line tool (mmpmon) [6]. Lustre offers monitoring via the /proc interface (LustreProc [7]). [8] presents performance data visualisation in Lustre based on debugging output and the /proc interface with the help of Ganglia. The parallel file system PVFS [9] embeds a performance monitor in the server process, which counts the number of metadata and I/O operations, and the amount of data accessed. It s data can be fetched with the command line tool pvfs2-perf-mon-example directly from the server process. In contrast off-line tools are typically applicable after program completion. In order to do so the monitoring system and instrumentation write event traces or profiles to files that are available afterwards. Representatives for this group are Intel Trace Analyzer & Collector [10], Paraver [11], Jumpshot/MPE [12] and Vampir/VampirTrace. Jumpshot/MPE works in cooperation with the MPICH2 library. It supports to analyze MPI functions for data communications and parallel I/Os by tracing every MPI call. Vampire visualizes MPI calls and records performance data according to the Open Trace Format [13]. Although Vampir- Trace supports tracing of internal operations such as remote memory access and I/O performance it supports them only on a client side [14]. Another framework SCALASCA [15] supports runtime summarization of measurements during execution and event trace collection for postmortem trace analysis. TAU [16] provides a robust, flexible, and portable approach for tracing and visualization of applications, CPU internal hardware counters. However, these tools do not support tracing activities of parallel file systems such as PVFS in conjunction with MPI-I/O calls. PIOviz [17, 18] is a trace-based environment which traces MPI calls and PVFS server internals such as network communication and I/O subsystem activity in conjunction with MPI-I/O calls. In addition, it also collects statistics of CPU usage and PVFS internal statistics [18]. As PIOviz uses MPICH s SLOG2 format a user can analyze trace information with Jumpshot. In comparison to tools described above, PIOvis combines an explicit tracing and visualization of the I/O system s behavior and a correlation of program events and induced system events. The latest PIOviz version described here enables tracing inside collective operations on server side as well as on the client side. To allow this ROMIO [19] and MPICH2 were extended as shown in the following section. 3 Tracing MPI Internals Modifications to enable tracing of MPI internals are split into two parts. First, instrumentation of MPICH2 is required to trace communication inside collective operations. Second, ROMIO is extended to allow tracing of PVFS calls and to show used MPI functions inside MPI-I/O. As a consequence, the linked MPI program will depend on the tracing library from MPE. Note that PIOviz uses version 1.0.5p4 of MPICH2 and version of PVFS. In MPICH2 each collective function calls only a set of internal functions to perform blocking or non-blocking point-to-point communication, that means 3

4 either (i)send, (i)receive or sendreceive operations. These functions are bundled in a single file and can be easily instrumented. Once modified, then processing inside the collective operation becomes visible. PIOviz uses a PMPI wrapper provided in MPE to trace the function calls. Once a user executes an MPI program, trace files are created in client and server side. With a normal building procedure ROMIO changes MPI calls to their PMPI pendants, therefore collective function calls inside ROMIO are hidden. However, the developers provide a precompiler macro to remove the redirection. Furthermore, we put MPE instrumentation around interesting parts in ROMIO, for instance around PVFS calls. While the latter piece of instrumentation is similar to the one shipped with ROMIO, the modified instrumentation in ROMIO works together with the existing PIOviz environment. Consequently the modifications allow us to visualize I/O calls, MPI calls made inside these I/O calls, internal communication in MPI calls, and corresponding operations in PVFS servers. 4 Evaluation Two PC cluster systems consisting of 9 nodes each were used to evaluate PIOviz. Both clusters use COTS components and Gigabit Ethernet for interconnection. The PVS cluster is an older 32 Bit Ubuntu 8.04 cluster, in contrast the Kindai cluster uses a 64 Bit CentOS 4.4. We decided to perform experiments with PI- Oviz on both clusters to neglect the influence of software versions and hardware. It turned out that in some cases observed behavior differs. For a qualitative evaluation precise hardware details are unnecessary and thus spared. In section 4.1 we analyze some collective operations performed on the PVS cluster. Provided results are comparable to the ones measured on the Kindai cluster, however the computational part during the collective calls can be observed better on the PVS cluster. The I/O intense HPIO benchmark [20] is run on the Kindai cluster in section 4.2 to assess a performance degradation we found on this cluster. This anomaly does not manifest on the PVS cluster. 4.1 Collective Communication In the following several collective MPI operations are assessed with PIOviz. A more detailed analysis of MPI Allreduce will show potential for improvement in MPICH2. Internal processing in MPI Scatter and MPI Gather is briefly discussed and might be considered from application programmers perspective. In the first experiment a single MPI allreduce call is performed to sum an array of 10 million double values (80 MByte of data respectively). Figure 1 shows the average, minimal and maximal time of the operation for 10 runs of the program. It is noticable that three processes need more time than four, similar behavior can be observed by comparing execution time for 5, 6, 7 and 8 processes. One might expect that the time needed to perform a collective allreduce with a lower number of processes takes less or equal the time than 4

5 with more processes. The reason is that MPICH2 uses a normal binary tree for process numbers equal to a power of 2 (see [2] for details). If the number of processes in the communicator is not a power of 2 then the algorithm exchanges data between processes to merge additional processes to apply the original binary tree algorithm. That matches the observations, two processes take about one second, four processes two seconds and 8 processes three seconds. Screenshots of the internal communication for three and four processes are given in figures 2 and 3. These screenshots show the internal activities of each process in a separate (time) line. However, not all optimizations mentioned in [2] are incorporated in MPICH2, thus the performance is suboptimal. Compare the observable behavior for 13 processes (figure 4) with the schema provided in [2]. Next we will briefly look at MPI Scatter and MPI Gather. For a configuration with 9 clients and 8 MByte of data, screenshots of the internal processing are provided in figure 5 and in figure 6. By looking at the internal processing one might ask why both algorithms work similar to a binary tree. It suggests that the root process sends data for multiple processes to another process. However, compared to broadcast each process gets individual data, therefore the intermediate node just forwards the data. Also the nodes forwarding data are blocked while sending and receiving the additional data. There are cases in which the forwarding might be useful, for instance for small messages where the setup takes longer than sending the message. In this case work could be shared by forwarding messages from the root node. However, in general the algorithm should be optimized for larger messages. From the user s perspective the gather algorithm provides potential for static load balancing (in case it is called frequently). Of course the parameters fed in gather must be known in advance, then the internal processing and dependencies are known. In the example in figure 6 the user could assign more work to odd processes. On the PVS cluster computation takes about 0.5 seconds. Maybe the application runs frequently with a specific parameter set and does not perform dynamic load-balancing. However, figuring out internal dependencies of the algorithm between the processes is hard without visualization. 4.2 Tracing of MPI-I/O To examine internal collective communications in collective MPI-I/O operations, the MPI-I/O HPIO benchmark was used. It supports contiguous and non-contiguous data access patterns in both collective and independent operations. For non-contiguous data accesses, derived data types are created with an ensemble of region size, region count, and region space, where region stands for a data area. Figure 7 illustrates an example of a data pattern for two processes. In this figure, we assume that data is stored in memory contiguously and non-contiguously in data file. Gaps between data regions are specified by region space of this benchmark. According to a file view created by a derived data type, each client process accesses the data file as shown in this figure. In the collective I/O with derived data types, two phase I/O is used in ROMIO [21] to make contiguous data region as much as possible. In this pa- 5

6 Figure 1: Time for allreduce with a variable number of processes Figure 2: Allreduce for three processes Figure 3: Allreduce for four processes Figure 4: Allreduce for 13 processes 6

This figure shows internal MPI communications and PVFS I/O calls with the help of the modified PIOviz.

7 Figure 5: Scatter for 9 Clients and 8 MByte of data Figure 6: Gather for 9 Clients and 8 MByte of data per, we ran collective write operations. An example of PIOviz screenshot for two-phase I/O with four client processes is shown in Figure 8 with text explanations. This figure shows internal MPI communications and PVFS I/O calls with the help of the modified PIOviz. Before starting I/O operations, client processes exchange information about offset and data length to calculate sizes of an associated memory and file domains. After this operation, client processes read data by using PVFS sys read from assigned file domains and copy them to their collective buffers. Later, data to be written is exchanged among them by using non-blocking MPI calls (MPI Isend and MPI Irecv) and overwritten on the buffers based on the file view described by a derived data type. Finally, the modified data in the buffers is written back by using PVFS sys write. If the collective buffer size is not sufficient to manage the whole assigned data, this sequence is repeated until whole data is manipulated. Figure 7: Example of a derived data type in the HPIO benchmark 7

Figure 8: Screenshot of two-phase I/O in MPI File write all In this evaluation, we could see effectiveness of client side tracing in conjunction with tracing PVFS server internals.

8 Figure 8: Screenshot of two-phase I/O in MPI File write all In this evaluation, we could see effectiveness of client side tracing in conjunction with tracing PVFS server internals. Figures 9 (a) and (b) show screenshots of typical inefficient access patterns obtained by previous and current PIOviz with some text explanations, respectively. The upper four time lines stand for client processes from rank 0 to 3, and the lower five time lines consists of one meta-server and data servers for PVFS in downward order. Note that every PVFS server was waiting for requests from clients long time after the first pair of read and write operations for a PVFS file system. The previous PIOviz does not show any internal MPI communications and PVFS I/O calls, while the current PIOviz does. With the previous PIOviz, we can not determine the point which has delay. On the other hand, the current release can assist to check problematic points. In this case, we can see that one of the client processes takes long time in PVFS sys read. As there is no PVFS operations on PVFS servers while a client process of rank 1 is issuing PVFS sys read, there might be inefficient operations between MPI and PVFS layers on the client process. 5 Conclusions and Future Work The paper showed that insights in MPI and in attached file systems assist to identify inefficient processing. This knowledge could be used by developers to tune internal layers. Also, knowledge of internal processing and dependencies allows the application programmer to optimize it towards the MPI implementation. While in general application users are discouraged from latter optimization, for fixed parameter sets this will improve insight into static load balancing. In complicated MPI-I/O accesses, internal behavior on Jumpshot screenshots reveal bottlenecks in internal operations. Evaluated examples of several MPI calls indicate room for optimization of MPICH2 in cluster environments. However, the modifications made are experimental and not yet suitable for every application. In the future we will improve stability of the environment, and to update it to the current versions of PVFS and MPICH. 8

(a) obtained by previous PIOviz (b) obtained by current PIOviz References Figure 9: Screenshots of inefficient I/O patterns [1] Message Passing Interface Forum, MPI: A message-passing interface

9 (a) obtained by previous PIOviz (b) obtained by current PIOviz References Figure 9: Screenshots of inefficient I/O patterns [1] Message Passing Interface Forum, MPI: A message-passing interface standard. Version 2.1, June [2] R. Thakur, R. Rabenseifner, and W. Gropp, Optimization of collective communication operations in MPICH, The International Journal of High Performance Computing Applications, vol. 19, pp , Spring [3] M. Kühnemann, T. Rauber, and G. Rünger, Optimizing MPI collective communication by orthogonal structures, Cluster Computing, vol. 9, no. 3, pp , [4] M. K. Velamati, A. Kumar, N. Jayam, G. Senthilkumar, P. K. Baruah, R. Sharma, S. Kapoor, and A. Srinivasan, Optimization of collective communication in intra-cell MPI, in HiPC, pp , [5] Paradyn Parallel Performance Tools. [6] IBM, General Parallel File System - Advanced Administration Guide V

10 [7] Sun Microsystems Inc., Lustre 1.6 Manual. [8] Sun Microsystems Inc., Profiling tools for IO. index.php?title=profiling Tools for IO. [9] W. Ligon and R. Ross, PVFS: Parallel Virtual File System, in Beowulf Cluster Computing with Linux (T. Sterling, ed.), Scientific and Engineering Computation, ch. 17, pp , Cambridge, Massachusetts: The MIT Press, Nov [10] Intel Trace Analyzer & Collector. products/asmo-na/eng/cluster/tanalyzer/, [11] J. Labarta, J. Giménez, E. Martínez, P. González, H. Servat, G. Llort, and X. Aguilar, Scalability of visualization and tracing tools, in Proc. of the ParCo 2005, vol. 33 of NIC Series, pp , John von Neumann Institute for Computing, Jülich, [12] A. Chan, W. Gropp, and E. Lusk, An efficient format for nearly constanttime access to arbitrary time intervals in large trace files, Scientific Programming, vol. 16, no. 2-3, pp , [13] A. Knüpfer, R. Brendel, H. Brunst, H. Mix, and W. E. Nagel, Introducing the Open Trace Format (OTF), vol of LNCS, pp , Springer, [14] H. Mickler, A. Knüpfer, M. Kluge, M. S. Müller, and W. E. Nagel, Trace- Based Analysis and Optimization for the Semtex CFD Application Hidden Remote Memory Accesses and I/O Performance, pp , [15] Z. Szebenyi, B. J. N. Wylie, and F. Wolf, SCALASCA parallel performance analyses of SPEC MPI2007 applications, vol of LNCS, pp , Springer, [16] S. S. Shende and A. D. Malony, The TAU parallel performance system, The International Journal of High Performance Computing Applications, vol. 20, pp , Summer [17] T. Ludwig, S. Krempel, M. Kuhn, J. M. Kunkel, and C. Lohse, Analysis of the MPI-IO optimization levels with the PIOViz jumpshot enhancement, vol of LNCS, pp , Springer, [18] J. M. Kunkel and T. Ludwig, Bottleneck detection in parallel file systems with trace-based performance monitoring, vol of LNCS, pp , Springer, [19] R. Thakur, E. Lusk, and W. Gropp, Users Guide for ROMIO: A High- Performance, Portable MPI-IO Implementation, technical memorandum anl/mcs-tm-234, Mathematics and Computer Science Division, Argonne National Laboratory, USA, [20] A. Ching, A. Choudhary, W. keng Liao, L. Ward, and N. Pundit, Evaluating I/O characteristics and methods for storing structured scientific data, in 20th IEEE International Parallel and Distributed Processing Symposium, p. 49, IEEE Computer Society, April [21] R. Thakur, W. Gropp, and E. Lusk, Optimizing noncontiguous accesses in MPI-IO, Parallel Computing, vol. 28, no. 1, pp ,

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Ketan Kulkarni and Edgar Gabriel Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston {knkulkarni,gabriel}@cs.uh.edu