Hint Controlled Distribution with Parallel File Systems

Size: px

Start display at page:

Download "Hint Controlled Distribution with Parallel File Systems"

Mabel Walton
6 years ago
Views:

1 Hint Controlled Distribution with Parallel File Systems Hipolito Vasquez Lucas and Thomas Ludwig Parallele und Verteilte Systeme, Institut für Informatik, Ruprecht-Karls-Universität Heidelberg, 6912 Heidelberg, Germany {hipolito.vasquez, Abstract. The performance of scientific parallel programs with high file-i/o-activity running on top of cluster computers strongly depends on the qualitative and quantitative characteristics of the requested I/Oaccesses. It also depends on the corresponding mechanisms and policies being used at the parallel file system level. This paper presents the motivation and design of a set of MPI-IO-hints. These hints are used to select the distribution function with which a parallel file system manipulates an opened file. The implementation of a new physical distribution function called varstrip dist is also presented in this article. This function is proposed based upon spatial characteristics presented by I/O-access patterns observed at the application level. 1 Introduction Hard disks offer a cost effective solution for secondary storage, but mainly due to mechanical reasons their access time has not kept pace with the speed development of processors. Disk and microprocessor performance have evolved at different rates [1]. This difference of development at the hardware level is one of the main causes of the so-called I/O-bottleneck problem [3] in disk-based computing systems. The performance of I/O intensive scientific applications, which convey huge amounts of data between primary and secondary storage, suffers heavily due to this bottleneck. The performance of such an application depends on the I/Osubsystem architecture and on the corresponding usage of it, which is inherent to the application s nature. In order to design computing systems with the cost effective advantages of hard disks and at the same time favor I/O intensive scientific applications, which run on top of such systems, the parallel I/O approach [4] has been adopted. This consists in arranging a set of disks over which files are striped or declustered [2]. By applying this mechanism, the applications take advantage of the resulting aggregated throughput. A Beowulf cluster computer [5] in which many nodes have their own hard disk device inherently constitutes an appropriate hardware testbed for supporting parallel I/O. In order to make this parallelism, at the hardware level, visible to the applications, corresponding parallel I/O operations at the file system and B. Di Martino et al. (Eds.): EuroPVM/MPI 25, LNCS 3666, pp , 25. c Springer-Verlag Berlin Heidelberg 25

2 Hint Controlled Distribution with Parallel File Systems 111 middleware level must be supported. Two implementations which fullfill these tasks are the PVFS2 [6] parallel file system and the ROMIO [7] library, an implementation of MPI-2[16]. ROMIO accepts so-called hints that are communicated via the info argument in the functions MPI File open, MPI File set view, and MPI File set info. Their purpose is mainly to communicate information, which may improve the I/O-subsystem s performance. A hint is represented by a keyvalue pair mainly concerning parameters for striping, collective I/O, and access patterns. In this work we propose a set of hints, which we call distribution hints. This set gives the user the opportunity to choose the type of physical distribution function [2] to be applied by the PVFS2 parallel file system for the manipulation of an opened file. After choosing the type of distribution function, the user can set its corresponding parameters. Assigning a value to the strip size, for example, requires information on the type of distribution function to which this parameter belongs. To augment the set of distribution functions, which can be manipulated via distribution hints, we also propose the new varstrip dist distribution for PVFS2. We propose this distribution function taking into consideration the characteristics of spatial I/O access patterns generated from scientific parallel applications. Through the usage of the varstrip distribution the programers can control the throughput or the load balancing degree in a PVFS2-ROMIObased I/O subsytem, thus influencing the performance of their MPI-IO-based application. 2 Parallel I/O Access Patterns 2.1 Introduction Our objective in this section is to present an abstract set of spatial I/O access patterns at the application level and their parameters. These patterns represent the assignation of storage areas of a logical file to monothreaded processes of a parallel program. This assignation is known as logical file partitioning [8]. The logical file can be interpreted as a one dimensional array of data blocks, whose smallest granularity is one byte. We use this set of patterns as a reference model, in order to propose distribution functions for the PVFS2 parallel file system. We have summarized these characteristics based upon studies, which have been done on I/O intensive parallel scientific applications running mainly on multiprocessor systems [1], [12], [14]. These patterns depend on the application s nature [15], but they are also conditioned by the kind of application programming interface being used and furthermore by the way this interface is used. ROMIO s interface, for example, offers four different levels to communicate a request pattern to the I/O subsystem. Each of these levels might have different performance implications for the application [13]. 2.2 Parameters We use the following parameters to characterize a spatial I/O access pattern: request size, type of operation, andsequentiality.

3 112 H. Vasquez Lucas and T. Ludwig Table 1. Relative Sizes of R Condition R<M size.5 M size.5 <R<M size R>M size Relative Size Small Medium Big We differentiate between absolute and relative request sizes. An absolute request size, R, is the requested number of bytes from the perspective of each involved process within a parallel program. R can be uniform or variable across processes. In order to express the relative size of R, we define M size as the main memory size of the compute node, where the accessing process runs. Taking M size as reference we distinguish the types of relative sizes shown in Table 1. Requests are also characterized by the type of operations they make. In this work we consider basically read and write operations. The main criterion that we use to characterize the set of spatial access patterns used in this work is the sequentiality from the program s perspective. We consider especially two types: partitioned and interleaved [11]. A partitioned sequentiality appears when the processes collectively access the entire file in disjoint sequential segments. There is no common area in the file being used by two processes. The interleaved sequentiality appears when the accesses of every process are strided, or noncontiguous, to form a global sequential pattern. 2.3 Spatial Patterns Figure 1 shows snapshots of five spatial patterns. The circles represent processes running within a common program that are accessing a common logical file, and the arrows mean any type of operation. Pattern represents a non-mpi parallel I/O to multiple files where every process is sequential with respect to I/O. This pattern has drawbacks such as a non-one logical view of the entire data set, a difficulty to manage the number of files, and a dependency on the number of original processes. Since it can be generated using language I/O [17], it will often be applied. Patterns 1 through 4 are MPI parallel I/O variants. Their main advantage consists in offering the user a one logical view of the file. Patterns 1 and 3 fall into the category of global partitioned sequentiality, whereas 2 and 4 are variants of interleaved global sequentiality. Pattern 4 appears when each process accesses the file in a noncontiguous manner. This happens when parallel scientific applications access multidimensional data structures. It can be generated through calling the darray or the subarray function of the MPI-2 interface. We call Pattern 4 irregular because it is the result of irregularly distributed arrays. In such a pattern each process has a data array and a map array, which indicates the position in the file of the corresponding data in the data array. Such a pattern can be expressed using the MPI-2 interface through the MPI Type create indexed block. It can also unknowingly be generated by using darray in the cases where the size of

4 Hint Controlled Distribution with Parallel File Systems Fig. 1. Parallel I/O Application Patterns the array in any dimension is not evenly divisible by the number of processes in that dimension. For this kind of access load balancing is an issue. 3 Distribution Functions in PVFS2 File distribution, physical distribution or simply distribution, is a set of methods describing a mapping from a logical sequence of bytes to a physical layout of bytes on PVFS2 I/O servers, which we here simply call I/O nodes. These functions are similar to declustering or striping methods used to scatter data across many disks such as in RAID systems [18]. One of these functions is the round robin scheme, which is implemented in PVFS2. In the context of PVFS2, the logical file consists of a set of strip sizes, ss, which are stored in a contiguous manner on I/O servers [9]. These strips are stored in datafiles [19] on I/O nodes through a distribution function. 4 A Set of Distribution Hints To ease our discussion in this section we define an I/O cluster as a Beowulf cluster computer where every physical node has a secondary storage device. The default distribution function in PVFS2 is the so called simple stripe, which is a round robin mechanism, that uses a fixed value of 64KB for ss. Suppose that PVFS2 is configured on an I/O cluster such that each node is a compute and I/O node at the same time and on top of this configuration an application generates pattern 1. Under these circumstances the simple stripe might penalize some strips by sending them over the network, thus slowing down I/O operations. In this work we propose the varstrip distribution. Our approach consists in reproducing pattern 1 at each level of the software stack down to the raw hardware, thus the varstrip distribution does not scatter strips over I/O nodes in a RAID manner, but instead it guarantees that each compute node accesses

5 114 H. Vasquez Lucas and T. Ludwig Parallel I/O Intensive Applications MPI MPI IO PVFS2 I/O Hardware Fig. 2. Software Stack Environment for Distribution Hints only its own local hard disk. Furthermore the strip size to be stored or retrieved on an I/O node can be defined. The varstrip distribution allows the definition of flexible strip sizes that can be assigned to a defined datafile number, thus influencing the load balancing degree among the different I/O servers. In order to control the parameters of any distribution function from an MPI- Program, running on a similar software stack as that shown in figure 2, we introduce distribution hints. The purpose of such a hint is to select not only a type of distribution function, but also its parameters. The hint-key must have the following format: <distribution name>:<parameter type>:<parameter name>. At the moment the user can choose, using this format, the following functions: basic dist, simple stripe, andvarstrip dist. By choosing the first one, the user saves the data on one single I/O node. The second applies the round robin mechanism with a strip size of 64 KB. These functions are already part of the standard set of distributions in PVFS2. By selecting our proposed varstrip dist function the user can influence the throughput or the amount of data to be assign to the I/O nodes when manipulating an opened file. In the hint-key the parameter name must be given with its type, in order for ROMIO and PVFS2 to manipulate it. Currently the strip size, type int64, parameter for the simple stripe is supported. The parameter strips is supported for varstrip dist. This parameter represents the assignation between datafile numbers and strip sizes. The following piece of code shows the usage of varstrip dist. MPI_Info_set(theinfo, distribution_name, varstrip_dist ) /*Throughput */ MPI_Info_set(theinfo, varstrip_dist:string:strips, :1;1:1 ) /*Load Balancing*/ MPI_Info_set(theinfo, varstrip_dist:string:strips, :8;1:1 )

6 5 Experiments 5.1 Testbed Hint Controlled Distribution with Parallel File Systems 115 The hardware testbed used for the implementation and tests was an I/O cluster consisting of 5 SMP nodes (,..). Each node had two Xeon hyper-threaded processors running at 2 Ghz, a main memory size of 1 GB, and an 8 GB hard disk. These nodes were networked using a store-and-forward Gigabit Ethernet switch. The used operating system was linux with kernel On top of this operating system we installed version 1..1 of PVFS2 and MPICH2. PVFS2 was running on top of an ext3 file system and every node was configured both as client and server. The node called was configured as the metadata server. 5.2 Objective The purpose of the measurements was to compare the bandwidth observed at the nodes when using the varstrip distribution with the bandwidth observed when using pattern or two variants of the round robin PVFS2 distribution: the default distribution function with a strip size of 64KB and a variant which we called simple stripe with fitted strip size. This variant resulted from setting thesamevalueforr, ss, and datafile. When using the fitted simple stripe a compute node did not necessarily access its own secondary storage device. 5.3 Measurements Figures 3, 4, and 5 show the bandwidths, y-axes, calculated from the measured times before and after MPI File write or MPI File read operations. One single process was started per node. Each process made small, medium, read, and write R requests following pattern 1. The requests (R < 1GB) are shown on the x-axes Reads MB/s 3 2 Writes MBytes Fig. 3. Measured Bandwidth: Pattern 1, varstrip dist

7 116 H. Vasquez Lucas and T. Ludwig Reads MB/s 3 2 Writes MBytes Fig. 4. Measured Bandwidth: Pattern 1, simple stripe, fitted strip size MB/s Writes Reads MBytes Fig. 5. Measured Bandwidth: Pattern 1, simple stripe For comparison purposes the same type of operations and values of R were requested at the application level using the unix write and read functions. The data was saved or retrieved to/from the local ext3 file system directly on the involved nodes following pattern. The corresponding values are presented in Figure 6. For pattern the measured bandwidth at the nodes approximately was of 5 MB/s and 4 MB/s for read and write operations respectively. The bandwidth for write operations of s hard disk was 3 MB/s. These results correlate with similar tests made with the bonnie++ benchmarking program. Using the values obtained for pattern as reference, we obtained only 55% and 4% of performance for write and read accesses respectively when using the default function simple stripe as presented in figure 5. It was the only case where the bandwidht of write was better than that for read operations. With the fitted strip size for the simple stripe function performances of approximately 75% and 8% were measured for write and read operations re-

8 Hint Controlled Distribution with Parallel File Systems Reads MB/s 3 2 Writes MBytes Fig. 6. Bandwidth obtained using the UNIX interface spectively. Since the compute nodes were not necessarily using their own local hard disks, accessed the hard disk of during reading operations as shown in figure 4. The node, also the metadata server, used its own local disk during read operations. Figure 3 presents the performance observed when using our proposed varstrip distribution. The bandwidth reached 8% and 1% of the reference bandwidth for write and read operations respectively. 6 Conclusion and Future Work In this paper we have described a set of MPI-IO-hints, which the user can choose to select a certain distribution function of the PVFS2 parallel file system and its corresponding parameters. We have also described the varstrip distribution function. This function is proposed taking into consideration pattern 1, a parallel I/O spatial pattern, which appears at the application level. For this type of workload the varstrip distribution performs better than the other distribution functions, as shown through the experiments. Furthermore, by selecting varstrip the user can manipulate the load balancing degree among the I/O servers. Our future work consists in implementing other distribution functions and constructing a matrix with pattern-distribution pairs, which will provide information about the functions best suited for particular application patterns. During this process we shall find out for which pattern and configuration the simple stripe performs best and how well varstrip dist performs with some other patterns as workload. Acknowledgment We thank Tobias Eberle and Frederik Grüll for the implementations, and Sven Marnach, our cluster administrator.

9 118 H. Vasquez Lucas and T. Ludwig Additionally, we would like to acknowledge the Department of Education of Baden Württemberg, Germany, for supporting this work. References 1. Patterson, David A., Chen, Peter M.: Storage Performance - Metrics and Benchmarks. (1998) 2. Patterson, David A., Chen, Peter M.: Maximizing Performance in a Striped Disk Array. Proc. 17th Annual Symposium on Computer Architecture (17th ISCA 9), Computer Architecture News. (199) Hsu, W. W., Smith, A. J.: Characteristics of I/O traffic in personal computer and server workloads. IBM Syst. J. 42 (23) Hsu, W. W., Smith, A. J.: The performance impact of I/O optimizations and disk improvements. IBM Journal of Research and Development. 48 (24) Sterling, T.: An Overview of Cluster Computing. Beowulf Cluster Computing with Linux. (22) PVFS2 URL: 7. ROMIO URL: 8. Ligon, W.B., Ross, R.B.: Implementation and Performance of a Parallel File System for High Performance Distributed Applications. Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing. (1996) Ross, Robert B., Carns, Philip H., Ligon III, Walter B., Latham, Robert: Using the Parallel Virtual File System. (22) 1. Madhyastha, Tara M.: Automatic Classification of Input/Output Access Patterns. PhD Thesis. (1997) 11. Madhyastha, Tara M., Reed, Daniel A.: Exploiting Global Input/Output Access Pattern Classification. Proceedings of SC97: High Performance Networking and Computing. (1997) 12. Thakur, Rajeev, Gropp, William, Lusk, Ewing: On implementing MPI-IO portably and with high performance. Proceedings of the 6th Workshop on I/O in Parallel and Distributed Systems (IOPADS-99). (1999) Thakur, Rajeev S., Gropp, William, Lusk, Ewing: A Case for ung MPI s derived datatypes to improve I/O Performance. Proceedings of Supercomputing 98 (CD- ROM). (1998) 14. Rabenseifner, Rolf, Koniges, Alice E., Prost, Jean-Pierre, Hedges, Richard: The Parallel Effective I/O Bandwidth Benchmark: b eff io. Parallel I/O for Cluster Computing. (24) Miller, Ethan L., Katz, Randy H.: Input/output behavior of supercomputing applications. SC. (1991) MPI-2 URL: Gropp, William, Lusk, Ewing, Thakur Rajeev: Using MPI-2: Advanced Features of the Message-Passing Interface. (1999) Patterson, David, Gibson, Garth, Katz Randy: A case for redundant arrays of inexpensive disks (RAID). Proceedings of the ACM SIGMOD International Conference on Management of Data. (1988) PVFS Development Team: PVFS 2 Concepts: the new guy s guide to PVFS. PVFS 2 Documentation (24) 2. PVFS Development Team: PVFS 2 Distribution Design Notes. PVFS 2 Documentation. (24)

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Ketan Kulkarni and Edgar Gabriel Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston {knkulkarni,gabriel}@cs.uh.edu