clients (compute nodes) servers (I/O nodes)

Size: px

Start display at page:

Download "clients (compute nodes) servers (I/O nodes)"

Whitney Hancock
6 years ago
Views:

1 Collective I/O on a SGI Cray Origin : Strategy and Performance Y. Cho, M. Winslett, J. Lee, Y. Chen, S. Kuo, K. Motukuri Department of Computer Science, University of Illinois Urbana, IL, U.S.A. Abstract Panda is a library for collective I/O of multidimensional arrays, designed for SPMD-style applications running on a distributed memory system. In this paper, we describe our experience of porting Panda to the SGI Cray Origin, which utilizes shared memory and a shared le system. On the Origin we used, AIDs and le system inappropriately congured for scientic applications is the limiting factor for I/O performance, to such a degree that a single I/O node can nearly saturate the le system, limiting scalability. We determined that Panda would scale up nicely with faster AIDs and a le system conguration highly tuned for large scientic applications, as each of Panda's I/O nodes can deliver data to the Origin's le system at a minimum sustained rate of 8 MB/sec, with {4 I/O nodes. We also determined that nodes do not need to be dedicated to I/O to sustain high utilization of the Origin's le system. Keywords: Collective I/O, Distributed shared memory system, Shared le system Introduction The unprecedented scale of scientic simulations on multi-teraop platforms raises new research issues in I/O and data management. Scientic applications on parallel platforms frequently have a single-program multipledata (SPMD) structure, in which each processor executes more or less the same program. Multidimensional arrays are a fundamental data type in these applications. E- Current aliation: IBM Almaden esearch Center ciently transferring these arrays between main memory and secondary storage poses a significant challenge. To overcome the I/O bottleneck problem, the Panda research group ( panda/) is developing new algorithms and approaches for collective I/O, targeting SPMD applications using HPFstyle or Adaptive Mesh enement (AM)- style data distributions for large multidimensional arrays on massively parallel platforms and workstation clusters. Our focus is both on I/O library ease of use and performance. The Panda parallel array I/O library has been extensively evaluated on the IBM SP and networks of workstations and demonstrated a high level of performance there, e.g., utilizing over 85% of the peak le system throughput on the SP with a wide range of typical array sizes, shapes, processor congurations and I/O requests. However, there is a wide range of massively parallel platforms and they dier vastly in their interconnects and I/O subsystems, which are the key factors for parallel I/O performance. We are participants in the Center for Simulation of Advanced ockets (CSA), a collaboration of computer scientists and domain scientists funded by the Accelerated Strategic Computing Initiative (ASCI), whose simulations are expected to run on all ASCI platforms (IBM SP, SGI Cray Origin, Intel PentiumPro-based machine and Cray T3E). As part of this work, in this paper, we describe our experience of porting and tuning the Panda parallel I/O library on the Origin. Unlike our prior platforms, Origin is a dis-

2 tributed shared memory system whose nodes also share a le system and disks. So, Panda's I/O strategies for previous platforms are not necessarily appropriate here. For instance, on the SP, doubling the number of I/O nodes usually doubles Panda's aggregate throughput. With the Origin's shared le system, increasing the number of I/O nodes may not speed up I/O. Section gives an overview of Panda and the Origin. Section 3 summarizes other potential issues for Panda and performance results are presented in Section 4. Section 5 surveys related work, and Section 6 concludes the paper. Background. Collective I/O in the Panda I/O library Panda is a parallel I/O library for multidimensional arrays. Its original design was intended for SPMD applications running on distributed memory architectures, with arrays distributed across multiple processors, and that are fairly closely synchronized at I/O time. Panda supports HPF-style BLOCK, CYCLIC, and AM-style data distributions across the multiple compute nodes on which Panda clients are running. Panda's approach to high-performance I/O in this environment is called server-directed I/O []. 3 3 clients (compute nodes) servers (I/O nodes) Figure : Dierent array data distributions in memory and on disk provided by Panda. Figure shows a D array distributed (BLOCK, BLOCK) across 4 compute nodes arranged in a mesh. Each piece of the distributed array is called a compute chunk, and each compute chunk resides in the memory of one compute node. The I/O nodes are also arranged in a mesh, and the data can be distributed across them similarly. The array distribution on disk can be radically dierent from that in memory. For instance, the array in Figure has a (BLOCK, *) distribution on disk. Using this distribution, the resulting data les can be concatenated together to form a single array in row-major or column-major order, which is particularly useful if the array is to be sent to a workstation for postprocessing with a visualization tool, as is often the case. Each I/O chunk resulting from the distribution chosen for disk will be buered and sent to (or read from) disk by one I/O node, and that I/O node is in charge of reading, writing, gathering, and scattering the I/O chunk. For example, in Figure, during a Panda write operation I/O node gathers compute chunks from compute nodes and, reorganizes them into a single I/O chunk, and writes it to disk. In parallel, I/O nodes and are gathering, reorganizing and writing their own I/O chunk. For a read operation, the reverse process is used. During I/O Panda divides large I/O chunks into smaller pieces, called subchunks, in order to obtain better le system performance and keep I/O node buer space requirements low. For a write operation, a Panda server will repeatedly gather and write subchunks, one by one. Panda. also supports an alternative I/O strategy, part-time I/O, where there are no dedicated I/O nodes. Instead some of the compute nodes become I/O nodes at I/O time, and return to computation after nishing the I/O operation. When a client reaches a collective I/O call, it may temporarily become a server.. Architecture of the Origin As shown in Figure (a), each node of the Origin consists of two processors, caches, a portion of shared memory, directory for cache coherence, interfaces to I/O devices and interconnection fabric. Memory is organized by We use the term I/O node to refer to a single processor performing I/O.

3 (a) (b) Processor Cache (4 MB) Hub XIO To I/O devices Processor Cache (4 MB) outer Memory & Directory To other nodes Figure : Architecture of the Origin : (a) each node and (b) interconnection fabric for 6 nodes. 8-byte cache lines and each cache line is associated with data bits for a directory-based cache coherence scheme []. Whenever a cache line is updated, corresponding data bits are looked up to invalidate other copies of the same cache line. Figure (b) shows the interconnection fabric for 3 processors, our conguration used for experiments. Each pair of nodes is connected to a router and routers form a hypercube []. Table gives details of the Origin we used at NCSA. 3 processors (6 nodes) share a total of 6 disks connected to two SCSI- AID level 5 adapters. The two AIDs are striped via XLV (XFS volume manager) to create a single logical volume. Users do not have any control over XLV or the AIDs. 3 Experimental Issues To understand the execution environment, we ran a microbenchmark to measure the performance of the le system and the message passing system. In the le system benchmark (Figure 3 (a)), we varied the number of processors each writing/reading to/from a separate 8 MB le on the logical volume and measured the aggregate throughput. For write performance, we issue fsync() at the end to ensure data are written on disk. For read measurements, we ush the le cache by writing a dummy le of size twice the entire memory size so that data are actually read from disk, not from le cache. esults show that read throughput is higher than write throughput Total processors 3 Processor 95 MHz MIPS Total memory 8 GB OS IIX 6.4 Interconnect: topology binary n-cube bandwidth.56 GB/sec (full duplex) I/O subsystem: interface Fast Wide SCSI- number of AIDs AID level 5 disks per AID 8 9-GB 7 PM disks File system XFS logical volume 7 GB (5% occupied) Table : Characteristics of the SGI Cray Origin at NCSA that we used for experiments as expected. Surprisingly, a single I/O node gives the best aggregate throughput, around 3 MB/sec, and additional I/O nodes degrade performance for MB request size. ead requests generated by multiple I/O nodes require immediate disk activity so the overhead of disk seeks increases with the number of readers. In contrast, the write-behind policy of the le cache allows delayed scheduling of writes to disk, which can reduce the cost of seeks. We found that if only a few processors write, the le size does not have any impact on performance. However, as the number of writers increases, a performance boost can be observed for small les, but the throughput var-

4 MB/sec (a) Aggregate file system performance file size = 8 MB, request size= MB ead Write (b) Performance of message passing Message size (KB) Number of sender-receiver pairs: Figure 3: of (a) le system and (b) message passing. ied among I/O nodes signicantly, with at least one I/O node remaining in the 4{5 MB/sec range for writes. Also, once the total amount of data written by all processors exceeds GB, performance starts to degrade. Figure 3 (a) uses MB I/O requests; write experiments with other sizes showed that the le system gave the same performance for request sizes from 3{4 KB. From Figure 3 (a) and the fact that Panda's performance is inherently limited to the performance of the slowest I/O nodes, we conclude that aggregate throughput for write operations will be limited by the le system to 5 MB/sec for large arrays in this I/O conguration. The poor performance is due to the AID and XLV conguration used in this installation, rather than to inherent XFS limitations. Later I/O benchmarks done by SGI and NCSA found out that the default AID and XLV stripe unit sizes used in our installation is appropriate for database applications making random small requests; by simply increasing the XLV stripe unit size, the throughput is signicantly increased. Also, tests using Ultra SCSI controllers with 3 non- AID drives provides 5 MB/sec per controller for writes. Thus in the remainder of the paper, we are careful to consider the possibility of faster le system performance. On many parallel platforms, message passing can be the bottleneck for I/O. To test this on the Origin, we measured the aggregate throughput of MPI by passing messages between one or more sender-receiver pairs. Figure 3 (b) shows that message passing performance scales well for a variety of message sizes. To attain this bandwidth when there are multiple processes, users need to tune SGI's environment variables for MPI. For instance, variables MPI BUFS PE HOST and MPI BUFS PE POC need to be set to minimize contention for the message buer [3]. Small messages (8{6 KB) tend to perform better as the number of sender-receiver pairs increases. This suggests that Panda's buer size on the I/O nodes should be set so that most data messages are 8{6 KB, if the interconnect ever becomes a limiting factor in performance. However, clearly the le system will be the main bottleneck for I/O on this Origin. The experiments in the next section address the question of how many I/O nodes are needed for peak performance, and whether nodes need to be dedicated to I/O or can be shared between computation and I/O. In previous work [4], we found that part-time I/O gave good performance on a small network of workstations, while maximizing the resources available for

5 computations { but only if I/O nodes are very carefully chosen to minimize the data transfer over the interconnect. However, on the Origin, where the processors are connected by a very fast interconnect, the exact choice of I/O nodes may no longer be so important. To help extrapolate to other Origins, we must also evaluate Panda's performance with a (simulated) faster le system. We focus on write operations because (i) the CSA applications are simulations, which are write intensive, and (ii) in general, write operations are slower than read operations in AID level 5 because of parity update []. Also, we focus on arrays that have dierent distributions in memory and on disk. If the output format is not important, each compute node can write its own data to a contiguous region of a le since the logical volume is shared by all nodes, however, performance will still be limited to the le system's aggregate 5 MB/sec and 3 MB/sec for write and read operations respectively. 4 Performance The results in this section were obtained using processors dedicated to our job. In all experiments, we distribute a 3D array across 6 compute nodes logically connected in a 4 mesh and vary the number of I/O nodes from to 6. The error bars show a 95% condence interval for the mean. In the rst experiments, we measure the time to output a f64; 8; 56; 5; 4g 55 oating point array using a set of dedicated I/O nodes. These large arrays are appropriate given the Origin's large memory and CSA application characteristics. We changed the number of elements along the slowest varying dimension so that all arrays would have the same amount of logically contiguous data in memory, which can have a signicant impact on message passing performance. In Figure 4 (a), for a write operation, as the number of I/O nodes increases, the aggregate throughput does not increase accordingly but is limited to about 4 MB/sec, 95% of the attainable peak le system throughput. The percentages shown in Figure 4 represent the fraction of the peak le system performance on the slowest I/O node. With only a few I/O nodes, we can fully saturate the I/O system and achieve throughput very close to the peak throughput attainable. This means that dedicating a large number of nodes to I/O is not an eective use of resources on this platform. We repeated the same experiment for array read operations (Figure 4(b)). Like the le system benchmark in Figure 3, Panda achieves the highest throughput with a single I/O node, averaging around 95% of the peak le system performance. As shown in Figure 4 (a), Panda's write performance is slightly lower if only { I/O nodes are used. This is due to an unbalanced scalability between message passing and the underlying le system as the number of I/O nodes increases. For instance, if we increase the number of I/O nodes from to 4, aggregate message passing throughput scales almost linearly but the aggregate le system write throughput increases only %. Thus, as the number of I/O nodes increases, the time each I/O node devotes to message passing becomes a smaller and smaller fraction of its total run time, providing near-to-peak le system utilization. This trend does not occur for read operations because XFS performs aggressive prefetching while Panda servers scatter the current I/O buer to clients; the large error bars in Figure 4 (b) represent prefetching/buer management errors. To determine scalability to a faster le system, we simulated innitely fast disks by commenting out the le system write operations in Panda (Figure 5). The results show how well Panda is utilizing the available MPI bandwidth. Up to 8 I/O nodes, the message passing in Panda scales very well, reaching around 9% of the corresponding attainable MPI bandwidth shown in Figure 3. We tested read operations simulating innitely fast disks and the results are similar to Figure 5 with 3{5% higher throughput. We conclude that on an Ori-

6 (a) (b) Write operation using dedicated I/O nodes array shape: {64, 8, 56, 5, 4} x 5 x 5 request size = MB 85-89% 9% 95% 95% 95% ead operation using dedicated I/O nodes array shape: {64, 8, 56, 5, 4} x 5 x 5 request size = MB 9-% 94-% 8-9% 76-9% 77-99% 64 MB 8 MB 56 MB 5 MB GB 64 MB 8 MB 56 MB 5 MB GB Figure 4: Performance when an array of shape n 5 5 is distributed (BLOCK, BLOCK, BLOCK) in memory and (BLOCK, *, *) on disk where n is the array size in MB. gin that can read or write data at x MB/sec, roughly dx=8e I/O nodes will be needed to saturate the le system, and the use of additional I/O nodes brings no performance advantages for Panda. In Figure 5, performance is insensitive to array size. However, this won't be true for all arrays, because MPI exhibits a noticeable performance penalty if data being transferred are not contiguous in memory. To explore this phenomenon, we repeated the experiment of Figure 4 using a dierent array shape, 5 5 f64; 8; 56; 5; 4g. As can be seen in Figure 6, larger arrays now perform better than small arrays, because smaller arrays have fewer elements in a contiguous region of memory. This causes I/O nodes to perform more strided memory accesses than with larger arrays and the extra cost of strided memory access on the parallel I/O performance is noticeable when there are few I/O nodes. However, the extra cost drops as the number of I/O nodes increases, because MPI costs drop linearly while le system costs hold steady. On Origins with faster le system congurations, the extra cost of strided accesses will be even more apparent. The use of separate threads for communication and disk access, as is done in implementations of disk-directed I/O [5], would alleviate this problem. We also tested part-time I/O using 6 compute nodes, where the rst k compute nodes become I/O nodes at I/O time. As shown in the second graph in Figure 6, we obtain almost the same aggregate throughput as with dedicated I/O nodes. As hypothesized earlier, the Origin's very fast interconnect makes a careful choice of I/O nodes unnecessary. We conclude that there is no advantage to dedicating nodes to I/O on the Origin. 5 elated work Numerous runtime libraries and le systems provide collective I/O [6,, 7, 8, 5, 9], but performance studies of these libraries on the Origin are just beginning. OMIO is an implementation of the MPI-IO standard interface which uses two-phase I/O for collective I/O operations []. [] presents collective read performance on a variety of parallel platforms including the Origin, for a 3D array distributed (BLOCK, BLOCK, BLOCK) in memory and (BLOCK, *, *) on disk, using all the compute processors as I/O processors at I/O time as with our part-time I/O nodes. On an Origin with a faster AID than the one we used, OMIO's collective read perfor-

7 Simulated write operation using dedicated I/O nodes array shape: {64, 8, 56, 5, 4} x 5 x 5 request size = MB 65-88% % 95-% 85-9% 8-88% 64 MB 8 MB 56 MB 5 MB GB Figure 5: Performance of simulated write when an array of shape n 5 5 is distributed (BLOCK, BLOCK, BLOCK) in memory and (BLOCK, *, *) on disk, where n is the array size in MB. mance on 3 processors, 75 MB/sec, is about 3 times faster than naive sequential reads (i.e., each compute node issues many small read/write requests). However, the throughput for collective read is only around 5% of the available le system bandwidth. The remaining of run time is used to analyze the requests of dierent processes and for redistributing data after reading large blocks. Our results in Figure 4 suggest that fewer I/O nodes might perform better. Alternatively, our results using a simulated faster le system (Figure 5) coupled with the performance results in [], suggest that server-directed I/O may have signicant performance advantage over two-phase I/O for the types of array I/O operations discussed in this paper. 6 Summary and Conclusion We have ported the Panda parallel I/O library, designed for collective I/O of SPMD-style applications running on a distributed memory system, to a SGI Cray Origin at NCSA. Our experiments show that the extremely low and non-scalable aggregate le system bandwidth, is the bottleneck for parallel I/O on this installation. The low throughput is due to the small stripe unit size used for the AID and XLV in this installation. By using more advanced AIDs and XLV parameters tuned for large I/O requests, the le system throughput can be signicantly increased. MPI on the Origin we used can sustain aggregate throughput of 8{ MB/sec between a sender and receiver, and scales almost linearly as the number of communicating processes increases. Thus message passing will not be the typical bottleneck for I/O on this or most other Origin congurations. We found that only a few I/O nodes would be needed to provide collective I/O performance close to the peak le system performance, even if the le system were much faster; and the nodes need not be dedicated to I/O. However, with { I/O nodes and small arrays, certain array shapes and distributions can incur a signicant performance penalty, because strided memory access using MPI on this platform can be very expensive. Considering the theoretical peak interconnect bandwidth, 78 MB/sec, we believe there is plenty of room for the MPI bandwidth between a pair of processors to be increased, removing potential performance pitfalls in parallel I/O performance for small numbers of I/O nodes. 7 Acknowledgements This work was supported in part by NASA and DOE under NAGW 444, NCC5 6, and B34494, and utilized a SGI Cray Origin at the National Center for Supercomputing Applications, University of Illinois at Urbana- Champaign. We would like to thank Eric Salo at SGI for his help regarding SGI's implementation of MPI. We would also like to thank Luc Chouinard at SGI and the NCSA sta for help with I/O benchmarks. eferences [] K. E. Seamons, Y. Chen, P. Jones, J. Jozwiak, and M. Winslett. Server-

8 (a) (b) 6 Write operation using dedicated I/O nodes array shape: 5 x 5 x {64, 8, 56, 5, 4} request size = MB 6 Write operation using part-time I/O nodes array shape: 5 x 5 x {64, 8, 56, 5, 4} request size = MB MB 8 MB 56 MB 5 MB GB 64 MB 8 MB 56 MB 5 MB GB Figure 6: Write performance when (a) dedicated I/O nodes and (b) part-time I/O nodes are used. Directed Collective I/O in Panda. In Proceedings of Supercomputing '95, November 995. [] Origin Servers Technical Overview. Technical report, Silicon Graphics Inc., April 997. [3] Message Passing Toolkit: MPI Programmer's Manual S-97.. Technical report, Silicon Graphics Inc., January 998. [4] Y. Cho, M. Winslett, M. Subramaniam, Y. Chen, S. Kuo, and K. E. Seamons. Exploiting Local Data in Parallel Array I/O on a Practical Network of Workstations. In Proceedings of the Fifth Workshop on I/O in Parallel and Distributed Systems, pages {3, November 997. [5] D. Kotz. Disk-directed I/O for MIMD multiprocessors. ACM Transactions on Computer Systems, 5():4{74, February 997. [7]. Bennett, K. Bryant, A. Sussman,. Das, and J. Saltz. Jovian: Framework for Optimizing Parallel I/O. In Proceedings of the Scalable Parallel Libraries Conference, pages {, October 994. [8] P. F. Corbett, D. G. Feitelson, J. Prost, and S. J. Bayler. Parallel Access to Files in the Vesta File System. In Proceedings of Supercomputing '93, pages 47{ 48, November 993. [9] ajesh Bordawekar. Implementation of Collective I/O in the Intel Paragon Parallel File Systems: Initial Experiences. In Proceedings of International Conference on Supercomputing, pages {8, July 997. []. Thakur, W. Gropp, and E. Lusk. A Case for Using MPI's Derived Datatypes to Improve I/O Performance. Technical eport Preprint ANL/MCS-P77-598, Argonne National Laboratory, May 998. [6]. Bordawekar, J. osario, and A. Choudhary. Design and Evaluation of Primitives for Parallel I/O. In Proceedings of the Supercomputing '93, pages 45{46, 993.

clients (compute nodes) servers (I/O nodes)

clients (compute nodes) servers (I/O nodes) Parallel I/O on Networks of Workstations: Performance Improvement by Careful Placement of I/O Servers Yong Cho 1, Marianne Winslett 1, Szu-wen Kuo 1, Ying Chen, Jonghyun Lee 1, Krishna Motukuri 1 1 Department