clients (compute nodes) servers (I/O nodes)

Size: px
Start display at page:

Download "clients (compute nodes) servers (I/O nodes)"

Transcription

1 Parallel I/O on Networks of Workstations: Performance Improvement by Careful Placement of I/O Servers Yong Cho 1, Marianne Winslett 1, Szu-wen Kuo 1, Ying Chen, Jonghyun Lee 1, Krishna Motukuri 1 1 Department of Computer Science, University of Illinois, 134 W. Springeld, Urbana, IL 6181, U.S.A. IBM Almaden Research Center, 65 Harry Road, San Jose, CA 951, U.S.A. fycho,winslettg@cs.uiuc.edu phone: (+1 17) 44 73, fax: (+1 17) Introduction Abstract Thanks to powerful processors, fast interconnects, and portable message passing libraries like PVM and MPI, networks of inexpensive workstations are getting more popular as an economical way to run highperformance parallel scientic applications. On traditional massively parallel processors, performance of parallel I/O is most often limited by disk bandwidth, though the performance of other system components, especially the interconnect, can at times be a limiting factor. In this paper, we show that the performance of parallel I/O on commodity clusters is often signicantly aected not only by disk speed but also by the interconnect network throughput, I/O bus capacity and load imbalance caused by heterogeneity of nodes. Specically, we present our experimental results from reading and writing large multidimensional arrays with the Panda I/O library, on two signicantly dierent clusters: HP workstations connected by FDDI and HP PCs connected by Myrinet. We also discuss the approaches we use in Panda to maximize the I/O throughput available to the application on these two platforms, particularly by a careful placement of I/O servers. Due to wide availability of powerful processors, fast interconnects, and portable message passing libraries like PVM [15] and MPI [8], networks of commodity workstations are gaining popularity as an economical way to run very large parallel applications. In this environment, scientic applications typically distribute their data across multiple processes which are kept closely synchronized for computation and communication. Often these applications output large intermediate results and read them back in later for a subsequent step of computation. Large nal results may also be output for subsequent visualization, especially with timedependent simulation applications. Long running applications also typically periodically save a snapshot of their major arrays in a le (checkpoint), so that they can restart from the checkpoint in the event of failure. Due to relatively low performance of the I/O subsystem and lack of ecient software, I/O performance can be a major bottleneck in all these applications. The Panda parallel I/O library 1 is designed for SPMD-style parallel applications to provide I/O portability, an easy-to-use high-level interface and high-performance collective I/O of multidimensional arrays. On traditional massively parallel processors like the IBM SP, Panda's performance is mainly limited by disk speed. In this paper, we show that Panda's performance on a commodity cluster can be aected signicantly by almost every system component: disk speed, interconnect network throughput, main memory size, I/O bus speed and the presence of heterogeneous nodes. We focus on our experimental results from two very different clusters - HP workstations connected by FDDI and HP PCs connected by Myrinet. With workstations connected by FDDI that have reasonably fast disks, the I/O bottleneck is likely to lie in message passing because FDDI is a 1 Mb/s shared-media network. Thus, it is crucial to minimize simultaneous access to the network by multiple processes as well as the amount of data transferred over the network. However, when a more contemporary network like Myrinet is used together with high-performance PCs (each a - or 4-processor symmetric multiprocessor (SMP) sharing memory and I/O busses) the interconnect is no longer a bottleneck but the contention among processors for the shared resources can be a limiting factor in parallel 1 More information can be found online at

2 I/O performance. For both types of clusters, a careful placement of I/O servers can have a strong positive impact on performance. In the next section, we introduce the Panda parallel I/O library and describe the two clusters used in our experiments. In section 3, we discuss problems related to parallel I/O on workstations connected by FDDI. In section 4, we show that resource contention among processors in the same SMP can be a limiting factor for I/O performance if a switch-based, more contemporary network is used as an interconnect. Sections 5 and 6 discuss heterogeneity related to parallel I/O performance and some related work respectively. Finally we conclude the paper in section 7. Background.1 The Panda parallel I/O library Panda is a parallel I/O library for multidimensional arrays. Its original design was intended for SPMD applications running on distributed memory systems, with arrays distributed across multiple processes that are closely synchronized at I/O time. Panda supports HPF-style [5] BLOCK, CYCLIC, and Adaptive Mesh Renement-style data distributions [1] across the multiple compute nodes on which Panda clients are running. Panda's approach to high-performance collective I/O, in which all clients and servers cooperate to perform I/O, is called server-directed I/O [14] clients (compute nodes) servers (I/O nodes) Fig. 1: Dierent array data distributions in memory and on disk provided by Panda. Fig. 1 shows a D array distributed (BLOCK, BLOCK) across 4 compute processors arranged in a logical mesh. Each piece of the distributed array is called a compute chunk, and each compute chunk resides in the memory of one compute processor. The I/O processors are also arranged in a logical mesh, and the data can be distributed across them and implicitly across their disks using a variety of distribution directives. The array distribution on disk can be radically dierent from that in memory. For instance, the array in Fig. 1 has a (BLOCK, *) distribution on disk. Using this distribution, the resulting data les can be concatenated together to form a single row-major or column-major array, which is particularly useful if the array is to be sent to a workstation for postprocessing with a visualization tool, as is often the case. With server-directed I/O, each I/O chunk resulting from the distribution chosen for disk will be buered and sent to (or read from) disk by one Panda server, and that server is in charge of reading, writing, gathering, and scattering the I/O chunk. For example, in Fig. 1, during a Panda write operation server gathers compute chunks from clients and 1, reorganizes them into a single I/O chunk, and writes it to disk. In parallel, server 1 is gathering, reorganizing and writing its own I/O chunk. For a read operation, the reverse process is used. During I/O Panda divides large I/O chunks into a series of smaller pieces, called subchunks, in order to obtain better le system performance and keep server buer space requirements low. For a write operation, a Panda server will repeatedly gather and write subchunks, one by one. When reading or writing an entire array, each Panda server reads or writes its le sequentially. Usually, Panda clients and servers reside on physically dierent processors. However, it would be wasteful to dedicate processors to I/O, leaving them idle during computation on a network of workstations where a limited number of processors are available. Panda supports an alternative I/O strategy, part-time I/O, where there are no dedicated I/O processors. Instead, some of the Panda clients run servers at I/O time, and return to computation after nishing the I/O operation.

3 . Cluster systems The rst platform on which our experiments were conducted is an 8-node HP 9/735 workstation cluster running HP-UX 9.7 on each node, connected by FDDI (see Tab. 1). Each node has two local disks and each Panda I/O node uses a 4 GB local disk. At the time of our experiments, the disks were 45{9% full depending on the node. We measured the average le system throughput using 18 sequential 1 MB application requests to the le system: 5.96 MB/sec for the least occupied disk and 5.63 MB/sec for the fullest disk. For the message passing layer, we used MPICH and obtained an average message passing bandwidth per node of 3.,.9 or.3 MB/sec when there are 1, or 4 pairs of senders and receivers, respectively, for Panda's common message sizes (3{51 KB). So, message passing slower than the underlying le system is a clear bottleneck for parallel I/O on this cluster. Our second platform is a High Performance Virtual Machine (HPVM) [], a collection of HP Kayak XU dual-processor PCs with a more advanced switch-based Myrinet interconnect. Each node consists of symmetric dual 3 MHz Pentium IIs and a 4 GB 1 RPM disk running Windows NT Server 4.. We measured the average le system (NTFS) throughput using the Win3 API as 1{1 MB/sec depending on the request size, le caching option and total amount of data read or written. With le caching turned on, the peak le system throughput was obtained from requests of size 8{18 KB, whereas without le caching, bigger request sizes (3{14 KB) performed better on average and also gave much more consistent performance for write operations. The results are consistent with the analysis of NTFS performance presented in [13]. As for read, throughput is best with le caching turned on, because of the performance advantage oered by prefetching into the le cache. For reads, request sizes of 8{64 KB lead to peak performance. So in our experiments with Panda on the PC cluster, we use 64 KB application read and write requests and turn le caching on only for read operations. Tab. 1 summarizes the le system and message passing performance in this conguration. The le system throughputs were measured by using 48 sequential 64 KB application requests to the le system (a total of 18 MB is read or written). The message passing throughput per node available from MPI-FM [7] is 4{7 MB/sec, again depending on the message size and the total amount of data transferred. Sharing of an SMP between clients and servers hurts performance when multiple processors in a node try to send/receive a message. Even if the theoretical peak performance of PCI is 133 MB/s, real achieved performance is much lower than that [13]. All SMP processes share a PCI bus and thus message passing can be bottlenecked by the PCI bus connecting to a fast network like Myrinet. In the system that we used for our experiments, contention reduces the message passing throughput between two processors in the same node to approximately 4 MB/sec, which is a little over half of the peak throughput obtainable from a pair of processors in dierent nodes. Memory MPI: File system: System # of Processors Inter- per throughput, write, name nodes connect node latency read throughput HP 9/ MHz FDDI 144 MB MPICH: HP-UX: workstation PA RISC 3. MB/s, 4.1{5.9 MB/s, cluster 57 us 4.3{5.7 MB/s SMP cluster 64 dual 3 MHz Myrinet 51 MB MPI-FM: NTFS: (HPVM) Pentium II 7 MB/s, 11.7 MB/s, 17 us 1.5 MB/s Tab. 1: Comparison of clusters. MPI latency is measured by sending 1 -byte messages between two processes and 3 KB messages are used on both clusters to measure the MPI throughput. For le system throughput, we used a 1 MB request size on the workstation cluster and a 3 KB request size on the SMP cluster for both read and write operations. 3 Parallel I/O on workstations connected by FDDI As implemented in Panda.1, part-time I/O chooses the rst m of the compute processors to run Panda servers, regardless of the array distribution or compute processor mesh. A similar strategy has been In our experiments on the HP workstation cluster, Panda servers also read or write one 1 MB subchunk at a time.

4 taken in other libraries that provide part-time I/O [1]. This provides acceptable performance on a high speed, switch-based interconnect whose processors are homogeneous with respect to I/O ability, as on many SPs, but on a shared-media interconnect like FDDI or Ethernet, we found that performance is generally unsatisfactory and tends to vary widely with the exact choice of compute and I/O processor meshes. To see the source of the problem, consider the example in Fig. 1. With the naive selection of compute processors and 1 as I/O servers, compute processor 1 needs to send its local chunk to compute processor and gather a subchunk from compute processors and 3. This incurs extra message passing which is unnecessary if compute processor also acts as an I/O server, instead of compute processor 1. In an environment where the interconnect will clearly be the bottleneck for I/O, as is the case for the HP workstation cluster with an FDDI interconnect, the single most important optimization we can make is to minimize the amount of remote data transfer. Our previous work [3] describes how to place I/O servers in a manner that will minimize the number of array elements that must be shipped across the network during I/O. More precisely, suppose we are given a target number m of part-time I/O servers, the current distribution of data across processor memories, and a desired distribution of data across I/O servers. We show how to choose the I/O servers from among the set of n m compute processors so that remote data transfer is minimized. We begin by forming an m n array called the I/O matrix (M), where each row represents one of the I/O servers and each column represents one of the n compute servers. The (i; j) entry in the I/O matrix, M(i; j), is the total number of array elements that the ith I/O server will have to gather from the jth compute processor, which can be computed from the array size, in-memory distribution and the target disk distribution. In Panda, every processor involved in an array I/O operation has access to array size and distribution information, so M can be generated at run time. Given the I/O matrix, the goal of choosing the m I/O servers that will minimize remote data transfer can be formalized as the problem of choosing m matrix entries M(i 1 ; j 1 ); : : : ; M(i m ; j m ) such that no two entries lie in the same column or row (i k 6= i l and j k 6= j l, for 1 k < l m) and P 1km M(i k; j k ) is maximal. To solve this problem, we can view M as a representation of a bipartite graph, where every row (I/O server) and every column (compute processor) represents a vertex and each entry M(i; j) is the weight of the edge connecting vertices i and j. [3] shows that the problem of assigning I/O servers is equivalent to nding the matching 3 of M with the largest possible sum of weights. The optimal solution can be obtained using the Hungarian Method [11] in O(m 3 ) time, where m is the number of part-time I/O servers. We compared the performance of Panda using the rst m processors as I/O servers (\xed" I/O servers) and optimally placed (in terms of minimal data transfer) I/O servers. We used all 8 nodes as compute processors in our experiments, as that conguration is probably most representative of scientists' needs. The in-memory distribution was (BLOCK, BLOCK) and we tested performance using a 4 compute processor mesh. We used, 4, or 8 part-time I/O servers, while increasing the array size from 4 MB (1414) to 16 MB (4848) and 64 MB (496496). For the disk distribution, we used either (BLOCK, *) or (*, BLOCK) to show the eect of a radically dierent distribution. We present results for writes; reads are similar. Since the cluster is not fully isolated from other networks, we did our experiments when no other user job was executing on the cluster. All the experimental results shown are the average of 3 or more trials and error bars show a 95% condence interval for the average. Fig. and Fig. 3 compare the time to write an array using dierent placements of I/O servers. A group of 6 bars is shown for each number of I/O servers. Each pair of bars within the group shows the response time to write an array of the given size using xed and optimal placement of I/O servers respectively. Optimal placement of I/O servers reduces array output time by at least 19% across all dierent combinations of array sizes and meshes, except for the cases where the xed and optimal I/O server placements are identical. We found that even with optimal I/O server placement, performance is very dependent not only on the amount of local data transfer, but also on the compute processor mesh chosen, array distribution on disk and the number of I/O servers. For instance, in Fig. 3, moving from to 4 I/O servers gives a superlinear speedup with optimal placement, but that does not happen if the (*, BLOCK) disk distribution is used (Fig. ) instead. To obtain good I/O performance, the user needs help from Panda in determining the eect on I/O performance of seemingly irrelevant decisions such as the choice between a 4 or 4 compute processor mesh. In [3], we presented a performance model for Panda running on an FDDI cluster, to be used in predicting message passing performance in Panda, and showed its accuracy. The performance model can guide a user to select array distributions and compute processor meshes that can give the best performance on this cluster. 3 A matching in a graph is a maximal subset of the edges of a graph such that no two edges share the same endpoint.

5 write operation, x4 compute processor mesh (BLOCK,BLOCK) in memory, (BLOCK,*) on disk write operation, x4 compute processor mesh (BLOCK,BLOCK) in memory, (*,BLOCK) on disk Panda Response Time (sec) # of I/O servers I/O server placement: fixed optimal 6 Panda Response Time (sec) # of I/O servers I/O server placement: fixed optimal 6 Fig. : Panda response time for writing an array using xed or optimal I/O server placement. Memory mesh: 4. Memory distribution: (BLOCK, BLOCK). Disk mesh: n 1, where n is the number of I/O servers. Disk distribution: (BLOCK, *). Fig. 3: Panda response time for writing an array using xed or optimal I/O server placement. Memory mesh: 4. Memory distribution: (BLOCK, BLOCK). Disk mesh: 1 n, where n is the number of I/O servers. Disk distribution: (*, BLOCK). 4 Parallel I/O on PCs connected by Myrinet As summarized in Tab. 1, each node in our SMP cluster consists of dual processors sharing memory, I/O bus and a le system. There is one 4. GB Ultra Wide SCSI disk (1 RPM) and a 16 MB/sec (fullduplex) Myrinet board connected to each SMP. The details are shown in Fig. 4. When a parallel application is running on both processors, contention for shared resources like the I/O bus (PCI bus in Fig. 4) or disk can be a serious bottleneck. For example, if both processors perform I/O at the same time, we nd that each processor obtains less than half of the le system throughput obtained by using only one processor, because the disk and the I/O bus connecting the disk controller are shared, and the I/O requests coming separately from each processor cause disk seeks or rotational delays. So in this conguration, it is crucial to avoid using multiple processors in the same SMP node as Panda I/O servers if possible. Fig. 5 compares the Panda output performance when 8 dedicated I/O servers are used with dierent placements. Each group of 3 bars compares performance using dierent congurations; the white bars show performance when both processors in the same SMP are used as I/O servers (xed I/O servers). With xed I/O servers, each server provides a throughput of only about 3 MB/sec. If a 4-processor SMP were used with all 4 processors as I/O servers, the throughput would be even lower. However, if the I/O servers are carefully placed to avoid multiple servers in the same SMP (black bars in Fig. 5), Panda throughput per server increases by more than 1%. The gray bars in Fig. 5 show the positive impact of placing each client and server in a separate SMP; the resulting performance is close to the peak le system performance reported in Tab. 1. For the 16 MB array, Panda does not perform as well as for larger arrays because the amount that each I/O node writes is so small that throughput is not scalable due to Panda's constant startup/shutdown overhead. We repeated the tests shown in Fig. 5 for read operations. In all cases, throughput at each I/O server is higher than for write operations, with the same performance trends as for writes. For instance, we obtained 1.4 MB/sec throughput at each I/O server for the 16 MB gray bar and 11.{11.6 MB/sec for the rest of the gray bars. Experiments not included in Fig. 5 show that if we place a Panda server or client on only one processor per SMP, the throughput that each Panda server delivers to the underlying le system averages 5 MB/sec for read and write operations. In other words, Panda can keep the underlying le system busy as long as the underlying le system has a peak throughput of at most that amount times the number of I/O servers sharing the le system. With careful placements, each I/O server delivers data to the le system at a throughput rate of about 5 MB/sec ( MB/sec for read operations), which is just half of the throughput obtained

6 CPU 1 CPU 1 16 compute processors and 8 I/O servers Network Interface PCI bus (133 MB/s) 4 MB/s Ultra Wide SCSI Controller Disk (1K RPM) system bus (58 MB/s) System Memory (51 MB) Throughput per I/O server (MB/s) Array size (MB) fixed I/O servers carefully placed I/O servers using only one processor per SMP Fig. 4: System architecture of each PC workstation having processors sharing memory and I/O subsystem. The bandwidths shown are the theoretical peak performance. Fig. 5: Memory mesh, disk mesh: 4. Array distribution in memory and on disk: (BLOCK, BLOCK, BLOCK). 16 compute processors, 8 dedicated I/O servers. when a Panda server or client is placed on only one processor per SMP. However, if xed placement is used, throughput drops to MB/sec for both reads and writes, which means that message passing bandwidth of 3{5 MB/sec is wasted by contention between servers on the same SMP for the PCI bus. 5 Heterogeneity in clusters The clusters used in the experiments in this paper had homogeneous system software and hardware, but many clusters will be heterogeneous. Heterogeneity can have a big impact on I/O performance and needs to be taken into consideration when choosing the placement of I/O servers. Since on a large cluster users often will not know in advance what nodes will be assigned to their job, server placement will need to be done at runtime, preferably by the I/O library. Heterogeneity can hurt parallel I/O performance by causing load imbalance. For example, on the cluster used for the experiments in this paper, le system performance varied from node to node, due to dierent amounts of free space on local disk. If work is assigned to I/O servers without considering their dierent capabilities, performance will be limited to that of the slowest I/O server. In other words, a single server with a very full disk could signicantly delay completion of an entire I/O operation. Sources of heterogeneity other than disk free space can also cause load imbalance and reduce I/O performance. Some other examples: Data placement. To improve computational load, data may not be spread evenly across all compute processors. The processors with the most data may become a bottleneck for I/O (e.g., on an FDDI cluster). The algorithm for I/O server placement in Section 3 took this type of heterogeneity into account. Disk and le system performance. Each node may have a dierent disk capacity and speed, a dierent le system, or a dierently partitioned le system. In this case both the placement of I/O servers and the distribution of data on disk must take the diering abilities into account. Further, the I/O strategy should be tailored to the le system of each server for best results (e.g., do not use le caching for write operations with NTFS). Processor characteristics. Main memory size can signicantly impact I/O performance, because larger memories often allow large le caches, which can help performance. Processor speed can also impact I/O performance, because the cost of copying data to and from message and le system buers is signicant. As shown in Section 4, processors that must share resources such as I/O busses or le system can have very dierent I/O performance characteristics from stand-alone processors. Thus optimal I/O server placement and workload distribution in a heterogeneous environment is an extremely complex problem. With so many potential variables to consider, a general portable solution to this problem would probably need to use heuristic search through the space of possibilities, rather than

7 relying entirely on exact algorithms. In general, for top performance, I/O servers need to be placed in such a way that all servers' I/O capabilities are as similar as possible. For instance, suppose a cluster consists of nodes having older PCs with a single processor and a 54 RPM disk, and a few new SMPs with a 1 RPM disk. On such a system, it might be advantageous to place multiple I/O servers on the same SMP node, directly contradicting our advice for a homogeneous system! Further, given that the I/O servers have dierent capabilities, work should be divided among them according to their abilities. We have taken some preliminary steps in this direction in [6], which examined several ways of dividing a workload among heterogeneous servers. 6 Related work A number of researchers have examined the problem of parallel I/O on workstation clusters; we believe we are the rst group to address problems related to resource sharing on SMPs and heterogeneity across nodes in collective I/O. PIOUS [9] is a pioneer work in parallel I/O on a workstation cluster. PIOUS is a parallel le system with a Unix-style le interface; coordinated access to a le is guaranteed using transactions. Heterogeneity also raises performance issues for parallel le systems. If les are automatically striped across all servers, performance can suer if some servers are slower than others. If the le system allows dynamic allocation of les to servers, our approaches to placing data to minimize contention for network, I/O bus, and disk may be helpful. VIP-FS [4] provides a collective I/O interface for scientic applications running in parallel and distributed environments. Their assumed-request strategy is designed for distributed systems where the network is a potentially congested shared medium. It reduces the number of I/O requests made by all the compute nodes involved in a collective I/O operation, to reduce congestion. In such an environment, careful placement of I/O servers can also reduce the total data trac. VIPIOS [1] is a design of a parallel I/O system to be used in conjunction with Vienna Fortran. VIPIOS exploits logical data locality in mapping between servers and application processes and physical data locality between servers and disks, which is similar to our approach in exploiting local data on workstations on FDDI. Our approach adds an algorithm for server placement that guarantees minimal remote data access during I/O. In [3], we also quantify the savings obtained by careful placement of servers, and use an analytical model to explain other performance trends. Our work is also related to I/O resource sharing in multiprocessor systems. [16] studies contention for a single I/O bus from accesses to dierent devices like video, network and disk, and studies the correlation among these devices. It characterizes how multiple device types interact when one or more Unix utilities are running on a multiprocessor workstation. Panda can probably benet from this type of study when heuristic search through the space of all possible placements is used to help place I/O servers. 7 Conclusion Compared to traditional supercomputers, commodity clusters are an economically attractive platform for running parallel scientic codes. While a few vendors dominate the marketplace for traditional supercomputers, it is relatively easy for any vendor to create a high-performance cluster product. The result is a dizzying array of possible cluster congurations, each with its own capabilities for computation, networking, and I/O, and each having dierent potential bottlenecks for I/O performance. Thus customization of I/O strategies will be needed for high performance I/O on many clusters. Making the customization strategy more dicult is the ease with which heterogeneous clusters can be constructed and operated; heterogeneous clusters will often require particularly sophisticated approaches to I/O optimization. This paper discusses our experiments with the Panda parallel I/O library on two dierent cluster systems. Unlike traditional massively parallel processors in which the main bottleneck for parallel I/O usually is on disk speed, we have found that the bottleneck can be almost anywhere in the system on commodity clusters. We presented a way to improve overall I/O performance on each platform by placing I/O servers carefully. On workstation clusters connected by FDDI, the bottleneck is in message passing and we place I/O servers to minimize the amount of data transferred over the network. On a cluster of SMPs connected by Myrinet, parallel I/O can be bottlenecked by sharing of disks and I/O busses by multiple processors in the same SMP node, so the I/O servers are placed to minimize the contention for shared resources. We expect -processor and 4-processor SMPs to become more popular in the future. Unfortunately, resource

8 sharing among processors in the same SMP node imposes a new potential cause of parallel I/O performance degradation. Acknowledgements. This research was supported in part by NASA under NAGW 444 and NCC5 16, and by the U.S. Department of Energy through the University of California under subcontract B Experiments were conducted using an HP workstation cluster at HP Labs in Palo Alto and a High Performance Virtual Machine (HPVM) at the National Center for Supercomputing Applications and the Concurrent Systems Architecture Group of the Department of Computer Science at the University of Illinois. References 1. P. Brezany, T. A. Mueck, and E. Schikuta. A Software Architecture for Massively Parallel Input-Output. In Proceedings of the Third International Workshop PARA'96, Lyngby, Denmark, August Springer Verlag.. A. Chien, S. Pakin, M. Lauria, M. Buchanan, K. Hane, L. Giannini, and J. Prusakova. High Performance Virtual Machines (HPVM): Clusters with Supercomputing APIs and Performance. In Proceedings of the the Eighth SIAM Conference on Parallel Processing for Scientic Computing, Minneapolis, MN, March Y. Cho, M. Winslett, M. Subramaniam, Y. Chen, S. Kuo, and K. E. Seamons. Exploiting Local Data in Parallel Array I/O on a Practical Network of Workstations. In Proceedings of the Fifth Workshop on I/O in Parallel and Distributed Systems, pages 1{13, San Jose, CA, November M. Harry, J. Rosario, and A. Choudhary. VIP-FS: A Virtual, Parallel File System for High Performance Parallel and Distributed Computing. In Proceedings of the Ninth International Parallel Processing Symposium, April High Peformance Fortran Forum. High Performance Fortran Language Specication, November S. Kuo, M. Winslett, Y. Chen, Y. Cho, M. Subramaniam, and K.E. Seamons. Parallel Input/Output with Heterogeneous Disks. In Proceedings of the 9th International Working Conference on Scientic and Statistical Database Management, pages 79{9, Olympia, Washington, August M. Lauria and A. Chien. MPI-FM: High Performance MPI on Workstation Clusters. Journal of Parallel and Distributed Computing, 4(1):4{18, January Message Passing Interface Forum. MPI: Message-Passing Interface Standard, June S. Moyer and V. S. Sunderam. Parallel I/O as a parallel application. International Journal of Supercomputer Applications, 9():95{17, Summer J. Nieplocha and I. Foster. Disk Resident Arrays: An Array-Oriented I/O Library for Out-of-Core Computation. In Proceedings of the Sixth Symposium on the Frontiers of Massively Parallel Computation, pages 196{4, October C. Papadimitriou and K. Steiglitz. Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall Inc., M. Parashar and J. Browne. Distributed Dynamic Data-Structures for Parallel Adaptive Mesh-Renement. In Proceedings of International Conference for High Performance Computing, E. Riedel, C. van Ingen, and J. Gray. A Performance Study of Sequential I/O on Windows NT 4. In Proceedings of the Second USENIX Windows NT Symposium, pages 1{1, Seattle, WA, August K. E. Seamons, Y. Chen, P. Jones, J. Jozwiak, and M. Winslett. Server-Directed Collective I/O in Panda. In Proceedings of Supercomputing '95, San Diego, CA, November V. Sunderam. PVM: A Framework for Parallel Distributed Computing. Concurrency: Practices and Experience, (4):315{339, S. VanderLeest and R. Iyer. Measurement of I/O Bus Contention and Correlation among Heterogeneous Device Types in a Single-bus Multiprocessor system. Computer Architecture News, (4):17{.

clients (compute nodes) servers (I/O nodes)

clients (compute nodes) servers (I/O nodes) Collective I/O on a SGI Cray Origin : Strategy and Performance Y. Cho, M. Winslett, J. Lee, Y. Chen, S. Kuo, K. Motukuri Department of Computer Science, University of Illinois Urbana, IL, U.S.A. Abstract

More information

Storage System. Distributor. Network. Drive. Drive. Storage System. Controller. Controller. Disk. Disk

Storage System. Distributor. Network. Drive. Drive. Storage System. Controller. Controller. Disk. Disk HRaid: a Flexible Storage-system Simulator Toni Cortes Jesus Labarta Universitat Politecnica de Catalunya - Barcelona ftoni, jesusg@ac.upc.es - http://www.ac.upc.es/hpc Abstract Clusters of workstations

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Meenakshi Arunachalam Alok Choudhary Brad Rullman y ECE and CIS Link Hall Syracuse University Syracuse, NY 344 E-mail:

More information

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those Parallel Computing on PC Clusters - An Alternative to Supercomputers for Industrial Applications Michael Eberl 1, Wolfgang Karl 1, Carsten Trinitis 1 and Andreas Blaszczyk 2 1 Technische Universitat Munchen

More information

Parallel Pipeline STAP System

Parallel Pipeline STAP System I/O Implementation and Evaluation of Parallel Pipelined STAP on High Performance Computers Wei-keng Liao, Alok Choudhary, Donald Weiner, and Pramod Varshney EECS Department, Syracuse University, Syracuse,

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

Performance Modeling of a Parallel I/O System: An. Application Driven Approach y. Abstract

Performance Modeling of a Parallel I/O System: An. Application Driven Approach y. Abstract Performance Modeling of a Parallel I/O System: An Application Driven Approach y Evgenia Smirni Christopher L. Elford Daniel A. Reed Andrew A. Chien Abstract The broadening disparity between the performance

More information

Switch. Switch. PU: Pentium Pro 200MHz Memory: 128MB Myricom Myrinet 100Base-T Ethernet

Switch. Switch. PU: Pentium Pro 200MHz Memory: 128MB Myricom Myrinet 100Base-T Ethernet COMPaS: A Pentium Pro PC-based SMP Cluster and its Experience Yoshio Tanaka 1, Motohiko Matsuda 1, Makoto Ando 1, Kazuto Kubota and Mitsuhisa Sato 1 Real World Computing Partnership fyoshio,matu,ando,kazuto,msatog@trc.rwcp.or.jp

More information

100 Mbps DEC FDDI Gigaswitch

100 Mbps DEC FDDI Gigaswitch PVM Communication Performance in a Switched FDDI Heterogeneous Distributed Computing Environment Michael J. Lewis Raymond E. Cline, Jr. Distributed Computing Department Distributed Computing Department

More information

Application Programmer. Vienna Fortran Out-of-Core Program

Application Programmer. Vienna Fortran Out-of-Core Program Mass Storage Support for a Parallelizing Compilation System b a Peter Brezany a, Thomas A. Mueck b, Erich Schikuta c Institute for Software Technology and Parallel Systems, University of Vienna, Liechtensteinstrasse

More information

MTIO A MULTI-THREADED PARALLEL I/O SYSTEM

MTIO A MULTI-THREADED PARALLEL I/O SYSTEM MTIO A MULTI-THREADED PARALLEL I/O SYSTEM Sachin More Alok Choudhary Dept.of Electrical and Computer Engineering Northwestern University, Evanston, IL 60201 USA Ian Foster Mathematics and Computer Science

More information

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Technische Universitat Munchen. Institut fur Informatik. D Munchen. Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl

More information

Profile-Based Load Balancing for Heterogeneous Clusters *

Profile-Based Load Balancing for Heterogeneous Clusters * Profile-Based Load Balancing for Heterogeneous Clusters * M. Banikazemi, S. Prabhu, J. Sampathkumar, D. K. Panda, T. W. Page and P. Sadayappan Dept. of Computer and Information Science The Ohio State University

More information

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters Kent Milfeld, Avijit Purkayastha, Chona Guiang Texas Advanced Computing Center The University of Texas Austin, Texas USA Abstract

More information

Java Virtual Machine

Java Virtual Machine Evaluation of Java Thread Performance on Two Dierent Multithreaded Kernels Yan Gu B. S. Lee Wentong Cai School of Applied Science Nanyang Technological University Singapore 639798 guyan@cais.ntu.edu.sg,

More information

Tuning High-Performance Scientific Codes: The Use of Performance Models to Control Resource Usage During Data Migration and I/O

Tuning High-Performance Scientific Codes: The Use of Performance Models to Control Resource Usage During Data Migration and I/O Tuning High-Performance Scientific Codes: The Use of Performance Models to Control Resource Usage During Data Migration and I/O Jonghyun Lee Marianne Winslett Xiaosong Ma Shengke Yu Department of Computer

More information

Administrivia. CMSC 411 Computer Systems Architecture Lecture 19 Storage Systems, cont. Disks (cont.) Disks - review

Administrivia. CMSC 411 Computer Systems Architecture Lecture 19 Storage Systems, cont. Disks (cont.) Disks - review Administrivia CMSC 411 Computer Systems Architecture Lecture 19 Storage Systems, cont. Homework #4 due Thursday answers posted soon after Exam #2 on Thursday, April 24 on memory hierarchy (Unit 4) and

More information

Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture

Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture Sivakumar Harinath 1, Robert L. Grossman 1, K. Bernhard Schiefer 2, Xun Xue 2, and Sadique Syed 2 1 Laboratory of

More information

Abstract. provide substantial improvements in performance on a per application basis. We have used architectural customization

Abstract. provide substantial improvements in performance on a per application basis. We have used architectural customization Architectural Adaptation in MORPH Rajesh K. Gupta a Andrew Chien b a Information and Computer Science, University of California, Irvine, CA 92697. b Computer Science and Engg., University of California,

More information

Storage Systems. Storage Systems

Storage Systems. Storage Systems Storage Systems Storage Systems We already know about four levels of storage: Registers Cache Memory Disk But we've been a little vague on how these devices are interconnected In this unit, we study Input/output

More information

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Network. Department of Statistics. University of California, Berkeley. January, Abstract Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

COMPUTE PARTITIONS Partition n. Partition 1. Compute Nodes HIGH SPEED NETWORK. I/O Node k Disk Cache k. I/O Node 1 Disk Cache 1.

COMPUTE PARTITIONS Partition n. Partition 1. Compute Nodes HIGH SPEED NETWORK. I/O Node k Disk Cache k. I/O Node 1 Disk Cache 1. Parallel I/O from the User's Perspective Jacob Gotwals Suresh Srinivas Shelby Yang Department of r Science Lindley Hall 215, Indiana University Bloomington, IN, 4745 fjgotwals,ssriniva,yangg@cs.indiana.edu

More information

RTI Performance on Shared Memory and Message Passing Architectures

RTI Performance on Shared Memory and Message Passing Architectures RTI Performance on Shared Memory and Message Passing Architectures Steve L. Ferenci Richard Fujimoto, PhD College Of Computing Georgia Institute of Technology Atlanta, GA 3332-28 {ferenci,fujimoto}@cc.gatech.edu

More information

1e+07 10^5 Node Mesh Step Number

1e+07 10^5 Node Mesh Step Number Implicit Finite Element Applications: A Case for Matching the Number of Processors to the Dynamics of the Program Execution Meenakshi A.Kandaswamy y Valerie E. Taylor z Rudolf Eigenmann x Jose' A. B. Fortes

More information

System Architecture PARALLEL FILE SYSTEMS

System Architecture PARALLEL FILE SYSTEMS Software and the Performance Effects of Parallel Architectures Keith F. Olsen,, Poughkeepsie, NY James T. West,, Austin, TX ABSTRACT There are a number of different parallel architectures: parallel hardware

More information

THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL. Jun Sun, Yasushi Shinjo and Kozo Itano

THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL. Jun Sun, Yasushi Shinjo and Kozo Itano THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL Jun Sun, Yasushi Shinjo and Kozo Itano Institute of Information Sciences and Electronics University of Tsukuba Tsukuba,

More information

Job Re-Packing for Enhancing the Performance of Gang Scheduling

Job Re-Packing for Enhancing the Performance of Gang Scheduling Job Re-Packing for Enhancing the Performance of Gang Scheduling B. B. Zhou 1, R. P. Brent 2, C. W. Johnson 3, and D. Walsh 3 1 Computer Sciences Laboratory, Australian National University, Canberra, ACT

More information

Table of Contents. HP A7173A PCI-X Dual Channel Ultra320 SCSI Host Bus Adapter. Performance Paper for HP PA-RISC Servers

Table of Contents. HP A7173A PCI-X Dual Channel Ultra320 SCSI Host Bus Adapter. Performance Paper for HP PA-RISC Servers HP A7173A PCI-X Dual Channel Ultra32 SCSI Host Bus Adapter Performance Paper for HP PA-RISC Servers Table of Contents Introduction...2 Executive Summary...2 Test Results...3 I/Ops...3 Service Demand...4

More information

Client (meet lib) tac_firewall ch_firewall ag_vm. (3) Create ch_firewall. (5) Receive and examine (6) Locate agent (7) Create pipes (8) Activate agent

Client (meet lib) tac_firewall ch_firewall ag_vm. (3) Create ch_firewall. (5) Receive and examine (6) Locate agent (7) Create pipes (8) Activate agent Performance Issues in TACOMA Dag Johansen 1 Nils P. Sudmann 1 Robbert van Renesse 2 1 Department of Computer Science, University oftroms, NORWAY??? 2 Department of Computer Science, Cornell University,

More information

Algorithm Engineering with PRAM Algorithms

Algorithm Engineering with PRAM Algorithms Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and

More information

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

Quiz for Chapter 6 Storage and Other I/O Topics 3.10 Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: 1. [6 points] Give a concise answer to each of the following

More information

Real-time communication scheduling in a multicomputer video. server. A. L. Narasimha Reddy Eli Upfal. 214 Zachry 650 Harry Road.

Real-time communication scheduling in a multicomputer video. server. A. L. Narasimha Reddy Eli Upfal. 214 Zachry 650 Harry Road. Real-time communication scheduling in a multicomputer video server A. L. Narasimha Reddy Eli Upfal Texas A & M University IBM Almaden Research Center 214 Zachry 650 Harry Road College Station, TX 77843-3128

More information

IBM Almaden Research Center, at regular intervals to deliver smooth playback of video streams. A video-on-demand

IBM Almaden Research Center, at regular intervals to deliver smooth playback of video streams. A video-on-demand 1 SCHEDULING IN MULTIMEDIA SYSTEMS A. L. Narasimha Reddy IBM Almaden Research Center, 650 Harry Road, K56/802, San Jose, CA 95120, USA ABSTRACT In video-on-demand multimedia systems, the data has to be

More information

Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications

Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications Wei-keng Liao Alok Choudhary ECE Department Northwestern University Evanston, IL Donald Weiner Pramod Varshney EECS Department

More information

An Introduction to GPFS

An Introduction to GPFS IBM High Performance Computing July 2006 An Introduction to GPFS gpfsintro072506.doc Page 2 Contents Overview 2 What is GPFS? 3 The file system 3 Application interfaces 4 Performance and scalability 4

More information

High Throughput WAN Data Transfer with Hadoop-based Storage

High Throughput WAN Data Transfer with Hadoop-based Storage High Throughput WAN Data Transfer with Hadoop-based Storage A Amin 2, B Bockelman 4, J Letts 1, T Levshina 3, T Martin 1, H Pi 1, I Sfiligoi 1, M Thomas 2, F Wuerthwein 1 1 University of California, San

More information

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4 W H I T E P A P E R Comparison of Storage Protocol Performance in VMware vsphere 4 Table of Contents Introduction................................................................... 3 Executive Summary............................................................

More information

DBMS Environment. Application Running in DMS. Source of data. Utilization of data. Standard files. Parallel files. Input. File. Output.

DBMS Environment. Application Running in DMS. Source of data. Utilization of data. Standard files. Parallel files. Input. File. Output. Language, Compiler and Parallel Database Support for I/O Intensive Applications? Peter Brezany a, Thomas A. Mueck b and Erich Schikuta b University of Vienna a Inst. for Softw. Technology and Parallel

More information

Dynamic load balancing of SCSI WRITE and WRITE SAME commands

Dynamic load balancing of SCSI WRITE and WRITE SAME commands Dynamic load balancing of SCSI WRITE and WRITE SAME commands Vladimir Tikhomirov Supervised by Dr. Simo Juvaste Advised by Antti Vikman and Timo Turunen Master's thesis. June 13, 2013 School of Computing

More information

The VERITAS VERTEX Initiative. The Future of Data Protection

The VERITAS VERTEX Initiative. The Future of Data Protection The VERITAS VERTEX Initiative V E R I T A S W H I T E P A P E R The Future of Data Protection Table of Contents Introduction.................................................................................3

More information

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Zhou B. B., Brent R. P. and Tridgell A. y Computer Sciences Laboratory The Australian National University Canberra,

More information

EMC DMX Disk Arrays with IBM DB2 Universal Database Applied Technology

EMC DMX Disk Arrays with IBM DB2 Universal Database Applied Technology EMC DMX Disk Arrays with IBM DB2 Universal Database Applied Technology Abstract This paper examines the attributes of the IBM DB2 UDB V8.2 database as they relate to optimizing the configuration for the

More information

Server 1 Server 2 CPU. mem I/O. allocate rec n read elem. n*47.0. n*20.0. select. n*1.0. write elem. n*26.5 send. n*

Server 1 Server 2 CPU. mem I/O. allocate rec n read elem. n*47.0. n*20.0. select. n*1.0. write elem. n*26.5 send. n* Information Needs in Performance Analysis of Telecommunication Software a Case Study Vesa Hirvisalo Esko Nuutila Helsinki University of Technology Laboratory of Information Processing Science Otakaari

More information

Scalability of Heterogeneous Computing

Scalability of Heterogeneous Computing Scalability of Heterogeneous Computing Xian-He Sun, Yong Chen, Ming u Department of Computer Science Illinois Institute of Technology {sun, chenyon1, wuming}@iit.edu Abstract Scalability is a key factor

More information

SONAS Best Practices and options for CIFS Scalability

SONAS Best Practices and options for CIFS Scalability COMMON INTERNET FILE SYSTEM (CIFS) FILE SERVING...2 MAXIMUM NUMBER OF ACTIVE CONCURRENT CIFS CONNECTIONS...2 SONAS SYSTEM CONFIGURATION...4 SONAS Best Practices and options for CIFS Scalability A guide

More information

Application. CoCheck Overlay Library. MPE Library Checkpointing Library. OS Library. Operating System

Application. CoCheck Overlay Library. MPE Library Checkpointing Library. OS Library. Operating System Managing Checkpoints for Parallel Programs Jim Pruyne and Miron Livny Department of Computer Sciences University of Wisconsin{Madison fpruyne, mirong@cs.wisc.edu Abstract Checkpointing is a valuable tool

More information

Parallel and High Performance Computing CSE 745

Parallel and High Performance Computing CSE 745 Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel

More information

Multicast can be implemented here

Multicast can be implemented here MPI Collective Operations over IP Multicast? Hsiang Ann Chen, Yvette O. Carrasco, and Amy W. Apon Computer Science and Computer Engineering University of Arkansas Fayetteville, Arkansas, U.S.A fhachen,yochoa,aapong@comp.uark.edu

More information

What are Clusters? Why Clusters? - a Short History

What are Clusters? Why Clusters? - a Short History What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by

More information

Computer Systems Laboratory Sungkyunkwan University

Computer Systems Laboratory Sungkyunkwan University I/O System Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Introduction (1) I/O devices can be characterized by Behavior: input, output, storage

More information

Parallel Performance Studies for a Clustering Algorithm

Parallel Performance Studies for a Clustering Algorithm Parallel Performance Studies for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland,

More information

Linux Software RAID Level 0 Technique for High Performance Computing by using PCI-Express based SSD

Linux Software RAID Level 0 Technique for High Performance Computing by using PCI-Express based SSD Linux Software RAID Level Technique for High Performance Computing by using PCI-Express based SSD Jae Gi Son, Taegyeong Kim, Kuk Jin Jang, *Hyedong Jung Department of Industrial Convergence, Korea Electronics

More information

A GPFS Primer October 2005

A GPFS Primer October 2005 A Primer October 2005 Overview This paper describes (General Parallel File System) Version 2, Release 3 for AIX 5L and Linux. It provides an overview of key concepts which should be understood by those

More information

Maximizing NFS Scalability

Maximizing NFS Scalability Maximizing NFS Scalability on Dell Servers and Storage in High-Performance Computing Environments Popular because of its maturity and ease of use, the Network File System (NFS) can be used in high-performance

More information

Nils Nieuwejaar, David Kotz. Most current multiprocessor le systems are designed to use multiple disks

Nils Nieuwejaar, David Kotz. Most current multiprocessor le systems are designed to use multiple disks The Galley Parallel File System Nils Nieuwejaar, David Kotz fnils,dfkg@cs.dartmouth.edu Department of Computer Science, Dartmouth College, Hanover, NH 3755-351 Most current multiprocessor le systems are

More information

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr

More information

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Designing High Performance Communication Middleware with Emerging Multi-core Architectures Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Partitioning Effects on MPI LS-DYNA Performance

Partitioning Effects on MPI LS-DYNA Performance Partitioning Effects on MPI LS-DYNA Performance Jeffrey G. Zais IBM 138 Third Street Hudson, WI 5416-1225 zais@us.ibm.com Abbreviations: MPI message-passing interface RISC - reduced instruction set computing

More information

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads) Parallel Programming Models Parallel Programming Models Shared Memory (without threads) Threads Distributed Memory / Message Passing Data Parallel Hybrid Single Program Multiple Data (SPMD) Multiple Program

More information

Erik Riedel Hewlett-Packard Labs

Erik Riedel Hewlett-Packard Labs Erik Riedel Hewlett-Packard Labs Greg Ganger, Christos Faloutsos, Dave Nagle Carnegie Mellon University Outline Motivation Freeblock Scheduling Scheduling Trade-Offs Performance Details Applications Related

More information

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational

More information

An Overview of Fujitsu s Lustre Based File System

An Overview of Fujitsu s Lustre Based File System An Overview of Fujitsu s Lustre Based File System Shinji Sumimoto Fujitsu Limited Apr.12 2011 For Maximizing CPU Utilization by Minimizing File IO Overhead Outline Target System Overview Goals of Fujitsu

More information

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Organization and Structure. Bing-Yu Chen National Taiwan University Computer Organization and Structure Bing-Yu Chen National Taiwan University Storage and Other I/O Topics I/O Performance Measures Types and Characteristics of I/O Devices Buses Interfacing I/O Devices

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT:

HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT: HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms Author: Stan Posey Panasas, Inc. Correspondence: Stan Posey Panasas, Inc. Phone +510 608 4383 Email sposey@panasas.com

More information

Component-Based Communication Support for Parallel Applications Running on Workstation Clusters

Component-Based Communication Support for Parallel Applications Running on Workstation Clusters Component-Based Communication Support for Parallel Applications Running on Workstation Clusters Antônio Augusto Fröhlich 1 and Wolfgang Schröder-Preikschat 2 1 GMD FIRST Kekulésraÿe 7 D-12489 Berlin, Germany

More information

Process 0 Process 1 MPI_Barrier MPI_Isend. MPI_Barrier. MPI_Recv. MPI_Wait. MPI_Isend message. header. MPI_Recv. buffer. message.

Process 0 Process 1 MPI_Barrier MPI_Isend. MPI_Barrier. MPI_Recv. MPI_Wait. MPI_Isend message. header. MPI_Recv. buffer. message. Where's the Overlap? An Analysis of Popular MPI Implementations J.B. White III and S.W. Bova Abstract The MPI 1:1 denition includes routines for nonblocking point-to-point communication that are intended

More information

The Optimal CPU and Interconnect for an HPC Cluster

The Optimal CPU and Interconnect for an HPC Cluster 5. LS-DYNA Anwenderforum, Ulm 2006 Cluster / High Performance Computing I The Optimal CPU and Interconnect for an HPC Cluster Andreas Koch Transtec AG, Tübingen, Deutschland F - I - 15 Cluster / High Performance

More information

Parallel I/O Scheduling in Multiprogrammed Cluster Computing Systems

Parallel I/O Scheduling in Multiprogrammed Cluster Computing Systems Parallel I/O Scheduling in Multiprogrammed Cluster Computing Systems J.H. Abawajy School of Computer Science, Carleton University, Ottawa, Canada. abawjem@scs.carleton.ca Abstract. In this paper, we address

More information

PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES

PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES Zhou B. B. and Brent R. P. Computer Sciences Laboratory Australian National University Canberra, ACT 000 Abstract We describe

More information

perform well on paths including satellite links. It is important to verify how the two ATM data services perform on satellite links. TCP is the most p

perform well on paths including satellite links. It is important to verify how the two ATM data services perform on satellite links. TCP is the most p Performance of TCP/IP Using ATM ABR and UBR Services over Satellite Networks 1 Shiv Kalyanaraman, Raj Jain, Rohit Goyal, Sonia Fahmy Department of Computer and Information Science The Ohio State University

More information

FB(9,3) Figure 1(a). A 4-by-4 Benes network. Figure 1(b). An FB(4, 2) network. Figure 2. An FB(27, 3) network

FB(9,3) Figure 1(a). A 4-by-4 Benes network. Figure 1(b). An FB(4, 2) network. Figure 2. An FB(27, 3) network Congestion-free Routing of Streaming Multimedia Content in BMIN-based Parallel Systems Harish Sethu Department of Electrical and Computer Engineering Drexel University Philadelphia, PA 19104, USA sethu@ece.drexel.edu

More information

How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O?

How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O? bs_bs_banner Short Technical Note Transactions in GIS, 2014, 18(6): 950 957 How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O? Cheng-Zhi Qin,* Li-Jun

More information

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987 Extra-High Speed Matrix Multiplication on the Cray-2 David H. Bailey September 2, 1987 Ref: SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3, (May 1988), pg. 603{607 Abstract The Cray-2 is

More information

Operating Systems. Operating Systems Professor Sina Meraji U of T

Operating Systems. Operating Systems Professor Sina Meraji U of T Operating Systems Operating Systems Professor Sina Meraji U of T How are file systems implemented? File system implementation Files and directories live on secondary storage Anything outside of primary

More information

Recommendations for Aligning VMFS Partitions

Recommendations for Aligning VMFS Partitions VMWARE PERFORMANCE STUDY VMware ESX Server 3.0 Recommendations for Aligning VMFS Partitions Partition alignment is a known issue in physical file systems, and its remedy is well-documented. The goal of

More information

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne Distributed Computing: PVM, MPI, and MOSIX Multiple Processor Systems Dr. Shaaban Judd E.N. Jenne May 21, 1999 Abstract: Distributed computing is emerging as the preferred means of supporting parallel

More information

Active Buffering Plus Compressed Migration: An Integrated Solution to Parallel Simulations Data Transport Needs

Active Buffering Plus Compressed Migration: An Integrated Solution to Parallel Simulations Data Transport Needs Active Buffering Plus Compressed Migration: An Integrated Solution to Parallel Simulations Data Transport Needs Jonghyun Lee Xiaosong Ma Marianne Winslett Shengke Yu University of Illinois at Urbana-Champaign

More information

Comparison of Storage Protocol Performance ESX Server 3.5

Comparison of Storage Protocol Performance ESX Server 3.5 Performance Study Comparison of Storage Protocol Performance ESX Server 3.5 This study provides performance comparisons of various storage connection options available to VMware ESX Server. We used the

More information

Data Sieving and Collective I/O in ROMIO

Data Sieving and Collective I/O in ROMIO Appeared in Proc. of the 7th Symposium on the Frontiers of Massively Parallel Computation, February 1999, pp. 182 189. c 1999 IEEE. Data Sieving and Collective I/O in ROMIO Rajeev Thakur William Gropp

More information

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s] Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of

More information

An Oracle White Paper April 2010

An Oracle White Paper April 2010 An Oracle White Paper April 2010 In October 2009, NEC Corporation ( NEC ) established development guidelines and a roadmap for IT platform products to realize a next-generation IT infrastructures suited

More information

Chapter 20: Database System Architectures

Chapter 20: Database System Architectures Chapter 20: Database System Architectures Chapter 20: Database System Architectures Centralized and Client-Server Systems Server System Architectures Parallel Systems Distributed Systems Network Types

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected

More information

Clustering and Reclustering HEP Data in Object Databases

Clustering and Reclustering HEP Data in Object Databases Clustering and Reclustering HEP Data in Object Databases Koen Holtman CERN EP division CH - Geneva 3, Switzerland We formulate principles for the clustering of data, applicable to both sequential HEP applications

More information

Enhancing Data Migration Performance via Parallel Data Compression

Enhancing Data Migration Performance via Parallel Data Compression Enhancing Data Migration Performance via Parallel Data Compression Jonghyun Lee, Marianne Winslett, Xiaosong Ma, Shengke Yu Department of Computer Science, University of Illinois, Urbana, IL 6181 USA fjlee17,

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

Concepts Introduced. I/O Cannot Be Ignored. Typical Collection of I/O Devices. I/O Issues

Concepts Introduced. I/O Cannot Be Ignored. Typical Collection of I/O Devices. I/O Issues Concepts Introduced I/O Cannot Be Ignored Assume a program requires 100 seconds, 90 seconds for accessing main memory and 10 seconds for I/O. I/O introduction magnetic disks ash memory communication with

More information

Ministry of Education and Science of Ukraine Odessa I.I.Mechnikov National University

Ministry of Education and Science of Ukraine Odessa I.I.Mechnikov National University Ministry of Education and Science of Ukraine Odessa I.I.Mechnikov National University 1 Modern microprocessors have one or more levels inside the crystal cache. This arrangement allows to reach high system

More information

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

TECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS

TECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS TECHNOLOGY BRIEF March 1999 Compaq Computer Corporation ISSD Technology Communications CONTENTS Executive Overview1 Notice2 Introduction 3 8-Way Architecture Overview 3 Processor and I/O Bus Design 4 Processor

More information

BİL 542 Parallel Computing

BİL 542 Parallel Computing BİL 542 Parallel Computing 1 Chapter 1 Parallel Programming 2 Why Use Parallel Computing? Main Reasons: Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion,

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Ewa Deelman and Rajive Bagrodia UCLA Computer Science Department deelman@cs.ucla.edu, rajive@cs.ucla.edu http://pcl.cs.ucla.edu

More information

Assessing performance in HP LeftHand SANs

Assessing performance in HP LeftHand SANs Assessing performance in HP LeftHand SANs HP LeftHand Starter, Virtualization, and Multi-Site SANs deliver reliable, scalable, and predictable performance White paper Introduction... 2 The advantages of

More information

Parallel & Cluster Computing. cs 6260 professor: elise de doncker by: lina hussein

Parallel & Cluster Computing. cs 6260 professor: elise de doncker by: lina hussein Parallel & Cluster Computing cs 6260 professor: elise de doncker by: lina hussein 1 Topics Covered : Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster

More information

A Performance Study of Sequential IO on WindowsNT 4.0

A Performance Study of Sequential IO on WindowsNT 4.0 A Performance Study of Sequential IO on WindowsNT 4. Erik Riedel (CMU) Catharine Van Ingen Jim Gray September 1997 Technical Report MSR-TR-97-34 Microsoft Research Microsoft Corporation One Microsoft Way

More information