clients (compute nodes) servers (I/O nodes)
|
|
- Juniper Goodman
- 6 years ago
- Views:
Transcription
1 Parallel I/O on Networks of Workstations: Performance Improvement by Careful Placement of I/O Servers Yong Cho 1, Marianne Winslett 1, Szu-wen Kuo 1, Ying Chen, Jonghyun Lee 1, Krishna Motukuri 1 1 Department of Computer Science, University of Illinois, 134 W. Springeld, Urbana, IL 6181, U.S.A. IBM Almaden Research Center, 65 Harry Road, San Jose, CA 951, U.S.A. fycho,winslettg@cs.uiuc.edu phone: (+1 17) 44 73, fax: (+1 17) Introduction Abstract Thanks to powerful processors, fast interconnects, and portable message passing libraries like PVM and MPI, networks of inexpensive workstations are getting more popular as an economical way to run highperformance parallel scientic applications. On traditional massively parallel processors, performance of parallel I/O is most often limited by disk bandwidth, though the performance of other system components, especially the interconnect, can at times be a limiting factor. In this paper, we show that the performance of parallel I/O on commodity clusters is often signicantly aected not only by disk speed but also by the interconnect network throughput, I/O bus capacity and load imbalance caused by heterogeneity of nodes. Specically, we present our experimental results from reading and writing large multidimensional arrays with the Panda I/O library, on two signicantly dierent clusters: HP workstations connected by FDDI and HP PCs connected by Myrinet. We also discuss the approaches we use in Panda to maximize the I/O throughput available to the application on these two platforms, particularly by a careful placement of I/O servers. Due to wide availability of powerful processors, fast interconnects, and portable message passing libraries like PVM [15] and MPI [8], networks of commodity workstations are gaining popularity as an economical way to run very large parallel applications. In this environment, scientic applications typically distribute their data across multiple processes which are kept closely synchronized for computation and communication. Often these applications output large intermediate results and read them back in later for a subsequent step of computation. Large nal results may also be output for subsequent visualization, especially with timedependent simulation applications. Long running applications also typically periodically save a snapshot of their major arrays in a le (checkpoint), so that they can restart from the checkpoint in the event of failure. Due to relatively low performance of the I/O subsystem and lack of ecient software, I/O performance can be a major bottleneck in all these applications. The Panda parallel I/O library 1 is designed for SPMD-style parallel applications to provide I/O portability, an easy-to-use high-level interface and high-performance collective I/O of multidimensional arrays. On traditional massively parallel processors like the IBM SP, Panda's performance is mainly limited by disk speed. In this paper, we show that Panda's performance on a commodity cluster can be aected signicantly by almost every system component: disk speed, interconnect network throughput, main memory size, I/O bus speed and the presence of heterogeneous nodes. We focus on our experimental results from two very different clusters - HP workstations connected by FDDI and HP PCs connected by Myrinet. With workstations connected by FDDI that have reasonably fast disks, the I/O bottleneck is likely to lie in message passing because FDDI is a 1 Mb/s shared-media network. Thus, it is crucial to minimize simultaneous access to the network by multiple processes as well as the amount of data transferred over the network. However, when a more contemporary network like Myrinet is used together with high-performance PCs (each a - or 4-processor symmetric multiprocessor (SMP) sharing memory and I/O busses) the interconnect is no longer a bottleneck but the contention among processors for the shared resources can be a limiting factor in parallel 1 More information can be found online at
2 I/O performance. For both types of clusters, a careful placement of I/O servers can have a strong positive impact on performance. In the next section, we introduce the Panda parallel I/O library and describe the two clusters used in our experiments. In section 3, we discuss problems related to parallel I/O on workstations connected by FDDI. In section 4, we show that resource contention among processors in the same SMP can be a limiting factor for I/O performance if a switch-based, more contemporary network is used as an interconnect. Sections 5 and 6 discuss heterogeneity related to parallel I/O performance and some related work respectively. Finally we conclude the paper in section 7. Background.1 The Panda parallel I/O library Panda is a parallel I/O library for multidimensional arrays. Its original design was intended for SPMD applications running on distributed memory systems, with arrays distributed across multiple processes that are closely synchronized at I/O time. Panda supports HPF-style [5] BLOCK, CYCLIC, and Adaptive Mesh Renement-style data distributions [1] across the multiple compute nodes on which Panda clients are running. Panda's approach to high-performance collective I/O, in which all clients and servers cooperate to perform I/O, is called server-directed I/O [14] clients (compute nodes) servers (I/O nodes) Fig. 1: Dierent array data distributions in memory and on disk provided by Panda. Fig. 1 shows a D array distributed (BLOCK, BLOCK) across 4 compute processors arranged in a logical mesh. Each piece of the distributed array is called a compute chunk, and each compute chunk resides in the memory of one compute processor. The I/O processors are also arranged in a logical mesh, and the data can be distributed across them and implicitly across their disks using a variety of distribution directives. The array distribution on disk can be radically dierent from that in memory. For instance, the array in Fig. 1 has a (BLOCK, *) distribution on disk. Using this distribution, the resulting data les can be concatenated together to form a single row-major or column-major array, which is particularly useful if the array is to be sent to a workstation for postprocessing with a visualization tool, as is often the case. With server-directed I/O, each I/O chunk resulting from the distribution chosen for disk will be buered and sent to (or read from) disk by one Panda server, and that server is in charge of reading, writing, gathering, and scattering the I/O chunk. For example, in Fig. 1, during a Panda write operation server gathers compute chunks from clients and 1, reorganizes them into a single I/O chunk, and writes it to disk. In parallel, server 1 is gathering, reorganizing and writing its own I/O chunk. For a read operation, the reverse process is used. During I/O Panda divides large I/O chunks into a series of smaller pieces, called subchunks, in order to obtain better le system performance and keep server buer space requirements low. For a write operation, a Panda server will repeatedly gather and write subchunks, one by one. When reading or writing an entire array, each Panda server reads or writes its le sequentially. Usually, Panda clients and servers reside on physically dierent processors. However, it would be wasteful to dedicate processors to I/O, leaving them idle during computation on a network of workstations where a limited number of processors are available. Panda supports an alternative I/O strategy, part-time I/O, where there are no dedicated I/O processors. Instead, some of the Panda clients run servers at I/O time, and return to computation after nishing the I/O operation.
3 . Cluster systems The rst platform on which our experiments were conducted is an 8-node HP 9/735 workstation cluster running HP-UX 9.7 on each node, connected by FDDI (see Tab. 1). Each node has two local disks and each Panda I/O node uses a 4 GB local disk. At the time of our experiments, the disks were 45{9% full depending on the node. We measured the average le system throughput using 18 sequential 1 MB application requests to the le system: 5.96 MB/sec for the least occupied disk and 5.63 MB/sec for the fullest disk. For the message passing layer, we used MPICH and obtained an average message passing bandwidth per node of 3.,.9 or.3 MB/sec when there are 1, or 4 pairs of senders and receivers, respectively, for Panda's common message sizes (3{51 KB). So, message passing slower than the underlying le system is a clear bottleneck for parallel I/O on this cluster. Our second platform is a High Performance Virtual Machine (HPVM) [], a collection of HP Kayak XU dual-processor PCs with a more advanced switch-based Myrinet interconnect. Each node consists of symmetric dual 3 MHz Pentium IIs and a 4 GB 1 RPM disk running Windows NT Server 4.. We measured the average le system (NTFS) throughput using the Win3 API as 1{1 MB/sec depending on the request size, le caching option and total amount of data read or written. With le caching turned on, the peak le system throughput was obtained from requests of size 8{18 KB, whereas without le caching, bigger request sizes (3{14 KB) performed better on average and also gave much more consistent performance for write operations. The results are consistent with the analysis of NTFS performance presented in [13]. As for read, throughput is best with le caching turned on, because of the performance advantage oered by prefetching into the le cache. For reads, request sizes of 8{64 KB lead to peak performance. So in our experiments with Panda on the PC cluster, we use 64 KB application read and write requests and turn le caching on only for read operations. Tab. 1 summarizes the le system and message passing performance in this conguration. The le system throughputs were measured by using 48 sequential 64 KB application requests to the le system (a total of 18 MB is read or written). The message passing throughput per node available from MPI-FM [7] is 4{7 MB/sec, again depending on the message size and the total amount of data transferred. Sharing of an SMP between clients and servers hurts performance when multiple processors in a node try to send/receive a message. Even if the theoretical peak performance of PCI is 133 MB/s, real achieved performance is much lower than that [13]. All SMP processes share a PCI bus and thus message passing can be bottlenecked by the PCI bus connecting to a fast network like Myrinet. In the system that we used for our experiments, contention reduces the message passing throughput between two processors in the same node to approximately 4 MB/sec, which is a little over half of the peak throughput obtainable from a pair of processors in dierent nodes. Memory MPI: File system: System # of Processors Inter- per throughput, write, name nodes connect node latency read throughput HP 9/ MHz FDDI 144 MB MPICH: HP-UX: workstation PA RISC 3. MB/s, 4.1{5.9 MB/s, cluster 57 us 4.3{5.7 MB/s SMP cluster 64 dual 3 MHz Myrinet 51 MB MPI-FM: NTFS: (HPVM) Pentium II 7 MB/s, 11.7 MB/s, 17 us 1.5 MB/s Tab. 1: Comparison of clusters. MPI latency is measured by sending 1 -byte messages between two processes and 3 KB messages are used on both clusters to measure the MPI throughput. For le system throughput, we used a 1 MB request size on the workstation cluster and a 3 KB request size on the SMP cluster for both read and write operations. 3 Parallel I/O on workstations connected by FDDI As implemented in Panda.1, part-time I/O chooses the rst m of the compute processors to run Panda servers, regardless of the array distribution or compute processor mesh. A similar strategy has been In our experiments on the HP workstation cluster, Panda servers also read or write one 1 MB subchunk at a time.
4 taken in other libraries that provide part-time I/O [1]. This provides acceptable performance on a high speed, switch-based interconnect whose processors are homogeneous with respect to I/O ability, as on many SPs, but on a shared-media interconnect like FDDI or Ethernet, we found that performance is generally unsatisfactory and tends to vary widely with the exact choice of compute and I/O processor meshes. To see the source of the problem, consider the example in Fig. 1. With the naive selection of compute processors and 1 as I/O servers, compute processor 1 needs to send its local chunk to compute processor and gather a subchunk from compute processors and 3. This incurs extra message passing which is unnecessary if compute processor also acts as an I/O server, instead of compute processor 1. In an environment where the interconnect will clearly be the bottleneck for I/O, as is the case for the HP workstation cluster with an FDDI interconnect, the single most important optimization we can make is to minimize the amount of remote data transfer. Our previous work [3] describes how to place I/O servers in a manner that will minimize the number of array elements that must be shipped across the network during I/O. More precisely, suppose we are given a target number m of part-time I/O servers, the current distribution of data across processor memories, and a desired distribution of data across I/O servers. We show how to choose the I/O servers from among the set of n m compute processors so that remote data transfer is minimized. We begin by forming an m n array called the I/O matrix (M), where each row represents one of the I/O servers and each column represents one of the n compute servers. The (i; j) entry in the I/O matrix, M(i; j), is the total number of array elements that the ith I/O server will have to gather from the jth compute processor, which can be computed from the array size, in-memory distribution and the target disk distribution. In Panda, every processor involved in an array I/O operation has access to array size and distribution information, so M can be generated at run time. Given the I/O matrix, the goal of choosing the m I/O servers that will minimize remote data transfer can be formalized as the problem of choosing m matrix entries M(i 1 ; j 1 ); : : : ; M(i m ; j m ) such that no two entries lie in the same column or row (i k 6= i l and j k 6= j l, for 1 k < l m) and P 1km M(i k; j k ) is maximal. To solve this problem, we can view M as a representation of a bipartite graph, where every row (I/O server) and every column (compute processor) represents a vertex and each entry M(i; j) is the weight of the edge connecting vertices i and j. [3] shows that the problem of assigning I/O servers is equivalent to nding the matching 3 of M with the largest possible sum of weights. The optimal solution can be obtained using the Hungarian Method [11] in O(m 3 ) time, where m is the number of part-time I/O servers. We compared the performance of Panda using the rst m processors as I/O servers (\xed" I/O servers) and optimally placed (in terms of minimal data transfer) I/O servers. We used all 8 nodes as compute processors in our experiments, as that conguration is probably most representative of scientists' needs. The in-memory distribution was (BLOCK, BLOCK) and we tested performance using a 4 compute processor mesh. We used, 4, or 8 part-time I/O servers, while increasing the array size from 4 MB (1414) to 16 MB (4848) and 64 MB (496496). For the disk distribution, we used either (BLOCK, *) or (*, BLOCK) to show the eect of a radically dierent distribution. We present results for writes; reads are similar. Since the cluster is not fully isolated from other networks, we did our experiments when no other user job was executing on the cluster. All the experimental results shown are the average of 3 or more trials and error bars show a 95% condence interval for the average. Fig. and Fig. 3 compare the time to write an array using dierent placements of I/O servers. A group of 6 bars is shown for each number of I/O servers. Each pair of bars within the group shows the response time to write an array of the given size using xed and optimal placement of I/O servers respectively. Optimal placement of I/O servers reduces array output time by at least 19% across all dierent combinations of array sizes and meshes, except for the cases where the xed and optimal I/O server placements are identical. We found that even with optimal I/O server placement, performance is very dependent not only on the amount of local data transfer, but also on the compute processor mesh chosen, array distribution on disk and the number of I/O servers. For instance, in Fig. 3, moving from to 4 I/O servers gives a superlinear speedup with optimal placement, but that does not happen if the (*, BLOCK) disk distribution is used (Fig. ) instead. To obtain good I/O performance, the user needs help from Panda in determining the eect on I/O performance of seemingly irrelevant decisions such as the choice between a 4 or 4 compute processor mesh. In [3], we presented a performance model for Panda running on an FDDI cluster, to be used in predicting message passing performance in Panda, and showed its accuracy. The performance model can guide a user to select array distributions and compute processor meshes that can give the best performance on this cluster. 3 A matching in a graph is a maximal subset of the edges of a graph such that no two edges share the same endpoint.
5 write operation, x4 compute processor mesh (BLOCK,BLOCK) in memory, (BLOCK,*) on disk write operation, x4 compute processor mesh (BLOCK,BLOCK) in memory, (*,BLOCK) on disk Panda Response Time (sec) # of I/O servers I/O server placement: fixed optimal 6 Panda Response Time (sec) # of I/O servers I/O server placement: fixed optimal 6 Fig. : Panda response time for writing an array using xed or optimal I/O server placement. Memory mesh: 4. Memory distribution: (BLOCK, BLOCK). Disk mesh: n 1, where n is the number of I/O servers. Disk distribution: (BLOCK, *). Fig. 3: Panda response time for writing an array using xed or optimal I/O server placement. Memory mesh: 4. Memory distribution: (BLOCK, BLOCK). Disk mesh: 1 n, where n is the number of I/O servers. Disk distribution: (*, BLOCK). 4 Parallel I/O on PCs connected by Myrinet As summarized in Tab. 1, each node in our SMP cluster consists of dual processors sharing memory, I/O bus and a le system. There is one 4. GB Ultra Wide SCSI disk (1 RPM) and a 16 MB/sec (fullduplex) Myrinet board connected to each SMP. The details are shown in Fig. 4. When a parallel application is running on both processors, contention for shared resources like the I/O bus (PCI bus in Fig. 4) or disk can be a serious bottleneck. For example, if both processors perform I/O at the same time, we nd that each processor obtains less than half of the le system throughput obtained by using only one processor, because the disk and the I/O bus connecting the disk controller are shared, and the I/O requests coming separately from each processor cause disk seeks or rotational delays. So in this conguration, it is crucial to avoid using multiple processors in the same SMP node as Panda I/O servers if possible. Fig. 5 compares the Panda output performance when 8 dedicated I/O servers are used with dierent placements. Each group of 3 bars compares performance using dierent congurations; the white bars show performance when both processors in the same SMP are used as I/O servers (xed I/O servers). With xed I/O servers, each server provides a throughput of only about 3 MB/sec. If a 4-processor SMP were used with all 4 processors as I/O servers, the throughput would be even lower. However, if the I/O servers are carefully placed to avoid multiple servers in the same SMP (black bars in Fig. 5), Panda throughput per server increases by more than 1%. The gray bars in Fig. 5 show the positive impact of placing each client and server in a separate SMP; the resulting performance is close to the peak le system performance reported in Tab. 1. For the 16 MB array, Panda does not perform as well as for larger arrays because the amount that each I/O node writes is so small that throughput is not scalable due to Panda's constant startup/shutdown overhead. We repeated the tests shown in Fig. 5 for read operations. In all cases, throughput at each I/O server is higher than for write operations, with the same performance trends as for writes. For instance, we obtained 1.4 MB/sec throughput at each I/O server for the 16 MB gray bar and 11.{11.6 MB/sec for the rest of the gray bars. Experiments not included in Fig. 5 show that if we place a Panda server or client on only one processor per SMP, the throughput that each Panda server delivers to the underlying le system averages 5 MB/sec for read and write operations. In other words, Panda can keep the underlying le system busy as long as the underlying le system has a peak throughput of at most that amount times the number of I/O servers sharing the le system. With careful placements, each I/O server delivers data to the le system at a throughput rate of about 5 MB/sec ( MB/sec for read operations), which is just half of the throughput obtained
6 CPU 1 CPU 1 16 compute processors and 8 I/O servers Network Interface PCI bus (133 MB/s) 4 MB/s Ultra Wide SCSI Controller Disk (1K RPM) system bus (58 MB/s) System Memory (51 MB) Throughput per I/O server (MB/s) Array size (MB) fixed I/O servers carefully placed I/O servers using only one processor per SMP Fig. 4: System architecture of each PC workstation having processors sharing memory and I/O subsystem. The bandwidths shown are the theoretical peak performance. Fig. 5: Memory mesh, disk mesh: 4. Array distribution in memory and on disk: (BLOCK, BLOCK, BLOCK). 16 compute processors, 8 dedicated I/O servers. when a Panda server or client is placed on only one processor per SMP. However, if xed placement is used, throughput drops to MB/sec for both reads and writes, which means that message passing bandwidth of 3{5 MB/sec is wasted by contention between servers on the same SMP for the PCI bus. 5 Heterogeneity in clusters The clusters used in the experiments in this paper had homogeneous system software and hardware, but many clusters will be heterogeneous. Heterogeneity can have a big impact on I/O performance and needs to be taken into consideration when choosing the placement of I/O servers. Since on a large cluster users often will not know in advance what nodes will be assigned to their job, server placement will need to be done at runtime, preferably by the I/O library. Heterogeneity can hurt parallel I/O performance by causing load imbalance. For example, on the cluster used for the experiments in this paper, le system performance varied from node to node, due to dierent amounts of free space on local disk. If work is assigned to I/O servers without considering their dierent capabilities, performance will be limited to that of the slowest I/O server. In other words, a single server with a very full disk could signicantly delay completion of an entire I/O operation. Sources of heterogeneity other than disk free space can also cause load imbalance and reduce I/O performance. Some other examples: Data placement. To improve computational load, data may not be spread evenly across all compute processors. The processors with the most data may become a bottleneck for I/O (e.g., on an FDDI cluster). The algorithm for I/O server placement in Section 3 took this type of heterogeneity into account. Disk and le system performance. Each node may have a dierent disk capacity and speed, a dierent le system, or a dierently partitioned le system. In this case both the placement of I/O servers and the distribution of data on disk must take the diering abilities into account. Further, the I/O strategy should be tailored to the le system of each server for best results (e.g., do not use le caching for write operations with NTFS). Processor characteristics. Main memory size can signicantly impact I/O performance, because larger memories often allow large le caches, which can help performance. Processor speed can also impact I/O performance, because the cost of copying data to and from message and le system buers is signicant. As shown in Section 4, processors that must share resources such as I/O busses or le system can have very dierent I/O performance characteristics from stand-alone processors. Thus optimal I/O server placement and workload distribution in a heterogeneous environment is an extremely complex problem. With so many potential variables to consider, a general portable solution to this problem would probably need to use heuristic search through the space of possibilities, rather than
7 relying entirely on exact algorithms. In general, for top performance, I/O servers need to be placed in such a way that all servers' I/O capabilities are as similar as possible. For instance, suppose a cluster consists of nodes having older PCs with a single processor and a 54 RPM disk, and a few new SMPs with a 1 RPM disk. On such a system, it might be advantageous to place multiple I/O servers on the same SMP node, directly contradicting our advice for a homogeneous system! Further, given that the I/O servers have dierent capabilities, work should be divided among them according to their abilities. We have taken some preliminary steps in this direction in [6], which examined several ways of dividing a workload among heterogeneous servers. 6 Related work A number of researchers have examined the problem of parallel I/O on workstation clusters; we believe we are the rst group to address problems related to resource sharing on SMPs and heterogeneity across nodes in collective I/O. PIOUS [9] is a pioneer work in parallel I/O on a workstation cluster. PIOUS is a parallel le system with a Unix-style le interface; coordinated access to a le is guaranteed using transactions. Heterogeneity also raises performance issues for parallel le systems. If les are automatically striped across all servers, performance can suer if some servers are slower than others. If the le system allows dynamic allocation of les to servers, our approaches to placing data to minimize contention for network, I/O bus, and disk may be helpful. VIP-FS [4] provides a collective I/O interface for scientic applications running in parallel and distributed environments. Their assumed-request strategy is designed for distributed systems where the network is a potentially congested shared medium. It reduces the number of I/O requests made by all the compute nodes involved in a collective I/O operation, to reduce congestion. In such an environment, careful placement of I/O servers can also reduce the total data trac. VIPIOS [1] is a design of a parallel I/O system to be used in conjunction with Vienna Fortran. VIPIOS exploits logical data locality in mapping between servers and application processes and physical data locality between servers and disks, which is similar to our approach in exploiting local data on workstations on FDDI. Our approach adds an algorithm for server placement that guarantees minimal remote data access during I/O. In [3], we also quantify the savings obtained by careful placement of servers, and use an analytical model to explain other performance trends. Our work is also related to I/O resource sharing in multiprocessor systems. [16] studies contention for a single I/O bus from accesses to dierent devices like video, network and disk, and studies the correlation among these devices. It characterizes how multiple device types interact when one or more Unix utilities are running on a multiprocessor workstation. Panda can probably benet from this type of study when heuristic search through the space of all possible placements is used to help place I/O servers. 7 Conclusion Compared to traditional supercomputers, commodity clusters are an economically attractive platform for running parallel scientic codes. While a few vendors dominate the marketplace for traditional supercomputers, it is relatively easy for any vendor to create a high-performance cluster product. The result is a dizzying array of possible cluster congurations, each with its own capabilities for computation, networking, and I/O, and each having dierent potential bottlenecks for I/O performance. Thus customization of I/O strategies will be needed for high performance I/O on many clusters. Making the customization strategy more dicult is the ease with which heterogeneous clusters can be constructed and operated; heterogeneous clusters will often require particularly sophisticated approaches to I/O optimization. This paper discusses our experiments with the Panda parallel I/O library on two dierent cluster systems. Unlike traditional massively parallel processors in which the main bottleneck for parallel I/O usually is on disk speed, we have found that the bottleneck can be almost anywhere in the system on commodity clusters. We presented a way to improve overall I/O performance on each platform by placing I/O servers carefully. On workstation clusters connected by FDDI, the bottleneck is in message passing and we place I/O servers to minimize the amount of data transferred over the network. On a cluster of SMPs connected by Myrinet, parallel I/O can be bottlenecked by sharing of disks and I/O busses by multiple processors in the same SMP node, so the I/O servers are placed to minimize the contention for shared resources. We expect -processor and 4-processor SMPs to become more popular in the future. Unfortunately, resource
8 sharing among processors in the same SMP node imposes a new potential cause of parallel I/O performance degradation. Acknowledgements. This research was supported in part by NASA under NAGW 444 and NCC5 16, and by the U.S. Department of Energy through the University of California under subcontract B Experiments were conducted using an HP workstation cluster at HP Labs in Palo Alto and a High Performance Virtual Machine (HPVM) at the National Center for Supercomputing Applications and the Concurrent Systems Architecture Group of the Department of Computer Science at the University of Illinois. References 1. P. Brezany, T. A. Mueck, and E. Schikuta. A Software Architecture for Massively Parallel Input-Output. In Proceedings of the Third International Workshop PARA'96, Lyngby, Denmark, August Springer Verlag.. A. Chien, S. Pakin, M. Lauria, M. Buchanan, K. Hane, L. Giannini, and J. Prusakova. High Performance Virtual Machines (HPVM): Clusters with Supercomputing APIs and Performance. In Proceedings of the the Eighth SIAM Conference on Parallel Processing for Scientic Computing, Minneapolis, MN, March Y. Cho, M. Winslett, M. Subramaniam, Y. Chen, S. Kuo, and K. E. Seamons. Exploiting Local Data in Parallel Array I/O on a Practical Network of Workstations. In Proceedings of the Fifth Workshop on I/O in Parallel and Distributed Systems, pages 1{13, San Jose, CA, November M. Harry, J. Rosario, and A. Choudhary. VIP-FS: A Virtual, Parallel File System for High Performance Parallel and Distributed Computing. In Proceedings of the Ninth International Parallel Processing Symposium, April High Peformance Fortran Forum. High Performance Fortran Language Specication, November S. Kuo, M. Winslett, Y. Chen, Y. Cho, M. Subramaniam, and K.E. Seamons. Parallel Input/Output with Heterogeneous Disks. In Proceedings of the 9th International Working Conference on Scientic and Statistical Database Management, pages 79{9, Olympia, Washington, August M. Lauria and A. Chien. MPI-FM: High Performance MPI on Workstation Clusters. Journal of Parallel and Distributed Computing, 4(1):4{18, January Message Passing Interface Forum. MPI: Message-Passing Interface Standard, June S. Moyer and V. S. Sunderam. Parallel I/O as a parallel application. International Journal of Supercomputer Applications, 9():95{17, Summer J. Nieplocha and I. Foster. Disk Resident Arrays: An Array-Oriented I/O Library for Out-of-Core Computation. In Proceedings of the Sixth Symposium on the Frontiers of Massively Parallel Computation, pages 196{4, October C. Papadimitriou and K. Steiglitz. Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall Inc., M. Parashar and J. Browne. Distributed Dynamic Data-Structures for Parallel Adaptive Mesh-Renement. In Proceedings of International Conference for High Performance Computing, E. Riedel, C. van Ingen, and J. Gray. A Performance Study of Sequential I/O on Windows NT 4. In Proceedings of the Second USENIX Windows NT Symposium, pages 1{1, Seattle, WA, August K. E. Seamons, Y. Chen, P. Jones, J. Jozwiak, and M. Winslett. Server-Directed Collective I/O in Panda. In Proceedings of Supercomputing '95, San Diego, CA, November V. Sunderam. PVM: A Framework for Parallel Distributed Computing. Concurrency: Practices and Experience, (4):315{339, S. VanderLeest and R. Iyer. Measurement of I/O Bus Contention and Correlation among Heterogeneous Device Types in a Single-bus Multiprocessor system. Computer Architecture News, (4):17{.
clients (compute nodes) servers (I/O nodes)
Collective I/O on a SGI Cray Origin : Strategy and Performance Y. Cho, M. Winslett, J. Lee, Y. Chen, S. Kuo, K. Motukuri Department of Computer Science, University of Illinois Urbana, IL, U.S.A. Abstract
More informationStorage System. Distributor. Network. Drive. Drive. Storage System. Controller. Controller. Disk. Disk
HRaid: a Flexible Storage-system Simulator Toni Cortes Jesus Labarta Universitat Politecnica de Catalunya - Barcelona ftoni, jesusg@ac.upc.es - http://www.ac.upc.es/hpc Abstract Clusters of workstations
More informationCHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song
CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed
More informationImplementation and Evaluation of Prefetching in the Intel Paragon Parallel File System
Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Meenakshi Arunachalam Alok Choudhary Brad Rullman y ECE and CIS Link Hall Syracuse University Syracuse, NY 344 E-mail:
More informationLINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those
Parallel Computing on PC Clusters - An Alternative to Supercomputers for Industrial Applications Michael Eberl 1, Wolfgang Karl 1, Carsten Trinitis 1 and Andreas Blaszczyk 2 1 Technische Universitat Munchen
More informationParallel Pipeline STAP System
I/O Implementation and Evaluation of Parallel Pipelined STAP on High Performance Computers Wei-keng Liao, Alok Choudhary, Donald Weiner, and Pramod Varshney EECS Department, Syracuse University, Syracuse,
More informationEgemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for
Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and
More informationPerformance Modeling of a Parallel I/O System: An. Application Driven Approach y. Abstract
Performance Modeling of a Parallel I/O System: An Application Driven Approach y Evgenia Smirni Christopher L. Elford Daniel A. Reed Andrew A. Chien Abstract The broadening disparity between the performance
More informationSwitch. Switch. PU: Pentium Pro 200MHz Memory: 128MB Myricom Myrinet 100Base-T Ethernet
COMPaS: A Pentium Pro PC-based SMP Cluster and its Experience Yoshio Tanaka 1, Motohiko Matsuda 1, Makoto Ando 1, Kazuto Kubota and Mitsuhisa Sato 1 Real World Computing Partnership fyoshio,matu,ando,kazuto,msatog@trc.rwcp.or.jp
More information100 Mbps DEC FDDI Gigaswitch
PVM Communication Performance in a Switched FDDI Heterogeneous Distributed Computing Environment Michael J. Lewis Raymond E. Cline, Jr. Distributed Computing Department Distributed Computing Department
More informationApplication Programmer. Vienna Fortran Out-of-Core Program
Mass Storage Support for a Parallelizing Compilation System b a Peter Brezany a, Thomas A. Mueck b, Erich Schikuta c Institute for Software Technology and Parallel Systems, University of Vienna, Liechtensteinstrasse
More informationMTIO A MULTI-THREADED PARALLEL I/O SYSTEM
MTIO A MULTI-THREADED PARALLEL I/O SYSTEM Sachin More Alok Choudhary Dept.of Electrical and Computer Engineering Northwestern University, Evanston, IL 60201 USA Ian Foster Mathematics and Computer Science
More informationTechnische Universitat Munchen. Institut fur Informatik. D Munchen.
Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl
More informationProfile-Based Load Balancing for Heterogeneous Clusters *
Profile-Based Load Balancing for Heterogeneous Clusters * M. Banikazemi, S. Prabhu, J. Sampathkumar, D. K. Panda, T. W. Page and P. Sadayappan Dept. of Computer and Information Science The Ohio State University
More informationExperiences with the Parallel Virtual File System (PVFS) in Linux Clusters
Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters Kent Milfeld, Avijit Purkayastha, Chona Guiang Texas Advanced Computing Center The University of Texas Austin, Texas USA Abstract
More informationJava Virtual Machine
Evaluation of Java Thread Performance on Two Dierent Multithreaded Kernels Yan Gu B. S. Lee Wentong Cai School of Applied Science Nanyang Technological University Singapore 639798 guyan@cais.ntu.edu.sg,
More informationTuning High-Performance Scientific Codes: The Use of Performance Models to Control Resource Usage During Data Migration and I/O
Tuning High-Performance Scientific Codes: The Use of Performance Models to Control Resource Usage During Data Migration and I/O Jonghyun Lee Marianne Winslett Xiaosong Ma Shengke Yu Department of Computer
More informationAdministrivia. CMSC 411 Computer Systems Architecture Lecture 19 Storage Systems, cont. Disks (cont.) Disks - review
Administrivia CMSC 411 Computer Systems Architecture Lecture 19 Storage Systems, cont. Homework #4 due Thursday answers posted soon after Exam #2 on Thursday, April 24 on memory hierarchy (Unit 4) and
More informationPerformance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture
Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture Sivakumar Harinath 1, Robert L. Grossman 1, K. Bernhard Schiefer 2, Xun Xue 2, and Sadique Syed 2 1 Laboratory of
More informationAbstract. provide substantial improvements in performance on a per application basis. We have used architectural customization
Architectural Adaptation in MORPH Rajesh K. Gupta a Andrew Chien b a Information and Computer Science, University of California, Irvine, CA 92697. b Computer Science and Engg., University of California,
More informationStorage Systems. Storage Systems
Storage Systems Storage Systems We already know about four levels of storage: Registers Cache Memory Disk But we've been a little vague on how these devices are interconnected In this unit, we study Input/output
More informationNetwork. Department of Statistics. University of California, Berkeley. January, Abstract
Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationCOMPUTE PARTITIONS Partition n. Partition 1. Compute Nodes HIGH SPEED NETWORK. I/O Node k Disk Cache k. I/O Node 1 Disk Cache 1.
Parallel I/O from the User's Perspective Jacob Gotwals Suresh Srinivas Shelby Yang Department of r Science Lindley Hall 215, Indiana University Bloomington, IN, 4745 fjgotwals,ssriniva,yangg@cs.indiana.edu
More informationRTI Performance on Shared Memory and Message Passing Architectures
RTI Performance on Shared Memory and Message Passing Architectures Steve L. Ferenci Richard Fujimoto, PhD College Of Computing Georgia Institute of Technology Atlanta, GA 3332-28 {ferenci,fujimoto}@cc.gatech.edu
More information1e+07 10^5 Node Mesh Step Number
Implicit Finite Element Applications: A Case for Matching the Number of Processors to the Dynamics of the Program Execution Meenakshi A.Kandaswamy y Valerie E. Taylor z Rudolf Eigenmann x Jose' A. B. Fortes
More informationSystem Architecture PARALLEL FILE SYSTEMS
Software and the Performance Effects of Parallel Architectures Keith F. Olsen,, Poughkeepsie, NY James T. West,, Austin, TX ABSTRACT There are a number of different parallel architectures: parallel hardware
More informationTHE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL. Jun Sun, Yasushi Shinjo and Kozo Itano
THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL Jun Sun, Yasushi Shinjo and Kozo Itano Institute of Information Sciences and Electronics University of Tsukuba Tsukuba,
More informationJob Re-Packing for Enhancing the Performance of Gang Scheduling
Job Re-Packing for Enhancing the Performance of Gang Scheduling B. B. Zhou 1, R. P. Brent 2, C. W. Johnson 3, and D. Walsh 3 1 Computer Sciences Laboratory, Australian National University, Canberra, ACT
More informationTable of Contents. HP A7173A PCI-X Dual Channel Ultra320 SCSI Host Bus Adapter. Performance Paper for HP PA-RISC Servers
HP A7173A PCI-X Dual Channel Ultra32 SCSI Host Bus Adapter Performance Paper for HP PA-RISC Servers Table of Contents Introduction...2 Executive Summary...2 Test Results...3 I/Ops...3 Service Demand...4
More informationClient (meet lib) tac_firewall ch_firewall ag_vm. (3) Create ch_firewall. (5) Receive and examine (6) Locate agent (7) Create pipes (8) Activate agent
Performance Issues in TACOMA Dag Johansen 1 Nils P. Sudmann 1 Robbert van Renesse 2 1 Department of Computer Science, University oftroms, NORWAY??? 2 Department of Computer Science, Cornell University,
More informationAlgorithm Engineering with PRAM Algorithms
Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and
More informationQuiz for Chapter 6 Storage and Other I/O Topics 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: 1. [6 points] Give a concise answer to each of the following
More informationReal-time communication scheduling in a multicomputer video. server. A. L. Narasimha Reddy Eli Upfal. 214 Zachry 650 Harry Road.
Real-time communication scheduling in a multicomputer video server A. L. Narasimha Reddy Eli Upfal Texas A & M University IBM Almaden Research Center 214 Zachry 650 Harry Road College Station, TX 77843-3128
More informationIBM Almaden Research Center, at regular intervals to deliver smooth playback of video streams. A video-on-demand
1 SCHEDULING IN MULTIMEDIA SYSTEMS A. L. Narasimha Reddy IBM Almaden Research Center, 650 Harry Road, K56/802, San Jose, CA 95120, USA ABSTRACT In video-on-demand multimedia systems, the data has to be
More informationDesign and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications
Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications Wei-keng Liao Alok Choudhary ECE Department Northwestern University Evanston, IL Donald Weiner Pramod Varshney EECS Department
More informationAn Introduction to GPFS
IBM High Performance Computing July 2006 An Introduction to GPFS gpfsintro072506.doc Page 2 Contents Overview 2 What is GPFS? 3 The file system 3 Application interfaces 4 Performance and scalability 4
More informationHigh Throughput WAN Data Transfer with Hadoop-based Storage
High Throughput WAN Data Transfer with Hadoop-based Storage A Amin 2, B Bockelman 4, J Letts 1, T Levshina 3, T Martin 1, H Pi 1, I Sfiligoi 1, M Thomas 2, F Wuerthwein 1 1 University of California, San
More informationW H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4
W H I T E P A P E R Comparison of Storage Protocol Performance in VMware vsphere 4 Table of Contents Introduction................................................................... 3 Executive Summary............................................................
More informationDBMS Environment. Application Running in DMS. Source of data. Utilization of data. Standard files. Parallel files. Input. File. Output.
Language, Compiler and Parallel Database Support for I/O Intensive Applications? Peter Brezany a, Thomas A. Mueck b and Erich Schikuta b University of Vienna a Inst. for Softw. Technology and Parallel
More informationDynamic load balancing of SCSI WRITE and WRITE SAME commands
Dynamic load balancing of SCSI WRITE and WRITE SAME commands Vladimir Tikhomirov Supervised by Dr. Simo Juvaste Advised by Antti Vikman and Timo Turunen Master's thesis. June 13, 2013 School of Computing
More informationThe VERITAS VERTEX Initiative. The Future of Data Protection
The VERITAS VERTEX Initiative V E R I T A S W H I T E P A P E R The Future of Data Protection Table of Contents Introduction.................................................................................3
More informationEcient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines
Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Zhou B. B., Brent R. P. and Tridgell A. y Computer Sciences Laboratory The Australian National University Canberra,
More informationEMC DMX Disk Arrays with IBM DB2 Universal Database Applied Technology
EMC DMX Disk Arrays with IBM DB2 Universal Database Applied Technology Abstract This paper examines the attributes of the IBM DB2 UDB V8.2 database as they relate to optimizing the configuration for the
More informationServer 1 Server 2 CPU. mem I/O. allocate rec n read elem. n*47.0. n*20.0. select. n*1.0. write elem. n*26.5 send. n*
Information Needs in Performance Analysis of Telecommunication Software a Case Study Vesa Hirvisalo Esko Nuutila Helsinki University of Technology Laboratory of Information Processing Science Otakaari
More informationScalability of Heterogeneous Computing
Scalability of Heterogeneous Computing Xian-He Sun, Yong Chen, Ming u Department of Computer Science Illinois Institute of Technology {sun, chenyon1, wuming}@iit.edu Abstract Scalability is a key factor
More informationSONAS Best Practices and options for CIFS Scalability
COMMON INTERNET FILE SYSTEM (CIFS) FILE SERVING...2 MAXIMUM NUMBER OF ACTIVE CONCURRENT CIFS CONNECTIONS...2 SONAS SYSTEM CONFIGURATION...4 SONAS Best Practices and options for CIFS Scalability A guide
More informationApplication. CoCheck Overlay Library. MPE Library Checkpointing Library. OS Library. Operating System
Managing Checkpoints for Parallel Programs Jim Pruyne and Miron Livny Department of Computer Sciences University of Wisconsin{Madison fpruyne, mirong@cs.wisc.edu Abstract Checkpointing is a valuable tool
More informationParallel and High Performance Computing CSE 745
Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel
More informationMulticast can be implemented here
MPI Collective Operations over IP Multicast? Hsiang Ann Chen, Yvette O. Carrasco, and Amy W. Apon Computer Science and Computer Engineering University of Arkansas Fayetteville, Arkansas, U.S.A fhachen,yochoa,aapong@comp.uark.edu
More informationWhat are Clusters? Why Clusters? - a Short History
What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by
More informationComputer Systems Laboratory Sungkyunkwan University
I/O System Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Introduction (1) I/O devices can be characterized by Behavior: input, output, storage
More informationParallel Performance Studies for a Clustering Algorithm
Parallel Performance Studies for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland,
More informationLinux Software RAID Level 0 Technique for High Performance Computing by using PCI-Express based SSD
Linux Software RAID Level Technique for High Performance Computing by using PCI-Express based SSD Jae Gi Son, Taegyeong Kim, Kuk Jin Jang, *Hyedong Jung Department of Industrial Convergence, Korea Electronics
More informationA GPFS Primer October 2005
A Primer October 2005 Overview This paper describes (General Parallel File System) Version 2, Release 3 for AIX 5L and Linux. It provides an overview of key concepts which should be understood by those
More informationMaximizing NFS Scalability
Maximizing NFS Scalability on Dell Servers and Storage in High-Performance Computing Environments Popular because of its maturity and ease of use, the Network File System (NFS) can be used in high-performance
More informationNils Nieuwejaar, David Kotz. Most current multiprocessor le systems are designed to use multiple disks
The Galley Parallel File System Nils Nieuwejaar, David Kotz fnils,dfkg@cs.dartmouth.edu Department of Computer Science, Dartmouth College, Hanover, NH 3755-351 Most current multiprocessor le systems are
More informationImage-Space-Parallel Direct Volume Rendering on a Cluster of PCs
Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr
More informationDesigning High Performance Communication Middleware with Emerging Multi-core Architectures
Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationPartitioning Effects on MPI LS-DYNA Performance
Partitioning Effects on MPI LS-DYNA Performance Jeffrey G. Zais IBM 138 Third Street Hudson, WI 5416-1225 zais@us.ibm.com Abbreviations: MPI message-passing interface RISC - reduced instruction set computing
More informationParallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)
Parallel Programming Models Parallel Programming Models Shared Memory (without threads) Threads Distributed Memory / Message Passing Data Parallel Hybrid Single Program Multiple Data (SPMD) Multiple Program
More informationErik Riedel Hewlett-Packard Labs
Erik Riedel Hewlett-Packard Labs Greg Ganger, Christos Faloutsos, Dave Nagle Carnegie Mellon University Outline Motivation Freeblock Scheduling Scheduling Trade-Offs Performance Details Applications Related
More informationIntroduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014
Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational
More informationAn Overview of Fujitsu s Lustre Based File System
An Overview of Fujitsu s Lustre Based File System Shinji Sumimoto Fujitsu Limited Apr.12 2011 For Maximizing CPU Utilization by Minimizing File IO Overhead Outline Target System Overview Goals of Fujitsu
More informationComputer Organization and Structure. Bing-Yu Chen National Taiwan University
Computer Organization and Structure Bing-Yu Chen National Taiwan University Storage and Other I/O Topics I/O Performance Measures Types and Characteristics of I/O Devices Buses Interfacing I/O Devices
More informationAdvanced Topics UNIT 2 PERFORMANCE EVALUATIONS
Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors
More informationHPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT:
HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms Author: Stan Posey Panasas, Inc. Correspondence: Stan Posey Panasas, Inc. Phone +510 608 4383 Email sposey@panasas.com
More informationComponent-Based Communication Support for Parallel Applications Running on Workstation Clusters
Component-Based Communication Support for Parallel Applications Running on Workstation Clusters Antônio Augusto Fröhlich 1 and Wolfgang Schröder-Preikschat 2 1 GMD FIRST Kekulésraÿe 7 D-12489 Berlin, Germany
More informationProcess 0 Process 1 MPI_Barrier MPI_Isend. MPI_Barrier. MPI_Recv. MPI_Wait. MPI_Isend message. header. MPI_Recv. buffer. message.
Where's the Overlap? An Analysis of Popular MPI Implementations J.B. White III and S.W. Bova Abstract The MPI 1:1 denition includes routines for nonblocking point-to-point communication that are intended
More informationThe Optimal CPU and Interconnect for an HPC Cluster
5. LS-DYNA Anwenderforum, Ulm 2006 Cluster / High Performance Computing I The Optimal CPU and Interconnect for an HPC Cluster Andreas Koch Transtec AG, Tübingen, Deutschland F - I - 15 Cluster / High Performance
More informationParallel I/O Scheduling in Multiprogrammed Cluster Computing Systems
Parallel I/O Scheduling in Multiprogrammed Cluster Computing Systems J.H. Abawajy School of Computer Science, Carleton University, Ottawa, Canada. abawjem@scs.carleton.ca Abstract. In this paper, we address
More informationPARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES
PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES Zhou B. B. and Brent R. P. Computer Sciences Laboratory Australian National University Canberra, ACT 000 Abstract We describe
More informationperform well on paths including satellite links. It is important to verify how the two ATM data services perform on satellite links. TCP is the most p
Performance of TCP/IP Using ATM ABR and UBR Services over Satellite Networks 1 Shiv Kalyanaraman, Raj Jain, Rohit Goyal, Sonia Fahmy Department of Computer and Information Science The Ohio State University
More informationFB(9,3) Figure 1(a). A 4-by-4 Benes network. Figure 1(b). An FB(4, 2) network. Figure 2. An FB(27, 3) network
Congestion-free Routing of Streaming Multimedia Content in BMIN-based Parallel Systems Harish Sethu Department of Electrical and Computer Engineering Drexel University Philadelphia, PA 19104, USA sethu@ece.drexel.edu
More informationHow to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O?
bs_bs_banner Short Technical Note Transactions in GIS, 2014, 18(6): 950 957 How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O? Cheng-Zhi Qin,* Li-Jun
More informationExtra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987
Extra-High Speed Matrix Multiplication on the Cray-2 David H. Bailey September 2, 1987 Ref: SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3, (May 1988), pg. 603{607 Abstract The Cray-2 is
More informationOperating Systems. Operating Systems Professor Sina Meraji U of T
Operating Systems Operating Systems Professor Sina Meraji U of T How are file systems implemented? File system implementation Files and directories live on secondary storage Anything outside of primary
More informationRecommendations for Aligning VMFS Partitions
VMWARE PERFORMANCE STUDY VMware ESX Server 3.0 Recommendations for Aligning VMFS Partitions Partition alignment is a known issue in physical file systems, and its remedy is well-documented. The goal of
More informationDistributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne
Distributed Computing: PVM, MPI, and MOSIX Multiple Processor Systems Dr. Shaaban Judd E.N. Jenne May 21, 1999 Abstract: Distributed computing is emerging as the preferred means of supporting parallel
More informationActive Buffering Plus Compressed Migration: An Integrated Solution to Parallel Simulations Data Transport Needs
Active Buffering Plus Compressed Migration: An Integrated Solution to Parallel Simulations Data Transport Needs Jonghyun Lee Xiaosong Ma Marianne Winslett Shengke Yu University of Illinois at Urbana-Champaign
More informationComparison of Storage Protocol Performance ESX Server 3.5
Performance Study Comparison of Storage Protocol Performance ESX Server 3.5 This study provides performance comparisons of various storage connection options available to VMware ESX Server. We used the
More informationData Sieving and Collective I/O in ROMIO
Appeared in Proc. of the 7th Symposium on the Frontiers of Massively Parallel Computation, February 1999, pp. 182 189. c 1999 IEEE. Data Sieving and Collective I/O in ROMIO Rajeev Thakur William Gropp
More informationCluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]
Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of
More informationAn Oracle White Paper April 2010
An Oracle White Paper April 2010 In October 2009, NEC Corporation ( NEC ) established development guidelines and a roadmap for IT platform products to realize a next-generation IT infrastructures suited
More informationChapter 20: Database System Architectures
Chapter 20: Database System Architectures Chapter 20: Database System Architectures Centralized and Client-Server Systems Server System Architectures Parallel Systems Distributed Systems Network Types
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected
More informationClustering and Reclustering HEP Data in Object Databases
Clustering and Reclustering HEP Data in Object Databases Koen Holtman CERN EP division CH - Geneva 3, Switzerland We formulate principles for the clustering of data, applicable to both sequential HEP applications
More informationEnhancing Data Migration Performance via Parallel Data Compression
Enhancing Data Migration Performance via Parallel Data Compression Jonghyun Lee, Marianne Winslett, Xiaosong Ma, Shengke Yu Department of Computer Science, University of Illinois, Urbana, IL 6181 USA fjlee17,
More informationA TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE
A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA
More informationConcepts Introduced. I/O Cannot Be Ignored. Typical Collection of I/O Devices. I/O Issues
Concepts Introduced I/O Cannot Be Ignored Assume a program requires 100 seconds, 90 seconds for accessing main memory and 10 seconds for I/O. I/O introduction magnetic disks ash memory communication with
More informationMinistry of Education and Science of Ukraine Odessa I.I.Mechnikov National University
Ministry of Education and Science of Ukraine Odessa I.I.Mechnikov National University 1 Modern microprocessors have one or more levels inside the crystal cache. This arrangement allows to reach high system
More informationPerformance of Multihop Communications Using Logical Topologies on Optical Torus Networks
Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,
More informationTECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS
TECHNOLOGY BRIEF March 1999 Compaq Computer Corporation ISSD Technology Communications CONTENTS Executive Overview1 Notice2 Introduction 3 8-Way Architecture Overview 3 Processor and I/O Bus Design 4 Processor
More informationBİL 542 Parallel Computing
BİL 542 Parallel Computing 1 Chapter 1 Parallel Programming 2 Why Use Parallel Computing? Main Reasons: Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion,
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationPerformance Study of the MPI and MPI-CH Communication Libraries on the IBM SP
Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Ewa Deelman and Rajive Bagrodia UCLA Computer Science Department deelman@cs.ucla.edu, rajive@cs.ucla.edu http://pcl.cs.ucla.edu
More informationAssessing performance in HP LeftHand SANs
Assessing performance in HP LeftHand SANs HP LeftHand Starter, Virtualization, and Multi-Site SANs deliver reliable, scalable, and predictable performance White paper Introduction... 2 The advantages of
More informationParallel & Cluster Computing. cs 6260 professor: elise de doncker by: lina hussein
Parallel & Cluster Computing cs 6260 professor: elise de doncker by: lina hussein 1 Topics Covered : Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster
More informationA Performance Study of Sequential IO on WindowsNT 4.0
A Performance Study of Sequential IO on WindowsNT 4. Erik Riedel (CMU) Catharine Van Ingen Jim Gray September 1997 Technical Report MSR-TR-97-34 Microsoft Research Microsoft Corporation One Microsoft Way
More information