clients (compute nodes) servers (I/O nodes)

Size: px
Start display at page:

Download "clients (compute nodes) servers (I/O nodes)"

Transcription

1 Collective I/O on a SGI Cray Origin : Strategy and Performance Y. Cho, M. Winslett, J. Lee, Y. Chen, S. Kuo, K. Motukuri Department of Computer Science, University of Illinois Urbana, IL, U.S.A. Abstract Panda is a library for collective I/O of multidimensional arrays, designed for SPMD-style applications running on a distributed memory system. In this paper, we describe our experience of porting Panda to the SGI Cray Origin, which utilizes shared memory and a shared le system. On the Origin we used, AIDs and le system inappropriately congured for scientic applications is the limiting factor for I/O performance, to such a degree that a single I/O node can nearly saturate the le system, limiting scalability. We determined that Panda would scale up nicely with faster AIDs and a le system conguration highly tuned for large scientic applications, as each of Panda's I/O nodes can deliver data to the Origin's le system at a minimum sustained rate of 8 MB/sec, with {4 I/O nodes. We also determined that nodes do not need to be dedicated to I/O to sustain high utilization of the Origin's le system. Keywords: Collective I/O, Distributed shared memory system, Shared le system Introduction The unprecedented scale of scientic simulations on multi-teraop platforms raises new research issues in I/O and data management. Scientic applications on parallel platforms frequently have a single-program multipledata (SPMD) structure, in which each processor executes more or less the same program. Multidimensional arrays are a fundamental data type in these applications. E- Current aliation: IBM Almaden esearch Center ciently transferring these arrays between main memory and secondary storage poses a significant challenge. To overcome the I/O bottleneck problem, the Panda research group ( panda/) is developing new algorithms and approaches for collective I/O, targeting SPMD applications using HPFstyle or Adaptive Mesh enement (AM)- style data distributions for large multidimensional arrays on massively parallel platforms and workstation clusters. Our focus is both on I/O library ease of use and performance. The Panda parallel array I/O library has been extensively evaluated on the IBM SP and networks of workstations and demonstrated a high level of performance there, e.g., utilizing over 85% of the peak le system throughput on the SP with a wide range of typical array sizes, shapes, processor congurations and I/O requests. However, there is a wide range of massively parallel platforms and they dier vastly in their interconnects and I/O subsystems, which are the key factors for parallel I/O performance. We are participants in the Center for Simulation of Advanced ockets (CSA), a collaboration of computer scientists and domain scientists funded by the Accelerated Strategic Computing Initiative (ASCI), whose simulations are expected to run on all ASCI platforms (IBM SP, SGI Cray Origin, Intel PentiumPro-based machine and Cray T3E). As part of this work, in this paper, we describe our experience of porting and tuning the Panda parallel I/O library on the Origin. Unlike our prior platforms, Origin is a dis-

2 tributed shared memory system whose nodes also share a le system and disks. So, Panda's I/O strategies for previous platforms are not necessarily appropriate here. For instance, on the SP, doubling the number of I/O nodes usually doubles Panda's aggregate throughput. With the Origin's shared le system, increasing the number of I/O nodes may not speed up I/O. Section gives an overview of Panda and the Origin. Section 3 summarizes other potential issues for Panda and performance results are presented in Section 4. Section 5 surveys related work, and Section 6 concludes the paper. Background. Collective I/O in the Panda I/O library Panda is a parallel I/O library for multidimensional arrays. Its original design was intended for SPMD applications running on distributed memory architectures, with arrays distributed across multiple processors, and that are fairly closely synchronized at I/O time. Panda supports HPF-style BLOCK, CYCLIC, and AM-style data distributions across the multiple compute nodes on which Panda clients are running. Panda's approach to high-performance I/O in this environment is called server-directed I/O []. 3 3 clients (compute nodes) servers (I/O nodes) Figure : Dierent array data distributions in memory and on disk provided by Panda. Figure shows a D array distributed (BLOCK, BLOCK) across 4 compute nodes arranged in a mesh. Each piece of the distributed array is called a compute chunk, and each compute chunk resides in the memory of one compute node. The I/O nodes are also arranged in a mesh, and the data can be distributed across them similarly. The array distribution on disk can be radically dierent from that in memory. For instance, the array in Figure has a (BLOCK, *) distribution on disk. Using this distribution, the resulting data les can be concatenated together to form a single array in row-major or column-major order, which is particularly useful if the array is to be sent to a workstation for postprocessing with a visualization tool, as is often the case. Each I/O chunk resulting from the distribution chosen for disk will be buered and sent to (or read from) disk by one I/O node, and that I/O node is in charge of reading, writing, gathering, and scattering the I/O chunk. For example, in Figure, during a Panda write operation I/O node gathers compute chunks from compute nodes and, reorganizes them into a single I/O chunk, and writes it to disk. In parallel, I/O nodes and are gathering, reorganizing and writing their own I/O chunk. For a read operation, the reverse process is used. During I/O Panda divides large I/O chunks into smaller pieces, called subchunks, in order to obtain better le system performance and keep I/O node buer space requirements low. For a write operation, a Panda server will repeatedly gather and write subchunks, one by one. Panda. also supports an alternative I/O strategy, part-time I/O, where there are no dedicated I/O nodes. Instead some of the compute nodes become I/O nodes at I/O time, and return to computation after nishing the I/O operation. When a client reaches a collective I/O call, it may temporarily become a server.. Architecture of the Origin As shown in Figure (a), each node of the Origin consists of two processors, caches, a portion of shared memory, directory for cache coherence, interfaces to I/O devices and interconnection fabric. Memory is organized by We use the term I/O node to refer to a single processor performing I/O.

3 (a) (b) Processor Cache (4 MB) Hub XIO To I/O devices Processor Cache (4 MB) outer Memory & Directory To other nodes Figure : Architecture of the Origin : (a) each node and (b) interconnection fabric for 6 nodes. 8-byte cache lines and each cache line is associated with data bits for a directory-based cache coherence scheme []. Whenever a cache line is updated, corresponding data bits are looked up to invalidate other copies of the same cache line. Figure (b) shows the interconnection fabric for 3 processors, our conguration used for experiments. Each pair of nodes is connected to a router and routers form a hypercube []. Table gives details of the Origin we used at NCSA. 3 processors (6 nodes) share a total of 6 disks connected to two SCSI- AID level 5 adapters. The two AIDs are striped via XLV (XFS volume manager) to create a single logical volume. Users do not have any control over XLV or the AIDs. 3 Experimental Issues To understand the execution environment, we ran a microbenchmark to measure the performance of the le system and the message passing system. In the le system benchmark (Figure 3 (a)), we varied the number of processors each writing/reading to/from a separate 8 MB le on the logical volume and measured the aggregate throughput. For write performance, we issue fsync() at the end to ensure data are written on disk. For read measurements, we ush the le cache by writing a dummy le of size twice the entire memory size so that data are actually read from disk, not from le cache. esults show that read throughput is higher than write throughput Total processors 3 Processor 95 MHz MIPS Total memory 8 GB OS IIX 6.4 Interconnect: topology binary n-cube bandwidth.56 GB/sec (full duplex) I/O subsystem: interface Fast Wide SCSI- number of AIDs AID level 5 disks per AID 8 9-GB 7 PM disks File system XFS logical volume 7 GB (5% occupied) Table : Characteristics of the SGI Cray Origin at NCSA that we used for experiments as expected. Surprisingly, a single I/O node gives the best aggregate throughput, around 3 MB/sec, and additional I/O nodes degrade performance for MB request size. ead requests generated by multiple I/O nodes require immediate disk activity so the overhead of disk seeks increases with the number of readers. In contrast, the write-behind policy of the le cache allows delayed scheduling of writes to disk, which can reduce the cost of seeks. We found that if only a few processors write, the le size does not have any impact on performance. However, as the number of writers increases, a performance boost can be observed for small les, but the throughput var-

4 MB/sec (a) Aggregate file system performance file size = 8 MB, request size= MB ead Write (b) Performance of message passing Message size (KB) Number of sender-receiver pairs: Figure 3: of (a) le system and (b) message passing. ied among I/O nodes signicantly, with at least one I/O node remaining in the 4{5 MB/sec range for writes. Also, once the total amount of data written by all processors exceeds GB, performance starts to degrade. Figure 3 (a) uses MB I/O requests; write experiments with other sizes showed that the le system gave the same performance for request sizes from 3{4 KB. From Figure 3 (a) and the fact that Panda's performance is inherently limited to the performance of the slowest I/O nodes, we conclude that aggregate throughput for write operations will be limited by the le system to 5 MB/sec for large arrays in this I/O conguration. The poor performance is due to the AID and XLV conguration used in this installation, rather than to inherent XFS limitations. Later I/O benchmarks done by SGI and NCSA found out that the default AID and XLV stripe unit sizes used in our installation is appropriate for database applications making random small requests; by simply increasing the XLV stripe unit size, the throughput is signicantly increased. Also, tests using Ultra SCSI controllers with 3 non- AID drives provides 5 MB/sec per controller for writes. Thus in the remainder of the paper, we are careful to consider the possibility of faster le system performance. On many parallel platforms, message passing can be the bottleneck for I/O. To test this on the Origin, we measured the aggregate throughput of MPI by passing messages between one or more sender-receiver pairs. Figure 3 (b) shows that message passing performance scales well for a variety of message sizes. To attain this bandwidth when there are multiple processes, users need to tune SGI's environment variables for MPI. For instance, variables MPI BUFS PE HOST and MPI BUFS PE POC need to be set to minimize contention for the message buer [3]. Small messages (8{6 KB) tend to perform better as the number of sender-receiver pairs increases. This suggests that Panda's buer size on the I/O nodes should be set so that most data messages are 8{6 KB, if the interconnect ever becomes a limiting factor in performance. However, clearly the le system will be the main bottleneck for I/O on this Origin. The experiments in the next section address the question of how many I/O nodes are needed for peak performance, and whether nodes need to be dedicated to I/O or can be shared between computation and I/O. In previous work [4], we found that part-time I/O gave good performance on a small network of workstations, while maximizing the resources available for

5 computations { but only if I/O nodes are very carefully chosen to minimize the data transfer over the interconnect. However, on the Origin, where the processors are connected by a very fast interconnect, the exact choice of I/O nodes may no longer be so important. To help extrapolate to other Origins, we must also evaluate Panda's performance with a (simulated) faster le system. We focus on write operations because (i) the CSA applications are simulations, which are write intensive, and (ii) in general, write operations are slower than read operations in AID level 5 because of parity update []. Also, we focus on arrays that have dierent distributions in memory and on disk. If the output format is not important, each compute node can write its own data to a contiguous region of a le since the logical volume is shared by all nodes, however, performance will still be limited to the le system's aggregate 5 MB/sec and 3 MB/sec for write and read operations respectively. 4 Performance The results in this section were obtained using processors dedicated to our job. In all experiments, we distribute a 3D array across 6 compute nodes logically connected in a 4 mesh and vary the number of I/O nodes from to 6. The error bars show a 95% condence interval for the mean. In the rst experiments, we measure the time to output a f64; 8; 56; 5; 4g 55 oating point array using a set of dedicated I/O nodes. These large arrays are appropriate given the Origin's large memory and CSA application characteristics. We changed the number of elements along the slowest varying dimension so that all arrays would have the same amount of logically contiguous data in memory, which can have a signicant impact on message passing performance. In Figure 4 (a), for a write operation, as the number of I/O nodes increases, the aggregate throughput does not increase accordingly but is limited to about 4 MB/sec, 95% of the attainable peak le system throughput. The percentages shown in Figure 4 represent the fraction of the peak le system performance on the slowest I/O node. With only a few I/O nodes, we can fully saturate the I/O system and achieve throughput very close to the peak throughput attainable. This means that dedicating a large number of nodes to I/O is not an eective use of resources on this platform. We repeated the same experiment for array read operations (Figure 4(b)). Like the le system benchmark in Figure 3, Panda achieves the highest throughput with a single I/O node, averaging around 95% of the peak le system performance. As shown in Figure 4 (a), Panda's write performance is slightly lower if only { I/O nodes are used. This is due to an unbalanced scalability between message passing and the underlying le system as the number of I/O nodes increases. For instance, if we increase the number of I/O nodes from to 4, aggregate message passing throughput scales almost linearly but the aggregate le system write throughput increases only %. Thus, as the number of I/O nodes increases, the time each I/O node devotes to message passing becomes a smaller and smaller fraction of its total run time, providing near-to-peak le system utilization. This trend does not occur for read operations because XFS performs aggressive prefetching while Panda servers scatter the current I/O buer to clients; the large error bars in Figure 4 (b) represent prefetching/buer management errors. To determine scalability to a faster le system, we simulated innitely fast disks by commenting out the le system write operations in Panda (Figure 5). The results show how well Panda is utilizing the available MPI bandwidth. Up to 8 I/O nodes, the message passing in Panda scales very well, reaching around 9% of the corresponding attainable MPI bandwidth shown in Figure 3. We tested read operations simulating innitely fast disks and the results are similar to Figure 5 with 3{5% higher throughput. We conclude that on an Ori-

6 (a) (b) Write operation using dedicated I/O nodes array shape: {64, 8, 56, 5, 4} x 5 x 5 request size = MB 85-89% 9% 95% 95% 95% ead operation using dedicated I/O nodes array shape: {64, 8, 56, 5, 4} x 5 x 5 request size = MB 9-% 94-% 8-9% 76-9% 77-99% 64 MB 8 MB 56 MB 5 MB GB 64 MB 8 MB 56 MB 5 MB GB Figure 4: Performance when an array of shape n 5 5 is distributed (BLOCK, BLOCK, BLOCK) in memory and (BLOCK, *, *) on disk where n is the array size in MB. gin that can read or write data at x MB/sec, roughly dx=8e I/O nodes will be needed to saturate the le system, and the use of additional I/O nodes brings no performance advantages for Panda. In Figure 5, performance is insensitive to array size. However, this won't be true for all arrays, because MPI exhibits a noticeable performance penalty if data being transferred are not contiguous in memory. To explore this phenomenon, we repeated the experiment of Figure 4 using a dierent array shape, 5 5 f64; 8; 56; 5; 4g. As can be seen in Figure 6, larger arrays now perform better than small arrays, because smaller arrays have fewer elements in a contiguous region of memory. This causes I/O nodes to perform more strided memory accesses than with larger arrays and the extra cost of strided memory access on the parallel I/O performance is noticeable when there are few I/O nodes. However, the extra cost drops as the number of I/O nodes increases, because MPI costs drop linearly while le system costs hold steady. On Origins with faster le system congurations, the extra cost of strided accesses will be even more apparent. The use of separate threads for communication and disk access, as is done in implementations of disk-directed I/O [5], would alleviate this problem. We also tested part-time I/O using 6 compute nodes, where the rst k compute nodes become I/O nodes at I/O time. As shown in the second graph in Figure 6, we obtain almost the same aggregate throughput as with dedicated I/O nodes. As hypothesized earlier, the Origin's very fast interconnect makes a careful choice of I/O nodes unnecessary. We conclude that there is no advantage to dedicating nodes to I/O on the Origin. 5 elated work Numerous runtime libraries and le systems provide collective I/O [6,, 7, 8, 5, 9], but performance studies of these libraries on the Origin are just beginning. OMIO is an implementation of the MPI-IO standard interface which uses two-phase I/O for collective I/O operations []. [] presents collective read performance on a variety of parallel platforms including the Origin, for a 3D array distributed (BLOCK, BLOCK, BLOCK) in memory and (BLOCK, *, *) on disk, using all the compute processors as I/O processors at I/O time as with our part-time I/O nodes. On an Origin with a faster AID than the one we used, OMIO's collective read perfor-

7 Simulated write operation using dedicated I/O nodes array shape: {64, 8, 56, 5, 4} x 5 x 5 request size = MB 65-88% % 95-% 85-9% 8-88% 64 MB 8 MB 56 MB 5 MB GB Figure 5: Performance of simulated write when an array of shape n 5 5 is distributed (BLOCK, BLOCK, BLOCK) in memory and (BLOCK, *, *) on disk, where n is the array size in MB. mance on 3 processors, 75 MB/sec, is about 3 times faster than naive sequential reads (i.e., each compute node issues many small read/write requests). However, the throughput for collective read is only around 5% of the available le system bandwidth. The remaining of run time is used to analyze the requests of dierent processes and for redistributing data after reading large blocks. Our results in Figure 4 suggest that fewer I/O nodes might perform better. Alternatively, our results using a simulated faster le system (Figure 5) coupled with the performance results in [], suggest that server-directed I/O may have signicant performance advantage over two-phase I/O for the types of array I/O operations discussed in this paper. 6 Summary and Conclusion We have ported the Panda parallel I/O library, designed for collective I/O of SPMD-style applications running on a distributed memory system, to a SGI Cray Origin at NCSA. Our experiments show that the extremely low and non-scalable aggregate le system bandwidth, is the bottleneck for parallel I/O on this installation. The low throughput is due to the small stripe unit size used for the AID and XLV in this installation. By using more advanced AIDs and XLV parameters tuned for large I/O requests, the le system throughput can be signicantly increased. MPI on the Origin we used can sustain aggregate throughput of 8{ MB/sec between a sender and receiver, and scales almost linearly as the number of communicating processes increases. Thus message passing will not be the typical bottleneck for I/O on this or most other Origin congurations. We found that only a few I/O nodes would be needed to provide collective I/O performance close to the peak le system performance, even if the le system were much faster; and the nodes need not be dedicated to I/O. However, with { I/O nodes and small arrays, certain array shapes and distributions can incur a signicant performance penalty, because strided memory access using MPI on this platform can be very expensive. Considering the theoretical peak interconnect bandwidth, 78 MB/sec, we believe there is plenty of room for the MPI bandwidth between a pair of processors to be increased, removing potential performance pitfalls in parallel I/O performance for small numbers of I/O nodes. 7 Acknowledgements This work was supported in part by NASA and DOE under NAGW 444, NCC5 6, and B34494, and utilized a SGI Cray Origin at the National Center for Supercomputing Applications, University of Illinois at Urbana- Champaign. We would like to thank Eric Salo at SGI for his help regarding SGI's implementation of MPI. We would also like to thank Luc Chouinard at SGI and the NCSA sta for help with I/O benchmarks. eferences [] K. E. Seamons, Y. Chen, P. Jones, J. Jozwiak, and M. Winslett. Server-

8 (a) (b) 6 Write operation using dedicated I/O nodes array shape: 5 x 5 x {64, 8, 56, 5, 4} request size = MB 6 Write operation using part-time I/O nodes array shape: 5 x 5 x {64, 8, 56, 5, 4} request size = MB MB 8 MB 56 MB 5 MB GB 64 MB 8 MB 56 MB 5 MB GB Figure 6: Write performance when (a) dedicated I/O nodes and (b) part-time I/O nodes are used. Directed Collective I/O in Panda. In Proceedings of Supercomputing '95, November 995. [] Origin Servers Technical Overview. Technical report, Silicon Graphics Inc., April 997. [3] Message Passing Toolkit: MPI Programmer's Manual S-97.. Technical report, Silicon Graphics Inc., January 998. [4] Y. Cho, M. Winslett, M. Subramaniam, Y. Chen, S. Kuo, and K. E. Seamons. Exploiting Local Data in Parallel Array I/O on a Practical Network of Workstations. In Proceedings of the Fifth Workshop on I/O in Parallel and Distributed Systems, pages {3, November 997. [5] D. Kotz. Disk-directed I/O for MIMD multiprocessors. ACM Transactions on Computer Systems, 5():4{74, February 997. [7]. Bennett, K. Bryant, A. Sussman,. Das, and J. Saltz. Jovian: Framework for Optimizing Parallel I/O. In Proceedings of the Scalable Parallel Libraries Conference, pages {, October 994. [8] P. F. Corbett, D. G. Feitelson, J. Prost, and S. J. Bayler. Parallel Access to Files in the Vesta File System. In Proceedings of Supercomputing '93, pages 47{ 48, November 993. [9] ajesh Bordawekar. Implementation of Collective I/O in the Intel Paragon Parallel File Systems: Initial Experiences. In Proceedings of International Conference on Supercomputing, pages {8, July 997. []. Thakur, W. Gropp, and E. Lusk. A Case for Using MPI's Derived Datatypes to Improve I/O Performance. Technical eport Preprint ANL/MCS-P77-598, Argonne National Laboratory, May 998. [6]. Bordawekar, J. osario, and A. Choudhary. Design and Evaluation of Primitives for Parallel I/O. In Proceedings of the Supercomputing '93, pages 45{46, 993.

clients (compute nodes) servers (I/O nodes)

clients (compute nodes) servers (I/O nodes) Parallel I/O on Networks of Workstations: Performance Improvement by Careful Placement of I/O Servers Yong Cho 1, Marianne Winslett 1, Szu-wen Kuo 1, Ying Chen, Jonghyun Lee 1, Krishna Motukuri 1 1 Department

More information

Data Sieving and Collective I/O in ROMIO

Data Sieving and Collective I/O in ROMIO Appeared in Proc. of the 7th Symposium on the Frontiers of Massively Parallel Computation, February 1999, pp. 182 189. c 1999 IEEE. Data Sieving and Collective I/O in ROMIO Rajeev Thakur William Gropp

More information

Tuning High-Performance Scientific Codes: The Use of Performance Models to Control Resource Usage During Data Migration and I/O

Tuning High-Performance Scientific Codes: The Use of Performance Models to Control Resource Usage During Data Migration and I/O Tuning High-Performance Scientific Codes: The Use of Performance Models to Control Resource Usage During Data Migration and I/O Jonghyun Lee Marianne Winslett Xiaosong Ma Shengke Yu Department of Computer

More information

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters Kent Milfeld, Avijit Purkayastha, Chona Guiang Texas Advanced Computing Center The University of Texas Austin, Texas USA Abstract

More information

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Meenakshi Arunachalam Alok Choudhary Brad Rullman y ECE and CIS Link Hall Syracuse University Syracuse, NY 344 E-mail:

More information

COMPUTE PARTITIONS Partition n. Partition 1. Compute Nodes HIGH SPEED NETWORK. I/O Node k Disk Cache k. I/O Node 1 Disk Cache 1.

COMPUTE PARTITIONS Partition n. Partition 1. Compute Nodes HIGH SPEED NETWORK. I/O Node k Disk Cache k. I/O Node 1 Disk Cache 1. Parallel I/O from the User's Perspective Jacob Gotwals Suresh Srinivas Shelby Yang Department of r Science Lindley Hall 215, Indiana University Bloomington, IN, 4745 fjgotwals,ssriniva,yangg@cs.indiana.edu

More information

Performance Modeling of a Parallel I/O System: An. Application Driven Approach y. Abstract

Performance Modeling of a Parallel I/O System: An. Application Driven Approach y. Abstract Performance Modeling of a Parallel I/O System: An Application Driven Approach y Evgenia Smirni Christopher L. Elford Daniel A. Reed Andrew A. Chien Abstract The broadening disparity between the performance

More information

PASSION Runtime Library for Parallel I/O. Rajeev Thakur Rajesh Bordawekar Alok Choudhary. Ravi Ponnusamy Tarvinder Singh

PASSION Runtime Library for Parallel I/O. Rajeev Thakur Rajesh Bordawekar Alok Choudhary. Ravi Ponnusamy Tarvinder Singh Scalable Parallel Libraries Conference, Oct. 1994 PASSION Runtime Library for Parallel I/O Rajeev Thakur Rajesh Bordawekar Alok Choudhary Ravi Ponnusamy Tarvinder Singh Dept. of Electrical and Computer

More information

Parallel Programming Interfaces

Parallel Programming Interfaces Parallel Programming Interfaces Background Different hardware architectures have led to fundamentally different ways parallel computers are programmed today. There are two basic architectures that general

More information

Implementing a Statically Adaptive Software RAID System

Implementing a Statically Adaptive Software RAID System Implementing a Statically Adaptive Software RAID System Matt McCormick mattmcc@cs.wisc.edu Master s Project Report Computer Sciences Department University of Wisconsin Madison Abstract Current RAID systems

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information

Maximizing NFS Scalability

Maximizing NFS Scalability Maximizing NFS Scalability on Dell Servers and Storage in High-Performance Computing Environments Popular because of its maturity and ease of use, the Network File System (NFS) can be used in high-performance

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications

Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications Wei-keng Liao Alok Choudhary ECE Department Northwestern University Evanston, IL Donald Weiner Pramod Varshney EECS Department

More information

pc++/streams: a Library for I/O on Complex Distributed Data-Structures

pc++/streams: a Library for I/O on Complex Distributed Data-Structures pc++/streams: a Library for I/O on Complex Distributed Data-Structures Jacob Gotwals Suresh Srinivas Dennis Gannon Department of Computer Science, Lindley Hall 215, Indiana University, Bloomington, IN

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

Making Resonance a Common Case: A High-Performance Implementation of Collective I/O on Parallel File Systems

Making Resonance a Common Case: A High-Performance Implementation of Collective I/O on Parallel File Systems Making Resonance a Common Case: A High-Performance Implementation of Collective on Parallel File Systems Xuechen Zhang 1, Song Jiang 1, and Kei Davis 2 1 ECE Department 2 Computer and Computational Sciences

More information

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware

More information

Parallel Pipeline STAP System

Parallel Pipeline STAP System I/O Implementation and Evaluation of Parallel Pipelined STAP on High Performance Computers Wei-keng Liao, Alok Choudhary, Donald Weiner, and Pramod Varshney EECS Department, Syracuse University, Syracuse,

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Storage System. Distributor. Network. Drive. Drive. Storage System. Controller. Controller. Disk. Disk

Storage System. Distributor. Network. Drive. Drive. Storage System. Controller. Controller. Disk. Disk HRaid: a Flexible Storage-system Simulator Toni Cortes Jesus Labarta Universitat Politecnica de Catalunya - Barcelona ftoni, jesusg@ac.upc.es - http://www.ac.upc.es/hpc Abstract Clusters of workstations

More information

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Technische Universitat Munchen. Institut fur Informatik. D Munchen. Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl

More information

MTIO A MULTI-THREADED PARALLEL I/O SYSTEM

MTIO A MULTI-THREADED PARALLEL I/O SYSTEM MTIO A MULTI-THREADED PARALLEL I/O SYSTEM Sachin More Alok Choudhary Dept.of Electrical and Computer Engineering Northwestern University, Evanston, IL 60201 USA Ian Foster Mathematics and Computer Science

More information

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those Parallel Computing on PC Clusters - An Alternative to Supercomputers for Industrial Applications Michael Eberl 1, Wolfgang Karl 1, Carsten Trinitis 1 and Andreas Blaszczyk 2 1 Technische Universitat Munchen

More information

Parallel I/O Scheduling in Multiprogrammed Cluster Computing Systems

Parallel I/O Scheduling in Multiprogrammed Cluster Computing Systems Parallel I/O Scheduling in Multiprogrammed Cluster Computing Systems J.H. Abawajy School of Computer Science, Carleton University, Ottawa, Canada. abawjem@scs.carleton.ca Abstract. In this paper, we address

More information

Performance of Variant Memory Configurations for Cray XT Systems

Performance of Variant Memory Configurations for Cray XT Systems Performance of Variant Memory Configurations for Cray XT Systems Wayne Joubert, Oak Ridge National Laboratory ABSTRACT: In late 29 NICS will upgrade its 832 socket Cray XT from Barcelona (4 cores/socket)

More information

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

Blue Waters I/O Performance

Blue Waters I/O Performance Blue Waters I/O Performance Mark Swan Performance Group Cray Inc. Saint Paul, Minnesota, USA mswan@cray.com Doug Petesch Performance Group Cray Inc. Saint Paul, Minnesota, USA dpetesch@cray.com Abstract

More information

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0; How will execution time grow with SIZE? int array[size]; int A = ; for (int i = ; i < ; i++) { for (int j = ; j < SIZE ; j++) { A += array[j]; } TIME } Plot SIZE Actual Data 45 4 5 5 Series 5 5 4 6 8 Memory

More information

Storage Hierarchy Management for Scientific Computing

Storage Hierarchy Management for Scientific Computing Storage Hierarchy Management for Scientific Computing by Ethan Leo Miller Sc. B. (Brown University) 1987 M.S. (University of California at Berkeley) 1990 A dissertation submitted in partial satisfaction

More information

White Paper. July VAX Emulator on HP s Marvel AlphaServers Extends the Life of Legacy DEC VAX Systems

White Paper. July VAX Emulator on HP s Marvel AlphaServers Extends the Life of Legacy DEC VAX Systems Resilient Systems, Inc. 199 Nathan Lane Carlisle, MA 01741 U.S.A. (tel) 1.978.369.5356 (fax) 1.978.371.9065 White Paper July 2003 VAX Emulator on HP s Marvel AlphaServers Extends the Life of Legacy DEC

More information

6. Results. This section describes the performance that was achieved using the RAMA file system.

6. Results. This section describes the performance that was achieved using the RAMA file system. 6. Results This section describes the performance that was achieved using the RAMA file system. The resulting numbers represent actual file data bytes transferred to/from server disks per second, excluding

More information

I/O CANNOT BE IGNORED

I/O CANNOT BE IGNORED LECTURE 13 I/O I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access improves by ~10% per year and I/O remains the same.

More information

SONAS Best Practices and options for CIFS Scalability

SONAS Best Practices and options for CIFS Scalability COMMON INTERNET FILE SYSTEM (CIFS) FILE SERVING...2 MAXIMUM NUMBER OF ACTIVE CONCURRENT CIFS CONNECTIONS...2 SONAS SYSTEM CONFIGURATION...4 SONAS Best Practices and options for CIFS Scalability A guide

More information

HDF5 I/O Performance. HDF and HDF-EOS Workshop VI December 5, 2002

HDF5 I/O Performance. HDF and HDF-EOS Workshop VI December 5, 2002 HDF5 I/O Performance HDF and HDF-EOS Workshop VI December 5, 2002 1 Goal of this talk Give an overview of the HDF5 Library tuning knobs for sequential and parallel performance 2 Challenging task HDF5 Library

More information

Cache introduction. April 16, Howard Huang 1

Cache introduction. April 16, Howard Huang 1 Cache introduction We ve already seen how to make a fast processor. How can we supply the CPU with enough data to keep it busy? The rest of CS232 focuses on memory and input/output issues, which are frequently

More information

I/O Analysis and Optimization for an AMR Cosmology Application

I/O Analysis and Optimization for an AMR Cosmology Application I/O Analysis and Optimization for an AMR Cosmology Application Jianwei Li Wei-keng Liao Alok Choudhary Valerie Taylor ECE Department, Northwestern University {jianwei, wkliao, choudhar, taylor}@ece.northwestern.edu

More information

David Kotz. Abstract. papers focus on the performance advantages and capabilities of disk-directed I/O, but say little

David Kotz. Abstract. papers focus on the performance advantages and capabilities of disk-directed I/O, but say little Interfaces for Disk-Directed I/O David Kotz Department of Computer Science Dartmouth College Hanover, NH 03755-3510 dfk@cs.dartmouth.edu Technical Report PCS-TR95-270 September 13, 1995 Abstract In other

More information

I/O in Parallel Applications: The Weakest Link. Rajeev Thakur Ewing Lusk William Gropp. Argonne, IL 60439, USA. fthakur, lusk,

I/O in Parallel Applications: The Weakest Link. Rajeev Thakur Ewing Lusk William Gropp. Argonne, IL 60439, USA. fthakur, lusk, I/O in Parallel Applications: The Weakest Link Rajeev Thakur Ewing Lusk William Gropp Mathematics and Computer Science Division Argonne National Laboratory Argonne, IL 60439, USA fthakur, lusk, groppg

More information

OS and Hardware Tuning

OS and Hardware Tuning OS and Hardware Tuning Tuning Considerations OS Threads Thread Switching Priorities Virtual Memory DB buffer size File System Disk layout and access Hardware Storage subsystem Configuring the disk array

More information

Parallel Architecture. Sathish Vadhiyar

Parallel Architecture. Sathish Vadhiyar Parallel Architecture Sathish Vadhiyar Motivations of Parallel Computing Faster execution times From days or months to hours or seconds E.g., climate modelling, bioinformatics Large amount of data dictate

More information

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Network. Department of Statistics. University of California, Berkeley. January, Abstract Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,

More information

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE DELL EMC ISILON F800 AND H600 I/O PERFORMANCE ABSTRACT This white paper provides F800 and H600 performance data. It is intended for performance-minded administrators of large compute clusters that access

More information

OS and HW Tuning Considerations!

OS and HW Tuning Considerations! Administração e Optimização de Bases de Dados 2012/2013 Hardware and OS Tuning Bruno Martins DEI@Técnico e DMIR@INESC-ID OS and HW Tuning Considerations OS " Threads Thread Switching Priorities " Virtual

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

LARGE scale parallel scientific applications in general

LARGE scale parallel scientific applications in general IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 13, NO. 12, DECEMBER 2002 1303 An Experimental Evaluation of I/O Optimizations on Different Applications Meenakshi A. Kandaswamy, Mahmut Kandemir,

More information

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay! Lecture 16 Today: Start looking into memory hierarchy Cache$! Yay! Note: There are no slides labeled Lecture 15. Nothing omitted, just that the numbering got out of sequence somewhere along the way. 1

More information

Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture

Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture Sivakumar Harinath 1, Robert L. Grossman 1, K. Bernhard Schiefer 2, Xun Xue 2, and Sadique Syed 2 1 Laboratory of

More information

Many parallel applications need to access large amounts of data. In such. possible to achieve good I/O performance in parallel applications by using a

Many parallel applications need to access large amounts of data. In such. possible to achieve good I/O performance in parallel applications by using a Chapter 13 Parallel I/O Rajeev Thakur & William Gropp Many parallel applications need to access large amounts of data. In such applications, the I/O performance can play a signicant role in the overall

More information

Shared Memory vs. Message Passing: the COMOPS Benchmark Experiment

Shared Memory vs. Message Passing: the COMOPS Benchmark Experiment Shared Memory vs. Message Passing: the COMOPS Benchmark Experiment Yong Luo Scientific Computing Group CIC-19 Los Alamos National Laboratory Los Alamos, NM 87545, U.S.A. Email: yongl@lanl.gov, Fax: (505)

More information

File Server Comparison: Executive Summary. Microsoft Windows NT Server 4.0 and Novell NetWare 5. Contents

File Server Comparison: Executive Summary. Microsoft Windows NT Server 4.0 and Novell NetWare 5. Contents File Server Comparison: Microsoft Windows NT Server 4.0 and Novell NetWare 5 Contents Executive Summary Updated: October 7, 1998 (PDF version 240 KB) Executive Summary Performance Analysis Price/Performance

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

Table 9. ASCI Data Storage Requirements

Table 9. ASCI Data Storage Requirements Table 9. ASCI Data Storage Requirements 1998 1999 2000 2001 2002 2003 2004 ASCI memory (TB) Storage Growth / Year (PB) Total Storage Capacity (PB) Single File Xfr Rate (GB/sec).44 4 1.5 4.5 8.9 15. 8 28

More information

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Ketan Kulkarni and Edgar Gabriel Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston {knkulkarni,gabriel}@cs.uh.edu

More information

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

Quiz for Chapter 6 Storage and Other I/O Topics 3.10 Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: 1. [6 points] Give a concise answer to each of the following

More information

File System Forensics : Measuring Parameters of the ext4 File System

File System Forensics : Measuring Parameters of the ext4 File System File System Forensics : Measuring Parameters of the ext4 File System Madhu Ramanathan Department of Computer Sciences, UW Madison madhurm@cs.wisc.edu Venkatesh Karthik Srinivasan Department of Computer

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

SMD149 - Operating Systems - Multiprocessing

SMD149 - Operating Systems - Multiprocessing SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction

More information

Condusiv s V-locity VM Accelerates Exchange 2010 over 60% on Virtual Machines without Additional Hardware

Condusiv s V-locity VM Accelerates Exchange 2010 over 60% on Virtual Machines without Additional Hardware openbench Labs Executive Briefing: March 13, 2013 Condusiv s V-locity VM Accelerates Exchange 2010 over 60% on Virtual Machines without Additional Hardware Optimizing I/O for Increased Throughput and Reduced

More information

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system

More information

1 of 6 4/8/2011 4:08 PM Electronic Hardware Information, Guides and Tools search newsletter subscribe Home Utilities Downloads Links Info Ads by Google Raid Hard Drives Raid Raid Data Recovery SSD in Raid

More information

An Evaluation of Java s I/O Capabilities for High-Performance Computing

An Evaluation of Java s I/O Capabilities for High-Performance Computing An Evaluation of Java s I/O Capabilities for High-Performance Computing Phillip M. Dickens Illinois Institute of Technology 10 West 31st Street Chicago, Illinois 60616 pmd@work.csam.iit.edu Rajeev Thakur

More information

Presented by: Nafiseh Mahmoudi Spring 2017

Presented by: Nafiseh Mahmoudi Spring 2017 Presented by: Nafiseh Mahmoudi Spring 2017 Authors: Publication: Type: ACM Transactions on Storage (TOS), 2016 Research Paper 2 High speed data processing demands high storage I/O performance. Flash memory

More information

hot plug RAID memory technology for fault tolerance and scalability

hot plug RAID memory technology for fault tolerance and scalability hp industry standard servers april 2003 technology brief TC030412TB hot plug RAID memory technology for fault tolerance and scalability table of contents abstract... 2 introduction... 2 memory reliability...

More information

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign SINGLE-TRANSPOSE IMPLEMENTATION OF THE OUT-OF-ORDER 3D-FFT Alexander J. Yee University of Illinois Urbana-Champaign The Problem FFTs are extremely memory-intensive. Completely bound by memory access. Memory

More information

EMC DMX Disk Arrays with IBM DB2 Universal Database Applied Technology

EMC DMX Disk Arrays with IBM DB2 Universal Database Applied Technology EMC DMX Disk Arrays with IBM DB2 Universal Database Applied Technology Abstract This paper examines the attributes of the IBM DB2 UDB V8.2 database as they relate to optimizing the configuration for the

More information

Mass Storage at the PSC

Mass Storage at the PSC Phil Andrews Manager, Data Intensive Systems Mass Storage at the PSC Pittsburgh Supercomputing Center, 4400 Fifth Ave, Pittsburgh Pa 15213, USA EMail:andrews@psc.edu Last modified: Mon May 12 18:03:43

More information

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning September 22 nd 2015 Tommaso Cecchi 2 What is IME? This breakthrough, software defined storage application

More information

Execution-driven Simulation of Network Storage Systems

Execution-driven Simulation of Network Storage Systems Execution-driven ulation of Network Storage Systems Yijian Wang and David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA 2115 yiwang, kaeli@ece.neu.edu Abstract

More information

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Ewa Deelman and Rajive Bagrodia UCLA Computer Science Department deelman@cs.ucla.edu, rajive@cs.ucla.edu http://pcl.cs.ucla.edu

More information

WHITE PAPER AGILOFT SCALABILITY AND REDUNDANCY

WHITE PAPER AGILOFT SCALABILITY AND REDUNDANCY WHITE PAPER AGILOFT SCALABILITY AND REDUNDANCY Table of Contents Introduction 3 Performance on Hosted Server 3 Figure 1: Real World Performance 3 Benchmarks 3 System configuration used for benchmarks 3

More information

Comparing the performance of MPICH with Cray s MPI and with SGI s MPI

Comparing the performance of MPICH with Cray s MPI and with SGI s MPI CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 3; 5:779 8 (DOI:./cpe.79) Comparing the performance of with Cray s MPI and with SGI s MPI Glenn R. Luecke,, Marina

More information

Parallel Performance Studies for a Clustering Algorithm

Parallel Performance Studies for a Clustering Algorithm Parallel Performance Studies for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland,

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4 W H I T E P A P E R Comparison of Storage Protocol Performance in VMware vsphere 4 Table of Contents Introduction................................................................... 3 Executive Summary............................................................

More information

Application Programmer. Vienna Fortran Out-of-Core Program

Application Programmer. Vienna Fortran Out-of-Core Program Mass Storage Support for a Parallelizing Compilation System b a Peter Brezany a, Thomas A. Mueck b, Erich Schikuta c Institute for Software Technology and Parallel Systems, University of Vienna, Liechtensteinstrasse

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols Portland State University ECE 588/688 Directory-Based Cache Coherence Protocols Copyright by Alaa Alameldeen and Haitham Akkary 2018 Why Directory Protocols? Snooping-based protocols may not scale All

More information

Performance Characterization of the Dell Flexible Computing On-Demand Desktop Streaming Solution

Performance Characterization of the Dell Flexible Computing On-Demand Desktop Streaming Solution Performance Characterization of the Dell Flexible Computing On-Demand Desktop Streaming Solution Product Group Dell White Paper February 28 Contents Contents Introduction... 3 Solution Components... 4

More information

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23 FILE SYSTEMS CS124 Operating Systems Winter 2015-2016, Lecture 23 2 Persistent Storage All programs require some form of persistent storage that lasts beyond the lifetime of an individual process Most

More information

What are Clusters? Why Clusters? - a Short History

What are Clusters? Why Clusters? - a Short History What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by

More information

COMP283-Lecture 3 Applied Database Management

COMP283-Lecture 3 Applied Database Management COMP283-Lecture 3 Applied Database Management Introduction DB Design Continued Disk Sizing Disk Types & Controllers DB Capacity 1 COMP283-Lecture 3 DB Storage: Linear Growth Disk space requirements increases

More information

Virtualizing Agilent OpenLAB CDS EZChrom Edition with VMware

Virtualizing Agilent OpenLAB CDS EZChrom Edition with VMware Virtualizing Agilent OpenLAB CDS EZChrom Edition with VMware Technical Overview Abstract This technical overview describes the considerations, recommended configurations, and host server requirements when

More information

Clustering and Reclustering HEP Data in Object Databases

Clustering and Reclustering HEP Data in Object Databases Clustering and Reclustering HEP Data in Object Databases Koen Holtman CERN EP division CH - Geneva 3, Switzerland We formulate principles for the clustering of data, applicable to both sequential HEP applications

More information

DBMS Environment. Application Running in DMS. Source of data. Utilization of data. Standard files. Parallel files. Input. File. Output.

DBMS Environment. Application Running in DMS. Source of data. Utilization of data. Standard files. Parallel files. Input. File. Output. Language, Compiler and Parallel Database Support for I/O Intensive Applications? Peter Brezany a, Thomas A. Mueck b and Erich Schikuta b University of Vienna a Inst. for Softw. Technology and Parallel

More information

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang CHAPTER 6 Memory 6.1 Memory 341 6.2 Types of Memory 341 6.3 The Memory Hierarchy 343 6.3.1 Locality of Reference 346 6.4 Cache Memory 347 6.4.1 Cache Mapping Schemes 349 6.4.2 Replacement Policies 365

More information

How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O?

How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O? bs_bs_banner Short Technical Note Transactions in GIS, 2014, 18(6): 950 957 How to Apply the Geospatial Data Abstraction Library (GDAL) Properly to Parallel Geospatial Raster I/O? Cheng-Zhi Qin,* Li-Jun

More information

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1 ZEST Snapshot Service A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1 Design Motivation To optimize science utilization of the machine Maximize

More information

Computer Systems Laboratory Sungkyunkwan University

Computer Systems Laboratory Sungkyunkwan University I/O System Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Introduction (1) I/O devices can be characterized by Behavior: input, output, storage

More information

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s] Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of

More information

High-Performance Lustre with Maximum Data Assurance

High-Performance Lustre with Maximum Data Assurance High-Performance Lustre with Maximum Data Assurance Silicon Graphics International Corp. 900 North McCarthy Blvd. Milpitas, CA 95035 Disclaimer and Copyright Notice The information presented here is meant

More information

Parallel Computer Architecture II

Parallel Computer Architecture II Parallel Computer Architecture II Stefan Lang Interdisciplinary Center for Scientific Computing (IWR) University of Heidelberg INF 368, Room 532 D-692 Heidelberg phone: 622/54-8264 email: Stefan.Lang@iwr.uni-heidelberg.de

More information

1e+07 10^5 Node Mesh Step Number

1e+07 10^5 Node Mesh Step Number Implicit Finite Element Applications: A Case for Matching the Number of Processors to the Dynamics of the Program Execution Meenakshi A.Kandaswamy y Valerie E. Taylor z Rudolf Eigenmann x Jose' A. B. Fortes

More information

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6. Cache Memories From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6. Today Cache memory organization and operation Performance impact of caches The memory mountain Rearranging

More information

Pattern-Aware File Reorganization in MPI-IO

Pattern-Aware File Reorganization in MPI-IO Pattern-Aware File Reorganization in MPI-IO Jun He, Huaiming Song, Xian-He Sun, Yanlong Yin Computer Science Department Illinois Institute of Technology Chicago, Illinois 60616 {jhe24, huaiming.song, sun,

More information

Partition Border Charge Update. Solve Field. Partition Border Force Update

Partition Border Charge Update. Solve Field. Partition Border Force Update Plasma Simulation on Networks of Workstations using the Bulk-Synchronous Parallel Model y Mohan V. Nibhanupudi Charles D. Norton Boleslaw K. Szymanski Department of Computer Science Rensselaer Polytechnic

More information

Active Buffering Plus Compressed Migration: An Integrated Solution to Parallel Simulations Data Transport Needs

Active Buffering Plus Compressed Migration: An Integrated Solution to Parallel Simulations Data Transport Needs Active Buffering Plus Compressed Migration: An Integrated Solution to Parallel Simulations Data Transport Needs Jonghyun Lee Xiaosong Ma Marianne Winslett Shengke Yu University of Illinois at Urbana-Champaign

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

41st Cray User Group Conference Minneapolis, Minnesota

41st Cray User Group Conference Minneapolis, Minnesota 41st Cray User Group Conference Minneapolis, Minnesota (MSP) Technical Lead, MSP Compiler The Copyright SGI Multi-Stream 1999, SGI Processor We know Multi-level parallelism experts for 25 years Multiple,

More information