Performance Modeling of a Parallel I/O System: An. Application Driven Approach y. Abstract

Similar documents
PPFS: A High Performance Portable Parallel File System. James V. Huber, Jr. Christopher L. Elford Daniel A. Reed. Andrew A. Chien David S.

Parallel Pipeline STAP System

Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications

Chapter 14 Performance and Processor Design

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

1e+07 10^5 Node Mesh Step Number

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

COMPUTE PARTITIONS Partition n. Partition 1. Compute Nodes HIGH SPEED NETWORK. I/O Node k Disk Cache k. I/O Node 1 Disk Cache 1.

Data Sieving and Collective I/O in ROMIO

RAID (Redundant Array of Inexpensive Disks)

Performance measurement. SMD149 - Operating Systems - Performance and processor design. Introduction. Important trends affecting performance issues

Performance of relational database management

vsan 6.6 Performance Improvements First Published On: Last Updated On:

Lecture 23: Storage Systems. Topics: disk access, bus design, evaluation metrics, RAID (Sections )

task object task queue

Storage System. Distributor. Network. Drive. Drive. Storage System. Controller. Controller. Disk. Disk

An Introduction to Parallel Programming

Access pattern Time (in millions of references)

On Checkpoint Latency. Nitin H. Vaidya. In the past, a large number of researchers have analyzed. the checkpointing and rollback recovery scheme

Submitted for TAU97 Abstract Many attempts have been made to combine some form of retiming with combinational

Eect of fan-out on the Performance of a. Single-message cancellation scheme. Atul Prakash (Contact Author) Gwo-baw Wu. Seema Jetli

Steering. Stream. User Interface. Stream. Manager. Interaction Managers. Snapshot. Stream

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching

I/O Characterization of Commercial Workloads

Review question: Protection and Security *

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

SONAS Best Practices and options for CIFS Scalability

on Current and Future Architectures Purdue University January 20, 1997 Abstract

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Published in the Proceedings of the 1997 ACM SIGMETRICS Conference, June File System Aging Increasing the Relevance of File System Benchmarks

Administrivia. CMSC 411 Computer Systems Architecture Lecture 19 Storage Systems, cont. Disks (cont.) Disks - review

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Reducing Disk Latency through Replication

Linux Software RAID Level 0 Technique for High Performance Computing by using PCI-Express based SSD

I/O CANNOT BE IGNORED

1 PERFORMANCE ANALYSIS OF SUPERCOMPUTING ENVIRONMENTS. Department of Computer Science, University of Illinois at Urbana-Champaign

Future File System: An Evaluation

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

MobiLink Performance. A whitepaper from ianywhere Solutions, Inc., a subsidiary of Sybase, Inc.

Frequency-based NCQ-aware disk cache algorithm

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Enumeration of Full Graphs: Onset of the Asymptotic Region. Department of Mathematics. Massachusetts Institute of Technology. Cambridge, MA 02139

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

Wednesday, May 3, Several RAID "levels" have been defined. Some are more commercially viable than others.

On Partitioning Dynamic Adaptive Grid Hierarchies. Manish Parashar and James C. Browne. University of Texas at Austin

File Size Distribution on UNIX Systems Then and Now

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

Wei Shu and Min-You Wu. Abstract. partitioning patterns, and communication optimization to achieve a speedup.

R-Capriccio: A Capacity Planning and Anomaly Detection Tool for Enterprise Services with Live Workloads

CS3733: Operating Systems

Titan: a High-Performance Remote-sensing Database. Chialin Chang, Bongki Moon, Anurag Acharya, Carter Shock. Alan Sussman, Joel Saltz

A Thorough Introduction to 64-Bit Aggregates

Loopback: Exploiting Collaborative Caches for Large-Scale Streaming

Abstract Studying network protocols and distributed applications in real networks can be dicult due to the need for complex topologies, hard to nd phy

I/O CANNOT BE IGNORED

Operating Systems. Week 9 Recitation: Exam 2 Preview Review of Exam 2, Spring Paul Krzyzanowski. Rutgers University.

Caching for NASD. Department of Computer Science University of Wisconsin-Madison Madison, WI 53706

Statistics Driven Workload Modeling for the Cloud

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

Analysis of Striping Techniques in Robotic. Leana Golubchik y Boelter Hall, Graduate Student Oce

Web Serving Architectures

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >

Storing Data: Disks and Files

2 STATEMENT BY AUTHOR This thesis has been submitted in partial fulllment of requirements for an advanced degree at The University of Arizona and is d

SOFT 437. Software Performance Analysis. Ch 7&8:Software Measurement and Instrumentation

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

GFS: The Google File System

Cost Models for Query Processing Strategies in the Active Data Repository

Optimizing Input/Output Using Adaptive File System Policies *

Program Counter Based Pattern Classification in Pattern Based Buffer Caching

CSE 153 Design of Operating Systems

A simple mathematical model that considers the performance of an intermediate node having wavelength conversion capability

Demand fetching is commonly employed to bring the data

Khoral Research, Inc. Khoros is a powerful, integrated system which allows users to perform a variety

Fundamental Concepts of Parallel Programming

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Makuhari, Chiba 273, Japan Kista , Sweden. Penny system [2] can then exploit the parallelism implicitly

Downloaded from

An Evaluation of Alternative Designs for a Grid Information Service

However, m pq is just an approximation of M pq. As it was pointed out by Lin [2], more precise approximation can be obtained by exact integration of t

Best Practices for SSD Performance Measurement

to automatically generate parallel code for many applications that periodically update shared data structures using commuting operations and/or manipu

Operating Systems 2010/2011

Predicting the Worst-Case Execution Time of the Concurrent Execution. of Instructions and Cycle-Stealing DMA I/O Operations

Performance impacts of autocorrelated flows in multi-tiered systems

Mark Sandstrom ThroughPuter, Inc.

Performance Modeling and Analysis of Flash based Storage Devices

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

L9: Storage Manager Physical Data Organization

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA

An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks

Storage Hierarchy Management for Scientific Computing

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1

Transcription:

Performance Modeling of a Parallel I/O System: An Application Driven Approach y Evgenia Smirni Christopher L. Elford Daniel A. Reed Andrew A. Chien Abstract The broadening disparity between the performance of I/O devices and the performance of processors and communication links on parallel platforms is a major obstacle to achieving high performance in many parallel application domains. We believe that understanding the interactions among application I/O access patterns, parallel le systems, and I/O hardware congurations is a prerequisite to identifying levels of I/O parallelism (i.e., the number of disks across which les should be distributed) that maximize application performance. To validate this conjecture, we constructed a series of I/O benchmarks that encapsulate the behavior of a class of I/O intensive access patterns. Performance measurements on the Intel Paragon XP/S demonstrated that the ideal distribution of data across storage devices is a strong function of the I/O access pattern. Based on this experience, we propose a simple, product form queuing network model that eectively predicts the performance of both I/O benchmarks and I/O intensive scientic applications as a function of I/O hardware conguration. Introduction The I/O demands of large-scale parallel applications continue to increase, while the performance disparity between individual processor and disks continues to widen. Given these trends, eectively distributing data across multiple storage devices is key to achieving desired I/O performance levels. In turn, we believe that identifying eective operating points requires an understanding of the interplay among application I/O access patterns, data partitioning alternatives, and hardware and software I/O congurations. Given the plethora of possible optimizations, determining preferred policy parameters by exhaustive exploration of the I/O parameter space is prohibitively expensive. Moreover, application developers need simple, qualitative models for choosing I/O parallelization strategies. Such models should encapsulate the performance implications of using either a smaller or larger number of disk, the eects of le access size, the granularity of data distribution across the disks, and the le access pattern. The goal of this paper is creation of such a model for parallel I/O. Modeling disk arrays and parallel I/O systems has been an active research area for several years. Approximate analytical models of disk arrays [, 2] using synthetic workloads assist the development of simple rules for preferred striping conguration in disk arrays. Our work complements these eorts by capturing the interaction of I/O requirements of scientic applications with both le system software and hardware. We construct a simple, This work was supported in part by the Advanced Research Projects Agency under DARPA contracts DABT63-94-C49 (SIO Initiative), DAVT63-9-C-29 and DABT63-93-C-4, by the National Science Foundation under grant NSF ASC 92-2369, and by the Aeronautics and Space Administration under NASA Contracts NAG--63, NGT-523 and USRA 5555-22. y Department of Computer Science, University of Illinois, Urbana, Illinois 68.

2 product form queuing network model that accurately models the basic performance trends of interleaved access patterns on the Intel Paragon XP/S Parallel File System (PFS). This model is appropriate for use by both application and le system developers. The remainder of this paper is organized as follows. In x2, we describe QCRD, a large scientic code for solving quantum chemical reaction dynamics problems. This is followed in x3 by a description of synthetic benchmarks that drive a simple, product form queuing network model. In x4, we characterize the performance of the QCRD code and validate the model for several disk striping congurations. Finally, x5 summarizes our ndings. 2 Quantum Chemical Reaction Dynamics (QCRD) Understanding the interactions among application I/O access patterns, parallel le systems, and I/O hardware congurations is a prerequisite to identifying levels of I/O parallelism that maximize application performance. Thus, a major objective of the multi-agency Scalable I/O Initiative (SIO) is to assemble a suite of I/O intensive, national challenge applications, to collect detailed performance data on application characteristics and access patterns, and use this information to design and evaluate parallel le system policies. Below, we characterize the I/O performance of QCRD, one application from the SIO code suite, using an extended version of the Pablo performance analysis environment [3]. The QCRD application [5] uses the method of symmetrical hyperspherical coordinates and local hyperspherical surface functions to solve the Schrodinger equation for the dierential and integral cross-section of the scattering of an atom by a diatomic molecule. Code parallelism is achieved by data decomposition (i.e., all processors execute the same code on dierent data portions of the global matrices). Via this data decomposition, the load is equally balanced across the processors, and code execution progresses in ve logical phases (programs) that operate as a logical pipeline. All our experiments were conducted on the Intel Paragon XP/S at the Caltech Center for Advanced Computing Research. As a platform for I/O research, this system supports multiple I/O congurations, including a set of older RAID-3 disk arrays and a group of newer Seagate disks. On all these congurations, the Paragon XP/S parallel le system (PFS) stripes le across multiple disks in default units of 64 KB. As a baseline for our I/O performance analysis of the QCRD code, we considered a representative, though modest, data set and measured the performance and behavior of QCRD on the Caltech system when using 8 Seagate disks. The rst four phases of the QCRD were executed on 64 processors, while the fth phase was executed on 6 processors. To compare the application I/O timings with an older disk conguration of 6 RAID- 3 disk arrays, we then striped the data les across only 6 of the 8 Seagate disks and repeated the experiment. Not surprisingly, the newer 6 disks were faster than the older RAID-3 conguration. However, they also reduced application execution time by roughly ten percent compared to use of all 8 disks. Finally, observing that increased I/O parallelism need not increase performance, we restricted the les to a single Seagate disk and repeated the experiment once more. Figure shows the sum of time spent on I/O by all processors for the ve QCRD phases. With the exception of phase one that achieves the best performance with only one disk, the cumulative I/O time is minimized when the data is striped across more than one, but less than all of the disks. Table illustrates the aggregate performance summaries for phase one and phase two of QCRD (the performance of the other phases are qualitatively similar to that of phase two and are not reported here for brevity's sake). The table shows clearly that the I/O

3 4 2 6 8 Total I/O Time 8 6 4 6 8 6 8 8 8 6 6 2 phase phase 2 phase 3 phase 4 phase 5 Fig.. Cumulative QCRD I/O times (, 6, and 8 disks). Phase One Total I/O Time Operation Count Disk 6 Disks 8 Disks open 832 42.82 457.72 57.6 seek 35,724 45,84.2 46,279.36 54,44.27 write 3,252 2,68.4,59.7,759.7 close 768 6.55.2.2 Phase Two Total I/O Time Operation Count Disk 6 Disks 8 Disks open,6 836.35 857.53,86.73 read 6,672 2,952.9 2,23.5 3,327.5 seek 65,664 62,5.6 55,78.28 65.847.36 write 3,6,672.38 586.23,679.2 close,536.2 68.42 429. Table QCRD I/O operation frequencies and overheads. times are dominated by seeks the application developers chose to use the PFS UNIX le access mode because it is the most direct and portable analog to sequential UNIX I/O. Using this le mode, each processor repeatedly seeks to its designated part of the shared le before performing any read or write operations. In fact, the total time spent on seeks, usually negligible on sequential machines, dominates the total code execution time. Seeks represent roughly percent of phase one's execution time, 5{6 percent of the execution time for phases two, three, and four, and almost 9 percent of phase ve's execution time. The following section explores the reasons of this behavior in greater detail. 3 Microbenchmarks and Performance Models In x2, we saw that understanding the interactions among application request patterns, le system semantics, and disk hardware congurations is critical to identifying eective operating points. Determining such points by exhaustively running the applications across the entire range of le system congurations is prohibitively expensive. An alternative, cost-eective method is to create an analytic model of parallel I/O performance that can predict ecient disk striping parameters for given request access characteristics. To identify the key elements of an eective I/O model and parameterize it accordingly, we rst constructed a series of microbenchmarks. These benchmarks were designed to highlight

4 25 (a) Interleaved Reads Number of seeks and reads: 2 (b) Interleaved Writes Number of seeks and writes: 2 8 8 disks Total Execution Time 2 5 5 disk 6 disks 8 disks Total Execution Time 6 4 2 8 6 4 6 disks disk 2 2 3 4 5 6 7 Number of Processors 2 3 4 5 6 7 Number of Processors Fig. 2. Microbenchmark execution times. system bottlenecks and to reect the I/O behavior of actual applications. As we noted earlier, application developers tend to use the UNIX I/O API because it is portable and because it is most familiar. Unfortunately, this does not exploit all available parallel le system features [4]. However, given the frequent use of the UNIX I/O API, we focus on modeling the Intel Paragon XP/S PFS performance characteristics using the default UNIX le access mode and the default 64 KB PFS stripe size. As a rst step, we constructed a synthetic workload that mimics the global interleaved access patterns found in the QCRD code. Each processor sequentially accesses its interleaved portion of the le, issuing a predened number of synchronous I/O requests of the same size. We then parameterized this synthetic workload to control the load imposed on the I/O system. These parameters include the number of processors that simultaneously perform interleaved operations on the le, the request access size, the stripe group size, and the type of I/O operation (reads or writes). By varying dierent parameters, we incrementally increase the stress on the I/O system and identify performance bottlenecks. Figure 2 shows the results of two experiments, one of interleaved reads (workload A) and one with interleaved writes (workload B) with les striped across, 6, and 8 disks. For workload A, there is a clear performance benet if the le is striped across 6 disks striping across the maximum number of disks is slower by twenty percent. For workload B, using a single disk is the most desirable alternative. Figure 3 illustrates the average seek, read, and write durations for the two workloads as a function of the number of disks. For interleaved reads, there is a clear reduction in the average seek time from one to eight disks. Beyond this point, however, the mean seek time increases for all processor count combinations. The seek operations for interleaved writes, shown in the lower portion of Figure 3, are much more expensive than those for reads and the costs increase rapidly with the number of disks. In turn, this suggests an optimal operating point for this interleaved write workload with the le striped across only 6 disks. By construction, the interleaved read and write behavior of these two benchmarks is similar to that found in QCRD; see Figure. Moreover, the reasons for the convexity of both performance curves as a function of the number of disks used is the same. For both The same qualitative behavior was detected for larger requests of 32 KB requests (i.e., half the disk

5 Workload A (Interleaved reads).22 Average Seek Duration.9 64 processors.8.7.6.5 32 processors.4.3 6 processors.2 process. 2 3 4 5 6 7 8 Number of Disks Average Read Duration.2.8 64 processors 32 processors.6.4.2. 6 processors.8 process.6 2 3 4 5 6 7 8 Number of Disks Workload B (Interleaved writes) 2.55 Average Seek Duration.8.6.4.2.8.6.4.2 64 processors.45 64 processors 32 processors.4.35 32 processors.3.25 6 processors 6 processors.2 process.5 process. 2 3 4 5 6 7 8 2 3 4 5 6 7 8 Average Write Duration.5 Number of Disks Number of Disks Fig. 3. Average microbenchmark operation durations (2,4 byte requests). the benchmarks and QCRD, the average seek times are at least an order of magnitude more expensive than the associated read or write operations. The primary reason for these high seek costs lies in the Intel PFS implementation of UNIX le system semantics. PFS maintains sequential consistency for shared le pointers even when the le is opened read-only. Using the microbenchmark data from Figures 2{3, we focused on modeling the eects of the PFS open, seek, read, and write primitives using UNIX access semantics. However, the models could be extended to include other PFS I/O access modes (e.g., M RECORD, M ASYNC) that relax consistency constraints. To simply analysis, we assume that access times for all I/O operations are exponentially distributed, that requests are served by the I/O system in rst-come-rst-served (FCFS) order, and that all read and write requests are of the same size. Because les can be striped across a variable number of disks, a natural way to capture the eects of disk striping is via a fork-join system. Unfortunately, the complexity of fork-join systems prohibits exact models that can be solved easily using analytical methods. Thus, we opted to use an approximate, single class, striping unit) and 28 KB requests (i.e., twice the disk striping unit).

6 ρ A open B seek C read/write ρ Interleaved Reads, Unit Size: 2,4 Bytes Interleaved Writes, Unit Size: 2,4 Bytes 25 2 Experiment Model disk 2 8 6 4 Experiment Model 8 disks Execution Time 5 8 disks Execution Time 2 8 6 6 disks disk 5 6 disks 4 2 2 3 4 5 6 7 Number of Processors 2 3 4 5 6 7 Number of Processors Fig. 4. Model prediction for interleaved accesses (2,4 byte operations). closed queuing network that models N tasks that all execute the same sequence of I/O operations: they rst open a common le, then perform a series of synchronous interleaved read or write requests. At each moment, we assume that N customers (i.e., tasks) circulate in the network. To model the interleaved disk access pattern we used a closed queuing network with three devices; see Figure 4. The time consumed for open, seek, and read/write operations is modeled by servers A, B, and C, respectively. The resource demand on each server is a function of the server rate and the branching probability shown in Figure 4. The server rates used as input to the model were taken from the microbenchmarks described earlier. Figure 4 illustrates the model prediction for interleaved accesses for read and write operations. The model accurately captures the relative ranking of the experiments' execution time with respect to the number of disks. By analyzing the queue lengths at the various devices, we see that as the number of tasks in the network increases, a larger percentage of the workload's execution time is attributed to queueing delay at the seek server, just as shown by the microbenchmarks. For both reads and writes, the model accuracy is within ten percent (with the exception of interleaved reads and stripe group equal to one disk). 4 I/O Characterization and Model Prediction of QCRD Because the ve QCRD code phases are structured similarly, we concentrate on analysis of phase two, which is representative of all but phase one. Except for phase one, which performs a set of interleaved writes, the remaining of the QCRD phases contain both interleaved reads and writes. Phase two executes the following sequence of steps. First, all processors synchronize, then open two basis les created by phase one. The two les are accessed in sequence with each processor seeking to its designated portion and performing 38 interleaved reads of 2,4 bytes each. After all processors have nished a two-dimensional quadrature, they

Operation Duration 7 seek read write.... 2 4 6 8 2 4 6 8 Execution Time Fig. 5. QCRD operation durations (phase two with 8 disks). open the same overlap le. Each processor then seeks to its designated portion and performs ve interleaved writes of 2,4 bytes each. All of the steps above are then repeated twelve times. Using our Pablo I/O instrumentation software, we captured a timestamped event trace of all I/O operations in QCRD phase two. Figure 5 illustrates the temporal spacing and duration of the seek, read, and write operations when les are striped across 8 disks. Figure 5 illustrates the twelve alternating bursts of I/O and computation activity. As previously discussed, le seeks are the largest source of I/O overhead. Figure 6 shows a detailed view of seek durations for the rst of the twelve I/O{compute cycles for three disk con gurations. At the beginning of each I/O interval, seek durations increase rapidly until the system reaches a steady state. At the end of each I/O interval, the seek durations decline as the number of competing processors declines. Using the simple queueing network model of x3, we predicted the I/O scalability of QCRD as a function of disk con guration. We parameterized the model's transition probability using the operation I/O frequencies from the measured data and the service rates for servers B and C suggested by the experimental measurements of x3. Figure 7 illustrates the experimental and predicted I/O execution times of each interleaved operation portion for the rst cycle of phase two. The model e ectively captures the performance trends across the three disk con gurations. 5 Conclusions We demonstrated that a single distribution of data across I/O devices is unlikely to perform optimally for all le access patterns. Using a series of simple I/O benchmarks that encapsulate common access patterns, we measured the cost of I/O primitives with respect to request size, interaccess time, and operation interaction across various disk striping con gurations We then constructed and parameterized a single class queueing network model that predicts benchmark and application behavior as a function of disk striping con guration. The major advantage of the model is its simplicity. References [] Chen, P. M., and Patterson, D. A. Maximizing Performance in a Striped Disk Array. In Proceedings of the 7th Annual International Symposium on Computer Architecture (99), pp. 322{33. [2] Lee, E., and Katz, R. An Analytic Performance Model of Disk Arrays. In ACM SIGMETRICS (May 993), pp. 98{9.

8 Seek Duration Seek Duration Seek Duration.6 (st Basis File) (2nd Basis File) Computation.4.2.8.6.4.2 2 4 6 8 2 4 6 (a) Execution Time ( Disk).6 (st Basis File) Computation.4.2.8.6.4 (2nd Basis File).2 2 4 6 8 2 4 6 (b) Execution Time (6 Disks).6 (st Basis File) (2nd Basis File) Computation.4.2.8.6.4.2 2 4 6 8 2 4 6 (c) Execution Time (8 Disks) Fig. 6. QCRD phase two seek durations (in seconds) for, 6, and 8 disks. QCRD Execution Time (a) 9 8 7 6 5 4 3 2 Phase 2 (reads) E E M E M M disk 6 disks 8 disks 8 QCRD Execution Time (b) 7 6 5 4 3 2 Phase (writes) M E E M E M disk 6 disks 8 disks Fig. 7. Model prediction for QCRD. [3] Reed, D. A., Elford, C. L., Madhyastha, T., Scullin, W. H., Aydt, R. A., and Smirni, E. I/O, Performance Analysis, and Performance Data Immersion. In MASCOTS '96 (Feb. 996), pp. {2. [4] Smirni, E., Aydt, R. A., Chien, A. A., and Reed, D. A. I/O Requirements of Scientic Applications: An Evolutionary View. In High Performance Distributed Computing (996). [5] Wu, Y.-S. M., Cuccaro, S. A., Hipes, P. G., and Kuppermann, A. Quantum Chemical Reaction Dynamics on a Highly Parallel Supercomputer. Theoretica Chimica Acta 79 (99), 225{239.