Benchmarking the Performance of Scientific Applications with Irregular I/O at the Extreme Scale

Similar documents
Guidelines for Efficient Parallel I/O on the Cray XT3/XT4

Adaptable IO System (ADIOS)

Stream Processing for Remote Collaborative Data Analysis

Collaborators from SDM Center, CPES, GPSC, GSEP, Sandia, ORNL

Parallel, In Situ Indexing for Data-intensive Computing. Introduction

Performance database technology for SciDAC applications

Extreme I/O Scaling with HDF5

High Speed Asynchronous Data Transfers on the Cray XT3

The character of the instruction scheduling problem

Experiences with ENZO on the Intel R Many Integrated Core (Intel MIC) Architecture

Blue Waters I/O Performance

Taming Parallel I/O Complexity with Auto-Tuning

Parallel I/O Libraries and Techniques

Experiences with ENZO on the Intel Many Integrated Core Architecture

JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING

ISC 09 Poster Abstract : I/O Performance Analysis for the Petascale Simulation Code FLASH

Scibox: Online Sharing of Scientific Data via the Cloud

PROGRAMMING AND RUNTIME SUPPORT FOR ENABLING DATA-INTENSIVE COUPLED SCIENTIFIC SIMULATION WORKFLOWS

Scalable, Automated Performance Analysis with TAU and PerfExplorer

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

A More Realistic Way of Stressing the End-to-end I/O System

Canopus: Enabling Extreme-Scale Data Analytics on Big HPC Storage via Progressive Refactoring

File Open, Close, and Flush Performance Issues in HDF5 Scot Breitenfeld John Mainzer Richard Warren 02/19/18

Discovery of the Source of Contaminant Release

Hardware-Efficient Parallelized Optimization with COMSOL Multiphysics and MATLAB

ScalaIOTrace: Scalable I/O Tracing and Analysis

DISP: Optimizations Towards Scalable MPI Startup

Hashing. Hashing Procedures

Scibox: Online Sharing of Scientific Data via the Cloud

The Case of the Missing Supercomputer Performance

Scaling Tuple-Space Communication in the Distributive Interoperable Executive Library. Jason Coan, Zaire Ali, David White and Kwai Wong

...And eat it too: High read performance in write-optimized HPC I/O middleware file formats

IBM Spectrum Scale IO performance

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

Workload Characterization using the TAU Performance System

An Analysis of Object Orientated Methodologies in a Parallel Computing Environment

The Spider Center-Wide File System

Chapter 9. Software Testing

CSE6230 Fall Parallel I/O. Fang Zheng

Persistent Data Staging Services for Data Intensive In-situ Scientific Workflows

EXTREME SCALE DATA MANAGEMENT IN HIGH PERFORMANCE COMPUTING

Multi-Resolution Streams of Big Scientific Data: Scaling Visualization Tools from Handheld Devices to In-Situ Processing

A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications

Replicating HPC I/O Workloads With Proxy Applications

Algorithm Engineering with PRAM Algorithms

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Introduction to HPC Parallel I/O

Computational performance and scalability of large distributed enterprise-wide systems supporting engineering, manufacturing and business applications

Enabling high-speed asynchronous data extraction and transfer using DART

Using a Robust Metadata Management System to Accelerate Scientific Discovery at Extreme Scales

Modeling and Implementation of an Asynchronous Approach to Integrating HPC and Big Data Analysis

On characterizing BGP routing table growth

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Flexible IO and Integration for Scientific Codes Through The Adaptable IO System (ADIOS)

PreDatA - Preparatory Data Analytics on Peta-Scale Machines

Demand fetching is commonly employed to bring the data

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Programming with MPI

Skel: Generative Software for Producing Skeletal I/O Applications

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

The Cray Rainier System: Integrated Scalar/Vector Computing

142

Joe Wingbermuehle, (A paper written under the guidance of Prof. Raj Jain)

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Solving Traveling Salesman Problem Using Parallel Genetic. Algorithm and Simulated Annealing

XpressSpace: a programming framework for coupling partitioned global address space simulation codes

Empirical Analysis of a Large-Scale Hierarchical Storage System

Iteration Based Collective I/O Strategy for Parallel I/O Systems

Characterizing the I/O Behavior of Scientific Applications on the Cray XT

distribution across network topology. Finally, we present a collection of methods to address some key performance issues plaguing SSDs, such as read

File Size Distribution on UNIX Systems Then and Now

Performance Modeling of a Parallel I/O System: An. Application Driven Approach y. Abstract

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Utilizing Unused Resources To Improve Checkpoint Performance

2.3 Algorithms Using Map-Reduce

Efficient Data Restructuring and Aggregation for I/O Acceleration in PIDX

Introduction to parallel Computing

libhio: Optimizing IO on Cray XC Systems With DataWarp

The Fusion Distributed File System

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Do You Know What Your I/O Is Doing? (and how to fix it?) William Gropp

A Characterization of Shared Data Access Patterns in UPC Programs

Generalized Fast Subset Sums for Bayesian Detection and Visualization

Large Data Visualization

Efficient, Scalable, and Provenance-Aware Management of Linked Data

MVAPICH2 vs. OpenMPI for a Clustering Algorithm

The Constellation Project. Andrew W. Nash 14 November 2016

The p196 mpi implementation of the reverse-and-add algorithm for the palindrome quest.

Revealing Applications Access Pattern in Collective I/O for Cache Management

Thwarting Traceback Attack on Freenet

Philip C. Roth. Computer Science and Mathematics Division Oak Ridge National Laboratory

Early Evaluation of the Cray X1 at Oak Ridge National Laboratory

Decentralized and Distributed Machine Learning Model Training with Actors

Performance analysis basics

Improving I/O Performance in POP (Parallel Ocean Program)

Performance impact of dynamic parallelism on different clustering algorithms

Chapter 1. Numeric Artifacts. 1.1 Introduction

Transcription:

Benchmarking the Performance of Scientific Applications with Irregular I/O at the Extreme Scale S. Herbein University of Delaware Newark, DE 19716 S. Klasky Oak Ridge National Laboratory Oak Ridge, TN 37831 M. Taufer University of Delaware Newark, DE 19716 Abstract In this paper we hypothesize that irregularities of I/O patterns (i.e., irregular amount of data written per process at each I/O step) in scientific simulations can cause increasing I/O times and substantial loss in scalability. To study whether our hypothesis is true, we quantify the impact of irregular I/O patterns on the I/O performance of scientific applications at the extreme scale. Specifically, we statistically model the irregular I/O behavior of two scientific applications such as the Monte Carlo application QMCPack and the adaptive mesh refinement application ENZO. For our testing, we feed our model into an I/O skeleton tool to measure the performance of the two applications I/O under different I/O settings. Empirically, we show how the growing data sizes and the irregular I/O patterns in these applications are both relevant factors impacting performance. I. INTRODUCTION When dealing with applications at the extreme scale, their I/O times play key roles in the overall applications performance. Specifically, as the number of processes grows, the I/O fraction of the whole execution time grows, causing substantial loss in the overall scalability. In our past work with the QMCPack application and the integration of the ADIOS I/O library in its code, we experienced this problem first-hand. Figure 1 shows the average percentage of time spent by QMCPack simulations of a large graphite system in execution (i.e., for computation and communication) versus the time spent for I/O. The values are obtained over a set of 6 QMCPack runs, each performing 1 Monte Carlo steps and printing in output the traces every 1 steps. We considered two popular libraries such as HDF5 [1] and ADIOS [2]; for the latter we considered different levels of aggregation for the writing processes (i.e., one aggregator for two processes, one aggregator for four processes, and one aggregator for eight processes). Independently from the I/O library and level of aggregation, the I/O grows to become 3% of the total execution time when half the nodes of the Titan supercomputer at the Oak Ridge National Laboratory (ORNL) are used. Identifying the reasons for increasing I/O times is not always intuitive or trivial. At first we thought that the irregularity of the I/O pattern (i.e., irregular amount of data written per process at each I/O step) in QMCPack was among the main causes of the increasing I/O times. QMCPack uses Quantum Monte Carlo methods to find solutions to the manybody Schrodinger s equation by stochastic samplings [3] and calculates total energies of condensed systems. The core of the QMCPack algorithm works on many trial solutions called walkers. A process runs multiple walkers per step; at the Fig. 1: Percentage of QMCPack time spent in execution vs. I/O for different I/O libraries and levels of aggregation using different numbers of nodes on Titan. end of each step, walkers with low energy are duplicated, while walkers with high energy are terminated. Thus, within each simulation step, processes deal with different numbers of walkers and write different amounts of data to disk. To study whether our hypothesis on the impact of I/O irregularities on performance is true, we consider different I/O patterns (both irregular and regular) and study their performance under different I/O settings. We investigate the problem with the ADIOS library only, as ADIOS comes with Skel, a powerful tool for decoupling I/O times from the other execution times, and we test I/O performance under different settings. Since results in [4] - also summarized in Figure 1 - do not indicate major differences between ADIOS and HDF5 in terms of I/O performance, we do not expect major discrepancies with HDF5 in terms of conclusions. The spectrum of applications with irregular I/O is quite broad. To cover such a spectrum and at the same time consider meaningful scientific applicators at the extreme scale, we consider a Monte Carlo application such as QMCPack [3] and an adaptive mesh refinement (AMR) application such as ENZO [5]. For each application we ran several simulations and built a statistical model of their I/O patterns that we present in this paper. In our study, we feed the models into Skel for our testing and we consider different parameter values and settings. For the sake of completeness, we also model the regular I/O of scientific applications such as S3D [6] and GTC [7], whose

processes write the same amount of data at each I/O step. Note how the use of Skel allows us to reproduce the I/O of applicators even if they do not support ADIOS yet, as it is the case for ENZO. With Skel, from any realtime executions, we can extract the application s I/O profile and feed it to Skel for testing purposes. The contributions of this paper are as follows: (1) We statistically model the I/O behavior of two irregular and a regular scientific applications. (2) We feed the model into the I/O skeleton tool to measure the performance of the two applications I/O under different I/O parameter values and settings. (3) We critically compare and contrast the results for a large set of I/O scenarios and show how the growing I/O size (i.e., from a medium I/O size of few MBytes per writing process to the larger I/O size with an increase of one order of magnitude) and the irregular I/O patterns in applications with irregular I/O are relevant factors when tuning performance. The rest of this paper is organized as follows: Section II provides an overview of the applications and thier I/O patterns; Section III describes the I/O methods studied in this paper and the I/O skeleton tool used for the testing; Section IV presents the several case studies; Section V discusses relevant related work; and Section VI concludes this paper. II. PROFILING I/OS OF REAL APPLICATIONS We model the irregular I/O behavior of the Monte Carlo application QMCPack and the adaptive mesh refinement (AMR) application ENZO, as well as the regular I/O behavior of applications such as S3D and GTC. A. QMCPack The core of the QMCPack algorithm works on many trial solutions called walkers. Walkers evolve over many steps during which they refine the system of particles using the Diffusion Monte Carlo (DMC) method. At the end of each step, walkers with low energy is duplicated, and walkers with high energy are terminated. Consequently, at each step, each process has a different number of walkers whose particle positions and energies it must write out to disk. Figure 2.a shows the box plot graphic for the number of walkers being written to output for real QMCPack simulations when performing 15 DMC steps of a large graphite system with 128 carbons and 512 electrons and using 256 processes (two per node) on Titan. The box plot consists of seven different pieces of information. The whisker on the bottom extends from the 1th percentile (bottom decile) to the top 9th percentile (top decile). Outliers are placed at the end of the bottom and top decile whiskers (outlier caps). The top, bottom, and line through the middle of the box correspond to the 75th percentile (top box line), 25th percentile (bottom box line), and 5th percentile (middle box line). A square is used to indicate the arithmetic mean. The height of the box indicates a large variability in number of walkers across the 256 processes at each step. For this QMCPack simulation, for each step, the mean (square) is equal to the median (middle box line), suggesting that a normal distribution of the number of walkers across processes per step can statically model the number of writing walkers per process at a give simulation step. Note that each walker contains the same amount of data per write (i.e., 65 KB) and thus the total size of the I/O data per process can be obtained by multiplying the number of walkers by 65 KB. Figure 2.b shows an example of normal distribution used to generate the I/O profile of QMCPack for one single step. While the number of walkers per process changes across steps, the overall distribution remains normal. Figure 2.c shows the box plot for the number of walkers being written to output for the synthetically generated I/O of QMCPack simulations used in our tests when performing 15 DMC steps of the large graphite system. B. ENZO ENZO is a parallel adaptive mesh refinement (AMR) application for computational astrophysics and cosmology. The ENZO algorithm uses high spatial and temporal resolution for modeling astrophysical fluid flows. Data is organized in Cartesian meshes in 1, 2, and 3 dimensions, and different meshes with different sizes are written by different processes at runtime. An ENZO simulation changes the accuracy of a solution in certain regions as the simulation evolves. In ENZO, the processes write a variable amount of data per step as the simulation expands and contracts the several meshes. Figure 3.a shows the plot box of the ENZO simulation for 15 steps of the NFW Cool-Core Cluster test on 256 nodes of Titan. The simulation of the cooling flow is performed in an idealized cool-core cluster that resembles the Perseus cluster. The test considers a root mesh of 64 3, a maximum refinement level of 12, and a minimum mass for refinement level exponent of -.2. In this simulation, as the cooling catastrophe happens, the temperature drops to the bottom of the cooling function. We observe that very few outliers (processes) deal with very large meshes while many processes deal with smaller meshes. As for QMCPack, we observe a large I/O variability (i.e., wide box plot). Contrary to QMCPack, the variability is not normally distributed but more closely resembles an exponential distribution. This can be deducted from the mean that is biased by the high size of the data records written by few outliers and the median that is overlapping with the bottom box line. Figure 3.b shows an example of exponential distribution used to generate the I/O profile of ENZO for one single step. Across steps, as the number of blocks per process changes, the overall distribution remains exponential. Figure 3.c shows the box plot graphic for the number of blocks being written to output for the synthetically generated I/O of ENZO simulations that we use for our testing when performing 15 steps. C. Applications with Regular I/O Patterns To prove our hypothesis we compare and contrast the I/O performance of the two applications described above with the I/O performance of an hypothetical application with regular I/O pattern. Several applications like S3D and GTC exhibit such a regular I/O pattern since in their simulations each process writes the same amount of data at each I/O step. These applications are easy to mimic synthetically. In an ideal case, the box plot graphic for this type of applications presents the overlapping of the 1th, 25th, 5th, 75th, and 9th percentiles in a single horizontal line, while no outliers are detected. The statistical model used here to mimic the I/O of regular applications assigns constant data to each writing process at each I/O step. III. I/O LIBRARY AND SKELETON We consider the ADIOS I/O library and its Skel tool as it allows us to decouple the I/O from the simulation.

45 4 14 12 45 4 Number of walkers 35 3 25 2 15 1 5 1 2 3 4 5 6 7 8 9 1 11 1213 1415 Number of steps (a) QMCPack I/O pattern Number of processes 1 8 6 4 2 < [,1) [1,2) [2,3) [3,4) [4,5) [5,6) [6,7) [7,8) Number of walkers per process (b) Normal distribution for one I/O step 8 Number of walkers 35 3 25 2 15 1 5 1 2 3 4 5 6 7 8 9 1 11 1213 1415 Number of steps (c) Synthetic I/O pattern Fig. 2: Profile of the number of walkers per process for 15 steps of a QMCPack simulation studying a 4x4x2 graphite on 256 nodes of Titan (a); statistical model of the I/O pattern mimicking a normal distribution for one simulation step (b); and example of synthetic I/O patterns generated with the statistical model (c). Block size (KBytes) 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 1 11 12 13 14 15 Number of write steps (a) ENZO I/O pattern Number of processes 4 35 3 25 2 15 1 5 < [,5) [15,2) [1,15) [5,1) [2,25) [25,3) 3 Data size per process (b) Exponential distribution for one step Data size (KBytes) 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 1 11 12 1314 15 Number of write steps (c) Synthetic I/O pattern Fig. 3: Profile of the data written per process for 15 steps of an ENZO simulation studying the NFW Cool-Core Cluster test on 256 nodes of Titan (a); statistical model of the I/O pattern mimicking an exponential distribution for one simulation step (b); and example of synthetic I/O patterns generated with the statistical model (c). A. ADIOS B. Skel ADIOS consists of a suite of I/O methods, an easy to use read-and-write API, a set of utilities, and metadata stored in an external XML file [2]. Different I/O methods, such as POSIX and MPI AGGREGATE, can be specified in the XML file, which can be changed at runtime rather than compile time. ADIOS POSIX (or POSIX) is the simplest of the ADIOS methods; it uses the standard POSIX file API. Each process writes to one file, and an extra global metadata file is created to reference the data output. POSIX obtains high performance when using few processes since it has low overhead; for large process counts the metadata server of a distributed file systems such as Lustre can become a bottleneck. ADIOS MPI AGGREGATE is a hybrid method that first aggregates data to a small subset of processes and then uses MPI-I/O to write to disk. By accumulating data, ADIOS MPI AGGREGATE keeps the load on the Lustre metadata server low and, thus, continues to perform even with very large numbers of processes. The simple API allows for minimal changes to the existing code while the XML file enables I/O method switching without recompiling the code. The XML file is parsed on ADIOS initialization; its metadata contains a description of the data generated and information on how the data should be written out to disk. Skel is a tool for building the skeleton of an application s I/O by decoupling the I/O component of a complex code from its computation and communication components [8]. Skel takes an ADIOS XML metadata descriptor and a set of parameters and generates C or FORTRAN source code with the appropriate ADIOS I/O calls, data generation, and timers. The execution of the benchmarks on real platforms is faster because Skel does not include computation and communication but rather just I/O; users can easily and efficiently collect a large set of I/O performance information. The original implementation of Skel had a major limitation that reduces its applicability: it only provides a method to specify fixed I/O size per time step. We addressed this limitation and extended Skel to enable variable I/O size per process and per I/O step using three different techniques: an inline definition of functions modeling the I/O of each process; the generation of processes I/Os based on a statistical distribution; and the use of I/O traces from a real simulation. Each approach is selected by adding additional tags to Skel s XML description of the I/O. When these tags are encountered, Skel uses the XML-specified approach to generate the variable I/O across processes at runtime.

IV. CASE STUDY We compare and contrast the I/O times when threading is on and off; POSIX or MPI AGGREGATE are used; the aggregation is based on all-gather or brigade collections; the number of OSTs and aggregators change; and the I/O size and I/O pattern change. A. Time Components In previous work, we showed how POSIX and MPI AGGREGATE are the most effective ADIOS methods for small and large numbers of writers respectively [4]. Thus here we focus our analysis on these two methods. For both methods, the I/O time can be broken down in several phases. In an initial phase, ADIOS reads in the specified XML file and parses it for the I/O parameters to be used for the application run. In this phase, the ADIOS buffer is also allocated. If MPI AGGREGATE is selected as the I/O method, then all of the writers in the application are split into Process Groups (PGs). Each PG is a collection of writers; one of the writers in each PG is designated as the PG aggregator. The aggregator is responsible for gathering the data from the other writers in the PG and saving the data to disk. If the number of aggregators, which is specified in the XML file, evenly divides the number of writers, then each PG will be exactly the same size. If not, one PG will have slightly more or slightly less writers than the average. Each aggregator creates a group metadata file on the file system using MPI File Open and writes out the PG header to it. The process with rank zero is responsible for creating the global metadata file. We call the I/O time to create the groups and define the aggregators ADIOS Group. If POSIX is selected as the I/O method for the run, then each writer creates its own metadata file on the file system. This requires each writer to contact the metadata server; the connection to the metadata server can quickly become a bottleneck as the number of writers expands. Again the process with rank zero is also responsible for creating the global metadata file that contains a duplicate copy of all of the metadata in the subfiles created by the other writers. We call the I/O time to create the files and contact the metadata server ADIOS Open. Independently from the I/O method used, each writer copies its data from its applications buffers into its ADIOS buffer. This is a simple write to memory operation; we call the associated I/O time ADIOS Write. In the last I/O phase, the data is saved to disk. In the case of POSIX, each writer sends its data to disk. In the case of MPI AGGREGATE, each writer sends its data to its PG s aggregator, and then the aggregator writes out the data to disk. Once all of the data is on disk, each writer is then responsible for its metadata. In the case of POSIX, each process is a writer and writes out the data to the metadata file. Each piece of metadata is saved in two places. The first place is in the subfile with the metadata s associated data. The second place is together with all of the other metadata in the global metadata file. The process with rank is responsible for this global file and must gather all of the metadata from each writer and write it to the global file. In the case of MPI AGGREGATE, the writers tasks are performed by the aggregators. We call this time component ADIOS Close. B. Threading vs. non-threading POSIX has no threading capabilities as an ADIOS method. For MPI AGGREGATE, the main difference when turning on or off the threading is with how the open operation is performed. When threading is enabled, the MPI File Open is performed in a separate thread. The thread is joined back into the main thread during the ADIOS Close, right before the data is written to disk. This means that while all of the ADIOS Writes (and possibly other application computation) are being performed, the open is happening asynchronously. This helps to mitigate the large overhead associated with the Lustre metadata server on systems like Titan. It is also worth mentioning that threading is also performed within the ADIOS Close when using the MPI AGGREGATE method. Once all of the data has been written to the subfiles, the associated metadata must also be written to the subfiles. The metadata must also be sent to process with rank and written out to the global metadata file. These two operations are performed simultaneously when using threading. To study the impact of threading, we compare the performance in terms of total time for the regular I/O pattern (i.e., Constant pattern) and irregular I/O patterns (i.e., Normal for the QMCPack-like I/O pattern and Exponential for the ENZO-like I/O pattern) without (w/o) and with threading. In the Uniform pattern, each process writes 2.4MBytes of data. In the normal pattern, samples are taken from a normal distribution with a mean of 2.4MBytes and a standard deviation of.8mbytes. In the exponential pattern, samples are taken from an exponential distribution with a lambda of.5. The samples are then scaled so that the average is 2.4MBytes. This size is considered a medium I/O size for our tests. In Figure 4 we compare the total time with and without threading for 15 steps of the three I/O patterns using 248 writers (two processes per node and 124 nodes of Titan). We use a fixed aggregation level of two writers per aggregator. We observe that threading clearly provides better performance by reducing the ADIOS Group time to a negligible fraction of the total time. Once threading is turned on, the main time component is the ADIOS Close time. We observed that the total times and the ADIOS Close times match for all three I/O patterns, indicating how for medium I/O size, this time component is the driving performance factor. An increase in number of writers and associated nodes does not substantially change this observation. Figure 5 shows how ADIOS Close times match for all three I/O patterns when using 248 and 496 writers (with two writers per node) on Titan with threading. The general conclusion that threading should be on as default configuration for ADIOS is not surprising. What it is surprising is that the performance are dictated by the ADIOS Close times and that the performance values are immune to the type of patterns at this level of I/O size, system sizes, and aggregation levels. C. POSIX vs. MPI AGGREGATE Intuitively, POSIX should be outperformed by the MPI AGGREGATE as the number of writers grows. To assess this hypothesis, we compare POSIX versus MPI AGGREGATE for the three I/O patterns when using 248 and 496 writers (with two writers per node) on Titan with MPI AGGREGATE and POSIX. Figure 6 shows the total I/O

(a) Constant w/o threading (b) Normal w/o threading (c) Exponential w/o threading (d) Constant with threading (e) Normal with threading (f) Exponential with threading Fig. 4: Total times for regular (i.e., Constant) and irregular (i.e., Normal and Exponential) patterns when using medium I/O size and 246 writers on Titan without (a-c) and with threading (d-f). (a) Constant, 248 writers (b) Normal, 248 writers (c) Exponential, 248 writers (d) Constant, 496 writers (e) Normal, 496 writers (f) Exponential, 496 writers Fig. 5: ADIOS Close times for regular (i.e., Constant) and irregular (i.e., Normal and Exponential) patterns when using medium I/O size with 248 writers (a-c) and 496 writers (d-f) on Titan. times with 248 writers on Titan with MPI AGGREGATE and POSIX. Similar performance were measured for 496 writers (not shown in this paper). The results in Figure 6 are again not surprising as turning on threading removes the impact of the ADIOS Group time on the total time. On the other hand, the opening phases for POSIX does not benefit from threading. In results not shown in this paper, we also observe that turning off threading causes the MPI AGGREGATE method to become more costly than POSIX, even for 496 writers, confirming the need for threading in I/O operations. The relevance of

(a) Constant with AGGR (b) Normal with AGGR (c) Exponential with AGGR (d) Constant with POSIX (e) Normal with POSIX (f) Exponential with POSIX Fig. 6: Total I/O times for all three I/O patterns (i.e., Constant, Normal, and Exponential) when using 248 writers on Titan with MPI AGGREGATE with threading (a-c) and with POSIX (d-f). these tests is in the fact that they confirm the insensitivity of the I/O times to the I/O patterns for both POSIX and MPI AGGREGATE for medium I/O size: the total times exhibit the same behavior for the three I/O patterns for POSIX as they do for MPI AGGREGATE. D. Aggregation: All-gather vs. Brigade Our tests point out the critical role of ADIOS Close times on the overall performance. Specifically, we observe how the ADIOS Close times match the total time in size and behavior across the different I/O patterns. The closing operation used as the default in MPI AGGREGATE is called brigade aggregation. ADIOS also supports a simple aggregation called all-gather. To better understand the dynamics of the closing operation, we compare the ADIOS Close times for the simple aggregation with the brigade aggregation. The simple all-gather aggregation (called here AGG=1) is the naive way of gathering data. Initially each aggregator allocates a block of memory large enough to store all of the data from all of the writers in the same PG. This operation is followed by a MPI Gatherv. Once all of the data is within the aggregators memory, the data is saved out to disk. The obvious disadvantage is that each aggregator must have enough physical memory available to store all of the data written by its PG. This is a most unlikely event given todays scientific applications. The brigade aggregation (called here AGG=2) is a more sophisticated way of gathering the data which avoids the memory limitations present in the simple aggregation. All of the writers within a PG do an MPI AllGather communication with the amount of data they plan to write. Each writer then allocates a block of memory equal to the size of the largest block of data to be written by a process in the PG. A line of process is created, ordered from largest rank to smallest rank (which should be the aggregator). A cascade of data begins to flow from the highest ranked process down to the aggregator. Each process prepares to receive data from its higher rank neighbor while simultaneously sending data to its lower rank neighbor. The aggregator begins saving data out to the disk, one writer block at a time. So as each iteration is completed, the data moves down the line of processes, ending at the aggregator. Although more data movement is involved in the brigade aggregation than in the simple aggregation, the memory requirement of the brigade aggregation is only that of the largest block to be written. To study the performance dynamics of the two closing algorithms, we compare the ADIOS Close time for the I/O pattern when using 248 and 496 writers on Titan with simple aggregation (AGG=1) and brigade aggregation (AGG=2). Once again for the medium I/O size we do not observe any difference in terms of performance behavior across the three I/O patterns and thus we display here only the results of the normal distribution. Figure 7 shows these results. The general conclusion is that there is no difference between the two closing approaches in terms of the maximum I/O times of the writers for each I/O step. On the other hand, within each single step the writers exhibit a different performance profile for the two closing approaches. Note that both sets of tests were run with two writers per node and an aggregation ratio of 2-to-1. In the case of simple aggregation, a thread is spawned and both the data and metadata are simultaneously gathered at each aggregator. Due to the parameters of these tests, this aggregation can be done through shared memory rather than over Titan s Gemini network. This means that the aggregation happens very quickly. So any non-aggregator writers (half the writers in these test cases) rapidly finish their portion of

(a) Normal, AGGR=1, 248 writers (b) Normal, AGGR=1, 496 writers (c) Normal, AGGR=2, 248 writers (d) Normal, AGGR=2, 496 writers Fig. 7: ADIOS Close time for the irregular normal pattern when using 248 and 496 writers on Titan with simple aggregation - i.e., AGG=1 (a-b) - and brigade aggregation - i.e., AGG=2 (c-d). ADIOS Close and exit the function. The remaining writers, the aggregators, must write all of the data and metadata out to disk. This operation is much slower in comparison to the aggregation and causes the large gaps we see in the figures. In the case of brigade aggregation, each aggregator immediately begins writing its data out to disk while simultaneously receiving data from the other writer in the PG. This causes the nonaggregators to block inside of the ADIOS Close while waiting for the aggregators to finish writing out the first block of data. Once the aggregators finish writing the first block, the nonaggregators can quit blocking on their MPI communication and can exit the ADIOS Close function. The aggregators then write out the second block of data and exit. This explains why half of the writers exit half way through the ADIOS Close. It is worthwhile to notice that, from the point of view of the memory use on each single node, the brigade aggregation with its lower memory requirements seems to better fit with the growing data size of applications; thus this is the approach used in the rest of this paper. E. Number of OSTs and Aggregators One critical question that is driving the I/O research is the selection of the optimal number of OSTs and aggregators. While the search for such values has been studied in other work [9] and is not the scope of this paper, we include here the study of their impact on performance for the sets of values that are normally considered when running applications on Titan. Figure 8 shows the ADIOS Close times for the Normal distribution pattern when using 248 writers (two writers per node) on Titan with different levels of aggregation (i.e., one-toeight (a), one-to-four (b), and one-to-two (c)). The figure also shows a case in which the number of aggregators and OSTs do not mach (i.e., the ration of aggregators over OSTs is two (d)). Figure 9 shows the ADIOS Close time for the Normal pattern when using 496 writers (two writers per node) on Titan with different parameters values. In both figures we can observe as the number of writers grows, the optimal ratio of aggregation also changes. For 248 writers, the optimal ratio is one-to-eight while for 496, the ratio increased to one-to-sixteen. On the other hand, a different number of OSTs doesn t seem to impact performance in both case studies. Our observations confirm the importance of selecting the proper number of aggregators for the sake of performance. The observations also confirm that the optimal numbers of aggregators may vary based on the number of writers and the data size (results not shown in this section). On the other hand, the type of I/O patterns (regular or irregular) seem again not to impact performance for the type of tests considered in this paper. Here we present results for the normal distribution but similar results were observed for the other two patterns. F. Data Size and I/O Patterns In all the cases studied so far we consider medium I/O size (i.e., ranging around 2.4MBytes per writer), the time component driving the performance is ADIOS Close, and the I/O pattern does not impact the I/O performance. Moreover, the ADIOS Write times of the processes in each I/O step are almost two orders of magnitude smaller than the ADIOS Close times and closely match the behaviors of the data sizes written by the processes themselves. When the data size grows, ADIOS Write should eventually compete with the

(a) AGGR level 1:8, AGGR:OSTs 1:1 (b) AGGR level 1:4, AGGR:OSTs 1:1 (c) AGGR level 1:2, AGGR:OSTs 1:1 (d) AGGR 1:4, AGGR:OSTs 2:1 Fig. 8: ADIOS Close time for the irregular normal pattern when using 248 writers on Titan with different levels of aggregation and OST numbers. (a) AGGR level 1:8, AGGR:OSTs 1:1 (b) AGGR level 1:4, AGGR:OSTs 1:1 (c) AGGR level 1:2, AGGR:OSTs 1:1 (d) AGGR 1:4, AGGR:OSTs 2:1 Fig. 9: ADIOS Close time for the irregular normal pattern when using 496 writers on Titan with different levels of aggregation and OST numbers. ADIOS Close time and the type of I/O pattern should be eventually impact performance, as we hypnotized for Figure 1. To prove this claim we run the same tests with the three different distributions but with one order of magnitude larger

(a) Write, AGGR level 1:16 (b) Close, AGGR level 1:16 (c) Total, AGGR level 1:16 (d) Write, AGGR level 1:4 (e) Close, AGGR level 1:4 (f) Total, AGGR level 1:4 Fig. 1: ADIOS Write, ADIOS Close, and total times for the regular I/O pattern using large I/O size with 1-to-16 and 1-to4 levels of aggregations. (a) Write, AGGR level 1:16 (b) Close, AGGR level 1:16 (c) Total, AGGR level 1:16 (d) Write, AGGR level 1:4 (e) Close, AGGR level 1:4 (f) Total, AGGR level 1:4 Fig. 11: ADIOS Write, ADIOS Close, and total times for the exponential (irregular) pattern using large data size with 1-to-16 and 1-to-4 levels of aggregations.

I/O data (i.e., ranging around 24MBytes per writer). We also consider different number of aggregators. Figure 1.a-c shows the ADIOS Write, ADIOS Close and total times for the regular I/O pattern with a level of aggregation of one aggregator for 16 writers and Figure 1.d-f shows the same times but with a level of aggregation of one aggregator for 4 writers. Figure 11.a-f shows the same time components for the exponential distribution of the irregular I/O. For space constraints, we omit the results for the other irregular I/O pattern (i.e., with the normal distribution). There are two important observations that emerge from the comparison of these two figures. First, we observe that the optimal level of aggregation for the regular pattern of 1-to-16 is different than the optimal level of aggregation for the irregular pattern of 1-to-4. Second, as the data per writer grow in size, the irregular I/O pattern has an increasing impact on the overall performance (i.e., the total time) especially when the optimal level of aggregation is selected (see Figure 11.d-f). G. Discussion Our initial hypothesis was when the simulation processes write irregular amount of data per process at each I/O step, the overall simulation exhibit increasing I/O times and substantial loss in scalability. The results in this paper seem to indicate that our hypothesis is true for large data sizes (i.e., when each process write data in the range of tens MBytes) but not for smaller I/O size (i.e., when each process write data in the range of few MBytes or lower). Unfortunately the version of ADIOS used for our tests does not support irregular I/Os with very large data sizes (i.e., in the range of hundreds MBytes). This prevents us for further study the problem at the very large data range at this point. We also observe how different I/O patterns require different I/O parameter values for optimal performance (e.g., different number of OSTs and aggregation levels). The automatic identification of optimal I/O parameter values for each I/O pattern is work in progress. V. RELATED WORK The study of I/O performance is still an open challenge. To the best of our knowledge, none of the existing work systematically studies the I/O performance of simulations exhibiting irregular I/O patterns. Recent efforts targeting the performance profile and tuning of I/O parameters include [1], [9], [11], [12]. As for the work presented in our paper, work in [1], [9], [11] studies I/O performance at extreme scale for a specific file system and specific I/O libraries. Specifically, in [1], the authors use an evolutionary method to explore the large space of I/O parameters for HDF5-based application. In [9], the authors present extensive characterization, tuning, and optimization of parallel I/O on the Jaguar supercomputer. In [11], the authors use a mathematical model to reproduce the file system behavior; simulation results are used to validate the model values and an auto-tuning tool searches for optimal parameters, starting from the validated model values. In [12], the authors model disk I/O time for a specific type of technology (i.e., SSD) and a specific platform (i.e., Dash - a prototype for the large, 124-node Gordon system at SDSC). Other I/O efforts study the overall I/O performance for one or multiple applications [13], [14], [15], [16]. VI. CONCLUSIONS In this paper we benchmarked the I/O of two relevant scientific applications with irregular I/O patterns, QMCPack and ENZO. Our initial hypothesis that the loss in I/O performance for these applicators is related to their irregular I/O patterns has been proven true for large data sizes but not for medium I/O size. As the I/O of applications is growing in size at the extreme scale, we expect that the I/O patterns will play a larger role in the performance tuning. Future work of the group includes defining strategies that integrate the I/O patterns as a first class citizen in the I/O tuning. ACKNOWLEDGMENTS This work is supported in part by the NSF grant CCF 1318445. The authors also want to thank Dr. Norbert Podhorszki for his advice on using ADIOS and Dr. Jeremy Logan for his advice on using Skel. REFERENCES [1] The HDF Group: Hierarchical data format Version 5. http://www.hdfgroup.org/hdf5, 2-21. [2] J.F. Lofstead, S. Klasky, K. Schwan, N. Podhorszki, and C. Jin. Flexible I/O and integration for scientific codes through the adaptable I/O system (ADIOS). In Proc. of CLADE 28, 28. [3] J. Kim, K.P. Esler, J. McMinis, M.A. Morales, B.K. Clark, L. Shulenburger, and D.M. Ceperley. Hybrid algorithms in Quantum Monte Carlo. J. of Physics: Conference Series, 42(1), 212. [4] S. Herbein, M. Matheny, M. Wezowicz, J. Kroger, J. Logan, J. Kim, S. Klasky, and M. Taufer. Performance impact of I/O on QMCPack simulations at the petascale and beyond. In Proc. of CSE 213, 213. [5] G.L. Bryan and et al. ENZO: An adaptive Mesh Refinement code for astrophysics. The Astrophysical Journal Supplement Series, 211(19), 214. [6] J. H. Chen and et al. Terascale direct numerical simulations of turbulent combustion using S3D. Comput. Sci. Disc., 2(151), 29. [7] J.C. Bennett and et al. Combining in-situ and in-transit processing to enable extreme-scale scientific analysis. In Proc. of SC12, 212. [8] J. Logan, S. Klasky, H. Abbasi, Q. Liu, G. Ostrouchov, M. Parashar, N. Podhorszki, Y. Tian, and M. Wolf. Understanding I/O performance using I/O skeletal applications. In Proc. of Euro-Par 212, 212. [9] B. Behzad, S. Byna, Q. Koziol, H.V.T. Luu, A. Prabhat, J. Huchette, R. Aydt, and M. Snir. Taming parallel I/O complexity with auto-tuning. In Proc. of SC13, 213. [1] W. Yu, J. S. Vetter, and H.S. Oral. Performance characterization and optimization of parallel I/O on the Cray XT. In Proc. of IPDPS, 28. [11] H. You, Q. Liu, Z. Li, and S. Moore. The design of an auto-tuning I/O framework on Cray XT5 system. In Cray Users Group Conference (CUG 11), 211. [12] M. R. Meswani, M. A. Laurenzano, L. Carrington, and A. Snavely. Modeling and predicting disk I/O time of HPC applications. In 21 DoD HPC Modernization Program Users Group Conference, 21. [13] M. Gamell, I. Rodero, M. Parashar, J. Bennett, H. Kolla, J. Chen, P.-T. Bremer, A. G. Landge, A. Gyulassy, P. McCormick, S. Pakin, V. Pascucci, and S. Klasky. Exploring power behaviors and trade-offs of in-situ data analytics. In Proc. of SC13, 213. [14] T. Jin, F. Zhang, Q. Sun, H. Bui, M. Parashar, H. Yu, S. Klasky, N. Podhorszki, and H. Abbasi. Using cross-layer adaptations for dynamic data management in large scale coupled scientific workflows. In Proc. of SC13, 213. [15] N. Podhorszki, S. Klasky, Q. Liu, C. Docan, M. Parashar, H. Abbasi, J. F. Lofstead, K. Schwan, M. Wolf, F. Zheng, and J. Cummings. Plasma fusion code coupling using scalable I/O services and scientific workflow. In Proc. of SC-WORKSHOPS, 213. [16] M. Slawiska, M. Clark, M. Wolf, T. Bode, H. Zou, P. Laguna, J. Logan, M. Kinsey, and S. Klasky. A Maya use case: adaptable scientific workflows with ADIOS for general relativistic astrophysics. In Proc. of XSEDE 213, 213.