Structured Streams: Data Services for Petascale Science Environments

Size: px

Start display at page:

Download "Structured Streams: Data Services for Petascale Science Environments"

Jared Parsons
6 years ago
Views:

1 Structured Streams: Data Services for Petascale Science Environments Patrick Widener Matthew Wolf Hasan Abbasi Matthew Barrick Jay Lofstead Jack Pulikottil Greg Eisenhauer Ada Gavrilovska Scott Klasky Ron Oldfield Patrick G. Bridges Arthur B. Maccabe Karsten Schwan. Abstract The challenge of meeting the I/O needs of petascale applications is exacerbated by an emerging class of data-intensive HPC applications that requires annotation, reorganization, or even conversion of their data as it moves between HPC computations and data end users or producers. For instance, data visualization can present data at different levels of detail. Further, actions on data are often dynamic, as new end-user requirements introduce data manipulations unforeseen by original application developers. These factors are driving a rich set of requirements for future petascale I/O systems: (1) high levels of performance and therefore, flexibility in how data is extracted from petascale codes; (2) the need to support on demand data annotation metadata creation and management outside application codes; (3) support for concurrent use of data by multiple applications, like visualization and storage, including associated consistency management and scheduling; and (4) the ability to flexibly access and reorganize physical data storage. We introduce an end-to-end approach to meeting these requirements: Structured Streams, streams of structured data with which methods for data management can be associated whenever and wherever needed. These methods can execute synchronously or asynchronously with data extraction and streaming, they can run on the petascale machine or on associated machines (such as storage or visualization engines), and they can implement arbitrary data annotations, reorganization, or conversions. The Structured Streaming Data System (SSDS) enables high-performance data movement or manipulation between the compute and service nodes of the petascale machine and between/on service nodes and ancillary machines; it enables the metadata creation and management associated with these movements through specification instead of application coding; and it ensures data consistency in the presence of anticipated or unanticipated data consumers. Two key abstractions implemented in SSDS, I/O graphs and Metabots, provide developers with high-level tools for structuring data movement as dynamically-composed topologies. A lightweight storage system avoids traditional sources of I/O overhead while enforcing protected access to data. This paper describes the SSDS architecture, motivating its design decisions and intended application uses. The utility of the I/O graph and Metabot abstractions is illustrated with examples from existing HPC codes executing on Linux Infiniband clusters and Cray XT3 supercomputers. Performance claims are supported with experiments benchmarking the underlying software layers of SSDS, as well as application-specific usage scenarios. 1 Introduction Large-scale HPC applications face daunting I/O challenges. This is especially true for complex coupled HPC codes like those in climate or seismic modeling, and also for emerging classes of data-intensive HPC applications. Problems arise not only from large data volumes but also from the need to perform activities such as data staging, reorganization, or transformation [24]. Coupled simulations, for instance, may require data staging and conversion, as in multi-scale materials modeling [7], or data remeshing or changes in data layout [17, 14]. Emerging data-intensive applications have additional requirements, such as those derived from their continuous monitoring [23]. Their online monitoring and the visualization of monitoring data require data filtering and conversion in addition to the basic constraints of low overhead, flexible extraction of said data [35]. Similar needs exist for data-intensive applications in the sensor domain, where sensor data interpretation requires actions like data cleaning or remeshing [22]. Addressing these challenges presents technical difficulties including:

2 scaling to large data volumes and large numbers of I/O clients (i.e., compute nodes), given limited I/O resources (i.e., a limited number of nodes in I/O partitions), avoiding excessive overheads on compute nodes (e.g., I/O buffers and compute cycles used for I/O), balancing bandwidth utilization across the system, as mismatches will slow down the computational engines, either through blocking or through over-provisioning in the I/O subsystem, and offering additional functionality in I/O including on demand data annotation, filtering, or similar metadata-centric I/O actions. Structured Streams, and the Structured Streaming Data System (SSDS) that implements them, are a new approach to petascale I/O that encompasses a number of new I/O techniques aimed at addressing the technical issues listed above: Data taps are flexible mechanisms for extracting data from or injecting data into HPC computations; efficiency is gained from making it easy to vary I/O overheads and costs in terms of buffer usage and CPU cycles spent on I/O and by controlling I/O volumes and frequency. Structured data exchanges between all stream participants make it possible to enrich I/O by annotating or changing data, both synchronously or asynchronously with data movement. I/O graphs explicitly represent an application s I/O tasks as configurable topologies of the nodes and links used for moving and operating on data. I/O graphs start with lightweight data taps on computational nodes, traverse arbitrary additional task nodes on the petascale machine (including compute and I/O nodes, as desired), and end on storage or visualization engines. Using I/O graphs, developers can flexibly and dynamically partition I/O tasks and concurrently execute them, across petascale machines and the ancillary engines supporting their use. Enhanced techniques dynamically manage I/O graph execution, including their scheduling and the I/O costs imposed on petascale applications. Metabots are tools for specifying and implementing operations on the data moved by I/O graphs. Metabot specifications include the nature of operations (annotation, organization, modification of data to meet dynamic end user needs) as well as implementation and interaction details (such as application synchrony, data consistency requirements, or metabot scheduling). Lightweight storage separates fast path data movements from machine to disk from metadata-based operations like file consistency, while preserving access control. Metabots operating on disk-resident data are one method for asynchronously (outside the data fast path) determining the file properties of data extracted from high performance machines. SSDS is being implemented for leadership class machines residing at sites like the Oak Ridge National Laboratory. Its realization for Cray XT3 and XT4 machines runs data taps on its compute nodes, using Cray s Catamount kernel, and it executes full-featured I/O graphs utilizing nodes, metabots, and lightweight storage both on the Cray XT3/XT4 I/O nodes and on secondary service machines. SSDS also runs on Linux-based clusters using Infiniband RDMA transports in place of the Sandia Portals [3] communication construct. Enhanced techniques for automatically managing I/O graphs costs vs. performance have not yet been realized, but measurements shown in this paper demonstrate the basic performance properties of SSDS mechanisms and the cost/performance tradeoffs made possible by their use. In the remainder of this paper, Section 2 describes the basic I/O structure of the HPC systems targeted by SSDS. Section 3 follows with a description of the Structured Streams abstractions, how it addresses the emerging I/O challenges in the systems described in Section 2, and the implementation of structured streams in SSDS. Section 4 presents experimental results from our prototype SSDS implementation, illustrating how the basic SSDS abstractions that implement structured streams provide a powerful and flexible I/O system for addressing emerging HPC I/O challenges. Section 5 then describes related work, and Section 6 presents conclusions and directions for future research. 2 Background Figure 1 depicts a representative machine structure for handling the data-intensive applications targeted by our work, derived from our interactions with HPC vendors and with scientists at Sandia, Los Alamos, and Oak Ridge National Laboratories. These architectures have four principal components: (1) a dedicated storage engine with limited attached computational facilities for data mining; (2) a large-scale MPP with compute nodes for application computation and I/O staging nodes for buffering and data reorganization; (3) miscellaneous local compute facilities such as visualization systems; and (4) remote data sources and sinks such as high-bandwidth sensor arrays and remote collaboration sites. The storage engine provides long-term storage, its primary client being the local MPP system. Nodes in the MPP are connected by a high-performance interconnect such as the Cray Seastar or 4x Infiniband with effective data rates of 4GB/sec or more, while this MPP is connected to other local machines, including the data archive, using a high-speed commodity interconnect such as 1GigE. Because these interconnects have lower peak and average bandwidths than the MPP s highperformance interconnect, some application nodes inside the MPP (i.e., service or I/O nodes) are typically dedicated to 2

impedance matching by buffering, staging, and reordering data as it flows into and out of the MPP.

Storage Engine and consume large amounts of data.

data sets resident in the pool of storage. 3 Structured Streams 3.1 Using Structured Streams Local MPP (Compute and I/O Nodes) Figure 1.

Conceptually, a structured stream is a sequence of annotations, augmentations, and transformations of typed, structured data; the stream describes how data is conceptually transformed as it moves

3 impedance matching by buffering, staging, and reordering data as it flows into and out of the MPP. Systems with remote access to the storage engine have lower bandwidths, even when using high end networks like TeraGrid or DOE s UltraScience network, but they can still produce Remote Sensors Storage Engine and consume large amounts of data. Con- sider, for example, data display or visualization clusters [16], sensor networks, satellites or other telemetry sources, or ground-based sites such as the Long Wavelength Array of radiotelescopes. Remote Clients Visualization For all such cases, the supercom- puter may be either the source (checkpoints or processed simulation results) or sink (analysis of data across data sets) for the large data sets resident in the pool of storage. 3 Structured Streams 3.1 Using Structured Streams Local MPP (Compute and I/O Nodes) Figure 1. System Hardware Architecture To address the I/O challenges in systems and applications such as those described in Section 2, we propose to manage all data movement as structured streams. Conceptually, a structured stream is a sequence of annotations, augmentations, and transformations of typed, structured data; the stream describes how data is conceptually transformed as it moves from data source to sink, along with desired performance, consistency, and quality of service requirements for the stream. An SSDS-based application describes its data flows in terms of structured streams. Data generation on an MPP cluster or retrieval from an storage system, for example, may be expressed as one or more structured streams, each of which performs data manipulation according to downstream application requirements. A structured streaming data system (SSDS), then, maps these streams onto runtime resources. In addition to meeting the application s I/O needs, this mapping can be constructed with MPP utilization in mind or to meet overhead constraints. A structured stream is a dynamic, extensible entity; incorporation of new application components into a structured stream is implemented as their attachment to any node in the graph. Such new components can attach as pure sinks (as simple graph endpoints, termed data taps) or include specific additional functionality in more complex subgraphs. A resulting strength of the structured stream model is that data generators need not be modified to accommodate new consumers. 3.2 Example Structured Streams Structured streams carry data of known structure, but they also offer rich facilities for the runtime creation of additional metadata. For example, metadata may be created to support direct naming, reference by relation, and by arbitrary content. Application-centric metadata like the structure or layout of data events can be stated explicitly as part of the extended I/O graph interface offered by SSDS (i.e., data taps), or it may be derived from application programs, compilers, or I/O marshaling interfaces. Consider the I/O tasks associated with the online visualization of data in multi-scale modeling codes. Here, even within a single domain like materials design, different simulation components will use different data representations (due to differences in length or time scales); they will have multiple ways of storing and archiving data; and they will use different program for analyzing, storing, and displaying data. For example, computational chemistry aspects can be approached either from the viewpoint of chemical physics or physical chemistry. Although researchers from both sides may agree on the techniques and base representations, there can be fundamental differences in the types of statistics which these two groups may gather. The physicists may be interested in wavefunction isosurfaces and the HOMO/LUMO gap, while the chemists may be more interested in types of bonds and estimated relative energies. More simply, some techniques require the data structures to be reported in wave space, while others require real space. Similar issues arise for many other HPC applications. A brief example demonstrates how application-desired structures and representations of data can be specified within the I/O graphs used to realize structured streams. Consider, for example, the output of the Warp molecular dynamics code [26], a tool developed by Steve Plimpton. It and its descendants have been used by numerous scientists for exploring materials physics problems in chemistry, physics, mechanical, and materials engineering. The output is composed of arrays of three 3

4 real coordinates, representing the x, y, and z values for an individual atom s location, coupled with additional attribute for the run, including the type of atom, velocities of the atoms and/or temperatures of the ensemble, total energies, and so on. Part of the identification of the atomic type is coupled to the representation of the force interaction between atoms within the code Iron atoms interact differently with other Iron atoms than they do with Nickel. Using output data from one code to serve as input for another involves not only capturing the simple positions, but also the appropriate changes in classifications and indexing that the new code may require. I/O graphs use structured data representations to simply record the ways in which such data is laid out in memory (and/or in file blocks) and then maintain and use the translations to other representations. The current implementation of SSDS uses translations that are created manually, although at a higher level than the Perl or Python scripting which are standard practices in the field. In ongoing work related (but not directly linked) to the SSDS effort, we are producing automated tools for creating and maintaining these translations. 3.3 Realizing Structured Streams The concept of a structured stream only describes how data is transformed. To actually realize structured streams, we have decomposed their specification and implementation into two concrete abstractions implemented in the Structured Streaming Data System (SSDS) that describe when and where data is transformed, I/O graphs and metabots. Based on the application and performance and consistency needs of a structured stream and available system resources, a structured stream is specified as a series of in-band synchronous data movements and transformations across a graph of hosts (an I/O graph) and some number of asynchronous, out-of-band data annotations and transformations (metabots), resulting in a complete description of how, where, and when data will move through a system. The mapping of a structured stream to an IOgraph and a set of metabots in SSDS can change as application developers desire or as run-time conditions dictate. For instance, data format conversion may be performed strictly in an I/O graph for a limited number of consumers, but broken out (according to data availability deadlines) into a combination of I/O graph and metabot actions as the number of consumers grows. This ability to customize structured streams, and their potential for coordination and scheduling of both in-band and out-of-band data activity, makes them a powerful tool for application composition I/O graphs I/O graphs are the data-movement and lightweight transformation engines of structured streams. The nodes of an I/O graph exchange data with each other over the network, receiving, routing, and forwarding as appropriate. I/O graphs are also responsible for annotating data and/or for executing the data manipulations required to filter data or more generally, make data right for recipients and to carry out data staging, buffering, or similar actions performed on the structured data events traversing them. Stated more precisely, in an I/O graph, each operation uses one or more specific input data form(s) to determine onward routing and produce potentially different output data form(s). We identify three types of such operations: inspection, in which an input data form and its contents are read-only and used as input to a yes/no routing, forwarding, or buffering decision for the data; annotation, in which the input data form is not changed, but the data itself might be modified before the routing determination is made; and morphing, in which the input data form (and its contents) is changed to a different data form before the routing determination is made. Regardless of which operations are being performed, I/O graph functionality is dynamically mapped onto MPP compute and service nodes, and onto the nodes of the storage engine and of ancillary machines. The SSDS design also anticipates the development of QoS- and resource-aware methods for data movement [5] and for SSDS graph provisioning [18, 3] to assist in deployment or evolution of I/O graphs. While the I/O graphs shown and evaluated in this paper were explicitly created by developers, higher level methods can be used to construct I/O graphs. For example, consider a data conversion used in the interface of a high performance molecular modeling code to a visualization infrastructure. This data conversion might involve a 5% reduction in height and width of a 2-dimensional array of, say, a bitmap image (25% of original data size). Precise descriptions of these conversions can be a basis for runtime generation of binary codes with proper looping and other control constructs to apply the operation to the range of elements specified. Our previous work has used XML specifications based on which automated tools can generate the binary codes that implement the necessary data extraction and transport actions across I/O graph nodes. 4

5 3.3.2 Metabots Metadata in MPP Applications. The particulars of stream structure may not be readily available from MPP applications. One reason for current applications using traditional file systems is that these systems force metadata to be generated in-line with computational results, reducing effective computational I/O bandwidth. As a result, minimal metadata is available on storage devices for use by other applications, and the metadata that is present is frequently only useful to the application that generated it. Additionally, such applications are extremely sensitive to the integrity of their metadata, ruling out modifications which might benefit other clients. Recovering metadata at a later time for use by other applications can reduce to inspection of source codes in order to determine data structure. As noted earlier, this situation is caused by the difference between MPP internal bandwidth and I/O bandwidth. One approach is to recover I/O bandwidth by removing metadata operations and other semantic operations from parallel file systems. However, semantics such as file abstractions, name mapping, and namespaces are commonly relied upon by MPP applications, so removing such metadata permanently is not an option. In fact, we consider the generation, management, and use of these and other types of metadata to be of increasing importance in the development of future high performance applications. Our approach, therefore, is not to require all metadata management to be in the fast data path of these applications, but instead, to move as much metadata-related processing as appropriate out of this path. The metabots described below realize this goal. Metabots and I/O graphs. Metabots provide a specification-based means of introducing asynchronous data manipulation into structured streams. A basic I/O graph emphasizes metadata generation as data is captured and moved (e.g., data source identification). By default, generation is performed in-band, a process analogous to current HPC codes that implicitly create metadata during data storage. Metabots make it possible to move these tasks out-of-band, that is, outside the fast path of data transfer. Specifically, Metabots can coordinate and execute these tasks at times less likely to affect applications, such as after a data set has been written to disk. In particular, metabots provide additional functionality for metadata creation and data organizations unanticipated by application developers but required by particular end-user needs. Colloquially, metabots are metadata agents whose run-time representations crawl storage containers, generating metadata in user-defined ways. For example, in many scientific applications, data may need to be analyzed for both spatial and temporal features. In examining the flux of ions through a particular region of space in a fusion simulation, the raw data (organized by time slice) may need to be transformed so that it is organized as a time series of data within a specific spatial bounding box. This can involve both computationally intensive (e.g., bounding box detection) and I/O bound phases (e.g. appending data fragments to the time series). The metabot run-time abstraction allows these to occur out-of-band or in-band, as appropriate. Metabots differ from traditional workflow concepts in two ways: (1) they are tightly coupled to the synchronous data transmission, and need to be flexibly reconfigurable based on what the run-time has or has not completed, and (2) they are confined to specific meta-data and data-organizational tasks. As such, the streaming data and metabot run-times could be integrated in future work to serve as a self-tuning actor within a more general workflow system like Kepler [2]. 3.4 A Software Architecture for Petascale Data Movement The data manipulation and transmission mechanism of SSDS leverages extensive prior work with high performance data movement. Key technologies realized in that research and leveraged for this effort include: (1) efficient representations of meta-information about data structure and layout; enabling (2) high performance and structure-aware manipulations on SSDS data in flight, carried out by dynamically deployed binary codes and using higher level tools with which such manipulations can be specified, termed XChange [1]; (3) a dynamic overlay optimized for efficient data movement, where data fast path actions are strongly separated from the control actions necessary to build, configure, and maintain the overlay[3]; and (4) a lightweight object storage facility (LWFS [21]) that provides flexible, high-performance data storage while preserving access controls on data. LWFS implements back end metadata and data storage. PBIO is an efficient binary runtime representation of metadata. [1] High performance data paths are realized with the EVPath data movement and manipulation infrastructure [9], and the XChange tool s purpose is to provide I/O graph mapping and management support [3]. Knowledge about the structure and layout of data is integrated into the base layer of I/O graphs. Selected data manipulations can then be directly integrated into I/O graph actions, in order to move only the data that is currently required and (if necessary and possible) to manipulate data to avoid unnecessary data copying due to send/receive data mismatches. The role of CM (Connection Manager) is to manage the communication interfaces of I/O Graphs. Also part of EVPath and CM are the control methods and interfaces needed to configure I/O graphs, including deploying the operations that operate on data, link graph nodes, delete them, etc. Potential overheads from high-level control semantics will not affect the data fast path performance, and alternative realizations of such 5

6 semantics become possible. These will be important enablers for integrating actions like file system consistency or conflict management with I/O graphs, for instance. Finally, the metadata used in I/O graphs can be provided by applications, but it can also be derived automatically, by the metabots described in Section I/O graph nodes use daemon processes to run on the MPP s service nodes, on the storage engine, and on secondary machines like visualization servers or remote sensor machines. In addition, selected I/O graph nodes may run on the MPP s compute engines, to provide to such applications the extended I/O interfaces offered by SSDS. For all I/O graph nodes, operator deployment can use runtime binary code generation techniques to optimize data manipulation for current application needs and platform conditions. Additional control processes not shown in the figure run tasks like optimization for code generation across multiple I/O graph nodes and machines, and I/O graph mapping and management actions. ECho Pub/Sub Manager CM XChange I/O Graph Manager PBIO CM Transports Metabots EVPath Figure 2. SSDS software architecture. Structured streams do not replace standard back-end storage (i.e., file systems) or transports (i.e., network subsystems). Instead, they extend and enhance the I/O functionality seen by high performance applications. The structured stream model of I/O inherently utilizes existing (and future) high performance storage systems and data storage models, but offers a datadriven rather than connection- or file-driven interface for the HPC developer. In particular, developers are provided with enhanced I/O system interfaces and tools to define data formats and the operations to be performed on data in flight between simulation components and I/O actions. As an example, the current implementation of SSDS leverages existing file systems (ext3 and lustre) and protocols (Portals and IB RDMA). Since the abstraction presented to the programmer is inherently asynchronous and data-driven, however, the run-time can perform data object optimizations (like message aggregation or data validation) in a more efficient way than the corresponding operation on a file object. In contrast, the successful paradigm of MPI I/O [32], particularly when coupled with a parallel file system, heavily leverages the file nature of the data target and utilizes the transports infrastructure as efficiently as possible within that model. However, that inherently means the underlying file system concepts of consistency, global naming, and access patterns will be enforced at the higher level as well. By adopting a model that allows for the embedding of computations within the transport overlay, it is possible to delay execution of or entirely eliminate those elements of the file object which the application does not immediately require. If a particular algorithm does not require consistency (as is true of some highly fault-tolerant algorithms), then it is not necessary to enforce it from the application perspective. Similarly, if there is an application-specific concept of consistency (such as validating a check point file before allowing it to overwrite the previous check point), that could be enforced as well, in addition to the more application-driven specifications mentioned earlier Implementation Datatap. The datatap is implemented as a request-read service that is designed for the multiple orders of magnitude difference between the available memory on the I/O and service nodes compared to the compute partition. We assume the existence of a large number of compute nodes producing data (we refer to them as datatap clients) and a smaller number of I/O nodes receiving the data (we refer to them as datatap servers). The datatap client issues a data available request to the datatap server, encodes the data for transmission and registers this buffer with the transport for remote read. For very large datasizes, the cost of encoding data can be significant, but it will be dwarfed by the actual cost of the data transfer [12, 11, 4]. On receipt of the request, the datatap server issues a read call. Due to the limited amount of memory available on the datatap server, the server only issues a read call if there is memory available to complete it. The datatap server issues multiple read requests to reduce the request latency as much as possible. The datatap server is performance bound by the available memory which restricts the number of concurrent request and the request service latency. The datatap server acts as a data feed in to the I/O graph overlay. The I/O graph can replicate the functionality of writing the output to a file (see Section 4.4), or it can be used LWFS 6

7 to perform inflight data transformations (see Section 4.4). We currently have two implementations of the datatap using Infiniband user-level verbs and the Sandia Portals interface. We needed the multiple implementations in order to support both our local Linux clusters and the Cray XT3 platform. The two implementations have a common design and hence common performance except in one regard. The Infiniband user level verbs do not provide a reliable datagram (RD) transport increasing the time spent in issuing a data available request (see Figure 8). I/O graph Implementation. Actual implementation of I/O graph data transport and processing is accomplished via a middleware package designed to facilitate the dynamic composition of overlay networks for message passing. The principal abstraction in this infrastructure is stones (as in stepping stones ), which are linked together to compose a data path. Message flows between stones can be both intra- and inter-process, with inter-process flows being managed by special output stones. The taxonomy of types of stones is relatively broad, but includes: terminal stones which implement data sinks; filter stones which can optionally discard data; transform stones which modify data; and split stones which implement data-based routing decisions and may redirect incoming data to one or more other stones for further processing. I/O graphs are highly efficient because the underlying transport mechanism performs only minimal encoding on the sending side and uses dynamic code generation to perform highly efficient decoding on the receiving side. The functions that filter and transform data are represented in a small C-like language. These functions can be transported between nodes in source form, but when placed in the overlay we use dynamic code generation to create a native version of these functions. This dynamic code generation capability is based on a lower-level package that provides for dynamic generation of a virtual RISC instruction set. Above that level, we provide a lexer, parser, semanticizer, and code generator, making the equivalent of a just-in-time compiler for the small language. As such, the system generates native machine code directly into the application s memory without reference to an external compiler. Because we do not rely upon a virtual machine or other sand-boxing technique, these filters can run at roughly the speed of unoptimized native code and can be generated considerably faster than forking an external compiler. 4 Experimental Evaluation 4.1 Overview To evaluate the effectiveness of our prototype SSDS implementation, we conducted a variety of experiments to understand its performance characteristics. In particular, we conducted experiments using a combination of HPC-oriented I/O benchmarks that test individual portions of SSDS including I/O graphs, the datatap, and Metabots, and prototype full-system SSDS application benchmark using a modified version of the GTC [23] HPC code. As an experimental testbed, we utilized a cluster of 53 dual-processor 3.2GHz Intel EM64T processors each with 6GB of memory running Redhat Enterprise Linux AS release 4 with kernel version ELsmp. Nodes were using connected by a non-blocking 4x Infiniband interconnect using the IB TCP/IP implementation. I/O services in the cluster are provided by cluster dedicated nodes containing 73GB 1k RPM Seagate ST37327LC Ultra SCSI 32 disks. Underlying SSDS I/O service was provided by a prototype implementation of the Sandia Lightweight File Systems (LWFS) [21] communicating using the user-level Portals TCP/IP reference implementation [3]. Note that the Portals reference implementation, unlike the native Portals implementation on Cray Seastar-based systems, is a largely unoptimized communication subsystem. Because this communication infrastructure currently constrains the absolute performance of SSDS on this platform, our experiments focus on relative comparisons instead of absolute performance numbers. 4.2 Metabot Evaluation Overview. To understand the potential performance benefits of Metabots in SSDS, we ran several metadata manipulation benchmarks with various SSDS configurations using different setups of a benchmark based on the LLNL fdtree benchmark [13], which we shall call fdtree-prime. The benchmark essentially creates a file hierarchy parametrized on the depth of the hierarchy, the number of directories at each depth and the number and size of files in each directory. In particular, we focused on comparing the benchmark performance with metadata manipulation inline and with metadata manipulation moved out-of-band to a Metabot. In situations where applications create large file hierarchies which are accessed at a later point of time, in-band file creation can incur significant metadata manipulation overhead. Since the application programmer generally knows the structure of this hierarchy, including file names, numbers, and sizes, it may frequently be possible to move namespace metadata manipulations 7

8 seconds naming raw reconstruct LWFS (Number of Files) # of files Seconds Naming Raw Reconstruct LWFS (Directory Depth) # of Levels (a) Scaling of Files (b) Scaling of Directory Figure 3. Out-of-band Metadata Reconstruction out-of-band. We have implemented this optimization using SSDS Metabots, where the application can write directly to the resulting storage targets without in-band metadata manipulation costs to application execution. Subsequently, a Metabot creates the needed filename to object mappings out-of-band. Setup. To understand the potential performance benefits of Metabots, we built a Metabot that would perform specified file hierarchy metadata manipulations according to a specification after the actual raw I/O writes done by fdtree-prime. We then compared the time taken to run fdtree-prime setups with metadata manipulation in-band and metadata manipulation out-ofband, as well as the amount of time needed for out-of-band metadata construction. For these tests, we used two different fdtree-prime setups: one which created an increasing number of files of size 4KB in a single directory and one that created the same-sized files in an increasingly deep directory hierarchy in which each directory contained 1 files and 2 subdirectories. Experiments were run on 4 nodes of the cluster described in section 4.1. One node executed the benchmark itself, while three nodes ran the LWFS authorization, naming, and storage servers. Results. In our first experiment, we see our raw write performance gets increasingly better with an increase in the number of files. Even with a flat directory structure, at 65,536 files, the write performance with inline-metadata creation is 7% slower than a raw write. In the second experiment, the performance gain is even more apparent. With a depth of 5 levels, and 2 directories per level, the write performance with inline-metadata creation is 9.7 times slower than a raw write. In both cases, the metadata construction Metabot takes about the same time as that of the inline-metadata benchmark. Analysis. While the above results demonstrate how moving metadata creation out-of-band can significantly increase the performance of in-band activity, the sum total of raw-write time and construction Metabot time is still greater than that of inline-metadata creation. This is so because the construction Metabot has to read from a raw object stored on the LWFS storage server and carry out the same operations as that of the inline-metadata benchmark. In the current LWFS API, the creation of a file is accompanied by the creation of a new LWFS object containing the data for the new file. Hence the construction Metabot has to repeat the workload of the inline-metadata benchmark. A more efficient implementation of the API would allow the construction Metabot to avoid file data copies and new object creations and simply create filesystem metadata for the object that was created during the raw-write benchmark. Performance could also have been better had the construction Metabot been deployed on the storage server as opposed on a remote node as was done for this series of experiments. Another issue to be addressed is that we do not have metrics for comparing write performance with other parallel file system implementations; this was primarily due to constraints of platform availability. However, LWFS performance characteristics are comparable to Lustre [21]. 4.3 DataTap Evaluation Overview. The datatap serves as a lightweight low overhead extraction system. As such the datatap replaces the remote filesystem services offered by the I/O nodes for large MPP machines. The datatap is designed to require a minimum level of synchronicity in extracting the data, thus allowing a large overlap between the application s kernel and the data movement. The adverse performance impact of extracting data from an application can be broken down into two parts. The nonasynchronous parts of the data extraction protocol (i.e., the time for the remote node to accept the data transfer request) and the blocking wait for the transfer to complete (e.g., if the transfer time is not fully overlapped by computation) have an 8

9 impact on the total computation time of the application. To reduce this overhead we designed the datatap in SSDS to have a minimum blocking overhead. Setup. We have implemented two versions of the datatap, (1) using the low level Infiniband verbs layer and (2) using the Sandia Portals interface. The Portals interface is optimized for the Cray XT3 environment but it does offer a TCP/IP module also. However the performance of the Portals over TCP/IP datatap is orders of magnitude worse than either Infiniband or Portals on the Cray XT3. Results. We tested the Infiniband datatap on our local Linux cluster described above. The Portals datatap was tested offsite on a Cray XT3 at Oak Ridge National Laboratory. The results demonstrate the feasibility, scalability, and limitations of inserting datataps in the I/O fast path. Bandwidth (MB/s) Server read data bandwidth Bandwidth (MB/s) Server read data bandwidth (a) IB-RDMA data tap on Linux cluster (b) Portals data tap on Cray XT3 Figure 4. Bandwidth during data consumer read. This bandwidth is less than the maximum available for higher number of processors because of multiple overlapping reads First we consider the bandwidth observed for a RDMA read (or Portals get ). The results are shown in Figure 4. The maximum bandwidth is available when the data size is large and the number of processors is small. This is due to the increasing number of concurrent read requests as the number of processors increases. For both the Infiniband and the Portals version the read bandwidth reaches a minimum value for a specific data size. This occurs when the maximum amount of allocatable memory is reached, forcing the datatap data consumer to schedule requests. The most significant aspect of this metric is that as the number of processors increases, the maximum number of outstanding requests reaches a plateau. Increasing the ammount of memory available to the consumer (or server) will result in a higher level of concurrency. Evaluation. First we look at the time taken to complete a data transfer. The bandwidth graph is shown in Figure 6. The Portals version shows a much higher degree of variability, but the overall pattern is the same for both Infiniband and Portals. For larger number of processors the time to complete a transfer increases proportionally. This increase is caused by the increasing number of outstanding requests on the consumer as the total transferred data increases beyond the maximum amount of memory allocated to the consumer. For small data sizes (the transfer for example) the time to completion stays almost constant. This is because all the nodes can be serviced simultaneously. The higher latency in the Infiniband version is due to the lack of a connectionless reliable transport. We had to implement a lightweight retransmission layer to address this issue, which invariably limits the performance and scalability observed in our Infiniband-based experiments. Another notable feature is that for the Portals datatap the time to complete for large data sizes is almost the same. This is because the Cray SeaStar is a high bandwidth but also high latency interface. Once the total data size to be transferred increases beyond the total available memory the performance becomes bottlenecked by the latency. The design of the datatap is such that only a limited number of data producers can be serviced immediately. This is due to the large imbalance between the combined memory of the compute partition and the combined memory of the I/O partition. Hence, an important performance metric is the average service latency, i.e. the time taken before a transfer request 9

10 Bandwidth (MB/s) Client observed egress bandwidth 6 Bandwidth (MB/s) Client observed egress bandwidth (a) IB-RDMA data tap on Linux cluster (b) Portals data tap on Cray XT3 Figure 5. Bandwidth observed by a single client Time (s) Time to complete transfer request Time (s) Time to complete transfer request (a) IB-RDMA data tap on Linux cluster (b) Portals data tap on Cray XT3 Figure 6. Total time to complete a single data transfer is serviced by the datatap server. The request service latency also determines how long the computing application will wait on the blocking send. Figure 7 shows the latency with increasing number of clients. The latency increase is almost directly proportional to the number of nodes, as the datatap server becomes memory bound once the number of processors increases beyond a specific limit. The shape of the graphs is different for Infiniband and Portals but the conclusion is the same: request service latency can be improved by allocating more memory to the datatap servers (for example, by increasing the number of servers). The impact of the memory bottleneck on aggregate bandwidth observed by the datatap server is shown in Figure 9. The results demonstrate the aggregate bandwidth increases with increasing number of nodes, but reaches a maximum when the server becomes memory-limited. Our ongoing work focuses on understanding how to best schedule outstanding service requests so as to minimize the impact of these memory mismatches on the ongoing computation. Time spent in issuing a data transfer request will be the cause of the most significant overhead on the performance of the application. This is because the application only blocks when issuing the send and when waiting for the completion of the data transfer. The actual data transfer is overlapped with the computation of the application. Unfortunately, the Infiniband version of the datatap blocks for a significant period of time (up to 2 seconds for 64 nodes and a transfer of 256 MB/node) (see Figure 8(a)). This performance bottleneck is caused by the lack of a reliable connectionless transport layer in the current 1

11 (a) Number of ions = Run Parameters Time for 1 iterations GTC/No output GTC/LUSTRE GTC/Datatap (b) Number of ions = Run Parameters Time for 1 iterations GTC/No output GTC/LUSTRE 46.9 GTC/Datatap Table 1. Comparison of GTC run times on the ORNL Cray XT3 development machine for two input sizes using different data output mechanisms OpenFabrics distribution. Thus as the request service latency increases the time taken to complete a send also increases. We are currently looking at ways to bypass this bottleneck. In contrast, the Portals datatap has very low latency and the latency stays almost constant for increasing number of nodes. The bulk of the time is spent in marshaling the data and we believe that this can also be optimized further. This demonstrates the feasibility of the datatap approach for environments with efficiently implemented transport layers. 4.4 Application-level Structured Stream Demonstration Overview. The power of the structured stream abstraction is to provide a ready interface for programmers to overlap computation with I/O in the high performance environment. In order to demonstrate both the capability of the interface and its performance, we have chosen to implement a structured stream to replace the bulk of the output data from the Gyrokinetic Turbulence Code GTC [23]. GTC is a particle-in-cell code for simulating fusion within tokamaks, and it is able to scale to multiple thousands of processors In its default I/O pattern, the dominant cost is from each processor writing out the local array of particles into a separate file. This corresponds to writing out something close to 1% of the memory foot print of the code, with the write frequency chosen so as to keep the average overhead of I/O to within a reasonable percentage of total execution. As part of the standard process of accumulating and interpreting this data, these individual files are the aggregated and parsed into time series, spatially-bounded regions, etc. as per the needs of the following annotation pipeline. Run-time comparisons. For this experiment, we replace the aforementioned bulk write with a structured stream publish event. We ran GTC with two sets of input parameters with 528,41 ions and 1,164,82 ions and compared the run-time for three different configurations. In the table 1 GTC/No output is the GTC configuration with no data output, GTC/Lustre outputs data to a per-process file on the Lustre filesystem and finally GTC/Datatap uses SSDS s lightweight datatap functionality for data output. We compare the application run-time on the Cray XT3 development cluster at Oak Ridge National Laboratory. We observe a significant reduction in the overhead caused by the data output (from about 8% on Lustre to about 3% using the datatap). This decrease in overhead is observed when we double the datasize (by increasing the number of ions). Time (s) e-5 6e-5 4e-5 2e-5 Average observed latency for request completion Time (s) e-5 Average observed latency for request completion (a) IB-RDMA data tap on Linux cluster (b) Portals data tap on Cray XT3 Figure 7. Average latency in request servicing 11

12 Time (s) Time to issue request Time (s) Time to issue request (a) IB-RDMA data tap on Linux cluster (b) Portals data tap on Cray XT3 Figure 8. Time to issue data transfer request I/O graph evaluation. The structured stream is configured with a simple I/O graph: datataps are placed in each of the GTC processes, feeding out asynchronously to an I/O node. From the I/O node, each of the messages is forwarded to a graph node where the data is partitioned into different bounding boxes, and then copies of both the whole data and the multiple small partitioned data sets are forwarded on to the storage nodes. Once the data is received by the datatap server we filter based on the bounding box and then transfer the data for visualization. The time taken to perform the bounding box computation is 2.29s and the time to transfer the filtered data is.37s. In the second implementation we transfer the data first and run the bounding box filter after the data transfer. The time taken for the bounding box filter is the same (2.29s) but the time taken to transfer the data increases to.297s. In the first implementation the total time taken to transfer the data and run the bounding box filter is lower, but the computation is performed on the datatap server resulting in a higher impact on the performance of the datatap, resulting in higher request service latency. For the second implementation the computation is performed on a remote node therefore reducing the impact on the datatap. 5 Related Work A number of different systems (among them NASD [15], Panasas [25], PVFS [19], and Lustre [6]), provide highperformance parallel file systems. Unlike these systems, SDSS provides a more general framework for manipulating data moving to and from storage than these systems. In particular, the higher-level semantic metadata information available in Structured Streams allows it to make informed scheduling, staging, and buffering decisions than these systems. Each of these systems could, however, be used as an underlying storage system for SSDS in a way similar to how SSDS currently uses 6 Server observed ingress bandwidth 45 Server observed ingress bandwidth Bandwidth (MB/s) Bandwidth (MB/s) (a) IB-RDMA data tap on Linux cluster (b) Portals data tap on Cray XT3 Figure 9. Aggregate Bandwidth observed by data consumer 12

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November 2008 Abstract This paper provides information about Lustre networking that can be used