Structured Streams: Data Services for Petascale Science Environments

Size: px
Start display at page:

Download "Structured Streams: Data Services for Petascale Science Environments"

Transcription

1 Structured Streams: Data Services for Petascale Science Environments Patrick Widener Matthew Wolf Hasan Abbasi Matthew Barrick Jay Lofstead Jack Pulikottil Greg Eisenhauer Ada Gavrilovska Scott Klasky Ron Oldfield Patrick G. Bridges Arthur B. Maccabe Karsten Schwan. Abstract The challenge of meeting the I/O needs of petascale applications is exacerbated by an emerging class of data-intensive HPC applications that requires annotation, reorganization, or even conversion of their data as it moves between HPC computations and data end users or producers. For instance, data visualization can present data at different levels of detail. Further, actions on data are often dynamic, as new end-user requirements introduce data manipulations unforeseen by original application developers. These factors are driving a rich set of requirements for future petascale I/O systems: (1) high levels of performance and therefore, flexibility in how data is extracted from petascale codes; (2) the need to support on demand data annotation metadata creation and management outside application codes; (3) support for concurrent use of data by multiple applications, like visualization and storage, including associated consistency management and scheduling; and (4) the ability to flexibly access and reorganize physical data storage. We introduce an end-to-end approach to meeting these requirements: Structured Streams, streams of structured data with which methods for data management can be associated whenever and wherever needed. These methods can execute synchronously or asynchronously with data extraction and streaming, they can run on the petascale machine or on associated machines (such as storage or visualization engines), and they can implement arbitrary data annotations, reorganization, or conversions. The Structured Streaming Data System (SSDS) enables high-performance data movement or manipulation between the compute and service nodes of the petascale machine and between/on service nodes and ancillary machines; it enables the metadata creation and management associated with these movements through specification instead of application coding; and it ensures data consistency in the presence of anticipated or unanticipated data consumers. Two key abstractions implemented in SSDS, I/O graphs and Metabots, provide developers with high-level tools for structuring data movement as dynamically-composed topologies. A lightweight storage system avoids traditional sources of I/O overhead while enforcing protected access to data. This paper describes the SSDS architecture, motivating its design decisions and intended application uses. The utility of the I/O graph and Metabot abstractions is illustrated with examples from existing HPC codes executing on Linux Infiniband clusters and Cray XT3 supercomputers. Performance claims are supported with experiments benchmarking the underlying software layers of SSDS, as well as application-specific usage scenarios. 1 Introduction Large-scale HPC applications face daunting I/O challenges. This is especially true for complex coupled HPC codes like those in climate or seismic modeling, and also for emerging classes of data-intensive HPC applications. Problems arise not only from large data volumes but also from the need to perform activities such as data staging, reorganization, or transformation [24]. Coupled simulations, for instance, may require data staging and conversion, as in multi-scale materials modeling [7], or data remeshing or changes in data layout [17, 14]. Emerging data-intensive applications have additional requirements, such as those derived from their continuous monitoring [23]. Their online monitoring and the visualization of monitoring data require data filtering and conversion in addition to the basic constraints of low overhead, flexible extraction of said data [35]. Similar needs exist for data-intensive applications in the sensor domain, where sensor data interpretation requires actions like data cleaning or remeshing [22]. Addressing these challenges presents technical difficulties including:

2 scaling to large data volumes and large numbers of I/O clients (i.e., compute nodes), given limited I/O resources (i.e., a limited number of nodes in I/O partitions), avoiding excessive overheads on compute nodes (e.g., I/O buffers and compute cycles used for I/O), balancing bandwidth utilization across the system, as mismatches will slow down the computational engines, either through blocking or through over-provisioning in the I/O subsystem, and offering additional functionality in I/O including on demand data annotation, filtering, or similar metadata-centric I/O actions. Structured Streams, and the Structured Streaming Data System (SSDS) that implements them, are a new approach to petascale I/O that encompasses a number of new I/O techniques aimed at addressing the technical issues listed above: Data taps are flexible mechanisms for extracting data from or injecting data into HPC computations; efficiency is gained from making it easy to vary I/O overheads and costs in terms of buffer usage and CPU cycles spent on I/O and by controlling I/O volumes and frequency. Structured data exchanges between all stream participants make it possible to enrich I/O by annotating or changing data, both synchronously or asynchronously with data movement. I/O graphs explicitly represent an application s I/O tasks as configurable topologies of the nodes and links used for moving and operating on data. I/O graphs start with lightweight data taps on computational nodes, traverse arbitrary additional task nodes on the petascale machine (including compute and I/O nodes, as desired), and end on storage or visualization engines. Using I/O graphs, developers can flexibly and dynamically partition I/O tasks and concurrently execute them, across petascale machines and the ancillary engines supporting their use. Enhanced techniques dynamically manage I/O graph execution, including their scheduling and the I/O costs imposed on petascale applications. Metabots are tools for specifying and implementing operations on the data moved by I/O graphs. Metabot specifications include the nature of operations (annotation, organization, modification of data to meet dynamic end user needs) as well as implementation and interaction details (such as application synchrony, data consistency requirements, or metabot scheduling). Lightweight storage separates fast path data movements from machine to disk from metadata-based operations like file consistency, while preserving access control. Metabots operating on disk-resident data are one method for asynchronously (outside the data fast path) determining the file properties of data extracted from high performance machines. SSDS is being implemented for leadership class machines residing at sites like the Oak Ridge National Laboratory. Its realization for Cray XT3 and XT4 machines runs data taps on its compute nodes, using Cray s Catamount kernel, and it executes full-featured I/O graphs utilizing nodes, metabots, and lightweight storage both on the Cray XT3/XT4 I/O nodes and on secondary service machines. SSDS also runs on Linux-based clusters using Infiniband RDMA transports in place of the Sandia Portals [3] communication construct. Enhanced techniques for automatically managing I/O graphs costs vs. performance have not yet been realized, but measurements shown in this paper demonstrate the basic performance properties of SSDS mechanisms and the cost/performance tradeoffs made possible by their use. In the remainder of this paper, Section 2 describes the basic I/O structure of the HPC systems targeted by SSDS. Section 3 follows with a description of the Structured Streams abstractions, how it addresses the emerging I/O challenges in the systems described in Section 2, and the implementation of structured streams in SSDS. Section 4 presents experimental results from our prototype SSDS implementation, illustrating how the basic SSDS abstractions that implement structured streams provide a powerful and flexible I/O system for addressing emerging HPC I/O challenges. Section 5 then describes related work, and Section 6 presents conclusions and directions for future research. 2 Background Figure 1 depicts a representative machine structure for handling the data-intensive applications targeted by our work, derived from our interactions with HPC vendors and with scientists at Sandia, Los Alamos, and Oak Ridge National Laboratories. These architectures have four principal components: (1) a dedicated storage engine with limited attached computational facilities for data mining; (2) a large-scale MPP with compute nodes for application computation and I/O staging nodes for buffering and data reorganization; (3) miscellaneous local compute facilities such as visualization systems; and (4) remote data sources and sinks such as high-bandwidth sensor arrays and remote collaboration sites. The storage engine provides long-term storage, its primary client being the local MPP system. Nodes in the MPP are connected by a high-performance interconnect such as the Cray Seastar or 4x Infiniband with effective data rates of 4GB/sec or more, while this MPP is connected to other local machines, including the data archive, using a high-speed commodity interconnect such as 1GigE. Because these interconnects have lower peak and average bandwidths than the MPP s highperformance interconnect, some application nodes inside the MPP (i.e., service or I/O nodes) are typically dedicated to 2

3 impedance matching by buffering, staging, and reordering data as it flows into and out of the MPP. Systems with remote access to the storage engine have lower bandwidths, even when using high end networks like TeraGrid or DOE s UltraScience network, but they can still produce Remote Sensors Storage Engine and consume large amounts of data. Con- sider, for example, data display or visualization clusters [16], sensor networks, satellites or other telemetry sources, or ground-based sites such as the Long Wavelength Array of radiotelescopes. Remote Clients Visualization For all such cases, the supercom- puter may be either the source (checkpoints or processed simulation results) or sink (analysis of data across data sets) for the large data sets resident in the pool of storage. 3 Structured Streams 3.1 Using Structured Streams Local MPP (Compute and I/O Nodes) Figure 1. System Hardware Architecture To address the I/O challenges in systems and applications such as those described in Section 2, we propose to manage all data movement as structured streams. Conceptually, a structured stream is a sequence of annotations, augmentations, and transformations of typed, structured data; the stream describes how data is conceptually transformed as it moves from data source to sink, along with desired performance, consistency, and quality of service requirements for the stream. An SSDS-based application describes its data flows in terms of structured streams. Data generation on an MPP cluster or retrieval from an storage system, for example, may be expressed as one or more structured streams, each of which performs data manipulation according to downstream application requirements. A structured streaming data system (SSDS), then, maps these streams onto runtime resources. In addition to meeting the application s I/O needs, this mapping can be constructed with MPP utilization in mind or to meet overhead constraints. A structured stream is a dynamic, extensible entity; incorporation of new application components into a structured stream is implemented as their attachment to any node in the graph. Such new components can attach as pure sinks (as simple graph endpoints, termed data taps) or include specific additional functionality in more complex subgraphs. A resulting strength of the structured stream model is that data generators need not be modified to accommodate new consumers. 3.2 Example Structured Streams Structured streams carry data of known structure, but they also offer rich facilities for the runtime creation of additional metadata. For example, metadata may be created to support direct naming, reference by relation, and by arbitrary content. Application-centric metadata like the structure or layout of data events can be stated explicitly as part of the extended I/O graph interface offered by SSDS (i.e., data taps), or it may be derived from application programs, compilers, or I/O marshaling interfaces. Consider the I/O tasks associated with the online visualization of data in multi-scale modeling codes. Here, even within a single domain like materials design, different simulation components will use different data representations (due to differences in length or time scales); they will have multiple ways of storing and archiving data; and they will use different program for analyzing, storing, and displaying data. For example, computational chemistry aspects can be approached either from the viewpoint of chemical physics or physical chemistry. Although researchers from both sides may agree on the techniques and base representations, there can be fundamental differences in the types of statistics which these two groups may gather. The physicists may be interested in wavefunction isosurfaces and the HOMO/LUMO gap, while the chemists may be more interested in types of bonds and estimated relative energies. More simply, some techniques require the data structures to be reported in wave space, while others require real space. Similar issues arise for many other HPC applications. A brief example demonstrates how application-desired structures and representations of data can be specified within the I/O graphs used to realize structured streams. Consider, for example, the output of the Warp molecular dynamics code [26], a tool developed by Steve Plimpton. It and its descendants have been used by numerous scientists for exploring materials physics problems in chemistry, physics, mechanical, and materials engineering. The output is composed of arrays of three 3

4 real coordinates, representing the x, y, and z values for an individual atom s location, coupled with additional attribute for the run, including the type of atom, velocities of the atoms and/or temperatures of the ensemble, total energies, and so on. Part of the identification of the atomic type is coupled to the representation of the force interaction between atoms within the code Iron atoms interact differently with other Iron atoms than they do with Nickel. Using output data from one code to serve as input for another involves not only capturing the simple positions, but also the appropriate changes in classifications and indexing that the new code may require. I/O graphs use structured data representations to simply record the ways in which such data is laid out in memory (and/or in file blocks) and then maintain and use the translations to other representations. The current implementation of SSDS uses translations that are created manually, although at a higher level than the Perl or Python scripting which are standard practices in the field. In ongoing work related (but not directly linked) to the SSDS effort, we are producing automated tools for creating and maintaining these translations. 3.3 Realizing Structured Streams The concept of a structured stream only describes how data is transformed. To actually realize structured streams, we have decomposed their specification and implementation into two concrete abstractions implemented in the Structured Streaming Data System (SSDS) that describe when and where data is transformed, I/O graphs and metabots. Based on the application and performance and consistency needs of a structured stream and available system resources, a structured stream is specified as a series of in-band synchronous data movements and transformations across a graph of hosts (an I/O graph) and some number of asynchronous, out-of-band data annotations and transformations (metabots), resulting in a complete description of how, where, and when data will move through a system. The mapping of a structured stream to an IOgraph and a set of metabots in SSDS can change as application developers desire or as run-time conditions dictate. For instance, data format conversion may be performed strictly in an I/O graph for a limited number of consumers, but broken out (according to data availability deadlines) into a combination of I/O graph and metabot actions as the number of consumers grows. This ability to customize structured streams, and their potential for coordination and scheduling of both in-band and out-of-band data activity, makes them a powerful tool for application composition I/O graphs I/O graphs are the data-movement and lightweight transformation engines of structured streams. The nodes of an I/O graph exchange data with each other over the network, receiving, routing, and forwarding as appropriate. I/O graphs are also responsible for annotating data and/or for executing the data manipulations required to filter data or more generally, make data right for recipients and to carry out data staging, buffering, or similar actions performed on the structured data events traversing them. Stated more precisely, in an I/O graph, each operation uses one or more specific input data form(s) to determine onward routing and produce potentially different output data form(s). We identify three types of such operations: inspection, in which an input data form and its contents are read-only and used as input to a yes/no routing, forwarding, or buffering decision for the data; annotation, in which the input data form is not changed, but the data itself might be modified before the routing determination is made; and morphing, in which the input data form (and its contents) is changed to a different data form before the routing determination is made. Regardless of which operations are being performed, I/O graph functionality is dynamically mapped onto MPP compute and service nodes, and onto the nodes of the storage engine and of ancillary machines. The SSDS design also anticipates the development of QoS- and resource-aware methods for data movement [5] and for SSDS graph provisioning [18, 3] to assist in deployment or evolution of I/O graphs. While the I/O graphs shown and evaluated in this paper were explicitly created by developers, higher level methods can be used to construct I/O graphs. For example, consider a data conversion used in the interface of a high performance molecular modeling code to a visualization infrastructure. This data conversion might involve a 5% reduction in height and width of a 2-dimensional array of, say, a bitmap image (25% of original data size). Precise descriptions of these conversions can be a basis for runtime generation of binary codes with proper looping and other control constructs to apply the operation to the range of elements specified. Our previous work has used XML specifications based on which automated tools can generate the binary codes that implement the necessary data extraction and transport actions across I/O graph nodes. 4

5 3.3.2 Metabots Metadata in MPP Applications. The particulars of stream structure may not be readily available from MPP applications. One reason for current applications using traditional file systems is that these systems force metadata to be generated in-line with computational results, reducing effective computational I/O bandwidth. As a result, minimal metadata is available on storage devices for use by other applications, and the metadata that is present is frequently only useful to the application that generated it. Additionally, such applications are extremely sensitive to the integrity of their metadata, ruling out modifications which might benefit other clients. Recovering metadata at a later time for use by other applications can reduce to inspection of source codes in order to determine data structure. As noted earlier, this situation is caused by the difference between MPP internal bandwidth and I/O bandwidth. One approach is to recover I/O bandwidth by removing metadata operations and other semantic operations from parallel file systems. However, semantics such as file abstractions, name mapping, and namespaces are commonly relied upon by MPP applications, so removing such metadata permanently is not an option. In fact, we consider the generation, management, and use of these and other types of metadata to be of increasing importance in the development of future high performance applications. Our approach, therefore, is not to require all metadata management to be in the fast data path of these applications, but instead, to move as much metadata-related processing as appropriate out of this path. The metabots described below realize this goal. Metabots and I/O graphs. Metabots provide a specification-based means of introducing asynchronous data manipulation into structured streams. A basic I/O graph emphasizes metadata generation as data is captured and moved (e.g., data source identification). By default, generation is performed in-band, a process analogous to current HPC codes that implicitly create metadata during data storage. Metabots make it possible to move these tasks out-of-band, that is, outside the fast path of data transfer. Specifically, Metabots can coordinate and execute these tasks at times less likely to affect applications, such as after a data set has been written to disk. In particular, metabots provide additional functionality for metadata creation and data organizations unanticipated by application developers but required by particular end-user needs. Colloquially, metabots are metadata agents whose run-time representations crawl storage containers, generating metadata in user-defined ways. For example, in many scientific applications, data may need to be analyzed for both spatial and temporal features. In examining the flux of ions through a particular region of space in a fusion simulation, the raw data (organized by time slice) may need to be transformed so that it is organized as a time series of data within a specific spatial bounding box. This can involve both computationally intensive (e.g., bounding box detection) and I/O bound phases (e.g. appending data fragments to the time series). The metabot run-time abstraction allows these to occur out-of-band or in-band, as appropriate. Metabots differ from traditional workflow concepts in two ways: (1) they are tightly coupled to the synchronous data transmission, and need to be flexibly reconfigurable based on what the run-time has or has not completed, and (2) they are confined to specific meta-data and data-organizational tasks. As such, the streaming data and metabot run-times could be integrated in future work to serve as a self-tuning actor within a more general workflow system like Kepler [2]. 3.4 A Software Architecture for Petascale Data Movement The data manipulation and transmission mechanism of SSDS leverages extensive prior work with high performance data movement. Key technologies realized in that research and leveraged for this effort include: (1) efficient representations of meta-information about data structure and layout; enabling (2) high performance and structure-aware manipulations on SSDS data in flight, carried out by dynamically deployed binary codes and using higher level tools with which such manipulations can be specified, termed XChange [1]; (3) a dynamic overlay optimized for efficient data movement, where data fast path actions are strongly separated from the control actions necessary to build, configure, and maintain the overlay[3]; and (4) a lightweight object storage facility (LWFS [21]) that provides flexible, high-performance data storage while preserving access controls on data. LWFS implements back end metadata and data storage. PBIO is an efficient binary runtime representation of metadata. [1] High performance data paths are realized with the EVPath data movement and manipulation infrastructure [9], and the XChange tool s purpose is to provide I/O graph mapping and management support [3]. Knowledge about the structure and layout of data is integrated into the base layer of I/O graphs. Selected data manipulations can then be directly integrated into I/O graph actions, in order to move only the data that is currently required and (if necessary and possible) to manipulate data to avoid unnecessary data copying due to send/receive data mismatches. The role of CM (Connection Manager) is to manage the communication interfaces of I/O Graphs. Also part of EVPath and CM are the control methods and interfaces needed to configure I/O graphs, including deploying the operations that operate on data, link graph nodes, delete them, etc. Potential overheads from high-level control semantics will not affect the data fast path performance, and alternative realizations of such 5

6 semantics become possible. These will be important enablers for integrating actions like file system consistency or conflict management with I/O graphs, for instance. Finally, the metadata used in I/O graphs can be provided by applications, but it can also be derived automatically, by the metabots described in Section I/O graph nodes use daemon processes to run on the MPP s service nodes, on the storage engine, and on secondary machines like visualization servers or remote sensor machines. In addition, selected I/O graph nodes may run on the MPP s compute engines, to provide to such applications the extended I/O interfaces offered by SSDS. For all I/O graph nodes, operator deployment can use runtime binary code generation techniques to optimize data manipulation for current application needs and platform conditions. Additional control processes not shown in the figure run tasks like optimization for code generation across multiple I/O graph nodes and machines, and I/O graph mapping and management actions. ECho Pub/Sub Manager CM XChange I/O Graph Manager PBIO CM Transports Metabots EVPath Figure 2. SSDS software architecture. Structured streams do not replace standard back-end storage (i.e., file systems) or transports (i.e., network subsystems). Instead, they extend and enhance the I/O functionality seen by high performance applications. The structured stream model of I/O inherently utilizes existing (and future) high performance storage systems and data storage models, but offers a datadriven rather than connection- or file-driven interface for the HPC developer. In particular, developers are provided with enhanced I/O system interfaces and tools to define data formats and the operations to be performed on data in flight between simulation components and I/O actions. As an example, the current implementation of SSDS leverages existing file systems (ext3 and lustre) and protocols (Portals and IB RDMA). Since the abstraction presented to the programmer is inherently asynchronous and data-driven, however, the run-time can perform data object optimizations (like message aggregation or data validation) in a more efficient way than the corresponding operation on a file object. In contrast, the successful paradigm of MPI I/O [32], particularly when coupled with a parallel file system, heavily leverages the file nature of the data target and utilizes the transports infrastructure as efficiently as possible within that model. However, that inherently means the underlying file system concepts of consistency, global naming, and access patterns will be enforced at the higher level as well. By adopting a model that allows for the embedding of computations within the transport overlay, it is possible to delay execution of or entirely eliminate those elements of the file object which the application does not immediately require. If a particular algorithm does not require consistency (as is true of some highly fault-tolerant algorithms), then it is not necessary to enforce it from the application perspective. Similarly, if there is an application-specific concept of consistency (such as validating a check point file before allowing it to overwrite the previous check point), that could be enforced as well, in addition to the more application-driven specifications mentioned earlier Implementation Datatap. The datatap is implemented as a request-read service that is designed for the multiple orders of magnitude difference between the available memory on the I/O and service nodes compared to the compute partition. We assume the existence of a large number of compute nodes producing data (we refer to them as datatap clients) and a smaller number of I/O nodes receiving the data (we refer to them as datatap servers). The datatap client issues a data available request to the datatap server, encodes the data for transmission and registers this buffer with the transport for remote read. For very large datasizes, the cost of encoding data can be significant, but it will be dwarfed by the actual cost of the data transfer [12, 11, 4]. On receipt of the request, the datatap server issues a read call. Due to the limited amount of memory available on the datatap server, the server only issues a read call if there is memory available to complete it. The datatap server issues multiple read requests to reduce the request latency as much as possible. The datatap server is performance bound by the available memory which restricts the number of concurrent request and the request service latency. The datatap server acts as a data feed in to the I/O graph overlay. The I/O graph can replicate the functionality of writing the output to a file (see Section 4.4), or it can be used LWFS 6

7 to perform inflight data transformations (see Section 4.4). We currently have two implementations of the datatap using Infiniband user-level verbs and the Sandia Portals interface. We needed the multiple implementations in order to support both our local Linux clusters and the Cray XT3 platform. The two implementations have a common design and hence common performance except in one regard. The Infiniband user level verbs do not provide a reliable datagram (RD) transport increasing the time spent in issuing a data available request (see Figure 8). I/O graph Implementation. Actual implementation of I/O graph data transport and processing is accomplished via a middleware package designed to facilitate the dynamic composition of overlay networks for message passing. The principal abstraction in this infrastructure is stones (as in stepping stones ), which are linked together to compose a data path. Message flows between stones can be both intra- and inter-process, with inter-process flows being managed by special output stones. The taxonomy of types of stones is relatively broad, but includes: terminal stones which implement data sinks; filter stones which can optionally discard data; transform stones which modify data; and split stones which implement data-based routing decisions and may redirect incoming data to one or more other stones for further processing. I/O graphs are highly efficient because the underlying transport mechanism performs only minimal encoding on the sending side and uses dynamic code generation to perform highly efficient decoding on the receiving side. The functions that filter and transform data are represented in a small C-like language. These functions can be transported between nodes in source form, but when placed in the overlay we use dynamic code generation to create a native version of these functions. This dynamic code generation capability is based on a lower-level package that provides for dynamic generation of a virtual RISC instruction set. Above that level, we provide a lexer, parser, semanticizer, and code generator, making the equivalent of a just-in-time compiler for the small language. As such, the system generates native machine code directly into the application s memory without reference to an external compiler. Because we do not rely upon a virtual machine or other sand-boxing technique, these filters can run at roughly the speed of unoptimized native code and can be generated considerably faster than forking an external compiler. 4 Experimental Evaluation 4.1 Overview To evaluate the effectiveness of our prototype SSDS implementation, we conducted a variety of experiments to understand its performance characteristics. In particular, we conducted experiments using a combination of HPC-oriented I/O benchmarks that test individual portions of SSDS including I/O graphs, the datatap, and Metabots, and prototype full-system SSDS application benchmark using a modified version of the GTC [23] HPC code. As an experimental testbed, we utilized a cluster of 53 dual-processor 3.2GHz Intel EM64T processors each with 6GB of memory running Redhat Enterprise Linux AS release 4 with kernel version ELsmp. Nodes were using connected by a non-blocking 4x Infiniband interconnect using the IB TCP/IP implementation. I/O services in the cluster are provided by cluster dedicated nodes containing 73GB 1k RPM Seagate ST37327LC Ultra SCSI 32 disks. Underlying SSDS I/O service was provided by a prototype implementation of the Sandia Lightweight File Systems (LWFS) [21] communicating using the user-level Portals TCP/IP reference implementation [3]. Note that the Portals reference implementation, unlike the native Portals implementation on Cray Seastar-based systems, is a largely unoptimized communication subsystem. Because this communication infrastructure currently constrains the absolute performance of SSDS on this platform, our experiments focus on relative comparisons instead of absolute performance numbers. 4.2 Metabot Evaluation Overview. To understand the potential performance benefits of Metabots in SSDS, we ran several metadata manipulation benchmarks with various SSDS configurations using different setups of a benchmark based on the LLNL fdtree benchmark [13], which we shall call fdtree-prime. The benchmark essentially creates a file hierarchy parametrized on the depth of the hierarchy, the number of directories at each depth and the number and size of files in each directory. In particular, we focused on comparing the benchmark performance with metadata manipulation inline and with metadata manipulation moved out-of-band to a Metabot. In situations where applications create large file hierarchies which are accessed at a later point of time, in-band file creation can incur significant metadata manipulation overhead. Since the application programmer generally knows the structure of this hierarchy, including file names, numbers, and sizes, it may frequently be possible to move namespace metadata manipulations 7

8 seconds naming raw reconstruct LWFS (Number of Files) # of files Seconds Naming Raw Reconstruct LWFS (Directory Depth) # of Levels (a) Scaling of Files (b) Scaling of Directory Figure 3. Out-of-band Metadata Reconstruction out-of-band. We have implemented this optimization using SSDS Metabots, where the application can write directly to the resulting storage targets without in-band metadata manipulation costs to application execution. Subsequently, a Metabot creates the needed filename to object mappings out-of-band. Setup. To understand the potential performance benefits of Metabots, we built a Metabot that would perform specified file hierarchy metadata manipulations according to a specification after the actual raw I/O writes done by fdtree-prime. We then compared the time taken to run fdtree-prime setups with metadata manipulation in-band and metadata manipulation out-ofband, as well as the amount of time needed for out-of-band metadata construction. For these tests, we used two different fdtree-prime setups: one which created an increasing number of files of size 4KB in a single directory and one that created the same-sized files in an increasingly deep directory hierarchy in which each directory contained 1 files and 2 subdirectories. Experiments were run on 4 nodes of the cluster described in section 4.1. One node executed the benchmark itself, while three nodes ran the LWFS authorization, naming, and storage servers. Results. In our first experiment, we see our raw write performance gets increasingly better with an increase in the number of files. Even with a flat directory structure, at 65,536 files, the write performance with inline-metadata creation is 7% slower than a raw write. In the second experiment, the performance gain is even more apparent. With a depth of 5 levels, and 2 directories per level, the write performance with inline-metadata creation is 9.7 times slower than a raw write. In both cases, the metadata construction Metabot takes about the same time as that of the inline-metadata benchmark. Analysis. While the above results demonstrate how moving metadata creation out-of-band can significantly increase the performance of in-band activity, the sum total of raw-write time and construction Metabot time is still greater than that of inline-metadata creation. This is so because the construction Metabot has to read from a raw object stored on the LWFS storage server and carry out the same operations as that of the inline-metadata benchmark. In the current LWFS API, the creation of a file is accompanied by the creation of a new LWFS object containing the data for the new file. Hence the construction Metabot has to repeat the workload of the inline-metadata benchmark. A more efficient implementation of the API would allow the construction Metabot to avoid file data copies and new object creations and simply create filesystem metadata for the object that was created during the raw-write benchmark. Performance could also have been better had the construction Metabot been deployed on the storage server as opposed on a remote node as was done for this series of experiments. Another issue to be addressed is that we do not have metrics for comparing write performance with other parallel file system implementations; this was primarily due to constraints of platform availability. However, LWFS performance characteristics are comparable to Lustre [21]. 4.3 DataTap Evaluation Overview. The datatap serves as a lightweight low overhead extraction system. As such the datatap replaces the remote filesystem services offered by the I/O nodes for large MPP machines. The datatap is designed to require a minimum level of synchronicity in extracting the data, thus allowing a large overlap between the application s kernel and the data movement. The adverse performance impact of extracting data from an application can be broken down into two parts. The nonasynchronous parts of the data extraction protocol (i.e., the time for the remote node to accept the data transfer request) and the blocking wait for the transfer to complete (e.g., if the transfer time is not fully overlapped by computation) have an 8

9 impact on the total computation time of the application. To reduce this overhead we designed the datatap in SSDS to have a minimum blocking overhead. Setup. We have implemented two versions of the datatap, (1) using the low level Infiniband verbs layer and (2) using the Sandia Portals interface. The Portals interface is optimized for the Cray XT3 environment but it does offer a TCP/IP module also. However the performance of the Portals over TCP/IP datatap is orders of magnitude worse than either Infiniband or Portals on the Cray XT3. Results. We tested the Infiniband datatap on our local Linux cluster described above. The Portals datatap was tested offsite on a Cray XT3 at Oak Ridge National Laboratory. The results demonstrate the feasibility, scalability, and limitations of inserting datataps in the I/O fast path. Bandwidth (MB/s) Server read data bandwidth Bandwidth (MB/s) Server read data bandwidth (a) IB-RDMA data tap on Linux cluster (b) Portals data tap on Cray XT3 Figure 4. Bandwidth during data consumer read. This bandwidth is less than the maximum available for higher number of processors because of multiple overlapping reads First we consider the bandwidth observed for a RDMA read (or Portals get ). The results are shown in Figure 4. The maximum bandwidth is available when the data size is large and the number of processors is small. This is due to the increasing number of concurrent read requests as the number of processors increases. For both the Infiniband and the Portals version the read bandwidth reaches a minimum value for a specific data size. This occurs when the maximum amount of allocatable memory is reached, forcing the datatap data consumer to schedule requests. The most significant aspect of this metric is that as the number of processors increases, the maximum number of outstanding requests reaches a plateau. Increasing the ammount of memory available to the consumer (or server) will result in a higher level of concurrency. Evaluation. First we look at the time taken to complete a data transfer. The bandwidth graph is shown in Figure 6. The Portals version shows a much higher degree of variability, but the overall pattern is the same for both Infiniband and Portals. For larger number of processors the time to complete a transfer increases proportionally. This increase is caused by the increasing number of outstanding requests on the consumer as the total transferred data increases beyond the maximum amount of memory allocated to the consumer. For small data sizes (the transfer for example) the time to completion stays almost constant. This is because all the nodes can be serviced simultaneously. The higher latency in the Infiniband version is due to the lack of a connectionless reliable transport. We had to implement a lightweight retransmission layer to address this issue, which invariably limits the performance and scalability observed in our Infiniband-based experiments. Another notable feature is that for the Portals datatap the time to complete for large data sizes is almost the same. This is because the Cray SeaStar is a high bandwidth but also high latency interface. Once the total data size to be transferred increases beyond the total available memory the performance becomes bottlenecked by the latency. The design of the datatap is such that only a limited number of data producers can be serviced immediately. This is due to the large imbalance between the combined memory of the compute partition and the combined memory of the I/O partition. Hence, an important performance metric is the average service latency, i.e. the time taken before a transfer request 9

10 Bandwidth (MB/s) Client observed egress bandwidth 6 Bandwidth (MB/s) Client observed egress bandwidth (a) IB-RDMA data tap on Linux cluster (b) Portals data tap on Cray XT3 Figure 5. Bandwidth observed by a single client Time (s) Time to complete transfer request Time (s) Time to complete transfer request (a) IB-RDMA data tap on Linux cluster (b) Portals data tap on Cray XT3 Figure 6. Total time to complete a single data transfer is serviced by the datatap server. The request service latency also determines how long the computing application will wait on the blocking send. Figure 7 shows the latency with increasing number of clients. The latency increase is almost directly proportional to the number of nodes, as the datatap server becomes memory bound once the number of processors increases beyond a specific limit. The shape of the graphs is different for Infiniband and Portals but the conclusion is the same: request service latency can be improved by allocating more memory to the datatap servers (for example, by increasing the number of servers). The impact of the memory bottleneck on aggregate bandwidth observed by the datatap server is shown in Figure 9. The results demonstrate the aggregate bandwidth increases with increasing number of nodes, but reaches a maximum when the server becomes memory-limited. Our ongoing work focuses on understanding how to best schedule outstanding service requests so as to minimize the impact of these memory mismatches on the ongoing computation. Time spent in issuing a data transfer request will be the cause of the most significant overhead on the performance of the application. This is because the application only blocks when issuing the send and when waiting for the completion of the data transfer. The actual data transfer is overlapped with the computation of the application. Unfortunately, the Infiniband version of the datatap blocks for a significant period of time (up to 2 seconds for 64 nodes and a transfer of 256 MB/node) (see Figure 8(a)). This performance bottleneck is caused by the lack of a reliable connectionless transport layer in the current 1

11 (a) Number of ions = Run Parameters Time for 1 iterations GTC/No output GTC/LUSTRE GTC/Datatap (b) Number of ions = Run Parameters Time for 1 iterations GTC/No output GTC/LUSTRE 46.9 GTC/Datatap Table 1. Comparison of GTC run times on the ORNL Cray XT3 development machine for two input sizes using different data output mechanisms OpenFabrics distribution. Thus as the request service latency increases the time taken to complete a send also increases. We are currently looking at ways to bypass this bottleneck. In contrast, the Portals datatap has very low latency and the latency stays almost constant for increasing number of nodes. The bulk of the time is spent in marshaling the data and we believe that this can also be optimized further. This demonstrates the feasibility of the datatap approach for environments with efficiently implemented transport layers. 4.4 Application-level Structured Stream Demonstration Overview. The power of the structured stream abstraction is to provide a ready interface for programmers to overlap computation with I/O in the high performance environment. In order to demonstrate both the capability of the interface and its performance, we have chosen to implement a structured stream to replace the bulk of the output data from the Gyrokinetic Turbulence Code GTC [23]. GTC is a particle-in-cell code for simulating fusion within tokamaks, and it is able to scale to multiple thousands of processors In its default I/O pattern, the dominant cost is from each processor writing out the local array of particles into a separate file. This corresponds to writing out something close to 1% of the memory foot print of the code, with the write frequency chosen so as to keep the average overhead of I/O to within a reasonable percentage of total execution. As part of the standard process of accumulating and interpreting this data, these individual files are the aggregated and parsed into time series, spatially-bounded regions, etc. as per the needs of the following annotation pipeline. Run-time comparisons. For this experiment, we replace the aforementioned bulk write with a structured stream publish event. We ran GTC with two sets of input parameters with 528,41 ions and 1,164,82 ions and compared the run-time for three different configurations. In the table 1 GTC/No output is the GTC configuration with no data output, GTC/Lustre outputs data to a per-process file on the Lustre filesystem and finally GTC/Datatap uses SSDS s lightweight datatap functionality for data output. We compare the application run-time on the Cray XT3 development cluster at Oak Ridge National Laboratory. We observe a significant reduction in the overhead caused by the data output (from about 8% on Lustre to about 3% using the datatap). This decrease in overhead is observed when we double the datasize (by increasing the number of ions). Time (s) e-5 6e-5 4e-5 2e-5 Average observed latency for request completion Time (s) e-5 Average observed latency for request completion (a) IB-RDMA data tap on Linux cluster (b) Portals data tap on Cray XT3 Figure 7. Average latency in request servicing 11

12 Time (s) Time to issue request Time (s) Time to issue request (a) IB-RDMA data tap on Linux cluster (b) Portals data tap on Cray XT3 Figure 8. Time to issue data transfer request I/O graph evaluation. The structured stream is configured with a simple I/O graph: datataps are placed in each of the GTC processes, feeding out asynchronously to an I/O node. From the I/O node, each of the messages is forwarded to a graph node where the data is partitioned into different bounding boxes, and then copies of both the whole data and the multiple small partitioned data sets are forwarded on to the storage nodes. Once the data is received by the datatap server we filter based on the bounding box and then transfer the data for visualization. The time taken to perform the bounding box computation is 2.29s and the time to transfer the filtered data is.37s. In the second implementation we transfer the data first and run the bounding box filter after the data transfer. The time taken for the bounding box filter is the same (2.29s) but the time taken to transfer the data increases to.297s. In the first implementation the total time taken to transfer the data and run the bounding box filter is lower, but the computation is performed on the datatap server resulting in a higher impact on the performance of the datatap, resulting in higher request service latency. For the second implementation the computation is performed on a remote node therefore reducing the impact on the datatap. 5 Related Work A number of different systems (among them NASD [15], Panasas [25], PVFS [19], and Lustre [6]), provide highperformance parallel file systems. Unlike these systems, SDSS provides a more general framework for manipulating data moving to and from storage than these systems. In particular, the higher-level semantic metadata information available in Structured Streams allows it to make informed scheduling, staging, and buffering decisions than these systems. Each of these systems could, however, be used as an underlying storage system for SSDS in a way similar to how SSDS currently uses 6 Server observed ingress bandwidth 45 Server observed ingress bandwidth Bandwidth (MB/s) Bandwidth (MB/s) (a) IB-RDMA data tap on Linux cluster (b) Portals data tap on Cray XT3 Figure 9. Aggregate Bandwidth observed by data consumer 12

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November 2008 Abstract This paper provides information about Lustre networking that can be used

More information

Scibox: Online Sharing of Scientific Data via the Cloud

Scibox: Online Sharing of Scientific Data via the Cloud Scibox: Online Sharing of Scientific Data via the Cloud Jian Huang, Xuechen Zhang, Greg Eisenhauer, Karsten Schwan, Matthew Wolf *, Stephane Ethier, Scott Klasky * Georgia Institute of Technology, Princeton

More information

Adaptable IO System (ADIOS)

Adaptable IO System (ADIOS) Adaptable IO System (ADIOS) http://www.cc.gatech.edu/~lofstead/adios Cray User Group 2008 May 8, 2008 Chen Jin, Scott Klasky, Stephen Hodson, James B. White III, Weikuan Yu (Oak Ridge National Laboratory)

More information

IBM Spectrum Scale IO performance

IBM Spectrum Scale IO performance IBM Spectrum Scale 5.0.0 IO performance Silverton Consulting, Inc. StorInt Briefing 2 Introduction High-performance computing (HPC) and scientific computing are in a constant state of transition. Artificial

More information

Compute Node Linux: Overview, Progress to Date & Roadmap

Compute Node Linux: Overview, Progress to Date & Roadmap Compute Node Linux: Overview, Progress to Date & Roadmap David Wallace Cray Inc ABSTRACT: : This presentation will provide an overview of Compute Node Linux(CNL) for the CRAY XT machine series. Compute

More information

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2 Introduction :- Today single CPU based architecture is not capable enough for the modern database that are required to handle more demanding and complex requirements of the users, for example, high performance,

More information

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies François Tessier, Venkatram Vishwanath, Paul Gressier Argonne National Laboratory, USA Wednesday

More information

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy François Tessier, Venkatram Vishwanath Argonne National Laboratory, USA July 19,

More information

High Speed Asynchronous Data Transfers on the Cray XT3

High Speed Asynchronous Data Transfers on the Cray XT3 High Speed Asynchronous Data Transfers on the Cray XT3 Ciprian Docan, Manish Parashar and Scott Klasky The Applied Software System Laboratory Rutgers, The State University of New Jersey CUG 2007, Seattle,

More information

Service Mesh and Microservices Networking

Service Mesh and Microservices Networking Service Mesh and Microservices Networking WHITEPAPER Service mesh and microservice networking As organizations adopt cloud infrastructure, there is a concurrent change in application architectures towards

More information

Stream Processing for Remote Collaborative Data Analysis

Stream Processing for Remote Collaborative Data Analysis Stream Processing for Remote Collaborative Data Analysis Scott Klasky 146, C. S. Chang 2, Jong Choi 1, Michael Churchill 2, Tahsin Kurc 51, Manish Parashar 3, Alex Sim 7, Matthew Wolf 14, John Wu 7 1 ORNL,

More information

SolidFire and Ceph Architectural Comparison

SolidFire and Ceph Architectural Comparison The All-Flash Array Built for the Next Generation Data Center SolidFire and Ceph Architectural Comparison July 2014 Overview When comparing the architecture for Ceph and SolidFire, it is clear that both

More information

Customer Success Story Los Alamos National Laboratory

Customer Success Story Los Alamos National Laboratory Customer Success Story Los Alamos National Laboratory Panasas High Performance Storage Powers the First Petaflop Supercomputer at Los Alamos National Laboratory Case Study June 2010 Highlights First Petaflop

More information

IBM InfoSphere Streams v4.0 Performance Best Practices

IBM InfoSphere Streams v4.0 Performance Best Practices Henry May IBM InfoSphere Streams v4.0 Performance Best Practices Abstract Streams v4.0 introduces powerful high availability features. Leveraging these requires careful consideration of performance related

More information

Scaling-Out with Oracle Grid Computing on Dell Hardware

Scaling-Out with Oracle Grid Computing on Dell Hardware Scaling-Out with Oracle Grid Computing on Dell Hardware A Dell White Paper J. Craig Lowery, Ph.D. Enterprise Solutions Engineering Dell Inc. August 2003 Increasing computing power by adding inexpensive

More information

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters Kent Milfeld, Avijit Purkayastha, Chona Guiang Texas Advanced Computing Center The University of Texas Austin, Texas USA Abstract

More information

Enabling high-speed asynchronous data extraction and transfer using DART

Enabling high-speed asynchronous data extraction and transfer using DART CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. (21) Published online in Wiley InterScience (www.interscience.wiley.com)..1567 Enabling high-speed asynchronous

More information

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance 11 th International LS-DYNA Users Conference Computing Technology LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to

More information

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Hao Wang, Jian Huang, Dhabaleswar K. Panda Department of Computer

More information

SONAS Best Practices and options for CIFS Scalability

SONAS Best Practices and options for CIFS Scalability COMMON INTERNET FILE SYSTEM (CIFS) FILE SERVING...2 MAXIMUM NUMBER OF ACTIVE CONCURRENT CIFS CONNECTIONS...2 SONAS SYSTEM CONFIGURATION...4 SONAS Best Practices and options for CIFS Scalability A guide

More information

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments Swen Böhm 1,2, Christian Engelmann 2, and Stephen L. Scott 2 1 Department of Computer

More information

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA Gilad Shainer 1, Tong Liu 1, Pak Lui 1, Todd Wilde 1 1 Mellanox Technologies Abstract From concept to engineering, and from design to

More information

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Kandalla, Mark Arnold and Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory

More information

CSE6230 Fall Parallel I/O. Fang Zheng

CSE6230 Fall Parallel I/O. Fang Zheng CSE6230 Fall 2012 Parallel I/O Fang Zheng 1 Credits Some materials are taken from Rob Latham s Parallel I/O in Practice talk http://www.spscicomp.org/scicomp14/talks/l atham.pdf 2 Outline I/O Requirements

More information

Memory Management Strategies for Data Serving with RDMA

Memory Management Strategies for Data Serving with RDMA Memory Management Strategies for Data Serving with RDMA Dennis Dalessandro and Pete Wyckoff (presenting) Ohio Supercomputer Center {dennis,pw}@osc.edu HotI'07 23 August 2007 Motivation Increasing demands

More information

Introduction to HPC Parallel I/O

Introduction to HPC Parallel I/O Introduction to HPC Parallel I/O Feiyi Wang (Ph.D.) and Sarp Oral (Ph.D.) Technology Integration Group Oak Ridge Leadership Computing ORNL is managed by UT-Battelle for the US Department of Energy Outline

More information

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments LCI HPC Revolution 2005 26 April 2005 Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments Matthew Woitaszek matthew.woitaszek@colorado.edu Collaborators Organizations National

More information

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4 Guidelines for Efficient Parallel I/O on the Cray XT3/XT4 Jeff Larkin, Cray Inc. and Mark Fahey, Oak Ridge National Laboratory ABSTRACT: This paper will present an overview of I/O methods on Cray XT3/XT4

More information

The Fusion Distributed File System

The Fusion Distributed File System Slide 1 / 44 The Fusion Distributed File System Dongfang Zhao February 2015 Slide 2 / 44 Outline Introduction FusionFS System Architecture Metadata Management Data Movement Implementation Details Unique

More information

PHX: Memory Speed HPC I/O with NVM. Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan

PHX: Memory Speed HPC I/O with NVM. Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan PHX: Memory Speed HPC I/O with NVM Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan Node Local Persistent I/O? Node local checkpoint/ restart - Recover from transient failures ( node restart)

More information

EXTREME SCALE DATA MANAGEMENT IN HIGH PERFORMANCE COMPUTING

EXTREME SCALE DATA MANAGEMENT IN HIGH PERFORMANCE COMPUTING EXTREME SCALE DATA MANAGEMENT IN HIGH PERFORMANCE COMPUTING A Thesis Presented to The Academic Faculty by Gerald F. Lofstead II In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy

More information

ECE7995 (7) Parallel I/O

ECE7995 (7) Parallel I/O ECE7995 (7) Parallel I/O 1 Parallel I/O From user s perspective: Multiple processes or threads of a parallel program accessing data concurrently from a common file From system perspective: - Files striped

More information

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning September 22 nd 2015 Tommaso Cecchi 2 What is IME? This breakthrough, software defined storage application

More information

HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT:

HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT: HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms Author: Stan Posey Panasas, Inc. Correspondence: Stan Posey Panasas, Inc. Phone +510 608 4383 Email sposey@panasas.com

More information

Catalogic DPX TM 4.3. ECX 2.0 Best Practices for Deployment and Cataloging

Catalogic DPX TM 4.3. ECX 2.0 Best Practices for Deployment and Cataloging Catalogic DPX TM 4.3 ECX 2.0 Best Practices for Deployment and Cataloging 1 Catalogic Software, Inc TM, 2015. All rights reserved. This publication contains proprietary and confidential material, and is

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions Roger Goff Senior Product Manager DataDirect Networks, Inc. What is Lustre? Parallel/shared file system for

More information

JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING

JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING Larson Hogstrom, Mukarram Tahir, Andres Hasfura Massachusetts Institute of Technology, Cambridge, Massachusetts, USA 18.337/6.338

More information

WHITE PAPER AGILOFT SCALABILITY AND REDUNDANCY

WHITE PAPER AGILOFT SCALABILITY AND REDUNDANCY WHITE PAPER AGILOFT SCALABILITY AND REDUNDANCY Table of Contents Introduction 3 Performance on Hosted Server 3 Figure 1: Real World Performance 3 Benchmarks 3 System configuration used for benchmarks 3

More information

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand Matthew Koop 1,2 Terry Jones 2 D. K. Panda 1 {koop, panda}@cse.ohio-state.edu trj@llnl.gov 1 Network-Based Computing Lab, The

More information

Integrating Analysis and Computation with Trios Services

Integrating Analysis and Computation with Trios Services October 31, 2012 Integrating Analysis and Computation with Trios Services Approved for Public Release: SAND2012-9323P Ron A. Oldfield Scalable System Software Sandia National Laboratories Albuquerque,

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008 OUTLINE INTRODUCTION DESIGN OVERVIEW

More information

MOHA: Many-Task Computing Framework on Hadoop

MOHA: Many-Task Computing Framework on Hadoop Apache: Big Data North America 2017 @ Miami MOHA: Many-Task Computing Framework on Hadoop Soonwook Hwang Korea Institute of Science and Technology Information May 18, 2017 Table of Contents Introduction

More information

pnfs and Linux: Working Towards a Heterogeneous Future

pnfs and Linux: Working Towards a Heterogeneous Future CITI Technical Report 06-06 pnfs and Linux: Working Towards a Heterogeneous Future Dean Hildebrand dhildebz@umich.edu Peter Honeyman honey@umich.edu ABSTRACT Anticipating terascale and petascale HPC demands,

More information

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen Accelerating Parallel Analysis of Scientific Simulation Data via Zazen Tiankai Tu, Charles A. Rendleman, Patrick J. Miller, Federico Sacerdoti, Ron O. Dror, and David E. Shaw D. E. Shaw Research Motivation

More information

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System RAIDIX Data Storage Solution Clustered Data Storage Based on the RAIDIX Software and GPFS File System 2017 Contents Synopsis... 2 Introduction... 3 Challenges and the Solution... 4 Solution Architecture...

More information

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou, K. Magoutis, M. Marazakis, A. Bilas Institute of Computer Science (ICS) Foundation for Research and

More information

PreDatA - Preparatory Data Analytics on Peta-Scale Machines

PreDatA - Preparatory Data Analytics on Peta-Scale Machines PreDatA - Preparatory Data Analytics on Peta-Scale Machines Fang Zheng 1, Hasan Abbasi 1, Ciprian Docan 2, Jay Lofstead 1, Scott Klasky 3, Qing Liu 3, Manish Parashar 2, Norbert Podhorszki 3, Karsten Schwan

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience

DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Vijay Velusamy, Anthony Skjellum MPI Software Technology, Inc. Email: {vijay, tony}@mpi-softtech.com Arkady Kanevsky *,

More information

EsgynDB Enterprise 2.0 Platform Reference Architecture

EsgynDB Enterprise 2.0 Platform Reference Architecture EsgynDB Enterprise 2.0 Platform Reference Architecture This document outlines a Platform Reference Architecture for EsgynDB Enterprise, built on Apache Trafodion (Incubating) implementation with licensed

More information

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Sayantan Sur, Matt Koop, Lei Chai Dhabaleswar K. Panda Network Based Computing Lab, The Ohio State

More information

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc Scaling to Petaflop Ola Torudbakken Distinguished Engineer Sun Microsystems, Inc HPC Market growth is strong CAGR increased from 9.2% (2006) to 15.5% (2007) Market in 2007 doubled from 2003 (Source: IDC

More information

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai, Ali R. Butt, Fei Meng, Xiaosong Ma, Youngjae Kim,Christian Engelmann, and Galen Shipman

More information

Maximizing NFS Scalability

Maximizing NFS Scalability Maximizing NFS Scalability on Dell Servers and Storage in High-Performance Computing Environments Popular because of its maturity and ease of use, the Network File System (NFS) can be used in high-performance

More information

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT PhD Summary DOCTORATE OF PHILOSOPHY IN COMPUTER SCIENCE & ENGINEERING By Sandip Kumar Goyal (09-PhD-052) Under the Supervision

More information

Report. Middleware Proxy: A Request-Driven Messaging Broker For High Volume Data Distribution

Report. Middleware Proxy: A Request-Driven Messaging Broker For High Volume Data Distribution CERN-ACC-2013-0237 Wojciech.Sliwinski@cern.ch Report Middleware Proxy: A Request-Driven Messaging Broker For High Volume Data Distribution W. Sliwinski, I. Yastrebov, A. Dworak CERN, Geneva, Switzerland

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected

More information

GFS: The Google File System

GFS: The Google File System GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M030 24 th October 2014 Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one

More information

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne Distributed Computing: PVM, MPI, and MOSIX Multiple Processor Systems Dr. Shaaban Judd E.N. Jenne May 21, 1999 Abstract: Distributed computing is emerging as the preferred means of supporting parallel

More information

Collaborators from SDM Center, CPES, GPSC, GSEP, Sandia, ORNL

Collaborators from SDM Center, CPES, GPSC, GSEP, Sandia, ORNL Fourth Workshop on Ultrascale Visualization 10/28/2009 Scott A. Klasky klasky@ornl.gov Collaborators from SDM Center, CPES, GPSC, GSEP, Sandia, ORNL H. Abbasi, J. Cummings, C. Docan,, S. Ethier, A Kahn,

More information

Optimizing LS-DYNA Productivity in Cluster Environments

Optimizing LS-DYNA Productivity in Cluster Environments 10 th International LS-DYNA Users Conference Computing Technology Optimizing LS-DYNA Productivity in Cluster Environments Gilad Shainer and Swati Kher Mellanox Technologies Abstract Increasing demand for

More information

Current Topics in OS Research. So, what s hot?

Current Topics in OS Research. So, what s hot? Current Topics in OS Research COMP7840 OSDI Current OS Research 0 So, what s hot? Operating systems have been around for a long time in many forms for different types of devices It is normally general

More information

Extreme I/O Scaling with HDF5

Extreme I/O Scaling with HDF5 Extreme I/O Scaling with HDF5 Quincey Koziol Director of Core Software Development and HPC The HDF Group koziol@hdfgroup.org July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 1 Outline Brief overview of

More information

Adaptive Cluster Computing using JavaSpaces

Adaptive Cluster Computing using JavaSpaces Adaptive Cluster Computing using JavaSpaces Jyoti Batheja and Manish Parashar The Applied Software Systems Lab. ECE Department, Rutgers University Outline Background Introduction Related Work Summary of

More information

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational

More information

Chapter 18 Parallel Processing

Chapter 18 Parallel Processing Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD

More information

WORKFLOW ENGINE FOR CLOUDS

WORKFLOW ENGINE FOR CLOUDS WORKFLOW ENGINE FOR CLOUDS By SURAJ PANDEY, DILEBAN KARUNAMOORTHY, and RAJKUMAR BUYYA Prepared by: Dr. Faramarz Safi Islamic Azad University, Najafabad Branch, Esfahan, Iran. Task Computing Task computing

More information

Hedvig as backup target for Veeam

Hedvig as backup target for Veeam Hedvig as backup target for Veeam Solution Whitepaper Version 1.0 April 2018 Table of contents Executive overview... 3 Introduction... 3 Solution components... 4 Hedvig... 4 Hedvig Virtual Disk (vdisk)...

More information

A GPFS Primer October 2005

A GPFS Primer October 2005 A Primer October 2005 Overview This paper describes (General Parallel File System) Version 2, Release 3 for AIX 5L and Linux. It provides an overview of key concepts which should be understood by those

More information

Assessing performance in HP LeftHand SANs

Assessing performance in HP LeftHand SANs Assessing performance in HP LeftHand SANs HP LeftHand Starter, Virtualization, and Multi-Site SANs deliver reliable, scalable, and predictable performance White paper Introduction... 2 The advantages of

More information

MPI History. MPI versions MPI-2 MPICH2

MPI History. MPI versions MPI-2 MPICH2 MPI versions MPI History Standardization started (1992) MPI-1 completed (1.0) (May 1994) Clarifications (1.1) (June 1995) MPI-2 (started: 1995, finished: 1997) MPI-2 book 1999 MPICH 1.2.4 partial implemention

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs

More information

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage Evaluation of Lustre File System software enhancements for improved Metadata performance Wojciech Turek, Paul Calleja,John

More information

NetApp SolidFire and Pure Storage Architectural Comparison A SOLIDFIRE COMPETITIVE COMPARISON

NetApp SolidFire and Pure Storage Architectural Comparison A SOLIDFIRE COMPETITIVE COMPARISON A SOLIDFIRE COMPETITIVE COMPARISON NetApp SolidFire and Pure Storage Architectural Comparison This document includes general information about Pure Storage architecture as it compares to NetApp SolidFire.

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

Leveraging Flash in HPC Systems

Leveraging Flash in HPC Systems Leveraging Flash in HPC Systems IEEE MSST June 3, 2015 This work was performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344. Lawrence Livermore National Security,

More information

Scibox: Online Sharing of Scientific Data via the Cloud

Scibox: Online Sharing of Scientific Data via the Cloud Scibox: Online Sharing of Scientific Data via the Cloud Jian Huang, Xuechen Zhang, Greg Eisenhauer, Karsten Schwan Matthew Wolf,, Stephane Ethier ǂ, Scott Klasky CERCS Research Center, Georgia Tech ǂ Princeton

More information

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS Adithya Bhat, Nusrat Islam, Xiaoyi Lu, Md. Wasi- ur- Rahman, Dip: Shankar, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng

More information

Structuring PLFS for Extensibility

Structuring PLFS for Extensibility Structuring PLFS for Extensibility Chuck Cranor, Milo Polte, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University What is PLFS? Parallel Log Structured File System Interposed filesystem b/w

More information

CA485 Ray Walshe Google File System

CA485 Ray Walshe Google File System Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage

More information

Request for Comments: 1787 T.J. Watson Research Center, IBM Corp. Category: Informational April 1995

Request for Comments: 1787 T.J. Watson Research Center, IBM Corp. Category: Informational April 1995 Network Working Group Y. Rekhter Request for Comments: 1787 T.J. Watson Research Center, IBM Corp. Category: Informational April 1995 Status of this Memo Routing in a Multi-provider Internet This memo

More information

Netronome NFP: Theory of Operation

Netronome NFP: Theory of Operation WHITE PAPER Netronome NFP: Theory of Operation TO ACHIEVE PERFORMANCE GOALS, A MULTI-CORE PROCESSOR NEEDS AN EFFICIENT DATA MOVEMENT ARCHITECTURE. CONTENTS 1. INTRODUCTION...1 2. ARCHITECTURE OVERVIEW...2

More information

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows Rafael Ferreira da Silva, Scott Callaghan, Ewa Deelman 12 th Workflows in Support of Large-Scale Science (WORKS) SuperComputing

More information

AcuSolve Performance Benchmark and Profiling. October 2011

AcuSolve Performance Benchmark and Profiling. October 2011 AcuSolve Performance Benchmark and Profiling October 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox, Altair Compute

More information

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1 ZEST Snapshot Service A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1 Design Motivation To optimize science utilization of the machine Maximize

More information

CHAPTER 5 ANT-FUZZY META HEURISTIC GENETIC SENSOR NETWORK SYSTEM FOR MULTI - SINK AGGREGATED DATA TRANSMISSION

CHAPTER 5 ANT-FUZZY META HEURISTIC GENETIC SENSOR NETWORK SYSTEM FOR MULTI - SINK AGGREGATED DATA TRANSMISSION CHAPTER 5 ANT-FUZZY META HEURISTIC GENETIC SENSOR NETWORK SYSTEM FOR MULTI - SINK AGGREGATED DATA TRANSMISSION 5.1 INTRODUCTION Generally, deployment of Wireless Sensor Network (WSN) is based on a many

More information

Table 9. ASCI Data Storage Requirements

Table 9. ASCI Data Storage Requirements Table 9. ASCI Data Storage Requirements 1998 1999 2000 2001 2002 2003 2004 ASCI memory (TB) Storage Growth / Year (PB) Total Storage Capacity (PB) Single File Xfr Rate (GB/sec).44 4 1.5 4.5 8.9 15. 8 28

More information

GFS: The Google File System. Dr. Yingwu Zhu

GFS: The Google File System. Dr. Yingwu Zhu GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can

More information

VARIABILITY IN OPERATING SYSTEMS

VARIABILITY IN OPERATING SYSTEMS VARIABILITY IN OPERATING SYSTEMS Brian Kocoloski Assistant Professor in CSE Dept. October 8, 2018 1 CLOUD COMPUTING Current estimate is that 94% of all computation will be performed in the cloud by 2021

More information

Pivot3 Acuity with Microsoft SQL Server Reference Architecture

Pivot3 Acuity with Microsoft SQL Server Reference Architecture Pivot3 Acuity with Microsoft SQL Server 2014 Reference Architecture How to Contact Pivot3 Pivot3, Inc. General Information: info@pivot3.com 221 West 6 th St., Suite 750 Sales: sales@pivot3.com Austin,

More information

Peer-to-Peer Systems. Chapter General Characteristics

Peer-to-Peer Systems. Chapter General Characteristics Chapter 2 Peer-to-Peer Systems Abstract In this chapter, a basic overview is given of P2P systems, architectures, and search strategies in P2P systems. More specific concepts that are outlined include

More information

Chapter 4. Fundamental Concepts and Models

Chapter 4. Fundamental Concepts and Models Chapter 4. Fundamental Concepts and Models 4.1 Roles and Boundaries 4.2 Cloud Characteristics 4.3 Cloud Delivery Models 4.4 Cloud Deployment Models The upcoming sections cover introductory topic areas

More information

An Introduction to GPFS

An Introduction to GPFS IBM High Performance Computing July 2006 An Introduction to GPFS gpfsintro072506.doc Page 2 Contents Overview 2 What is GPFS? 3 The file system 3 Application interfaces 4 Performance and scalability 4

More information

LINUX CONTAINERS. Where Enterprise Meets Embedded Operating Environments WHEN IT MATTERS, IT RUNS ON WIND RIVER

LINUX CONTAINERS. Where Enterprise Meets Embedded Operating Environments WHEN IT MATTERS, IT RUNS ON WIND RIVER Where Enterprise Meets Embedded Operating Environments WHEN IT MATTERS, IT RUNS ON WIND RIVER EXECUTIVE SUMMARY Flexible and connected platforms are core components in leading computing fields, including

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information