Adaptable IO System (ADIOS)

Similar documents
Flexible IO and Integration for Scientific Codes Through The Adaptable IO System (ADIOS)

High Speed Asynchronous Data Transfers on the Cray XT3

Collaborators from SDM Center, CPES, GPSC, GSEP, Sandia, ORNL

Stream Processing for Remote Collaborative Data Analysis

CSE6230 Fall Parallel I/O. Fang Zheng

High Throughput Data Movement

Scibox: Online Sharing of Scientific Data via the Cloud

PreDatA - Preparatory Data Analytics on Peta-Scale Machines

EXTREME SCALE DATA MANAGEMENT IN HIGH PERFORMANCE COMPUTING

Parallel, In Situ Indexing for Data-intensive Computing. Introduction

Integrating Analysis and Computation with Trios Services

Enabling high-speed asynchronous data extraction and transfer using DART

...And eat it too: High read performance in write-optimized HPC I/O middleware file formats

Structured Streams: Data Services for Petascale Science Environments

libhio: Optimizing IO on Cray XC Systems With DataWarp

Benchmarking the Performance of Scientific Applications with Irregular I/O at the Extreme Scale

Introduction to High Performance Parallel I/O

XpressSpace: a programming framework for coupling partitioned global address space simulation codes

PROGRAMMING AND RUNTIME SUPPORT FOR ENABLING DATA-INTENSIVE COUPLED SCIENTIFIC SIMULATION WORKFLOWS

HDF5 I/O Performance. HDF and HDF-EOS Workshop VI December 5, 2002

EMPRESS Extensible Metadata PRovider for Extreme-scale Scientific Simulations

What NetCDF users should know about HDF5?

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4

Introduction to Parallel I/O

Oh, Exascale! The effect of emerging architectures on scien1fic discovery. Kenneth Moreland, Sandia Na1onal Laboratories

Extreme I/O Scaling with HDF5

Lustre Parallel Filesystem Best Practices

Damaris. In-Situ Data Analysis and Visualization for Large-Scale HPC Simulations. KerData Team. Inria Rennes,

Lecture 33: More on MPI I/O. William Gropp

Introduction to NetCDF

Skel: Generative Software for Producing Skeletal I/O Applications

Using a Robust Metadata Management System to Accelerate Scientific Discovery at Extreme Scales

Managing Variability in the IO Performance of Petascale Storage Systems

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

The Spider Center-Wide File System

In situ data processing for extreme-scale computing

ScalaIOTrace: Scalable I/O Tracing and Analysis

Do You Know What Your I/O Is Doing? (and how to fix it?) William Gropp

Big Data in Scientific Domains

Extending Scalability of Collective IO Through Nessie and Staging

Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-Volatile Burst Buffers

Scibox: Online Sharing of Scientific Data via the Cloud

Optimized Scatter/Gather Data Operations for Parallel Storage

Improving I/O Performance in POP (Parallel Ocean Program)

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

Welcome! Virtual tutorial starts at 15:00 BST

API and Usage of libhio on XC-40 Systems

Introduction to HPC Parallel I/O

Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin

IBM Spectrum Scale IO performance

FlexIO: I/O Middleware for Location-Flexible Scientific Data Analytics

Reducing I/O variability using dynamic I/O path characterization in petascale storage systems

Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

Combing Partial Redundancy and Checkpointing for HPC

Scalable Interaction with Parallel Applications

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package

Research in Middleware Systems For In-Situ Data Analytics and Instrument Data Analysis

RFC: Metadata Journaling to Improve Crash Survivability

In-Situ Data Analysis and Visualization: ParaView, Calalyst and VTK-m

Philip C. Roth. Computer Science and Mathematics Division Oak Ridge National Laboratory

The Fusion Distributed File System

Dynamic File Striping and Data Layout Transformation on Parallel System with Fluctuating I/O Workload

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

A Parallel API for Creating and Reading NetCDF Files

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

Parallel I/O Libraries and Techniques

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016

Technical Computing Suite supporting the hybrid system

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Efficient Object Storage Journaling in a Distributed Parallel File System

Challenges in HPC I/O

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

Mark Pagel

I/O: State of the art and Future developments

Modeling and Implementation of an Asynchronous Approach to Integrating HPC and Big Data Analysis

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Structuring PLFS for Extensibility

The MPI Message-passing Standard Practical use and implementation (I) SPD Course 2/03/2010 Massimo Coppola

An Overview of Fujitsu s Lustre Based File System

VisIt Libsim. An in-situ visualisation library

A Case for Standard Non-Blocking Collective Operations

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

A More Realistic Way of Stressing the End-to-end I/O System

Canopus: Enabling Extreme-Scale Data Analytics on Big HPC Storage via Progressive Refactoring

Exploiting Lustre File Joining for Effective Collective IO

IBM CORAL HPC System Solution

The Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter

Improving Collective I/O Performance by Pipelining Request Aggregation and File Access

PERFORMANCE OF PARALLEL IO ON LUSTRE AND GPFS

Blue Waters I/O Performance

Scalable Software Components for Ultrascale Visualization Applications

A Platform for Fine-Grained Resource Sharing in the Data Center

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA

Lock Ahead: Shared File Performance Improvements

MOHA: Many-Task Computing Framework on Hadoop

Green Hills Software, Inc.

Developing Integrated Data Services for Cray Systems with a Gemini Interconnect

Map-Reduce. Marco Mura 2010 March, 31th

SDS: A Framework for Scientific Data Services

Scaling Parallel I/O Performance through I/O Delegate and Caching System

Transcription:

Adaptable IO System (ADIOS) http://www.cc.gatech.edu/~lofstead/adios Cray User Group 2008 May 8, 2008 Chen Jin, Scott Klasky, Stephen Hodson, James B. White III, Weikuan Yu (Oak Ridge National Laboratory) Jay Lofstead, Hasan Abbasi, Karsten Schwan, Matthew Wolf (Georgia Tech) Wei-keng Liao, Alok Choudhary, North Western University Manish Parashar, Ciprian Docan, Rutgers University. Ron Oldfield, Sandia Labs 1

Outline ADIOS overview. Design goals. ADIOS files(bp). ADIOS APIs. ADIOS XML File Description. ADIOS Transport Methods. Posix MPI-AIO MPI-IO DataTap MPI-CIO DART NULL PHDF5 Initial ADIOS Performance Future work Conclusions 2

ADIOS Overview Design Goals Combine. Fast I/O routines. Easy to use. Scalable architecture (100s cores) millions of processors. QoS. Metadata rich output. Visualization applied during simulations. Analysis, compression techniques applied during simulations. Provenance tracking. Methods to swap controlling apps (steering) vs. fast I/O. Use the largest, data producing codes at ORNL as test cases. S3D, GTC, GTS, Chimera, XGC Support the 90% of the codes. We haven t found a code which doesn t work, but we might. 3

Why Yet Another API? No special purpose API suitable for all purposes complexities of programming interface differences in file content support performance differences depending on configuration of run No support for non-io tasks as part of IO activity To add Viz support, must add code To integrate with steering or other feedback mechanism, must add code 4

A programmers perspective. (I/O options) until ADIOS came along! (PICK 1 please) Posix: F90 writes. p processors/files. MPI-IO: n writers, p procs, f files. Netcdf. HDF5 Pnetcdf Phdf5 Asynchronous I/O Data streaming. VISIT APIs for steering. GT APIs for A/IO steering. Rutgers APIs for adios/steering. We tried them all, but mainstream GTC used Posix F90 with p processors/files! Why? 5

ADaptable IO System (ADIOS) Combines High performance I/O. In-Situ Visualization. Real-time analytics. Collaborating with many institutions MPI-IO/ORNL Jaguar Async MPI-IO Jaguar DART Jaguar Datatap/jaguar Maviz/jaguar Visit/jaguar Paraview/jaguar Phdf5/jaguar Pnetcdf/jaguar BGP/IB/GPFS.. GTC GTC_s Flash XGC1 Chimera S3D M3D XGC0 25 GBs 22GBs 15 GBs 20 GBs 1.2TB <1 6

ADIOS Overview Overview Allows plug-ins for different I/O implementations. Abstracts the API from the method used for I/O. Simple API, almost as easy as F90 write statement. Both synchronous and asynchronous transports supported without code changes. Scientific Codes ADIOS API buffering schedule feedback Componentization. Don t worry about IO implementation. Components for IO transport methods, buffering, scheduling, and eventually feedback mechanisms. Change IO method by changing XML file only! POSIX IO MPI-IO LIVE/DataTap DART HDF-5 pnetcdf Viz Engines Others (plug-in) 7 External Metadata (XML file)

ADIOS Overview Middleware between Applications and Transport Methods Abstract the data information (type, dimension, description etc) into XML file Clean and easy interface to app developers Webpage http://www.cc.gatech.edu/~lofstead/adios http://hecura.cs.unm.edu/doku.php?id=asynchronous_i_ o_api 8

Benefits of ADIOS Simple API As close to standard Fortran POSIX IO calls as possible External metadata in XML file Change IO transport/processing by changing XML file Best practices/optimized IO routines for all supported transports for free Supports both synchronous (MPI, MPI collective, netcdf, pnetcdf, HDF-5) and Asynchronous (GT: EVPath, Rutgers: DART) Transports New transports for things like Visit and Kepler in the planning/development stages 9

ADIOS file format.bp format. (binary packed). Blocks of data are dumped with tags before each basic element. Will support a header in the released version of ADIOS. Header will eventually be an index table. No re-arrangement of the data when it touches disk. Utilities to dump.bp files to standard output. (like h5dump, ncdump). Converters from.bp to Hdf5 Netcdf Ascii (column data). 10

ADIOS Setup Code Example XGC - Main.F90 # MPI Initialization call my_mpi_init # ADIOS Initialization call adios_init ( config.xml, MPI_COMM_WORLD, MPI_COMM_SELF,MPI_INFO_NULL) # ADIOS resource release call adios_finalize (sml_mype) #MPI resource release call my_mpi_finalize 11

Example for asynchronous I/O scheduling Setup/iteration procedure call adios_init ('config.xml')...! do main loop call adios_begin_calculation ()! do non-communication work call adios_end_calcuation ()...! perform restart write, etc....! do communication work call adios_end_iteration ()! end loop... call adios_finalize () For asynchronous operations ADIOS let s programmers mark where no communication will occur in the code. We do this for computational kernels in the code. XML file contains information for how quickly we must write out the data (how many iterations). 12

ADIOS APIs Example GTC - restart.f90 call adios_get_group (grp_id, restart ) call adios_open (buf_id, grp_id, restart.bp') ADIOS_WRITE (buf_id,mpi_comm_world) ADIOS_WRITE (buf_id,nparam) ADIOS_WRITE ADIOS_GROUP_WRITE(buf_id,grp_id) (buf_id,mimax) ADIOS_WRITE (buf_id,zion) call adios_close (buf_id) 13

ADIOS XML Example <adios-config host-language= Fortran > <adios-group name= restart coordinationcommunicator="mpi_comm_world > <var name= mpi_comm_world type= integer*8 /> <var name= nparam type= integer write= no /> <var name= mimax type= integer write= no /> <var name= zion type= double dimensions= nparam,mimax /> <attribute name= description path= /zion value= ion particle /> </adios-group> <method priority= 1 method= MPI group= restart /> <buffer size-mb= 100 allocate-time= now /> </adios-config> 14

ADIOS Features Dataset/Array support Local/global dimension: Specify global space, local dimension(per mpi-process), and offsets (for this dataset). We can specify ghost-zones too. Support for specifying visualization meshes VTK-like format used in XML code. Structured/Unstructured data. 1 mesh per ADIOS group. No support for AMR for ADIOS 1.0 Language Support Fortran is the default. C/C++ supported and tested. 15

ADIOS methods Posix. ADIOS buffers data (user-definable) and writes out large blocks. Posix writes out 1 file per MPI-process. 16

ADIOS MPI-IO method Simple chained open requests dispatched sequentially, but with unknown time offset 1.Process receives its starting file offset from previous rank 2.Process calculates next file offset and sends to next rank 3.Process opens file Robust chained opens processed sequentially 1.Process receives files offset from previous rank 2.Process opens file 3.Process calculates next file offset and sends to next rank Robust Timed chained opens processed sequentially, attempts to maintain constant minimum offset between opens 1.Process starts elapsed time 2.Process receives files offset from previous rank 3.Process opens file 4.Process waits a specified interval minus elapsed time 5.Process calculates next file offset and sends to next rank Each method will also offset the actual I/O data requests to a different degree. Similar methods could be used to control the data flow to OSTs 17

MPI-CIO (Collective MPI-method) Collective I/O enables process collaboration to rearrange I/O requests for better performance. The collective I/O method in ADIOS first defines MPI fileviews for all processes based on the data partitioning information provided in the XML configuration file. ADIOS also generated MPI-IO hints, such as data sieving and I/O aggregators, based on the access pattern and underlying file system configuration. The hints are supplied to the MPI-IO library for further performance enhancement. The syntax to describe the data partitioning pattern in the XML file uses <global-bounds dimensions offsets> tag which defines the global array size and the offsets of local subarrays in the global space. 18

MPI-AIO (Asynchronous MPI-IO method). Currently needs OpenMPI. Modify existing adios_mpi.c, adios_mpi_do_write adios_mpi_do_read Use asynchronous I/O if: Buffer space available in adios (<buffer-mb=xxxx>) for current I/O request Otherwise use synchronous call If asynchronous I/O not available in MPI-IO implementation, then request handled by MPI-IO synchronously Unneeded async buffer allocation in adios consumed only for duration of synchronous I/O operation a wash Currently only 1 outstanding async request allowed. Fine for large >>1MB I/O, but small I/O performance will benefit from queuing multiple requests. Issue deferred close 19

ADIOS Method - NULL Switching 1 line in the XML file enables the code NOT to write data to disk. Used for testing the speed of the I/O routine. 20

MPI method asynchronous method with DataTap DataTap is an asynchronous data transport to ensure very high levels of scalability through sever-directed I/O, built by Georgia Tech. Data moves using RDMA methods from the compute nodes to the DataTap servers on login nodes. 21

ADIOS DART Method. DART Asynchronous IO Substrate Objectives Minimize IO time & overheads Maximize data throughput Minimize latency of data transfers Approach Asynchronously schedule IO operations to overlap with computations Dynamically manage buffering and mapping of IO to service nodes Build on RDMA and Portals Evaluation using CPES Simulations XGC-1, 1k nodes stream to login node transferred 380GB, in 3000sec with 5 sec spent in I/O - ~0.2% overhead XGC-1, 1k nodes save to local store transferred 380GB, in 2900sec with 4.86 sec spent in I/O - ~0.16% overhead GTC simulation on 1k and 2k nodes transferred 1TB, in 3000sec with 12sec spent on I/O - ~0.6% overhead 22

ADIOS methods not yet incorporated. Parallel HDF5. Parallel Netcdf. VISIT. Data multiplexing methods 23

Initial ADIOS performance. MPI-IO method. GTC and GTS codes have achieved over 20 GB/sec on Cray XT at ORNL. Chimera code speed up by 6.5% (overall time). DART: <2% overhead for writing 2 TB/hour with GTC code. DataTap vs. Posix 5 secs for GTC computation. ~25 seconds for Posix IO ~4 seconds with DataTap 24

Future work: ADIOS 2.0 Dynamic XML Generator within one simulation and reconfigure the simulation on the fly using the dynamic XML file. Index files built for fast reading. Metadata Catalogue. One file will be produced per simulation which hyperlinks to all of the other files produced. Additional Readers developed. Additional Transport Methods. Pnetcdf, ViSIT. (YOU TELL US). Hooks into workflow automation. Metadata/EOF signals sent to Kepler workflow. 25

Conclusions & Future work ADIOS 1.0 will be released in 2008. ADIOS 2.0 will include Parallel hdf5 methods. Integrated for all IO in XGC1, GTC, GTS, Chimera, S3D, Flash. Parallel netcdf methods. Asynchronous schedulers optimized. Fast asynchronous methods. ADIOS is supposed to be Easy to use. FAST. Highly annotated. ADIOS 1.0 has hdf5, netcdf, ascii converters. Data Multiplexing. Faster methods to index files and read. Harden routines on Crays, BlueGene, Infiniband. Bug fixing More feedback from the File System. 26