Extreme I/O Scaling with HDF5

Similar documents
The HDF Group. Parallel HDF5. Extreme Scale Computing Argonne.

Introduction to HDF5

Taming Parallel I/O Complexity with Auto-Tuning

Fast Forward I/O & Storage

The HDF Group. Parallel HDF5. Quincey Koziol Director of Core Software & HPC The HDF Group.

Introduction to High Performance Parallel I/O

Lustre* - Fast Forward to Exascale High Performance Data Division. Eric Barton 18th April, 2013

Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin

CA485 Ray Walshe Google File System

The Fusion Distributed File System

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

SDS: A Framework for Scientific Data Services

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Caching and Buffering in HDF5

Fast Forward Storage & I/O. Jeff Layton (Eric Barton)

Lustre overview and roadmap to Exascale computing

Distributed File Systems II

HDF5 I/O Performance. HDF and HDF-EOS Workshop VI December 5, 2002

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package

What NetCDF users should know about HDF5?

Improved Solutions for I/O Provisioning and Application Acceleration

Introduction to HPC Parallel I/O

File Open, Close, and Flush Performance Issues in HDF5 Scot Breitenfeld John Mainzer Richard Warren 02/19/18

Distributed Filesystem

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

Application Performance on IME

Challenges in HPC I/O

Do You Know What Your I/O Is Doing? (and how to fix it?) William Gropp

Distributed Systems 16. Distributed File Systems II

Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright. lbl.gov

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH

Lustre Parallel Filesystem Best Practices

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Parallel I/O from a User s Perspective

CSE 124: Networked Services Lecture-16

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

The Google File System

Application Example Running on Top of GPI-Space Integrating D/C

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Benchmarking the CGNS I/O performance

NetCDF-4: : Software Implementing an Enhanced Data Model for the Geosciences

End-to-End Data Integrity in the Intel/EMC/HDF Group Exascale IO DOE Fast Forward Project

Parallel File Systems. John White Lawrence Berkeley National Lab

Optimization of non-contiguous MPI-I/O operations

Introduction to Parallel I/O

ISC 09 Poster Abstract : I/O Performance Analysis for the Petascale Simulation Code FLASH

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4

DRAFT. HDF5 Data Flow Pipeline for H5Dread. 1 Introduction. 2 Examples

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

HPC Input/Output. I/O and Darshan. Cristian Simarro User Support Section

CLOUD-SCALE FILE SYSTEMS

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

CHARACTERIZING HPC I/O: FROM APPLICATIONS TO SYSTEMS

An Introduction to GPFS

Programming Models for Supercomputing in the Era of Multicore

General Plasma Physics

xsim The Extreme-Scale Simulator

The Google File System

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

HDF- A Suitable Scientific Data Format for Satellite Data Products

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

FastForward I/O and Storage: IOD M5 Demonstration (5.2, 5.3, 5.9, 5.10)

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

CSE 124: Networked Services Fall 2009 Lecture-19

PyTables. An on- disk binary data container. Francesc Alted. May 9 th 2012, Aus=n Python meetup

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

Workload Characterization using the TAU Performance System

Integrating Analysis and Computation with Trios Services

Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-Volatile Burst Buffers

MapReduce. U of Toronto, 2014

An Overview of the HDF5 Technology Suite and its Applications

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

RFC: HDF5 Virtual Dataset

Lustre A Platform for Intelligent Scale-Out Storage

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Parallel I/O Libraries and Techniques

NPTEL Course Jan K. Gopinath Indian Institute of Science

Data-Intensive Applications on Numerically-Intensive Supercomputers

Blue Waters I/O Performance

Intel Math Kernel Library 10.3

Adaptive Cluster Computing using JavaSpaces

Spark and HPC for High Energy Physics Data Analyses

Jialin Liu, Evan Racah, Quincey Koziol, Richard Shane Canon, Alex Gittens, Lisa Gerhardt, Suren Byna, Mike F. Ringenburg, Prabhat

FastForward I/O and Storage: ACG 8.6 Demonstration

The Google File System. Alexandru Costan

Data Staging: Moving large amounts of data around, and moving it close to compute resources

HPC Storage Use Cases & Future Trends

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Fluentd + MongoDB + Spark = Awesome Sauce

BigData and Map Reduce VITMAC03

HDF5: An Introduction. Adam Carter EPCC, The University of Edinburgh

ArrayUDF Explores Structural Locality for Faster Scientific Analyses

GFS: The Google File System

IBM Spectrum Scale IO performance

TOPLink for WebLogic. Whitepaper. The Challenge: The Solution:

Transcription:

Extreme I/O Scaling with HDF5 Quincey Koziol Director of Core Software Development and HPC The HDF Group koziol@hdfgroup.org July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 1

Outline Brief overview of HDF5 and The HDF Group Extreme I/O Scaling with HDF5: Application and Algorithm Efficiency Application-based Fault Tolerant Methods and Algorithms Application-based Topology Awareness What else is on the horizon for HDF5? July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 2

What is HDF5? A versatile data model that can represent very complex data objects and a wide variety of metadata. A completely portable file format with no limit on the number or size of data objects stored. An open source software library that runs on a wide range of computational platforms, from cell phones to massively parallel systems, and implements a high-level API with C, C++, Fortran 90, and Java interfaces. A rich set of integrated performance features that allow for access time and storage space optimizations. Tools and applications for managing, manipulating, viewing, and analyzing the data in the collection. July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 3

The HDF5 Data Model Datasets Multi-dimensional array of elements, together with supporting metadata Groups Directory-like structure containing links to datasets, groups, other objects July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 4

Example HDF5 File / (root) 3-D array /A palette lat lon temp ---- ----- ----- 12 23 3.1 15 24 4.2 17 21 3.6 Table Raster image Raster image 2-D array July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 5

Why HDF5? Challenging data: Application data that pushes the limits of what can be addressed by traditional database systems, XML documents, or in-house data formats. Software solutions: For very large datasets, very fast access requirements, or very complex datasets. To easily share data across a wide variety of computational platforms using applications written in different programming languages. That take advantage of the many open-source and commercial tools that understand HDF5. July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 6

Who uses HDF5? Some examples of HDF5 users Astrophysics Astronomers NASA Earth Science Enterprise Dept. of Energy Labs Supercomputing centers in US, Europe and Asia Financial Institutions NOAA Manufacturing industries Many others For a more detailed list, visit http://www.hdfgroup.org/hdf5/users5.html July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 7

What is The HDF Group And why does it exist? July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 8

The HDF Group 18 Years at University of Illinois National Center for Supercomputing Applications (NCSA) Spun-out from NCSA in July, 2006 Non-profit ~40 scientific, technology, professional staff Intellectual property: The HDF Group owns HDF4 and HDF5 HDF formats and libraries to remain open BSD-style license July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 9

The HDF Group Mission To ensure long-term accessibility of HDF data through sustainable development and support of HDF technologies. July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 10

Goals Maintain, evolve HDF for sponsors and communities that depend on it Provide consulting, training, tuning, development, research Sustain the group for long term to assure data access over time July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 11

Extreme I/O with HDF5 Application and Algorithm Efficiency Scaling use of HDF5 from current petascale to future exascale systems Application-based Fault Tolerant Methods and Algorithms File and system-based fault tolerance with HDF5 Application-based Topology Awareness I/O Autotuning research July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 12

Application & Algorithm Efficiency Success Story @ NERSC: Sifting Through a Trillion Electrons http://crd.lbl.gov/news-and-publications/news/2012/ sifting-through-a-trillion-electrons/ 3-D magnetic reconnection dataset of a trillion particles, using VPIC application Ran on NERSC s Cray XE6 hopper system, using 120,000 cores, writing each snapshot into a single 32TB HDF5 file, at a sustained I/O rate of 27GB/s, with peaks of 31GB/s ~90% of maximum system I/O performance! July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 13

Application & Algorithm Efficiency Aspects to HDF5 Efficiency General Issues Parallel Issues Upcoming parallel Improvements: File Truncation w/mpi Collective HDF5 Metadata Modifications Exascale FastForward work July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 14

General HDF5 Efficiency Faster HDF5 Performance: Metadata Use the latest file format features H5Pset_libver_bounds() Increase size of metadata data structures H5Pset_istore_k(), H5Pset_sym_k(), etc. Aggregate metadata into larger blocks H5Pset_meta_block_size() Align objects in the file H5Pset_alignment() Cork the cache Prevents trickle of metadata writes [Code example in slide notes] July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 15

General HDF5 Efficiency Faster HDF5 Performance: Raw Data Avoid converting datatypes If conversion necessary, increase datatype conversion buffer size (default 1MB) with H5Pset_buffer() Possibly use chunked datasets H5Pset_chunk() Make your chunks the right size Goldilocks Principle: Not too big, nor too small Increase size of the chunk cache: default is 1MB H5Pset_chunk_cache() July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 16

Upcoming General Improvements Sampling of New Features: Single-Writer/Multiple Reader (SWMR) Allows concurrent, lock-free access from a single writing process and multiple reading processes Asynchronous I/O Adding support for POSIX aio_*() routines Scalable chunk indices Constant time lookup and append operations for nearly all datasets Persistent free space tracking Options for tracking free space in file, when it s closed All available in 1.10.0 release July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 17

Parallel HDF5 Efficiency Create skeleton file in serial Current parallel HDF5 requires collective modification of metadata If structure of file is [mostly] known, create skeleton in serial, then close and re-open file in parallel Use collective I/O for raw data H5Pset_dxpl_mpio() Possibly use collective chunk optimization routines H5Pset_dxpl_mpio_chunk_opt(), H5Pset_dxpl_mpio_chunk_opt_num(), H5Pset_dxpl_mpio_chunk_opt_ratio(), H5Pset_dxpl_mpio_collective_opt() Query whether collective I/O occurs with: H5Pget_mpio_actual_io_mode() and H5Pget_mpio_actual_chunk_opt_mode() July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 18

Upcoming Parallel Improvements File Truncation Problem: Current HDF5 file format requires file always truncated to allocated size when closed File truncation with MPI is very slow Solution: Modify HDF5 library and file format to store allocated size as metadata within the HDF5 file Gives large speed benefit for parallel applications Available in 1.10.0 release July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 19

Upcoming Parallel Improvements Collective HDF5 Metadata Modification Problem: Current HDF5 library requires all metadata modifications to be performed collectively Solution: Set aside one (or more, eventually) process as metadata server and have other processes communicate metadata changes to server Available in 1.10.0 release (possibly) July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 20

Upcoming Parallel Improvements Exascale FastForward work w/whamcloud &EMC Whamcloud, EMC & The HDF Group were recently awarded contract for exascale storage research and prototyping: http://www.hpcwire.com/hpcwire/2012-07-12/ doe_primes_pump_for_exascale_supercomputers.ht ml Using HDF5 data model and interface as top layer of next generation storage system for future exascale systems Laundry list of new features in HDF5 Parallel compression, asynchronous I/O, query/index features, transactions, python wrappers, etc. July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 21

Extreme I/O with HDF5 Application and Algorithm Efficiency Scaling use of HDF5 from current petascale to future exascale systems Application-based Fault Tolerant Methods and Algorithms File and system-based fault tolerance with HDF5 Application-based Topology Awareness I/O Autotuning research July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 22

Fault Tolerance and HDF5 Metadata Journaling Protect file by writing metadata changes to journal file first, then writing to HDF5 file. Performance trade-off for good fault tolerance Works with non-posix file systems Ordered Updates Protect file by special data structure update algorithms - essentially copy-on-write and lock-free/ wait-free algorithms Good performance and good fault tolerance Only works on POSIX file systems July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 23

Fault Tolerance and HDF5 MPI Fault Tolerance Participating in MPI Forum since 2008, contributing and helping to advance features that will enable HDF5 to meet future needs Forum s Fault Tolerance Working Group has proposed features that will allow MPI file handles to be rebuilt if participating processes fail If this feature is voted into the MPI standard and available in implementations, we will adopt it, so that applications can be certain that their HDF5 files are secure July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 24

Extreme I/O with HDF5 Application and Algorithm Efficiency Scaling use of HDF5 from current petascale to future exascale systems Application-based Fault Tolerant Methods and Algorithms File and system-based fault tolerance with HDF5 Application-based Topology Awareness I/O Autotuning research July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 25

Autotuning Background Software Autotuning: Employ empirical techniques to evaluate a set of alternative mappings of computation kernels to an architecture and select the mapping that obtains the best performance. Autotuning Categories: Self-tuning library generators such as ATLAS, PhiPAC and OSKI for linear algebra, etc. Compiler-based autotuners that automatically generate and search a set of alternative implementations of a computation Application-level autotuners that automate empirical search across a set of parameter values proposed by the application programmer July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 26

HDF5 Autotuning Why? Because the dominant I/O support request at NERSC is poor I/O performance, many/most of which can be solved by enabling Lustre striping, or tuning another I/O parameter Scientists shouldn t have to figure this stuff out! Two Areas of Focus: Evaluate techniques for autotuning HPC application I/O File system, MPI, HDF5 Record and Replay HDF5 I/O operations July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 27

Autotuning HPC I/O Goal: Avoid tuning each application to each machine and file system Create I/O autotuner library that can inject optimal parameters for I/O operations on a given system Using Darshan* tool to create wrappers for HDF5 calls Application can be dynamically linked with I/O autotuning library No changes to application or HDF5 library Using several HPC applications currently: VPIC, GCRM, Vorpal * - http://www.mcs.anl.gov/research/projects/darshan/ July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 28

Autotuning HPC I/O Initial parameters of interest File System (Lustre): stripe count, stripe unit MPI-I/O: Collective buffer size, coll. buffer nodes HDF5: Alignment, sieve buffer size July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 29

Autotuning HPC I/O!"#$%"&'#$()*+#$,-(.*'-/#0$ ()*+,-./012)$ &$ #$!%$ '"$ %&$!"#$ ()*+,-.(+3-$4567$!$ "$ &$ #$!%$ '"$ %&$!"#$ &$!$!$!$ >"&" ##$ %&$ 89.20:-;$!$ "$ &$ #$!%$ '"$ 1$ 23454$ 89.91<-*.;+3-$4567$!$ "$ &$ #$!%$ '"$ %&$!"#$!"#$!"#$ '"$!"#$!=&# >?%$!$ 56$ @A+B2C-2)$ >"&"##$!=&#>?%$ ;+D.91E.;+3-$4F67$ %&$!"#$ ">%$ >!"$!$ 56$ July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 30

Autotuning HPC I/O Autotuning Exploration/Generation Process: Iterate over running application many times: Intercept application s I/O calls Inject autotuning parameters Measure resulting performance Analyze performance information from many application runs to create configuration file, with best parameters found for application/machine/file system July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 31

Autotuning HPC I/O Using the I/O Autotuning Library: Dynamically link with I/O autotuner library I/O autotuner library automatically reads parameters from config file created during exploration process I/O autotuner automatically injects autotuning parameters as application operates July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 32

Autotuning HPC I/O!"#$$%&'(%)'*+'(,#-%'./(0#$/1%2' ()*+,-./012)$ &$ #$!%$ '"$ %&$!"#$ ()*+,-.(+3-$4567$!$ "$ &$ #$!%$ '"$ %&$!"#$ &$!$ &$!$ >"&" ##$ %&$ 89.20:-;$!$ "$ &$ #$!%$ '"$ 3' 45' 89.91<-*.;+3-$4567$!$ "$ &$ #$!%$ '"$ %&$!"#$!"#$ %&$!%$!"#$ >"&" ##$ >!"$ @A+B2C-2)$ >"&"##$!=&#>?%$ ;+D.91E.;+3-$4F67$ %&$!"#$ ">%$ >!"$!$ 56$ July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 33

Autotuning HPC I/O 5$&26-(/7(52,,",1(82+(9:+";-(2&",1(<=(./,012+34/,(06$&(/,(>=(./+$&?@(A/B$(/7( 53,1$+( '#!!" '!!!" &!!"!"#$%&'( %!!",-./.01/.23456.5476" $!!"!"#$(C(DEFGFH(&(./,012+34/,(I(C(JH( #!!"!" '" (" )" *" +" ''" '(" ')" '*" '+" #'" #(" #)" #*" #+" ('" ((" ()" (*" (+" $'" $(" $)" $*" $+" )'" )(" ))" )*" )+" %'" %(" %)" %*" %+" *'" )"*$+$,-(./,012+34/,&( July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 34

Autotuning HPC I/O!"#$%&'()"#*+,-* July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 35

Autotuning HPC I/O Remaining research: Determine speed of light for I/O on system and use that to define good enough performance Entire space is too large to fully explore, we are now evaluating genetic algorithm techniques to help find good enough parameters How to factor out unlucky exploration runs Methods for avoiding overriding application parameters with autotuned parameters July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 36

Recording and Replaying HDF5 Goal: Extract an I/O Kernel from application, without examining application code Method: Using Darshan again, record all HDF5 calls and parameters in replay file Create parallel replay tool that uses recording to replay HDF5 operations July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 37

Recording and Replaying HDF5 Benefits: Easy to create I/O kernel from any application, even one with no source code Can use replay file as accurate application I/O benchmark Can move replay file to another system for comparison Can autotune from replay, instead of application Challenges: Serializing complex parameters to HDF5 Replay files can be very large Accurate parallel replay is difficult July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 38

HDF5 Road Map Performance Ease of use Robustness Innovation July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 39

HDF5 in the Future It s tough to make predictions, especially about the future Yogi Berra July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 40

Plans, Guesses & Speculations Focus on asynchronous I/O, instead of improving multi-threaded concurrency of HDF5 library: Library currently thread-safe, but not concurrent Instead of improving concurrency, focus on heavily leveraging asynchronous I/O Use internal multi-threading where possible Extend Single-Writer/Multiple-Reader (SWMR) to parallel applications Allow an MPI application to be the writer July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 41

Plans, Guesses & Speculations Improve Parallel I/O Performance: Continue to improve our use of MPI and file system features Reduce # of I/O accesses for metadata access Improve Journaled HDF5 File Access: Make raw data operations journaled Allow "super transactions" to be created by applications Enable journaling for Parallel HDF5 July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 42

Plans, Guesses & Speculations Improve raw data chunk cache implementation More efficient storage and I/O of variable-length data, including compression Work with HPC community to serve their needs: Focus on high-profile applications or I/O kernels and remove HDF5 bottlenecks discovered You tell us! July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 43

Questions? July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 44