CSE6230 Fall Parallel I/O. Fang Zheng

Similar documents
Adaptable IO System (ADIOS)

Integrating Analysis and Computation with Trios Services

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Crossing the Chasm: Sneaking a parallel file system into Hadoop

PreDatA - Preparatory Data Analytics on Peta-Scale Machines

Introduction to High Performance Parallel I/O

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

The Fusion Distributed File System

Stream Processing for Remote Collaborative Data Analysis

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

An Evolutionary Path to Object Storage Access

Collaborators from SDM Center, CPES, GPSC, GSEP, Sandia, ORNL

Revealing Applications Access Pattern in Collective I/O for Cache Management

libhio: Optimizing IO on Cray XC Systems With DataWarp

FlexIO: I/O Middleware for Location-Flexible Scientific Data Analytics

EXTREME SCALE DATA MANAGEMENT IN HIGH PERFORMANCE COMPUTING

Do You Know What Your I/O Is Doing? (and how to fix it?) William Gropp

Scibox: Online Sharing of Scientific Data via the Cloud

Introduction to HPC Parallel I/O

Techniques to improve the scalability of Checkpoint-Restart

Damaris. In-Situ Data Analysis and Visualization for Large-Scale HPC Simulations. KerData Team. Inria Rennes,

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016

Challenges in HPC I/O

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Extreme I/O Scaling with HDF5

On the Role of Burst Buffers in Leadership- Class Storage Systems

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work

...And eat it too: High read performance in write-optimized HPC I/O middleware file formats

Parallel I/O from a User s Perspective

Taming Parallel I/O Complexity with Auto-Tuning

Parallel, In Situ Indexing for Data-intensive Computing. Introduction

API and Usage of libhio on XC-40 Systems

HDF5 I/O Performance. HDF and HDF-EOS Workshop VI December 5, 2002

ECE7995 (7) Parallel I/O

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Improved Solutions for I/O Provisioning and Application Acceleration

Canopus: Enabling Extreme-Scale Data Analytics on Big HPC Storage via Progressive Refactoring

Using a Robust Metadata Management System to Accelerate Scientific Discovery at Extreme Scales

Leveraging Flash in HPC Systems

Leveraging Burst Buffer Coordination to Prevent I/O Interference

IBM Spectrum Scale IO performance

EMPRESS Extensible Metadata PRovider for Extreme-scale Scientific Simulations

Structuring PLFS for Extensibility

Research in Middleware Systems For In-Situ Data Analytics and Instrument Data Analysis

Introduction to Parallel I/O

VARIABILITY IN OPERATING SYSTEMS

PROGRAMMING AND RUNTIME SUPPORT FOR ENABLING DATA-INTENSIVE COUPLED SCIENTIFIC SIMULATION WORKFLOWS

I/O at JSC. I/O Infrastructure Workloads, Use Case I/O System Usage and Performance SIONlib: Task-Local I/O. Wolfgang Frings

Scibox: Online Sharing of Scientific Data via the Cloud

High Performance Data Analytics for Numerical Simulations. Bruno Raffin DataMove

Parallel I/O Libraries and Techniques

Massively Parallel I/O for Partitioned Solver Systems

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

Parallel I/O on JUQUEEN

Smart Trading with Cray Systems: Making Smarter Models + Better Decisions in Algorithmic Trading

Structured Streams: Data Services for Petascale Science Environments

Storage Challenges at Los Alamos National Lab

Enabling Active Storage on Parallel I/O Software Stacks. Seung Woo Son Mathematics and Computer Science Division

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

System Software for Big Data and Post Petascale Computing

Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016

IS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers

SDS: A Framework for Scientific Data Services

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

Managing Variability in the IO Performance of Petascale Storage Systems

Bridging the Gap Between High Quality and High Performance for HPC Visualization

Extending Scalability of Collective IO Through Nessie and Staging

pnfs, POSIX, and MPI-IO: A Tale of Three Semantics

Flexible IO and Integration for Scientific Codes Through The Adaptable IO System (ADIOS)

Reducing I/O variability using dynamic I/O path characterization in petascale storage systems

Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin

Blue Waters I/O Performance

pnfs and Linux: Working Towards a Heterogeneous Future

NEXTGenIO Performance Tools for In-Memory I/O

Cloud Computing CS

Lustre* - Fast Forward to Exascale High Performance Data Division. Eric Barton 18th April, 2013

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Pattern-Aware File Reorganization in MPI-IO

Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright. lbl.gov

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP (

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

Lecture 33: More on MPI I/O. William Gropp

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

ScalaIOTrace: Scalable I/O Tracing and Analysis

High Speed Asynchronous Data Transfers on the Cray XT3

File Systems for HPC Machines. Parallel I/O

Enosis: Bridging the Semantic Gap between

I/O Monitoring at JSC, SIONlib & Resiliency

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

Forming an ad-hoc nearby storage, based on IKAROS and social networking services

Scalable Software Components for Ultrascale Visualization Applications

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Data Transformation and Migration in Polystores

New User Seminar: Part 2 (best practices)

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

Transcription:

CSE6230 Fall 2012 Parallel I/O Fang Zheng 1

Credits Some materials are taken from Rob Latham s Parallel I/O in Practice talk http://www.spscicomp.org/scicomp14/talks/l atham.pdf 2

Outline I/O Requirements of HPC Applications Parallel I/O Stack From storage hardware to I/O libraries High-level I/O Middleware Case study: ADIOS In situ I/O Processing Interesting Research Topics 3

Parallel I/O I/O Requirements for HPC Applications Checkpoint/restart: for defending against failures Analysis data: for later analysis and visualization Other data: diagnostics, logs, snapshots, etc. Applications view data with domain-specific semantics: Variables, meshes, particles, attributes, etc. Challenges: High concurrency: e.g., 100,000 processes Hugh data volume: e.g., 200MB per process Ease to use: scientists are not I/O experts 4

Supporting Application I/O Provide mapping of app. domain data abstractions API that uses language meaningful to app. programmers Coordinate access by many processes Collective I/O, consistency semantics Organize I/O devices into a single space Convenient utilities and file model And also Insulate applications from I/O system changes Maintain performance!!! 5

What about Parallel I/O? Focus of parallel I/O is on using parallelism to increase bandwidth Use multiple data sources/sinks in concert Both multiple storage devices and multiple/wide paths to them But applications don't want to deal with block devices and network protocols, So we add software layers 6

Parallel I/O Stack I/O subsystem in supercomputers Oak Ridge National Lab s Jaguar Cray XT4/5 cited 7

Parallel I/O Stack Another Example: IBM BlueGene/P cited 8

Parallel File Systems (PFSs) Organize I/O devices into a single logical space Striping files across devices for performance Export a well-defined API, usually POSIX Access data in contiguous regions of bytes Very general 9

Parallel I/O Stack Idea: Add some additional software components to address remaining issues Coordination of access Mapping from application model to I/O model These components will be increasingly specialized as we add layers Bridge this gap between existing I/O systems and application needs 10

Parallel I/O for HPC Break up support into multiple layers: High level I/O library maps app. abstractions to a structured, portable file format (e.g. HDF5, Parallel netcdf, ADIOS) Middleware layer deals with organizing access by many processes (e.g. MPI-IO, UPC-IO) Parallel file system maintains logical space, provides efficient access to data (e.g. PVFS, GPFS, Lustre) 11

High Level Libraries Provide an appropriate abstraction for domain Multidimensional datasets Typed variables Attributes Self-describing, structured file format Map to middleware interface Encourage collective I/O Provide optimizations that middleware cannot 12

Parallel I/O Stack High Level I/O Libraries: Provide richer semantics than file abstraction Match applications data models: variables, attributes, data types, domain decomposition, etc. Optimize I/O performance on top of MPI-IO Can leverage more application-level knowledge File format and layout Orchestrate/coordinate I/O requests Examples: HDF5, NetCDF, ADIOS, SILO, etc. 13

I/O Middleware Facilitate concurrent access by groups of processes Collective I/O Atomicity rules Expose a generic interface Good building block for high-level libraries Match the underlying programming model (e.g. MPI) Efficiently map middleware operations into PFS ones Leverage any rich PFS access constructs 14

Parallel I/O Parallel I/O supported by MPI-IO Individual files Shared file, individual file pointers Shared file, collective I/O Each MPI process writes/reads a separate file MPI process writes/reads a single shared file with individual file pointers MPI process writes/reads a single shared file with collective semantics 15

Parallel File System Manage storage hardware Present single view Focus on concurrent, independent access Knowledge of collective I/O usually very limited Publish an interface that middleware can use effectively Rich I/O language Relaxed but sufficient semantics 16

17 Reference: http://wiki.lustre.org/images/3/38/shipman_feb_lustre_scalability.pdf Parallel I/O Stack Parallel File System: Example: Lustre file system

High Level I/O Library Case Study: ADIOS ADIOS (Adaptable I/O System): Developed by Georgia Tech and Oak Ridge National Lab Works on Lustre and IBM s GPFS In production use by several major DOE applications Features: Simple, high-level API for reading/writing data in parallel Support several popular file formats High I/O performance at large scales Extensible framework 18

High Level I/O Library Case Study: ADIOS ADIOS Architecture: Layered design Higher level AIDOS API: Adios_open/read/write/close Support multiple underlying file formats and I/O methods Built-in optimization: scheduling, buffering, etc. Scientific Codes ADIOS API buffering schedule feedback Others (plug-in) Viz Engines pnetcdf HDF-5 DART LIVE/DataTap MPI-IO POSIX IO External Metadata (XML file) 19

High Level I/O Library Case Study: ADIOS Optimize Write Performance under Contention: Write performance is critical for checkpointing Parallel file system is shared by: Processes within a MPI program Different MPI programs running concurrently How to attain high write performance on a busy, shared supercomputer? Application 1 Application 2 I/O Servers 20

In Situ I/O Processing: An alternative Approach to Parallel I/O Negative interference due to contention on shared file system Internal contention between process within the same MPI program As ratio of application processes vs. I/O servers reaches certain point, write bandwidth starts to drop 21

In Situ I/O Processing: An alternative Approach to Parallel I/O Negative interference due to contention on shared file system External contention between different MPI programs Huge variations of I/O performance on supercomputers Reference: Jay Lofstead, Fang Zheng, Qing Liu, Scott Klasky, Ron Oldfield, Todd Kordenbrock, Karsten Schwan, Matthew Wolf. "Managing Variability in the IO Performance of Petascale Storage Systems". In Proceedings of SC 10. New Orleans, LA. November 2010 22

High Level I/O Library Case Study: ADIOS How to obtain high write performance on a busy, shared supercomputer? The basic trick is find slow (over-loaded) I/O servers and avoid writing data to them Application 1 Application 2 I/O Servers 23

High Level I/O Library Case Study: ADIOS ADIOS solution: Coordination Divide the writing processes into groups Each group has a sub-coordinator to monitor writing progress Near the end of collective I/O, coordinator has a global view of storage targets performance, and inform stragglers to write to fast targets 24

High Level I/O Library Case Study: ADIOS Results with a parallel application: Pixie3D Higher I/O Bandwidth Less variation 25

In Situ I/O Processing An alternative approach to existing parallel I/O techniques Motivation: I/O is becoming the bottleneck for large scale simulation AND analysis 26

I/O Is a Major Bottleneck Now! Under-provisioned I/O and storage sub-system in supercomputers Huge disparity between I/O and computation capacity I/O resources are shared and contended by concurrent jobs Machine (as of Nov. 2011) Peak Flops Peak I/O bandwidth Flop/byte Jaguar Cray XT5 2.3 Petaflops 120GB/sec 191666 Franklin Cray XT4 352 Teraflops 17GB/sec 20705 Hopper Cray XE6 1.28 Petaflops 35GB/sec 36571 Intrepid BG/P 557 Teraflops 78GB/sec 7141 27

I/O Is a Major Bottleneck Now! Huge output volume for scientific simulations Example: GTS fusion simulation: 200MB per MPI process x 100,000 procs 20TB per checkpoint Increasing scale increasing failure frequency increasing I/O frequency increasing I/O time A prediction by Sandia National Lab shows that checkpoint I/O will weigh more than 50% of total simulation runtime under current machines failure frequency Reference: Ron Oldfield, Sarala Arunagiri, Patricia J. Teller, Seetharami R. Seelam, Maria Ruiz Varela, Rolf Riesen, Philip C. Roth: Modeling the Impact of Checkpoints on Next-Generation Systems. MSST 2007: 30-46 28

I/O Is a Major Bottleneck Now! Analysis and visualization needs to read data back to gain useful insights from the raw bits File read time can weigh 90% of total runtime of visualization tasks Reference: Tom Peterka, Hongfeng Yu, Robert B. Ross, Kwan-Liu Ma, Robert Latham: End-to-End Study of Parallel Volume Rendering on the IBM Blue Gene/P. ICPP 2009:566-573 29

In Situ I/O Processing In Situ I/O Processing: Eliminate I/O bottleneck by tightly coupling simulation and analysis Simulation PFS Analysis remove the bottleneck! Simulation Analysis 30

In Situ I/O Processing In Situ I/O Processing: Process simulation output data online, while the data is being generated Many useful analysis can be done this way Data reduction: filtering, feature extraction Data preparation: layout re-organization Data inspection: validation, monitoring Reduce disk I/O activities reduce time and power consumption Reduce the time from data to insight 31

Placing In Situ Analytics There are multiple options to place analytics along with simulation Inline Helper core Staging nodes I/O nodes Offline 32

In Situ I/O Processing PreDatA (Preparatory Data Analytics): Allocate a set of compute nodes (Staging Area) to host analytics Move simulation output data into Staging Area Process data in Staging Area using MapReduce 33

In Situ I/O Processing PreDatA data flow: Compute Node Staging Node Application 2b 1 1a 1c 1b 2a 3 Output Data Stream 4 Packed partial data chunk Local metadata Metadata Calculation Global metadata Data request Stream Processing 34

In Situ I/O Processing MapReduce in Staging Area fetch initialize map map map map map map shuffle reduce reduce finalize fetch map map initialize map map shuffle reduce finalize map map 35

In Situ I/O Processing FlexIO: In Situ I/O processing middleware Used by simulation and analytics to exchange data Analytics can be arbitrary MPI codes Enable flexible placement options of analytics In simulation nodes, staging nodes, offline nodes No need to change code when changing placement Simulation / Analytics Codes FlexIO API FlexIO Runtime File I/O Parallel Data Movement DC Plug-ins Performance Monitoring EVPath Messaging Library Buffer Management RDMA (InfiniBand, SeaStar/Portals, Gemini) Shared Memory (SysV, mmap, Xpmem) 36

FlexIO: In Situ I/O Processing High performance data movement between simulation and analytics Automatically re-distribute multi-dimensional arrays between two parallel programs 0 1 2 3 4 5 6 7 8 0 1 0 1 2 3 4 5 6 7 8 Simulation Step 2 Step 1.s Dir. Server Analytics Step 1.a Step 4 0 1 37

In Situ I/O Processing Placement Algorithms: Decide where to run analytics Data Aware Mapping: Take inter-program data movement volume as input Use graph partitioning to group processes Map process groups to nodes Holistic Placement: Conservative resource allocation Take intra- and inter-program data movement volume as input Use graph mapping to map processes to cores Node Topology Aware Placement: Model node based on its cache structure 38

In Situ I/O Processing Placement of Analytics simulation analytics Inter-program data movement Intra-simulation data movement Intra-analytics data movement Data aware mapping Holistic placement 39

Total Execution Time (Sec) In Situ I/O Processing Improve Application Performance by Smart Placement of Analytics GTS fusion simulation + statistical analysis Placement leads to 20% improvement of total runtime 340 320 300 280 260 240 220 Inline Helper Core (Data Aware Mapping) Helper Core (Holistic) Helper Core (Node Topo. Aware) Staging Lower Bound 200 512 1024 2048 4096 GTS Cores 40

Parallel FS vs. Distributed FS Distributed file systems used in data center environment (like Google FS or HDFS, etc.) Similarities: Client/server (meta, data) architecture Basic file semantics Support parallel and distributed workloads Differences: Deployment model: DFS co-locate computation and data, and aggregates local disks PFS assumes diskless clients Interface: PFS provides collective I/O semantics and (most) POSIX semantics DFS like HDFS supports key-value store semantics DFS assumes write-once semantics, disallows concurrent writes to one file DFS exposes data locality information to job scheduler Implementation: data distribution, failure/consistency handling, etc. On the Duality of Data-intensive File System Design: Reconciling HDFS and PVFS. Wittawat Tantisiriroj, Swapnil Patil, Garth Gibson, Seung Woo Son, Samuel J. Lang, Robert B. Ross. SC11, 41 November 12-18, 2011

Interesting Research Topics Integrate new hardware into parallel I/O stack (e.g., SSD, NVRAM) Move Computation close to data: Use MapReduce/NoSQL systems to process massive scientific data Online Real-time stream processing Move analytics into File systems Analytics-driven simulation: re-computing data may be cheaper and faster than storing&loading the data! 42

References MPI-IO: http://www.mcs.anl.gov/research/projects/romio/ HDF5: http://www.hdfgroup.org/hdf5/ NetCDF: http://www.unidata.ucar.edu/software/netcdf/ ADIOS: http://www.olcf.ornl.gov/center-projects/adios/ Lustre: http://wiki.lustre.org/index.php/main_page PVFS: http://www.pvfs.org/ GPFS: http://www-03.ibm.com/systems/software/gpfs/ 43