CSE6230 Fall Parallel I/O. Fang Zheng

CSE6230 Fall 2012 Parallel I/O Fang Zheng 1

Credits Some materials are taken from Rob Latham s Parallel I/O in Practice talk http://www.spscicomp.org/scicomp14/talks/l atham.pdf 2

Outline I/O Requirements of HPC Applications Parallel I/O Stack From storage hardware to I/O libraries High-level I/O Middleware Case study: ADIOS In situ I/O Processing Interesting Research Topics 3

Parallel I/O I/O Requirements for HPC Applications Checkpoint/restart: for defending against failures Analysis data: for later analysis and visualization Other data: diagnostics, logs, snapshots, etc. Applications view data with domain-specific semantics: Variables, meshes, particles, attributes, etc. Challenges: High concurrency: e.g., 100,000 processes Hugh data volume: e.g., 200MB per process Ease to use: scientists are not I/O experts 4

Supporting Application I/O Provide mapping of app. domain data abstractions API that uses language meaningful to app. programmers Coordinate access by many processes Collective I/O, consistency semantics Organize I/O devices into a single space Convenient utilities and file model And also Insulate applications from I/O system changes Maintain performance!!! 5

What about Parallel I/O? Focus of parallel I/O is on using parallelism to increase bandwidth Use multiple data sources/sinks in concert Both multiple storage devices and multiple/wide paths to them But applications don't want to deal with block devices and network protocols, So we add software layers 6

Parallel I/O Stack I/O subsystem in supercomputers Oak Ridge National Lab s Jaguar Cray XT4/5 cited 7

Parallel I/O Stack Another Example: IBM BlueGene/P cited 8

Parallel File Systems (PFSs) Organize I/O devices into a single logical space Striping files across devices for performance Export a well-defined API, usually POSIX Access data in contiguous regions of bytes Very general 9

Parallel I/O Stack Idea: Add some additional software components to address remaining issues Coordination of access Mapping from application model to I/O model These components will be increasingly specialized as we add layers Bridge this gap between existing I/O systems and application needs 10

Parallel I/O for HPC Break up support into multiple layers: High level I/O library maps app. abstractions to a structured, portable file format (e.g. HDF5, Parallel netcdf, ADIOS) Middleware layer deals with organizing access by many processes (e.g. MPI-IO, UPC-IO) Parallel file system maintains logical space, provides efficient access to data (e.g. PVFS, GPFS, Lustre) 11

High Level Libraries Provide an appropriate abstraction for domain Multidimensional datasets Typed variables Attributes Self-describing, structured file format Map to middleware interface Encourage collective I/O Provide optimizations that middleware cannot 12

Parallel I/O Stack High Level I/O Libraries: Provide richer semantics than file abstraction Match applications data models: variables, attributes, data types, domain decomposition, etc. Optimize I/O performance on top of MPI-IO Can leverage more application-level knowledge File format and layout Orchestrate/coordinate I/O requests Examples: HDF5, NetCDF, ADIOS, SILO, etc. 13

I/O Middleware Facilitate concurrent access by groups of processes Collective I/O Atomicity rules Expose a generic interface Good building block for high-level libraries Match the underlying programming model (e.g. MPI) Efficiently map middleware operations into PFS ones Leverage any rich PFS access constructs 14

Parallel I/O Parallel I/O supported by MPI-IO Individual files Shared file, individual file pointers Shared file, collective I/O Each MPI process writes/reads a separate file MPI process writes/reads a single shared file with individual file pointers MPI process writes/reads a single shared file with collective semantics 15

Parallel File System Manage storage hardware Present single view Focus on concurrent, independent access Knowledge of collective I/O usually very limited Publish an interface that middleware can use effectively Rich I/O language Relaxed but sufficient semantics 16

17 Reference: http://wiki.lustre.org/images/3/38/shipman_feb_lustre_scalability.pdf Parallel I/O Stack Parallel File System: Example: Lustre file system

High Level I/O Library Case Study: ADIOS ADIOS (Adaptable I/O System): Developed by Georgia Tech and Oak Ridge National Lab Works on Lustre and IBM s GPFS In production use by several major DOE applications Features: Simple, high-level API for reading/writing data in parallel Support several popular file formats High I/O performance at large scales Extensible framework 18

High Level I/O Library Case Study: ADIOS ADIOS Architecture: Layered design Higher level AIDOS API: Adios_open/read/write/close Support multiple underlying file formats and I/O methods Built-in optimization: scheduling, buffering, etc. Scientific Codes ADIOS API buffering schedule feedback Others (plug-in) Viz Engines pnetcdf HDF-5 DART LIVE/DataTap MPI-IO POSIX IO External Metadata (XML file) 19

High Level I/O Library Case Study: ADIOS Optimize Write Performance under Contention: Write performance is critical for checkpointing Parallel file system is shared by: Processes within a MPI program Different MPI programs running concurrently How to attain high write performance on a busy, shared supercomputer? Application 1 Application 2 I/O Servers 20

In Situ I/O Processing: An alternative Approach to Parallel I/O Negative interference due to contention on shared file system Internal contention between process within the same MPI program As ratio of application processes vs. I/O servers reaches certain point, write bandwidth starts to drop 21

In Situ I/O Processing: An alternative Approach to Parallel I/O Negative interference due to contention on shared file system External contention between different MPI programs Huge variations of I/O performance on supercomputers Reference: Jay Lofstead, Fang Zheng, Qing Liu, Scott Klasky, Ron Oldfield, Todd Kordenbrock, Karsten Schwan, Matthew Wolf. "Managing Variability in the IO Performance of Petascale Storage Systems". In Proceedings of SC 10. New Orleans, LA. November 2010 22

High Level I/O Library Case Study: ADIOS How to obtain high write performance on a busy, shared supercomputer? The basic trick is find slow (over-loaded) I/O servers and avoid writing data to them Application 1 Application 2 I/O Servers 23

High Level I/O Library Case Study: ADIOS ADIOS solution: Coordination Divide the writing processes into groups Each group has a sub-coordinator to monitor writing progress Near the end of collective I/O, coordinator has a global view of storage targets performance, and inform stragglers to write to fast targets 24

High Level I/O Library Case Study: ADIOS Results with a parallel application: Pixie3D Higher I/O Bandwidth Less variation 25

In Situ I/O Processing An alternative approach to existing parallel I/O techniques Motivation: I/O is becoming the bottleneck for large scale simulation AND analysis 26

I/O Is a Major Bottleneck Now! Under-provisioned I/O and storage sub-system in supercomputers Huge disparity between I/O and computation capacity I/O resources are shared and contended by concurrent jobs Machine (as of Nov. 2011) Peak Flops Peak I/O bandwidth Flop/byte Jaguar Cray XT5 2.3 Petaflops 120GB/sec 191666 Franklin Cray XT4 352 Teraflops 17GB/sec 20705 Hopper Cray XE6 1.28 Petaflops 35GB/sec 36571 Intrepid BG/P 557 Teraflops 78GB/sec 7141 27

I/O Is a Major Bottleneck Now! Huge output volume for scientific simulations Example: GTS fusion simulation: 200MB per MPI process x 100,000 procs 20TB per checkpoint Increasing scale increasing failure frequency increasing I/O frequency increasing I/O time A prediction by Sandia National Lab shows that checkpoint I/O will weigh more than 50% of total simulation runtime under current machines failure frequency Reference: Ron Oldfield, Sarala Arunagiri, Patricia J. Teller, Seetharami R. Seelam, Maria Ruiz Varela, Rolf Riesen, Philip C. Roth: Modeling the Impact of Checkpoints on Next-Generation Systems. MSST 2007: 30-46 28

I/O Is a Major Bottleneck Now! Analysis and visualization needs to read data back to gain useful insights from the raw bits File read time can weigh 90% of total runtime of visualization tasks Reference: Tom Peterka, Hongfeng Yu, Robert B. Ross, Kwan-Liu Ma, Robert Latham: End-to-End Study of Parallel Volume Rendering on the IBM Blue Gene/P. ICPP 2009:566-573 29

In Situ I/O Processing In Situ I/O Processing: Eliminate I/O bottleneck by tightly coupling simulation and analysis Simulation PFS Analysis remove the bottleneck! Simulation Analysis 30

In Situ I/O Processing In Situ I/O Processing: Process simulation output data online, while the data is being generated Many useful analysis can be done this way Data reduction: filtering, feature extraction Data preparation: layout re-organization Data inspection: validation, monitoring Reduce disk I/O activities reduce time and power consumption Reduce the time from data to insight 31

Placing In Situ Analytics There are multiple options to place analytics along with simulation Inline Helper core Staging nodes I/O nodes Offline 32

In Situ I/O Processing PreDatA (Preparatory Data Analytics): Allocate a set of compute nodes (Staging Area) to host analytics Move simulation output data into Staging Area Process data in Staging Area using MapReduce 33

In Situ I/O Processing PreDatA data flow: Compute Node Staging Node Application 2b 1 1a 1c 1b 2a 3 Output Data Stream 4 Packed partial data chunk Local metadata Metadata Calculation Global metadata Data request Stream Processing 34

In Situ I/O Processing MapReduce in Staging Area fetch initialize map map map map map map shuffle reduce reduce finalize fetch map map initialize map map shuffle reduce finalize map map 35

In Situ I/O Processing FlexIO: In Situ I/O processing middleware Used by simulation and analytics to exchange data Analytics can be arbitrary MPI codes Enable flexible placement options of analytics In simulation nodes, staging nodes, offline nodes No need to change code when changing placement Simulation / Analytics Codes FlexIO API FlexIO Runtime File I/O Parallel Data Movement DC Plug-ins Performance Monitoring EVPath Messaging Library Buffer Management RDMA (InfiniBand, SeaStar/Portals, Gemini) Shared Memory (SysV, mmap, Xpmem) 36

FlexIO: In Situ I/O Processing High performance data movement between simulation and analytics Automatically re-distribute multi-dimensional arrays between two parallel programs 0 1 2 3 4 5 6 7 8 0 1 0 1 2 3 4 5 6 7 8 Simulation Step 2 Step 1.s Dir. Server Analytics Step 1.a Step 4 0 1 37

In Situ I/O Processing Placement Algorithms: Decide where to run analytics Data Aware Mapping: Take inter-program data movement volume as input Use graph partitioning to group processes Map process groups to nodes Holistic Placement: Conservative resource allocation Take intra- and inter-program data movement volume as input Use graph mapping to map processes to cores Node Topology Aware Placement: Model node based on its cache structure 38

In Situ I/O Processing Placement of Analytics simulation analytics Inter-program data movement Intra-simulation data movement Intra-analytics data movement Data aware mapping Holistic placement 39

Total Execution Time (Sec) In Situ I/O Processing Improve Application Performance by Smart Placement of Analytics GTS fusion simulation + statistical analysis Placement leads to 20% improvement of total runtime 340 320 300 280 260 240 220 Inline Helper Core (Data Aware Mapping) Helper Core (Holistic) Helper Core (Node Topo. Aware) Staging Lower Bound 200 512 1024 2048 4096 GTS Cores 40

Parallel FS vs. Distributed FS Distributed file systems used in data center environment (like Google FS or HDFS, etc.) Similarities: Client/server (meta, data) architecture Basic file semantics Support parallel and distributed workloads Differences: Deployment model: DFS co-locate computation and data, and aggregates local disks PFS assumes diskless clients Interface: PFS provides collective I/O semantics and (most) POSIX semantics DFS like HDFS supports key-value store semantics DFS assumes write-once semantics, disallows concurrent writes to one file DFS exposes data locality information to job scheduler Implementation: data distribution, failure/consistency handling, etc. On the Duality of Data-intensive File System Design: Reconciling HDFS and PVFS. Wittawat Tantisiriroj, Swapnil Patil, Garth Gibson, Seung Woo Son, Samuel J. Lang, Robert B. Ross. SC11, 41 November 12-18, 2011

Interesting Research Topics Integrate new hardware into parallel I/O stack (e.g., SSD, NVRAM) Move Computation close to data: Use MapReduce/NoSQL systems to process massive scientific data Online Real-time stream processing Move analytics into File systems Analytics-driven simulation: re-computing data may be cheaper and faster than storing&loading the data! 42

References MPI-IO: http://www.mcs.anl.gov/research/projects/romio/ HDF5: http://www.hdfgroup.org/hdf5/ NetCDF: http://www.unidata.ucar.edu/software/netcdf/ ADIOS: http://www.olcf.ornl.gov/center-projects/adios/ Lustre: http://wiki.lustre.org/index.php/main_page PVFS: http://www.pvfs.org/ GPFS: http://www-03.ibm.com/systems/software/gpfs/ 43