Introduction to High Performance Parallel I/O

Similar documents
Parallel I/O from a User s Perspective

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

HDF5 I/O Performance. HDF and HDF-EOS Workshop VI December 5, 2002

Introduction to Parallel I/O

Lecture S3: File system data layout, naming

File Open, Close, and Flush Performance Issues in HDF5 Scot Breitenfeld John Mainzer Richard Warren 02/19/18

Introduction to HPC Parallel I/O

Welcome! Virtual tutorial starts at 15:00 BST

The Fusion Distributed File System

Extreme I/O Scaling with HDF5

An Evolutionary Path to Object Storage Access

Parallel I/O Libraries and Techniques

Parallel File Systems. John White Lawrence Berkeley National Lab

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package

Do You Know What Your I/O Is Doing? (and how to fix it?) William Gropp

File system internals Tanenbaum, Chapter 4. COMP3231 Operating Systems

HPC Input/Output. I/O and Darshan. Cristian Simarro User Support Section

CA485 Ray Walshe Google File System

I/O: State of the art and Future developments

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016

Crossing the Chasm: Sneaking a parallel file system into Hadoop

CS 111. Operating Systems Peter Reiher

Segmentation with Paging. Review. Segmentation with Page (MULTICS) Segmentation with Page (MULTICS) Segmentation with Page (MULTICS)

Enosis: Bridging the Semantic Gap between

Map-Reduce. Marco Mura 2010 March, 31th

Parallel I/O on JUQUEEN

Improved Solutions for I/O Provisioning and Application Acceleration

Distributed File Systems II

Recall: Address Space Map. 13: Memory Management. Let s be reasonable. Processes Address Space. Send it to disk. Freeing up System Memory

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 420, York College. November 21, 2006

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

Structuring PLFS for Extensibility

File. File System Implementation. File Metadata. File System Implementation. Direct Memory Access Cont. Hardware background: Direct Memory Access

What NetCDF users should know about HDF5?

Caching and Buffering in HDF5

Data Storage and Query Answering. Data Storage and Disk Structure (2)

Revealing Applications Access Pattern in Collective I/O for Cache Management

Main Points. File systems. Storage hardware characteristics. File system usage patterns. Useful abstractions on top of physical devices

GFS: The Google File System

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

SDS: A Framework for Scientific Data Services

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

What is a file system

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Data Staging: Moving large amounts of data around, and moving it close to compute resources

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES

Operating Systems. File Systems. Thomas Ropars.

Performance! (1/latency)! 1000! 100! 10! Capacity Access Time Cost. CPU Registers 100s Bytes <10s ns. Cache K Bytes ns 1-0.

Last Class: Memory management. Per-process Replacement

PERFORMANCE OF PARALLEL IO ON LUSTRE AND GPFS

Cluster-Level Google How we use Colossus to improve storage efficiency

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

[537] Fast File System. Tyler Harter

Toward An Integrated Cluster File System

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

Dept. Of Computer Science, Colorado State University

CS 318 Principles of Operating Systems

Distributed Filesystem

Triton file systems - an introduction. slide 1 of 28

Chapter 8: Memory-Management Strategies

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Storage. CS 3410 Computer System Organization & Programming

Operating Systems. Operating Systems Professor Sina Meraji U of T

Four Components of a Computer System

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Data Movement & Tiering with DMF 7

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

CS 318 Principles of Operating Systems

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters

Scalable I/O. Ed Karrels,

1 Motivation for Improving Matrix Multiplication

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

CSE 451: Operating Systems Spring Module 12 Secondary Storage

CHAPTER 8: MEMORY MANAGEMENT. By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

Memory Hierarchy. Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity. (Study Chapter 5)

Chapter 8: Main Memory. Operating System Concepts 9 th Edition

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

CSE 153 Design of Operating Systems

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Chapter 8: Main Memory

Chapter 8: Main Memory

CSE502: Computer Architecture CSE 502: Computer Architecture

Composite Metrics for System Throughput in HPC

COMP 273 Winter physical vs. virtual mem Mar. 15, 2012

The TokuFS Streaming File System

CS370 Operating Systems

pnfs, POSIX, and MPI-IO: A Tale of Three Semantics

CPS101 Computer Organization and Programming Lecture 13: The Memory System. Outline of Today s Lecture. The Big Picture: Where are We Now?

Parallel I/O. and split communicators. David Henty, Fiona Ried, Gavin J. Pringle

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

Chapter 7: Main Memory. Operating System Concepts Essentials 8 th Edition

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Transcription:

Introduction to High Performance Parallel I/O Richard Gerber Deputy Group Lead NERSC User Services August 30, 2013-1- Some slides from Katie Antypas

I/O Needs Getting Bigger All the Time I/O needs growing each year in scientific community I/O is a bottleneck for many Data sizes growing with system memory For large users I/O parallelism is mandatory Titan has 10 PB of disk and Blue Waters 22 PB!!! Images from David Randall, Paola Cessi, John Bell, T Scheibe

Outline Storage Architectures File Systems I/O Strategies MPI I/O Parallel I/O Libraries 3

Why is Parallel I/O for science applications difficult? Scientists think about data in terms of how a system is represented in the code: as grid cells, particles, Ultimately, data is stored on a physical device Layers in between the application and the device are complex and varied I/O interfaces are often arcane and complicated Images from David Randall, Paola Cessi, John Bell, T Scheibe

System I/O Architecture Should an application scientist/programmer care about what the I/O system looks like? Yes! It would be nice not to have to, but performance and perhaps functionality depend on it. You may be able to make simple changes to the code or runtime environment that make a big difference. Inconvenient Truth: Scientists need to understand their I/O in order to get good performance This may be mitigated by using I/O libraries.

Storage Architectures -6-

Simplified I/O Hierarchy Application High Level IO Library Intermediate Layer May be MPI IO Parallel File System Storage Device

Storage Devices Usually we ll be talking about arrays of hard disks FLASH drives are being used as fast disks, but are expensive Burst buffers coming soon Magnetic tapes are cheap, but slow and probably don t appear as standard file systems

Some Definitions Capacity (in MB, GB, TB, PB) Depends on area available on storage device and the density data can be written Transfer Rate (bandwidth) MB/sec or GB/sec Rate at which a device reads or writes data Depends on many factors: network interfaces, disk speed, etc. Be careful with parallel BW numbers: aggregate? per what? Access Time (latency) Delay before the first byte is read Metadata A description of where and how a file or directory is stored on physical media Is itself data that has to be read/written Excessive metadata access can limit performance

Latencies 1,000 ns Node Interconnect 1,000,000 ns 100 ns Memory Interface 10 ns L2 Cache L1 Data Cache L1 Instr. Cache 1 ns ST LD INT Registers Shif t INT FM A FP add Disk Storage CPU FP Registers FP mul t 10

Bandwidths How fast can you stream data from your application to/from disk? Once you pay the latency penalty, HPC system BWs are large. ~ 10s to now 100s GB/sec To take advantage of this, read/write large chunks (MBs) of data Serial bandwidths < 1 GB/sec Limited by interfaces and/or physical media You need parallelism to achieve high aggregate bandwidth

File Buffering and Caching Buffering Used to improve performance Caching File system collects full blocks of data before transferring data to disk For large writes, can transfer many blocks at once File system retrieves an entire block of data, even if all data was not requested, data remains in the cache Can happen in many places, compute node, I/O server, disk Not the same on all platforms Important to study your own application s performance rather than look at peak numbers

File Systems 13

Local vs. Global File Systems On-board (the old local ) Directly attached to motherboard via some interface Few HPC systems have disks directly attached to a node Local in HPC: Access from one system Network attached PB+ file systems Via high-speed internal network (e.g. IB,Gemini, Airies) Direct from node via high-speed custom network (e.g. IB,FibreChannel) Ethernet Contention among jobs on system Global : Access from multiple systems Networked file system Activity on other systems can impact performance Useful for avoiding data replication, movement among systems

Top Parallel File Systems Used in HPC GPFS

Generic Parallel File System Architecture Compute Nodes Internal Network I/O Servers I/O I/O I/O I/O I/O I/O I/O I/O External Network Parallel FS Storage Hardware -- Disks

I/O Strategies 17

Application I/O All significant I/O performed by your job should use the file system designed for HPC applications. Home directories are not optimized for large I/O performance. Consult your center s documentation. Home Directory Work / Scratch Directory

Parallel I/O: A User Wish List Easy to program Get acceptable performance Users tell us that I/O should be on the order of 10% of run time or less Have data files portable among systems Write data from many processors into a single file Read data from any number of tasks (i.e., you want to see the logical data layout not the physical layout) Be able to easily use M tasks to read a data file written using N tasks

High Level I/O Strategies Single task does all I/O Each task writes to its own file All tasks write to single shared file n<n tasks write to a single file n 1 <N tasks write to n 2 <N files July 19, 2008

Serial I/O tasks 0 1 2 3 4 5 File Each task sends its data to a master that writes the data Advantages Simple Disadvantages Scales poorly May not fit into memory on task 0 Bandwidth from 1 task is very limited

Parallel I/O Multi-file Each Processors Writes Its Data to Separate File tasks 0 1 2 3 4 5 File0 File1 File2 File3 File4 File5 Advantages Easy to program Can be fast (up to a point) Disadvantages Many files can cause serious performance problems Hard for you to manage 10K, 100K or 1M files

Flash Center IO Nightmare 32,000 processor run on LLNL BG/L Parallel IO libraries not yet available Every task wrote Checkpoint files:.7 TB, every 4 hours, 200 total Plot files: 20GB each, 700 files Particle files: 470 MB each, 1,400 files Used 154 TB total Created 74 million files! UNIX utility problems (e.g., mv, ls, cp) It took 2 years to sift though data, sew files together

Parallel I/O Single-File All Tasks to Single File tasks 0 1 2 3 4 5 File Advantages Single file makes data manageable No system problems with excessive metadata Disadvantages Can be more difficult to program (use libs) Performance may be less

Hybrid Model I Groups of Tasks Access Different Files tasks 0 1 2 3 4 5 File File Advantages Fewer files than 1 1 Disadvantages Algorithmically complex Better performance than All 1

MPI-IO 26

What is MPI-IO? Parallel I/O interface for MPI programs Allows access to shared files using a standard API that is optimized and safe Key concepts: MPI communicators open()s and close()s are collective within communicator Only tasks in communicator can access Derived data types All operations (e.g. read()) have an associated MPI data type Collective I/O for optimizations 27

Basic MPI IO Routines MPI_File_open() associate a file with a file handle. MPI_File_seek() move the current file position to a given location in the file. MPI_File_read() read some fixed amount of data out of the file beginning at the current file position. MPI_File_write() write some fixed amount of data into the file beginning at the current file position. MPI_File_sync() -- flush any caches associated with the file handle. MPI_File_close() close the file handle.

Independent and Collective I/O P0 P1 P2 P3 P4 P5 P0 P1 P2 P3 P4 P5 Independent I/O Collective I/O Independent I/O operations specify only what a single process will do Independent I/O calls obscure relationships between I/O on other processes Many applications have phases of computation and I/O During I/O phases, all processes read/write data Collective I/O is coordinated access to storage by a group of processes Collective I/O functions are called by all processes participating in I/O Allows I/O layers to know more about access as a whole, more opportunities for optimization in lower software layers, better performance Slide from Rob Ross, Rob Latham at ANL 29

MPI-IO Summary Provides optimizations for typically low performing I/O patterns (non-contiguous I/O and small block I/O) You could use MPI-IO directly, but better to use a high level I/O library MPI-IO works well in the middle of the I/O stack, letting high-level library authors write to the MPI-IO API 30

High Level Parallel I/O Libraries 31

Common Storage Formats ASCII: Very slow Takes a lot of space! Inaccurate Binary Non-portable (eg. byte ordering and types sizes) Not future proof Self-Describing formats NetCDF/HDF4, HDF5, Parallel NetCDF Community File Formats FITS, HDF-EOS, SAF, PDB, Plot3D Many users at this level. We would like to encourage you to transition to a higher IO library Modern Implementations built on top of HDF, NetCDF, or other selfdescribing object-model API

What is a High Level Parallel I/O Library? An API which helps to express scientific simulation data in a more natural way Multi-dimensional data, labels and tags, non-contiguous data, typed data Typically sits on top of MPI-IO layer and can use MPI-IO optimizations A library offers Portable formats - can write on one machine and read from another Longevity - output will last and be accessible with library tools and no need to remember version number of code

What about performance? Hand tuned I/O for a particular application and architecture may perform better, but maybe not. A library can be optimized for a given architecture by the library developers Performance is not just GB/s, but more importantly, productivity If you use your own binary file format, it forces you to understand layers below the application and preserve your I/O routines if you want to read later Every time code is ported to a new machine or underlying file system is changed or upgraded, you are required to make changes to maintain performance

IO Library Overhead Very little, if any overhead from HDF5 for one file per processor IO compared to Posix and MPI-IO Data from Hongzhang Shan

ADIOS ADIOS provides a code API and external XML descriptor file that lets you process data in different ways by changing the XML file and rerunning your code. ADIOS can use different back-end file storage formats (e.g. HDF5, netcdf) -36-

Recommendations Do large I/O: write fewer big chunks of data (1MB+) rather than small bursty I/O Do parallel I/O Serial I/O (single writer) can not take advantage of the system s parallel capabilities. Use a single, shared file instead of 1 file per writer, esp. at high parallel concurrency Avoid excessive metadata activity (e.g., avoid keeping millions of small files in a single directory) Use an I/O library API and write flexible, portable programs

National Energy Research Scientific Computing Center -38-