Do You Know What Your I/O Is Doing? (and how to fix it?) William Gropp

Similar documents
Taming Parallel I/O Complexity with Auto-Tuning

Introduction to High Performance Parallel I/O

The Fusion Distributed File System

On the Role of Burst Buffers in Leadership- Class Storage Systems

Challenges in Programming the Next Generation of HPC Systems. William Gropp wgropp.cs.illinois.edu

Improved Solutions for I/O Provisioning and Application Acceleration

UMAMI: A Recipe for Generating Meaningful Metrics through Holistic I/O Performance Analysis

Illinois Proposal Considerations Greg Bauer

Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Lecture 20: Distributed Memory Parallelism. William Gropp

Extreme I/O Scaling with HDF5

Mining Supercomputer Jobs' I/O Behavior from System Logs. Xiaosong Ma

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

Parallel File Systems. John White Lawrence Berkeley National Lab

CSE6230 Fall Parallel I/O. Fang Zheng

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

GPFS Experiences from the Argonne Leadership Computing Facility (ALCF) William (Bill) E. Allcock ALCF Director of Operations

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

CHARACTERIZING HPC I/O: FROM APPLICATIONS TO SYSTEMS

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Damaris. In-Situ Data Analysis and Visualization for Large-Scale HPC Simulations. KerData Team. Inria Rennes,

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Parallel I/O on JUQUEEN

The Dashboard: HPC I/O Analysis Made Easy

libhio: Optimizing IO on Cray XC Systems With DataWarp

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016

HPC Storage Use Cases & Future Trends

Two-Choice Randomized Dynamic I/O Scheduler for Object Storage Systems. Dong Dai, Yong Chen, Dries Kimpe, and Robert Ross

An Evolutionary Path to Object Storage Access

pnfs, POSIX, and MPI-IO: A Tale of Three Semantics

How To write(101)data on a large system like a Cray XT called HECToR

MPI in 2020: Opportunities and Challenges. William Gropp

Preparing GPU-Accelerated Applications for the Summit Supercomputer

IME Infinite Memory Engine Technical Overview

Introduction to HPC Parallel I/O

Our Workshop Environment

Blue Waters I/O Performance

The Future of Virtualization: The Virtualization of the Future

Distributed File Systems II

THOUGHTS ABOUT THE FUTURE OF I/O

Structuring PLFS for Extensibility

Bridging the Gap Between High Quality and High Performance for HPC Visualization

Optimization of non-contiguous MPI-I/O operations

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work

Rsyslog: going up from 40K messages per second to 250K. Rainer Gerhards

Expressing Fault Tolerant Algorithms with MPI-2. William D. Gropp Ewing Lusk

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Data Movement & Tiering with DMF 7

HDF5 I/O Performance. HDF and HDF-EOS Workshop VI December 5, 2002

Infinite Memory Engine Freedom from Filesystem Foibles

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

How Not to Measure Performance: Lessons from Parallel Computing or Only Make New Mistakes William Gropp

A Classification of Parallel I/O Toward Demystifying HPC I/O Best Practices

Application Performance on IME

ECE7995 (7) Parallel I/O

Smart Trading with Cray Systems: Making Smarter Models + Better Decisions in Algorithmic Trading

Programming with MPI

Integrating Analysis and Computation with Trios Services

Science-as-a-Service

Advanced Data Placement via Ad-hoc File Systems at Extreme Scales (ADA-FS)

Scalable I/O. Ed Karrels,

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Transport Simulations beyond Petascale. Jing Fu (ANL)

Memory Management Strategies for Data Serving with RDMA

GraphTrek: Asynchronous Graph Traversal for Property Graph-Based Metadata Management

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Leveraging Burst Buffer Coordination to Prevent I/O Interference

Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore

Co-existence: Can Big Data and Big Computation Co-exist on the Same Systems?

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

API and Usage of libhio on XC-40 Systems

Triton file systems - an introduction. slide 1 of 28

Welcome! Virtual tutorial starts at 15:00 BST

Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright. lbl.gov

Blue Waters Local Software To Be Released: Module Improvements and Parfu Parallel Archive Tool

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Fast Forward I/O & Storage

Lecture 33: More on MPI I/O. William Gropp

Introduction to Parallel I/O

Burst Buffers Simulation in Dragonfly Network

Cori (2016) and Beyond Ensuring NERSC Users Stay Productive

Case Studies in Storage Access by Loosely Coupled Petascale Applications

PERFORMANCE OF PARALLEL IO ON LUSTRE AND GPFS

Distributed Filesystem

High-Performance Distributed RMA Locks

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Magellan Project. Jeff Broughton NERSC Systems Department Head October 7, 2009

What is a file system

Concurrency. Glossary

Parallel I/O using standard data formats in climate and NWP

BigData and Map Reduce VITMAC03

Lustre and PLFS Parallel I/O Performance on a Cray XE6

Distributed Systems 16. Distributed File Systems II

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Data Intensive Scalable Computing

Transcription:

Do You Know What Your I/O Is Doing? (and how to fix it?) William Gropp www.cs.illinois.edu/~wgropp

Messages Current I/O performance is often appallingly poor Even relative to what current systems can achieve Part of the problem is the I/O interface semantics Many applications need to rethink their approach to I/O Not sufficient to fix current I/O implementations HPC Centers have been complicit in causing this problem By asking users the wrong question By using their response as an excuse to keep doing the same thing 2

Just How Bad Is Current I/O Performance? Much of the data (and some slides) taken from A Multiplatform Study of I/O Behavior on Petascale Supercomputers, Huong Luu, Marianne Winslett, William Gropp, Robert Ross, Philip Carns, Kevin Harms, Prabhat, Suren Byna, and Yushu Yao, presented at HPDC 15. This paper has lots more data consider this presentation a sampling http://www.hpdc.org/2015/program/slides/luu.pdf http://dl.acm.org/citation.cfm?doid=2749246.2749269 Thanks to Luu, Behzad, and the Blue Waters staff and project for Blue Waters results Analysis part of PAID program at Blue Waters 3

I/O Logs Captured By Darshan, A Lightweight I/O Characterization Tool Instruments I/O functions at multiple levels Reports key I/O characteristics Does not capture text I/O functions Low overhead à Automatically deployed on multiple platforms. http://www.mcs.anl.gov/research/ projects/darshan/ 4

Caveats on Darshan Data Users can opt out Not all applications recorded; typically about ½ on DOE systems Data saved at MPI_Finalize Applications that don t call MPI_Finalize, e.g., run until time is expired and then restart from the last checkpoint, aren t covered About ½ of Blue Waters Darshan data not included in analysis 5

I/O log dataset: 4 platforms, >1M jobs, almost 7 years combined Intrepid Mira Edison Blue Waters Architecture BG/P BG/Q Cray XC30 Cray XE6/ XK7 Peak Flops 0.557 PF 10 PF 2.57 PF 13.34 PF Cores 160K 768K 130K 792K+59K smx Total Storage 6 PB 24 PB 7.56 PB 26.4 PB Peak I/O Throughput 88 GB/s 240 GB/s 168 GB/s 963 GB/s File System GPFS GPFS Lustre Lustre # of jobs 239K 137K 703K 300K Time period 4 years 18 months 9 months 6 months 6

Very Low I/O Throughput Is The Norm 7

Most Jobs Read/Write Little Data (Blue Waters data) 8

I/O Thruput vs Relative Peak 9

I/O Time Usage Is Dominated By A Small Number Of Jobs/Apps 10

Improving the performance of the top 15 apps can save a lot of I/O time Platform I/O time percent Percent of platform I/O time saved if min thruput = 1 GB/s Mira 83% 32% Intrepid 73% 31% Edison 70% 60% Blue Waters 75% 63% 11

Top 15 apps with largest I/O time (Blue Waters) Consumed 1500 hours of I/O time (75% total system I/O time) 12

What Are Some of the Problems? POSIX I/O has a strong consistency model Hard to cache effectively Applications need to transfer block-aligned and sized data to achieve performance Complexity adds to fragility of file system, the major cause of failures on large scale HPC systems Files as I/O objects add metadata choke points Serialize operations, even with independent files Do you know about O_NOATIME? Burst buffers will not fix these problems must change the semantics of the operations Big Data file systems have very different consistency models and metadata structures, designed for their application needs Why doesn t HPC? There have been some efforts, such as PVFS, but the requirement for POSIX has held up progress 13

Remember POSIX is not just open, close, read, and write (and seek ) That s (mostly) syntax POSIX includes strong semantics if there are concurrent accesses Even if such accesses never occur POSIX also requires consistent metadata Access and update times, size, 14

No Science Application Code Needs POSIX I/O Many are single reader or single writer Eventual consistency is fine Some are disjoint reader or writer Eventual consistency is fine, but must handle non-block-aligned writes Some applications use the file system as a simple data base Use a data base we know how to make these fast and reliable Some applications use the file system to implement interprocess mutex Use a mutex service even MPI point-to-point A few use the file system as a bulletin board May be better off using RDMA Only need release or eventual consistency Correct Fortran codes do not require POSIX Standard requires unique open, enabling correct and aggressive client and/or server-side caching MPI-IO would be better off without POSIX 15

Part 2: What Can We Do About it? Short run What can we do now? Long run How can we fix the problem? 16

Short Run Diagnose Case study. Code P Avoid serialization (really!) Reflects experience with bugs in file systems, including claiming to be POSIX but not providing correct POSIX semantics Avoid cache problems Large block ops; aligned data Avoid metadata update problems Limit number of processes updating information about files, even implicitly 17

Case Study Code P: Logically Cartesian mesh Reads ~1.2GB grid file Takes about 90 minutes! Writes similar sized files for time steps Only takes a few minutes (each)! System I/O Bandwidth is ~ 1TB/s peak; ~5 GB/sec per (groups of 125) nodes 18

Serialized Reads Sometime in the past only this worked File systems buggy (POSIX makes system complex) Quick fix: allow 128 concurrent reads One line fix (if (mod(i,128) == 0)) in front of Barrier About 10x improvement in performance Takes about 10 minutes to read file 19

What s Really Wrong? Single grid file (in easy-to-use, canonical order) requires each process to read multiple short sections from file I/O system reads large blocks; only a small amount of each can be used when each process reads just its own block For high performance, must read and use entire blocks Can do this by having different processes read blocks, then shuffle data to the processes that need it Easy to accomplish using a few lines of MPI (MPI_File_set_view, MPI_File_read_all) 20

Fixing Code P Developed simple API for reading arbitrary blocks within an n-d mesh 3D tested; expected use case Can position beginning of n-d mesh anywhere in file Now ~3 seconds to read file 1800x faster than original code Sounds good, but is still <1GB/s Similar test on BG/Q 200x faster Writes of time steps now the top problem Somewhat faster by default (caching by file system is slightly easier) Roughly 10 minutes/timestep MPI_File_write_all should have similar benefit as read 21

Long Run Rethink I/O API, especially semantics May keep open/read/write/close, but add API to select more appropriate semantics Maintains correctness for legacy codes Can add improved APIs for new codes New architectures (e.g., burst buffers ) unlikely to implement POSIX semantics 22

Final Thoughts Users often unaware of how poor their I/O performance is They ve come to expect awful Collective I/O can provide acceptable performance Single file approach often most convenient for workflow; works with arbitrary process count Single file per process can work But at large scale, metadata operations can limit performance Antiquated HPC file system semantics make systems fragile and perform poorly Past time to reconsider in requirements; should look at big data alternatives 23

Thanks! Especially Huong Luu, Babak Behzad Code P I/O: Ed Karrels Funding from: NSF Blue Waters Partners at ANL, LBNL; DOE funding 24