Parallel I/O from a User s Perspective

Similar documents
Introduction to High Performance Parallel I/O

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Crossing the Chasm: Sneaking a parallel file system into Hadoop

HDF5 I/O Performance. HDF and HDF-EOS Workshop VI December 5, 2002

Parallel I/O CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Parallel I/O Spring / 22

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Introduction to HPC Parallel I/O

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

Extreme I/O Scaling with HDF5

Caching and Buffering in HDF5

What NetCDF users should know about HDF5?

Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright. lbl.gov

CA485 Ray Walshe Google File System

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Lecture 33: More on MPI I/O. William Gropp

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Parallel I/O Libraries and Techniques

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Introduction to Parallel I/O

Improved Solutions for I/O Provisioning and Application Acceleration

What is a file system

L9: Storage Manager Physical Data Organization

Lecture S3: File system data layout, naming

User Perspective. Module III: System Perspective. Module III: Topics Covered. Module III Overview of Storage Structures, QP, and TM

CS 111. Operating Systems Peter Reiher

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

Distributed File Systems II

Introduction to serial HDF5

File system internals Tanenbaum, Chapter 4. COMP3231 Operating Systems

I/O: State of the art and Future developments

FILE SYSTEMS, PART 2. CS124 Operating Systems Winter , Lecture 24

CS3600 SYSTEMS AND NETWORKS

Parallel I/O and Portable Data Formats

COMP 273 Winter physical vs. virtual mem Mar. 15, 2012

File. File System Implementation. File Metadata. File System Implementation. Direct Memory Access Cont. Hardware background: Direct Memory Access

The Fusion Distributed File System

Introduction to HDF5

Operating Systems. Operating Systems Professor Sina Meraji U of T

Last Class: Memory management. Per-process Replacement

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Storing Data: Disks and Files

CLOUD-SCALE FILE SYSTEMS

CS 318 Principles of Operating Systems

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

File Open, Close, and Flush Performance Issues in HDF5 Scot Breitenfeld John Mainzer Richard Warren 02/19/18

Post-processing issue, introduction to HDF5

Memory. Objectives. Introduction. 6.2 Types of Memory

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

Segmentation with Paging. Review. Segmentation with Page (MULTICS) Segmentation with Page (MULTICS) Segmentation with Page (MULTICS)

COSC 6374 Parallel Computation. Scientific Data Libraries. Edgar Gabriel Fall Motivation

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

The Google File System

V. File System. SGG9: chapter 11. Files, directories, sharing FS layers, partitions, allocations, free space. TDIU11: Operating Systems

Welcome! Virtual tutorial starts at 15:00 BST

The Host Environment. Module 2.1. Copyright 2006 EMC Corporation. Do not Copy - All Rights Reserved. The Host Environment - 1

Outlines. Chapter 2 Storage Structure. Structure of a DBMS (with some simplification) Structure of a DBMS (with some simplification)

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

Chapter 11: Implementing File Systems

CS 318 Principles of Operating Systems

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

File Systems. ECE 650 Systems Programming & Engineering Duke University, Spring 2018

Hard Disk Organization. Vocabulary

Chapter 11: Implementing File

Parallel File Systems. John White Lawrence Berkeley National Lab

Chapter 11: File System Implementation. Objectives

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

File Systems for HPC Machines. Parallel I/O

EECS 482 Introduction to Operating Systems

CS370 Operating Systems

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 13

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

PROCESS VIRTUAL MEMORY. CS124 Operating Systems Winter , Lecture 18

CENG3420 Lecture 08: Memory Organization

Disk Scheduling COMPSCI 386

CHAPTER 6 Memory. CMPS375 Class Notes Page 1/ 16 by Kuo-pao Yang

OPERATING SYSTEM. Chapter 12: File System Implementation

31268_WEB SYSTEMS LECTURE 1. Operating Systems Part 1

Frequently asked questions from the previous class survey

Chapter 6 - External Memory

HPC Storage Use Cases & Future Trends

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

Hadoop/MapReduce Computing Paradigm

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

Frequently asked questions from the previous class survey

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

Operating Systems. File Systems. Thomas Ropars.

File System Management

I/O CANNOT BE IGNORED

Taming Parallel I/O Complexity with Auto-Tuning

FILE SYSTEM IMPLEMENTATION. Sunu Wibirama

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

GPFS Experiences from the Argonne Leadership Computing Facility (ALCF) William (Bill) E. Allcock ALCF Director of Operations

Main Points. File systems. Storage hardware characteristics. File system usage patterns. Useful abstractions on top of physical devices

Coordinating Parallel HSM in Object-based Cluster Filesystems

Transcription:

Parallel I/O from a User s Perspective HPC Advisory Council Stanford University, Dec. 6, 2011 Katie Antypas Group Leader, NERSC User Services

NERSC is DOE in HPC Production Computing Facility NERSC computing for science 4000 users, 500 projects From 48 states; 65% from universities Hundreds of users each day 1500 publications per year Systems designed for science 1.3PF Petaflop Cray system, Hopper - 3 nd Fastest computer in US - Fastest open Cray XE6 system - Additional.5 PF in Franklin system and smaller clusters 2 2

Ever Increasing Data Sets User I/O needs growing each year in the scientific community Data from sensors, detectors, telescopes, genomes adding flood of data For many users I/O parallelism is mandatory I/O remains a bottleneck for many groups Images from 3 David Randall, Paola Cessi, John Bell, T Scheibe

Scientists think about data in terms of their science problem: molecules, atoms, grid cells, particles Ultimately, physical disks store bytes of data Layers in between, the application and physical disks are at various levels of sophistication Why is Parallel I/O for science applications so difficult? Images from David 4 Randall, Paola Cessi, John Bell, T Scheibe

Parallel I/O: A User Perspective Wish List Write data from multiple processors into a single file File can be read in the same manner regardless of the number of CPUs that read from or write to the file. Do so with the same performance as writing one-fileper-processor And make all of the above portable from one machine to the next Inconvenient Truth: Scientists need to understand I/O infrastructure in order to get good performance 5

I/O Hierarchy Slide Application High Level IO Library MPI-IO Layer Parallel File System Storage Device 6

Storage Media Magnetic Hard Disk Drives Magnetic Tape 7

Magnetic Hard Disk Drives Source: Popular Science, Wikimedia Commons 8

Some definitions Capacity (in MB, GB, TB) Measured by areal density Areal density = track density * linear density Transfer Rate (bandwidth) MB/sec Rate at which a device reads or writes data Access Time (milli-seconds) Delay before the first byte is read 9

Access Time T(access) = T(seek) + T(latency) T(seek) = time to move head to correct track T(latency) = time to rotate to correct sector T(seek) = ~10 milli-sec T(latency) = 4.2 milli-sec T(access) = 14 milli-sec!! How does this compare to clock speed? ~100 Million flops in the time it takes to access disk 10 Image from Brenton Blawat

Hard disk drives operate at about 30-~300s MB/s Areal density getting higher Improved magnetic coating on platters More accurate control of head position on platter Rotational rates slowly increasing What s the problem? Why isn t a hard drive faster? Electronics supporting the head can t keep up with increases in density Access time always a fixed cost 11

Disk Transfer Rates over Time Thanks to R. Freitas of IBM Almaden Research Center for providing much of the data for this graph. Slide from 12 Rob Ross, Rob Latham at ANL 12

Individual disk drives obviously not fast enough for supercomputers Need parallelism RAID Redundant Array of Independent Disks Different configurations of RAID offer various levels of data redundancy and fail over protection 13

File Systems 14 14

Software layer between the Operating System and Storage Device which creates abstractions for Files Directories Access permissions File pointers File descriptors Moving data between memory and storage devices Coordinating concurrent access to files Managing the allocation and deletion of data blocks on the storage devices Data recovery What is a File System? J.M. May 15 Parallel IO for High Performance Computing

File systems allocate space in blocks, usually multiples of disk sector A file is created by the file system on one or more unused blocks The file system must keep track of used and unused blocks Attempts to allocated nearby blocks to the same file From disk sectors to files 16

Data about data Inodes store Metadata File systems store information about files externally to those files. Linux uses an inode, which stores information about files and directories (size in bytes, device id, user id, group id, mode, timestamps, link info, pointers to disk blocks, file size ) Any time a file s attributes change or info is desired (e.g., ls l) metadata has to be retrieved from the metadata server 17

Your laptop has a file system, referred to as a local file system A networked file system allows multiple clients to access files Treats concurrent access to the same file as a rare event A parallel file system builds on concept of networked file system Efficiently manages hundreds to thousands of processors accessing the same file concurrently Coordinates locking, caching, buffering and file pointer challenges Scalable and high performing File Systems GPFS J.M. May Parallel 18 IO for High Performance Computing

Generic Parallel File System Architecture Compute Nodes Internal Network I/O Servers MDS I/O I/O I/O I/O I/O I/O I/O External Network - (Likely FC) Disk controllers - manage failover Storage Hardware -- Disks 19

Lustre file system on Hopper Note: SCRATCH1 and SCRATCH2 have identical configurations. 20

Buffering Used to improve performance Caching File system collects full blocks of data before transferring data to disk For large writes, can transfer many blocks at once File system retrieves an entire block of data, even if all data was not requested, data remains in the cache Can happen in many places, compute node, I/O server, disk Not the same on all platforms File Buffering and Caching Important to study your own application s performance rather than look at peak numbers 21

Data Distribution in Parallel File Systems Slide Rob Ross, Rob 22 Latham at ANL 22

Locking in Parallel File Systems Most parallel file systems use locks to manage concurrent access to files Files are broken up into lock units, (also called blocks) Clients obtain locks on units that they will access before I/O occurs Enables caching on clients as well (as long as client has a lock, it knows its cached data is valid) Locks are reclaimed from clients when others desire access If an access touches any data in a lock unit, the lock for that region must be obtained before access occurs. Slide from Rob Ross, 23 Rob Latham at ANL 23

Locking and Concurrent Access Question -- what would happen if writes were smaller than lock units? Slide from Rob Ross, Rob 24 Latham at ANL 24

Small Writes How will the parallel file system perform with small writes (less than the size of a lock unit)? 25 25

Time Access Patterns Memory Contiguous Contiguous in memory, not in file Memory Contiguous in file, not in memory Memory File File File Memory File Dis-contiguous Mem File Bursty Memory File Out-of-Core 26

Now from the user s point of view 27 27

Serial I/O processors 0 1 2 3 4 5 File Each processor sends its data to the master who then writes the data to a file Advantages Simple May perform ok for very small IO sizes Disadvantages Not scalable Not efficient, slow for any large number of processors or data sizes May not be possible if memory constrained 28

Parallel I/O Multi-file processors 0 1 2 3 4 5 File File File File File File Each processor writes its own data to a separate file Advantages Simple to program Can be fast -- (up to a point) Disadvantages Can quickly accumulate many files Hard to manage Requires post processing Difficult for storage systems, HPSS, to handle many small files Can overwhelm the file system with many writers 29

Flash Center IO Nightmare Large 32,000 processor run on LLNL BG/L Parallel IO libraries not yet available Intensive I/O application checkpoint files.7 TB, dumped every 4 hours, 200 dumps plotfiles - 20GB each, 700 dumps particle files 1400 particle files 470MB each 154 TB of disk capacity 74 million files! Unix tool problems Took 2 years to sift though data, sew files together 30

Parallel I/O Single-file processors 0 1 2 3 4 5 File Each processor writes its own data to the same file using MPI-IO mapping Advantages Single file Manageable data Disadvantages Shared files may not perform as well as one-file-per-processor models 31

MPI-IO 32 32

What is MPI-IO? Parallel I/O interface for MPI programs Allows users to write shared files with a simple interface Key concepts: MPI communicators File views Define which parts of a file are visible to a given processor Can be shared by all processors, distinct or partially overlapped Collective I/O for optimizations 33 33

Independent and Collective I/O P0 P1 P2 P3 P4 P5 P0 P1 P2 P3 P4 P5 Independent I/O Collective I/O Independent I/O operations specify only what a single process will do obscures relationships between I/O on other processes Collective I/O is coordinated access to storage by a group of processes Functions are called by all processes participating in I/O Allows I/O layers to know more about access as a whole, more opportunities for optimization in lower software layers, better performance 34 34 Slide from Rob Ross, Rob Latham at ANL

Collective Buffering MPI-IO Optimizations Consolidates I/O requests from all procs Only a subset of procs (called aggregators) write to the file Key point is to limit writers so that tasks are not competing for same I/O block of data Various algorithms exist for aligning data to block boundaries Collective buffering is controlled by MPI-IO hints: romio_cb_read, romio_cb_write, cb_buffer_size, cb_nodes, cb_config_list 35 35

When to use collective When to use collective buffering buffering The smaller the write, the more likely it is to benefit from collective buffering Large contiguous I/O will not benefit from collective buffering. (If write size is larger than I/O block then there will not be contention from multiple procs for that block.) Non-contiguous writes of any size will not see a benefit from collective buffering 36 36

But how does MPI-IO perform? IO Test using IOR benchmark on 576 cores on Hopper with Lustre file system Hard sell to users Transfer Size MPI-IO performance depends on the file system, MPI-IO layer and data access pattern Transfer Size 37

MPI-IO Summary MPI-IO is in the middle of the I/O stack Provides optimizations typically low performing I/O patterns (non-contiguous I/O and small block I/O) Performance highly depends on the vendor implementation of MPI-IO You could use MPI-IO directly, but better to use a high level I/O library 38 38

High Level Parallel I/O Libraries (HDF5) 39 39

Common Storage Formats ASCII: Slow Takes more space! Inaccurate Binary Non-portable (eg. byte ordering and types sizes) Not future proof Parallel I/O using MPI-IO Self-Describing formats NetCDF/HDF4, HDF5, Parallel NetCDF Example in HDF5: API implements Object DB model in portable file Parallel I/O using: phdf5/pnetcdf (hides MPI-IO) Community File Formats FITS, HDF-EOS, SAF, PDB, Plot3D Modern Implementations built on top of HDF, NetCDF, or other selfdescribing object-model API Many NERSC users at this level. We would like to encourage users to transition to a higher IO library 40

An API which helps to express scientific simulation data in a more natural way Multi-dimensional data, labels and tags, non-contiguous data, typed data Sits on top of MPI-IO Simplicity for visualization and analysis Portable format -take input/output files to different machine Examples: What is a High Level Parallel I/O Library? HDF5 http://www.hdfgroup.org/hdf5/ Parallel NetCDF http://trac.mcs.anl.gov/projects/parallel-netcdf 41

HDF5 Data Model Groups Arranged in directory hierarchy root group is always / Datasets Dataspace Datatype Attributes Bind to Group & Dataset / (root) Dataset0 type,space time =0.2345 validity =None Dataset1 type, space author =Jane Doe date =10/24/2006 Dataset0.1 type,space subgrp Dataset0.2 type,space 42

Example 1: Writing dataset by rows P0 P1 File P2 NX P3 Each task must calculate an offset and count of data to store NY count[0]=2 offset[0]=2 count[1]=4 offset[1]=0 43

Writing by Chunks: Output of h5dump HDF5 grid_rows.h5" { GROUP "/" { DATASET Dataset1" { DATATYPE H5T_IEEE_F64LE DATASPACE SIMPLE { ( 8, 4 ) / ( 8, 4 ) } DATA { 3.2, 4.3, 5.6, 7.9, 5.2, 2.1, 4.5, 8.9, 5.7, 7.0, 3.4, 2.3, 2.1, 1.9, 5.4, 4.2, 3.3, 6.4, 4.7, 3.5, 3.5, 2.9, 2.3, 6.8, 4.8, 2.4, 8.1, 4.0, 7.3, 8.1, 4.8, 5.1 } } } 44 44

Example 2: Writing dataset by chunks File P0 P1 P2 P3 NX count[1]=2 offset[1]=0 NY count[0]=4 offset[0]=4 45

Writing by Chunks: Output of h5dump HDF5 write_chunks.h5" { GROUP "/" { DATASET Dataset1" { DATATYPE H5T_IEEE_F64LE DATASPACE SIMPLE { ( 8, 4 ) / ( 8, 4 ) } DATA { 3.2, 4.3, 5.6, 7.9, 5.2, 2.1, 4.5, 8.9, 5.7, 7.0, 3.4, 2.3, 2.1, 1.9, 5.4, 4.2, 3.3, 6.4, 4.7, 3.5, 3.5, 2.9, 2.3, 6.8, 4.8, 2.4, 8.1, 4.0, 7.3, 8.1, 4.8, 5.1 } } } 46 46

Basic Functions H5Fcreate (H5Fopen) H5Screate_simple/H5Screate H5Dcreate (H5Dopen) H5Sselect_hyperslab H5Dread, H5Dwrite H5Dclose H5Sclose H5Fclose create (open) File create dataspace create (open) Dataset select subsections of data access Dataset close Dataset close dataspace close File NOTE: Order not strictly specified. 47 47

But what about performance? Hand tuned I/O for a particular architecture will likely perform better, but Purpose of I/O libraries is portability, longevity, simplicity, and productivity High level libraries follow performance of When code is ported to a new machine user is required to make changes to improve IO performance Let other people do the work HDF5 can be optimized for given platforms and file systems by library developers 48

Solid State Storage More reliable than hard disk drives No moving parts Withstand shock and vibration Non-volatile retain data without power Slower than DRAM memory, faster than hard disk drives Relatively cheap 49

New Technologies In HPC we like to move data around Fast interconnects Parallel file systems Diskless nodes How does this contrast to other organizations with large data problems? (Google, Yahoo, Facebook) Slow interconnects Local disk on compute nodes Move the computation to the data Not all HPC problems can be setup in a mapreduce structure, but with more datasets coming from sensors, detectors and telescopes, what can HPC learn from 50 the Hadoop model?

Recommendations Think about the big picture Run time vs Post Processing trade off Decide how much IO overhead you can afford Data Analysis Portability Longevity Storability Write and read large blocks of I/O where ever possible 51

Recent Cover Stories from NERSC Research H. Guo 2010 S. Salahuddin 2010 L.Sugiyama 2010 T Head-Gordon 2010 Pennycook 2010 V. Daggett 2010 W. Dorland 2010 A Fedorov 2010 E. Bylaska T Head-Gordon U. Landman 2010 2010 2010 P. Liu 2010 NERSC is enabling new high quality science across disciplines, with over 1,600 refereed publications last year 52