Lecture 33: More on MPI I/O. William Gropp

Similar documents
Parallel I/O and MPI-IO contd. Rajeev Thakur

Parallel I/O. Steve Lantz Senior Research Associate Cornell CAC. Workshop: Data Analysis on Ranger, January 19, 2012

Parallel I/O. Steve Lantz Senior Research Associate Cornell CAC. Workshop: Parallel Computing on Ranger and Lonestar, May 16, 2012

Best Practice Guide - Parallel I/O

Introduction to Parallel I/O

MPI-IO Performance Optimization IOR Benchmark on IBM ESS GL4 Systems

Parallel I/O Techniques and Performance Optimization

Parallel I/O: Not Your Job. Rob Latham and Rob Ross Mathematics and Computer Science Division Argonne National Laboratory {robl,

Parallel I/O for High Performance and Scalability. It's Not Your Job (...it's mine)

Optimizing I/O on the Cray XE/XC

Parallel NetCDF. Rob Latham Mathematics and Computer Science Division Argonne National Laboratory

Scalable I/O. Ed Karrels,

Application I/O on Blue Waters. Rob Sisneros Kalyana Chadalavada

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016

Parallel I/O on Theta with Best Practices

Introdução ao MPI-IO. Escola Regional de Alto Desempenho 2018 Porto Alegre RS. Jean Luca Bez 1 Francieli Z. Boito 2 Philippe O. A.

Programming with MPI: Advanced Topics

High-Performance Techniques for Parallel I/O

PERFORMANCE OF PARALLEL IO ON LUSTRE AND GPFS

Overview of High Performance Input/Output on LRZ HPC systems. Christoph Biardzki Richard Patra Reinhold Bader

High Performance Computing

Introduction to I/O at CHPC

Managing Cray XT MPI Runtime Environment Variables to Optimize and Scale Applications Geir Johansen

HDF5 I/O Performance. HDF and HDF-EOS Workshop VI December 5, 2002

Mark Pagel

Chapter 11. Parallel I/O. Rajeev Thakur and William Gropp

CHALLENGES FOR PARALLEL I/O IN GRID COMPUTING

MPI-IO. Warwick RSE. Chris Brady Heather Ratcliffe. The Angry Penguin, used under creative commons licence from Swantje Hess and Jannis Pohlmann.

Parallel I/O Libraries and Techniques

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4

Introduction to High Performance Parallel I/O

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Parallel IO Benchmarking

I/O in scientific applications

Lustre Parallel Filesystem Best Practices

What is a file system

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

A Parallel API for Creating and Reading NetCDF Files

Reducing I/O variability using dynamic I/O path characterization in petascale storage systems

Managing Cray XT MPI Runtime Environment Variables to Optimize and Scale Applications

Improving MPI Independent Write Performance Using A Two-Stage Write-Behind Buffering Method

An Evolutionary Path to Object Storage Access

Parallel I/O from a User s Perspective

Improving Collective I/O Performance by Pipelining Request Aggregation and File Access

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package

MPI Parallel I/O. Chieh-Sen (Jason) Huang. Department of Applied Mathematics. National Sun Yat-sen University

Advanced MPI. William Gropp* Rusty Lusk Rob Ross Rajeev Thakur *University of Illinois. Argonne NaAonal Laboratory

What NetCDF users should know about HDF5?

How To write(101)data on a large system like a Cray XT called HECToR

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O

Enabling Active Storage on Parallel I/O Software Stacks. Seung Woo Son Mathematics and Computer Science Division

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Evaluating I/O Characteristics and Methods for Storing Structured Scientific Data

PARALLEL I/O FOR SHARED MEMORY APPLICATIONS USING OPENMP

Many parallel applications need to access large amounts of data. In such. possible to achieve good I/O performance in parallel applications by using a

Parallel I/O with MPI TMA4280 Introduction to Supercomputing

Data Management of Scientific Datasets on High Performance Interactive Clusters

HPC Input/Output. I/O and Darshan. Cristian Simarro User Support Section

Exploiting Lustre File Joining for Effective Collective IO

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

Parallel I/O on JUQUEEN

MPI, Part 3. Scientific Computing Course, Part 3

Parallel I/O and Portable Data Formats PnetCDF and NetCDF 4

Lustre Lockahead: Early Experience and Performance using Optimized Locking. Michael Moore

Noncontiguous I/O Accesses Through MPI-IO

Data Sieving and Collective I/O in ROMIO

Parallel I/O with MPI

Crossing the Chasm: Sneaking a parallel file system into Hadoop

I/O Analysis and Optimization for an AMR Cosmology Application

Using MPI File Caching to Improve Parallel Write Performance for Large-Scale Scientific Applications

Adaptable IO System (ADIOS)

Advanced Parallel Programming

Advanced Parallel Programming

Introduction to I/O at CHPC

Evaluating Structured I/O Methods for Parallel File Systems

Welcome! Virtual tutorial starts at 15:00 BST

Feasibility Study of Effective Remote I/O Using a Parallel NetCDF Interface in a Long-Latency Network

Pattern-Aware File Reorganization in MPI-IO

The Fusion Distributed File System

Hybrid Programming with OpenMP and MPI. Karl W. Schulz. Texas Advanced Computing Center The University of Texas at Austin

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

Meta-data Management System for High-Performance Large-Scale Scientific Data Access

pnfs, POSIX, and MPI-IO: A Tale of Three Semantics

High Performance File I/O for The Blue Gene/L Supercomputer

A High Performance Implementation of MPI-IO for a Lustre File System Environment

libhio: Optimizing IO on Cray XC Systems With DataWarp

Scalable and Modular Parallel I/ O for Open MPI

Do You Know What Your I/O Is Doing? (and how to fix it?) William Gropp

Crossing the Chasm: Sneaking a parallel file system into Hadoop

COSC 6374 Parallel Computation. Scientific Data Libraries. Edgar Gabriel Fall Motivation

ECSS Project: Prof. Bodony: CFD, Aeroacoustics

Parallel storage. Toni Cortes Team leader. EXPERTISE training school

Taming Parallel I/O Complexity with Auto-Tuning

Parallel I/O and Portable Data Formats

Improving I/O Performance in POP (Parallel Ocean Program)

MPICH on Clusters: Future Directions

MPI-2 Overview. Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS)

Challenges in HPC I/O

Implementing Byte-Range Locks Using MPI One-Sided Communication

A First Implementation of Parallel IO in Chapel for Block Data Distribution 1

Transcription:

Lecture 33: More on MPI I/O William Gropp www.cs.illinois.edu/~wgropp

Today s Topics High level parallel I/O libraries Options for efficient I/O Example of I/O for a distributed array Understanding why collective I/O offers better performance Optimizing parallel I/O performance 2

Portable File Formats Ad-hoc file formats Difficult to collaborate Cannot leverage post-processing tools MPI provides external32 data encoding No one uses this High level I/O libraries netcdf and HDF5 Better solutions than external32 Define a container for data - Describes contents - May be queried (self-describing) Standard format for metadata about the file Wide range of post-processing tools available 3 3

Higher Level I/O Libraries Scientific applications work with structured data and desire more selfdescribing file formats netcdf and HDF5 are two popular higher level I/O libraries Abstract away details of file layout Provide standard, portable file formats Include metadata describing contents For parallel machines, these should be built on top of MPI-IO HDF5 has an MPI-IO option http://www.hdfgroup.org/hdf5/ 4 4

netcdf Data Model netcdf provides a means for storing multiple, multi-dimensional arrays in a single file, along with information about the arrays 5

Parallel netcdf (PnetCDF) (Serial) netcdf API for accessing multidimensional data sets Portable file format Popular in both fusion and climate communities Parallel netcdf Very similar API to netcdf Tuned for better performance in today s computing environments Retains the file format so netcdf and PnetCDF applications can share files PnetCDF builds on top of any Cluster PnetCDF ROMIO PVFS2 IBM BG PnetCDF IBM MPI GPFS MPI-IO implementation 6 6

I/O in netcdf and PnetCDF (Serial) netcdf Parallel read All processes read the file independently No possibility of collective optimizations Sequential write PnetCDF Parallel writes are carried out by shipping data to a single process Just like our stdout checkpoint code Parallel read/write to shared netcdf file Built on top of MPI-IO which utilizes optimal I/O facilities of the parallel file system and MPI-IO implementation Allows for MPI-IO hints and datatypes for further optimization P0 P1 P2 P3 netcdf Parallel File System P0 P1 P2 P3 Parallel netcdf Parallel File System 7

Optimizations Given complete access information, an implementation can perform optimizations such as: Data Sieving: Read large chunks and extract what is really needed Collective I/O: Merge requests of different processes into larger requests Improved prefetching and caching 8 8

Bandwidth Results 3D distributed array access written as levels 0, 2, 3 Five different machines (dated but still relevant) NCSA Teragrid IA-64 cluster with GPFS and MPICH2 ASC Purple at LLNL with GPFS and IBM s MPI Jazz cluster at Argonne with PVFS and MPICH2 Cray XT3 at ORNL with Lustre and Cray s MPI SDSC Datastar with GPFS and IBM s MPI Since these are all different machines with different amounts of I/O hardware, we compare the performance of the different levels of access on a particular machine, not across machines 9 9

Distributed Array Access Read Write Array size: 512 x 512 x 512 Note log scaling! Thanks to Weikuan Yu, Wei-keng Liao, Bill Loewe, and Anthony Chan for these results. 10

Detailed Example: Distributed Array Access Global array size: m x n Array is distributed on a 2D grid of processes: nproc(1) x nproc(2) Coordinates of a process in the grid: icoord(1), icoord(2) (0-based indices) Local array stored in file in a layout corresponding to global array in columnmajor (Fortran) order Two cases local array is stored contiguously in memory local array has ghost area around it in memory 11

n P0 icoord = (0,0) P1 icoord = (0,1) P2 icoord = (0,2) m P3 icoord = (1,0) P4 P5 icoord = (1,1) icoord = (1,2) nproc(1) = 2, nproc(2) = 3 12

CASE I: LOCAL ARRAY CONTIGUOUS IN MEMORY iarray_of_sizes(1) = m iarray_of_sizes(2) = n iarray_of_subsizes(1) = m/nproc(1) iarray_of_subsizes(2) = n/nproc(2) iarray_of_starts(1) = icoord(1) * iarray_of_subsizes(1) iarray_of_starts(2) = icoord(2) * iarray_of_subsizes(2) ndims = 2 zerooffset = 0! 0-based, even in Fortran! 0-based, even in Fortran call MPI_TYPE_CREATE_SUBARRAY(ndims, iarray_of_sizes, iarray_of_subsizes, iarray_of_starts, MPI_ORDER_FORTRAN, MPI_REAL, ifiletype, ierr) call MPI_TYPE_COMMIT(ifiletype, ierr) call MPI_FILE_OPEN(MPI_COMM_WORLD, /home/me/test, MPI_MODE_CREATE + MPI_MODE_RDWR, MPI_INFO_NULL, ifh, ierr) call MPI_FILE_SET_VIEW(ifh, zerooffset, MPI_REAL, ifiletype, native, MPI_INFO_NULL, ierr) ilocal_size = iarray_of_subsizes(1) * iarray_of_subsizes(2) call MPI_FILE_WRITE_ALL(ifh, local_array, ilocal_size, MPI_REAL, istatus, ierr) call MPI_FILE_CLOSE(ifh, ierr) 13

CASE II: LOCAL ARRAY HAS GHOST AREA OF SIZE 2 ON EACH SIDE Local array Ghost area! Create derived datatype describing layout of local array in memory iarray_of_sizes(1) = m/nproc(1) + 4 iarray_of_sizes(2) = n/nproc(2) + 4 iarray_of_subsizes(1) = m/nproc(1) iarray_of_subsizes(2) = n/nproc(2) iarray_of_starts(1) = 2 iarray_of_starts(2) = 2 ndims = 2! 0-based, even in Fortran! 0-based, even in Fortran call MPI_TYPE_CREATE_SUBARRAY(ndims, iarray_of_sizes, iarray_of_subsizes, iarray_of_starts, MPI_ORDER_FORTRAN, MPI_REAL, imemtype, ierr) call MPI_TYPE_COMMIT(imemtype, ierr) Continued... 14

CASE II continued! Create derived datatype to describe layout of local array in the file iarray_of_sizes(1) = m iarray_of_sizes(2) = n iarray_of_subsizes(1) = m/nproc(1) iarray_of_subsizes(2) = n/nproc(2) iarray_of_starts(1) = icoord(1) * iarray_of_subsizes(1) iarray_of_starts(2) = icoord(2) * iarray_of_subsizes(2) ndims = 2 zerooffset = 0! 0-based, even in Fortran! 0-based, even in Fortran call MPI_TYPE_CREATE_SUBARRAY(ndims, iarray_of_sizes, iarray_of_subsizes, iarray_of_starts, MPI_ORDER_FORTRAN, MPI_REAL, ifiletype, ierr) call MPI_TYPE_COMMIT(ifiletype, ierr)! Open the file, set the view, and write call MPI_FILE_OPEN(MPI_COMM_WORLD, /home/thakur/test, MPI_MODE_CREATE + MPI_MODE_RDWR, MPI_INFO_NULL, ifh, ierr) call MPI_FILE_SET_VIEW(ifh, zerooffset, MPI_REAL, ifiletype, native, MPI_INFO_NULL, ierr) call MPI_FILE_WRITE_ALL(ifh, local_array, 1, imemtype, istatus, ierr) call MPI_FILE_CLOSE(ifh, ierr) 15

Comments On MPI I/O Understand why collective I/O is better, and how it enables efficient implementation of I/O More on tuning MPI IO performance Common errors in using MPI IO 16

Collective I/O The next slide shows the trace for the collective I/O case Note that the entire program runs for a little more than 1 sec Each process does its entire I/O with a single write or read operation Data is exchanged with other processes so that everyone gets what they need Very efficient! 17

Collective I/O 18 18

Data Sieving The next slide shows the trace for the data sieving case Note that the program runs for about 5 sec now Since the default data sieving buffer size happens to be large enough, each process can read with a single read operation, although more data is read than actually needed (because of holes) Since PVFS doesn t support file locking, data sieving cannot be used for writes, resulting in many small writes (1K per process) 19

Data Sieving 20 20

Posix I/O The next slide shows the trace for Posix I/O Lots of small reads and writes (1K each per process) The reads take much longer than the writes in this case because of a TCPincast problem happening in the switch Total program takes about 80 sec Very inefficient! 21

Posix I/O 22 22

Passing Hints MPI defines MPI_Info Provides an extensible list of key=value pairs Used to package variable, optional types of arguments that may not be standard Used in IO, Dynamic, and RMA, as well as with communicators 23

Passing Hints to MPI-IO MPI_Info info; MPI_Info_create(&info); /* no. of I/O devices to be used for file striping */ MPI_Info_set(info, "striping_factor", "4"); /* the striping unit in bytes */ MPI_Info_set(info, "striping_unit", "65536"); MPI_File_open(MPI_COMM_WORLD, "/pfs/datafile", MPI_MODE_CREATE MPI_MODE_RDWR, info, &fh); MPI_Info_free(&info); 24

Passing Hints to MPI-IO (Fortran) integer info call MPI_Info_create(info, ierr)! no. of I/O devices to be used for file striping call MPI_Info_set(info, "striping_factor", "4, ierr )! the striping unit in bytes call MPI_Info_set(info, "striping_unit", "65536, ierr ) call MPI_File_open(MPI_COMM_WORLD, "/pfs/datafile", & MPI_MODE_CREATE + MPI_MODE_RDWR, info, & fh, ierr ) call MPI_Info_free( info, ierr ) 25 25

Examples of Hints (used in ROMIO) striping_unit striping_factor cb_buffer_size cb_nodes ind_rd_buffer_size ind_wr_buffer_size start_iodevice pfs_svr_buf direct_read direct_write MPI predefined hints New Algorithm Parameters Platform-specific hints 26 26

Common MPI I/O Hints Controlling parallel file system striping_factor - size of strips on I/O servers striping_unit - number of I/O servers to stripe across start_iodevice - which I/O server to start with Controlling aggregation cb_config_list - list of aggregators cb_nodes - number of aggregators (upper bound) Tuning ROMIO (most common MPI-IO implementation) optimizations romio_cb_read, romio_cb_write - aggregation on/off romio_ds_read, romio_ds_write - data sieving on/off 27 27

Aggregation Example Cluster of SMPs One SMP box has fast connection to disks Data is aggregated to processes on single box Processes on that box perform I/O on behalf of the others 28 28

Ensuring Parallel I/O is Parallel On Blue Waters, by default, files are not striped That is, there is no parallelism in distributing the file across multiple disks You can set the striping factor on a directory; inherited by all files created in that directory lfs setstripe -c 4 <directory-name> You can set the striping factor on a file when it is created with MPI_File_open, using the hints, e.g., set MPI_Info_set(info, striping_factor, 4 ); MPI_Info_set(info, cb_nodes, 4 ); 29

Finding Hints on Blue Waters Setting the environment variable MPICH_MPIIO_HINTS_DISPLAY=1 causes the program to print out the available I/O hints and their values. 30

Example of Hints Display PE 0: MPICH/MPIIO environment settings: PE 0: MPICH_MPIIO_HINTS_DISPLAY = 1 PE 0: MPICH_MPIIO_HINTS = NULL PE 0: MPICH_MPIIO_ABORT_ON_RW_ERROR = disable PE 0: MPICH_MPIIO_CB_ALIGN = 2 PE 0: MPIIO hints for ioperf.out.tfargq: cb_buffer_size = 16777216 romio_cb_read = automatic romio_cb_write = automatic cb_nodes = 1 cb_align = 2 romio_no_indep_rw = false romio_cb_pfr = disable romio_cb_fr_types = aar romio_cb_fr_alignment = 1 romio_cb_ds_threshold = 0 romio_cb_alltoall = automatic ind_rd_buffer_size = 4194304 ind_wr_buffer_size = 524288 romio_ds_read = disable romio_ds_write = disable striping_factor = 1 striping_unit = 1048576 romio_lustre_start_iodevice = 0 direct_io = false aggregator_placement_stride = -1 abort_on_rw_error = disable cb_config_list = *:* 31

Summary of I/O Tuning MPI I/O has many features that can help users achieve high performance The most important of these features are the ability to specify noncontiguous accesses, the collective I/O functions, and the ability to pass hints to the implementation Users must use the above features! In particular, when accesses are noncontiguous, users must create derived datatypes, define file views, and use the collective I/O functions 32 32

Common Errors in Using MPI-IO Not defining file offsets as MPI_Offset in C and integer (kind=mpi_offset_kind) in Fortran (or perhaps integer*8 in Fortran 77) In Fortran, passing the offset or displacement directly as a constant (e.g., 0) in the absence of function prototypes (F90 mpi module) Using darray datatype for a block distribution other than the one defined in darray (e.g., floor division) filetype defined using offsets that are not monotonically nondecreasing, e.g., 0, 3, 8, 4, 6. (can occur in irregular applications) 33 33