Welcome! Virtual tutorial starts at 15:00 BST

Similar documents
Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

PERFORMANCE OF PARALLEL IO ON LUSTRE AND GPFS

Blue Waters I/O Performance

Introduction to Parallel I/O

Parallel IO Benchmarking

ECSS Project: Prof. Bodony: CFD, Aeroacoustics

Lustre Parallel Filesystem Best Practices

HPC Input/Output. I/O and Darshan. Cristian Simarro User Support Section

Parallel I/O on Theta with Best Practices

Toward Understanding Life-Long Performance of a Sonexion File System

libhio: Optimizing IO on Cray XC Systems With DataWarp

Parallel I/O. and split communicators. David Henty, Fiona Ried, Gavin J. Pringle

CFD exercise. Regular domain decomposition

Parallel File Systems for HPC

An Overview of Fujitsu s Lustre Based File System

Introduction to High Performance Parallel I/O

Triton file systems - an introduction. slide 1 of 28

How To write(101)data on a large system like a Cray XT called HECToR

HPC Architectures. Types of resource currently in use

Using file systems at HC3

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

File Systems for HPC Machines. Parallel I/O

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

Scalable I/O. Ed Karrels,

Feedback on BeeGFS. A Parallel File System for High Performance Computing

CSE 153 Design of Operating Systems

Fractals. Investigating task farms and load imbalance

API and Usage of libhio on XC-40 Systems

5.4 - DAOS Demonstration and Benchmark Report

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4

Welcome! Virtual tutorial starts at 15:00 GMT. Please leave feedback afterwards at:

Data storage on Triton: an introduction

The JANUS Computing Environment

Xyratex ClusterStor6000 & OneStor

Overview of High Performance Input/Output on LRZ HPC systems. Christoph Biardzki Richard Patra Reinhold Bader

User Training Cray XC40 IITM, Pune

Lock Ahead: Shared File Performance Improvements

Open SFS Roadmap. Presented by David Dillow TWG Co-Chair

Parallel I/O Libraries and Techniques

Parallel Programming. Libraries and Implementations

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Reusing this material

Efficient Object Storage Journaling in a Distributed Parallel File System

Parallel I/O Techniques and Performance Optimization

Parallel I/O. Steve Lantz Senior Research Associate Cornell CAC. Workshop: Data Analysis on Ranger, January 19, 2012

Architecting a High Performance Storage System

ARCHER/RDF Overview. How do they fit together? Andy Turner, EPCC

Lustre overview and roadmap to Exascale computing

Application I/O on Blue Waters. Rob Sisneros Kalyana Chadalavada

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

Best Practice Guide - Parallel I/O

Reusing this material

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

File Open, Close, and Flush Performance Issues in HDF5 Scot Breitenfeld John Mainzer Richard Warren 02/19/18

Parallel Programming. Libraries and implementations

Implementing a Hierarchical Storage Management system in a large-scale Lustre and HPSS environment

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

ASKAP Central Processor: Design and Implementa8on

HDF5: An Introduction. Adam Carter EPCC, The University of Edinburgh

Lustre TM. Scalability

Lustre File System. Proseminar 2013 Ein-/Ausgabe - Stand der Wissenschaft Universität Hamburg. Paul Bienkowski Author. Michael Kuhn Supervisor

1. ALMA Pipeline Cluster specification. 2. Compute processing node specification: $26K

NetApp High-Performance Storage Solution for Lustre

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016

Coordinating Parallel HSM in Object-based Cluster Filesystems

Fractals exercise. Investigating task farms and load imbalance

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

HIGH-PERFORMANCE STORAGE FOR DISCOVERY THAT SOARS

CSCS HPC storage. Hussein N. Harake

Message Passing Programming. Modes, Tags and Communicators

Molecular Modelling and the Cray XC30 Performance Counters. Michael Bareford, ARCHER CSE Team

Parallel I/O. Steve Lantz Senior Research Associate Cornell CAC. Workshop: Parallel Computing on Ranger and Lonestar, May 16, 2012

Integration Path for Intel Omni-Path Fabric attached Intel Enterprise Edition for Lustre (IEEL) LNET

Lustre Lockahead: Early Experience and Performance using Optimized Locking. Michael Moore

Parallel File Systems Compared

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

LLNL Lustre Centre of Excellence

Lessons learned from Lustre file system operation

CA485 Ray Walshe Google File System

Message-Passing Programming with MPI. Message-Passing Concepts

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Introduction to HPC Parallel I/O

Design and Evaluation of a 2048 Core Cluster System

Architecting Storage for Semiconductor Design: Manufacturing Preparation

Improved Solutions for I/O Provisioning and Application Acceleration

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

Current Topics in OS Research. So, what s hot?

The BioHPC Nucleus Cluster & Future Developments

Cluster Setup and Distributed File System

Lustre HSM at Cambridge. Early user experience using Intel Lemur HSM agent

Lustre and PLFS Parallel I/O Performance on a Cray XE6

Oak Ridge National Laboratory

Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin

Data Analytics with HPC. Data Streaming

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

Transcription:

Welcome! Virtual tutorial starts at 15:00 BST

Parallel IO and the ARCHER Filesystem ARCHER Virtual Tutorial, Wed 8 th Oct 2014 David Henty <d.henty@epcc.ed.ac.uk>

Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us This means you are free to copy and redistribute the material and adapt and build on the material under the following terms: You must give appropriate credit, provide a link to the license and indicate if changes were made. If you adapt or build on the material you must distribute your work under the same license as the original. Note that this presentation contains images owned by others. Please seek their permission before reusing these images.

Overview Why parallel IO is difficult The Lustre file system Standard parallel IO strategies MPI-IO Tuning Lustre

Why is Parallel IO Difficult? Difficult in principle combine distributed data into a single location data access patterns surprisingly complicated Difficult in practice individual disk IO speeds are not very fast file systems are complicated parallel file systems are even more complicated IO performance achieved by using multiple disks at once

Programmer View vs Machine View 4 3 2 8 7 6 12 11 10 16 15 14 1 2 3 4 Process 4 1 2 3 4 Process 2 1 2 3 4 1 5 9 13 1 2 3 4 Process 3 Process 1

4x4 array on 2x2 Process Grid Parallel Data 2 1 4 3 2 1 4 3 2 4 2 4 1 3 1 3 File 1 2 1 2 3 4 3 4 1 2 1 2 3 4 3 4

ARCHER s Lustre Cray Sonexion Storage MMU: Metadata Management Unit Lustre MetaData Server Contains server hardware and storage SSU: Scalable Storage Unit 2 x OSSs and 8 x OSTs (Object Storage Targets) Contains Storage controller, Lustre server, disk controller and RAID engine Each unit is 2 OSSs each with C4 OOSTs M P of U 10 T E (8+2) disks in Sa T O R E RAID6 array A N A L Y Z E Multiple SSUs are combined to form storage racks 8

ARCHER s File systems Connected to the Cray XC30 via LNET router service nodes. Infiniband Network /fs2 6 SSUs 12 OSSs 48 OSTs 480 HDDs 4TB per HDD 1.4 PB Total /fs3 6 SSUs 12 OSSs 48 OSTs 480 HDDs 4TB per HDD 1.4 PB Total /fs4 7 SSUs 14 OSSs 56 OSTs 560 HDDs 4TB per HDD 1.6 PB Total C O M P U T E S T O R E A N A L Y Z E

Lustre data striping Lustre s performance comes from striping files over multiple OSTs Single logical user file e.g. /work/y02/y02/ted OS/file-system automatically divides the file into stripes Stripes are then read/written to/from their assigned OST C O M P U T E S T O R E A N A L Y Z E 10

Opening a file Lustre Client Open OSTs Metadata Server (MDS) name permissions attributes location The client sends a request to the MDS to opening/acquiring information about the file The MDS then passes back a list of OSTs For an existing file, these contain the data stripes For a new files, these typically contain a randomly assigned list of OSTs where data is to be stored Lustre Client Read/write Object Storage Server (OSS) + Object Storage Target (OST) Object Storage Server (OSS) + Object Storage Target (OST) Once a file has been opened no further communication is required between the client and the MDS All transfer is directly between the assigned OSTs and the client C O M P U T E S T O R E A N A L Y Z E 11

Summary Lustre achieves high bandwidth via multiple disks Single file can be striped across multiple disks allows simultaneous IO from multiple Object Storage Targets think of each OST as a separate IO path to disk Optimised for large transactions please write 100 Mb to disk Meta Data Server can be a bottleneck opening and closing files is serialised and can be slow not optimised for large numbers of small files

I/O strategies: Spokesperson (master/serial IO) One process performs I/O Data Aggregation or Duplication Limited by single I/O process Easy to program Pattern does not scale Time increases linearly with amount of data Time increases with number of processes Care has to be taken when doing the all-to-one kind of communication at scale Can be used for a dedicated I/O Server Lustre clients Bottlenecks C O M P U T E S T O R E A N A L Y Z E 13

I/O strategies: Multiple Writers Multiple Files All processes perform I/O to individual files Limited by file system Easy to program Pattern may not scale at large process counts Number of files creates bottleneck with metadata operations Number of simultaneous disk accesses creates contention for file system resources C O M P U T E S T O R E A N A L Y Z E 14

2x2 to 1x4 Redistribution 4 8 12 16 3 7 11 15 2 6 10 14 write 11 12 15 16 9 10 13 14 3 4 7 8 data4.dat data3.dat data2.dat 1 5 9 13 reorder 1 2 5 6 data1.dat 4 8 12 16 newdata4.dat read 3 2 7 6 11 10 15 14 newdata3.dat newdata2.dat 1 5 9 13 newdata1.dat

I/O strategies: Multiple Writers Single File Each process performs I/O to a single file which is shared. Performance Data layout within the shared file is very important. At large process counts contention can build for file system resources. Not all programming languages support it C/C++ can work with fseek No real Fortran standard C O M P U T E S T O R E A N A L Y Z E 16

I/O strategies: Collective IO to single or multiple files Aggregation to a processor in a group which processes the data. Serializes I/O in group. I/O process may access independent files. Limits the number of files accessed. Group of processes perform parallel I/O to a shared file. Increases the number of shares to increase file system usage. Decreases number of processes which access a shared file to decrease file system contention. C O M P U T E S T O R E A N A L Y Z E 17

Summary Need subgroups of IO processes to do IO simultaneously too many and there is contention for file system resources too few and we do not use all the OSTs Far too complicated to do this ourselves need some library to help us out For example, MPI-IO part of MPI standard since MPI-2.0

MPI-IO Approach Each process / rank tells MPI-IO what portion(s) of the file it wants to read / write uses MPI Derived Datatypes this is called the File View Tell MPI-IO what data to write automatically goes to the position(s) selected by the file view all the communications / buffering / aggregation handled by MPI-IO Allows for collective IO MPI-IO has a global view can aggregate data for a small number of large IO transactions

Combining File Views 4 3 8 7 12 11 16 15 rank 1 (0,1) rank 3 (1,1) rank 0 2 1 6 5 10 9 14 13 rank 0 (0,0) rank 2 (1,0) rank 1 rank 2 rank 3

Collective IO Combine ranks 0 and 1 for single contiguous read/write to file Combine ranks 2 and 3 for single contiguous read/write to file

MPI-IO on ARCHER Optimised for the Lustre file system Scales the number of IO processes appropriate to the number of OSTs and the total number of processes But... the striping of a file (the number of OSTs) is set by the user important to get this right

Lustre Striping Essential to stripe large files across multiple disks but striping small files across many disks is bad Default striping on ARCHER is across 4 OSTs Can set this yourself using: lfs setstripe -c <nstripe> <directory> to use all the OSTs: nstripe = -1 to enquire: lfs getstripe <directory> Test case: large 3D dataset across 3D process grid IO done using MPI-IO

128 3 per proc: 16 MiB to 64 GiB

Summary Master IO unaffected by striping Same bandwidth as parallel IO with no striping Around 400 MiB/s independent of process count With parallel IO and striping bandwidth scales with process count (until all OSTs are used) achieve around 2 GiB/s for default striping (4 OSTs) 10 s of GiB/s for full striping (all OSTs, nstripe = -1)

Collective vs Independent IO Replace MPI_File_write_all with MPI_File_write identical functionality different performance Results with full striping Processes Individual Collective 1 49.5 MiB/s 441 MiB/s 8 5.9 MiB/s 404 MiB/s 64 2.4 MiB/s 1630 MiB/s

Conclusions Good IO requires three things to be true A sensible number of IO processes not a single process, not all processes MPI-IO does this for you File striped across multiple disks in Lustre, use multiple OSTs via: lfs setstripe Collective IO IO processes aggregate data: small number of large IO operations

How good is my IO? Essential to quantify in terms of GiB/s look at the size of your files time the IO operations hundreds of MiB/s: bad tens of GiB/s: good Performance tools may help here eg Cray performance tools can report IO rates

Other libraries What if I use NetCDF / HDF5 /...? Use a version that layers on top of MPI-IO Ensure that it is doing collective IO Set the Lustre striping for the files

Help with IO Contact the CSE support team! email: support@archer.ac.uk Working on a white paper on parallel IO plan to have first version available in November

Goodbye! Thanks for attending