Lustre Parallel Filesystem Best Practices

Similar documents
Using file systems at HC3

HPC Input/Output. I/O and Darshan. Cristian Simarro User Support Section

Introduction to Parallel I/O

Welcome! Virtual tutorial starts at 15:00 BST

Parallel I/O on Theta with Best Practices

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Triton file systems - an introduction. slide 1 of 28

Data storage on Triton: an introduction

Overview of High Performance Input/Output on LRZ HPC systems. Christoph Biardzki Richard Patra Reinhold Bader

Application I/O on Blue Waters. Rob Sisneros Kalyana Chadalavada

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Parallel File Systems for HPC

How To write(101)data on a large system like a Cray XT called HECToR

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Parallel I/O. Steve Lantz Senior Research Associate Cornell CAC. Workshop: Data Analysis on Ranger, January 19, 2012

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4


File Systems for HPC Machines. Parallel I/O

Lustre overview and roadmap to Exascale computing

Blue Waters Local Software To Be Released: Module Improvements and Parfu Parallel Archive Tool

The cluster system. Introduction 22th February Jan Saalbach Scientific Computing Group

The JANUS Computing Environment

Feedback on BeeGFS. A Parallel File System for High Performance Computing

I/O Performance on Cray XC30

Extreme I/O Scaling with HDF5

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016

Lecture 33: More on MPI I/O. William Gropp

Mark Pagel

Lustre Lockahead: Early Experience and Performance using Optimized Locking. Michael Moore

Disruptive Storage Workshop Hands-On Lustre

PERFORMANCE OF PARALLEL IO ON LUSTRE AND GPFS

An Evolutionary Path to Object Storage Access

The Fusion Distributed File System

Lustre Clustered Meta-Data (CMD) Huang Hua Andreas Dilger Lustre Group, Sun Microsystems

Scalable I/O. Ed Karrels,

libhio: Optimizing IO on Cray XC Systems With DataWarp

Parallel IO Benchmarking

ECSS Project: Prof. Bodony: CFD, Aeroacoustics

Parallel I/O Techniques and Performance Optimization

HDF5 I/O Performance. HDF and HDF-EOS Workshop VI December 5, 2002

Taming Parallel I/O Complexity with Auto-Tuning

Welcome to getting started with Ubuntu Server. This System Administrator Manual. guide to be simple to follow, with step by step instructions

HDFS Access Options, Applications

CMD Code Walk through Wang Di

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

A More Realistic Way of Stressing the End-to-end I/O System

Parallel I/O Libraries and Techniques

Improved Solutions for I/O Provisioning and Application Acceleration

CNAG Advanced User Training

Parallel I/O. Steve Lantz Senior Research Associate Cornell CAC. Workshop: Parallel Computing on Ranger and Lonestar, May 16, 2012

PLFS and Lustre Performance Comparison

Filesystems on SSCK's HP XC6000

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

An Overview of Fujitsu s Lustre Based File System

Deep Learning on SHARCNET:

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

Chapter Two. Lesson A. Objectives. Exploring the UNIX File System and File Security. Understanding Files and Directories

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

I/O: State of the art and Future developments

File Open, Close, and Flush Performance Issues in HDF5 Scot Breitenfeld John Mainzer Richard Warren 02/19/18

Outline. File Systems. File System Structure. CSCI 4061 Introduction to Operating Systems

Introduction to HPC Parallel I/O

Remote Directories High Level Design

Performance Monitoring in an HP SFS Environment

Blue Waters I/O Performance

5.4 - DAOS Demonstration and Benchmark Report

Lustre File System. Proseminar 2013 Ein-/Ausgabe - Stand der Wissenschaft Universität Hamburg. Paul Bienkowski Author. Michael Kuhn Supervisor

Best Practice Guide - Parallel I/O

Introduction to High Performance Parallel I/O

Diagnosing Performance Issues in Cray ClusterStor Systems

Managing Lustre TM Data Striping

New User Seminar: Part 2 (best practices)

Progress on Efficient Integration of Lustre* and Hadoop/YARN

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

Unix Filesystem. January 26 th, 2004 Class Meeting 2

An Exploration of New Hardware Features for Lustre. Nathan Rutman

Experiences with Porting CESM to ARCHER

Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin

Lustre Metadata Fundamental Benchmark and Performance

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters

The BioHPC Nucleus Cluster & Future Developments

Lustre/HSM Binding. Aurélien Degrémont Aurélien Degrémont LUG April

Operating Systems 2015 Assignment 4: File Systems

ROBINHOOD POLICY ENGINE

1 / 23. CS 137: File Systems. General Filesystem Design

AWP ODC QUICK START GUIDE

Ceph: A Scalable, High-Performance Distributed File System

Lustre and PLFS Parallel I/O Performance on a Cray XE6

Understanding MPI on Cray XC30

File Management 1/34

Chapter 11: File System Implementation. Objectives

API and Usage of libhio on XC-40 Systems

NetApp High-Performance Storage Solution for Lustre

Virtual Memory. User memory model so far:! In reality they share the same memory space! Separate Instruction and Data memory!!

Pintos Project 4 File Systems. November 14, 2016

Reminder. COP4600 Discussion 9 Assignment4. Discussion 8 Recap. Question-1(Segmentation) Question-2(I-node) Question-1(Segmentation) 10/24/2010

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS PARALLEL FILE I/O 1

Hotfix 913CDD03 Visual Data Explorer and SAS Web OLAP Viewer for Java

Transcription:

Lustre Parallel Filesystem Best Practices George Markomanolis Computational Scientist KAUST Supercomputing Laboratory georgios.markomanolis@kaust.edu.sa 7 November 2017

Outline Introduction to Parallel I/O Understanding the I/O performance on Lustre Accelerating the performance

There is no magic solution I/O Performance I/O performance depends on the pattern Of course a bottleneck can occur from any part of an application Increasing computation and decreasing I/O is a good solution but not always possible

Lustre best practices I Do not use ls l The information on ownership and permission metadata is stores on MDTs, however, the file size metadata is available from OSTs Use ls instead Avoid having large number of files in a single directory Split a large number of files into multiple subdirectories to minimize the contention.

Lustre best practices II Avoid accessing large number of small files on Lustre On Lustre, create a small file 64KB: dd if=/dev/urandom of=my_big_file count=128 128+0 records in 128+0 records out 65536 bytes (66 kb) copied, 0.014081 s, 4.7 MB/s On /tmp/ of the node /tmp> dd if=/dev/urandom of=my_big_file count=128 128+0 records in 128+0 records out 65536 bytes (66 kb) copied, 0.00678089 s, 9.7 MB/s

Various I/O modes

Lustre Lustre file system is made up of an underlying: Set of I/O servers called Object Storage Servers (OSSs) Disks called Object Storage Targets (OSTs), stores file data (chunk of files). We have 144 OSTs on Shaheen The file metadata is controlled by a Metadata Server (MDS) and stored on a Metadata Target (MDT)

Lustre Operation

Lustre best practices III Use stripe count 1 for directories with many small files (default on our system) mkdir experiments lfs setstripe -c 1 experiments cd experiments tar -zxvf code.tgz Copy larger files to another directory with higher striping mkdir large_files lfs setstripe -c 16 large_files cp file large_files/

Lustre best practices IV Increase striping count for parallel access, especially on large files. The striping factor should be a factor of a number of processes performing the parallel I/O. To identify the minimum number of striping, use the square root of the file size. If the file is 90GB. The square root is 9.5, so use at least 9 OSTs lfs setstripe c 9 file If you use 64 MPI processes for parallel I/O, the number of used OSTs should be less than 64, you could try 8, 16, 32, 64, depending on file size

Important factors: Striping Aligned data Software/techiniques to optimize I/O IOBUF for serial I/O module load iobuf Compile and link your application export IOBUF_PARAMS= *.out:verbose:count=12:size=32m Hugepage for MPI applications module load craype-hugepages4m compile and link Multiple options for Huge pages size Huge pages increase the maximum size of data and text in a program accessible by the high speed network Use parallel I/O libraries suchs as PHDF5, PNetCDF, ADIOS But how parallel is your I/O?

Collective Buffering MPI I/O aggregators During a collective write, the buffers on the aggregated nodes are buffered through MPI, then these nodes write the data to the I/O servers. Example 8 MPI processes, 2 MPI I/O aggregators

How many MPI processes are writing a shared file? With CRAY-MPICH, we execute one application with 1216 MPI processes and it supports parallel I/O with Parallel NetCDF and the file s size is 361GB: First case (no stripping): mkdir execution_folder copy necessary files in the folder cd execution_folder run the application Timing for Writing restart for domain seconds Answer: 1 MPI process 1: 674.26 elapsed

How many MPI processes are writing a shared file? With CRAY-MPICH, we execute one application with 1216 MPI processes and it supports parallel I/O with Parallel NetCDF and the file s size is 361GB: Second case: mkdir execution_folder lfs seststripe c 144 execution_folder copy necessary files in the folder cd execution_folder Run the application Timing for Writing restart for domain 1: 10.35 elapsed seconds Answer: 144 MPI processes

I/O performance on Lustre while increasing OSTs WRF 361GB restart file Lustre 160 140 120 Time (in seconds) 100 80 60 Lustre 40 20 0 5 10 20 40 80 100 144 # OST If the file is not big enough, increasing OSTs can cause performance issues

How to declare the number of MPI I/O aggregators By default with the current version of Lustre, the number of MPI I/O aggregators is the number of OSTs. There are two ways to declare the striping (number of OSTs). Execute the following command on an empty folder lfs setstripe -c X empty_folder where X is between 2 and 144, depending on the size of the used files. Use the environment variable MPICH_MPIIO_HINTS to declare striping per files export MPICH_MPIIO_HINTS= "wrfinput*:striping_factor=64,wrfrst*:striping_factor=144,\ wrfout*:striping_factor=144

Handling Darshan Data: https://kaust-ksl.github.io/harshad/ Using Darshan tool to visualize I/O performance

Lustre best practices V Experiment with different stripe count/sizes for MPI collective writes Stripe align, make sure that the I/O requests do not access multiple OSTs Avoid repetitive stat operations Loop: Check if file exists Execute a command Delete a file

Lustre best practices VI Avoid multiple processes opening the same file at same time. When opening a read-only file in Fortran use ACTION='read and not the default ACTION='readwrite to reduce contention by not locking the file Avoid repetitive Open/Close operations, if you read only a file use Action= read, read the file once, read all the information and save the results.

Test case 4 MPI processes n=1000 do i=1,n open(11, file=ffile,position="append") write (11, *) 5 close(11) end do

Test case - Darshan

Test case Darshan II

Test case II 4 MPI processes n=1000 open(11, file=ffile,position="append") do i=1,n write (11, *) 5 end do close(11)

Test case II - Darshan

Test case II Darshan II

It was explained how Lustre operates Conclusions Some parameters need be investigated for the optimum performance Be patient, probably you will not achieve the best performance immediately

Thank you! Questions? georgios.markomanolis@kaust.edu.sa