PERFORMANCE OF PARALLEL IO ON LUSTRE AND GPFS

Similar documents
Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Welcome! Virtual tutorial starts at 15:00 BST

Code Saturne on POWER8 clusters: First Investigations

Parallel IO Benchmarking

Lecture 33: More on MPI I/O. William Gropp

Introduction to Parallel I/O

Introduction to High Performance Parallel I/O

Parallel I/O Techniques and Performance Optimization

PRACE Workshop: Application Case Study: Code_Saturne. Andrew Sunderland, Charles Moulinec, Zhi Shang. Daresbury Laboratory, UK

Fortran 2008 coarrays

Introduction to HDF5

Parallel IO concepts (MPI-IO and phdf5)

Lustre Parallel Filesystem Best Practices

ECSS Project: Prof. Bodony: CFD, Aeroacoustics

Parallel I/O. Steve Lantz Senior Research Associate Cornell CAC. Workshop: Data Analysis on Ranger, January 19, 2012

Parallel I/O and MPI-IO contd. Rajeev Thakur

HPC Input/Output. I/O and Darshan. Cristian Simarro User Support Section

Scalable I/O. Ed Karrels,

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Migrating A Scientific Application from MPI to Coarrays. John Ashby and John Reid HPCx Consortium Rutherford Appleton Laboratory STFC UK

Parallel I/O. and split communicators. David Henty, Fiona Ried, Gavin J. Pringle

Parallel I/O. Steve Lantz Senior Research Associate Cornell CAC. Workshop: Parallel Computing on Ranger and Lonestar, May 16, 2012

libhio: Optimizing IO on Cray XC Systems With DataWarp

Overview of High Performance Input/Output on LRZ HPC systems. Christoph Biardzki Richard Patra Reinhold Bader

How To write(101)data on a large system like a Cray XT called HECToR

API and Usage of libhio on XC-40 Systems

Parallel I/O in the LFRic Infrastructure. Samantha V. Adams Workshop on Exascale I/O for Unstructured Grids th September 2017, DKRZ, Hamburg.

Parallel I/O Libraries and Techniques

MPI, Part 3. Scientific Computing Course, Part 3

ARCHER/RDF Overview. How do they fit together? Andy Turner, EPCC

File Systems for HPC Machines. Parallel I/O

Introduction to serial HDF5

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4

Best Practice Guide - Parallel I/O

The IBM Blue Gene/Q: Application performance, scalability and optimisation

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package

HDF5: An Introduction. Adam Carter EPCC, The University of Edinburgh

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Do You Know What Your I/O Is Doing? (and how to fix it?) William Gropp

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK

Benchmarking the CGNS I/O performance

Triton file systems - an introduction. slide 1 of 28

Taming Parallel I/O Complexity with Auto-Tuning

Advanced Parallel Programming

Advanced Parallel Programming

Post-processing issue, introduction to HDF5

CFD exercise. Regular domain decomposition

Introduction to HPC Parallel I/O

Exploring XcalableMP. Shun Liang. August 24, 2012

Hybrid OpenMP-MPI Turbulent boundary Layer code over 32k cores

MPI-IO. Warwick RSE. Chris Brady Heather Ratcliffe. The Angry Penguin, used under creative commons licence from Swantje Hess and Jannis Pohlmann.

Porting and Optimisation of UM on ARCHER. Karthee Sivalingam, NCAS-CMS. HPC Workshop ECMWF JWCRP

Enhancing scalability of the gyrokinetic code GS2 by using MPI Shared Memory for FFTs

Introduction to I/O at CHPC

Parallel Mesh Multiplication for Code_Saturne

High-level Abstraction for Block Structured Applications: A lattice Boltzmann Exploration

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

CUG Talk. In-situ data analytics for highly scalable cloud modelling on Cray machines. Nick Brown, EPCC

Bringing a scientific application to the distributed world using PGAS

HDF5 I/O Performance. HDF and HDF-EOS Workshop VI December 5, 2002

Large-scale Gas Turbine Simulations on GPU clusters

Parallel I/O CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Parallel I/O Spring / 22

An Evolutionary Path to Object Storage Access

Parallel I/O on JUQUEEN

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

Damaris. In-Situ Data Analysis and Visualization for Large-Scale HPC Simulations. KerData Team. Inria Rennes,

AcuSolve Performance Benchmark and Profiling. October 2011

Evaluating New Communication Models in the Nek5000 Code for Exascale

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP

EDF's Code_Saturne and parallel IO. Toolchain evolution and roadmap

Optimizing I/O on the Cray XE/XC

Experiences from the Fortran Modernisation Workshop

Challenges in HPC I/O

The Arm Technology Ecosystem: Current Products and Future Outlook

Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Domain Decomposition: Computational Fluid Dynamics

MPI Casestudy: Parallel Image Processing

Introduction to the SBLI code. N. De Tullio and N.D.Sandham Faculty of Engineering and the Environment University of Southampton

Welcome. Virtual tutorial starts at BST

Grid Computing Competence Center Large Scale Computing Infrastructures (MINF 4526 HS2011)

Direct Numerical Simulation of Turbulent Boundary Layers at High Reynolds Numbers.

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

Domain Decomposition: Computational Fluid Dynamics

Domain Decomposition: Computational Fluid Dynamics

Profiling coarray+mpi miniapps with TAU and the Intel compiler

Optimizing TELEMAC-2D for Large-scale Flood Simulations

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC

OP2 FOR MANY-CORE ARCHITECTURES

CSE 153 Design of Operating Systems

Transport Simulations beyond Petascale. Jing Fu (ANL)

INTERTWinE workshop. Decoupling data computation from processing to support high performance data analytics. Nick Brown, EPCC

MPI, Part 2. Scientific Computing Course, Part 3

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

Efficient use of OpenFOAM in industry

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016

Transcription:

PERFORMANCE OF PARALLEL IO ON LUSTRE AND GPFS David Henty and Adrian Jackson (EPCC, The University of Edinburgh) Charles Moulinec and Vendel Szeremi (STFC, Daresbury Laboratory

Outline Parallel IO problem Common IO patterns Parallel filesystems MPI-IO Benchmark results Filesystem tuning MPI-IO Application results HDF5 and NetCDF Conclusions

Parallel IO problem 4 8 12 16 3 7 11 15 1 2 3 4 Process 4 1 2 3 4 Process 2 1 2 3 4 2 6 10 14 1 2 3 4 Process 3 1 5 9 13 Process 1

Parallel Filesystems (Figure based on Lustre diagram from Cray) Single logical user file OS/file-system automatically divides the file into stripes

Common IO patterns Multiple files, multiple writers each process writes its own file numerous usability and performance issues Single file, single writer (master IO) high usability but poor performance Single file, multiple writers all processes write to a single file; poor performance Single file, collective writers aggregate data onto a subset of IO processes hard to program and may require tuning potential for scalable IO performance

Global description: MPI-IO 4 3 8 7 12 11 16 15 rank 1 (0,1) rank 3 (1,1) 2 1 6 5 10 9 14 13 rank 0 (0,0) rank 2 (1,0) global file 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 rank 1 filetype rank 1 view of file 3 4 7 8

Collective IO Enables numerous optimisations in principle requires global description and participation of all processes does this help in practice? Combine ranks 0 and 1 for single contiguous read/write to file Combine ranks 2 and 3 for single contiguous read/write to file

Cellular Automaton Model Fortran coarray library for 3D cellular automata microstructure simulation, Anton Shterenlikht, proceedings of 7 th International Conference on PGAS Programming Models, 3-4 October 2013, Edinburgh, UK.

Benchmark Distributed regular 3D dataset across 3D process grid local data has halos of depth 1; set up for weak scaling implemented in Fortran and MPI-IO! Define datatype describing global location of local data call MPI_Type_create_subarray(ndim, arraygsize, arraysubsize, arraystart, MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, filetype, ierr)! Define datatype describing where local data sits in local array call MPI_Type_create_subarray(ndim, arraysize, arraysubsize, arraystart, MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, mpi_subarray, ierr)! After opening file fh, define what portions of file this process owns call MPI_File_set_view(fh, disp, MPI_DOUBLE_PRECISION, filetype, 'native', MPI_INFO_NULL, ierr)! Write data collectively call MPI_File_write_all(fh, iodata, 1, mpi_subarray, status, ierr)

ARCHER XC30

Single file, multiple writers Serial bandwidth on ARCHER around 400 to 500 MiB/s Use MPI_File_write not MPI_File_write_all identical functionality different performance Processes Bandwidth 1 49.5 MiB/s 8 5.9 MiB/s 64 2.4 MiB/s

Single file, collective writers

Lustre striping We ve done a lot of work to enable (many) collective writers learned MPI-IO and described data layout to MPI enabled collective IO MPI dynamically decided on number of writers collected data and aggregates before writing... for almost no benefit! Need many physical disks as well as many IO streams in Lustre, controlled by the number of stripes default number of stripes is 4; ARCHER has around 50 IO servers User needs to set striping count on a per-file/directory basis lfs setstripe c -1 <directory> # use maximal striping

Cray XC30 with Lustre: 128 3 per proc

Cray XC30 with Lustre: 256 3 per proc

BG/Q: #IO servers scales with CPUs

Code_Saturne http://code-saturne.org CFD code developed by EDF (France) Co-located finite volume, arbitrary unstructured meshes, predictor-corrector 350 000 lines of code 50% C 37% Fortran 13% Python MPI for distributed-memory (some OpenMP for sharedmemory) including MPI-IO Laminar and turbulent flows: k-eps, k-omega, SST, v2f, RSM, LES models,...

Code_SATURNE: default settings Consistent with benchmark results default striping Lustre similar to GPFS

Time (s) Code_Saturne: Lustre striping 1200 1000 800 600 400 MPI-IO - 7.2 B Tetra Mesh No Stripping Read Input 814MB No Stripping Write Mesh_Output 742GB Full Stripping Read Input 814MB Full Stripping Write Mesh_Output 742GB Consistent with benchmark results order of magnitude improvement from striping 200 0 30000 40000 Number of Cores

Simple HDF5 benchmark: Lustre

TPLS code Two-Phase Level Set: CFD code simulates the interface between two fluid phases. High resolution direct numerical simulation Applications Evaporative cooling Oil and gas hydrate transport Cleaning processes Distillation/absorption Fortran90 + MPI IO improved by orders of magnitude ASCII master IO -> binary NetCF does striping help?

TPLS results

Further Work Non-blocking parallel IO could hide much of writing time or use more restricted split-collective functions extend benchmark to overlap comms with calculation I don t believe it is implemented in current MPI-IO libraries blocking MPI collectives are used internally A subset of user MPI processes will be used by MPI-IO would be nice to exclude them from calculation extend MPI_Comm_split_type() to include something like MPI_COMM_TYPE_IONODE as well as MPI_COMM_TYPE_SHARED?

Conclusions Efficient parallel IO requires all of the following a global approach coordination of multiple IO streams to the same file collective writers filesystem tuning MPI-IO Benchmark useful to inform real applications NetCDF and HDF5 layered on top of MPI-IO although real application IO behaviour is complicated Try a library before implementing bespoke solutions! higher level view pays dividends