Overview of High Performance Input/Output on LRZ HPC systems. Christoph Biardzki Richard Patra Reinhold Bader

Similar documents
Using file systems at HC3

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Triton file systems - an introduction. slide 1 of 28

Lustre Parallel Filesystem Best Practices

Parallel I/O. Steve Lantz Senior Research Associate Cornell CAC. Workshop: Data Analysis on Ranger, January 19, 2012

HDF5 I/O Performance. HDF and HDF-EOS Workshop VI December 5, 2002

Lecture 33: More on MPI I/O. William Gropp

How To write(101)data on a large system like a Cray XT called HECToR

Data storage on Triton: an introduction

I/O: State of the art and Future developments

Welcome! Virtual tutorial starts at 15:00 BST

HPC Input/Output. I/O and Darshan. Cristian Simarro User Support Section

Parallel File Systems Compared

Parallel File Systems for HPC

Parallel I/O on JUQUEEN

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Introduction to High Performance Parallel I/O

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Exploiting the full power of modern industry standard Linux-Systems with TSM Stephan Peinkofer

Introduction to Parallel I/O

Parallel File Systems. John White Lawrence Berkeley National Lab

ECSS Project: Prof. Bodony: CFD, Aeroacoustics

What is a file system

GFS: The Google File System

A GPFS Primer October 2005

Filesystems on SSCK's HP XC6000

CA485 Ray Walshe Google File System

Caching and Buffering in HDF5

PERFORMANCE OF PARALLEL IO ON LUSTRE AND GPFS

Lustre A Platform for Intelligent Scale-Out Storage

High-Performance Lustre with Maximum Data Assurance

Lustre overview and roadmap to Exascale computing

Deep Learning on SHARCNET:

Jyotheswar Kuricheti

Scalable I/O. Ed Karrels,

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Parallel I/O. Steve Lantz Senior Research Associate Cornell CAC. Workshop: Parallel Computing on Ranger and Lonestar, May 16, 2012

The BioHPC Nucleus Cluster & Future Developments

I/O in scientific applications

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

Monitoring and Trouble Shooting on BioHPC

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

GFS: The Google File System. Dr. Yingwu Zhu

Operating System Concepts Ch. 11: File System Implementation

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0)

libhio: Optimizing IO on Cray XC Systems With DataWarp

Chapter 11: File System Implementation. Objectives

An Introduction to GPFS

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Shared Object-Based Storage and the HPC Data Center

The JANUS Computing Environment

CSCS HPC storage. Hussein N. Harake

Coordinating Parallel HSM in Object-based Cluster Filesystems

Ben Walker Data Center Group Intel Corporation

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Parallel IO Benchmarking

How To Rock with MyRocks. Vadim Tkachenko CTO, Percona Webinar, Jan

Distributed Filesystem

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

File Systems. What do we need to know?

HPC at UZH: status and plans

Effective Use of CSAIL Storage

Data Movement & Storage Using the Data Capacitor Filesystem

File Management 1/34

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

System input-output, performance aspects March 2009 Guy Chesnot

CS399 New Beginnings. Jonathan Walpole

Evaluating Cloud Storage Strategies. James Bottomley; CTO, Server Virtualization

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Practical Scientific Computing

Preview from Notesale.co.uk Page 6 of 52

Efficiency Evaluation of the Input/Output System on Computer Clusters

The Google File System

Practical Scientific Computing

Parallel I/O and MPI-IO contd. Rajeev Thakur

High Performance Computing

Map-Reduce. Marco Mura 2010 March, 31th

designed. engineered. results. Parallel DMF

HPC File Systems and Storage. Irena Johnson University of Notre Dame Center for Research Computing

The Google File System

MELLANOX MTD2000 NFS-RDMA SDK PERFORMANCE TEST REPORT

Optimising for the p690 memory system

Extreme I/O Scaling with HDF5

File Systems for HPC Machines. Parallel I/O

The cluster system. Introduction 22th February Jan Saalbach Scientific Computing Group

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

Multi-Rail LNet for Lustre

Basic filesystem concepts. Tuesday, November 22, 2011

Distributed File Systems II

1

OpenAFS A HPC filesystem? Rich Sudlow Center for Research Computing University of Notre Dame

Massive Data Processing on the Acxiom Cluster Testbed

Transcription:

Overview of High Performance Input/Output on LRZ HPC systems Christoph Biardzki Richard Patra Reinhold Bader

Agenda Choosing the right file system Storage subsystems at LRZ Introduction to parallel file systems Optimizing I/O in your applications Big/Little Endian issues (Fortran)

File system types at LRZ Home and Project file systems Typically lots of small files (<1 MB) Available space limited by quota Very reliable Regular backup is performed by LRZ E.g., source code, binaries, configuration and (smaller) input files Pseudo-temporary file systems Huge local or (shared + parallel) file systems (>100 TB), no quota Good I/O bandwidth with huge files (> 100 MB) not optimal for small files (transactions) Somewhat lower reliability due to new technology and size High-watermark deletion, no backup! E.g., large temporary files, large input or output files

Choosing the right file system Filesystems are a shared resource please be nice to other users and Do: Put your really important data into a home/project file system Use the $OPT_TMP environment variable which always references the optimal temporary file system Use snapshots where available if you need an older version of a file or if you ve removed a file by mistake Contact LRZ HPC support if you feel you have an unusually I/O-intensive application or if you need additional, reliable storage for your project Do not: Use your home directory for temporary files Put small files into parallel file systems (don t use small files at all! ) Put any data you can t recompute into a pseudotemporary file system (no backup!)

Storage configuration at LRZ NFS: Home file systems in the Linux cluster + Altix (some TB) and HLRB-II (60 TB) Expect a total performance of ~100 MB/s with sequential access Snapshots are available as a backup measure XFS / Cluster-XFS: Used on altix (11+7 TB) and HLRB II (300 + 300 TB) as scratch file systems Several 100 MB/s per process, up to 20 GB/s per file system on HLRB-II Lustre: Pseudotemporary file system on the Linux Cluster (140 TB) Using 1.6 release Up to 5000 MB/s aggregate I/O Bandwidth

Current I/O subsystem setup on Linux Cluster systems Lustre Lustre OST 120

Introduction to parallel file systems What is a parallel file system? The file server becomes a bottleneck when a parallel application running on a cluster writes/reads huge amounts of data In a parallel file system you can split a file among several file servers, parallelize the I/O and improve performance In the diagram the stripe size is 4 (letters) In reality ~2 MB The number of servers used is also configurable You don t want to stripe every file over all your servers Exception: many clients access one file ( parallel I/O )

Example: Lustre at LRZ Configurable parameters in Lustre: Stripe size (Default 2 MB) Stripe count = number of servers to stripe over (Default: 1) Number of first server (Default: random) Lustre Configuration: 1 Metadata-Server, 120 Data-Servers (called OSTs: Object Storage Targets) ~1 TB of storage attached to each OST 10 Gigabit Ethernet-Connections to network switches Client connection: Gigabit Ethernet: 90 MB/s 10 GE nodes: ~600 MB/s

Performance (2006) Benchmark with up to 15 Dual-Itanium-Clients using Gigabit Ethernet Every client writes a 15 GB file into Lustre

General rules for I/O Avoid unnecessary I/O Perform I/O in few and and large chunks Binary instead of formatted data (Factor 3 performance improvement!) Use appropriate filesystem Use I/O libraries whenever available Convert to target/visualization format in memory if possible For parallel programs: output to separate files for each process: highest throughput, but usually needs postprocessing Use library/compiler support for conversion between little/big endian of files used on different architectures Avoid unnecessary open/close statements Avoid explicit flushes of data to disk, except when needed for consistency reasons

I/O in Fortran Parameters of the OPEN statement: Specify what you want to do: read, write or both: ACTION='READ' / 'WRITE' / 'READWRITE' Perform direct access with large record length (if possible a multiple of the disk block size): ACCESS='DIRECT', RECL=<record_length> Use binary (unformatted) I/O (default for direct access) FORM='UNFORMATTED' If you need sequential formatted access, remember to access data in large chunks at least Use buffering if possible/manually increase buffer size (~100MB) Intel Fortran run-time system: additional parameters of open statement: BUFFERED= yes, BUFFERCOUNT=10000 directives are usually proprietary

I/O in C Increase buffer size (~100MB): setvbuf (call before reading, writing or any other operation on the file) Perform unformatted instead of formatted IO: fwrite/fread instead of fprintf/fscanf For repositioning within the file use fseek Example: double data[size]; char* myvbuf; FILE* fp; IO fully buffered fp=fopen(filename, "w"); setvbuf(fp, myvbuf, _IOFBF, 100000000); fseek(fp, 0, SEEK_SET); fwrite(data, sizeof(double), SIZE, fp);

MPI-I/O Perform non-contiguous IO with MPI derived datatypes Perform collective IO Tell the MPI subsystem what you want to do (read, write, both,...) call MPI_Info_set (info, 'access_style', <style>, ierr) where <style> can be 'write_once', 'read_once', 'write_mostly', 'read_mostly', 'sequential',... Pass additional hints to the MPI subsystem (unknown hints will be ignored) many of these are implementation-dependent

Lustre striping factor: Tuning I/O on Lustre: serial and MPI-parallel lfs getstripe <filename> shows striping factor of a file lfs setstripe <directory> <stripe-size> <start-ost> \ <stripe-cnt> sets striping size, factor and first ost for files created in directory Example: lfs setstripe /lustre/a2832bf/bench 0-1 12 (will stripe with default striping size (2MB) over 12 OSTs) Hints for MPI parallel I/O: call MPI_Info_set(info, 'striping unit', '<stripe-size>', ierr) call MPI_Info_set(info, 'striping factor', '<stripe-cnt>', ierr) call MPI_Info_set(info, 'num_io_nodes', '<stripe-cnt>', ierr)

19 blades jede Partition: ~1.25 GB/s im aggregierten Modus $OPT_TMP Weiteres Dateisystem $PROJECT verfügbar

Tuning I/O on CXFS: FFIO glibc calls can be diverted to use alternative I/O layer: Fast and Flexible IO Prerequisites: dynamic linkage at least against glibc export LD_PRELOAD=/usr/lib/libFFIO.so Optionally set variables: FF_IO_LOGFILE and FF_IO_OPEN_DIAGS Set variable FF_IO_OPTS (mandatory!) to select file patterns I/O layers to be used performance relevant parameters Then run program as usual man libffio for details

Example for FFIO usage export FF_IO_OPTS=\ myfile.*(eie.direct.nodiag.mbytes:4096:64:6,\ event.mbytes.notrace)' Effects all files with basename myfile.* E(nhanced) I(ntelligence) E(ngineering) suboptions: direct unbuffered I/O nodiag no cache usage statistics reported mbytes unit for logging 4096: page size units are 512 byte blocks use this or an integer multiple for LRZ system striping unit TP9700: 2 MByte 64: number of pages in FFIO cache low value enforces flushing to disk high value provides effective buffering choose according to other memory requirements of program 6: number of pages read-ahead if sequential access detected can improve read performance if suitably increased Event layer (statistics): effectively unused here monitor I/O between layers

FFIO for MPI programs Can have separated FFIO settings for each MPI task must use SGI MPT on Altix replace FF_IO_OPTS by export SGI_MPI=/usr/lib export FF_IO_OPTS_RANK0= export FF_IO_OPTS_RANK1=

DMA transfers: Tuning MPI IO (XFS) call MPI_Info_set(info, 'direct_read', 'true', ierr) call MPI_Info_set(info, 'direct_write', 'true', ierr) bypasses OS buffer cache, can improve performance in special cases, but usually leads to performance degradation (do not use, except when memory used by buffer cache needed for computation) See FFIO description on previous slides

MPI IO Example Writing a distributed array of REAL4 (6 processes) with MPI derived datatype (darray): 1024 1024 Lustre total MB/s (noncollective) MB/s (collective) 12GB 34 102 (6 OSTs) 120GB 51 100 XFS 12GB 270 315 120GB 189 160

Big/Little Endian issues: converting unformatted files Environment variable specific to Intel-Fortran-generated binaries: export F_UFMTENDIAN=MODE [MODE;]EXCEPTION where: MODE = big little EXCEPTION = big:ulist little:ulist ULIST ULIST = U ULIST,U U = decimal decimal-decimal Examples: F_UFMTENDIAN=big F_UFMTENDIAN=big:9,12 big-endian for units 9 and 12, little-endian for others F_UFMTENDIAN="big;little:8" big-endian for all except unit 8 file format is big-endian for all units if F_UFMTENDIAN is unset: default value little

Converting Files: Alternatives for Intel Fortran Use convert switch at compilation will have effect on all units opened in source file Use convert= keyword on OPEN statement will only affect opened I/O unit proprietary enhancement code non-portable! Both option and keyword can take various values: big_endian little_endian cray ibm See compiler documentation / language reference for detailed information