Overview of High Performance Input/Output on LRZ HPC systems Christoph Biardzki Richard Patra Reinhold Bader
Agenda Choosing the right file system Storage subsystems at LRZ Introduction to parallel file systems Optimizing I/O in your applications Big/Little Endian issues (Fortran)
File system types at LRZ Home and Project file systems Typically lots of small files (<1 MB) Available space limited by quota Very reliable Regular backup is performed by LRZ E.g., source code, binaries, configuration and (smaller) input files Pseudo-temporary file systems Huge local or (shared + parallel) file systems (>100 TB), no quota Good I/O bandwidth with huge files (> 100 MB) not optimal for small files (transactions) Somewhat lower reliability due to new technology and size High-watermark deletion, no backup! E.g., large temporary files, large input or output files
Choosing the right file system Filesystems are a shared resource please be nice to other users and Do: Put your really important data into a home/project file system Use the $OPT_TMP environment variable which always references the optimal temporary file system Use snapshots where available if you need an older version of a file or if you ve removed a file by mistake Contact LRZ HPC support if you feel you have an unusually I/O-intensive application or if you need additional, reliable storage for your project Do not: Use your home directory for temporary files Put small files into parallel file systems (don t use small files at all! ) Put any data you can t recompute into a pseudotemporary file system (no backup!)
Storage configuration at LRZ NFS: Home file systems in the Linux cluster + Altix (some TB) and HLRB-II (60 TB) Expect a total performance of ~100 MB/s with sequential access Snapshots are available as a backup measure XFS / Cluster-XFS: Used on altix (11+7 TB) and HLRB II (300 + 300 TB) as scratch file systems Several 100 MB/s per process, up to 20 GB/s per file system on HLRB-II Lustre: Pseudotemporary file system on the Linux Cluster (140 TB) Using 1.6 release Up to 5000 MB/s aggregate I/O Bandwidth
Current I/O subsystem setup on Linux Cluster systems Lustre Lustre OST 120
Introduction to parallel file systems What is a parallel file system? The file server becomes a bottleneck when a parallel application running on a cluster writes/reads huge amounts of data In a parallel file system you can split a file among several file servers, parallelize the I/O and improve performance In the diagram the stripe size is 4 (letters) In reality ~2 MB The number of servers used is also configurable You don t want to stripe every file over all your servers Exception: many clients access one file ( parallel I/O )
Example: Lustre at LRZ Configurable parameters in Lustre: Stripe size (Default 2 MB) Stripe count = number of servers to stripe over (Default: 1) Number of first server (Default: random) Lustre Configuration: 1 Metadata-Server, 120 Data-Servers (called OSTs: Object Storage Targets) ~1 TB of storage attached to each OST 10 Gigabit Ethernet-Connections to network switches Client connection: Gigabit Ethernet: 90 MB/s 10 GE nodes: ~600 MB/s
Performance (2006) Benchmark with up to 15 Dual-Itanium-Clients using Gigabit Ethernet Every client writes a 15 GB file into Lustre
General rules for I/O Avoid unnecessary I/O Perform I/O in few and and large chunks Binary instead of formatted data (Factor 3 performance improvement!) Use appropriate filesystem Use I/O libraries whenever available Convert to target/visualization format in memory if possible For parallel programs: output to separate files for each process: highest throughput, but usually needs postprocessing Use library/compiler support for conversion between little/big endian of files used on different architectures Avoid unnecessary open/close statements Avoid explicit flushes of data to disk, except when needed for consistency reasons
I/O in Fortran Parameters of the OPEN statement: Specify what you want to do: read, write or both: ACTION='READ' / 'WRITE' / 'READWRITE' Perform direct access with large record length (if possible a multiple of the disk block size): ACCESS='DIRECT', RECL=<record_length> Use binary (unformatted) I/O (default for direct access) FORM='UNFORMATTED' If you need sequential formatted access, remember to access data in large chunks at least Use buffering if possible/manually increase buffer size (~100MB) Intel Fortran run-time system: additional parameters of open statement: BUFFERED= yes, BUFFERCOUNT=10000 directives are usually proprietary
I/O in C Increase buffer size (~100MB): setvbuf (call before reading, writing or any other operation on the file) Perform unformatted instead of formatted IO: fwrite/fread instead of fprintf/fscanf For repositioning within the file use fseek Example: double data[size]; char* myvbuf; FILE* fp; IO fully buffered fp=fopen(filename, "w"); setvbuf(fp, myvbuf, _IOFBF, 100000000); fseek(fp, 0, SEEK_SET); fwrite(data, sizeof(double), SIZE, fp);
MPI-I/O Perform non-contiguous IO with MPI derived datatypes Perform collective IO Tell the MPI subsystem what you want to do (read, write, both,...) call MPI_Info_set (info, 'access_style', <style>, ierr) where <style> can be 'write_once', 'read_once', 'write_mostly', 'read_mostly', 'sequential',... Pass additional hints to the MPI subsystem (unknown hints will be ignored) many of these are implementation-dependent
Lustre striping factor: Tuning I/O on Lustre: serial and MPI-parallel lfs getstripe <filename> shows striping factor of a file lfs setstripe <directory> <stripe-size> <start-ost> \ <stripe-cnt> sets striping size, factor and first ost for files created in directory Example: lfs setstripe /lustre/a2832bf/bench 0-1 12 (will stripe with default striping size (2MB) over 12 OSTs) Hints for MPI parallel I/O: call MPI_Info_set(info, 'striping unit', '<stripe-size>', ierr) call MPI_Info_set(info, 'striping factor', '<stripe-cnt>', ierr) call MPI_Info_set(info, 'num_io_nodes', '<stripe-cnt>', ierr)
19 blades jede Partition: ~1.25 GB/s im aggregierten Modus $OPT_TMP Weiteres Dateisystem $PROJECT verfügbar
Tuning I/O on CXFS: FFIO glibc calls can be diverted to use alternative I/O layer: Fast and Flexible IO Prerequisites: dynamic linkage at least against glibc export LD_PRELOAD=/usr/lib/libFFIO.so Optionally set variables: FF_IO_LOGFILE and FF_IO_OPEN_DIAGS Set variable FF_IO_OPTS (mandatory!) to select file patterns I/O layers to be used performance relevant parameters Then run program as usual man libffio for details
Example for FFIO usage export FF_IO_OPTS=\ myfile.*(eie.direct.nodiag.mbytes:4096:64:6,\ event.mbytes.notrace)' Effects all files with basename myfile.* E(nhanced) I(ntelligence) E(ngineering) suboptions: direct unbuffered I/O nodiag no cache usage statistics reported mbytes unit for logging 4096: page size units are 512 byte blocks use this or an integer multiple for LRZ system striping unit TP9700: 2 MByte 64: number of pages in FFIO cache low value enforces flushing to disk high value provides effective buffering choose according to other memory requirements of program 6: number of pages read-ahead if sequential access detected can improve read performance if suitably increased Event layer (statistics): effectively unused here monitor I/O between layers
FFIO for MPI programs Can have separated FFIO settings for each MPI task must use SGI MPT on Altix replace FF_IO_OPTS by export SGI_MPI=/usr/lib export FF_IO_OPTS_RANK0= export FF_IO_OPTS_RANK1=
DMA transfers: Tuning MPI IO (XFS) call MPI_Info_set(info, 'direct_read', 'true', ierr) call MPI_Info_set(info, 'direct_write', 'true', ierr) bypasses OS buffer cache, can improve performance in special cases, but usually leads to performance degradation (do not use, except when memory used by buffer cache needed for computation) See FFIO description on previous slides
MPI IO Example Writing a distributed array of REAL4 (6 processes) with MPI derived datatype (darray): 1024 1024 Lustre total MB/s (noncollective) MB/s (collective) 12GB 34 102 (6 OSTs) 120GB 51 100 XFS 12GB 270 315 120GB 189 160
Big/Little Endian issues: converting unformatted files Environment variable specific to Intel-Fortran-generated binaries: export F_UFMTENDIAN=MODE [MODE;]EXCEPTION where: MODE = big little EXCEPTION = big:ulist little:ulist ULIST ULIST = U ULIST,U U = decimal decimal-decimal Examples: F_UFMTENDIAN=big F_UFMTENDIAN=big:9,12 big-endian for units 9 and 12, little-endian for others F_UFMTENDIAN="big;little:8" big-endian for all except unit 8 file format is big-endian for all units if F_UFMTENDIAN is unset: default value little
Converting Files: Alternatives for Intel Fortran Use convert switch at compilation will have effect on all units opened in source file Use convert= keyword on OPEN statement will only affect opened I/O unit proprietary enhancement code non-portable! Both option and keyword can take various values: big_endian little_endian cray ibm See compiler documentation / language reference for detailed information