Structuring PLFS for Extensibility

Structuring PLFS for Extensibility Chuck Cranor, Milo Polte, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University

What is PLFS? Parallel Log Structured File System Interposed filesystem b/w apps & backing storage Los Alamos National Labs, CMU, EMC, Target: HPC checkpoint files PLFS transparently transforms a highly concurrent write access pattern to a pattern more efficient for distributed filesystems First paper: Bent et al, Supercomputer 2009 http://github.com/plfs, http://institute.lanl.gov/plfs/ 2

Checkpoint Write Patterns The two main checkpoint write patterns: N-1: all N processes write to one shared file Concurrent I/O to a single file is often unscalable Small, unaligned, clustered traffic is problematic N-N: each process writes to its own file Overhead of inserting many files in a single dir Easier for DFS (after files created) Archival and management more difficult Initial PLFS focus: improve N-1 case 3

PLFS Transforms Workloads PLFS improves N-1 performance by transforming it into an N-N workload FUSE/MPI: transparent solution, no application changes required 4

PLFS Converts N-1 to N-N 131 132 279 281 152 148 host1 host2 host3 /foo PLFS Virtual Layer /foo/ hostdir.1/ hostdir.2/ hostdir.3/ data.131 indx.131 data.132 indx.132 data.279 indx.279 data.281 indx.281 data.152 indx.152 Physical Underlying 5 Parallel File System data.148 indx.148

PLFS N-1 Bandwidth Speedups 100X SPEED UP 10X 6

The Price of Success Original PLFS was limited to 1 workload: N-1 checkpoint on mounted posix filesystem All data stored in PLFS container logs Ported first to MIO-IO/ROMIO Feasibly deploy on leadership class machines Success with LANL apps: actual adoption? Requires maintainability & roadmap evolution Develop a team: LANL, EMC, CMU, Revisit code with maintainability in mind 7

PLFS Extensibility Architecture HPC Application PLFS high-level API libplfs Logical FS interface flat file small file container byte-range Index API pattern distributed I/O Store interface posix pvfs iofsl hdfs libhdfs/jvm hdfs.jar 8 MDHIM w/leveldb

Case Study: HPC in the Cloud Emergence of Hadoop: converged storage HDFS: Hadoop Distributed Filesystem Key attributes: Single sequential writer (not POSIX, no pwrite) Not VFS mounted, access through Java API Local storage on nodes (converged) Data replicated ~3 times (local+remote1+remote2) HPC in the Cloud: N-1 checkpoint on HDFS? Observation: PLFS log I/O fits HDFS semantics 9

PLFS Backend Limitations PLFS hardwired to POSIX API: Needs a kernel mounted filesystem Uses integer file descriptors Memory maps index files to read them HDFS does not fit these assumptions Solution: I/O Store Insert a layer of indirection above PLFS backend Model after POSIX API to minimize code changes 10

PLFS I/O Store Architecture PLFS FUSE PLFS MPI I/O libplfs PLFS container I/O store posix i/o HDFS i/o posix libc API mounted fs lib{hdfs,jvm} hdfs.jar Java code 11

PLFS/HDFS Benchmark Testbed: PRObE (www.nmc-probe.org) Each node has dual 1.6GHz AMD cores, 16GB RAM, 1TB drive, gigabit ethernet Ubuntu Linux, HDFS 0.21.0, PLFS, OpenMPI Benchmark: LANL FS Test Suite (fs_test) Simulates N-1 checkpoint, strided Filesystems tested: PVFS OrangeFS 2.8.4 w/64mb stripe size PLFS/HDFS w/1 replica (local disk) PLFS/HDFS w/3 replicas (local disk + remote1 + remote 2) Blocksizes: 47001, 48K, 1M Checkpoint size: 32GB written by 64 nodes 12

Benchmark Operation nodes 0 1 2 3 write phase continue pattern for remaining strides block read phase stride 2 3 0 1 nodes (shifted for read) We unmount and cache flush data filesystem between read/write 13

PLFS Implementation Architecture FUSE filesystem and a Middleware lib (MPI) PLFS FUSE daemon PLFS lib FUSE upcall PLFS FUSE app proc1 app i/o PLFS FUSE app proc2 VFS/POSIX API PLFS MPI app proc1 PLFS/ MPI libs PLFS MPI app proc2 PLFS/ MPI libs use r kernel FUSE module Local fs backing store i/o Distributed fs MPI sync calls interconnect to disk to network to other nodes 14

PLFS/HDFS Write Bandwidth 2000 PVFS-write PLFS/HDFS1-write PLFS/HDFS3-write write bandwidth (Mbytes/s) 1500 1000 500 0 47001 48K 1M access unit size (bytes) 15

PLFS/HDFS Write Bandwidth PLFS/HDFS performs well (note HDFS1 is local disk) 2000 PVFS-write PLFS/HDFS1-write PLFS/HDFS3-write write bandwidth (Mbytes/s) 1500 1000 500 0 47001 48K 1M access unit size (bytes) 16

PLFS/HDFS Write Bandwidth PLFS/HDFS performs well (note HDFS3 is 3 copies) 2000 PVFS-write PLFS/HDFS1-write PLFS/HDFS3-write write bandwidth (Mbytes/s) 1500 1000 500 0 47001 48K 1M access unit size (bytes) 17

PLFS/HDFS Read Bandwidth HDFS with small access size benefits from PLFS log grouping 1000 PVFS-read PLFS/HDFS1-read PLFS/HDFS3-read read bandwidth (Mbytes/s) 500 0 47001 48K 1M access unit size (bytes) 18

PLFS/HDFS Read Bandwidth HDFS3 with large access size suffers imbalance 1000 PVFS-read PLFS/HDFS1-read PLFS/HDFS3-read read bandwidth (Mbytes/s) 500 0 47001 48K 1M access unit size (bytes) 19

HDFS 1 vs 3: I/O Scheduling Network counters show HDFS3 read imbalance PLFS/HDFS1 PLFS/HDFS3 Total size of data served (MB) 1000 500 0 10 20 30 40 50 60 Node number 20

I/O Store Status Rewrote initial I/O Store prototype Production-level code Multiple concurrent instances of I/O Stores Re-plumbed entire backend I/O path Prototyped POSIX, HDFS, PVFS stores IOFSL done by EMC Regression tested at LANL I/O Store now part of PLFS released code https://github.com/plfs 21

Conclusions PLFS extensions for workload transformation: Logical FS interface Not just container logs; packing small files, burst buffer I/O Store layer Non-POSIX backends (HDFS, IOFSL, PVFS) Compression, write buffering, IO forwarding Container index extensions PLFS is open source, available on github http://github.com/plfs Developer email: plfs-devel@lists.sourceforge.net 22