Let s Make Parallel File System More Parallel [LA-UR-15-25811] Qing Zheng 1, Kai Ren 1, Garth Gibson 1, Bradley W. Settlemyer 2 1 Carnegie MellonUniversity 2 Los AlamosNationalLaboratory
HPC defined by Parallel scientific apps low-latency network for msg passing tired cluster deployments PFS for highly scalable storage I/O App 1 App 3 App 2 Parallel File System [Lustre] compute nodes (10,000+) storage nodes (100+) Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 2
Failure Handling Nodes/network will fail apps use checkpoints to avoid complete re-execution each proc dumps its memory to a file App 1 App 3 App 2 Parallel File System [Lustre] compute nodes (10,000+) storage nodes (100+) Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 3
Failure Handling When failure happens an app is simply re-scheduled and resumes execution from a latest checkpoint App 1 App 3 App 2 Parallel File System [Lustre] compute nodes (10,000+) storage nodes (100+) Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 4
Checkpointing 1 if (proc_id == 0) { 2 mkdir( /proj/a/chk/001 ); 3 } 4 sync(); 5 int fd = open( /proj/a/chk/001/<proc_id>, 6 O_CREAT O_EXCL O_WRONLY); 7 write(fd, <..> ); 8 write(fd, <..> ); 9 close(fd); App 1 App 3 App 2 Parallel File System [Lustre] 640K open()/close() N * 640K write() Assuming 20,000 nodes and 32 CPUs per node compute nodes (10,000+) storage nodes (100+) Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 5
Will existing PFS deliver sufficient perf? YES? NO? [ DATA ] [ METADATA] Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 6
Metadata 1 Namespace Tree hierarchical directory structure 2 File Attributes file name, file size, last modification time, 3 Data Location where to find file/directory data? NO? [ METADATA] open(), close(), unlink(), mkdir(), rmdir(), rename(), getattr(), chmod(), readdir(), Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 7
Decoupled PFS Parallel File System e.g. Lustre MDS e.g. Lustre OSS metadata service [a single (or a few) machines] data service [a large collection of machines] Allow data to scale without scaling metadata Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 8
Isn t Metadata a Problem? NO FS only stores large files NO metadata is small in size NO 90% of ops are I/O Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 9
Isn t Metadata a Problem? NO FS only stores large files Median file size in actually tiny/small < 64KB in cloud computing data centers < 64MB in super computing environments 64MB is the default block size for Google File System Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 10
Isn t Metadata a Problem? bigger & bigger cluster # app processes metadata size # of metadata op NO metadata is small in size NO 90% of ops are I/O Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 11
HPC is growing Fast Tomorrow we will have EXASCALE computing facilities more intensive METADATA WORKLOADS Metadata eventually a huge problem!! Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 12
Will existing PFS deliver sufficient perf? NO!! [ METADATA] Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 13
GOAL PARALLEL DATA/METADATA Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 14
Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 15
Middleware Design Parallel Scientific Applications metadata ops data storage Underlying Storage Infrastructure [Object Storage/Parallel File System] metadata storage Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 16
Middleware Design Parallel Scientific Application Client Proc Private Server metadata operations Primary Server data/metadata storage fast interconnect metadata storage Underlying Storage Infrastructure [Object Storage/Parallel File System] Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 17
Middleware Design Parallel Scientific Application Client Proc Private Server metadata operations Primary Server data/metadata storage fast interconnect metadata storage Enables metadata to be potentially served Underlying Storage Infrastructure from [Object Storage/Parallel compute File nodes System] Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 18
Agenda 1 2 Metadata Bulk Representation Insertion Client-funded File System Metadata Architecture Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 19
1 Metadata Representation Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 20
Block-based Metadata superblock data block map inode map inode blocks data blocks UNIX Model inode id=161 [..] -> 132 size=64 id=157 [.] -> 157 type=[file] size=4096 zhengq-> 158 time=2015-07-27 type=[directory] kair -> 159 time=2015-07-27 garth -> 160 bws -> 161 directory entry list Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 21
Block-based Metadata superblock data block map inode map inode blocks data blocks inode id=161 size=64 type=[file] time=2015-07-27 id=157 size=4096 type=[directory] time=2015-07-27 [..] -> 132 [.] -> 157 zhengq-> 158 kair -> 159 garth -> 160 bws -> 161 directory entry list file creates -> disk seeks, liner directory entry search cost zero per-directory concurrency Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 22
ordered KV pairs Table-based Metadata ROOT (id=0) key 0,h(proj) value id=1, type=dir, fname=proj readdir [ROOT] proj (id=1) src (id=2) 0,h(src) 1,h(batchfs) id=2, type=dir, fname=src id=5, type=dir, fname=batchfs batchfs (id=5) 2,h(fs.h) id=3, type=file, fname=fs.h readdir /src fs.h fs.c 2,h(fs.c) id=4, type=file, fname=fs.c KEY = parent_id + hash(fname), VALUE = an embedded inode + fname Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 23
ordered KV pairs Table-based Metadata ROOT (id=0) key 0,h(proj) value id=1, type=dir, fname=proj readdir [ROOT] proj (id=1) src (id=2) 0,h(src) 1,h(batchfs) id=2, type=dir, fname=src id=5, type=dir, fname=batchfs batchfs (id=5) 2,h(fs.h) id=3, type=file, fname=fs.h readdir /src fs.h fs.c 2,h(fs.c) id=4, type=file, fname=fs.c A large distributed sorted directory entry table KEY = parent_id + hash(fname), with embedded VALUE = an inodes embedded inode + fname Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 24
Table Representation Log-structured Merge Trees [LSM] create file/directory level-0 always sits in memory k/v k/v merge k/v merge k/v k/v k/v In-mem B-Tree k/v k/v k/v Level-0 Level-1 A collection of B-trees at different levels k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v Level-2 Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 25
Table Representation Log-structured Merge Trees [LSM] create file/directory merge level-0 into level-1 k/v k/v k/v Level-0 FULL merge k/v merge k/v k/v In-mem B-Tree k/v k/v k/v Level-1 A collection of B-trees at different levels k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v Level-2 Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 26
Table Representation Log-structured Merge Trees [LSM] create file/directory merge partial level-1 into level-2 k/v k/v merge k/v In-mem B-Tree Level-0 k/v k/v FULL k/v k/v k/v k/v Level-1 A collection of B-trees at different levels merge k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v Level-2 Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 27
Table Representation Log-structured Merge Trees [LSM] create file/directory (optimized for K/V insertion) k/v merge k/v merge k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v In-mem B-Tree k/v k/v k/v convert random disk I/O into sequential k/v k/v I/Ok/v k/v Level-0 Level-1 Level-2 A collection of B-trees at different levels avoids disk seeks Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 28
LSM - Updates ROOT (id=0) proj (id=1) src (id=2) 1,h(batchfs) perm=xxx, fname=batchfs, seq=245 batchfs (id=5) 1,h(batchfs) perm=yyy, fname=batchfs, seq=361 chmod( /proj/batchfs, ) fs.h fs.c no write in-place seq 361>245 Convert K/V updates to K/V insertion operations Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 29
LSM - Deletions ROOT (id=0) proj (id=1) src (id=2) 1,h(batchfs) live=true, fname=batchfs, seq=245 batchfs (id=5) 1,h(batchfs) live=false, fname=batchfs, seq=361 rmdir( /proj/batchfs, ) fs.h fs.c no explicit deletion seq 361>245 Convert K/V deletions to K/V insertion operations Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 30
LSM - Deletions ROOT (id=0) proj (id=1) src (id=2) 1,h(batchfs) live=true, fname=batchfs, seq=245 batchfs (id=5) 1,h(batchfs) live=false, fname=batchfs, seq=361 rmdir( /proj/batchfs, ) fs.h fs.c no explicit deletion 1. immutable data structure Convert K/V deletions to an K/V insertion operations 2. snapshotting a file system image is trivial seq 361>245 Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 31
LSM - Storage namespace represented k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v formatted T 1 T 2 T 3 T 4 32MB each LSM-Tree Underlying Storage Infrastructure [Object Storage/Parallel File System] Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 32
LSM - Storage namespace represented k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v formatted T 1 T 2 T 3 T 4 e.g. 32MB each LSM-Tree Pack metadata into large files Reuse data path to deliver scalable metadata Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 33
Experiments Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 34
Experiments Each client process creates 1 private directory and inserts a set of empty files into that directory (CHECKPOINT WORKLOAD) Name Node [metadata node] Each node has two CPUs, 8GM RAM, one HDD SATA disk, and one 1Gb Ethernet port Data Node Hadoop File System (HDFS) Cluster Data Node Data Node Data Node Data Node Data Node Data Node Data Node Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 35
Experiments Each client process creates 1 private directory and inserts a set of empty files into that directory (CHECKPOINT WORKLOAD) Name Node [metadata node] Each node has two CPUs, 8GM RAM, one HDD SATA disk, and one 1Gb Ethernet port Data Node Hadoop File System (HDFS) Cluster Data Node Data Node Data Node Data Node Data Node Data Node The original Hadoop file system gives 600 op/s Data Node Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 36
Experiment Settings 1 million files inserted without bulk insertion HDFS Name Node 1-8 BatchFS clients 1 BatchFS Server 1-8 BatchFS clients 1-8 BatchFS clients HDFS Data Node HDFS Data Node HDFS Data Node DISK DISK DISK Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 37
Throughput (K op/s) HDFS Baseline v.s. BatchFS 14 12 11 13 13 12 10 8 6 20X 20X 20X 20X 4 2 0.6 0.6 0.6 0.6 0 8 client processes 16 client processes 32 client processes 64 client processes Efficient Metadata Representation Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 38
2 Bulk Insertion Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 39
Traditional Model Parallel Scientific Application mkdir(), create() Dedicated Metadata Server write tree files Shared Underlying Storage Infrastructure T 1 T 2 T 3 T 4 on-disk namespace storage Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 40
Traditional Model Parallel Scientific Application mkdir(), create() Sync. Interface Strong Consistent Dedicated Metadata Server write tree files Shared Underlying Storage Infrastructure T 1 T 2 T 3 T 4 on-disk namespace storage Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 41
Traditional Model Parallel Scientific Application 320K client processes mkdir(), create() Sync. Interface Strong Consistent bottleneck Dedicated Metadata Server write tree files Shared Underlying Storage Infrastructure T 1 T 2 T 3 T 4 1. Dedicated service doesn t work in exascale on-disk namespace storage Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 42
Traditional Model Parallel Scientific Application 320K client processes mkdir(), create() Sync. Interface Strong Consistent bottleneck Dedicated Metadata Server write tree files Shared Underlying Storage Infrastructure T 1 T 2 T 3 T 4 1. Dedicated service doesn t work in exascale on-disk namespace storage 2. Traditional model overkill for scientific applications Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 43
Bulk Insertion mkdir() create() via private servers Parallel Scientific Application (1) write tree files Dedicated Metadata Server write tree files T 5 T 6 client s metadata mutations T 1 T 2 T 3 T 4 on-disk namespace storage Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 44
Bulk Insertion Parallel Scientific Application (2) bulk submit finishes execution by as easily as picking up all submitted tree files Dedicated Metadata Server write tree files T 5 T 6 client s metadata mutations T 1 T 2 T 3 T 4 on-disk namespace storage Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 45
Bulk Insertion Parallel Scientific Application (2) bulk submit finishes execution by as easily as picking up all submitted tree files Dedicated Metadata Server write tree files T 5 T 6 client s metadata mutations T 1 T 2 T 3 T 4 Similar to database pre-loading on-disk namespace storage Data inserted via a low-level protocol instead of SQL Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 46
Bulk Insertion Parallel Scientific Application T 5 T 6 client s metadata mutations (2) bulk submit finishes execution by as easily as picking up all submitted tree files Dedicated Metadata Server 1. More efficient h/w utilization write tree files 2. less calls to dedicated servers: more scalable metadata T 1 T 2 T 3 T 4 Similar to database pre-loading on-disk namespace storage Data inserted via a low-level protocol instead of SQL Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 47
Concurrency Control client1 client2 client1 client2 1 chmod( /proj, ) 1 chmod( /proj, ) 1 chmod( /proj, ) 1 rmdir( /proj, ) client1 client2 client1 client2 1 mkdir( /proj, ) 1 mkdir( /proj, ) 1 rename( /proj, /a ) 1 rename( /proj, /b ) Total ordering of mutations from different clients Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 48
Optimistic Locking ROOT ROOT SNAPSHOT CHECK/MERGE proj src proj src batchfs batchfs fs.h fs.c fs.h fs.c BOOTSTRAP SUBMIT batchfs checkpoint BatchFS Client checkpoint Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 49 ck1 ck1
Optimistic Locking ROOT proj ROOT src SNAPSHOT CHECK/MERGE proj batchfs src batchfs fs.h fs.c BOOTSTRAP SUBMIT checkpoint batchfs Similar to source code control (github/svn) checkpoint BatchFS ck1 Except there is no Client data copying (we do copy-by-ref) Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 50 ck1 fs.h fs.c
Optimistic Locking ROOT batchfs proj ROOT src SNAPSHOT CHECK/MERGE proj Fundamental Assumption batchfs Scientific applications rarely produce conflicts fs.h fs.c BOOTSTRAP SUBMIT checkpoint batchfs Similar to source code control (github/svn) checkpoint BatchFS ck1 Except there is no Client data copying (we do copy-by-ref) Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 51 ck1 fs.h src fs.c
Phase 1: Branching Client instantiates a private namespace global namespace from a global snapshot a global snapshot T 1 T 2 T 3 T 4 T 5 client s private branch Client snapshot( ) mkdir( ) chmod( ) bulk_insert( ) T T 1 T 2 T 3 global branch KV pairs Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 52
Phase 2: Merging Server picks up and schedules a check global namespace on client s metadata mutations a global snapshot T 1 T 2 T 3 T 4 T 5 client s private branch Client snapshot( ) mkdir( ) chmod( ) bulk_insert( ) T T 1 T 2 T 3 KV pairs tentative accepted, subject to future rejection open( ) Client2 global branch Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 53
Phase 3: Verification global namespace T 1 T 2 T 3 T 4 T 5 T 6 T 7 SST Interpreter Log metadata operation log view soft re-execution T 1 T 2 T 3 T 4 concurrent updates that mostly don t produce conflicts COMMIT T 5 T 6 T 7 T 8 client s metadata mutations conflict resolution Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 54
Experiments Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 55
Previous Setting 1 million files inserted without bulk insertion HDFS Name Node 1-8 BatchFS clients 1 BatchFS Server 1-8 BatchFS clients 1-8 BatchFS clients HDFS Data Node HDFS Data Node HDFS Data Node DISK DISK DISK Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 56
New Setting 8 million files inserted with bulk insertion HDFS Name Node 1-8 BatchFS clients 1 BatchFS Server 1-8 BatchFS clients 1-8 BatchFS clients HDFS Data Node HDFS Data Node HDFS Data Node DISK DISK DISK Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 57
Throughput (K op/s) No v.s. w/ Bulk Insertion 250 200 188 203 216 150 100 8X 139 15X 15X 18X 50 0 11 13 13 12 0.6 0.6 0.6 0.6 8 client processes 16 client processes 32 client processes 64 client processes Bulk Insertion - 20X * 18X = 360X faster then HDFS Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 58
Agenda 1 2 Metadata Bulk Representation Insertion Client-funded File System Metadata Architecture Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 59
Why FS is slow? Inefficient metadata representation At least one RPC per operation Synchronous metadata interface Pessimistic concurrency control Dedicated authorization service Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 60
Client-funded HPC Exascale PFS architecture Move metadata computation from servers to apps Better h/w utilization FS scales w/ # of clients pre-executes metadata ops privately per-batch synchronization App 1 App 3 App 2 compute nodes not in critical path Primary Metadata Server Underlying Storage Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 61
Client-funded HPC Exascale PFS architecture Move metadata computation from servers to apps Better h/w utilization FS scales w/ # of clients pre-executes metadata ops privately per-batch synchronization App 1 App 3 App 2 compute nodes not in critical path Apps have long had rich h/w resources Primary Metadata Server Underlying Storage Now they can buy themselves scalable metadata Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 62
Future Work Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 63
Implementation Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 64
Metadata Traces Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 65
Reference Scaling the File System Control Plane with Client-Funded Metadata Servers (PDSW14) Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion (SC14) Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 66
QUESTIONS