Let s Make Parallel File System More Parallel

Let s Make Parallel File System More Parallel [LA-UR-15-25811] Qing Zheng 1, Kai Ren 1, Garth Gibson 1, Bradley W. Settlemyer 2 1 Carnegie MellonUniversity 2 Los AlamosNationalLaboratory

HPC defined by Parallel scientific apps low-latency network for msg passing tired cluster deployments PFS for highly scalable storage I/O App 1 App 3 App 2 Parallel File System [Lustre] compute nodes (10,000+) storage nodes (100+) Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 2

Failure Handling Nodes/network will fail apps use checkpoints to avoid complete re-execution each proc dumps its memory to a file App 1 App 3 App 2 Parallel File System [Lustre] compute nodes (10,000+) storage nodes (100+) Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 3

Failure Handling When failure happens an app is simply re-scheduled and resumes execution from a latest checkpoint App 1 App 3 App 2 Parallel File System [Lustre] compute nodes (10,000+) storage nodes (100+) Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 4

Checkpointing 1 if (proc_id == 0) { 2 mkdir( /proj/a/chk/001 ); 3 } 4 sync(); 5 int fd = open( /proj/a/chk/001/<proc_id>, 6 O_CREAT O_EXCL O_WRONLY); 7 write(fd, <..> ); 8 write(fd, <..> ); 9 close(fd); App 1 App 3 App 2 Parallel File System [Lustre] 640K open()/close() N * 640K write() Assuming 20,000 nodes and 32 CPUs per node compute nodes (10,000+) storage nodes (100+) Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 5

Will existing PFS deliver sufficient perf? YES? NO? [ DATA ] [ METADATA] Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 6

Metadata 1 Namespace Tree hierarchical directory structure 2 File Attributes file name, file size, last modification time, 3 Data Location where to find file/directory data? NO? [ METADATA] open(), close(), unlink(), mkdir(), rmdir(), rename(), getattr(), chmod(), readdir(), Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 7

Decoupled PFS Parallel File System e.g. Lustre MDS e.g. Lustre OSS metadata service [a single (or a few) machines] data service [a large collection of machines] Allow data to scale without scaling metadata Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 8

Isn t Metadata a Problem? NO FS only stores large files NO metadata is small in size NO 90% of ops are I/O Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 9

Isn t Metadata a Problem? NO FS only stores large files Median file size in actually tiny/small < 64KB in cloud computing data centers < 64MB in super computing environments 64MB is the default block size for Google File System Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 10

Isn t Metadata a Problem? bigger & bigger cluster # app processes metadata size # of metadata op NO metadata is small in size NO 90% of ops are I/O Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 11

HPC is growing Fast Tomorrow we will have EXASCALE computing facilities more intensive METADATA WORKLOADS Metadata eventually a huge problem!! Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 12

Will existing PFS deliver sufficient perf? NO!! [ METADATA] Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 13

GOAL PARALLEL DATA/METADATA Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 14

Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 15

Middleware Design Parallel Scientific Applications metadata ops data storage Underlying Storage Infrastructure [Object Storage/Parallel File System] metadata storage Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 16

Middleware Design Parallel Scientific Application Client Proc Private Server metadata operations Primary Server data/metadata storage fast interconnect metadata storage Underlying Storage Infrastructure [Object Storage/Parallel File System] Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 17

Middleware Design Parallel Scientific Application Client Proc Private Server metadata operations Primary Server data/metadata storage fast interconnect metadata storage Enables metadata to be potentially served Underlying Storage Infrastructure from [Object Storage/Parallel compute File nodes System] Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 18

Agenda 1 2 Metadata Bulk Representation Insertion Client-funded File System Metadata Architecture Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 19

1 Metadata Representation Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 20

Block-based Metadata superblock data block map inode map inode blocks data blocks UNIX Model inode id=161 [..] -> 132 size=64 id=157 [.] -> 157 type=[file] size=4096 zhengq-> 158 time=2015-07-27 type=[directory] kair -> 159 time=2015-07-27 garth -> 160 bws -> 161 directory entry list Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 21

Block-based Metadata superblock data block map inode map inode blocks data blocks inode id=161 size=64 type=[file] time=2015-07-27 id=157 size=4096 type=[directory] time=2015-07-27 [..] -> 132 [.] -> 157 zhengq-> 158 kair -> 159 garth -> 160 bws -> 161 directory entry list file creates -> disk seeks, liner directory entry search cost zero per-directory concurrency Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 22

ordered KV pairs Table-based Metadata ROOT (id=0) key 0,h(proj) value id=1, type=dir, fname=proj readdir [ROOT] proj (id=1) src (id=2) 0,h(src) 1,h(batchfs) id=2, type=dir, fname=src id=5, type=dir, fname=batchfs batchfs (id=5) 2,h(fs.h) id=3, type=file, fname=fs.h readdir /src fs.h fs.c 2,h(fs.c) id=4, type=file, fname=fs.c KEY = parent_id + hash(fname), VALUE = an embedded inode + fname Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 23

ordered KV pairs Table-based Metadata ROOT (id=0) key 0,h(proj) value id=1, type=dir, fname=proj readdir [ROOT] proj (id=1) src (id=2) 0,h(src) 1,h(batchfs) id=2, type=dir, fname=src id=5, type=dir, fname=batchfs batchfs (id=5) 2,h(fs.h) id=3, type=file, fname=fs.h readdir /src fs.h fs.c 2,h(fs.c) id=4, type=file, fname=fs.c A large distributed sorted directory entry table KEY = parent_id + hash(fname), with embedded VALUE = an inodes embedded inode + fname Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 24

Table Representation Log-structured Merge Trees [LSM] create file/directory level-0 always sits in memory k/v k/v merge k/v merge k/v k/v k/v In-mem B-Tree k/v k/v k/v Level-0 Level-1 A collection of B-trees at different levels k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v Level-2 Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 25

Table Representation Log-structured Merge Trees [LSM] create file/directory merge level-0 into level-1 k/v k/v k/v Level-0 FULL merge k/v merge k/v k/v In-mem B-Tree k/v k/v k/v Level-1 A collection of B-trees at different levels k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v Level-2 Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 26

Table Representation Log-structured Merge Trees [LSM] create file/directory merge partial level-1 into level-2 k/v k/v merge k/v In-mem B-Tree Level-0 k/v k/v FULL k/v k/v k/v k/v Level-1 A collection of B-trees at different levels merge k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v Level-2 Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 27

Table Representation Log-structured Merge Trees [LSM] create file/directory (optimized for K/V insertion) k/v merge k/v merge k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v In-mem B-Tree k/v k/v k/v convert random disk I/O into sequential k/v k/v I/Ok/v k/v Level-0 Level-1 Level-2 A collection of B-trees at different levels avoids disk seeks Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 28

LSM - Updates ROOT (id=0) proj (id=1) src (id=2) 1,h(batchfs) perm=xxx, fname=batchfs, seq=245 batchfs (id=5) 1,h(batchfs) perm=yyy, fname=batchfs, seq=361 chmod( /proj/batchfs, ) fs.h fs.c no write in-place seq 361>245 Convert K/V updates to K/V insertion operations Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 29

LSM - Deletions ROOT (id=0) proj (id=1) src (id=2) 1,h(batchfs) live=true, fname=batchfs, seq=245 batchfs (id=5) 1,h(batchfs) live=false, fname=batchfs, seq=361 rmdir( /proj/batchfs, ) fs.h fs.c no explicit deletion seq 361>245 Convert K/V deletions to K/V insertion operations Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 30

LSM - Deletions ROOT (id=0) proj (id=1) src (id=2) 1,h(batchfs) live=true, fname=batchfs, seq=245 batchfs (id=5) 1,h(batchfs) live=false, fname=batchfs, seq=361 rmdir( /proj/batchfs, ) fs.h fs.c no explicit deletion 1. immutable data structure Convert K/V deletions to an K/V insertion operations 2. snapshotting a file system image is trivial seq 361>245 Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 31

LSM - Storage namespace represented k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v formatted T 1 T 2 T 3 T 4 32MB each LSM-Tree Underlying Storage Infrastructure [Object Storage/Parallel File System] Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 32

LSM - Storage namespace represented k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v formatted T 1 T 2 T 3 T 4 e.g. 32MB each LSM-Tree Pack metadata into large files Reuse data path to deliver scalable metadata Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 33

Experiments Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 34

Experiments Each client process creates 1 private directory and inserts a set of empty files into that directory (CHECKPOINT WORKLOAD) Name Node [metadata node] Each node has two CPUs, 8GM RAM, one HDD SATA disk, and one 1Gb Ethernet port Data Node Hadoop File System (HDFS) Cluster Data Node Data Node Data Node Data Node Data Node Data Node Data Node Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 35

Experiments Each client process creates 1 private directory and inserts a set of empty files into that directory (CHECKPOINT WORKLOAD) Name Node [metadata node] Each node has two CPUs, 8GM RAM, one HDD SATA disk, and one 1Gb Ethernet port Data Node Hadoop File System (HDFS) Cluster Data Node Data Node Data Node Data Node Data Node Data Node The original Hadoop file system gives 600 op/s Data Node Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 36

Experiment Settings 1 million files inserted without bulk insertion HDFS Name Node 1-8 BatchFS clients 1 BatchFS Server 1-8 BatchFS clients 1-8 BatchFS clients HDFS Data Node HDFS Data Node HDFS Data Node DISK DISK DISK Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 37

Throughput (K op/s) HDFS Baseline v.s. BatchFS 14 12 11 13 13 12 10 8 6 20X 20X 20X 20X 4 2 0.6 0.6 0.6 0.6 0 8 client processes 16 client processes 32 client processes 64 client processes Efficient Metadata Representation Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 38

2 Bulk Insertion Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 39

Traditional Model Parallel Scientific Application mkdir(), create() Dedicated Metadata Server write tree files Shared Underlying Storage Infrastructure T 1 T 2 T 3 T 4 on-disk namespace storage Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 40

Traditional Model Parallel Scientific Application mkdir(), create() Sync. Interface Strong Consistent Dedicated Metadata Server write tree files Shared Underlying Storage Infrastructure T 1 T 2 T 3 T 4 on-disk namespace storage Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 41

Traditional Model Parallel Scientific Application 320K client processes mkdir(), create() Sync. Interface Strong Consistent bottleneck Dedicated Metadata Server write tree files Shared Underlying Storage Infrastructure T 1 T 2 T 3 T 4 1. Dedicated service doesn t work in exascale on-disk namespace storage Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 42

Traditional Model Parallel Scientific Application 320K client processes mkdir(), create() Sync. Interface Strong Consistent bottleneck Dedicated Metadata Server write tree files Shared Underlying Storage Infrastructure T 1 T 2 T 3 T 4 1. Dedicated service doesn t work in exascale on-disk namespace storage 2. Traditional model overkill for scientific applications Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 43

Bulk Insertion mkdir() create() via private servers Parallel Scientific Application (1) write tree files Dedicated Metadata Server write tree files T 5 T 6 client s metadata mutations T 1 T 2 T 3 T 4 on-disk namespace storage Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 44

Bulk Insertion Parallel Scientific Application (2) bulk submit finishes execution by as easily as picking up all submitted tree files Dedicated Metadata Server write tree files T 5 T 6 client s metadata mutations T 1 T 2 T 3 T 4 on-disk namespace storage Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 45

Bulk Insertion Parallel Scientific Application (2) bulk submit finishes execution by as easily as picking up all submitted tree files Dedicated Metadata Server write tree files T 5 T 6 client s metadata mutations T 1 T 2 T 3 T 4 Similar to database pre-loading on-disk namespace storage Data inserted via a low-level protocol instead of SQL Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 46

Bulk Insertion Parallel Scientific Application T 5 T 6 client s metadata mutations (2) bulk submit finishes execution by as easily as picking up all submitted tree files Dedicated Metadata Server 1. More efficient h/w utilization write tree files 2. less calls to dedicated servers: more scalable metadata T 1 T 2 T 3 T 4 Similar to database pre-loading on-disk namespace storage Data inserted via a low-level protocol instead of SQL Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 47

Concurrency Control client1 client2 client1 client2 1 chmod( /proj, ) 1 chmod( /proj, ) 1 chmod( /proj, ) 1 rmdir( /proj, ) client1 client2 client1 client2 1 mkdir( /proj, ) 1 mkdir( /proj, ) 1 rename( /proj, /a ) 1 rename( /proj, /b ) Total ordering of mutations from different clients Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 48

Optimistic Locking ROOT ROOT SNAPSHOT CHECK/MERGE proj src proj src batchfs batchfs fs.h fs.c fs.h fs.c BOOTSTRAP SUBMIT batchfs checkpoint BatchFS Client checkpoint Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 49 ck1 ck1

Optimistic Locking ROOT proj ROOT src SNAPSHOT CHECK/MERGE proj batchfs src batchfs fs.h fs.c BOOTSTRAP SUBMIT checkpoint batchfs Similar to source code control (github/svn) checkpoint BatchFS ck1 Except there is no Client data copying (we do copy-by-ref) Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 50 ck1 fs.h fs.c

Optimistic Locking ROOT batchfs proj ROOT src SNAPSHOT CHECK/MERGE proj Fundamental Assumption batchfs Scientific applications rarely produce conflicts fs.h fs.c BOOTSTRAP SUBMIT checkpoint batchfs Similar to source code control (github/svn) checkpoint BatchFS ck1 Except there is no Client data copying (we do copy-by-ref) Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 51 ck1 fs.h src fs.c

Phase 1: Branching Client instantiates a private namespace global namespace from a global snapshot a global snapshot T 1 T 2 T 3 T 4 T 5 client s private branch Client snapshot( ) mkdir( ) chmod( ) bulk_insert( ) T T 1 T 2 T 3 global branch KV pairs Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 52

Phase 2: Merging Server picks up and schedules a check global namespace on client s metadata mutations a global snapshot T 1 T 2 T 3 T 4 T 5 client s private branch Client snapshot( ) mkdir( ) chmod( ) bulk_insert( ) T T 1 T 2 T 3 KV pairs tentative accepted, subject to future rejection open( ) Client2 global branch Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 53

Phase 3: Verification global namespace T 1 T 2 T 3 T 4 T 5 T 6 T 7 SST Interpreter Log metadata operation log view soft re-execution T 1 T 2 T 3 T 4 concurrent updates that mostly don t produce conflicts COMMIT T 5 T 6 T 7 T 8 client s metadata mutations conflict resolution Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 54

Experiments Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 55

Previous Setting 1 million files inserted without bulk insertion HDFS Name Node 1-8 BatchFS clients 1 BatchFS Server 1-8 BatchFS clients 1-8 BatchFS clients HDFS Data Node HDFS Data Node HDFS Data Node DISK DISK DISK Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 56

New Setting 8 million files inserted with bulk insertion HDFS Name Node 1-8 BatchFS clients 1 BatchFS Server 1-8 BatchFS clients 1-8 BatchFS clients HDFS Data Node HDFS Data Node HDFS Data Node DISK DISK DISK Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 57

Throughput (K op/s) No v.s. w/ Bulk Insertion 250 200 188 203 216 150 100 8X 139 15X 15X 18X 50 0 11 13 13 12 0.6 0.6 0.6 0.6 8 client processes 16 client processes 32 client processes 64 client processes Bulk Insertion - 20X * 18X = 360X faster then HDFS Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 58

Agenda 1 2 Metadata Bulk Representation Insertion Client-funded File System Metadata Architecture Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 59

Why FS is slow? Inefficient metadata representation At least one RPC per operation Synchronous metadata interface Pessimistic concurrency control Dedicated authorization service Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 60

Client-funded HPC Exascale PFS architecture Move metadata computation from servers to apps Better h/w utilization FS scales w/ # of clients pre-executes metadata ops privately per-batch synchronization App 1 App 3 App 2 compute nodes not in critical path Primary Metadata Server Underlying Storage Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 61

Client-funded HPC Exascale PFS architecture Move metadata computation from servers to apps Better h/w utilization FS scales w/ # of clients pre-executes metadata ops privately per-batch synchronization App 1 App 3 App 2 compute nodes not in critical path Apps have long had rich h/w resources Primary Metadata Server Underlying Storage Now they can buy themselves scalable metadata Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 62

Future Work Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 63

Implementation Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 64

Metadata Traces Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 65

Reference Scaling the File System Control Plane with Client-Funded Metadata Servers (PDSW14) Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion (SC14) Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 66

QUESTIONS