Qing Zheng Lin Xiao, Kai Ren, Garth Gibson

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Qing Zheng Lin Xiao, Kai Ren, Garth Gibson"

Transcription

1 Qing Zheng Lin Xiao, Kai Ren, Garth Gibson

2 File System Architecture APP APP APP APP APP APP APP APP metadata operations Metadata Service I/O operations [flat namespace] Low-level Storage Infrastructure Parallel data path with decoupled metadata path Parallel Data Lab - PDL Group Meeting 2

3 Metadata = File System App Library Namespace Indexing Small File Storage Large File Block Indexing Metadata Representation Parallel Data Lab - PDL Group Meeting 3

4 Decoupled!= Scalable Single metadata server HDFS, Lustre 1.x Statically partitioned metadata servers PVFS, Federated HDFS, NFS v4.1 Many existing metadata service don t scale Parallel Data Lab - PDL Group Meeting 4

5 Our Goal is To Have Really Scalable Metadata Parallel Data Lab - PDL Group Meeting 5

6 Our Goal is To Have Really Scalable Metadata Outline 1. Pathname lookup important limitation on scalability 2. Client-side caching represented by IndexFS 3. Replicated state represented by ShardFS 4. Experimental results Parallel Data Lab - PDL Group Meeting 6

7 Path Resolution Hierarchical permission checking In order to resolve /a/b/c/, need to test /a, a/b, b/c, 1) Permissions to lookup names under an intermediate directory 2) The existence of the name 3) The name represents a directory A set of recursive tests starting from the root Parallel Data Lab - PDL Group Meeting 7

8 2 Categories of Ops Dir lookup state mutation ops Non-conflicting ops mkdir rename on dirs rmdir chmod on dirs chown on dirs mknod/create unlink rename on files chmod/chown on files utime/utimes Parent Dir Id Obj Name ObjId ObjSize ObjMode UserId GroupId Times Embedded File Data Other Metadata Parallel Data Lab - PDL Group Meeting 8

9 Naive Implementation 1 lookup RPC to server for each intermediate dir Other Clients /a /a/b/c/ Server 0 Client /a/d/ Client b/c a/b Server 2 Server 1 BOTTLENECKS 1. Repeated RPCs 2. Hot spot servers holding names at the top of the tree Client Side Parallel Data Lab - Server Side PDL Group Meeting 9

10 Outline 1. Pathname lookup important limitation on scalability 2. Client-side caching represented by IndexFS 3. Replicated state represented by ShardFS 4. Experimental results PDL Group Meeting 10

11 Design Choice #1 Lease and cache dir lookup states at clients Block mutation ops until all leases have expired Cache-Entry Expiration-time client-side cache table Cache-Id Max-expiration-time server-side cache table Fewer repeated RPCs & simple server states Outline 1. Pathname lookup important limitation on scalability 2. Client-side caching represented by IndexFS 3. Replicated state represented by ShardFS 4. Experimental results Parallel Data Lab - PDL Group Meeting 11

12 IndexFS Design Distributes namespace on a per-dir partition basic Path resolution conducted by clients with an consistent lease-based lookup cache b d / a c client cache /a a/c ROOT a e c b 1 e c/2 d 2 3 Parallel Data Lab - PDL Group Meeting 12

13 Outline 1. Pathname lookup important limitation on scalability 2. Client-side caching represented by IndexFS 3. Replicated state represented by ShardFS 4. Experimental results Parallel Data Lab - PDL Group Meeting 13

14 Design Choice #2 Replicates dir lookup states to all servers & broadcasts mutation ops to all servers mkdir Server 1 Server 2 Server 3 Server n Principally a better decision if #client >> #server Outline 1. Pathname lookup important limitation on scalability 2. Client-side caching represented by IndexFS 3. Replicated state represented by ShardFS 4. Experimental results Parallel Data Lab - PDL Group Meeting 14

15 ShardFS Design Distributes namespace on a per-file basic (sharding) All metadata servers can accept new files and perform path resolution / a client /a/b/1 ROOT 1 ROOT 2 a c b d c ROOT ROOT e /a/c/3 b d 3 e Parallel Data Lab - PDL Group Meeting 15

16 RPC Amplification * dir lookup state mutation op IndexFS ShardFS path resolution mknod unlink/getattr mkdir* rmdir*/readdir chmod/chown on file chmod*/chown* on dir utime on file/dir #path_lookups + 0 ~ #path_depth #partitions #metadata_servers #metadata_servers 1 #metadata_server 1 Parallel Data Lab - PDL Group Meeting 16

17 Micro Benchmark YCSB++ [SoCC11 ] Balanced tree 1280 files per leaf dir Zipfian tree files distributed unevenly, creating static hot spots Uniform/zipfian stat s random getattrs on files, creating dynamic hot spots Leaf dirs 111K dirs Root dir 5 levels fan out=10 128M files Parallel Data Lab - PDL Group Meeting 17

18 Throughput op/s Creation Phase Uniform Stat s Zipfian Stat s 20,000 IndexFS ShardFS IndexFS ShardFS IndexFS ShardFS 16,000 metadata footprint 14,654 15,502 12,000 10,829 dir splitting load variance hot spots on files 8,000 8,819 8,507 path resolution 6,529 6,356 4,000 4,728 3,071 3,635 2,022 2,105 0 Balanced Tree Zipfian Tree Balanced Tree Zipfian Tree Balanced Tree Zipfian Tree Parallel Data Lab - PDL Group Meeting 18

19 Macro Benchmark YCSB++ [SoCC11 ] Strong scaling fixed namespace and fixed fs ops Weak scaling fixed dir structure and fixed dir lookup state mutation ops chmod /a open /a/ strong scaling s s s s s s s s s s s s with scaling number of files and scaling number of other fs ops LinkedIn trace replay 1-day HDFS trace with 1.9M dirs & 11.4M files chmod /a open /a/1 Parallel Data Lab - PDL Group Meeting 19 a a chmod /a open /a/ weak scaling 1 α 2 β 2 γ s s s s s s s s s s s s a a chmod /a open /a/1 open /a/α

20 Throughput Kop/s Strong Scaling Weak Scaling IndexFS-1m IndexFS-250ms IndexFS-50ms ShardFS IndexFS-1m IndexFS-250ms IndexFS-50ms ShardFS number of metadata servers Parallel Data Lab - PDL Group Meeting 20

21 References IndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion. Kai Ren, Qing Zheng, Swapnil Patil and Garth Gibson. SC 2014 TableFS: Enhancing metadata efficiency in local file systems. Kai Ren and Garth Gibson. USENIX ATC 2013 Scale and Concurrency in GIGA+: File System Directories with Millions of Files. Swapnil Patil and Garth Gibson. FAST 2009 YCSB++: Benchmarking and Performance Debugging Advanced Features in Scalable Table Stores. Swapnil Patil, Milo Polte, Kai Ren, Wittawat Tantisiriroj, Lin Xiao, Julio Lopez, Garth Gibson, Adam Fuchs, Billie Rinaldi. SoCC 2011 Parallel Data Lab - PDL Group Meeting 21

22 BACKUP SLIDES

23 Target Namespace 90% dirs are small (less than 128 entries) Large dirs are really huge 90% dirs are of depth 16 or more Median file size smaller then 64KB for many fs es Parallel Data Lab - PDL Group Meeting 23

24 Distribution of FS Ops 100% 80% 60% 40% 20% Mknod Open Chmod Rename Mkdir Readdir Remove 0% LinkedIn OpenCloud Yahoo! PROD Yahoo! R&D Parallel Data Lab - PDL Group Meeting 24

25 ShardFS/IndexFS Overview namespace indexing small file storage metadata operations Application Metadata Client (ShardFS/IndexFS Client) Metadata Server (ShardFS/IndexFS Server) block operations metadata storage data storage DFS Metadata Server DFS Data Server DFS internal operations large file block indexing ShardFS or IndexFS Underlying Distributed File System Parallel Data Lab - PDL Group Meeting 25

26 Metadata Representation Log-structured and indexed data structure mkdir TableFS [ATC13] IndexFS [SC14] [(k,v), (k,v),..., (k,v)] In-mem buffer ShardFS/IndexFS Servers SSTable 1 SSTable 2 SSTable 3 Key-Value Store Parallel Data Lab - PDL Group Meeting 26

27 Sorted Namespace Metadata = dir index + object attributes + file data for small files Parent Dir Id Obj Name ObjId ObjSize ObjMode UserId GroupId Times Embedded File Data Other Metadata Parent Dir Id Obj Name ObjId ObjSize ObjMode UserId GroupId Times Embedded File Data Other Metadata Parent Dir Id Obj Name ObjId ObjSize ObjMode UserId GroupId Times Embedded File Data Other Metadata Parent Dir Id Obj Name ObjId ObjSize ObjMode UserId GroupId Times Embedded File Data Other Metadata Key Value Parallel Data Lab - PDL Group Meeting 27