System Namespace, Reza Nasirigerdeh, Carlos Maltzahn, Jeff LeFevre, Noah Watkins, Peter Alvaro, Margaret Lawson*, Jay Lofstead*, Jim Pivarski^ UC Santa Cruz, *Sandia National Laboratories, ^Princeton University
Overview: Transfer/Materialization of FS Lists Problem: RPCs ops in distributed file systems Tintenfisch: client F(x) FS Metadata Server client client : Decouple 1 2 High Perf. Computing High Energy Physics FS Metadata Server faster transfer faster modification faster generation FS Metadata Server client client but then reads must transferred & materialized 3 Large Simulations Schemas: Namespaces are LARGE but predictable (bounded/balanced) : F(x) = 2 x n for(i=0;i<len;i++){, CROSS, UCSC * 1 3 2 2
Outline Namespace 3
Primer on file system namespaces Names Data; Hierarchical Structure file.txt subtree /dir strong consistency durability inherit parent's ownership Problem! POSIX IO file system metadata access semantics are difficult to scale. Global Semantics Hierarchical Semantics Namespace 4
Primer on FS metadata access patterns: Small and frequent requests Target same resource [Sevilla et. al., SC'15] [Mesnier et. al., IEEE Comm.] client client metadata IO (permissions, size, atime, etc.) metadata IO data data IO IO Many metadata reads/writes Single Distributed Node File System as a result Fewer metadata reads/writes metadata IO does not scale like data IO Namespace 5
File system metadata access patterns: Small and frequent requests Target same resource metadata cluster Global Semantics Hierarchical Semantics strong consistency durability inherit parent's ownership lock management relaxing consistency caching inodes journal formats journal safety caching paths metadata distribution load balancing Namespace 6
: Decoupled Namespaces [Zheng et al., PDSW'14; Zheng et al., PDSW'15] Metadata Server How does this apply to today's namespaces? RPCs Traditional Client DeltaFS client Namespace 7
Example 1 : PLFS Namespace Middleware used in HPC for checkpoint-restart [Bent et al., SC'09] Namespace Repeat Pattern twice pattern PLFS specific metadata 8
Example 1 : PLFS Namespace Middleware used in HPC for checkpoint-restart [Bent et al., SC'09] Repeat Pattern twice pattern PLFS specific metadata scales with # of clients metadata write metadata Rd Namespace RPCs probably not an option; decoupled writes x reads (RD) 9
Summary of File s High Perf. Computing 1 2 High Energy Physics scales with # of clients ~ billions of files physics metadata data Namespace 10
Summary of File s High Perf. Computing High Energy Physics 1 2 3 physics metadata data Large Scale Simulations scales with # of clients ~ billions of files ~ trillions of objects Client (obj0 objn) Metadata Service SQLite Namespace 11
Summary of File s High Perf. Computing High Energy Physics 1 2 3 Large Scale Simulations scales with # of clients ~ billions of files ~ trillions of objects Decoupling definitely improves performance Lower is Better physics metadata data example Client (obj0 objn) Metadata Service SQLite but then reads must transferred & materialized Namespace 12
Summary of File s High Perf. Computing High Energy Physics 1 2 3 physics metadata data Large Scale Simulations scales with # of clients ~ billions of files ~ trillions of objects Client (obj0 objn) Metadata Service SQLite Namespace Schemas: Namespaces are LARGE but predictable (bounded/balanced) 13
Integrate w/ decoupled namespaces [Zheng et al., PDSW'15] Metadata Server RPCs Traditional Client Namespace Client "Decoupled Namespace" policy implemented on in Cudele [Sevilla et. al., IPDPS'18] 14
Tintenfisch builds on decoupled namespaces Metadata Server RPCs Traditional Client Namespace Generator Client "Decoupled Namespace" policy implemented on in Cudele [Sevilla et. al., IPDPS'18] 15
Namespace generators: compact metadata Formula Generator Files(n) = 2 x n Dirs(m) = m 1 High Perf. Computing 3 Code Generator import bbox if(t>30){ obj=bbox.split(4) Fusion Simulation Pointer Generator 2 * * * High Energy Physics Namespace faster transfer faster modification faster generation (metadata compaction) (change generator) (no more list/filter) Obj Store 16
Prior work with programmable storage Scalable File Systems 1 Subtree Load Balancing 2 Subtree Semantics Consistency/Durability 3 Subtree Schemas Namespace Mantle [SC '15, CCGrid '18] Cudele [IPDPS '18] Tintenfisch [] Malacology [EuroSys '17] 17
: transferring and materializing reads Related work: decoupled namespace overheads with large namespaces -1 High Performance Computing scales w/ @ of clients -2 ~ billions of files -3 s ~ trillions of files Tintenfisch: metadata compaction Schemas: Namespaces are LARGE but predictable (bounded/balanced) : F(x) = 2 x n for(i=0;i<len;i++){ * 1 3 2 faster transfer faster modification faster generation Namespace 18
Future Work proper sandboxing for security/correctness more complex file system metadata permissions (workflows) size of the file (Hadoop) timestamps and dates (GC) storage system agnostic metadata generation Namespace 19
Thanks!, Reza Nasirigerdeh, Carlos Maltzahn, Jeff LeFevre, Noah Watkins, Peter Alvaro, Margaret Lawson*, Jay Lofstead*, Jim Pivarski^ More information: - programmability.us UC Santa Cruz, *Sandia National Laboratories, ^Princeton University 20