Let s Make Parallel File System More Parallel

Similar documents
Qing Zheng Kai Ren, Garth Gibson, Bradley W. Settlemyer, Gary Grider Carnegie Mellon University Los Alamos National Laboratory

Qing Zheng Lin Xiao, Kai Ren, Garth Gibson

IndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Network File System (NFS)

Network File System (NFS)

The Google File System. Alexandru Costan

The Google File System

The Google File System

Distributed File Systems II

Distributed File Systems

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Distributed Systems 16. Distributed File Systems II

CA485 Ray Walshe Google File System

Crossing the Chasm: Sneaking a parallel file system into Hadoop

YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores

The Google File System

Google File System. Arun Sundaram Operating Systems

The Fusion Distributed File System

ShardFS vs. IndexFS: Replication vs. Caching Strategies for Distributed Metadata Management in Cloud Storage Systems

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

The Google File System

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Quobyte The Data Center File System QUOBYTE INC.

CLOUD-SCALE FILE SYSTEMS

Structuring PLFS for Extensibility

Operating Systems. File Systems. Thomas Ropars.

Improved Solutions for I/O Provisioning and Application Acceleration

ShardFS vs. IndexFS: Replication vs. Caching Strategies for Distributed Metadata Management in Cloud Storage Systems

Main Points. File systems. Storage hardware characteristics. File system usage patterns. Useful abstractions on top of physical devices

The Google File System (GFS)

CS 4284 Systems Capstone

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores

Chapter 6. File Systems

File Systems. Chapter 11, 13 OSPP

The Google File System

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

MDHIM: A Parallel Key/Value Store Framework for HPC

The Google File System

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Filesystems on SSCK's HP XC6000

MOHA: Many-Task Computing Framework on Hadoop

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

NPTEL Course Jan K. Gopinath Indian Institute of Science

Distributed Filesystem

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

File Systems. CS170 Fall 2018

HPC Storage Use Cases & Future Trends

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

Map-Reduce. Marco Mura 2010 March, 31th

COS 318: Operating Systems. Journaling, NFS and WAFL

Arvind Krishnamurthy Spring Implementing file system abstraction on top of raw disks

EXPLODE: a Lightweight, General System for Finding Serious Storage System Errors. Junfeng Yang, Can Sar, Dawson Engler Stanford University

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Data storage on Triton: an introduction

Triton file systems - an introduction. slide 1 of 28

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

CS 111. Operating Systems Peter Reiher

The Google File System

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

1 / 23. CS 137: File Systems. General Filesystem Design

Storage Systems for Shingled Disks

Changing Requirements for Distributed File Systems in Cloud Storage

File System Aging: Increasing the Relevance of File System Benchmarks

Map Reduce. Yerevan.

Big Compute, Big Net & Big Data: How to be big

Tricky issues in file systems

Google Cluster Computing Faculty Training Workshop

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

PERSISTENCE: FSCK, JOURNALING. Shivaram Venkataraman CS 537, Spring 2019

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

1 / 22. CS 135: File Systems. General Filesystem Design

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

VOLTDB + HP VERTICA. page

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

File Systems: Fundamentals

DISTRIBUTED FILE SYSTEMS & NFS

Chapter 11: File System Implementation. Objectives

CSE 153 Design of Operating Systems

Designing a True Direct-Access File System with DevFS

CSE 124: Networked Services Lecture-17

File Systems: Fundamentals

File Systems. What do we need to know?

Motivation. Operating Systems. File Systems. Outline. Files: The User s Point of View. File System Concepts. Solution? Files!

Warming up Storage-level Caches with Bonfire

Coordinating Parallel HSM in Object-based Cluster Filesystems

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Isilon Performance. Name

What is a file system

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

File Systems. CS 4410 Operating Systems. [R. Agarwal, L. Alvisi, A. Bracy, M. George, E. Sirer, R. Van Renesse]

Campaign Storage. Peter Braam Co-founder & CEO Campaign Storage

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

The Google File System

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

Transcription:

Let s Make Parallel File System More Parallel [LA-UR-15-25811] Qing Zheng 1, Kai Ren 1, Garth Gibson 1, Bradley W. Settlemyer 2 1 Carnegie MellonUniversity 2 Los AlamosNationalLaboratory

HPC defined by Parallel scientific apps low-latency network for msg passing tired cluster deployments PFS for highly scalable storage I/O App 1 App 3 App 2 Parallel File System [Lustre] compute nodes (10,000+) storage nodes (100+) Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 2

Failure Handling Nodes/network will fail apps use checkpoints to avoid complete re-execution each proc dumps its memory to a file App 1 App 3 App 2 Parallel File System [Lustre] compute nodes (10,000+) storage nodes (100+) Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 3

Failure Handling When failure happens an app is simply re-scheduled and resumes execution from a latest checkpoint App 1 App 3 App 2 Parallel File System [Lustre] compute nodes (10,000+) storage nodes (100+) Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 4

Checkpointing 1 if (proc_id == 0) { 2 mkdir( /proj/a/chk/001 ); 3 } 4 sync(); 5 int fd = open( /proj/a/chk/001/<proc_id>, 6 O_CREAT O_EXCL O_WRONLY); 7 write(fd, <..> ); 8 write(fd, <..> ); 9 close(fd); App 1 App 3 App 2 Parallel File System [Lustre] 640K open()/close() N * 640K write() Assuming 20,000 nodes and 32 CPUs per node compute nodes (10,000+) storage nodes (100+) Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 5

Will existing PFS deliver sufficient perf? YES? NO? [ DATA ] [ METADATA] Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 6

Metadata 1 Namespace Tree hierarchical directory structure 2 File Attributes file name, file size, last modification time, 3 Data Location where to find file/directory data? NO? [ METADATA] open(), close(), unlink(), mkdir(), rmdir(), rename(), getattr(), chmod(), readdir(), Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 7

Decoupled PFS Parallel File System e.g. Lustre MDS e.g. Lustre OSS metadata service [a single (or a few) machines] data service [a large collection of machines] Allow data to scale without scaling metadata Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 8

Isn t Metadata a Problem? NO FS only stores large files NO metadata is small in size NO 90% of ops are I/O Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 9

Isn t Metadata a Problem? NO FS only stores large files Median file size in actually tiny/small < 64KB in cloud computing data centers < 64MB in super computing environments 64MB is the default block size for Google File System Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 10

Isn t Metadata a Problem? bigger & bigger cluster # app processes metadata size # of metadata op NO metadata is small in size NO 90% of ops are I/O Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 11

HPC is growing Fast Tomorrow we will have EXASCALE computing facilities more intensive METADATA WORKLOADS Metadata eventually a huge problem!! Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 12

Will existing PFS deliver sufficient perf? NO!! [ METADATA] Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 13

GOAL PARALLEL DATA/METADATA Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 14

Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 15

Middleware Design Parallel Scientific Applications metadata ops data storage Underlying Storage Infrastructure [Object Storage/Parallel File System] metadata storage Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 16

Middleware Design Parallel Scientific Application Client Proc Private Server metadata operations Primary Server data/metadata storage fast interconnect metadata storage Underlying Storage Infrastructure [Object Storage/Parallel File System] Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 17

Middleware Design Parallel Scientific Application Client Proc Private Server metadata operations Primary Server data/metadata storage fast interconnect metadata storage Enables metadata to be potentially served Underlying Storage Infrastructure from [Object Storage/Parallel compute File nodes System] Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 18

Agenda 1 2 Metadata Bulk Representation Insertion Client-funded File System Metadata Architecture Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 19

1 Metadata Representation Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 20

Block-based Metadata superblock data block map inode map inode blocks data blocks UNIX Model inode id=161 [..] -> 132 size=64 id=157 [.] -> 157 type=[file] size=4096 zhengq-> 158 time=2015-07-27 type=[directory] kair -> 159 time=2015-07-27 garth -> 160 bws -> 161 directory entry list Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 21

Block-based Metadata superblock data block map inode map inode blocks data blocks inode id=161 size=64 type=[file] time=2015-07-27 id=157 size=4096 type=[directory] time=2015-07-27 [..] -> 132 [.] -> 157 zhengq-> 158 kair -> 159 garth -> 160 bws -> 161 directory entry list file creates -> disk seeks, liner directory entry search cost zero per-directory concurrency Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 22

ordered KV pairs Table-based Metadata ROOT (id=0) key 0,h(proj) value id=1, type=dir, fname=proj readdir [ROOT] proj (id=1) src (id=2) 0,h(src) 1,h(batchfs) id=2, type=dir, fname=src id=5, type=dir, fname=batchfs batchfs (id=5) 2,h(fs.h) id=3, type=file, fname=fs.h readdir /src fs.h fs.c 2,h(fs.c) id=4, type=file, fname=fs.c KEY = parent_id + hash(fname), VALUE = an embedded inode + fname Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 23

ordered KV pairs Table-based Metadata ROOT (id=0) key 0,h(proj) value id=1, type=dir, fname=proj readdir [ROOT] proj (id=1) src (id=2) 0,h(src) 1,h(batchfs) id=2, type=dir, fname=src id=5, type=dir, fname=batchfs batchfs (id=5) 2,h(fs.h) id=3, type=file, fname=fs.h readdir /src fs.h fs.c 2,h(fs.c) id=4, type=file, fname=fs.c A large distributed sorted directory entry table KEY = parent_id + hash(fname), with embedded VALUE = an inodes embedded inode + fname Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 24

Table Representation Log-structured Merge Trees [LSM] create file/directory level-0 always sits in memory k/v k/v merge k/v merge k/v k/v k/v In-mem B-Tree k/v k/v k/v Level-0 Level-1 A collection of B-trees at different levels k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v Level-2 Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 25

Table Representation Log-structured Merge Trees [LSM] create file/directory merge level-0 into level-1 k/v k/v k/v Level-0 FULL merge k/v merge k/v k/v In-mem B-Tree k/v k/v k/v Level-1 A collection of B-trees at different levels k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v Level-2 Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 26

Table Representation Log-structured Merge Trees [LSM] create file/directory merge partial level-1 into level-2 k/v k/v merge k/v In-mem B-Tree Level-0 k/v k/v FULL k/v k/v k/v k/v Level-1 A collection of B-trees at different levels merge k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v Level-2 Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 27

Table Representation Log-structured Merge Trees [LSM] create file/directory (optimized for K/V insertion) k/v merge k/v merge k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v In-mem B-Tree k/v k/v k/v convert random disk I/O into sequential k/v k/v I/Ok/v k/v Level-0 Level-1 Level-2 A collection of B-trees at different levels avoids disk seeks Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 28

LSM - Updates ROOT (id=0) proj (id=1) src (id=2) 1,h(batchfs) perm=xxx, fname=batchfs, seq=245 batchfs (id=5) 1,h(batchfs) perm=yyy, fname=batchfs, seq=361 chmod( /proj/batchfs, ) fs.h fs.c no write in-place seq 361>245 Convert K/V updates to K/V insertion operations Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 29

LSM - Deletions ROOT (id=0) proj (id=1) src (id=2) 1,h(batchfs) live=true, fname=batchfs, seq=245 batchfs (id=5) 1,h(batchfs) live=false, fname=batchfs, seq=361 rmdir( /proj/batchfs, ) fs.h fs.c no explicit deletion seq 361>245 Convert K/V deletions to K/V insertion operations Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 30

LSM - Deletions ROOT (id=0) proj (id=1) src (id=2) 1,h(batchfs) live=true, fname=batchfs, seq=245 batchfs (id=5) 1,h(batchfs) live=false, fname=batchfs, seq=361 rmdir( /proj/batchfs, ) fs.h fs.c no explicit deletion 1. immutable data structure Convert K/V deletions to an K/V insertion operations 2. snapshotting a file system image is trivial seq 361>245 Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 31

LSM - Storage namespace represented k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v formatted T 1 T 2 T 3 T 4 32MB each LSM-Tree Underlying Storage Infrastructure [Object Storage/Parallel File System] Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 32

LSM - Storage namespace represented k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v formatted T 1 T 2 T 3 T 4 e.g. 32MB each LSM-Tree Pack metadata into large files Reuse data path to deliver scalable metadata Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 33

Experiments Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 34

Experiments Each client process creates 1 private directory and inserts a set of empty files into that directory (CHECKPOINT WORKLOAD) Name Node [metadata node] Each node has two CPUs, 8GM RAM, one HDD SATA disk, and one 1Gb Ethernet port Data Node Hadoop File System (HDFS) Cluster Data Node Data Node Data Node Data Node Data Node Data Node Data Node Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 35

Experiments Each client process creates 1 private directory and inserts a set of empty files into that directory (CHECKPOINT WORKLOAD) Name Node [metadata node] Each node has two CPUs, 8GM RAM, one HDD SATA disk, and one 1Gb Ethernet port Data Node Hadoop File System (HDFS) Cluster Data Node Data Node Data Node Data Node Data Node Data Node The original Hadoop file system gives 600 op/s Data Node Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 36

Experiment Settings 1 million files inserted without bulk insertion HDFS Name Node 1-8 BatchFS clients 1 BatchFS Server 1-8 BatchFS clients 1-8 BatchFS clients HDFS Data Node HDFS Data Node HDFS Data Node DISK DISK DISK Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 37

Throughput (K op/s) HDFS Baseline v.s. BatchFS 14 12 11 13 13 12 10 8 6 20X 20X 20X 20X 4 2 0.6 0.6 0.6 0.6 0 8 client processes 16 client processes 32 client processes 64 client processes Efficient Metadata Representation Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 38

2 Bulk Insertion Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 39

Traditional Model Parallel Scientific Application mkdir(), create() Dedicated Metadata Server write tree files Shared Underlying Storage Infrastructure T 1 T 2 T 3 T 4 on-disk namespace storage Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 40

Traditional Model Parallel Scientific Application mkdir(), create() Sync. Interface Strong Consistent Dedicated Metadata Server write tree files Shared Underlying Storage Infrastructure T 1 T 2 T 3 T 4 on-disk namespace storage Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 41

Traditional Model Parallel Scientific Application 320K client processes mkdir(), create() Sync. Interface Strong Consistent bottleneck Dedicated Metadata Server write tree files Shared Underlying Storage Infrastructure T 1 T 2 T 3 T 4 1. Dedicated service doesn t work in exascale on-disk namespace storage Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 42

Traditional Model Parallel Scientific Application 320K client processes mkdir(), create() Sync. Interface Strong Consistent bottleneck Dedicated Metadata Server write tree files Shared Underlying Storage Infrastructure T 1 T 2 T 3 T 4 1. Dedicated service doesn t work in exascale on-disk namespace storage 2. Traditional model overkill for scientific applications Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 43

Bulk Insertion mkdir() create() via private servers Parallel Scientific Application (1) write tree files Dedicated Metadata Server write tree files T 5 T 6 client s metadata mutations T 1 T 2 T 3 T 4 on-disk namespace storage Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 44

Bulk Insertion Parallel Scientific Application (2) bulk submit finishes execution by as easily as picking up all submitted tree files Dedicated Metadata Server write tree files T 5 T 6 client s metadata mutations T 1 T 2 T 3 T 4 on-disk namespace storage Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 45

Bulk Insertion Parallel Scientific Application (2) bulk submit finishes execution by as easily as picking up all submitted tree files Dedicated Metadata Server write tree files T 5 T 6 client s metadata mutations T 1 T 2 T 3 T 4 Similar to database pre-loading on-disk namespace storage Data inserted via a low-level protocol instead of SQL Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 46

Bulk Insertion Parallel Scientific Application T 5 T 6 client s metadata mutations (2) bulk submit finishes execution by as easily as picking up all submitted tree files Dedicated Metadata Server 1. More efficient h/w utilization write tree files 2. less calls to dedicated servers: more scalable metadata T 1 T 2 T 3 T 4 Similar to database pre-loading on-disk namespace storage Data inserted via a low-level protocol instead of SQL Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 47

Concurrency Control client1 client2 client1 client2 1 chmod( /proj, ) 1 chmod( /proj, ) 1 chmod( /proj, ) 1 rmdir( /proj, ) client1 client2 client1 client2 1 mkdir( /proj, ) 1 mkdir( /proj, ) 1 rename( /proj, /a ) 1 rename( /proj, /b ) Total ordering of mutations from different clients Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 48

Optimistic Locking ROOT ROOT SNAPSHOT CHECK/MERGE proj src proj src batchfs batchfs fs.h fs.c fs.h fs.c BOOTSTRAP SUBMIT batchfs checkpoint BatchFS Client checkpoint Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 49 ck1 ck1

Optimistic Locking ROOT proj ROOT src SNAPSHOT CHECK/MERGE proj batchfs src batchfs fs.h fs.c BOOTSTRAP SUBMIT checkpoint batchfs Similar to source code control (github/svn) checkpoint BatchFS ck1 Except there is no Client data copying (we do copy-by-ref) Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 50 ck1 fs.h fs.c

Optimistic Locking ROOT batchfs proj ROOT src SNAPSHOT CHECK/MERGE proj Fundamental Assumption batchfs Scientific applications rarely produce conflicts fs.h fs.c BOOTSTRAP SUBMIT checkpoint batchfs Similar to source code control (github/svn) checkpoint BatchFS ck1 Except there is no Client data copying (we do copy-by-ref) Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 51 ck1 fs.h src fs.c

Phase 1: Branching Client instantiates a private namespace global namespace from a global snapshot a global snapshot T 1 T 2 T 3 T 4 T 5 client s private branch Client snapshot( ) mkdir( ) chmod( ) bulk_insert( ) T T 1 T 2 T 3 global branch KV pairs Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 52

Phase 2: Merging Server picks up and schedules a check global namespace on client s metadata mutations a global snapshot T 1 T 2 T 3 T 4 T 5 client s private branch Client snapshot( ) mkdir( ) chmod( ) bulk_insert( ) T T 1 T 2 T 3 KV pairs tentative accepted, subject to future rejection open( ) Client2 global branch Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 53

Phase 3: Verification global namespace T 1 T 2 T 3 T 4 T 5 T 6 T 7 SST Interpreter Log metadata operation log view soft re-execution T 1 T 2 T 3 T 4 concurrent updates that mostly don t produce conflicts COMMIT T 5 T 6 T 7 T 8 client s metadata mutations conflict resolution Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 54

Experiments Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 55

Previous Setting 1 million files inserted without bulk insertion HDFS Name Node 1-8 BatchFS clients 1 BatchFS Server 1-8 BatchFS clients 1-8 BatchFS clients HDFS Data Node HDFS Data Node HDFS Data Node DISK DISK DISK Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 56

New Setting 8 million files inserted with bulk insertion HDFS Name Node 1-8 BatchFS clients 1 BatchFS Server 1-8 BatchFS clients 1-8 BatchFS clients HDFS Data Node HDFS Data Node HDFS Data Node DISK DISK DISK Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 57

Throughput (K op/s) No v.s. w/ Bulk Insertion 250 200 188 203 216 150 100 8X 139 15X 15X 18X 50 0 11 13 13 12 0.6 0.6 0.6 0.6 8 client processes 16 client processes 32 client processes 64 client processes Bulk Insertion - 20X * 18X = 360X faster then HDFS Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 58

Agenda 1 2 Metadata Bulk Representation Insertion Client-funded File System Metadata Architecture Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 59

Why FS is slow? Inefficient metadata representation At least one RPC per operation Synchronous metadata interface Pessimistic concurrency control Dedicated authorization service Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 60

Client-funded HPC Exascale PFS architecture Move metadata computation from servers to apps Better h/w utilization FS scales w/ # of clients pre-executes metadata ops privately per-batch synchronization App 1 App 3 App 2 compute nodes not in critical path Primary Metadata Server Underlying Storage Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 61

Client-funded HPC Exascale PFS architecture Move metadata computation from servers to apps Better h/w utilization FS scales w/ # of clients pre-executes metadata ops privately per-batch synchronization App 1 App 3 App 2 compute nodes not in critical path Apps have long had rich h/w resources Primary Metadata Server Underlying Storage Now they can buy themselves scalable metadata Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 62

Future Work Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 63

Implementation Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 64

Metadata Traces Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 65

Reference Scaling the File System Control Plane with Client-Funded Metadata Servers (PDSW14) Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion (SC14) Parallel_Data_Lab - http://www.pdl.cmu.edu/ LANL/Summer_School 66

QUESTIONS