Dynamic Metadata Management for Petabyte-scale File Systems

Similar documents
Ceph: A Scalable, High-Performance Distributed File System PRESENTED BY, NITHIN NAGARAJ KASHYAP

Ceph: A Scalable, High-Performance Distributed File System

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

Outline. Challenges of DFS CEPH A SCALABLE HIGH PERFORMANCE DFS DATA DISTRIBUTION AND MANAGEMENT IN DISTRIBUTED FILE SYSTEM 11/16/2010

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 18: Naming, Directories, and File Caching

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 18: Naming, Directories, and File Caching

CS-580K/480K Advanced Topics in Cloud Computing. Object Storage

Ceph: A Scalable, High-Performance Distributed File System

RBF: A New Storage Structure for Space- Efficient Queries for Multidimensional Metadata in OSS

Ceph: A Scalable, High-Performance Distributed File System

Storage in HPC: Scalable Scientific Data Management. Carlos Maltzahn IEEE Cluster 2011 Storage in HPC Panel 9/29/11

Material You Need to Know

Ceph: scaling storage for the cloud and beyond

Distributed File Systems II

Ceph: A Scalable Object-Based Storage System

The Google File System

Venugopal Ramasubramanian Emin Gün Sirer SIGCOMM 04

Efficient Metadata Management in Large Distributed Storage Systems

Huge market -- essentially all high performance databases work this way

Today: World Wide Web! Traditional Web-Based Systems!

Scalable Cache Coherence

Advanced Operating Systems

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

CLOUD-SCALE FILE SYSTEMS

Distributed File Systems

Main Memory and the CPU Cache

The Design and Implementation of a Next Generation Name Service for the Internet (CoDoNS) Presented By: Kamalakar Kambhatla

What is a file system

SMD149 - Operating Systems - File systems

Chapter 11: File System Implementation. Objectives

Current Topics in OS Research. So, what s hot?

COS 318: Operating Systems. Journaling, NFS and WAFL

The Google File System

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

NFSv4 as the Building Block for Fault Tolerant Applications

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Last Class: Consistency Models. Today: Implementation Issues

An Introduction to GPFS

Distributed Systems 16. Distributed File Systems II

Distributed File Systems Issues. NFS (Network File System) AFS: Namespace. The Andrew File System (AFS) Operating Systems 11/19/2012 CSC 256/456 1

The Design and Implementation of AQuA: An Adaptive Quality of Service Aware Object-Based Storage Device

Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

I/O and file systems. Dealing with device heterogeneity

Name: Instructions. Problem 1 : Short answer. [48 points] CMU / Storage Systems 20 April 2011 Spring 2011 Exam 2

Reliable and Efficient Metadata Storage and Indexing Using NVRAM

Filesystems Lecture 11

Spyglass: metadata search for large-scale storage systems

Finding a Needle in a Haystack. Facebook s Photo Storage Jack Hartner

CS5412 CLOUD COMPUTING: PRELIM EXAM Open book, open notes. 90 minutes plus 45 minutes grace period, hence 2h 15m maximum working time.

CA485 Ray Walshe Google File System

NFS Version 4.1. Spencer Shepler, Storspeed Mike Eisler, NetApp Dave Noveck, NetApp

Lustre overview and roadmap to Exascale computing

Operating Systems. Week 13 Recitation: Exam 3 Preview Review of Exam 3, Spring Paul Krzyzanowski. Rutgers University.

The Google File System

CS 416: Operating Systems Design April 22, 2015

Local File Stores. Job of a File Store. Physical Disk Layout CIS657

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

CSE 451: Operating Systems Winter Redundant Arrays of Inexpensive Disks (RAID) and OS structure. Gary Kimura

Distributed Meta-data Servers: Architecture and Design. Sarah Sharafkandi David H.C. Du DISC

1 / 23. CS 137: File Systems. General Filesystem Design

CS3600 SYSTEMS AND NETWORKS

Distributed File Systems. CS 537 Lecture 15. Distributed File Systems. Transfer Model. Naming transparency 3/27/09

Networking Recap Storage Intro. CSE-291 (Cloud Computing), Fall 2016 Gregory Kesden

File System Performance (and Abstractions) Kevin Webb Swarthmore College April 5, 2018

Today: Coda, xfs! Brief overview of other file systems. Distributed File System Requirements!

ECE 598 Advanced Operating Systems Lecture 14

PPMS: A Peer to Peer Metadata Management Strategy for Distributed File Systems

CSIT5300: Advanced Database Systems

Addressed Issue. P2P What are we looking at? What is Peer-to-Peer? What can databases do for P2P? What can databases do for P2P?

FlexCache Caching Architecture

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Scalable overlay Networks

OPERATING SYSTEM. Chapter 12: File System Implementation

1 / 22. CS 135: File Systems. General Filesystem Design

A Practical Scalable Distributed B-Tree

Operating Systems. Week 9 Recitation: Exam 2 Preview Review of Exam 2, Spring Paul Krzyzanowski. Rutgers University.

Changing Requirements for Distributed File Systems in Cloud Storage

CS350: Data Structures B-Trees

ShardFS vs. IndexFS: Replication vs. Caching Strategies for Distributed Metadata Management in Cloud Storage Systems

Chapter 11: Indexing and Hashing

Lustre A Platform for Intelligent Scale-Out Storage

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Introduction to Distributed Data Systems

Distributed File Systems. Directory Hierarchy. Transfer Model

SONAS Best Practices and options for CIFS Scalability

The Google File System

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

virtual machine block storage with the ceph distributed storage system sage weil xensummit august 28, 2012

Chapter 18 Distributed Systems and Web Services

A Distributed Namespace for a Distributed File System

Distributed Systems. Hajussüsteemid MTAT Distributed File Systems. (slides: adopted from Meelis Roos DS12 course) 1/25

Physical Level of Databases: B+-Trees

CSE 124: Networked Services Lecture-17

Filesystems Lecture 13

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

A Framework for Power of 2 Based Scalable Data Storage in Object-Based File System

Transcription:

Dynamic Metadata Management for Petabyte-scale File Systems Sage Weil Kristal T. Pollack, Scott A. Brandt, Ethan L. Miller UC Santa Cruz November 1, 2006 Presented by Jae Geuk, Kim

System Overview Petabytes of storage (10 18 bytes, millions of gigabytes) Billions of files ranging from bytes to terabytes 10,000 to 100,000 clients File I/O decoupled from metadata, namespace operations ~1,000 Object-based Storage Devices (OSDs) ~10 Metadata Servers (MDSs)

Metadata Cluster The MDS cluster is responsible for Managing the file system namespace (directory structure) Facilitating client access to file data Hierarchical directory structure (POSIX semantics) Traditionally two types of metadata Inodes Size, mtime, mode Where is the data? Directory entries File name Inode pointer MDS cluster s collective cache for load balancing Goal #1 : Mask I/O to underlying disk storage by leveraging MDS cluster s collective cache

Metadata Workload As many as half of all file system operations access metadata Open, close Readdir, many stats (ls F or al) Large systems must address a variety of usage patterns General purpose computing Scientific, cluster computing Must efficiently handle hotspot development Directories near root of hierarchy are necessarily popular Popular files: many opens of same file (e.g. /lib/*) Popular directories: many creates in same directory, (e.g. /tmp) Goal #2 : MDS cluster should efficiently handle varied and changing file systems and usage patterns

MDS Cluster Design Issues Consistency Updates must be consistent within MDS cluster; clients should have consistent view of file system Leveraging Locality of Reference Hierarchies are out friend Workload Partitioning Balance MDS loads, maximize efficiency and throughput Traffic Management Hot spots, flash mobs Metadata Storage Fast commits and efficient reads

Simplifying Consistency Consistency Locality Partitioning Traffic Management Storage When metadata is replicated, a simple locking and coherence strategy is best Single MDS node acts as authority for any particular metadata object ( primary copy replication ) Serializes updates Commits updates to stable storage (e.g. disk, NVRAM) Manages cooperative cache MDS cluster is collectively a large, consistent and coherent cache

Consistency Locality Partitioning Traffic Management Storage Leveraging Locality of Reference A hierarchical storage paradigm leads to high locality of reference within directories and subtrees Store inodes with directory entries Streamlines typical usage patterns (lookup + open or readdir + stats) Allows prefetching of directory contents Inodes no longer globally addressable (by inode number) Avoids management issues associated with a signle large, distributed, and sparsely populated table Hard links are easy to handle with small auxiliary inode table for the rare doublylinked inode

Partitioning Approaches Consistency Locality Partitioning Traffic Management Storage How to delegate authority? Manual subtree partitioning (coarse distribution) Preserves locality Leads to imbalanced distribution as file system, workload change Hash-based partitioning (finer distribution) Provides a good probabilistic distribution of metadata for all workloads Ignores hierarchical structure Less vulnerable to hot sopts Destroys locality

Consistency Locality Partitioning Traffic Management Storage Dynamic subtree Partitioning Distribute subtree of directory hierarchy Somewhat coarse distribution of variably-sized subtrees Preserve locality within entire branches of the directory hierarchy Must intelligently manage distribution based on workload demands Keep MDS cluster load balanced Requires active repartitioning as workload and file system change instead of relying on a (fixed) probabilistic distribution

A Sample Partition Consistency Locality Partitioning Traffic Management Storage The system dynamically and intelligently redelegates arbitrary subtrees based on usage patterns Coarser, subtree-based partition means higher efficiency Fewer prefixes need to be replicated for path traversal Granularity of distribution can range from large subtrees to individual directories

Consistency Locality Partitioning Traffic Management Storage Distributing Directory Contents If a directory is large or busy, it s contents can be selectively hashed across the cluster Directory entry/inode distribution is based on hash of parent directory id and file name Whether a directory is hashed is dynamically determined

Flash Mobs and Hot Spots Consistency Locality Partitioning Traffic Management Storage There is a tradeoff between fast access and load balancing If clients know where to locate any metadata, they can access it directly but at any time they may decide to simultaneously access any single item without warning If clients always contact a random MDS or proxy, load can always be balanced but all (or most) transactions require extra network hops Flash mobs are common in scientific computing and even general purpose workloads Many opens of the same file Many creates in one directory A MDS with a popular item will not operate as efficiently under load, or may crash Fine-grained distribution can help, but is an incomplete solution

Traffic Control Consistency Locality Partitioning Traffic Management Storage Ideally Unpopular items : clients directly contact the authoritative MDS Popular items : replicated across the cluster Dynamic distribution can exploit client ignorance! MDS tracks popularity of directories and files Normally, the MDS tells clients where items it requests live (who the authority is) If an item becomes popular, future clients are told the data is replicated Number of clients thinking that any particular item is in any single place is bounded at all times Clients choose MDS to contact based on their best guess Direct queries based on deepest know prefix Coarse subtree partitioning makes this a good guess

Traffic Control In Action Consistency Locality Partitioning Traffic Management Storage Suppose clients 1 through 1000 are booted in sequence and immediately begin using portions of the file system. After some time, their caches might look like this: Access to /*, /bin/*, /home/* is distributed across the cluster Access to /home/$user/* goes directly to authoritative MDS Suppose all clients suddenly try to open /bin/ls: Client 1 contacts MDS 1, everybody else contacts a random MDS All clients open /home/bar/foo.txt: Client 100 knows where /home/bar is, but Other clients are only familiar with /home, and thus direct their query at an (effectively) random MDS

Tiered Metadata Storage Consistency Locality Partitioning Traffic Management Storage Short-term storage Immediate commits require high write bandwidth Log structure for each MDS, with size similar to the MDS s cache Items expired from cache or falling off end of log are written to long-term storage Absorbs short-lived metadata Long-term storage Reads should dominate Directory contents grouped together (directory entries and inodes) Likely a collection of objects, or log structure OSDs as underlying storage mechanism Random, shared access facilitates MDS failover, workload redistribution Objects are suitable for storing logs, directory contents, but OSDs may need to be tuned for a small object workload

Evaluation Static subtree partitioning Hashing Lazy Hybrid Dynamic subtree partitioning Description Static Partitioning Directory or File Hashing File Hashing with dual-entry ACL Dynamic partitioning Directory Locality O Directory Hashing : O File Hashing : X X O (directory entries embedding inodes) Directory Hot Spots X Load balancing Load balancing Load balancing (like Directory Hashing) File Hot Spots (Flash mobs) X X X Cooperative Caching ACL access overhead (Needs parent inode) X O (parent replication caching overhead) X (dual-entry ACL : metadata + full path ACL) Update cost (logging & lazy update) X Caching in MDS Some prefixes Directory Hashing : Some inodes File Hashing : Parent replicated inodes Parent replicated inodes Some prefixes

Preliminary Results Simulated to evaluate partitioning and load balancing strategies Static workload (user home directories) Compares relative ability of system to scale Identical MDS nodes (memory), scale # disks, clients, and file system size Collective cache always 20% of file system metadata

Subtree Partitioning is Efficient Subtree partitioning incurs a lower overhead from prefix metadata More memory available for caching other metadata

Dynamic Workloads - 1 Dynamic Partitioning and workload evolution

Dynamic Workloads - 2 Clients selects a random MDS Forwarding requests to the authoritative MDS No Traffic control Forwarding all request to the authoritative MDS Slow Replies Traffic control Forwarding and replicating the request Fast Replies

Conclusion Static partitioning More efficient in a static workload Dynamic partitioning More efficient in a dynamic workload Directory entry embedding inodes Load balancing on Hot Spots Directory Hashing Cooperative caching for load balancing Client selects a random MDS

Why is Metadata Hard? File data storage is trivially parallelizable File I/O occurs independent of other files/objects Scalability of OSD array limited only by network architecture Metadata semantics are more complex Hierarchical directory structure defines object interdependency Metadata location & POSIX permissions depend on parent directories Consistency and state MDS must manage open files, client capabilities for data access, locking Heavy workload Metadata is small, but there are lots of objects and lots of transactions Good metadata performance is critical to overall system performance

Client Consistency Consistency Locality Partitioning Traffic Management Storage Two basic strategies Stateless (NFS) : cached metadata times out at client after some period Easy, but weak (no client cache coherence) More traffic as clients refresh previously cached metadata Stateful (Sprite, Coda) : MDS cluster tracks what clients cache what, and invalidates as necessary Strong consistency, coherence Higher memory demands on MDS cluster Choice of strategy may be site specific