Changing Requirements for Distributed File Systems in Cloud Storage

Similar documents
Distributed File Systems II

CA485 Ray Walshe Google File System

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

GFS: The Google File System

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

CLOUD-SCALE FILE SYSTEMS

The Google File System. Alexandru Costan

NPTEL Course Jan K. Gopinath Indian Institute of Science

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

GFS: The Google File System. Dr. Yingwu Zhu

Distributed Systems 16. Distributed File Systems II

Building the Storage Internet. Dispersed Storage Overview March 2008

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Balancing storage utilization across a global namespace Manish Motwani Cleversafe, Inc.

Google File System. By Dinesh Amatya

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

SCALABLE CONSISTENCY AND TRANSACTION MODELS

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

CS 655 Advanced Topics in Distributed Systems

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

EECS 482 Introduction to Operating Systems

Distributed File Systems

What is a file system

The Google File System

Weak Consistency and Disconnected Operation in git. Raymond Cheng

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

The Google File System

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Architecture of a Real-Time Operational DBMS

Cleversafe Overview Presentation. Cleversafe, Inc. All rights reserved.

Distributed Systems. GFS / HDFS / Spanner

416 Distributed Systems. Distributed File Systems 4 Jan 23, 2017

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

Building Consistent Transactions with Inconsistent Replication

Applications of Paxos Algorithm

GlusterFS Architecture & Roadmap

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Achieving the Potential of a Fully Distributed Storage System

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

BigTable. CSE-291 (Cloud Computing) Fall 2016

CS November 2017

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

GlobalFS: A Strongly Consistent Multi-Site Filesystem

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Evaluating Cloud Storage Strategies. James Bottomley; CTO, Server Virtualization

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

The Google File System (GFS)

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

Transactions and ACID

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Scality RING on Cisco UCS: Store File, Object, and OpenStack Data at Scale

Datacenter replication solution with quasardb

Lecture XIII: Replication-II

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

PNUTS and Weighted Voting. Vijay Chidambaram CS 380 D (Feb 8)

AN OVERVIEW OF DISTRIBUTED FILE SYSTEM Aditi Khazanchi, Akshay Kanwar, Lovenish Saluja

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

Staggeringly Large Filesystems

Dynamo: Key-Value Cloud Storage

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Integrity in Distributed Databases

Cluster-Level Google How we use Colossus to improve storage efficiency

Introduction to Cloud Computing

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

SQL, NoSQL, MongoDB. CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden

Ceph: A Scalable, High-Performance Distributed File System

PNUTS: Yahoo! s Hosted Data Serving Platform. Reading Review by: Alex Degtiar (adegtiar) /30/2013

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

Cloud Computing CS

The Google File System

DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

Distributed Filesystem

10. Replication. Motivation

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Architekturen für die Cloud

Consistency in Distributed Storage Systems. Mihir Nanavati March 4 th, 2016

Distributed Systems. Lec 9: Distributed File Systems NFS, AFS. Slide acks: Dave Andersen

CSE 124: Networked Services Fall 2009 Lecture-19

Distributed Systems. Hajussüsteemid MTAT Distributed File Systems. (slides: adopted from Meelis Roos DS12 course) 1/25

Consistency and Scalability

HyperDex. A Distributed, Searchable Key-Value Store. Robert Escriva. Department of Computer Science Cornell University

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

Replication in Distributed Systems

EECS 498 Introduction to Distributed Systems

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

DIVING IN: INSIDE THE DATA CENTER

BigTable: A Distributed Storage System for Structured Data

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

3/4/14. Review of Last Lecture Distributed Systems. Topic 2: File Access Consistency. Today's Lecture. Session Semantics in AFS v2

Transcription:

Changing Requirements for Distributed File Systems in Cloud Storage Wesley Leggette Cleversafe

Presentation Agenda r About Cleversafe r Scalability, our core driver r Object storage as basis for filesystem technology r Namespace-based routing r Distributed transactions r Optimistic concurrency r Designing an ultra-scalable filesystem r Filesystem operations on object layer r Conclusions 2

About Cleversafe r We offer scalable storage solutions r Target market is massive storage (>10 PiB) r Information Dispersal Algorithms (Erasure Codes) r Reduce cost by avoiding replication overhead r Maximize reliability by tolerating many failures r Object storage core product offering r How to translate this technology to filesystem space r Evolution from object storage concepts r Also influenced by distributed databases and P2P r Techniques we investigate not unique to IDA

How Dispersed Storage Works 1. Digital Assets divided into slices using Information Dispersal Algorithms IDA Total Slices = width = N Digital Content 8h$1 vd@- fmq& Z4$ >hip )aj% l[au T0kQ Site 1 %~fa Uh(k My)v 9hU6 >kir 4Wco 2. Slices distributed to separate disks, storage nodes and geographic locavons Site 2 vd@- pyvq Site 3 Site 4 8h$1 &i@n >hip )aj% l[au %~fa IDA 9hU6 >kir pyvq 4Wco 3. A threshold number of slices are retrieved and used to regenerate the original content

Access Methods Simple Object HTTP Accesser Exposes HTTP REST API We sell two deployment models Application Server HTTP Accesser Object ID Database Stores metadata dsnet Protocol Multiple Accessers can be load balanced for increased throughput and availability. The Accesser returns a unique 36 Character Object ID Simple Object Client Library Accesser Function Embedded into the Client Application Server Object ID Database Java Client Library OBJECT VAULT These are clients in context of this presentation dsnet Protocol Stores metadata Accesser functionality including slicing and dispersal is contained within the client library OBJECT VAULT

Scalability A Primary Requirement r Big Data customers are petabyte to exabyte scale r Scale out architecture r Add storage capacity with commodity machines r Reduce costs: commodity hard drives r Invariants r Reliability keep data even as cheap disks fail r Availability access data during node failures r Performance linear performance growth 6

Scale Example r Shutterfly r 10 PB Cleversafe dsnet storage system r All commodity hard drives r Single storage container for all photos r 10 s thousands of large photos stored per minute r Max capacity many times this level r 14 access nodes for load balanced read/write r No single point of failure r Linear performance growth with each new node r This uses object storage product

Investigating Filesystem Space r We have scalable object storage r Limitless capacity and performance growth r Fully concurrent read/write r Some customers want the same with a filesystem r Is this technically possible? r What tradeoffs would have to be made?

Scale comes from homogeneity Client Storage Client Client Storage Storage Scalable Client Storage Client Storage Client Client Metadata Storage Storage Not Scalable Client Storage r To scale out, we need to do so at each layer r Eliminate central chokepoint for data operations r Central point of failure, central point of r We accomplish this today with object storage r Consider same concept in a filesystem

What approach can we take? r Start with scalable transactional object storage r Add filesystem implementation on top Transactional Object Storage IDA + Distributed Transaction Client Namespace-based Storage Routing Session Management (Multi-path) Object Reliability Namespace Remote Session Filesystem Object Reliability Namespace Remote Session r Object r Check-and-write transactions r Reliability r Ensures committed objects reliable and consistent r Namespace r Routes actual data storage r No central I/O manager

Namespace Object Reliability Namespace Remote Session

Traditional Centralized Routing Client Routing Master 1 Object 10,000 req/s Storage 2 Reliability 640 req/s Storage Storage Namespace Remote Session Storage MAX 15 Servers! Central controller directs traffic r Easier to implement, allows simple search r Detect conflicts, control locking r Does not scale-out with rest of architecture r Today, 10PB system needs 90 45-disk nodes* r These nodes can service 57,600 2MB req/s** r Central point of failure = less availability r * 3TB drives, some IDA overhead **10Gbps NIC, nodes saturate wire speed

Namespace-based Routing 4-wide Vault 8-wide Vault Index 0 7 Object Index 0 3 H A H A 1 6 Reliability Namespace G B G B F C F C 2 5 2 E D Remote Session 1 E 4 D 3 Namespace concept from P2P systems r Chord, CAN, Kademila r MongoDB, CouchDB production examples r Physical mapping determined by storage map r Small data (<10KiB) loaded at start-up r r P2P systems use dynamic overlay protocol r We ll have 10 s thousands of nodes, not millions

Storing Data in a Namespace Object r Reliability Namespace No central lookup for data I/O 1. Generate object id 2. Map to storage Remote Session Object ID Source Name Slice Name Slice Name Slice Name Storage Map With object storage, object id à database r How do we map file name to object id? r

Reliability Object Reliability Namespace Remote Session

Replication and Eventual Consistency Object Reliability Namespace Remote Session Eventual consistency often used with replication r Client writes new versions to available nodes r Versions sync to other replicas lazily r Application responsible for consistency r Already true in filesystems r Allows partition tolerant systems r COPY 1 Now Later Client A COPY 2 REPAIR 3 Client B Read sees old version REPAIR 4

Dispersal Requires Consistency Reliability Namespace Remote Session Dispersal doesn t store replicas r Threshold of slices required to recover data r Crash during unsafe periods can cause loss r Methods to prevent loss r Three-phase distributed transaction r Width: 4 Threshold: 3 Time Object r Commit: Safe Safe UNSAFE Safe All revisions visible during unsafe period r Finalize: Cleanup when new version commit safe r Quorum-based r Writes voting fail if <T successful

Three-Phase Commit Protocol Object 2-Phase Commit Protocol Reliability Namespace Remote Session 1 WRITE COMMIT 2 Width: 4 Threshold: 3 Commit Failure Causes Loss! X X 3-Phase Commit Protocol 1 WRITE COMMIT 2 3 FINALIZE/UNDO X X

Consistent Transactional Interface Object Reliability Namespace Remote Session r Distributed transaction makes dispersal safe r All happens in client, no server coordination r Write consistency r Side-effect of distributed transactions r Writes either succeed or fail atomically r Limitation: Consistency = less partition tolerance r CAP Theorem (we also choose availability) r Either read or write fails during partition r Still shardable : affects availability, not scalability r Is consistency useful for filesystem directories?

Object Object Reliability Namespace Remote Session

Write-if-absent for WORM Object Reliability Namespace Remote Session r Object storage is WORM r Enforced by underlying storage r Write-if-absent model built on transactions r Distributed transactions emulate atomicity r Checked write fails if previous revision exists WRITE IF PREVIOUS = Client A Client B ' WRITE IF PREVIOUS = Success Failure

Optimistic Concurrency Control Object Reliability Namespace Remote Session Client A WRITE IF PREVIOUS = Client A WRITE IF PREVIOUS = 1 Client A ' Client B 1 Success Success WRITE IF PREVIOUS = 1 Success WRITE IF PREVIOUS = 1 Failure V3 Success READ, REDO ACTION 2 3 V3 WRITE IF PREVIOUS = 2 Easy to extend this model to multiple revisions r Write succeeds IFF last revision matches given r Basis for optimistic concurrency r How do concurrent writers update a directory? r

Filesystem Filesystem Object Reliability Namespace Remote Session

Ultra-Scalable Filesystem Technology Filesystem Object r Filesystem layer on top of object storage r Scalable no-master storage r Inherits reliability, security, and performance r How do we map file name to object id? r Is consistency useful for filesystem directories? r How do concurrent writers update a directory?

Object-based directory tree How do we map file name to object id? r Directories stored as objects r Filesystem structure as reliable as data r Directory content data is map of file name to object id r Object id points to another object on system r Id for content data r Id for metadata (xattr, etc) r Data objects WORM r Zero-copy snapshot support r r Reference r counting Well known object id for root

Directory Internal Consistency r Is consistency useful for filesystem directories? r Object layer allows atomic directory updates r This mimics model used by traditional filesystems r Content data stored in separate immutable storage r Safe snapshot support r Eventual consistency r Temporary effects r Writes: Orphaned data r Deletes: Read error r Absolute requirement? No.

Concurrency Requires Serialization r How do concurrent writers update a directory? r Updates to directory entries are atomic (definition) r More precisely, filesystem operations are serialized r Client A adds file, Client B adds file, Client C deletes file r First to call wins, application must have sane order r Kernels use mutexes (locks) for serialization r Master controller (pnfs, GoogleFS) does this r We want to use multiple/no master model r Distributed locking protocols exist (e.g., PAXOS) r It s hard: Protocols complex and have drawbacks r It s slow: Overhead for every operation

Optimistic Concurrency r We want to serialize without locking r Observation: File writes have two steps r Write the data (long, no contention)* r Modify the directory (short, serialized)** r Use checked writes for directory r Always read directory before writes r Write new revision if-not-modified-since r On write conflict, re-read, replay, repeat * Consider workload where files > 1 MiB, we write content data in WORM storage ** Because directories stored as as objects themselves, modifying directory is re-writing object

Lockless Directory Update r Optimistic concurrency guarantees serialization r Operation is simple ( add file ), so replay trivial r On conflict, operation replay semantics are clear r Content data (large) is not rewritten on conflict r Highly parallelizable r Potentially unbounded contention latency r Back-off protocol can help r Not good for high directory contention use cases

Conclusions r Advantages r Limitations r Final Thoughts

Advantages r Scalability and Performance r Content data I/O quick and contention free r No-master concurrent read and write r Linearly scalable performance r Availability r Load balancing without complicated HA setups r Reliability r Information dispersal r Both data and metadata have same reliability r No separate backup required for index server

Limitations r Optimistic concurrency sensitive to high contention r Cache requirements limit directory size r No intrinsic limit, but a 100MiB directory object? r No central master makes explicit file locking hard r SMB, NFS protocols support these r Not suitable for random-write workloads r Not suitable for majority small file workloads r Directory write times eclipse file write times r Requires separate index service for search

Final Thoughts r Significant advances come from P2P and NOSQL space r Three key techniques allow for ultra-scalable FS r Namespace-based routing r Distributed transactions using quorum/3-phase commit r Optimistic concurrency using checked write r Techniques useable with IDA or replicated systems r Filesystem would not be general purpose r Techniques have some trade-offs r Excellent for specific big data use cases

Questions?