Staggeringly Large Filesystems

Similar documents
Staggeringly Large File Systems. Presented by Haoyan Geng

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

The Google File System

The Google File System

The Google File System (GFS)

The Google File System

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

GFS: The Google File System. Dr. Yingwu Zhu

GFS: The Google File System

Distributed System. Gang Wu. Spring,2018

The Google File System

CLOUD-SCALE FILE SYSTEMS

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Google File System. Arun Sundaram Operating Systems

Google Disk Farm. Early days

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

NPTEL Course Jan K. Gopinath Indian Institute of Science

The Google File System

The Google File System. Alexandru Costan

Google File System. By Dinesh Amatya

CA485 Ray Walshe Google File System

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

Google File System 2

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

CSE 124: Networked Services Lecture-16

CSE 124: Networked Services Fall 2009 Lecture-19

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Distributed File Systems II

The Google File System

Distributed Filesystem

The Google File System GFS

The Google File System

2/27/2019 Week 6-B Sangmi Lee Pallickara

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Google File System (GFS) and Hadoop Distributed File System (HDFS)

The Google File System

NPTEL Course Jan K. Gopinath Indian Institute of Science

NPTEL Course Jan K. Gopinath Indian Institute of Science

MapReduce. U of Toronto, 2014

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

BigData and Map Reduce VITMAC03

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Distributed Systems. GFS / HDFS / Spanner

Google Cluster Computing Faculty Training Workshop

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Google is Really Different.

7680: Distributed Systems

Distributed Systems 16. Distributed File Systems II

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Lecture 3 Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP 2003

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

Lecture XIII: Replication-II

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

L1:Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung ACM SOSP, 2003

Distributed File Systems (Chapter 14, M. Satyanarayanan) CS 249 Kamal Singh

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

GFS-python: A Simplified GFS Implementation in Python

Seminar Report On. Google File System. Submitted by SARITHA.S

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Federated Array of Bricks Y Saito et al HP Labs. CS 6464 Presented by Avinash Kulkarni

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

This material is covered in the textbook in Chapter 21.

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

Outline. Spanner Mo/va/on. Tom Anderson

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Advantages of P2P systems. P2P Caching and Archiving. Squirrel. Papers to Discuss. Why bother? Implementation

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

Proceedings of FAST 03: 2nd USENIX Conference on File and Storage Technologies

Hadoop Distributed File System(HDFS)

The OceanStore Write Path

Distributed File Systems. Directory Hierarchy. Transfer Model

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Introduction to Distributed Data Systems

Advanced Data Management Technologies

Distributed File Systems

Yves Goeleven. Solution Architect - Particular Software. Shipping software since Azure MVP since Co-founder & board member AZUG

Distributed Data Store

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Transcription:

Staggeringly Large Filesystems Evan Danaher CS 6410 - October 27, 2009

Outline 1 Large Filesystems 2 GFS 3 Pond

Outline 1 Large Filesystems 2 GFS 3 Pond

Internet Scale Web 2.0 GFS Thousands of machines 1 What is Large Hundreds of active jobs 1 4 Petabyte filesystems 1 40 GB/s read/write load 1 Future: Spanner: millions of machines, Exabytes of storage 1 Pond 1000 machines Designed to scale to millions. 1 Jeff Dean, LADIS 2009

Different Approaches Local/Wide Area Communication Model P2P Tiered Centralized Trusted/Untrusted Nodes

Outline 1 Large Filesystems 2 GFS 3 Pond

Motivation Component failures are common Durability Fast startup important Huge files Append, not random writes Large streaming reads or small random reads

Design Very pragmatic Fast startup after failures Co-designed with applications Favor bandwidth over latency Trusted Nodes Centralized

Key Points Chunkservers store data Centralized master controls everything Relaxed consistency model

Interface Create, Delete, Open, Close, Read, Write Snapshot: Cheap copy at a point in time Record append Safe concurrent append by multiple applications

Chunkservers Store fixed 64MB chunks Indexed by chunkid Checksummed Fixed size simplifies design Stored as files on local filesystem Replicated (default 3 times) Authoritative on which chunks are stored

Master Single master with all metadata in memory Fast and simple Global decisions are easy Less than 64 bytes / 64MB chunk Requests chunk data from chunkservers on startup Authoritative on file to chunk mapping

Master Durability Single point of failure Log all operations to replicated log Checkpointed for quick recovery Readable shadow master replicas

Garbage Collection On deletion, rename to a hidden file Remove hidden files after 3 days Periodically scan and erase orphan chunks Each chunkserver HeartBeat sends some stored chunks In master response, send unknown ones Simple cleanup for all cases

Consistency Model Relaxed consistency model Namespace mutations are atomic Data is less clear Consistent: All clients see same data Defined: Consistent, mutation is correct Concurrent writes: consistent, not defined Concurrent record appends: defined interspersed with inconsistent Record append has weak guarantees Record was appended atomically at least once May also be junk the applications should ignore

Locking Files are identified by paths No directory structure To lock a file Read locks on ancestor directories Read or write lock on requested file(s) Order by depth, lexicographically

Mutation Control One replica, the primary, gets a chunk lease. The primary picks the ordering of mutations. Times out after 1 minute, usually renewed. Can be revoked (e.g., for snapshots)

Lifecycle of a Mutation Client requests lease holder from master Created if necessary

Lifecycle of a Mutation Master sends primary, secondary replicas

Lifecycle of a Mutation Client pushes data to all replicas Linear forwarding pipeline

Lifecycle of a Mutation Client sends write request to primary

Lifecycle of a Mutation Primary forwards write request to secondaries

Lifecycle of a Mutation Secondaries report completion to primary

Lifecycle of a Mutation Primary reports completion to client

Record Append Guarantees at least one atomic append Uses basic mutation control flow Failure at any replica leads to a retry Might leads to inconsistent data/duplicates Filtered out by the application Can t span chunks If it doesn t fit, start a new chunk Limit to 1/4 chunk size to avoid fragmentation

Detecting Failure Chunk version numbers detect stale replicas Regular handshakes with master detect failed chunkservers Checksums detect corrupted data Checksums are stored in memory They are persisted in a log separate from data Idle clients may checksum unused data Failed replicas are ignored by the master Cloned as soon as possible to avoid data loss

Replica Placement Below-average disk utilization Equalize load across chunkservers Limit recent creations per chunkserver Avoid flooding a chunkserver with writes Spread replicas across racks Redundancy, read bandwidth Periodically rebalance Gradually fills up new chunkservers

Fast recovery Chunk replication Master replication Checksums High Availability

Performance Test Clusters Cluster A: R&D for over 100 engineers Cluster B: Production data processing

Performance Test Clusters Metadata on chunkservers: mostly checksums Metadata on master: small

Performance Test Clusters Roughly 50-100 MB/server Fast recovery

Performance Throughput High throughput for reads and writes

Performance Usage Majority read Low master load

Key Points Chunkservers store 64M chunks Stored redundantly Centralized master controls everything Stores all metadata in memory Logs replicated for durability Relaxed consistency model Consistent/defined segments Weak record append guarantee

Outline 1 Large Filesystems 2 GFS 3 Pond

Design Goals More general storage model Only trusts infrastructure in aggregate Failures or malicious users Churn Data is universally available Good consistency model

Key Points Versioned data model Byzantine agreement for primary copy Erasure-coded archive for durability

Data Model Allow arbitrary writes via CoW. Data Object ( file ) Sequence of read-only versions Simplifies caching, replication Allows time travel

Data Object B-Tree-like structure Each block is referenced by a Block GUID Cryptographically-secure hash of its contents

Data Object Root BGUID of a version is its Version GUID Active GUID describes a set of versions Shared copies for free

Tapestry Virtualizes Resources Host and resources are identified by GUIDs Hosts publish GUIDs of their resources Messages are addressed to GUIDs Tapestry routes messages to a nearby host with the GUID.

Primary Replica Each object has a primary replica Maintains AGUID to VGUID mapping Serializes updates. Really a collection of servers: the inner ring Uses a Byzantine agreement protocol assume any failure of less than 1/3 of the group MAC within the group, public key outside Proactive threshold signatures allow churn within this group without breaking the public key

Replication/Consistency BGUID verifies contents of a block Safe replication of blocks is easy AGUID to VGUID mapping changes Primary copy replication: Primary replica applies all updates Creates a certificate mapping AGUID to the most recent VGUID. Client can verify most recent using a nonce. Response contains nonce, is signed.

Archiving Replication is space-inefficient. Erasure codes (Cauchy-Reed-Solomon): split a block into m fragments encode into n > m fragments any m of the n can reconstruct the file Fragments are distributed uniformly deterministically based on BGUID and fragment number

Caching To locate a block, look up its BGUID On failure, use fragments to reconstruct After reconstructing, publishes ownership Future readers can use this copy Discarded whenever convenient And unpublished from Tapestry

Performance Local Archive Performance Reconstruction is within 1.7 times of reading a remote replica.

Performance Local Throughput Computationally limited by inner ring

Performance Wide Area Throughput IR Location Client Location Throughput (MB/s) Cluster Cluster 2.59 Cluster PlanetLab 1.22 Bay Area PlanetLab 1.19 Local cluster Reconstruction is within 1.7 times of reading a remote replica.

Performance Archive Performance Local cluster Reconstruction is within 1.7 times of reading a remote replica.

Performance Andrew Benchmark Times in seconds NFS Loopback on Oceanstore vs. NFS NFS faster on LAN, slower on WAN. Less computation, less efficient network

Key Points Versioned data model Updates are by CoW Simplifies caching Byzantine agreement for primary copy No single point of failure Erasure-coded archive for durability Provides space-efficient storage

Tradeoffs in large filesystems GFS Centralized Master Trusted Nodes Weak consistency Pond Fully distributed Untrusted Nodes Efficient long-term storage Summary