GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

Similar documents
ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

CA485 Ray Walshe Google File System

The Google File System. Alexandru Costan

The Google File System

CLOUD-SCALE FILE SYSTEMS

GFS: The Google File System. Dr. Yingwu Zhu

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

The Google File System

The Google File System

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

GFS: The Google File System

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Google File System. Arun Sundaram Operating Systems

The Google File System

Primary/Backup. CS6450: Distributed Systems Lecture 3/4. Ryan Stutsman

Distributed Filesystem

The Google File System

Google File System 2

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Distributed File Systems II

Google Disk Farm. Early days

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

NPTEL Course Jan K. Gopinath Indian Institute of Science

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

The Google File System

The Google File System

The Google File System (GFS)

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Distributed Systems 16. Distributed File Systems II

NPTEL Course Jan K. Gopinath Indian Institute of Science

MapReduce. U of Toronto, 2014

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Distributed System. Gang Wu. Spring,2018

BigData and Map Reduce VITMAC03

Staggeringly Large Filesystems

CSE 124: Networked Services Lecture-16

Google File System. By Dinesh Amatya

CSE 124: Networked Services Fall 2009 Lecture-19

The Google File System GFS

The Google File System

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

NPTEL Course Jan K. Gopinath Indian Institute of Science

CS6450: Distributed Systems Lecture 15. Ryan Stutsman

Distributed Systems. GFS / HDFS / Spanner

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Distributed Transactions

CS6450: Distributed Systems Lecture 13. Ryan Stutsman

L1:Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung ACM SOSP, 2003

Google is Really Different.

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Map-Reduce. Marco Mura 2010 March, 31th

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

2/27/2019 Week 6-B Sangmi Lee Pallickara

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Google Cluster Computing Faculty Training Workshop

7680: Distributed Systems

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

CS6450: Distributed Systems Lecture 11. Ryan Stutsman

Lecture XIII: Replication-II

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

GFS-python: A Simplified GFS Implementation in Python

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Byzantine Fault Tolerance

Scaling KVS. CS6450: Distributed Systems Lecture 14. Ryan Stutsman

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Distributed File Systems (Chapter 14, M. Satyanarayanan) CS 249 Kamal Singh

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Lecture 3 Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP 2003

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

Hadoop Distributed File System(HDFS)

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

Staggeringly Large File Systems. Presented by Haoyan Geng

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Map Reduce. Yerevan.

Seminar Report On. Google File System. Submitted by SARITHA.S

Introduction to MapReduce

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Primary-Backup Replication

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Hadoop and HDFS Overview. Madhu Ankam

Introduction to Distributed Data Systems

Distributed Hash Tables

Transcription:

GFS CS6450: Distributed Systems Lecture 5 Ryan Stutsman Some material taken/derived from Princeton COS-418 materials created by Michael Freedman and Kyle Jamieson at Princeton University. Licensed for use under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Some material taken/derived from MIT 6.824 by Robert Morris, Franz Kaashoek, and Nickolai Zeldovich. 1

From Last Time... The problem with Primary/Backup? Under poor connectivity or high churn may thrash if state synchronization to new backups is costly Think of cross-dc/cross-wan replication, cloud across AZs 2

Compute MapReduce Storage??? MapReduce gives scalable fault-tolerance data processing How do we do the same thing for data storage? Cheap hardware is competitive advantage But, have to deal with faults as a result 3

Filesystems Map of (hierarchical) filenames to varlength blobs open(filename) -> fd read(fd) -> bytes write(fd, bytes) seek(fd, pos) Multiple appenders in POSIX with O_APPEND? 4

Assumptions High component failure rates Inexpensive commodity components fail all the time Modest number of HUGE files Millions of files, many GB or even TBs each Files are write-once, mostly appended to Perhaps concurrently Large streaming reads High sustained throughput favored over low latency 5

Overview 6

Multi-writer Appends Common Crawler TBs of Data Processor Crawler URL Log Processor Crawler Processor Need parallel processing, but log all results to a common log stream for consumer processes Want to spread read I/O over many disks 7

Biggest Questions from Forms Concerns about Master Fault-tolerance/availability Scaling, capacity, and load Defined/Consistent Why not avoid this? How do apps deal with it? (Not asked, but what is the point/benefit?) 8

GFS Architecture Legend: Data messages Control messages Application GFS client (file name, chunk index) (chunk handle, chunk locations) GFS master File namespace /foo/bar chunk 2ef0 Le Instructions to chunkserver (chunk handle, byte range) chunk data Chunkserver state GFS chunkserver GFS chunkserver Linux file system Linux file system 9

Reads Get server list for (filename, offset); response includes primary/secondaries Contact any replica with offset Get data or beyond EOF Client Secondary Replica A Primary Replica Master Secondary Replica B 10

Writes/Appends 1 Get server list for (filename, offset); response includes primary/secondaries 4 Client 3 step 1 2 Master 3 Push data along pipeline Secondary Replica A 6 4 5 Send write command to primary Primary orders it, conveys order to secondaries 7 Primary Replica Secondary Replica B 6 5 Legend: Control Data 7 Notify client of outcome/offset 11

Mutual Exclusion/Ownership Master grant(chunk 32) Server Server Server Need mutual exclusion between servers for chunk mutations Only one chunkserver should order writes for a given chunk 12

Mutual Exclusion/Ownership Master grant(chunk 32) Server Server Server Heartbeats determine when new primary is needed But what about safe revocation? 13

Leases Master grant(chunk 32, until 16:31:24) Server Server Server Assume bounded clock drift Assume network is only so asynchronous in delivery Need lease term >> propagation delay + clock skew t c = max(0, t s - (m prop + 2m proc ) - ε) Very common approach to mutual exclusion 14

Questions How should append failures be handled? What is the effect of write to all, read from any? Assume two clients C 1 and C 2, no active writers If C 1 reads record r at offset o and notifies C 2 Then C 2 reads at offset o Is C 2 guaranteed to see r? 15

Master Single master Find files and their chunks Access control Heartbeating, monitoring chunkservers Chunk distribution/rebalancing Snapshotting Maintains Filename Chunklist map Chunklist Chunk servers map All in memory: can access/scan structures quickly Advantages/disadvantages? 16

Master Recovery Filesystem metadata changes logged locally and remotely Synchronously logged for client requests On crash, replay log to reconstruct state To bound recovery time, checkpoint state On restart, mmap checkpoint state, replay log tail Does checkpointing block normal requests? How often should checkpointing be done? What about chunk locations? They aren t logged. 17

Shadow Masters Improve scaling of metadata read operations Shadow masters consume remote copies of replication log to provides a nearly up-to-date view of mater metadata Since data ops go to chunk servers, they see upto-date data for the chunks the shadow is aware of 18

Consistency A consistency model defines expected behavior under concurrent mutation Should a read started after an acknowledged write see the effects of that write? Must it in GFS? Answers to these questions depend on how much complexity developers can tolerate performance trade offs 19

Defined, consistent? Outcome depends on type of mutation, success/failure, concurrent mutations Consistent: all clients will see same data regardless of read replica Defined: if consistent and contains value written by the client 20

No Failure, No Concurrency abc R 1 C Write( abc ) 1 abc R 2 abc R 3 Defined and consistent 21

Failure, No Concurrency abc R 1 C Write( abc ) 1 abc R 2 R 3 Undefined and inconsistent Why? R 3 inconsistent And read from R 3 gives undefined value back 22

No Failure, Concurrency C 1 Write( aaaabbbbcccc ) C 2 Write( ddddeeeeffff ) aaaa dddd bbbb eeee aaaa dddd bbbb eeee aaaa dddd bbbb eeee abc Undefined and consistent Why? What s the benefit of this approach? 23

Append Serial/Concurrent Must ensure all appends go to end of file No append precedes the another previously successful append Easy? Additional constraints Must maintain floor(offset / 64 MB) = chunk number Otherwise chunk size needed at master Linear time seek for reads/writes Avoid cross-chunkserver coordination on boundary crossing Else need distributed commit; see 2PC next week 24

Problem: Straddling Appends C 1 Append( ) Append( ) C 2 Without care, a reader may miss concurrent appends 25

Solution: Padding C 1 Append( ) Append( ) Side effect: defined interspersed with padding Where do duplicates come from? Append retry Hence, padding may actually be inconsistent C 2 26

Ordering Summary How are operations on files ordered? First, by masters read/write locks when metadata is mutated Master delegates ordering responsibilities within a chunk to a primary chunkserver 27

Interesting Points to add/discuss Rack layout, failure domains, network topology, IP addresses as distance Separation of control path and data path Master operations, fault-tolerance, and recovery Shadow masters Evals 3 copies enough? Are failures independent? Recovery, replica rebalancing, placement Loosely coupled replica garbage collection Existence of chunks is effectively ground truth Doesn t matter if the master says a CS has something it doesn t Doesn t matter if the master says a CS doesn t have something it does System design needs to work correctly and converge on truth 28