GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Similar documents
Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

The Google File System

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Distributed File Systems II

CLOUD-SCALE FILE SYSTEMS

GFS: The Google File System. Dr. Yingwu Zhu

The Google File System (GFS)

Distributed System. Gang Wu. Spring,2018

NPTEL Course Jan K. Gopinath Indian Institute of Science

GFS: The Google File System

The Google File System

The Google File System

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Google File System. Arun Sundaram Operating Systems

The Google File System. Alexandru Costan

The Google File System

Google File System. By Dinesh Amatya

Google Disk Farm. Early days

The Google File System

Distributed Filesystem

NPTEL Course Jan K. Gopinath Indian Institute of Science

The Google File System

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

The Google File System

MapReduce. U of Toronto, 2014

CA485 Ray Walshe Google File System

CSE 124: Networked Services Fall 2009 Lecture-19

CSE 124: Networked Services Lecture-16

Google File System 2

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

BigData and Map Reduce VITMAC03

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

Distributed Systems 16. Distributed File Systems II

Google is Really Different.

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Map-Reduce. Marco Mura 2010 March, 31th

GFS-python: A Simplified GFS Implementation in Python

GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

Google File System (GFS) and Hadoop Distributed File System (HDFS)

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

Lecture XIII: Replication-II

NPTEL Course Jan K. Gopinath Indian Institute of Science

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

The Google File System

Distributed Systems. GFS / HDFS / Spanner

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

Distributed File Systems

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

7680: Distributed Systems

Distributed File Systems (Chapter 14, M. Satyanarayanan) CS 249 Kamal Singh

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Staggeringly Large Filesystems

The Google File System GFS

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Performance Gain with Variable Chunk Size in GFS-like File Systems

Distributed File Systems. Directory Hierarchy. Transfer Model

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

CS 537: Introduction to Operating Systems Fall 2015: Midterm Exam #4 Tuesday, December 15 th 11:00 12:15. Advanced Topics: Distributed File Systems

Cluster-Level Google How we use Colossus to improve storage efficiency

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Staggeringly Large File Systems. Presented by Haoyan Geng

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

L1:Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung ACM SOSP, 2003

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

Seminar Report On. Google File System. Submitted by SARITHA.S

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

2/27/2019 Week 6-B Sangmi Lee Pallickara

Google Cluster Computing Faculty Training Workshop

Extreme computing Infrastructure

Introduction to Cloud Computing

Hadoop Distributed File System(HDFS)

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

Lecture 3 Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP 2003

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

Distributed Systems. Lec 11: Consistency Models. Slide acks: Jinyang Li, Robert Morris

Data Storage in the Cloud

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

AN OVERVIEW OF DISTRIBUTED FILE SYSTEM Aditi Khazanchi, Akshay Kanwar, Lovenish Saluja

Transcription:

GFS Overview Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures Interface: non-posix New op: record appends (atomicity matters, order doesn't) Architecture: one master, many chunk (data) servers Master stores metadata, and monitors chunk servers Chunk servers store and serve chunks Semantics Nothing for traditional write op At least once, atomic record appends 8

GFS Design Constraints 1. Machine failures are the norm 1000s of components Bugs, human errors, failures of memory, disk, connectors, networking, and power supplies Monitoring, error detection, fault tolerance, automatic recovery must be integral parts of a design 2. Design for big-data workloads Search, ads, Web analytics, Map/Reduce, 9

Workload Characteristics Files are huge by traditional standards Multi-GB files are common Most file updates are appends Random writes are practically nonexistent Many files are written once, and read sequentially High bandwidth is more important than latency Lots of concurrent data accessing E.g., multiple crawler workers updating the index file GFS design is geared toward apps characteristics And Google apps have been geared toward GFS 10

Interface Design Not POSIX compliant Supports only popular FS operations, and semantics are different That means you wouldn t be able to mount it Additional operation: record append Frequent operation at Google: Merging results from multiple machines in one file (Map/Reduce) Using file as producer - consumer queue Logging user activity, site traffic Order doesn t matter for appends, but atomicity and concurrency matter 11

Architectural Design A GFS cluster A single master (replicated later) Many chunkservers Accessed by many clients Clients A file Divided into fixed-sized chunks (similar to FS blocks) Labeled with 64-bit unique global IDs (called handles) Stored at chunkservers 3-way replicated across chunkservers Master keeps track of metadata (e.g., which chunks belong to which files) C C C C C GFS cluster S S M S S 12

GFS Basic Functioning Client retrieves metadata for operation from master Read/Write data flows between client and chunkserver Minimizing the master s involvement in read/write operations alleviates the single-master bottleneck Metadata cache chunks from various files 13

Detailed Architecture 14

Chunks Analogous to FS blocks, except larger Size: 64 MB! Normal FS block sizes are 512B - 8KB Pros of big chunk sizes? Cons of big chunk sizes? 15

Chunks Analogous to FS blocks, except larger Size: 64 MB! Normal FS block sizes are 512B - 8KB Pros of big chunk sizes: Less load on server (less metadata, hence can be kept in master s memory) Suitable for big-data applications (e.g., search) Sustains large bandwidth, reduces network overhead Cons of big chunk sizes: Fragmentation if small files are more frequent than initially believed 16

The GFS Master A process running on a separate machine Initially, GFS supported just a single master, but then they added master replication for fault-tolerance in other versions/distributed storage systems The replicas use Paxos, an algorithm that we'll talk about later to keep coordinate, i.e., to act as one (think of the voting algorithm we looked at for distributed mutual exclusion) Stores all metadata 1)File and chunk namespaces Hierarchical namespace for files, flat namespace for chunks 1)File-to-chunk mappings 2)Locations of a chunk s replicas 17

Chunk Locations Kept in memory, no persistent states Master polls chunkservers at startup What does this imply? Upsides Downsides 18

Chunk Locations Kept in memory, no persistent states Master polls chunkservers at startup What does this imply? Upsides: master can restart and recover chunks from chunkservers Note that the hierarchical file namespace is kept on durable storage in the master Downside: restarting master takes a long time Why do you think they do it this way? 19

Chunk Locations Kept in memory, no persistent states Master polls chunkservers at startup What does this imply? Upsides: master can restart and recover chunks from chunkservers Note that the hierarchical file namespace is kept on durable storage in the master Downside: restarting master takes a long time Why do you think they do it this way? Design for failures Simplicity Scalability the less persistent state master maintains, the better 20

Master Controls/Coordinates Chunkservers Master and chunkserver communicate regularly (heartbeat): Is chunkserver down? Are there disk failures on chunkserver? Are any replicas corrupted? Which chunks does chunkserver store? Master sends instructions to chunkserver: Delete a chunk Create new chunk Replicate and start serving this chunk (chunk migration) Why do we need migration support? 21

Reads Updates: Writes Record appends File System Operations 22

Read Protocol Cache 23

Read Protocol 24

Updates Writes: Cause data to be written at application-specified file offset Record appends: Operations that append data to a file Cause data to be appended atomically at least once Offset chosen by GFS, not by the client Goals: Clients can read, write, append records at max throughput and in parallel Some consistency that we can understand (kinda) Hence, favor concurrency / performance over semantics Compare to AFS/NFS' design goals 25

Update Order For consistency, updates to each chunk must be ordered in the same way at the different chunk replicas Consistency means that replicas will end up with the same version of the data and not diverge For this reason, for each chunk, one replica is designated as the primary The other replicas are designated as secondaries Primary defines the update order All secondaries follows this order 26

Primary Leases For correctness, at any time, there needs to be one single primary for each chunk Or else, they could order different writes in different ways 27

Primary Leases For correctness, at any time, there needs to be one single primary for each chunk Or else, they could order different writes in different ways To ensure that, GFS uses leases Master selects a chunkserver and grants it lease for a chunk The chunkserver holds the lease for a period T after it gets it, and behaves as primary during this period The chunkserver can refresh the lease endlessly But if the chunkserver can't successfully refresh lease from master, he stops being a primary If master doesn't hear from primary chunkserver for a period, he gives the lease to someone else 28

Primary Leases For correctness, at any time, there needs to be one single primary for each chunk Or else, they could order different writes in different ways To ensure that, GFS uses leases Master selects a chunkserver and grants it lease for a chunk The chunkserver holds the lease for a period T after it gets it, and behaves as primary during this period The chunkserver can refresh the lease endlessly But if the chunkserver can't successfully refresh lease from master, he stops being a primary If master doesn't hear from primary chunkserver for a period, he gives the lease to someone else So, at any time, at most one server is primary for each chunk But different servers can be primaries for different chunks 29

Write Algorithm 30

Write Algorithm 31

Write Algorithm 32

Write Algorithm 33

Write Consistency Primary enforces one update order across all replicas for concurrent writes It also waits until a write finishes at the other replicas before it replies Therefore: We ll have identical replicas But, file region may end up containing mingled fragments from different clients E.g., writes to different chunks may be ordered differently by their different primary chunkservers Thus, writes are consistent but undefined in GFS 34

One Little Correction Actually, the client doesn t send the data to everyone It sends the data to one replica, then replicas send the data in a chain to all other replicas Why? 35

One Little Correction Actually, the client doesn t send the data to everyone It sends the data to one replica, then replicas send the data in a chain to all other replicas Why? To maximize bandwidth and throughput! 36

Record Appends The client specifies only the data, not the file offset File offset is chosen by the primary Why do they have this? Provide meaningful semantic: at least once atomically Because FS is not constrained Re: where to place data, it can get atomicity without sacrificing concurrency 37

Record Appends The client specifies only the data, not the file offset File offset is chosen by the primary Why do they have this? Provide meaningful semantic: at least once atomically Because FS is not constrained Re: where to place data, it can get atomicity without sacrificing concurrency Rough mechanism: If record fits in chunk, primary chooses the offset and communicates it to all replicas offset is arbitrary If record doesn t fit in chunk, the chunk is padded and client gets failure file may have blank spaces If a record append fails at any replica, the client retries the operation file may contain record duplicates 38

Record Append Algorithm 1. Application originates record append request. 2. GFS client translates request and sends it to master. 3. Master responds with chunk handle and (primary + secondary) replica locations. 4. Client pushes write data to all locations. 5. Primary checks if record fits in specified chunk. 6. If record does not fit, then: The primary pads the chunk, tells secondaries to do the same, and informs the client. Client then retries the append with the next chunk. 7. If record fits, then the primary: appends the record at some offset in chunk, tells secondaries to do the same (specifies offset), receives responses from secondaries, and sends final response to the client. 39

Implications of Weak Consistency for Applications Relying on appends rather on overwrites Writing self-validating records Checksums to detect and remove padding Self-identifying records Unique Identifiers to identify and discard duplicates Hence, applications need to adapt to GFS and be aware of its inconsistent semantics We ll talk soon about several consistency models, which make things easier/harder to build apps against 40

GFS Summary Optimized for large files and sequential appends large chunk size File system API tailored to stylized workload Single-master design to simplify coordination But minimize workload on master by not involving master in large data transfers Implemented on top of commodity hardware Unlike AFS/NFS, which for scale, require a pretty hefty server