The Google File System

Similar documents
The Google File System

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

The Google File System

The Google File System

The Google File System

Google File System. By Dinesh Amatya

Google File System. Arun Sundaram Operating Systems

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

CLOUD-SCALE FILE SYSTEMS

The Google File System (GFS)

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

The Google File System

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Google Disk Farm. Early days

The Google File System

The Google File System. Alexandru Costan

Distributed System. Gang Wu. Spring,2018

The Google File System GFS

The Google File System

GFS: The Google File System

Distributed File Systems (Chapter 14, M. Satyanarayanan) CS 249 Kamal Singh

GFS: The Google File System. Dr. Yingwu Zhu

Google File System (GFS) and Hadoop Distributed File System (HDFS)

CA485 Ray Walshe Google File System

NPTEL Course Jan K. Gopinath Indian Institute of Science

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

NPTEL Course Jan K. Gopinath Indian Institute of Science

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

Distributed Filesystem

7680: Distributed Systems

CSE 124: Networked Services Lecture-16

Google File System 2

Lecture 3 Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP 2003

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Distributed File Systems II

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Seminar Report On. Google File System. Submitted by SARITHA.S

Distributed Systems 16. Distributed File Systems II

CSE 124: Networked Services Fall 2009 Lecture-19

2/27/2019 Week 6-B Sangmi Lee Pallickara

NPTEL Course Jan K. Gopinath Indian Institute of Science

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Distributed File Systems. Directory Hierarchy. Transfer Model

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Google is Really Different.

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

Lecture XIII: Replication-II

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Staggeringly Large Filesystems

Staggeringly Large File Systems. Presented by Haoyan Geng

Distributed File Systems

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Google Cluster Computing Faculty Training Workshop

L1:Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung ACM SOSP, 2003

MapReduce. U of Toronto, 2014

Distributed Systems. GFS / HDFS / Spanner

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

BigData and Map Reduce VITMAC03

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

Map-Reduce. Marco Mura 2010 March, 31th

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

Introduction to Cloud Computing

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Data Storage in the Cloud

GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Performance Gain with Variable Chunk Size in GFS-like File Systems

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

GFS-python: A Simplified GFS Implementation in Python

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

TI2736-B Big Data Processing. Claudia Hauff

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

HDFS Architecture Guide

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

AN OVERVIEW OF DISTRIBUTED FILE SYSTEM Aditi Khazanchi, Akshay Kanwar, Lovenish Saluja

This material is covered in the textbook in Chapter 21.

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Cloud Scale Storage Systems. Yunhao Zhang & Matthew Gharrity

Current Topics in OS Research. So, what s hot?

Extreme computing Infrastructure

CS 345A Data Mining. MapReduce

CS655: Advanced Topics in Distributed Systems [Fall 2013] Dept. Of Computer Science, Colorado State University

Distributed Systems. Lec 9: Distributed File Systems NFS, AFS. Slide acks: Dave Andersen

Transcription:

The Google File System Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung ACM SIGOPS 2003 {Google Research} Vaibhav Bajpai NDS Seminar 2011

Looking Back time Classics Sun NFS (1985) CMU Andrew FS (1988) Fault Tolerant CMU Coda Parallel [HPC] Google FS, Oracle Lustre, IBM zfs Serverless [Peer-to-Peer] xfs, Frangipani, Global FS, GPFS stateless server fully POSIX stateful server location independent 2

Outline Design Overview System Interactions Performance Conclusion 3

Design Overview Assumptions Design Decisions files stored as chunks single master no data caching familiar, customizable API Example Metadata in-memory DS persistent DS 4

Design Overview Assumptions component failures are a norm rather than an exception. multi-gb files are a common case. mutation on files: random read writes are practically non-existent. optimize large concurrent sequential appends are common. large streaming reads are common. high bandwidth > low latency 5

Design Overview Design Decisions files stored as chunks reliability through replication across 3+ chunk servers single master no data caching familiar but customizable API 6

Design Decisions Files stored as Chunks files divided into 64MB fixed-size chunks 64-bit chunk handle to identify each chunk replicated on multiple chunk-servers (default 3) replication level is customizable across namespaces 7

Design Decisions Chunk Size (64MB) advantages: reduces interaction with the master! reduces network overhead likely to perform many OPs on a chunk reduces size of the stored metadata 8

Design Decisions Chunk Size (64MB) issues: internal fragmentation within chunk use lazy space allocation chunk-servers may become hotspots use higher replication factor stagger application start times enable client-client communication (in future...) 9

Design Decisions Single Master centralization for simplicity! global knowledge helps in: sophisticated chunk placement making replication decisions master also maintains all FS metadata.... but this does NOT complement with our learning from distributed systems! 10

Design Decisions Single Master shadow masters: read-only access minimize master involvement solves: single-point of failure solves: scalability bottleneck data never moves through the master only metadata moves and is also cached large chunk size chunk leases: to delegate data mutations. 11

Design Decisions No Data Caching clients do NOT cache file data. working-set too large to be cached locally. simplifies the client; no cache-coherence issues. chunk-servers do NOT cache file data. linux buffer-cache already caches the data. 12

Design Decisions FS Interface support for usual operations: create, delete, open, close, read, write. beyond POSIX snapshot operation: copy a file/directory tree at low cost. record append operation: atomically and concurrently append data to a file from multiple clients. 13

Design Overview Example cache as key:value key=(filename, chunk index) value=(chunk handle, chunk locations) (filename, chunk index) (chunk handle, chunk locations) (chunk handle, byte range) chunk data 14

Design Overview Metadata master stores three types of metadata: file and chunk namespaces file-to-chunk mappings chunk replica locations persistent in-memory 15

Design Overview Metadata In-Memory Data Structures makes the master operations FAST! allows efficient periodic scan of the entire state for chunk garbage collection for re-replication on chunk-server failures for load balancing by chunk migration capacity of the system limited by the master s memory size? NO, a 64MB chunk needs less than 64B metadata. that is, 640TB data needs less than 640MB memory! 16

Design Overview Metadata In Memory DS Chunk Locations chunk locations do NOT have a persistent record to know the chunk locations, the master asks: each chunk-server at startup. or whenever a chunk-server joins the cluster eliminates the issue of syncing master and chunk-server 17

Design Overview Metadata Operation Log defines the order of concurrent operations. replicated on multiple machines fast recovery using checkpoints (stored in B-trees) create checkpoints in separate thread 18

Design Overview Metadata Operation Log Example set of mutations to the master X have an operation log for START X Y END things you re going to write before you write it single master copy of metadata is now corrupted! 19

Consistency Model consistent: all clients see the same data, regardless of which replica they read from defined: consistent AND clients see what the mutation has written in its entirety 20

Design Overview Consistency Model Guarantees by GFS after a sequence of successful mutations, the region is guaranteed to be defined by mutating in same order on all replicas by chunk versioning to detect stale replicas data corruption is detected by checksums. 21

Design Overview Consistency Model Implications for Apps single writer appends writer periodically checkpoints the writes readers ONLY process region upto last checkpoint multiple writer appends writers use write-at-least-once semantics readers discard paddings using checksums readers discard duplicates using unique IDs 22

System Interactions Chunk Leases and Mutation Order Data and Control Flow consistent Special Operations atomic record append snapshot defined Master Operations namespace management and locking replica placement creation, re-replication, rebalancing garbage collection stale replica detection 23

System Interactions Chunk Leases used to minimize the master involvement. primary replica: one granted with a chunk lease it picks a serial order of mutation. with initial timeout of 60s (can be extended) can be revoked by the master 24

operation completed operation completed operation completed System Interactions Mutation Order cache as key:value key=(chunk handle) value=(primaryid, secondaries) who is the chunk lease holder? Client ID of primary replica and location of secondaries Master write request data Secondary Replica A data - primary assign serial numbers to all mutations Primary Replica - applies mutation to its own local state data - forwards the write request Secondary Replica B 25

System Interactions Mutation Order Example client A client B file: foo chunks: {1, 3, 5} file: foo chunks: {2, 3} primary replica secondary replicas changelist {A, foo, 1} {A, foo, 3} {A, foo, 5} {B, foo, 2} {B, foo, 3} <- consistent -> {A, foo, 1} {A, foo, 3} {A, foo, 5} {B, foo, 2} {B, foo, 3} 26

System Interactions Data and Control Flow decouple data flow and control flow control flow: client primary secondary data flow: pushed linearly along carefully picked chain of chunk-servers forward to the closest first distances estimated from IP addresses use pipelining to exploit full-duplex links 27

System Interactions Special Operations atomic record append traditional concurrent writes require a distributed lock manager this operation ensures concurrent appends are serializable client only specifies the data primary lease choses a mutation order if a record-append fails at any replica, client retries some replicas may have duplicates GFS does NOT guarantee replicas are byte wise identical GFS ONLY guarantees data is written at-least once atomically 28

System Interactions Special Operations atomic record append concurrent writes require a distributed lock manager concurrent appends to same file appear side-by-side record A record B defined does NOT guarantee order of concurrent appends record B record A does guarantee writes within a record are NOT interleaved record A (part I) record B (part I) record A (part II) record B (part II) 29

System Interactions Special Operations Snapshots makes a copy of a file/directory tree instantaneously uses standard copy-on-write scheme steps of master operation: inspired from AFS revokes outstanding leases on files logs the operation to disk duplicates the in-memory metadata state duplicates the chunks ONLY on first write request 30

Master Operations Namespace Management and Locking Policies Garbage Collection State Replica Detection 31

Master Operation Namespace Management no per-directory data structure no file/directory aliasing unlike UNIX FS namespace representation using a lookup table maps full pathnames to metadata using prefix-compression for efficient storage 32

Master Operation Namespace Locking ability to lock over regions of namespaces: to allow multiple master operations to be active at once. example: snapshot: /home/user to /save/user write lock: /save/user and /home/user create: /home/user/foo read lock: /home and /home/user read write read write TRUE FALSE FALSE FALSE locks acquired in a consistent total order to prevent deadlocks 33

Master Operation Policies Replica Creation place new replicas on chunk-servers with below average disk space utilization limit number of recent creations on chunk-servers spread replicas of chunks across racks 34

Master Operation Policies Re-replication replicate chunks further away from their replication goal FIRST replicate chunks for live files FIRST replicate chunks blocking clients FIRST 35

Master Operation Garbage Collection lazy reclamation master logs the deletion immediately the storage is NOT reclaimed immediately renamed to hidden name with deletion stamp removed 3 days later (configurable) can be read/undeleted during this time 36

Master Operation Garbage Collection Steps of Operations scan the FS namespace: remove hidden files remove file-chunk mapping from in-memory metadata scan the chunk namespace: identify orphaned chunks remove the metadata for orphaned chunks send message to chunk-servers to delete chunks 37

Master Operation Garbage Collection Advantages simple and reliable chunk creation may fail replica deletion messages maybe lost done in batches and cost is amortized ONLY done when the master is relatively free safety net against accidental, irreversible deletion 38

Master Operation Garbage Collection Disadvantages deletion delay hinders when storage is tight different reclamation policy for separate namespaces cannot reuse the storage right away delete twice: expedites storage reclamation 39

Master Operation Stale Replica Detection stale replica: when chunk-server fails and misses mutation to a chunk while it s down chunk version numbers to detect stale replicas reclaimed in a regular garbage collection 40

Fault Tolerance and Diagnosis High Availability Data Integrity Diagnostic Tools 41

Fault Tolerance and Diagnosis High Availability Fast Recovery Chunk Replication Master Replication 42

Fault Tolerance and Diagnosis High Availability Fast Recovery designed to restore state and start in seconds no distinction: normal and abnormal termination 43

Fault Tolerance and Diagnosis High Availability Chunk Replication different replication levels for separate namespaces keep each chunk ALWAYS fully replicated on chunk-server failure on replica corruption 44

Fault Tolerance and Diagnosis High Availability Master Replication operation log and checkpoints are replicated on master failure! monitoring infrastructure outside GFS starts a new master process shadow masters: read-only access to FS enhance read availability 45

Fault Tolerance and Diagnosis Data Integrity checksums to detect data corruption chunk(64mb): 64KB blocks + 32bit checksum checksum verification prevents error propagation optimized for appends: incremental checksum update little effect on performance: normal reads span multiple blocks anyway reads aligned at checksum block boundaries checksum calculations can be overlapped with IO 46

Fault Tolerance and Diagnosis Diagnostic Tools diagnostic logging on: chunk-servers going up and down RPC requests and replies helps reconstruct history and diagnose problem serve as traces for load testing and performance analysis minimal performance impact: written sequentially and asynchronously 47

Performance master metadata: can fit in RAM lot of occupied disk space CS metadata: well under 1% efficient usage of resources 48

Conclusions demonstrates how to support large scale processing workloads on commodity hardware designed to tolerate frequent h w failures optimized for huge files: appends and streaming reads relax consistency model and extended FS interface simple solutions: single master are good enough 49

References Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles (SOSP '03). ACM, New York, NY, USA, 29-43. 50