The Google File System

Similar documents
The Google File System

Google File System. By Dinesh Amatya

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

CLOUD-SCALE FILE SYSTEMS

The Google File System (GFS)

The Google File System

The Google File System

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

The Google File System

Distributed System. Gang Wu. Spring,2018

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

The Google File System

Google File System. Arun Sundaram Operating Systems

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

The Google File System

NPTEL Course Jan K. Gopinath Indian Institute of Science

Google Disk Farm. Early days

NPTEL Course Jan K. Gopinath Indian Institute of Science

CA485 Ray Walshe Google File System

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

The Google File System

CSE 124: Networked Services Lecture-16

CSE 124: Networked Services Fall 2009 Lecture-19

GFS: The Google File System. Dr. Yingwu Zhu

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

Google File System (GFS) and Hadoop Distributed File System (HDFS)

The Google File System. Alexandru Costan

The Google File System GFS

GFS: The Google File System

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

7680: Distributed Systems

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Distributed File Systems II

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Google File System 2

NPTEL Course Jan K. Gopinath Indian Institute of Science

Seminar Report On. Google File System. Submitted by SARITHA.S

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

Staggeringly Large Filesystems

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Lecture XIII: Replication-II

Google is Really Different.

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

2/27/2019 Week 6-B Sangmi Lee Pallickara

Distributed File Systems (Chapter 14, M. Satyanarayanan) CS 249 Kamal Singh

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

Distributed Systems 16. Distributed File Systems II

Lecture 3 Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP 2003

MapReduce. U of Toronto, 2014

Distributed Filesystem

BigData and Map Reduce VITMAC03

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Distributed Systems. GFS / HDFS / Spanner

GFS-python: A Simplified GFS Implementation in Python

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

L1:Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung ACM SOSP, 2003

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

Performance Gain with Variable Chunk Size in GFS-like File Systems

Staggeringly Large File Systems. Presented by Haoyan Geng

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

This material is covered in the textbook in Chapter 21.

Map-Reduce. Marco Mura 2010 March, 31th

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Distributed File Systems. Directory Hierarchy. Transfer Model

Google Cluster Computing Faculty Training Workshop

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Hadoop Distributed File System(HDFS)

Data Storage in the Cloud

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

Distributed File Systems

HDFS Architecture Guide

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

Extreme computing Infrastructure

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Introduction to Cloud Computing

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

AN OVERVIEW OF DISTRIBUTED FILE SYSTEM Aditi Khazanchi, Akshay Kanwar, Lovenish Saluja

Transcription:

October 13, 2010

Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003.

1 Assumptions Interface Architecture Single master Chunk size Metadata 2 Mutation mechanism Additional operations 3 4 5

Frequent failures Assumptions Interface Architecture Single master Chunk size Metadata Hundreds of machines built from inexpensive commodity parts Component failures are the norm rather than the exception Constant monitoring, error detection, fault tolerance, and prompt automatic recovery must be integral to the system

Huge files Assumptions Interface Architecture Single master Chunk size Metadata Modest number of large files Multi-GB files are common Small files supported, but not optimized for Design assumptions and parameters such as I/O operation and blocksizes had to be revisited

Writing Assumptions Interface Architecture Single master Chunk size Metadata Mostly appending new data rather than overwriting existing data Large, sequential writes Once written, files are seldom modified again Appending is the focus of performance optimization and atomicity guarantees

Reading Assumptions Interface Architecture Single master Chunk size Metadata Once written, files are only read, often only sequentially Mostly large streaming reads and small random reads Batching and sorting small reads to advance steadily through the file

Concurrency Assumptions Interface Architecture Single master Chunk size Metadata Files often used as producer-consumer queues or for many-way merging Hundreds of producers concurrently append to a single file The file may be read later, or a consumer may be reading through the file simultaneously Atomicity with minimal synchronization overhead is essential

Bandwidth vs. latency Assumptions Interface Architecture Single master Chunk size Metadata High sustained bandwidth is more important than low latency Most applications place a premium on processing data in bulk at a high rate Few have stringent response time requirements for an individual read or write

Interface Assumptions Interface Architecture Single master Chunk size Metadata GFS doesn t implement a standard API such as POSIX Files are organized hierarchically in directories and identified by pathnames Standard operations: create, delete, open, close, read, and write Additional operations: snapshot and record append Snapshot creates a copy of a file or a directory tree at low cost Record append allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual client s append

Architecture Assumptions Interface Architecture Single master Chunk size Metadata A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients Each of these is a commodity Linux machine running a user-level server process

Files Assumptions Interface Architecture Single master Chunk size Metadata Files are divided into fixed-size chunks Each chunk is identified by a 64 bit chunk handle Chunkservers store chunks on local disks as Linux files Each chunk is replicated on multiple chunkservers (default: 3)

Master Assumptions Interface Architecture Single master Chunk size Metadata Maintains all file system metadata: namespace access control information mapping from files to chunks current locations of chunks Controls system-wide activities: chunk lease management garbage collection of orphaned chunks chunk migration between chunkservers Periodically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state

Communication Assumptions Interface Architecture Single master Chunk size Metadata GFS client communicates with the master and chunkservers to read or write data on behalf of the application Clients interact with the master only for metadata operations All data-bearing communication goes directly to the chunkservers

Cache Assumptions Interface Architecture Single master Chunk size Metadata Clients cache only metadata Caching data offers little benefit because most applications stream through huge files Not having them simplifies the client and the overall system Chunkservers need not cache file data because chunks are stored as local files (Linux s buffer cache already keeps frequently accessed data in memory)

Single master Assumptions Interface Architecture Single master Chunk size Metadata Having a single master simplifies the design Minimizing its involvement in reads and writes ensures that it does not become a bottleneck Clients only ask the master which chunkservers they should contact They cache this information for a limited time and interact with the chunkservers directly for many subsequent operations

Assumptions Interface Architecture Single master Chunk size Metadata 1 Client translates the file name and byte offset into chunk index within the file 2 It sends the master a request 3 The master replies with the corresponding chunk handle and locations of the replicas 4 The client caches this information 5 The client then sends a request to one of the replicas 6 Further reads of the same chunk require no more client-master interaction

- scheme Assumptions Interface Architecture Single master Chunk size Metadata

Chunk size Assumptions Interface Architecture Single master Chunk size Metadata 64 MB Lazy space allocation avoids wasting space due to internal fragmentation Advantages: Reduction of clients need to interact with the master Reduction of network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time Reduction of the size of metadata

Metadata Assumptions Interface Architecture Single master Chunk size Metadata Three types: File and chunk namespaces Mapping from files to chunks Locations of each chunk s replicas All metadata is kept in the master s memory Namespaces and mapping are also kept in an operation log stored on the master s local disk and replicated on remote machines The master does not store chunk location information persistently it asks each chunkserver about its chunks

In-Memory Data Structures Assumptions Interface Architecture Single master Chunk size Metadata Since metadata is stored in memory, master operations are fast Amount of memory the master has is not a concern: there is less than 64 bytes of metadata for each 64 MB chunk and file

Chunk locations Assumptions Interface Architecture Single master Chunk size Metadata The master does not keep a persistent record of which chunkservers have a replica of a given chunk It polls chunkservers for that information at startup and periodically thereafter (with HeartBeat messages) This eliminates the problem of keeping the master and chunkservers in sync

Operation log Assumptions Interface Architecture Single master Chunk size Metadata Contains a historical record of critical metadata changes Serves as a logical time line that defines the order of concurrent operations It is replicated on multiple machines Responds to a client operation only after flushing the corresponding log record to disk

Operation log Assumptions Interface Architecture Single master Chunk size Metadata The master recovers its file system state by replaying the operation log To minimize startup time, we must keep the log small. The master checkpoints its state whenever the log grows beyond a certain size The checkpoint is in a compact B-tree like form that can be directly mapped into memory and used for namespace lookup without extra parsing A failure during checkpointing does not affect correctness because the recovery code detects and skips incomplete checkpoints

Leases and mutation order Mutation mechanism Additional operations Mutation is an operation that changes the contents or metadata of a chunk (e.g. write) Leases are used to maintain a consistent mutation order across replicas The master grants a chunk lease to one of the replicas, which we call the primary The primary picks a serial order for all mutations to the chunk All replicas follow this order when applying mutations Lease has an extendible 60-seconds timeout Even if the master loses communication with a primary, it can safely grant a new lease to another replica after the old lease expires

Mutation mechanism Additional operations Data flow The flow of data is decoupled from the flow of control to use the network efficiently Control flows from the client to the primary and then to all secondaries Data is pushed linearly along a carefully picked chain of chunkservers Once a chunkserver receives some data, it starts forwarding immediately

Write control - scheme Mutation mechanism Additional operations

Mutation mechanism Additional operations Record append The client specifies only the data GFS appends it to the file at least once atomically at an offset of GFS s choosing If appending the record to the current chunk would cause the chunk to exceed the maximum size (64 MB), it is padded up to max size and next chunk is created

Mutation mechanism Additional operations Record append For the operation to report success, the data must have been written at the same offset on all replicas of some chunk If a record append fails at any replica, the client retries the operation Replicas of the same chunk may contain different data possibly including duplicates of the same record GFS does not guarantee that all replicas are bytewise identical. It only guarantees that the data is written at least once as an atomic unit

Mutation mechanism Additional operations Snapshot The snapshot operation makes a copy of a file or a directory tree Uses standard copy-on-write techniques

Mutation mechanism Additional operations Snapshot 1 When the master receives a snapshot request, it first revokes any relevant leases 2 Then, the master logs the operation to disk 3 It then applies this log record to its in-memory state by duplicating the metadata 4 The newly created snapshot files point to the same chunks as the source files 5 Next time the chunk is to be written, master notices that the reference count is greater than one 6 It then asks each chunkserver that has a current replica of original chunk to create its copy

The master executes all namespace operations Manages chunk replicas throughout the system: Makes placement decisions Creates new chunks and hence replicas Coordinates various system-wide activities to keep chunks fully replicated, to balance load across all the chunkservers, and to reclaim unused storage

Namespace management and locking GFS represents its namespace as a lookup table mapping full pathnames to metadata Each node in the namespace tree has an associated read-write lock Each master operation acquires a set of locks before it runs (read locks for all superdirectories pathnames and read or write lock for the whole pathname) Creating a file doesn t require write lock on parent directory, as there is no inode-like data structure Multiple file creations can be executed concurrently in the same directory

Replica placement The chunk replica placement policy serves two purposes: maximize data reliability and availability, and maximize network bandwidth utilization Replicas are spread across different machines and racks

Chunk creation When the master creates a chunk, it chooses where to place the initially empty replicas. It considers several factors: Chunkservers with below-average disk space utilization are preferred The number of recent creations on each chunkserver should be limited Replicas of a chunk should be spread across racks

Re-replication The master re-replicates a chunk as soon as the number of available replicas falls below a user-specified goal Priority of re-replication is based on several factors: How far it is from the replication goal Chunks from live files are replicated before chunks that belong to recently deleted files Chunks that are blocking client progress are prioritized

Rebalancing It is performed periodically Master examines the current replica distribution and moves replicas for better disk space and load balancing Through this process, master gradually fills up new chunkservers Replicas are removed from the chunkservers with below-average free space

Garbage collection GFS does not immediately reclaim the available physical storage The master logs a file s deletion immediately The file is renamed to a hidden name During the master s regular scan of the file system namespace, it removes any such hidden files if they have existed for more than three days In a similar scan, the master identifies orphaned chunks and erases the metadata for those chunks In a HeartBeat message, each chunkserver reports what chunks it has, and the master replies with the chunks that are no longer present in the master s metadata

Garbage collection Garbage collection provides a uniform and dependable way to clean up any replicas not known to be useful It merges storage reclamation into the regular background activities of the master The delay in reclaiming storage provides a safety net against accidental, irreversible deletion

Stale Replica Detection Chunk replicas may become stale if a chunkserver fails and misses mutations to the chunk while it is down For each chunk, the master maintains a chunk version number to distinguish between up-to-date and stale replicas Whenever the master grants a new lease on a chunk, it increases the chunk version number and informs the up-to-date replicas The master removes stale replicas in its regular garbage collection

High availability Fast recovery of master and chunkservers Chunk replication Master replication

Master replication Replication of operation log and checkpoints Mutation considered committed only after flushing its log record locally and on all replicas If a master machine fails, monitoring infrastructure starts a new master process elsewhere Shadow masters provide read-only access to the file system even when the primary master is down Shadow master reads a replica of the log and applies the same changes to its data structures exactly as the primary does Like the primary, it polls chunkservers at startup and exchanges frequent handshake messages with them to monitor their status

Data Integrity Chunkservers us checksumming to detect data corruption A chunk is broken up into 64 KB blocks, each has a corresponding 32 bit checksum Checksums are kept in memory and stored persistently with logging Chunkserver verifies the checksum of data blocks that overlap the read range before returning any data (reads) If a block doesn t match the checksum, chunkserver returns an error and reports it to the master, who will clone the chunk from another replica. The invalid replica is removed

Data Integrity Checksum computation is optimized for appends Checksum is incrementally updated for the last partial block, and computed for any brand new blocks filled by the append For writes, the first and last blocks of the range being overwritten must be read and verified first Scanning inactive chunks during idle periods

Micro-benchmarks