The Google File System

Similar documents
Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

The Google File System

The Google File System

Google File System. By Dinesh Amatya

The Google File System

CLOUD-SCALE FILE SYSTEMS

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

The Google File System

Google File System. Arun Sundaram Operating Systems

The Google File System

Distributed System. Gang Wu. Spring,2018

The Google File System (GFS)

The Google File System

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

The Google File System

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

The Google File System GFS

CSE 124: Networked Services Lecture-16

GFS: The Google File System. Dr. Yingwu Zhu

The Google File System. Alexandru Costan

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

CSE 124: Networked Services Fall 2009 Lecture-19

CA485 Ray Walshe Google File System

Google Disk Farm. Early days

NPTEL Course Jan K. Gopinath Indian Institute of Science

Google File System 2

NPTEL Course Jan K. Gopinath Indian Institute of Science

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Distributed Filesystem

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

GFS: The Google File System

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

Distributed File Systems (Chapter 14, M. Satyanarayanan) CS 249 Kamal Singh

Distributed File Systems II

NPTEL Course Jan K. Gopinath Indian Institute of Science

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

2/27/2019 Week 6-B Sangmi Lee Pallickara

7680: Distributed Systems

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Lecture XIII: Replication-II

Seminar Report On. Google File System. Submitted by SARITHA.S

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Lecture 3 Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP 2003

Distributed Systems 16. Distributed File Systems II

Staggeringly Large File Systems. Presented by Haoyan Geng

Staggeringly Large Filesystems

Google is Really Different.

Distributed Systems. GFS / HDFS / Spanner

MapReduce. U of Toronto, 2014

BigData and Map Reduce VITMAC03

GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

L1:Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung ACM SOSP, 2003

Distributed File Systems. Directory Hierarchy. Transfer Model

Performance Gain with Variable Chunk Size in GFS-like File Systems

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Map-Reduce. Marco Mura 2010 March, 31th

This material is covered in the textbook in Chapter 21.

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Distributed File Systems

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

CS655: Advanced Topics in Distributed Systems [Fall 2013] Dept. Of Computer Science, Colorado State University

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

HDFS Architecture Guide

GFS-python: A Simplified GFS Implementation in Python

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

Google Cluster Computing Faculty Training Workshop

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Data Storage in the Cloud

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

TI2736-B Big Data Processing. Claudia Hauff

Hadoop Distributed File System(HDFS)

Hadoop and HDFS Overview. Madhu Ankam

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Introduction to Cloud Computing

AN OVERVIEW OF DISTRIBUTED FILE SYSTEM Aditi Khazanchi, Akshay Kanwar, Lovenish Saluja

CLOUD- SCALE FILE SYSTEMS THANKS TO M. GROSSNIKLAUS

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Transcription:

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008

OUTLINE INTRODUCTION DESIGN OVERVIEW SYSTEM INTERACTIONS MASTER OPERATION FAULT TOLERANCE AND DIAGNOSIS MEASUREMENTS CONCLUSIONS

INTRODUCTION Shares many same goals as previous distributed file system Departed from some earlier file system design assumptions Multiple GFS clusters are currently deployed for different purposes

DESIGN OVERVIEW Assumptions Interface Architecture Single Master Chunk Size Metadata Consistency Model

DESIGN OVERVIEW Assumptions The system is built from many inexpensive commodity components that often fail The system stores a modest number of large files The workflows primarily consist of two kinds of reads: large streaming reads and small random reads

DESIGN OVERVIEW Assumptions The workflows also have many large, sequential writes that append data to files The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file High sustained bandwidth is more important than low latency

DESIGN OVERVIEW Interface Provides a familiar file system interface But does not implement a standard API such as POSIX Files are organized hierarchically in directories and identified by pathnames create, delete, open, close, read and write snapshot and record append

DESIGN OVERVIEW Architecture A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients Files are divided into fixed-size chunks. Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by master at the time of chunk creation

DESIGN OVERVIEW Architecture Application GFS client GFS master File Chunk 1 Chunk... 2 Chunk 3 Identified by an immutable and globally unique 64 bit chunk handle GFS chunkserver Linux file system GFS chunkserver Linux file system... Chunk 1 Chunk 3 Chunk 2 Chunk 1 Accessed by chunk handle and byte range Replica Chunk 3 Chunk 2

DESIGN OVERVIEW Architecture Application GFS client GFS master Namespace Access Control Information Mapping from files to chunks Current locations of chunks Maintains all file system metadata GFS chunkserver Linux file system GFS chunkserver Linux file system... Chunk 1 Chunk 3 Chunk 2 Chunk 1 Chunk 3 Chunk 2

DESIGN OVERVIEW API Architecture Application GFS client HeartBeat messages GFS master Chunk lease management Garbage collection Chunk migration between chunkservers Controls systemwide activities Neither the client nor the chunkserver caches file data GFS chunkserver Linux file system Chunk 1 Chunk 3 Chunk 2 Chunk 1 GFS chunkserver Linux file system... Chunk 3 Chunk 2

DESIGN OVERVIEW File name & byte offset Single Master Cached Chunk index Application GFS client file name & chunk index chunk handle & chunk location GFS master File namespace /foo/bar chunk 2ef0 Legend chunk data chunk handle & byte range Data messages Control messages GFS chunkserver Linux file system Instructions to chunk server Chunkserver state GFS chunkserver Linux file system.........

DESIGN OVERVIEW Chunk Size Each chunk size is 64 MB Each chunk replica is stored as a plain Linux file on a chunk server Reduces clients need to interact with the master, network overhead, and the size of metadata stored on the master Hot spots issue

DESIGN OVERVIEW Metadata The master stores three major types of metadata: the file and chunk namespace the mapping from files to chunks the location s of each chunk s replica All metadata is kept in the master s memory

DESIGN OVERVIEW Metadata In-Memory Data Structures It s fast plus easy and efficient for the master to periodically scan in the background Limited by how much memory there is The cost of adding extra memory is far less than the benefits we gain

DESIGN OVERVIEW Metadata Chunk Locations The master polls chunkservers for that information at startup Regular HeartBeat messages Errors and rename

DESIGN OVERVIEW Metadata Operation log Contains historical record of critical metadata changes Replicated on multiple remote machines and respond to a client operation only after flushing the corresponding log record to disk both locally and remotely

DESIGN OVERVIEW Metadata Operation log The master recovers its file system state by replaying the operation log The master checkpoints its state whenever the log grows beyond a certain size A new checkpoint can be created without delaying incoming mutations

DESIGN OVERVIEW Consistency Model Supports highly distributed applications well but remains relatively simple and efficient to implement Guarantees by GFS Implications for applications

DESIGN OVERVIEW Consistency Model Guarantees by GFS File namespace mutations are atomic The state of a file region after a data mutation depends on the type of mutations consistent, inconsistent, defined and undefined

DESIGN OVERVIEW Consistency Model Guarantees by GFS Write Record Append Serial success Concurrent successes defined consistent but undefined defined interspersed with inconsistent Failure inconsistent Table 1: File Region State after Mutation

DESIGN OVERVIEW Consistency Model Implications for Applications Relying on appends rather than overwrites Checkpointing Writing self-validating and self-identifying record

SYSTEM INTERACTIONS Leases and Mutation Order Data Flow Atomic Record Appends Snapshot

SYSTEM INTERACTIONS Leases and Mutation Order Each mutation is performed at all chunk s replicas Leases are used to maintain a consistent mutation order across replicas The lease mechanism is design to minimize management overhead at the master A lease has an initial timeout of 60 seconds

SYSTEM INTERACTIONS Leases and Mutation Order 4 Client 3 step 1 2 Master Secondary Replica A 6 7 Primary Replica 5 Legend: Secondary Replica B 6 Control Data Figure 2: Write Control and Data Flow

SYSTEM INTERACTIONS Data Flow Decoupling the flow of data from the flow of control to use the network efficiently Each machine forwards the data to the closest machine in the network topology that has not received it Client S1 S3 S4 S2

SYSTEM INTERACTIONS Data Flow Minimizing latency by pipelining the data transfer over TCP connections Switched network with full-duplex links The ideal elapsed time for transferring B bytes to R replicas is B/T + RL T - network throughput (100 Mbps) L - latency (far below 1 ms)

SYSTEM INTERACTIONS Atomic Record Appends The client specifies only the data Many clients on different machines append to the same file concurrently Serves multiple-producer/single-consumer queues or contains merged results from many different clients Follows the control flow with a little extra logic at the primary

SYSTEM INTERACTIONS Snapshot Makes a copy of a file or a directory tree almost instantaneously, while minimizing any interruptions of ongoing mutations Implemented by standard copy-on-write techniques

SYSTEM INTERACTIONS Snapshot Application GFS client Snapshot GFS master Operation log Snapshot Metadata Source Dir Chunk C Dest Dir Revoke leases GFS chunkserver Linux file system GFS chunkserver Linux file system... Legend Lease Chunk C Chunk C

SYSTEM INTERACTIONS Snapshot Application GFS client Request chunk C Chunk C GFS master Operation log Snapshot Metadata Source Dir Dest Dir Chunk C > 1 Create new chunk C GFS chunkserver Linux file system GFS chunkserver Linux file system... Legend Lease Chunk C C Chunk C C Chunk C Chunk C

MASTER OPERATIONS Namespace Management and Locking Replica Placement Creation, Re-replication, Rebalancing Garbage Collection Stale Replica Detection

MASTER OPERATIONS Namespace Mgt and Locking Allows multiple operations to be active and use locks over regions of the namespace Logically represents namespace as a lookup table mapping full pathnames to metadata Each node in the namespace tree has an associated read-write lock Each master operation acquires a set of locks before it runs

MASTER OPERATIONS Namespace Mgt and Locking /d1/d2/ /dn/leaf Read locks on the directory name /d1 /d1/d2 /d1/d2/ /dn Either a read lock /d1/d2/ /dn/leaf or a write lock on the full pathname

MASTER OPERATIONS Namespace Mgt and Locking How this locking mechanism can prevent a file /home/user/foo from being created while /home/user is being snapshotted to /home/save Snapshot operation Creation operation Read locks /home /save /home /home/user Write locks /home/user /save/user /home/user/foo

MASTER OPERATIONS Replica Placement There are hundreds of chunkservers spread across many machine racks Communication between two machines on different racks may cross one or more network switches Two purposes: Maximize data reliability and availability Maximize network bandwidth utilization

MASTER OPERATIONS Creation, Re-replication, Rebalancing The reasons for creating chunk replicas Factors when creating a chunk: Place new replicas on chunkservers with below-average disk space utilization Limit the number of recent creations on each chunkserver Spread replicas of a chunk across racks

MASTER OPERATIONS Creation, Re-replication, Rebalancing The master re-replicates a chunk as soon as the number of available replicas falls below a user-specified goal Each chunk that needs to be re-replicated is prioritized based on several factors The master picks the highest priority chunk and clones it by instructing chunkservers

MASTER OPERATIONS Creation, Re-replication, Rebalancing The master rebalances replicas periodically The master gradually fills up a new chunkserver rather than instantly swamping

MASTER OPERATIONS Garbage Collection I want to delete a file named /foo... Client Master Log Delete... Metadata... /.foo-20081126 /foo... ChunkServers

MASTER OPERATIONS Stale Replica Detection For each chunk, the master maintains a chunk version number Chunk version number is increased when the master grants a new lease on a chunk The master removes stale replicas in its regular garbage collection The client or the chunkserver verifies the check number as well

FAULT TOLERANCE AND DIAGNOSIS High Availability Data Integrity Diagnostic Tools

FAULT TOLERANCE AND DIAGNOSIS High Availability Fast recovery Chunk Replication Master Replication Operation log and checkpoints are replicated on multiple machines Monitor infrastructure outside GFS

FAULT TOLERANCE AND DIAGNOSIS Data Integrity Each chunkserver uses checksumming to detect corruption of stored data A chunk is broken up into 64 KB blocks Each block has a 32 bit checksum Checksumming has little effect on read performance

FAULT TOLERANCE AND DIAGNOSIS Diagnostic Tools GFS servers generate diagnostic logs that record many significant events and all RPC requests and replies The performance impact of logging is minimal because these logs are written sequentially and asynchronously

MEASUREMENTS Micro-benchmarks Real World Clusters

MEASUREMENTS Micro-benchmarks Dual 1.4 GHz PIII processors 2 GB of memory Two 80 GB 5400 rpm disks HP 2524 switch 100 Mbps full-duplex link 1 Gbps link

MEASUREMENTS Micro-benchmarks Reads N clients read simultaneously Each client reads a randomly selected 4 MB region from 320 GB file set Repeated 256 times so that each client ends up reading 1 GB of data Expects at most 10% hit in the Linux buffer cache

MEASUREMENTS Micro-benchmarks Writes N clients write simultaneously to N distinct files Each client writes 1 GB of data to a new file in a series of 1 MB writes

MEASUREMENTS Micro-benchmarks Record Appends N clients append simultaneously to a single file In reality, applications tend to produce multiple files concurrently such as N clients append to M shared files simultaneously where both N and M are in the dozens or hundreds

MEASUREMENTS Real World Clusters Cluster A: Used regularly for research and development by over a hundred engineers A typical task is initiated by a human user and runs up to several hours It reads through a few MBs to a few TBs of data

MEASUREMENTS Real World Clusters Cluster B: Used for production data processing The tasks last much longer and continuously generate and process multi- TB data sets with only occasional human intervention In both cases, a single task consists of many processes on many machines

MEASUREMENTS Real World Clusters Storage Cluster A B Chunkservers 324 227 Available disk space 72 TB 180 TB Used disk space 55 TB 155 TB Number of Files 735 k 737 k Number of Dead files 22 k 232 k Number of Chunks 992 k 1550 k Metadata at chunkservers 13 GB 21 GB Metadata as master 48 MB 60 MB Characteristics of two GFS clusters

MEASUREMENTS Real World Clusters Metadata Includes checksums for 64 KB blocks of user data and the chunk version number About 100 bytes per file on average Each individual server has only 50 to 100 MB of metadata Cluster A B Chunkservers 324 227 Available disk space 72 TB 180 TB Used disk space 55 TB 155 TB Number of Files 735 k 737 k Number of Dead files 22 k 232 k Number of Chunks 992 k 1550 k Metadata at chunkservers 13 GB 21 GB Metadata as master 48 MB 60 MB Characteristics of two GFS clusters

MEASUREMENTS Real World Clusters Read and Write Rates Both clusters had been up for about one week Cluster A B Read rate (last minute) 583 MB/s 380 MB/s Read rate (laste hour) 562 MB/s 384 MB/s Read rate (since restart) 589 MB/s 49 MB/s Write rate (last minute) 1 MB/s 101 MB/s Write rate (last hour) 2 MB/s 117 MB/s Write rate (since restart) 25 MB/s 13 MB/s Master ops (last minute) 325 Ops/s 533 Ops/s Master ops (last hour) 381 Ops/s 518 Ops/s Master ops (since restart) 202 Ops/s 347 Ops/s Performance Metrics for Two GFS Clusters

MEASUREMENTS Real World Clusters Master Load Supports many thousands of file accesses per second It is possible to speed up further by placing name lookup caches in front of the namespace data structure Cluster A B Read rate (last minute) 583 MB/s 380 MB/s Read rate (laste hour) 562 MB/s 384 MB/s Read rate (since restart) 589 MB/s 49 MB/s Write rate (last minute) 1 MB/s 101 MB/s Write rate (last hour) 2 MB/s 117 MB/s Write rate (since restart) 25 MB/s 13 MB/s Master ops (last minute) 325 Ops/s 533 Ops/s Master ops (last hour) 381 Ops/s 518 Ops/s Master ops (since restart) 202 Ops/s 347 Ops/s Performance Metrics for Two GFS Clusters

MEASUREMENTS Real World Clusters Recovery Time Experiment 1 Killed a single chunkserver in cluster B which has about 15,000 chunks containing 600 GB of data Limited to 91 concurrent cloning (40%) where each clone operation is allowed to consume at most 6.25 MB/s (50 Mbps)

MEASUREMENTS Real World Clusters Recovery Time The result of experiment 1 All chunks were restored in 23.2 minutes Replication rate: 440 MB/s

MEASUREMENTS Real World Clusters Recovery Time Experiment 2 Killed two chunkservers each with roughly 16,000 chunks and 660 GB of data Reduced 266 chunks to having a single replica Restored to at least 2x replication within 2 minutes

CONCLUSIONS Supports large-scale data processing workloads on commodity hardware Radically different points in the design space Provides fault tolerance by constant monitoring, replicating crucial data, and fast and automatic recovery

CONCLUSIONS Delivers high aggregate throughput to many concurrent readers and writers performing a variety of tasks Successfully met Google s storage needs and is widely used within Google as the storage platform for research and development as well as production data processing