The Google File System

Similar documents
Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

The Google File System

The Google File System

Google File System. By Dinesh Amatya

The Google File System

CLOUD-SCALE FILE SYSTEMS

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

The Google File System

Google File System. Arun Sundaram Operating Systems

The Google File System

The Google File System

Distributed System. Gang Wu. Spring,2018

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

The Google File System (GFS)

The Google File System

Google Disk Farm. Early days

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

GFS: The Google File System. Dr. Yingwu Zhu

CSE 124: Networked Services Lecture-16

The Google File System. Alexandru Costan

NPTEL Course Jan K. Gopinath Indian Institute of Science

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Google File System (GFS) and Hadoop Distributed File System (HDFS)

CA485 Ray Walshe Google File System

The Google File System GFS

Google File System 2

CSE 124: Networked Services Fall 2009 Lecture-19

GFS: The Google File System

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

NPTEL Course Jan K. Gopinath Indian Institute of Science

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

Distributed File Systems (Chapter 14, M. Satyanarayanan) CS 249 Kamal Singh

Distributed File Systems II

Distributed Filesystem

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Seminar Report On. Google File System. Submitted by SARITHA.S

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

NPTEL Course Jan K. Gopinath Indian Institute of Science

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Staggeringly Large File Systems. Presented by Haoyan Geng

Lecture 3 Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP 2003

7680: Distributed Systems

Lecture XIII: Replication-II

Staggeringly Large Filesystems

Google is Really Different.

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Distributed Systems. GFS / HDFS / Spanner

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

2/27/2019 Week 6-B Sangmi Lee Pallickara

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

L1:Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung ACM SOSP, 2003

Distributed File Systems. Directory Hierarchy. Transfer Model

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

MapReduce. U of Toronto, 2014

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

Distributed Systems 16. Distributed File Systems II

This material is covered in the textbook in Chapter 21.

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

BigData and Map Reduce VITMAC03

GFS-python: A Simplified GFS Implementation in Python

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Performance Gain with Variable Chunk Size in GFS-like File Systems

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Google Cluster Computing Faculty Training Workshop

Map-Reduce. Marco Mura 2010 March, 31th

Data Storage in the Cloud

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

CS655: Advanced Topics in Distributed Systems [Fall 2013] Dept. Of Computer Science, Colorado State University

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Hadoop Distributed File System(HDFS)

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

TI2736-B Big Data Processing. Claudia Hauff

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Introduction to Cloud Computing

CLOUD- SCALE FILE SYSTEMS THANKS TO M. GROSSNIKLAUS

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

Cloud Scale Storage Systems. Yunhao Zhang & Matthew Gharrity

Hadoop and HDFS Overview. Madhu Ankam

Distributed File Systems

Architecture of Systems for Processing Massive Amounts of Data

CS 345A Data Mining. MapReduce

Transcription:

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google SOSP 03, October 19 22, 2003, New York, USA Hyeon-Gyu Lee, and Yeong-Jae Woo Memory & Storage Architecture Lab. School of Computer Science and Engineering Seoul National University

Outline Design overview Assumptions Interface Architecture Single master Chunk size Metadata Consistency model System interactions Master operation Fault tolerance Measurements Conclusion 2

Assumptions The system is built from many inexpensive commodity components The system stores a modest number of large files A few million files, each typically 100 MB or larger The workloads primarily consist of Large sequential reads and small random reads Large sequential writes that append data to files The system must efficiently implement semantics for clients that concurrently append to the same file High sustained bandwidth is more important than low latency 3

Interface Files are organized in directories and identified by pathnames File operations Create, delete, open, close, read, and write Snapshot Creates a copy of a file or a directory tree Record append Allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity 4

Architecture A GFS cluster consists of a single master and multiple chunkservers Files are divided into fixed-size chunks Master maintains all file system metadata Neither the client nor the chunkserver caches file data 5

Architecture A GFS cluster consists of a single master and multiple chunkservers Files are divided into fixed-size chunks Each chunk is identified by an immutable and globally unique 64 bit chunk handle Master maintains all file system metadata Namespace, access control information, mapping from files to chunks, and current locations of chunks Master periodically communicates with each chunkserver in HeartBeat messages Neither the client nor the chunkserver caches file data 6

Single master Master makes chunk placement and replication decisions Client asks the master which chunkserver it should contact Client caches this information and interacts with the chunkserver directly 1 2 3 4 7

Chunk size Large chunk size (64 MB) Pros Reduces clients need to interact with the master Reduces network overhead since client is likely to perform many operations on a large chunk Reduces the size of the metadata stored on the master Cons If many clients are accessing the same small file, chunkservers may become hot spots - Ex) an executable single-chunk file - Sol) Stores such files with a higher replication factor 8

Metadata Metadata types (stored in memory) File and chunk namespaces Mapping from files to chunks Locations of each chunk s replicas Be kept persistent by logging to an operation log Rebuild by asking each chunkserver Master periodically scans its entire state in background For chunk garbage collection, re-replication, and chunk migration Size of metadata is not a problem Master maintains < 64 B of metadata for each 64 MB chunk File namespace data requires < 64 B per file name Operation log contains a record of metadata changes Also provides a logical time line that defines the order of concurrent operations Changes are not visible to clients until metadata changes are made persistent Master recovers its state by replaying the operation log from last checkpoint 9

Consistency model GFS has a relaxed consistency model Guarantees by GFS File namespace mutations are atomic and are handled in master with locking File region state after mutation Implications for applications Applications deal with the relaxed consistency model by using appends, checkpointing, and writing self-validating, self-identifying records 10

Outline Design overview System interactions Leases and mutation order Data flow Atomic record appends Snapshot Master operation Fault tolerance Measurements Conclusion 11

Leases and mutation order Mutation is performed at all chunk s replicas using leases to maintain a consistent mutation order across replicas Master grants a chunk lease to one of the replicas, called primary Primary picks a serial order for all mutations to the chunk 12

Data flow GFS decouples the data flow from the control flow Each machine forwards the data to the closest machine in the network topology that has not received it Distances can be estimated from IP addresses GFS minimizes the latency by pipelining the data transfer over TCP connections 13

Atomic record appends GFS appends the data to the file at least once atomically at an offset of GFS s choosing and returns that offset to the client Algorithm of atomic record append Client pushes the data to all replicas of the last chunk of the file Client sends the append request to the primary Primary checks to see if data causes chunk to exceed the maximum size If so, pads chunk to the maximum size and tells secondaries to do same, then tells client to retry on the next chunk If the record fits, primary appends data to its replica, tells secondaries to write at the exact offset where it has 14

Snapshot Snapshot operation makes a copy of a file or a directory tree almost instantaneously GFS uses copy-on-write technique to implement snapshots Revokes any outstanding leases on the chunks to snapshot Logs the snapshot operation to disk and duplicates the metadata If a client wants to write, creates a new chunk and tells secondaries to do same 15

Outline Design overview System interactions Master operation Namespace management and locking Replicas Garbage collection Fault tolerance Measurements Conclusion 16

Namespace management and locking GFS allows multiple operations to be active and uses locks to ensure serialization Namespace is represented as a lookup table mapping full pathnames to metadata Represented as a tree in memory with prefix compression Each node has an associated read-write lock and each master operation acquires locks before it runs Locks are acquired in a total order to prevent deadlock Ordered by level in the namespace tree and lexicographically in the same level 17

Replicas GFS spreads chunk replicas across racks This ensures that some replicas will survive even if an entire rack is damaged This can exploit the aggregate bandwidth of multiple racks Replicas are created for chunk creation, re-replication, and rebalancing Places new replicas on chunkservers with below-average disk space utilization Re-replicates a chunk when the number of available replicas falls below a userspecified goal Rebalances replicas periodically for better disk space and load balancing 18

Garbage collection GFS reclaims the storage for deleted files lazily via garbage collection at the file and chunk levels File level When a file is deleted, master logs the deletion and renames the file to a hidden name that includes the deletion timestamp During the master s scan, hidden files are removed if they have existed for more than 3 days Chunk level Each chunkserver reports a subset of chunks it has and master identifies and replies with orphaned chunks Each chunkserver deletes orphaned chunks 19

Outline Design overview System interactions Master operation Fault tolerance High availability Data integrity Measurements Conclusion 20

High availability Fast recovery Master and chunkserver are designed to restore their state and start in seconds Chunk replication Each chunk is replicated on multiple chunkservers on different racks Users can specify different replication levels The default is 3 Master replication Operation logs and checkpoints are replicated on multiple machines One master process is in charge of all mutations and background activities Shadow masters provide read-only access to the file system when the primary master is down 21

Data integrity Each chunkserver uses checksumming to detect data corruption A chunk is broken up into 64 KB blocks Each block has a corresponding 32 bit checksum Chunkserver verifies the checksum of data blocks for reads During idle period, chunkservers scan and verify the inactive chunks 22

Outline Design overview System interactions Master operation Fault tolerance Measurements Micro-benchmarks Real world clusters Conclusion 23

Micro-benchmarks Test environment GFS cluster consists of 1 master with 2 replicas, 16 chunkservers, and 16 clients Each machine Dual 1.4 GHz PIII processor 2 GB memory 2 * 80 GB 5400 RPM HDD 100 Mbps full-duplex Ethernet Replica Chunkserver #0 Client #0 Master Replica Chunkserver #15 Client #15 24

Micro-benchmarks Reads Each client reads a randomly selected 4 MB * 256 from a 320 GB file Writes Each client writes 1 GB data to a new file in a series of 1 MB writes Each write involves 3 different replicas Record appends All clients append simultaneously to a single file 25

Read world clusters Test environment Cluster A Used for research and development Reads MBs ~ TBs data and writes the results back Cluster B Used for production data processing Generates and processes TBs data 26

Read world clusters Performance metrics for two GFS clusters Cluster B was in the middle of a burst write activity The read rates were much higher than the write rates Recovery time Killed a single chunkserver of cluster B Which had 15 K chunks containing 600 GB data All chunks were restored in 23.3 minutes 27

Outline Design overview System interactions Master operation Fault tolerance Measurements Conclusion 28

Conclusion GFS treats component failures as the norm rather than the exception GFS optimizes for huge files that are appended to and then read GFS provides fault tolerance by replication and fast recovery Replication to tolerate against chunkserver failures Checksumming to detect data corruption GFS delivers high throughput to many concurrent clients By separating file system control from data transfer Master involvement in common operation is minimized By a large chunk size By chunk leases GFS is widely used within Google as the storage platform 29