The Google File System GFS

Similar documents
The Google File System

The Google File System

The Google File System

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

The Google File System

The Google File System (GFS)

CLOUD-SCALE FILE SYSTEMS

Google File System. By Dinesh Amatya

The Google File System

Google File System. Arun Sundaram Operating Systems

NPTEL Course Jan K. Gopinath Indian Institute of Science

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

The Google File System

The Google File System

Distributed System. Gang Wu. Spring,2018

Google Disk Farm. Early days

CSE 124: Networked Services Fall 2009 Lecture-19

CSE 124: Networked Services Lecture-16

The Google File System

GFS: The Google File System. Dr. Yingwu Zhu

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

NPTEL Course Jan K. Gopinath Indian Institute of Science

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

GFS: The Google File System

NPTEL Course Jan K. Gopinath Indian Institute of Science

Distributed Filesystem

CA485 Ray Walshe Google File System

The Google File System. Alexandru Costan

7680: Distributed Systems

Google File System 2

Distributed File Systems II

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

Google File System (GFS) and Hadoop Distributed File System (HDFS)

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Distributed File Systems (Chapter 14, M. Satyanarayanan) CS 249 Kamal Singh

Lecture XIII: Replication-II

Distributed Systems 16. Distributed File Systems II

2/27/2019 Week 6-B Sangmi Lee Pallickara

MapReduce. U of Toronto, 2014

Staggeringly Large Filesystems

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Seminar Report On. Google File System. Submitted by SARITHA.S

BigData and Map Reduce VITMAC03

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Google is Really Different.

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

Distributed Systems. GFS / HDFS / Spanner

Lecture 3 Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP 2003

GFS-python: A Simplified GFS Implementation in Python

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Distributed File Systems. Directory Hierarchy. Transfer Model

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

L1:Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung ACM SOSP, 2003

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

HDFS Architecture Guide

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Google Cluster Computing Faculty Training Workshop

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Map-Reduce. Marco Mura 2010 March, 31th

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Performance Gain with Variable Chunk Size in GFS-like File Systems

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed File Systems

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Staggeringly Large File Systems. Presented by Haoyan Geng

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

This material is covered in the textbook in Chapter 21.

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Outline. Spanner Mo/va/on. Tom Anderson

Cluster-Level Google How we use Colossus to improve storage efficiency

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

Hadoop Distributed File System(HDFS)

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

CS5460: Operating Systems Lecture 20: File System Reliability

AN OVERVIEW OF DISTRIBUTED FILE SYSTEM Aditi Khazanchi, Akshay Kanwar, Lovenish Saluja

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

CSE 124: Networked Services Lecture-17

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

Transcription:

The Google File System GFS

Common Goals of GFS and most Distributed File Systems Performance Reliability Scalability Availability

Other GFS Concepts Component failures are the norm rather than the exception. o File System consists of hundreds or even thousands of storage machines built from inexpensive commodity parts. Files are Huge. Multi-GB Files are common. o Each file typically contains many application objects such as web documents. Append, Append, Append. o Most files are mutated by appending new data rather than overwriting existing data. o Co-Designing o Co-designing applications and file system API benefits overall system by increasing flexibility o Slide from Michael Raines

GFS Why assume hardware failure is the norm? o It is cheaper to assume common failure on poor hardware and account for it, rather than invest in expensive hardware and still experience occasional failure. The amount of layers in a distributed system (network, disk, memory, physical connections, power, OS, application) mean failure on any could contribute to data corruption. Slide from Michael Raines

Assumptions System built from inexpensive commodity components that fail Modest number of files expect few million and > 100MB size. Did not optimize for smaller files 2 kinds of reads large streaming read (1MB), small random reads (batch and sort) Well-defined semantics of producer/ consumer and many-way merge. 1 producer per machine append to file. Atomicity High sustained bandwidth chosen over low latency

Interface GFS familiar file system interface Files organized hierarchically in directories, path names Create, delete, open, close, read, write Snapshot and record append (allows multiple clients to append simultaneously - atomic)

Master/Servers Single master, multiple chunkservers Each file divided into fixed-size chunks Immutable and globally unique 64 bit chunk handle assigned at creation Chunks stored by chunkservers on local disks as Linux files R or W chunk data specified by chunk handle and byte range Each chunk replicated on multiple chunkservers default is 3

Master/Servers Master maintains all file system metadata Namespace, access control info, mapping from files to chunks, location of chunks Controls garbage collection of chunks Communicates with each chunkserver through HeartBeat messages Clients interact with master for metadata, chunksevers do the rest, e.g. R/W on behalf of applications No caching For client working sets too large, simplified coherence For chunkserver chunks already stored as local files, Linux caches MFU in memory

Heartbeats What do we gain from Heartbeats? Not only do we get the new state of a remote system, this also updates the master of failures. Any system that fails to respond to a Heartbeat message is assumed dead. This information allows the master to update his metadata accordingly. This also queues the Master to create more replicas of the lost data. Slide from Michael Raines

Master Single Master Simplifies design Placement, replication decisions made with global knowledge No R/W, so not a bottleneck Client asks master which chunkservers to contact

Client Client translate file name/offset into chunk index within file Send master request with file name/chunk index Master replies with chunk handle and location of replicas Client caches info using file name/chunk index as key Client sends request to one of the replicas (closest) Further reads of same chunk require no interaction Can ask for multiple chunks in same request

Chunk Size 64 MB, larger than typical Replica stored as plain Linux file, extended as needed Lazy space allocation

Why Large chunk size? Reduces interaction of client with master R/W on same chunk only 1 request to mater Mostly R/W large sequential files Likely to perform many operations on given chunk (keep persistent TCP connection) Reduces size of metadata stored on master

Chunk problems But If small file one chunk may be hot spot Can fix this with replication, stagger batch application start times

Metadata 3 types: File and chunk namespaces Mapping from files to chunks Location of each chunk s replicas All metadata in memory First two types stored in logs for persistence (on master local disk and replicated remotely)

Metadata In memory fast Periodically scans state garbage collect Re-replication if chunkserver failure Migration to load balance 64 B data for each 64 MB chunk File namespace < 64B

GFS Why Large Files? o METADATA! Every file in the system adds to the total overhead metadata that the system must store. More individual data means more data about the data is needed. Slide from Michael Raines

Metadata Instead of keeping track of chunk location info Poll which chunkserver has which replica Master controls all chunk placement Disks may go bad, chunkserver errors, etc.

Operation Log Historical record of critical metadata changes Provides logical time line of concurrent ops Log replicated on remote machines Flush record to disk locally and remotely Log kept small checkpoint when > size Checkpoint in B-tree form New checkpoint built without delaying mutations (takes about 1 min for 2 M files) Only keep latest checkpoint and subsequent logs

Consistency Model File namespace mutation atomic File Region Consistent if all clients see same data Region defined after file data mutation (all clients see writes in entirety, no interference from writes) Undefined but Consistent - concurrent successful mutations all clients see same data, but not reflect what any one mutation has written Inconsistent if failed mutation (retries)

Consistency Model Write data written at application specific offset Record append data appended automatically at least once at offset of GFS s choosing (Regular Append write at offset, client thinks is EOF) GFS Applies mutation to chunk in same order on all replicas Uses chunk version numbers to detect stale replicas (missed mutations if chunkserver down) garbage collected updated next time contact master Additional failures regular handshakes master and chunkservers, checksumming Data only lost if all replicas lost before GFS can react

GFS Why Append Only? Overwriting existing data is not state safe. o We cannot read data while it is being modified. A customized ("Atomized") append is implemented by the system that allows for concurrent read/write, write/write, and read/write/ write events.

Consistency Relaxed consistency can be accommodated relying on appends instead of overwrites Appending more efficient/resilient to failure than random writes Checkpointing allows restart incrementally and no processing of incomplete successfully written data

Leases and Mutation Order Chunk lease One replica chosen as primary - given lease Primary picks serial order for all mutations to chunk Lease expires after 60s

Leases and order 1. Client asks master for chunkserver with lease 2. Master replies with primary and 2ndary Client contacts master again if primary unreachable 3. Client pushes data to all replicas 4. All replicas ACK, client sends write to primary 5. Primary forwards W to all 2ndary 6. 2ndaries ACK when done W 7. Primary replies with any errors

Data Flow Fully utilize each machine s network bandwidth Data pushed along chain chunkservers Avoid bottlenecks and high-latency Machine forwards to closest machine in topology Minimize latency Pipeline data transfer over TCP connections B/T + RL T = 100 Mbps, L < 1 ms (1MB in about 80ms)

Atomic Record Appends Concurrent write to same region not serializable Record append client only specifies data, GFS takes care of where written (O_APPEND) No need for distributed lock manager

Record append Client pushes data to all replicas of last chunk of the file Sends request to primary Primary checks if will exceed 64MB, if so send message to retry on next chunk, else primary appends, tells 2ndary to write, send reply to client If append fails, retries Replicas of same chunk may contain different data, including duplication of same record in part or whole Does not guarantee all replicas bytewise identical Guarantees data written at least once as atomic unit Data must have been written at same offset on all replicas of same chunk, but not same offset in file

Snapshot Snapshot makes copy of file Used to create branch copies of huge data sets or checkpoint Copy on write techniques First revokes leases on chunks Master logs operation to disk Apply log record to in-memory state by duplicating metadata Newly created snapshot points to same chunks as source file

Snapshot After snapshot, client sends request to master to find lease holder Reference count > 1 so picks new chunk handle Each chunkserver with old chunk asked to create new chunk Create new chunk locally (on same server) Master grants replica the lease

Master Operations Master executes all namespace operations Manages chunk replicas Makes placement decision Creates new chunks (and replicas) Coordinates various system-wide activities to keep chunks fully replicated Balance load Reclaim unused storage

Namespace Management and Locking Master ops can take time, e.g. revoking leases allow multiple ops at same time, use locks over regions for serialization GFS does not have per directory data structure listing all files Instead lookup table mapping full pathnames to metadata Each name in tree has R/W lock If accessing: /d1/d2/../dn/leaf, R lock on /d1, /d1/d2, etc., W lock on /d1/d2 /leaf

Example /home/user/foo created while /home/user snapshotted to /save/user R lock on /home and /save, W lock on /home/ user and /save/user R lock on /home and /home/user and W on / home/user/foo Serialized because conflicting locks on /home/ user

Locking Allows concurrent mutations in same directory R lock on directory name prevents directory from being deleted, renamed or snapshotted W locks on file names serialize attempts to create file with same name twice R/W objects allocated lazily, delete when not in use Locks acquired in total order (by level in tree) prevents deadlocks

Replica Placement GFS cluster distributed across many machine racks Need communication across several network switches Challenge to distribute data Chunk replica Maximize data reliability Maximize network bandwidth utilization Spread replicas across racks (survive even if entire rack offline R can exploit aggregate bandwidth of multiple racks W traffic has to flow through multiple racks

Creation, Re-replication, Master creates chunk Rebalancing Place replicas on chunkservers with below-average disk utilization Limit number of recent creates per chunkserver New chunks may be hot Spread replicas across racks Re-replicate When number of replicas falls below goal Chunkserver unavailable, corrupted, etc. Replicate based on priority (fewest replicas) Master limits number of active clone ops

Rebalance Periodically moves replicas for better disk space and load balancing Gradually fills up new chunkserver Removes replicas from chunkservers with belowaverage free space

Garbage Collection Lazy at both file and chunk levels When delete file, file renamed to hidden name including delete timestamp During regular scan of file namespace hidden files removed if existed > 3 days Until then can be undeleted When removed, in-memory metadata erased Orphaned chunks identified and erased With HeartBeat message, chunkserver/master exchange info about files, master tells chunkserver about files it can delete, chunkserver free to delete

Garbage Collection Easy in GFS All chunks in file-to-chunk mappings of master All chunk replicas are Linux files under designated directories on each chunkserver Everything else garbage

Advantages Garbage Collection Simple, reliable in large scale distributed system Chunk creation may succeed on some servers but not others Replica deletion messages may be lost and resent Uniform and dependable way to clean up replicas Merges storage reclamation with background activities of master Done in batches, cost amortized Done only when master free Delay in reclaiming storage provides against accidental deletion

Garbage Collection Disadvantages Delay hinders user effort to fine tune usage when storage tight Applications that create/delete may not be able to reuse space right away Expedite storage reclamation if file explicitly deleted again Allow users to apply different replication and reclamation policies

Fault Tolerance Fast Recovery Master/chunkservers restore state and start in seconds regardless of how terminated Abnormal or normal Chunk Replication

Shadow Master Master Replication Replicated for reliability One master remains in charge of all mutations and background activities If fails, start instantly If machine or disk mails, monitor outside GFS starts new master with replicated log Clients only use canonical name of master

Shadow Master Shadow Master Read-only access to file systems even when primary master down Not mirrors, so may lag primary slightly (fractions of second) Enhance read availability for files not actively mutated or if stale OK, e.g. metadata, access control info??? Shadow master read replica of operation log, applies same sequence of changes to data structures as the primary does Polls chunkserver at startup,monitors their status, etc. Depends only on primary for replica location updates

Data Integrity Checksumming to detect corruption of stored data Impractical to compare replicas across chunkservers to detect corruption Divergent replicas may be legal Chunk divided into 64KB blocks, each with 32 bit checksums Checksums stored in memory and persistently with logging

Data Integrity Before read, checksum If problem, return error to requestor and reports to master Requestor reads from replica, master clones chunk from other replica, delete bad replica Most reads span multiple blocks, checksum small part of it Checksum lookups done without I/O

Data Integrity Checksum computation optimized for appends Incrementally update checksum for last partial checksum block, computer new checksum for any new checksum blocks filled by append If partial corrupted, will detect with next read Write (not append) would require verifying firsta dn last blocks being overwritten, compute/record new checksums During idle, chunkservers scan and verify inactive chunks

Diagnostic tools Detail diagnostic logs Problem isolation, debugging, performance analysis at minimal cost Helps to understand transient non repeatable interaction between machines Record events, e.g. chunkservers going up and down, all RPC requests and replies Can reconstruct history to diagnose problem Little performance impact, written sequentially and asynchronously

Measurements Identify bottlenecks Performance with 1 master, 2 master replicas 16 chunkservers, 16 clients 1.4 GHz PIII processor, 2GB memory, 2 80GB 5400 rpm disks, 100Mbps ethernet, switches 1 Gbps link

Reads N clients read simultaneously from file system Client read 4 MB region from 320 GB file set Repeat 256 times 1GB data Chunkservers have 32 GB memory, 10% hit rate Close to cold cache results Observed read rate 80% of per-client limit when 1 reader, 75% when 16 readers (same chunk server)

Write N clients write simultaneously to N files Each client writes 1 GB data Limit plateaus at 67 MB/s, write each byte to 3 of 15 chunkservers Write rate for 1 client, ½ limit (limited by network stack, interaction with pipelining, delays in propagating from replica to another) Write rate 35 MB/s for 16 clients, ½ theoretical limit, more collisions Find write to be slower than desired

Append N clients append simultaneously to a file Store last chunk of file independent of number of clients Performance limited by network bandwidth 6.0 MB/s for one client, drops to 4.8 for 16 Applications produce multiple files concurrently so not issue in practice, write to one file while chunkservers for another file busy

Real World Clusters Cluster A research development by 100 engineers Cluster B production data processing Multi TB data sets, longer processing Cluster B has more dead files, more chunks, larger files

Workload R/W rates for time periods Size of master s memory doesn t limit capacity Average W rate < 30 MB/s, must write 3 times R rates much higher than W Operations sent to master 200-500 ops/sec Master was bottleneck, now use binary searches through namespaces 1000s of file access/sec Could use lookup cache in front of namespace data structure

Recovery Time After chunkserver failure, chunks must be clone to bring to replication level After chunkserver killed, 23.2 minutes to restore at replication rate of 440 MB/s Killed 2 chunkservers, restored all files to 2X within 2 minutes

Experiences GFS initially seen as backend file system for production system Now includes permissions and quotas Biggest problems disk, Linux driver related Problems with range of IDE protocol versions supported, only most recent worked Mismatch drivers corrupt data (use checksums) Linux 2.2 cost of fsync() proportional to size of file not size of what modified, needed Linux 2.4 Single read-writer lock on threads uses mmap() Lock blocked primary network thread from mapping new data to memory Replace mmap() with pthread() with extra copy cost Address day-to-day data processing needs for complicated distributed systems with commodity components

Conclusions GFS qualities essential for large-scale data processing on commodity hardware Component failures the norm rather than exception Optimize for huge files appended to Fault tolerance by constant monitoring, replication, fast/ automatic recovery High aggregate throughput Separate file system control Large file size

GFS In the Wild Google currently has multiple GFS clusters deployed for different purposes. The largest currently implemented systems have over 1000 storage nodes and over 300 TB of disk storage. These clusters are heavily accessed by hundreds of clients on distinct machines. Slide from Michael Raines Has Google made any adjustments?