L1:Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung ACM SOSP, 2003

Similar documents
ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

The Google File System. Alexandru Costan

CLOUD-SCALE FILE SYSTEMS

Google File System (GFS) and Hadoop Distributed File System (HDFS)

The Google File System

The Google File System

The Google File System

The Google File System

The Google File System

Google File System. Arun Sundaram Operating Systems

Google Cluster Computing Faculty Training Workshop

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

The Google File System

Google File System. By Dinesh Amatya

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

The Google File System

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

GFS: The Google File System. Dr. Yingwu Zhu

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

The Google File System (GFS)

Distributed File Systems II

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

CA485 Ray Walshe Google File System

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

GFS: The Google File System

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

The Google File System

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Google Disk Farm. Early days

CSE 124: Networked Services Lecture-16

NPTEL Course Jan K. Gopinath Indian Institute of Science

Google File System 2

CSE 124: Networked Services Fall 2009 Lecture-19

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

MapReduce. U of Toronto, 2014

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Distributed System. Gang Wu. Spring,2018

Distributed Systems 16. Distributed File Systems II

NPTEL Course Jan K. Gopinath Indian Institute of Science

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

BigData and Map Reduce VITMAC03

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

Distributed Filesystem

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

NPTEL Course Jan K. Gopinath Indian Institute of Science

The Google File System GFS

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

7680: Distributed Systems

Distributed Systems. GFS / HDFS / Spanner

GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

Staggeringly Large Filesystems

Distributed File Systems. Directory Hierarchy. Transfer Model

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

L5-6:Runtime Platforms Hadoop and HDFS

Distributed File Systems (Chapter 14, M. Satyanarayanan) CS 249 Kamal Singh

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Staggeringly Large File Systems. Presented by Haoyan Geng

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

2/27/2019 Week 6-B Sangmi Lee Pallickara

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

Lecture 3 Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP 2003

Google is Really Different.

Seminar Report On. Google File System. Submitted by SARITHA.S

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Hadoop Distributed File System(HDFS)

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

Lecture XIII: Replication-II

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Hadoop and HDFS Overview. Madhu Ankam

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

TI2736-B Big Data Processing. Claudia Hauff

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Map-Reduce. Marco Mura 2010 March, 31th

Introduction to MapReduce

GFS-python: A Simplified GFS Implementation in Python

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

HDFS Architecture Guide

CAP Theorem, BASE & DynamoDB

CSE 153 Design of Operating Systems

UNIT-IV HDFS. Ms. Selva Mary. G

A BigData Tour HDFS, Ceph and MapReduce

Data Storage in the Cloud

Introduction to Cloud Computing

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

Transcription:

Indian Institute of Science Bangalore, India भ रत य व ज ञ न स स थ न ब गल र, भ रत Department of Computational and Data Sciences DS256:Jan18 (3:1) L1:Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung ACM SOSP, 2003 Yogesh Simmhan 9, 1 1, 1 6 J a n, 2018 Department of Computational and Data Science, IISc, 2016 This work is licensed under a Creative Commons Attribution 4.0 International License Copyright for external content used with attribution is retained by their original authors CDS Department of Computational and Data Sciences

Credits The following slides are based on: Distributed Filesystems, CSE 490h Introduction to Distributed Computing, Spring 2007, Aaron Kimball https://courses.cs.washington.edu/courses/cse490h/07sp/ Google File System, Alex Moshchuk

File Systems Overview System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files Addressable by a filename ( foo.txt ) Usually supports hierarchical nesting (directories) A file path joins file & directory names into a relative or absolute address to identify a file ( /home/aaron/foo.txt )

What Gets Stored User data itself is the bulk of the file system's contents Also includes meta-data on a drive-wide and per-file basis: Drive-wide: Per-file: Available space Formatting info character set... name owner modification date physical layout...

High-Level Organization Files are organized in a tree structure made of nested directories One directory acts as the root links (symlinks, shortcuts, etc) provide simple means of providing multiple access paths to one file Other file systems can be mounted and dropped in as sub-hierarchies (other drives, network shares)

Low-Level Organization (1/2) File data and meta-data stored separately File descriptors + meta-data stored in inodes Large tree or table at designated location on disk Tells how to look up file contents Meta-data may be replicated to increase system reliability

Low-Level Organization (2/2) Standard read-write medium is a hard drive (other media: CDROM, tape,...) Viewed as a sequential array of blocks Must address ~1 KB chunk at a time Tree structure is flattened into blocks Overlapping reads/writes/deletes can cause fragmentation: files are often not stored with a linear layout inodes store all block numbers related to file

Fragmentation A B C (free space) A B C A (free space) A (free space) C A (free space) A D C A D (free)

Design Considerations Smaller inode size reduces amount of wasted space Larger inode size increases speed of sequential reads (may not help random access) Should the file system be faster or more reliable? But faster at what: Large files? Small files? Lots of reading? Frequent writers, occasional readers?

Distributed Filesystems Support access to files on remote servers Must support concurrency Make varying guarantees about locking, who wins with concurrent writes, etc... Must gracefully handle dropped connections Can offer support for replication and local caching Different implementations sit in different places on complexity/feature scale

GFS CDS.IISc.ac.in CDS.IISc.in Department of Computational and Data Sciences

Motivation Google needed a good distributed file system Redundant storage of massive amounts of data on cheap and unreliable computers Why not use an existing file system? Google s problems are different from anyone else s Different workload and design priorities GFS is designed for Google apps and workloads Google apps are designed for GFS

Assumptions High component failure rates Inexpensive commodity components fail all the time seeds for the Cloud on commodity clusters Modest number of HUGE files Just a few million Each is 100MB or larger; multi-gb files typical Files are write-once, mostly appended to Perhaps concurrently Example? Large streaming reads, sequential reads High sustained throughput favored over low latency

Crawler Index Web Search Crawler Store Index Crawl the web Downloads URLs, store them to file system Extract page content (words), URL links Do BFS traversal Index the text content Invested index: Word->URL Rank the pages PageRank Crawler Web Graph Web Graph Page Rank Page Rank

GFS Design Decisions Files stored as chunks Fixed size (64MB) Reliability through replication Each chunk replicated across 3+ chunkservers Single master to coordinate access, keep metadata Simple centralized management, placement decisions No data caching Little benefit due to large data sets, streaming reads Familiar interface, but customize the API Simplify the problem; focus on Google apps Add snapshot and record append operations

GFS Architecture Single master Mutiple chunkservers Can anyone see a potential weakness in this design?

Single master Downsides Single point of failure Scalability bottleneck GFS solutions: Shadow masters Minimize master involvement never move data through it, use only for metadata and cache metadata at clients large chunk size master delegates authority to primary replicas in data mutations (chunk leases)

Chunks Stored as a Linux file on chunk servers Default is 64MB larger than file system block sizes Pre-allocates disk space, avoids fragmentation Reduces master metadata Reduces master interaction, allows metadata caching on client Persistent TCP connection to chunk server Con: hot-spots need to tune replication

Metadata Global metadata is stored on the master File and chunk namespaces Mapping from files to chunks Locations of each chunk s replicas All kept in-memory (64 bytes / chunk) Fast Easily accessible Cheaper to add more memory

Metadata Master builds chunk-to-server mapping on bootup, from chunk servers And on heart-beat Avoids sync of chunkservers & master Chunkservers can come and go, fail, restart often Gives flexibility to manage chunkservers independently Lazy propagation to master Chunk server is final authority on whether it has chunk or not

Metadata Metadata should be consistent, and updates visible to clients only after stable Once metadata is visible, clients can use/change it Master has an operation log for persistent logging of critical metadata updates Namespace & file-chunk mapping persisted Persistent on local disk as log mutations Logical time, order of concurrent ops, single master! Replicated, client ack only after replication Replay log to recover metadata Checkpoints for faster recovery

Chunk Servers Master controls chunk placement Chunk servers store chunks Interacts with clients for file operations Heart beat messages to master liveliness, list of chunks present

Master s responsibilities Namespace management/locking Metadata storage Periodic scanning of in-memory metadata Periodic communication with chunkservers give instructions, collect state, track cluster health

Chunk Management Chunk creation, re-replication, rebalancing balance space utilization, bandwidth utilization and access speed/performance spread replicas across racks to reduce correlated failures across servers, racks re-replicate data if redundancy falls below threshold rebalance data to smooth out storage and request load

Master s responsibilities Garbage Collection simpler, more reliable than traditional file delete master logs the deletion, renames the file to a hidden name lazily garbage collects hidden files Stale replica deletion detect stale replicas using chunk version numbers

Fault Tolerance High availability fast recovery master and chunkservers restartable in a few seconds chunk replication default: 3 replicas. Replication for fault tolerance vs. Replication for performance shadow masters Data integrity checksum every 64KB block in each chunk

Relaxed consistency model Namespace updates atomic and serializable namespace locking guarantees this Consistent = All replicas have the same value. Any client reads same value. Defined = Consistent, and all replicas reflect the mutation performed. All clients read the updated value in entirety.

Relaxed consistency model Singe successful write leaves region defined Concurrent writes leave region consistent, but possibly undefined All clients see same content, but writes may have happened in fragments/interleaved Failed writes leave the region inconsistent Different clients see different content at different times

Mutations Mutation = write or append Write to a specific offset Append data, atomically at least once Must be done for all replicas Returns start offset for defined region May pad inconsistent regions, Create duplicates Mutated file guaranteed to be defined after successful mutations Apply mutations to a chunk in the same order on all its replicas Use chunk version numbers to detect any stale replica that was down during a mutation Caching of metadata in client can cause stale reads!

Mutations Goal: Minimize master involvement Lease mechanism for a mutation master picks one replica as primary; gives it a lease for mutations from any client (default 60 secs, extensible as part of heartbeat) primary defines a serial order of mutations all replicas follow this order

Mutations Dataflow decoupled from control flow Forwarding to closest server. Pipeline. Fully utilize bandwidth! B/T + R.L (Bytes, Thruput, Replicas, Latency) Send record to all replicas, any order. Maintained in buffer by c server. Send write to primary with data IDs. Primary assigns serial numbers to records. Applies mutations in serial order. Forwards to secondary s. Applied on them, and acked to primary. Acked to client. Any failures replica leaves inconsistent state. Retry. Mutations across chunks are split by client.

Mutations Clients need to distinguish between defined and undefined regions Some work has moved into the applications: e.g., prefer appends, use self-validating, self-identifying records, checkpoint Use checksums to identify and discard padding Use uids to identify duplicate records Simple, efficient Google apps can live with it what about other apps?

Atomic record append Concurrent writes to same region not serializable In Append, client specifies data. GFS appends it to the file atomically at least once GFS picks the offset. Returns offset to client. works for concurrent writers Client sends data to all replicas of last chunk of file. Primary applies to its end; tells secondaries to apply at that offset. E.g. concurrent appends from different clients Used heavily by Google apps e.g., for files that serve as multiple-producer/single-consumer queues

Snapshot Full copies of a file, rapidly On snapshot request, NameNode revokes all primary leases to chunks Log opn to disk. Copy metadata to new file, chunks point to old chunks with reference count +1 Copy old chunk to new on write (copy on write to local chunkserver) if reference counter > 1

Performance

Performance

Deployment in Google 50+ GFS clusters Each with thousands of storage nodes Managing petabytes of data GFS is under BigTable, etc.

Conclusion GFS demonstrates how to support large-scale processing workloads on commodity hardware design to tolerate frequent component failures optimize for huge files that are mostly appended and read feel free to relax and extend FS interface as required go for simple solutions (e.g., single master) GFS has met Google s storage needs it must be good!

Observed Failure/year Expected Failure/year Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?, Bianca Schroeder Garth A. Gibson Usenix FAST 2007

CDS.IISc.ac.in CDS.IISc.in Department of Computational and Data Sciences Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, IEEE Symposium on Mass Storage Systems and Technologies (MSST), 2010

Block Report DataNode identifies block replicas in its possession to the NameNode by sending a block report block id, the generation stamp and the length First block report is sent immediately after the DataNode registration. Subsequent block reports are sent every hour.

Heartbeat DataNodes send heartbeats to the NameNode to confirm that it is up Default interval is 3 seconds. After no heartbeat in 10 minutes the NameNode considers it out of service, block replicas unavailable. Schedules replication. Piggyback storage capacity, fraction of storage in use, data transfers in progress NameNode responds to heartbeat with: Replicate blocks, remove local replicas, shut down the node, send immediate block report

Checkpoint Node Can take hours to recreate NameNode for 1 week of journal (ops log) Periodically combines existing checkpoint & journal to create a new checkpoint & empty journal Download current checkpoint & journal from the NameNode, merges them locally, return new checkpoint to NameNode New checkpoint lets NameNode truncate the tail of the journal

Backup Node BackupNode can create periodic checkpoints. Read-only NameNode! Also maintains an in-memory image of namespace, synchronized with NameNode Accepts the journal stream of namespace transactions from active NameNode, saves them to its local store, applies them to its own namespace image in memory Creating checkpoints is done locally

Block Creation Pipeline The DataNodes form a pipeline, the order of which minimizes the total network distance from the client to the last DataNode. Data is pushed to the pipeline as (64 KB) packet buffers Async, Max outstanding acks. Clients generates checksums for blocks, DN stores checksums for each block. Checksums verified by client while reading to detect corruption.

Block Placement The distance from a node to its parent node is 1. Distance between nodes is sum of distances to their common ancestor. Tradeoff between min write cost, and max reliability, availability & agg read B/W 1 st replica on writer node, the 2 nd & 3 rd on different nodes in a different rack No Datanode has more than one replica. No rack has more than two replicas of a block. NameNode returns replica location in the order of its closeness to the reader

Replication NameNode detects under- or over-replication from block report Remove replica without reducing the # of racks hosting replicas, prefer DataNode with least disk space Under-replicated blocks put in priority queue. Block with 1 replica has highest priority. Background thread scans the replication queue, decide where to place new replicas

Balancer Balances disk space usage on an HDFS cluster based on threshold Utilization of a node (used%) differs from the utilization of cluster by no more than the threshold value. Iteratively moves replicas from nodes with higher utilization to nodes with lower. Maintains data availability, minimizes inter-rack copying, limits bandwidth consumed

Discussion How many sys-admins does it take to run a system like this? much of management is built in Google currently has ~450,000 machines GFS: only 1/3 of these are effective that s a lot of extra equipment, extra cost, extra power, extra space! GFS has achieved availability/performance at a very low cost, but can you do it for even less? Is GFS useful as a general-purpose commercial product? small write performance not good enough? relaxed consistency model