! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Similar documents
The Google File System (GFS)

The Google File System

CLOUD-SCALE FILE SYSTEMS

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

The Google File System

Google File System. By Dinesh Amatya

Google File System. Arun Sundaram Operating Systems

The Google File System

GFS: The Google File System. Dr. Yingwu Zhu

The Google File System

The Google File System

Distributed System. Gang Wu. Spring,2018

Google Disk Farm. Early days

Google File System 2

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

GFS: The Google File System

NPTEL Course Jan K. Gopinath Indian Institute of Science

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

The Google File System

CA485 Ray Walshe Google File System

NPTEL Course Jan K. Gopinath Indian Institute of Science

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

NPTEL Course Jan K. Gopinath Indian Institute of Science

The Google File System

CSE 124: Networked Services Fall 2009 Lecture-19

CSE 124: Networked Services Lecture-16

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

The Google File System

The Google File System GFS

Distributed File Systems II

Google is Really Different.

The Google File System. Alexandru Costan

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

Distributed Filesystem

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

MapReduce. U of Toronto, 2014

7680: Distributed Systems

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

Distributed File Systems (Chapter 14, M. Satyanarayanan) CS 249 Kamal Singh

Lecture XIII: Replication-II

Google File System (GFS) and Hadoop Distributed File System (HDFS)

2/27/2019 Week 6-B Sangmi Lee Pallickara

BigData and Map Reduce VITMAC03

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Staggeringly Large Filesystems

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Distributed Systems 16. Distributed File Systems II

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Staggeringly Large File Systems. Presented by Haoyan Geng

Distributed Systems. GFS / HDFS / Spanner

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

L1:Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung ACM SOSP, 2003

Seminar Report On. Google File System. Submitted by SARITHA.S

Lecture 3 Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP 2003

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Map-Reduce. Marco Mura 2010 March, 31th

GFS-python: A Simplified GFS Implementation in Python

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Distributed File Systems

Google Cluster Computing Faculty Training Workshop

Distributed File Systems. Directory Hierarchy. Transfer Model

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Performance Gain with Variable Chunk Size in GFS-like File Systems

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Extreme computing Infrastructure

Introduction to Cloud Computing

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Data Storage in the Cloud

HDFS Architecture Guide

This material is covered in the textbook in Chapter 21.

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

CS655: Advanced Topics in Distributed Systems [Fall 2013] Dept. Of Computer Science, Colorado State University

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Hadoop Distributed File System(HDFS)

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

Hadoop and HDFS Overview. Madhu Ankam

AN OVERVIEW OF DISTRIBUTED FILE SYSTEM Aditi Khazanchi, Akshay Kanwar, Lovenish Saluja

CLOUD- SCALE FILE SYSTEMS THANKS TO M. GROSSNIKLAUS

Transcription:

Cloud background Google File System! Warehouse scale systems " 10K-100K nodes " 50MW (1 MW = 1,000 houses) " Power efficient! Located near cheap power! Passive cooling! Power Usage Effectiveness = Total facility power/it equipment power: 1.2 or better " CPU, storage network, power:! With 10K-100K-node data center 3-5x cheaper than for 100-1K-node data center! Design constraints " Component failures are the norm! 1000s of components! Bugs, human errors, failures of memory, disk, connectors, networking, and power supplies! Monitoring, error detection, fault tolerance, automatic recovery " Files are huge by traditional standards! Multi-GB files are common! Billions of objects Introduction Interface Design! Design constraints " Most modifications are appends! Random writes are practically nonexistent! Many files are written once, and read sequentially " Two types of reads! Large streaming reads! Small random reads (in the forward direction) " Sustained bandwidth more important than latency " File system APIs are open to changes! POSIX-like " create, delete, open, close, read, write! Additional operations " Snapshot " Record append! Supports concurrent append! Guarantees atomicity of each append

Architectural Design Architectural Design (2)! A GFS cluster " A single master " Multiple chunkservers per master! Accessed by multiple clients " Running on commodity Linux machines! A file " Represented as fixed-sized chunks! Labeled with 64-bit unique global IDs! Divided in chunks (3-way replicated), stored at chunkservers! Single master, multiple chunkservers, multiple clients.! Files divided into fixed-size chunks. " Each chunk identified by immutable and globally unique chunk handle. (How to create?) " Stored by chunkservers locally as regular files. " Each chunk is replicated.! Master maintains all metadata. " Namespaces " Access control (What does this say about security?) " Heartbeat! No client side caching because streaming access. What about server side? Single Master Architectural Design (3)! General disadvantages for distributed systems: " Single point of failure " Bottleneck (scalability)! Solution? " Clients use Master only for metadata! Reading/writing goes directly through the chunkservers! Client translates file name and byte offset to chunk index.! Sends request to master.! Master replies with chunk handle and location of replicas.! Client caches this info.! Sends request to a close replica, specifying chunk handle and byte range.! Requests to master are typically buffered

Chunk Size! 64 MB: much larger than a normal FS block " Fewer chunk location requests to master " Fewer metadata entries for a client (or for the Master)to cache in memory " Potential issue with fragmentation! Hotspots: Some files (executable) may be accessed too much " Use higher replication factor. " Or Metadata! Three major types " File and chunk namespaces " File-to-chunk mappings " Locations of chunk replicas! All metadata is in memory " System capacity limited by memory of master! Memory is cheap " On a failure! Name space and mapping recovered through an operations log (kept on disk and replicated remotely)! Location info rebuilt by querying chunkservers upon recovery Metadata: Chunk Location Metadata: Operation Log! Not kept persistent at the Master " Chunkservers polled at startup " Info periodically refreshed using heartbeat messages to chunkservers! Possibly triggers re-replication! Simplicity " No need to sync info at Master as nodes leave (voluntarily or because of a failure) " With many failures, hard to keep consistency anyway! Serves as logical timeline for when files and chunks are created! Keeps " File/chunk namespaces " File to chunk mapping! Replicated on remote machines! Checkpointed periodically to truncate logs and bound startup

Consistency Model! File namespace mutations occur atomically. " Relatively simple, since just a single master.! A file region is " Consistent if all clients see the same data, independent of the replica they access. " Defined if consistent and all clients see the modification in its entirety (as if change were atomic) Write Record Append Serial success Defined Defined, but interspersed with Concurrent success Consistent but undefined inconsistent Failure Inconsistent Consistency Model! Two types of updates " Writes (at application-specified offset) " Record appends (at offset of GFS s choosing)! Relaxed consistency " Concurrent changes are consistent but undefined " A record append is atomically committed at least once! Even during concurrent updates! Offset returned to the client marks beginning of defined region! Occasional padding/duplications! Updates are applied in the same order at all replicas " Use chunk version number to detect stale replicas! They are garbage collected ASAP! Client read may temporarily see stale values " Staleness limited by timeout on cache entry Implications for applications! Use appends rather than overwriting! Checkpoints to mark defined portions of files! Self-validating/self-identifying records " Use checksums to throw away extra padding and record fragments! For non-idempotent operations, use unique record identifier to handle duplicates Leases and mutation order! Master grants chunk lease to primary replica! Primary determines order of updates to all replicas! Lease: " 60 second timeouts " Can be extended indefinitely " Extension request are piggybacked on heartbeat messages " Can be revoked (to achieve copy-on-write for snapshots)

The leasing mechanism Data Flow 1. Client asks master for all replicas and id of primary 2. Master replies. Client caches. 3. Client pre-pushes data to all replicas. 4. After all replicas acknowledge, client sends write request to primary. 5. Primary forwards write request to all replicas. 6. Secondaries signal completion. 7. Primary replies to client. Errors handled by retrying.! Separation of control and data flows " Control goes from client to primary to all replicas " Data is spread using chain replication! Push data to closest chunkserver! Transfers are pipelined " Chunkserver forwards data as soon as it receives it If write straddles chunks, broken down into multiple writes, which causes undefined states. Atomic Record Appends 1. Client pushes data to all replicas. 2. Sends request to primary. 3. If record does not fit in chunk, primary pads current chunk and tells client to retry with new chunk. 4. Primary writes data, tells replicas to do the same at its chosen offset.! If any replica fails client retries " Different replicas may end up storing different sequences of bits " On success, data written everywhere at the same offset! Clients need to handle inconsistencies (see before) Snapshot! Used to " Create copies of large data sets " Have something to go back to quickly before applying tentative updates! Handled using copy-on-write " Revoke outstanding leases " Log operation to disk " Duplicate the metadata, but point to the same chunks. " When a client requests a write, the master allocates a new chunk handle! To save bandwidth, new chunk copy is located on the same chunkserver as the old one

Master Operation Locking Operations! No directories! No hard links and symbolic links! Logically, namespace represented as a lookup table " Maps full path name to metadata! A lock per path " To access /d1/d2/leaf " Need to lock /d1, /d1/d2, and /d1/d2/leaf " Can modify a directory concurrently! Each thread acquires " A read lock on a directory " A write lock on a file " Totally ordered locking to prevent deadlocks Replica Placement Garbage Collection! Goals: " Maximize data reliability and availability! Spread chunk replicas across machines and racks " Maximize network bandwidth! Creation " Prefer underutilized chunkservers " Limit new chunks per chunkservers (why?)! Re-replication " Prioritize chunks with lower replication factors! Rebalancing " Periodically move replicas around for better disk and load balancing! Used to implement lazy file deletion " Simple " Works despite master or replica failure, message loss, etc! Deleted files are first hidden for three days! After 3 days, in-memory metadata of hidden file is removed! Orphaned chunks detected and deleted during regular scan

Stale replica detection Fault Tolerance and Diagnosis! Chunk version number " Increased on every new lease " Logged by master and replicas in persistent state " Relayed to client when contact Master for chunk handle! Stale replicas detected by Master when chunkserver recovers! Fast recovery " Master and chunkserver are designed to restore their states and start in seconds regardless of termination conditions! Chunk replication " Moving towards erasure coding! Master replication " Shadow masters provide read-only access when the primary master is down Fault Tolerance and Diagnosis! Data integrity " A chunk is divided into 64-KB blocks " Each with its checksum (logged persistently) " Verified at read and write times " Also background scans for rarely used data