Google File System (GFS) and Hadoop Distributed File System (HDFS)

Similar documents
The Google File System. Alexandru Costan

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

The Google File System

Google File System. By Dinesh Amatya

The Google File System

The Google File System

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

The Google File System

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

CLOUD-SCALE FILE SYSTEMS

The Google File System

The Google File System

The Google File System

Distributed Filesystem

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

The Google File System (GFS)

GFS: The Google File System. Dr. Yingwu Zhu

CA485 Ray Walshe Google File System

CSE 124: Networked Services Lecture-16

The Google File System

CSE 124: Networked Services Fall 2009 Lecture-19

MapReduce. U of Toronto, 2014

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

L1:Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung ACM SOSP, 2003

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Google File System. Arun Sundaram Operating Systems

GFS: The Google File System

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Distributed File Systems II

BigData and Map Reduce VITMAC03

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Distributed Systems 16. Distributed File Systems II

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

Google Disk Farm. Early days

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Hadoop Distributed File System(HDFS)

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

TI2736-B Big Data Processing. Claudia Hauff

Distributed System. Gang Wu. Spring,2018

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

NPTEL Course Jan K. Gopinath Indian Institute of Science

This material is covered in the textbook in Chapter 21.

NPTEL Course Jan K. Gopinath Indian Institute of Science

Staggeringly Large File Systems. Presented by Haoyan Geng

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Introduction to MapReduce

Map-Reduce. Marco Mura 2010 March, 31th

Distributed File Systems (Chapter 14, M. Satyanarayanan) CS 249 Kamal Singh

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

Google is Really Different.

The Google File System GFS

Lecture 3 Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP 2003

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Distributed File Systems. Directory Hierarchy. Transfer Model

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Seminar Report On. Google File System. Submitted by SARITHA.S

Hadoop and HDFS Overview. Madhu Ankam

GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Staggeringly Large Filesystems

Google File System 2

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

7680: Distributed Systems

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

2/27/2019 Week 6-B Sangmi Lee Pallickara

Dept. Of Computer Science, Colorado State University

Distributed Systems. GFS / HDFS / Spanner

Map Reduce. Yerevan.

NPTEL Course Jan K. Gopinath Indian Institute of Science

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Google Cluster Computing Faculty Training Workshop

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

CS370 Operating Systems

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Batch Inherence of Map Reduce Framework

CS 345A Data Mining. MapReduce

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

A BigData Tour HDFS, Ceph and MapReduce

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Lecture XIII: Replication-II

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Architecture

Introduction to MapReduce

HDFS Architecture Guide

HADOOP FRAMEWORK FOR BIG DATA

Transcription:

Google File System (GFS) and Hadoop Distributed File System (HDFS) 1

Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear on compute resources Move computation to data Minimize expensive data transfers Data is large, programs are small Reliability and Availability: Failures are common Persistent storage Simple computational model (MapReduce) Hides complexity in efficient execution framework Streaming data access (avoid random reads) More efficient than seek-based data access Need for a suitable file system 2

A Typical Cluster Architecture 3

Failures in literature LANL data (DSN 2006) Data collected over 9 years 4750 machines, 24101 CPUs Distribution of failures Hardware ~ 60%, Software ~ 20%, Network/Environment/Humans ~ 5%, Aliens ~ 25% Depending on a system, failures occurred between once a day to once a month Most of the systems in the survey were the cream of the crop at their time Disk drive failure analysis (FAST 2007) Annualized Failure Rates vary from 1.7% for one year old drives to over 8.6% in three year old ones Utilization affects failure rates only in very old disk drive populations Temperature change can cause increase in failure rates but mostly for old drives Memory also fails (DRAM errors analysis, SIGMETRICS 2009) 4

GFS & HDFS Distributed file systems manage the storage across a network of machines. GFS Implemented especially for meeting the rapidly growing demands of Google s data processing needs. The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP 03 HDFS Hadoop has a general-purpose file system abstraction (i.e., can integrate with several storage systems such as the local file system, HDFS, Amazon S3, etc.). HDFS is Hadoop s flagship file system. Implemented for the purpose of running Hadoop s MapReduce applications. Based on work done by Google in the early 2000s The Hadoop Distributed File System, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, IEEE2010 5

Roadmap Assumptions / Requirements Design choices Architecture / Implementation 6

Assumptions Section 2.1: The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung, SOSP 03 Modest number of very large files Data access write-once, read-many-times pattern Large reads: Time to read the whole dataset is more important Mostly, files are appended to, perhaps concurrently High sustained throughput favored over low latency Commodity hardware Need fault-tolerance High component failure rates: Inexpensive commodity components fail all the time Not a good fit for low-latency data access lots of small files multiple writers, arbitrary file modifications 7

Design Files stored as chunks Fixed size (64MB) Reliability through replication Each chunk replicated across 3+ chunkservers Single master to coordinate access, keep metadata Simple centralized management No data caching Little benefit due to large data sets, streaming reads Familiar interface, but customize the API Simplify the problem; focus on Google apps Add snapshot and record append operations 8

Overview 9

Namenodes and Datanodes Two types of nodes: One Namenode/Master Multiple Datanodes/Chunkservers Name node manages the filesystem namespace. File system tree and metadata, stored persistently Block locations, stored transiently Data nodes store and retrieve data blocks when they are told to by clients or the Namenode. Data nodes report back to the Namenode periodically with lists of blocks that they are storing. 10

HDFS/GFS Hadoop/MapReduce Component Naming Conventions MapReduce daemons JobTracker: client communication, job scheduling, resource management, lifecycle coordination (~ master in Google MR) TaskTrackers: task execution module (~ worker in Google MR) HDFS daemons NameNode: namespace and block management (~ master in GFS) DataNodes: block replica container (~ chunkserver in GFS) 11

Blocks Files are broken into block-sized chunks (64 MB by default) With the large block abstraction: A file can be larger than any single disk in the network Storage subsystem is simplified (e.g., metadata bookkeeping) Replication for fault-tolerance and availability is facilitated Reduces clients need to interact with the master Reads and writes on the same chunk require only one initial request to the master for chunk location information. Applications mostly read and write large files sequentially Client can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time Reduces the size of the metadata stored on the master. This allows us to keep the metadata in memory Potential disadvantage: Chunkserver becomes hotspot with small file 12

Single master Advantage: simplified design Global knowledge of the system Problem: Single point of failure Scalability bottleneck GFS solutions: Shadow masters Minimize master involvement never move data through it, use only for metadata and cache metadata at clients large chunk size master delegates authority to primary replicas in data mutations (chunk leases) 13

Metadata Global metadata is stored on the master File and chunk namespaces Mapping from files to chunks Locations of each chunk s replicas Access control information All in memory (64 bytes / chunk) Fast Easily accessible (fast scan, e.g., for balancing) Master has an operation log for persistent logging of critical metadata updates Persistent on local disk Replicated Checkpoints for faster recovery 14

Master s Responsibilities Metadata storage Namespace management/locking Periodic communication with chunkservers give instructions, collect state, track cluster health Chunks management Creation place new replicas on chunkservers with below-average disk space utilization limit recent creations on each chunkserver. It predicts imminent heavy write traffic spread replicas of a chunk across racks. Re-replication: number of available replicas falls below a user-specified goal. a chunkserver becomes unavailable, corrupted replica, disks is disabled because of errors, replication goal is increased. Rebalancing examine the current replica distribution and move replicas for better disk space and load balancing. 15

Master s Responsibilities Garbage Collection simpler, more reliable than traditional file delete master logs the deletion, renames the file to a hidden name lazily garbage collects hidden files Periodically ask the chunkservers which blocks they have. Those that cannot be referenced anymore can be removed. Stale replica deletion detect stale replicas using chunk version numbers 16

Mutations Mutation = write or record append Must be done for all replicas Goal: minimize master involvement Lease mechanism: Master picks one replica as primary; gives it a lease for mutations Data flow decoupled from control flow Mutation is not random access 17

Caches Data is not cached by clients (only metadata is) Access pattern: stream through huge files Working set too large to be cached System is highly simplified: no cache coherence problem! Chunkservers do cache data indirectly Chunks are stored as local files and so Linux s buffer cache already keeps frequently accessed data in memory 18

GFS - Overview 19

Read: The client sends the master the file name and chunk index. Using the fixed chunk size, the client translates the file name and byte offset specified by the application into a chunk index within the file. The master replies with a chunk handle and locations of the replicas. The client caches this information. 20

The client sends a request to one of the replicas, most likely the closest one. The request specifies the chunk handle and a byte range within that chunk. Further reads of the same chunk require no more client-master interaction The client typically asks for multiple chunks in the same request The master can also include the information for chunks immediately following those requested. Sidesteps several future client-master interactions. 21

Write Secondary Replica Primary Replica Secondary Replica A primary replica holds a lease: coordinate writing on a chunk Separate data flow from control flow 22

Write The master grants a chunk lease to one of the replicas, which we call the primary. The primary picks a serial order for all mutations to the chunk. All replicas follow this order when applying mutations. A lease has an initial timeout of 60 seconds. The primary can request and typically receive extensions HeartBeat messages regularly exchanged between master and chunkservers. If the master loses communication with a primary, it can safely grant a new lease to another replica after the old lease expires. 23

Write Secondary Replica Primary Replica Secondary Replica The client asks the master which chunkserver holds the current lease for the chunk and the locations of the other replicas. If no one has a lease, the master grants one to a replica it chooses The master replies with the identity of the primary and the locations of the other (secondary) replicas. The client caches this data for future mutations. It needs to contact the master again only when the primary becomes unreachable or replies that it no longer holds a lease 24

Write Secondary Replica Primary Replica Secondary Replica The client pushes the data to all the replicas. Each chunkserver will store the data in an internal LRU buffer. Decoupling the data flow from the control flow, improve performance by scheduling the expensive data flow based on the network topology Once all the replicas have acknowledged receiving the data, the client sends a write request to the primary. 25

Write Secondary Replica Primary Replica Secondary Replica Once all the replicas have acknowledged receiving the data, the client sends a write request to the primary. The request identifies the data pushed earlier to all of the replicas. The primary assigns consecutive serial numbers to all the mutations it receives, possibly from multiple clients, which provides serialization. 26

Write Secondary Replica Primary Replica Secondary Replica The primary forwards the write request to all secondary replicas. Each secondary replica applies mutations in the same serial number order assigned by the primary. The secondaries all reply to the primary indicating that they have completed the operation. The primary replies to the client. Any errors encountered at any of the replicas are reported to the client. 27

Write: Decouple data and control flow Control flows from client to the primary and then to all secondaries Data is pushed linearly along a carefully picked chain of chunkservers in a pipelined fashion. Avoid network bottlenecks and high-latency links (e.g., inter-switch links): each machine forwards data to the closest machine in the network. Once a chunkserver receives some data, it starts forwarding immediately. Switched network with full-duplex links. Sending the data immediately does not reduce the receive rate. 28

Write: Decouple data and control flow Control flows from client to the primary and then to all secondaries Data is pushed linearly along a carefully picked chain of chunkservers in a pipelined fashion. Avoid network bottlenecks and high-latency links (e.g., inter-switch links): each machine forwards data to the closest machine in the network. Once a chunkserver receives some data, it starts forwarding immediately. Switched network with full-duplex links. Sending the data immediately does not reduce the receive rate. 29

Write: Time estimation Ignore network congestion Transfer B bytes to R replicas B/T + RL T is the network throughput L is latency to transfer bytes between two machines. E.g., network links are typically 100 Mbps (T), and L is far below 1 ms. Therefore, 1 MB can ideally be distributed in about 80 ms. (2003) 30

References Web Search for a Planet: The Google Cluster Architecture, L. Barroso, J. Dean, U. Hoelzle, IEEE Micro 23(2), 2003. The Google File System, S. Ghemawat, H. Gobioff, S. Leung, SOSP 2003. The Hadoop Distributed File System, K. Shvachko et al, MSST 2010. Hadoop: The Definitive Guide, T. White, O Reilly, 3rd edition, 2012. 31

Questions? 32

Sources & References General intro to bigdata http://www.systems.ethz.ch/sites/default/files/file/bigdata_fall2012/bigdat a-2012-m5-1%20updated.pdf Storage systems and big data www.cs.colostate.edu/~cs655/lectures/cs655-l4-storagesystems.pdf Specific on GFS https://courses.cs.washington.edu/courses/cse490h/08wi/lecture/lec6.ppt Specific on GFS http://prof.ict.ac.cn/dcomputing/uploads/2013/dc_4_0_gfs_ict.pdf 33