This material is covered in the textbook in Chapter 21.

Similar documents
Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

The Google File System

Google File System (GFS) and Hadoop Distributed File System (HDFS)

CLOUD-SCALE FILE SYSTEMS

The Google File System

The Google File System

The Google File System

The Google File System

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Google Disk Farm. Early days

Google File System. By Dinesh Amatya

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

The Google File System

GFS: The Google File System

Distributed Filesystem

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Map-Reduce. Marco Mura 2010 March, 31th

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

GFS: The Google File System. Dr. Yingwu Zhu

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

The Google File System (GFS)

Distributed File Systems II

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

The Google File System

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

MapReduce. U of Toronto, 2014

The Google File System. Alexandru Costan

NPTEL Course Jan K. Gopinath Indian Institute of Science

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

BigData and Map Reduce VITMAC03

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

NPTEL Course Jan K. Gopinath Indian Institute of Science

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Staggeringly Large File Systems. Presented by Haoyan Geng

Hadoop Distributed File System(HDFS)

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

Staggeringly Large Filesystems

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

CSE 124: Networked Services Fall 2009 Lecture-19

CA485 Ray Walshe Google File System

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Distributed System. Gang Wu. Spring,2018

CSE 124: Networked Services Lecture-16

NPTEL Course Jan K. Gopinath Indian Institute of Science

Seminar Report On. Google File System. Submitted by SARITHA.S

Google File System 2

2/27/2019 Week 6-B Sangmi Lee Pallickara

Data Storage in the Cloud

Programming Models MapReduce

Google File System. Arun Sundaram Operating Systems

Distributed File Systems (Chapter 14, M. Satyanarayanan) CS 249 Kamal Singh

The Google File System

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

The Google File System GFS

7680: Distributed Systems

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

CS 345A Data Mining. MapReduce

Lecture XIII: Replication-II

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Performance Gain with Variable Chunk Size in GFS-like File Systems

GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

GFS-python: A Simplified GFS Implementation in Python

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Distributed Systems. GFS / HDFS / Spanner

Distributed File Systems. Directory Hierarchy. Transfer Model

Lessons Learned While Building Infrastructure Software at Google

Architecture of Systems for Processing Massive Amounts of Data

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Distributed Systems 16. Distributed File Systems II

Outline. Spanner Mo/va/on. Tom Anderson

CS655: Advanced Topics in Distributed Systems [Fall 2013] Dept. Of Computer Science, Colorado State University

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

CS 537: Introduction to Operating Systems Fall 2015: Midterm Exam #4 Tuesday, December 15 th 11:00 12:15. Advanced Topics: Distributed File Systems

Parallel Rendering. Johns Hopkins Department of Computer Science Course : Rendering Techniques, Professor: Jonathan Cohen

Google is Really Different.

Lecture 3 Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP 2003

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

Big Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

Cloud Computing CS

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

Introduction to MapReduce

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

CLIENT DATA NODE NAME NODE

L1:Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung ACM SOSP, 2003

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Transcription:

This material is covered in the textbook in Chapter 21.

The Google File System paper, by S Ghemawat, H Gobioff, and S-T Leung, was published in the proceedings of the ACM Symposium on Operating Systems Principles in October 2003 and may be found at http://labs.google.com/papers/gfs-sosp2003.pdf.

This slide and the discussion below are from the aforementioned paper, The Google File System. 1) The client asks the master which chunkserver holds the current lease for the chunk and the locations of the other replicas. If no one has a lease, the master grants one to a replica it chooses (not shown). 2) The master replies with the identity of the primary and the locations of the other (secondary) replicas. The client caches this data for future mutations. It needs to contact the master again only when the primary becomes unreachable or replies that it no longer holds a lease. 3) The client pushes the data to all the replicas. A client can do so in any order. Each chunkserver will store the data in an internal LRU buffer cache until the data is used or aged out. By decoupling the data flow from the control flow, we can improve performance by scheduling the expensive data flow based on the network topology regardless of which chunkserver is the primary. 4) Once all the replicas have acknowledged receiving the data, the client sends a write request to the primary. The request identifies the data pushed earlier to all of the replicas. The primary assigns consecutive serial numbers to all the mutations it receives, possibly from multiple clients, which provides the necessary serialization. It applies the mutation to its own local state in serial number order. 5) The primary forwards the write request to all secondary replicas. Each secondary replica applies mutations in the same serial number order assigned by the primary. 6) The secondaries all reply to the primary indicating that they have completed the operation. 7) The primary replies to the client. Any errors encountered at any of the replicas are reported to the client. In case of errors, the write may have succeeded at the primary and an arbitrary subset of the secondary replicas. (If it had failed at the primary, it would not have been assigned a serial number and forwarded.) The client request is considered to have failed, and the modified region is left in an inconsistent state. Our client code handles such errors by retrying the failed mutation. It will make a few attempts at steps (3) through (7) before falling back to a retry from the beginning of the write.

The discussion below is copied from the aforementioned paper, The Google File System. A GFS cluster is highly distributed at more levels than one. It typically has hundreds of chunkservers spread across many machine racks. These chunkservers in turn may be accessed from hundreds of clients from the same or different racks. Communication between two machines on different racks may cross one or more network switches. Additionally, bandwidth into or out of a rack may be less than the aggregate bandwidth of all the machines within the rack. Multi-level distribution presents a unique challenge to distribute data for scalability, reliability, and availability. The chunk replica placement policy serves two purposes: maximize data reliability and availability, and maximize network bandwidth utilization. For both, it is not enough to spread replicas across machines, which only guards against disk or machine failures and fully utilizes each machine s network bandwidth. We must also spread chunk replicas across racks. This ensures that some replicas of a chunk will survive and remain available even if an entire rack is damaged or offline (for example, due to failure of a shared resource like a network switch or power circuit). It also means that traffic, especially reads, for a chunk can exploit the aggregate bandwidth of multiple racks. On the other hand, write traffic has to flow through multiple racks, a tradeoff we make willingly.

Chubby is discussed in http://static.googleusercontent.com/external_content/ untrusted_dlcp/research.google.com/en/us/archive/chubby-osdi06.pdf. These bullets come from the textbook, page 940.

This is approximately what Chubby does. For exact details, see the textbook.

A paper discussing MapReduce can be found at http://labs.google.com/papers/ mapreduce.html.

This example is from the aforementioned paper. It counts the number of occurrences of each word in a collection of documents.

This slide is taken from the aforementioned paper on map reduce. I 26

This figure is from the aforementioned paper on map reduce. 15,000 map tasks and 1 reduce task were used. 1800 computers were employed.

This figure is also from the aforementioned paper. It shows the results of sorting 10 10 100- byte records, with 15,000 map tasks and 4000 reduce tasks running on 1800 computers. The top row shows the rate at which input is read; the second row shows the rate at which data is transferred from map tasks to reduce tasks, and the last row shows the rate at which final output is produced. The Normal execution column refers to the use of extra backup tasks which are used to deal with stragglers: a few map or reduce tasks take far longer than others, perhaps because of hardware problems. When a MapReduce application is close to completion, the master schedules a few backup tasks to execute the remaining in-progress tasks redundantly. The outputs of whichever finish first the backup tasks or the original tasks are used first. The middle column shows how useful this is: without the backup tasks it took far longer to run. The last column shows how quickly the system dealt with the loss of 200 tasks.

A more recent version of these arguments, taking into account the response of the mapreduce people (next slide), can be found at http://database.cs.brown.edu/papers/ stonebraker-cacm2010.pdf.

These points are from http://www.databasecolumn.com/2008/01/mapreducecontinued.html.