The Google File System

Similar documents
The Google File System

The Google File System

The Google File System

The Google File System

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Google File System. By Dinesh Amatya

The Google File System

The Google File System. Alexandru Costan

Google File System (GFS) and Hadoop Distributed File System (HDFS)

CLOUD-SCALE FILE SYSTEMS

The Google File System

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

Google File System. Arun Sundaram Operating Systems

CA485 Ray Walshe Google File System

CSE 124: Networked Services Fall 2009 Lecture-19

CSE 124: Networked Services Lecture-16

The Google File System (GFS)

NPTEL Course Jan K. Gopinath Indian Institute of Science

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

GFS: The Google File System. Dr. Yingwu Zhu

The Google File System

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

GFS: The Google File System

The Google File System GFS

Google Disk Farm. Early days

Distributed Filesystem

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Distributed File Systems II

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Google is Really Different.

Seminar Report On. Google File System. Submitted by SARITHA.S

Distributed System. Gang Wu. Spring,2018

NPTEL Course Jan K. Gopinath Indian Institute of Science

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

MapReduce. U of Toronto, 2014

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

Staggeringly Large File Systems. Presented by Haoyan Geng

Distributed File Systems. Directory Hierarchy. Transfer Model

BigData and Map Reduce VITMAC03

Google File System 2

Distributed File Systems (Chapter 14, M. Satyanarayanan) CS 249 Kamal Singh

Distributed Systems 16. Distributed File Systems II

NPTEL Course Jan K. Gopinath Indian Institute of Science

L1:Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung ACM SOSP, 2003

Lecture 3 Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP 2003

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

Google Cluster Computing Faculty Training Workshop

Staggeringly Large Filesystems

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Distributed Systems. GFS / HDFS / Spanner

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

GFS-python: A Simplified GFS Implementation in Python

Hadoop Distributed File System(HDFS)

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

7680: Distributed Systems

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

2/27/2019 Week 6-B Sangmi Lee Pallickara

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

Lecture XIII: Replication-II

Performance Gain with Variable Chunk Size in GFS-like File Systems

Data Storage in the Cloud

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Map-Reduce. Marco Mura 2010 March, 31th

TI2736-B Big Data Processing. Claudia Hauff

HDFS Architecture Guide

CS 345A Data Mining. MapReduce

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

CS655: Advanced Topics in Distributed Systems [Fall 2013] Dept. Of Computer Science, Colorado State University

Map Reduce. Yerevan.

This material is covered in the textbook in Chapter 21.

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

Introduction to MapReduce

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Introduction to Distributed Data Systems

Transcription:

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo

Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in GFS Evaluation Conclusion

Motivation

What is the GFS? Google File System is a scalable distributed file system for large distributed data-intensive applications, which runs on inexpensive commodity hardware and provides fault tolerance, high performance to a large number of clients. GFS shares many of the same goals as previous distributed file systems such as performance, scalability, reliability, and availability

GFS Assumptions Hardware: The system is built from many inexpensive commodity components that often fail File: The system stores a modest number of large files Workloads characteristics: - Large streaming reads - Small random reads. - Many large, sequential writes that append data to files Client: the system must efficiently implement for multiple clients that concurrently append to the same file. Target: High sustained bandwidth is more important than low latency

Interface of GFS GFS provides a familiar file system interface: support the usual operations to create, delete, open, close, read, and write files. GFS supports snapshot and record append operations - Producer-Consumer queues - Many-way merging

Architecture of GFS GFS components: - One single master - Multiple Clients - Multiple GFS chunkserver

Chunk Size Chunksize is set as 64MB Pro: - Less interoperation between client and master node - Keep TCP long connection, less network overhead - Less meta data on master node Con: - Small file - Too many clients visit the same file, hot spots

Metadata Three types of metadata: - (1) File and chunk namespaces - (2) Mapping from files to chunks - (3) Locations of each chunk s replicas All metadata is kept in master memory (performance) - Fast - Easily accessible (1) & (2) are kept persistent by logging (Reliability); (3) will be updated periodically

Master Node Metadata storage Namespace management Periodically communicate with chunkservers Chunk operation: create, re-replicate, delete, garbage collection, load balance, etc.

System Interaction (1) Mutation (2) Lease Minimize management overhead at the master

Mutation Mutation = write or append to the contents or metadata of a chunk - Must be done for all replicas (Consistency) Lease - Master picks one replica as primary; gives it a lease for mutations for all replicas Purpose - Data flow decoupled from control flow - Minimize master involvement

Outline GFS Background, Concepts and Key words (Question) Example of GFS Operations Some optimizations in GFS Evaluation Conclusion

Question [1] its design has been driven by key observations of our application workloads and technological environment, What are the workload and technology characteristics GFS assumed in its design and what are their corresponding design choices? > GFS design assumptions and target workload

GFS Assumptions Hardware: The system is built from many inexpensive commodity components that often fail File: The system stores a modest number of large files Workloads characteristics: - Large streaming reads - Small random reads. - Many large, sequential writes that append data to files Client: the system must efficiently implement for multiple clients that concurrently append to the same file. Target: High sustained bandwidth is more important than low latency

Question [2] while caching data blocks in the client loses its appeal. GFS does not cache file data. Why does this design choice not lead to performance loss? What benefit does this choice have? (1) stream through huge files (a) Simply design of GFS client (2) working sets too large server (b)eliminating cache coherence issues, challenging Client caches offer little benefit. However, clients still cache metadata for future access.

Question [3] Small files must be supported, but we need not optimize for them. Why? (a) GFS is designed to store millions of large files, each typically 100 MB or larger in size Large and small files exist in almost every systems. (b) The chunkservers storing chunks which belong to small files may become hot spots if many clients are accessing the same file. In practice, hot spots have not been a major issue because our applications mostly read large multi-chunk files sequentially. (c) One of disadvantages of GFS

Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in GFS Evaluation Conclusion

Read in GFS Application 1, Application originates the read request data 6 Client 5 data from file 1 file name, byte range file name, chunk index 4 2 3 chunk handle byte range chunk handle replica location Master 2, GFS client translates request and sends it to master 3, Master responds with chunk handle and replica locations Chunk Chunk Chunk

Read in GFS data 6 Application Client 5 data from file 1 file name, byte range file name, chunk index 4 2 3 chunk handle byte range chunk handle replica location Master 4, Client picks a location and sends the request 5, Chunkserver sends requested data to the client 6, Client forwards the data to the application Chunk Chunk Chunk

Write on GFS Application 9 Client 4 1 file name, byte range 2 3 8 5 Master 1. Application originates the request 2. GFS client translates request and sends it to master 3. Master responds with chunk handle and replica locations Chunk replica 6 Chunk (Primary) 6 Chunk replica 7 7

Write on GFS Application 9 Client 1 file name, byte range 2 Master 4, Client pushes write data to all locations. Data is stored in chunkserver s internal buffers 4 5 3 8 5, Client sends write command to primary Chunk replica 6 Chunk (Primary) 6 Chunk replica 7 7

Write on GFS Application 9 Client 1 file name, byte range 2 Master 6, Primary determines serial order for data instances in its buffer and writes the instances in that order to the chunk 4 5 3 8 Primary sends the serial order to the secondaries and tells them to perform the write Chunk replica 6 Chunk (Primary) 6 Chunk replica 7 7

Write on GFS Application 9 Client 4 1 file name, byte range 7, Secondaries respond back to primary 5 2 3 8 Master 8, Primary responds back to the client 9, Client responds to applications Chunk replica 6 Chunk (Primary) 6 Chunk replica 7 7

Append on GFS In a traditional write, the client specifies the offset at which data is to be written. Append is same as write, but no offset. GFS picks the offset and works for concurrent writers difference

Outline GFS Background, Concepts and Key words Example of GFS Operations (Question) Some optimizations in GFS Evaluation Conclusion

Question [4] Clients interact with the master for metadata operations, but all data-bearing communication goes directly to the chunkservers. How does this design help improve the system s performance? Potential bottleneck minimize clients involvement in reads and writes with the master node

Question [5] A GFS cluster consists of a single master. What s benefit of having only a single master? What s its potential performance risk? How does GFS minimize such a risk? 1, Simplify Design 2, Potential bottleneck 3, Minimize clients involvement in reads and writes with the master node

Question [6] Each chunk replica is stored as a plain Linux file on a chunkserver and is extended only as needed. How does GFS collaborate with chunkserver s local file system to store file chunks? What s lazy space allocation and what s its benefit? GFS is composed of many servers Each server is typically a commodity Linux machine running a user-level server process. The file in GFS is finally stored in local server as regular Linux file

Question [6] Each chunk replica is stored as a plain Linux file on a chunkserver and is extended only as needed. How does GFS collaborate with chunkserver s local file system to store file chunks? What s lazy space allocation and what s its benefit? with help of local file system

Question [6] Each chunk replica is stored as a plain Linux file on a chunkserver and is extended only as needed. How does GFS collaborate with chunkserver s local file system to store file chunks? What s lazy space allocation and what s its benefit? Lazy allocation simply means not allocating a resource until it is actually needed. Benefits: Lazy space allocation avoids wasting space due to internal fragmentation, perhaps the greatest objection against such a large chunksize.

Question [7] On the other hand, a large chunks size, even with lazy space allocation, has its disadvantages. Give an example disadvantage. A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients are accessing the same file. In practice, hot spots did develop when GFS was first used by a batch-queue system. The few chunkservers storing an executable problem were overloaded by hundreds of simultaneous requests. Fixed by storing such executables with a higher replication factor and by making the batchqueue system stagger application start times.

Question [7] On the other hand, a large chunks size, even with lazy space allocation, has its disadvantages. Give an example disadvantage. [Example] hot spot for small files Chunk

Question [8] One potential concern for this memory-only approach is that the number of chunks and hence the capacity of the whole system is limited by how much memory the master has. Why is GFS s master able to keep the metadata in memory? Chunk size (64MB) > less than 64 bytes Metadata, small enough

Question [9] We use leases to maintain a consistent mutation order across replicas. Could you show a scenario where unexpected result may appear if the lease mechanism is not implemented? Also explain how leases help address the problem? without lease primary order order: A, B, C order: A, C, B non-primary order order: B, A, C

Question [9] We use leases to maintain a consistent mutation order across replicas. Could you show a scenario where unexpected result may appear if the lease mechanism is not implemented? Also explain how leases help address the problem? primary order order: A, B, C follow it with lease order: A, B, C non-primary order order: A, B, C

Question [9] We use leases to maintain a consistent mutation order across replicas. Could you show a scenario where unexpected result may appear if the lease mechanism is not implemented? Also explain how leases help address the problem? Lease: keep mutation order Secondary replicas follows primary replica

Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in GFS Evaluation Conclusion

Some Optimizations on GFS Snapshot Fault tolerance Relaxed Consistency Model

Snapshot A snapshot is a copy of a system at a moment at low cost Snapshot is implemented based on standard copy-onwrite Why we use snapshot? - To quickly create branch copies of huge data sets (Performance) - A quick data access for end users (Performance) - Changes committed or rolled-back easily (Reliability)

Fault Tolerance High availability - Fast recovery - Master and Chunkservers: failed, restart in a few seconds Chunk replication - Each chunk is replicated on multiple chunkservers on different tracks. Users can specify different levels for different parts of the file namespace. - default: 3 replicas Shadow masters - Checksum every 64KB block in each chunk

Relaxed Consistency Model Relying on appends rather than overwrites, checkpointing, and writing self-validating, self-identifying records - far more efficient and resilient to Apps Many writers concurrently append to a file for merged results or as a producer-consumer queue - simple, efficient - Google apps live with it

Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in GFS (Question) Evaluation Conclusion

Question [10] When the master creates a chunk, it chooses where to place the initially empty replicas. What are criteria for choosing where to place the initially empty replicas? new 1, place new replicas on chunkservers with below-average diskspace utilization (balance) 2, limit the number of recent creations on each chunkserver (imminent heavy write soon) 3, spread replicas of a chunkacross racks (reliability)

Question [11] The master re-replicates a chunk as soon as the number of available replicas falls below a user-specified goal. When a new chunkserver is added into the system, the master mostly uses chunk rebalancing rather than using new chunks to fill up it. Why? Heavy I/O flow, bad :( Put eggs in one basket, not safe 2, limit the number of recent creations on each chunkserver (imminent heavy write soon) 3, spread replicas of a chunkacross racks (reliability)

Question [12] After a file is deleted, GFS does not immediately reclaim the available physical storage. It does so only lazily during regular garbage collection at both the file and chunk levels. How are files and chunks are deleted? What s the advantages of the delayed space reclamation (garbage collection), rather than eager deletion? File: When a file is deleted by the application, the master logs the deletion immediately. The file is just renamed to a hidden name that includes the deletion timestamp. During the master s regular scan of the file system namespace, it removes any such hidden files if they have existed for more than three days. Then remove namespace, metadata, etc. Chunk: the master identifies not reachable chunks with heartbeat message and erases the metadata for those chunks.

Question [12] After a file is deleted, GFS does not immediately reclaim the available physical storage. It does so only lazily during regular garbage collection at both the file and chunk levels. How are files and chunks are deleted? What s the advantages of the delayed space reclamation (garbage collection), rather than eager deletion? Advantages: 1, simple and reliable for large distribute systems 2, it merges storage reclamation into the regular background activities of the master, less overhead or burden for master node 3, avoid accidental, irreversible deletion

Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in GFS Evaluation Conclusion

Evaluation Environment Cluster - 1 master - 16 chunkservers (1.4GHz PIII CPU, 2G Ram, 2*80GB Disk, 100Mpbs Ethernet) - 16 clients Server machines connected to central switch by 100 Mbps Ethernet Switches (HP2524) connected with 1 Gbps link

Aggregate Throughputs N clients reading 4 MB region from 320 GB file set simultaneously. Read rate slightly lower as clients go up due to probability reading from same chunkserver 1 client: - 10MB/s, 80% limit 16 client: - 6MB/s, 75% limit

Aggregate Throughputs N clients writing to N files simultaneously. Low write rate is due to delay in propagating data among replicas. Slow write is not major problem with aggregate write bandwidth to large clients. 1 client: - 6.3 MB/s, 50% limit 16 client: - 2.2 MB/s per client

Aggregate Throughputs N clients appending to a single file simultaneously. Append rate slightly lower as clients go up due to network congestion by different clients. Chunkserver network congestion is not major issue with large clients appending to large shared files. 1 client: - 6 MB/s 16 client: - 4.8 MB/s per client

Real World Clusters A: research and development B: production data processing

GFS Deployment in Google Many GFS clusters Hundreds/thousands of storage nodes each Managing petabytes of data GFS is under BigTable, etc.

Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in GFS Evaluation Conclusion

Conclusion Google File System is a scalable distributed file system for large distributed data-intensive applications, which runs on inexpensive commodity hardware and provides fault tolerance, high performance to a large number of clients. GFS shares many of the same goals as previous distributed file systems but has its own innovations and limitations (master bottleneck, designed for large files, hotspot, etc) GFS meets Google s storage needs and serves Google s apps and services

One Comparison Taobao File System from Alibaba Hundreds of Millions of Products Product images, description, comments, transactions, etc. are all small files.

Taobao File System Optimization for small files Open sourced 1st level index One chunk contains many small files with hierarchy Nth level index

Reference cs.brown.edu/~debrabant/cis570-website/slides/ gfs.ppt https://www.cs.umd.edu/class/spring2011/ cmsc818k/lectures/gfs-hdfs.pdf http://www.slideshare.net/omerfarukinceedutr/ google-file-system-gfs-presentation http://www.slideshare.net/romain_jacotin/thegoogle-file-system-gfs

Q & A