CSE 124: Networked Services Lecture-16

Similar documents
CSE 124: Networked Services Fall 2009 Lecture-19

The Google File System

The Google File System

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

The Google File System

The Google File System (GFS)

The Google File System

Google File System. By Dinesh Amatya

The Google File System

The Google File System

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

CLOUD-SCALE FILE SYSTEMS

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Google File System. Arun Sundaram Operating Systems

NPTEL Course Jan K. Gopinath Indian Institute of Science

Distributed System. Gang Wu. Spring,2018

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

CA485 Ray Walshe Google File System

The Google File System

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Distributed Filesystem

GFS: The Google File System. Dr. Yingwu Zhu

CSE 124: Networked Services Lecture-17

The Google File System. Alexandru Costan

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Google Disk Farm. Early days

GFS: The Google File System

Google File System 2

NPTEL Course Jan K. Gopinath Indian Institute of Science

Distributed Systems 16. Distributed File Systems II

The Google File System GFS

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Google is Really Different.

MapReduce. U of Toronto, 2014

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

Distributed File Systems II

The Google File System

Staggeringly Large Filesystems

BigData and Map Reduce VITMAC03

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Seminar Report On. Google File System. Submitted by SARITHA.S

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

7680: Distributed Systems

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

NPTEL Course Jan K. Gopinath Indian Institute of Science

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Distributed Systems. GFS / HDFS / Spanner

Distributed File Systems (Chapter 14, M. Satyanarayanan) CS 249 Kamal Singh

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

2/27/2019 Week 6-B Sangmi Lee Pallickara

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Lecture XIII: Replication-II

Performance Gain with Variable Chunk Size in GFS-like File Systems

Lecture 3 Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP 2003

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

Staggeringly Large File Systems. Presented by Haoyan Geng

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

HDFS Architecture Guide

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Map-Reduce. Marco Mura 2010 March, 31th

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Google Cluster Computing Faculty Training Workshop

Hadoop and HDFS Overview. Madhu Ankam

Hadoop Distributed File System(HDFS)

L1:Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung ACM SOSP, 2003

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

GFS-python: A Simplified GFS Implementation in Python

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

A BigData Tour HDFS, Ceph and MapReduce

Cluster-Level Google How we use Colossus to improve storage efficiency

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Distributed File Systems. Directory Hierarchy. Transfer Model

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

A Study of Comparatively Analysis for HDFS and Google File System towards to Handle Big Data

Introduction to Cloud Computing

CS655: Advanced Topics in Distributed Systems [Fall 2013] Dept. Of Computer Science, Colorado State University

CS November 2017

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Transcription:

Fall 2010 CSE 124: Networked Services Lecture-16 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/23/2010 CSE 124 Networked Services Fall 2010 1

Updates PlanetLab experiments began First batch is given access information Read through PlanetLab documentation at www.planetlab.org Project-2 idea final presentation Those who have questions on Super-proxy, request for office appointment Presentation/Demo Deadline: Last Lecture class (December 2 nd, 2010) Submission of report (one page or more) documentation and final source code: finals week It should contain: a brief description of the project Instructions for building and using the code 11/23/2010 CSE 124 Networked Services Fall 2010 2

Google File System 11/23/2010 CSE 124 Networked Services Fall 2010 3

Google File Systems Google File system A scalable distributed file system Large distributive data intensive applications Widely deployed in Google Scalability 100s of terabytes 1000s of disks 1000s of machines Main benefits Fault tolerance while running over commodity hardware High aggregate performance 11/23/2010 CSE 124 Networked Services Fall 2010 4

Why GFS? Component failures are common Aplication bugs, OS bugs, human errors, failure of disks, memory, connectors, networking, or power failure File sizes are huge Multi-GB is common Even TBs are expected I/O operations and Block Sizes are to be reconsidered Most files are appended most often Most operations include appending new data Fewer overwriting Random writes within files are mostly non-existent Large repositories scanned by data analysis programs Data streams generated by continuous programs Archival data File system Co-design with application will be far more optimal APIs design must consider the application Atomic append helps multiple clients to concurrently append data Can be useful for clustering 1000s of nodes 11/23/2010 CSE 124 Networked Services Fall 2010 5

Design objectives of GFS Inexpensive commodity hardware Must store modest number of large files Few million files each of 100MB or even Multi-GB Must support small files Must support two kinds of reads Large streaming reads (1 MB or more) Small random reads (few KBs) May batch and sort for multiple small reads Must support many large sequential writes Similar in size to those Reads Written files are seldom modified Small writes must be supported (may be with less efficiency) Must support concurrent appending of the files Multiple clients must append the same file Must provide high sustained throughput 11/23/2010 CSE 124 Networked Services Fall 2010 6

GFS API Similar to the standard POSIX file API Supports usual create, delete, open, close, read, and write operations Additional interfaces Snapshot Creates a copy of a file or directory tree very efficiently Record append Allows multiple clients to simultaneously Atomicity is guaranteed 11/23/2010 CSE 124 Networked Services Fall 2010 7

GFS architecture 11/23/2010 CSE 124 Networked Services Fall 2010 8

GFS Master Single master design (later shadow masters and master replicas) To simplify the original design Maintains Namespace for files as well as chunks Access control information Mapping from files to chunks Current locations of chunks Does Maps files to chunks Chunk lease management Garbage collection Chunk migration between chunkservers Scalability of single Master design several Peta bytes and Processing load for meta files GFS evolved to multiple GFS masters over a collection of chunk servers Upto 8 masters could be mapped onto on chunk server 11/23/2010 CSE 124 Networked Services Fall 2010 9

Master Design Master sends instructions to chunkserver To Delete a given chunk To Create a new chunk Periodic communication between Master and chunkserver to keep state information: Is chunkserver down? disk failures on chunkserver? Any replicas corrupted? Which chunk replicas does chunkserver store? 11/23/2010 CSE 124 Networked Services Fall 2010 10

Master bottleneck Master is typically faster Since metadata is small Less than 64 bytes per file name (prefix compression applied) 64 bytes metadata per 64MB chunk So one Master worked well early designs When the file sizes are smaller Large number of files resulted Gmail files Metadata became too huge Master s memory is limited to hold the metadata Master became a bottleneck 11/23/2010 CSE 124 Networked Services Fall 2010 11

Chunks and chunk servers Analogous to block, however, very large Stored as files on chunk server Size: 64 MB! Chunk handle (~ chunk file name) is used to locate Chunks are replicated across multiple chunk servers Minimum three replicas Chunk servers Stores chunks Do not cache chunks Large chunk size: Pros Helps reduce the number of Client-Master interactions Helps using persistent TCP connection between Client-Chunkserver Reduces the size of metadata in the Master Large chunk size: Cons Can create hotspots Inefficiency in storing smaller files (Gmail files) 11/23/2010 CSE 124 Networked Services Fall 2010 12

Client-Chunkserver interactions Read request is originated by Applications GFS Client receives the request from Applications GFS Client translates the request <File name, byte offset> <File name, chunk index> Note that chunk sizes are fixed (64MB) GFS client queries the Master with <Filename and Chunk index> Master identifies the <chunk handle> and the location of the chunk servers GFS client request chunks from one of the chosen chunk servers Usually the nearest chunkserver is chosen Chunkserver sends requested data to the clients GFS client forwards the data to the application 11/23/2010 CSE 124 Networked Services Fall 2010 13

Example of Client-Master interaction Application (Search indexer) 1 File name: crawl_index_99 Offset: 164KB Size: 2048 Bytes GFS client GFS Master File name: crawl_index_99 Index: 3 2 crawl_index_99 Chunk_001 (R1,R5, R8 ) Chunk_002 (R8, R4, R6) 3 Chunk_003 (R4,R3, R2 ) Chunk_003, Chunkservers: R4, R3, R2 11/23/2010 CSE 124 Networked Services Fall 2010 14

Example of Client-chunkserver interaction Application (Search indexer) 6 GFS client 2048 Bytes 4 Chunk_003, 2048 Bytes 2048 Bytes 5 Chunkserver R2 Chunkserver R3 Chunkserver R2 11/23/2010 CSE 124 Networked Services Fall 2010 16

Example of Write operation Application (Search indexer) GFS Master File name: crawl_index_99, DATA 1 GFS client File name, crawl_index_99 Chunk Index 2 3 Chunk_handle, Primary and secondary replica information 11/23/2010 CSE 124 Networked Services Fall 2010 18

Example Write Application (Search indexer) GFS client 4 DATA Secondary Chunkserver chunk Buffer Primary Chunkserver chunk Buffer Secondary Chunkserver chunk Buffer 11/23/2010 CSE 124 Networked Services Fall 2010 19

Example Write Application (Search indexer) GFS client 5 Write Secondary Chunkserver Buffer Primary Chunkserver 6 D1 D2 D3 Secondary Chunkserver Buffer chunk chunk chunk 7 Write 11/23/2010 CSE 124 Networked Services Fall 2010 20

Example Write Application (Search indexer) GFS client 9 Resp onse Secondary Chunkserver Buffer Primary Chunkserver D1 D2 D3 Secondary Chunkserver Buffer chunk chunk chunk 8 Resp onses 11/23/2010 CSE 124 Networked Services Fall 2010 21

Write control and data flow in GFS From the original GFS publication Data flow may not be one-to-many Bandwidth utilization Location of primaries and secondaries Data and control flow separation 11/23/2010 CSE 124 Networked Services Fall 2010 23

Master Operations novelties Locking Read locks and write locks are separate Efficiency in multiple activities Replica placement Large number of big chunks Bandwidth limitations of racks Combined bandwidth of all servers will far exceed the rack bandwidth Policies Maximize reliability and availability Maximize the network bandwidth utilization Solution Place chunk replicas across racks Nearby racks so that the network bandwidth can be better utilized 11/23/2010 CSE 124 Networked Services Fall 2010 24

Master Operations novelties Creation, re-replication, rebalancing Creation: Where to place the initially empty (new) chunks Place new chunks on new chunk servers with low average utilization Place the new chunks on chunk servers where the number of recent creations where high Heavy write traffic may follow Place new chunk replicas across different racks Re-replication refers to creating additional replicas When the existing number of replicas fall below the user s requirement Chunkserver may be unavailable Replica corruption Replica errors replication goal is increased Strategy: Master picks the highest priority chunk and replicates it on additional chunk servers Priority of a chunk is boosted base on its impact One has more failed replicas 11/23/2010 CSE 124 Networked Services Fall 2010 25

Master Operations novelties Rebalancing Examines the current replica distribution Moves replicas to better disk space Balances the network as well as file system space Decide which replicas to move Gradually fills new chunk servers Garbage Collection Deletion of a chunk is logged instantly actual chunk deletion is not done immediately Deleted file is renamed to a hidden name with time stamp During regular scan of the file system name space Old hidden files are removed Until then it can be undeleted Orphan chunks are deleted as well Not reachable from any file Master-Chunk server HeartBeat message is used for garbage collection Chunk server reports the chunks it has Master responds with the chunks it does not have metadata Chunk server deletes the unwanted chunks Stale replica deletion (version number) 11/23/2010 CSE 124 Networked Services Fall 2010 26

High availability Fast Recovery Chunk replication Master replication Replicates the metadata, logs, and check points Shadow masters helps replicate read-only operation of master Data integrity Checksum-based Each chunk is broken into 64KB blocks Each block has a 32bit checksum 11/23/2010 CSE 124 Networked Services Fall 2010 27

Performance Setup 1 master Two master replicas 16 chunk servers 16 clients 1Gbps link 1.4GHz pentium III 2 GB memory 100Mbps full duplex 2x80GB 5400 RPM disks Servers (19) Clients (16) 11/23/2010 CSE 124 Networked Services Fall 2010 28

Performance (Read) Read rate for N clients Upto 16 readers Each client randomly reads a 4MBb block from a 320GB file It is repeated to read the entire file Aggregate and theoretical limits At 125MBps, the 1Gbps network link saturates Or 2.5MBps/client saturates 80% of per client limit is achieved 94% of aggregate limit is achieved 11/23/2010 CSE 124 Networked Services Fall 2010 29

Performance (Write) N clients writes to N distinct files Each client writes 1GB data to a file 1 MB writes Aggregate rate and theoretical limit are provided Limit: 67MBps Multiple replica writes Each with 12.5MBps NIC Aggregate: 35MBps/16 clients 2.2MBps/client Main culprit was network software stack Multiple writes for each file Network protocol stack does not do pipelining well In real-world performance is better 11/23/2010 CSE 124 Networked Services Fall 2010 30

GFS performance Record append N clients append simultaneously to a single file Performance is limited by the network bandwidth of the chunk server with last chunk Drops from 6MBps/client to 4.8MBps/client 11/23/2010 CSE 124 Networked Services Fall 2010 31

Real-world measurements Cluster A: Used for research and development (by over a hundred engineers) Typical task initiated by user and runs for a few hours. Task reads MB s-tb s of data, transforms/analyzes the data, and writes results back. Cluster B: Used for production data processing. Typical task runs much longer than a Cluster A tas Continuously generates and processes multi-tb data sets. Human users rarely involved. Clusters had been running for about a week when measurements were taken. 11/23/2010 CSE 124 Networked Services Fall 2010 32

Performance with clusters 11/23/2010 CSE 124 Networked Services Fall 2010 33

Performance of GFS 11/23/2010 CSE 124 Networked Services Fall 2010 34

Master Requests 11/23/2010 CSE 124 Networked Services Fall 2010 35

Reading Google File System documents 11/23/2010 CSE 124 Networked Services Fall 2010 36