CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Similar documents
CS655: Advanced Topics in Distributed Systems [Fall 2013] Dept. Of Computer Science, Colorado State University

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

Dynamo: Amazon s Highly Available Key-value Store

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

The Google File System

The Google File System

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

Google File System. By Dinesh Amatya

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

Dynamo: Amazon s Highly Available Key-value Store

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

The Google File System

The Google File System

Google File System (GFS) and Hadoop Distributed File System (HDFS)

The Google File System

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

The Google File System

CLOUD-SCALE FILE SYSTEMS

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

2/27/2019 Week 6-B Sangmi Lee Pallickara

The Google File System

The Google File System (GFS)

The Google File System. Alexandru Costan

NPTEL Course Jan K. Gopinath Indian Institute of Science

CAP Theorem, BASE & DynamoDB

CS Amazon Dynamo

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Google File System. Arun Sundaram Operating Systems

GFS: The Google File System. Dr. Yingwu Zhu

The Google File System

Distributed File Systems II

GFS: The Google File System

CA485 Ray Walshe Google File System

DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

Distributed System. Gang Wu. Spring,2018

NPTEL Course Jan K. Gopinath Indian Institute of Science

The material in this lecture is taken from Dynamo: Amazon s Highly Available Key-value Store, by G. DeCandia, D. Hastorun, M. Jampani, G.

CSE 124: Networked Services Fall 2009 Lecture-19

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

Google Disk Farm. Early days

CSE 124: Networked Services Lecture-16

Presented By: Devarsh Patel

CS 138: Dynamo. CS 138 XXIV 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Dynamo: Amazon s Highly Available Key-value Store. ID2210-VT13 Slides by Tallat M. Shafaat

This material is covered in the textbook in Chapter 21.

Dynamo: Amazon s Highly Available Key-value Store

Google File System 2

Distributed Filesystem

FAQs Snapshots and locks Vector Clock

Dynamo: Amazon s Highly Available Key-Value Store

The Google File System GFS

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Distributed Hash Tables Chord and Dynamo

NPTEL Course Jan K. Gopinath Indian Institute of Science

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Distributed File Systems. Directory Hierarchy. Transfer Model

Google is Really Different.

Reprise: Stability under churn (Tapestry) A Simple lookup Test. Churn (Optional Bamboo paper last time)

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

Staggeringly Large File Systems. Presented by Haoyan Geng

Dept. Of Computer Science, Colorado State University

CS 655 Advanced Topics in Distributed Systems

Seminar Report On. Google File System. Submitted by SARITHA.S

L1:Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung ACM SOSP, 2003

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Distributed File Systems (Chapter 14, M. Satyanarayanan) CS 249 Kamal Singh

MapReduce. U of Toronto, 2014

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Distributed Systems 16. Distributed File Systems II

BigData and Map Reduce VITMAC03

Distributed Systems. GFS / HDFS / Spanner

7680: Distributed Systems

Map-Reduce. Marco Mura 2010 March, 31th

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Hadoop Distributed File System(HDFS)

CS 345A Data Mining. MapReduce

Distributed File Systems

Introduction to Cloud Computing

Introduction to Distributed Data Systems

Dynamo. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Motivation System Architecture Evaluation

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CSE-E5430 Scalable Cloud Computing Lecture 10

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Lecture XIII: Replication-II

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

A Cloud Storage Adaptable to Read-Intensive and Write-Intensive Workload

Staggeringly Large Filesystems

GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Lecture 3 Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP 2003

CS November 2017

Transcription:

CS 555: DISTRIBUTED SYSTEMS [DYNAMO & GOOGLE FILE SYSTEM] Frequently asked questions from the previous class survey What s the typical size of an inconsistency window in most production settings? Dynamo? When adding a new node to Dynamo, are requests redirected to replicas until the new node is fully ready? Shrideep Pallickara Computer Science Colorado State University L27.1 L27.2 Topics covered in this lecture Dynamo Versioning Google File System DYNAMO VERSIONING L27.3 L27.4 Data versioning There are applications at Amazon that tolerate this A put() may return before it is applied to all replicas If there are no failures Upper bound on update propagation times If there are failures Things take much longer Shopping carts Add to Cart can never be forgotten or rejected If most recent state of cart unavailable Make changes to the older version Divergent versions are reconciled later L27.5 L27.6 SLIDES CREATED BY: SHRIDEEP PALLICKARA L27.1

Dynamo treats each modification as a new, immutable version of the data Multiple versions of data present at same time Often new versions subsume old data Syntactic reconciliation When an automatic reconciliation is not possible Clients have to do it Collapse branches into one Manage your shopping cart Dynamo uses vector clocks to capture causality A vector clock for each version of the object Two versions of object being compared If VC 1 <= VC 2 for all indices of the vector clock n O 1 occurred before O 2 Otherwise, changes are in conflict n Need reconciliation L27.7 L27.8 A client must specify which version it is updating Pass context from an earlier read operation Context contains vector clock information Requests with branches that cannot be reconciled? Returns all objects with versioning info in context Update done using this context reconciles and collapses all branches Execution of get() and put() operations Read and write operations involve the first N healthy nodes During failures, nodes lower in priority are accessed L27.9 L27.10 To maintain consistency, Dynamo uses a quorum protocol Uses configurable settings for replicas that must participate in Reads Writes Quorum-based protocols: When there are N replicas Read quorum N R To modify a file write-quorum N W N R + N W > N Prevent read-write conflict N W > N/2 Prevent write-write conflict L27.11 L27.12 SLIDES CREATED BY: SHRIDEEP PALLICKARA L27.2

Quorum-based protocols: Example A B C D E F G H I J K L A B C D E F G H I J K L Upon receiving a put() request for a key Coordinator generates a vector clock for new version Sends new version to N highest-ranked reachable nodes If at least N W -1 nodes respond: write is successful! J N R=3 N W=10 Read Quorum: Write Quorum: N R=7 N W=6 L Write-write conflict Concurrent writes to {A, B, C, E, F, G} and {D, H, I, J, K, L} will be accepted L27.13 L27.14 External Discovery: During node adds When A and B join, it might be a while before they know each other s existence Logical partitioning Use seed nodes that are known to all nodes All nodes reconcile membership with seed DYNAMO: EXPERIENCES L27.15 L27.16 Popular reconciliation strategies Business logic specific Timestamp Last write wins High performance read engine High read rates Small update rates n N R=1 and N W=N Quorum-based protocols: Example 2 A B C D E F G H I J K L N R=1 N W=12 J Read Quorum: Write Quorum: L27.17 L27.18 SLIDES CREATED BY: SHRIDEEP PALLICKARA L27.3

Common configuration of the quorum N R =2 N W =2 N=3 Balancing performance and durability Some services not happy with 300 ms SLA Writes tend to be slower than reads To cope with this, nodes maintain object buffer Main memory Periodically written to storage L27.19 L27.20 Broad brushstroke themes in current extreme scale storage systems SANJAY GHEMAWAT, HOWARD GOBIOFF, SHUN-TAK LEUNG: The Google file system. Proceedings of SOSP 2003: 29-43 THE GOOGLE FILE SYSTEM Voluminous data Commodity hardware Distributed Data Expect failures Tune for access by applications Optimize for dominant usage Tradeoff between consistency and availability L27.21 L27.22 Demand pulls in GFS: I Component failures are the norm Files are huge by traditional standards File mutations predominantly through appends Not overwrites Applications and File system API designed in lock-step Demand pulls in GFS - II Hundreds of producers will concurrently append to a file Many-way merging High sustained bandwidth is more important than low latency L27.23 L27.24 SLIDES CREATED BY: SHRIDEEP PALLICKARA L27.4

The file system interface Architecture of GFS Does not implement standard APIs such as POSIX Client Supports create, delete, open, close, read and write snapshot Create a fast copy of file and directory tree Client... Client GFS Master record append Multiple files can concurrently append records to the same file n Without additional locking GFS Chunk Server Linux File System GFS Chunk Server Linux File System... GFS Chunk Server Linux File System L27.25 L27.26 In GFS a file is broken up into fixed-size chunks Obvious reason The file is too big Map-Reduce Set the stage for computations that operate on this data Parallel I/O I/O seek times are 14 x 10 6 slower than CPU access times In GFS a file is broken up into fixed-size chunks Each chunk has a 64-bit globally unique ID Assigned by the Master Chunks are stored by chunk servers On local disks as LINUX files Each chunk is replicated Default is 3 L27.27 L27.28 Master operations ALL system metadata is managed by the Master and stored in Main Memory Manage system metadata Leasing of chunks Garbage collection of orphaned chunks Chunk migrations 1 File and chunk namespaces 2 Mapping from files to chunks 3 Location of chunks Logs mutations into a permanent log L27.29 L27.30 SLIDES CREATED BY: SHRIDEEP PALLICKARA L27.5

Why have a single Master? Vastly simplifies design Easy to use global knowledge to reason about Chunk placements Replication decisions Communications with the chunk servers Periodic communications using heartbeats Instructions to the chunk server Collect/retrieve state from the chunk server L27.31 L27.32 Chunk size This is fixed at 64 MB Much larger than typical filesystem block sizes (512 bytes) Lazy space allocation Stored as plain Linux file Extended only as needed But why this big? Reduces client interaction with the master Can cache info for a multi-tb working set Reduce network overhead With a large chunk, client performs more operations Persistent connections Reduce size of metadata stored in the master 64 bytes of metadata per 64 MB chunk L27.33 L27.34 Why keep the entire metadata in memory? Speed Master can scan its state in the background Implement chunk garbage collection Re-replicate if there are failures Chunk migration to balance load and space Add extra memory to increase file system size Size of the file system with 1 TB of RAM: Assume file sizes are exact multiples of chunk sizes Number of entries = 2 40 /2 6 MAXIMUM SIZE of the file system = Number of entries x Chunk size = 2 40 x 2 6 x 2 20 2 6 = 2 60 = 1 EB L27.35 L27.36 SLIDES CREATED BY: SHRIDEEP PALLICKARA L27.6

Tracking the chunk servers Master does not keep a persistent copy of the location of chunk servers List maintained via heart-beats Allows list to be in sync with reality despite failures Chunk server has final word on chunks it holds Caching at the client/chunk servers Clients do not cache file data At client the working set may be too large Simplify client; eliminate cache-coherence problems Chunk servers do not cache file data either Chunks are stored as local files Linux s buffer cache already keeps frequently accessed data in memory L27.37 L27.38 Mutations Mutation changes the content or metadata of a chunk Write Append Each mutation is performed at all chunk replicas Handling writes and appends to a file MANAGING MUTATIONS L27.39 L27.40 GFS uses leases to maintain consistent mutation order across replicas Master grants lease to one of the replicas PRIMARY Primary picks serial-order For all mutations to the chunk Other replicas follow this order n When applying mutations Lease mechanism designed to minimize communications with the master Lease has initial timeout of 60 seconds As long as chunk is being mutated Primary can request and receive extensions Extension requests/grants piggybacked over heart-beat messages L27.41 L27.42 SLIDES CREATED BY: SHRIDEEP PALLICKARA L27.7

Revocation and transfer of leases How a write is actually performed Master may revoke a lease before it expires If communications lost with primary Master can safely give lease to another replica n ONLY AFTER the lease period for old primary elapses Client Secondary Replica A Primary Replica MASTER Secondary Replica B L27.43 L27.44 Client pushes data to all the replicas (I) Each chunk server stores data in an LRU buffer until Data is used Aged out Client pushes data to all the replicas (II) When chunk servers acknowledge receipt of data Client sends a write request to primary Primary assigns consecutive serial numbers to mutations Forwards to replicas L27.45 L27.46 Data flow is decoupled from the control flow to utilize network efficiently Utilize each machine s network bandwidth Avoid network bottlenecks Avoid high-latency links Leverage network topology Estimate distances from IP addresses What if the secondary replicas could not finish the write operation? Client request is considered failed Modified region is inconsistent No attempt to delete this from the chunk Client must handle this inconsistency Client retries the failed mutation L27.47 L27.48 SLIDES CREATED BY: SHRIDEEP PALLICKARA L27.8

GFS client code implements the file system API Communications with master and chunk servers done transparently On behalf of apps that read or write data Interact with master for metadata Data-bearing communications directly to chunk servers The contents of this slide-set are based on the following references Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, Werner Vogels: Dynamo: Amazon's Highly Available Key-value Store. SOSP 2007: 205-220 Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung: The Google file system. Proceedings of SOSP 2003: 29-43. L27.49 L27.50 SLIDES CREATED BY: SHRIDEEP PALLICKARA L27.9