EECS 498 Introduction to Distributed Systems

Similar documents
Distributed Systems. Tutorial 9 Windows Azure Storage

Yves Goeleven. Solution Architect - Particular Software. Shipping software since Azure MVP since Co-founder & board member AZUG

Big Data Processing Technologies. Chentao Wu Associate Professor Dept. of Computer Science and Engineering

EECS 498 Introduction to Distributed Systems

EECS 498 Introduction to Distributed Systems

EECS 498 Introduction to Distributed Systems

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

EECS 498 Introduction to Distributed Systems

EECS 482 Introduction to Operating Systems

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

EECS 482 Introduction to Operating Systems

The Google File System

Distributed File Systems II

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

Distributed Filesystem

GFS: The Google File System

Google File System. Arun Sundaram Operating Systems

GFS: The Google File System. Dr. Yingwu Zhu

ECS High Availability Design

CS5412: OTHER DATA CENTER SERVICES

MapReduce. U of Toronto, 2014

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

CLOUD-SCALE FILE SYSTEMS

Distributed Systems. GFS / HDFS / Spanner

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

CSE 444: Database Internals. Section 9: 2-Phase Commit and Replication

BigData and Map Reduce VITMAC03

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Google File System 2

EECS 498 Introduction to Distributed Systems

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

CA485 Ray Walshe Google File System

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

GR Reference Models. GR Reference Models. Without Session Replication

No compromises: distributed transactions with consistency, availability, and performance

The Google File System

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

Changing Requirements for Distributed File Systems in Cloud Storage

Applications of Paxos Algorithm

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

The Google File System

Dynamo: Key-Value Cloud Storage

Distributed Systems 16. Distributed File Systems II

The Google File System

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

Intra-cluster Replication for Apache Kafka. Jun Rao

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Documentation Accessibility. Access to Oracle Support

BigTable. CSE-291 (Cloud Computing) Fall 2016

GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

PNUTS: Yahoo! s Hosted Data Serving Platform. Reading Review by: Alex Degtiar (adegtiar) /30/2013

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

CS 537: Introduction to Operating Systems Fall 2015: Midterm Exam #4 Tuesday, December 15 th 11:00 12:15. Advanced Topics: Distributed File Systems

Distributed Databases

Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin

CS5412: DIVING IN: INSIDE THE DATA CENTER

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

NPTEL Course Jan K. Gopinath Indian Institute of Science

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

CSE 124: Networked Services Lecture-16

The Google File System

Low-Latency Multi-Datacenter Databases using Replicated Commit

EECS 482 Introduction to Operating Systems

The Google File System

The Google File System

NPTEL Course Jan K. Gopinath Indian Institute of Science

The Google File System (GFS)

Google File System. By Dinesh Amatya

Consistency in Distributed Systems

NPTEL Course Jan K. Gopinath Indian Institute of Science

Consistency in Distributed Storage Systems. Mihir Nanavati March 4 th, 2016

Introduction to Distributed Systems Seif Haridi

EECS 482 Introduction to Operating Systems

Parallel Databases C H A P T E R18. Practice Exercises

ZHT: Const Eventual Consistency Support For ZHT. Group Member: Shukun Xie Ran Xin

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

Dept. Of Computer Science, Colorado State University

Ceph: A Scalable, High-Performance Distributed File System

GridGain and Apache Ignite In-Memory Performance with Durability of Disk

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

Strong Consistency & CAP Theorem

HydraFS: a High-Throughput File System for the HYDRAstor Content-Addressable Storage System

CPS 512 midterm exam #1, 10/7/2016

Data-Intensive Distributed Computing

Transcription:

EECS 498 Introduction to Distributed Systems Fall 2017 Harsha V. Madhyastha

November 27, 2017 EECS 498 Lecture 19 2

Windows Azure Storage Focus on storage within a data center Goals: Durability Strong consistency High availability Scalability Primary-backup replication November 27, 2017 EECS 498 Lecture 19 3

Revisiting Project 2 Assume viewservice is Paxos-based RSM Why would client s Append fail? Viewservice may not have detected failure Key idea in Azure: Client triggers view change Why didn t we do this in Project 2? View change was a heavyweight operation November 27, 2017 EECS 498 Lecture 19 4

Thought experiment What if we supported only Append and Get operations in Project 2? How to leverage this to make view change lightweight? Hint: All replicas don t fail when view change is initiated Don t transfer any state upon view change Serve GETs based on state from all views November 27, 2017 EECS 498 Lecture 19 5

WAS: High-level idea Treat state as append-only log PUTs from clients applied as appends to log Log comprises concatenation of extents Only last extent can be modified;; all others read-only View change has two steps: Seal last extent Add new extent November 27, 2017 EECS 498 Lecture 19 6

Logical view of data in WAS Stream //foo/myfile.dat Ptr E1 Ptr E2 Ptr E3 Ptr E4 Ptr E5 Extent E1 Extent E2 Extent E3 Extent E4 Extent E5 November 27, 2017 EECS 498 Lecture 19 7

Creating an Extent Client Create Stream/Extent EN1 Primary EN2, EN3 Secondary SM SMView Service Paxos Allocate Extent replica set EN 1 EN 2 EN 3 EN Primary Secondary A Secondary B November 27, 2017 EECS 498 Lecture 19 8

Replication Flow Client EN1 Primary EN2, EN3 Secondary SM SMView Service Paxos Ack Append Upon failure, how to determine size of extent before sealing it? EN 1 EN 2 EN 3 EN Primary Secondary A Secondary B November 27, 2017 EECS 498 Lecture 19 9

Extent Sealing (Scenario 1) Seal Extent Client Seal Extent SM Stream SM Master Sealed at 120 Append 120 120 Ask for current length EN 1 EN 2 EN 3 EN 4 Primary Secondary A Secondary B November 27, 2017 EECS 498 Lecture 19 10

Extent Sealing (Scenario 1) Client SM Stream SM Master Seal Extent Sealed at 120 120 Sync with SM EN 1 EN 2 EN 3 EN 4 Primary Secondary A Secondary B November 27, 2017 EECS 498 Lecture 19 11

Extent Sealing (Scenario 2) Client Seal Extent SM SMSM Seal Extent Append Ask for current length EN 1 EN 2 EN 3 EN 4 Primary Secondary A Secondary B Scenario in which servers would differ in responses? November 27, 2017 EECS 498 Lecture 19 12

Extent Sealing (Scenario 2) Seal Extent Client Seal Extent SM SMSM Sealed at 100 Append 120 100 Ask for current length EN 1 EN 2 EN 3 EN 4 Primary Secondary A Secondary B November 27, 2017 EECS 498 Lecture 19 13

Extent Sealing (Scenario 2) Client SM SMSM Seal Extent Sealed at 100 100 Sync with SM EN 1 EN 2 EN 3 EN 4 Primary Secondary A Secondary B November 27, 2017 EECS 498 Lecture 19 14

Serving GETs How to lookup latest value of any key? Unlike thought experiment, WAS supports PUTs Ask Stream Master (aka view service)? Stream Master only has definitive copy of metadata: Number of extents, Length of sealed extents, Unaware of what updates clients performed November 27, 2017 EECS 498 Lecture 19 15

Serving GETs Need to maintain index of data in each stream which offset in which extent has latest value for key? How to execute a GET on a key? Lookup index for (extent, offset), then lookup stream Implication for execution of PUTs? Must not only append to stream, but also update index November 27, 2017 EECS 498 Lecture 19 16

Layering in Azure Storage Access blob storage via the URL: http://<account>.blob.core.windows.net/ Storage Location Service LB LB Front-Ends Front-Ends Partition Layer Inter-stamp (Geo) replication Partition Layer Stream Layer Intra-stamp replication Storage Stamp Stream Layer Intra-stamp replication Storage Stamp November 27, 2017 EECS 498 Lecture 19 17

Partition Layer Index Range Partitioning Blob Index Account Name Container Name Blob Name aaaa aaaa aaaaa............ Account harry.. Container pictures.. sunrise.. Blob Name Name Name.. Front-End.... harry pictures sunset.. Server.......... A- H: PS1 Account.. H - R: Container.. PS2.. Blob richard Name videos Name soccer Name.. R - Z:.. PS3.. richard videos tennis...... Partition Map............ zzzz zzzz zzzz zzzz zzzzz zzzzz Storage Stamp Partition Server A- H PS 1 Partition Server H - R PS 2 A- H: PS1 Partition H - R: PS2 Master R - Z: PS3 Partition Server R - Z Partition Map PS 3 November 27, 2017 EECS 498 Lecture 19 18

Overall Architecture Front End Layer Incoming Write Request Ack FE FE FE FE FE Partition Master Lock Service Partition Layer Partition Server Partition Server Partition Server Partition Server M Stream Layer M Paxos M Extent Nodes (EN) November 27, 2017 EECS 498 Lecture 19 19

Beating CAP Theorem Stream layer Availability: Seal extent and switch to new replica set Consistency: Replicas agree on commit length Partition layer Consistency: Disjoint partition ranges across servers Availability: Partition servers only cache state Key enabler: Append-only design Drawbacks? Garbage collection overhead November 27, 2017 EECS 498 Lecture 19 20

Enabling Scalability Use replicated Paxos-based service to make centralized decisions Stream Master Partition Master Lock service at partition layer But Ensure central service is off the data path Cache state from central service when possible November 27, 2017 EECS 498 Lecture 19 21

Replica Selection How should stream master choose 3 replicas for an extent? Different copies in different racks Ideally, racks with different sources of power November 27, 2017 EECS 498 Lecture 19 22

Optimizing Read Latency Can read from any replica of a sealed extent What if replica that a client chooses to read from is currently under high load? How to reduce latency impact? Read from all 3 replicas;; wait for first response 3x increase in load Read from any replica;; give deadline to respond Retry at another replica if deadline violation November 27, 2017 EECS 498 Lecture 19 23

Optimizing Storage Cost Storing 3 replicas à 3x storage cost How to reduce amount of data stored? Exploit benefit of sealing an extent: Content of sealed extent never changes in future November 27, 2017 EECS 498 Lecture 19 24

Erasure Coding + + = Split data into M chunks Use coding to generate N additional chunks Data recoverable using any M of (M + N) chunks How many chunks must be lost to lose data? Better durability than 3x replication at 1.3 1.5x storage overhead November 27, 2017 EECS 498 Lecture 19 25

Combining Optimizations How to address tradeoff? Erasure coding: Lower storage cost Replication: Lower read latency For any extent Store one original copy Store (M + N) erasure coded chunks If read from replica misses deadline, reconstruct data from coded chunks November 27, 2017 EECS 498 Lecture 19 26