DrRobert N. M. Watson

Similar documents
Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed Systems 8L for Part IB

10. Replication. Motivation

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

Replication in Distributed Systems

Computing Parable. The Archery Teacher. Courtesy: S. Keshav, U. Waterloo. Computer Science. Lecture 16, page 1

Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

Exam 2 Review. October 29, Paul Krzyzanowski 1

Dynamo: Key-Value Cloud Storage

Distributed Systems. 19. Spanner. Paul Krzyzanowski. Rutgers University. Fall 2017

Scaling Out Key-Value Storage

CS Amazon Dynamo

Integrity in Distributed Databases

Building Consistent Transactions with Inconsistent Replication

SCALABLE CONSISTENCY AND TRANSACTION MODELS

EECS 498 Introduction to Distributed Systems

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

Extreme Computing. NoSQL.

Replication. Feb 10, 2016 CPSC 416

Spanner: Google's Globally-Distributed Database* Huu-Phuc Vo August 03, 2013

CS November 2017

CA485 Ray Walshe NoSQL

Building Consistent Transactions with Inconsistent Replication

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Spanner: Google's Globally-Distributed Database. Presented by Maciej Swiech

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

Spanner : Google's Globally-Distributed Database. James Sedgwick and Kayhan Dursun

Eventual Consistency 1

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

CISC 7610 Lecture 2b The beginnings of NoSQL

Consistency in Distributed Storage Systems. Mihir Nanavati March 4 th, 2016

Google Spanner - A Globally Distributed,

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

Introduction to Distributed Data Systems

CPS 512 midterm exam #1, 10/7/2016

CS5412: DIVING IN: INSIDE THE DATA CENTER

Distributed Data Management Replication

Distributed Systems. Lec 12: Consistency Models Sequential, Causal, and Eventual Consistency. Slide acks: Jinyang Li

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

CS 655 Advanced Topics in Distributed Systems

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

11. Replication. Motivation

EECS 498 Introduction to Distributed Systems

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

Scaling KVS. CS6450: Distributed Systems Lecture 14. Ryan Stutsman

Distributed systems Lecture 8: PubSub; Security; NASD/AFS/Coda

DIVING IN: INSIDE THE DATA CENTER

Replication and Consistency

Consistency and Replication. Why replicate?

CS5412: OTHER DATA CENTER SERVICES

Lessons Learned While Building Infrastructure Software at Google

Database Architectures

CS5412: TRANSACTIONS (I)

Intuitive distributed algorithms. with F#

Dynamo: Amazon s Highly Available Key-value Store. ID2210-VT13 Slides by Tallat M. Shafaat

Report to Brewer s original presentation of his CAP Theorem at the Symposium on Principles of Distributed Computing (PODC) 2000

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

CS October 2017

CS6450: Distributed Systems Lecture 11. Ryan Stutsman

CS November 2018

Consensus and related problems

CAP Theorem. March 26, Thanks to Arvind K., Dong W., and Mihir N. for slides.

Replications and Consensus

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein Sameh Elnikety. Copyright 2012 Philip A. Bernstein

Distributed Data Store

Applications of Paxos Algorithm

Distributed Systems. Lec 9: Distributed File Systems NFS, AFS. Slide acks: Dave Andersen

DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Distributed computing: index building and use

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

416 practice questions (PQs)

CSE 444: Database Internals. Lecture 25 Replication

Don t Give Up on Serializability Just Yet. Neha Narula

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

MapReduce & BigTable

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

BigTable. CSE-291 (Cloud Computing) Fall 2016

Distributed File Systems II

CS 138: Dynamo. CS 138 XXIV 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Linearizability CMPT 401. Sequential Consistency. Passive Replication

Distributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

Availability versus consistency. Eventual Consistency: Bayou. Eventual consistency. Bayou: A Weakly Connected Replicated Storage System

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

CS427 Multicore Architecture and Parallel Computing

Dynamo. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Motivation System Architecture Evaluation

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

Transcription:

Distributed systems Lecture 15: Replication, quorums, consistency, CAP, and Amazon/Google case studies DrRobert N. M. Watson 1 Last time General issue of consensus: How to get processes to agree on something FLP says impossible in asynchronous networks with at least 1 (more) failure but in practice we re OK! General idea useful for leadership elections, distributed mutual exclusion: relies on being able to detect failures Distributed transactions: Need to commit a set of sub-transactions across multiple servers want all-or-nothing semantics Use atomic commit protocol like 2PC Replication: Performance, load-balancing, and fault tolerance Introduction to consistency 2 1

From last lecture Replication and consistency More challenging if clients can perform updates For example, imagine x has value 3 (in all replicas) C1 requests write(x, 5) from S4 C2 requests read(x) from S3 What should occur? With strong consistency, the distributed system behaves as if there is no replication present: i.e. in above, C2 should get the value 5 requires coordination between all servers With weak consistency, C2 may get 3 or 5 (or?) Less satisfactory, but much easier to implement Recall close-to-open consistency in NFS 3 Achieving strong consistency Goal: impose total order on updates to some state x Ensure update propagated to replicas before later reads Simple lock-step solution for replicated object: 1. When S i receives update for x, locks x at all other replicas 2. Make change to x on S i 3. Propagate S i s change to x to all other replicas 4. Other servers send ACK to S i 5. After ACKs received, instruct replicas to unlock x 6. Once C j has ACK for its write to S i, any C k will see update Need to handle failure (of replica, or network) Add step to tentatively apply update, and only actually apply ( commit ) update if all replicas agree We ve reinvented distributed transactions & 2PC! 4 2

Quorum systems Transactional consistency works, but: High overhead, and Poor availability during update (worse if crash!) An alternative is a quorum system: Imagine there are N replicas, a write quorum Q w, and a read quorum Q r Constraint on writes: Q w > N/2 Constraint on reads: (Q w + Q r ) > N To perform a write, must update Q w replicas Ensures a majority of replicas have new value To perform a read, must read Q r replicas Ensures that we read at least one updated value 5 Example S 1 S 2 S 3 S 4 S 5 S 6 S 7 time X=0 (t2,s3) X=0 (t2,s3) X=0 (t2,s3) X=0 (t2,s3) X=0 (t2,s3) Seven replicas (N=7), Q w = 5, Q r = 3 All objects have associated version (T, S) T is logical timestamp, initialized to zero S is a server ID (used to break ties) Any write will update at least Q w replicas Performing a read is easy: Choose replicas to read from until get Q r responses Correct value is the one with highest version 6 3

Quorum systems: writes Performing a write is trickier: Must ensure get entire quorum, or cannot update Hence need a commit protocol (as before) In fact, transactional consistency is a quorum protocol with Q w = N and Q r = 1! But when Q w < N, additional complexity since must bring replicas up-to-date before updating Quorum systems are good when expect failures Additional work on update, additional work on reads but increased availability during failure How might client-server traffic scale with Q w /Q r? 7 Weak consistency Maintaining strong consistency has costs: Need to coordinate updates to all (or Q w ) replicas Slow and will block other accesses for the duration Weak consistency systems provides fewer guarantees: E.g. C 1 updates (replica of) object Y at S 3 S 3 lazily propagates changes to other replicas We can do this by reducing quorum parameters Q r : Clients can potentially read stale value from other S x Q w : Writes may conflict: >1 Y values w/same timestamp Considerably more efficient and more available: Less waiting for replicas on read and write hence is also more available (i.e. fault tolerant) But it can be harder to reason about possible outcomes 8 4

FIFO consistency As with group communication primitives, various ordering guarantees possible FIFO consistency: all updates originating at S i (on behalf of a client) occur in the same order at all replicas As with FIFO multicast, can buffer for as long as we like! But says nothing about how S i s updates are interleaved with S j s at another replica (may put S j first, or S i, or mix) Still useful in some circumstances E.g. single user accessing different replicas at disjoint times I.e., client will see its writes serialized Essentially primary replication with primary = last accessed E.g., sufficient for multiple mail clients interacting with the same mailbox independently (phone, tablet) 9 Eventual consistency FIFO consistency doesn t provide very nice semantics: E.g. C 1 writes V 1 of file f to S 1 Later C 1 reads f from S 2, and writes V 2 Much later, C 1 reads f from S 3 and gets V 1 changes lost! What happened? V 1 arrived at S 3 after V 2, thus overwrote it (stoooopid S 3 ) A desirable property in weakly consistent systems is that they converge to a more correct state I.e. in the absence of further updates, every replica will eventually end up with the same latest version This is called eventual consistency 10 5

Implementing eventual consistency Servers S i keep a version vector V i (O) for each object O For each update of O on S i, increment V i (O)[i] (essentially a vector clock as a per-object version number) Servers synchronize pair-wise from time to time For each object O, compare V i (O) to V j (O) If V i (O) < V j (O), S i gets an up-to-date copy from S j ; if V j (O) < V i (O), S j gets an up-to-date copy from S i. If Vi(O) ~ Vj(O) we have a write-conflict: Concurrent updates have occurred at 2 or more servers Must apply some kind of reconciliation method (similar to revision control systems, and equally painful) Coda filesystem (next lecture) uses this approach 11 Amazon s Dynamo [2007] Storage service used within Amazon s web services Designed to prioritize availability above consistency: SLA to give bounded response time 99.99% of the time if customer wants to add something to shopping basket and there s a failure still want addition to work Even if get (temporarily) inconsistent view fix later! Built around notion of a so-called sloppy quorum: Have N, Q w, Q r as we saw earlier but don t actually require that Q w > N/2, or that (Q w + Q r ) > N Instead make tunable: lower Q values = higher availability; and higher read (or write) throughput Also let system continue during failure Application must handle (reconcile?) inconsistency 12 6

Session guarantees Eventual consistency seems great, but how can you program to it? Need to know something about guarantees to the client These are called session guarantees: Not system wide, just for one (identified) client Client must be a more active participant E.g. client maintains version vectors of objects it reads/writes Example: Read Your Writes (RYW): If C i writes a new value to x, a later read of x should see the update even if C i is now reading from another replica Need C i to remember highest ID of any update it made Only read from a server if it has seen that update E.g., Webmail: Exchange stale message read/delete flags between sessions for greater scalability 13 Session guarantees + availability There are many variations on session guarantees All deal with allowable state on replica given history of accesses by a specific client Session guarantees are weaker than strong consistency, but stronger than pure weak consistency: But this means that they sacrifice availability I.e. choosing not to allow a read or write if it would break a session guarantee means not allowing that operation! Pure weak consistency would allow the operation Can we get the best of both worlds? 14 7

Consistency, Availability & Partitions (CAP) Short answer: No ;-) The CAP Theorem (Brewer 2000, Gilbert & Lynch 2002) says you can only guarantee two of: Consistent data, Availability, Partition-tolerance in a single system. In local-area systems, can sometimes drop partitiontolerance by using redundant networks In the wide-area, this is not an option: Must choose between consistency & availability Most Internet-scale systems ditch consistency NB: this doesn t mean things are always inconsistent, just that they re not guaranteed to be consistent 15 A Google datacentre MapReduce Scalable distributed computation model BigTable Distributed storage with weak consistency Spanner Distributed storage with strong consistency Many spiffy distributed systems at Google E.g.: Dapper: trace RPCs and distributed events 16 8

Google: architecture overview Parallel data processing: MapReduce Fast data analytics: Dremel Web serving: GWS Cross-datacenter RDBMS: Spanner RPCs Distributed storage: Colossus Structured storage: BigTable Distributed locking: Chubby Cluster managment and scheduling: Borg / Omega 17 Google s MapReduce [2004] Specialized programming framework for scale Run a program on 100 s to 10,000 s machines Framework takes care of: Parallelization, distribution, load-balancing, scaling up (or down) & fault-tolerance Locality: compute close to (distributed) data Programmer implements two methods map(key, value) list of <key, value > pairs reduce(key, value ) result Inspired by functional programming Reduce data movement by computing close to data source E.g., for every word, count documents using word(s): Extract words from local documents in map() phase Aggregate and generate sums in reduce() phase 9

MapReduce: for each key, sum values Input Perform Map() query against local data matching input spec; write new keys/values (e.g., 5 instances of X found here) Map X: 5 X: 3 Y: 2 Y: 7 Shuffle Aggregate gathered results for each intermediate key using Reduce() (e.g., X sum = sum(x i )) Reduce X sum : 8 Y sum : 9 Output End user can query results via distributed key/value store Results: X sum : 8, Y sum : 9 MapReduce example programs Sorting data is trivial (map, reduce both identity function) Works since the shuffle step essentially sorts data Distributed grep (search for words) map: emit a line if it matches a given pattern reduce: just copy the intermediate data to the output Count URL access frequency map: process logs of web page access; output <URL, 1> reduce: add all values for the same URL Reverse web-link graph map: output <target, source> for each link to target in a page reduce: concatenate the list of all source URLs associated with a target. Output <target, list(source)> 10

MapReduce: pros and cons Extremely simple, and: Can auto-parallelize (since operations on every element in input are independent) Can auto-distribute (since rely on underlying Colossus/BigTable distributed storage) Gets fault-tolerance (since tasks are idempotent, i.e. can just re-execute if a machine crashes) Doesn t really use any of the sophisticated algorithms we ve seen (except storage replication) Limited to batch jobs and computations that are expressible as a map() followed by a reduce() Google s BigTable [2006] Three-dimensional structured key-value store: <row key, column key, timestamp> value row (key: string) cell < larry.page, websearch, 133746428> cat pictures column (key: string) timestamp (key: int64) Effectively a distributed, sorted, sparse map Versioned web contents by URL, user activity history, web logs, 22 11

Google s BigTable [2006] Distributed tablets (~1 GB max) hold subsets of map Adjacent rows have user-specifiable locality E.g., store pages for a particular website in the same tablet On top of Collossus, which handles replication and fault tolerance: only one (active) server per tablet! Reads & writes within a row are transactional Independently of the number of columns touched But: no cross-row transactions possible META0 tablet is root for name resolution Filesystem meta stored in BigTable itself Use Chubby to elect master (META0 tablet server), and to maintain list of tablet servers & schemas 5-way replicated Paxos consensus on data in Chubby Google s Spanner [2012] BigTable insufficient for some consistency needs Often have transactions across >1 datacenters May buy app on Play Store while travelling in the U.S. Hit U.S. server, but customer billing data is in U.K. Spanner offers transactional consistency: full RDBMS power, ACID properties, at global scale! Wide-area consistency is hard due to long delays and clock skew Secret sauce: hardware-assisted clock sync Using GPS and atomic clocks in datacenters Use global timestamps and Paxos to reach consensus Still have a period of uncertainty for write TX: wait it out! 12

Comparison Dynamo BigTable Spanner Consistency eventual weak(ish) strong Availability high throughput, low latency low throughput, high latency Expressivity simple key-value row transactions full transactions Summary + next time Strong, weak, and eventual consistency Quorum replication Session guarantees CAP theorem Amazon/Google case studies Distributed-system security Access control, capabilities, RBAC, single-system sign on Distributed storage system case studies NASD, AFS3, and Coda 26 13