Eventual Consistency 1

Similar documents
SCALABLE CONSISTENCY AND TRANSACTION MODELS

Dynamo: Key-Value Cloud Storage

Final Exam Logistics. CS 133: Databases. Goals for Today. Some References Used. Final exam take-home. Same resources as midterm

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Data Store Consistency. Alan Fekete

Scaling Out Key-Value Storage

Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre

CS Amazon Dynamo

CAP Theorem, BASE & DynamoDB

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

CS 138: Dynamo. CS 138 XXIV 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CONSISTENT FOCUS. Building reliable distributed systems at a worldwide scale demands trade-offs between consistency and availability.

Distributed Systems. Lec 12: Consistency Models Sequential, Causal, and Eventual Consistency. Slide acks: Jinyang Li

Dynamo: Amazon s Highly Available Key-value Store. ID2210-VT13 Slides by Tallat M. Shafaat

CS6450: Distributed Systems Lecture 11. Ryan Stutsman

Important Lessons. Today's Lecture. Two Views of Distributed Systems

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

Replication in Distributed Systems

EECS 498 Introduction to Distributed Systems

How Eventual is Eventual Consistency?

SCALABLE CONSISTENCY AND TRANSACTION MODELS THANKS TO M. GROSSNIKLAUS

Distributed Data Store

Distributed Systems. Catch-up Lecture: Consistency Model Implementations

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

Dynamo: Amazon s Highly Available Key-Value Store

CS 655 Advanced Topics in Distributed Systems

10. Replication. Motivation

CAP Theorem. March 26, Thanks to Arvind K., Dong W., and Mihir N. for slides.

Trade- Offs in Cloud Storage Architecture. Stefan Tai

CISC 7610 Lecture 2b The beginnings of NoSQL

Apache Cassandra - A Decentralized Structured Storage System

Scaling KVS. CS6450: Distributed Systems Lecture 14. Ryan Stutsman

ADVANCED DATABASES CIS 6930 Dr. Markus Schneider

CS October 2017

DrRobert N. M. Watson

Consistency: Relaxed. SWE 622, Spring 2017 Distributed Software Engineering

Consumer-view of consistency properties: definition, measurement, and exploitation

Spotify. Scaling storage to million of users world wide. Jimmy Mårdell October 14, 2014

There is a tempta7on to say it is really used, it must be good

Migrating Oracle Databases To Cassandra

Building Consistent Transactions with Inconsistent Replication

Strong Consistency & CAP Theorem

Architekturen für die Cloud

Dynamo: Amazon s Highly Available Key-value Store

Dynamo. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Motivation System Architecture Evaluation

CSE-E5430 Scalable Cloud Computing Lecture 10

Haridimos Kondylakis Computer Science Department, University of Crete

Distributed Data Management Replication

Cassandra Design Patterns

Availability versus consistency. Eventual Consistency: Bayou. Eventual consistency. Bayou: A Weakly Connected Replicated Storage System

Database Availability and Integrity in NoSQL. Fahri Firdausillah [M ]

Modern Database Concepts

CAP and the Architectural Consequences

Report to Brewer s original presentation of his CAP Theorem at the Symposium on Principles of Distributed Computing (PODC) 2000

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Reprise: Stability under churn (Tapestry) A Simple lookup Test. Churn (Optional Bamboo paper last time)

Dynamo: Amazon s Highly Available Key-value Store

Distributed Hash Tables Chord and Dynamo

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan

Distributed Key-Value Stores UCSB CS170

Distributed Hash Tables

DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

@ Twitter Peter Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica. UC Berkeley

Chapter 24 NOSQL Databases and Big Data Storage Systems

CS6450: Distributed Systems Lecture 15. Ryan Stutsman

Distributed Systems. Lec 9: Distributed File Systems NFS, AFS. Slide acks: Dave Andersen

Introduction Aggregate data model Distribution Models Consistency Map-Reduce Types of NoSQL Databases

Consistency in Distributed Storage Systems. Mihir Nanavati March 4 th, 2016

Building Consistent Transactions with Inconsistent Replication

10.0 Towards the Cloud

A Global In-memory Data System for MySQL Daniel Austin, PayPal Technical Staff

NoSQL Concepts, Techniques & Systems Part 1. Valentina Ivanova IDA, Linköping University

DISTRIBUTED COMPUTER SYSTEMS

Consistency in Distributed Systems

Computing Parable. The Archery Teacher. Courtesy: S. Keshav, U. Waterloo. Computer Science. Lecture 16, page 1

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Transactions and ACID

Distributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Data-Intensive Distributed Computing

The CAP theorem. The bad, the good and the ugly. Michael Pfeiffer Advanced Networking Technologies FG Telematik/Rechnernetze TU Ilmenau

Replication and Consistency

Consistency & Replication

Improving Logical Clocks in Riak with Dotted Version Vectors: A Case Study

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

References. NoSQL Motivation. Why NoSQL as the Solution? NoSQL Key Feature Decisions. CSE 444: Database Internals

Choosing a MySQL HA Solution Today. Choosing the best solution among a myriad of options

CIB Session 12th NoSQL Databases Structures

The material in this lecture is taken from Dynamo: Amazon s Highly Available Key-value Store, by G. DeCandia, D. Hastorun, M. Jampani, G.

Sources. P. J. Sadalage, M Fowler, NoSQL Distilled, Addison Wesley

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Eventual Consistency Today: Limitations, Extensions and Beyond

Consistency-Based SLA for Cloud Storage

Databases : Lecture 1 2: Beyond ACID/Relational databases Timothy G. Griffin Lent Term Apologies to Martin Fowler ( NoSQL Distilled )

Consistency and Replication 1/65

Transcription:

Eventual Consistency 1

Readings Werner Vogels ACM Queue paper http://queue.acm.org/detail.cfm?id=1466448 Dynamo paper http://www.allthingsdistributed.com/files/ amazon-dynamo-sosp2007.pdf Apache Cassandra Consistency docs http://www.datastax.com/documentation/ cassandra/2.1/cassandra/dml/ dmlaboutdataconsistency.html 2

Eventual Consistency Recall: data is replicated for performance and failover purposes If lots of clients want to read same data, can read off different replicas When a machine goes down, the data is still available somewhere else 3

How to implement R/W operations Ideally, want accesses to a single object to be atomic/linearizable I.e. if several processes are using the same object, it should "look like" they were accessing it in sequence on a single-node system 4

Linearizability Linearizability requirements A total order on operations across the whole system Consistent with wall-clock ordering for nonoverlapping operations If a read is ordered after a write, the read should see the value written by that write (or a later one) Sometimes called consistency, but not the same kind of consistency as ACID consistency 5

The Problem Want to enforce this without sacrificing availability Users must be able to perform reads/writes when they want to Plus, things get even worse if your system gets partitioned due to network failures If you really want to keep replicas in sync, must block everyone until network is back up 6

Consistency vs. Availability A fundamental tradeoff between Consistency Availability Partition Tolerance "Brewer's Conjecture" or "CAP theorem" 7

Consistency (As seen before): accesses to a single object must be atomic/linearizable 8

Availability All clients must be able to access some version of the data Every request must (eventually) receive a response Example: every write must succeed, cannot be forgotten or rejected by system 9

Partition Tolerance The system must tolerate network partitioning scenarios and still function correctly Realistic and sensible requirement; network partitions occur all the time in the real world once your system is large enough 10

If partitions are unavoidable, must choose between C & A Strong consistency P Eventual consistency C A 11

The tradeoff Some modern systems choose A over C Many of them NoSQL Makes sense originally built for use cases where scalability and availability are paramount, and strong consistency and other features traditionally provided by an RDBMS are less crucial 12

Eventual Consistency System must always be able to take reads and writes Even when network becomes partitioned Consequence: can no longer guarantee all replicas kept in sync And programmer must work with that! Replicas do (eventually) get back into sync, but there is a definite window of inconsistency 13

EC the System Perspective Three important parameters: N: Replication factor How many copies of the data are stored? R: How many replicas are checked during a read W: How many replicas must acknowledge receipt of a write 14

Relationship between N, R and W If W + R > N, can guarantee strong consistency Read and write set always overlap Various tradeoffs can make sense W = N, R = 1 achieves fastest reads W = 1, R = N achieves fastest writes Can also do W=Q, R=Q where Q = N / 2 + 1 ("Quorum") 15

EC in Apache Cassandra NoSQL system Open source Initially developed by Facebook Column family data model Provides tunable consistency guarantees So you can trade off consistency for performance (availability) as you want by setting the appropriate ConsistencyLevel 16

Cassandra Writes Level ANY ONE TWO THREE QUORUM LOCAL_QUORUM EACH_QUORUM ALL Behavior Ensure that the write has been written to at least 1 node, including HintedHandoff recipients. Ensure that the write has been written to at least 1 replica's commit log and memory table before responding to the client. Ensure that the write has been written to at least 2 replicas before responding to the client. Ensure that the write has been written to at least 3 replicas before responding to the client. Ensure that the write has been written to N / 2 + 1 replicas before responding to the client. Ensure that the write has been written to <ReplicationFactor> / 2 + 1 nodes, within the local datacenter (requires NetworkTopologyStrategy) Ensure that the write has been written to <ReplicationFactor> / 2 + 1 nodes in each datacenter (requires NetworkTopologyStrategy) Ensure that the write is written to all N replicas before responding to the client. Any unresponsive replicas will fail the operation. 17

Cassandra Reads Level ONE Behavior Will return the record returned by the first replica to respond. A consistency check is always done in a background thread to fix any consistency issues when ConsistencyLevel.ONE is used. This means subsequent calls will have correct data even if the initial read gets an older value. (This is called ReadRepair) TWO THREE QUORUM LOCAL_QUORUM EACH_QUORUM ALL Will query 2 replicas and return the record with the most recent timestamp. Again, the remaining replicas will be checked in the background. Will query 3 replicas and return the record with the most recent timestamp. Will query all replicas and return the record with the most recent timestamp once it has at least a majority of replicas (N / 2 + 1) reported. Again, the remaining replicas will be checked in the background. Returns the record with the most recent timestamp once a majority of replicas within the local datacenter have replied. Returns the record with the most recent timestamp once a majority of replicas within each datacenter have replied. Will query all replicas and return the record with the most recent timestamp once all replicas have replied. Any unresponsive replicas will fail the operation. 18

Amazon Dynamo A key-value store created by Amazon in the mid 2000- s Academic paper in SOSP 2007 One of the big names in early NoSQL and EC systems Designed from the ground up to serve Amazon's unique needs Notably: system must always be able to take a write E.g adding item to shopping cart Otherwise sales are lost 19

Dynamo conflict resolution Dynamo is an EC system A put() call may return to its caller before the update has been applied at all the replicas A get() call may return many versions of the same object. These versions must be reconciled at some point Your shopping cart had better be consistent at checkout!! 20

Dynamo conflict resolution When to reconcile? At write time? -> not good, slows down writes So, need to reconcile versions at read time Who should reconcile? Datastore? -> doesn't know enough about your application semantics Can only use simple policies like "last write wins" So, conflict resolution is up to the application 21

Data Versioning Challenge: an object having distinct version sub-histories, which the system will need to reconcile in the future. Solution: uses vector clocks in order to capture causality between different versions of the same object. 22

Vector Clock A vector clock is a list of (node, counter) pairs. Every version of every object is associated with one vector clock. 23

Vector Clock Example 24

Vector Clock Every version of every object is associated with one vector clock. If the counters on the first object s clock are less-than-or-equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten. 25

EC the client perspective What guarantees can your code expect? Strong consistency after an update completes, any subsequent access returns updated value Weak consistency above is not guaranteed. Window of inconsistency during which older values may be seen 26

Kinds of consistency Eventual consistency a form of weak consistency the system guarantees that if no new updates are made eventually all accesses will return latest value When is "eventually"? Depends on a lot of system parameters: Number of replicas Load on system Communication delays 27

Kinds of eventual consistency Read-your-writes consistency If a process has written a data item and reads it later, should see own update Session consistency Read-your-writes consistency in the (restricted) context of a session 28

Kinds of eventual consistency Monotonic write consistency: writes by a single process are serialized in order Monotonic read consistency: if a process has seen a value, will never see an older value on subsequent reads These can be combined in various ways 29

How well does this work in practice? Depends on the specific system Experimental investigations Data Consistency Properties and the Trade-offs in Commercial Cloud Storages: the Consumers Perspective, H. Wada et al, CIDR 2011 Experiments on Amazon SimpleDB Supports consistent reads and eventually consistent reads "Consistency across all copies of the data is usually reached within a second" 30

Frequency of Observing Stale Data Experimental Setup A writer updates data once each 3 secs, for 5 mins A reader reads it 100 times/sec Check if the data is stale by comparing value seen to the most recent value written Plot against time since most recent write occurred Experiments carried out in Oct/Nov, 2010 31

Read and Write from a Single Thread With eventually consistent read, 33% of chance to read freshest values within 500ms Perhaps one master and two other replicas. Read takes value randomly from one of these? 32

Read and Write from a Single Thread First time for eventual consistent read to reach 99% fresh is stable 500ms Outlier cases of stale read after 500ms, but no regular daily or weekly variation observed 33

Monotonic Reads Definition: each read sees a value at least as fresh as that seen in any previous read from the same thread/session Experiment: check for a fresh read followed by a stale one (within a period of <450 ms from write) SimpleDB ec reads do not provide this guarantee! In staleness, two successive eventual consistent reads are almost independent 2 nd Stale 2 nd Fresh 1 st Stale 39.94% 21.08% 1 st Fresh 23.36% 15.63% 34

Eventual Consistency Summary A tradeoff to privilege availability (ability to execute reads and writes) over consistency Not appropriate in all cases and for all applications Need to decide if temporary inconsistency is acceptable Generally much harder to program on an ec system Unexpected effects can and do occur https://aphyr.com/posts/322-call-me-maybe-mongodbstale-reads for a recent article on MongoDB issues 35