Self-healing Data Step by Step

Similar documents
Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

Strong Eventual Consistency and CRDTs

Strong Eventual Consistency and CRDTs

Conflict-free Replicated Data Types (CRDTs)

Genie. Distributed Systems Synthesis and Verification. Marc Rosen. EN : Advanced Distributed Systems and Networks May 1, 2017

Eventual Consistency Today: Limitations, Extensions and Beyond

Why distributed databases suck, and what to do about it. Do you want a database that goes down or one that serves wrong data?"

10. Replication. Motivation

Computing Parable. The Archery Teacher. Courtesy: S. Keshav, U. Waterloo. Computer Science. Lecture 16, page 1

CS Amazon Dynamo

Dynamo: Key-Value Cloud Storage

Consistency and Replication. Why replicate?

Eventual Consistency Today: Limitations, Extensions and Beyond

Distributed Key Value Store Utilizing CRDT to Guarantee Eventual Consistency

EECS 498 Introduction to Distributed Systems

CAP and the Architectural Consequences

Riak. Distributed, replicated, highly available

What Came First? The Ordering of Events in

CAP Theorem. March 26, Thanks to Arvind K., Dong W., and Mihir N. for slides.

CSE-E5430 Scalable Cloud Computing Lecture 10

DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

DISTRIBUTED COMPUTER SYSTEMS

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

Building Consistent Transactions with Inconsistent Replication

Consistency in Distributed Storage Systems. Mihir Nanavati March 4 th, 2016

Advanced Databases ( CIS 6930) Fall Instructor: Dr. Markus Schneider. Group 17 Anirudh Sarma Bhaskara Sreeharsha Poluru Ameya Devbhankar

Replication in Distributed Systems

殷亚凤. Consistency and Replication. Distributed Systems [7]

Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage

The CAP theorem. The bad, the good and the ugly. Michael Pfeiffer Advanced Networking Technologies FG Telematik/Rechnernetze TU Ilmenau

Important Lessons. Today's Lecture. Two Views of Distributed Systems

Last Class: Web Caching. Today: More on Consistency

SCALABLE CONSISTENCY AND TRANSACTION MODELS

Lecture 6 Consistency and Replication

Lasp: A Language for Distributed, Coordination-Free Programming

Introduction to Distributed Systems Seif Haridi

Large-Scale Geo-Replicated Conflict-free Replicated Data Types

Replication and Consistency

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre

CS6450: Distributed Systems Lecture 11. Ryan Stutsman

Scaling Out Key-Value Storage

Distributed Systems (5DV147)

Optimized OR-Sets Without Ordering Constraints

11. Replication. Motivation

TARDiS: A branch and merge approach to weak consistency

DISTRIBUTED EVENTUALLY CONSISTENT COMPUTATIONS

CAP Theorem, BASE & DynamoDB

SMAC: State Management for Geo-Distributed Containers

Intuitive distributed algorithms. with F#

Spanner: Google's Globally-Distributed Database. Presented by Maciej Swiech

Module 7 - Replication

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

arxiv: v1 [cs.dc] 11 Jul 2013

Consistency and Replication

Scaling KVS. CS6450: Distributed Systems Lecture 14. Ryan Stutsman

How to survive in a BASE world. A practical guide for the dauntless developer

GOSSIP ARCHITECTURE. Gary Berg css434

Conflict-Free Replicated Data Types (basic entry)

Consistency and Replication (part b)

Presented By: Devarsh Patel

Coordination-Free Computations. Christopher Meiklejohn

Selective Hearing: An Approach to Distributed, Eventually Consistent Edge Computation

Important Lessons. A Distributed Algorithm (2) Today's Lecture - Replication

Eventual Consistency 1

Linearizability CMPT 401. Sequential Consistency. Passive Replication

Conflict-free Replicated Data Types in Practice

Chapter 4: Distributed Systems: Replication and Consistency. Fall 2013 Jussi Kangasharju

Just-Right Consistency. Centralised data store. trois bases. As available as possible As consistent as necessary Correct by design

Consistency & Replication

Microservices stress-free and without increased heart-attack risk

Dynamo: Amazon s Highly Available Key-value Store

Dynamo: Amazon s Highly Available Key-Value Store

Distributed Data Management Replication

Coordinating distributed systems part II. Marko Vukolić Distributed Systems and Cloud Computing

Dynamo: Amazon s Highly Available Key-value Store. ID2210-VT13 Slides by Tallat M. Shafaat

Causal Consistency and Two-Phase Commit

Distributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions

Modern Database Concepts

CSE 5306 Distributed Systems. Consistency and Replication

Eventual Consistency. Eventual Consistency

Consistency and Replication. Some slides are from Prof. Jalal Y. Kawash at Univ. of Calgary

There is a tempta7on to say it is really used, it must be good

Bloom: Big Systems, Small Programs. Neil Conway UC Berkeley

Integrity in Distributed Databases

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

Mutual consistency, what for? Replication. Data replication, consistency models & protocols. Difficulties. Execution Model.

Consistency and Replication

Eventually Consistent HTTP with Statebox and Riak

Distributed Systems: Consistency and Replication

Joint Entity Resolution

What am I? Bryan Hunt Basho Client Services Engineer Erlang neophyte JVM refugee Be gentle

INF-5360 Presentation

Consistency and Replication 1/62

Spanner: Google s Globally-Distributed Database. Wilson Hsieh representing a host of authors OSDI 2012

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Spanner : Google's Globally-Distributed Database. James Sedgwick and Kayhan Dursun

Consistency and Replication. Why replicate?

FIT: A Distributed Database Performance Tradeoff. Faleiro and Abadi CS590-BDS Thamir Qadah

Transcription:

Self-healing Data Step by Step Uwe Friedrichsen (codecentric AG) NoSQL matters Cologne, 29. April 2014

@ufried Uwe Friedrichsen uwe.friedrichsen@codecentric.de http://slideshare.net/ufried http://ufried.tumblr.com

Why NoSQL? Scalability Easier schema evolution Availability on unreliable OTS hardware It s more fun...

Why NoSQL? Scalability Easier schema evolution Availability on unreliable OTS hardware It s more fun...

Challenges Giving up ACID transactions (Temporal) anomalies and inconsistencies short-term due to replication or long-term due to partitioning It might happen. Thus, it will happen!

C Consistency Strict Consistency ACID / 2PC A Availability Strong Consistency Quorum R&W / Paxos Eventual Consistency CRDT / Gossip / Hinted Handoff P Partition Tolerance

Strict Consistency (CA) Great programming model no anomalies or inconsistencies need to be considered Does not scale well best for single node databases We know ACID It works! We know 2PC It sucks! Use for moderate data amounts

And what if I need more data? Distributed datastore Partition tolerance is a must Need to give up strict consistency (CP or AP)

Strong Consistency (CP) Majority based consistency model can tolerate up to N nodes failing out of 2N+1 nodes Good programming model Single-copy consistency Trades consistency for availability in case of partitioning Paxos (for sequential consistency) Quorum-based reads & writes

Quorum-based reads & writes N number of nodes (replicas) R number of reads delivering the same result required W number of confirmed writes required W > N / 2 R + W > N

Example 1 3 replica nodes This is: N = 3 W > N / 2 à W > 3 / 2 à W > 1,5 Let s pick: W = 2 (can tolerate failure of 1 node) R + W > N à R + 2 > 3 à R > 1 Let s pick: R = 2 (can tolerate failure of 1 node)

Client Node 1 Node 2 Node 3

Client W W W Node 1 Node 2 Node 3

W = 2 Client Node 1 Node 2 Node 3

Client Node 1 Node 2 Node 3

Client R R R Node 1 Node 2 Node 3

R = 2 2x / 1x Client Node 1 Node 2 Node 3

Client Node 1 Node 2 Node 3

Client R R R Node 1 Node 2 Node 3

R = 1 1x / 1x Client Node 1 Node 2 Node 3

Example 2 3 replica nodes This is: N = 3 W > N / 2 à W > 3 / 2 à W > 1,5 Let s pick: W = 3 (strict consistency does not tolerate any failure) R + W > N à R + 3 > 3 à R > 0 Let s pick: R = 1 (single read from any node sufficient)

Example 3 5 replica nodes This is: N = 5 W > N / 2 à W > 5 / 2 à W > 2,5 Let s pick: W = 3 (can tolerate failure of 2 nodes) R + W > N à R + 3 > 5 à R > 2 Let s pick: R = 3 (can tolerate failure of 2 nodes)

Limitations of QB R&W Client-centric consistency Strong consistency only perceived on client, not on nodes Requires additional measures on nodes Nodes need to implement at least eventual consistency Requires fixed set of replica nodes Dynamic adding and removal of nodes not possible Use if client-perceived strong consistency is sufficient and simplicity trumps formal precision

And what if I need more availability? Need to give up strong consistency (CP) Relax required consistency properties even more Leads to eventual consistency (AP)

Eventual Consistency (AP) Gives up some consistency guarantees no sequential consistency, anomalies become visible Maximum availability possible can tolerate up to N-1 nodes failing out of N nodes Challenging programming model anomalies usually need to be resolved explicitly Gossip / Hinted Handoffs CRDT

Conflict-free Replicated Data Types Eventually consistent, self-stabilizing data structures Designed for maximum availability Tolerates up to N-1 out of N nodes failing State-based CRDT: Convergent Replicated Data Type (CvRDT) Operation-based CRDT: Commutative Replicated Data Type (CmRDT)

A bit of theory first...

Convergent Replicated Data Type State-based CRDT CvRDT All replicas (usually) connected Exchange state between replicas, calculate new state on target replica State transfer at least once over eventually-reliable channels Set of possible states form a Semilattice Partially ordered set of elements where all subsets have a Least Upper Bound (LUB) All state changes advance upwards with respect to the partial order

Commutative Replicated Data Type Operation-based CRDT - CmRDT All replicas (usually) connected Exchange update operations between replicas, apply on target replica Reliable broadcast with ordering guarantee for non-concurrent updates Concurrent updates must be commutative

That s been enough theory...

Counter

Op-based Counter Data Integer i Init i 0 Query return i Operations increment(): i i + 1 decrement(): i i - 1

State-based G-Counter (grow only) (Naïve approach) Data Integer i Init i 0 Query return i Update increment(): i i + 1 Merge( j) i max(i, j)

State-based G-Counter (grow only) (Naïve approach) R1 I i = 0 U i = 1 R2 I i = 0 M i = 1 M i = 1 R3 I i = 0 U i = 1

State-based G-Counter (grow only) (Vector-based approach) Data Integer V[] Init V [0, 0,..., 0] Query return i V[i] Update increment(): V[i] V[i] + 1 Merge(V ) i [0, n-1] : V[i] max(v[i], V [i]) / one element per replica set / i is replica set number

State-based G-Counter (grow only) (Vector-based approach) R1 I V = [0, 0, 0] U V = [1, 0, 0] R2 I V = [0, 0, 0] M V = [1, 0, 0] M V = [1, 0, 1] R3 I V = [0, 0, 0] U V = [0, 0, 1]

State-based PN-Counter (pos./neg.) Simple vector approach as with G-Counter does not work Violates monotonicity requirement of semilattice Need to use two vectors Vector P to track incements Vector N to track decrements Query result is i P[i] N[i]

State-based PN-Counter (pos./neg.) Data Integer P[], N[] Init P [0, 0,..., 0], N [0, 0,..., 0] Query Return i P[i] N[i] Update increment(): P[i] P[i] + 1 decrement(): N[i] N[i] + 1 Merge(P, N ) i [0, n-1] : P[i] max(p[i], P [i]) i [0, n-1] : N[i] max(n[i], N [i]) / one element per replica set / i is replica set number / i is replica set number

Non-negative Counter Problem: How to check a global invariant with local information only? Approach 1: Only dec if local state is > 0 Concurrent decs could still lead to negative value Approach 2: Externalize negative values as 0 inc(negative value) == noop(), violates counter semantics Approach 3: Local invariant only allow dec if P[i] - N[i] > 0 Works, but may be too strong limitation Approach 4: Synchronize Works, but violates assumptions and prerequisites of CRDTs

Sets

Op-based Set (Naïve approach) Data Set S Init S {} Query(e) return e S Operations add(e) : S S {e} remove(e): S S \ {e}

Op-based Set (Naïve approach) R1 I S = {} add(e) S = {e} rmv(e) S = {} add(e) S = {e} R2 I S = {} add(e) add(e) rmv(e) S = {e} S = {e} S = {} R3 I S = {} add(e) S = {e}

State-based G-Set (grow only) Data Set S Init S {} Query(e) return e S Update add(e): S S {e} Merge(S ) S = S S

State-based 2P-Set (two-phase) Data Set A, R Init A {}, R {} Query(e) return e A e R Update add(e): A A {e} remove(e): (pre query(e)) R R {e} Merge(A, R ) A A A, R R R / A: added, R: removed

Op-based OR-Set (observed-remove) Data Set S / Set of pairs { (element e, unique tag u),... } Init S {} Query(e) return u : (e, u) S Operations add(e): S S { (e, u) } remove(e): pre query(e) R { (e, u) u : (e, u) S } S S \ R / u is generated unique tag /at source ( prepare ) /downstream ( execute )

Op-based OR-Set (observed-remove) R1 I S = {} add(e) S = {e a } rmv(e) S = {} add(e b ) S = {e b } R2 I S = {} add(e b ) add(e a ) rmv(e a ) S = {e b } S = {e a, e b } S = {e b } R3 I S = {} add(e) S = {e b }

More datatypes Register Dictionary (Map) Tree Graph Array List plus more representations for each datatype

Garbage collection Sets could grow infinitely in worst case Garbage collection possible, but a bit tricky Only remove updates that can be deleted safely and were received by all replicas Usually implemented using vector clocks Can tolerate up to n-1 crashes Live only while no replicas are crashed Can induce surprising behavior sometimes Sometimes stronger consensus is needed (Paxos,...)

Limitations of CRDTs Very weak consistency guarantees Strives for quiescent consistency Eventually consistent Not suitable for high-volume ID generator or alike Not easy to understand and model Not all data structures representable Use if availability is extremely important

Further reading 1. Shapiro et al., Conflict-free Replicated Data Types, Inria Research report, 2011 2. Shapiro et al., A comprehensive study of Convergent and Commutative Replicated Data Types, Inria Research report, 2011 3. Basho Tag Archives: CRDT, https://basho.com/tag/crdt/ 4. Leslie Lamport, Time, clocks, and the ordering of events in a distributed system, Communications of the ACM (1978)

Wrap-up CAP requires rethinking consistency Strict Consistency ACID / 2PC Strong Consistency Quorum-based R&W, Paxos Eventual Consistency CRDT, Gossip, Hinted Handoffs Pick your consistency model based on your consistency and availability requirements

The real world is not ACID Thus, it is perfectly fine to go for a relaxed consistency model

@ufried Uwe Friedrichsen uwe.friedrichsen@codecentric.de http://slideshare.net/ufried http://ufried.tumblr.com