What Came First? The Ordering of Events in

Similar documents
Riak. Distributed, replicated, highly available

DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

Advanced Databases ( CIS 6930) Fall Instructor: Dr. Markus Schneider. Group 17 Anirudh Sarma Bhaskara Sreeharsha Poluru Ameya Devbhankar

Why distributed databases suck, and what to do about it. Do you want a database that goes down or one that serves wrong data?"

CS Amazon Dynamo

CAP Theorem, BASE & DynamoDB

Dynamo: Amazon s Highly Available Key-Value Store

Self-healing Data Step by Step

CS 138: Dynamo. CS 138 XXIV 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Presented By: Devarsh Patel

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

The Go Programming Language. Frank Roberts

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

Dynamo: Amazon s Highly Available Key-value Store

殷亚凤. Consistency and Replication. Distributed Systems [7]

05-concurrency.txt Tue Sep 11 10:05: , Fall 2012, Class 05, Sept. 11, 2012 Randal E. Bryant

Using Erlang, Riak and the ORSWOT CRDT at bet365 for Scalability and Performance. Michael Owen Research and Development Engineer

CS 655 Advanced Topics in Distributed Systems

Replication and Consistency

Dynamo: Key-Value Cloud Storage

FAQs Snapshots and locks Vector Clock

Dynamo. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Motivation System Architecture Evaluation

Spotify. Scaling storage to million of users world wide. Jimmy Mårdell October 14, 2014

Improving Logical Clocks in Riak with Dotted Version Vectors: A Case Study

Dynamo: Amazon s Highly Available Key-value Store. ID2210-VT13 Slides by Tallat M. Shafaat

CSE 486/586 Distributed Systems

There is a tempta7on to say it is really used, it must be good

Consistency: Relaxed. SWE 622, Spring 2017 Distributed Software Engineering

Eventual Consistency 1

Dynamo Tom Anderson and Doug Woos

Lasp: A Language for Distributed, Coordination-Free Programming

Reminder from last time

Important Lessons. Today's Lecture. Two Views of Distributed Systems

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

Availability versus consistency. Eventual Consistency: Bayou. Eventual consistency. Bayou: A Weakly Connected Replicated Storage System

Eventual Consistency Today: Limitations, Extensions and Beyond

Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage

Scaling Out Key-Value Storage

Dynamo: Amazon s Highly Available Key-value Store

Exam 2 Review. Fall 2011

INF-5360 Presentation

EECS 498 Introduction to Distributed Systems

Intuitive distributed algorithms. with F#

11/12/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

Key-Value Stores: RiakKV

Important Lessons. A Distributed Algorithm (2) Today's Lecture - Replication

Key-Value Stores: RiakKV

Eventual Consistency Today: Limitations, Extensions and Beyond

Rsyslog: going up from 40K messages per second to 250K. Rainer Gerhards

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Distributed System Engineering: Spring Exam I

10. Replication. Motivation

Conflict-free Replicated Data Types in Practice

Consistency and Replication. Some slides are from Prof. Jalal Y. Kawash at Univ. of Calgary

DEMYSTIFYING BIG DATA WITH RIAK USE CASES. Martin Schneider Basho Technologies!

EECS 498 Introduction to Distributed Systems

SEGR 550 Distributed Computing. Final Exam, Fall 2011

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CIS Operating Systems Synchronization based on Busy Waiting. Professor Qiang Zeng Spring 2018

Strong Eventual Consistency and CRDTs

Vector Clocks in Coq

Distributed Systems. Catch-up Lecture: Consistency Model Implementations

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

CSE-E5430 Scalable Cloud Computing Lecture 10

Synchronization. Chapter 5

CSE 5306 Distributed Systems

Key-Value Stores: RiakKV

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Concurrency in Go 9/22/17

CSE Traditional Operating Systems deal with typical system software designed to be:

relational Relational to Riak Why Move From Relational to Riak? Introduction High Availability Riak At-a-Glance

POSIX Threads: a first step toward parallel programming. George Bosilca

Eventually Consistent HTTP with Statebox and Riak

Consistency and Replication

SCALABLE CONSISTENCY AND TRANSACTION MODELS

Strong Eventual Consistency and CRDTs

Exam 2 Review. October 29, Paul Krzyzanowski 1

Scaling KVS. CS6450: Distributed Systems Lecture 14. Ryan Stutsman

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

CS6450: Distributed Systems Lecture 11. Ryan Stutsman

Memory Consistency. Minsoo Ryu. Department of Computer Science and Engineering. Hanyang University. Real-Time Computing and Communications Lab.

Monitors; Software Transactional Memory

Monitors; Software Transactional Memory

Distributed Systems. Pre-Exam 1 Review. Paul Krzyzanowski. Rutgers University. Fall 2015

Threading and Synchronization. Fahd Albinali

Haridimos Kondylakis Computer Science Department, University of Crete

Distributed Systems. Lec 9: Distributed File Systems NFS, AFS. Slide acks: Dave Andersen

Synchronizing the Asynchronous

Actor-Based Concurrency: Implementation and Comparative Analysis With Shared Data and Locks Alex Halter Randy Shepherd

Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Deadlock and Monitors. CS439: Principles of Computer Systems September 24, 2018

DISTRIBUTED COMPUTER SYSTEMS

CS 31: Intro to Systems Misc. Threading. Kevin Webb Swarthmore College December 6, 2018

Lecture 10: Multi-Object Synchronization

Computing Parable. The Archery Teacher. Courtesy: S. Keshav, U. Waterloo. Computer Science. Lecture 16, page 1

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Concept of a process

Chapter 6: Process Synchronization

CS 347 Parallel and Distributed Data Processing

Foundations of the C++ Concurrency Memory Model

ADVANCED OPERATING SYSTEMS

Transcription:

What Came First? The Ordering of Events in Systems @kavya719

kavya

the design of concurrent systems

Slack architecture on AWS

systems with multiple independent actors. threads in a multithreaded program. nodes in a distributed system. concurrent actors

threads user-space or system threads

threads user-space or system threads var tasks []Task func main() { for { if len(tasks) > 0 { task := dequeue(tasks) process(task) } } } R W R W

multiple threads: // Shared variable var tasks []Task g1 R g2 func worker() { for len(tasks) > 0 { task := dequeue(tasks) process(task) } } func main() { // Spawn fixed-pool of worker threads. startworkers(3, worker) W R W } // Populate task queue. for _, t := range hellatasks { tasks = append(tasks, t) } data race when two+ threads concurrently access a shared memory location, at least one access is a write.

many threads provides concurrency, may introduce data races.

nodes processes i.e. logical nodes (but term can also refer to machines i.e. physical nodes). communicate by message-passing i.e. connected by unreliable network, no shared memory. are sequential. no global clock.

distributed key-value store. three nodes with master and two replicas. cart: [ ] user X ADD apple crepe user Y ADD blueberry crepe M R R cart: [ apple crepe, blueberry crepe ]

distributed key-value store. three nodes with three equal replicas. read_quorum = write_quorum = 1. eventually consistent. cart: [ ] user X ADD apple crepe user Y ADD blueberry crepe cart: [ apple crepe ] N 1 N 2 N 3 cart: [ blueberry crepe ]

multiple nodes accepting writes provides availability, may introduce conflicts.

given we want concurrent systems, we need to deal with data races, conflict resolution.

riak: distributed key-value store channels: Go concurrency primitive stepping back: similarity, meta-lessons

riak a distributed datastore

riak Distributed key-value database: // A data item = <key: blob> { uuid1234 : { name : ada }} v1.0 released in 2011. Based on Amazon s Dynamo. Eventually consistent: uses optimistic replication i.e. replicas can temporarily diverge, will eventually converge. Highly available: data partitioned and replicated, decentralized, sloppy quorum. ] AP system (CAP theorem)

cart: [ ] ADD apple crepe ADD blueberry crepe cart: [ apple crepe ] N 1 conflict resolution N 2 N 3 cart: [ blueberry crepe ] cart: [ apple crepe ] UPDATE to date crepe N 1 cart: [ date crepe ] N 2 N 3 causal updates

how do we determine causal vs. concurrent updates?

user Y user X { cart : [ B ] } user X { cart : [ A ] } { cart : [ A ]} { cart : [ D ]} A: apple B: blueberry D: date A B C D N 1 N 2 N 3 concurrent events?

A B C D N 1 N 2 N 3 concurrent events?

A B C D N 1 N 2 N 3 A, C: not concurrent same sequential actor

A B C D N 1 N 2 N 3 A, C: not concurrent same sequential actor C, D: not concurrent fetch/ update pair

happens-before orders events across actors. (threads or nodes) Formulated in Lamport s Time, Clocks, and the Ordering of Events paper in 1978. establishes causality and concurrency. X Y IF one of: same actor are a synchronization pair X E Y IF X not Y and Y not X, concurrent!

causality and concurrency A B C D N 1 N 2 N 3 A C (same actor) C D (synchronization pair) So, A D (transitivity)

causality and concurrency A B C D N 1 N 2 N 3 but B? D D? B So, B, D concurrent!

{ cart : [ B ] } { cart : [ A ] } { cart : [ A ]} { cart : [ D ]} A B C D N 1 N 2 N 3 A D D should update A B, D concurrent B, D need resolution

how do we implement happens-before?

vector clocks means to establish happens-before edges. n1 n2 n3 n 1 n 2 n 3 n 1 n 2 n 3 n 1 n 2 n 3 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1

vector clocks means to establish happens-before edges. n1 n2 n3 n 1 n 2 n 3 n 1 n 2 n 3 n 1 n 2 n 3 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 1

vector clocks means to establish happens-before edges. n1 n2 n3 n 1 n 2 n 3 n 1 n 2 n 3 n 1 n 2 n 3 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 1 0

vector clocks means to establish happens-before edges. n1 n2 n3 n 1 n 2 n 3 n 1 n 2 n 3 n 1 n 2 n 3 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 2 0 0 max ((2, 0, 0), (0, 1, 0)) 2 1 0

vector clocks means to establish happens-before edges. n1 n2 n3 n 1 n 2 n 3 n 1 n 2 n 3 n 1 n 2 n 3 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 2 0 0 max ((2, 0, 0), (0, 1, 0)) 2 1 0 happens-before comparison: X Y iff VCx < VCy

2 0 0 A B C D N 1 1 0 0 2 0 0 N 2 2 1 0 N 3 0 0 1 VC at A: VC at D: 1 0 0 2 1 0 So, A D

2 0 0 A B C D N 1 1 0 0 2 0 0 N 2 2 1 0 N 3 0 0 1 VC at B: VC at D: 0 0 1 2 1 0 So, B, D concurrent

causality tracking in riak Riak stores a vector clock with each version of the data. a more precise form, dotted version vector GET, PUT operations on a key pass around a casual context object, that contains the vector clocks. n1 Therefore, able to detect conflicts. n2 2 0 0 max ((2, 0, 0), (0, 1, 0)) 2 1 0

causality tracking in riak Riak stores a vector clock with each version of the data. a more precise form, dotted version vector GET, PUT operations on a key pass around a casual context object, that contains the vector clocks. Therefore, able to detect conflicts. what about resolving those conflicts?

conflict resolution in riak Behavior is configurable. Assuming vector clock analysis enabled: last-write-wins i.e. version with higher timestamp picked. merge, iff the underlying data type is a CRDT return conflicting versions to application riak stores siblings or conflicting versions, returned to application for resolution.

return conflicting versions to application: Riak stores both versions B: { cart: [ blueberry crepe ] } D: { cart: [ date crepe ] } 0 0 1 2 1 0 next op returns both to application application must resolve conflict { cart: [ blueberry crepe, date crepe ] } which creates a causal update { cart: [ blueberry crepe, date crepe ] } 2 1 1

what about resolving those conflicts? doesn t (default behavior). instead, exposes happens-before graph to the application for conflict resolution.

riak: uses vector clocks to track causality and conflicts. exposes happens-before graph to the user for conflict resolution.

channels Go concurrency primitive

multiple threads: // Shared variable var tasks []Task g1 R g2 func worker() { for len(tasks) > 0 { task := dequeue(tasks) process(task) } } func main() { // Spawn fixed-pool of worker threads. startworkers(3, worker) W R W } // Populate task queue. for _, t := range hellatasks { tasks = append(tasks, t) } data race when two+ threads concurrently access a shared memory location, at least one access is a write.

memory model specifies when an event happens before another. X Y x = 1 print(x) X Y IF one of: same thread are a synchronization pair X E Y IF X not Y and Y not X, concurrent! unlock/ lock on a mutex, send / recv on a channel, spawn/ first event of a thread. etc.

goroutines The unit of concurrent execution: goroutines user-space threads use as you would threads > go handle_request(r) Go memory model specified in terms of goroutines within a goroutine: reads + writes are ordered with multiple goroutines: shared data must be synchronized else data races!

synchronization The synchronization primitives are: mutexes, conditional vars, > import sync > mu.lock() atomics > import sync/ atomic" > atomic.adduint64(&myint, 1) channels

channels Do not communicate by sharing memory; instead, share memory by communicating. standard type in Go chan safe for concurrent use. mechanism for goroutines to communicate, and synchronize. Conceptually similar to Unix pipes: > ch := make(chan int) // Initialize > go func() { ch <- 1 } () // Send > <-ch // Receive, blocks until sent.

// Shared variable var tasks []Task func worker() { for len(tasks) > 0 { task := dequeue(tasks) process(task) } } func main() { // Spawn fixed-pool of workers. startworkers(3, worker) worker: * get a task. * process it. * repeat. main: want: * give tasks to workers. } // Populate task queue. for _, t := range hellatasks { tasks = append(tasks, t) }

var taskch = make(chan Task, n) var resultch = make(chan Result) func main() { // Spawn fixed-pool of workers. startworkers(3, worker) } // Populate task queue. for _, t := range hellatasks { taskch <- t } // Wait for and amalgamate results. var results []Result for r := range resultch { results = append(results, r) } func worker() { for { // Get a task. t := <-taskch process(t) // Send the result. resultch <- r } }

] ] mutex? mu // Shared variable var tasks []Task func worker() { for len(tasks) > 0 { task := dequeue(tasks) process(task) } } ] mu func main() { // Spawn fixed-pool of workers. startworkers(3, worker) mu } // Populate task queue. for _, t := range hellatasks { tasks = append(tasks, t) } but workers can exit early.

want: channel semantics (as used): main: * send tasks worker: * wait for task * process it * repeat send task main worker wait for task recv task process send task to happen before worker runs. channels allow us to express happens-before constraints.

channels: allow, and force, the user to express happens-before constraints.

stepping back

similarities surface happens-before to the user riak: distributed key-value store channels: Go concurrency primitive first principle: happens-before

meta-lessons

new technologies cleverly decompose into old ideas

the right boundaries for abstractions are flexible.

happens-before riak channels @kavya719 https://speakerdeck.com/kavya719/what-came-first

riak: a note (or two) nodes in Riak: > virtual nodes ( vnodes ) > key-space partitioning by consistent hashing,1 vnode per partition. > sequential because Erlang processes, use message queues. replicas: > N, R, W, etc. configurable by key. > on network partition, defaults to sloppy quorum w/ hinted-handoff. conflict-resolution: > by read-repair, active anti-entropy.

riak: dotted version vectors problem with standard vector clocks: false concurrency. userx: PUT cart : A, {} > (1, 0); A usery: PUT cart : B, {} > (2, 0); [ A, B ] userx: PUT cart : C, {(1, 0); A } > (1, 0)!< (2, 0) > (3, 0); [ A, B, C ] This is false concurrency; leads to sibling explosion. dotted version vectors fine-grained mechanism to detect causal updates. decompose each vector clock into its set of discrete events, so: userx: PUT cart : A, {} > (1, 0); A usery: PUT cart : B, {} > (2, 0); [(1, 0)-> A, (2, 0)-> B ] userx: PUT cart : C, {} > (3, 0); [(2, 0)-> B, (3, 0)-> C ]

riak: CRDTs Conflict-free / Convergent / Commutative Replicated Data Type > data structure with property: replicas can be updated concurrently without coordination, and it s mathematically possible to always resolve conflicts. > two types: op-based (commutative) and state-based (convergent). > examples: G-Set (Grow-Only Set), G-Counter, PN-Counter > Riak DT is state-based CRDTs.

channels: implementation ch := make(chan int, 3) ring buffer waiting senders waiting receivers mutex buf sendq recvq lock... nil nil hchan

g1 ch <- t1 ch <- t2 ch <- t3 buf sendq recvq lock nil nil ch <- t4 buf sendq recvq lock nil g1

g1 ch <- t1 buf g2 <-ch sendq recvq lock nil g1

g2 buf <-ch sendq recvq nil nil lock g1

g1 g2 buf <-ch sendq recvq nil nil lock ch <- t4 buf sendq recvq nil nil lock

1. send happens-before corresponding receive // Shared variable var count = 0 var ch = make(chan bool, 1) func setcount() { count++ ch <- true } func printcount() { <- ch print(count) } A B g1 W send B C recv R g2 C D go setcount() go printcount() So, A D

2. n th receive on a channel of size C happens-before n+c th send completes. var maxoutstanding = 3 var taskch = make(chan int, maxoutstanding) func worker() { for { t := <-taskch processandstore(t) } } func main() { go worker() } tasks := generatehellatasks() for _, t := range tasks { taskch <- t }

1. send happens-before corresponding receive. If channel empty: receiver goroutine paused; resumed after a channel send occurs. If channel not empty: receiver gets first unreceived element i.e. buffer is a FIFO queue. Sends must have completed due to mutex.

2. n th receive on a channel of size C happens-before n+c th send completes. 2nd receive happens-before 5th send. send #3 can occur. send #4 can occur after receive #1. send #5 can occur after receive #2. Fixed-size, circular buffer.

2. n th receive on a channel of size C happens-before n+c th send completes. If channel full: sender goroutine paused; resumed after a channel recv occurs. If channel not empty: receiver gets first unreceived element i.e. buffer is a FIFO queue. Send of that element must have completed due to channel mutex