Consistency in Distributed Storage Systems. Mihir Nanavati March 4 th, 2016

Similar documents
CAP Theorem. March 26, Thanks to Arvind K., Dong W., and Mihir N. for slides.

Strong Consistency & CAP Theorem

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

CS6450: Distributed Systems Lecture 15. Ryan Stutsman

CS6450: Distributed Systems Lecture 11. Ryan Stutsman

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre

SCALABLE CONSISTENCY AND TRANSACTION MODELS

Replication in Distributed Systems

CISC 7610 Lecture 2b The beginnings of NoSQL

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Eventual Consistency Today: Limitations, Extensions and Beyond

Consistency: Relaxed. SWE 622, Spring 2017 Distributed Software Engineering

Computing Parable. The Archery Teacher. Courtesy: S. Keshav, U. Waterloo. Computer Science. Lecture 16, page 1

Data Storage Revolution

CS 655 Advanced Topics in Distributed Systems

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Important Lessons. A Distributed Algorithm (2) Today's Lecture - Replication

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

Distributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions

Causal Consistency and Two-Phase Commit

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

DrRobert N. M. Watson

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

10. Replication. Motivation

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Distributed Systems. Lec 12: Consistency Models Sequential, Causal, and Eventual Consistency. Slide acks: Jinyang Li

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

Transactions and ACID

Eventual Consistency Today: Limitations, Extensions and Beyond

The CAP theorem. The bad, the good and the ugly. Michael Pfeiffer Advanced Networking Technologies FG Telematik/Rechnernetze TU Ilmenau

CPS 512 midterm exam #1, 10/7/2016

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Recall use of logical clocks

CSE 451: Operating Systems Winter Redundant Arrays of Inexpensive Disks (RAID) and OS structure. Gary Kimura

Distributed Systems (5DV147)

EECS 498 Introduction to Distributed Systems

Self-healing Data Step by Step

Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin

Important Lessons. Today's Lecture. Two Views of Distributed Systems

Causal Consistency. CS 240: Computing Systems and Concurrency Lecture 16. Marco Canini

Migrating Oracle Databases To Cassandra

Lecture 6 Consistency and Replication

Eventual Consistency 1

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

Basic vs. Reliable Multicast

File System Performance (and Abstractions) Kevin Webb Swarthmore College April 5, 2018

Building a transactional distributed

HyperDex. A Distributed, Searchable Key-Value Store. Robert Escriva. Department of Computer Science Cornell University

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

Database Availability and Integrity in NoSQL. Fahri Firdausillah [M ]

Extreme Computing. NoSQL.

DATABASE TRANSACTIONS. CS121: Relational Databases Fall 2017 Lecture 25

Haridimos Kondylakis Computer Science Department, University of Crete

Introduction to NoSQL

Distributed File Systems

Simplifying Wide-Area Application Development with WheelFS

GridGain and Apache Ignite In-Memory Performance with Durability of Disk

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE. Nicolas Buchschacher - University of Geneva - ADASS 2018

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. Lec 11: Consistency Models. Slide acks: Jinyang Li, Robert Morris

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

Page 1. Key Value Storage"

Data Consistency Now and Then

Changing Requirements for Distributed File Systems in Cloud Storage

Scaling Out Key-Value Storage

Exam 2 Review. October 29, Paul Krzyzanowski 1

Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

CSC 261/461 Database Systems Lecture 20. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Intuitive distributed algorithms. with F#

Networking Recap Storage Intro. CSE-291 (Cloud Computing), Fall 2016 Gregory Kesden

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

PRESENTATION TITLE GOES HERE. Understanding Architectural Trade-offs in Object Storage Technologies

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Don t Give Up on Serializability Just Yet. Neha Narula

GFS: The Google File System

Chapter 4: Distributed Systems: Replication and Consistency. Fall 2013 Jussi Kangasharju

A Global In-memory Data System for MySQL Daniel Austin, PayPal Technical Staff

Introduction to NoSQL Databases

Have your cake, and eat it too. Strong Consistency and High Performance

NoSQL Databases. Vincent Leroy

Consistency and Replication. Why replicate?

Distributed File Systems II

Google is Really Different.

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

Distributed Systems. Catch-up Lecture: Consistency Model Implementations

Replication. Feb 10, 2016 CPSC 416

Replications and Consensus

CORFU: A Shared Log Design for Flash Clusters

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

Choosing a MySQL HA Solution Today. Choosing the best solution among a myriad of options

Database Management System Prof. D. Janakiram Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No.

Data-Intensive Distributed Computing

Transcription:

Consistency in Distributed Storage Systems Mihir Nanavati March 4 th, 2016

Today Overview of distributed storage systems CAP Theorem

About Me Virtualization/Containers, CPU microarchitectures/caches, Network Stacks Last year and a half: high speed storage systems (Last 24 hours: Powerpoint) Don t believe everything I say built (at most) one distributed system more than you!

On Systems Point in time designs Balance tradeoffs What is old is new again?

Why Distributed Storage Systems? Performance Capacity Reliability

Architecture FS Frontend Object Backend Client Load Balancer FS Frontend Object Backend FS Frontend Object Backend

Architecture FS Frontend Object Backend Client NFS Write Load Balancer FS Frontend Object Backend FS Frontend Object Backend

Architecture What does this file map to? How many replicas does it have? Where are the replicas located? FS Frontend Object Backend Which replica is the leader/master? What sort of replication scheme? Client NFS Write Load Balancer FS Frontend Object Backend FS Frontend Object Backend

Architecture FS Frontend Object Backend Client NFS Write Load Balancer FS Frontend Object Backend FS Frontend Object Backend

Architecture Subtleties with in-parallel writes! FS Frontend Object Backend Replica Write Client NFS Write Load Balancer FS Frontend Object Backend Local Write Replica Write FS Frontend Object Backend

Architecture FS Frontend Object Backend Ack Client NFS Write Load Balancer FS Frontend Object Backend Ack Ack FS Frontend Object Backend

Architecture FS Frontend Object Backend Ack Client Ack Load Balancer FS Frontend Object Backend Ack Ack FS Frontend Object Backend

Metadata What does this file map to? How many replicas does it have? Where are the replicas located? Which replica is the leader/master? What sort of replication scheme?

Location of Replicas Directory Service [Petal, GFS] Consistent Hashing [Chord, Redis] Clients vs Proxy?

Directory Services vs Hashing Fine grained placement Locality Lookup on reads/writes Update on data moves Statistical load balancing Lookups are free Membership list of nodes Update on node churn

Directory Services vs Hashing Fine grained placement Both systems require metadata Statistical load balancing Locality Lookup on reads/writes Update on data moves Lookups are free Membership list of nodes Update on churn

Data and Metadata Can t avoid having metadata Metadata is also data! Different consistency and availability requirements!

Consistency and Availability Consistency: Is there a single true value for every object in the system? Does a read always return the result of the most recent write, i.e., is stale data ever visible? Availability: Is data accessible in the face of failures?

CAP Theorem [Brewer] Pick any two of: Sequential Consistency Availability Partition Tolerance

C in CAP vs C in ACID ACID vs BASE Different definitions of consistency! CAP s C(onsistency) = ACID s A(tomicity) = Visibility to all future operations ACID s C(onsistency) = Does data follow all schema constraints?

Sequential Consistency if client a executes operations {a1, a2, a3,...} and client b executes operations {b1, b2, b3,...}, then you could create some serialized version {a1, b1, b2, a2,...} (or whatever) executed by the clients

Sequential Consistency For a single piece of data [Lamport]: Appears like a single copy to an outside observer Strict ordering of operations from the same client Some arbitrary (single) ordering of operating across clients Sometimes referred to as linearizable [Herlihy and Wing]

Defeating CAP Setup 3 servers, 2 clients 3-way in-memory replication Each piece of data has a single master Clients send request to master if available, else any other node

Defeating CAP master(k1) Server A Server B Server C write(k1, A ) Client A Client B

Defeating CAP master(k1) write(k1, A ) Server A Server B Server C write(k1, A ) write(k1) Client A Client B

Defeating CAP master(k1) Server A Server B Server C read(k1) Client A Client B

Defeating CAP master(k1) Server A Server B Server C A Client A Client B

Defeating CAP master(k1) Server A Server B Server C Client A Client B

Defeating CAP master(k1) Server A Server B Server C write(k1, B ) read(k1) Client A Client B

Defeating CAP master(k1) What to do next? Disallow write: CP system Allow write: AP system Server A Server B Server C write(k1, B ) read(k1) Client A Client B

Defeating CAP Setup 3 servers, 2 clients 3-way in-memory replication Each piece of data has a single master Clients only send request to masters

Defeating CAP master(k1) Server A Server B Server C write(k1, B ) read(k1) Client A Client B

Defeating CAP Client B gets no service! master(k1) Server A Server B Server C write(k1, B ) read(k1) Client A Client B

Defeating CAP Defeating CAP is rather hard! Formal proof of impossibility [Gilbert and Lynch]

Strong vs Eventual Consistency Extreme ends of the spectrum Active research: Causal, Causal+,. NoSQL give up consistency: schema-less, no multi-key transactions, etc.

CAP: Pick any Two? Can t drop P: CP vs AP systems Like a prix fixe menu! Not all systems are strictly CP or AP!

Dissenters [Michael Stonebreaker, VoltDB] In my experience, network partitions do not happen often. Specifically, they occur less frequently than the sum of bohrbugs [deterministic DB crashes], application errors, human errors and reprovisioning events. So it doesn t much matter what you do when confronted with network partitions. Surviving them will not move the needle on availability because higher frequency events will cause global outages. Hence, you are giving up something (consistency) and getting nothing in return.

Network Reliability [Gill, et al.] Intra-datacenter links: Median downtime of 10 minutes/year Inter-datacenter (WAN) links: 24-72 minutes per year Is a node failure a network partition?

CAP: What to Pick? No universally true guidelines Depends on application constraints Is the client within our control?

Consistency in Memory vs Non-volatile Storage Post crash semantics

Consistency in Durable Storage Subtleties with in-parallel writes! FS Frontend Object Backend Replica Write Client NFS Write Load Balancer FS Frontend Object Backend Local Write Replica Write FS Frontend Object Backend

Consistency in Durable Storage In-place writes: Risk corruption Overwrite-allocate: Metadata updates Concurrency?

Chain Replication [Renesse and Schneider] Sequential writes No data corruption Writes to go head of a chain; reads to the tail

Chain Replication head(k1) tail(k1) Server A Server B Server C write(k1, A ) Client A

Chain Replication head(k1) tail(k1) write(k1, A ) Server A Server B Server C write(k1, A ) Client A

Chain Replication head(k1) tail(k1) Server A write(k1, A ) Server B write(k1, A ) Server C write(k1, A ) Client A

Chain Replication head(k1) tail(k1) Server A write(k1, A ) Server B write(k1, A ) Server C write(k1, A ) ACK Client A

Chain Replication head(k1) tail(k1) Server A Server B Server C read(k1) Client A

Chain Replication head(k1) tail(k1) Server A Server B Server C A Client A

Chain Replication head(k1) tail(k1) Server A Server B Server C write(k1, B ) read(k1) Client A

Chain Replication head(k1) tail(k1) write(k1, A ) Server A Server B Server C write(k1, B ) A Does this violate consistency? Client A

Chain Replication head(k1) tail(k1) write(k1, A ) Server A Server B Server C write(k1, B ) A Does this violate consistency? Client A Has the write been completed/acknowledged?

Chain Replication In-place writes without data corruption, i.e. durability CP or AP?

Chain Replication In-place writes without data corruption, i.e. durability CP or AP? The approach is intended for supporting large-scale storage services that exhibit high throughput and availability without sacrificing strong consistency guarantees

What about head(k1) tail(k1) Server A Server B Server C write(k1, B ) read(k1) Client A Client B

Chain Replication In-place writes without data corruption, i.e. durability CP or AP? Our chain replication does not offer graceful handling of partitioned operation, trading that instead for supporting all three of: high performance, scalability, and strong consistency.

Workarounds Consistency hard to reason about! Send all requests via the master?

Pros and Cons In-place writes Reduce metadata updates Sequential writes Latency amplification

Consistency and Availability? [Escriva and Gun Sirer] What CAP is simplified to: You must give up consistency or availability at all times What the CAP theorem really says: If you cannot limit the number of faults and requests can be directed to any server and you insist on serving every request then you cannot possibly be consistent Limit failures or kind of partition?

Conclusions Distributed Storage needs to reason about consistency and availability CAP is an analysis tool, not a design principle! Within bounds of failures, all three are possible Implementation details matter too!