Spanner : Google's Globally-Distributed Database. James Sedgwick and Kayhan Dursun

Similar documents
Spanner: Google's Globally-Distributed Database. Presented by Maciej Swiech

Spanner: Google s Globally- Distributed Database

Corbett et al., Spanner: Google s Globally-Distributed Database

Spanner: Google's Globally-Distributed Database* Huu-Phuc Vo August 03, 2013

Google Spanner - A Globally Distributed,

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

Beyond TrueTime: Using AugmentedTime for Improving Spanner

Distributed Systems. 19. Spanner. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Data Management. Christoph Lofi Institut für Informationssysteme Technische Universität Braunschweig

Spanner: Google s Globally-Distributed Database. Wilson Hsieh representing a host of authors OSDI 2012

Distributed Transactions

Multi-version concurrency control

Concurrency Control II and Distributed Transactions

Spanner: Google s Globally- Distributed Database

Multi-version concurrency control

Distributed Systems. GFS / HDFS / Spanner

7680: Distributed Systems

NewSQL Databases. The reference Big Data stack

MINIMIZING TRANSACTION LATENCY IN GEO-REPLICATED DATA STORES

Integrity in Distributed Databases

Building Consistent Transactions with Inconsistent Replication

10. Replication. Motivation

Cloud Computing. DB Special Topics Lecture (10/5/2012) Kyle Hale Maciej Swiech

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

EECS 498 Introduction to Distributed Systems

There Is More Consensus in Egalitarian Parliaments

Extreme Computing. NoSQL.

How do we build TiDB. a Distributed, Consistent, Scalable, SQL Database

Consistency in Distributed Systems

DrRobert N. M. Watson

Building Consistent Transactions with Inconsistent Replication

How we build TiDB. Max Liu PingCAP Amsterdam, Netherlands October 5, 2016

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

Consistency and Replication. Why replicate?

Consistency and Replication (part b)

The Chubby Lock Service for Loosely-coupled Distributed systems

Data-Intensive Distributed Computing

Consistency and Replication

Distributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions

CS Amazon Dynamo

CS November 2017

CSE 5306 Distributed Systems. Consistency and Replication

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Distributed PostgreSQL with YugaByte DB

Shen PingCAP 2017

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

Consistency and Replication

Replication in Distributed Systems

Consistency and Replication

CA485 Ray Walshe NoSQL

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Distributed KIDS Labs 1

Replication and Consistency. Fall 2010 Jussi Kangasharju

Distributed Data Management Replication

Chapter 4: Distributed Systems: Replication and Consistency. Fall 2013 Jussi Kangasharju

Outline. Spanner Mo/va/on. Tom Anderson

Lecture 6 Consistency and Replication

CS /29/17. Paul Krzyzanowski 1. Fall 2016: Question 2. Distributed Systems. Fall 2016: Question 2 (cont.) Fall 2016: Question 3

DISTRIBUTED COMPUTER SYSTEMS

System Models. 2.1 Introduction 2.2 Architectural Models 2.3 Fundamental Models. Nicola Dragoni Embedded Systems Engineering DTU Informatics

A Brief Introduction of TiDB. Dongxu (Edward) Huang CTO, PingCAP

Consistency and Replication. Some slides are from Prof. Jalal Y. Kawash at Univ. of Calgary

Important Lessons. A Distributed Algorithm (2) Today's Lecture - Replication

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017

Beyond FLP. Acknowledgement for presentation material. Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen

殷亚凤. Consistency and Replication. Distributed Systems [7]

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

Distributed Transaction Management 2003

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Replication. Consistency models. Replica placement Distribution protocols

Applications of Paxos Algorithm

Documentation Accessibility. Access to Oracle Support

Today s Papers. Google Chubby. Distributed Consensus. EECS 262a Advanced Topics in Computer Systems Lecture 24. Paxos/Megastore November 24 th, 2014

Synchronization. Chapter 5

Consensus and related problems

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CSE 5306 Distributed Systems

Atomicity. Bailu Ding. Oct 18, Bailu Ding Atomicity Oct 18, / 38

MySQL Group Replication. Bogdan Kecman MySQL Principal Technical Engineer

Ghislain Fourny. Big Data 5. Wide column stores

Advanced Databases Lecture 17- Distributed Databases (continued)

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Exam 2 Review. October 29, Paul Krzyzanowski 1

BigTable. CSE-291 (Cloud Computing) Fall 2016

TiDB: NewSQL over HBase.

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

Dynamo: Amazon s Highly Available Key-value Store. ID2210-VT13 Slides by Tallat M. Shafaat

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

Final Exam Logistics. CS 133: Databases. Goals for Today. Some References Used. Final exam take-home. Same resources as midterm

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein Sameh Elnikety. Copyright 2012 Philip A. Bernstein

Chapter 11 - Data Replication Middleware

DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

Dynamo: Amazon s Highly Available Key-value Store

SQL, NoSQL, MongoDB. CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden

Datacenter replication solution with quasardb

CS848 Paper Presentation Building a Database on S3. Brantner, Florescu, Graf, Kossmann, Kraska SIGMOD 2008

Distributed Database Management System UNIT-2. Concurrency Control. Transaction ACID rules. MCA 325, Distributed DBMS And Object Oriented Databases

Cluster-Level Google How we use Colossus to improve storage efficiency

PushyDB. Jeff Chan, Kenny Lam, Nils Molina, Oliver Song {jeffchan, kennylam, molina,

Transcription:

Spanner : Google's Globally-Distributed Database James Sedgwick and Kayhan Dursun

Spanner - A multi-version, globally-distributed, synchronously-replicated database - First system to - Distribute data globally - Externally-consistent distributed Xacts.

Introduction - Spanner? - System that shards data across Paxos machines into data centers all around the world. - Designed to scale up to millions of machines and trillions of database rows.

Features - Dynamic replication configurations - Constraints to manage - Read latency - Write latency - Durability, availability - Balancing

Features cont. - Externally consistent reads and writes - Globally consistent reads Why consistency matters?

Implementation - Set of zones = set of locations of dist. data - Can be more than one zone in a datacenter

Spanserver Software Stack Tablet: (key:string, timestamp:int) -> string Paxos: Replication sup. - Writes initiate protocol at leader - Reads from the tablet directly - Lock table - Trans. manager

Directories - A bucketing abstraction - Unit of data placement - Movement - Load balancing - Access patterns - Accessors

Data model - A data model based on schematized semi-relational tables - With popularity of Megastore - Query language - With popularity of Dremel - General-purpose Xacts. - Experienced the lack with BigTable

Data Model cont. - Not purely relational (rows have names) - DB must be partitioned into hierarchies

TrueTime Method Returns TT.now() TT.after(t) TT.before(t) TTinterval: [earliest, latest] true if t has definitely passed true if t has definitely not arrived Represents time as intervals with bounded uncertainty Let instantaneous error be e (half of interval width) Let average error be ē Formal guarantee: Let t abs (e) be the absolute time of event e For tt = TT.now(), tt.earliest <= t abs (e) <= tt.latest where e is the invocation event

TrueTime implementation Two underlying time references, used together because they have disjoint failure modes GPS: Antenna/receiver failures, interference, GPS system outage Atomic clock: Drift, etc Set of time masters per datacenter (mixed GPS and atomic) Each server runs a time daemon Masters cross-check time against other masters and rate of local clock Masters advertise uncertainty GPS uncertainty near zero, atomic uncertainty grows based on worst case clock drift Masters can evict themselves if their uncertainty grows too high

TrueTime implementation, contd. Time daemons poll a variety of masters (local and remote GPS masters as well as atomic) Use variant of Marzullo's algorithm to detect liars Sync local clocks to non-liars Between syncs, daemons advertise slowly increasing uncertainty Derived from worst-case local drift, time master uncertainty, and communication delay to masters e as seen by TrueTime client thus has sawtooth pattern varies from about 1 to 7 ms over each poll interval Time master unavailability and overloaded machines/network can cause spikes in e

Spanner Operations Read-write transactions Standalone writes are a subset Read-only transactions Non-snapshot standalone reads are a subset Executed at system-chosen timestamp without locking, such that writes are not blocked. Executed on any replica that is sufficiently up to date w.r.t. chosen timestamp Snapshot reads Client provided timestamp or upper time bound

Paxos Invariants Spanner's Paxos implementation used timed (10 second) leader leases to make leadership long lived Candidate becomes leader after receiving quorum of timed lease votes Replicas extend lease votes implicitly on writes. Leader requests a lease extension from a replica if its vote is close to expiration. Define a lease interval as starting when a quorum is achieved, and ending when a quorum is lost Spanner requires monotonically increasing Paxos write timestamps across leaders in a group, so it is critical that leader lease intervals are disjoint To achieve disjointness, a leader could log its interval via Paxos, and subsequent leaders could wait for this interval before taking over. Spanner avoids this Paxos communication and preserves disjointness via a TrueTime-based mechanism described in Appendix A. It's in an appendix because it's complicated. Also: leaders can abdicate, but must wait until TT.after(s max ) is true, where s max is the maximum timestamp used by a leader, to preserve disjointness

Proof of Externally Consistent RW Transactions External consistency: if the start of T 2 occurs after the commit of T 1, then the commit timestamp of T 2 is after the commit timestamp of T 1 Let start, commit request, and commit events be e i, start, e i, server, and e i, commit Thus, formally: if t abs (e 1, commit ) < t abs (e 2, start ), then s 1 < s 2 Start: Coordinator leader assigns timestamp s i to transaction T i s.t. s i is no less than TT.now().latest, computed after e i, server Commit wait: Coordinator leader ensures clients can't see effects of T i before TT.after(s i ) is true. That is, s i < t abs (e i, commit ) s 1 < t abs (e 1, commit ) (commit wait) t abs (e 1, commit ) < t abs (e 2, start ) (assumption) t abs (e 2, start ) <= t abs (e 2, server ) (causality) t abs (e 2, server ) <= s 2 (start) s 1 < s 2 (transitivity)

Serving Reads at a Timestamp Each replica tracks safe time t safe, which is the maximum timestamp at which it is up to date. Replica can read at t if t <= t safe t safe = min(t Paxos-safe, t TM-safe ) t Paxos-safe is just the timestamp of the highest applied Paxos write on the replica. Paxos write times increase monotonically, so writes will not occur at or below t Paxos-safe w.r.t. Paxos t TM-safe accounts for uncommitted transactions in the replica's group. Every participant leader (of group g) for transaction T i assigns prepare timestamp s i,g - prepare to its record. This timestamp is propagated to g via Paxos. The coordinator leader ensures that commit time s i of T i >= s i,g - prepare for each participant group g. Thus, t TM-safe = min i (s i,g - prepare ) - 1 Thus, t TM-safe is guaranteed to be before all prepared but uncommitted transactions in the replica's group

Assigning Timestamps to RO Transactions To execute a read-only transaction, pick timestamp s read, then execute as snapshot reads at s read at sufficiently up to date replicas. Picking TT.now().latest after the transaction start will definitely preserve external consistency, but may block unnecessarily long while waiting for t safe to advance. Choose the oldest timestamp that preserves external consistency: LastTS. Can do better than now if there are no prepared transactions If the read's scope is a single Paxos group, simply choose the timestamp of the last committed write at that group. If the read's scope encompasses multiple groups, a negotiation could occur among group leaders to determine max g (LastTS g ) Current implementation avoids this communication and simply uses TT.now().latest

Details of RW Transactions, pt. 1 Client issues reads to leader replicas of appropriate groups. These acquire read locks and read the most recent data. Once reads are completed and writes are buffered (at the client), client chooses a coordinator leader and sends the identity of the leader along with buffered writes to participant leaders. Non-coordinator participant leaders acquire write locks choose a prepare timestamp larger than any previous transaction timestamps log a prepare record in Paxos notify coordinator of chosen timestamp.

Details of RW Transactions, pt. 2 Coordinator leader acquires locks picks a commit timestamp s greater than TT.now().latest, greater than or equal to all participant prepare timestamps, and greater than any previous transaction timestamps assigned by the leader logs commit record in Paxos Coordinator waits until TT.after(s) to allow replicas to commit T, to obey commit wait Since s > TT.now().latest, expected wait is at least 2 * ē After commit wait, timestamp is sent to the client and all participant leaders Each leader logs commit timestamp via Paxos, and all participants then apply at the same timestamp and release locks

Schema-Change Transactions Spanner supports atomic schema changes Can't use a standard transaction, since the number of participants (number of groups in the database) could be in the millions Use a non-blocking transaction Explicitly assign a timestamp t in the future to the transaction in the prepare phase Reads and writes synchronize around this timestamp If their timestamps precede t, proceed If their timestamps are after t, block behind schema change

Refinements A single prepared transaction blocks T TM-safe from advancing. What if the prepared transactions don't conflict with the read? Augment T TM-safe with mappings from key ranges to prepare timestamps. When calculating T TM-safe as the minimum timestamp of prepared transactions in a group, consult these mappings and only consider transactions which conflict with the read Similar problem with LastTS - when assigning a timestamp to a read-only transaction, we must wait until after all previous commit timestamps, even if those commits don't conflict with the read. Similar solution - maintain mappings of key ranges to commit timestamps, and only consider conflicting commits when calculating a maximum

Refinements t Paxos-safe cannot advance without Paxos writes, so snapshots reads at t cannot proceed at groups whose last Paxos write occurred before t. Paxos leaders instead advance t Paxos-safe by keeping track of the timestamp above which future Paxos writes will occur. Maintain mapping MinNextTS(n) from Paxos sequence number n to the minimum timestamp that can be assigned to the Paxos write n + 1 Leaders advance MinNextTS(n) s.t. it doesn't extend past their lease. Advances occur every 8 seconds by default, so in the worst case, replicas can serve reads no more recently than 8 seconds ago Advances can occur by a replica's request as well

Evaluation - Availability - TrueTime - Running system F1

Availability - Results of 3 experiments in the presence of datacenter failures. - 5 zones, each has 25 spanservers. - Data sharded into 1250 paxos groups.

Availability - leader-hard kill: 10 sec. after killing, throughput is recovered

TrueTime

F1 - First was based on MySql - Spanner removes the need to manually reshard - Provides synchronous replication and automatic failover - F1 requires strong transactional semantics

Thank you!