Google Spanner - A Globally Distributed,

Similar documents
Spanner : Google's Globally-Distributed Database. James Sedgwick and Kayhan Dursun

Spanner: Google's Globally-Distributed Database. Presented by Maciej Swiech

Corbett et al., Spanner: Google s Globally-Distributed Database

Spanner: Google's Globally-Distributed Database* Huu-Phuc Vo August 03, 2013

Spanner: Google s Globally- Distributed Database

Spanner: Google s Globally-Distributed Database. Wilson Hsieh representing a host of authors OSDI 2012

Beyond TrueTime: Using AugmentedTime for Improving Spanner

Distributed Systems. 19. Spanner. Paul Krzyzanowski. Rutgers University. Fall 2017

Spanner: Google s Globally- Distributed Database

Distributed Data Management. Christoph Lofi Institut für Informationssysteme Technische Universität Braunschweig

Concurrency Control II and Distributed Transactions

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

NewSQL Databases. The reference Big Data stack

Distributed Transactions

Multi-version concurrency control

Multi-version concurrency control

Distributed Systems. GFS / HDFS / Spanner

Integrity in Distributed Databases

MINIMIZING TRANSACTION LATENCY IN GEO-REPLICATED DATA STORES

7680: Distributed Systems

Cloud Computing. DB Special Topics Lecture (10/5/2012) Kyle Hale Maciej Swiech

How do we build TiDB. a Distributed, Consistent, Scalable, SQL Database

Low-Latency Multi-Datacenter Databases using Replicated Commit

EECS 498 Introduction to Distributed Systems

Building Consistent Transactions with Inconsistent Replication

Extreme Computing. NoSQL.

DrRobert N. M. Watson

There Is More Consensus in Egalitarian Parliaments

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017

Ghislain Fourny. Big Data 5. Wide column stores

Building Consistent Transactions with Inconsistent Replication

10. Replication. Motivation

Applications of Paxos Algorithm

Replicated State Machine in Wide-area Networks

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

Outline. Spanner Mo/va/on. Tom Anderson

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Database Architectures

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

Exam 2 Review. Fall 2011

How we build TiDB. Max Liu PingCAP Amsterdam, Netherlands October 5, 2016

Data-Intensive Distributed Computing

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements

CS November 2017

PNUTS and Weighted Voting. Vijay Chidambaram CS 380 D (Feb 8)

Replication in Distributed Systems

MillWheel:Fault Tolerant Stream Processing at Internet Scale. By FAN Junbo

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

Beyond FLP. Acknowledgement for presentation material. Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

AGREEMENT PROTOCOLS. Paxos -a family of protocols for solving consensus

Lessons Learned While Building Infrastructure Software at Google

Introduction to Distributed Systems Seif Haridi

CS5412: DIVING IN: INSIDE THE DATA CENTER

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

TiDB: NewSQL over HBase.

Paxos provides a highly available, redundant log of events

Distributed File Systems II

Percolator. Large-Scale Incremental Processing using Distributed Transactions and Notifications. D. Peng & F. Dabek

Cloud Scale Storage Systems. Yunhao Zhang & Matthew Gharrity

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

Mobile and Heterogeneous databases Distributed Database System Transaction Management. A.R. Hurson Computer Science Missouri Science & Technology

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

Exam 2 Review. October 29, Paul Krzyzanowski 1

CS 425 / ECE 428 Distributed Systems Fall 2017

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

CS 102. Big Data. Spring Big Data Platforms

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage

Consistency in Distributed Systems

CS 655 Advanced Topics in Distributed Systems

Chapter 24 NOSQL Databases and Big Data Storage Systems

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Time Synchronization and Logical Clocks

It also performs many parallelization operations like, data loading and query processing.

Distributed PostgreSQL with YugaByte DB

Bigtable. Presenter: Yijun Hou, Yixiao Peng

MENCIUS: BUILDING EFFICIENT

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

CockroachDB on DC/OS. Ben Darnell, CTO, Cockroach Labs

MapReduce & BigTable

Replications and Consensus

SCALARIS. Irina Calciu Alex Gillmor

Today s Papers. Google Chubby. Distributed Consensus. EECS 262a Advanced Topics in Computer Systems Lecture 24. Paxos/Megastore November 24 th, 2014

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

Distributed KIDS Labs 1

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Database Architectures

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Distributed Systems Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2016

SpecPaxos. James Connolly && Harrison Davis

FAQs Snapshots and locks Vector Clock

Transcription:

Google Spanner - A Globally Distributed, Synchronously-Replicated Database System James C. Corbett, et. al. Feb 14, 2013. Presented By Alexander Chow For CS 742

Motivation Eventually-consistent sometimes isn t good enough. General Purpose Transactions (ACID) Application desires complex, evolving schemas Schematized Tables SQL-like query language

The Problem Store data across thousands of machines, hundreds of data centres Replication across data centres, even continents

Spanner Features Lock-free distributed read transactions from any sufficiently-up-todate replica External consistency Commit order == Timestamp Order == Global Wall Clock Time The TrueTime API

Lock-free Reads Example unfriend untrusty person Write dissenting post t Single Machine Read Friend1 post Friend2 post Generated Page

Lock-free Reads Example unfriend untrusty person Write dissenting post t Single Machine Read Block writes Friend1 post Friend2 post Generated Page

Lock-free Reads Example unfriend untrusty person Write dissenting post t Single Machine Read Friend1 post Friend2 post Generated Page

Lock-free Reads Example unfriend untrusty person Write dissenting post t Single Machine Read Friend1 post Friend2 post Generated Page

Lock-free Reads Friend1 post Friend2 post Generated Page Friend100 post Friend101 post

Lock-free Reads Block writes Friend1 post Friend2 post Generated Page Friend100 post Friend101 post

Lock-free Reads Friend1 post Friend2 post Generated Page Friend100 post Friend101 post

Lock-free Reads Friend1 post Friend2 post Generated Page Friend100 post Friend101 post

Lock-free Reads Friend1 post Friend2 post Generated Page Friend100 post Friend101 post

Lock-free Reads Friend1 post Friend2 post Generated Page Friend100 post Friend101 post

Lock-free Reads Friend1 post Friend2 post Generated Page Friend100 post Friend101 post

TrueTime API TT.Now() 2! t earliest latest TT.Now()

Read-Write Transaction 2 Phase Locking Acquired all locks Release all locks t t = TT.now() Commit wait s = t.latest()

Overlapping with Commit Wait Network cost to achieve consensus far dominates time for commit wait; no need to wait Acquired all locks Start consensus Finished consensus Release all locks t t = TT.now() Commit wait s = t.latest()

Integrating 2PC and TrueTime Acquired all locks Start Logging Done Logging Commit Wait done Release all locks Acquired all locks Release all locks t t Acquired all locks Release all locks Prepared, send s t Each computes s

Implementing TrueTime GPS GPS Atomic clock Atomic clock Timemaster Timemaster Timemaster Poll Client Datacenter 1 Datacenter 2 Datacenter 3

Implementing TrueTime Time at synchronization (polling of timemasters, every 30 seconds) Time is from nearest available timemaster Poll nearby datacenter s timemasters for redundancy, detect rogue timemasters. Use variation on Marzullo s Algorithm to detect liars, compute time of non-liars.! resets to! broadcast by Timemaster plus communication time (1ms) plus Between synchronizations: Increase! by local drift (200us/s)

Time availability by design Commit time uses variable! If local timemaster not available, can use remote timemaster from other data center (100+ ms delay) Spanner slows down automatically

Easy Schema Change Non-blocking variant of regular transaction At prepare stage, choose a timestamp t in the future Reads and writes which implicitly depend on schema: If their time is before t, proceed If their time is after t, block Without TrueTime, defining a schema change to happen at time t would be meaningless.

Spanner Implementation Details Tablet: Similar to Bigtable s tablet. A bag of mappings of: (key:string, timestamp:int64) -> string More like multi-version database Stored on Colossus (distributed file system)

Spanner Implementation Details Tablets are replicated (between datacenters, possibly inter-continental), concurrency coordination by Paxos A transaction needs consistency across its replicas; coordinated by Paxos tablet1 (replica 1) Paxos Group Paxos Leader Paxos Paxos Paxos tablet1 (replica 1) tablet1 (replica 1) Paxos Group: A tablet and its replicas as well as the concurrency machinery across the replicas replica replica replica

Spanner Implementation Details 2PC Coordination If transaction involves multiple Paxos Groups, use transaction management machinery atop of Paxos groups to coordinate 2PC Participant Leader Transaction Manager Paxos Group Participant Leader Transaction Manager Paxos Group Participant Leader Transaction Manager Paxos Group If a transaction involves a single Paxos Group, can bypass Transaction Manager and Participant Leader machinery. Thus, system involves 2 stages of concurrency control, 2PC and Paxos, where one stage can be skipped.

Lock-free Reads at a Timestamp Each replica maintains tsafe tsafe = min(t paxos safe, t TM safe ) t paxos safe is timestamp of highest-applied Paxos write t TM safe is much harder: = if no pending 2PC transaction = mini (s prepare i,g ) over i prepared transactions in group g. Thus, tsafe is maximum timestamp at which reads are safe

Data Locality Application-level controllable data locality Prefix of key used to define the bucket Key: 0PZX2N47HL5N4MAE3Q Key: 0PZX2N47HL5N7U9OY2 Key: 0PZX2N47HL5NQBDP73 Entries in the same bucket are always in the same Paxos group. Can balance load between Paxos groups by moving buckets.

Benchmarks 50 Paxos groups, 2500 buckets, 4KB reads or writes, datacenters 1ms apart Latency remains mostly constant as number of replicas increases because Paxos executes in parallel at a group s replicas Less sensitivity to a slow replica as number of replicas increases (easy to achieve quorum).

Benchmarks All leaders explicitly placed in zone Z1. Killing all servers in a zone at 5 seconds. For Z1 test, completion rate drops to almost 0. Recovers quickly after reelection of new leader

Critique No background on current global time synchronization techniques Lack of proofs of absolute error bounds in their TrueTime implementation External consistency? Guess at implied meaning (referenced PhD dissertation not available online) Pipelined Paxos? Not described. Is each replica governed by a replicawide lock so one replica cannot undergo Paxos concurrently on disjoint rows?