Spanner: Google's Globally-Distributed Database. Presented by Maciej Swiech

Similar documents
Spanner : Google's Globally-Distributed Database. James Sedgwick and Kayhan Dursun

Distributed Systems. 19. Spanner. Paul Krzyzanowski. Rutgers University. Fall 2017

Spanner: Google s Globally- Distributed Database

Corbett et al., Spanner: Google s Globally-Distributed Database

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

Google Spanner - A Globally Distributed,

Spanner: Google's Globally-Distributed Database* Huu-Phuc Vo August 03, 2013

Spanner: Google s Globally-Distributed Database. Wilson Hsieh representing a host of authors OSDI 2012

Beyond TrueTime: Using AugmentedTime for Improving Spanner

Cloud Computing. DB Special Topics Lecture (10/5/2012) Kyle Hale Maciej Swiech

NewSQL Databases. The reference Big Data stack

Distributed Data Management. Christoph Lofi Institut für Informationssysteme Technische Universität Braunschweig

Spanner: Google s Globally- Distributed Database

Integrity in Distributed Databases

How do we build TiDB. a Distributed, Consistent, Scalable, SQL Database

7680: Distributed Systems

Distributed Transactions

Building Consistent Transactions with Inconsistent Replication

Distributed Systems. GFS / HDFS / Spanner

Multi-version concurrency control

Concurrency Control II and Distributed Transactions

Ghislain Fourny. Big Data 5. Wide column stores

Extreme Computing. NoSQL.

Multi-version concurrency control

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

DrRobert N. M. Watson

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

Distributed PostgreSQL with YugaByte DB

How we build TiDB. Max Liu PingCAP Amsterdam, Netherlands October 5, 2016

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

EECS 498 Introduction to Distributed Systems

10. Replication. Motivation

Percolator. Large-Scale Incremental Processing using Distributed Transactions and Notifications. D. Peng & F. Dabek

Consistency in Distributed Systems

Outline. Spanner Mo/va/on. Tom Anderson

Low-Latency Multi-Datacenter Databases using Replicated Commit

TiDB: NewSQL over HBase.

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Database Architectures

Building Consistent Transactions with Inconsistent Replication

MINIMIZING TRANSACTION LATENCY IN GEO-REPLICATED DATA STORES

Architecture of a Real-Time Operational DBMS

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

Replication in Distributed Systems

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

PNUTS and Weighted Voting. Vijay Chidambaram CS 380 D (Feb 8)

Applications of Paxos Algorithm

There Is More Consensus in Egalitarian Parliaments

Distributed File Systems II

BigTable. CSE-291 (Cloud Computing) Fall 2016

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

Migrating Oracle Databases To Cassandra

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

Distributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions

CS November 2017

Conceptual Modeling on Tencent s Distributed Database Systems. Pan Anqun, Wang Xiaoyu, Li Haixiang Tencent Inc.

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

MySQL Group Replication. Bogdan Kecman MySQL Principal Technical Engineer

Today s Papers. Google Chubby. Distributed Consensus. EECS 262a Advanced Topics in Computer Systems Lecture 24. Paxos/Megastore November 24 th, 2014

App Engine: Datastore Introduction

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Chapter 24 NOSQL Databases and Big Data Storage Systems

Transactions in a Distributed Key-Value Store

From Google File System to Omega: a Decade of Advancement in Big Data Management at Google

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

CA485 Ray Walshe NoSQL

Building a transactional distributed

Data-Intensive Distributed Computing

CockroachDB on DC/OS. Ben Darnell, CTO, Cockroach Labs

Cloud Spanner. Rohit Gupta, Solutions

F1: A Distributed SQL Database That Scales

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Introduction to MySQL InnoDB Cluster

Self-healing Data Step by Step

CS /29/17. Paul Krzyzanowski 1. Fall 2016: Question 2. Distributed Systems. Fall 2016: Question 2 (cont.) Fall 2016: Question 3

Overview. Introduction to Transaction Management ACID. Transactions

PushyDB. Jeff Chan, Kenny Lam, Nils Molina, Oliver Song {jeffchan, kennylam, molina,

Final Exam Logistics. CS 133: Databases. Goals for Today. Some References Used. Final exam take-home. Same resources as midterm

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

BigTable: A Distributed Storage System for Structured Data

Distributed Data Store

Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements

The Chubby Lock Service for Loosely-coupled Distributed systems

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Don t Give Up on Serializability Just Yet. Neha Narula

Big Data Processing Technologies. Chentao Wu Associate Professor Dept. of Computer Science and Engineering

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Scaling Without Sharding. Baron Schwartz Percona Inc Surge 2010

Exam 2 Review. Fall 2011

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Transcription:

Spanner: Google's Globally-Distributed Database Presented by Maciej Swiech

What is Spanner? "...Google's scalable, multi-version, globallydistributed, and synchronously replicated database."

What is Spanner? Distributed Multiversion Database General-purpose transactions (ACID) SQL syntax Schematized tables Semi-relational data model TrueTime API versioning / ordering

Organization Universes test, dev, production Zones unit of administrative deployment Spanservers data served by zonemaster

Organization - spanserver Tablet (key, timestamp) --> value Collusus storage stores replica state Paxos state machine consensus solving

Organization - spanserver Participant leader tx manager lock table long-lived (10s)

spanserver leader selection Potential leader sends requests for lease votes If it receives a quorum, it is leader Successful write extends leadership Votes are timestamped upon expiration, no longer count towards quorum Invariant: Each leader's lease interval is disjoint

Organization - directories Sharded buckets generally located in one zone Spatial locality - often accessed directories are colocated Replication specified by application / location Movedir directive takes advantage of sharding to move a directory

Data Model SQL, schematized table structure move away from BigTable / NoSQL Semi-relational all tables must have a PK set -- a "name" Two phase commits run through Paxos state machine Schema allows for locality and replica description

Data Model - syntax Mostly SQL INTERLEAVE IN Schematize tables allow for more complex lookups

Synchronization Transactional reads/writes must acquire a lock in the lock table Other operations can read directly from the replica

How to solve ordering? We have seen: Versioning Vector clocks Matrix clocks Spanner's answer...

TrueTime GPS and Atomic clock driven Known inaccuracy Common worst-case scenario: 10 ms delay

TrueTime GPS more precise, more likely to fail Atomic clock predictable inaccuracy, less likely to fail Server hierarchy to provide time to nodes If accuracy is too low, system will slow to ensure ordering is preserved Synchronization every 30 seconds Sawtooth pattern of inaccuracy, max = 7ms

Concurrency - transactions Spanner internally retries transactions (how?) Transactions are served by checking timestamps on replicas Transactions store in log and garbage collected (how?)

Concurrency - leader leases Remember, leader leases must be disjoint! Leader maintains s max - its maximum timestamp s max increases with transations Leader must wait for TT.after(s max ) to abdicate

Concurrency - RW Two-phased locking timestamp at any locked time Paxos timestamp used Timestamps issued are monotonically increasing s max incremented

Concurrency - RW Event e i start occurs that starts transaction Leader for T i assigns commit timestamp s i no less than TT.now().latest (eval at e i server ) Commit wait: no other client can see data until TT.after(s i commit ), where s i commit is time of commit Ordering is guaranteed

Concurrency - RW

Concurrency - reads Each replica maintains t safe maximum up-to-date timestamp t safe = min(t safe Paxos, t safe TM ) Paxos t safe = highest applied write timestamp TM t safe depends on number of pending tx If t read < t safe, read is allowed through

Concurrency - RO Assign s read = TT.now().latest to transaction May block if t safe is not advanced enough Reads are done at any up-to-date replica inferred from transaction "scope" s max increased

Overview Lock-free distributed read transactions External consistency of distributed transactions Integration of concurrency control, replication, and two-phase commits TrueTime global time system

Evaluation Somewhat lacking... Throughput throughput increases with number of replicas, optimal occurs at 3-5 replicas Latency number of replicas does not affect latency much commit wait: 5ms Paxos latency: 9ms

Evaluation - scalability 3 zones, 25 spanservers each 50 participant optimal scale

Evaluation - availability 5 zones, 25 spanservers each All leaders in Z 1 Soft kill notifies of necessary reelection CAVEAT - spare capacity

Evaluation - TrueTime CPU failure 6x more likely than clock failure March 30 th improvement due to network Inaccuracy is low enough to guarantee invariants Spanner depends on for ordering

F1 Advertising backend - sharded MySQL DB sharded data stored in BigTables Why switch? dynamic sharding (remember directories?) synchronous replication and automatic failover strong transactional semantics

Evaluation - F1 Replicas on both US coasts Database mainetnance and upkeep greatly simplified over previous model Cluster failure mitigated by schema re-definition

Questions?