Spanner: Google s Globally- Distributed Database

Similar documents
Spanner : Google's Globally-Distributed Database. James Sedgwick and Kayhan Dursun

Spanner: Google's Globally-Distributed Database. Presented by Maciej Swiech

Corbett et al., Spanner: Google s Globally-Distributed Database

Beyond TrueTime: Using AugmentedTime for Improving Spanner

Google Spanner - A Globally Distributed,

Spanner: Google's Globally-Distributed Database* Huu-Phuc Vo August 03, 2013

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

Distributed Systems. 19. Spanner. Paul Krzyzanowski. Rutgers University. Fall 2017

Spanner: Google s Globally-Distributed Database. Wilson Hsieh representing a host of authors OSDI 2012

Distributed Transactions

Distributed Data Management. Christoph Lofi Institut für Informationssysteme Technische Universität Braunschweig

Concurrency Control II and Distributed Transactions

Multi-version concurrency control

Spanner: Google s Globally- Distributed Database

Multi-version concurrency control

Distributed Systems. GFS / HDFS / Spanner

NewSQL Databases. The reference Big Data stack

7680: Distributed Systems

Building Consistent Transactions with Inconsistent Replication

How do we build TiDB. a Distributed, Consistent, Scalable, SQL Database

MINIMIZING TRANSACTION LATENCY IN GEO-REPLICATED DATA STORES

Cloud Computing. DB Special Topics Lecture (10/5/2012) Kyle Hale Maciej Swiech

CS /29/17. Paul Krzyzanowski 1. Fall 2016: Question 2. Distributed Systems. Fall 2016: Question 2 (cont.) Fall 2016: Question 3

Applications of Paxos Algorithm

Shen PingCAP 2017

Building Consistent Transactions with Inconsistent Replication

Distributed Systems Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2016

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Extreme Computing. NoSQL.

There Is More Consensus in Egalitarian Parliaments

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

Designing Distributed Systems using Approximate Synchrony in Data Center Networks

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

From Google File System to Omega: a Decade of Advancement in Big Data Management at Google

Quiz I Solutions MASSACHUSETTS INSTITUTE OF TECHNOLOGY Spring Department of Electrical Engineering and Computer Science

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

Cloud Scale Storage Systems. Yunhao Zhang & Matthew Gharrity

How we build TiDB. Max Liu PingCAP Amsterdam, Netherlands October 5, 2016

Janus. Consolidating Concurrency Control and Consensus for Commits under Conflicts. Shuai Mu, Lamont Nelson, Wyatt Lloyd, Jinyang Li

Outline. Spanner Mo/va/on. Tom Anderson

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Paxos Made Live. An Engineering Perspective. Authors: Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented By: Dipendra Kumar Jha

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

DrRobert N. M. Watson

Low-Latency Multi-Datacenter Databases using Replicated Commit

10. Replication. Motivation

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Distributed PostgreSQL with YugaByte DB

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

The Chubby Lock Service for Loosely-coupled Distributed systems

Replication. Consistency models. Replica placement Distribution protocols

Distributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions

Data-Intensive Distributed Computing

EECS 498 Introduction to Distributed Systems

Beyond FLP. Acknowledgement for presentation material. Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen

CPS 512 midterm exam #1, 10/7/2016

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

Ghislain Fourny. Big Data 5. Wide column stores

Introduction to MySQL InnoDB Cluster

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

PushyDB. Jeff Chan, Kenny Lam, Nils Molina, Oliver Song {jeffchan, kennylam, molina,

TiDB: NewSQL over HBase.

Building a transactional distributed

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

EECS 498 Introduction to Distributed Systems

Hierarchical Chubby: A Scalable, Distributed Locking Service

SimpleChubby: a simple distributed lock service

Paxos Made Live. an engineering perspective. Authored by Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented by John Hudson

Integrity in Distributed Databases

Today s Papers. Google Chubby. Distributed Consensus. EECS 262a Advanced Topics in Computer Systems Lecture 24. Paxos/Megastore November 24 th, 2014

Large Systems: Design + Implementation: Communication Coordination Replication. Image (c) Facebook

PARALLEL CONSENSUS PROTOCOL

Lessons Learned While Building Infrastructure Software at Google

Exam 2 Review. Fall 2011

Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements

MATRIX:DJLSYS EXPLORING RESOURCE ALLOCATION TECHNIQUES FOR DISTRIBUTED JOB LAUNCH UNDER HIGH SYSTEM UTILIZATION

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017

Building Consistent Transactions with Inconsistent Replication

MapReduce. U of Toronto, 2014

The Chubby lock service for loosely- coupled distributed systems

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

The Google File System

Be General and Don t Give Up Consistency in Geo- Replicated Transactional Systems

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

BIG DATA AND CONSISTENCY. Amy Babay

Changing Requirements for Distributed File Systems in Cloud Storage

Intra-cluster Replication for Apache Kafka. Jun Rao

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

No Compromises. Distributed Transactions with Consistency, Availability, Performance

BigTable. CSE-291 (Cloud Computing) Fall 2016

Transactions in a Distributed Key-Value Store

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

A Brief Introduction of TiDB. Dongxu (Edward) Huang CTO, PingCAP

Replications and Consensus

CS Amazon Dynamo

Datacenter replication solution with quasardb

Transcription:

Spanner: Google s Globally- Distributed Database Google, Inc. OSDI 2012 Presented by: Karen Ouyang

Problem Statement Distributed data system with high availability Support external consistency!

Key Ideas Distributed data system with high availability Supports external consistency! Enabling technology: TrueTime API

Server Organization datacenters have one or more zones

Server Organization handles moving data across zones assigns data to spanservers used by clients to locate spanservers serves data to clients

Spanserver Stack between 100 and 1000 instances (key:string, :int64) string

Spanserver Stack set of replicas: Paxos group writes initiate Paxos protocol at leader; reads from any sufficiently up-to-date replica

Spanserver Stack supports distributed transactions contains state for two-phase locking

supports distributed transactions contains state for two-phase locking transactions with 1+ group: two-phase commit select coordinator leader from leaders

TrueTime API Exposes clock uncertainty by expressing time as an interval Uses GPS and atomic clocks Time master machines per datacenter Client polls multiple masters to compute time interval

TrueTime API time TT.now() earliest latest 2ϵ

Consistency Ensure external consistency by ensuring order All transactions are assigned Data written by T is ed with s

Two-phase locking: assign s at any time that locks are held Assign s to Paxos writes in increasing order across leaders A leader only assigns s within its leader lease; leader leases are disjoint

Transactions: two-phase commit Two transactions start commit T 1 start commit T 2 Assign commit s with s 1 < s 2 How?

Start: commit is after time of commit request at server or: t abs (e 2 server ) s

Commit wait: cannot see data committed by T until s (assigned ) has passed pick s = TT.now().latest s

Commit wait: cannot see data committed by T until s (assigned ) has passed pick s = TT.now().latest s wait until s < TT.now().earliest

Commit wait: cannot see data committed by T until s (assigned ) has passed pick s = TT.now().latest s wait until s < TT.now().earliest commit wait

Commit wait: cannot see data committed by T until s (assigned ) has passed pick s = TT.now().latest s wait until s < TT.now().earliest commit wait

s 1 < t abs (e commit 1 ) s 1 < t abs (e commit 1 ) < t abs (e start 2 ) s 1 < t abs (e commit 1 ) < t abs (e start 2 ) < t abs (e server 2 ) s 1 < t abs (e commit 1 ) < t abs (e start 2 ) < t abs (e server 2 ) s 2 s 1 < s 2

Two-phase commit coordinator leader

Two-phase commit: client begins coordinator leader

Two-phase commit coordinator leader

Two-phase commit coordinator leader choose prepare

Two-phase commit log prepare record in Paxos coordinator leader choose prepare

Two-phase commit log prepare record in Paxos coordinator leader send prepare choose prepare

Two-phase commit log prepare record in Paxos coordinator leader send prepare choose prepare choose commit

Two-phase commit log prepare record in Paxos log commit in Paxos coordinator leader send prepare choose prepare choose commit

Two-phase commit log prepare record in Paxos log commit in Paxos coordinator leader send prepare choose prepare choose commit commit wait done

Two-phase commit log prepare record in Paxos log commit in Paxos coordinator leader send prepare notify choose prepare choose commit commit wait done

Two-phase commit log prepare record in Paxos log commit in Paxos coordinator leader send prepare notify choose prepare choose commit commit wait done

Two-phase commit log prepare record in Paxos log commit in Paxos log outcome in Paxos coordinator leader send prepare notify choose prepare choose commit commit wait done

Two-phase commit log prepare record in Paxos log commit in Paxos log outcome in Paxos coordinator leader send prepare notify choose prepare choose commit commit wait done

Read-Only Transactions Serving reads at a Replica tracks safe time t safe : can read t t safe Define t safe = min(t Paxos, t TM ) Assigning s to RO transactions Simplest: assign s read = TT.now().latest May block; should assign oldest that preserves external consistency

Microbenchmarks Two-phase commit scalability

Microbenchmarks Effect of killing servers on throughput

Performance TrueTime F1, Google s advertising backend Automatic failover High standard deviation for latency?

Final Thoughts Implemented at a large scale (F1)! Commit wait is pretty clever Very dependent on clocks Security?

References Corbett et al. Spanner: Google s Globally-Distributed Database. Proc. Of OSDI. 2012. http://research.google.com/archive/spanner.html