Corbett et al., Spanner: Google s Globally-Distributed Database

Similar documents
Spanner : Google's Globally-Distributed Database. James Sedgwick and Kayhan Dursun

Spanner: Google s Globally- Distributed Database

Spanner: Google's Globally-Distributed Database. Presented by Maciej Swiech

Distributed Systems. 19. Spanner. Paul Krzyzanowski. Rutgers University. Fall 2017

Google Spanner - A Globally Distributed,

Spanner: Google's Globally-Distributed Database* Huu-Phuc Vo August 03, 2013

Spanner: Google s Globally-Distributed Database. Wilson Hsieh representing a host of authors OSDI 2012

Beyond TrueTime: Using AugmentedTime for Improving Spanner

7680: Distributed Systems

Spanner: Google s Globally- Distributed Database

Distributed Data Management. Christoph Lofi Institut für Informationssysteme Technische Universität Braunschweig

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

NewSQL Databases. The reference Big Data stack

Distributed Systems. GFS / HDFS / Spanner

Concurrency Control II and Distributed Transactions

Distributed Transactions

Multi-version concurrency control

Multi-version concurrency control

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

Integrity in Distributed Databases

Ghislain Fourny. Big Data 5. Wide column stores

Cloud Computing. DB Special Topics Lecture (10/5/2012) Kyle Hale Maciej Swiech

How do we build TiDB. a Distributed, Consistent, Scalable, SQL Database

Distributed PostgreSQL with YugaByte DB

Cloud Scale Storage Systems. Yunhao Zhang & Matthew Gharrity

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017

From Google File System to Omega: a Decade of Advancement in Big Data Management at Google

TiDB: NewSQL over HBase.

Storage, Processing and Reliability on Google's Distributed Systems

How we build TiDB. Max Liu PingCAP Amsterdam, Netherlands October 5, 2016

Applications of Paxos Algorithm

DrRobert N. M. Watson

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONCURRENCY CONTROL SCHEMES IN NEWSQL SYSTEMS

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Distributed System. Gang Wu. Spring,2018

BigTable. CSE-291 (Cloud Computing) Fall 2016

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

Last Class: Naming. DNS Implementation

Distributed File Systems II

Extreme Computing. NoSQL.

Replication. Consistency models. Replica placement Distribution protocols

10. Replication. Motivation

Building Consistent Transactions with Inconsistent Replication

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

No Compromises. Distributed Transactions with Consistency, Availability, Performance

Exam 2 Review. October 29, Paul Krzyzanowski 1

Cloudera Kudu Introduction

Lab IV. Transaction Management. Database Laboratory

EECS 498 Introduction to Distributed Systems

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

CS 425 / ECE 428 Distributed Systems Fall 2017

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Data-Intensive Distributed Computing

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

CSC 261/461 Database Systems Lecture 21 and 22. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Synchronization. Chapter 5

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Exam 2 Review. Fall 2011

Synchronization. Clock Synchronization

Enea Polyhedra Polyhedra Database Management Systems (DBMS)

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

Lecture 10: Clocks and Time

CS November 2017

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Distributed KIDS Labs 1

0: BEGIN TRANSACTION 1: W = 1 2: X = W + 1 3: Y = X * 2 4: COMMIT TRANSACTION

TRANSACTIONS AND ABSTRACTIONS

MINIMIZING TRANSACTION LATENCY IN GEO-REPLICATED DATA STORES

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

DEPARTMENT OF INFORMATION TECHNOLOGY QUESTION BANK. UNIT I PART A (2 marks)

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

BigTable: A Distributed Storage System for Structured Data

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

CA485 Ray Walshe NoSQL

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

October/November 2009

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Apache Kudu. Zbigniew Baranowski

Beyond FLP. Acknowledgement for presentation material. Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen

Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements

Consistency in Distributed Systems

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Introduction to Databases, Fall 2005 IT University of Copenhagen. Lecture 10: Transaction processing. November 14, Lecturer: Rasmus Pagh

Last Class: Naming. Today: Classical Problems in Distributed Systems. Naming. Time ordering and clock synchronization (today)

Seminar 3. Transactions. Concurrency Management in MS SQL Server

Database Management System

Concurrency and Recovery

CSC 261/461 Database Systems Lecture 24

Distributed Systems 16. Distributed File Systems II

Lecture 12: Time Distributed Systems

Distributed Systems COMP 212. Revision 2 Othon Michail

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Paxos Phase 2. Paxos Phase 1. Google Chubby. Paxos Phase 3 C 1

Shen PingCAP 2017

OUTLINE. Introduction Clock synchronization Logical clocks Global state Mutual exclusion Election algorithms Deadlocks in distributed systems

Today s Papers. Google Chubby. Distributed Consensus. EECS 262a Advanced Topics in Computer Systems Lecture 24. Paxos/Megastore November 24 th, 2014

GFS: The Google File System. Dr. Yingwu Zhu

Transcription:

Corbett et al., : Google s Globally-Distributed Database MIMUW 2017-01-11

ACID transactions

ACID transactions SQL queries

ACID transactions SQL queries Semi-relational data model

ACID transactions SQL queries Semi-relational data model Lock-free distributed transactions

ACID transactions SQL queries Semi-relational data model Lock-free distributed transactions Global scale

ACID transactions SQL queries Semi-relational data model Lock-free distributed transactions Global scale Externally consistent

Consistency matters Unfriend untrustworthy person X Post: My government is repressive...

External consistency Linearisability: if T 1 commits before T 2 starts, then T 1 s commit timestamp is smaller than T 2 s

External consistency Linearisability: if T 1 commits before T 2 starts, then T 1 s commit timestamp is smaller than T 2 s The first system to provide the guarantee at global scale

External consistency Linearisability: if T 1 commits before T 2 starts, then T 1 s commit timestamp is smaller than T 2 s The first system to provide the guarantee at global scale Allows:

External consistency Linearisability: if T 1 commits before T 2 starts, then T 1 s commit timestamp is smaller than T 2 s The first system to provide the guarantee at global scale Allows: consistent reads in the past

External consistency Linearisability: if T 1 commits before T 2 starts, then T 1 s commit timestamp is smaller than T 2 s The first system to provide the guarantee at global scale Allows: consistent reads in the past consistent backups

External consistency Linearisability: if T 1 commits before T 2 starts, then T 1 s commit timestamp is smaller than T 2 s The first system to provide the guarantee at global scale Allows: consistent reads in the past consistent backups consistent MapReduce executions

External consistency Linearisability: if T 1 commits before T 2 starts, then T 1 s commit timestamp is smaller than T 2 s The first system to provide the guarantee at global scale Allows: consistent reads in the past consistent backups consistent MapReduce executions atomic schema updates

Organisation A universe consists of zones

Organisation A universe consists of zones Zone has:

Organisation A universe consists of zones Zone has: a zonemaster

Organisation A universe consists of zones Zone has: a zonemaster spanservers that serve data to clients

Organisation A universe consists of zones Zone has: a zonemaster spanservers that serve data to clients location proxies

Organisation A universe consists of zones Zone has: a zonemaster spanservers that serve data to clients location proxies Global:

Organisation A universe consists of zones Zone has: a zonemaster spanservers that serve data to clients location proxies Global: universe master

Organisation A universe consists of zones Zone has: a zonemaster spanservers that serve data to clients location proxies Global: universe master placement driver responsible for data transfer across zones

Organisation A universe consists of zones Zone has: a zonemaster spanservers that serve data to clients location proxies Global: universe master placement driver responsible for data transfer across zones Bucketing abstraction: directories

Spanserver tablets

Spanserver tablets key, timestamp string

Spanserver tablets key, timestamp string stored on Colossus as B-trees and WAL

Spanserver tablets key, timestamp string stored on Colossus as B-trees and WAL Paxos state machine

Spanserver tablets key, timestamp string stored on Colossus as B-trees and WAL Paxos state machine Leader

Spanserver tablets key, timestamp string stored on Colossus as B-trees and WAL Paxos state machine Leader long-lived leader leases

Spanserver tablets key, timestamp string stored on Colossus as B-trees and WAL Paxos state machine Leader long-lived leader leases lock table

Spanserver tablets key, timestamp string stored on Colossus as B-trees and WAL Paxos state machine Leader long-lived leader leases lock table transaction manager

TrueTime Idea: expose clock uncertainty

TrueTime Idea: expose clock uncertainty Time masters: GPS or atomic clocks

TrueTime Idea: expose clock uncertainty Time masters: GPS or atomic clocks (Armageddon masters)

TrueTime Idea: expose clock uncertainty Time masters: GPS or atomic clocks (Armageddon masters) Timeslave daemon polls a variety of masters

TrueTime Idea: expose clock uncertainty Time masters: GPS or atomic clocks (Armageddon masters) Timeslave daemon polls a variety of masters Marzullo s algorithm used to detect liars

TrueTime Idea: expose clock uncertainty Time masters: GPS or atomic clocks (Armageddon masters) Timeslave daemon polls a variety of masters Marzullo s algorithm used to detect liars Eviction of malfunctioning masters and clients

TrueTime Idea: expose clock uncertainty Time masters: GPS or atomic clocks (Armageddon masters) Timeslave daemon polls a variety of masters Marzullo s algorithm used to detect liars Eviction of malfunctioning masters and clients Assumed upper bound on clock drift: 200 µs s.

Transactions Operation Concurrency control Replica Required RW trans. pessimistic leader RO trans. lock-free leader (timestamp), any Snapshot read lock-free any

RW transactions Two-phase locking, timestamps assigned when all locks are being held

RW transactions Two-phase locking, timestamps assigned when all locks are being held Disjoint leader lease intervals

RW transactions Two-phase locking, timestamps assigned when all locks are being held Disjoint leader lease intervals Start: Coordinator leader assigns timestamp s TT.now().latest after receiving the commit request, and greater than all prepare timestamps previously issued

RW transactions Two-phase locking, timestamps assigned when all locks are being held Disjoint leader lease intervals Start: Coordinator leader assigns timestamp s TT.now().latest after receiving the commit request, and greater than all prepare timestamps previously issued Commit wait: Clients cannot see any data commited by the transaction until TT.after(s) is true

RW transactions Two-phase locking, timestamps assigned when all locks are being held Disjoint leader lease intervals Start: Coordinator leader assigns timestamp s TT.now().latest after receiving the commit request, and greater than all prepare timestamps previously issued Commit wait: Clients cannot see any data commited by the transaction until TT.after(s) is true Wound-wait

RW transactions Two-phase locking, timestamps assigned when all locks are being held Disjoint leader lease intervals Start: Coordinator leader assigns timestamp s TT.now().latest after receiving the commit request, and greater than all prepare timestamps previously issued Commit wait: Clients cannot see any data commited by the transaction until TT.after(s) is true Wound-wait Client drives two-phase commit using the identity of the coordinator

Snapshot reads Safe time Maximum timestamp at which the replica is up to date Minimum of: timestamp of the highest-applied Paxos write

Snapshot reads Safe time Maximum timestamp at which the replica is up to date Minimum of: timestamp of the highest-applied Paxos write prepare timestamps of prepared (but not commited) transactions

RO transactions A timestamp needs to be assigned

RO transactions A timestamp needs to be assigned Scope expression required to negotiate timestamp between all Paxos groups involved

RO transactions A timestamp needs to be assigned Scope expression required to negotiate timestamp between all Paxos groups involved Either TT.now().latest...

RO transactions A timestamp needs to be assigned Scope expression required to negotiate timestamp between all Paxos groups involved Either TT.now().latest...... or the timestamp of the last commited write at a Paxos group

Q&A