Spanner: Google's Globally-Distributed Database* Huu-Phuc Vo August 03, 2013

Similar documents
Spanner : Google's Globally-Distributed Database. James Sedgwick and Kayhan Dursun

Spanner: Google s Globally-Distributed Database. Wilson Hsieh representing a host of authors OSDI 2012

Spanner: Google's Globally-Distributed Database. Presented by Maciej Swiech

Corbett et al., Spanner: Google s Globally-Distributed Database

Google Spanner - A Globally Distributed,

Distributed Systems. 19. Spanner. Paul Krzyzanowski. Rutgers University. Fall 2017

Spanner: Google s Globally- Distributed Database

Spanner: Google s Globally- Distributed Database

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

Distributed Data Management. Christoph Lofi Institut für Informationssysteme Technische Universität Braunschweig

NewSQL Databases. The reference Big Data stack

Beyond TrueTime: Using AugmentedTime for Improving Spanner

7680: Distributed Systems

Cloud Computing. DB Special Topics Lecture (10/5/2012) Kyle Hale Maciej Swiech

Integrity in Distributed Databases

Distributed Systems. GFS / HDFS / Spanner

Concurrency Control II and Distributed Transactions

How do we build TiDB. a Distributed, Consistent, Scalable, SQL Database

Building Consistent Transactions with Inconsistent Replication

Distributed Transactions

MINIMIZING TRANSACTION LATENCY IN GEO-REPLICATED DATA STORES

Multi-version concurrency control

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017

DrRobert N. M. Watson

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Extreme Computing. NoSQL.

Applications of Paxos Algorithm

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Bigtable. Presenter: Yijun Hou, Yixiao Peng

How we build TiDB. Max Liu PingCAP Amsterdam, Netherlands October 5, 2016

Multi-version concurrency control

Distributed KIDS Labs 1

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Distributed File Systems II

Outline. Spanner Mo/va/on. Tom Anderson

TiDB: NewSQL over HBase.

Ghislain Fourny. Big Data 5. Wide column stores

Building Consistent Transactions with Inconsistent Replication

10. Replication. Motivation

EECS 498 Introduction to Distributed Systems

Today s Papers. Google Chubby. Distributed Consensus. EECS 262a Advanced Topics in Computer Systems Lecture 24. Paxos/Megastore November 24 th, 2014

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements

Chapter 24 NOSQL Databases and Big Data Storage Systems

CISC 7610 Lecture 2b The beginnings of NoSQL

Distributed System. Gang Wu. Spring,2018

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein Sameh Elnikety. Copyright 2012 Philip A. Bernstein

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

From Google File System to Omega: a Decade of Advancement in Big Data Management at Google

Distributed PostgreSQL with YugaByte DB

App Engine: Datastore Introduction

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

CS November 2017

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

CS /29/17. Paul Krzyzanowski 1. Fall 2016: Question 2. Distributed Systems. Fall 2016: Question 2 (cont.) Fall 2016: Question 3

There Is More Consensus in Egalitarian Parliaments

PushyDB. Jeff Chan, Kenny Lam, Nils Molina, Oliver Song {jeffchan, kennylam, molina,

Data-Intensive Distributed Computing

CS 655 Advanced Topics in Distributed Systems

SQL, NoSQL, MongoDB. CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Shen PingCAP 2017

SCALARIS. Irina Calciu Alex Gillmor

A Brief Introduction of TiDB. Dongxu (Edward) Huang CTO, PingCAP

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Cloud Scale Storage Systems. Yunhao Zhang & Matthew Gharrity

MySQL Group Replication. Bogdan Kecman MySQL Principal Technical Engineer

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

Everything You Need to Know About MySQL Group Replication

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

Don t Give Up on Serializability Just Yet. Neha Narula

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

2/27/2019 Week 6-B Sangmi Lee Pallickara

Database Architectures

The Chubby Lock Service for Loosely-coupled Distributed systems

Consistency in Distributed Systems

Atomicity. Bailu Ding. Oct 18, Bailu Ding Atomicity Oct 18, / 38

Janus. Consolidating Concurrency Control and Consensus for Commits under Conflicts. Shuai Mu, Lamont Nelson, Wyatt Lloyd, Jinyang Li

BigTable. CSE-291 (Cloud Computing) Fall 2016

CA485 Ray Walshe NoSQL

Introduction to MySQL InnoDB Cluster

Distributed Systems Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2016

ZooKeeper. Table of contents

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

Lessons Learned While Building Infrastructure Software at Google

BigTable: A Distributed Storage System for Structured Data

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

MySQL High Availability Solutions. Alex Poritskiy Percona

Exam 2 Review. October 29, Paul Krzyzanowski 1

Migrating Oracle Databases To Cassandra

CPS 512 midterm exam #1, 10/7/2016

The Google File System

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

Replication in Distributed Systems

Low-Latency Multi-Datacenter Databases using Replicated Commit

Transcription:

Spanner: Google's Globally-Distributed Database* Huu-Phuc Vo August 03, 2013 *OSDI '12, James C. Corbett et al. (26 authors), Jay Lepreau Best Paper Award

Outline What is Spanner? Features & Example Structure of Spanner's Implementation Data model Version management Novel TrueTime API External consistency TrueTime API Lock-free read-only transactions Performance 2

What is Spanner? What is Spanner? scalable, multi-version database global-distributed database synchronously-replicated database Goal: building rich apps easy at Google scale Key ideas to build transactional storage system replicated globally to manage cross-datacenter replicated data to manage consistency globally 3

Features Features externally consistent reads & writes non-blocking reads in the past lock-free read-only transactions atomic schema changes SQL-like query language Applications support Google's advertising backend replace a sharded MySQL database 4

Key ideas Relational data model with SQL & general purpose transactions External consistency transactions can be ordered by their commit time commit times correspond to real world notions of time Paxos-based replication with number of replicas and distance to replicas controllable Data partitioned across thousands of servers 5

Example: Global Social Network Spanner maintains single system looks like single database 6

Example: Global Social Network sharded data US Brazil Spain Russia 7

Example: Global Social Network shard data into in many ways across datacenters User posts User posts US User posts User posts Brazil User User posts posts Spain User posts Russia 8

Example: Global Social Network replicate & store across datacenters User posts User posts San Francisco Seattle Arizona User posts User posts Sao Paulo Santiago US Brazil User User posts posts London Paris Berlin Madrid User posts Moscow Berlin Spain Russia 9

Structure of Spanner's Implementation Spanserver software stack Directories are the unit of data movement between Paxos groups 10

Spanserver software stack Transaction manager (TM): support distributed transactions Participant leader: implemented by using TM Lock table: implement concurrency control Leader: a replica that is selected among replicas Paxos state machine: support data replication Tablet: instances of a data structure Colossus: distributed file system (Google File System's successor) 11

Structure of Spanner's Implementation Spanner deployment = universe Zones set of zones = set of locations of dist. data can be more than 1 zone in a datacenter Servers in Spanner universe universe master: a console to show all zones for debug placement driver: handles automated movement of data across zones on timescale of minutes & communicates with spanserver location proxy: used by clients to locate spanservers to serve their data 12

Directory: unit of data placement move data between Paxos groups Purpose of moving: put frequently accessed dir into same group move to group that close to its accessors load balancing Directories moving while client operations are ongoing move 50MB directory in a few seconds 13

Data Model A data model based on schematized semi-relational tables with popularity of Megastore Query language with popularity of Dremel General-purpose Xacts experienced the lack with BigTable Not purely relational (rows have names) DB must be partitioned into hierarchies 14

Version Management Timestamp: to order all writes in a transaction (trans.) If T2 starts after commit of T1 then commit TS (T2) must greater commit TS (T1) Version management write transactions use strict 2-Phase Locking (2PL) Strict 2PL: trans. requests lock at any time before action & releases locks when commits each trans. T is assigned a timestamp (TS) s data written by T is timestamped with s 15

Concurrency Control Spanner uses multi-paxos each spanserver runs a single instance of Paxos for each "tablet" of data it stores tablet uses Paxos to agree on a "group leader" all writes will be done by the leader to replicas never concurrent outstanding writes to the same object within a replica group as long as the leader is stable only one round of communication is needed between leader and replicas to con7rm writes 16

External Consistency External consistency transactions can be ordered by their commit time commit times correspond to real world notions of time Synchronizing Snapshots timestamp order = commit order ensure consistent clock values by using GPS + atomic clocks on some nodes network time sync. protocols (exchange & adjust time accordingly) small differences between clocks define TrueTime to solve 17

External Consistency Example: (node = spanserver) suppose two nodes N1/N2 running T1 & N3/N4 running T2 N1/N2 running 30 seconds ahead N3/N4 T1 commits at time T=0 on node N1/N2 (time T=-30 on N3/N4) T2 commit time earlier than T1 to external observer If T1 & T2 wrote data D, recovered replica may apply T1 18

TrueTime TrueTime API methods: estimates accuracy of node's clock Global clock time with bounded uncertainty Ɛ: instantaneous error (half of interval width) 19

TrueTime Example How to ensure T1's commits time before T2's? ensure that T2 doesn't commit until after t1.latest t1.latest < t2.earliest: need to know each other's trans. instead, if ensure that T1 holds its locks until t1.latest, then T1 commits before T2 commits 20

TrueTime Architecture GPS & atomic-clock timemasters cross-check each other timeslave polls masters to reduce vulnerability from other masters Compute reference [earliest, latest] = now ± Ɛ 21

Commit & 2-Phase Commit 22

Commit & 2-Phase Commit 23

Commit & 2-Phase Commit 24

Paxos-based replication Spanner's Paxos implementation used timed (10 second) leader leases to make leadership long lived able to tolerate the failure/disconnection of nodes/data centers Idea ensure that all writes & reads go to a "quorum" of nodes Quorum: a simple majority if we have N nodes, read & write to 6oor(N/2) + 1 nodes In reality, send write & read request to all nodes, but only wait for a quorum of them to complete basics of quorum-based replication last time 25

Performance Micro-benchmarks for Spanner a single spanserver per zone enough load to saturate spanserver CPU all data is served out of memory to measure overhead of Spanner stack get about 2,500 transactional writes/s per CPU 15,000 transactional reads/s per CPU 26

Summary Spanner blended & developed semi-relational interface, transactions & SQL-based query language concepts of scalability, auto segmentation, failure resistance, data replication consistency & wide distribution TrueTime another key functionality of Spanner provides a functionality based on accurate time synchronization in distributed system by expressing inaccuracy of time more specifically in time API 27

Thank you 28