Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

Similar documents
Spanner: Google's Globally-Distributed Database. Presented by Maciej Swiech

Spanner : Google's Globally-Distributed Database. James Sedgwick and Kayhan Dursun

Distributed Systems. 19. Spanner. Paul Krzyzanowski. Rutgers University. Fall 2017

Spanner: Google s Globally- Distributed Database

Spanner: Google's Globally-Distributed Database* Huu-Phuc Vo August 03, 2013

Distributed Data Management. Christoph Lofi Institut für Informationssysteme Technische Universität Braunschweig

NewSQL Databases. The reference Big Data stack

Integrity in Distributed Databases

Building Consistent Transactions with Inconsistent Replication

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

Google Spanner - A Globally Distributed,

Today s Papers. Google Chubby. Distributed Consensus. EECS 262a Advanced Topics in Computer Systems Lecture 24. Paxos/Megastore November 24 th, 2014

Extreme Computing. NoSQL.

10. Replication. Motivation

Distributed Systems. GFS / HDFS / Spanner

Spanner: Google s Globally-Distributed Database. Wilson Hsieh representing a host of authors OSDI 2012

Corbett et al., Spanner: Google s Globally-Distributed Database

How do we build TiDB. a Distributed, Consistent, Scalable, SQL Database

Building Consistent Transactions with Inconsistent Replication

Applications of Paxos Algorithm

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

Spanner: Google s Globally- Distributed Database

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Beyond TrueTime: Using AugmentedTime for Improving Spanner

How we build TiDB. Max Liu PingCAP Amsterdam, Netherlands October 5, 2016

7680: Distributed Systems

Concurrency Control II and Distributed Transactions

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

Distributed File Systems II

Data-Intensive Distributed Computing

Consensus and related problems

CISC 7610 Lecture 2b The beginnings of NoSQL

CPS 512 midterm exam #1, 10/7/2016

EECS 498 Introduction to Distributed Systems

Distributed KIDS Labs 1

App Engine: Datastore Introduction

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

CockroachDB on DC/OS. Ben Darnell, CTO, Cockroach Labs

Database Architectures

MINIMIZING TRANSACTION LATENCY IN GEO-REPLICATED DATA STORES

TiDB: NewSQL over HBase.

Introduction to NoSQL Databases

Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements

Shen PingCAP 2017

Multi-version concurrency control

DrRobert N. M. Watson

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

Outline. Spanner Mo/va/on. Tom Anderson

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

CA485 Ray Walshe NoSQL

Cloud Computing. DB Special Topics Lecture (10/5/2012) Kyle Hale Maciej Swiech

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

Exam 2 Review. Fall 2011

DIVING IN: INSIDE THE DATA CENTER

SQL, NoSQL, MongoDB. CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

BigTable. CSE-291 (Cloud Computing) Fall 2016

Ghislain Fourny. Big Data 5. Wide column stores

EECS 498 Introduction to Distributed Systems

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Paxos and Distributed Transactions

SimpleChubby: a simple distributed lock service

The Chubby Lock Service for Loosely-coupled Distributed systems

CS Amazon Dynamo

Low-Latency Multi-Datacenter Databases using Replicated Commit

Architecture of a Real-Time Operational DBMS

Paxos Made Live. An Engineering Perspective. Authors: Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented By: Dipendra Kumar Jha

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

CS November 2017

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017

Quiz I Solutions MASSACHUSETTS INSTITUTE OF TECHNOLOGY Spring Department of Electrical Engineering and Computer Science

Beyond FLP. Acknowledgement for presentation material. Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

There Is More Consensus in Egalitarian Parliaments

Distributed Transactions

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

Don t Give Up on Serializability Just Yet. Neha Narula

Exam 2 Review. October 29, Paul Krzyzanowski 1

GFS: The Google File System

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

Blockchain, Throughput, and Big Data Trent McConaghy

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Big Data Management and NoSQL Databases

Multi-version concurrency control

Modeling, Analyzing, and Extending Megastore using Real-Time Maude

GFS: The Google File System. Dr. Yingwu Zhu

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Paxos Phase 2. Paxos Phase 1. Google Chubby. Paxos Phase 3 C 1

The Google File System

Rule 14 Use Databases Appropriately

DEPARTMENT OF INFORMATION TECHNOLOGY QUESTION BANK. UNIT I PART A (2 marks)

Introduction to MySQL InnoDB Cluster

Transcription:

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database. Presented by Kewei Li

The Problem db nosql complex legacy tuning expensive simple clean just enough as apps become more complex as apps need to be more scalable newsql

The Problem db Guarantees Supports transactions (ACID) Scalability Hard to scale beyond a few machines nosql Eventual consistency Highly scalable & available

The Problem db Guarantees Supports transactions (ACID) Scalability Hard to scale beyond a few machines newsql nosql Eventual consistency Highly scalable & available

MegaStore: Key Features Partition data into entity groups (chosen by user) ACID semantics within each entity group Availability is achieved by replication

MegaStore: Key Features Partition data into entity groups (chosen by user) ACID semantics within each entity group Availability is achieved by replication ACID semantics across groups is expensive (asynchronous queues are recommended)

MegaStore: Key Features Google BigTable (NoSQL) used as backend data store within each data center Cost- transparent API (implement your own joins) Some control over physical layout

write- ahead log for each entity group is stored here Committed: in the log Applied: written to underlying data

MegaStore: Read Transactions Current Apply all previously committed writes (in the log), then read Can only be done within an entity group Snapshot Look for timestamp of latest applied write and read from there Inconsistent Read the latest values from applied writes

MegaStore: Write within Entity Group 1. Current Read: Obtain the timestamp and log position of the last committed transaction. 2. Application logic: Read from Bigtable and gather writes into a log entry. 3. Commit: Use Paxos to achieve consensus for appending that entry to the log. 4. Apply: Write mutations to the entities and indexes in Bigtable. 5. Clean up: Delete data that is no longer required.

MegaStore: Write across Entity Groups Use asynchronous queues (message sending) Two- phase commit

MegaStore: Replication and Paxos Coordinator keeps track of which entity groups (within a replica) have observed all writes. These entity groups can respond immediate to reads Writes are agreed upon by Paxos (one for each write) Replicas can also be witnesses or read- only

MegaStore: Replication and Paxos

MegaStore: Read Algorithm

MegaStore: Write Algorithm

MegaStore: Failures & Debugging Coordinators are only valid if they are holding Chubby locks. If a coordinator fails They (conservatively) lose Chubby locks if they fail Reads are processed by other replicas Writes wait to be sure that the coordinator has failed If a replica fails, Reroute traffic Invalidate its unhealthy coordinator if it is still holding Chubby locks Disable replica entirely (impacts availability) Debug synchronization issues through pseudo- random test framework

MegaStore: Deployment

MegaStore: Weaknesses Can only support a few writes per second per entity group due to Paxos conflicts Users need to carefully partition data into entity groups Global ACID semantics are expensive due to two- phase commits Slow

Spanner: Key Features Supports complicated transaction but less cost transparent Global ACID semantics Availability through replication Rebalancing of data across spanservers

Coordinates two phase commit across Paxos groups Provides synchronization within a Paxos group Leader chosen by Paxos Btree/ distributed file system Replicas provide availability

Spanner: Directories Administrators can specify regions and number & type of replicas Data is grouped into directories, which can be assigned to regions by applications Movedir is a background process that can move directories across Paxos groups

Spanner: Data Model

Spanner: True Time Invocation of TT.now() is guaranteed to fall within returned interval Implemented using GPS synchronized and atomic clocks (with algorithms to detect outliers) Error in time typically rise from 1ms to 7ms between synchronizations (every 30s), though overloading/network conditions can cause spikes

Spanner: Time Stamping Writes Spanner always fulfill these time stamping invariants for writes: Start: the time stamp for a write is after the write has arrived at the server Commit Wait: the time stamp for a commit is before the actual commit Within a Paxos group, the leader can use locks to ensure these invariants (leaders are leased every 10s; there can only be one leader) Across Paxos groups all participants send prepare messages to the coordinator leader. The leader chooses a timestamp that is later than all received timestamps (to fulfill Start) The leader chooses a commit time stamp, waits until the time stamp has passed, then allows participants to commit (to fulfill Commit Wait)

Spanner: Time Stamping Reads Every replica has a t "#$%, which is the time that it is up to date t "#$% = min t *#+,"./ "#$%, t "#$% t *#+," "#$% is the time of the last applied Paxos write t./ "#$% is if there are no prepared but not committed transactions (phase one of a two phase write). If there are, t./ "#$% is the minimum of prepare timestamps from all the transaction managers Reads can be served from any replica with a sufficiently high t "#$%. You can request a time stamp (snapshot) or get one assigned If the read can be served by one Paxos group, assigned time stamp is latest commit. If read is across groups, then it is TT.now().latest

Spanner: Other Time Stamping Refinements t./ "#$% can be augmented with finer locking to allow reads to proceed with a two phase write happening t "#$% *#+," is advanced every 8 seconds by the Paxos leader Schema changes can be time stamped and completed in the background

Spanner: Performance Testing Two phase commit scalability (over 10 runs) 400 350 300 Latency/ms 250 200 150 100 50 0 0 50 100 150 200 250 Participants Mean 99th Percentile

Spanner: Performance Testing Availability: killing a server

Spanner: Performance Testing Availability: killing a server

Spanner: Performance Testing Availability: true time error

Spanner: Performance Testing F1 deployment: distribution of fragments and latency measurements

Spanner: Performance Testing F1 benefits over sharded MySQL No need for manual resharding Better failover management (F1 needs transactions so NoSQL won t work) Future works True time improvements (better hardware) scale- up in each server