Lecture 17 : Distributed Transactions 11/8/2017

Similar documents
6.033 Computer System Engineering

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 13 - Distribution: transactions

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

CS 347: Distributed Databases and Transaction Processing Notes07: Reliable Distributed Database Management

Distributed Systems Consensus

CS 347 Parallel and Distributed Data Processing

Distributed Databases

Distributed Databases. CS347 Lecture 16 June 6, 2001

Goal A Distributed Transaction

Huge market -- essentially all high performance databases work this way

Transactions. Intel (TX memory): Transactional Synchronization Extensions (TSX) 2015 Donald Acton et al. Computer Science W2

6.830 Lecture Transactions October 23, 2017

CS October 2017

CSE 444: Database Internals. Section 9: 2-Phase Commit and Replication

Distributed Transaction Management 2003

CSC 261/461 Database Systems Lecture 21 and 22. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Database management system Prof. D. Janakiram Department of Computer Science and Engineering Indian Institute of Technology, Madras

Topics in Reliable Distributed Systems

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

7 Fault Tolerant Distributed Transactions Commit protocols

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit)

Problem: if one process cannot perform its operation, it cannot notify the. Thus in practise better schemes are needed.

Distributed Commit in Asynchronous Systems

Today: Fault Tolerance. Reliable One-One Communication

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

Distributed System. Gang Wu. Spring,2018

Consensus in Distributed Systems. Jeff Chase Duke University

Recovering from a Crash. Three-Phase Commit

Lab 4 : Caching Locks. Introduction. Getting Started

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

CMPS 181, Database Systems II, Final Exam, Spring 2016 Instructor: Shel Finkelstein. Student ID: UCSC

MITOCW watch?v=w_-sx4vr53m

MITOCW ocw apr k

Parallel DBs. April 25, 2017

6.830 Lecture Recovery 10/30/2017

CS 541 Database Systems. Three Phase Commit

CS 347 Parallel and Distributed Data Processing

Distributed File System

CSE 486/586: Distributed Systems

Lecture 21. Lecture 21: Concurrency & Locking

The Implicit{Yes Vote Commit Protocol. with Delegation of Commitment. Pittsburgh, PA Pittsburgh, PA are relatively less reliable.

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

Distributed Systems Fault Tolerance

Distributed Computation Models

6.830 Lecture 15 11/1/2017

6.830 Lecture Recovery 10/30/2017

Distributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions

CS5412: TRANSACTIONS (I)

It also performs many parallelization operations like, data loading and query processing.

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin

CPS 512 midterm exam #1, 10/7/2016

Consensus and related problems

Today: Fault Tolerance

Recovery and Logging

DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

(Pessimistic) Timestamp Ordering

Last Time. 19: Distributed Coordination. Distributed Coordination. Recall. Event Ordering. Happens-before

Distributed Systems COMP 212. Lecture 19 Othon Michail

Fault Tolerance. o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication. o Distributed Commit

Distributed Databases

CSC 261/461 Database Systems Lecture 24

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Fault Tolerance. Basic Concepts

EECS 591 DISTRIBUTED SYSTEMS

COMMENTS. AC-1: AC-1 does not require all processes to reach a decision It does not even require all correct processes to reach a decision

(Pessimistic) Timestamp Ordering. Rules for read and write Operations. Read Operations and Timestamps. Write Operations and Timestamps

Today: Fault Tolerance. Fault Tolerance

Transactions and Recovery Study Question Solutions

Paxos provides a highly available, redundant log of events

Some Examples of Conflicts. Transactional Concurrency Control. Serializable Schedules. Transactions: ACID Properties. Isolation and Serializability

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

Lecture 22 Concurrency Control Part 2

Today: Fault Tolerance. Replica Management

Long running and distributed transactions. TJTSE54 spring 2009 Ville Seppänen

Distributed Databases

The objective. Atomic Commit. The setup. Model. Preserve data consistency for distributed transactions in the presence of failures

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Global atomicity. Such distributed atomicity is called global atomicity A protocol designed to enforce global atomicity is called commit protocol

PRIMARY-BACKUP REPLICATION

Causal Consistency and Two-Phase Commit

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Transactions. 1. Transactions. Goals for this lecture. Today s Lecture

Fault Tolerance. it continues to perform its function in the event of a failure example: a system with redundant components

How Bitcoin achieves Decentralization. How Bitcoin achieves Decentralization

Lesson 3 Transcript: Part 1 of 2 - Tools & Scripting

Distributed Systems

Introduction to Database Systems CSE 444. Lecture 26: Distributed Transactions

TWO-PHASE COMMIT ATTRIBUTION 5/11/2018. George Porter May 9 and 11, 2018

Incompatibility Dimensions and Integration of Atomic Commit Protocols

! A lock is a mechanism to control concurrent access to a data item! Data items can be locked in two modes :

Distributed Databases

Name: 1. CS372H: Spring 2009 Final Exam

CS 425 / ECE 428 Distributed Systems Fall 2017

Today: Fault Tolerance. Failure Masking by Redundancy

Failure models. Byzantine Fault Tolerance. What can go wrong? Paxos is fail-stop tolerant. BFT model. BFT replication 5/25/18

Lecture 11: Linux ext3 crash recovery

6.033 Computer System Engineering

Transcription:

Lecture 17 : Distributed Transactions 11/8/2017 Today: Two-phase commit. Last time: Parallel query processing Recap: Main ways to get parallelism: Across queries: - run multiple queries simultaneously With in a query: - run groups of operators in separate threads ("pipeline") - partition data across multiple processors / threads ("partition") Partitioning is the main way databases parallelize one query -- only way to get high levels of speedup because of limited pipeline length and frequency of blocking operators. Three main types of partitioning: hash, range and round robin. Talked about join algorithms; several options: 0) If tables are partitioned properly, nothing to do 1) Copy smaller table 2) Re-partition one or both tables (depending on initial partitioning) Ex: Join A.a = B.b, where A partitioned on a and B partitioned on b. No repartitioning work to do!

SELECT * FROM A,B WHERE A.a = B.b n machines, hash fxn H Hash A on a, H(a) -> 1..n Hash B on b, B(b) -> 1..n Ex-- Join A.a = B.b, where A partitioned on A.a Hash B on c, H(c) -> 1..n Generalizes to the case when both tables have to be repartitioned. Aggregation: Can run aggregates in parallel, then merge groups. Example: select avg(f) from A Standard way of expressing aggregates:

INIT MERGE -- associative & commutative FINAL Ok -- so now lets talk about distributed transaction processing. Primary way we achieve this is two phase commit. What's the purpose of two-phase commit? (Distributed transactions, on multiple machines, where machines can fail independently.) Transaction BEGIN INSERT a ---- > S1 INSERT b ---- > S2 INSERT c ---- > S3 COMMIT Why doesn't existing commit protocol work? Suppose S1 crashes, but S2 and S3 succeed. S2 and S3 shouldn't commit unless S1 commits. What's the architecture here? ---> SUBOORD1 / USER -----> COORD-----> SUBOORD2 \ ---> SUBOORD 3 - Query statements arrive at COORD. - COORD sends statements to SUBOORD. C S1 S2 S3 -------Insert----> <-------OK------ -----------------Insert----> <-----------------OK------ -------------------------Insert----> <------------------------OK------ ------Prepare---> --------------Prepare---> ----------------------Prepare---> - Once COORD gets a COMMIT statement from USER, it initiates 2PC.

Why do we need such a complicated protocol? Why not just send COMMIT messages to all sites once they've finished their share of the work? One of them might fail during COMMIT processing, which would require us to be able to ABORT suboords that have already committed. So how does 2PC avoid this? Via PREPARE. What does a site being PREPARED mean? If a suboord site is prepared, it is ready and able to COMMIT the xaction. It must retain the ability to do this even if it crashes. If all sites are PREPARED, the coordinator can unilaterally decided to abort or commit. (Requires logging!) Ok -- let's look at the protocol -- requires careful interleaving of logging and messaging. Preliminaries: - Each site has a recovery process that keeps track of the fate of transactions running on the site, contacts and responds to contacts from remote sites re: running transactions. - If a site is prepared and crashes, it needs to ask coordinator about the outcome of the transaction on recovery. - So when can coordinator / worker forget about state of the transaction? Coordinator needs to make sure all workers have learned about xaction state, and workers need to make sure coordinator has learned about their "vote". What is in log records? all records - tid, coordid prepare records -- list of locks held coord commit/abort -- list of suboords

Protocol example: COORD SUBOORD ->COMMIT (1) PREPARE--------------------> (2) FW(abort/prepare) (3) <---------------------- VOTE, in prepared state (4) all yes? FW(commit) (5) commit point COMMIT -----------------------> else FW(abort) ABORT ------------------------> (6) FW(abort/commit) (7) <---------------- ACK (8) W(end) (9) What's the deal with force writes? log up to that point must be on disk before you can continue. Let's look at what happens in the face of various failures. (1) Transaction aborted (2) Abort Coord -- will recover, abort transaction just as in normal recovery (discarding all state) Suboord -- must timeout eventually, once it contacts coord, which has no record of xaction, it will abort. (Coord in basic protocol replies abort in no information case) Coord -- will never hear reply, will abort (how does it tell the suboord failed?) Suboord -- will recover, rollback xaction during recovery (3) Is in prepared state. Need to contact COORD to determine fate. Couple of options: - It already sent its vote, and coord is waiting for SUBOORD to send an ack -- thus, SUBOORD can learn fate.

- It didn't send its vote, in which case COORD may or may not have timed out. If it hasn't timed out, it can vote. If COORD has timed out, it must have aborted, and will tell the SUBOORD this. (4) Crashed before receiving all votes. Abort. COORD aborts during recovery SUBOORD eventually times out, when it contacts COORD it will learn xaction aborted. (5) Crashed after writing commit record. Commit. COORD recovers into committing state. Must send commit messages and collect ACKs from all sites. Couple of possibilities: SUBOORD may have already written commit message/sent ack, and forgotten about xaction. In which case, it should just ACK and do nothing. SUBOORD may not have written commit message, and is waiting, in which case, this allows it to go forward. Note that SUBOORD cannot time out xaction at this point -- it must wait to hear from COORD before committing/aborting. (6) Crashed before receiving COMMIT/abort Upon recovery, SUBOORD contacts COORD, asks about fate of xaction. COORD cannot forget state since it has not heard an ack yet. (7) Crashed after writing COMMIT record, before ACKing. SUBOORD will recover, transaction will be committed. COORD will periodically send a COMMIT message, which SUBOORD will ACK without writing any additional state. (8) COORD Crashed after receiving some ACKs. COORD will send COMMIT/ABORT to all SUBOORDS, who will ACK. (9) Coord crashes after writing END. Nothing needs to be done.

What happens if we send a VOTE before writing the PREPARED record? Trouble! SUBOORD might recover and rollback, when it should have committed. What happens if SUBOORD sends and ACK before writing the COMMIT record? SUBOORD might recover, contact COORD, which will know nothing about xaction (because it wrote the END record), and reply "ABORTED", which would be wrong. What if COORD replied "committed" by default? Problem in step 4. Read only sites. If a SUBOORD is read only, it can send a "READ VOTE". It doesn't need to write any log records, and can forget the transaction after it votes. Will ack any commits sent by COORD. COORD doesn't need to send ABORT/COMMIT messages to READ only sites. If all sites are read only, no ABORT/COMMIT messages need to be sent. Presumed Abort Notice that in the absence of information, we abort. This means we don't need to: - Force write abort record on any site. - Send ABORT messages at all (though we still may want to for efficiency reasons) - Send/wait for ACK messages for aborts - Write END record for aborted transactions on COORD. Presumed Commit Change protocol so that we return "COMMIT" when there is no information about a transaction. Changes: - SUBOORDs don't acknowledge COMMIT messages - SUBOORDs don't have force write COMMIT (still have to force on COORD). - Don't write END records for COMMITs on COORD. What does this break? Suppose COORD crashes after having received some but not all votes. It will then tell all prepared SUBOORDs that the transaction COMMITed when they inquire, but this may not be correct, since it's not safe to assume that all SUBOORDs are able to commit.

Soln? Force write a list of SUBOORDs at COORD before sending PREPARE message (a "COLLECTING" message). Then COORD can figure out who was involved in transaction and tell them to ABORT. Messages for committing transaction Coord Suboord Update or Readonly Update Read-Only Standard 2W,1F,1M(R/O),2M(U) 2W,2F,2M 0W,0F,1M PA 2W,1F,1M(R/O),2M(U) 2W,2F,2M 0W,0F,1M PC 2W,2F,1M(R/O),2M(U) 2W,1F,1M 0W,0F,1M Deadlock detection: What is the problem with deadlocks in a parallel DB? Two sites can serialize transactions in different orders, be waiting on each other to commit: T1S1 T2S1 T1S2 T2S2 ------- -------- -------- ---------- locked: A waiting: A waiting: B locked: B What's the problem? S2 has ordered T1 after T2, S1 has ordered T1 before T2 This is a deadlock. What do we do to fix it? Build a global waits-for graph by sending local waits-for information to other cites. T1 ------ WF ------- > T2 < ------ WF ------ Detect cycles. Hard because sites are running asynchronously. Need to ensure that we don't "accidentally" detect a deadlock by using stale waits-for info.

Rather than sending information to everyone else, can limit the number of sites that waits-for information goes to: For example: S1: T1 < ---- T2 S2: T1 -----> T2 Here, S2 sends info to S1 because T1 waits for T2 and T1 < T2. S1 does not send to S2. Then, S1 detects deadlock. What victim to choose? Just pick locally. What do real databases do? Apparently, they used presumed abort, and detect deadlocks via timeout, not deadlock detection. Why presumed abort? Not clear -- not many transactions abort. PC does add an extra log record. Maybe it's just simpler and they haven't bothered to do PC. Why timeout based DDD? Complexity What if we want two sites to commit the transaction at exactly the same time? Two generals paradox! Can't guarantee that you can get consensus in bounded time in the face of a lossy network. Barack Hilary Attack at dawn!------> <-------------------------OK Let's roll!--------------> <----------------------Got it. R can't know that B heard his last message. Therefore, R may not have heard B's last message, and so on... If there is a non-zero probability of delivery, retry's will eventually guarantee success, but not in a bounded time.