Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class. Today s Class. Faloutsos/Pavlo CMU /615

Similar documents
Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Lecture 21: Logging Schemes /645 Database Systems (Fall 2017) Carnegie Mellon University Prof. Andy Pavlo

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. General Overview NOTICE: Faloutsos CMU SCS

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Administrivia. Last Class. Faloutsos/Pavlo CMU /615

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #19: Logging and Recovery 1

Chapter 17: Recovery System

Failure Classification. Chapter 17: Recovery System. Recovery Algorithms. Storage Structure

CS122 Lecture 15 Winter Term,

Homework 6 (by Sivaprasad Sudhir) Solutions Due: Monday Nov 27, 11:59pm

Database Recovery. Lecture #21. Andy Pavlo Computer Science Carnegie Mellon Univ. Database Systems / Fall 2018

Concurrency Control & Recovery

Recoverability. Kathleen Durant PhD CS3200

Chapter 17: Recovery System

ARIES (& Logging) April 2-4, 2018

Chapter 16: Recovery System. Chapter 16: Recovery System

CompSci 516: Database Systems

Concurrency Control & Recovery

Review: The ACID properties. Crash Recovery. Assumptions. Motivation. Preferred Policy: Steal/No-Force. Buffer Mgmt Plays a Key Role

Recovery System These slides are a modified version of the slides of the book Database System Concepts (Chapter 17), 5th Ed McGraw-Hill by

ACID Properties. Transaction Management: Crash Recovery (Chap. 18), part 1. Motivation. Recovery Manager. Handling the Buffer Pool.

Chapter 14: Recovery System

A tomicity: All actions in the Xact happen, or none happen. D urability: If a Xact commits, its effects persist.

Crash Recovery CMPSCI 645. Gerome Miklau. Slide content adapted from Ramakrishnan & Gehrke

Transaction Management Overview. Transactions. Concurrency in a DBMS. Chapter 16

UNIT 9 Crash Recovery. Based on: Text: Chapter 18 Skip: Section 18.7 and second half of 18.8

Transaction Management: Crash Recovery (Chap. 18), part 1

some sequential execution crash! Recovery Manager replacement MAIN MEMORY policy DISK

Crash Recovery Review: The ACID properties

Transaction Management Overview

The transaction. Defining properties of transactions. Failures in complex systems propagate. Concurrency Control, Locking, and Recovery

Introduction. Storage Failure Recovery Logging Undo Logging Redo Logging ARIES

Recovery System These slides are a modified version of the slides of the book Database System Concepts (Chapter 17), 5th Ed

Outline. Purpose of this paper. Purpose of this paper. Transaction Review. Outline. Aries: A Transaction Recovery Method

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Transaction Management & Concurrency Control. CS 377: Database Systems

Log-Based Recovery Schemes

Distributed Systems

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Crash Recovery. The ACID properties. Motivation

COURSE 1. Database Management Systems

Crash Recovery. Chapter 18. Sina Meraji

Advanced Database Management System (CoSc3052) Database Recovery Techniques. Purpose of Database Recovery. Types of Failure.

Recovery from failures

Atomicity: All actions in the Xact happen, or none happen. Consistency: If each Xact is consistent, and the DB starts consistent, it ends up

Transactions and Recovery Study Question Solutions

Review: The ACID properties. Crash Recovery. Assumptions. Motivation. More on Steal and Force. Handling the Buffer Pool

Database Management System

Slides Courtesy of R. Ramakrishnan and J. Gehrke 2. v Concurrent execution of queries for improved performance.

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Database Systems ( 資料庫系統 )

CAS CS 460/660 Introduction to Database Systems. Recovery 1.1

Database Recovery Techniques. DBMS, 2007, CEng553 1

System Malfunctions. Implementing Atomicity and Durability. Failures: Crash. Failures: Abort. Log. Failures: Media

Introduction to Data Management. Lecture #18 (Transactions)

User Perspective. Module III: System Perspective. Module III: Topics Covered. Module III Overview of Storage Structures, QP, and TM

Database Applications (15-415)

CSE 444: Database Internals. Lectures 13 Transaction Schedules

Introduction to Database Systems CSE 444

Introduction to Data Management. Lecture #26 (Transactions, cont.)

Overview of Transaction Management

CS122 Lecture 18 Winter Term, All hail the WAL-Rule!

COURSE 4. Database Recovery 2

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Recovery Techniques. The System Failure Problem. Recovery Techniques and Assumptions. Failure Types

6.830 Lecture Recovery 10/30/2017

6.830 Lecture Recovery 10/30/2017

Recovery and Logging

CS5200 Database Management Systems Fall 2017 Derbinsky. Recovery. Lecture 15. Recovery

Final Review. May 9, 2017

Problems Caused by Failures

DATABASE DESIGN I - 1DL300

1/29/2009. Outline ARIES. Discussion ACID. Goals. What is ARIES good for?

Final Review. May 9, 2018 May 11, 2018

Transactions. Transaction. Execution of a user program in a DBMS.

Lecture 16: Transactions (Recovery) Wednesday, May 16, 2012

Transaction Management. Readings for Lectures The Usual Reminders. CSE 444: Database Internals. Recovery. System Crash 2/12/17

Motivating Example. Motivating Example. Transaction ROLLBACK. Transactions. CSE 444: Database Internals

Database Management Systems Reliability Management

What are Transactions? Transaction Management: Introduction (Chap. 16) Major Example: the web app. Concurrent Execution. Web app in execution (CS636)

Lectures 8 & 9. Lectures 7 & 8: Transactions

CS122 Lecture 19 Winter Term,

PART II. CS 245: Database System Principles. Notes 08: Failure Recovery. Integrity or consistency constraints. Integrity or correctness of data

CSE 444: Database Internals. Lectures Transactions

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Database Recovery. Dr. Bassam Hammo

Transaction Management: Introduction (Chap. 16)

CSE 544 Principles of Database Management Systems. Fall 2016 Lectures Transactions: recovery

File Systems: Consistency Issues

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Aries (Lecture 6, cs262a)

Chapter 22. Introduction to Database Recovery Protocols. Copyright 2012 Pearson Education, Inc.

Chapter 9. Recovery. Database Systems p. 368/557

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

CS 541 Database Systems. Recovery Managers

Database Management System

Databases: transaction processing

CSC 261/461 Database Systems Lecture 20. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

JOURNALING FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 26

Announcements. Transaction. Motivating Example. Motivating Example. Transactions. CSE 444: Database Internals

In This Lecture. Transactions and Recovery. Transactions. Transactions. Isolation and Durability. Atomicity and Consistency. Transactions Recovery

Transcription:

Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications C. Faloutsos A. Pavlo Lecture#23: Crash Recovery Part 1 (R&G ch. 18) Last Class Basic Timestamp Ordering Optimistic Concurrency Control Multi-Version Concurrency Control Multi-Version+2PL Partition-based T/O Faloutsos/Pavlo 15-415/615 2 Today s Class Overview Shadow Paging Write-Ahead Log Checkpoints Logging Schemes Examples Faloutsos/Pavlo 15-415/615 3 1

Motivation T1 BEGIN R(A) W(A) COMMIT Buffer Pool A=1 A=2 A=1 Page Disk Memory Faloutsos/Pavlo 15-415/615 4 Crash Recovery Recovery algorithms are techniques to ensure database consistency, transaction atomicity and durability despite failures. Recovery algorithms have two parts: Actions during normal txn processing to ensure that the DBMS can recover from a failure. Actions after a failure to recover the database to a state that ensures atomicity, consistency, and durability. Faloutsos/Pavlo 15-415/615 5 Crash Recovery DBMS is divided into different components based on the underlying storage device. Need to also classify the different types of failures that the DBMS needs to handle. Faloutsos/Pavlo 15-415/615 6 2

Storage Types Volatile Storage: Data does not persist after power is cut. Examples: DRAM, SRAM Non-volatile Storage: Data persists after losing power. Examples: HDD, SDD Use multiple storage Stable Storage: devices to approximate. A non-existent form of non-volatile storage that survives all possible failures scenarios. Faloutsos/Pavlo 15-415/615 7 Failure Classification Transaction Failures System Failures Storage Media Failures Faloutsos/Pavlo 15-415/615 8 Transaction Failures Logical Errors: Transaction cannot complete due to some internal error condition (e.g., integrity constraint violation). Internal State Errors: DBMS must terminate an active transaction due to an error condition (e.g., deadlock) Faloutsos/Pavlo 15-415/615 9 3

System Failures Software Failure: Problem with the DBMS implementation (e.g., uncaught divide-by-zero exception). Hardware Failure: The computer hosting the DBMS crashes (e.g., power plug gets pulled). Fail-stop Assumption: Non-volatile storage contents are assumed to not be corrupted by system crash. Faloutsos/Pavlo 15-415/615 10 Storage Media Failure Non-Repairable Hardware Failure: A head crash or similar disk failure destroys all or part of non-volatile storage. Destruction is assumed to be detectable (e.g., disk controller use checksums to detect failures). No DBMS can recover from this. Database must be restored from archived version. Faloutsos/Pavlo 15-415/615 11 Problem Definition Primary storage location of records is on non-volatile storage, but this is much slower than volatile storage. Use volatile memory for faster access: First copy target record into memory. Perform the writes in memory. Write dirty records back to disk. Faloutsos/Pavlo 15-415/615 12 4

Problem Definition Need to ensure: The changes for any txn are durable once the DBMS has told somebody that it committed. No changes are durable if the txn aborted. Faloutsos/Pavlo 15-415/615 13 Undo vs. Redo Undo: The process of removing the effects of an incomplete or aborted txn. Redo: The process of re-instating the effects of a committed txn for durability. How the DBMS supports this functionality depends on how it manages the buffer pool Faloutsos/Pavlo 15-415/615 14 Buffer Pool Management Is T1 allowed to Schedule overwrite A even Do we force T2 s changes to T1 though T2 it hasn t be written to disk? BEGIN committed? Buffer Pool R(A) W(A) BEGIN A=1 A=3 B=99 B=88 C=7 R(B) A=1 A=3 B=99 B=88 C=7 W(B) COMMIT ABORT Disk Memory What happens when we need to rollback T1? Page Faloutsos/Pavlo 15-415/615 15 5

Buffer Pool Steal Policy Whether the DBMS allows an uncommitted txn to overwrite the most recent committed value of an object in non-volatile storage. STEAL: Is allowed. NO-STEAL: Is not allowed. Faloutsos/Pavlo 15-415/615 16 Buffer Pool Force Policy Whether the DBMS ensures that all updates made by a txn are reflected on non-volatile storage before the txn is allowed to commit: FORCE: Is enforced. NO-FORCE: Is not enforced. Force writes makes it easier to recover but results in poor runtime performance. Faloutsos/Pavlo 15-415/615 17 T1 BEGIN R(A) W(A) ABORT NO-STEAL + FORCE Schedule T2 BEGIN R(B) W(B) COMMIT NO-STEAL means that T1 changes cannot be written to disk yet. Buffer Pool A=1 A=3 B=99 B=88 C=7 FORCE means that T2 changes must Memory be written Now it s trivial to disk to at this point. rollback T1. A=1 B=99 B=88 C=7 Disk Page Faloutsos/Pavlo 15-415/615 18 6

NO-STEAL + FORCE This approach is the easiest to implement: Never have to undo changes of an aborted txn because the changes were not written to disk. Never have to redo changes of a committed txn because all the changes are guaranteed to be written to disk at commit time. But this will be slow Faloutsos/Pavlo 15-415/615 19 Today s Class Overview Shadow Paging Write-Ahead Log Checkpoints Logging Schemes Examples Faloutsos/Pavlo 15-415/615 20 Shadow Paging Maintain two separate copies of the database (master, shadow) Updates are only made in the shadow copy. When a txn commits, atomically switch the shadow to become the new master. Buffer Pool: NO-STEAL + FORCE Faloutsos/Pavlo 15-415/615 21 7

Shadow Paging Database is a tree whose root is a single disk block. There are two copies of the tree, the master and shadow The root points to the master copy. Updates are applied to the shadow copy. Portions courtesy of the great Phil Bernstein Faloutsos/Pavlo 15-415/615 22 Shadow Paging Example Memory Non-Volatile Storage DB Root 1 2 3 4 Master Page Table Pages on Disk Faloutsos/Pavlo 15-415/615 23 Shadow Paging To install the updates, overwrite the root so it points to the shadow, thereby swapping the master and shadow: Before overwriting the root, none of the transaction s updates are part of the diskresident database After overwriting the root, all of the transaction s updates are part of the diskresident database. Portions courtesy of the great Phil Bernstein Faloutsos/Pavlo 15-415/615 24 8

Shadow Paging Example Read-only txns access the current Memory master. DB Root X 1 2 3 4 Master Page Table 1 2 3 4 Shadow Page Table Active modifying txn updates shadow pages. Non-Volatile Storage Pages on Disk Faloutsos/Pavlo 15-415/615 25 X X X Shadow Paging Undo/Redo Supporting rollbacks and recovery is easy. Undo: Simply remove the shadow pages. Leave the master and the DB root pointer alone. Redo: Not needed at all. Faloutsos/Pavlo 15-415/615 26 Shadow Paging Advantages No overhead of writing log records. Recovery is trivial. Faloutsos/Pavlo 15-415/615 27 9

Shadow Paging Disadvantages Copying the entire page table is expensive: Use a page table structured like a B+tree No need to copy entire tree, only need to copy paths in the tree that lead to updated leaf nodes Commit overhead is high: Flush every updated page, page table, & root. Data gets fragmented. Need garbage collection. Faloutsos/Pavlo 15-415/615 28 Today s Class Overview Shadow Paging Write-Ahead Log Checkpoints Logging Schemes Examples Faloutsos/Pavlo 15-415/615 29 Write-Ahead Log Record the changes made to the database in a log before the change is made. Assume that the log is on stable storage. Log contains sufficient information to perform the necessary undo and redo actions to restore the database after a crash. Buffer Pool: STEAL + NO-FORCE Faloutsos/Pavlo 15-415/615 30 10

Write-Ahead Log Protocol All log records pertaining to an updated page are written to non-volatile storage before the page itself is allowed to be overwritten in non-volatile storage. A txn is not considered committed until all its log records have been written to stable storage. Faloutsos/Pavlo 15-415/615 31 Write-Ahead Log Protocol Log record format: <txnid, objectid, beforevalue, aftervalue> Each transaction writes a log record first, before doing the change. Write a <BEGIN> record to mark txn starting point. When a txn finishes, the DBMS will: Write a <COMMIT> record on the log Make sure that all log records are flushed before it returns an acknowledgement to application. Faloutsos/Pavlo 15-415/615 32 Write-Ahead Log Example T1 BEGIN W(A) W(B) COMMIT ObjectId TxnId WAL <T1 begin> <T1, A, 99, 88> <T1, B, 5, 10> <T1 commit> CRASH! Buffer Pool Before Value After Value A=99 B=5 The result is deemed safe to return to app. A=99 A=88 B=10 B=5 Non-Volatile Storage Volatile Storage Faloutsos/Pavlo 33 11

WAL Implementation Details When should we write log entries to disk? When the transaction commits. Can use group commit to batch multiple log flushes together to amortize overhead. When should we write dirty records to disk? Every time the txn executes an update? Once when the txn commits? Faloutsos/Pavlo 15-415/615 34 WAL Deferred Updates Observation: If we prevent the DBMS from writing dirty records to disk until the txn commits, then we don t need to store their original values. WAL <T1 begin> <T1, A, 99, X 88> <T1, B, X 5, 10> <T1 commit> Faloutsos/Pavlo 15-415/615 35 WAL Deferred Updates Observation: If we prevent the DBMS from writing dirty records to disk until the txn commits, then we don t need to store their original Replay the values. log and Simply ignore all of redo each update. WAL <T1 begin> <T1, A, 88> <T1, B, 10> <T1 commit> CRASH! T1 s updates. WAL <T1 begin> <T1, A, 88> <T1, B, 10> CRASH! Faloutsos/Pavlo 15-415/615 36 12

WAL Deferred Updates This won t work if the change set of a txn is larger than the amount of memory available. Example: Update all salaries by 5% The DBMS cannot undo changes for an aborted txn if it doesn t have the original values in the log. We need to use the STEAL policy. Faloutsos/Pavlo 15-415/615 37 WAL Buffer Pool Policies NO-STEAL STEAL NO-STEAL Undo + Redo STEAL NO-FORCE Fastest NO-FORCE Slowest FORCE Slowest Runtime Performance FORCE Fastest No Undo + No Redo Recovery Performance Almost every DBMS uses NO-FORCE + STEAL Faloutsos/Pavlo 15-415/615 38 Today s Class Overview Shadow Paging Write-Ahead Log Checkpoints Logging Schemes Examples Faloutsos/Pavlo 15-415/615 39 13

Checkpoints The WAL will grow forever. After a crash, the DBMS has to replay the entire log which will take a long time. The DBMS periodically takes a checkpoint where it flushes all buffers out to disk. Faloutsos/Pavlo 15-415/615 40 Checkpoints Output onto stable storage all log records currently residing in main memory. Output to the disk all modified blocks. Write a <CHECKPOINT> entry to the log and flush to stable storage. Faloutsos/Pavlo 15-415/615 41 Checkpoints WAL <T1 begin> <T1, A, 1, 2> <T1 commit> <T2 begin> <T2, A, 2, 3> <T3 begin> <CHECKPOINT> <T2 commit> <T3, A, 3, 4> CRASH! Any txn that committed before the checkpoint is ignored (T1). T2 + T3 did not commit before the last checkpoint. Need to redo T2 because it committed after checkpoint. Need to undo T3 because it did not commit before the crash. Faloutsos/Pavlo 15-415/615 42 14

Checkpoints Challenges We have to stall all txns when take a checkpoint to ensure a consistent snapshot. Scanning the log to find uncommitted can take a long time. Not obvious how often the DBMS should take a checkpoint. Faloutsos/Pavlo 15-415/615 43 Checkpoints Frequency Checkpointing too often causes the runtime performance to degrade. System spends too much time flushing buffers. But waiting a long time is just as bad: The checkpoint will be large and slow. Makes recovery time much longer. Faloutsos/Pavlo 15-415/615 44 Today s Class Overview Shadow Paging Write-Ahead Log Checkpoints Logging Schemes Examples Faloutsos/Pavlo 15-415/615 45 15

Logging Schemes Physical Logging: Record the changes made to a specific location in the database. Example: Position of a record in a page. Logical Logging: Record the high-level operations executed by txns. Example: The UPDATE, DELETE, and INSERT queries invoked by a txn. Faloutsos/Pavlo 15-415/615 46 Physical vs. Logical Logging Logical logging requires less data written in each log record than physical logging. Difficult to implement recovery with logical logging if you have concurrent txns. Hard to determine which parts of the database may have been modified by a query before crash. Also takes longer to recover because you must re-execute every txn all over again. Faloutsos/Pavlo 15-415/615 47 Physiological Logging Hybrid approach where log records target a single page but do not specify data organization of the page. This is the most popular approach. Faloutsos/Pavlo 15-415/615 48 16

Logging Schemes INSERT INTO X VALUES(1,2,3); Physical Logical Physiological <T1, Table=X, Page=99, Offset=4, Record=(1,2,3)> <T1, Index=X_PKEY, Page=45, Offset=9, Key=(1,Record1)> <T1, INSERT INTO X VALUES(1,2,3) > <T1, Table=X, Page=99, Record=(1,2,3)> <T1, Index=X_PKEY, IndexPage=45, Key=(1,Record1)> Faloutsos/Pavlo 15-415/615 49 Today s Class Overview Shadow Paging Write-Ahead Log Checkpoints Logging Schemes Examples Faloutsos/Pavlo 15-415/615 50 Observation #1 You can only safely write a single page to non-volatile storage at a time. Linux Default: 4KB How does a DBMS make sure that large updates are safely written? Faloutsos/Pavlo 15-415/615 51 17

MySQL Doublewrite Buffer When MySQL flushes dirty records from its buffer, it first writes them out sequentially to a doublewrite buffer and then fsyncs. If this is successful, then it can safely write records at their real location. On recovery, check whether the doublewrite buffer matches the record s real location. If not, then restore from doublewrite buffer. Faloutsos/Pavlo 15-415/615 52 Observation #2 With a WAL, the DBMS has to write each update to stable storage at least twice: Once in the log. And again in the primary storage. The total amount of data per update depends on implementation (e.g., physical vs. logical) Faloutsos/Pavlo 15-415/615 53 Storing BLOBs in the Database Every time you change a BLOB field you have to store the before/after image in WAL. Don t store large files in your database! Put the file on the filesystem and store a URI in the database. More information: To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem? Faloutsos/Pavlo 15-415/615 54 18

Log-Structured Merge Trees No primary storage. The log is the database. All writes create just one log entry (fast!). All reads must search the log backwards to find the last value written for the target key. DBMS still must periodically take a checkpoint: Log compaction instead of flushing buffers. Faloutsos/Pavlo 15-415/615 55 LevelDB LSM Google s fast storage library that provides an ordered mapping of key/value pairs. MemTable: In-memory index of log entries SSTable: Immutable MemTables on disk. MemTable <Key=A,Value=4> <Key=C,Value=12> <Key=D,Value=6> <Key=E,Value=99> SSTable Index Key Offset SSTable SSTable Index Key Offset SSTable Faloutsos/Pavlo 15-415/615 56 Observation #3 All of this has assumed that the database is stored on slow disks (HDD, SDD). What kind of logging should we use if the database is stored entirely in main memory? Faloutsos/Pavlo 15-415/615 57 19

VoltDB Command Logging Even more lightweight version of logical logging based on stored procedures. <txnid, ProcedureName, Parameters> Command <T1, Proc=UpdateAcct, Params=(123)> Faloutsos/Pavlo 15-415/615 58 Command vs. Physiological Runtime Performance (Higher is Better) Recovery Time (Lower is Better) Storage Speed (Relative to DRAM) Storage Speed (Relative to DRAM) Faloutsos/Pavlo 15-415/615 59 Summary Write-Ahead Log to handle loss of volatile storage. Use incremental updates (i.e., STEAL, NO- FORCE) with checkpoints. On recovery: undo uncommitted txns + redo committed txns. Faloutsos/Pavlo 15-415/615 60 20

Next Class ARIES Algorithms for Recovery and Isolation Exploiting Semantics Write-ahead Logging Repeating History during Redo Logging Changes during Undo Faloutsos/Pavlo 15-415/615 61 21