An Efficient Commit Protocol Exploiting Primary-Backup Placement in a Parallel Storage System. Haruo Yokota Tokyo Institute of Technology

Similar documents
MARK-OPT Fat-Btree Fat-Btree Fat-Btree

Evaluation of Multiple Fat-Btrees on a Parallel Database

On Interoperating Incompatible Atomic Commit Protocols in Distributed Databases

Chapter 11: File System Implementation. Objectives

CS3600 SYSTEMS AND NETWORKS

Current Topics in OS Research. So, what s hot?

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Distributed and Fault-Tolerant Execution Framework for Transaction Processing

Log-Based Recovery Schemes

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

Distributed KIDS Labs 1

Outline. Failure Types

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Chapter 11: Implementing File Systems

CS122 Lecture 15 Winter Term,

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction

File System Implementation

CMPS 181, Database Systems II, Final Exam, Spring 2016 Instructor: Shel Finkelstein. Student ID: UCSC

The Google File System

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

Database Architectures

Chapter 17: Recovery System

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

SONAS Best Practices and options for CIFS Scalability

BzTree: A High-Performance Latch-free Range Index for Non-Volatile Memory

Chapter 11: Implementing File

Recoverability. Kathleen Durant PhD CS3200

Lecture 21: Logging Schemes /645 Database Systems (Fall 2017) Carnegie Mellon University Prof. Andy Pavlo

CSC 261/461 Database Systems Lecture 20. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Integrity in Distributed Databases

CA485 Ray Walshe Google File System

PART II. CS 245: Database System Principles. Notes 08: Failure Recovery. Integrity or consistency constraints. Integrity or correctness of data

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

CPS 512 midterm exam #1, 10/7/2016

OPERATING SYSTEM. Chapter 12: File System Implementation

Chapter 17: Recovery System

Failure Classification. Chapter 17: Recovery System. Recovery Algorithms. Storage Structure

Building Consistent Transactions with Inconsistent Replication

Crash Recovery. Hector Garcia-Molina Stijn Vansummeren. CS 245 Notes 08 1

Chapter 11: Implementing File Systems

Datacenter replication solution with quasardb

Database Management System

No compromises: distributed transactions with consistency, availability, and performance

A tomicity: All actions in the Xact happen, or none happen. D urability: If a Xact commits, its effects persist.

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class. Today s Class. Faloutsos/Pavlo CMU /615

Database Architectures

Low Overhead Concurrency Control for Partitioned Main Memory Databases

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II

Analysis of Derby Performance

UNIT 9 Crash Recovery. Based on: Text: Chapter 18 Skip: Section 18.7 and second half of 18.8

Rule partitioning versus task sharing in parallel processing of universal production systems

Guide to Mitigating Risk in Industrial Automation with Database

Distributed File Systems II

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Crash Recovery Review: The ACID properties

It also performs many parallelization operations like, data loading and query processing.

S-Store: Streaming Meets Transaction Processing

Incompatibility Dimensions and Integration of Atomic Commit Protocols

IBM InfoSphere Streams v4.0 Performance Best Practices

The World s Fastest Backup Systems

File System Internals. Jo, Heeseung

Anti-Caching: A New Approach to Database Management System Architecture. Guide: Helly Patel ( ) Dr. Sunnie Chung Kush Patel ( )

ARIES (& Logging) April 2-4, 2018

Transactum Business Process Manager with High-Performance Elastic Scaling. November 2011 Ivan Klianev

Technical Computing Suite supporting the hybrid system

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

Name: 1. CS372H: Spring 2009 Final Exam

Chapter 16: Recovery System. Chapter 16: Recovery System

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Outline. Purpose of this paper. Purpose of this paper. Transaction Review. Outline. Aries: A Transaction Recovery Method

Concurrent & Distributed Systems Supervision Exercises

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

Weak Levels of Consistency

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

CSE 544 Principles of Database Management Systems

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

The Google File System

Recovery from failures

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

File System Management

Database Management Systems Reliability Management

Fault Tolerance. Goals: transparent: mask (i.e., completely recover from) all failures, or predictable: exhibit a well defined failure behavior

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Chapter 11: Implementing File-Systems

Chapter 12: File System Implementation

DBS related failures. DBS related failure model. Introduction. Fault tolerance

Chapter 18: Parallel Databases

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

Distributed Systems Fault Tolerance

The Implicit{Yes Vote Commit Protocol. with Delegation of Commitment. Pittsburgh, PA Pittsburgh, PA are relatively less reliable.

Chapter 7: File-System

Loose-Ordering Consistency for Persistent Memory

CERIAS Tech Report Autonomous Transaction Processing Using Data Dependency in Mobile Environments by I Chung, B Bhargava, M Mahoui, L Lilien

Parallel Databases C H A P T E R18. Practice Exercises

Hypervisor-based Fault-tolerance. Where should RC be implemented? The Hypervisor as a State Machine. The Architecture. In hardware

A Case for Merge Joins in Mediator Systems

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Administrivia. Last Class. Faloutsos/Pavlo CMU /615

Transcription:

An Efficient Commit Protocol Exploiting Primary-Backup Placement in a Parallel Storage System Haruo Yokota Tokyo Institute of Technology

My Research Interests Data Engineering + Dependable Systems Dependable Storage Systems Distributed Databases Dependable Web Service Compositions XML Databases In this presentation, focus on Dependable Parallel Storage Systems A Distributed Commit Protocol on the Storage System IFIP10.4 Meeting (H.Yokota) 2

Backgrounds Currently, the parity calculation based parallel storage systems, such as RAID3-6, are widely used. However, they have a weakness of drastic performance degradation under failures and recovery processes. We choose the primary-backup approach to keep quality of service under failures and recovery. It enables us to store contentious stream contents, such as video streams, in it, also. We proposed Autonomous Disks [Yokota, 1999] Each disk drive is interconnected in a network to configure a storage cluster, and equipped with a disk- resident intelligent processor to handle data distribution, load balance and disk failures. In the cluster, data are partitioned into all member disks. IFIP10.4 Meeting (H.Yokota) 3

Key Technologies of AD Access Methods We proposed the Fat-Btree as a distributed directory to access data distributed in the disk cluster [Yokota, ICDE1999] Data Placement (load/amount balance) We first adopt Chained Declustering method [Hsiao and Dewitt, 1990]for locating the primary and backup Then we propose Adaptive Overlapped Declustering method to balance both access load and data amount of each disk [Watanabe and Yokota, ICDE2005] IFIP10.4 Meeting (H.Yokota) 4

Configuration of an AD Cluster Client Client IP Network Client Controller Controller Controller Controller Distributed Directory Directory Directory Directory Directory a f g m n s t z Primary 1 Primary 2 Primary 3 Primary 4 Chained Declustering Directory Directory Directory Directory Backup 4 Backup 1 Backup 2 Backup 3 ECA Rules ECA Rules ECA Rules ECA Rules Target data can be accessed from any disks to provide high throughput High reliability with keeping QoS IFIP10.4 Meeting (H.Yokota) 5

AOD: Adaptive Overlapped Declustering It aims to balance both access load and space utilization We distribute the backup data to both sides of the primary data with assuming that the backup data is not accessed usually Access load is balanced by only the placement of primary data. Space utilization is balanced by adjusting the backup distribution. AOD is also effective to shorten the recovery time because it uses both side backups to restore the primary. Primary Divide Backup to both sides Backup Backup Restore Restore IFIP10.4 Meeting (H.Yokota) 6

20% reduction in unused space Simulation Results Chained Declustering Adaptive Overlapped Declustering Unused space restoring in approximately half time Chained Declustering less negative impact Adaptive Overlapped Declustering Recovery time and response time The graphs illustrate that the effect of balancing load and amount, and of shortening the recovery time IFIP10.4 Meeting (H.Yokota) 7

A Fat-Btree The root updated infrequently has copies to enable parallel access root Leaves have no copies to reduce synchronization costs for frequent updates Data PE0 PE1 PE2 PE3 IFIP10.4 Meeting (H.Yokota) 8

Transaction Control on Fat-Btree The strategy of partial duplication adds the complexity to structure modification operations (SMOs), such as page splits and merges, since the index pages are partially copied across PEs. When an SMO occurs at an index page, all PEs having a copy of the index page must be updated synchronously. We developed concurrency control protocols suited for Fat-Btree [Miyazaki et al. 2002, Yoshihara et al. 2006]. Atomic Commit Protocol (ACP) becomes imperative to guarantee the correctness of data. IFIP10.4 Meeting (H.Yokota) 9

BA-1.5PC We propose a new atomic commit protocol suited for distributed storage adopting primary backup data placement under the fail-stop failure model. It has a low-overhead log mechanism that eliminates the blocking disk I/O I (async-async- ). It remov moves the voting phase from commit processing to gain a faster commit process. We named it Backup-Assist 1.5-Phase Commit,, or BA-1.5PC protocol. IFIP10.4 Meeting (H.Yokota) 10

Other Commit Protocols (1/3) Two-Phase Commit (2PC) protocol is the best known of all commit protocols for distributed environment. It can produce severe delays while synchronizing the copies under the primary- backup scheme. It can also prone to blocking cohorts if a master fails in the decision phase. Many attempts at ACP ameliorate the drawbacks of 2PC IFIP10.4 Meeting (H.Yokota) 11

Other Commit Protocols (2/3) Low-overhead commit protocols Presumed Abort (PA) protocol, Presumed Commit (PC) protocol [Mohan et al. 1986] reduce number of message exchanges as well as forced log writes Early Prepare (EP) protocol [Stamos et al. 1990] heavy delay at transaction processing Coordinator Log (CL) protocol [Stamos et al. 1993], Implicit Yes Vote (IYV) protocol [Al-Houmaily et al. 1995] concentrate cohorts log to the master and hamper the system s s scalability OPTimistic (OPT) commit protocol [Gupta et al. 1997] reduces lock waiting time by lending the locks holding by transactions in commit processing, but must be carefully designed to avoid cascaded aborting IFIP10.4 Meeting (H.Yokota) 12

Other Commit Protocols (3/3) Non-blocking commit protocols Three-Phase Commit (3PC) protocol [Skeen 1982] has an extra buffer phase between the voting phase and the decision phase suffers severe delay ACP based on Uniform Timed Reliable Broadcast (ACP-UTRB) [Babaoglu et al. 1993], Non-Blocking Single- Phase Atomic Commit (NB-SPAC) protocol [Abdallah et al. 1998] impose a strong assumption uniform broadcast upon underlying communication model, it assumes the strong assumption and the cost of message delivery depends on number of participants compromise scalability IFIP10.4 Meeting (H.Yokota) 13

Async- Write-A -Ahead Log [Härder et al. 1983] (WAL) is widely accepted as a means for providing atomicity and durability in DBs Neighbor WAL [Hvasshovd et al. 1996] ()) reduces the overhead of logging of WAL in parallel databases writing the logs into memory of a neighboring PE. We propose Asynchronous- (async-) It does not wait for the from neighbor at a log flush operation by deferring the synchronization until the final decision phase of the transaction. PE(i) Log flush write Log write (in memory) OK Log message acknowledge PE(j) Log write (in memory) async- primary Log write Log flush Log write (in memory) OK Log message backup Log write (in memory) IFIP10.4 Meeting (H.Yokota) 14

BA-1.5 Phase Commit Protocol It removes the voting phase by implicitly including cohorts votes into the of each operations Pending phase Master (backup) Master (backup) A primary cohort waits for the backup at decision phase The only point at which the primary and backup are synchronized A pending phase with a result that can be discarded is counted as 0.5 phase in the commit process The origin of the name 1.5-phase (membership log) (log 1) Prepared Active removal of voting phases (log p) Prepared Active (decision) Decision (decision) Pending (end log) decision Commited IFIP10.4 Meeting (H.Yokota) 15 asynchronization reduces messages 0.5 phase

Failure Recovery Strategy (1/2) BA-1.5PC handles failures effectively by exploiting the primary-backup structure For cohort failures, Master (backup) Master (backup) The backup of that cohort can assume primary cohort s s role The backup cohort will maintain log records for operations until the failed cohort is fixed and restarts It can easily restore its state by reading the complete list of log records at its backup and applying them to bring its data up to date (membership log) (log 1) (log p) (decision) (decision) Decision decision (end log) IFIP10.4 Meeting (H.Yokota) 16

Failure Recovery Strategy (1/2) BA-1.5PC handles failures effectively by exploiting the primary-backup structure For cohort failures, Master (backup) Master (backup) The backup of that cohort can assume primary cohort s s role The backup cohort will maintain log records for operations until the failed cohort is fixed and restarts It can easily restore its state by reading the complete list of log records at its backup and applying them to bring its data up to date (membership log) (log 1) (log p) (decision) (decision) Decision Decision (i) decision (end log) IFIP10.4 Meeting (H.Yokota) 17 decision (i)

Failure Recovery Strategy (2/2) For master failure, It is surmounted by its backup located in its logical neighbor The primary master failure before the decision phase The backup master will automatically abort the transaction The primary master failure during the decision phase The backup master will re-inform all cohorts of the result of transaction, instead of forcing other cohorts to wait until the failed master finishes its restart and recovery Master (backup) Master (backup) (membership log) (log 1) (log p) (decision) Decision (decision) (end log) decision IFIP10.4 Meeting (H.Yokota) 18

Failure Recovery Strategy (2/2) For master failure, It is surmounted by its backup located in its logical neighbor The primary master failure before the decision phase The backup master will automatically abort the transaction The primary master failure during the decision phase The backup master will re-inform all cohorts of the result of transaction, instead of forcing other cohorts to wait until the failed master finishes its restart and recovery Master (backup) Master (backup) (membership log) (log 1) (log p) Abort (abort) abort IFIP10.4 Meeting (H.Yokota) 19

Failure Recovery Strategy (2/2) For master failure, It is surmounted by its backup located in its logical neighbor The primary master failure before the decision phase The backup master will automatically abort the transaction The primary master failure during the decision phase The backup master will re-inform all cohorts of the result of transaction, instead of forcing other cohorts to wait until the failed master finishes its restart and recovery Master (backup) Master (backup) (membership log) (log 1) (log p) (decision) Decision (decision) (end log) decision IFIP10.4 Meeting (H.Yokota) 20

Failure Recovery Strategy (2/2) For master failure, It is surmounted by its backup located in its logical neighbor The primary master failure before the decision phase The backup master will automatically abort the transaction The primary master failure during the decision phase The backup master will re-inform all cohorts of the result of transaction, instead of forcing other cohorts to wait until the failed master finishes its restart and recovery Master (backup) Master (backup) (membership log) (log 1) (log p) (decision) Decision (decision) Decision (decision) decision IFIP10.4 Meeting (H.Yokota) 21

Estimation of Protocol Overheads Compare overheads referred to message exchanges and forced disk writes in both transaction processing and commit processing Comparison targets: conventional protocols 2PC, EP Assumption There are n cohorts in a transaction besides the master A transaction has p operations that are delivered to cohorts IFIP10.4 Meeting (H.Yokota) 22

Overheads of 2PC n: No. of cohorts Number of message 2p + 8n p: No. of operations A master sends the messages to deliver the p operations to the cohorts The master must exchange two rounds of messages with each cohort at commit phase The primary cohort must forward messages to its backup cohort The backup cohort reply to these messages during commit processing Number of forced log writes The cohort must force-write two log records at commit The master must force-write a log record for the decision The backup cohort force-write two log records during commit processing Master (backup) p p Vote Vote 2n vote vote 2n 4n Decision Decision decision decision IFIP10.4 Meeting (H.Yokota) 23

Master (backup) Overheads of 2PC n: No. of cohorts Number of message 2p + 8n p: No. of operations A master sends the messages to deliver the p operations to the cohorts The master must exchange two rounds of messages with each cohort at commit phase The primary cohort must forward messages to its backup cohort The backup cohort reply to these messages during commit processing Number of forced log writes 4n + 1 The cohort must force-write two log records at commit The master must force-write a log record for the decision The backup cohort force-write two log records during commit processing Vote Vote 2n vote 1 vote 2n Decision Decision decision decision IFIP10.4 Meeting (H.Yokota) 24

Overheads of EP 4p + 2n n: No. of cohorts p: No. of operations Number of message A cohort must acknowledge that operation with a message to a master on every operations The cohort sends acknowledgement only for aborted transactions The primary cohort must transmit its operations and the final decision of a transaction to the backup cohort Number of forced log writes The master must force-write a membership log record at the beginning of a transaction and a decision log record in the commit phase The cohort must force-write its log record together with its updated data to stable storage on every operation Master (backup) 2p 2p Decision (decision) IFIP10.4 Meeting (H.Yokota) 25 n n

Overheads of EP 4p + 2n n: No. of cohorts p: No. of operations Number of message A cohort must acknowledge that operation with a message to a master on every operations The cohort sends acknowledgement only for aborted transactions The primary cohort must transmit its operations and the final decision of a transaction to the backup cohort Number of forced log writes 2p + 2 The master must force-write a membership log record at the beginning of a transaction and a decision log record in the commit phase The cohort must force-write its log record together with its updated data to stable storage on every operation Master (backup) 2p 2 Decision (decision) IFIP10.4 Meeting (H.Yokota) 26

Master (backup) Master (backup) Overheads of BA-1.5PC Number Number of message 3p + 4n + 3 n: No. of cohorts p: No. of operations delivers operations to cohorts and waits for their replies the cohort issues a log using async- to its backup including the operation the master broadcasts the decision to all cohorts and waits for their s the primary cohort writes a decision log to a backup cohort and the backup cohort returns an to the primary cohort the primary master must send a membership log record at the beginning of a transaction and a decision log record in the commit phase and an end log record at the end of the transaction to backup master with the async- Number of forced log writes 0 requires no forced disk writes of log records with the async- (membership log) (log 1) p 2p 3 (decision) Decision 2n (end log) decision (decision) (log p) 2n IFIP10.4 Meeting (H.Yokota) 27

Summary of Protocol Overheads BA-1.5PC outperforms the other two protocol It does not require any forced log writes during a transaction The relative ranking of 2PC and EP depends heavily on n and p The operation number p = u * n for Fat-Btree transactions where SMOs happen on index pages u is the number of updates to index pages usually u is larger than two in such an SMO EP will incur a larger overhead than 2PC arising from its heavy delays for forced log writes during operation processing Protocol 2PC EP BA-1.5PC No. of message 2p+8n 4p+2n 3p+4n+3+3 No. of forced log writes 4n+1+1 2p+2 0 IFIP10.4 Meeting (H.Yokota) 28

Experimentation The performance of proposed protocol is evaluated using an experimental system of Autonomous Disks Comparison targets Conventional protocols: 2PC and EP Workloads Synthetic Workloads Access Requests: an access include a request to insert a key with a fixed amount of data (400 bytes) Access Frequencies: uniform IFIP10.4 Meeting (H.Yokota) 29

Experimentation Environment Hardware No. of PE: PE: CPU: Memory: Disk: 64 (Storages), 8 (Clients) Sun Fire B100x Athlon XP-M1800+ PC2100 DDR 1GB Toshiba MK3019GAX (30GB, 5400rpm, 2.5inch) 8 clients GbE switch 64 storage nodes Software OS: Linux 2.4.20 Java VM: Sun J2SE 1.4.2_04 Server VM IFIP10.4 Meeting (H.Yokota) 30

Changing Data Set Sizes (64 Nodes) BA-1.5PC outperforms 2PC and EP with regard to transaction throughput at all of data set sizes Because it greatly reduces the message complexity during commit process, and removes the necessity to write forced disk log by the async-. Insertion Throughput in Tuples per Second 12000 10000 8000 6000 4000 2000 0 1 10 100 1000 10000 Total Object Number (10^3) BA-1.5PC 2PC EP IFIP10.4 Meeting (H.Yokota) 31

Differing Number of Storage Nodes (5M Objects) Insertion More storage nodes in the system allow higher insertion throughputs. This is attributed to the good scalability and parallelism of the Fat-Btree index structure. The increase in the throughput of BA-1.5PC is larger than that of the other two protocols as the system size increases Throughput in Tuples per Second BA-1.5PC is more scalable than 2PC and EP 12000 10000 8000 6000 4000 2000 0 8 16 32 64 Number of Storage Nodes BA-1.5PC 2PC EP IFIP10.4 Meeting (H.Yokota) 32

Conclusions We have proposed a new atomic commit protocol for distributed storage systems adopting the primary and backup approach to reduce the overhead of transaction processing. We analyzed the overhead of our protocol with the conventional protocols. We implemented the proposed protocol on an experimental Autonomous Disk system, and compared it with the conventional protocols. It improves approximately 60% for 64 storage nodes It enjoys a superior performance advantage in terms of throughput for varied dataset sizes and at different Autonomous Disk system scales. IFIP10.4 Meeting (H.Yokota) 33

Related Progress It is interesting to exploit the processing power of the disk resident processor for advanced storage management. We proposed another method to balance both access load and space utilization under multi-version management control. It combines the Fat-Btree with linked list structures. We also proposed revocation methods for encrypting on parallel disks adopting the primary backup configuration. IFIP10.4 Meeting (H.Yokota) 34

B (blade) type prototype IFIP10.4 Meeting (H.Yokota) 35

S (Small-size) type Prototype IFIP10.4 Meeting (H.Yokota) 36

IFIP10.4 Meeting (H.Yokota) 37

IFIP10.4 Meeting (H.Yokota) 38

Thank you for your attention

IFIP10.4 Meeting (H.Yokota) 40

The Server Blade System for Simulation IFIP10.4 Meeting (H.Yokota) 41