Distributed and Fault-Tolerant Execution Framework for Transaction Processing

Similar documents
Evolving fault-tolerance in Hadoop with robust auto-recovering JobTracker

High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc.

Business Continuity and Disaster Recovery. Ed Crowley Ch 12

Assignment 5. Georgia Koloniari

Data Storage Revolution

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein. Copyright 2003 Philip A. Bernstein. Outline

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

NFSv4 as the Building Block for Fault Tolerant Applications

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein Sameh Elnikety. Copyright 2012 Philip A. Bernstein

Replication Solutions with Open-E Data Storage Server (DSS) April 2009

Distributed KIDS Labs 1

An Efficient Commit Protocol Exploiting Primary-Backup Placement in a Parallel Storage System. Haruo Yokota Tokyo Institute of Technology

Rule 14 Use Databases Appropriately

April 21, 2017 Revision GridDB Reliability and Robustness

Linearizability CMPT 401. Sequential Consistency. Passive Replication

Guide to Mitigating Risk in Industrial Automation with Database

HUAWEI OceanStor Enterprise Unified Storage System. HyperReplication Technical White Paper. Issue 01. Date HUAWEI TECHNOLOGIES CO., LTD.

Introduction to Distributed Data Systems

status Emmanuel Cecchet

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

Performance and Scalability with Griddable.io

Microsoft SQL Server Fix Pack 15. Reference IBM

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

VERITAS Storage Foundation 4.0 TM for Databases

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Oracle Rdb Hot Standby Performance Test Results

Microsoft SQL Server

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

Lecture 11 Hadoop & Spark

10. Replication. Motivation

Physical Storage Media

Avoiding the Cost of Confusion: SQL Server Failover Cluster Instances versus Basic Availability Group on Standard Edition

Gnothi: Separating Data and Metadata for Efficient and Available Storage Replication

Chapter 11 - Data Replication Middleware

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

Pimp My Data Grid. Brian Oliver Senior Principal Solutions Architect <Insert Picture Here>

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

<Insert Picture Here> QCon: London 2009 Data Grid Design Patterns

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Low Overhead Concurrency Control for Partitioned Main Memory Databases

Chapter 20: Database System Architectures

Distributed File Systems Part II. Distributed File System Implementation

Current Topics in OS Research. So, what s hot?

Distributed Data Management Replication

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

Datacenter replication solution with quasardb

Low Latency Data Grids in Finance

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

The Google File System (GFS)

Intuitive distributed algorithms. with F#

Documentation Accessibility. Access to Oracle Support

Reliable Distributed System Approaches

VERITAS Volume Manager for Windows 2000 VERITAS Cluster Server for Windows 2000

GFS: The Google File System. Dr. Yingwu Zhu

Building Consistent Transactions with Inconsistent Replication

Swiss IT Pro SQL Server 2005 High Availability Options Agenda: - Availability Options/Comparison - High Availability Demo 08 August :45-20:00

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Definition of RAID Levels

High availability with MariaDB TX: The definitive guide

MySQL HA Solutions Selecting the best approach to protect access to your data

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

Unified Management for Virtual Storage

Ultra-high-speed In-memory Data Management Software for High-speed Response and High Throughput

STAR: Scaling Transactions through Asymmetric Replication

ApsaraDB for Redis. Product Introduction

Veritas Storage Foundation for Windows by Symantec

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

Solace JMS Broker Delivers Highest Throughput for Persistent and Non-Persistent Delivery

Making RAMCloud Writes Even Faster

Ryan Adams Blog - Twitter MIRRORING: START TO FINISH

Paxos Made Live. An Engineering Perspective. Authors: Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented By: Dipendra Kumar Jha

Introduction to Computer Science. William Hsu Department of Computer Science and Engineering National Taiwan Ocean University

Scalable Transactions for Web Applications in the Cloud

Chapter 18 Parallel Processing

Conceptual Modeling on Tencent s Distributed Database Systems. Pan Anqun, Wang Xiaoyu, Li Haixiang Tencent Inc.

EXAM Administering Microsoft SQL Server 2012 Databases. Buy Full Product.

White paper ETERNUS CS800 Data Deduplication Background

Lecture XII: Replication

Introduction to MySQL InnoDB Cluster

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

GFS: The Google File System

Dynamo: Amazon s Highly Available Key-Value Store

Coordinated IMS and DB2 Disaster Recovery Session Number #10806

IBM System Storage DS5020 Express

Distributed Operating Systems

Overview. Implementing Fibre Channel SAN Boot with the Oracle ZFS Storage Appliance. January 2014 By Tom Hanvey; update by Peter Brouwer Version: 2.

9 Replication 1. Client

Failover Solutions with Open-E Data Storage Server (DSS V6)

Adapting Commit Protocols for Large-Scale and Dynamic Distributed Applications

PNUTS: Yahoo! s Hosted Data Serving Platform. Reading Review by: Alex Degtiar (adegtiar) /30/2013

Lecture 23 Database System Architectures

IBM InfoSphere Streams v4.0 Performance Best Practices

Transactum Business Process Manager with High-Performance Elastic Scaling. November 2011 Ivan Klianev

Transcription:

Distributed and Fault-Tolerant Execution Framework for Transaction Processing May 30, 2011 Toshio Suganuma, Akira Koseki, Kazuaki Ishizaki, Yohei Ueda, Ken Mizuno, Daniel Silva *, Hideaki Komatsu, Toshio Nakatani IBM Research Tokyo, * Amazon.com, Inc 2011 IBM Corporation

Transaction-volume explosions are increasingly common in many commercial businesses Online shopping, online auction services Algorithmic trading Banking services more It is difficult (if not impossible) to create systems that satisfy transaction, scalable performance, and high availability Can we improve performance without significant loss of availability?

Study of performance-availability trade-off in a distributed cluster environment by proposing a new replication protocol Our replication protocol Has a feature of continuous adjustment between performance and availability Keeps global data consistency at transaction boundaries Enables scalable performance with a slight compromise of availability

Motivation and Contributions Replication Scheme Data replication model Existing replication strategies Our approach Replication protocol detail Failure recovery process Failover example Availability Experiment Summary

Data tables are partitioned and distributed over a cluster of nodes. Each partition is replicated on 3 different nodes (as Primary, Secondary, and Tertiary data), and each node serves for 3 different partitions Large dataset partitioning function X-P: Primary data X-S: Secondary data X-T: Tertiary data 1 2 3 4 5 6 7 10 1-P 2-P 3-P 4-P 10-S 9-T 1-S 10-T 2-S 1-T 3-S 2-T 5-P 4-S 3-T 10-P 9-S 8-T Node 1 Node 2 Node 3 Node 4 Node 5 Node 10 Spare Spare

1. Synchronous replication Primary waits for s to be mirrored in Backup nodes Allows failover without data loss Limited performance: Danger of replication paper [SIGMOD, 1996] Example: Traditional RDB systems, e.g. DB2 parallel edition 2. Asynchronous replication Primary proceeds without waiting acknowledgement from Backup Risk data loss upon failover to Backup nodes Better performance by passing synchronization delay to read transaction Example: Chain replication [OSDI, 2004], Ganymed [Middleware, 2004] A Chain example updates HEAD queries TAIL replies

We employ different replication policy for 2 backup nodes Primary: Active computation node Secondary: Synchronous replication node Tertiary: Asynchronous replication node This allows performance improvement with relaxed synchronization, while Tertiary can contribute for increasing availability Primary node Secondary node Tertiary node Application update/commit read Log data update/ commit Log data update/ commit

!" Master sends messages to all Primary to start their local tasks Primary accumulates all data updates from application to s and sends them to Secondary Secondary passes the s to Tertiary task start Global UOW 1 task start to other Primary Local Task update/commit read Global UOW 2 Master node Primary worker Secondary worker Tertiary worker

# $ Secondary notifies Primary when buffering is completed Primary commits the local transaction when buffering completion message arrived from Secondary Primary then sends the task end message to Master Global UOW 1 task end from other Primary Local Task update/commit read task end Global UOW 2 commit bufferingcomplete Master node Primary worker Secondary worker Tertiary worker

% $ Master receives task end messages from all Primary Master sends all Secondary to commit Primary start the next local task after receiving the message from Master Global UOW 1 Local Task update/commit read Global UOW 2 task start Local Task docommit Master node Primary worker Secondary worker Tertiary worker

&! $ Tertiary notifies Primary when the s are committed Primary deletes the corresponding s Global UOW 1 Local Task update/commit read Global UOW 2 Local Task deletelog commitcomplete Master node Primary worker Secondary worker Tertiary worker

'( A spare node is activated upon failure of any single node Secondary and Tertiary are promoted to Primary and Secondary Spare node gets copies from the new Secondary, and acts as Tertiary Primary Secondary Tertiary 1-P 2-P 3-P 10-S 9-T 1-S 10-T 2-S 1-T 4-P 3-S 2-T 5-P 4-S 3-T 10-P 9-S 8-T Node 1 Node 2 Node 3 Node 4 Node 5 Node 10 Spare1 Spare2 2-P 1-P 10-S 3-P 2-S 1-S 4-P 3-S 2-T 5-P 4-S 3-T 10-P 9-S 8-T 1-T 10-T 9-T Node 1 Node 2 Node 3 Node 4 Node 5 Node 10 Spare1 Spare2 2-P 1-P 0-S 3-P 5-P 10-P 2-S 4-P 9-S 1-S 3-S 8-T 1-T 10-T 9-T 4-T 3-T 2-T Node 1 Node 2 Node 3 Node 4 Node 5 Node 10 Spare1 Spare2

) $$! $ Suppose both Primary and Secondary fail at the same time If Tertiary has the records made in the last committed transaction, the system can continue without data loss Last committed transaction Current transaction : Update1 Update2 Update3 Commit Update4 Primary Secondary Tertiary : Update1 Update2 Update3 Commit : Update1 Update2 Update3 Failover System can continue

* If both Primary and Secondary fail at the same time, and If Tertiary has not received all the s of the last committed transaction, some data is lost and the system is not automatically recoverable Last committed transaction Current transaction : Update1 Update2 Update3 Commit Update4 Primary Secondary Tertiary : Update1 Update2 Update3 Commit Delay : Update1 Update2 Failover not possible Incomplete s Not automatically recoverable

$+ Availability of our system is affected by the delay of transferring the to Tertiary. The delay is significantly affected by data transfer efficiency from Secondary to Tertiary Disk accesses due to insufficient memory can be a bottleneck By removing I/O bottlenecks on the nodes, we can minimize the delay and maximize P, the probability of availability of the records of the last committed transaction. 1-synch-backup P=0.5 P=0.9 2-synch-backup 99.9% 99.99% 99.999% 99.9999% Case: A cluster of 1,000 nodes, each has 0.001 failure probability (corresponding 3-year (= 1,000-day) MTBF and 1-day MTTR

+!,- " We created a batch job by combining three different scenarios in TPC-C; NewOrder, Payment, and Delivery We evaluate our replication protocol from the following aspects: Scaling efficiency (strong scaling and weak scaling) Replication overhead (with and without replication) Effect of relaxed synchronization IBM BladeCenter JS21 (PowerPC970MP) x39 Input : Purchase Order List NewOrder (step 1) Payment (step 2) Input data partitioned and assigned to workers Node 1 Primary Secondary Tertiary DB2 Node 2 Delivery (step 3) Master Workers Node 39

.$ The throughput is increased almost linearly as nodes are added. Throughput (# of input / sec) 5000 4500 4000 3500 3000 2500 2000 1500 1000 Higher is better Total input data size 40000 80000 160000 320000 400000 500 0 0 4 8 12 16 20 24 28 32 36 40 # of blades

-".$ The execution time is almost flat as nodes are added if sufficient memory is available for the node (e.g. buffer pools of DB). Otherwise, the increase of disk accesses causes the delay of synchronization 250 Low er is better Elapsed time (sec) 200 150 100 Input size per each blade 1K 5K 10K 15K 20K 50 0 0 4 8 12 16 20 24 28 32 36 40 # of blades

The replication overhead varies with the input data size per blade A heavy disk accesses causes fairly high overhead B with sufficient memory resource, the overhead is 20% Elapsed time (sec) 350 300 250 200 150 100 50 0 A (a) Strong Scaling (# of record = 40,000) no data replication with data replication 0 4 8 12 16 20 # of blades B

$ /. We compared the total execution time of TPC-C NewOrder transactions between conventional (full synchronization) model and our relaxed synchronization model The 43% reduction of the execution time is due to our approach of low synchronization overhead Total execution time (sec) TPC-C NewOrder Transaction 160 140 120 100 80 60 40 20 0 43% Conventional full synchronization Our relaxed synchronization

$ We proposed a new replication protocol that combines two different replication policies Synchronous replication for Secondary and asynchronous replication for Tertiary. Using our replication scheme: We can achieve scalable performance System tolerates up to 2 simultaneous node failure among triple redundant nodes most of the time Overhead of data replication is 20% with sufficient memory We showed performance-availability trade-off that we can obtain performance improvements by slightly compromising availability E.g. 99.9999% 99.999%

0"

$ +1)/" 23!#445 1. Data (both DB tables and input files) are partitioned and distributed over a cluster of nodes, as specified by users 2. Master partitions the job into tasks based on data layout, and assigns them to nodes based on owner-compute-rule 3. Each node executes a task (which only requires local data accesses) Client Job request Master node (primary and back up) ID $ 9638 120.5 4738 25.6 Input file Independent DB instance Assign tasks to nodes based on data layout Node cluster node-1 node-2 node-3 node-n Distributed tables for A and B Partitioned as specified by users B1 B2 B3 Bn A1 A2 A3 An DB DB DB DB Partitioned input file Partitioned input file Partitioned input file Partitioned input file

67 Optimistic replication Rely on eventual consistency model Conflict resolution mechanism is necessary Transaction cannot be supported (e.g., read-modify-write is not possible) Superior in performance and availability Example: Gossip protocol in Cassandra

$ Example A cluster of 1,000 nodes, each has the probability of 0.001 failure E.g. 3-year (= 1,000-day) MTBF (Mean Time Between Failure) and 1-day MTTR (Mean Time To Repair) Conventional full synchronization approach: System becomes unavailable only when all nodes holding a copy of a particular partition fail at the same time 99.9999% availability for 2-backup-node replication 99.9% availability for 1-backup-node replication Our relaxed synchronization approach: System availability depends on the probability of availability in tertiary on a simultaneous failure of primary and secondary nodes 99.999% if we assume this probability of -availability is 0.9 99.99% if we assume this probability of -availability is 0.5

-89: Primary has committed all updates in the last UOW and sent their s to Secondary. Secondary has received the s for the last UOW. Tertiary is alive and ready to receive s Our protocol proceeds by keeping the Sustainable State among all triplets Example Transaction Boundary UOW 1 Transaction Boundary UOW 2 Transaction Boundary UOW 3 (current) Primary All updates committed All updates committed All updates committed Sending s Secondary All s received All s received All s received Receiving s Tertiary All s received Receiving s

( $ 1. Find a transaction recovery point and determine new Primary and Secondary 2. Select a node to join the triplet as new Tertiary 3. Have the new Secondary send a snapshot and s to the new Tertiary 4. Resume application on new Primary Secondary fails All 3 Workers alive keeping sustainable state Both Primary and Secondary fail Primary and Tertiary alive Primary fails Secondary and Tertiary alive Tertiary fails New Tertiary assignment Both Secondary and Tertiary fail Only Primary alive Both Primary and Tertiary fail Only Secondary alive Only Tertiary alive Node promotion Primary and Secondary alive Node promotion and/or new Secondary assignment Logs not available Not automatically recoverable