Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

Similar documents
Elastic Consistent Hashing for Distributed Storage Systems

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Ceph: A Scalable, High-Performance Distributed File System

The Google File System

CA485 Ray Walshe Google File System

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

CLOUD-SCALE FILE SYSTEMS

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

NPTEL Course Jan K. Gopinath Indian Institute of Science

Introduction to Distributed Data Systems

GFS: The Google File System

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

GFS: The Google File System. Dr. Yingwu Zhu

Decentralized Distributed Storage System for Big Data

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

The Google File System

The Fusion Distributed File System

The Google File System

THE COMPLETE GUIDE COUCHBASE BACKUP & RECOVERY

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Distributed Systems 16. Distributed File Systems II

Distributed File Systems II

CS-580K/480K Advanced Topics in Cloud Computing. Object Storage

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

THE COMPLETE GUIDE HADOOP BACKUP & RECOVERY

Ceph: A Scalable, High-Performance Distributed File System PRESENTED BY, NITHIN NAGARAJ KASHYAP

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Hadoop/MapReduce Computing Paradigm

EsgynDB Enterprise 2.0 Platform Reference Architecture

MapReduce. U of Toronto, 2014

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

The Google File System

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Lecture 11 Hadoop & Spark

vsan Remote Office Deployment January 09, 2018

A BigData Tour HDFS, Ceph and MapReduce

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

ZHT: Const Eventual Consistency Support For ZHT. Group Member: Shukun Xie Ran Xin

CS 655 Advanced Topics in Distributed Systems

A Practical Scalable Distributed B-Tree

The Google File System

A Gentle Introduction to Ceph

Native vsphere Storage for Remote and Branch Offices

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

The Google File System

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

BigData and Map Reduce VITMAC03

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Google File System. Arun Sundaram Operating Systems

Embedded Technosolutions

The Google File System

Introduction to Database Services

Fault Tolerance in K3. Ben Glickman, Amit Mehta, Josh Wheeler

Map-Reduce. Marco Mura 2010 March, 31th

Amazon ElastiCache 8/1/17. Why Amazon ElastiCache is important? Introduction:

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.

Advanced Architectures for Oracle Database on Amazon EC2

-Presented By : Rajeshwari Chatterjee Professor-Andrey Shevel Course: Computing Clusters Grid and Clouds ITMO University, St.

CSE 124: Networked Services Lecture-16

CSE 344 MAY 2 ND MAP/REDUCE

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

JackRabbit: Improved agility in elastic distributed storage

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Everything You Need to Know About MySQL Group Replication

The Google File System (GFS)

BERLIN. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

NPTEL Course Jan K. Gopinath Indian Institute of Science

Today: Coda, xfs. Case Study: Coda File System. Brief overview of other file systems. xfs Log structured file systems HDFS Object Storage Systems

Next-Generation Cloud Platform

Agility and Performance in Elastic Distributed Storage

Distributed Computations MapReduce. adapted from Jeff Dean s slides

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

High Availability and Disaster Recovery Solutions for Perforce

"Software-defined storage Crossing the right bridge"

Introduction to Map Reduce

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Resilient Distributed Datasets

Finding a needle in Haystack: Facebook's photo storage

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Distributed Computation Models

Storage Optimization with Oracle Database 11g

Simplifying Collaboration in the Cloud

Take Back Lost Revenue by Activating Virtuozzo Storage Today

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Blizzard: A Distributed Queue

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

EMC Business Continuity for Microsoft Applications

Correlation based File Prefetching Approach for Hadoop

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

Transcription:

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store Wei Xie TTU CS Department Seminar, 3/7/2017 1

Outline General introduction Study 1: Elastic Consistent Hashing based Store Motivation and related work Design Evaluation Study 2: Reducing Failure-recovery Cost in CH based Store Motivation and related work Design Evaluation Conclusion 2

Big Data Storage Growing data-intensive (big data) application Large data volume (hundreds of TBs, PBs, EB), thousands of CPUs to access data Cluster computer (supercomputer, data center, cloud infrastructure ) 1PB=1,000,000,000 MB 1EB=1,000,000,000,000 MB (10 to 12) A Cluster Computer 3

Big Data Examples Science: Large Hadron Collider (LHC) 1PB data per sec, 15PB filtered data per year, 160PB disk Search engine: Yahoo use 1500 nodes for 5PB data @2008 4

Scalability of Storage To store large volume of data, the scalability of a data store software is critical Scalability means: performance improvement achieved by increasing the number of servers Popular systems like Hadoop Distributed File System (HDFS) scale to 10,000 nodes Performance encounters bottleneck at metadata servers 5

Metadata Server Bottleneck With many data nodes (DNs), HDFS has performance bottleneck at name-node Need very large capacity to store metadata Querying/updating the name-node with many concurrent clients degrades performance 6

Getting Rid of Metadata Server data ID=1 Data Node 0 1 1 101 2 304 node ID=101 Consistent hashing Use hash function to map data to DNs data ID=1 hash function node ID=101 No need to update metadata server Much smaller memory footprint 10X increase in scale (Ceph) 7

Consistent Hashing Keys D1 D2 D3 partitions Servers 1 2 3 Keys D1 D2 D3 partitions Servers 1 2 3 4 Hashes D1 2 160 0 Hashes Hashes D1 2 160 0 4 Hashes 1 1 3 3 D3 D2 D3 D2 2 2 1 holds D1 2 holds D2 3 holds D3 4 holds D1 2 holds D2 3 holds D3 1 holds nothing 8

Challenges with CH Modern large-scale data store challenges Scalability Manageability Performance Power consumption Fault tolerance We observe and investigate two problems with CH, in terms of power consumption and fault tolerance 9

Outline General introduction Study 1: Elastic Consistent Hashing based Store Motivation and related work Design Evaluation Study 2: Reducing Failure-recovery Cost in CH based Store Motivation and related work Design Evaluation Conclusion 10

Background: Elastic Data Store for Power Saving Elasticity: the ability to resize the storage cluster as workload varies (more servers means better performance but higher power consumption) Benefits Re-use storage nodes for other purpose Save machine hours (operating cost) Most distributed storages are not elastic GFS and HDFS Deactivating servers may make data unavailable 11

Agility is Important Agility determines how much machine hours to be saved 12

Non-elastic Data Layout A typical pseudo-random based data layout as seen in most CH-based distributed FS Almost all server must be on to ensure 100% availability No elastic resizing capability 13

Elastic Data Layout General rule Take advantage of replication Always keep the first (primary) replicas on The other replicas can be activated on demand 14

Primary Server Layout Peak write performance: N/3 (same as non-elastic) Limited scaling to N/3 only 15

Equal-work Data Layout 16

Primary-server Layout with CH Modifies data placement in original CH so that one replica is always placed on a primary server To achieve equal-work layout, the cluster must be configured accordingly Primary server (always active) Secondary server (active) Secondary server (inactive) Data object skip secondary 5 8 3 9 1 4 7 6 2 10 skip inactive D2 5 8 3 9 10 4 7 6 2 1 skip primary D2 skip inactive D1 D1 17

Equal-work Data Layout Number of data chunks on primary: v primary = B p Number of data chunks on secondary: v secondary i = B i 10 9 8 #10 4 Data Distribution Version1 (10 active) Version2 (8 active) Version3 (10 active) Data to migrate Number of Data Blocks 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 Rank of server 18

Contribution Summary Primary Data Placement/replication scheme with consistent hashing Achieves primary-secondary data layout for elasticity Slight modification to existing consistent hashing Preserves the property of consistent hashing 19

Data Re-integration After a node is turned down, no data will be written to it. When this node joins again, any newly created data/ modified data might need to re-integrate to it. However, data store does not know what data is modified or newly created. It has to transfer all data that should be placed on the new joined node. 20

Data Re-integration Data re-integration incurs lots of I/O operations and degrades performance when scaling up 3-phase workload: high load -> low load -> high load No resizing: 10 servers always on With resizing: 10 servers -> 2 servers -> 10 servers 350 300 Original Consistent Hashing With resizing No resizing IO throughput (MB/s) 250 200 150 100 50 Phase 1 ends Phase 2 ends 0 0 100 200 300 400 500 600 Time (seconds) 21

Our Contribution Selective background re-integration Dirty table to track all OIDs that are dirty When re-integration finishes, OID is removed from table The rate of re-integration is controlled Primary (always active) Secondary (active) Secondary (inactive) Data replica OID 10010 Version 9 Dirty Y OID 10010 Version 9 Dirty Y Node 1 2 9 10 8 3 5 4 Membership Table State On On 3 On Off Off Version 9 Version 10 1 9 7 10 6 2 Dirty Table OID Version 100 8 200 8 10 9 103 9 10010 9 20400 9 obj 10010 Re-integration order Resizing OID 10010 Version 10 Dirty Y OID 10010 Version 10 Dirty Y Membership Table Node 1 2 9 10 State On On 3 On On Off 8 3 5 4 1 9 Dirty Table 7 OID Version 100 8 200 8 10 9 103 9 10010 9 20400 9 102 10 205 10 1010 10 22 6 2 10 Resizing obj 10010 Re-integration order OID 10010 Version 11 Dirty N Membership Table Node 1 2 9 10 State On On 3 On On On 8 3 5 Version 11 4 1 9 Dirty Table 7 OID Version 100 8 200 8 10 9 103 9 10010 9 20400 9 102 10 205 10 1010 10 All the dirty data in the table till OID 10010 are re-integrated to version 10 10 6 2 Re-integration order obj 10010 OID 10010 Version 11 Dirty N

Implementation Primary-secondary data placement/replication implemented in Sheepdog Dirty data tracking implemented using Redis 23

Evaluation 3-phase workload test T: deadline for background re-integration Rate: data transfer rate for background re-integration Performance significantly improved with selective background re-integration IO throughput (MB/s) 350 300 250 200 150 100 Sel+backg(T=2,Rate=200) Sel+backg(T=4,Rate=200) Sel+backg(T=6,Rate=200) Selective Original CH No-resizing High rate delays resizing 50 Phase 1 ends Phase 2 ends 0 0 100 200 300 400 500 600 Time (seconds) 24

Large-scale Trace Analysis Use the Cloudera trace Apply our policy and analyze the effect of resizing 50 40 CC-a Trace Ideal Original CH Primary+aggresive Primary+background 180 160 140 CC-b Trace Ideal Original CH Primary+aggresive Primary+background Number of servers 30 20 Number of servers 120 100 80 60 10 40 20 0 0 50 100 150 200 250 Time (minutes) 25 0 0 50 100 150 200 250 Time (minutes)

Summary We propose primary-secondary data placement/ replication scheme to provide better elasticity in consistent hashing based data store We use selective background data re-integration technique to reduce the I/O footprint when reintegrating nodes to a cluster First work studying elasticity for saving power in consistent hashing based store 26

Outline General introduction Study 1: Elastic Consistent Hashing based Store Motivation and related work Design Evaluation Study 2: Reducing Failure-recovery Cost in CH based Store Motivation and related work Design Evaluation Conclusion 27

Fault-tolerance and Self-healing Replication for tolerating failures When a node fails, a self-healing system could recover lost data by itself without administrator intervention Keys partitions D1 D2 D3 Hashes 2 160 0 4 D1 6 Servers 1 2 3 Hashes 1 4 5 6 Keys partitions D1 D2 D3 Hashes 2 160 0 4 D1 6 Servers 1 2 3 Hashes 1 4 5 6 3 D3 2 5 D2 3 D3 2 5 D2 2 fails D2 s second replicas is migrated to 3 automa9cally 28

Motivation Even though CH is able to self-heal from failures, the cost of recovery is large (data transfers) If simply delaying self-healing, the risk of data loss can be large Use different data layout to delay healing as much as possible Determine when it is OK to delay self-healing and when it is not 29

Motivation Psuedo-random replication has low tolerance on multiple concurrent failures Losing one server makes data in danger 30

Primary Replication Same as the one used in Elastic Consistent Hashing As long as primary replicas are available, there is no worry about losing data 31

Data Recovery Strategy Aggressive recovery: as long as a node fails, recovery starts to transfer data Lazy recovery: as long as a node fail does not incur much risk of losing data, data transfer is delayed Need a metric to quantify the risk of losing data 32

Determine Recovery Strategy Minimum Replication Level (MRL) The smallest number of replicas that a data may have Larger MRL means more failure can be tolerated Set a threshold of MRL. When MRL drops below the threshold, aggressive recovery is used 33

Measuring MRL in CH MRL can be easily calculated in consistent hashing based data store Primary server Secondary server Data object Failed primary server Failed secondary server u c Uncommitted fail node Committed fail node 2 active D1 u 5 8 3 9 1 7 6 2 10 4 u u D2 1 D2 5 7 6 u 8 3 2 active 9 10 4 u 2 1 active u server 5, 6 and 10 failed, MRL=2, lazy 4, 6 and 10 failed, MRL=1, aggressive case (1) case (2) D2 1 5 7 8 6 c 3 2 1 active 9 10 4 c u server 4, 6 and 10 failed, MRL=3, lazy case (3) 1 5 7 8 6 2 u 3 3 active 9 4 10 D3 server 3 failed, MRL=3, aggressive case (3) 34

Analysis with MSR Trace MSR trace: 1 week I/O trace from Microsoft Research Server Insert recovery periods into the trace with two recovery strategies 10000 9000 8000 7000 Recovery period MSR Throughput 10000 9000 8000 7000 Recovery period MSR Throughput IOPS 6000 5000 4000 IOPS 6000 5000 4000 3000 3000 2000 2000 1000 1000 0 0 50 100 150 200 Hours Aggressive recovery 35 0 0 50 100 150 200 Hours Lazy recovery

Evaluation IO Rate (MB/s) 100 90 80 70 60 50 40 30 20 10 Simulate primary-secondary replication and lazy recovery within libch-placement, a consistent hashing library Failure is generated using Weibull distribution Failure and recovery data simulated is inserted into MSR trace and replayed on Sheepdog client Primary+ lazy recovery strategy improves I/O performance when a failure occurs failure MSR Trace, I/O Rate Primary-secondary Random IO Rate (MB/s) 140 120 100 80 60 40 20 0 failure MSR Trace, I/O Rate Primary-secondary Random 0 111 112 113 114 115 116 117 Hours 36-20 124 125 126 127 128 129 130 Hours

Summary We leverage the primary-secondary replication scheme to replace random replication scheme to tolerate multiple concurrent failures We use MRL metric to determine the risk of data loss and the data recovery strategy Using our replication scheme and recovery strategy, the I/O footprint after node failure is significantly reduced 37

Conclusion Consistent hashing based store is promising but has limited functionality We provide some initial insight into how to enhance the consistent hashing to offer better functionalities that are important in modern data store, like fault-tolerance and elasticity There are many more to be explored 38

Questions! Welcome to visit our website for more details. DISCL lab: http://discl.cs.ttu.edu/ Personal site: https://sites.google.com/site/harvesonxie/ 39