Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store Wei Xie TTU CS Department Seminar, 3/7/2017 1
Outline General introduction Study 1: Elastic Consistent Hashing based Store Motivation and related work Design Evaluation Study 2: Reducing Failure-recovery Cost in CH based Store Motivation and related work Design Evaluation Conclusion 2
Big Data Storage Growing data-intensive (big data) application Large data volume (hundreds of TBs, PBs, EB), thousands of CPUs to access data Cluster computer (supercomputer, data center, cloud infrastructure ) 1PB=1,000,000,000 MB 1EB=1,000,000,000,000 MB (10 to 12) A Cluster Computer 3
Big Data Examples Science: Large Hadron Collider (LHC) 1PB data per sec, 15PB filtered data per year, 160PB disk Search engine: Yahoo use 1500 nodes for 5PB data @2008 4
Scalability of Storage To store large volume of data, the scalability of a data store software is critical Scalability means: performance improvement achieved by increasing the number of servers Popular systems like Hadoop Distributed File System (HDFS) scale to 10,000 nodes Performance encounters bottleneck at metadata servers 5
Metadata Server Bottleneck With many data nodes (DNs), HDFS has performance bottleneck at name-node Need very large capacity to store metadata Querying/updating the name-node with many concurrent clients degrades performance 6
Getting Rid of Metadata Server data ID=1 Data Node 0 1 1 101 2 304 node ID=101 Consistent hashing Use hash function to map data to DNs data ID=1 hash function node ID=101 No need to update metadata server Much smaller memory footprint 10X increase in scale (Ceph) 7
Consistent Hashing Keys D1 D2 D3 partitions Servers 1 2 3 Keys D1 D2 D3 partitions Servers 1 2 3 4 Hashes D1 2 160 0 Hashes Hashes D1 2 160 0 4 Hashes 1 1 3 3 D3 D2 D3 D2 2 2 1 holds D1 2 holds D2 3 holds D3 4 holds D1 2 holds D2 3 holds D3 1 holds nothing 8
Challenges with CH Modern large-scale data store challenges Scalability Manageability Performance Power consumption Fault tolerance We observe and investigate two problems with CH, in terms of power consumption and fault tolerance 9
Outline General introduction Study 1: Elastic Consistent Hashing based Store Motivation and related work Design Evaluation Study 2: Reducing Failure-recovery Cost in CH based Store Motivation and related work Design Evaluation Conclusion 10
Background: Elastic Data Store for Power Saving Elasticity: the ability to resize the storage cluster as workload varies (more servers means better performance but higher power consumption) Benefits Re-use storage nodes for other purpose Save machine hours (operating cost) Most distributed storages are not elastic GFS and HDFS Deactivating servers may make data unavailable 11
Agility is Important Agility determines how much machine hours to be saved 12
Non-elastic Data Layout A typical pseudo-random based data layout as seen in most CH-based distributed FS Almost all server must be on to ensure 100% availability No elastic resizing capability 13
Elastic Data Layout General rule Take advantage of replication Always keep the first (primary) replicas on The other replicas can be activated on demand 14
Primary Server Layout Peak write performance: N/3 (same as non-elastic) Limited scaling to N/3 only 15
Equal-work Data Layout 16
Primary-server Layout with CH Modifies data placement in original CH so that one replica is always placed on a primary server To achieve equal-work layout, the cluster must be configured accordingly Primary server (always active) Secondary server (active) Secondary server (inactive) Data object skip secondary 5 8 3 9 1 4 7 6 2 10 skip inactive D2 5 8 3 9 10 4 7 6 2 1 skip primary D2 skip inactive D1 D1 17
Equal-work Data Layout Number of data chunks on primary: v primary = B p Number of data chunks on secondary: v secondary i = B i 10 9 8 #10 4 Data Distribution Version1 (10 active) Version2 (8 active) Version3 (10 active) Data to migrate Number of Data Blocks 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 Rank of server 18
Contribution Summary Primary Data Placement/replication scheme with consistent hashing Achieves primary-secondary data layout for elasticity Slight modification to existing consistent hashing Preserves the property of consistent hashing 19
Data Re-integration After a node is turned down, no data will be written to it. When this node joins again, any newly created data/ modified data might need to re-integrate to it. However, data store does not know what data is modified or newly created. It has to transfer all data that should be placed on the new joined node. 20
Data Re-integration Data re-integration incurs lots of I/O operations and degrades performance when scaling up 3-phase workload: high load -> low load -> high load No resizing: 10 servers always on With resizing: 10 servers -> 2 servers -> 10 servers 350 300 Original Consistent Hashing With resizing No resizing IO throughput (MB/s) 250 200 150 100 50 Phase 1 ends Phase 2 ends 0 0 100 200 300 400 500 600 Time (seconds) 21
Our Contribution Selective background re-integration Dirty table to track all OIDs that are dirty When re-integration finishes, OID is removed from table The rate of re-integration is controlled Primary (always active) Secondary (active) Secondary (inactive) Data replica OID 10010 Version 9 Dirty Y OID 10010 Version 9 Dirty Y Node 1 2 9 10 8 3 5 4 Membership Table State On On 3 On Off Off Version 9 Version 10 1 9 7 10 6 2 Dirty Table OID Version 100 8 200 8 10 9 103 9 10010 9 20400 9 obj 10010 Re-integration order Resizing OID 10010 Version 10 Dirty Y OID 10010 Version 10 Dirty Y Membership Table Node 1 2 9 10 State On On 3 On On Off 8 3 5 4 1 9 Dirty Table 7 OID Version 100 8 200 8 10 9 103 9 10010 9 20400 9 102 10 205 10 1010 10 22 6 2 10 Resizing obj 10010 Re-integration order OID 10010 Version 11 Dirty N Membership Table Node 1 2 9 10 State On On 3 On On On 8 3 5 Version 11 4 1 9 Dirty Table 7 OID Version 100 8 200 8 10 9 103 9 10010 9 20400 9 102 10 205 10 1010 10 All the dirty data in the table till OID 10010 are re-integrated to version 10 10 6 2 Re-integration order obj 10010 OID 10010 Version 11 Dirty N
Implementation Primary-secondary data placement/replication implemented in Sheepdog Dirty data tracking implemented using Redis 23
Evaluation 3-phase workload test T: deadline for background re-integration Rate: data transfer rate for background re-integration Performance significantly improved with selective background re-integration IO throughput (MB/s) 350 300 250 200 150 100 Sel+backg(T=2,Rate=200) Sel+backg(T=4,Rate=200) Sel+backg(T=6,Rate=200) Selective Original CH No-resizing High rate delays resizing 50 Phase 1 ends Phase 2 ends 0 0 100 200 300 400 500 600 Time (seconds) 24
Large-scale Trace Analysis Use the Cloudera trace Apply our policy and analyze the effect of resizing 50 40 CC-a Trace Ideal Original CH Primary+aggresive Primary+background 180 160 140 CC-b Trace Ideal Original CH Primary+aggresive Primary+background Number of servers 30 20 Number of servers 120 100 80 60 10 40 20 0 0 50 100 150 200 250 Time (minutes) 25 0 0 50 100 150 200 250 Time (minutes)
Summary We propose primary-secondary data placement/ replication scheme to provide better elasticity in consistent hashing based data store We use selective background data re-integration technique to reduce the I/O footprint when reintegrating nodes to a cluster First work studying elasticity for saving power in consistent hashing based store 26
Outline General introduction Study 1: Elastic Consistent Hashing based Store Motivation and related work Design Evaluation Study 2: Reducing Failure-recovery Cost in CH based Store Motivation and related work Design Evaluation Conclusion 27
Fault-tolerance and Self-healing Replication for tolerating failures When a node fails, a self-healing system could recover lost data by itself without administrator intervention Keys partitions D1 D2 D3 Hashes 2 160 0 4 D1 6 Servers 1 2 3 Hashes 1 4 5 6 Keys partitions D1 D2 D3 Hashes 2 160 0 4 D1 6 Servers 1 2 3 Hashes 1 4 5 6 3 D3 2 5 D2 3 D3 2 5 D2 2 fails D2 s second replicas is migrated to 3 automa9cally 28
Motivation Even though CH is able to self-heal from failures, the cost of recovery is large (data transfers) If simply delaying self-healing, the risk of data loss can be large Use different data layout to delay healing as much as possible Determine when it is OK to delay self-healing and when it is not 29
Motivation Psuedo-random replication has low tolerance on multiple concurrent failures Losing one server makes data in danger 30
Primary Replication Same as the one used in Elastic Consistent Hashing As long as primary replicas are available, there is no worry about losing data 31
Data Recovery Strategy Aggressive recovery: as long as a node fails, recovery starts to transfer data Lazy recovery: as long as a node fail does not incur much risk of losing data, data transfer is delayed Need a metric to quantify the risk of losing data 32
Determine Recovery Strategy Minimum Replication Level (MRL) The smallest number of replicas that a data may have Larger MRL means more failure can be tolerated Set a threshold of MRL. When MRL drops below the threshold, aggressive recovery is used 33
Measuring MRL in CH MRL can be easily calculated in consistent hashing based data store Primary server Secondary server Data object Failed primary server Failed secondary server u c Uncommitted fail node Committed fail node 2 active D1 u 5 8 3 9 1 7 6 2 10 4 u u D2 1 D2 5 7 6 u 8 3 2 active 9 10 4 u 2 1 active u server 5, 6 and 10 failed, MRL=2, lazy 4, 6 and 10 failed, MRL=1, aggressive case (1) case (2) D2 1 5 7 8 6 c 3 2 1 active 9 10 4 c u server 4, 6 and 10 failed, MRL=3, lazy case (3) 1 5 7 8 6 2 u 3 3 active 9 4 10 D3 server 3 failed, MRL=3, aggressive case (3) 34
Analysis with MSR Trace MSR trace: 1 week I/O trace from Microsoft Research Server Insert recovery periods into the trace with two recovery strategies 10000 9000 8000 7000 Recovery period MSR Throughput 10000 9000 8000 7000 Recovery period MSR Throughput IOPS 6000 5000 4000 IOPS 6000 5000 4000 3000 3000 2000 2000 1000 1000 0 0 50 100 150 200 Hours Aggressive recovery 35 0 0 50 100 150 200 Hours Lazy recovery
Evaluation IO Rate (MB/s) 100 90 80 70 60 50 40 30 20 10 Simulate primary-secondary replication and lazy recovery within libch-placement, a consistent hashing library Failure is generated using Weibull distribution Failure and recovery data simulated is inserted into MSR trace and replayed on Sheepdog client Primary+ lazy recovery strategy improves I/O performance when a failure occurs failure MSR Trace, I/O Rate Primary-secondary Random IO Rate (MB/s) 140 120 100 80 60 40 20 0 failure MSR Trace, I/O Rate Primary-secondary Random 0 111 112 113 114 115 116 117 Hours 36-20 124 125 126 127 128 129 130 Hours
Summary We leverage the primary-secondary replication scheme to replace random replication scheme to tolerate multiple concurrent failures We use MRL metric to determine the risk of data loss and the data recovery strategy Using our replication scheme and recovery strategy, the I/O footprint after node failure is significantly reduced 37
Conclusion Consistent hashing based store is promising but has limited functionality We provide some initial insight into how to enhance the consistent hashing to offer better functionalities that are important in modern data store, like fault-tolerance and elasticity There are many more to be explored 38
Questions! Welcome to visit our website for more details. DISCL lab: http://discl.cs.ttu.edu/ Personal site: https://sites.google.com/site/harvesonxie/ 39