Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

Size: px

Start display at page:

Download "Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017"

Beverly McDowell
5 years ago
Views:

1 Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store Wei Xie TTU CS Department Seminar, 3/7/2017 1

2 Outline General introduction Study 1: Elastic Consistent Hashing based Store Motivation and related work Design Evaluation Study 2: Reducing Failure-recovery Cost in CH based Store Motivation and related work Design Evaluation Conclusion 2

3 Big Data Storage Growing data-intensive (big data) application Large data volume (hundreds of TBs, PBs, EB), thousands of CPUs to access data Cluster computer (supercomputer, data center, cloud infrastructure ) 1PB=1,000,000,000 MB 1EB=1,000,000,000,000 MB (10 to 12) A Cluster Computer 3

4 Big Data Examples Science: Large Hadron Collider (LHC) 1PB data per sec, 15PB filtered data per year, 160PB disk Search engine: Yahoo use 1500 nodes for 5PB 4

5 Scalability of Storage To store large volume of data, the scalability of a data store software is critical Scalability means: performance improvement achieved by increasing the number of servers Popular systems like Hadoop Distributed File System (HDFS) scale to 10,000 nodes Performance encounters bottleneck at metadata servers 5

6 Metadata Server Bottleneck With many data nodes (DNs), HDFS has performance bottleneck at name-node Need very large capacity to store metadata Querying/updating the name-node with many concurrent clients degrades performance 6

7 Getting Rid of Metadata Server data ID=1 Data Node node ID=101 Consistent hashing Use hash function to map data to DNs data ID=1 hash function node ID=101 No need to update metadata server Much smaller memory footprint 10X increase in scale (Ceph) 7

8 Consistent Hashing Keys D1 D2 D3 partitions Servers Keys D1 D2 D3 partitions Servers Hashes D Hashes Hashes D Hashes D3 D2 D3 D holds D1 2 holds D2 3 holds D3 4 holds D1 2 holds D2 3 holds D3 1 holds nothing 8

9 Challenges with CH Modern large-scale data store challenges Scalability Manageability Performance Power consumption Fault tolerance We observe and investigate two problems with CH, in terms of power consumption and fault tolerance 9

10 Outline General introduction Study 1: Elastic Consistent Hashing based Store Motivation and related work Design Evaluation Study 2: Reducing Failure-recovery Cost in CH based Store Motivation and related work Design Evaluation Conclusion 10

11 Background: Elastic Data Store for Power Saving Elasticity: the ability to resize the storage cluster as workload varies (more servers means better performance but higher power consumption) Benefits Re-use storage nodes for other purpose Save machine hours (operating cost) Most distributed storages are not elastic GFS and HDFS Deactivating servers may make data unavailable 11

12 Agility is Important Agility determines how much machine hours to be saved 12

13 Non-elastic Data Layout A typical pseudo-random based data layout as seen in most CH-based distributed FS Almost all server must be on to ensure 100% availability No elastic resizing capability 13

14 Elastic Data Layout General rule Take advantage of replication Always keep the first (primary) replicas on The other replicas can be activated on demand 14

15 Primary Server Layout Peak write performance: N/3 (same as non-elastic) Limited scaling to N/3 only 15

16 Equal-work Data Layout 16

17 Primary-server Layout with CH Modifies data placement in original CH so that one replica is always placed on a primary server To achieve equal-work layout, the cluster must be configured accordingly Primary server (always active) Secondary server (active) Secondary server (inactive) Data object skip secondary skip inactive D skip primary D2 skip inactive D1 D1 17

18 Equal-work Data Layout Number of data chunks on primary: v primary = B p Number of data chunks on secondary: v secondary i = B i #10 4 Data Distribution Version1 (10 active) Version2 (8 active) Version3 (10 active) Data to migrate Number of Data Blocks Rank of server 18

19 Contribution Summary Primary Data Placement/replication scheme with consistent hashing Achieves primary-secondary data layout for elasticity Slight modification to existing consistent hashing Preserves the property of consistent hashing 19

20 Data Re-integration After a node is turned down, no data will be written to it. When this node joins again, any newly created data/ modified data might need to re-integrate to it. However, data store does not know what data is modified or newly created. It has to transfer all data that should be placed on the new joined node. 20

21 Data Re-integration Data re-integration incurs lots of I/O operations and degrades performance when scaling up 3-phase workload: high load -> low load -> high load No resizing: 10 servers always on With resizing: 10 servers -> 2 servers -> 10 servers Original Consistent Hashing With resizing No resizing IO throughput (MB/s) Phase 1 ends Phase 2 ends Time (seconds) 21

22 Our Contribution Selective background re-integration Dirty table to track all OIDs that are dirty When re-integration finishes, OID is removed from table The rate of re-integration is controlled Primary (always active) Secondary (active) Secondary (inactive) Data replica OID Version 9 Dirty Y OID Version 9 Dirty Y Node Membership Table State On On 3 On Off Off Version 9 Version Dirty Table OID Version obj Re-integration order Resizing OID Version 10 Dirty Y OID Version 10 Dirty Y Membership Table Node State On On 3 On On Off Dirty Table 7 OID Version Resizing obj Re-integration order OID Version 11 Dirty N Membership Table Node State On On 3 On On On Version Dirty Table 7 OID Version All the dirty data in the table till OID are re-integrated to version Re-integration order obj OID Version 11 Dirty N

23 Implementation Primary-secondary data placement/replication implemented in Sheepdog Dirty data tracking implemented using Redis 23

24 Evaluation 3-phase workload test T: deadline for background re-integration Rate: data transfer rate for background re-integration Performance significantly improved with selective background re-integration IO throughput (MB/s) Sel+backg(T=2,Rate=200) Sel+backg(T=4,Rate=200) Sel+backg(T=6,Rate=200) Selective Original CH No-resizing High rate delays resizing 50 Phase 1 ends Phase 2 ends Time (seconds) 24

25 Large-scale Trace Analysis Use the Cloudera trace Apply our policy and analyze the effect of resizing CC-a Trace Ideal Original CH Primary+aggresive Primary+background CC-b Trace Ideal Original CH Primary+aggresive Primary+background Number of servers Number of servers Time (minutes) Time (minutes)

26 Summary We propose primary-secondary data placement/ replication scheme to provide better elasticity in consistent hashing based data store We use selective background data re-integration technique to reduce the I/O footprint when reintegrating nodes to a cluster First work studying elasticity for saving power in consistent hashing based store 26

27 Outline General introduction Study 1: Elastic Consistent Hashing based Store Motivation and related work Design Evaluation Study 2: Reducing Failure-recovery Cost in CH based Store Motivation and related work Design Evaluation Conclusion 27

28 Fault-tolerance and Self-healing Replication for tolerating failures When a node fails, a self-healing system could recover lost data by itself without administrator intervention Keys partitions D1 D2 D3 Hashes D1 6 Servers Hashes Keys partitions D1 D2 D3 Hashes D1 6 Servers Hashes D3 2 5 D2 3 D3 2 5 D2 2 fails D2 s second replicas is migrated to 3 automa9cally 28

29 Motivation Even though CH is able to self-heal from failures, the cost of recovery is large (data transfers) If simply delaying self-healing, the risk of data loss can be large Use different data layout to delay healing as much as possible Determine when it is OK to delay self-healing and when it is not 29

30 Motivation Psuedo-random replication has low tolerance on multiple concurrent failures Losing one server makes data in danger 30

31 Primary Replication Same as the one used in Elastic Consistent Hashing As long as primary replicas are available, there is no worry about losing data 31

32 Data Recovery Strategy Aggressive recovery: as long as a node fails, recovery starts to transfer data Lazy recovery: as long as a node fail does not incur much risk of losing data, data transfer is delayed Need a metric to quantify the risk of losing data 32

33 Determine Recovery Strategy Minimum Replication Level (MRL) The smallest number of replicas that a data may have Larger MRL means more failure can be tolerated Set a threshold of MRL. When MRL drops below the threshold, aggressive recovery is used 33

34 Measuring MRL in CH MRL can be easily calculated in consistent hashing based data store Primary server Secondary server Data object Failed primary server Failed secondary server u c Uncommitted fail node Committed fail node 2 active D1 u u u D2 1 D u active u 2 1 active u server 5, 6 and 10 failed, MRL=2, lazy 4, 6 and 10 failed, MRL=1, aggressive case (1) case (2) D c active c u server 4, 6 and 10 failed, MRL=3, lazy case (3) u 3 3 active D3 server 3 failed, MRL=3, aggressive case (3) 34

35 Analysis with MSR Trace MSR trace: 1 week I/O trace from Microsoft Research Server Insert recovery periods into the trace with two recovery strategies Recovery period MSR Throughput Recovery period MSR Throughput IOPS IOPS Hours Aggressive recovery Hours Lazy recovery

36 Evaluation IO Rate (MB/s) Simulate primary-secondary replication and lazy recovery within libch-placement, a consistent hashing library Failure is generated using Weibull distribution Failure and recovery data simulated is inserted into MSR trace and replayed on Sheepdog client Primary+ lazy recovery strategy improves I/O performance when a failure occurs failure MSR Trace, I/O Rate Primary-secondary Random IO Rate (MB/s) failure MSR Trace, I/O Rate Primary-secondary Random Hours Hours

37 Summary We leverage the primary-secondary replication scheme to replace random replication scheme to tolerate multiple concurrent failures We use MRL metric to determine the risk of data loss and the data recovery strategy Using our replication scheme and recovery strategy, the I/O footprint after node failure is significantly reduced 37

38 Conclusion Consistent hashing based store is promising but has limited functionality We provide some initial insight into how to enhance the consistent hashing to offer better functionalities that are important in modern data store, like fault-tolerance and elasticity There are many more to be explored 38

39 Questions! Welcome to visit our website for more details. DISCL lab: Personal site: 39

Elastic Consistent Hashing for Distributed Storage Systems

Elastic Consistent Hashing for Distributed Storage Systems Wei Xie and Yong Chen Department of Computer Science, Texas Tech University, Lubbock, TX Abstract Elastic distributed storage systems have been