C 2 : Adaptive Load Balancing for Metadata Server Cluster in Cloud-scale File Systems

Size: px

Start display at page:

Download "C 2 : Adaptive Load Balancing for Metadata Server Cluster in Cloud-scale File Systems"

Gary McDowell
6 years ago
Views:

1 C 2 : Adaptive Load Balancing for Metadata Server Cluster in Cloud-scale File Systems Quanqing Xu 1, Rajesh Vellore Arumugam 1, Khai Leong Yong 1, Yonggang Wen 2, Yew-Soon Ong 2 1 Data Storage Institute, A*STAR, Singapore {Xu Quanqing, Rajesh VA, YONG Khai Leong}@dsi.a-star.edu.sg 2 Nanyang Technological University {ygwen, asysong}@ntu.edu.sg Abstract. In Cloud-scale file systems, load balancing in request workloads across a metadata server cluster is critical for avoiding performance bottlenecks and improving quality of services. Many good approaches have been proposed for load balancing in distributed file systems. Some of them pay attention to global namespace balancing, making metadata distribution across metadata servers as uniform as possible. However, they do not work well in skew request distributions, which impair load balancing but simultaneously increase the effectiveness of caching and replication. In this paper, we propose Cloud Cache (C 2 ), an adaptive load balancing scheme for metadata server cluster in Cloud-scale file systems. It combines adaptive cache diffusion and replication scheme to cope with the request load balancing problem, and it can be integrated into existing distributed metadata management approaches to efficiently improve their load balancing performance. By conducting a performance evaluation in trace-driven simulations, experimental results demonstrate the efficiency and scalability of C 2. 1 Introduction Modern Cloud-scale file systems that store EB-scale (1 18 or 2 6 bytes) data [1, 2], separate file data access and metadata transactions to achieve high performance and scalability. EB-scale data is managed with a distributed file system in a data center to support many computations, e.g., the large synoptic survey telescope 3, in which there are more than 1 18 files. Data is stored on a storage cluster including numerous servers directly accessed by clients via the network, while metadata is managed separately by a metadata server (MDS) cluster consisting of a few dedicated servers. The dedicated MDS cluster manages the global namespace and the directory hierarchy of file system, the mapping from files to objects, and the permissions of files and directories. The MDS cluster just allows for concurrent data transfers between large numbers of clients and storage servers, and it provides efficient metadata service performance with specific workloads, e.g., thousands of clients updating to the same directory or accessing the same file. 3

2 2 Authors Suppressed Due to Excessive Length Compared to the overall data space, the size of metadata is relatively small, and it is typically.1% to 1% of data space 4, but it is relatively large in EB-scale file systems, e.g., 1PB to 1PB for 1EB data. Besides, 5% to 8% of all file system accesses are to metadata [3]. Therefore, in order to achieve high performance and scalability, a careful MDS cluster architecture must be designed and implemented to avoid potential bottlenecks caused by metadata requests. To efficiently handle the workload generated by a large number of clients, metadata should be properly partitioned so as to evenly distribute metadata traffic by leveraging the MDS cluster efficiently. At the same time, to deal with the changing workload, a scalable metadata management mechanism [4] is necessary to provide highly efficient metadata performance for mixed workloads generated by tens of thousands of concurrent clients. The concurrent accesses from a large number of clients to large-scale distributed storage will cause request load imbalance among metadata servers and inefficient use of metadata cache. Distributed caching is a widely deployed technique to handle request load imbalance and reduce request latency, and it is both orthogonal and complementary to the load balancing technique proposed in [5]. Meanwhile, distributed replication is also able to decrease the retrieve latency of metadata items. There are two insights based on our experience: 1) replicas on cached metadata items can balance request workload, and 2) increasing the number of replicas does help handle bursts of workloads. In our previous work [5], we consider the storage load problem by balancing metadata storage, but we do not take into account the request load problem. The goal of request load balancing is to assign the tasks to each node in distributed systems so that all the resources available are utilized as uniformly as possible. In distributed systems, the solutions to rebalance request workload read as follows: 1) migrating a heavily requested item preemptively from one overloaded node to another underloaded one, or 2) aborting a request in the overloaded node, and transferring it to a different node with replicas. As one of distributed services, distributed metadata management faces the following two questions: 1) How to distribute the workload on metadata servers? and 2) How to reduce the retrieval latency of metadata items? The system performance relies upon the answers to these questions. We have to deal with potentially unpredictable shifts in the request workload, e.g., flash crowds [6] or adversarial access patterns, such as a denial-of-service attack [7]. An imbalanced load causes long retrieval latencies of metadata items and impairs the system overall performance. In EB-scale file systems, the performance of distributed metadata management depends critically on distributing metadata items to MDSs to balance the request workload across the MDSs. Unfortunately, the optimal metadata placement is likely to change over time because of workload changes and dynamic system membership. Therefore, it is common to periodically calculate a new assignment of metadata items to MDSs, either on demand or at regular intervals as the changes of MDS occur. In this paper, we propose an adaptive load balancing approach named C 2 to solve the above problems. We consider how to find an efficient caching and replication scheme, which automatically adapts to changing workload in EB-scale file systems. By analyzing a running workload of requests to metadata items, it calculates a new load-balancing 4 dcslab.hanyang.ac.kr/nvramos8/ethanmiller.pdf

3 C 2 : Adaptive Load Balancing for Metadata Server Cluster in Cloud-scale File Systems 3 plan and then migrates them when their request rates are more than the request capacity of the node that maintains them. The input to our migration plan consists of an initial state of metadata items in virtual nodes and a given requirement of load balancing, and their request capacities. Our goal is to find a migration plan that moves the metadata from the initial state to the final state of load balancing with the minimum rounds. Moreover, the overlay network topology and metadata access information are utilized for metadata replication decisions. The rest of the paper is organized as follows. Section 2 describes the problem definition. The adaptive cache diffusion mechanism is presented in Section 3. Section 4 introduces the adaptive replication scheme in C 2. In Section 5 we present performance evaluation results of C 2. Section 6 describes related work. In Section 7 we conclude this paper. 2 Problem Definition 2.1 Traces Analyzed There are three real traces we analyze as shown in Table 1. Microsoft means Microsoft Windows build server production traces [8] from BuildServer to BuildServer7 within 24 hours, and its data size is 223.7GB (including access pattern information). Harvard is a research and NFS trace used by a large Harvard research group [9], and its data size is 158.6GB (including access pattern information). We implemented a metadata crawler that performs a recursive walk of the file system using stat() to extract file/directory metadata. By using the metadata crawler, the Linux trace is fetched from 22 Linux servers in our data center, and it is different from and much bigger than the Linux trace in [5]. Its file system metadata size is 4.53GB, and data size is 3.5TB. Table 1: Traces Trace # of files Path metadata Max. length Harvard 7,936,19 176M 18 Microsoft 7,725, M 34 Linux 1,271,66 786M Load Balancing Distributed metadata server cluster must guarantee good load balancing in such a way that they can meet their throughput and latency goals, and both partitioning and replication can be combined to make them scalable. They have to balance two kinds of loads: 1) storage load and 2) request load. The storage load is static for requiring constant storage capacity in each node. Capacity is typically load-balanced by using a hashing-based approach [1]. The request load is dynamic for handling queries from users. Metadata should be distributed as uniformly as possible among nodes, and no node should cope with much more query requests than another node. Although some schemes can balance the utilization of storage space, they do not balance the request load, in which

4 4 Authors Suppressed Due to Excessive Length hot spots often occur, i.e., some items are requested more than others. Many real-world workloads have uneven request distributions. Distributed systems typically balance the request load with the following ways. Some systems dynamically move data from overloaded servers to underloaded servers to make the request load uniform. Others rely upon replication to direct queries to the underloaded ones with a number of replicas, substantially improving load balancing [11]. 2.3 DROP with Caching DROP [5] leverages pathname-based locality-preserving hashing (LpH) for metadata distribution and location, avoiding the overhead of hierarchical directory traversal. To access data, a client hashes the pathname of the file with the same LpH function to locate which MDS contains the metadata of the file, and then contacts the appropriate MDS. The process is extremely efficient metadata access, typically involving a single message to a single MDS. With losing negligible metadata locality, DROP uses an efficient histogram-based dynamic load balancing mechanism to balance storage load. We can leverage the namespace locality in keys by caching metadata items within the same domain in lookup results, reducing total metadata lookup traffic. DROP maintains namespace locality in metadata placement, so clients do not need to require data from many nodes, and repetitive lookups are avoided because of the lookup cache mechanism. Large amounts of localities exist in distributed systems, e.g., file access locality in P2P systems [12], and they are the basis for distributed caching techniques. Metadata server architecture is shown in Figure 1. DROP is a SSD/NVM-based key-value store, where key is pathname, and value is its inode information. C 2 is used to deal with request load balancing, in which the lookup cache stores metadata items in recent query results, so future query requests that access keys in cached key ranges entirely bypass the lookup step. Clients could also explore a lookup cache in DirHash or FileHash that is to randomly distribute directories or files according to their pathnames, each of which is assigned to a metadata server, but it would be less effective since future queries may not request keys in recently accessed key ranges. Cache entries may become stale because of potential dynamic system membership. DROP falls back to a normal lookup when a metadata item is not found. It does not affect correctness with a stale cache entry, but it impairs retrieval latency. When a file/directory is updated, DROP is responsible to insert its metadata s new versions along the entire path to the root. It makes sure that each read must have a consistent view of the metadata, and it implies that each write must update all the metadata along the full path. When writing temporary files, DROP avoids this overhead with a t- second write-back cache, which is also explored as a buffer. Due to this buffer, multiple reads of the same metadata occurring within a t-second window only require it to be retrieved once. Metadata items seen by clients may be stale by up to t seconds because of this cache, but incomplete writes will never been seen. 2.4 Problem Formulation Given a set of nodes S (S = {S i, i = 1,, n}), with each storing a subset of metadata items D (D = {D j, j = 1,, m}) and a specified set of move operations, each of

5 C 2 : Adaptive Load Balancing for Metadata Server Cluster in Cloud-scale File Systems 5 System Interface Cloud Cache (C 2 ) Replication Engine Storage Manager Failover Policy Lookup Cache Request load balancing Ki=pathname Vi=inode_info SSD/NVM-based Key-value Store DROP /home/alice/ /a.txt Inode: Blocks: 8 Size: Uid: ( 1/ Alice) Gid: ( 1/ Alice) Data server: DS3 Locality-preserving Hashing Ring Dynamic Load Balancing Storage load balancing EB-scale storage Fig. 1: Metadata Server Architecture which specifies which item needs to be moved from one node to another one. A question we face and address is how to schedule these move operations. For each metadata item d, there is a subset of source MDSs S d and target MDSs T d. In the beginning, only the MDSs in S d have metadata item d, and all the MDSs in T d want to receive it. A MDS in T d becomes a source of item d after it receives item d. Our goal is to find a metadata migration plan using the minimum number of rounds, where there is a constraint that is each MDS just takes part in the transfer of only one item either as a sender or receiver. It is a NP-hard problem [13]. There are a set of nodes S and a set of metadata items D. Initially, each MDS stores a subset of items. A transfer graph G = (V, E) is built, in which each node represents a virtual node and an edge e = (u, v) represents a metadata item to be moved from a node u to v. Over time, metadata items may be moved to another MDS for load balancing. Note that the transfer graph can be a multi-graph, in which there are multiple edges between two nodes, when multiple metadata items are moved from one node to another. There are two situations: 1) the request load of an item is smaller than a given request load threshold l t for a node, and 2) the request load of an item is bigger than l t for a node. The Microsoft trace shows that the hottest file is accessed over 2.5% of total requests and the the combined CDF of hottest 125 files is close to 9% [8]. It tells that the hottest file is much more popular than one of other files that are not in the hottest 125 files. Suppose that there are 2 metadata servers, each of which has five virtual nodes, and there are 1 virtual nodes in total. Any virtual node that maintains the hottest file will be overloaded. For the first situation, we can use adaptive cache diffusion discussed in Section 3, while for the second one, we can use adaptive replication scheme described in Section 4. 3 Adaptive Cache Diffusion We first present an adaptive cache diffusion approach that leads to low migration overhead and fast convergence. Load-stealing and load-shedding are used to achieve this goal. Cache space is used for retrieval operations of DROP, in which a cached metadata item is placed at a virtual node to accelerate subsequent retrievals. It might be replaced via LRU very soon after it is created.

6 6 Authors Suppressed Due to Excessive Length 3.1 System Model A physical metadata server might have a set of virtual nodes N = {n 1, n 2,, n d } with a set of loads L = {l 1, l 2,, l d }. Load is applied to metadata servers via their virtual nodes, i.e., metadata server S might have load L S = d i l i. A MDS is said to be load-balanced when it satisfy Definition 1, i.e., the largest load is less than t 2 times the smallest load in the DROP system. According to Definition 1, a MDS has an upper target L u (L u = t L) and a lower target L l (L l = 1/t L). If a MDS finds itself receiving more load than L u, it considers itself overloaded. Otherwise, it considers itself underloaded if it finds itself receiving less load than L l. MDSs may want to operate below their capacities to prevent variations in workload from temporary overload. Definition 1 (MDS i is load balancing). MDS i is load balancing if its load satisfies 1/t L i /L t (t 2). File popularity [9] follow Zipf request distributions. The Zipf property of file access patterns is a basic fact of nature. It states that a small number of objects are greatly popular, but there is a long tail of unpopular requests. A Zipf workload means that destinations are ranked by popularity. The Zipf law states that the popularity of the ith-most popular object is proportional to i α, in which α is the Zipf coefficient. Usually, Zipf distributions look linear when plotted on a log-log scale. Figure 2 shows the popularity distribution of file/directory metadata items in the Microsoft and Harvard traces. Like the Internet, the metadata request distribution as observed in both traces also follows Zipf distributions. (a) Microsoft Windows trace (b) Harvard trace Fig. 2: Read and write distribution 3.2 Load Shedding Load-shedding means that an overloaded node attempts to offload requests to one or more underloaded ones. It may be well suited to the DROP MDS cluster. An overloaded node n 1 has to transfer an item x to another node n 2, and simultaneously create a redirection pointer to n 2. The item x also could be replicated at n 2, increasing redundancy and allowing n 1 to control how much load would be shed. In Section 4, we will explain how to effectively place multiple replicas using a multiple-choice scheme.

7 C 2 : Adaptive Load Balancing for Metadata Server Cluster in Cloud-scale File Systems 7 There are m metadata items in a node, with a tuple of loads l 1, l 2,, l m and a tuple of probabilities p 1, p 2,, p m. When this node has a cache of size c >, the c most frequently requested items will all hit the cache of this node, with two tuples of positive numbers l 1, l 2,, l c and p 1, p 2,, p c respectively. Let L be t L, and this node is overloaded if c i l i > L. Therefore, it can be formulated as a -1 Knapsack Problem that is NP-hard, i.e., it is to determine how to reassign some items to other nodes in a way that minimizes metadata migration from this node as follows: c maximize z = p i x i s.t. i=1 c l i x i L i=1 x i {, 1}, i {1, 2,, c} (1a) (1b) (1c) Constraint (1b) ensures that the total load of metadata items kept in this node is less than L. Constraint (1c) states if an item x i is kept or not. 3.3 Load Stealing Load-stealing states that a underloaded node n 1 seeks out load to take from one or more overloaded nodes. The load-stealing node finds such a node n 2, and it makes a replica of an item x in the node n 2, which creates a redirection pointer to n 1 for the item x. A natural idea is to have n 1 attempt to steal metadata items, for which n 1 has a redirection pointer. A metadata item can be placed using multiple choices, and it is associated with one of its r hash locations, which is further explained in Section 4. There are a number of metadata items from previous nodes with two tuples of positive numbers l 1, l 2,, l c and p 1, p 2,, p c respectively. If c i l i < L, this node is in load balancing, and it can take some items with its cache space of L L e. Therefore, it also can be formulated as a -1 Knapsack Problem, i.e., it is to determine how to take some items from other overloaded nodes in a way that maximizes the cache utilization of this node as follows: maximize z = p ix i s.t. c c i=1 l ix i L L e i=1 x i {, 1}, i {1, 2,, c } (2a) (2b) (2c) where c L e = l i (3) i=1

8 8 Authors Suppressed Due to Excessive Length {a:19%, b:16%, c:14%, d:13%, e:12%, f:8%} {a:19%, b:16%, c:14%, d:13%, e:12%, f:8%} A1 A1 The DROP Overlay Network B1 {} Node A1 B1 Content (load) {a:954, b:834, c:721, d:672, e:6, f: 434} {} Load 4,215 Routing Table The DROP Overlay Network B1 {} Node A1 B1 Content (load) {a:954, c:721, e:6, f: 434} Load {b:834, d:672, j:885} 2,391 Routing Table 2,79 b:b, d:b D1 {h:16%, i:15%, j:14%, k:12%, l:1%} C1 {g:17%} C1 D1 {g:814} {h:992, i:896, j:885, k:717, l:596} 814 4,86 D1 {h:16%, i:15%, j:14%, k:12%, l:1%} C1 {g:17%} C1 D1 {g:814, l:596} 1,41 {h:992, i:896, k:717} 2,65 j:b, l:c (a) Before Cache Diffusion (b) After Cache Diffusion Fig. 3: Cache Diffusion. Each metadata server has only one virtual node for illustration. 3.4 Traffic Control During load balancing, a metadata item may be migrated multiple times. DROP uses metadata pointers to minimize metadata migration overhead. For a metadata pointer, a node retrieves the metadata when it has held the pointer for longer than the stabilization time of the pointer. Using metadata pointers only temporarily hurts metadata locality when balancing the load. Besides reducing load balancing overhead, pointers also can make writes succeed even when the target node is at capacity, pointers can be utilized to divert metadata items from heavy nodes to light nodes. However, the node at full capacity will eventually shed some load when balancing the load, just causing temporary additional indirection. Suppose that a node X is heavily loaded, and a node Y takes some items of X to reduce some of X s load. Now X must transfer some of its metadata items to Y. Instead of having X immediately shed some of its metadata items to Y when Y gets some items from X, Y initially maintains metadata pointers to X. Later Y transfers the pointers to Z, and Z ultimately retrieves the actual metadata from X and deletes the pointers. There is an example of cache diffusion as shown in Figure 3, in which there is load imbalance before cache diffusion as shown in Figure 3(a). There are four nodes, where A 1 and D 1 are overloaded, while B 1 and C 1 are underloaded. After running our cache diffusion approach, we can see there is a good load balancing as shown in Figure 3(b), where the loads of four nodes are all in [1/t L, t L], i.e., [ , ] (t = 2). The migrated items are found via the routing tables of the nodes that are responsible for the items. 4 Adaptive Replication Scheme We propose a novel metadata replication mechanism to further balance request workload by placing multiple replicas of popular metadata items in different nodes. In DROP, a ZooKeeper-based linearizable consistency mechanism is proposed in [14] to keep excellent metadata consistency among MDSs. 4.1 Random Node Selection We first present an effective random node selection strategy to achieve coarse load balancing. Let h denote a hash function that maps virtual nodes onto the ring, and H (H =

9 C 2 : Adaptive Load Balancing for Metadata Server Cluster in Cloud-scale File Systems 9 {h 1, h 2,, h k }) denote a set of hash functions mapping metadata items onto the ring. The number of replicas r is calculated by r = f θ, where f is the access frequency of a metadata item, and θ is a given threshold. An item x is inserted as a primary replica using the h hash function, and its r 1 replicas should be placed in the nodes selected by using k hash functions. Lookups are initiated to find the nodes associated with each of these k hash values by calculating h 1, h 2,, h k. According to the mapping given by h, k lookups can be executed in parallel to find the virtual nodes n 1, n 2,, n k in charge of these hash values. After querying the loads of nodes, the underloaded nodes are chosen. To decrease the overhead of searching for additional nodes, redirection pointers are explored. In addition to storing replicas at the r 1 underloaded nodes N r 1 u, other nodes {S N r 1 u } store a pointer x N r 1 u. To search for the item x, a single query is performed by choosing a hash function h j at random in an effort to locate one of nodes in N r 1 u. If n j does not have x, n j forwards the query request with a pointer x N r 1 u. Query requests take at most one more step. The extra step is necessary with probability (k r + 1)/k if h j is chosen uniformly at random from the k choices. It incurs the overhead of maintaining the additional pointers, but storing actual items and any associated computation dominate the stored pointers. In addition, we need to answer how to select r 1 nodes from N u to place x s replicas. 4.2 Topology-aware Replica Placement To select the r 1 underloaded nodes N r 1 u, we first need to consider the network topological characteristics of nodes so that we place the replicas of an item on the nodes topologically adjacent to the node in charge of the item in DROP. In this way, we can reduce the network bandwidth consumption and the query latency when achieving better load balancing. We employ an effective topology-aware replica placement scheme by introducing a technique that discovers the topological information of nodes. In this technique, the key is how to represent and keep the network topology information so that the topologically close nodes are easily discovered for a given node. Thus, we must have a mechanism that is able to represent the topological location information of nodes. The distributed binning scheme [15] is a simple approach for this purpose. For example, there is a topology table for node 7, as shown in Figure 4. The landmark node ordering information is employed as part of the node identification information. There are three landmark nodes L 1, L 2 and L 3 that are used, and the link latencies from node 7 to the three landmark nodes are within [,2), [2,8) and greater than 8ms. The nodes with the same or similar ordering information are topologically close, e.g., node 7:11 is topological closer to node 26:12 than node 124:212. It means the link latency to the node 26:12 is much smaller than to the node 124:212. For each entry in the table, the first item is the order information, and the second one consists of several records, each of which includes the node ID and its workload. For an entry with a tuple [o, (id, w), ], it represents that workloads came from the nodes with the common order information o in the past given period. To choose a node with adequate workload capacity to store a replica, node n 1 in charge of a popular item has to contact those nodes by sending a message. The nodes being selected to store the replicas reply

10 1 Authors Suppressed Due to Excessive Length MetadataId Order (NodeID, Workload) Fig. 4: A sample topology table on node 7:11. When there are three replicas of MetadataId 635 with access times 45 to be placed in DROP, the nodes will be 125, 256 and 558. with their order information and estimated workload. Meanwhile, direct links to the replicas are created on node n Directory-based Replica Diffusion We present an efficient directory-based replica diffusion technique in DROP. A replica, as a copy of cached metadata item, is placed in the DROP overlay by its insertion operation. DROP stores directories of pointers to metadata item replicas that are stored in virtual nodes, but their locations are not related to the structure of locality-preserving hashing. When a node has a metadata item whose request times exceed a given threshold, it creates a directory for the item, chooses r 1 virtual nodes with the topologyaware replica replacement scheme and stores the item replicas at the nodes, recording them in its directory. When the directory receives a request for the item, it returns directory entries pointing to individual replicas of the item with a single response message. The directory node monitors the request rate for the item to determine if a new replica is created. When the request rate reaches a given threshold, the directory node creates a new replica along with the list of pointers to replicas of the item. In chain-based replica diffusion, the r replicas of an item are placed on its primary node and its r 1 followers. In both replica diffusion techniques, a node has to serve a request if it holds a replica of the requested metadata item. In the chain-based replica diffusion, a node pushes out a replica of the item one overlay hop closer to the source node of the last request if the request rate has exceeded its capability. It also aims to offload some of the demand to more nodes that serve requests. Compared to the chainbased replica diffusion, the directory-based one has three advantages: 1) faster replica transmission speed, 2) higher query parallelism, and 3) better load balancing because of r 1 nodes chosen with k random hash functions. 5 Performance Evaluation In this section, we evaluate the performance of C 2 using one synthetic workload-based simulation and two detailed trace-driven simulations. We have developed a detailed

11 C 2 : Adaptive Load Balancing for Metadata Server Cluster in Cloud-scale File Systems 11 event-driven simulator to validate and evaluate our design decisions and choices. In the first part, we empirically evaluate the convergence rate of C 2. We second measure the metadata migration overhead of C 2. Lastly, we measure the scalability of our adaptive replication scheme. We define Load Factor as follows: LoadF actor = Max.Load Min.Load. Each MDS has five virtual nodes, and the Linux trace follows the Zipf distribution with α = 1.2. All simulation experiments are conducted on a Linux Server with four Dual-Core AMD Opteron(TM) 2.6GHz processors and 8.GB of RAM, running 64-bit Ubuntu All the experiments are repeated three times, and average results are reported. 5.1 Convergence Rate In this section, we measure convergence rate using the three approaches with C 2 on the three traces. Convergence rate is critically significant in distributed systems. It includes two metrics: 1) number of rounds that measures how many rounds should be taken to reach load balancing, and 2) time cost that means how fast to achieve load balancing. Figure 5 depicts the number of rounds on the three traces in the cluster of metadata servers when using all the three methods with C 2. We can see that there are at most five rounds to converge to load balancing on the Linux trace, and there are at most four rounds to converge to load balancing on both the Microsoft trace and the Harvard trace. This is because there are much more access frequencies with more metadata items in the Linux trace than in both the Microsoft and Harvard traces, so the system is harder to reach load balancing when using the Linux trace than the other two traces. As shown in Figure 5(b) and Figure 5(c), the system is in a state of load balancing before running DirHash with C 2 or FileHash with C 2 in the cluster of ten MDSs # of rounds 3 2 # of rounds 3 2 # of rounds (a) Linux trace (b) Microsoft Windows trace (c) Harvard trace Fig. 5: Number of Rounds with Varying the Number of Metadata Servers Figure 6 shows how long it takes using all the three approaches with C 2 to reach load balancing. Figure 6(b) illustrates that DROP with C 2 has much longer time cost than the other two approaches because the Microsoft trace has only three first-level directories, and it has more obvious locality than the other two traces. Figure 6(c) shows that DROP with C 2 is close to the other two approaches in time cost because the Harvard trace has the most first-level directories among the three traces, and it is the worst in locality among them. Figure 6(a) demonstrates that DROP with C 2 is somewhat longer in time cost than the other two approaches. This is because the Linux trace has more first-level directories than the Microsoft trace, and much fewer first-level ones than

12 12 Authors Suppressed Due to Excessive Length the Harvard trace, and it has worse than Microsoft and better than Harvard in locality. Figure 5 and 6 illustrate that the deployed techniques have excellent efficiency. Time Cost (seconds) Time Cost (seconds) Time Cost (seconds) s (a) Linux trace s (b) Microsoft Windows trace s (c) Harvard trace Fig. 6: Time Cost with Varying the Number of Metadata Servers 5.2 Migration Overhead As file and directory metadata items are accessed more/less frequently, the request workload distribution in the system changes, and the system may have to migrate cached metadata to maintain request load balancing. Figure 7 shows the metadata migration overhead with excellent scalability. We perform this experiment as follows. Due to skew query requests, all the MDSs in the DROP system are not in a satisfactory load balancing state at the beginning. Metadata items in the Microsoft and Harvard traces are accessed according to their real-world history access information, while those in the Linux trace are requested according to the Zipf-like distribution. Migration Overhead 4% 35% 3% 25% 2% 15% 1% 5% Migration Overhead 45% 4% 35% 3% 25% 2% 15% 1% 5% Migration Overhead 5% 4% 3% 2% 1% % % % (a) Linux trace (b) Microsoft Windows trace (c) Harvard trace Fig. 7: Migration Overhead with Varying the Number of Metadata Servers During this period, the system is not in load balancing repeatedly. When not in load balancing, it will run C 2 to make itself reach a good load balancing state. Figure 8 demonstrates that all the three methods with C 2 make the system reach good load balancing. We investigate how many metadata items are migrated from the beginning to the end. Figure 7(a) shows that the three methods cause 24.67%, 17.23% and 16.1% of items to be migrated in average respectively, and Figure 7(b) illustrates that they cause 27.68%, 21.13% and 23.59% of items to be migrated in average respectively. Note that Figure 7(c) demonstrates that DirHash with C 2 makes more items to be migrated than the other two approaches, because it has better load balancing than them on the Harvard

13 C 2 : Adaptive Load Balancing for Metadata Server Cluster in Cloud-scale File Systems 13 trace as shown in Figure 8(c). As we previously show the scalability of C 2 in convergence rate, C 2 tries to reduce the metadata migration overhead at each step by making decision based on -1 knapsack problem used in load shedding and load stealing, as shown in Section 3.2 and Section Load Factor 3 2 Load Factor 3 2 Load Factor (a) Linux trace (b) Microsoft Windows trace (c) Harvard trace Fig. 8: Load Balancing with Varying the Number of Metadata Servers 5.3 Replication Overhead In this section, we present how many replicas are necessary for metadata items requested much heavily to keep good load balancing. Note that we do not count a main replica. Figure 9 shows that our adaptive replication scheme has excellent scalability with different numbers of metadata servers. When running our adaptive replication scheme on the Microsoft trace, the number of replicas varies slightly as the MDS cluster size increases, where its maximum value is 2.96, and its minimum value is 1.. When running the scheme on the other two traces, the number of replicas rises somewhat more obviously than that on the Microsoft trace as the MDS cluster size increases, but the scalability is still excellent on the two traces. The maximum numbers of replicas are 7.6 and 6.68 respectively, while the minimum numbers of replicas are 3.8 and 3.17 respectively on the Harvard trace and the Linux trace. 1 8 Harvard Microsoft Linux # of replicas s Fig. 9: Replicas for metadata items accessed frequently

14 14 Authors Suppressed Due to Excessive Length 6 Related work In recent years, many load balancing schemes were proposed in distributed metadata organization and management. Online Migration. The virtual node approach [1] was proposed to cope with the imbalance of the key distribution due to hash function. A number of virtual nodes are generated with random IDs in a physical server, so reducing the load imbalance. However, the usage of virtual nodes enormously boosts the amount of routing metadata in each server, therefore causing more maintenance overhead and increasing the number of hops per lookup. In addition, it does not take item popularity into account. On the contrary, the dynamic ID approach explores only a single ID per server [16]. The load of a server can be adjusted with a more suitable ID in the namespace. However, the solution requires IDs to be reassigned to maintain load balancing, resulting in a high overhead because of transferring items and updating overlay links. Our motivation for studying the online migration problem lies in how to efficiently migrate metadata for MDS cluster in large-scale storage systems. Caching and Replication. Hot spots are handled with caches to store popular items in the network, and query requests are considered to be resolved whenever cache hits occur along the entire path. Solutions addressing the uneven popularity of objects are based on caching and replication. Path replication replicates objects on all nodes along the full lookup path, e.g., DHash [17] replicates objects in k successors with caching on the lookup path. In the k-choice [11] load balancing approach, multiple hashes are employed to generate a set of IDs in a node, and one of the IDs is chosen at join time to minimize the differences between capacity and load for itself and other nodes affected by its join time. Unfortunately, the last several hops of a lookup are precisely the ones that can least be optimized [18]. Furthermore, a fixed number of replicas do not work well since the request load is dynamic: resources may be wasted if the number is set too high, while the replicas may not be enough to support a high request load if it is set too low. Our replication-based solution is similar to the k-choice approach, with a flexible number of replicas and topology-aware replica placement strategy. 7 Conclusions In this paper, we present an adaptive load balancing approach named C 2 to handle request load balancing for metadata server cluster in Cloud-scale file systems. C 2 explores the opposition between load balancing, and caching and replication, i.e., skew request distributions impair load balancing but simultaneously raises the effectiveness of caching and replication. Therefore, the cache serves the most popular items, ensuring that the nodes maintaining them do not become performance bottlenecks. Multiple hash functions are exploited to place multiple replicas, so balancing the load caused by most frequently accessed items. Our approach enables the system good load balancing even when query request workload is heavily skewed. Extensive simulation results show significant improvements in maintaining a more balanced distributed metadata management system, leading to the improved system with excellent scalability and performance.

15 C 2 : Adaptive Load Balancing for Metadata Server Cluster in Cloud-scale File Systems 15 Acknowledgement The authors would like to thank Garth Gibson from Carnegie Mellon University and Jun Wang from University of Central Florida for their help. This work is supported by A STAR Thematic Strategic Research Programme (TSRP) Grant No References 1. Raicu, I., Foster, I.T., Beckman, P.: Making a case for distributed file systems at Exascale. In: LSAP. (211) Amer, A., Long, D., Schwarz, T.: Reliability Challenges for Storing Exabytes. In: International Conference on Computing, Networking and Communications (ICNC), CNC Workshop. (214) 3. Ousterhout, J.K., Costa, H.D., Harrison, D., Kunze, J.A., Kupfer, M.D., Thompson, J.G.: A Trace-Driven Analysis of the UNIX 4.2 BSD File System. In: SOSP. (1985) Hua, Y., Zhu, Y., Jiang, H., Feng, D., Tian, L.: Supporting Scalable and Adaptive Metadata Management in Ultralarge-Scale File Systems. IEEE Trans. Parallel Distrib. Syst. 22(4) (211) Xu, Q., Arumugam, R.V., Yong, K.L., Mahadevan, S.: DROP: Facilitating distributed metadata management in EB-scale storage systems. In: MSST. (213) Wendell, P., Freedman, M.J.: Going viral: flash crowds in an open CDN. In: Internet Measurement Conference. (211) Fan, B., Lim, H., Andersen, D.G., Kaminsky, M.: Small cache, big effect: provable load balancing for randomly partitioned cluster services. In: SoCC. (211) Kavalanekar, S., Worthington, B.L., Zhang, Q., Sharda, V.: Characterization of storage workload traces from production Windows Servers. In: IISWC. (28) Ellard, D., Ledlie, J., Malkani, P., Seltzer, M.I.: Passive NFS Tracing of and Research Workloads. In: FAST. (23) 1. Stoica, I., Morris, R., Karger, D.R., Kaashoek, M.F., Balakrishnan, H.: Chord: A scalable peer-to-peer lookup service for internet applications. In: SIGCOMM. (21) Ledlie, J., Seltzer, M.I.: Distributed, secure load balancing with skew, heterogeneity and churn. In: INFOCOM. (25) Gummadi, P.K., Dunn, R.J., Saroiu, S., Gribble, S.D., Levy, H.M., Zahorjan, J.: Measurement, modeling, and analysis of a peer-to-peer file-sharing workload. In: SOSP. (23) Khuller, S., Kim, Y.A., Wan, Y.C.J.: Algorithms for data migration with cloning. In: PODS. (23) Xu, Q., Arumugam, R., Yong, K.L., Mahadevan, S.: Efficient and Scalable Metadata Management in EB-scale File Systems. IEEE Transactions on Parallel and Distributed Systems 99(PrePrints) (213) Ratnasamy, S., Handley, M., Karp, R.M., Shenker, S.: Topologically-Aware Overlay Construction and Server Selection. In: INFOCOM. (22) 16. Naor, M., Wieder, U.: Novel architectures for P2P applications: The continuous-discrete approach. ACM Transactions on Algorithms 3(3) (27) 17. Dabek, F., Kaashoek, M.F., Karger, D.R., Morris, R., Stoica, I.: Wide-Area Cooperative Storage with CFS. In: SOSP. (21) Gopalakrishnan, V., Silaghi, B.D., Bhattacharjee, B., Keleher, P.J.: Adaptive Replication in Peer-to-Peer Systems. In: ICDCS. (24)

EARM: An Efficient and Adaptive File Replication with Consistency Maintenance in P2P Systems.

EARM: An Efficient and Adaptive File Replication with Consistency Maintenance in P2P Systems. : An Efficient and Adaptive File Replication with Consistency Maintenance in P2P Systems. 1 K.V.K.Chaitanya, 2 Smt. S.Vasundra, M,Tech., (Ph.D), 1 M.Tech (Computer Science), 2 Associate Professor, Department