A Low-Cost Multi-Failure Resilient Replication Scheme for High Data Availability in Cloud Storage

Size: px

Start display at page:

Download "A Low-Cost Multi-Failure Resilient Replication Scheme for High Data Availability in Cloud Storage"

Hector Wiggins
6 years ago
Views:

216 IEEE 23rd International Conference on High Perforance Coputing A Low-Cost Multi-Failure Resilient Replication Schee for High Data Availability in Cloud Storage Jinwei Liu* and Haiying Shen

1 216 IEEE 23rd International Conference on High Perforance Coputing A Low-Cost Multi-Failure Resilient Replication Schee for High Data Availability in Cloud Storage Jinwei Liu* and Haiying Shen *Departent of Electrical and Coputer Engineering, Cleson University, Cleson, SC 29634, USA Departent of Coputer Science, University of Virginia, Charlottesville, VA 2294, USA jinweil@cleson.edu, hs6s@virginia.edu Abstract Replication is a coon approach to enhance data availability in cloud storage systes. Previously proposed replication schees cannot effectively handle both correlated and noncorrelated achine failures while increasing the data availability with the liited resource. The schees for correlated achine failures ust create a constant nuber of replicas for each data object, which neglects diverse data popularities and cannot utilize the resource to axiize the expected data availability. Also, the previous schees neglect the consistency aintenance cost and the storage cost caused by replication. It is critical for cloud providers to axiize data availability (hence iniize SLA violations) while iniizing cost caused by replication in order to axiize the revenue. In this paper, we build a nonlinear integer prograing odel to axiize data availability in both types of failures and iniize the cost caused by replication. Based on the odel s solution for the replication degree of each data object, we propose a low-cost ulti-failure resilient replication schee (). can effectively handle both correlated and non-correlated achine failures, considers data popularities to enhance data availability, and also tries to iniize consistency aintenance cost and storage cost. Extensive nuerical results fro trace paraeters and experients fro real-world Aazon S3 show that achieves high data availability, low data loss probability and low consistency aintenance cost and storage cost copared to previous replication schees. I. INTRODUCTION Datacenter storage syste (e.g., Hadoop Distributed File Syste (HDFS) [1], RAMCloud [2], Google File Syste (GFS) [3] and Windows Azure [4]) is an iportant coponent of cloud datacenters, especially for data-intensive services in this big data era. It is critical for cloud providers to reduce the violations of Service Level Agreeents (SLAs) for tenants to provide high quality-of-service and avoid the associated penalties. For exaple, a typical SLA required fro services that use Aazon s Dynao storage syste is that 99.9% of the read and write requests execute within 3s [5]. Therefore, a storage syste ust guarantee data availability for different applications to coply to SLAs, which is, however, a non-trivial task. Iany cloud services such as Aazon, Google App Engine and Windows Azure, the server failure probability is in the range of [1%, 1%] and the corresponding data availability is in the range of [99.9%, 99.99%] [6]. Data availability is usually influenced by data loss, which is typically caused by achine failures including correlated and non-correlated achine failures coexisting in storage systes. The forer eans ultiple nodes fail (nearly) siultaneously, while the latter eans nodes fail individually. Correlated achine failures often occur in large-scale storage systes [7] [9] due to coon failure causes (e.g., cluster power outages, workload-triggered software bug anifestations, Denial-of- Service attacks). For exaple, in cluster power outages [1], [11], a non-negligible percentage (.5%-1%) of nodes [1], [1] do not coe back to life after power is restored [12]. Correlated achine failures cause significant data loss [12], which have been docuented by Yahoo! [1], LinkedIn [1] and Facebook [13]. Non-correlated achine failures [8] are caused by reasons such as different hardware/software copositions and configurations, and varying network access ability. Replication is a coon approach to reduce data loss and enhance data availability. The data popularity consideration in replication is critical to axiize the expected data availability 1. Due to highly skewed data popularity distributions [16], popular data with considerably higher request frequency generates heavier load on the nodes, which ay lead to data unavailability at a tie. On the other hand, unpopular data with few requests wastes the resources for storing and aintaining replicas. Thus, an effective replication schee ust consider the diverse data popularities to use the liited resources (e.g., eory) [17], [18] to increase expected data availability. Though creating ore replicas for data objects iproves data availability, it coes with higher consistency aintenance cost [19] [21] and storage cost [22], [23]. In consistency aintenance, when data is updated, an update is sent to its replica nodes. In addition to the nuber of replicas, the consistency aintenance cost and storage cost are also affected by the geographic distance and storage edia (e.g., disk, SSD, EBS), respectively. To reduce the cost, we need to iniize the nuber of replicas, liit the geographic distance of replica nodes and replicate the additional replicas beyond the required storage of applications to less expensive storage edia while still provide SLA guarantee. Therefore, effective replication ethods ust axiize expected data availability (by considering both correlated and non-correlated achine failures and data popularity) and iniize the cost caused by replication (i.e., consistency aintenance cost and storage cost) [2], [24]. Rando replication has been widely used in datacenter storage systes including HDFS, RAMCloud, GFS and Windows Azure. These systes partition each data object into chunks (i.e., partitions), each of which is replicated to a constant nuber of randoly selected nodes on different racks. Though ran- 1 We use a service-centric availability etric: the proportion of all successful service requests over all requests [8], [14], [15] /16 $ IEEE DOI 1.119/HiPC

2 do replication can handle non-correlated achine failures, it cannot handle correlated achine failures well. The reason is that any cobination of a certain nuber of nodes would for a set of nodes for storing all replicas of a data chunk and data loss occurs if all nodes in the set experience failures siultaneously [12]. Previously proposed data replication ethods cannot effectively handle both correlated and noncorrelated achine failures and utilize the liited resource to increase data availability siultaneously [7], [8], [12], [25]. Although any ethods have been proposed to iprove data availability [7], [8], [19], [25] [27], they do not concurrently consider data popularities and the cost caused by replication to increase expected data availability and decrease cost. As shown in Figure 1, a key proble here is how to achieve an optial tradeoff between increasing data availability and reducing cost caused by replication with the ultiate goal of SLA copliance and revenue Data availability Copyset Replication Rando Replication Consistency aintenance cost Figure 1: Achieving an optial tradeoff aong data availability, storage cost and consistency aintenance cost. axiization for cloud service providers. To address the proble, we propose a low-cost ulti-failure resilient replication schee (), which targets the application of distributed data-intensive services and storage syste. is ore advantageous than previous replication schees (e.g., Rando Replication, Copyset Replication [12]) in that it can handle both correlated and non-correlated achine failures, and also jointly considers data popularity and the cost caused by replication. We suarize the contribution of this work below. We build a nonlinear integer prograing (NLIP) odel that ais to axiize expected data availability (in both types of failures) with the consideration of popularities and reduce the cost caused by replication. We then use Lagrange ultipliers to derive a solution: the replication degree (i.e., the nuber of replicas) of each data object. Based on the solution, we propose to handle both correlated and non-correlated achine failures. partitions all nodes to different groups with each group responsible for replicating all data objects with the sae replication degree. Then, partitions nodes in a group into different sets [12]; each set consists of nodes fro different datacenters within a certain geographic distance. The replicas of a chunk are stored in the nodes in one set, so data loss occurs only if these nodes experience failures siultaneously. Each data chunk is replicated to the corresponding storage ediu based on its priority. reduces the frequency of data loss by reducing the nuber of sets in a group, i.e., the probability that all nodes in a set fail. We have conducted extensive nuerical analysis based on trace paraeters and experients on Aazon S3 [28] to copare with other state-of-the-art replication schees. Results show achieves high data availability, low data loss probability and low storage and consistency aintenance cost. The reainder of this paper is organized as follows. Section II reviews the related work. Section III presents our NLIP odel. Section IV presents the design for. Section V presents the nuerical and experiental results. Section VI concludes this paper with rearks on our future work. II. RELATED WORK Many ethods have been proposed to handle non-correlated or correlated achine failures. Zhong et al. [8] assued independent achines failures, and proposed a odel that achieves high expected service availability. However, this odel does not consider correlated achine failures hence cannot handle such failures. Nath et al. [7] identified a set of design principles that syste builders can use to tolerate correlated failures. Cidon et al. proposed Copyset Replication [12] and Tiered Replication [29] to reduce the frequency of data loss caused by correlated achine failures by liiting the replica nodes of any chunks to a single copyset. These ethods for correlated achine failures do not consider data popularity to iniize data loss probability in correlated and non-correlated achine failures. Unlike Copyset Replication and Tiered Replication, our proposed reduces probability of data loss caused by both correlated and non-correlated achine failures with considering data popularity, and it derives diverse replication degrees for data objects with different popularities and thus increases the overall data object availability by creating ore replicas for popular data objects and less for unpopular data objects. Moreover, replicates data objects with considering reducing consistence aintenance cost and storage cost, which is critical for cloud providers to axiize the revenue. Thoson et al. [25] presented CalvinFS, a scalable distributed file syste, which horizontally partitions and replicates file syste etadata across a shared-nothing cluster of servers. However, CalvinFS cannot handle both correlated and non-correlated achine failures while iniizing cost (e.g., consistency aintenance cost and storage cost) caused by replication. The above previously proposed ethods cannot handle both correlated achine failures and non-correlated achine failures while utilizing the liited resource to increase data availability. They neglect data popularity existing in current cloud storage syste and thus cannot fully utilize the resource to increase data availability. Also, they do not concurrently consider iniizing the consistency aintenance cost and storage cost caused by replication. There is a large body of work on enhancing data availability for distributed storage systes. Abu-Libdeh et al. [26] presented a replication protocol for datacenter services to provide strong consistency and high availability. Bonvin et al. [19] proposed a self-anaged key-value store that dynaically allocates the resources of a data cloud to several applications in a cost-efficient and fair anner. Zhang et al. [3] presented Moji to provide the reliability and availability in large-scale storage systes. Moji uses a two-tier architecture in which the priary tier contains a irrored pair of nodes and the secondary tier contains the secondary backup nodes with weakly consistent copies of data. Yu et al. [31] used a fixed nuber of replicas for every 243

3 data object, and showed that the assignent of object replicas to achines plays a draatic role in the availability of ultiobject operations. However, ost of these previous works neglect data object popularities when deterining the nuber of replicas for each object, thus cannot satisfy the deands of popular data objects or fully utilize the liited resource. Although the works in [8], [16] consider object popularities, they do not give a solution for handling correlated achine failures, which is a key issue in achieving high availability in today s cloud datacenters [7]. In addition, they do not consider different storage ediu prices (for reducing storage cost) or geographic distance in selecting nodes to store replicas, both of which affect the total cost of the storage systes. Motivated by the probles in the existing works, we propose that can effectively handle both correlated and noncorrelated achine failures and also considers the factors indicated in the introduction section to axiize data availability and reduce the consistency aintenance cost and storage cost. III. NONLINEAR INTEGER PROGRAMMING MODEL FOR A cloud storage syste usually serves ultiple applications siultaneously. Without loss of generality, we assue there are n applications in the cloud storage, and each application has data objects [19]. Each data object belongs to only one application, and is split into M partitions [19] and the data object is lost if any of its partitions is lost [12]. Replication degree of a data object represents the nuber of replicas of the data object. We use D ij to denote the jth data object belonging to application i (denoted by a i ). Let d ij be the replication degree (nuber of replicas) of D ij. The replicas of a partition of a data object are placed in a set of d ij different nodes. Suppose there are N servers in the cloud. For analytical tractability, we assue a physical node (i.e., a server) belongs to a rack, a roo, a datacenter, a country and a continent. To easily identify the geographic location of a physical node, each physical node has a label in the for of continent-country-datacenter-roo-rack-server [19]. Proble Stateent: Given data object request probabilities, data object sizes, space constraints for different applications, and thresholds for request failure probability, consistency aintenance cost and storage cost, what is the optial replication degree for each data object, so that the request failure probability, consistency aintenance cost and storage cost are iniized in both correlated and noncorrelated achine failures? Then, how to assign the chunk replicas of data objects to the nodes to achieve the objectives? A. Data Availability Maxiization We axiize data availability by considering both achine failure probability and data object request probability (i.e., popularity). We iniize the data loss probability in both correlated and non-correlated achine failures. Different data objects ay have different popularities, and then have different deands on the nuber of replicas to ensure data availability. 1) Correlated Machine Failures: To reduce data loss in correlated achine failures, we adopt the fault-tolerant set (FTS) concept fro [12], which is a distinct set of servers holding all copies of a given partition. Each FTS is a single unit of failure. That is, when an FTS fails, at least one data object is lost. For correlated achine failures, the probability of data loss increases as the nuber of FTSs increases because the probability that the failed servers constitute at least one FTS increases (It is ore likely that the failed servers include at least one FTSs). So we iniize the probability of data loss by iniizing the nuber of FTSs. To this end, we can statically assign each server to a single FTS, and constrain the replicas of a partition to a randoly selected preassigned FTS. That is, we first place the priary replica (i.e., original copy) on a randoly selected server, and then place the secondary replicas on all the nodes in this server s FTS. However, this will lead to load ibalance proble in which soe servers becoe overloaded while soe servers are underloaded due to data storage. On the other hand, rando replication that randoly chooses a replica holder has a higher probability of data loss because alost every new replicated partition creates a distinct FTS. To achieve a tradeoff, we use the approach in Copyset Replication [12], which assigns a server to a liited nuber of FTSs rather than a single FTS. Due to the overlap of FTSs, after the priary replica is stored in a randoly selected server, the FTS options that can store the secondary replicas are those that hold the server. The servers in these sets are options that can be used to store the secondary replicas. The nuber of these servers is called scatter width (denoted by S). For exaple, if the FTSs that hold server 1 are {1,2,3}, {1,4,5}, then when the priary replica is stored at server 1, the secondary replica can be randoly placed either on servers 2 and 3 or 4 and 5. In this case the scatter width equals to 4. The probability of correlated achine failures equals the ratio of the nuber of FTSs over the axiu nuber of sets: #FTSs/ax{#sets} (1) To reduce the probability of data loss, Copyset Replication akes the replication schee satisfy two conditions below: Condition 1: The FTSs overlap with each other by at ost one server. Condition 2: The FTSs cover all the servers equally. Copyset Replication uses the Balanced Incoplete Block Design (BIBD)-based ethod for any scatter width to create FTSs that satisfy both condition 1 and 2 and iniize #FTSs. 2 We define a pair (X,A), where X is a set of servers in the syste (i.e., X = {1,2,...,N}), and A is a collection of all FTSs in the syste. Let N,R and λ be positive integers such that N > R 2, BIBD for (N,R,λ) satisfies the following properties: Each FTS contains exactly R servers. Each pair of servers is contained in exactly λ FTSs. When λ = 1, the BIBD provides an optial design for iniizing the nuber of FTSs for scatter width S = N 1. In this case, condition 1 ensures that each FTS increases the scatter width for its servers by exactly R 1 copared to the case S when λ =. Copyset Replication creates R 1 N R FTSs. Then, the failure probability in correlated achine failures equals: 2 In ipleentation, Copyset Replication uses rando perutation to create FTSs. 244

4 p cor = S ( ) N N R 1 R / (2) R In rando replication, the nuber of FTSs created is [12]: { ( #FTSs = N S R 1), S < N 2 #FTSs ( N) (3) R, S N Based on Equs. (1), (3), we can calculate the probability of correlated achine failures in rando replication. We give an exaple to illustrate the process of generating FTSs. Consider a storage syste with N = 12 servers, the size of FTS R = 3, and the scatter width S = 4. Using the BIBDbased ethod, the following solution of FTSs is created to achieve less nuber of FTSs. B 1 = {,1,2},B 2 = {3,4,1},B 3 = {6,7,8},B 4 = {9,1,11}, B 5 = {,3,8},B 6 = {1,4,7},B 7 = {2,5,11},B 8 = {5,6,9}. For rando replication, the #FTSs is 72. Hence, the probability of data loss caused ( ) by correlated ( ) achine failures is: N 12 #FTSs/ = 72/ =.327 R 3 However, for BIBD-based ethod fro Copyset Replication, the nuber of FTSs is 8, and the probability of data loss caused by correlated achine failures is uch saller: ( ) ( ) N 12 #FTSs/ = 8/ =.36 R 3 There are any ethods that can be used for constructing BIBDs, but no single ethod can create optial BIBDs for any cobination of N and R [32], [33]. Copyset Replication cobines BIBD and rando perutations to generate a nonoptial design that can accoodate any scatter width. By relaxing the constraint for the nuber of overlapping nodes fro an exact nuber to at ost the exact nuber, the probability of successfully generating FTSs increases. Since the scatter width should be uch saller than the nuber of nodes, is likely to generate FTSs with at ost one overlapping node [12]. tries to iniize the replication degrees of data objects in order to liit the consistency aintenance cost and storage cost, the replication degrees should be not large or vary greatly. Saller replication degrees generate saller R values. Although soeties it is not possible to generate the optial BIBDs, BIBDs can be used as a useful benchark to easure how close is to the optial schee for specific values of scatter width [12]. 2) Non-correlated Machine Failures: In non-correlated achine failures, the failures of achines are statistically independent of each other, and they can be categorized into unifor and nonunifor achine failures. In the scenario of unifor achine failures, each achine has the sae probability to fail, denoted by p ( < p < 1). The data object is lost if any of its chunk (partition) is lost, and a chunk is lost only if all replicas of the chunk are lost. Hence, the expected probability of data loss in the unifor achine failure is: p uni =( M p d ij )/( n) (4) where M is the nuber of partitions (chunks) for each data object, n is the nuber of applications, and is the nuber of data objects in each application. In the scenario of nonunifor achine failures, each achine fails with different probabilities. Assue replicas of each data object are placed oachines with no concern for individual achine failures. Let p 1,...,p N be the failure probabilities of N servers in the cloud, respectively. Based on [8], the expected data object failure probability is the sae as that on unifor failure achines with per-achine failure probability equaling N i=1 p i/n. We use p non to denote the expected probability of data loss in nonunifor achine failures. Then, N p non =( M ( p k /N) d ij )/( n) (5) k=1 We introduce a ethod to evaluate a data object s popularity below. Different types of applications are always featured by the popular tie periods of data objects. For exaple, the videos in social network applications are usually popular for 3 onths [34], while the news in a news website usually is popular for several days. We rank the applications based on their types and use b ai to denote the rank; a higher b ai eans that the application has longer popular tie periods of data objects. Assue tie is split into epochs (a fixed period of tie). To avoid creating and deleting replicas for data objects that are popular for a short tie period, we consider both its application rank (b ai ) and its expected visit frequency, i.e., the nuber of visits in an epoch (v ij ) [16], [19], [35], [36], to deterine the popularity function of a data object (ϕ( )): ϕ ij ( )=α b ai + β v ij (6) where α and β are the weights for the two factors. Let r ij be the probability of requesting data D ij. For singleobject requests, n i=1 r ij = 1 after noralization. The request probability of a data object is the sae as that of each partition of this data object. The request probability is proportional to the popularity of the data object (ϕ( )) [8], [37]. r ij = k 1 ϕ ij ( ) (7) where k 1 is a certain coefficient. 3) Availability Maxiization Proble Forulation: In a cloud syste with the co-existence of correlated and noncorrelated (unifor and nonunifor) achine failures, the data is lost if any type of failures occurs. Recall that p cor (Forula (2)), p uni (Forula (4)) and p non (Forula (5)) are probabilities of a data object loss caused by correlated achine failures, unifor achine failures, and nonunifor achine failures (at epoch t), respectively. So the probability of data loss caused by correlated and non-correlated achine failures is 3 P f = w 1 p cor + w 2 p uni + w 3 p non ( w i = 1) (8) i=1 where w 1,w 2, and w 3 are the probabilities of the occurrence of each type of achine failures, respectively. We axiize the data availability by iniizing the expected request failure probability with the consideration that different data objects have different popularities. To achieve this goal, we present a NLIP odel with ultiple constraints to deterine the replication degree of each data object. Given the popularities and sizes of data objects, our goal is to find the optial replication degree of each data object such that the expected request failure probability (denoted by P r ) can be iniized. Each application purchases a certain storage space in public clouds. Also, in private clouds, different applications ay be assigned different storage spaces based on their priorities. Thus, the storage space constraint for each application is necessary and iportant due 245

5 to the liited precious storage edia. Thus, we foralize our proble as the following constrained optiization proble: in P r = r ij M (P f ) d ij (9) s.t. s ij d ij K i (i = 1,...,n) where K i (i = 1,...,n) is the space capacity for application i, and s ij is the size of data object D ij. s ij d ij is the total storage consuption of all the data objects belonging to application i. The optiization objective is to iniize the expected request failure probability and the constraint is to ensure that the storage consuption of an application does not exceed its space capacity. B. Consistency Maintenance Cost Miniization In this section, we forulate the proble of iniizing consistency aintenance cost caused by data replication. We use C c to denote the total consistency aintenance cost of all replicas in the cloud storage syste. We first introduce how to calculate C c. Previous work [38] indicates that the data object write overhead is linear with the nuber of data object replicas. Then, the consistency aintenance cost of a partition can be approxiated as the product of the nuber of replicas of the partition and the average counication cost paraeter (denoted by δ co ) [19], i.e., d ij δ co. Hence, the total consistency aintenance cost of all data objects is C c = (M d ij ) δ co (1) The fixed average counication cost (δ co ) can be calculated as in [39]: δ co = E[ s u dis(s i,s j ) σ] (11) i, j where s u is the average update essage size, dis(s i,s j ) is the geographic distance between the server of the original copy S i and a replica server S j (an expectation of all possible distances between server of original copy (priary server) and replica servers, which is calculated it fro a probabilistic perspective), and σ is the average counication cost per unit of distance. We adopt the ethod in [19] to calculate the geographic distance between servers. Specifically, it uses a 6-bit nuber to represent the distance between servers. Each bit corresponds to the location part of a server, i.e., continent, country, datacenter, roo, rack and server. Starting with the ost significant bit (i.e., the leftost bit), each location part of both servers are copared one by one to copute the geo-siilarity between the. The corresponding bit is set to 1 if the location parts are equivalent, otherwise it is set to. Once a bit is set to, all less significant bits are autoatically set to. For exaple, suppose S i and S j are two arbitrary servers, and the distance between the is represented as 111 (as shown below). Then, it indicates that S i and S j are in the sae datacenter but not in the sae roo. continent country datacenter roo rack server The geographic distance is obtained by applying a binary NOT operation to the geo-siilarity. That is, 111 = 111 = 7 (decial) Thus, the optiization proble for consistency aintenance cost can be forulated as follows: in C c = (M d ij ) δ co s.t. s ij d ij K i (i = 1,...,n) (12) C. Storage Cost Miniization Different applications require different storage edia (e.g., disk, SSD, EBS) to store data. Different edia have different costs per unit size. For exaple, in the Aazon web cloud service, the prices for EBS and SSD are $.44 and $.7 per hour, respectively. After satisfying the applications storage requireents on different storage edia, we need to decide the storage edia for their additional replicas for enhanced data availability. Different applications have different SLAs with different associated penalties. The applications with higher SLA violation penalties should have higher priorities to eet their SLA requireents in order to reduce the associated penalties. Therefore, the additional replicas of data objects of higher-priority applications should be stored in a faster storage ediu. On the other hand, in order to save storage cost, the additional replicas of data objects of lower-priority applications should be stored in a lower and less expensive storage ediu. We use b pi to denote application i s priority; higher b pi eans higher priority. Since faster storage edius are ore costly, for data objects of high priority applications, we hope to store ore data partitions in faster storage edius in order to satisfy ore requests per unit tie hence increase data availability [19]. Thus, to deterine the storage ediu for storing additional replicas, we define a priority function ψ( ) of a data object based on its application priority and size [16]: ψ ij ( )=γ b pi + η/s ij (13) where γ and η are the weights for application priority and data object s size. The size of the data object can be changed due to write operation. The priority values are classified to a nuber of levels, and each level corresponds to a storage ediu. Thus, the unit storage cost of a data object equals the unit cost of its apped storage ediu, denoted by c sij, which is proportional to the priority value of a data object (ψ ij ( )). c sij = k 2 ψ ij ( ) (14) where k 2 is a certain coefficient. The storage cost of a data object is related to its storage ediu, its size and the nuber of its replicas. Different storage edius have different unit costs, and different data objects have different sizes and replication degrees. We iniize storage cost by iniizing the expected cost for storage edius. We exaine expected storage cost iniization by deterining the replication degree for each data object. To achieve this goal, we present a NLIP approach with ultiple constraints that can be used to obtain a policy. Given the applications of data objects and the data object sizes, our goal is to find the replication degree for each data object that iniizes storage cost. Hence, we foralize our proble as the following constrained optiization proble: in C s = (s ij d ij c sij T ) s.t. s ij d ij K i (i = 1,...,n) where T is the tie duration of an epoch. (15) 246

6 D. Proble Forulation and Solution We consider additional three threshold constraints for request failure probability (Pr th ), consistency aintenance cost (Cc th ) and storage cost (Cs th ). The probability of expected request failure ust be no larger than a threshold Pr th. This constraint is used to ensure that the expected request failure probability is not beyond a threshold, which serves the goal of achieving high data availability. The constraint on the consistency aintenance cost is to ensure that the consistency aintenance cost in one epoch is no larger than a threshold, Cc th. The storage cost in one epoch is no ore than threshold Cs th. The constraints on both consistency aintenance cost and storage cost are to ensure that the cost caused by replication is at a low level, which akes the syste ore efficient and econoical. By cobining Forulas (9), (12) and (15) and the additional constraints, we can build a NLIP odel for the global optiization proble as follows: in { P r +C c +C s } = (r ij M (P f ) d ij ) (16) + (M d ij ) δ co + (s ij d ij c sij T ) s.t. s ij d ij K i (i = 1,...,n) (17) r ij M (P f ) d ij Pr th ( < r ij < 1) (18) (M d ij ) δ co Cc th (19) (s ij d ij c sij T ) Cs th (2) The decision variables are data objects s replication degrees, and they ust be positive integers in practice. The objective is to iniize the request failure probability, the consistency aintenance cost and storage cost. The optiization constraints are used to ensure that the space capacity of data objects belonging to each application is not exceeded, the probability of expected request failure is no ore than a threshold P th r, the consistency aintenance cost and storage cost in one epoch are no ore than thresholds Cc th and Cs th, respectively. Theore 3.1. The relaxed NLIP optiizatioodel is convex. Proof Inequs. (17), (19), (2) are linear inequalities, and they define convex regions. The exponential function P d ij f in Inequ. (18) is convex, and the su (i.e., n i=1 r ij (P f ) d ij) of convex functions is a convex function. Thus, the constraint (18) defines a convex set. All the constraints define convex regions, and the intersection of convex sets is a convex set. Thus, the region of the optiization proble is convex. Hence the relaxed NLIP optiization proble is convex. For analytical tractability, we first relax the proble to a real-nuber optiization proble in which d 11,...,d n are real nubers, and derive the solution for the real-nuber optiization proble. Then, we use integer rounding to get the solution for practical use. Specifically, we adopt the approach fro [8] to round each d ij to its nearest integer while all d ij s saller than 1 are rounded to 1. By relaxing the proble to real-nuber optiization proble, the optial solution should always use up all the available storage space (i.e., K i ) for each application [8]. 3 Thus, we have s ij d ij = K i (i = 1,...,n) (21) where K i is the space capacity for application i. For the realnuber optiization proble, we use Lagrange ultipliers to derive the solution. Since there are n + 3 constraints, we use the ultipliers λ 1,λ 2,...,λ n+3 to cobine the constraints and the optiization goal together into the Lagrangian function Λ(d 11,...,d 1,...,d n1,...,d n,λ 1,λ 2,...,λ n+3 ) = ( r ij M (P f ) d ij + (M d ij ) δ co + (s ij d ij c sij T ) + λ i ( s ij d ij K i )) + λ n+1 ( r ij M (P f ) d ij P th (22) r )+λ n+2 ( (M d ij )δ co C th c )+λ n+3 ( (s ij d ij c sij T ) Cs th ) where K i = s ij d ij. The critical value of Λ is achieved only if its gradient is zero. Based on Theore 3.1, the NLIP optiization proble is convex. Thus, the gap between the relaxed proble and its dual proble is zero [4]. Also, the object function of the NLIP odel is derivable, thus the gradients of λ exist. We can get the solution for the optiization proble based on the Lagrange dual solution. Considering that the popularities and iportance of data objects usually do not change uch within a short period of tie, we can choose larger value for T to avoid coputation overhead, and also use IPOPT [41], [42] (a software library for large scale nonlinear optiization) to solve the large-scale nonlinear optiization proble, which ake ore practical. 4 IV. THE REPLICATION SCHEME In Section III, we calculate the replication degree of each data object, which is the first step of the design of. The next proble is how to assign the replicas to nodes, which is of iportance for increasing the data availability [9]. Recall that in Section III-A, we introduced BIBD-based file replication in Copyset Replication that can reduce data loss probability in correlated achine failures. Though the BIBDbased ethod can be used to reduce data loss probability in correlated achine failures, it requires a constant replication degree and cannot be used for replicating data with various replication degrees. Our proposed can deal with the probles. We present the details of below. Algorith 1 shows the pseudocode of the replication algorith. For particular and arbitrary i and j, the replication degree d ij of data D ij can be obtained fro the NLIP optiization odel in Section III. To reduce data loss in both correlated and non-correlated achine failures, first ranks these replication degrees in ascending order (i.e., d 1,...,d l ). For a given replication degree d i (i = 1,...,l), first groups the data objects with replication degree d i together (denoted by 3 The storage cost ay increase as storage space increases, but it also depends on the storage edia for storing the data, thus the clai that the space constraint holds at equality does not contradict the ain hypothesis (cost-effective) of the paper. 4 Although soe paraeters (e.g., data popularity, achine failure rate) are not usually available a priori, we can get the fro the historical data [7], [8], [16]. 247

7 Replication degree=d1 Replication degree=d2 Replication degree=d3 Server Server1 Server2 Server3 Server4 Server5 Server6 Server7 Server8 Server9 Server1 Server11 Server12 Server13 Server14 Server15 Server16 Server17 FTS 1 FTS 2 FTS 1 FTS 2 FTS 1 FTS 2 Server Server2 Server1 Server3 Server4 Server6 Server7 Server5 Server8 Server9 Server1 Server14 Server15 Server16 Server11 Server12 Server13 Server17 FTS 3 FTS 4 FTS 3 FTS 4 Table I: Paraeters fro publicly available data [12]. Syste Chunks per node Cluster size Scatter width Facebook HDFS RAMCloud N-1 D i ), and counts the nuber of data objects with replication degree d i (denoted by ND r i ). To handle the proble of varying replication degrees, partitions all nodes to l groups and then conducts Copyset Replication [12] to assign chunks with replication degree d i to the i th node group (i = 1,...,l). For load balance, the nuber of nodes in each group is Algorith 1: Pseudocode of the algorith. 1 Copute the replication degree for each data object using the NLIP odel (Section III) 2 Rank the replication degrees in ascending order d 1,...,d l 3 for i 1 to l do 4 Group data objects with d i together (D i ) 5 Use BIBD-based ethod to generate FTSs; each FTS has nodes fro different datacenters but within a certain geographic distance 6 Store each chunk s replicas of data objects with d i to all nodes in an FTS with d i 7 return proportional to the nuber of replicas that will be stored in the ND r d group. Accordingly, assigns N i i nodes to data l i=1 Nr D d i i group D i. then uses the BIBD-based ethod to generate FTSs for each group of nodes. Specifically, deterines λ in Section III-A1 based on the load balancing requireent and then uses the BIBD-based ethod for (N,R,λ), where N is the nuber of the nodes in a node group and R is the replication degree of the group (d i ) (R relates the correlated achine failure with replication degree). The nodes in each FTS are required to be fro different datacenters and within a certain geographic distance between each other. Distributing replicas over different datacenters can avoid data loss due to achine failures (e.g., caused by power outages) for data reliability. Liiting the distances between replica nodes for a data chunk constrains the consistency aintenance cost. replicates the chunks of each data object to all nodes in an FTS. 5 Figure 2 illustrates an exaple to show the design of. In the exaple, d 1 = 2, d 2 = 3, and d 3 = 4. In the left ost block, each FTS contains two chunks of a data object. Each FTS overlaps with every other FTS with at ost one server. This helps reduce the probability of data loss in correlated achine failures and balance the load. For those data objects with replication degree 2, the chunks of a data object will be stored to the nodes in an FTS in the left ost block. In the iddle block, each FTS contains three chunks of a data object. In the right ost block, each FTS contains four chunks of a data object. 5 Although putting all replicas of a chunk to the nodes in an FTS can bring about the cost of inter-rack transfer (across oversubscribed switches), it can significantly reduce the probability of data loss caused by correlated achine failures using BIBD-based ethod [12]. FTS 3 FTS 4 Figure 2: Architecture of for cloud storage fault-tolerant sets (FTSs). Table II: Paraeter settings. Paraeter Meaning Setting N # of servers 1-1 M # of chunks of a data object 3 [45] R # of servers in each FTS 3 λ # of FTSs containing a pair of servers 1 S Scatter width 4 p Prob. of a server failure.5 Pr th Threshold for expected request failure.5 Cc th Threshold for consistency aint. cost 1 Cs th Threshold for storage cost 3 # of data objects in each application 1 n # of data applications 5 Recall that each data priority value corresponds to a storage ediu. We assue that all servers have all types of storage edia, and we need to deterine which ediu to use for a given priority on a given server. When replicating a chunk to a node, chooses the storage edia for the chunks based on data objects priority calculated by Forula (13). Data objects with higher priority values will be stored to faster and ore expensive storage edia (e.g., Meory, SSD), and vice versa. The constraint (17) is to ensure that storage requireents of data objects do not overfill the servers. V. PERFORMANCE EVALUATION We conducted the nuerical analysis based on the paraeters in [12] (Table I) derived fro the syste statistics fro Facebook, HDFS and RAMCloud [1], [2], [1], [12], [13], [43], and also conducted real-world experients on Aazon S3. The distributions of file read and write rates in our analysis and tests follow those of the CTH trace [44] provided by Sandia National Laboratories that records 4-hour read/write log in a parallel file syste. A. Nuerical Analysis Results We copared with other three replication schees: Rando Replication (RR) [12], Replication Degree Custoization (RDC) [8] and Copyset Replication [12] (Copyset). We copare with Copyset instead of Tiered Replication (TR) [29] because TR focuses on data durability [29], however focus on data availability. Moreover, the key idea of Copyset and TR for reducing data loss probability is the sae, and they both use the BIBD-based ethod to reduce the probability of data loss. RR places the priary replica on a rando node (say node i) in the entire syste, and places the secondary replicas on (R-1) nodes around the priary node (i.e., nodes i+1, i+2,...). 6 RDC derives replication degree for each object with considering data object popularity to axiize the expected data availability for non-correlated achine failures. Copyset splits the nodes into a nuber of copysets, and constrains the replicas of every chunk to a single copyset so that it can reduce the frequency of 6 RR is based on Facebook s design, which chooses secondary replica holders fro a window of nodes around the priary node. 248

8 Probability of data loss Probability of data loss Probability of data loss Availability of requested data object (a) Facebook (b) HDFS (c) RAMCloud Figure 3: Probability of data loss vs. nuber of nodes (R = 3 for RR and Copyset). RDC Copyset RR Availability of requested data object RDC Copyset RR Availability of requested data object RDC Copyset RR Storage cost (a) Facebook (b) HDFS (c) RAMCloud Figure 4: Availability of requested data objects vs. nuber of nodes (R = 3 for RR and Copyset) Storage cost Copyset RDC RR 24 Copyset RDC RR 22 Storage cost Copyset RDC RR 22 5 (a) Facebook (b) HDFS (c) HDFS Figure 5: Storage cost vs. nuber of nodes (R = 3 for RR and Copyset). 5 5 Consistency aintenance cost Consistency aintenance cost Consistency aintenance cost (a) Facebook (b) HDFS (c) HDFS Figure 6: Consistency aintenance cost vs. nuber of nodes (R = 3 for RR and Copyset). data loss by iniizing the nuber of copysets. The nuber of nodes that fail concurrently in each syste was set to 1% of the nodes in the syste [6], [12]. Since this rate is the axiu percentage of concurrent failure nodes in real clouds (i.e., worst case) [1], [11], [46], it is reasonable to see higher probabilities of data loss and lower expected data availability in our analytical results than the real results in current clouds. The distributions of file popularity and updates follow those of the CTH trace. We used the noral distribution with ean of 1 and standard deviation of 1 to generate 1 unit costs for different storage edius. Copared to unifor achine failures, nonunifor and correlated achine failures are ore realistic due to different hardware/software copositions and configurations [8], [47]. Thus, we set w 1 =.4, w 2 =.2 and w 3 =.4 in Equ. (8), respectively. We randoly generated 6 bit nuber fro reasonable ranges for each node to represent its location, as explained in Section III-B. Table II shows the paraeter settings in our analysis unless otherwise specified. We first calculate the data loss probability for each ethod. 7 Specifically, we used Forula (8) for, Forula (2) for Copyset, Forula (3) for RR, and Equ. (8) with w 1 =, w 2 =.4 and w 3 =.6 for RDC. Figure 3(a)-3(c) show the 7 Many datacenter operators prefer to low probability of any incurring data loss at the expense of losing ore data in each event due to high cost of each incident of data loss [12] relationship between the probability of data loss and the nuber of nodes in the Facebook, HDFS and RAMCloud environents, respectively. We find that the result approxiately follows <Copyset<RDC<RR. The probability of data loss in Copyset is higher than that in. This is because considers non-correlated achine failures which are not considered in Copyset. RDC generates a higher probability of data loss than and Copyset because it neglects reducing the probability of data loss caused by correlated achine failures. The probability of data loss in RR is uch higher than and Copyset, and also higher than RDC. This is due to two reasons. First, RR places the copies of a chunk on a certain nuber (i.e., R) of different nodes. Any cobination of R nodes that fail siultaneously would result in data loss in correlated achine failures. However, and Copyset lose data only if all the nodes in an FTS fail siultaneously. Second, and RDC consider data popularity to increase the expected data availability, which is, however, not considered in RR. We then calculate the availability of requested data object by 1 P r, and P r is calculated by Forula (9). Figure 4(a)- 4(c) show the result of the availability of requested data objects. We see the result generally follows >Copyset>RDC>RR in all figures. The availability of requested data objects in Copyset is lower than that in because considers 249

9 Probability of data loss RR RDC Copyset Probability of data loss RR RDC Copyset (a) Scatter width=2 (b) Scatter width=4 Figure 7: Probability of data loss vs. nuber of nodes on Aazon S3 (R = 3 for RR and Copyset). data popularity when deterining the replication degree for each chunk, which is, however, not considered in Copyset. Also, reduces data loss in both correlated and noncorrelated achine failures, while Copyset only iniizes the data loss in correlated achine failures. The availability of requested data objects in Copyset is higher than that in RDC because RDC cannot reduce the probability of data loss caused by correlated achine failures and thereby decreases the availability of requested data objects. RR has the lowest availability because RR places the copies of a chunk on a certain nuber (i.e., R) of nodes and any cobination of R nodes that fail siultaneously would cause data loss. Also, RR does not consider data popularity as RDC. We then used Forula (15) to calculate the storage cost based on the sizes, replication degrees, and storage ediu unit costs of data objects for. For the other three ethods, we randoly choose storage edia for data objects and do not iniize the storage cost with space capacity constraint (Inequ. 17) for RDC. Figure 5(a)-5(c) show the result of storage cost. The result in all three figures generally follows RR Copyset>RDC>. This is because stores data objects into different storage edius based on their applications priorities, the sizes, and the replication degrees of data objects to iniize the total storage cost. The storage costs in Copyset and RR are higher than RDC and uch higher than. RDC reduces the replicas of unpopular data objects, thus reducing storage cost. Copyset and RR neither consider the different storage edius nor reduce the replicas of unpopular data objects to reduce storage cost. We used Forula (12) to calculate the consistency aintenance cost of each ethod based on the geographic distance and replication degrees of data objects. Figure 6(a)- 6(c) show the result of consistency aintenance cost. In these figures, the consistency aintenance costs in Copyset and RR are higher than because liits the geographic distance between the replica nodes of a chunk and the nuber of replicas, thereby reducing the consistency aintenance cost. However, Copyset and RR neglect geographic distances. RDC produces the highest consistency aintenance cost because RDC neglects geographic distance and it also generates ore replicas for popular data objects. B. Real-world Experiental Results We conducted experients on the real-world Aazon S3. We siulated the geo-distributed storage datacenters using three regions of Aazon S3 in the U.S. In each region, we created the sae nuber of buckets, each of which siulates a data server. The nuber of buckets was varied fro 5 to 25 with step size of 5 in our test. We distributed 5 Availability of requested data object RR Copyset RDC Availability of requested data object RR RDC Copyset (a) Scatter width=2 (b) Scatter width=4 Figure 8: Availability of requested data objects vs. nuber of nodes on Aazon S3 (R = 3 for RR and Copyset). Storage cost to Copyset 15% 1% 95% 9% 85% 8% 75% Copyset RDC RR Consistency aintenance cost to CopySet 12% 1% 8% 6% 4% 2% % (a) Storage cost (b) Consistency aintenance cost Figure 9: Storage cost and consistency aintenance cost vs. nuber of nodes on Aazon S3 (R= 3 for RR and Copyset). data objects to all the servers in the syste. We used the distributions of read and write rates fro the CTH trace data to generate reads and writes. The requests were generated fro servers in Windows Azure eastern region. We consider the requests targeting each region with latency ore than 1s as failed requests (unavailable data objects). 8 We used the paraeters in Table II except p and N. In this test, N is the total nuber of siulated data servers in the syste and p (with average value.89) follows the actual probability of a server failure in the real syste. We used the actual price of the data access of Aazon S3 to calculate the storage cost. Figure 7(a) and 7(b) show the result of probability of data loss on Aazon S3. We see the result approxiately follows <RDC<Copyset<RR. Our nuerical result shows that <Copyset<RDC<RR. Both results confir that generates the lowest probability of data loss. RDC generates higher probability of data loss than Copyset in the nuerical analysis but generates lower probability of data loss than Copyset in the experients. This is because RDC cannot handle correlated achine failures, and the failure rate of correlated achine failures is 1% in the nuerical analysis but our real-world experient has fewer correlated achine failures. We also see that scatter width 2 produces lower probability of data loss than scatter width 4. This is because a large scatter width increases the nuber of FTSs and thus increases the probability of data loss. Figure 8(a) and 8(b) show result of the availability of requested data objects on Aazon S3. We see the result follows >Copyset>RDC>RR, which is consistent with the result in Figure 4 due to the sae reasons. We also see that scatter width 2 produces higher availability than scatter width 4 because a large scatter width increases the probability of data loss and reduces data availability. We then regard Copyset as a baseline and calculate the ratio of storage/consistency aintenance cost of each of the other 8 Since it is hard to generate peranent failures on Aazon S3 and the network latency is low based on [48], we believe the data request failure within U.S. is ainly caused by achine failures, and we consider the requests targeting each region with latency longer than 1s as failed request, which reflects the availability of data objects. 25

Utility-Based Resource Allocation for Mixed Traffic in Wireless Networks

IEEE IFOCO 2 International Workshop on Future edia etworks and IP-based TV Utility-Based Resource Allocation for ixed Traffic in Wireless etworks Li Chen, Bin Wang, Xiaohang Chen, Xin Zhang, and Dacheng