Fault Tolerant Parallel Data-Intensive Algorithms

Size: px

Start display at page:

Download "Fault Tolerant Parallel Data-Intensive Algorithms"

Geraldine Cori Johnston
5 years ago
Views:

1 Fault Tolerant Parallel Data-Intensive Algorithms Mucahid Kutlu Department of Computer Science and Engineering Ohio State University Columbus, OH, Gagan Agrawal Department of Computer Science and Engineering Ohio State University Columbus, OH, Oguz Kurt Department of Mathematics Ohio State University Columbus, OH, Abstract Fault-tolerance is rapidly becoming a crucial issue in high-end and distributed computing, as increasing number of cores are decreasing the mean-time to failure of the systems. While checkpointing, including checkpointing of parallel programs like MPI applications, provides a general solution, the overhead of this approach is becoming increasingly unacceptable. Thus, algorithm-based fault-tolerance provides a nice practical alternative, though it is less general. Although this approach has been studied for many applications, there is no existing work for algorithm-based fault-tolerance for the growing class of dataintensive parallel applications. In this paper, we present an algorithm-based fault tolerance solution that handles fail-stop failures for a class of data intensive algorithms. We divide the dataset into smaller data blocks and in replication step, we distribute the replicated blocks with the aim of keeping the maximum data intersection between any two processors minimum. This allows us to have minimum data loss when multiple failures occur. In addition, our approach enables better load balance after failure, and decreases the amount of re-processing of the lost data. We have evaluated our approach by using two popular parallel data mining algorithms, which are k-means and apriori. We show that our approach has negligible overhead when there are no failures, and allows us to gracefully handle different number of failures, and failures at different points of processing. We also provide the comparison of our approach with the MapReduce based solution for fault tolerance, and show that we outperform Hadoop both in absence and presence of failures. I. INTRODUCTION Growing computational and data processing needs are currently being met with increasing number of cores, as there is almost no improvement in single core s performance. However, with growing number of cores, the Mean-Time To Failure (MTTF) of the systems is decreasing. As a result, fault-tolerance is rapidly becoming a major topic in high-end computing. Several different approaches to fault-tolerance have been taken, depending upon the nature of the parallel application and programming model used. Since a large number of highend applications are developed using MPI, there is a large number of efforts on MPI fault-tolerance, which focus on checkpointing [28], [5], [15], [34], [23], [21], [2]. However, the main issue with checkpointing is the high overhead, especially, as systems are becoming larger, and disk bandwidths are not improving. For the future exascale systems, it is being argued that checkpointing and recovery time (with current methods) will even exceed the MTTF, leading to the need for alternative methods [9]. A promising alternative, which has resulted in lower overheads, is algorithm-based fault-tolerance [12], [16], [6], often based on disk-less checkpointing[36],[22]. These methods use specific properties of the algorithm to reduce the amount of information that needs to be cached. Most of the action in this area has been for scientific computations, like linear algebra routines [36], [16], iterative computations [12], including conjugate gradient [13]. While large parallel systems were traditionally used for scientific computations, data-intensive computing has rapidly emerged as a major application class in recent years [8]. Though most recent work in this area has been in the context of MapReduce [17] and its variants, stand-alone implementations of parallel data-intensive algorithms are also very common [24], [39]. In this paper, we examine algorithm-level fault-tolerance for data-intensive algorithms. We show how the common properties of many similar data mining algorithms can be exploited to develop an approach for algorithm-level faulttolerance. These algorithms involve an iterative structure, where the communication at the end of each iteration is limited to generalized reductions. We focus on fail-stop failures [38] in which the failed processors stop working and all their data is lost. We also assume that no additional processors are used for the recovery. Note that going on the execution with the remaining nodes is more challenging than using backup nodes and also continuing the process with remaining ones is more practical since backup nodes may not be always possible. Therefore, we have to read the lost data again and assign these data portions to the running processors. Not to degrade the performance, we also need to ensure good load balance over the remaining nodes. Our new approach is as follows. In order to minimize the amount of data loss, we first divide the data that each processor will normally process into smaller data blocks. Then we replicate these data blocks and distribute them among processors such that the maximum intersection between any two processors is minimum. In case of failures, the data that would normally be processed by the failed processors are assigned to the slaves that already store the replicas of them by the master node. Having smaller parts as a unit of processing and replication allows us to have better load balance after failure, and to decrease the amount of data loss when we have multiple failures. Moreover, we divide each data block into smaller data portions and augment the algorithm to perform summary exchange after processing each data portion. This reduces the amount of work to be redone after a failure. We have extensively evaluated our algorithms. First, we show that replication of data and summary exchanges add very little overhead.starting from an execution on 16 nodes, we could recover with 1, 2, and 3 failures with an overall slowdown of a 16.6%, 17.1%, and 26.3%, respectively. We also compared our approach with Hadoop s support for fault tolerance [41], using implementations of the same algorithms with the MapReduce API. Our approach had much lower slowdown while handling failures /12/$ IEEE

2 The rest of the paper is organized as follows. In Section 2, we explain the serial and parallel versions of k-means algorithm. In Section 3, we explain the details of our replication approach and present the distribution algorithm. In Section 4, we describe recovery from different failure cases. In Section 5, we report a detailed evaluation of our approach, including a comparison against Hadoop. We compare our work with related research efforts in Section 6 and conclude in Section 7. II. DATA MINING ALGORITHMS In our study, we focus on two representative data-intensive algorithms, which are k-means clustering [29] and apriori association mining [1]. In this section, we explain k-means algorithm, by presenting both sequential and parallel versions of the algorithm. The explanation of apriori algorithm is not given because of space limits. A. K-means Clustering Clustering is one of the most commonly studied problems in machine learning and data mining. The goal in clustering is to divide a set of data records into k parts or clusters,maximizing similarity within each cluster and dissimilarity across clusters. K-means is an iterative clustering algorithm, and its pseudocode is shown as Algorithm 1. In k-means algorithm, we select randomly k centroids initially (Line 2) before the iterative step. We assign each object to the nearest cluster (Line 5-7). Any distance calculation method can be used in this step. Once such assignments are completed, we calculate the new centroids of each cluster (Line 8). The calculation can be done by averaging the coordinates of each object that are assigned to the corresponding cluster. Then, we calculate delta in order to find how much the centroids of the clusters have changed in the current iteration (Line 9). If this change is not greater than a prespecified threshold, it means that the algorithm has converged. Therefore, we can finish the clustering process. To avoid the possibility of an infinite loop, we also put a boundary for the iteration numbers. Algorithm 1 Serial K-means Clustering Algorithm 1: input: D = { d 1,d 2,...,d n }(Data records to be clustered) k (Number of Clusters) M axiter (Maximum Iteration) Threshold 2: Select randomly k cluster centroids 3: iteration = 4: repeat 5: for i = 1 n do 6: Assign d i to the nearest cluster 7: end for 8: Calculate new centroids of clusters 9: delta = k j=1 (newcentroid j oldcentroid j 1: Increment iteration by 1 11: until iteration MaxIter delta Threshold Now, we consider the parallel version of k-means algorithm. The pseudo-code of k-means algorithm for master and slave nodes are given in Algorithm 2. We distribute the data among slave nodes equally, so each slave node is responsible for n/p data records, where p is the number of processors and n is the number of data records. In the master node, we select the initial k cluster centroids (Line 1). At the beginning of each iteration, we first broadcast current k cluster centroids (Line 4), so that each slave node gets the same cluster centroids (Line 3). Master node waits until it gets all the results from the slaves (Line 5). During this time, slave nodes calculate local new centroids (Line 5) and delta (Line 6). Then, they send delta and the centroids with the number of data records in each cluster to the master node. Once master node gets all the results from the slaves, it calculates the global new centroids and the total delta (Line 6-7) and broadcasts the delta. If delta is not greater than threshold, the clustering finishes for all nodes. Otherwise, a new iteration begins. Algorithm 2 Parallel K-means Clustering Algorithm Master Node 1: Select randomly k cluster centroids 2: iteration = 3: repeat 4: Broadcast k cluster centroids 5: Wait for all new centroids from slaves 6: Calculate new centroids of clusters 7: Calculate total delta 8: Broadcast delta 9: Increment iteration by 1 1: until iteration MAXITER delta Threshold Slave Node: 1: iteration = 2: repeat 3: Receive the k cluster centroids 4: Assign each data record to the nearest cluster 5: Calculate new cluster centroids 6: delta = k j=1 (newcentroid j oldcentroid j 7: Send delta and the cluster centroids with the number of data records of each cluster 8: Increment iteration by 1 9: Receive new delta 1: until iteration MAXITER delta Threshold B. Generalization to Other Algorithms { * Outer Sequential Loop * } While () { { * Reduction Loop * } Foreach (element e) { (i,val) = process(e); Reduc(i) = Reduc(i) op val; } } Fig. 1. Generalized Reduction Processing Structure of Common Data Mining Algorithms In our previous works [3], [31], we have made the observation that parallel versions of several well-known data mining techniques share a relatively similar structure. Besides k- means clustering and apriori association mining, this structure applies to several other clustering and association mining algorithms, as well as to algorithms for Bayesian network for classification [1], k-nearest neighbor classifier [25], artificial neural networks [25], and decision tree classifiers [35]. The common structure behind these algorithms is summarized in Figure 1. The function op is an associative and commutative function. Thus, the iterations of the foreach loop can be performed in any order. The data-structure Reduc is referred to as the reduction object. The reduction performed is, however, irregular, in the sense that specific elements of

3 the reduction object that are updated depend upon the results of the processing of an element. For algorithms following such generalized reduction structure, parallelization can be done by dividing the data instances (or records or transactions) among the processing threads. The computation performed by each thread will be iterative and will involve reading the data instances in an arbitrary order, processing each data instance, and performing a local reduction. In distributed memory setting, the reduction object needs to be replicated, and a global reduction is performed after the local reduction. Again, we can see that the parallel algorithms for k-means clustering described earlier and Apriori algorithm that we used in our experiments are specific instances of these. III. OUR APPROACH We now describe the overall approach taken for faulttolerant data-intensive algorithms, like k-means clustering and apriori association mining. A. System Model Assumptions and Goals In our study, we focus on only the fail-stop failures. When a node fails, we lose everything on that node, including the data that had been processed by the failed node. In addition, we do not use any additional node for the failed one to recover or be replaced, and instead, continue the execution with the remaining nodes since using backup nodes is not always practical. Unless otherwise stated, the word failure implies the failure of a slave node, and not the failure of the master node. In developing fault-tolerant algorithms, we have two main goals. First, we want to minimize the data loss, since the lost data needs to be reread from the storage cluster, resulting in slowdown. Second, we want to minimize re-execution or repeated work when a failure occurs. To meet these two goals, we take the following approach. We intelligently replicate the data across slave nodes, such that the amount of lost data because of a certain number of failures is minimized. Second, we have developed a method for summarizing the computations, and sharing these summaries, in a way that decreases the re-execution. However, in our approach, we also need to combine these two steps, which itself leads to several challenges. B. Replication We now describe the data replication approach we use. In order to decrease the number of accessing storage cluster, when we read the data initially at the beginning, we store the same data on more than 1 processor. Therefore, each processor has two types of data, primary data and the replicas. Note that the loading time of replicas will be much less than loading primary data since the data will be already existing in the cluster and we won t need to access the storage cluster for them. Each processor normally processes only its own primary data, but can process replicas in case of failures. If we replicate the data R times, i.e., have R 1 replicas besides the primary data, we can handle R 1 failures and continue the application with replicas, without requiring any new I/O operation from the storage cluster. But if R processors that store the same data fail, we lose those particular data elements, and have to access the storage cluster to read the lost data. Clearly, one way to reduce the possibility of having to read additional data will be to increase R, i.e., create additional copies of data. However, there is a practical limit on R, as local data storage resources are limited. Thus, our goal is to replicate the data in a fashion that for a given replication factor R, we minimize execution time for any number of failures. To meet the above goal, we need to ensure that the common data between any given set of processes is the minimum possible, among all possible options for the given replication factor. We achieve this in the following way. First, we divide the primary data of each processor into smaller and equal S parts, each of which can be denoted as D, and has a size of D. The primary data on the processor i is denoted as D i, and its size is D i = D S. In the replication step, we distribute these S data components that are the primary data for the processor i among different processors, so as to get R copies of the data in all. This process is repeated for the primary data of all processors. At the end, each processor should have the same amount of data. The data at a processor i, including the primary data and the replicas, is denoted as P i, and its size is P i = D i R. Returning to our goal in replicating the data and allocating the replicas, we will like that the intersection between the data that any two processors have is either a null set, or only one data block. So, our goal can be written as follows: i, j, P i P j D We developed an algorithm, which meets this goal, provided that a sufficient number of total processors are available. We first explain our idea with an example. In the Figure 2, a sample data distribution is shown where the replication factor, R, is 3 and the number of data components (or blocks per processor) is 6, which also means that S is 2. In general, if we have n processors, we allocate the data into n R virtual processors. The first n virtual processors are allocated for the primary data, and the remaining n (R 1) processors are allocated for replicas. For this example, we first divide all the data into n S R parts. In our example, n = 7, so there are 42 data blocks. We first distribute these 42 data blocks among the first n virtual processors. This distribution is shown under columns P-P6 processors in Figure 2. This is our first processor block with the index of zero. Then, we need to distribute the replicas. We allocate additional n virtual processors, which is the processor block with the index of 1. We first distribute all the data blocks among these virtual processors in the same fashion as we did for the initial block. This process is repeated till we have R processor blocks allocated. Now, to minimize the overlap, we use a simple trick, which can be seen from Figure 2. We shift each row for a virtual processor block to the left, block index row index times. There is no shift for the rows of the initial block (primary data), since the block index is. But, the row 1 (second two from the top) for the next block is shifted by 1, the row 2 is shifted by 2, and so on. Similarly, the row 1 for the processor block with an index of 2 is shifted by 2, the row 2 is shifted by 4, and so on. The final allocation for our example can be seen from Figure 2 where each processor block is colored differently. To generalize and formalize the method, we proceed as follows. Our goal is to compute a distribution matrix C, which is as follows: c, c,1 c,p 1 c 1, c 1,1 c 1,p 1 C S R 1,p 1 = c S R 1, c S R 1,1 c S R 1,p 1 (1) In this matrix, the column numbers (j) represent the virtual processor number and the row numbers (i) represent the

4 Fig. 2. Example Data Distribution (Replication Factor is 3, No. of Blocks Per Processor is 6) ranking of the data block in the corresponding processor. Each c i,j value represents the index of the data blocks. The overall method is generalized and summarized as Algorithm 3. Note that in the implementation, each processor can assign S data blocks as primary data. Simply, primary data of processors in the i th processor block will be the data blocks in the [S i,s (i + 1)) rows. Algorithm 3 Data Distribution Algorithm 1: Divide the entire data into n R S parts 2: Initialize block (columns through n 1) of the matrix C 3: Distribute the data parts among n processors 4: for i = 1 R 1 do 5: Allocate a virtual processor block with n processors 6: Copy columns through n 1 as block i 7: for j = (S R 1) do 8: Shift the j th row of the i th block j i times to the left. 9: end for 1: end for Returning to our matrix, each c i,j value will be as follows according to our algorithm. c i,j = i n + ((j + j/n i) mod n) where i < S R and j < R n. We can re-define all the data block indexes of any processor by using the matrix C as in Eq. (2). P j = {c i,j : i =,...,S R 1} (2) We now focus on proving the correctness of the method. First, an obvious observation about the algorithm is as follows. When we want to calculate the processor index of any data block in the next processor block, we just do a shift operation in its row. Therefore, the number of possible positions for that data block is 1. In other words, each data index in a processor block is unique. Theorem 3.1: P i Pj 1 if p = n R where p is the number of processors and n is a prime number larger than S R. Proof: Let A,B P i P j and A B. This implies Eq. (3) and (4). A = c k,i = c k,j (3) B = c m,i = c m,j (4) where k,m S R 1. Eq. (3) and (4) implies Eq. (5) and (6). i + i/n k j + j/n k mod n (5) i + i/n m j + j/n m mod n (6) When we subtract Eq. (6) from Eq. (5), we get Eq. (7) i/n (k m) j/n (k m) mod n (7) If k = m, then A = B. So we may assume that k m. We can derive Eq. (8) from Eq. (7). i/n = j/n n 1 (8) Say i/n = j/n = a. This also means that P i and P j are both in the a th processor block. But this contradicts with our observation mentioned above. Therefore, the intersection between any two processors cannot be more than 1 data portion. For all R and S values, the algorithm above can be used and we can achieve our minimum intersection goal if we have n R processors. When the intersection is minimum, in the bestcase,thedatalosswillbezeroin (R 1) n/rfailures.in theworstcase,weloseonlyonedatablockwith R failures.on the other hand, if we distribute the data randomly, we would lose S R data blocks with R failures in the worst case. C. Summarization In a basic approach for iterative algorithms involving reductions, the slaves send their reduction objects after processing all of their data. However, this approach does not perform well when there are failures. Suppose that processors fail after processing 99% of their data. Since the master node does not receive any result from the failed nodes, we have to re-process all that data again. From description of k-means given earlier, and from Figure 1, we can see that as data elements are processed, the state of the computation is getting captured in the reduction object. This fact can be used to improve efficiency of these algorithms in the presence of failures. Particularly, the slaves can send their reduction objects periodically within one iteration. The advantage of this approach is that we do not need to re-process the data if we have a copy of the reduction object for that data. One question that arises is, how frequently should the summaries or reduction objects be exchanged? In our approach, each processor has S data blocks that they are responsible to process and send their summaries. Therefore, each processor should send at least S summaries. However, we also divide each data block into smaller data portions. So a slave node sends a summary for each data portion. There are two advantages of dividing data blocks into smaller parts. First, in case of a failure, the amount of data to be reprocessed decreases.

5 Second, after failure, we need to re-assign the data records of the failed nodes to the running ones. If the replication factor is greater than 2, having smaller data portions allows us to distribute the lost data more balanced among the processors that store the lost data as replica. That is to say, we get better parallelization after failure when we have more data portions. The use of such summaries requires certain changes to the algorithm, even if there are no failures. The modification is similar for k-means and apriori, and is explained only for the k-means algorithm. The modified algorithms for the master node and the slave node are stated as Algorithm 4. The master node waits for all summaries, instead of waiting for the final results from slaves (Line 5-8). Once it receives a summary from a slave, it keeps the summary to calculate new centroids and also sends the summary to the slave nodes, which store the data of the summary as replica (Line 7). The other parts are similar with the original parallel algorithm. In the slave node, there are only two changes. 1) We divide the iteration into S M parts where S is the number of data blocks and M is the number of data portions per data block, and send the summaries S M times. 2) One question that arises is that what is the effect of these summaries for network traffic if we use high number of nodes? When we use high number of nodes, sending much more summaries may be problem if we use a single master node. But we can handle this problem by using more master nodes to decrease the network traffic. However, then there will be a master node failure problem. In order to handle this, we wait for the summaries of the replicas after sending all results (Line 1), so that we do not process replicas. These summaries of replicas can be used if master node also fails, together with some slave nodes. We can skip this last summary exchange part if the possibility of the failure of the master nodes is negligible. Note that sending and receiving of such summaries can be done by non-blocking messaging. Therefore, we can start waiting for the summaries of the replicas at the beginning of the iteration to save some time. But, of course, still we have to wait after sending all summaries if we haven t received all of them yet. IV. RECOVERY Replication and exchange of summaries (or reduction objects) are important steps towards enabling fault-tolerance in algorithms. We now discuss how our algorithms actually recover from failures. Let s assume that we have the system giveninthefigure3.inthisexample,thereplicationfactor, R, is 3, number of data blocks per processor (for primary data), S, is 1, the number of data portion per data blocks is 2 and the number of slave nodes is 7. In the figure, the blocks from P1 to P7 represent the slave processors. The dashed boxes inside the processors represent the data blocks and the numbered boxes inside each block represent the data portions. Each processor has to process its primary data when there is no failure. For example, P1 has to process d 1 and d 2 and will receive the summaries for d 7 d 1 where d i is the i th data portion. Particularly, we will explain how we recover the system in 2 different failure scenarios, which are single node failure in the middle of an iteration, and multiple nodes failure. In our approach, some bookkeeping is needed to support failure recovery. Particularly, the master node keeps track of the availability for each data portion. At the beginning, it is R for all of them. The master node also keeps track of the workload of each slave node. In case of a failure, it tries to balance the workload by using this information. A. Single Slave Node Failure in the Middle of the Iteration In this scenario, let s assume that P1 fails in the middle of an iteration, after sending the summary for d 1. The recovery Algorithm 4 Fault Tolerant Parallel K-Means Clustering Algorithm Master Node: 1: Select randomly k cluster centroids 2: iteration = 3: repeat 4: Broadcast k cluster centroids 5: repeat 6: Wait for a summary from slaves 7: Once a summary is received, send the summary to the slave nodes that store the replica of the related data 8: until There are unrecevied summaries 9: Calculate the new centroids of clusters 1: Calculate total delta and broadcast it 11: Increment iteration by 1 12: until iteration MAXITER delta Threshold Slave Node: 1: iteration = 2: repeat 3: Receive the k cluster centroids 4: for i = 1 S M do 5: Assign each data record of the slave node s i th data portion to the nearest cluster 6: Calculate new cluster centroids 7: delta = k j=1 (newcentroid j oldcentroid j 8: Send delta and cluster centroids with the number of data records of each cluster 9: end for 1: Wait for the summaries of the replica data portions. 11: Increment iteration by 1 12: Receive new delta 13: until iteration MAXITER delta Threshold Fig. 3. Sample System with 7 Slave Nodes for this case will be as follows. As the master node notices that P1 has failed, it decreases number of available resources for d 1,d 2,d 7,d 8,d 9,d 1, by 1. Since R = 3, we did not lose any data and do not have to do any additional reads from the storage cluster. Now, we can see that P2 and P3 have the data portions of d 1 and d 2 since both have the same data block as replica. So master node notifies P2 to process d 1 and P3 to process d 2 after the failure. But P2 does not have to process d 1 in this failure iteration since the master node has already received its summary before the failure. Overall, the application execution will be finished with 6 slave nodes. Because we have 2 data portions per block, we can balance the remaining workload to some extent, i.e., the work from P1 is being shared by P2 and P3. A large number of data blocks

6 and higher replication factor will also allow us to balance the workload even better. B. Multiple Slave Nodes Failure In this scenario, let us assume that P1, P2 and P3 fail at the beginning of an iteration. For simplicity, let us further assume that the master node notices all three failures at the same time. Then, the master node decreases the availability of d 1 and d 2 by 3 and those of d 3 d 14 by 1. Among the blocks that are no longer available as primary data, we still have d 3 d 6 available on slave nodes that are still operational. So, the master node notifies P4 to process d 4 and d 6, P5 to process d 3 and P6 to process d 5. However, d 1 and d 2 are not available on any of the slave nodes, and therefore, it needs to be read from the storage cluster. Since P7 has the least workload, master node notifies P7 to read the data from the storage cluster and process d 1 and d 2. Note that the bookkeeping of the master node allows us to balance the workload of slave nodes. V. EXPERIMENTS This section reports results from a number of experiments we conducted to evaluate our approach, and to compare it against the approach implemented in Hadoop. More specifically, we had the following goals in our experiments: To quantify the overheads associated with our approach and to examine its ability to handle different number of failures. To understand the significance of the parameters used in our approach, i.e. how the replication factor and the number of summaries exchanged impact the performance both in absence and presence of failures. To compare our approach against the MapReduce approach for fault-tolerance, by comparing performance with the Hadoop implementation. A. Experimental Setup In our experiments, we used a cluster where each node had 8 cores 2.53 GHz Intel (R) Xeon (R) processor and 12 GB memory. We implemented k-means and apriori algorithms in C programming language with MPI library. We assume a setup where data can be read at high speeds from within the cluster, but requires higher latencies if it needs to be loaded from outside the cluster. Thus, depending upon reading the data to store as replica or as primary data, the corresponding data can either be read from within the cluster, or may need to be read from outside the cluster. The former is very fast, whereas the latter case adds substantial slow-downs. Note that in case of failures where we lose all copies of some particular data, we have to read that data again outside the cluster. The size of the datasets is 4.87 GB and 4.79 GB for k-means and apriori, respectively. The parameters used for describing our results are given in Table I. Note that both of the algorithms we are considering involve multiple iterations, and failure iteration refers to the iteration in which a failure occurs. B. Evaluation of Our Approach We now evaluate the effectiveness of our approach, varying a number of parameters. Except for one experiment, we used 16 nodes and single core per each node in order to have fully distributed memory for execution of the algorithms. Initial Evaluation: The first experiment focused on the effect of replication in our approach. We fixed I, S, M and Per to some certain values and changed the number of failures and the number of replicas. The results are shown in Figure 4. TABLE I THE PARAMETERS USED IN EXPERIMENTS The number of replicas of the dataset (including the original) R The number of data blocks per primary data S The number of data portions per data block M The failure iteration I The percentage of the data that had been processed (within Per the failure iteration) before the failure occurs The number of failures F The number of processors on which the code is executed P The above part of the bars give the first initial data loading time and the below part gives the data processing time. For k-means and apriori, failure occurs after processing half of the data in the 1th (among 25) and 3rd (among 6) iterations, respectively R=1 R=2 R= R=1 R=2 R=3 (a) K-Means(I=1, Per=%5,S=1.M=2) (b) Apriori(I=3, Per=%5,S=1,M=2) Fig. 4. Overall Effectiveness of our Approach: Varying Replication Factor and : Top (Darker) Portion of Each Chart Reflects Initial Data Loading Times Several observations can be made from this figure. First, the average overhead of additional replicas in both algorithms are only.2% and.5% for R = 2 and R = 3, respectively. However, as seen from the figure, the slowdown in the case of failure(s) is much higher. In addition, the overhead of the data loading time will be less when the total number of iterations of the algorithms increases. However, as we showed here, we can handle failures quite effectively with different replication factors. Clearly, when we do not have any additional copies,

7 even one failure requires that additional data be read from the storage cluster which has a high cost. Having additional copies of the data alleviates this need, and failures (up to a certain number) can be handled with only modest slowdowns. Because the remaining processing (approximately half of the total work) is being done with fewer nodes, some slowdown is to be expected as number of failures is increasing. However, as we have more replicas, we are able to get better parallelization after failure by distributing the work of the failed node among remaining running nodes. In our next set of experiments, we show this re-parallelization effect and how the slowdown can be lower with a higher value of S. Impact of Frequency of Summary Exchange: Besides the number of replicas, another factor in our scheme is the number of summaries or the frequency of summary exchanges. The number of summary exchanges depends on number of data blocks and the number of messages sent per each data block. Therefore, this parameter impacts how the data is distributed, and how much work might have to be redone if a failure occurs in the middle of an iteration. We conducted two sets of experiments to evaluate this factor M=1 M=2 M=4 M=8 M= Percentage (a) K-Means(I=5, F=1,R=2,S=1) M=1 M=2 M=4 M=8 M= S=1 S=2 S=3 S= S=1 S=2 S=3 S=4 (a) K-Means(I=1, Per=%,R=3,M=1) (b) Apriori(I=3, Per=%,R=3,M=1) Fig. 5. Impact of Frequency of Summary Exchange: Varying Number of Failures In the first set of experiments, we fixed I, R, M and Per and varied F and S to see the impact of size of data blocks and also frequency of summary exchange with different number of failures. When we have larger S value, we have more summaries to be exchanged and smaller data blocks Percentage (b) Apriori(I=2, F=1,R=2,S=1) Fig. 6. Impact of Frequency of Summary Exchange: Varying Fractions of Work Completed At the Time of Failure Particularly, all failures occur at the beginning of the iteration in this set of experiments. The results are shown in Figure 5. When there is no failure, the execution time is nearly same for all S values for both of the algorithms. It shows that the overhead of summary exchange is negligible. When there is only one failure, we get the best results in the S = 4 case and the worst results with the S = 1 case. In other words, increasing value of S helps to improve the performance in case of failures. There are two reasons for the better performance of S = 4 case. First, as we divide the data into more parts, we are able to get better load balance in case of failures. Second, when there are 3 failures, we need to read one data block from the storage cluster. As we have more data blocks, the block size will be smaller. Therefore, the I/O operation takes less time when we have larger number of data blocks. Note that distributing data blocks to different processors to keep intersection among processors minimum decreases the number of lost data blocks. Without this kind of distribution, the amount of lost data would increase for larger S values and cause losing the advantage of reading smaller data. In the next set of experiments, we evaluate the impact of frequency of summary exchange as failures occur at different times during an iteration. The goal here is to understand the recovery time with different frequencies of summary exchanges. We fixed I, S, R and F, and varied Per and M.

8 The execution times of the failure iteration for different cases are shown in Figure 6. We can see that we get better results, as the frequency of summary exchanges increases for both of the algorithms. The reason is that the failed node is able to send some results of its data before the failure. Therefore, we do not need to re-process that data in that iteration P=16 P=32 P=64 using a different number of nodes, both MapReduce and our approach allow recovery using the remaining (fewer) nodes in case of failure. Thus, we have compared our solutions extensively against Hadoop 1, the open-source implementation of MapReduce. It should be noted that there is a significant programmability difference between our approach and the one from MapReduce. However, we believe that our approach can also be implemented as part of a high-level solution in the future No Failure Failure at % Failure at 25% Failure at 5% Failure at 75% P=16 P=32 P=64 (a) K-Means(I=1, Per=%5,R=3,S=4) (b) Apriori(I=3, Per=%5, R=3,S=4) Fig. 7. Effect of Processor Numbers with Different Scalability of the Approach: In all of the previous experiments, we used only 16 slave nodes. Now, we show the scalability of the approach by considering larger configurations. Thus, we look at the execution times with different number of failures, as the original number of nodes varies. The results are shown in Figure 7. Several observations can be made about these results. First, the execution times without failures are scaling quite well. Second, our approach can handle failures for all cases effectively. In fact, the relative slowdown with two or three failures is lower when the original number of processors is higher. This is because the remaining computing capacity is a larger fraction of the original capacity. C. Comparison with Hadoop We compared our approach with MapReduce, which is a popular solution for developing data-intensive applications. The reasons of why we chose MapReduce to compare with are as follows. First, MapReduce is a well-known fault tolerant framework besides its popularity. Second, unlike MPI-based solutions for fault-tolerance, which do not allow recovery 4 2 Hadoop K means Hadoop Apriori Our Approach K means Our Apporach Apriori Percentage Fig. 8. Total Execution Time when Single Failure Occurred at Different Percentages We used Hadoop implementations of k-means and apriori algorithms, which were used in an earlier study as well[4]. The configurations and parameters used were as follows. We did not use any back-up nodes for either of the implementations, i.e. failure recovery occurs with fewer nodes. In order to detect failures faster in Hadoop, we decreased tasktracker expire interval from the default value of 1 minutes to 1 seconds. Replication factor is set to 3 in both implementations. Default chunk sizes are used in Hadoop and the frequency of summary exchange is 12 (S = 4 and M = 3) in our implementation. The execution time of Hadoop had high variance for several failure cases. Therefore, we run each experiment 5 times and calculated the average after eliminating the maximum and minimum values. Because of differences in how Hadoop works and our algorithm-based approach is implemented, we made some changes in experiments and the metrics we use. First, because it is well known that Hadoop does not work well for iterative functions [19], we executed both applications for only a single iteration. Second, Hadoop and our approach are implemented with different programming language (C vs. Java) and file system. So comparing their execution times directly will not necessarily be fair. Therefore, we will compare their relative slowdown in presence of failure, instead of absolute execution times. First, we wanted to see how two systems behave when a failure occurs at different times during an iteration. The results are shown in Figure 8. Our approach has much better results than Hadoop. For apriori, it seems that our approach has similar results for all cases since the execution lasts for only 1 iteration and the data processing time is only 9% of the all execution time. In Hadoop tests, the average slowdown of both algorithms is similar for failures at % and 25% and it becomes higher as the failure percentage increases. The average slowdown of both algorithms in our approach s tests 1

9 Hadoop Hadoop Our Approach Number of Failure (a) K-Means Number of Failure (b) Apriori No Failure Single Failure 2 Failures 3 Failures No Failure Single Failure 2 Failures 3 Failures Our Approach Fig. 9. Total Execution Time that Changes with the is 5% for all failure points while it changes from 2% to 25% in Hadoop tests. Next, we wanted to examine how the two systems behave when multiple failures occur. We simulated failures at the beginning of the iteration since Hadoop s performance for failures % and 25% are similar and better than other failure points as seen from the Figure 8. In 3 failures scenario, we killed 3 nodes that share the same data block in our approach. Thus, when there isafailure,we have toread the data fromthe storage cluster. Note that, in another 3 nodes, we may not lose any data, i.e. the failure of first 3 nodes in Figure 2. That is to say, we kill the 3 nodes that cause highest slowdown for our approach. The results are shown in Figure 9. Our approach has much better performance than Hadoop for all cases. Hadoop s average slowdown of both algorithms increases from 23% to 33% as the number of failure increases. In our approach, the average slowdown of both algorithms increases from 5% to 28% as the number of failures increases. VI. RELATED WORK Fault-tolerance is a widely studied topic. We restrict our discussion to 3 topics: 1)data distribution, 2)fail-stop failures[38] in the context of parallel and data-intensive applications, 3) fault tolerance in MapReduce Data distribution approaches are mostly discussed for file systems and storage clusters. For example, CRUSH[4] is focused on minimizing data movement in case of addition/removal of disks. FARM[42] divides data into blocks with fix size and stores each replica of the blocks to different disks. FARM improves recovery time by distributing the lost data to be reconstructed over a number of drives in the disk array. RUSH[26] algorithms also try to minimize the redistribution of data elements and they guarantee that no two replicas of a particular object will be stored in the same server. However, none of these studies focused on minimizing the data intersection of servers in order to minimize the data loss in case of multiple failures. To the best of our knowledge, there is no previous study that focused on minimization of the intersection. A common way of handling fail-stop failures is checkpointing, including synchronous checkpointing [18] and asynchronous checkpointing [37]. In order to decrease the amount of the data to be saved, application level checkpointing (ALC) has been proposed [7], [21]. In ALC, only certain specific data elements that can recover the application are saved, instead of the whole system state. To decrease the overhead by eliminating the stable storage, Plank [36] proposed diskless checkpointing approach, which checkpoints the states of each processor in memory and use checkpointing processors that encode these in-memory checkpoints to calculate the last state of the failed processors. Another way of avoiding time consuming I/O operations is using algorithm based recovery methods (ABR). Here, the recovery can be performed by using already existing information in running processors. If the algorithm itself does not contain such redundant information, it can be added by modifying the algorithms. Huang and Abraham [27] demonstrated that the miscalculations can be detected and corrected by using the checksum relationship that is preserved in the final computation results. Chen [11] extended this work to tolerate fail-stop failures in the outer product version of matrixmatrix multiplications. Davies et al. extended the checksum method to the Linpack Benchmark decomposition [16]. Lu et al. [32] force processors to send redundant information to their neighborstouseintherecoveryandapplythistechniquetothe Newton s method. Chen [12] has developed an ABR scheme for iterative methods, based on the observation that many iterative algorithms can be recovered without checkpointing, if they satisfy some certain conditions. In Chen s study, one of the goals is to start the recovery from the iteration that failure occurred without any roll-back. But in our study, in worst case, we start the recovery from the failure iteration and some percentage of the failure iteration may also not need to be reprocessed. In addition, in Chen s study, the failed processor needs to become available again and is used for recovery. In comparison, in our study, we continue the process with the remaining (fewer) processors. None of the above efforts have considered data-intensive applications of the nature we have considered. Almost all work on fault-tolerance data processing has been in context of MapReduce [17]. The fault-tolerance approach in MapReduce is as follows. The work is divided into a number of tasks as specified by the user. In the case of a slave node failure, any of the running tasks as well as completed map tasks should be re-executed by another slave. The completed map tasks are re-executed because they store their output on the local disks. Most of the researches on MapReduce have focused on improving the performance of MapReduce(i.e.[14]). The limited amount of work on improving failure recovery in MapReduce includes the work by Zheng [44], who proposes using passive replication on top of re-execution. Costa et al. [2] propose Byzantine fault [3] tolerant MapReduce that re-executes each

10 task more than once but tries to minimize the number of these re-executions. Twister[19] and imapreduce[43] are both designed to improve performance of MapReduce in iterative algorithms. For fault tolerance, both use checkpointing and roll back to the last iteration with checkpoint in presence of failure. Martin et al. [33] introduce a fault tolerant mechanism for a streaming version of MapReduce, where reducers that can process key/value pairs as mappers emit them, are used instead of stateless reducers that wait for all mappers to be finished. They use a combination of uncoordinated checkpointing and in-memory logging for the fault tolerance. VII. CONCLUSION This paper has developed an algorithm-based fault tolerance approach for handling fail-stop failures in a class of dataintensive algorithms. Our approach combines replication and summarization together to decrease the latency of execution in the presence of failures. We use a novel replication and distribution algorithm, which allows us to distribute the data in a way that the maximal data intersection between processors is minimized. Our approach allows recovery using a fewer number of nodes, and achieves good load balance between the remaining nodes. The main observations from our detailed evaluation are as follows. First, the overhead of our approach when there are no failures is negligible. We show how different number of failures and failures at different points of processing can be gracefully handled by our approach. Finally, in comparing our approach with the MapReduce approach (as implemented in Hadoop), we show that our approach performs better both in absence and presence of failures. REFERENCES [1] R. Agrawal and J. Shafer. Parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering, 8(6): , June [2] Marcelo Pasin Alysson N. Bessani Pedro Costa and Miguel Correia. Byzantine fault-tolerant mapreduce: Faults are not just crashes. In 3rd IEEE International Conference on Cloud Computing Technology and Science, 211. [3] Algirdas Avizienis, Jean claude Laprie, Brian R, and Carl L. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1:11 33, 24. [4] T. Bicer, Wei Jiang, and G. Agrawal. Supporting fault tolerance in a data-intensive computing middleware. In Parallel Distributed Processing (IPDPS), 21 IEEE International Symposium on, pages 1 12, april 21. [5] G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov. Mpich-v: Toward a scalable fault tolerant mpi for volatile nodes. In Supercomputing, ACM/IEEE 22 Conference, page 29, nov. 22. [6] George Bosilca, Rémi Delmas, Jack Dongarra, and Julien Langou. Algorithm-based fault tolerance applied to high performance computing. J. Parallel Distrib. Comput., 69(4):41 416, April 29. [7] Greg Bronevetsky, Daniel Marques, Keshav Pingali, and Paul Stodghill. C: A system for automating application-level checkpointing of mpi programs. In 16th international workshop on languages and compilers for parallel computers (LCPC3, pages , 23. [8] Randal E. Bryant. Data-Intensive Supercomputing: The Case for DISC. Technical Report CMU- CS-7-128, School of Computer Science, Carnegie Mellon University, 27. [9] Franck Cappello, Al Geist, Bill Gropp, Laxmikant V. Kalé, Bill Kramer, and Marc Snir. Toward exascale resilience. IJHPCA, 23(4): , 29. [1] P. Cheeseman and J. Stutz. Bayesian classification (autoclass): Theory and practice. In Advances in Knowledge Discovery and Data Mining, pages AAAI Press / MIT Press, [11] Zizhong Chen. Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments. In IPDPS, pages 1 8, 28. [12] Zizhong Chen. Algorithm-based recovery for iterative methods without checkpointing. In Proceedings of the 2th ACM International Symposium on High Performance Distributed Computing, HPDC 211, San Jose, CA, USA, June 8-11, 211, pages 73 84, 211. [13] Zizhong Chen and J. Dongarra. A scalable checkpoint encoding algorithm for diskless checkpointing. In High Assurance Systems Engineering Symposium, 28. HASE th IEEE, pages 71 79, dec. 28. [14] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears. Mapreduce online. In Proceedings of the 7th USENIX conference on Networked systems design and implementation, NSDI 1, pages 21 21, Berkeley, CA, USA, 21. USENIX Association. [15] Camille Coti, Thomas Herault, Pierre Lemarinier, Laurence Pilard, Ala Rezmerita, Eric Rodriguez, and Franck Cappello. Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant mpi. In Proceedings of the 26 ACM/IEEE conference on Supercomputing, SC 6. ACM, 26. [16] Teresa Davies, Christer Karlsson, Hui Liu, Chong Ding, and Zizhong Chen. High performance linpack benchmark: a fault tolerant implementation without checkpointing. In Proceedings of the international conference on Supercomputing, ICS 11, pages ACM, 211. [17] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages , 24. [18] G. Deconinck and R. Lauwereins. User-triggered checkpointing: system-independent and scalable application recovery. In Computers and Communications, Proceedings., Second IEEE Symposium on, pages , jul [19] Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox. Twister: a runtime for iterative mapreduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC 1, pages , New York, NY, USA, 21. ACM. [2] G.Bronevetsky, D.Marques, K.Pingali, and P.Stodghill. Automated application-level checkpointing of mpi programs. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 23), pages 84 94, Oct. 23. [21] G.Bronevetsky, D.Marques, M.Schulz, P.Szwed, and K.Pingali. Application-level checkpointing for shared memory programs. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 24), pages , Oct. 24. [22] Leonardo Arturo Bautista Gomez, Naoya Maruyama, Franck Cappello, and Satoshi Matsuoka. Distributed diskless checkpoint for large scale systems. In Proceedings of the 21 1th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, CCGRID 1. IEEE Computer Society, 21. [23] Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir, and Franck Cappello. Uncoordinated checkpointing without domino effect for send-deterministic mpi applications. In 25th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 211, Anchorage, Alaska, USA, 16-2 May, Conference Proceedings, 211. [24] E-H. Han, G. Karypis, and V. Kumar. Scalable parallel datamining for association rules. IEEE Transactions on Data and Knowledge Engineering, 12(3), May / June 2. [25] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2. [26] R. J. Honicky and Ethan L. Miller. Replication under scalable hashing: A family of algorithms for scalable decentralized data distribution. April 24. [27] Kuang-Hua Huang and Jacob A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Trans. Computers, 33(6): , [28] J. Hursey, J.M. Squyres, T.I. Mattox, and A. Lumsdaine. The design and implementation of checkpoint/restart process fault tolerance for open mpi. In Parallel and Distributed Processing Symposium, 27. IPDPS 27. IEEE International, pages 1 8, march 27. [29] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, [3] Ruoming Jin and Gagan Agrawal. A Middleware for Developing Parallel Data Mining Implementations. In In Proceedings of the first SIAM conference on Data Mining, 21. [31] Ruoming Jin and Gagan Agrawal. Shared Memory Paraellization of Data Mining Algorithms: Techniques, Programming Interface, and Performance. In In Proceedings of the second SIAM conference on Data Mining, 22. [32] Hui Liu, T. Davies, Chong Ding, C. Karlsson, and Zizhong Chen. Algorithm-based recovery for newton s method without checkpointing. In Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 211 IEEE International Symposium on, pages , may 211. [33] A. Martin, T. Knauth, S. Creutz, D. Becker, S. Weigert, C. Fetzer, and A. Brito. Low-overhead fault tolerance for high-throughput data processing systems. In Distributed Computing Systems (ICDCS), st International Conference on, pages , june 211. [34] A. Moody, G. Bronevetsky, K. Mohror, and B.R. de Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In High Performance Computing, Networking, Storage and Analysis (SC), 21 International Conference for, pages 1 11, nov. 21. [35] S. K. Murthy. Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery, 2(4): , [36] J.S. Plank, Youngbae Kim, and J.J. Dongarra. Algorithm-based diskless checkpointing for fault tolerant matrix operations. In Fault-Tolerant Computing, FTCS-25. Digest of Papers., Twenty-Fifth International Symposium on, pages , jun [37] III Richard, G.C. and M. Singhal. Using logging and asynchronous checkpointing to implement recoverable distributed shared memory. In Reliable Distributed Systems, Proceedings., 12th Symposium on, pages 58 67, oct [38] Richard D. Schlichting and Fred B. Schneider. Fail-stop processors: An approach to designing fault-tolerant computing systems. ACM Trans. Comput. Syst., 1(3): , [39] David B. Skillicorn. Strategies for parallel data mining. IEEE Concurrency, Oct-Dec [4] Sage A. Weil, Scott A. Brandt, Ethan L. Miller, and Carlos Maltzahn. Crush: controlled, scalable, decentralized placement of replicated data. In Proceedings of the 26 ACM/IEEE conference on Supercomputing, SC 6, New York, NY, USA, 26. ACM. [41] Tom White. Hadoop: The Definitive Guide. O Reilly, first edition edition, june 29. [42] Qin Xin, Ethan L. Miller, and Thomas J. E. Schwarz. Evaluation of distributed recovery in large-scale storage systems. In Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing, HPDC 4, pages , Washington, DC, USA, 24. IEEE Computer Society. [43] Yanfeng Zhang, Qinxin Gao, Lixin Gao, and Cuirong Wang. imapreduce: A distributed computing framework for iterative computation. In IPDPS Workshops [43], pages [44] Qin Zheng. Improving mapreduce fault tolerance in the cloud. In IPDPS Workshops, pages 1 6, 21.

Supporting Fault Tolerance in a Data-Intensive Computing Middleware

Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University IPDPS 2010, Atlanta,