Investigation of Techniques to Model and Reduce Latencies in Partial Quorum Systems

Size: px

Start display at page:

Download "Investigation of Techniques to Model and Reduce Latencies in Partial Quorum Systems"

Ernest Townsend
5 years ago
Views:

1 Investigation of Techniques to Model and Reduce Latencies in Partial Quorum Systems Wesley Chow Ning Tan December 17, 2012 Abstract In this project, we investigate the possibility of using redundant requests and redundant replies to reduce the read latency in Apache Cassandra. Our techniques are applicable to all distributed data storage systems. We test varying the threshold when we send duplicate requests (fast-retries), both statically (in milliseconds) and dynamically (by percentile). We also attempt to vary sending duplicate replies with some small percentage. We show that in most variations, fast-retry and duplicate reply performance comes close to the baseline in the average case, and performs better than the baseline in the long tail. In order to give a systematic way to dynamically determine the optimal fast-retry threshold, we apply a graphical model to predict network latency in the system. We show that by exhibiting the correlation between nodes and time, our model is more accurate when compared to the previously proposed PBS model. We implement our redundant request and redundant replies technique in Cassandra and stress test the read performance using Berkeley s Psi Millennium cluster. 1 Introduction 1.1 Distributed Data Stores Distributed data stores have attracted quite a lot of interest in recent years, mostly due to the scalability, availability and efficiency it provides [20]. For example, Google developed BigTable [21] as the storage system for their services including Gmail, Google Map, Google Reader, Youtube, and Google Earth. Amazon developed Dynamo [23] for as a primary-key accessed data store for services such as its shopping cart. These systems typically replicate data across different machines and data centers [19, 20, 21, 23]. This allows the system to achieve high availability and partition tolerance when machines failed, as the data will still be available from other machines and other data centers [25]. This also allows us a method of improving system performance instead of waiting for a response from only one machine, the system send messages to all the replicas and wait for a fraction of them to response, thus improving the latency. However, there is a price to pay for this beneficial latency decrease. In particular, the strong consistency constraint is replaced with eventual consistency in these systems, meaning there is no guarantee on returning the most recent version of the data [32]. The only assumption is that over a sufficiently long period of time and an absence of writes, all replicas will eventually be consistent [19, 32]. This latency-consistency trade-off has had important implications in system design [19]. For systems with strict latency requirements, consistency is usually sacrificed. For instance, Amazon reported that 100ms of extra latency would result in a 1% loss in sales [28]. Therefore, they need to ensure that latency is low, even at the long tail (say 99.99th percentile). Google also reported that 500ms of extra latency would decrease their traffic by 20% [29], which results in a severe penalty in their revenue. On the other hand, this comes with a cost contacting fewer replicas generally weakens the consistency guarantees on queried data. Therefore, in order to achieve the optimal performance, one needs a better understanding of the latency vs. consistency trade-off. 1.2 Partial Quorums Most distributed data stores other than Dynamo are open-sourced and thus highly customizable. Users typically have the ability to choose the replication factor (N), the read quorum (R), and the write quorum (W). This gives them the ability to choose any sort of consistency they desire. If R + W N, then there is a strict quorum and any data read will have strong consistency. However, if R + W N, then there is only a partial quorum with eventual consistency guarantees. Partial quorums provide latency benefits at the cost of consistency over strict quorums. Another benefit of the flexibility of setting R and W is that users

2 can shift latencies around based on the application s needs. If the application is very read heavy for example, then having a lower R and a higher W will provide better consistency with a lower aggregate latency with the same number of replicas. 1.3 Read Paths Distributed Data stores such as Dynamo, Cassandra [27], LinkedIn s Voldemort [11], and Basho s Riak [3] all process reads differently. When it comes to achieving read quorums, these systems vary from two extremes. Dynamo and Riak will always send read requests to all N replicas and wait for only R responses [15, 23]. Voldemort will only send to R replicas and wait for all of them to respond [18]. Cassandra on the other hand, will send only R requests 90% of the time and N requests 90% of the time for consistency purposes (read-repair) [6, 7]. Sending more requests at the start is similar to what we are trying to achieve. Both methods will increase the load on the system while attempting to decrease latency. 1.4 Probabilistic Graphical Model The Graphical Model, also known as a Markov Random Field, is a well-studied model in statistic and machine learning. It brings together graph theory and probability theory in a powerful formalism [34]. It has proven to be very useful in various fields including bioinformatics, speech processing, image processing and control theory [34]. In terms of modeling network traffic, graphical model comes very handy as it encodes the network topology naturally. One thing that separates graphical modeling from previous models like PBS [20] (Probabilistic Bounded Staleness) is that graphical modeling captures the correlation between the nodes. While the usual independent and identically distributed random variable (i.i.d) assumption makes it much easier while doing analysis, we remark that the correlations between nodes and time usually play a crucial role in the performance of the system, especially when it comes to long tail performance. 1.5 Previous Work Sending redundant requests has been proven a very successful technique in Google s BigTable, especially when it comes to long-tail performance. Sending out a redundant request within 10ms if the initial request had not completed improved BigTable s 99.9th percentile latency from 994ms to 50ms [22]. There has been intensive study on applying graphical models in traffic modeling, prediction and classification [26, 30, 31, 36]. For example, [26] applies graphical modeling to model the traffic in the Greater Seattle area. [31] applies graphical modeling to model the traffic in London. In terms of network traffic, [30] uses graphical modeling for semi-supervised traffic classification. 1.6 Contributions of this Project We make the following contributions in the paper: We implement the fast-retry and duplicate reply method in Cassandra. We stress test many variations of fast-retry and duplicate reply on an eight node Psi Millennium cluster. We develop a new way of modeling network latency based on an undirected graphical model. This model captures the correlations between replicas, and therefore models and predicts the network traffic more accurately. The rest of the paper is organized as follows. Section 2 gives some background on graphical models. Section 3 describes the fast-retry and duplicate replies techniques and their potential benefits. In Section 4, we describe a graphical model that can be used to predict network traffic. Section 5 discusses learning parameters and Section 6 discusses inference algorithms in the graphical model. Section 7 discusses our implementation of fast-retry and duplicate replies in Cassandra. Section 8 describes our evaluation methods and the variations we tried. Section 9 discusses the results from the evaluation. Finally, Sections 10 and 11 contain our conclusions and possible future directions. 2 Preliminaries 2.1 Graphical Model In the probabilistic undirected graphical model, we have a graph G = (V, E), where V is the set of vertices, E V V is the edges between the vertices. In the model each vertex v V is corresponding to a random variable X v, which takes value from the domain D. For each maximal clique C in the graph, there is an associated potential function φ C (x C ) : D C R +, and the probability of an event happening is given by the following expression: (2.1) P(X = x) = 1 Z φ C (x C ) C C where Z is the normalization factor, also known as partition function. While the definition could be somewhat complicated and non-intuitive, it is shown in Hammersley- Clifford theorem that this is equivalent to a simple conditional independent condition: Theorem 2.1. (Hammersley-Clifford) The probability on V can be written as in (2.1) if and only if for any sets S 1, S 2, S 3 V such that S 3 separates S 1 and

3 S 2, the variables X S1 and X S2 are independent when conditioning on X S3. 3 Fast-retry and Duplicate reply The idea of sending duplicate requests is not new. However, to the best of our knowledge, there has not been any work done on finding the optimal threshold with which to send a duplicate request. Send a request too quickly, and the response might already be on the way and the extra load incurred was wasted. Wait too long, however, and the ability to reduce latencies is severely diminished. The ideal case would be to send one or more duplicate requests as soon as possible without inducing any latency penalty to the system. Similarly, when the replica receives this request, it incurs a disk operation in order to retrieve the value. There is no way for the replica to know if the duplicate request it just received was sent before its response was received or if the response was never received. The disk operation cost might become irrelevant as more companies move their systems to Solid State Drives. Amazon built Dynamo using SSDs [33], but this is not the common case. Duplicate replies are in a sense the pre-emptive move to counter a lossy network. If the network is constantly dropping some small percentage of packets, then in theory sending duplicate replies could help. The question is how to determine when to send a duplicate reply. We use the simplest method of sending a duplicate reply some percentage of the time. As there has been no prior work done on this area, we set out to test this hypothesis and determine its usefulness. 4 Modeling In this section we discuss the details of modeling and predicting the network traffic using graphical modeling. We make the following assumptions about the network traffic distribution: Condition 1. We assume that the connections between nodes satisfy the conditional independent condition: that is, given two set of edges in the network topology, if they are separated by the third set, then the traffic on the two sets are independent conditioning on the third set. This is a reasonable assumption because all the traffic between the two sets must go through the third set, therefore once the third set is fixed, the two sets would not be able to interfere with each other, and therefore they should behave independently. Condition 2. We assume that the traffic on the graph is only affected by the traffic in K previous time steps. That is, we assume that the traffic before K steps on the graph has negligible influence on current traffic. Equipped with the assumptions, we re now ready to describe our graphical model. The graph in our model will encode both time and network topology. Specifically, each vertex in the graph can be represented as a tuple (V i, V j, T ). That is, the vertex is a random variable represents the traffic condition between node i and node j at time T. Given two vertices (V i, V j, T 1 ) and (V k, V l, T 2 ), there is an edge between them if and only if either 1) T 1 = T 2 and (V i, V j ) shares a vertex with (V k, V l ) or 2) T 2 T 1 = 1 and (V i, V j ) = (V k, V l ). 5 Learning Parameters in Graphical Model Given the historical data, we use the data as training data to build our graphical model. Specifically, we learn the potential functions in the graphical model. Several algorithms were introduced in this context. Before getting to the choice of the learning algorithm, first we have to specify the metric on which we evaluate the quality of a set of parameter. In this work, we follow the classical framework of likelihood maximization, in which we try to find the set of parameters that maximizes the chance of us observing the data. Maximum likelihood is an extremely well studied framework in statistics, several methods were introduced to find the optimal parameters. In this work, we choose to use the EM algorithm introduced by Dempster, Laird and Rubin [24]. The reason we choose this algorithm based on the following reasoning: The convergence of the EM algorithm is very well understood. Specifically, it is shown by Wu that the EM algorithm converges under reasonable assumptions [37] The EM algorithm deals well with the existence of latent variables. This is very important in our application because the historical data we have may not be complete this could be due to server failures or packet drops. 6 Inference Algorithms in our Graphical Model There are several ways to do inference in graphical models. For example, one could use Sum-Product algorithm, which is known to compute all the marginals in linear time on trees. However, when it comes to general graphs, no polynomial time algorithm is known to compute the marginals exactly. Any algorithm has to either give up on efficiency or accuracy. Indeed, algorithms in both regimes were proposed. For example, Junction Tree algorithm can be used to do exact inference, however the running time has an exponential dependence

4 on the size of cliques in the graph. On the other hand, one could efficiently compute the Bethe Approximation, which only gives an approximation of the marginals. In this work, we use the Junction Tree algorithm, with the reasoning that the number of nodes we are dealing with is relatively small. Therefore, the Junction Tree algorithm can still satisfy our time-efficiency need. However, once the number of nodes grow, we will have to look into approximation algorithms that are more efficient. 7 Implementation Apache Cassandra ( is an open-sourced distributed NoSQL data store used by many companies including Netflix [4], Twitter [5] and Reddit [8]. It uses the distributed system model from Amazon s Dynamo and the data model from Google s BigTable. It takes the idea of dividing work into stages with separate stages pools from SEDA [6, 35]. We are going to focus only on the read path, since that is where our optimizations come into play. When a read is initiated, the following steps occur: [6] 1. The StorageProxy queries for the nodes (endpoints) that are responsible for replicas of specified key. 2. The currently alive endpoints are sorted by proximity. In our case, this is simply the round trip latency. This is kept track of by the LatencyTracker, which will also be used by our fast-retries later. 3. The closest endpoint is then sent a request for the actual data. This is handled by the ReadCallback class which will timeout after a user specified timeout. 4. The remaining R - 1 nodes are sent a digest request. Digests take the same CPU and disk I/O cost as a regular read, but lessen the load on the network. 5. If there are no digest mismatches, the data is returned. Otherwise, read-repair occurs and then the results are read again. 6. As this is happening, read-repair will probabilistically be sent messages to compute the digest of the response for increased consistency. We modify 1.1.6, the latest stable version of Cassandra [27]. In our implementation, we ignore read-repair, as we are not concerned with the consistency or staleness of the data we get back [20]. In terms of lines of code, fast-retry and duplicate replies are extremely simple. Depending on our configuration, our ReadCallback timeout is shortened from the default (two seconds) to our fast-retry timeout. Once the ReadCallback times out the first time, we send another request to the same endpoint and wait the remaining amount of time. The user set remote procedure call timeout is never exceeded even with fast-retry. To use dynamic fast-retry thresholds, we need to have access to running percentile data from past reads. Luckily, the LatencyTracker keeps a count of all total operations and a rough latency histogram. The LatencyTracker s histogram s buckets consists of ninety-two buckets spanning from one microsecond to thirty-six seconds. Each bucket is 1.2 times larger than the last bucket, giving us an inexact latency percentile measurement. However, this approximation is acceptable for our purposes. Implementation of duplicate replies was very simple. With some probability, the ReadVerbHandler will send two replies instead of one. We chose to send the duplicate request to the same node instead of sending to the N - R nodes which were still unsolicited. Mostly this is because in our tests, our replication factor N was 4, our read quorum was 3, and the nodes never went down. Had our test setup been more complex, perhaps with a larger N or with some system churn, sending to the N - R nodes instead of the same R nodes on fast-retry would have been worth implementing. 8 Evaluation 8.1 Setup We tested our algorithms on the Psi Millennium Cluster here at Berkeley [17]. The cluster consisted of eight nodes, each with 32 8-core Intel(R) Xeon(R) 2.60GHz chips. While the nodes had large SSDs, we chose to use NFS to introduce a larger I/O bottleneck. This was mostly done to simulate what would happen in a more heavily loaded system without having to go through the trouble of running an I/O intensive task on each of the nodes. However, the nodes each had 128GB of RAM, which significantly limited the amount of actual disk operations performed, especially as we only inserted and read about 8GB of data. Unfortunately, this cluster is most likely not representative of what companies are running their distributed data storage instances on. When the cluster initializes, each node is manually assigned a token in order to ensure completely even partitioning of the key-space. This was done in order to remove partitioning randomness and ensure that each run of our tests had the exact same environment. To prepare our tests, we used the built-in Cassandra Stress Tool [2]. The setup involves first placing five million keys that are each replicate to four nodes. Since our cluster is all in one data center, we used SimpleStrategy which simply places replicas on nodes

5 clockwise on the key ring without considering rack or data center location. As our nodes were all in the same data center, and their rack configuration unknown, SimpleStrategy was the best option. All our tests attempt to read the five million keys we just inserted, with a read quorum of 3. The test script was run on one node, but the script initiates read requests to all nodes in our cluster. This shares the coordination workload among the entire cluster instead of just one node. 8.2 Variations We started our tests simple and added on variations as we went. First, we tried sending a fast-retry at various static thresholds. Fig.1 shows the performance between the 90th and 99th percentiles and Fig.2 shows the performance between the 99th and 99.9th percentile. We show only three static thresholds to avoid clutter. Fig.3 shows the same experiment, except we added an artificial CPU load to every node in the cluster. This was done by running a script that created ten threads that would endlessly refine an estimation of π using Leibniz s formula [10] and periodically print the output. After this experiment, we changed our algorithm and made our fast-retries dynamic. Instead of sending a duplicate request after waiting some static amount of time, we send another request at the current 95th, 97th, 99th, and 99.9th percentiles. Fig.4 shows the behavior of percentile based dynamic retries from the 0th to the 99th percentiles, and Fig.5 show the 99th to 99.9th percentiles. In the above experiment, our fast-retries would be sent off at whatever the current percentile setting dictated, with a minimum of 1ms, up to the 2 second timeout. However, as it is unlikely that fast-retries would make much of a difference on the low end of that scale, we introduced a minimum wait time before a fast retry is sent. We then re-ran the above percentiles except with a 5ms minimum wait time. This is shown in Fig.6. Since our nodes are sitting in the same data center, network conditions such as packet drops are unlikely to happen. If a distributed data store is spread across a wide-area network however, this is much more likely to happen. Therefore, we used the netem [12] tool to introduce artificial packet drops in the network. We started with a 3% packet drop rate with a 25% correlation. This means that 3% of packets would be dropped, with the likelihood of each successive packet being dropped depending 25% on whether the last one was dropped. Using the correlation value allows us to simulate packet burst losses, a common packet loss pattern in real world networks. We varied the percentile at which we sent a fast-retry, keeping the same 5ms minimum wait time as before. The results of the 3% packet loss trial is shown in Fig.7. 5% packet losses with 25% correlation is shown in Fig.8. The last variation we did was to test the effects of randomly sending a duplicate reply. The system maintained a 5% packet loss with 25% correlation, and we sent a duplicate reply with 3%, 5%, 7%, and 10% probabilities. This variation always sent a duplicate request at the 97th percentile threshold. The results of this trial is shown in Fig.9. Note that different variations should not be compared against each other. Certain variations logged a lot more data than other variations, leading to different performance baselines. 9 Results 9.1 Fast-retry and Duplicate reply Static fastretry did not provide any latency benefits over regular Cassandra until approximately the 99.9th percentile. Below this threshold, static fast-retry performs on par with or slightly worse than the baseline. This held true whether or not we applied a heavy CPU load to nodes in the system. The lack of difference between the light and heavy CPU load cases can be explained by the hardware our tests ran on. Having 32 8-core processors per node makes it very difficult to make the performance CPUbound. In most real-world applications, the bottleneck is in disk I/O anyways, not the CPU. Dynamic fast-retry on the other hand, performed well at every percentile. Between the 0th and 99th percentiles, dynamic fast-retry performed on par with the baseline. In fact, we would be surprised if dynamic fast-retry affected the performance at lower percentiles, as no duplicate work is done unless a request s duration reaches the specified percentile. Between the 99th and 99.9th percentiles, dynamic fast-retry performs just slightly better than the baseline. This is true for both the 1ms and 5ms minimum wait times. As our system is very powerful and the network is very reliable, the minimal performance gains here are not discouraging. Introducing packet loss into our cluster allows fastretry to demonstrate its potential latency improvements more easily. After introducing a 3% packet drop into the network, we begin to see a more noticeable difference between the baseline and our dynamic fastretry implementation. Increase that rate to 5%, and fast-retry suddenly performs much better than the baseline. Using a 95th percentile dynamic threshold results in a 67ms 99.9th percentile latency. The baseline achieves a 117ms latency, almost doubling the fast-retry result! We ran the duplicate reply tests with a 97th percentile fast-retry with the same 5% packet loss settings

6 as before. Unfortunately, sending duplicate replies does not seem to produce a noticeable performance gain over just having 97th percentile fast-retry. 9.2 Graphical Model In order to verify the accuracy of our graphical model, we use the baseline data as training data, then use our model to predict the the performance of the fast-retry algorithm. Also, we compare our performance with the performance of PBS model prediction. Things are slightly subtle in our case, as the LatencyTracker only records the minimum latency among all the replicas, and we do not have any information regarding the remaining latencies. We handle this by assigning a latent variable to each time step, indicating the id of the replica with the minimum latency. As we mentioned before, the EM algorithm can be applied to deal with the existence of latent variable, therefore we still are able to calculate the Maximum Likelihood Distribution. However, this approach comes with a penalty: the lack of information severely limits the power of the model. In particular, we will not be able to capture any correlation between replicas since the data does not distinguish them. Indeed, in our final prediction, the replicas behave identical and independently. However, we will still be able to capture the correlation between the latencies and time, which plays a crucial role in the result. 9.3 Modeling Evaluation We compare the performance of our modeling algorithm and PBS prediction algorithm. We compare the performance in the simplest configuration: no heavy background workload and fixed threshold fast-retry. This is because that both models do not take into consideration of background workload and packet drops, therefore would yield undesirable performance upon the existence of these factors. However, we remark that both models can be easily modified to take these factors into account. As shown in Fig.10 and Fig.11, we compare the performances in two settings: 7ms fast-retry and 15ms fast-retry. We can see that our graphical model yields more accurate predictions in both settings. The main reason is that the PBS prediction usually tends to be over-optimistic in terms of latency predictions. For example, it is nearly impossible for two consecutive requests to both end up at long tail ( 99% latency), while this scenario actually happens a lot in practice. Graphical model avoids these kinds of inaccuracies by exploiting the correlation between two consecutive requests, therefore yielding a more accurate prediction. 10 Conclusions The static fast-retry technique should only be used very carefully by the owner of a system. From the 0th to the 99th percentile, fast-retry latencies are on par with the baseline Cassandra. However, from 99th percentile to around the 99.8th percentile, the baseline does better than fast-retry in most of the settings we tried. Only after this point does fast-retry improve upon the baseline performance. Instead, users of distributed storage systems should use percentile based fast-retry. Percentile based fast-retry improves performance over the baseline implementation across the board in every percentile for every configuration we tried. The latency improvements are especially noticeable when the network consistently drops some percentage of packets (5% in our tests). As for modeling network latencies, we show in this work that our graphical model provides more accurate modeling and prediction compared to PBS. However, we did not implement our prediction algorithm in Cassandra. We hope this could be done in future so that we could dynamically adjust the fast-retry threshold based on the model s predictions. Sending duplicate replies, at least in our implementation, does not appear to be all that beneficial. The results are fairly close to sending only one reply, but one set of runs did not generate enough data for us to have a confident conclusion as to the efficacy of this method. That said, our guess would be that this method (as implemented now) is ineffective and should not be implemented. 11 Future Work Our results were all gathered from a very powerful cluster living in a single data center. While the techniques we tried were effective, it is hard for the effects to be noticeable on such a small scale using such powerful computational resources. Using a testbed that is closer to real-world applications could potentially show that our techniques have a much larger effect. Also, while each trial was five million operations, each trial was only run once. Running the trials more times would reduce noise that may have been introduced through other environmental factors. One factor that comes to mind is if other nodes in the Psi Millennium cluster ran a bandwidth intensive application that introduced cross-traffic in our system and skewed our results. In our fast-retry, we send a duplicate reply to the same node it sent the original to. In a variation on fastretry, the coordinator could instead send a request to a node that has not been solicited yet - provided the read quorum R is lower than the replication factor N. By sending to the remaining N - R nodes, we could potentially decrease latencies further by contacting different

7 nodes that might not be down - perhaps the reason the initial node did not reply is because it suddenly became inaccessible. This technique would most likely show the best results over the baseline in cases of system churn. As the duplicate reply method did not yield very positive or very negative results, the technique warrants further study. The ideal behavior would be for some Oracle to know that a packet would be dropped and preemptively send a duplicate. As this is not possible, one could attempt to achieve the next best thing. Since packet losses tend to happen in bursts, a system could approximate the packet loss rate at any given time. Using this data, it could be possible to know when to scale up or down percentage with which it sends a duplicate reply, achieving improved read latencies. For the graphical model approach, we suspect that the performance of the modeling algorithm will improve if more detailed and accurate data is provided that is, the detailed latencies between replicas. Once the data is available, our modeling algorithm will capture the correlation between replicas. We believe that it will be particularly useful when this approach is applied to a distributed data store where the data is placed across a wide area, as correlation among replicas play a more crucial role in this scenario. Acknowledgements We would like to thank Peter Bailis and Shivaram Venkataraman for their guidance throughout the project. We would like to thank the AMP Genomics project for lending us their systems, and Jon Kuroda for facilitating the process. Also, Anthony D. Joseph, John D. Kubiatowicz, and Aaron Davidson for help at various points. References [1] Apache cassandra 1.1 documentation - cassandrastress. references/stress_java, December [2] Basho riak. riak-overview/, December [3] Benchmarking cassandra scalability on aws - over a million writes per second. benchmarking-cassandra-scalability-on.html, November [4] Cassandra at twitter today. http: //engineering.twitter.com/2010/07/ cassandra-at-twitter-today.html, July [5] Cassandra wiki: Architecture internals. apache.org/cassandra/architectureinternals, Decemeber [6] Cassandra wiki: Operations. org/cassandra/operations#repairing_missing_or_ inconsistent_data, December [7] January state of the servers. reddit.com/search/label/cassandra, January [8] Leibniz formula for pi. wiki/leibniz_formula_for_pi, December [9] Linkedin voldemort. project-voldemort.com/voldemort/, December [10] netem. collaborate/workgroups/networking/netem, December [11] Riak read path - get fsm. https: //github.com/basho/riak_kv/blob/ 42eb6951b369e3fd9a42f7f54fb7618a40f1a9fb/ src/riak_kv_get_fsm.erl#l153, June [12] Uc berkeley cluster computing. millennium.berkeley.edu/wiki/psi, December [13] Voldemort read path - pipelineroutedstore. master/src/java/voldemort/store/routed/ PipelineRoutedStore.java#L186, September [14] D. Abadi. Consistency tradeoffs in modern distributed database system design: Cap is only part of the story. Computer, 45(2):37 42, [15] P. Bailis, S. Venkataraman, M.J. Franklin, J.M. Hellerstein, and I. Stoica. Probabilistically bounded staleness for practical partial quorums. Proceedings of the VLDB Endowment, 5(8): , [16] F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R.E. Gruber. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2):4, [17] J. Dean. Achieving rapid response times in large online services. googleusercontent.com/external_content/ untrusted_dlcp/research.google.com/en/us/ people/jeff/berkeley-latency-mar2012.pdf, May [18] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: amazon s highly available key-value store. In ACM SIGOPS Operating Systems Review, volume 41, pages ACM, [19] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pages 1 38, [20] Seth Gilbert and Nancy Lynch. Brewer s conjecture and the feasibility of consistent available partitiontolerant web services. In In ACM SIGACT News, page 2002, [21] E.J. Horvitz, J. Apacible, R. Sarin, and L. Liao. Prediction, expectation, and surprise: Methods, designs, and study of a deployed traffic forecasting service.

8 arxiv preprint arxiv: , [22] A. Lakshman and P. Malik. Cassandraa decentralized structured storage system. Operating systems review, 44(2):35, [23] G. Linden. Make data useful. Presentation, Amazon, November, [24] G. Linden. Marissa mayer at web 2.0. Online at: [25] C. Rotsos, J. Van Gael, A.W. Moore, and Z. Ghahramani. Probabilistic graphical models for semisupervised traffic classification. In Proceedings of the 6th International Wireless Communications and Mobile Computing Conference, pages ACM, [26] S. Sun, C. Zhang, and G. Yu. A bayesian network approach to traffic flow forecasting. Intelligent Transportation Systems, IEEE Transactions on, 7(1): , [27] W. Vogels. Eventually consistent. Communications of the ACM, 52(1):40 44, [28] W. Vogels. Amazon dynamodb a fast and scalable nosql database service designed for internet scale applications /01/amazon-dynamodb.html, January [29] M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends R in Machine Learning, 1(1-2):1 305, [30] M. Welsh, D. Culler, and E. Brewer. Seda: An architecture for well-conditioned, scalable internet services. In ACM SIGOPS Operating Systems Review, volume 35, pages ACM, [31] J. Whittaker, S. Garside, and K. Lindveld. Tracking and predicting a network traffic process. International Journal of Forecasting, 13(1):51 61, [32] CF Wu. On the convergence properties of the em algorithm. The Annals of Statistics, 11(1):95 103, Appendix This appendix section contains the graphs we have for this report. All the graphs are generated from Matlab. Unless otherwise specified, all the latencies are measured in millisecond. Figure 1: Figure 2: Figure 3:

9 Figure 4: Figure 7: Figure 5: Figure 8: Figure 6: Figure 9:

10 Figure 10: Figure 11:

How Eventual is Eventual Consistency?

Probabilistically Bounded Staleness How Eventual is Eventual Consistency? Peter Bailis, Shivaram Venkataraman, Michael J. Franklin, Joseph M. Hellerstein, Ion Stoica (UC Berkeley) BashoChats 002, 28 February