Trading latency for freshness in storage systems

Size: px

Start display at page:

Download "Trading latency for freshness in storage systems"

Kathleen Waters
5 years ago
Views:

1 Trading latency for freshness in storage systems James Cipar September 27, 2012 Abstract Many storage systems have to provide extremely high throughput updates and low latency read queries. In practice, system designs that provide those capabilities often face a trade-off between query latency, efficiency, and result freshness. In my dissertation, I will argue that systems should be designed to allow for a per-query configuration of this trade-off. I will use two case studies to demonstrate the value of doing so. The first is LazyBase, a database designed for high-throughput ingest of observational data. The second is LazyTables, a shared data structure designed to support parallel machine learning applications. In both cases, the term ``Lazy'' refers to the systems' procrastination: waiting to apply updates until they can be executed as efficiently as possible. This design decision creates the potential for staleness in the data, hence the need for studying the trade-off between freshness and performance. Additionally, I will describe a number of other applications where this trade-off is potentially useful in system design. 1 Introduction There is a large and diverse class of parallel applications that require extremely high throughput updates to shared data structures. Two important subclasses of these are analytical databases and shared data structures used by parallel machine learning algorithms. Many applications in modern data collection and analytics require analyzing large data sets that grow and change at a high rate. These applications include business and purchase analytics, hardware monitoring, click stream and web user tracking, and tweets and social media analysis. In addition to high update throughput, these applications require efficient queries over a reasonably up-to-date version of the database. Many parallel computations, such as those used in machine learning, maintain large shared data structures. Commonly these data structures are sparse matrices or vectors of numbers. Each thread of execution performs many updates at a very high rate -- hundreds of thousands per thread per second -- interspersed with infrequent read accesses. For instance, in Latent Dirichlet Allocation, each thread reads a portion of a data set, then issues many increment and decrement operations to elements of a shared vector. There is no locality to these operations: any thread may update any part of the vector. In both cases, query efficiency, low query latency, and fresh results are all desirable properties of a storage system designed to support these applications. However, many techniques to achieve high update throughput hurt at least one of these properties. For instance, batching many updates into a single group 1

2 can vastly improve update throughput. However, this introduces a latency between when an update is submitted and when the resulting batch is committed. Furthermore, a large batch may require a longer time to process, introducing additional latency. This, as well as other similar techniques, means that the design of storage systems involves a tradeoff between the latency and efficiency of query results and the freshness (or inversely, the staleness) of the data that is returned. For many applications, different queries have different desires regarding this tradeoff, even when the queries are accessing the same data set. In the example of an analytical database storing Twitter data, an application trying to report the ``hot news'' that people are currently talking about has a freshness requirement on the order of a few minutes; another application to analyze the structure of the social network graph might have a freshness requirement on the order of days. In a parallel computation system, queries at the beginning of a computation may be able to make progress even with highly stale results, but need very accurate and fresh results to make small refinements at the end of a computation. Such diverse applications can benefit from storage systems that allow the application to make decisions regarding this tradeoff at query time. Existing solutions typically provide only a single point on this tradeoff. In the database example, there are three types of existing solutions that have settled on a different tradeoff. Relational databases (RDBMSs) -- generally accessed via the SQL query language -- provide a well defined consistency model and very up-to-date results. However, the cost of this is relatively inefficient update and query performance, and limited scalability. Data warehouses -- sometimes called analytical databases -- are designed for extremely high performance queries, at the cost of very inefficient updates. To mitigate this, updates are performed in large batches, a technique that introduces staleness on the order of hours or even days. Lastly, NoSQL databases drop the strong consistency of RDBMSs in exchange for high performance and scalability. However, these system can be difficult to program because they provide no guarantees for either the consistency or freshness of the data. In this thesis, I will argue for system designs that allow applications to manage this tradeoff at query time. I will demonstrate that freshness requirements are often a property of the query, not the data set or even the application, and that making the tradeoff explicit enables greater overall efficiency. Thesis Statement Storage systems can and should provide for per-query configuration of the tradeoff between data freshness and query efficiency and staleness. To support this thesis, I will show that (a) techniques to improve performance often introduce staleness in data, (b) it is often possible to do work during a query's execution to avoid this staleness, and (c) there is value in creating systems that allow applications to do so. I will demonstrate these points in the context of two case studies in applying this tradeoff to real storage systems. The first case study (section 2) examines a database that is designed to support continuous, high-throughput updates and inserts while also allowing efficient analytical queries over the data. The second case study (section 3) is a data management system for large scale, data-intensive machine learning. Additionally, I will describe a set of design principles (section 4) that apply to systems exploiting this latency/freshness tradeoff. I will discuss how these design principles could apply to other domains where this tradeoff is potentially useful (section 5), such as sensor network, geographically distributed logs collection and data-intensive scientific computing. 2

3 2 First case study: LazyBase The first case study examines LazyBase [7], a database designed to support applications that extract knowledge from large, rapidly changing data sets. These applications require a database that supports continuous and extremely high throughput update and insert operations, as well as efficient queries over large parts of the data set. Furthermore, given the fast-paced nature of modern data analytics, these applications sometimes require queries to be answered with a very fresh view of the data, but often can use slightly older versions. Existing solutions fall short in one or more of the requirements of these data analytics applications. Relational databases (often referred to as OLTP databases) provide a fresh view of the data and efficient queries for specific items. However, their performance is often insufficient for workloads with very high throughput updates and large analytic queries. In response to these shortcomings, data warehouses (a.k.a. ``analytical databases'') provide very efficient support for queries over large parts of a data set. However, they are typically loaded by batch jobs that run infrequently, meaning that the query results reflect a very stale version of the data. The new and growing class of NoSQL databases attempt to solve problems in scaling relational databases to large ``web scale'' workloads. These databases focus on excellent scale out performance. To achieve this scalability they typically provide weak consistency guarantees, a choice motivated by the CAP Theorem. While these systems can handle the throughput requirements of many analytical applications, their consistency semantics can be difficult for application programmers to reason about. 2.1 Motivation and Design To achieve the efficiency and scalability necessary for these applications, LazyBase makes extensive use of batching and pipelining for update and insert operations. However, these techniques introduce a long delay between when an update is submitted by a client and when that update is finally applied to the main database. In extreme cases, this delay may be minutes or even hours. This means that queries answered from the main database tables may return data that is seconds to minutes old.. While such a delay may be acceptable for some queries, others require more up-to date results. Table 1 shows a variety of application families from a wide variety of application domains. Observe that different applications accessing the same data may have vastly different freshness requirements. Additionally, there is a general trend where applications that require fresher data often look at less of the data set. For instance, in the retail application domain, the freshest data is required by real-time coupons and targeted ads. Both of these applications are only concerned with a single user's activity. On the other hand, the applications that can use stale data (product search, trending, earnings reports) need to look at data across a range of users. Using this observation, LazyBase presents applications programmers with a configurable per-query tradeoff between freshness and latency. This tradeoff is made possible by LazyBase's pipelined architecture, depicted in Figure 1. LazyBase clients connect to an ingest server and upload a batch of updates. This batch, possibly grouped with other batches, is used to create a Self-Consistent Update file (SCU), which contains instructions for inserting, updating, and deleting rows of data. Before being applied to the main database file (the Authority Tables), the SCU goes through a series of stages which are designed to make the data easier to query. These stages include ID-Remapping (a technique for efficiently handling foreign keys), Sorting, and finally a Merge operation that eventually updates the Authority Tables. 3

4 Application domain Retail Social networking Transportation Investment Enterprise information management Data center and network monitoring Table 1: Freshness requirements for application families from a variety of domains Desired Freshness seconds minutes hours real-time coupons, targeted ads and suggestions message list updates, friend/follower list changes emergency response, air traffic control real-time micro-trades, stock tickers infected machine identifications automated problem detection and diagnosis just in time inventory management wall posts, photo sharing, news updates and trending real-time traffic maps, bus/plane arrival prediction web-delivered graphs , file-based policy violations human-driven diagnosis, online performance charting product search, trending, earnings reports social graph analysis traffic engineering, bus route planning trend analyses, growth reports enterprise search results, e-discovery requests capacity planning, availability analyses, workload characterization Figure 1: LazyBase pipeline 4

5 In this pipeline, there is a potentially long delay (e.g, seconds to minutes) between a client submitting an SCU and that SCU being merged with the Authority Table. This delay depends on a number of factors including the number of servers, the SCU size, and the complexity of the indexes and foreign keys. To allow applications to access fresher versions of the data, LazyBase allows queries to read the intermediate data produced by each pipeline stage. For instance, a query may read the Authority tables as well as the sorted SCUs that have not yet been merged. This would provide a fresher view of the data, but because the sorted files haven't been merged yet, the query processing operation is slower and more expensive. Thus, when accessing intermediate data the queries are less efficient and have a longer latency. The extreme case of this is queries that require completely up-to-date results and therefore must access the unsorted SCUs. Because these SCUs are unsorted and unindexed, there is no efficient way to find a particular data value (or range of values). Hence, a query that accesses unsorted SCUs must scan all of the unsorted files to find all relevant data. 2.2 Evaluation The primary goals in the evaluation of LazyBase are to motivate the use of large batches (and thus the increase in update latency), and to show that LazyBase can provide fresher results at query time, at a cost of query latency. This section presents some key results from [7], with more details available in that paper. Figure 2 shows the overall ingest performance of LazyBase with varying batch sizes. This experiment demonstrates the potential improvement in throughput that can be achieved by batching. It shows that very large batches can be used for significant performance improvements. Figure 2: Inserts per second for various SCU sizes. However, such large batches can take a long time to process from start to finish. Table 2 shows the performance of each stage of the LazyBase pipeline in terms of the SCU size. With an SCU size of 2k rows, the total latency from the start to finish of the pipeline is over 10 minutes. This high latency may be unacceptable for some applications, necessitating the ability to query data that is still in the pipeline. Figure 3 demonstrates the performance of queries that access the intermediate data from the pipeline. For this experiment, an upload client loaded a large volume of data into LazyBase over the course of about 10 minutes. During that time query clients submitted queries with different freshness requirements, and measured the overall latency of the query. The results show that LazyBase is able to provide low-latency 5

Table 2: Performance of individual pipeline stages Stage k Rows/s Ingest 39 Sort 158 Merge 120 Figure 3: Query latency over time for steady-state work- load.

6 Table 2: Performance of individual pipeline stages Stage k Rows/s Ingest 39 Sort 158 Merge 120 Figure 3: Query latency over time for steady-state work- load. Sampled every 15s and smoothed with a sliding window average over the past minute. queries using the Authority Tables, while also providing fresher data for queries that need it -- at a cost of increased latency. 3 Second case study: LazyTables The second case study examines LazyTables, a distributed shared data structure for data intensive machine learning applications. LazyTables provides an abstraction of a 2-dimensional table of values shared between all processes and threads of an application. In reality, the table may be spread across many servers, and the data may reside on disk instead of in main memory. This system is specifically designed for a class of machine learning applications (e.g, LDA, Shotgun) that require only loose synchronization between different threads. These applications typically proceed in iterations, where each iteration first reads a small number of rows from the table, then updates many individual values in the table. These updates are not local to a single row, but may be scattered across all rows of the table. Often these updates must be applied atomically, e.g. atomically increment one value while decrementing another. Furthermore, the operations are almost exclusively commutative and associative: increment and multiply are the most common. Existing solutions Previous work -- including Piccolo [13], GraphLab [11] [12], and Spark [17] -- attempts to provide high performance distributed data structures for machine learning or other data analytics. The Piccolo system is the most similar in intent to LazyTables. In fact, the API for LazyTables was designed based on the Piccolo API, and is almost exactly the same. However, Piccolo -- like the other systems -- does not account 6

7 for the extremely high write throughput required by many modern machine learning applications and does not exploit their particular relaxations of consistency requirements. GraphLab was specifically designed for large scale machine learning, but focuses on graph algorithms. These algorithms model the problem using a large set of vertexes, each connected to a small number of other vertexes. When a vertex's value changes by more than a threshold, it triggers an update function for all neighboring vertexes. The computation continues until the result has stabilized. Spark provides an in-memory distributed data structure called a Reliable Distributed Dataset (RDD). Similar to LazyTables, RDDs are designed for high-performance data analytics. However, to acheive their fault tolerance goals, RDDs allow only a restricted set of bulk operations on data, and specifically prohibit fine-grained updates. While some algorithms are naturally expressed as graph algorithms, many users report that there is a significant mismatch between their way of thinking about the algorithm, and the model provided by GraphLab. This mismatch was the original reason that the machine learning research group approached us with this problem. In all of these cases, there is still an assumption that writes must complete with low latency so that reads may return relatively up-to-date results. However, much recent work in machine learning, e.g. [6], demonstrates that algorithms can proceed with relatively little coupling between threads and a potentially stale view of the data. This suggests that there is room for a significant performance improvement by allowing for some staleness in the data. 3.1 Motivating experiments To motivate the design of LazyTables, I've been experimenting with a parallel Latent Dirichlet Allocation [5] (LDA) implementation written by Qirong Ho. I provided Qirong with a simple in-memory table class that has compile-time options for thread safety and event tracing. By tracing the requests made to the table, I found that, out of approximately 30M requests, only 40 of them are row_read calls. The rest are all increment calls. Furthermore, in his algorithm, each thread only reads updated values for each row after many updates, meaning that they are already using quite out-of-date information. This provides a strong motivating example for LazyTables, as it shows the importance of update performance and the relative unimportance of read performance or freshness. As further motivation, I decided to improve the performance of increment while causing row_read to return even staler data. This was based on the observation that the program ran substantially faster with a single thread in non-thread-safe mode. Based on the assumption that locking was slowing down the multithreaded version, I modified the increment function to batch updates in thread local storage. After 1024 updates have accumulated (or the client program calls flush), the updates are pushed to the shared table, locking only once to do so. Table 3 shows the results of these experiments, run on a 6-core virtual machine with 12GB of RAM. Increment operations are so frequent in this application that locking the table structure slows down performance by a factor of 2 or more. Note that, as the number of threads increases, the cost of full synchronization goes up considerably. On the other hand, the performance of the batching implementation improves with more processors. The log-likelihood column gives a measure of the quality of the result, with a higher number (closer to 0) indicating better results. While the absolute value is not meaningful with only 10 iterations, the relative 7

8 values can be compared. These results suggest that batching with a batch size of 1024 has little effect on the convergence of the algorithm. In other words, this algorithm can tolerate some staleness in its read results. Table 3: Runtime for 10 iterations of LDA with different synchronization methods. Synchronization method Threads Runtime (s) log-likelihood Single-threaded e7 Locking e7 Batching (1024) e7 Locking e7 Batching (1024) e7 Locking e7 Batching (1024) e7 Locking e7 Batching (1024) e7 3.2 Design These experiments motivate a number of design requirements for the system. This section describes these requirements, as well as some open questions in the design of LazyTables. The experiments from subsection 3.1 suggest that extremely low-latency, high-throughput writes are the most important design consideration. Even an operation as simple as locking a mutex would be prohibitively slow for every write. The need for low-latency suggests that the client must not wait for the update operation to complete. This is achieved by a combination of batching and lazy writes. When a client thread submits an update, it is batched in thread local storage. This means that the client does not have to invoke any expensive atomic operations for typical writes. When the batch is large enough -- or the client explicitly requests a flush operation -- the batch is pushed into the shared LazyTable data structure. The high throughput requirement motivates two other design decisions that may further increase staleness: combining updates and logging. Updates can be combined by clients to reduce the total number of updates that must be sent over expensive network links or written to stable storage. For instance, two increment operations that modify the same element of the table can be combined into one. When writing updates to stable storage, logging allows the system to avoid an expensive read-modify-write. Instead, the update is logged and the main data structures can be updated later as needed. In a system built on top of spinning magnetic disk, this also allows LazyTables to perform almost exclusively sequential I/O. A proposed update pipeline is depicted in Figure 4. In this scheme, updates are sent through a series of buffering stages where they may be batched and merged with other updates. Eventually they are written to an on-disk log. This log can be periodically scanned to produce a new snapshot for the updated rows, which will also be written to disk. Using this design, a client has a number of options to satisfy read requests, depending on the desired freshness or latency: 1. Read a locally cached version of the snapshot. 2. Read only the latest snapshot. 8

9 Figure 4: LazyTable update pipeline 3. Read the latest snapshot and scan part (or all) of the log to get a newer value 4. Read the latest snapshot, all of the log, and request that clients flush recent writes Consistency requirements An important design consideration in LazyTables is the notion of ``staleness'', and how it is measured. In LazyBase, staleness is based on elapsed wall clock time, because LazyBase is intended to be used to collect data from the real world. In the case of LazyTables, elapsed real time is not as meaningful, because it gives no indication of how much computation has occurred in the interim. One possibility would be to use the total number of batches written since in the interim. But, even that may not be meaningful to application programmers, as some threads may proceed faster than others. Based on the iterative structure of many of these algorithms, another possible way of measuring staleness is which iteration the data came from. For instance, an application could submit a query requesting ``all data as of 3 iterations ago''. This would allow for consistency requirements similar to data stream warehouses [9]. In this case, the results for each iteration could be considered ``closed,'' if all results for that iteration are available, and ``open'', if some results are still pending Experiments The goal in evaluating LazyTables is to clearly show the value providing a tradeoff between latency and freshness in the implementation. I will demonstrate that the design of the table implementation fits the needs of the ML applications: extremely high write throughput and acceptable read performance and freshness. Secondly, I will demonstrate the latency/freshness tradeoff can be made at query time. Lastly, I will show the utility of this for some example machine learning algorithms, showing good performance and scalability with added resources. I have divided the experiments into three phases based on the implementation effort required. 9

10 Phase 1 In the first phase, which I have already started, I will gather log data about how an important machine learning algorithm accesses a data table. I'm particularly interested in the frequency of each operation as well as the potential for combining updates before creating a new snapshot of a row. Phase 2 In the second phase, I will explore the impact of staleness on the convergence behavior of some ML algorithms. As the staleness increases, how does this convergence rate change? Additionally, this may be an opportunity to explore some of the alternative consistency models discussed in subsubsection Note that this phase does not require a complete implementation of LazyTables, but simply a way of artificially injecting staleness into a prototype implementation. An additional goal of these experiments is to understand how dynamic changes in the freshness requirement can improve the performance and convergence of the algorithms. For instance, when an algorithm starts it may not require particularly fresh data. However, when it is nearing the end of its execution it may require much fresher data in making small changes to refine the result. Phase 3 The third phase of experiments will explore the performance impact of the techniques described in subsection 3.2. I will measure the performance differences between logging updates and update-in-place. Additionally, I will measure the staleness of snapshot results, and what the freshness/latency tradeoff looks like in practice. 4 Design principles In addition to the case studies supporting my thesis, another contribution of this work will be a set of design principles that will be extracted from the LazyBase and LazyTables implementations. I will discuss how these design principles applied to both LazyBase and LazyTables and how they might be applied to other systems with similar goals. These design principles fall into two broad categories: a set of techniques that can be used to exploit a tradeoff between latency and freshness and a general classification of applications where these techniques are relevant. The latency/freshness tradeoff generally appears in systems that require very high update throughput. To increase the throughput, and lower the write latency for the client, we use techniques like batching and logging. These techniques speed up writes from the client's perspective, but they increase the time between writing a value and it being available for queries (i.e, the staleness). To mitigate this problem, queries that need fresher results can do more work to access intermediate data. In the strictest sense, any application that does not perform atomic transactions with both reads and writes can use stale data. However, the degree of freshness that the system provides -- or the application expects -- is generally not specified. NoSQL systems like Cassandra try to service reads and writes as quickly as possible, but there is never any guarantee regarding the age of the data that is read. Applications often expect to read a ``current'' copy of the data, but have no way of specifying that to the system. I believe that application programmers should be made aware of this staleness and given an option to specify freshness requirements. 10

11 I have started developing a list of attributes that indicate that an application is a candidate for techniques that potentially increase staleness in exchange for improved throughput or latency. For instance, LazyBase was designed for ``observational data'', where processes that write data are making observations of the outside world and writing them to the database. In this case, they never need to read the data that already exists in order to make a change. Another class is cases where the ``details'' of the data change quickly, but the high-level structure changes very slowly. For instance, with the Twitter data, individual tweets are added at a very high rate, but the social network graph is relatively static. Similarly, in the ML example, the application may frequently make small changes to a high-dimensional vector, but because of the high dimension, the overall location of the vector changes slowly. 5 Other potential domains The dissertation will also include a qualitative discussion of how a tradeoff between latency and freshness is currently applied, or could be applied, to other application domains. I will discuss how the design principles presented in section 4 are relevant to these other systems. Primarily I will do this in the context of two application domains: data collection from sensor networks, and geographically-distributed logs collection. I will go into a fair amount of detail about these domains and how it could be beneficial for freshness to be a first-class consideration in system design for them. However, I will not do an actual implementation and set of experimentals for these. 5.1 Sensor networks Sensor networks typically consist of many small devices collecting data at a high rate. (Consider a ``smart building'' with temperature, humidity, noise, light sensors for HVAC, door openings, rooms, windows, etc.) The network for such a system is, at best, a wifi connection. These systems frequently collect more data than they can can actually propogate over the network. Thus, applications are unable to get data that is up-to-date and very fine-grained. Instead, they have to either use stale data or summary data. A potential solution is to upload stale, coarse-grained data to a central location, but many applications need better than that. For these, it should be possible to push some queries closer to to the sensors themselves (or at least gateway collection nodes) to access the fresher or more detailed data. Doing so will increase the latency of the query, but provide more up-to-date results. I have begun talking to two researchers who are working in this space: Diogo Gomes and Anthony Rowe. In this part of the thesis, I plan to continue these discussions and explore what the state of the art is in these types of sensor networks. I will describe how the design principles from section 4 are currently applied in the field and how they might be further applied. 5.2 Geographically distributed logs A similar problem occurs in collecting logs from geographically distributed data centers. This is a case where I really need to find and talk to some experts. The problem, as I've envisioned it, is that we would like to query logs aggregated from many datacenters around the world. Within a single data center, logs are sent to a central location for querying (maybe a LazyBase-like system). However, bandwidth between the datacenters is limited, and it may not be possible to have a single location with 11

12 an up to date view of all logs. Similar to the sensor network case, there is a tradeoff between looking at a local but stale view of the data and running a distributed query across all data centers. 5.3 Honorable mentions In addition to the previous two examples, there are a number of other applications that could potentially exploit a tradeoff between freshness and latency. They are worth mentioning, but in the interest of time I do not plan on exploring them in any detail. These include the collection and analysis of scientific data, such as telescope data and genomic data. In these cases the data sets are too large to send over the internet. Instead, it may be possible to distribute summary data that can satisfy some queries quikcly, while shipping code to the original data source to execute more detailed queries. In addition to data bottlenecks, there may be other iterative algorithms that demonstrate convergence in the face of stale data. For instance, the conjugate gradient method for solving linear systems can converge when there are errors in the calculation. This suggests that it may also converge if reads provide slightly out-of-date results. 6 Additional related Work The closest related work has been discussed in the two case study sections. This section discusses additional related work. While current systems are forced to deal with freshness to a degree, they generally provide limited control over freshness to applications. With the recent popularity of NoSQL databases and their loose consistency guarantees, there has been much interest in developing stricter definitions for the consistency provided by these systems. Some of this work has focused on developing a descriptive framework for these consistency models [16] [2]. Other work focuses on developing systems that have performance that is competitive with NoSQL systems, but provide stronger consistency models such as causal+ consistency [15]. Other recent work [14] [3] describes systems for evaluating the freshness or consistency of data returned by a storage system. One can often make reasonable decisions in the absence of perfect answers - Agarwal et al, BlinkDB BlinkDB [1] is a statistical database that allows tradeoff between query accuracy and response time. While this is not the same as a freshness/latency tradeoff, a key motivating observation is that many applications can effectively use approximate results generated by sampling the whole data set. For instance, it may be possible to detect a misbehaving server by looking at only a small sample of the server logs. This is similar to the observation is LazyBase that many queries can be effectively satisfied with stale data. Other databases provide similar capabilities through approximate query processing [8] and online aggregation [10]. 6.1 Parallel machine learning Like many fields, the machine learning community has been grappling with the problem of scaling their algorithms to larger systems. Mahout [4] is a set of libraries for running machine learning algorithms 12

13 on top of Hadoop Map Reduce. Recognizing that Map Reduce is not always the best programming model for iterative machine learning, systems like GraphLab [11] [12] and Spark [17] provide different programming models, and much greater efficiency, for machine learning computations. Piccolo [13] is one such system that provides the abstraction of a large shared table: a natural programming model for many machine learning algorithms. As in the database community, there has been interest in achieving scalability by relaxing the consistency requirements. Specifically, this has taken the form of developing algorithms that require little synchronization between threads, and are resilient to errors and staleness in their temporary data. Bradley et al. [6] describe a parallel algorithm, ``Shotgun'' (an extension to the sequential algorithm ``Shooting'') for performing coordinate descent in parallel on a shared memory machine. They show that even without explicitly synchronizing the iterations of different threads, their algorithm will quickly converge to the correct answer. 7 Plan, status, and timeline I have completed the first case study (section 2) and published the results in EuroSys 2012 [7]. This section will focus on the second case study (section 3) and the qualitative studies (section 5). 7.1 LazyTables Like the evaluation for LazyTables (subsection 2.2), I have divided the implementation plan into three phases. Phase 1 The goals of the first phase, which I have already started, is to determine the requirements of a system to support the target machine learning applications, develop an API for the system, and collect usage data from real algorithms. I have been meeting with machine learning researchers to learn about their requirements and how they currently build their applications. We have agreed on a first version of the API, and I have developed a simple prototype in-memory, single machine version of it in C++. This prototype allows me to collect traces of the application's access to the table data structure. Phase 2 The goals of the second phase are to begin experimenting with staleness in machine learning applications. To do so, I will need to modify the prototype table implementation so that I can inject staleness into the system. This will allow me to validate that the machine learning algorithms really do work when query results are stale and to determine the effects of this on the convergence rate of the algorithms. Phase 3 The goal of the third phase is to determine how the design of LazyTables can improve the performance of machine learning applications. While a full implementation of LazyTables is probably unnecessary for this dissertation, individual techniques will need to be implemented and tested to show the performance improvements they provide, as well as the staleness they introduce in the system. The specific techniques that I will be considering are pre-aggregating updates at each of the batching stages, logging updates to 13

14 disk vs. update-in-place, and forcing a flush of updates from other processes or threads. 7.2 Timeline October Present proposal Finish gathering data for Phase 1 of LazyTables Begin Phase 2 of LazyTables: inject staleness and see how application reacts Begin implementing a partial Phase 3. This is to show that we can actually improve throughput in a realistic setting. November Complete Phase 2 of LazyTables Phase 3 of LazyTables Work more on writing December Write up LazyTables Spring Semester Tie up loose ends for LazyTables Polish up writing document References [1] S. Agarwal, A. Panda, B. Mozafari, S. Madden, and I. Stoica. BlinkDB: Queries with bounded errors and bounded response times on very large data. CoRR, abs/ , [2] A. S. Aiyer, E. Anderson, X. Li, M. A. Shah, and J. J. Wylie. Consistability: describing usually consistent systems. In Proceedings of the Fourth conference on Hot topics in system dependability, HotDep'08, pages 8--8, Berkeley, CA, USA, USENIX Association. [3] E. Anderson, X. Li, M. A. Shah, J. Tucek, and J. J. Wylie. What consistency does your key-value store actually provide? In Proceedings of the Sixth international conference on Hot topics in system dependability, HotDep'10, pages 1--16, Berkeley, CA, USA, USENIX Association. [4] Apache Mahout, [5] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3: , Mar [6] J. Bradley, A. Kyrola, D. Bickson, and C. Guestrin. Parallel coordinate descent for l1-regularized loss minimization. In L. Getoor and T. Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML '11, pages , New York, NY, USA, June ACM. [7] J. Cipar, G. Ganger, K. Keeton, C. B. Morrey, III, C. A. Soules, and A. Veitch. LazyBase: trading freshness for performance in a scalable database. In Proceedings of the 7th ACM european conference on Computer Systems, EuroSys '12, pages , New York, NY, USA, ACM. [8] M. N. Garofalakis and P. B. Gibbon. Approximate query processing: Taming the terabytes. In 14

15 Proceedings of the 27th International Conference on Very Large Data Bases, VLDB '01, pages 725--, San Francisco, CA, USA, Morgan Kaufmann Publishers Inc. [9] L. Golab and T. Johnson. Consistency in a stream warehouse. In CIDR 2011, Fifth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 9-12, 2011, Online Proceedings, pages [10] J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In Proceedings of the 1997 ACM SIGMOD international conference on Management of data, SIGMOD '97, pages , New York, NY, USA, ACM. [11] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI), Catalina Island, California, July [12] Y. Low, G. Joseph, K. Aapo, D. Bickson, C. Guestrin, and M. Hellerstein, Joseph. Distributed GraphLab: A framework for machine learning and data mining in the cloud. PVLDB, [13] R. Power and J. Li. Piccolo: building fast, distributed programs with partitioned tables. In Proceedings of the 9th USENIX conference on Operating systems design and implementation, OSDI'10, pages 1--14, Berkeley, CA, USA, USENIX Association. [14] M. R. Rahman, W. Golab, A. AuYoung, K. Keeton, and J. J. Wylie. Toward a principled framework for benchmarking consistency. In Proceedings of the 8th Workshop on Hot Topics in System Dependability, HotDep Usenix, [15] Y. Sovran, R. Power, M. K. Aguilera, and J. Li. Transactional storage for geo-replicated systems. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP '11, pages , New York, NY, USA, ACM. [16] D. Terry. Replicated data consistency explained through baseball. Technical Report MSR-TR , Microsoft Research, October [17] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, NSDI'12, pages 2--2, Berkeley, CA, USA, USENIX Association. 15

LazyBase: Trading freshness and performance in a scalable database

LazyBase: Trading freshness and performance in a scalable database (EuroSys 2012) Jim Cipar, Greg Ganger, *Kimberly Keeton, *Craig A. N. Soules, *Brad Morrey, *Alistair Veitch PARALLEL DATA LABORATORY