Trading latency for freshness in storage systems

Size: px
Start display at page:

Download "Trading latency for freshness in storage systems"

Transcription

1 Trading latency for freshness in storage systems James Cipar September 27, 2012 Abstract Many storage systems have to provide extremely high throughput updates and low latency read queries. In practice, system designs that provide those capabilities often face a trade-off between query latency, efficiency, and result freshness. In my dissertation, I will argue that systems should be designed to allow for a per-query configuration of this trade-off. I will use two case studies to demonstrate the value of doing so. The first is LazyBase, a database designed for high-throughput ingest of observational data. The second is LazyTables, a shared data structure designed to support parallel machine learning applications. In both cases, the term ``Lazy'' refers to the systems' procrastination: waiting to apply updates until they can be executed as efficiently as possible. This design decision creates the potential for staleness in the data, hence the need for studying the trade-off between freshness and performance. Additionally, I will describe a number of other applications where this trade-off is potentially useful in system design. 1 Introduction There is a large and diverse class of parallel applications that require extremely high throughput updates to shared data structures. Two important subclasses of these are analytical databases and shared data structures used by parallel machine learning algorithms. Many applications in modern data collection and analytics require analyzing large data sets that grow and change at a high rate. These applications include business and purchase analytics, hardware monitoring, click stream and web user tracking, and tweets and social media analysis. In addition to high update throughput, these applications require efficient queries over a reasonably up-to-date version of the database. Many parallel computations, such as those used in machine learning, maintain large shared data structures. Commonly these data structures are sparse matrices or vectors of numbers. Each thread of execution performs many updates at a very high rate -- hundreds of thousands per thread per second -- interspersed with infrequent read accesses. For instance, in Latent Dirichlet Allocation, each thread reads a portion of a data set, then issues many increment and decrement operations to elements of a shared vector. There is no locality to these operations: any thread may update any part of the vector. In both cases, query efficiency, low query latency, and fresh results are all desirable properties of a storage system designed to support these applications. However, many techniques to achieve high update throughput hurt at least one of these properties. For instance, batching many updates into a single group 1

2 can vastly improve update throughput. However, this introduces a latency between when an update is submitted and when the resulting batch is committed. Furthermore, a large batch may require a longer time to process, introducing additional latency. This, as well as other similar techniques, means that the design of storage systems involves a tradeoff between the latency and efficiency of query results and the freshness (or inversely, the staleness) of the data that is returned. For many applications, different queries have different desires regarding this tradeoff, even when the queries are accessing the same data set. In the example of an analytical database storing Twitter data, an application trying to report the ``hot news'' that people are currently talking about has a freshness requirement on the order of a few minutes; another application to analyze the structure of the social network graph might have a freshness requirement on the order of days. In a parallel computation system, queries at the beginning of a computation may be able to make progress even with highly stale results, but need very accurate and fresh results to make small refinements at the end of a computation. Such diverse applications can benefit from storage systems that allow the application to make decisions regarding this tradeoff at query time. Existing solutions typically provide only a single point on this tradeoff. In the database example, there are three types of existing solutions that have settled on a different tradeoff. Relational databases (RDBMSs) -- generally accessed via the SQL query language -- provide a well defined consistency model and very up-to-date results. However, the cost of this is relatively inefficient update and query performance, and limited scalability. Data warehouses -- sometimes called analytical databases -- are designed for extremely high performance queries, at the cost of very inefficient updates. To mitigate this, updates are performed in large batches, a technique that introduces staleness on the order of hours or even days. Lastly, NoSQL databases drop the strong consistency of RDBMSs in exchange for high performance and scalability. However, these system can be difficult to program because they provide no guarantees for either the consistency or freshness of the data. In this thesis, I will argue for system designs that allow applications to manage this tradeoff at query time. I will demonstrate that freshness requirements are often a property of the query, not the data set or even the application, and that making the tradeoff explicit enables greater overall efficiency. Thesis Statement Storage systems can and should provide for per-query configuration of the tradeoff between data freshness and query efficiency and staleness. To support this thesis, I will show that (a) techniques to improve performance often introduce staleness in data, (b) it is often possible to do work during a query's execution to avoid this staleness, and (c) there is value in creating systems that allow applications to do so. I will demonstrate these points in the context of two case studies in applying this tradeoff to real storage systems. The first case study (section 2) examines a database that is designed to support continuous, high-throughput updates and inserts while also allowing efficient analytical queries over the data. The second case study (section 3) is a data management system for large scale, data-intensive machine learning. Additionally, I will describe a set of design principles (section 4) that apply to systems exploiting this latency/freshness tradeoff. I will discuss how these design principles could apply to other domains where this tradeoff is potentially useful (section 5), such as sensor network, geographically distributed logs collection and data-intensive scientific computing. 2

3 2 First case study: LazyBase The first case study examines LazyBase [7], a database designed to support applications that extract knowledge from large, rapidly changing data sets. These applications require a database that supports continuous and extremely high throughput update and insert operations, as well as efficient queries over large parts of the data set. Furthermore, given the fast-paced nature of modern data analytics, these applications sometimes require queries to be answered with a very fresh view of the data, but often can use slightly older versions. Existing solutions fall short in one or more of the requirements of these data analytics applications. Relational databases (often referred to as OLTP databases) provide a fresh view of the data and efficient queries for specific items. However, their performance is often insufficient for workloads with very high throughput updates and large analytic queries. In response to these shortcomings, data warehouses (a.k.a. ``analytical databases'') provide very efficient support for queries over large parts of a data set. However, they are typically loaded by batch jobs that run infrequently, meaning that the query results reflect a very stale version of the data. The new and growing class of NoSQL databases attempt to solve problems in scaling relational databases to large ``web scale'' workloads. These databases focus on excellent scale out performance. To achieve this scalability they typically provide weak consistency guarantees, a choice motivated by the CAP Theorem. While these systems can handle the throughput requirements of many analytical applications, their consistency semantics can be difficult for application programmers to reason about. 2.1 Motivation and Design To achieve the efficiency and scalability necessary for these applications, LazyBase makes extensive use of batching and pipelining for update and insert operations. However, these techniques introduce a long delay between when an update is submitted by a client and when that update is finally applied to the main database. In extreme cases, this delay may be minutes or even hours. This means that queries answered from the main database tables may return data that is seconds to minutes old.. While such a delay may be acceptable for some queries, others require more up-to date results. Table 1 shows a variety of application families from a wide variety of application domains. Observe that different applications accessing the same data may have vastly different freshness requirements. Additionally, there is a general trend where applications that require fresher data often look at less of the data set. For instance, in the retail application domain, the freshest data is required by real-time coupons and targeted ads. Both of these applications are only concerned with a single user's activity. On the other hand, the applications that can use stale data (product search, trending, earnings reports) need to look at data across a range of users. Using this observation, LazyBase presents applications programmers with a configurable per-query tradeoff between freshness and latency. This tradeoff is made possible by LazyBase's pipelined architecture, depicted in Figure 1. LazyBase clients connect to an ingest server and upload a batch of updates. This batch, possibly grouped with other batches, is used to create a Self-Consistent Update file (SCU), which contains instructions for inserting, updating, and deleting rows of data. Before being applied to the main database file (the Authority Tables), the SCU goes through a series of stages which are designed to make the data easier to query. These stages include ID-Remapping (a technique for efficiently handling foreign keys), Sorting, and finally a Merge operation that eventually updates the Authority Tables. 3

4 Application domain Retail Social networking Transportation Investment Enterprise information management Data center and network monitoring Table 1: Freshness requirements for application families from a variety of domains Desired Freshness seconds minutes hours real-time coupons, targeted ads and suggestions message list updates, friend/follower list changes emergency response, air traffic control real-time micro-trades, stock tickers infected machine identifications automated problem detection and diagnosis just in time inventory management wall posts, photo sharing, news updates and trending real-time traffic maps, bus/plane arrival prediction web-delivered graphs , file-based policy violations human-driven diagnosis, online performance charting product search, trending, earnings reports social graph analysis traffic engineering, bus route planning trend analyses, growth reports enterprise search results, e-discovery requests capacity planning, availability analyses, workload characterization Figure 1: LazyBase pipeline 4

5 In this pipeline, there is a potentially long delay (e.g, seconds to minutes) between a client submitting an SCU and that SCU being merged with the Authority Table. This delay depends on a number of factors including the number of servers, the SCU size, and the complexity of the indexes and foreign keys. To allow applications to access fresher versions of the data, LazyBase allows queries to read the intermediate data produced by each pipeline stage. For instance, a query may read the Authority tables as well as the sorted SCUs that have not yet been merged. This would provide a fresher view of the data, but because the sorted files haven't been merged yet, the query processing operation is slower and more expensive. Thus, when accessing intermediate data the queries are less efficient and have a longer latency. The extreme case of this is queries that require completely up-to-date results and therefore must access the unsorted SCUs. Because these SCUs are unsorted and unindexed, there is no efficient way to find a particular data value (or range of values). Hence, a query that accesses unsorted SCUs must scan all of the unsorted files to find all relevant data. 2.2 Evaluation The primary goals in the evaluation of LazyBase are to motivate the use of large batches (and thus the increase in update latency), and to show that LazyBase can provide fresher results at query time, at a cost of query latency. This section presents some key results from [7], with more details available in that paper. Figure 2 shows the overall ingest performance of LazyBase with varying batch sizes. This experiment demonstrates the potential improvement in throughput that can be achieved by batching. It shows that very large batches can be used for significant performance improvements. Figure 2: Inserts per second for various SCU sizes. However, such large batches can take a long time to process from start to finish. Table 2 shows the performance of each stage of the LazyBase pipeline in terms of the SCU size. With an SCU size of 2k rows, the total latency from the start to finish of the pipeline is over 10 minutes. This high latency may be unacceptable for some applications, necessitating the ability to query data that is still in the pipeline. Figure 3 demonstrates the performance of queries that access the intermediate data from the pipeline. For this experiment, an upload client loaded a large volume of data into LazyBase over the course of about 10 minutes. During that time query clients submitted queries with different freshness requirements, and measured the overall latency of the query. The results show that LazyBase is able to provide low-latency 5

6 Table 2: Performance of individual pipeline stages Stage k Rows/s Ingest 39 Sort 158 Merge 120 Figure 3: Query latency over time for steady-state work- load. Sampled every 15s and smoothed with a sliding window average over the past minute. queries using the Authority Tables, while also providing fresher data for queries that need it -- at a cost of increased latency. 3 Second case study: LazyTables The second case study examines LazyTables, a distributed shared data structure for data intensive machine learning applications. LazyTables provides an abstraction of a 2-dimensional table of values shared between all processes and threads of an application. In reality, the table may be spread across many servers, and the data may reside on disk instead of in main memory. This system is specifically designed for a class of machine learning applications (e.g, LDA, Shotgun) that require only loose synchronization between different threads. These applications typically proceed in iterations, where each iteration first reads a small number of rows from the table, then updates many individual values in the table. These updates are not local to a single row, but may be scattered across all rows of the table. Often these updates must be applied atomically, e.g. atomically increment one value while decrementing another. Furthermore, the operations are almost exclusively commutative and associative: increment and multiply are the most common. Existing solutions Previous work -- including Piccolo [13], GraphLab [11] [12], and Spark [17] -- attempts to provide high performance distributed data structures for machine learning or other data analytics. The Piccolo system is the most similar in intent to LazyTables. In fact, the API for LazyTables was designed based on the Piccolo API, and is almost exactly the same. However, Piccolo -- like the other systems -- does not account 6

7 for the extremely high write throughput required by many modern machine learning applications and does not exploit their particular relaxations of consistency requirements. GraphLab was specifically designed for large scale machine learning, but focuses on graph algorithms. These algorithms model the problem using a large set of vertexes, each connected to a small number of other vertexes. When a vertex's value changes by more than a threshold, it triggers an update function for all neighboring vertexes. The computation continues until the result has stabilized. Spark provides an in-memory distributed data structure called a Reliable Distributed Dataset (RDD). Similar to LazyTables, RDDs are designed for high-performance data analytics. However, to acheive their fault tolerance goals, RDDs allow only a restricted set of bulk operations on data, and specifically prohibit fine-grained updates. While some algorithms are naturally expressed as graph algorithms, many users report that there is a significant mismatch between their way of thinking about the algorithm, and the model provided by GraphLab. This mismatch was the original reason that the machine learning research group approached us with this problem. In all of these cases, there is still an assumption that writes must complete with low latency so that reads may return relatively up-to-date results. However, much recent work in machine learning, e.g. [6], demonstrates that algorithms can proceed with relatively little coupling between threads and a potentially stale view of the data. This suggests that there is room for a significant performance improvement by allowing for some staleness in the data. 3.1 Motivating experiments To motivate the design of LazyTables, I've been experimenting with a parallel Latent Dirichlet Allocation [5] (LDA) implementation written by Qirong Ho. I provided Qirong with a simple in-memory table class that has compile-time options for thread safety and event tracing. By tracing the requests made to the table, I found that, out of approximately 30M requests, only 40 of them are row_read calls. The rest are all increment calls. Furthermore, in his algorithm, each thread only reads updated values for each row after many updates, meaning that they are already using quite out-of-date information. This provides a strong motivating example for LazyTables, as it shows the importance of update performance and the relative unimportance of read performance or freshness. As further motivation, I decided to improve the performance of increment while causing row_read to return even staler data. This was based on the observation that the program ran substantially faster with a single thread in non-thread-safe mode. Based on the assumption that locking was slowing down the multithreaded version, I modified the increment function to batch updates in thread local storage. After 1024 updates have accumulated (or the client program calls flush), the updates are pushed to the shared table, locking only once to do so. Table 3 shows the results of these experiments, run on a 6-core virtual machine with 12GB of RAM. Increment operations are so frequent in this application that locking the table structure slows down performance by a factor of 2 or more. Note that, as the number of threads increases, the cost of full synchronization goes up considerably. On the other hand, the performance of the batching implementation improves with more processors. The log-likelihood column gives a measure of the quality of the result, with a higher number (closer to 0) indicating better results. While the absolute value is not meaningful with only 10 iterations, the relative 7

8 values can be compared. These results suggest that batching with a batch size of 1024 has little effect on the convergence of the algorithm. In other words, this algorithm can tolerate some staleness in its read results. Table 3: Runtime for 10 iterations of LDA with different synchronization methods. Synchronization method Threads Runtime (s) log-likelihood Single-threaded e7 Locking e7 Batching (1024) e7 Locking e7 Batching (1024) e7 Locking e7 Batching (1024) e7 Locking e7 Batching (1024) e7 3.2 Design These experiments motivate a number of design requirements for the system. This section describes these requirements, as well as some open questions in the design of LazyTables. The experiments from subsection 3.1 suggest that extremely low-latency, high-throughput writes are the most important design consideration. Even an operation as simple as locking a mutex would be prohibitively slow for every write. The need for low-latency suggests that the client must not wait for the update operation to complete. This is achieved by a combination of batching and lazy writes. When a client thread submits an update, it is batched in thread local storage. This means that the client does not have to invoke any expensive atomic operations for typical writes. When the batch is large enough -- or the client explicitly requests a flush operation -- the batch is pushed into the shared LazyTable data structure. The high throughput requirement motivates two other design decisions that may further increase staleness: combining updates and logging. Updates can be combined by clients to reduce the total number of updates that must be sent over expensive network links or written to stable storage. For instance, two increment operations that modify the same element of the table can be combined into one. When writing updates to stable storage, logging allows the system to avoid an expensive read-modify-write. Instead, the update is logged and the main data structures can be updated later as needed. In a system built on top of spinning magnetic disk, this also allows LazyTables to perform almost exclusively sequential I/O. A proposed update pipeline is depicted in Figure 4. In this scheme, updates are sent through a series of buffering stages where they may be batched and merged with other updates. Eventually they are written to an on-disk log. This log can be periodically scanned to produce a new snapshot for the updated rows, which will also be written to disk. Using this design, a client has a number of options to satisfy read requests, depending on the desired freshness or latency: 1. Read a locally cached version of the snapshot. 2. Read only the latest snapshot. 8

9 Figure 4: LazyTable update pipeline 3. Read the latest snapshot and scan part (or all) of the log to get a newer value 4. Read the latest snapshot, all of the log, and request that clients flush recent writes Consistency requirements An important design consideration in LazyTables is the notion of ``staleness'', and how it is measured. In LazyBase, staleness is based on elapsed wall clock time, because LazyBase is intended to be used to collect data from the real world. In the case of LazyTables, elapsed real time is not as meaningful, because it gives no indication of how much computation has occurred in the interim. One possibility would be to use the total number of batches written since in the interim. But, even that may not be meaningful to application programmers, as some threads may proceed faster than others. Based on the iterative structure of many of these algorithms, another possible way of measuring staleness is which iteration the data came from. For instance, an application could submit a query requesting ``all data as of 3 iterations ago''. This would allow for consistency requirements similar to data stream warehouses [9]. In this case, the results for each iteration could be considered ``closed,'' if all results for that iteration are available, and ``open'', if some results are still pending Experiments The goal in evaluating LazyTables is to clearly show the value providing a tradeoff between latency and freshness in the implementation. I will demonstrate that the design of the table implementation fits the needs of the ML applications: extremely high write throughput and acceptable read performance and freshness. Secondly, I will demonstrate the latency/freshness tradeoff can be made at query time. Lastly, I will show the utility of this for some example machine learning algorithms, showing good performance and scalability with added resources. I have divided the experiments into three phases based on the implementation effort required. 9

10 Phase 1 In the first phase, which I have already started, I will gather log data about how an important machine learning algorithm accesses a data table. I'm particularly interested in the frequency of each operation as well as the potential for combining updates before creating a new snapshot of a row. Phase 2 In the second phase, I will explore the impact of staleness on the convergence behavior of some ML algorithms. As the staleness increases, how does this convergence rate change? Additionally, this may be an opportunity to explore some of the alternative consistency models discussed in subsubsection Note that this phase does not require a complete implementation of LazyTables, but simply a way of artificially injecting staleness into a prototype implementation. An additional goal of these experiments is to understand how dynamic changes in the freshness requirement can improve the performance and convergence of the algorithms. For instance, when an algorithm starts it may not require particularly fresh data. However, when it is nearing the end of its execution it may require much fresher data in making small changes to refine the result. Phase 3 The third phase of experiments will explore the performance impact of the techniques described in subsection 3.2. I will measure the performance differences between logging updates and update-in-place. Additionally, I will measure the staleness of snapshot results, and what the freshness/latency tradeoff looks like in practice. 4 Design principles In addition to the case studies supporting my thesis, another contribution of this work will be a set of design principles that will be extracted from the LazyBase and LazyTables implementations. I will discuss how these design principles applied to both LazyBase and LazyTables and how they might be applied to other systems with similar goals. These design principles fall into two broad categories: a set of techniques that can be used to exploit a tradeoff between latency and freshness and a general classification of applications where these techniques are relevant. The latency/freshness tradeoff generally appears in systems that require very high update throughput. To increase the throughput, and lower the write latency for the client, we use techniques like batching and logging. These techniques speed up writes from the client's perspective, but they increase the time between writing a value and it being available for queries (i.e, the staleness). To mitigate this problem, queries that need fresher results can do more work to access intermediate data. In the strictest sense, any application that does not perform atomic transactions with both reads and writes can use stale data. However, the degree of freshness that the system provides -- or the application expects -- is generally not specified. NoSQL systems like Cassandra try to service reads and writes as quickly as possible, but there is never any guarantee regarding the age of the data that is read. Applications often expect to read a ``current'' copy of the data, but have no way of specifying that to the system. I believe that application programmers should be made aware of this staleness and given an option to specify freshness requirements. 10

11 I have started developing a list of attributes that indicate that an application is a candidate for techniques that potentially increase staleness in exchange for improved throughput or latency. For instance, LazyBase was designed for ``observational data'', where processes that write data are making observations of the outside world and writing them to the database. In this case, they never need to read the data that already exists in order to make a change. Another class is cases where the ``details'' of the data change quickly, but the high-level structure changes very slowly. For instance, with the Twitter data, individual tweets are added at a very high rate, but the social network graph is relatively static. Similarly, in the ML example, the application may frequently make small changes to a high-dimensional vector, but because of the high dimension, the overall location of the vector changes slowly. 5 Other potential domains The dissertation will also include a qualitative discussion of how a tradeoff between latency and freshness is currently applied, or could be applied, to other application domains. I will discuss how the design principles presented in section 4 are relevant to these other systems. Primarily I will do this in the context of two application domains: data collection from sensor networks, and geographically-distributed logs collection. I will go into a fair amount of detail about these domains and how it could be beneficial for freshness to be a first-class consideration in system design for them. However, I will not do an actual implementation and set of experimentals for these. 5.1 Sensor networks Sensor networks typically consist of many small devices collecting data at a high rate. (Consider a ``smart building'' with temperature, humidity, noise, light sensors for HVAC, door openings, rooms, windows, etc.) The network for such a system is, at best, a wifi connection. These systems frequently collect more data than they can can actually propogate over the network. Thus, applications are unable to get data that is up-to-date and very fine-grained. Instead, they have to either use stale data or summary data. A potential solution is to upload stale, coarse-grained data to a central location, but many applications need better than that. For these, it should be possible to push some queries closer to to the sensors themselves (or at least gateway collection nodes) to access the fresher or more detailed data. Doing so will increase the latency of the query, but provide more up-to-date results. I have begun talking to two researchers who are working in this space: Diogo Gomes and Anthony Rowe. In this part of the thesis, I plan to continue these discussions and explore what the state of the art is in these types of sensor networks. I will describe how the design principles from section 4 are currently applied in the field and how they might be further applied. 5.2 Geographically distributed logs A similar problem occurs in collecting logs from geographically distributed data centers. This is a case where I really need to find and talk to some experts. The problem, as I've envisioned it, is that we would like to query logs aggregated from many datacenters around the world. Within a single data center, logs are sent to a central location for querying (maybe a LazyBase-like system). However, bandwidth between the datacenters is limited, and it may not be possible to have a single location with 11

12 an up to date view of all logs. Similar to the sensor network case, there is a tradeoff between looking at a local but stale view of the data and running a distributed query across all data centers. 5.3 Honorable mentions In addition to the previous two examples, there are a number of other applications that could potentially exploit a tradeoff between freshness and latency. They are worth mentioning, but in the interest of time I do not plan on exploring them in any detail. These include the collection and analysis of scientific data, such as telescope data and genomic data. In these cases the data sets are too large to send over the internet. Instead, it may be possible to distribute summary data that can satisfy some queries quikcly, while shipping code to the original data source to execute more detailed queries. In addition to data bottlenecks, there may be other iterative algorithms that demonstrate convergence in the face of stale data. For instance, the conjugate gradient method for solving linear systems can converge when there are errors in the calculation. This suggests that it may also converge if reads provide slightly out-of-date results. 6 Additional related Work The closest related work has been discussed in the two case study sections. This section discusses additional related work. While current systems are forced to deal with freshness to a degree, they generally provide limited control over freshness to applications. With the recent popularity of NoSQL databases and their loose consistency guarantees, there has been much interest in developing stricter definitions for the consistency provided by these systems. Some of this work has focused on developing a descriptive framework for these consistency models [16] [2]. Other work focuses on developing systems that have performance that is competitive with NoSQL systems, but provide stronger consistency models such as causal+ consistency [15]. Other recent work [14] [3] describes systems for evaluating the freshness or consistency of data returned by a storage system. One can often make reasonable decisions in the absence of perfect answers - Agarwal et al, BlinkDB BlinkDB [1] is a statistical database that allows tradeoff between query accuracy and response time. While this is not the same as a freshness/latency tradeoff, a key motivating observation is that many applications can effectively use approximate results generated by sampling the whole data set. For instance, it may be possible to detect a misbehaving server by looking at only a small sample of the server logs. This is similar to the observation is LazyBase that many queries can be effectively satisfied with stale data. Other databases provide similar capabilities through approximate query processing [8] and online aggregation [10]. 6.1 Parallel machine learning Like many fields, the machine learning community has been grappling with the problem of scaling their algorithms to larger systems. Mahout [4] is a set of libraries for running machine learning algorithms 12

13 on top of Hadoop Map Reduce. Recognizing that Map Reduce is not always the best programming model for iterative machine learning, systems like GraphLab [11] [12] and Spark [17] provide different programming models, and much greater efficiency, for machine learning computations. Piccolo [13] is one such system that provides the abstraction of a large shared table: a natural programming model for many machine learning algorithms. As in the database community, there has been interest in achieving scalability by relaxing the consistency requirements. Specifically, this has taken the form of developing algorithms that require little synchronization between threads, and are resilient to errors and staleness in their temporary data. Bradley et al. [6] describe a parallel algorithm, ``Shotgun'' (an extension to the sequential algorithm ``Shooting'') for performing coordinate descent in parallel on a shared memory machine. They show that even without explicitly synchronizing the iterations of different threads, their algorithm will quickly converge to the correct answer. 7 Plan, status, and timeline I have completed the first case study (section 2) and published the results in EuroSys 2012 [7]. This section will focus on the second case study (section 3) and the qualitative studies (section 5). 7.1 LazyTables Like the evaluation for LazyTables (subsection 2.2), I have divided the implementation plan into three phases. Phase 1 The goals of the first phase, which I have already started, is to determine the requirements of a system to support the target machine learning applications, develop an API for the system, and collect usage data from real algorithms. I have been meeting with machine learning researchers to learn about their requirements and how they currently build their applications. We have agreed on a first version of the API, and I have developed a simple prototype in-memory, single machine version of it in C++. This prototype allows me to collect traces of the application's access to the table data structure. Phase 2 The goals of the second phase are to begin experimenting with staleness in machine learning applications. To do so, I will need to modify the prototype table implementation so that I can inject staleness into the system. This will allow me to validate that the machine learning algorithms really do work when query results are stale and to determine the effects of this on the convergence rate of the algorithms. Phase 3 The goal of the third phase is to determine how the design of LazyTables can improve the performance of machine learning applications. While a full implementation of LazyTables is probably unnecessary for this dissertation, individual techniques will need to be implemented and tested to show the performance improvements they provide, as well as the staleness they introduce in the system. The specific techniques that I will be considering are pre-aggregating updates at each of the batching stages, logging updates to 13

14 disk vs. update-in-place, and forcing a flush of updates from other processes or threads. 7.2 Timeline October Present proposal Finish gathering data for Phase 1 of LazyTables Begin Phase 2 of LazyTables: inject staleness and see how application reacts Begin implementing a partial Phase 3. This is to show that we can actually improve throughput in a realistic setting. November Complete Phase 2 of LazyTables Phase 3 of LazyTables Work more on writing December Write up LazyTables Spring Semester Tie up loose ends for LazyTables Polish up writing document References [1] S. Agarwal, A. Panda, B. Mozafari, S. Madden, and I. Stoica. BlinkDB: Queries with bounded errors and bounded response times on very large data. CoRR, abs/ , [2] A. S. Aiyer, E. Anderson, X. Li, M. A. Shah, and J. J. Wylie. Consistability: describing usually consistent systems. In Proceedings of the Fourth conference on Hot topics in system dependability, HotDep'08, pages 8--8, Berkeley, CA, USA, USENIX Association. [3] E. Anderson, X. Li, M. A. Shah, J. Tucek, and J. J. Wylie. What consistency does your key-value store actually provide? In Proceedings of the Sixth international conference on Hot topics in system dependability, HotDep'10, pages 1--16, Berkeley, CA, USA, USENIX Association. [4] Apache Mahout, [5] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3: , Mar [6] J. Bradley, A. Kyrola, D. Bickson, and C. Guestrin. Parallel coordinate descent for l1-regularized loss minimization. In L. Getoor and T. Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML '11, pages , New York, NY, USA, June ACM. [7] J. Cipar, G. Ganger, K. Keeton, C. B. Morrey, III, C. A. Soules, and A. Veitch. LazyBase: trading freshness for performance in a scalable database. In Proceedings of the 7th ACM european conference on Computer Systems, EuroSys '12, pages , New York, NY, USA, ACM. [8] M. N. Garofalakis and P. B. Gibbon. Approximate query processing: Taming the terabytes. In 14

15 Proceedings of the 27th International Conference on Very Large Data Bases, VLDB '01, pages 725--, San Francisco, CA, USA, Morgan Kaufmann Publishers Inc. [9] L. Golab and T. Johnson. Consistency in a stream warehouse. In CIDR 2011, Fifth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 9-12, 2011, Online Proceedings, pages [10] J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In Proceedings of the 1997 ACM SIGMOD international conference on Management of data, SIGMOD '97, pages , New York, NY, USA, ACM. [11] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI), Catalina Island, California, July [12] Y. Low, G. Joseph, K. Aapo, D. Bickson, C. Guestrin, and M. Hellerstein, Joseph. Distributed GraphLab: A framework for machine learning and data mining in the cloud. PVLDB, [13] R. Power and J. Li. Piccolo: building fast, distributed programs with partitioned tables. In Proceedings of the 9th USENIX conference on Operating systems design and implementation, OSDI'10, pages 1--14, Berkeley, CA, USA, USENIX Association. [14] M. R. Rahman, W. Golab, A. AuYoung, K. Keeton, and J. J. Wylie. Toward a principled framework for benchmarking consistency. In Proceedings of the 8th Workshop on Hot Topics in System Dependability, HotDep Usenix, [15] Y. Sovran, R. Power, M. K. Aguilera, and J. Li. Transactional storage for geo-replicated systems. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP '11, pages , New York, NY, USA, ACM. [16] D. Terry. Replicated data consistency explained through baseball. Technical Report MSR-TR , Microsoft Research, October [17] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, NSDI'12, pages 2--2, Berkeley, CA, USA, USENIX Association. 15

LazyBase: Trading freshness and performance in a scalable database

LazyBase: Trading freshness and performance in a scalable database LazyBase: Trading freshness and performance in a scalable database (EuroSys 2012) Jim Cipar, Greg Ganger, *Kimberly Keeton, *Craig A. N. Soules, *Brad Morrey, *Alistair Veitch PARALLEL DATA LABORATORY

More information

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 3, Ver. IV (May-Jun. 2016), PP 125-129 www.iosrjournals.org I ++ Mapreduce: Incremental Mapreduce for

More information

Resilient Distributed Datasets

Resilient Distributed Datasets Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,

More information

Twitter data Analytics using Distributed Computing

Twitter data Analytics using Distributed Computing Twitter data Analytics using Distributed Computing Uma Narayanan Athrira Unnikrishnan Dr. Varghese Paul Dr. Shelbi Joseph Research Scholar M.tech Student Professor Assistant Professor Dept. of IT, SOE

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Fast, Interactive, Language-Integrated Cluster Computing

Fast, Interactive, Language-Integrated Cluster Computing Spark Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica www.spark-project.org

More information

Lecture 22 : Distributed Systems for ML

Lecture 22 : Distributed Systems for ML 10-708: Probabilistic Graphical Models, Spring 2017 Lecture 22 : Distributed Systems for ML Lecturer: Qirong Ho Scribes: Zihang Dai, Fan Yang 1 Introduction Big data has been very popular in recent years.

More information

Harp-DAAL for High Performance Big Data Computing

Harp-DAAL for High Performance Big Data Computing Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Survey on Incremental MapReduce for Data Mining

Survey on Incremental MapReduce for Data Mining Survey on Incremental MapReduce for Data Mining Trupti M. Shinde 1, Prof.S.V.Chobe 2 1 Research Scholar, Computer Engineering Dept., Dr. D. Y. Patil Institute of Engineering &Technology, 2 Associate Professor,

More information

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences Indian Institute of Science Bangalore, India भ रत य व ज ञ न स स थ न ब गल र, भ रत Department of Computational and Data Sciences L3: Spark & RDD Department of Computational and Data Science, IISc, 2016 This

More information

Webinar Series TMIP VISION

Webinar Series TMIP VISION Webinar Series TMIP VISION TMIP provides technical support and promotes knowledge and information exchange in the transportation planning and modeling community. Today s Goals To Consider: Parallel Processing

More information

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2 Introduction :- Today single CPU based architecture is not capable enough for the modern database that are required to handle more demanding and complex requirements of the users, for example, high performance,

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications Spark In- Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker,

More information

2/4/2019 Week 3- A Sangmi Lee Pallickara

2/4/2019 Week 3- A Sangmi Lee Pallickara Week 3-A-0 2/4/2019 Colorado State University, Spring 2019 Week 3-A-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 1: MAPREDUCE PA1

More information

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Presented by: Dishant Mittal Authors: Juwei Shi, Yunjie Qiu, Umar Firooq Minhas, Lemei Jiao, Chen Wang, Berthold Reinwald and Fatma

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

Trading Freshness for Performance in Distributed Systems James Cipar CMU-CS

Trading Freshness for Performance in Distributed Systems James Cipar CMU-CS Trading Freshness for Performance in Distributed Systems James Cipar CMU-CS-14-144 December 2014 School of Computer Science Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 Thesis

More information

10. Replication. Motivation

10. Replication. Motivation 10. Replication Page 1 10. Replication Motivation Reliable and high-performance computation on a single instance of a data object is prone to failure. Replicate data to overcome single points of failure

More information

745: Advanced Database Systems

745: Advanced Database Systems 745: Advanced Database Systems Yanlei Diao University of Massachusetts Amherst Outline Overview of course topics Course requirements Database Management Systems 1. Online Analytical Processing (OLAP) vs.

More information

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation 2016 IJSRSET Volume 2 Issue 5 Print ISSN: 2395-1990 Online ISSN : 2394-4099 Themed Section: Engineering and Technology Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

More information

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight ESG Lab Review InterSystems Data Platform: A Unified, Efficient Data Platform for Fast Business Insight Date: April 218 Author: Kerry Dolan, Senior IT Validation Analyst Abstract Enterprise Strategy Group

More information

Volley: Automated Data Placement for Geo-Distributed Cloud Services

Volley: Automated Data Placement for Geo-Distributed Cloud Services Volley: Automated Data Placement for Geo-Distributed Cloud Services Authors: Sharad Agarwal, John Dunagen, Navendu Jain, Stefan Saroiu, Alec Wolman, Harbinder Bogan 7th USENIX Symposium on Networked Systems

More information

Massive Online Analysis - Storm,Spark

Massive Online Analysis - Storm,Spark Massive Online Analysis - Storm,Spark presentation by R. Kishore Kumar Research Scholar Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Kharagpur-721302, India (R

More information

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Survey Paper on Traditional Hadoop and Pipelined Map Reduce International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,

More information

More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server

More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server Q. Ho, J. Cipar, H. Cui, J.K. Kim, S. Lee, *P.B. Gibbons, G.A. Gibson, G.R. Ganger, E.P. Xing Carnegie Mellon University

More information

Introduction to MapReduce Algorithms and Analysis

Introduction to MapReduce Algorithms and Analysis Introduction to MapReduce Algorithms and Analysis Jeff M. Phillips October 25, 2013 Trade-Offs Massive parallelism that is very easy to program. Cheaper than HPC style (uses top of the line everything)

More information

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver Using the SDACK Architecture to Build a Big Data Product Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver Outline A Threat Analytic Big Data product The SDACK Architecture Akka Streams and data

More information

Shark: Hive on Spark

Shark: Hive on Spark Optional Reading (additional material) Shark: Hive on Spark Prajakta Kalmegh Duke University 1 What is Shark? Port of Apache Hive to run on Spark Compatible with existing Hive data, metastores, and queries

More information

An Efficient Execution Scheme for Designated Event-based Stream Processing

An Efficient Execution Scheme for Designated Event-based Stream Processing DEIM Forum 2014 D3-2 An Efficient Execution Scheme for Designated Event-based Stream Processing Yan Wang and Hiroyuki Kitagawa Graduate School of Systems and Information Engineering, University of Tsukuba

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

IX: A Protected Dataplane Operating System for High Throughput and Low Latency IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay, A. et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Reviewed by Chun-Yu and Xinghao Li Summary In this

More information

Operating System Performance and Large Servers 1

Operating System Performance and Large Servers 1 Operating System Performance and Large Servers 1 Hyuck Yoo and Keng-Tai Ko Sun Microsystems, Inc. Mountain View, CA 94043 Abstract Servers are an essential part of today's computing environments. High

More information

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis 1 NoSQL So-called NoSQL systems offer reduced functionalities compared to traditional Relational DBMSs, with the aim of achieving

More information

Jumbo: Beyond MapReduce for Workload Balancing

Jumbo: Beyond MapReduce for Workload Balancing Jumbo: Beyond Reduce for Workload Balancing Sven Groot Supervised by Masaru Kitsuregawa Institute of Industrial Science, The University of Tokyo 4-6-1 Komaba Meguro-ku, Tokyo 153-8505, Japan sgroot@tkl.iis.u-tokyo.ac.jp

More information

Stream Processing on IoT Devices using Calvin Framework

Stream Processing on IoT Devices using Calvin Framework Stream Processing on IoT Devices using Calvin Framework by Ameya Nayak A Project Report Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science Supervised

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

Chapter 14 Performance and Processor Design

Chapter 14 Performance and Processor Design Chapter 14 Performance and Processor Design Outline 14.1 Introduction 14.2 Important Trends Affecting Performance Issues 14.3 Why Performance Monitoring and Evaluation are Needed 14.4 Performance Measures

More information

Eventual Consistency 1

Eventual Consistency 1 Eventual Consistency 1 Readings Werner Vogels ACM Queue paper http://queue.acm.org/detail.cfm?id=1466448 Dynamo paper http://www.allthingsdistributed.com/files/ amazon-dynamo-sosp2007.pdf Apache Cassandra

More information

Computer-based Tracking Protocols: Improving Communication between Databases

Computer-based Tracking Protocols: Improving Communication between Databases Computer-based Tracking Protocols: Improving Communication between Databases Amol Deshpande Database Group Department of Computer Science University of Maryland Overview Food tracking and traceability

More information

Data Platforms and Pattern Mining

Data Platforms and Pattern Mining Morteza Zihayat Data Platforms and Pattern Mining IBM Corporation About Myself IBM Software Group Big Data Scientist 4Platform Computing, IBM (2014 Now) PhD Candidate (2011 Now) 4Lassonde School of Engineering,

More information

The Future of High Performance Computing

The Future of High Performance Computing The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer

More information

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype? Big data hype? Big Data: Hype or Hallelujah? Data Base and Data Mining Group of 2 Google Flu trends On the Internet February 2010 detected flu outbreak two weeks ahead of CDC data Nowcasting http://www.internetlivestats.com/

More information

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent Tanton Jeppson CS 401R Lab 3 Cassandra, MongoDB, and HBase Introduction For my report I have chosen to take a deeper look at 3 NoSQL database systems: Cassandra, MongoDB, and HBase. I have chosen these

More information

Current Topics in OS Research. So, what s hot?

Current Topics in OS Research. So, what s hot? Current Topics in OS Research COMP7840 OSDI Current OS Research 0 So, what s hot? Operating systems have been around for a long time in many forms for different types of devices It is normally general

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Database and Knowledge-Base Systems: Data Mining. Martin Ester Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models CIEL: A Universal Execution Engine for

More information

VOLTDB + HP VERTICA. page

VOLTDB + HP VERTICA. page VOLTDB + HP VERTICA ARCHITECTURE FOR FAST AND BIG DATA ARCHITECTURE FOR FAST + BIG DATA FAST DATA Fast Serve Analytics BIG DATA BI Reporting Fast Operational Database Streaming Analytics Columnar Analytics

More information

Spark: A Brief History. https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf

Spark: A Brief History. https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf Spark: A Brief History https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf A Brief History: 2004 MapReduce paper 2010 Spark paper 2002 2004 2006 2008 2010 2012 2014 2002 MapReduce @ Google

More information

HP ProLiant BladeSystem Gen9 vs Gen8 and G7 Server Blades on Data Warehouse Workloads

HP ProLiant BladeSystem Gen9 vs Gen8 and G7 Server Blades on Data Warehouse Workloads HP ProLiant BladeSystem Gen9 vs Gen8 and G7 Server Blades on Data Warehouse Workloads Gen9 server blades give more performance per dollar for your investment. Executive Summary Information Technology (IT)

More information

CS 6453: Parameter Server. Soumya Basu March 7, 2017

CS 6453: Parameter Server. Soumya Basu March 7, 2017 CS 6453: Parameter Server Soumya Basu March 7, 2017 What is a Parameter Server? Server for large scale machine learning problems Machine learning tasks in a nutshell: Feature Extraction (1, 1, 1) (2, -1,

More information

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and

More information

Apache Flink- A System for Batch and Realtime Stream Processing

Apache Flink- A System for Batch and Realtime Stream Processing Apache Flink- A System for Batch and Realtime Stream Processing Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich Prof Dr. Matthias Schubert 2016 Introduction to Apache Flink

More information

SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX

SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX Daniel Crankshaw, Peter Bailis, Joseph Gonzalez, Haoyuan Li, Zhao Zhang, Ali Ghodsi, Michael Franklin,

More information

Scaling Distributed Machine Learning with the Parameter Server

Scaling Distributed Machine Learning with the Parameter Server Scaling Distributed Machine Learning with the Parameter Server Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su Presented

More information

April Copyright 2013 Cloudera Inc. All rights reserved.

April Copyright 2013 Cloudera Inc. All rights reserved. Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on

More information

COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING

COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING Volume 119 No. 16 2018, 937-948 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.eu/hub/ http://www.acadpubl.eu/hub/ COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING K.Anusha

More information

Improving Performance and Ensuring Scalability of Large SAS Applications and Database Extracts

Improving Performance and Ensuring Scalability of Large SAS Applications and Database Extracts Improving Performance and Ensuring Scalability of Large SAS Applications and Database Extracts Michael Beckerle, ChiefTechnology Officer, Torrent Systems, Inc., Cambridge, MA ABSTRACT Many organizations

More information

Lecture 23 Database System Architectures

Lecture 23 Database System Architectures CMSC 461, Database Management Systems Spring 2018 Lecture 23 Database System Architectures These slides are based on Database System Concepts 6 th edition book (whereas some quotes and figures are used

More information

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica UC BERKELEY Motivation Many important

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

CS 655 Advanced Topics in Distributed Systems

CS 655 Advanced Topics in Distributed Systems Presented by : Walid Budgaga CS 655 Advanced Topics in Distributed Systems Computer Science Department Colorado State University 1 Outline Problem Solution Approaches Comparison Conclusion 2 Problem 3

More information

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture By Gaurav Sheoran 9-Dec-08 Abstract Most of the current enterprise data-warehouses

More information

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia MapReduce & Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline - MapReduce: - - Resilient Distributed Datasets (RDD) - - Motivation Examples The Design and How it Works Performance Motivation

More information

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre NoSQL systems: sharding, replication and consistency Riccardo Torlone Università Roma Tre Data distribution NoSQL systems: data distributed over large clusters Aggregate is a natural unit to use for data

More information

Spark, Shark and Spark Streaming Introduction

Spark, Shark and Spark Streaming Introduction Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References

More information

Workshop Report: ElaStraS - An Elastic Transactional Datastore in the Cloud

Workshop Report: ElaStraS - An Elastic Transactional Datastore in the Cloud Workshop Report: ElaStraS - An Elastic Transactional Datastore in the Cloud Sudipto Das, Divyakant Agrawal, Amr El Abbadi Report by: Basil Kohler January 4, 2013 Prerequisites This report elaborates and

More information

Efficient, Scalable, and Provenance-Aware Management of Linked Data

Efficient, Scalable, and Provenance-Aware Management of Linked Data Efficient, Scalable, and Provenance-Aware Management of Linked Data Marcin Wylot 1 Motivation and objectives of the research The proliferation of heterogeneous Linked Data on the Web requires data management

More information

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Automatic Scaling Iterative Computations. Aug. 7 th, 2012 Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics

More information

BIG DATA TESTING: A UNIFIED VIEW

BIG DATA TESTING: A UNIFIED VIEW http://core.ecu.edu/strg BIG DATA TESTING: A UNIFIED VIEW BY NAM THAI ECU, Computer Science Department, March 16, 2016 2/30 PRESENTATION CONTENT 1. Overview of Big Data A. 5 V s of Big Data B. Data generation

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic

More information

Epilogue. Thursday, December 09, 2004

Epilogue. Thursday, December 09, 2004 Epilogue Thursday, December 09, 2004 2:16 PM We have taken a rather long journey From the physical hardware, To the code that manages it, To the optimal structure of that code, To models that describe

More information

Data Centers and Cloud Computing. Slides courtesy of Tim Wood

Data Centers and Cloud Computing. Slides courtesy of Tim Wood Data Centers and Cloud Computing Slides courtesy of Tim Wood 1 Data Centers Large server and storage farms 1000s of servers Many TBs or PBs of data Used by Enterprises for server applications Internet

More information

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters ARTICLE Using Alluxio to Improve the Performance and Consistency of HDFS Clusters Calvin Jia Software Engineer at Alluxio Learn how Alluxio is used in clusters with co-located compute and storage to improve

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

PNUTS: Yahoo! s Hosted Data Serving Platform. Reading Review by: Alex Degtiar (adegtiar) /30/2013

PNUTS: Yahoo! s Hosted Data Serving Platform. Reading Review by: Alex Degtiar (adegtiar) /30/2013 PNUTS: Yahoo! s Hosted Data Serving Platform Reading Review by: Alex Degtiar (adegtiar) 15-799 9/30/2013 What is PNUTS? Yahoo s NoSQL database Motivated by web applications Massively parallel Geographically

More information

Was ist dran an einer spezialisierten Data Warehousing platform?

Was ist dran an einer spezialisierten Data Warehousing platform? Was ist dran an einer spezialisierten Data Warehousing platform? Hermann Bär Oracle USA Redwood Shores, CA Schlüsselworte Data warehousing, Exadata, specialized hardware proprietary hardware Introduction

More information

Data Partitioning and MapReduce

Data Partitioning and MapReduce Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,

More information

S-Store: Streaming Meets Transaction Processing

S-Store: Streaming Meets Transaction Processing S-Store: Streaming Meets Transaction Processing H-Store is an experimental database management system (DBMS) designed for online transaction processing applications Manasa Vallamkondu Motivation Reducing

More information

Finding a needle in Haystack: Facebook's photo storage

Finding a needle in Haystack: Facebook's photo storage Finding a needle in Haystack: Facebook's photo storage The paper is written at facebook and describes a object storage system called Haystack. Since facebook processes a lot of photos (20 petabytes total,

More information

Data Centers and Cloud Computing. Data Centers

Data Centers and Cloud Computing. Data Centers Data Centers and Cloud Computing Slides courtesy of Tim Wood 1 Data Centers Large server and storage farms 1000s of servers Many TBs or PBs of data Used by Enterprises for server applications Internet

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

Important Lessons. Today's Lecture. Two Views of Distributed Systems

Important Lessons. Today's Lecture. Two Views of Distributed Systems Important Lessons Replication good for performance/ reliability Key challenge keeping replicas up-to-date Wide range of consistency models Will see more next lecture Range of correctness properties L-10

More information

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Page 1. Goals for Today Background of Cloud Computing Sources Driving Big Data CS162 Operating Systems and Systems Programming Lecture 24 Goals for Today" CS162 Operating Systems and Systems Programming Lecture 24 Capstone: Cloud Computing" Distributed systems Cloud Computing programming paradigms Cloud Computing OS December 2, 2013 Anthony

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Managing IoT and Time Series Data with Amazon ElastiCache for Redis

Managing IoT and Time Series Data with Amazon ElastiCache for Redis Managing IoT and Time Series Data with ElastiCache for Redis Darin Briskman, ElastiCache Developer Outreach Michael Labib, Specialist Solutions Architect 2016, Web Services, Inc. or its Affiliates. All

More information

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan Coflow Recent Advances and What s Next? Mosharaf Chowdhury University of Michigan Rack-Scale Computing Datacenter-Scale Computing Geo-Distributed Computing Coflow Networking Open Source Apache Spark Open

More information

Eventual Consistency Today: Limitations, Extensions and Beyond

Eventual Consistency Today: Limitations, Extensions and Beyond Eventual Consistency Today: Limitations, Extensions and Beyond Peter Bailis and Ali Ghodsi, UC Berkeley - Nomchin Banga Outline Eventual Consistency: History and Concepts How eventual is eventual consistency?

More information

Executive Summary. It is important for a Java Programmer to understand the power and limitations of concurrent programming in Java using threads.

Executive Summary. It is important for a Java Programmer to understand the power and limitations of concurrent programming in Java using threads. Executive Summary. It is important for a Java Programmer to understand the power and limitations of concurrent programming in Java using threads. Poor co-ordination that exists in threads on JVM is bottleneck

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

Data Centers and Cloud Computing

Data Centers and Cloud Computing Data Centers and Cloud Computing CS677 Guest Lecture Tim Wood 1 Data Centers Large server and storage farms 1000s of servers Many TBs or PBs of data Used by Enterprises for server applications Internet

More information

VoltDB vs. Redis Benchmark

VoltDB vs. Redis Benchmark Volt vs. Redis Benchmark Motivation and Goals of this Evaluation Compare the performance of several distributed databases that can be used for state storage in some of our applications Low latency is expected

More information