Grand Challenge: Scalable Stateful Stream Processing for Smart Grids

Size: px
Start display at page:

Download "Grand Challenge: Scalable Stateful Stream Processing for Smart Grids"

Transcription

1 Grand Challenge: Scalable Stateful Stream Processing for Smart Grids Raul Castro Fernandez, Matthias Weidlich, Peter Pietzuch Imperial College London {rc311, m.weidlich, Avigdor Gal Technion - Israel Institute of Technology avigal@ie.technion.ac.il ABSTRACT We describe a solution to the ACM DEBS Grand Challenge 1, which evaluates event-based systems for smart grid analytics. Our solution follows the paradigm of stateful data stream processing and is implemented on top of the SEEP stream processing platform. It achieves high scalability by massive data-parallel processing and the option of performing semantic load-shedding. In addition, our solution is fault-tolerant the large processing state of stream operators is not lost after failure. Our experimental results show that our solution processes 1 month worth of data for houses in hours. When we scale out the system, the time reduces linearly to 3 minutes before the system bottlenecks at the data source. We then apply semantic load-shedding, maintaining a low median prediction error and reducing the time further to 17 minutes. The system achieves these results with median latencies below 3 ms and a 9 th percentile below 5 ms. 1. INTRODUCTION The goal of the ACM DEBS Grand Challenge is to conduct a comparative evaluation of event-based systems by offering real-life event data and requirements for event queries. The 1 edition of the challenge [15] focuses on smart grid analytics. The challenge is based on measurements of energy consumption, expressed as instantaneous load and cumulative work, at the level of individual electricity plugs. These measurements are synthesised from realworld profiles derived from a set of smart home installations. The event queries focus on two types of analytics: (i) short-term load forecasting and (ii) load statistics for real-time demand management. We observe three main characteristics of the 1 challenge: High data volume. The data includes load and work events of individual plugs at a rate of approximately one measurement per second. For the considered houses with roughly plugs, this yields a volume of more than billion events for one month. Given future predicted growth, applications of smart grid analytics can be expected to process data of hundreds to thousands of houses, increasing the volume by one or two orders of magnitude. Unbounded and global state. The event queries require sophisticated handling of processing state. Short-term load forecasting relies on historical data, which requires aggregates to be maintained for an unbounded time window. This is challenging due to the evergrowing state to be stored. The performance of computation on this state is likely to degrade over time. Moreover, load statistics for real-time demand management require a global aggregation over all houses and plugs, which limits the options to distribute processing. Unbounded and global state also affect fault-tolerance mechanisms because it becomes expensive or even impossible to recreate all state by reprocessing events. # Events (load measurements) Days (5 min windows) Figure 1: Load measurements (five min window) indicating change of more than five watts. Large measurement variability. For the load and work values in the dataset, we observe a large variability in the frequency with which the values change over time. Some plugs report close to constant load for long periods of time, and significant changes in load are correlated between plugs, following global energy consumption patterns. This effect is shown in Figure 1, which depicts the number of load measurements per five-minute window when ignoring changes of less than five watts. Considering only the events that indicate larger changes in load, the amount of events that need to be processed varies by up to a factor of seven over time. Given these characteristics, we present a solution to the challenge that is based on the SEEP stream processing platform [7]. SEEP executes event processing queries as stateful distributed dataflow graphs. It supports the definition of stream processing operators as user-defined functions (UDFs). The main features of our solution using SEEP are as follows: 1. Data-parallel processing. To handle the high volume of events, our solution scales the processing of events in a data-parallel fashion on a cluster of nodes.. Optimised stateful operators. Given the complex state of the event queries, our solution exploits stream operators with efficient state handling specific to a given query, e.g. through indexed in-memory data structures. 3. Filtering and elasticity. We exploit the long periods of relatively constant load measurements in the dataset by performing semantic load-shedding, thus reducing the total events to process downstream. To support resource-efficient deployments when the input event rate varies over time, our solution can dynamically provision processing resources on-demand.. Fault tolerance. Our solution supports fault-tolerant processing, which is crucial for any continuously running data analytics application on a cluster of nodes. Instead of reprocessing all events after failure, operator state is recovered from periodic state checkpoints with low overhead.

2 Our experimental evaluation shows that our implementation processes the challenge dataset with a throughput of 3, events per second for the load forecasting and 1, events per second for the outlier detection, with median latencies of 17 ms and 136 ms, respectively. The resulting speed-up over real-time processing is 3 (load forecasting) and (outlier detection). Our solution also scales linearly in the number of cluster nodes used for computation. With 6 (load forecasting) and 7 (outlier detection) nodes, the speed-up over real-time increases up to 13 and 9, respectively. Moreover, we show that semantic load-shedding leads to a modest median error in the query results, but increases the speed-up by two orders of magnitude. With this set-up, our solution processes one month worth of data for houses in 17 mins. The remainder of the paper is structured as follows. In Section, we give an overview of the SEEP stream processing platform used by our solution. Section 3 gives details on the implemented operators. Section presents our evaluation results. We discuss related work in Section 5, before concluding in Section 6.. THE SEEP PLATFORM Our solution follows the paradigm of stateful stream processing because it provides a natural way to implement the proposed queries: it permits the definition of custom state and its manipulation by UDFs. Event processing systems such as Esper [1] and SASE [3] provide high-level query languages and lack the possibility to fine tune the data structures used to maintain the query state. In contrast, stateful stream processing achieves efficient state handling for each specific scenario. Recently, a new generation of data-parallel stream processing systems based on dataflow graphs have been proposed, including Twitter Storm [9] and Apache S [11]. Although these systems allow for massive parallelisation of stream processing operators, they assume that dataflow graphs are static and operators are stateless: they cannot react to varying input rates or efficiently recovery operator state after failure. In contrast, the SEEP platform [7] implements a stateful stream processing model and can (i) dynamically partition operator state to scale out a query in order to increase processing throughput; and (ii) recover operator state after a failure, while maintaining deterministic execution. As a result, SEEP achieves the following three features: (1) SEEP is highly scalable in order to handle high volume data. For the given challenge, this is an important feature because a realistic set-up for smart grid analytics would require the processing of events emitted by more than houses. Prior work [7] has shown that SEEP scales to close to a hundred virtual machines (VMs) in the Amazon EC public cloud. It achieves this through data parallelism stateless and stateful operators are scaled out, i.e. multiple instances of operators are deployed in the cluster. Each instance operates on a subset of the event data. Events are dispatched to instances based on query semantics, e.g. in the challenge dataset the event streams may be hash-partitioned by houses. In addition, SEEP exploits pipeline parallelism chains of operators are deployed on different nodes to reduce latency. () SEEP is fault tolerant, which is critical when operator state depends on a large number of past events. The load forecasting in the challenge relies on a model learned from historic data, and outlier detection employs windows with up to 1,, events. Even under the assumption of a reliable event source with access to the full event stream, losing operator state would require the reprocessing all events. Instead, SEEP creates periodic checkpoints of operator state, which are backed up to remote nodes and used to quickly recover state after failure. Source <input> <input> <heartbeat> Filter <input> Q1 Q Plug <plug update> Q Global Figure : The dataflow graph of our solution. <plug prediction> <house prediction> Sink <house outliers> (3) SEEP is elastic it dynamically scales out stateful operators at runtime by partitioning their state. This functionality is particularly useful for event queries with high variability in the input rate. When pre-filtering events in the challenge to ignore minor changes in load, the input rate varies. SEEP can adapt to such workload changes, using cluster resource more efficiently. 3. QUERY IMPLEMENTATION This section describes how we implemented the event queries from the challenge on the SEEP platform. We first give an overview of the main ideas behind the queries (Section 3.1), before we give details of the operator implementations (Sections 3. 3.). 3.1 Overview The structure of the logical dataflow graph of the queries is shown in Figure. An operator filter performs semantic load-shedding across all load and work measurements (denoted by <input>). This permits, for example, filtering of events that indicate only a minor change in load for a certain plug. The filter operator can be scaled out so that different instances realise data parallelism by partitioning the event stream <input> per house, household or even plug. The actual queries are implemented by three operators, namely Q1, Q Plug, and Q Global. Load forecasting and outlier detection are independent queries their parallelisation is done separately. Load forecasting is realised by operator Q1, and it is done at two levels of aggregation, i.e., plugs and houses. Hence, the operator can be scaled out by partitioning the respective event stream for the most coarse-grained aggregation, i.e., per house. Data-parallel processing is of particular importance for this operator because the query requires the maintenance of an unbounded time window. At the same time, the query also requires frequent updates of the result stream, i.e. every 3 seconds as specified by the timestamps of the events. When events are streamed faster than real-time and distributed over a large number of operator instances, however, it becomes impossible to identify the intervals for updating the result stream at a particular instance. To solve this issue, we implement a heartbeat mechanism in the filter operator, which emits a signal to operator Q1 whenever an update is due. Heartbeat generation is implemented in operator filter because the pre-processing of data is less costly than the actual load prediction. Hence, the number of instances of operator filter can be expected to be much smaller than the number of instances of operator Q1. To cope with data quality issues such as missing values, e.g. as seen around days and 8 in Figure 1, operator Q1 also features a correction mechanism that is based on the measurements of cumulative work per plug, which will be detailed below. Outlier detection is split up into two operators. Here, the idea is to separate the part of the query that can be parallelised from the part that requires a global state. Operator Q Plug thus takes the input stream and maintains the median of the load per plug for each of the time windows. The operator can be scaled out at the level of plugs.

3 Operator Q Global, in turn, maintains the global median over all plugs. It also realises the outlier detection and emits the results. Due to its global state, this operator cannot be scaled out. To reduce the amount of computation done at the singleton instance of operator Q Global, it relies not only on the median values computed by Q Plug, but also on the information about which measurements entered or left one of the investigated time windows (denoted by <plug update> in Figure ). As a consequence, a large part of the effort to maintain the time windows per plug is performed at operator Q Plug, which can be scaled out. 3. Filter The filter operator realises the following functionality: Duplicate elimination. To filter duplicate measurements, the operator maintains the timestamps of the last load and work measurements for each plug. Only measurements with a timestamp larger than the last observed (per plug) are forwarded. Variability-based filtering. To leverage the large variability in the frequency with which load values change over time for optimisation, the filter operator can perform semantic load-shedding, ignoring measurements that denote a minor change in load with respect to the last non-filtered measurement. In such a case, measurements of work are only forwarded if the load measurement with the same timestamp has not been removed by the filter procedure. In Section, we evaluate the trade-off between this type of filtering and the correctness of the query results with an experimental setup. Heartbeat generation. The aforementioned heartbeats are generated based on the timestamps of the processed events. Whenever an event with a timestamp larger than the time of the last heartbeat plus the heartbeat interval is received, a new heartbeat is emitted. 3.3 Query 1: Load Forecasting Operator Q1 for load forecasting is implemented as follows: Prediction model. As a baseline, we rely on the prediction model defined in the challenge description, which combines current load measurements with a model over historical data. More specifically, the load prediction for the time window following the next one is based on the average load of the current window and the median of the average loads of windows covering the same time of all past days. The generation of prediction values is triggered by the heartbeats, which are generated by the filter operator. Work-based correction. To address the issues stemming from missing load measurements, our operator exploits measurements of cumulative work per plug. Correction is triggered when the operator receives a work measurement, and the number of recorded load measurements for the preceding window is less than a threshold. Since work is measured at a coarse resolution (1 kwh), the work values enable us to derive only an approximation of the actual average load. Therefore, the threshold on the number of load measurements allows for tuning how many load values are at least required to avoid the computation of the window average based on work values. A specific value for the threshold is chosen based on the expected rate for load measurements. If applied, the correction mechanism determines the maximal interval of adjacent windows with insufficient load measurements. The difference between the first and last work measurement for this period is used to conclude on the average load for all the windows. State handling. Load forecasting relies on the average load per window per plug over the complete history. To cope with the unbounded state of the query, our implementation strives for reducing the size of the state as much as possible. First, we observe that although results have to be provided for five different window sizes, all of them can be expressed as multiples of the smallest window of one minute. Therefore, our implementation only stores the state for the smallest windows. Second, since prediction is based on the load average, our operator keeps only a sliding average for the current smallest window and the average load for all historic windows. Load averages are kept in a two-dimensional array (per plug, per window), and an index structure allows for quick access of a global identifier for a plug. The index is implemented as a three-dimensional array over the house, household, and plug identifiers. For the work-based correction mechanism, additional state needs to be maintained. For each plug and window, the number of load measurements and the first recorded work value is maintained in further two-dimensional arrays. 3. Query : Outliers Outlier detection is realised by operators Q Plug and Q Global. The former focuses on the calculation of windows and the median load per plug. Its results are then used by operator Q Global to conduct the actual outlier detection. Plug windows and median. To maintain the time windows and calculate the median load per plug, operator Q Plug proceeds as follows. Upon the arrival of load measurement, the value and timestamp is added to either window for the respective plug. The timestamp of the received event is used to remove old events from both windows. Then, the median of the load values for the plug is calculated. If both, the median and the multiset of values of both windows, did not change, no event is forwarded to operator Q Global. If there has been a change, the new median for the plug as well as the load values added or removed to either window are sent to Q Global (<plug update>). Outlier detection. To detect outliers, operator Q Global compares the median values per plug as computed by operator Q Plug with the global median. To compute the latter, the operator maintains two time windows over all plugs. However, these windows are updated only based on the values provided by the events of the <plug update> stream generated by operator Q Plug. Receiving an event of the <plug update> stream leads to recalculation of the global median for the respective window. If that has not changed, only the house related to the plug for which the update has been received is considered in the outlier detection. If the global median changed, the plugs of all houses are checked. If the percentage of plugs with a median load higher than the global median changes, the result stream is updated. State handling. To implement the time windows, for each plug, operator Q Plug maintains two double-ended queues, one containing the timestamps and one containing the load values. Implemented as linked lists, these queues allow to insert new measurements in constant time. Accessing and removing events from the other end of the queue is done in constant time. The queue containing the timestamps is used to determine whether elements of the queue containing the load values should be removed. To compute the median over the load values, operator Q Plug maintains an indexable skip-list [1] of these values per plug. Such a skip-list holds an ordered sequence of elements and maintains an additional index structure, a linked hierarchy of sub-sequences that skip certain elements of the original list. We use the probabilistic and indexable version of this data structure the skip paths are randomly chosen and, for each skip path, we also store the length in terms of the number of skipped elements. The indexable skip-list allows for inserting, deleting and search-

4 ing load values as well as accessing the load value at a particular list index in logarithmic time. Calculation of the median is traced back to a list lookup. Since the query requires the lookup only for the median element, and not for an arbitrary index, we also keep a pointer to the current median element of the list, which is updated with every insertion or deletion. Hence, the median is derived in constant time. Although bounded, handling the state of operator Q Global is challenging due to the sheer number of measurements that need to be kept (up to 1,, events) and the update frequency. For both windows, our implementation relies on an indexable skip-list and uses a pointer to the median elements of these lists. Throughput (1K e/s) Th. 1th Th. 5th Th. 9th 1 Workload (houses) EVALUATION Our evaluation assess the performance of our system along three dimensions by investigating: whether it scales. Does the system support more houses? whether it can cope with the current load with headroom. Can the system process faster than real time? how fast we can incorporate predictions. Does the system achieve low latency, even when it is distributed? We deploy our solution in a private cluster composed by 1 Intel(R) Xeon(R) CPU E GHz nodes with 8 GB of RAM, running SEEP on a Linux kernel 3.. with Java 7. We run SEEP with the fault tolerant mechanism enabled. Scalability. To measure scalability we report relative throughput, where we normalize the throughput of the system for the baseline case, and show how it increases as we add cluster nodes. Also, we explain the bottlenecks observed when conducting the experiments. Throughput. After analysing the available datasets, we found that we need a system capable of processing 377 events/s, 696 events/s and 1565 events/s on average for the 1, and houses dataset to process the incoming input rate over a month. SEEP processes three orders of magnitude faster than this. For this reason, we report speedup over RT (real time) as the number of times the system process faster than required to run the queries. As an example, consider a speedup over RT of, which would allow for processing one month worth of data in 15 days. Latency. We measure the end to end latency of those events that close windows in both queries. To measure latency accurately we place both source and sink of our system on the same cluster node, so that both operators have access to a common clock. For the given event queries, the processing cost per event is close to constant regardless the dataset size. Dataset sizes, however, have an impact on the total memory required to run the queries. We exploit the stateful capabilities of our system to provide an efficient implementation which expresses the state efficiently, avoiding any kind of performance hit. Note that under this scenario, larger datasets do not impact the throughput of our system, but only the speedup, as there are more events to process..1 Query 1: Load Forecasting Our implementation of query 1 consists of two operators, a filter and Q1 (see Figure ). For the baseline system, each of the operators is deployed on a single node of the cluster. For our distributed deployment we scale out from to 6 nodes. Baseline system. Figure 3 shows the 1th, 5th, and 9th percentile of throughput as requested in the challenge description. As expected, this is constant across the different workload sizes but the speedup over RT decreases as there are simply more events to process. With a speedup over RT of around 9, the system can process one month Figure 3: Throughput and speedup as a function of the number of houses. With a constant throughput, growing the size of the dataset implies less speedup. Q1 Dist. Q1 Q Dist. Q 1th th th Table 1: Latencies in ms for both queries with baseline and distributed deployment. worth of data from 1 houses in about one hour, while it will take around four hours to do the same for houses. Distributed system. Ideally, we want the system to scale to support the data coming from more houses, which in our system is equivalent to keep the speedup over RT constant. We exploit data parallelism to aggregate throughput, thus, keeping constant or even increasing the speedup over RT. Figure shows in the x axis the number of cluster nodes used during the experiment. The relative throughput increases linearly from to 3 nodes, then sub-linearly until 5, and then we find a spike when using 6 nodes. The reason for the sub-linear behaviour is due to the sink operator, which has to aggregate the results coming from distributed nodes, becoming an IO bottleneck. To confirm this was the case, we scaled out the sink and run the system with 6 nodes, which shows how the throughput increases again. The speedup over RT across this experiment is always increasing, which confirms that our system can scale to bigger datasets while keeping the throughput. We stopped at 6 nodes as the source became a bottleneck at this point. This would not be an issue in a real scenario with distributed sources, but we did not want to break the order semantics provided in the dataset. Table 1 shows the latencies we measured for both the baseline system and the distributed one. The major sources of latency spikes in SEEP are caused by the buffering mechanism used for fault tolerance, and the intricacies of this with the garbage collector under high memory utilisation scenarios. Neither of these happen in the context of query 1. Our latencies are slightly lower than in the nonscaled out case. The reason for lower latencies in the distributed case is the source cannot insert data at higher rates, and thus, events go through the same number of queues and processing elements as in the non-scaled case, but with more headroom.. Query : Outliers Our implementation of query consists of three operators, filter, Q Plug and Q Global (see Figure ). Hence, the baseline deployment comprises 3 nodes of the cluster. For the distributed deployment we scale out from 3 to 7 nodes.

5 Relative Throughput Relative th Relative Throughput Relative th Number of nodes Number of nodes Figure : Throughput and speedup over RT as a function of the number of machines. We increase the speedup by scaling out the system to aggregate throughput. Throughput (1K e/s) Th. 1th Th. 5th Th. 9th 1 Workload (houses) Figure 5: Throughput and speedup as a function of the number of houses, for query. Figure 6: Distributed deployment of query. Scaling out the query aggregates throughput and increases the speedup Abs. of prediction error (Watt) Plug 1th Plug 5th Plug 9th Days (5 min windows) Figure 7: Time series over four days showing a low prediction error for plugs over five minute windows. Note that the 5th overlaps the x axis. Baseline system.. Figure 6 shows the expected behaviour of decreasing speedup as the dataset grows in events size. This query is computationally more expensive than query 1. In our solution this translates into a total time of 3. hours to process one month of data for 1 houses to 13 hours to do the same in the houses case. Distributed system.. We follow the same strategy of scaling out the system to increase the speedup over the minimum throughput required by the system, reported in Figure 6. When adding more cluster nodes the throughput increases, except between 5 and 6 nodes. The reason for this behaviour is that there were two simultaneous bottlenecks. First a CPU bottleneck that disappears after scaling from 5 to 6 nodes, giving rise to an IO bottleneck. When scaling out the IO bottleneck, we show how the system can keep up increasing the speedup. We stop our query at this point, when the source becomes a bottleneck. Latencies for query are reported in Table 1. Latencies are higher than in query 1 because the bottleneck in this query was CPU related while in query 1 was IO (serialisation and deserialisation)..3 Impact of Semantic Load-Shedding To investigate the inherent trade-off of result accuracy and computation efficiency implied by semantic load-shedding, we compared the load predictions derived by query one for a sample of four days. We focus on the predictions derived for the smallest time window (one minute). This window represents the most challenging case, since for larger windows, the relative importance of filtered events is smaller and, thus, accuracy is less compromised. We show in Figures 7 and 8 the absolute error prediction error for plugs and houses, respectively, aggregated for windows of five minutes. For individual plugs, although the 9th percentile shows spikes up to 15 watt, the median error is zero in virtually all cases. For load predictions for houses, in turn, the median error is largely between one and three watts and there is little variability in the results. Based on these results, we conclude that the error is small enough to justify the activation of the mechanism. Turning to the benefits of load-shedding for processing performance, Figure 9 shows the difference in throughput and speedup over RT when enabling the mechanism. As discussed before, the throughput per node is mostly not affected since processing cost per event is close to constant. However, we observe an improvement of speedup of two orders of magnitude, meaning that one month worth of data for houses is processed in 17 minutes. This drastic speedup together with the low accuracy loss justifies the usage of semantic load-shedding in this scenario. We consider this an important outcome as it allows more headroom to scale out the system to accommodate the load from more houses. 5. RELATED WORK Our solution is based on the use of distributed stream processing. Various engines for stream processing have been proposed in the literature, such as Discretized Streams [1], Naiad [1], Twitter Storm [9], Apache S [11], MOA [5], Apache Kafka [8], and Streams [6]. Discretized Streams support massive parallelisation and state in the form of RDDs (Resilient Distributed Datasets), however, its approach based on micro-batching is not optimal to reduce processing latency. Naiad [1] can scale out across many nodes maintaining low latencies, however the system is not designed to

6 Abs. of prediction error (Watt) House 1th House 5th House 9th Days (5 min windows) Figure 8: Time series over four days showing a low prediction error for houses over five minute windows. Throughput (1K e/s) Th. Original Th. Filtered (original) (filtered) 1 Workload (houses) Figure 9: When applying semantic load-shedding, the speedup grows two orders of magnitude (note the logarithmic scale). The system process one month worth of data for houses in 17 minutes. be fault tolerant when the managed state is large. The other systems support massive parallelisation of stream processing, but lack support for managing stateful operators and for reacting to varying input rates. Both issues are of particular relevance for the grand challenge the queries feature a large and complex state and the number of events indicating changes in load shows large variability. Therefore, we grounded our solution on SEEP [7], which supports stateful operators, dynamic scale-out, and fault tolerant processing. To improve the processing performance, we apply semantic loadshedding, dropping input events in a structured way to achieve timely processing. Load-shedding is common technique for optimising event processing applications, in particular for achieving high throughput [13, ]. In some cases, load shedding for distributed stream processing may in itself become an optimization problem [13]. Our approach exploits domain semantics, i.e., the size of a change of a measurement value, to decide on which events to filter. Our evaluation showed that this approach results in only minor inaccuracies in the load forecast. However, filtering leads to drastic performance improvements. The speedup realised by the system grows by two orders of magnitude. 6. CONCLUSIONS In this work, we presented a highly scalable solution to the ACM DEBS Grand Challenge 1. We based our solution on SEEP, a platform for stateful stream processing that supports dynamic scale out of operators and recovery of operator state after a failure. To achieve efficient processing, we presented implementations of the stream processing operators that are geared towards parallelisation and effective state management, e.g., using queues and skiplists. In addition, we exploited the fact that there are long time periods over which measurement values are relatively constant. Further details on our solution including a screencast are available at []. The experimental evaluation of our solution indicates that the system can indeed cope with high volume data, processing it with 3, events per second for the load forecasting and 1, events per second for the outlier detection and median latencies of 17 ms or 136 ms, respectively. We also conclude that the system is capable to handle larger workloads since it scales linearly when adding cluster nodes. Further, we demonstrated that semantic loadshedding can lead to huge performance gains. While filtering leads to approximate query results only, in our case, the resulting bias was rather modest, with a low median error for the prediction. However, the speedup over real time was increased by two orders of magnitude, which allowed us to process the whole dataset in 17 minutes. Our approach to filter events for more efficient stream processing opens several directions for future work. On the one hand, the generation of filter conditions from a set of queries would allow for automating the approach. On the other hand, filtering may be guided by the amount of tolerated inaccuracies in the query results. Probing unfiltered results for certain time intervals would enable to assess the consequences of filtering and adapt its selectivity dynamically. 7. REFERENCES [1] [] [3] J. Agrawal, Y. Diao, D. Gyllstrom, and N. Immerman. Efficient pattern matching over event streams. In SIGMOD, pages ACM, 8. [] B. Babcock, M. Datar, and R. Motwani. Load shedding in data stream systems. In Data Streams, volume 31 of Advances in Database Systems, pages Springer, 7. [5] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. Moa: Massive online analysis. The Journal of Machine Learning Research, 99:161 16, 1. [6] C. Bockermann and H. Blom. The streams framework. Technical Report 5, TU Dortmund University, 1. [7] R. C. Fernandez, M. Migliavacca, E. Kalyvianaki, and P. Pietzuch. Integrating scale out and fault tolerance in stream processing using operator state management. In SIGMOD, pages ACM, 13. [8] J. Kreps, N. Narkhede, and J. Rao. Kafka: A distributed messaging system for log processing. In NetDB, 11. [9] N. Marz. Storm - distributed and fault-tolerant realtime computation, 13. [1] D. G. Murray, F. McSherry, et al. Naiad: A Timely Dataflow System. In SOSP, 13. [11] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S: Distributed stream computing platform. In Data Mining Workshops (ICDMW), pages IEEE, 1. [1] W. Pugh. Skip lists: A probabilistic alternative to balanced trees. Commun. ACM, 33(6): , 199. [13] N. Tatbul, U. Çetintemel, and S. Zdonik. Staying fit: Efficient load shedding techniques for distributed stream processing. In VLDB, pages VLDB Endowment, 7. [1] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized Streams: Fault-tolerant Streaming Computation at Scale. In SOSP, 13. [15] H. Ziekow and Z. Jerzak. The DEBS 1 Grand Challenge. In DEBS, Mubai, India, July 1. ACM.

Grand Challenge: Scalable Stateful Stream Processing for Smart Grids

Grand Challenge: Scalable Stateful Stream Processing for Smart Grids Grand Challenge: Scalable Stateful Stream Processing for Smart Grids Raul Castro Fernandez, Matthias Weidlich, Peter Pietzuch Imperial College London {rc311, m.weidlich, prp}@imperial.ac.uk Avigdor Gal

More information

Performance and Scalability with Griddable.io

Performance and Scalability with Griddable.io Performance and Scalability with Griddable.io Executive summary Griddable.io is an industry-leading timeline-consistent synchronized data integration grid across a range of source and target data systems.

More information

Data Stream Processing in the Cloud

Data Stream Processing in the Cloud Department of Computing Data Stream Processing in the Cloud Evangelia Kalyvianaki ekalyv@imperial.ac.uk joint work with Raul Castro Fernandez, Marco Fiscato, Matteo Migliavacca and Peter Pietzuch Peter

More information

Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management

Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management Raul Castro Fernandez Imperial College London rc311@doc.ic.ac.uk Matteo Migliavacca University of Kent mm3@kent.ac.uk

More information

Apache Flink- A System for Batch and Realtime Stream Processing

Apache Flink- A System for Batch and Realtime Stream Processing Apache Flink- A System for Batch and Realtime Stream Processing Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich Prof Dr. Matthias Schubert 2016 Introduction to Apache Flink

More information

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

Part 1: Indexes for Big Data

Part 1: Indexes for Big Data JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,

More information

Apache Flink. Alessandro Margara

Apache Flink. Alessandro Margara Apache Flink Alessandro Margara alessandro.margara@polimi.it http://home.deib.polimi.it/margara Recap: scenario Big Data Volume and velocity Process large volumes of data possibly produced at high rate

More information

Performance of relational database management

Performance of relational database management Building a 3-D DRAM Architecture for Optimum Cost/Performance By Gene Bowles and Duke Lambert As systems increase in performance and power, magnetic disk storage speeds have lagged behind. But using solidstate

More information

QLIKVIEW SCALABILITY BENCHMARK WHITE PAPER

QLIKVIEW SCALABILITY BENCHMARK WHITE PAPER QLIKVIEW SCALABILITY BENCHMARK WHITE PAPER Hardware Sizing Using Amazon EC2 A QlikView Scalability Center Technical White Paper June 2013 qlikview.com Table of Contents Executive Summary 3 A Challenge

More information

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Architectural challenges for building a low latency, scalable multi-tenant data warehouse Architectural challenges for building a low latency, scalable multi-tenant data warehouse Mataprasad Agrawal Solutions Architect, Services CTO 2017 Persistent Systems Ltd. All rights reserved. Our analytics

More information

Grand Challenge: The TechniBall System

Grand Challenge: The TechniBall System Grand Challenge: The TechniBall System Avigdor Gal, Sarah Keren, Mor Sondak, Matthias Weidlich Technion - Israel Institute of Technology avigal@ie.technion.ac.il {sarahn,mor,weidlich}@tx.technion.ac.il

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

StreamMine3G OneClick Deploy & Monitor ESP Applications With A Single Click

StreamMine3G OneClick Deploy & Monitor ESP Applications With A Single Click OneClick Deploy & Monitor ESP Applications With A Single Click Andrey Brito, André Martin, Christof Fetzer, Isabelly Rocha, Telles Nóbrega Technische Universität Dresden - Dresden, Germany - Email: {firstname.lastname}@se.inf.tu-dresden.de

More information

arxiv: v1 [cs.dc] 29 Jun 2015

arxiv: v1 [cs.dc] 29 Jun 2015 Lightweight Asynchronous Snapshots for Distributed Dataflows Paris Carbone 1 Gyula Fóra 2 Stephan Ewen 3 Seif Haridi 1,2 Kostas Tzoumas 3 1 KTH Royal Institute of Technology - {parisc,haridi}@kth.se 2

More information

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Apache Flink

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Apache Flink Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Apache Flink Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur Schmid, Daniyal Kazempour,

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

Optimizing Apache Spark with Memory1. July Page 1 of 14

Optimizing Apache Spark with Memory1. July Page 1 of 14 Optimizing Apache Spark with Memory1 July 2016 Page 1 of 14 Abstract The prevalence of Big Data is driving increasing demand for real -time analysis and insight. Big data processing platforms, like Apache

More information

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane BIG DATA Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management Author: Sandesh Deshmane Executive Summary Growing data volumes and real time decision making requirements

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008 OUTLINE INTRODUCTION DESIGN OVERVIEW

More information

Challenges in Data Stream Processing

Challenges in Data Stream Processing Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Challenges in Data Stream Processing Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in

More information

Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation

Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation Large-Scale Data & Systems Group Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation Georgios Theodorakis, Alexandros Koliousis, Peter Pietzuch, Holger Pirk Large-Scale Data & Systems (LSDS)

More information

Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing

Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing Nesime Tatbul Uğur Çetintemel Stan Zdonik Talk Outline Problem Introduction Approach Overview Advance Planning with an

More information

Streaming data Model is opposite Queries are usually fixed and data are flows through the system.

Streaming data Model is opposite Queries are usually fixed and data are flows through the system. 1 2 3 Main difference is: Static Data Model (For related database or Hadoop) Data is stored, and we just send some query. Streaming data Model is opposite Queries are usually fixed and data are flows through

More information

COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING

COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING Volume 119 No. 16 2018, 937-948 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.eu/hub/ http://www.acadpubl.eu/hub/ COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING K.Anusha

More information

Distributed ETL. A lightweight, pluggable, and scalable ingestion service for real-time data. Joe Wang

Distributed ETL. A lightweight, pluggable, and scalable ingestion service for real-time data. Joe Wang A lightweight, pluggable, and scalable ingestion service for real-time data ABSTRACT This paper provides the motivation, implementation details, and evaluation of a lightweight distributed extract-transform-load

More information

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21 Big Processing -Parallel Computation COS 418: Distributed Systems Lecture 21 Michael Freedman 2 Ex: Word count using partial aggregation Putting it together 1. Compute word counts from individual files

More information

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara Week 1-B-0 Week 1-B-1 CS535 BIG DATA FAQs Slides are available on the course web Wait list Term project topics PART 0. INTRODUCTION 2. DATA PROCESSING PARADIGMS FOR BIG DATA Sangmi Lee Pallickara Computer

More information

Massive Scalability With InterSystems IRIS Data Platform

Massive Scalability With InterSystems IRIS Data Platform Massive Scalability With InterSystems IRIS Data Platform Introduction Faced with the enormous and ever-growing amounts of data being generated in the world today, software architects need to pay special

More information

Capacity Planning for Application Design

Capacity Planning for Application Design WHITE PAPER Capacity Planning for Application Design By Mifan Careem Director - Solutions Architecture, WSO2 1. Introduction The ability to determine or forecast the capacity of a system or set of components,

More information

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors. About the Tutorial Storm was originally created by Nathan Marz and team at BackType. BackType is a social analytics company. Later, Storm was acquired and open-sourced by Twitter. In a short time, Apache

More information

DRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE

DRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE DRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout, Michael Armbrust, Ali Ghodsi, Michael Franklin, Benjamin Recht, Ion Stoica STREAMING WORKLOADS

More information

IBM InfoSphere Streams v4.0 Performance Best Practices

IBM InfoSphere Streams v4.0 Performance Best Practices Henry May IBM InfoSphere Streams v4.0 Performance Best Practices Abstract Streams v4.0 introduces powerful high availability features. Leveraging these requires careful consideration of performance related

More information

Data Analytics with HPC. Data Streaming

Data Analytics with HPC. Data Streaming Data Analytics with HPC Data Streaming Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0.

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0. IBM Optim Performance Manager Extended Edition V4.1.0.1 Best Practices Deploying Optim Performance Manager in large scale environments Ute Baumbach (bmb@de.ibm.com) Optim Performance Manager Development

More information

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers A Distributed System Case Study: Apache Kafka High throughput messaging for diverse consumers As always, this is not a tutorial Some of the concepts may no longer be part of the current system or implemented

More information

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa EPL646: Advanced Topics in Databases Christos Hadjistyllis

More information

Scalable Distributed Event Detection for Twitter

Scalable Distributed Event Detection for Twitter Scalable Distributed Event Detection for Twitter Richard McCreadie, Craig Macdonald and Iadh Ounis School of Computing Science University of Glasgow Email: firstname.lastname@glasgow.ac.uk Miles Osborne

More information

Nowadays data-intensive applications play a

Nowadays data-intensive applications play a Journal of Advances in Computer Engineering and Technology, 3(2) 2017 Data Replication-Based Scheduling in Cloud Computing Environment Bahareh Rahmati 1, Amir Masoud Rahmani 2 Received (2016-02-02) Accepted

More information

The Stream Processor as a Database. Ufuk

The Stream Processor as a Database. Ufuk The Stream Processor as a Database Ufuk Celebi @iamuce Realtime Counts and Aggregates The (Classic) Use Case 2 (Real-)Time Series Statistics Stream of Events Real-time Statistics 3 The Architecture collect

More information

High-Performance Event Processing Bridging the Gap between Low Latency and High Throughput Bernhard Seeger University of Marburg

High-Performance Event Processing Bridging the Gap between Low Latency and High Throughput Bernhard Seeger University of Marburg High-Performance Event Processing Bridging the Gap between Low Latency and High Throughput Bernhard Seeger University of Marburg common work with Nikolaus Glombiewski, Michael Körber, Marc Seidemann 1.

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

Correlation based File Prefetching Approach for Hadoop

Correlation based File Prefetching Approach for Hadoop IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie

More information

Performance Consequences of Partial RED Deployment

Performance Consequences of Partial RED Deployment Performance Consequences of Partial RED Deployment Brian Bowers and Nathan C. Burnett CS740 - Advanced Networks University of Wisconsin - Madison ABSTRACT The Internet is slowly adopting routers utilizing

More information

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts Memory management Last modified: 26.04.2016 1 Contents Background Logical and physical address spaces; address binding Overlaying, swapping Contiguous Memory Allocation Segmentation Paging Structure of

More information

Scalable Streaming Analytics

Scalable Streaming Analytics Scalable Streaming Analytics KARTHIK RAMASAMY @karthikz TALK OUTLINE BEGIN I! II ( III b Overview Storm Overview Storm Internals IV Z V K Heron Operational Experiences END WHAT IS ANALYTICS? according

More information

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015 Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document

More information

Designing Hybrid Data Processing Systems for Heterogeneous Servers

Designing Hybrid Data Processing Systems for Heterogeneous Servers Designing Hybrid Data Processing Systems for Heterogeneous Servers Peter Pietzuch Large-Scale Distributed Systems (LSDS) Group Imperial College London http://lsds.doc.ic.ac.uk University

More information

Clustering and Reclustering HEP Data in Object Databases

Clustering and Reclustering HEP Data in Object Databases Clustering and Reclustering HEP Data in Object Databases Koen Holtman CERN EP division CH - Geneva 3, Switzerland We formulate principles for the clustering of data, applicable to both sequential HEP applications

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

arxiv: v1 [cs.db] 21 Feb 2019

arxiv: v1 [cs.db] 21 Feb 2019 CONTINUOUS OUTLIER MINING OF STREAMING DATA IN FLINK A PREPRINT arxiv:1902.07901v1 [cs.db] 21 Feb 2019 Theodoros Toliopoulos, Anastasios Gounaris, Kostas Tsichlas, Apostolos Papadopoulos Department of

More information

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Motivation And Intro Programming Model Spark Data Transformation Model Construction Model Training Model Inference Execution Model Data Parallel Training

More information

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Google File System (GFS) and Hadoop Distributed File System (HDFS) Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear

More information

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica Sparrow Distributed Low-Latency Spark Scheduling Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica Outline The Spark scheduling bottleneck Sparrow s fully distributed, fault-tolerant technique

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Datenbanksysteme II: Modern Hardware. Stefan Sprenger November 23, 2016

Datenbanksysteme II: Modern Hardware. Stefan Sprenger November 23, 2016 Datenbanksysteme II: Modern Hardware Stefan Sprenger November 23, 2016 Content of this Lecture Introduction to Modern Hardware CPUs, Cache Hierarchy Branch Prediction SIMD NUMA Cache-Sensitive Skip List

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Comparative Analysis of Range Aggregate Queries In Big Data Environment

Comparative Analysis of Range Aggregate Queries In Big Data Environment Comparative Analysis of Range Aggregate Queries In Big Data Environment Ranjanee S PG Scholar, Dept. of Computer Science and Engineering, Institute of Road and Transport Technology, Erode, TamilNadu, India.

More information

Using Apache Beam for Batch, Streaming, and Everything in Between. Dan Halperin Apache Beam PMC Senior Software Engineer, Google

Using Apache Beam for Batch, Streaming, and Everything in Between. Dan Halperin Apache Beam PMC Senior Software Engineer, Google Abstract Apache Beam is a unified programming model capable of expressing a wide variety of both traditional batch and complex streaming use cases. By neatly separating properties of the data from run-time

More information

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Using Synology SSD Technology to Enhance System Performance Synology Inc. Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_WP_ 20121112 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges... 3 SSD

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

EXTRACT DATA IN LARGE DATABASE WITH HADOOP International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0

More information

Complex Event Processing under Constrained Resources by State-Based Load Shedding

Complex Event Processing under Constrained Resources by State-Based Load Shedding 2018 IEEE 34th International Conference on Data Engineering Complex Event Processing under Constrained Resources by State-Based Load Shedding Bo Zhao (expected graduation: March 2020) Supervised by Matthias

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

A Cost Model for Adaptive Resource Management in Data Stream Systems

A Cost Model for Adaptive Resource Management in Data Stream Systems A Cost Model for Adaptive Resource Management in Data Stream Systems Michael Cammert, Jürgen Krämer, Bernhard Seeger, Sonny Vaupel Department of Mathematics and Computer Science University of Marburg 35032

More information

Datacenter replication solution with quasardb

Datacenter replication solution with quasardb Datacenter replication solution with quasardb Technical positioning paper April 2017 Release v1.3 www.quasardb.net Contact: sales@quasardb.net Quasardb A datacenter survival guide quasardb INTRODUCTION

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

Heckaton. SQL Server's Memory Optimized OLTP Engine

Heckaton. SQL Server's Memory Optimized OLTP Engine Heckaton SQL Server's Memory Optimized OLTP Engine Agenda Introduction to Hekaton Design Consideration High Level Architecture Storage and Indexing Query Processing Transaction Management Transaction Durability

More information

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica UC BERKELEY Motivation Many important

More information

Introduction to Big-Data

Introduction to Big-Data Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,

More information

An Efficient Execution Scheme for Designated Event-based Stream Processing

An Efficient Execution Scheme for Designated Event-based Stream Processing DEIM Forum 2014 D3-2 An Efficient Execution Scheme for Designated Event-based Stream Processing Yan Wang and Hiroyuki Kitagawa Graduate School of Systems and Information Engineering, University of Tsukuba

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Data Stream Processing Topics Model Issues System Issues Distributed Processing Web-Scale Streaming 3 System Issues Architecture

More information

Hierarchical PLABs, CLABs, TLABs in Hotspot

Hierarchical PLABs, CLABs, TLABs in Hotspot Hierarchical s, CLABs, s in Hotspot Christoph M. Kirsch ck@cs.uni-salzburg.at Hannes Payer hpayer@cs.uni-salzburg.at Harald Röck hroeck@cs.uni-salzburg.at Abstract Thread-local allocation buffers (s) are

More information

The Google File System (GFS)

The Google File System (GFS) 1 The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur 2 Design constraints

More information

Getafix: Workload-aware Distributed Interactive Analytics

Getafix: Workload-aware Distributed Interactive Analytics Getafix: Workload-aware Distributed Interactive Analytics Presenter: Mainak Ghosh Collaborators: Le Xu, Xiaoyao Qian, Thomas Kao, Indranil Gupta, Himanshu Gupta Data Analytics 2 Picture borrowed from https://conferences.oreilly.com/strata/strata-ny-2016/public/schedule/detail/51640

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data

More information

Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach

Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach Tiziano De Matteis, Gabriele Mencagli University of Pisa Italy INTRODUCTION The recent years have

More information

DEBS Grand Challenge: Real Time Data Analysis of Taxi Rides using StreamMine3G

DEBS Grand Challenge: Real Time Data Analysis of Taxi Rides using StreamMine3G DEBS Grand Challenge: Real Time Data Analysis of Taxi Rides using StreamMine3G André Martin *, Andrey Brito #, Christof Fetzer * * Technische Universität Dresden - Dresden, Germany # Universidade Federal

More information

Chapter 8: Memory-Management Strategies

Chapter 8: Memory-Management Strategies Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and

More information

Comparative Study of Subspace Clustering Algorithms

Comparative Study of Subspace Clustering Algorithms Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that

More information

Massively Parallel Processing. Big Data Really Fast. A Proven In-Memory Analytical Processing Platform for Big Data

Massively Parallel Processing. Big Data Really Fast. A Proven In-Memory Analytical Processing Platform for Big Data Big Data Really Fast A Proven In-Memory Analytical Processing Platform for Big Data 2 Executive Summary / Overview: Big Data can be a big headache for organizations that have outgrown the practicality

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective. Part I: Operating system overview: Memory Management

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective. Part I: Operating system overview: Memory Management ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part I: Operating system overview: Memory Management 1 Hardware background The role of primary memory Program

More information

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver Using the SDACK Architecture to Build a Big Data Product Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver Outline A Threat Analytic Big Data product The SDACK Architecture Akka Streams and data

More information

Virtual Memory. Chapter 8

Virtual Memory. Chapter 8 Chapter 8 Virtual Memory What are common with paging and segmentation are that all memory addresses within a process are logical ones that can be dynamically translated into physical addresses at run time.

More information

CA485 Ray Walshe Google File System

CA485 Ray Walshe Google File System Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage

More information

Drizzle: Fast and Adaptable Stream Processing at Scale

Drizzle: Fast and Adaptable Stream Processing at Scale Drizzle: Fast and Adaptable Stream Processing at Scale Shivaram Venkataraman * UC Berkeley Michael Armbrust Databricks Aurojit Panda UC Berkeley Ali Ghodsi Databricks, UC Berkeley Kay Ousterhout UC Berkeley

More information

Appendix B. Standards-Track TCP Evaluation

Appendix B. Standards-Track TCP Evaluation 215 Appendix B Standards-Track TCP Evaluation In this appendix, I present the results of a study of standards-track TCP error recovery and queue management mechanisms. I consider standards-track TCP error

More information

Enhanced Performance of Database by Automated Self-Tuned Systems

Enhanced Performance of Database by Automated Self-Tuned Systems 22 Enhanced Performance of Database by Automated Self-Tuned Systems Ankit Verma Department of Computer Science & Engineering, I.T.M. University, Gurgaon (122017) ankit.verma.aquarius@gmail.com Abstract

More information

GFS: The Google File System. Dr. Yingwu Zhu

GFS: The Google File System. Dr. Yingwu Zhu GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Thwarting Traceback Attack on Freenet

Thwarting Traceback Attack on Freenet Thwarting Traceback Attack on Freenet Guanyu Tian, Zhenhai Duan Florida State University {tian, duan}@cs.fsu.edu Todd Baumeister, Yingfei Dong University of Hawaii {baumeist, yingfei}@hawaii.edu Abstract

More information

DATABASES AND THE CLOUD. Gustavo Alonso Systems Group / ECC Dept. of Computer Science ETH Zürich, Switzerland

DATABASES AND THE CLOUD. Gustavo Alonso Systems Group / ECC Dept. of Computer Science ETH Zürich, Switzerland DATABASES AND THE CLOUD Gustavo Alonso Systems Group / ECC Dept. of Computer Science ETH Zürich, Switzerland AVALOQ Conference Zürich June 2011 Systems Group www.systems.ethz.ch Enterprise Computing Center

More information

Advanced Databases: Parallel Databases A.Poulovassilis

Advanced Databases: Parallel Databases A.Poulovassilis 1 Advanced Databases: Parallel Databases A.Poulovassilis 1 Parallel Database Architectures Parallel database systems use parallel processing techniques to achieve faster DBMS performance and handle larger

More information

UDP Packet Monitoring with Stanford Data Stream Manager

UDP Packet Monitoring with Stanford Data Stream Manager UDP Packet Monitoring with Stanford Data Stream Manager Nadeem Akhtar #1, Faridul Haque Siddiqui #2 # Department of Computer Engineering, Aligarh Muslim University Aligarh, India 1 nadeemalakhtar@gmail.com

More information

Data Collection, Preprocessing and Implementation

Data Collection, Preprocessing and Implementation Chapter 6 Data Collection, Preprocessing and Implementation 6.1 Introduction Data collection is the loosely controlled method of gathering the data. Such data are mostly out of range, impossible data combinations,

More information

Building an Internet-Scale Publish/Subscribe System

Building an Internet-Scale Publish/Subscribe System Building an Internet-Scale Publish/Subscribe System Ian Rose Mema Roussopoulos Peter Pietzuch Rohan Murty Matt Welsh Jonathan Ledlie Imperial College London Peter R. Pietzuch prp@doc.ic.ac.uk Harvard University

More information

Self Regulating Stream Processing in Heron

Self Regulating Stream Processing in Heron Self Regulating Stream Processing in Heron Huijun Wu 2017.12 Huijun Wu Twitter, Inc. Infrastructure, Data Platform, Real-Time Compute Heron Overview Recent Improvements Self Regulating Challenges Dhalion

More information

DESIGN AND OVERHEAD ANALYSIS OF WORKFLOWS IN GRID

DESIGN AND OVERHEAD ANALYSIS OF WORKFLOWS IN GRID I J D M S C L Volume 6, o. 1, January-June 2015 DESIG AD OVERHEAD AALYSIS OF WORKFLOWS I GRID S. JAMUA 1, K. REKHA 2, AD R. KAHAVEL 3 ABSRAC Grid workflow execution is approached as a pure best effort

More information