Grand Challenge: Scalable Stateful Stream Processing for Smart Grids
|
|
- Jessie Potter
- 5 years ago
- Views:
Transcription
1 Grand Challenge: Scalable Stateful Stream Processing for Smart Grids Raul Castro Fernandez, Matthias Weidlich, Peter Pietzuch Imperial College London {rc311, m.weidlich, Avigdor Gal Technion - Israel Institute of Technology avigal@ie.technion.ac.il ABSTRACT We describe a solution to the ACM DEBS Grand Challenge 1, which evaluates event-based systems for smart grid analytics. Our solution follows the paradigm of stateful data stream processing and is implemented on top of the SEEP stream processing platform. It achieves high scalability by massive data-parallel processing and the option of performing semantic load-shedding. In addition, our solution is fault-tolerant the large processing state of stream operators is not lost after failure. Our experimental results show that our solution processes 1 month worth of data for houses in hours. When we scale out the system, the time reduces linearly to 3 minutes before the system bottlenecks at the data source. We then apply semantic load-shedding, maintaining a low median prediction error and reducing the time further to 17 minutes. The system achieves these results with median latencies below 3 ms and a 9 th percentile below 5 ms. 1. INTRODUCTION The goal of the ACM DEBS Grand Challenge is to conduct a comparative evaluation of event-based systems by offering real-life event data and requirements for event queries. The 1 edition of the challenge [15] focuses on smart grid analytics. The challenge is based on measurements of energy consumption, expressed as instantaneous load and cumulative work, at the level of individual electricity plugs. These measurements are synthesised from realworld profiles derived from a set of smart home installations. The event queries focus on two types of analytics: (i) short-term load forecasting and (ii) load statistics for real-time demand management. We observe three main characteristics of the 1 challenge: High data volume. The data includes load and work events of individual plugs at a rate of approximately one measurement per second. For the considered houses with roughly plugs, this yields a volume of more than billion events for one month. Given future predicted growth, applications of smart grid analytics can be expected to process data of hundreds to thousands of houses, increasing the volume by one or two orders of magnitude. Unbounded and global state. The event queries require sophisticated handling of processing state. Short-term load forecasting relies on historical data, which requires aggregates to be maintained for an unbounded time window. This is challenging due to the evergrowing state to be stored. The performance of computation on this state is likely to degrade over time. Moreover, load statistics for real-time demand management require a global aggregation over all houses and plugs, which limits the options to distribute processing. Unbounded and global state also affect fault-tolerance mechanisms because it becomes expensive or even impossible to recreate all state by reprocessing events. # Events (load measurements) Days (5 min windows) Figure 1: Load measurements (five min window) indicating change of more than five watts. Large measurement variability. For the load and work values in the dataset, we observe a large variability in the frequency with which the values change over time. Some plugs report close to constant load for long periods of time, and significant changes in load are correlated between plugs, following global energy consumption patterns. This effect is shown in Figure 1, which depicts the number of load measurements per five-minute window when ignoring changes of less than five watts. Considering only the events that indicate larger changes in load, the amount of events that need to be processed varies by up to a factor of seven over time. Given these characteristics, we present a solution to the challenge that is based on the SEEP stream processing platform [7]. SEEP executes event processing queries as stateful distributed dataflow graphs. It supports the definition of stream processing operators as user-defined functions (UDFs). The main features of our solution using SEEP are as follows: 1. Data-parallel processing. To handle the high volume of events, our solution scales the processing of events in a data-parallel fashion on a cluster of nodes.. Optimised stateful operators. Given the complex state of the event queries, our solution exploits stream operators with efficient state handling specific to a given query, e.g. through indexed in-memory data structures. 3. Filtering and elasticity. We exploit the long periods of relatively constant load measurements in the dataset by performing semantic load-shedding, thus reducing the total events to process downstream. To support resource-efficient deployments when the input event rate varies over time, our solution can dynamically provision processing resources on-demand.. Fault tolerance. Our solution supports fault-tolerant processing, which is crucial for any continuously running data analytics application on a cluster of nodes. Instead of reprocessing all events after failure, operator state is recovered from periodic state checkpoints with low overhead.
2 Our experimental evaluation shows that our implementation processes the challenge dataset with a throughput of 3, events per second for the load forecasting and 1, events per second for the outlier detection, with median latencies of 17 ms and 136 ms, respectively. The resulting speed-up over real-time processing is 3 (load forecasting) and (outlier detection). Our solution also scales linearly in the number of cluster nodes used for computation. With 6 (load forecasting) and 7 (outlier detection) nodes, the speed-up over real-time increases up to 13 and 9, respectively. Moreover, we show that semantic load-shedding leads to a modest median error in the query results, but increases the speed-up by two orders of magnitude. With this set-up, our solution processes one month worth of data for houses in 17 mins. The remainder of the paper is structured as follows. In Section, we give an overview of the SEEP stream processing platform used by our solution. Section 3 gives details on the implemented operators. Section presents our evaluation results. We discuss related work in Section 5, before concluding in Section 6.. THE SEEP PLATFORM Our solution follows the paradigm of stateful stream processing because it provides a natural way to implement the proposed queries: it permits the definition of custom state and its manipulation by UDFs. Event processing systems such as Esper [1] and SASE [3] provide high-level query languages and lack the possibility to fine tune the data structures used to maintain the query state. In contrast, stateful stream processing achieves efficient state handling for each specific scenario. Recently, a new generation of data-parallel stream processing systems based on dataflow graphs have been proposed, including Twitter Storm [9] and Apache S [11]. Although these systems allow for massive parallelisation of stream processing operators, they assume that dataflow graphs are static and operators are stateless: they cannot react to varying input rates or efficiently recovery operator state after failure. In contrast, the SEEP platform [7] implements a stateful stream processing model and can (i) dynamically partition operator state to scale out a query in order to increase processing throughput; and (ii) recover operator state after a failure, while maintaining deterministic execution. As a result, SEEP achieves the following three features: (1) SEEP is highly scalable in order to handle high volume data. For the given challenge, this is an important feature because a realistic set-up for smart grid analytics would require the processing of events emitted by more than houses. Prior work [7] has shown that SEEP scales to close to a hundred virtual machines (VMs) in the Amazon EC public cloud. It achieves this through data parallelism stateless and stateful operators are scaled out, i.e. multiple instances of operators are deployed in the cluster. Each instance operates on a subset of the event data. Events are dispatched to instances based on query semantics, e.g. in the challenge dataset the event streams may be hash-partitioned by houses. In addition, SEEP exploits pipeline parallelism chains of operators are deployed on different nodes to reduce latency. () SEEP is fault tolerant, which is critical when operator state depends on a large number of past events. The load forecasting in the challenge relies on a model learned from historic data, and outlier detection employs windows with up to 1,, events. Even under the assumption of a reliable event source with access to the full event stream, losing operator state would require the reprocessing all events. Instead, SEEP creates periodic checkpoints of operator state, which are backed up to remote nodes and used to quickly recover state after failure. Source <input> <input> <heartbeat> Filter <input> Q1 Q Plug <plug update> Q Global Figure : The dataflow graph of our solution. <plug prediction> <house prediction> Sink <house outliers> (3) SEEP is elastic it dynamically scales out stateful operators at runtime by partitioning their state. This functionality is particularly useful for event queries with high variability in the input rate. When pre-filtering events in the challenge to ignore minor changes in load, the input rate varies. SEEP can adapt to such workload changes, using cluster resource more efficiently. 3. QUERY IMPLEMENTATION This section describes how we implemented the event queries from the challenge on the SEEP platform. We first give an overview of the main ideas behind the queries (Section 3.1), before we give details of the operator implementations (Sections 3. 3.). 3.1 Overview The structure of the logical dataflow graph of the queries is shown in Figure. An operator filter performs semantic load-shedding across all load and work measurements (denoted by <input>). This permits, for example, filtering of events that indicate only a minor change in load for a certain plug. The filter operator can be scaled out so that different instances realise data parallelism by partitioning the event stream <input> per house, household or even plug. The actual queries are implemented by three operators, namely Q1, Q Plug, and Q Global. Load forecasting and outlier detection are independent queries their parallelisation is done separately. Load forecasting is realised by operator Q1, and it is done at two levels of aggregation, i.e., plugs and houses. Hence, the operator can be scaled out by partitioning the respective event stream for the most coarse-grained aggregation, i.e., per house. Data-parallel processing is of particular importance for this operator because the query requires the maintenance of an unbounded time window. At the same time, the query also requires frequent updates of the result stream, i.e. every 3 seconds as specified by the timestamps of the events. When events are streamed faster than real-time and distributed over a large number of operator instances, however, it becomes impossible to identify the intervals for updating the result stream at a particular instance. To solve this issue, we implement a heartbeat mechanism in the filter operator, which emits a signal to operator Q1 whenever an update is due. Heartbeat generation is implemented in operator filter because the pre-processing of data is less costly than the actual load prediction. Hence, the number of instances of operator filter can be expected to be much smaller than the number of instances of operator Q1. To cope with data quality issues such as missing values, e.g. as seen around days and 8 in Figure 1, operator Q1 also features a correction mechanism that is based on the measurements of cumulative work per plug, which will be detailed below. Outlier detection is split up into two operators. Here, the idea is to separate the part of the query that can be parallelised from the part that requires a global state. Operator Q Plug thus takes the input stream and maintains the median of the load per plug for each of the time windows. The operator can be scaled out at the level of plugs.
3 Operator Q Global, in turn, maintains the global median over all plugs. It also realises the outlier detection and emits the results. Due to its global state, this operator cannot be scaled out. To reduce the amount of computation done at the singleton instance of operator Q Global, it relies not only on the median values computed by Q Plug, but also on the information about which measurements entered or left one of the investigated time windows (denoted by <plug update> in Figure ). As a consequence, a large part of the effort to maintain the time windows per plug is performed at operator Q Plug, which can be scaled out. 3. Filter The filter operator realises the following functionality: Duplicate elimination. To filter duplicate measurements, the operator maintains the timestamps of the last load and work measurements for each plug. Only measurements with a timestamp larger than the last observed (per plug) are forwarded. Variability-based filtering. To leverage the large variability in the frequency with which load values change over time for optimisation, the filter operator can perform semantic load-shedding, ignoring measurements that denote a minor change in load with respect to the last non-filtered measurement. In such a case, measurements of work are only forwarded if the load measurement with the same timestamp has not been removed by the filter procedure. In Section, we evaluate the trade-off between this type of filtering and the correctness of the query results with an experimental setup. Heartbeat generation. The aforementioned heartbeats are generated based on the timestamps of the processed events. Whenever an event with a timestamp larger than the time of the last heartbeat plus the heartbeat interval is received, a new heartbeat is emitted. 3.3 Query 1: Load Forecasting Operator Q1 for load forecasting is implemented as follows: Prediction model. As a baseline, we rely on the prediction model defined in the challenge description, which combines current load measurements with a model over historical data. More specifically, the load prediction for the time window following the next one is based on the average load of the current window and the median of the average loads of windows covering the same time of all past days. The generation of prediction values is triggered by the heartbeats, which are generated by the filter operator. Work-based correction. To address the issues stemming from missing load measurements, our operator exploits measurements of cumulative work per plug. Correction is triggered when the operator receives a work measurement, and the number of recorded load measurements for the preceding window is less than a threshold. Since work is measured at a coarse resolution (1 kwh), the work values enable us to derive only an approximation of the actual average load. Therefore, the threshold on the number of load measurements allows for tuning how many load values are at least required to avoid the computation of the window average based on work values. A specific value for the threshold is chosen based on the expected rate for load measurements. If applied, the correction mechanism determines the maximal interval of adjacent windows with insufficient load measurements. The difference between the first and last work measurement for this period is used to conclude on the average load for all the windows. State handling. Load forecasting relies on the average load per window per plug over the complete history. To cope with the unbounded state of the query, our implementation strives for reducing the size of the state as much as possible. First, we observe that although results have to be provided for five different window sizes, all of them can be expressed as multiples of the smallest window of one minute. Therefore, our implementation only stores the state for the smallest windows. Second, since prediction is based on the load average, our operator keeps only a sliding average for the current smallest window and the average load for all historic windows. Load averages are kept in a two-dimensional array (per plug, per window), and an index structure allows for quick access of a global identifier for a plug. The index is implemented as a three-dimensional array over the house, household, and plug identifiers. For the work-based correction mechanism, additional state needs to be maintained. For each plug and window, the number of load measurements and the first recorded work value is maintained in further two-dimensional arrays. 3. Query : Outliers Outlier detection is realised by operators Q Plug and Q Global. The former focuses on the calculation of windows and the median load per plug. Its results are then used by operator Q Global to conduct the actual outlier detection. Plug windows and median. To maintain the time windows and calculate the median load per plug, operator Q Plug proceeds as follows. Upon the arrival of load measurement, the value and timestamp is added to either window for the respective plug. The timestamp of the received event is used to remove old events from both windows. Then, the median of the load values for the plug is calculated. If both, the median and the multiset of values of both windows, did not change, no event is forwarded to operator Q Global. If there has been a change, the new median for the plug as well as the load values added or removed to either window are sent to Q Global (<plug update>). Outlier detection. To detect outliers, operator Q Global compares the median values per plug as computed by operator Q Plug with the global median. To compute the latter, the operator maintains two time windows over all plugs. However, these windows are updated only based on the values provided by the events of the <plug update> stream generated by operator Q Plug. Receiving an event of the <plug update> stream leads to recalculation of the global median for the respective window. If that has not changed, only the house related to the plug for which the update has been received is considered in the outlier detection. If the global median changed, the plugs of all houses are checked. If the percentage of plugs with a median load higher than the global median changes, the result stream is updated. State handling. To implement the time windows, for each plug, operator Q Plug maintains two double-ended queues, one containing the timestamps and one containing the load values. Implemented as linked lists, these queues allow to insert new measurements in constant time. Accessing and removing events from the other end of the queue is done in constant time. The queue containing the timestamps is used to determine whether elements of the queue containing the load values should be removed. To compute the median over the load values, operator Q Plug maintains an indexable skip-list [1] of these values per plug. Such a skip-list holds an ordered sequence of elements and maintains an additional index structure, a linked hierarchy of sub-sequences that skip certain elements of the original list. We use the probabilistic and indexable version of this data structure the skip paths are randomly chosen and, for each skip path, we also store the length in terms of the number of skipped elements. The indexable skip-list allows for inserting, deleting and search-
4 ing load values as well as accessing the load value at a particular list index in logarithmic time. Calculation of the median is traced back to a list lookup. Since the query requires the lookup only for the median element, and not for an arbitrary index, we also keep a pointer to the current median element of the list, which is updated with every insertion or deletion. Hence, the median is derived in constant time. Although bounded, handling the state of operator Q Global is challenging due to the sheer number of measurements that need to be kept (up to 1,, events) and the update frequency. For both windows, our implementation relies on an indexable skip-list and uses a pointer to the median elements of these lists. Throughput (1K e/s) Th. 1th Th. 5th Th. 9th 1 Workload (houses) EVALUATION Our evaluation assess the performance of our system along three dimensions by investigating: whether it scales. Does the system support more houses? whether it can cope with the current load with headroom. Can the system process faster than real time? how fast we can incorporate predictions. Does the system achieve low latency, even when it is distributed? We deploy our solution in a private cluster composed by 1 Intel(R) Xeon(R) CPU E GHz nodes with 8 GB of RAM, running SEEP on a Linux kernel 3.. with Java 7. We run SEEP with the fault tolerant mechanism enabled. Scalability. To measure scalability we report relative throughput, where we normalize the throughput of the system for the baseline case, and show how it increases as we add cluster nodes. Also, we explain the bottlenecks observed when conducting the experiments. Throughput. After analysing the available datasets, we found that we need a system capable of processing 377 events/s, 696 events/s and 1565 events/s on average for the 1, and houses dataset to process the incoming input rate over a month. SEEP processes three orders of magnitude faster than this. For this reason, we report speedup over RT (real time) as the number of times the system process faster than required to run the queries. As an example, consider a speedup over RT of, which would allow for processing one month worth of data in 15 days. Latency. We measure the end to end latency of those events that close windows in both queries. To measure latency accurately we place both source and sink of our system on the same cluster node, so that both operators have access to a common clock. For the given event queries, the processing cost per event is close to constant regardless the dataset size. Dataset sizes, however, have an impact on the total memory required to run the queries. We exploit the stateful capabilities of our system to provide an efficient implementation which expresses the state efficiently, avoiding any kind of performance hit. Note that under this scenario, larger datasets do not impact the throughput of our system, but only the speedup, as there are more events to process..1 Query 1: Load Forecasting Our implementation of query 1 consists of two operators, a filter and Q1 (see Figure ). For the baseline system, each of the operators is deployed on a single node of the cluster. For our distributed deployment we scale out from to 6 nodes. Baseline system. Figure 3 shows the 1th, 5th, and 9th percentile of throughput as requested in the challenge description. As expected, this is constant across the different workload sizes but the speedup over RT decreases as there are simply more events to process. With a speedup over RT of around 9, the system can process one month Figure 3: Throughput and speedup as a function of the number of houses. With a constant throughput, growing the size of the dataset implies less speedup. Q1 Dist. Q1 Q Dist. Q 1th th th Table 1: Latencies in ms for both queries with baseline and distributed deployment. worth of data from 1 houses in about one hour, while it will take around four hours to do the same for houses. Distributed system. Ideally, we want the system to scale to support the data coming from more houses, which in our system is equivalent to keep the speedup over RT constant. We exploit data parallelism to aggregate throughput, thus, keeping constant or even increasing the speedup over RT. Figure shows in the x axis the number of cluster nodes used during the experiment. The relative throughput increases linearly from to 3 nodes, then sub-linearly until 5, and then we find a spike when using 6 nodes. The reason for the sub-linear behaviour is due to the sink operator, which has to aggregate the results coming from distributed nodes, becoming an IO bottleneck. To confirm this was the case, we scaled out the sink and run the system with 6 nodes, which shows how the throughput increases again. The speedup over RT across this experiment is always increasing, which confirms that our system can scale to bigger datasets while keeping the throughput. We stopped at 6 nodes as the source became a bottleneck at this point. This would not be an issue in a real scenario with distributed sources, but we did not want to break the order semantics provided in the dataset. Table 1 shows the latencies we measured for both the baseline system and the distributed one. The major sources of latency spikes in SEEP are caused by the buffering mechanism used for fault tolerance, and the intricacies of this with the garbage collector under high memory utilisation scenarios. Neither of these happen in the context of query 1. Our latencies are slightly lower than in the nonscaled out case. The reason for lower latencies in the distributed case is the source cannot insert data at higher rates, and thus, events go through the same number of queues and processing elements as in the non-scaled case, but with more headroom.. Query : Outliers Our implementation of query consists of three operators, filter, Q Plug and Q Global (see Figure ). Hence, the baseline deployment comprises 3 nodes of the cluster. For the distributed deployment we scale out from 3 to 7 nodes.
5 Relative Throughput Relative th Relative Throughput Relative th Number of nodes Number of nodes Figure : Throughput and speedup over RT as a function of the number of machines. We increase the speedup by scaling out the system to aggregate throughput. Throughput (1K e/s) Th. 1th Th. 5th Th. 9th 1 Workload (houses) Figure 5: Throughput and speedup as a function of the number of houses, for query. Figure 6: Distributed deployment of query. Scaling out the query aggregates throughput and increases the speedup Abs. of prediction error (Watt) Plug 1th Plug 5th Plug 9th Days (5 min windows) Figure 7: Time series over four days showing a low prediction error for plugs over five minute windows. Note that the 5th overlaps the x axis. Baseline system.. Figure 6 shows the expected behaviour of decreasing speedup as the dataset grows in events size. This query is computationally more expensive than query 1. In our solution this translates into a total time of 3. hours to process one month of data for 1 houses to 13 hours to do the same in the houses case. Distributed system.. We follow the same strategy of scaling out the system to increase the speedup over the minimum throughput required by the system, reported in Figure 6. When adding more cluster nodes the throughput increases, except between 5 and 6 nodes. The reason for this behaviour is that there were two simultaneous bottlenecks. First a CPU bottleneck that disappears after scaling from 5 to 6 nodes, giving rise to an IO bottleneck. When scaling out the IO bottleneck, we show how the system can keep up increasing the speedup. We stop our query at this point, when the source becomes a bottleneck. Latencies for query are reported in Table 1. Latencies are higher than in query 1 because the bottleneck in this query was CPU related while in query 1 was IO (serialisation and deserialisation)..3 Impact of Semantic Load-Shedding To investigate the inherent trade-off of result accuracy and computation efficiency implied by semantic load-shedding, we compared the load predictions derived by query one for a sample of four days. We focus on the predictions derived for the smallest time window (one minute). This window represents the most challenging case, since for larger windows, the relative importance of filtered events is smaller and, thus, accuracy is less compromised. We show in Figures 7 and 8 the absolute error prediction error for plugs and houses, respectively, aggregated for windows of five minutes. For individual plugs, although the 9th percentile shows spikes up to 15 watt, the median error is zero in virtually all cases. For load predictions for houses, in turn, the median error is largely between one and three watts and there is little variability in the results. Based on these results, we conclude that the error is small enough to justify the activation of the mechanism. Turning to the benefits of load-shedding for processing performance, Figure 9 shows the difference in throughput and speedup over RT when enabling the mechanism. As discussed before, the throughput per node is mostly not affected since processing cost per event is close to constant. However, we observe an improvement of speedup of two orders of magnitude, meaning that one month worth of data for houses is processed in 17 minutes. This drastic speedup together with the low accuracy loss justifies the usage of semantic load-shedding in this scenario. We consider this an important outcome as it allows more headroom to scale out the system to accommodate the load from more houses. 5. RELATED WORK Our solution is based on the use of distributed stream processing. Various engines for stream processing have been proposed in the literature, such as Discretized Streams [1], Naiad [1], Twitter Storm [9], Apache S [11], MOA [5], Apache Kafka [8], and Streams [6]. Discretized Streams support massive parallelisation and state in the form of RDDs (Resilient Distributed Datasets), however, its approach based on micro-batching is not optimal to reduce processing latency. Naiad [1] can scale out across many nodes maintaining low latencies, however the system is not designed to
6 Abs. of prediction error (Watt) House 1th House 5th House 9th Days (5 min windows) Figure 8: Time series over four days showing a low prediction error for houses over five minute windows. Throughput (1K e/s) Th. Original Th. Filtered (original) (filtered) 1 Workload (houses) Figure 9: When applying semantic load-shedding, the speedup grows two orders of magnitude (note the logarithmic scale). The system process one month worth of data for houses in 17 minutes. be fault tolerant when the managed state is large. The other systems support massive parallelisation of stream processing, but lack support for managing stateful operators and for reacting to varying input rates. Both issues are of particular relevance for the grand challenge the queries feature a large and complex state and the number of events indicating changes in load shows large variability. Therefore, we grounded our solution on SEEP [7], which supports stateful operators, dynamic scale-out, and fault tolerant processing. To improve the processing performance, we apply semantic loadshedding, dropping input events in a structured way to achieve timely processing. Load-shedding is common technique for optimising event processing applications, in particular for achieving high throughput [13, ]. In some cases, load shedding for distributed stream processing may in itself become an optimization problem [13]. Our approach exploits domain semantics, i.e., the size of a change of a measurement value, to decide on which events to filter. Our evaluation showed that this approach results in only minor inaccuracies in the load forecast. However, filtering leads to drastic performance improvements. The speedup realised by the system grows by two orders of magnitude. 6. CONCLUSIONS In this work, we presented a highly scalable solution to the ACM DEBS Grand Challenge 1. We based our solution on SEEP, a platform for stateful stream processing that supports dynamic scale out of operators and recovery of operator state after a failure. To achieve efficient processing, we presented implementations of the stream processing operators that are geared towards parallelisation and effective state management, e.g., using queues and skiplists. In addition, we exploited the fact that there are long time periods over which measurement values are relatively constant. Further details on our solution including a screencast are available at []. The experimental evaluation of our solution indicates that the system can indeed cope with high volume data, processing it with 3, events per second for the load forecasting and 1, events per second for the outlier detection and median latencies of 17 ms or 136 ms, respectively. We also conclude that the system is capable to handle larger workloads since it scales linearly when adding cluster nodes. Further, we demonstrated that semantic loadshedding can lead to huge performance gains. While filtering leads to approximate query results only, in our case, the resulting bias was rather modest, with a low median error for the prediction. However, the speedup over real time was increased by two orders of magnitude, which allowed us to process the whole dataset in 17 minutes. Our approach to filter events for more efficient stream processing opens several directions for future work. On the one hand, the generation of filter conditions from a set of queries would allow for automating the approach. On the other hand, filtering may be guided by the amount of tolerated inaccuracies in the query results. Probing unfiltered results for certain time intervals would enable to assess the consequences of filtering and adapt its selectivity dynamically. 7. REFERENCES [1] [] [3] J. Agrawal, Y. Diao, D. Gyllstrom, and N. Immerman. Efficient pattern matching over event streams. In SIGMOD, pages ACM, 8. [] B. Babcock, M. Datar, and R. Motwani. Load shedding in data stream systems. In Data Streams, volume 31 of Advances in Database Systems, pages Springer, 7. [5] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. Moa: Massive online analysis. The Journal of Machine Learning Research, 99:161 16, 1. [6] C. Bockermann and H. Blom. The streams framework. Technical Report 5, TU Dortmund University, 1. [7] R. C. Fernandez, M. Migliavacca, E. Kalyvianaki, and P. Pietzuch. Integrating scale out and fault tolerance in stream processing using operator state management. In SIGMOD, pages ACM, 13. [8] J. Kreps, N. Narkhede, and J. Rao. Kafka: A distributed messaging system for log processing. In NetDB, 11. [9] N. Marz. Storm - distributed and fault-tolerant realtime computation, 13. [1] D. G. Murray, F. McSherry, et al. Naiad: A Timely Dataflow System. In SOSP, 13. [11] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S: Distributed stream computing platform. In Data Mining Workshops (ICDMW), pages IEEE, 1. [1] W. Pugh. Skip lists: A probabilistic alternative to balanced trees. Commun. ACM, 33(6): , 199. [13] N. Tatbul, U. Çetintemel, and S. Zdonik. Staying fit: Efficient load shedding techniques for distributed stream processing. In VLDB, pages VLDB Endowment, 7. [1] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized Streams: Fault-tolerant Streaming Computation at Scale. In SOSP, 13. [15] H. Ziekow and Z. Jerzak. The DEBS 1 Grand Challenge. In DEBS, Mubai, India, July 1. ACM.
Grand Challenge: Scalable Stateful Stream Processing for Smart Grids
Grand Challenge: Scalable Stateful Stream Processing for Smart Grids Raul Castro Fernandez, Matthias Weidlich, Peter Pietzuch Imperial College London {rc311, m.weidlich, prp}@imperial.ac.uk Avigdor Gal
More informationPerformance and Scalability with Griddable.io
Performance and Scalability with Griddable.io Executive summary Griddable.io is an industry-leading timeline-consistent synchronized data integration grid across a range of source and target data systems.
More informationData Stream Processing in the Cloud
Department of Computing Data Stream Processing in the Cloud Evangelia Kalyvianaki ekalyv@imperial.ac.uk joint work with Raul Castro Fernandez, Marco Fiscato, Matteo Migliavacca and Peter Pietzuch Peter
More informationIntegrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management
Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management Raul Castro Fernandez Imperial College London rc311@doc.ic.ac.uk Matteo Migliavacca University of Kent mm3@kent.ac.uk
More informationApache Flink- A System for Batch and Realtime Stream Processing
Apache Flink- A System for Batch and Realtime Stream Processing Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich Prof Dr. Matthias Schubert 2016 Introduction to Apache Flink
More informationFuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc
Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system
More informationPart 1: Indexes for Big Data
JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,
More informationApache Flink. Alessandro Margara
Apache Flink Alessandro Margara alessandro.margara@polimi.it http://home.deib.polimi.it/margara Recap: scenario Big Data Volume and velocity Process large volumes of data possibly produced at high rate
More informationPerformance of relational database management
Building a 3-D DRAM Architecture for Optimum Cost/Performance By Gene Bowles and Duke Lambert As systems increase in performance and power, magnetic disk storage speeds have lagged behind. But using solidstate
More informationQLIKVIEW SCALABILITY BENCHMARK WHITE PAPER
QLIKVIEW SCALABILITY BENCHMARK WHITE PAPER Hardware Sizing Using Amazon EC2 A QlikView Scalability Center Technical White Paper June 2013 qlikview.com Table of Contents Executive Summary 3 A Challenge
More informationArchitectural challenges for building a low latency, scalable multi-tenant data warehouse
Architectural challenges for building a low latency, scalable multi-tenant data warehouse Mataprasad Agrawal Solutions Architect, Services CTO 2017 Persistent Systems Ltd. All rights reserved. Our analytics
More informationGrand Challenge: The TechniBall System
Grand Challenge: The TechniBall System Avigdor Gal, Sarah Keren, Mor Sondak, Matthias Weidlich Technion - Israel Institute of Technology avigal@ie.technion.ac.il {sarahn,mor,weidlich}@tx.technion.ac.il
More informationDistributed File Systems II
Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation
More informationStreamMine3G OneClick Deploy & Monitor ESP Applications With A Single Click
OneClick Deploy & Monitor ESP Applications With A Single Click Andrey Brito, André Martin, Christof Fetzer, Isabelly Rocha, Telles Nóbrega Technische Universität Dresden - Dresden, Germany - Email: {firstname.lastname}@se.inf.tu-dresden.de
More informationarxiv: v1 [cs.dc] 29 Jun 2015
Lightweight Asynchronous Snapshots for Distributed Dataflows Paris Carbone 1 Gyula Fóra 2 Stephan Ewen 3 Seif Haridi 1,2 Kostas Tzoumas 3 1 KTH Royal Institute of Technology - {parisc,haridi}@kth.se 2
More informationLecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Apache Flink
Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Apache Flink Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur Schmid, Daniyal Kazempour,
More informationThe Google File System
October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single
More informationOptimizing Apache Spark with Memory1. July Page 1 of 14
Optimizing Apache Spark with Memory1 July 2016 Page 1 of 14 Abstract The prevalence of Big Data is driving increasing demand for real -time analysis and insight. Big data processing platforms, like Apache
More informationBIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane
BIG DATA Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management Author: Sandesh Deshmane Executive Summary Growing data volumes and real time decision making requirements
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008 OUTLINE INTRODUCTION DESIGN OVERVIEW
More informationChallenges in Data Stream Processing
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Challenges in Data Stream Processing Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in
More informationHammer Slide: Work- and CPU-efficient Streaming Window Aggregation
Large-Scale Data & Systems Group Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation Georgios Theodorakis, Alexandros Koliousis, Peter Pietzuch, Holger Pirk Large-Scale Data & Systems (LSDS)
More informationStaying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing
Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing Nesime Tatbul Uğur Çetintemel Stan Zdonik Talk Outline Problem Introduction Approach Overview Advance Planning with an
More informationStreaming data Model is opposite Queries are usually fixed and data are flows through the system.
1 2 3 Main difference is: Static Data Model (For related database or Hadoop) Data is stored, and we just send some query. Streaming data Model is opposite Queries are usually fixed and data are flows through
More informationCOMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING
Volume 119 No. 16 2018, 937-948 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.eu/hub/ http://www.acadpubl.eu/hub/ COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING K.Anusha
More informationDistributed ETL. A lightweight, pluggable, and scalable ingestion service for real-time data. Joe Wang
A lightweight, pluggable, and scalable ingestion service for real-time data ABSTRACT This paper provides the motivation, implementation details, and evaluation of a lightweight distributed extract-transform-load
More informationPutting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21
Big Processing -Parallel Computation COS 418: Distributed Systems Lecture 21 Michael Freedman 2 Ex: Word count using partial aggregation Putting it together 1. Compute word counts from individual files
More information8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara
Week 1-B-0 Week 1-B-1 CS535 BIG DATA FAQs Slides are available on the course web Wait list Term project topics PART 0. INTRODUCTION 2. DATA PROCESSING PARADIGMS FOR BIG DATA Sangmi Lee Pallickara Computer
More informationMassive Scalability With InterSystems IRIS Data Platform
Massive Scalability With InterSystems IRIS Data Platform Introduction Faced with the enormous and ever-growing amounts of data being generated in the world today, software architects need to pay special
More informationCapacity Planning for Application Design
WHITE PAPER Capacity Planning for Application Design By Mifan Careem Director - Solutions Architecture, WSO2 1. Introduction The ability to determine or forecast the capacity of a system or set of components,
More informationBefore proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.
About the Tutorial Storm was originally created by Nathan Marz and team at BackType. BackType is a social analytics company. Later, Storm was acquired and open-sourced by Twitter. In a short time, Apache
More informationDRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE
DRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout, Michael Armbrust, Ali Ghodsi, Michael Franklin, Benjamin Recht, Ion Stoica STREAMING WORKLOADS
More informationIBM InfoSphere Streams v4.0 Performance Best Practices
Henry May IBM InfoSphere Streams v4.0 Performance Best Practices Abstract Streams v4.0 introduces powerful high availability features. Leveraging these requires careful consideration of performance related
More informationData Analytics with HPC. Data Streaming
Data Analytics with HPC Data Streaming Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationBest Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0.
IBM Optim Performance Manager Extended Edition V4.1.0.1 Best Practices Deploying Optim Performance Manager in large scale environments Ute Baumbach (bmb@de.ibm.com) Optim Performance Manager Development
More informationA Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers
A Distributed System Case Study: Apache Kafka High throughput messaging for diverse consumers As always, this is not a tutorial Some of the concepts may no longer be part of the current system or implemented
More informationTrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa
TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa EPL646: Advanced Topics in Databases Christos Hadjistyllis
More informationScalable Distributed Event Detection for Twitter
Scalable Distributed Event Detection for Twitter Richard McCreadie, Craig Macdonald and Iadh Ounis School of Computing Science University of Glasgow Email: firstname.lastname@glasgow.ac.uk Miles Osborne
More informationNowadays data-intensive applications play a
Journal of Advances in Computer Engineering and Technology, 3(2) 2017 Data Replication-Based Scheduling in Cloud Computing Environment Bahareh Rahmati 1, Amir Masoud Rahmani 2 Received (2016-02-02) Accepted
More informationThe Stream Processor as a Database. Ufuk
The Stream Processor as a Database Ufuk Celebi @iamuce Realtime Counts and Aggregates The (Classic) Use Case 2 (Real-)Time Series Statistics Stream of Events Real-time Statistics 3 The Architecture collect
More informationHigh-Performance Event Processing Bridging the Gap between Low Latency and High Throughput Bernhard Seeger University of Marburg
High-Performance Event Processing Bridging the Gap between Low Latency and High Throughput Bernhard Seeger University of Marburg common work with Nikolaus Glombiewski, Michael Körber, Marc Seidemann 1.
More informationShark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker
Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha
More informationCorrelation based File Prefetching Approach for Hadoop
IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie
More informationPerformance Consequences of Partial RED Deployment
Performance Consequences of Partial RED Deployment Brian Bowers and Nathan C. Burnett CS740 - Advanced Networks University of Wisconsin - Madison ABSTRACT The Internet is slowly adopting routers utilizing
More informationMemory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts
Memory management Last modified: 26.04.2016 1 Contents Background Logical and physical address spaces; address binding Overlaying, swapping Contiguous Memory Allocation Segmentation Paging Structure of
More informationScalable Streaming Analytics
Scalable Streaming Analytics KARTHIK RAMASAMY @karthikz TALK OUTLINE BEGIN I! II ( III b Overview Storm Overview Storm Internals IV Z V K Heron Operational Experiences END WHAT IS ANALYTICS? according
More informationLambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
More informationDesigning Hybrid Data Processing Systems for Heterogeneous Servers
Designing Hybrid Data Processing Systems for Heterogeneous Servers Peter Pietzuch Large-Scale Distributed Systems (LSDS) Group Imperial College London http://lsds.doc.ic.ac.uk University
More informationClustering and Reclustering HEP Data in Object Databases
Clustering and Reclustering HEP Data in Object Databases Koen Holtman CERN EP division CH - Geneva 3, Switzerland We formulate principles for the clustering of data, applicable to both sequential HEP applications
More informationEmbedded Technosolutions
Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication
More informationarxiv: v1 [cs.db] 21 Feb 2019
CONTINUOUS OUTLIER MINING OF STREAMING DATA IN FLINK A PREPRINT arxiv:1902.07901v1 [cs.db] 21 Feb 2019 Theodoros Toliopoulos, Anastasios Gounaris, Kostas Tsichlas, Apostolos Papadopoulos Department of
More informationPouya Kousha Fall 2018 CSE 5194 Prof. DK Panda
Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Motivation And Intro Programming Model Spark Data Transformation Model Construction Model Training Model Inference Execution Model Data Parallel Training
More informationGoogle File System (GFS) and Hadoop Distributed File System (HDFS)
Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear
More informationSparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica
Sparrow Distributed Low-Latency Spark Scheduling Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica Outline The Spark scheduling bottleneck Sparrow s fully distributed, fault-tolerant technique
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationDatenbanksysteme II: Modern Hardware. Stefan Sprenger November 23, 2016
Datenbanksysteme II: Modern Hardware Stefan Sprenger November 23, 2016 Content of this Lecture Introduction to Modern Hardware CPUs, Cache Hierarchy Branch Prediction SIMD NUMA Cache-Sensitive Skip List
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationComparative Analysis of Range Aggregate Queries In Big Data Environment
Comparative Analysis of Range Aggregate Queries In Big Data Environment Ranjanee S PG Scholar, Dept. of Computer Science and Engineering, Institute of Road and Transport Technology, Erode, TamilNadu, India.
More informationUsing Apache Beam for Batch, Streaming, and Everything in Between. Dan Halperin Apache Beam PMC Senior Software Engineer, Google
Abstract Apache Beam is a unified programming model capable of expressing a wide variety of both traditional batch and complex streaming use cases. By neatly separating properties of the data from run-time
More informationUsing Synology SSD Technology to Enhance System Performance Synology Inc.
Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_WP_ 20121112 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges... 3 SSD
More informationPSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets
2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department
More informationEXTRACT DATA IN LARGE DATABASE WITH HADOOP
International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0
More informationComplex Event Processing under Constrained Resources by State-Based Load Shedding
2018 IEEE 34th International Conference on Data Engineering Complex Event Processing under Constrained Resources by State-Based Load Shedding Bo Zhao (expected graduation: March 2020) Supervised by Matthias
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationA Cost Model for Adaptive Resource Management in Data Stream Systems
A Cost Model for Adaptive Resource Management in Data Stream Systems Michael Cammert, Jürgen Krämer, Bernhard Seeger, Sonny Vaupel Department of Mathematics and Computer Science University of Marburg 35032
More informationDatacenter replication solution with quasardb
Datacenter replication solution with quasardb Technical positioning paper April 2017 Release v1.3 www.quasardb.net Contact: sales@quasardb.net Quasardb A datacenter survival guide quasardb INTRODUCTION
More informationCLOUD-SCALE FILE SYSTEMS
Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients
More informationHeckaton. SQL Server's Memory Optimized OLTP Engine
Heckaton SQL Server's Memory Optimized OLTP Engine Agenda Introduction to Hekaton Design Consideration High Level Architecture Storage and Indexing Query Processing Transaction Management Transaction Durability
More informationDiscretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica UC BERKELEY Motivation Many important
More informationIntroduction to Big-Data
Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,
More informationAn Efficient Execution Scheme for Designated Event-based Stream Processing
DEIM Forum 2014 D3-2 An Efficient Execution Scheme for Designated Event-based Stream Processing Yan Wang and Hiroyuki Kitagawa Graduate School of Systems and Information Engineering, University of Tsukuba
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Data Stream Processing Topics Model Issues System Issues Distributed Processing Web-Scale Streaming 3 System Issues Architecture
More informationHierarchical PLABs, CLABs, TLABs in Hotspot
Hierarchical s, CLABs, s in Hotspot Christoph M. Kirsch ck@cs.uni-salzburg.at Hannes Payer hpayer@cs.uni-salzburg.at Harald Röck hroeck@cs.uni-salzburg.at Abstract Thread-local allocation buffers (s) are
More informationThe Google File System (GFS)
1 The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur 2 Design constraints
More informationGetafix: Workload-aware Distributed Interactive Analytics
Getafix: Workload-aware Distributed Interactive Analytics Presenter: Mainak Ghosh Collaborators: Le Xu, Xiaoyao Qian, Thomas Kao, Indranil Gupta, Himanshu Gupta Data Analytics 2 Picture borrowed from https://conferences.oreilly.com/strata/strata-ny-2016/public/schedule/detail/51640
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data
More informationParallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach
Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach Tiziano De Matteis, Gabriele Mencagli University of Pisa Italy INTRODUCTION The recent years have
More informationDEBS Grand Challenge: Real Time Data Analysis of Taxi Rides using StreamMine3G
DEBS Grand Challenge: Real Time Data Analysis of Taxi Rides using StreamMine3G André Martin *, Andrey Brito #, Christof Fetzer * * Technische Universität Dresden - Dresden, Germany # Universidade Federal
More informationChapter 8: Memory-Management Strategies
Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and
More informationComparative Study of Subspace Clustering Algorithms
Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that
More informationMassively Parallel Processing. Big Data Really Fast. A Proven In-Memory Analytical Processing Platform for Big Data
Big Data Really Fast A Proven In-Memory Analytical Processing Platform for Big Data 2 Executive Summary / Overview: Big Data can be a big headache for organizations that have outgrown the practicality
More informationECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective. Part I: Operating system overview: Memory Management
ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part I: Operating system overview: Memory Management 1 Hardware background The role of primary memory Program
More informationUsing the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver
Using the SDACK Architecture to Build a Big Data Product Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver Outline A Threat Analytic Big Data product The SDACK Architecture Akka Streams and data
More informationVirtual Memory. Chapter 8
Chapter 8 Virtual Memory What are common with paging and segmentation are that all memory addresses within a process are logical ones that can be dynamically translated into physical addresses at run time.
More informationCA485 Ray Walshe Google File System
Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage
More informationDrizzle: Fast and Adaptable Stream Processing at Scale
Drizzle: Fast and Adaptable Stream Processing at Scale Shivaram Venkataraman * UC Berkeley Michael Armbrust Databricks Aurojit Panda UC Berkeley Ali Ghodsi Databricks, UC Berkeley Kay Ousterhout UC Berkeley
More informationAppendix B. Standards-Track TCP Evaluation
215 Appendix B Standards-Track TCP Evaluation In this appendix, I present the results of a study of standards-track TCP error recovery and queue management mechanisms. I consider standards-track TCP error
More informationEnhanced Performance of Database by Automated Self-Tuned Systems
22 Enhanced Performance of Database by Automated Self-Tuned Systems Ankit Verma Department of Computer Science & Engineering, I.T.M. University, Gurgaon (122017) ankit.verma.aquarius@gmail.com Abstract
More informationGFS: The Google File System. Dr. Yingwu Zhu
GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationThwarting Traceback Attack on Freenet
Thwarting Traceback Attack on Freenet Guanyu Tian, Zhenhai Duan Florida State University {tian, duan}@cs.fsu.edu Todd Baumeister, Yingfei Dong University of Hawaii {baumeist, yingfei}@hawaii.edu Abstract
More informationDATABASES AND THE CLOUD. Gustavo Alonso Systems Group / ECC Dept. of Computer Science ETH Zürich, Switzerland
DATABASES AND THE CLOUD Gustavo Alonso Systems Group / ECC Dept. of Computer Science ETH Zürich, Switzerland AVALOQ Conference Zürich June 2011 Systems Group www.systems.ethz.ch Enterprise Computing Center
More informationAdvanced Databases: Parallel Databases A.Poulovassilis
1 Advanced Databases: Parallel Databases A.Poulovassilis 1 Parallel Database Architectures Parallel database systems use parallel processing techniques to achieve faster DBMS performance and handle larger
More informationUDP Packet Monitoring with Stanford Data Stream Manager
UDP Packet Monitoring with Stanford Data Stream Manager Nadeem Akhtar #1, Faridul Haque Siddiqui #2 # Department of Computer Engineering, Aligarh Muslim University Aligarh, India 1 nadeemalakhtar@gmail.com
More informationData Collection, Preprocessing and Implementation
Chapter 6 Data Collection, Preprocessing and Implementation 6.1 Introduction Data collection is the loosely controlled method of gathering the data. Such data are mostly out of range, impossible data combinations,
More informationBuilding an Internet-Scale Publish/Subscribe System
Building an Internet-Scale Publish/Subscribe System Ian Rose Mema Roussopoulos Peter Pietzuch Rohan Murty Matt Welsh Jonathan Ledlie Imperial College London Peter R. Pietzuch prp@doc.ic.ac.uk Harvard University
More informationSelf Regulating Stream Processing in Heron
Self Regulating Stream Processing in Heron Huijun Wu 2017.12 Huijun Wu Twitter, Inc. Infrastructure, Data Platform, Real-Time Compute Heron Overview Recent Improvements Self Regulating Challenges Dhalion
More informationDESIGN AND OVERHEAD ANALYSIS OF WORKFLOWS IN GRID
I J D M S C L Volume 6, o. 1, January-June 2015 DESIG AD OVERHEAD AALYSIS OF WORKFLOWS I GRID S. JAMUA 1, K. REKHA 2, AD R. KAHAVEL 3 ABSRAC Grid workflow execution is approached as a pure best effort
More information