NOWADAYS, many modern applications require largescale

Size: px
Start display at page:

Download "NOWADAYS, many modern applications require largescale"

Transcription

1 Automating Characterization Deployment in Distributed Data Stream Management Systems Chunkai Wang, Xiaofeng Meng, Member, IEEE, Qi Guo, Zujian Weng, and Chen Yang Abstract Distributed data stream management systems (DDSMS) are usually composed of upper layer relational query systems (RQS) and lower layer stream processing systems (SPS). When users submit new queries to RQS, a query planner needs to be converted into a directed acyclic graph (DAG) consisting of tasks which are running on SPS. Based on different query requests and data stream properties, SPS need to configure different deployments strategies. However, how to dynamically predict deployment configurations of SPS to ensure the processing throughput and low resource usage is a great challenge. This article presents OrientStream, a framework for automating characterization deployment in DDSMS using incremental machine learning techniques. By introducing the data-level, query plan-level, operator-level, and cluster-level s four-level feature extraction mechanism, we first use the different query workloads as training sets to predict the resource usage by DDSMS, and select the optimal resource configuration from candidate settings based on the current query requests and stream properties, then migrate the operator state by introducing dynamic reconfiguration. Finally, we validate our approach on the open source SPS Storm. In view of the application scenarios with long monitoring cycle and non-frequent data fluctuation, experiments show that OrientStream can reduce CPU usage of 8-15 percent and memory usage of percent, respectively. Index Terms Stream processing system, relational query system, incremental learning, modeling and prediction Ç 1 INTRODUCTION NOWADAYS, many modern applications require largescale continuous queries and analysis, such as analytic over microblogs in social networks, monitoring of high frequency trading in financial fields, real-time recommendation in electronic business and so on. Therefore, there have been many open source SPS, such as S4 [7], Storm [8] and Spark Streaming [1] etc. For enhancing the ease of use and processing capability of them, some research institutions have developed RQS that have abstract query language, such as StreamingSQL [2] and Squall [3] etc. For completing the query requests given by users within the specified latency, DDSMS need to guarantee the high throughput and low latency of stream processing. This often requires users to set the relevant configurations in advance, such as the operator parallelism and the memory usage of processes. However, according to data stream properties (e.g., arrival rate and value distribution) and the difference of query tasks, the reasonable setting of configurations for ensuring both processing in real-time and low resource usage is a very challenging problem. C. Wang, X. Meng, Z. Weng, and C. Yang are with the School of Information, Renmin University of China, Beijing , China. chunkai_wang@163.com, xfmeng@ruc.edu.cn, hankwing@hotmail. com, yangchenwo@126.com. Q. Guo is with the Institute of Computing Technology, Chinese Academy of Sciences, Beijing , China. guoqi@ict.ac.cn. Manuscript received 25 Feb. 2017; revised 30 Aug. 2017; accepted 3 Sept Date of publication 15 Sept. 2017; date of current version 3 Nov (Corresponding author: Xiaofeng Meng.) Recommended for acceptance by Y. Zhang. For information on obtaining reprints of this article, please send to: reprints@ieee.org, and reference the Digital Object Identifier below. Digital Object Identifier no /TKDE Motivating Example. Consider an application which analyzes the traffic GPS data collected from taxis and buses in real time. We take the Traffic-Monitoring (TM) system as a query example, and Storm as a SPS example. Traffic-monitoring needs to process and analyze the traffic GPS data collected from taxis and buses in real time. The scenario is a typical application of streaming execution. We use GeoLife GPS Trajectories [25] for this workload. This system contains a map matching processing logic which accepts trajectories data of a GPS-loggers or GPS-phones including altitude, latitude, and longitude, and determines the location of this object in terms of a road ID. The speed calculate processing logic accepts the road ID result from the matching processing logic and calculate the average speed according to the road ID. Based on execution tasks of the TM system, the workload in Storm can be expressed as a directed acyclic graph (called topology), where nodes are data sources (called spouts) or processing logic (called bolts) which run in different processes (called workers), and edges indicate the flow of data between nodes. For real-time computation, the topology should be created at first. Then, before the deployment of the topology to the Storm cluster, the user should specify various topology parameters (e.g., the number of workers, spout parallelism, and bolt parallelism, etc.) according to either experience or machine configurations. Finally, the Storm cluster receives continuous data stream and processes them in a real-time fashion. However, the topology parameters are not aware of dynamic data stream. For example, as shown in Fig. 1, due to the fixed configurations, some processing latency are beyond the user s threshold (e.g., 1 second) with the change of data stream rate. Meanwhile, it also leads to the higher cluster resources usage. As shown in Fig. 2, the average _

2 Fig. 1. Latency. usage of CPU is about 40 percent, and the average usage of memory (the user specifies the maximum memory usage of each worker) is more than 90 percent. As DDSMS are always deployed in the cloud environment, the problem of fixed configurations not only leads to high resource usage, reduces the user s return-on-investment (ROI), and even causes the system bottleneck. To address the above limitation, our earlier works show that many system resources are wasted when fixing the topology configuration [33], [34]. Taking the Amazon EC2 as an example, the above resource waste will result in additional $300 per month per machine. Moreover, fixed topology configuration might also constrain the processing throughput if an overwhelming data flow exceeds the system capability. However, Paper [33] needs to restart a new topology after selecting the optimal configuration, which results in about 1 minute adjustment delay and does not apply to frequently fluctuating data streams. Paper [34] uses offline methods to train the initial machine learning models with randomly generated samples, which does not use the incremental learning methods, and cannot update the training dataset in real time. Meanwhile, Paper [35] have proposed a resource-aware dynamic load scheduler. Compared to the scheduler, which increases the overall throughput by maximizing the resource usage, our work focuses on minimizing the resource usage while still satisfying the real-time constraint. So, in this paper, we present OrientStream, a resourceefficient system to implement dynamic characterization deployment for Storm configuration according to the current data flow rate. This is achieved by using a two-phase mechanism. First, multiple resource usage models are trained incrementally with accumulated historical data features. Then, whenever necessary, these models can be used for deriving the most resource-efficient configuration to be deployed. With the advance of the query process, the amount of training data is dynamically increased. In order to use regression models to predict the resource usage in real time, not need to re-build the models of all training data repeatedly and save the time and resource for building models, we choose incremental learning technique based on MOA [4] and Weka [5] toolkits. And, we introduce Tachyon [36] as an in-memory file server to store the state information, and use the system of Operator States Migration (OSM) [37] to achieve live adjustment from different nodes Fig. 2. Resource usage. to adjust the operator parallelism dynamically. The contributions of our work are summarized as follows: 1) We construct initial training sets by using the several workloads for meeting the different runtime characteristics of DDSMS, and project queries to four levels (data-level, query plan-level, operator-level and cluster-level) for collecting the training data to predicttheresourceusage.and,weevaluatetheworth of an attribute by repeatedly sampling an instance and analyze the features influence. 2) As the collection of training samples has continued to increase, incremental learning models always ensure to improve prediction accuracy. In addition, compared to the batch learning models, incremental learning models can further reduce the runtime of training and prediction, and save the CPU and memory usage. 3) For enhancing the prediction accuracy, we give the algorithm of Ensemble Different kinds of Regressions (EDKRegression) to integrate different regression models (see Section 5.3 for details) based on the related independent of base models. 4) We introduce the mechanism of outlier execution detection, predict the rationality of configurations, and monitor the anomalous executions. Through the methods of normalization and nearest neighbors, we find abnormal data to further improve the prediction accuracy of regression models. 5) We use the Operator States Migration to achieve live adjustment from different nodes. It enable workloads to perform migrations automatically, without losing any data or incurring unnecessary restarting cost. This help us to complete the adaptive resource management. The rest of the paper is organized as follows. Section 2 surveys the related work. Then, there are the related concept description and the problem definition in Section 3. And, in Section 4, we give the architecture of OrientStream, which includes the model construction, resource prediction and dynamic tuning. Next, we describe some critical steps of OrientStream in Section 5, including feature extraction, model building, abnormal detection mechanism and live adjustment. Section 6 gives the results of our experiment evaluation. Finally, Section 7 concludes this paper.

3 WANG ET AL.: AUTOMATING CHARACTERIZATION DEPLOYMENT IN DISTRIBUTED DATA STREAM MANAGEMENT SYSTEMS Fig. 3. Storm architecture. 2 RELATED WORK To meet users different query requests, DDSMS needs to build a complete operator library in RQS and dynamically set up the parallelism of processing nodes of SPS based on different query requests. There are roughly three lines of related work as below. Dynamic Load Scheduling. Recently, several algorithms have been proposed for the strategies of dynamic load scheduling about Storm [9], [10], [11]. For example, Aeolus [9] is an optimizer to set up the parallel degree and the batch size of intra-node dataflow programs. It defined a cost model based on four measurements of processing tuple time, including shipping time, queuing time, pure processing time and actual processing time. According to the model, Aeolus can get the best configuration scheme for the parallel degree of operators and batch size. DRS [10] is a dynamic resource scheduler for cloud-based DDSMS. It designed an accurate performance model based on the theory of Jackson open queueing networks [12] that can get the setting of optimal parallel degree of operators. Aniello et al. [11] proposed two scheduling algorithms by using topology-based static scheduling strategy and traffic-based dynamic scheduling strategy for reducing the latency of processing tuples and inter-node traffic between multiple topologies. However, Aeolus and DRS need to be aware of the specific processing time for each operator, and can only be used for fixed query application scenarios. Paper [11] just focuses on the transmission delay, not on the resource usage, and does not take into account dynamic adjustment the operator parallelism according to the change of data rate and query requests in one topology. Machine Learning Techniques. Khoshkbarforoushha et al. [13] proposed a model based on the Mixture Density Network [14] for resource consumption estimation of data stream processing workloads. This model can help users to make a decision about whether to register a new query or not. ALOJA [15] project presented an open, vendor neutral repository, featuring over Hadoop executions. It can predict query time and abnormal detection based on ALOJA-ML [16], which is a framework for using machine learning techniques to interpret Hadoop benchmark performance data and performance tuning. Jamshidi et al. [17] designed the Bayesian optimization for configuration potimization (BO4CO), an auto-tuning algorithm to optimal configuration of SPS. However, the paper [13] cannot dynamically change the scheduling strategy and the operator parallelism, and cannot predict resource usage. ALOJA-ML can only be applied to Hadoop, not predict application scenes of data stream. BO4CO can only analyze the collected historical training data of SPS, but cannot do the incremental analysis of new data. Resource Estimation for RQS. As we know, RQS often has the SQL-like query interface. Some research has been devoted to monitoring resource consumption of SQL queries. Li et al. [18] set up two kinds of mechanisms of feature extractions: coarse-grained global features and finegrained operator-specific features for different queries over Microsoft SQL Server. Akdere et al. [19] built three models for predicting query performance based on different query plans: query plan-level model, operator-level model and hybrid model for nested query. However, The two models only consider the static selection process, cannot monitor dynamically, and do not consider the system characteristics under RQS. OrientStream is orthogonal to the above works since we focus on constructing training sets for the data stream properties and the parameter configurations of different levels, predicting resource usage by online machine learning techniques, and dynamically tuning the related parameter configurations of DDSMS. 3 PRELIMINARIES This section presents the related concepts of Storm and OrientStream, and gives the problem definition and description. 3.1 Storm Synopsis In this paper, Storm is used as an experimental platform of SPS. It is a distributed processing framework that can process data streams in real time. As shown in Fig. 3, Storm consists of a nimbus node and a number of supervisor nodes. The nimbus node is responsible for allocating resources, scheduling requests and monitoring each supervisor node; each supervisor node is responsible for accepting requests assigned by nimbus and managing the processes (called workers) under the node. Each worker contains multiple threads (called executors), and each executor can run one or more component instances (called tasks). The entire Storm cluster uses Zookeeper [28] to provide the coordinated management of the distributed application. 3.2 Concept Description OrientStream needs to monitor the operation of DDSMS in real time, and collects training samples based on different data stream rate and the parameter configurations. Formally, we define the related concepts as follows: Definition 1. A window model can divide the infinite data stream into a number of finite sub-streams. Each query processing is only aimed at the sub-stream in the current window. Generally, we can set the window size based on the time interval or the number of tuples. In OrientStream, we use the window model based on the time interval to get the training set. Each window with the different interval is used as a training sample. In this paper, we set the time interval to a random number from 30 to

4 Fig. 4. OrientStream architecture. 120 seconds. First, we start a query request from the configuration. When the window size reaches the interval bound, we collect the training sample. Then, we restart the query request from the new configuration and window size. Finally, we can collect the comprehensive training set through different windows. Definition 2. The processing latency of DDSMS is defined as the sum of processing delays of each tuple processed by operators. Each operator has the different parallelism, which is often implemented in a multi-threaded manner. Then, the latency of data source i is represented as the mean latency of every thread, and the formal definition is as follows: Latencyðsource i Þ¼ Xm j¼1 Latencyðsource i Þ j =m; (1) where, m is the parallelism of the data source node. However, the processing latency of DDSMS is the maximum latency of each data source, it can be defined as below: n Latency ¼ max Latencyðsource iþ; (2) i¼1 where, n is the number of data sources in a query request. 3.3 Problem Definition Query requests always need to set different parameter configurations by user s experience. Whether the parameter settings are reasonable or not will directly affect the throughput and the latency of DDSMS, as well as the resource usage. With the change of data stream rate, we should dynamically adjust parameter configurations to meet user s needs of the latency and the throughput thresholds. But, SPS are often not allowed to adjust parameters arbitrarily. For example, re-balance mechanism of Storm can only decrease operators parallelism without increasing them. We need to predict the resource usage, the latency and the throughput for any possible configurations. Based on the predicted results, the optimal configuration is defined as the one using minimal CPU and memory resource while the latency and the throughput requirements are still guaranteed under the thresholds set by users. Thus, the problem can be modeled as a resource usage optimization problem as we discuss next. Let N ¼ðn 1 ;n 2 ;n 3 ;...Þ be the set of all nodes in a cluster. The CPU usage U CPU and the memory usage U Memory of each node n i can be defined as follows: Uðn i Þ¼a U CPU ðn i Þþb U Memory ðn i Þ ða þ b ¼ 1Þ; (3) where, a and b are the weight of the CPU and memory usage respectively, which set as 50 percent by default. Then, let C ¼ðc 1 ;c 2 ;c 3 ;...Þ be the candidate configurations supplied by users. For the entire cluster, we need to predict the optimal configuration c opt to achieve the following optimization goals Minimize X n i 2N Uðn i Þ s:t: and RðlatencyÞ <TðlatencyÞ; RðthroughputÞ >TðthroughputÞ; where, RðlatencyÞ and RðthroughputÞ are the processing latency and the throughput of query requests. T ðlatencyþ and TðthroughputÞ are the thresholds set by users. The constructed models fall into two categories, i.e., regression and classification models. The regression models are constructed for predicting the resource usage based on their numeric labels. In order to realize regression models with higher accuracy, we select four independent models that support incremental learning and have the highest regression accuracy (see in Section 5.2). For predicting the processing latency and throughput, we build binary classification models to get whether the latency and throughput can meet the system specification. In this paper, we use the J48 algorithm in Weka to classify, which is the implementation of the C4.5 [38] algorithm. J48 algorithm is based on the decision tree, and its attribute with the maximum gain ratio is selected as the splitting attribute. The gain ratio of attribute A is defined as GainRateðAÞ ¼ (4) GainðAÞ SplitInfo A ðdþ, where GainðAÞ is the information gain of attribute A, SplitInfo A ðdþ is the potential information generated by splitting the training data set D into v partitions, corresponding to the v outcomes of a test on attribute A. 4 ORIENTSTREAM ARCHITECTURE This section gives the overview of OrientStream and working principles of each step. OrientStream Overview. Fig. 4 illustrates the overall architecture of OrientStream, by extending the original Storm and Kafka [20] cluster. It is mainly divided into two parts: the left part is the hierarchical feature extractions, and the right part is the query monitor that is responsible for

5 WANG ET AL.: AUTOMATING CHARACTERIZATION DEPLOYMENT IN DISTRIBUTED DATA STREAM MANAGEMENT SYSTEMS Name DATA_RATE DATA_PARA CBROKER CPARTITION OPERATOR_RATE collecting the feature data and predicting the resource usage by incremental learning models. In the left part, according to the characteristics of DDSMS and clusters, we divide feature extractions into four levels. There are data-level features of the data source, plan-level features of RQS, operator-level features of SPS and clusterlevel features of the hardware platform (see Section 5.1 for details). Each data source collects and transmits data stream using the Kafka message queue. Kafka is a distributed, high-throughput platform for handling real-time data feeds. In the right part, we analyze the real-time collected feature data and build the incremental learning models, and predict the resource usage of DDSMS using model-based methods. Model Construction. During the incremental learning, we collect the feature data dynamically to generate Orient- Stream data-set, use the updatable data-set to predict the accuracy of the each model and use the prediction results to train the model by incremental learning. According to the characteristics of the data-set, we choose different incremental learning algorithms to build models based on MOA and Weka toolkits. Resource Prediction. As shown in Fig. 4, Query Monitor is responsible for building incremental learning models and generating the Resource Predictor by collecting the feature data. Based on the data stream rate, distribution, and query requests submitted by users, we use the Resource Predictor to predict candidate configuration settings of the topology parameters that include the spouts parallelism, the bolts parallelism, etc. Then, we can predict the optimal configuration c opt, and alert the abnormal configurations which are defined as the ones using resource too much or the latency or the throughput in excess of thresholds. In these cases, we can learn from traces obtained from DDSMS executions incrementally, improve the robustness of the model by different workloads with different parallelism and query requests, and get higher accuracy of prediction models by using learning techniques than using rule of the thumb techniques. Dynamic Tuning. During model deployment, the constructed models are used for determining the optimal configuration for the current data rate, so as to minimize the resource usage while still preserving real-time property. To achieve this, there are two basic problems need to be addressed: (1) When to change the configuration, and (2) How to deploy the optimal configuration online. TABLE 1 Data-Level Features Description data rate of input RQS data parallelism of input RQS # Kakfa brokers # topics partitions in Kafka Avg rate of different operators When to change the configuration. We make the decision based on a simple threshold solution from the streaming behaviors (i.e., the data producing and consuming rate) collected from the Storm topology and the Kafka producer. More specifically, the Kafka producer and the Storm cluster periodically send the data producing rate and the data consuming rate of the running topology, respectively, to the Query Monitor (see Fig. 4). Then, such data rates are used in two conditions: (1) if the current data consuming rate significantly differs from previous data consuming rate, the current configuration may be unsuitable, (2) if the data producing rate is higher than the consuming rate to a certain extent, data needs to be stored in Kafka waiting to be processed, rendering reduced processing throughput. In either case, it is required to adjust the topology configuration for resource efficiency, since the properties of the data stream have changed. Moreover, users can explicitly specify the parameters of these two conditions. For example, for the first condition, users can specify not only the difference between the current and previous consuming rates, but also the distance between these two checkpoints. Deploying the Configuration Online. Due to the limitations of the re-balance mechanism of Storm, we proposed to deactivate the old topology and then switch to a new topology with the optimal configuration in papers [33] and [34]. However, the process of switching to a new topology requires about 1 minute delay. So, the application scenarios need to meet the following two requirements: (1) the monitor cycle of data stream is longer; (2) the data fluctuation is not frequent and far lower than the switching time. In this paper, we use the Operator States Migration to achieve live adjustment from different nodes. It enable workloads to perform migrations automatically, without losing any data or incurring unnecessary restarting cost. According to the predicted optimal configuration, we can increase/decrease the number of workers dynamically and adjust the parallelism of each operator arbitrarily. 5 MODELING RESOURCE PREDICTION This section focuses on the implementation details, including features extraction, model building and integration, abnormal detection mechanism and applications, and live adjustment. 5.1 Data-Set Features In this section, we will describe the features that use as the input OrientStream data-set to the learning model for predicting the resource usage. For this purpose, we need to specify the granularity of input features for learning models. The feature extraction strategy not only includes the data features and query features of RQS, but also takes into account the environment configurations at the bottom of RQS which include the configurations of SPS and the features of machine clusters. According to a series of experiments and specific domain knowledge, we choose features from the following four aspects: Data-Level Features. There are data-level features in Table 1. We control the rate of input data stream as DATA_RATE by changing the spout parallelism. Also, the number of Kafka brokers and the number of topics partitions are also taken into account. In addition, because the change of DATA_ RATE could influence the rate of subsequent operators, we

6 TABLE 2 Query Plan-Level Features TABLE 4 Cluster-Level Features Name WINDOW_SIZE SLIDING_STEP COPERATOR Description # tuples in a window step of sliding window # different operators in a query Name WORK_NODE CPU_CORE MEMORY NETWORK Description # Work nodes # CPU cores memory capacity (G) network transmission rate (Gbps) need to calculate the rate of relevant operators dynamically (e.g., map matching bolt rate and speed calculate bolt rate in the traffic monitoring workload). Query Plan-Level Features. As shown in Table 2, we define the query plan-level features. The continuous query can generate DAG-style topology by using the query plan. The amount of data waiting to be processed and the window sliding step can be controlled by WINDOW_SIZE and SLIDING_STEP. Moreover, we count the number of different types of operators. Operator-Level Features. There are operator-level features in Table 3. We can monitor resource usage by adjusting the operator parallelism in different windows. If the same operator appears more than once, we take the average value as the criterion of feature extraction. Meanwhile, we need to involve the number of workers about Storm as the feature that can influence the processing efficiency of operators. Cluster-Level Features. There is the summarization of features about clusters in Table 4. We defines cluster-level features including the number of worker nodes, CPU cores, memory capacity and network transmission rate. 5.2 Algorithms At this time, OrientStream data-set is generated in real time. Through the experimental comparison, we choose four incremental learning models in MOA and Weka toolkits with the highest prediction accuracy. Bayes model: we use the Naive Bayes (NB) from the MOA toolkit. Bayes[21] algorithm updates internal counters with each new instance and uses the counter to assign a class in a probabilistic fashion to a new item in the stream. It can be defined as follows: Q i pðcjxþ ¼ pðx i ¼ x i jc ¼ cþpðcþ Q i pðx ; (5) i ¼ x i Þ where, i ranges over the training points of attribute X. Hoeffding trees model: we use the HoeffdingTree (HT), also from the MOA toolkit. A hoeffding tree or VFDT [22] is an incremental decision tree that is capable of learning from massive data stream. It uses information entropy or Gini index as a criterion for selecting split attributes, and uses the hoeffding inequality as the condition for determining to split node. It exploits the fact that a small sample can often be enough to choose an optimal splitting attribute. Name OPERATOR_PARA WORKER TABLE 3 Operator-Level Features Description Avg para of operators # worker in Storm Online bagging model: we use the OzaBagging [23] from the MOA toolkit and the hoeffding tree as the base learner. Oza and Russell propose the online bagging algorithm that gives each instance a weight for re-sampling. Each base learner s training set contains K copies of each of the original training examples, where PðK ¼ kþ ¼ N 1 k 1 1 N k ; (6) k N N which is the Binomial distribution. As N!1, the distribution of K tends to a Poisson(1) distribution. Nearest neighbors model: We use the IBk [24] from the Weka toolkit, through introducing updateable classifier method to achieve incremental learning characteristics. IBk is the instance-based learning algorithm. It is identical to the k nearest neighbors algorithm expect that it normalizes its attributes ranges, processes instances incrementally. The similarity function is shown as follows: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Similarityðx; yþ ¼ X n i¼1 fðx i ;y i Þ; (7) where, the instances are described by n attributes. We define fðx i ;y i Þ¼ðx i y i Þ 2 for numeric-valued attributes and fðx i ;y i Þ¼ðx i 6¼ y i Þ for Boolean and symbolic-valued attributes. 5.3 EDKRegression Model Ensemble learning is one of the most effective methods to improve the accuracy of the regression algorithm, which is considered as one of the most effective learning ideas. Its contribution is to improve regression accuracy by aggregating multiple base models. Ensemble learning methods mainly include Bagging [26] and Boosting [27]. Although both of mathematical foundations are not exactly the same for improving the accuracy of regression algorithms, they attempt to produce efficient committees by combining the outputs of multiple base models to improve the performance of regression algorithms. However, Bagging and Boosting only ensemble the same regression model (e.g., hoeffding tree). As discussed in Section 5.2, four models have no contact and the prediction is independent of each other. Therefore, we combine these four models and propose a regression model EDKRegression (Ensemble Different Kind of Regression). Without loss of generality, as shown in Fig. 5, EDKRegression combines k base regression models R 1 ;R 2 ;...;R k, and creates an improved composite regression model R by scoring the regression models. For the new arrival tuple t n, EDKRegression can obtain the prediction result of the vote by k base regression models.

7 WANG ET AL.: AUTOMATING CHARACTERIZATION DEPLOYMENT IN DISTRIBUTED DATA STREAM MANAGEMENT SYSTEMS Fig. 5. EDKRegression prediction process. The detailed description of EDKRegression is listed in Algorithm 1. Algorithm 1. EDKRegression Algorithm Require: Each prediction values of base learning models for predicting sample n ; Ensure: The prediction value of EDKRegression for predicting sample n; 1: Computing AVG N ; 2: for (i ¼ 1; i<n; i þþ) do 3: Computing RAE n using Equation (5); 4: end for 5: for (j ¼ 1; j<k; j þþ) do 6: Computing F n using Equation (6); 7: end for In the process of prediction, first, EDKRegression lets each base regression model to make prediction; then, according to the relative absolute error (RAE) value of the each base regression model, EDKRegression calculates the voting weights respectively; finally, it determine the regression value of the sample with a weighted vote. Among them, RAE and the objective function of EDKRegression are defined as below: Let N be the set of training samples in each regression model and n be the nth sample to be predicted. P n is the prediction value of the regression model to the sample n. T n is the true value of the sample n. AVG N is the mean of N samples. Then RAE n ¼ XN jp in T i i¼1 j= XN j¼1 T j AVG N ; (8) where, RAE n is the relative absolute error of the nth sample using specified base regression model. So, based on the k base regression models, the objective function of EDKRegression to predict the sample n is F n ¼ Xk i¼1 ð1 RAE in Þ= k Xk j¼1 RAE jn! P in ; (9) where, F n is the relative absolute error of the nth sample using EDKRegression model, and P in is the prediction value of the nth sample using the ith base regression model. 5.4 Abnormal Detection Mechanism Due to the abnormal execution of the cluster (e.g., the cluster load is not balanced, network transmission is not stable), or Fig. 6. Anomaly detection mechanism flow. the observation does not fit into the model, it results in some abnormal training data. Based on the difference between the predicted value and the observed value (e.g., resource usage), we divide the training data into three types: Normal. Detected data can be predicted correctly by learning models. Warning. Detected data is mis-predicted, and more than a half of nearest neighbors to the data are mispredicted too. Outlier. Detected data is mis-predicted, but more than a half of nearest neighbors to the data are predicted correctly. Algorithm 2. Abnormal Detection Algorithm Require: Prediction Error Threshold T e ; The number of tuples to nearest tag tuple k NN ; The regression file F contains predicted value and true value; Ensure: The type of detected tuple t; 1: if (jt:predictedvalue t:truevaluej <T e ) then 2: t is Normal; 3: else 4: for (i ¼ 1; i < Number of TestDataSet; i þþ) do 5: Computing the k NN tuples nearest to t; 6: end for 7: for (j ¼ 1; j<k NN ; j þþ) do 8: if (jt k :PredictedValue t k :TrueValuej >T e ) then 9: N warning +1; 10: else 11: N outlier +1; 12: end if 13: end for 14: if (N warning >N outlier ) then 15: t is Warning; 16: else 17: t is Outlier; 18: end if 19: end if Fig. 6 shows the anomaly detection flow. We can improve the prediction accuracy by discarding abnormal training data and not getting it into the incremental learning model. Besides using this method to test new observations, we can re-train our models by subtracting observations marked as outlier. Algorithm 2 gives the detailed description of the detection procedure. For each detected data, first, we select the

8 Fig. 7. Architecture of OSM. normal data based on the error threshold T e (Algorithm 2 lines 1-2). Then, for each abnormal test data, we compute the k NN tuples that are nearest to the test data. Lines 4-6 in Algorithm 2 depicts this process. Next, we count the number of warning or outlier of these k NN tuples by the definition of warning or outlier category (Algorithm 2 lines 7-13). Finally, we judge whether the test data is warning or outlier (Algorithm 2 lines 14-18). 5.5 Live Adjustment Because the re-balance mechanism of Storm can only guarantee that the number of executors is less than or equal to the number of tasks, when adjusting the maximum of the number of tasks, we need to re-build a new topology based on the predicted optimal configuration [33], [34]. It causes about 1 minute delay to re-build a topology. In this section, we introduce the system of Operator States Migration [37] to implement the live adjustment of different configurations without building a new topology. It can dynamically change the operator parallelism and achieve accurate queries without missing operator states. The architecture of OSM is shown in Fig. 7 [37]. Each worker in Storm maintains its own thread pool (from task 1 to task n ). Meanwhile, it creates an Operator States Manager (OSMer) to access the application data when all executors are killed. Hence, the pointers in the OSMer ensures that no application data is marked as garbage, and remain in memory during the migration. The OSMer classifies operator states into three types, marked as to stay, to move out and to move in, based on the adjustment strategy. To stay tasks do not require extra processing; to move out tasks are to be moved out to another node, and their states need to serialize into a file server (as shown in Fig. 7); to move in tasks are moved to the current node from another node, and their states need to download from file server to a separate retriever thread illustrated in Fig. 7. Moreover, OSM uses the scheduling technique in [39] to reduce the network traffic cost during migration and minimize the total transmission time. 6 EVALUATION This section presents three workloads of each test scenario and gives the result of predicting the resource usage with different learning models. We take the prediction of CPU usage as an example to verify the effect of anomaly detection mechanism, compare to reduce the degree of error prediction in the process of incremental learning, and verify the resource save situation through using the dynamic tuning mechanism. The testbed is established on a cluster of ten nodes connected by a 1 Gbit Ethernet switch. Five nodes are used to transmit data source through Kafka. One node serves as the nimbus of Storm, and the remaining four nodes act as supervisor nodes. Each data source node and the nimbus node have a Intel E GHz four-core CPU and 4 GB of DDR3 RAM. Each supervisor node has two Intel E GHz twelve-core CPU and 64G of DDR3 RAM. We implement comprehensive evaluations of our prototype on Storm and Ubuntu In order to cover the different query characteristics, we choose three different workloads as below. Traffic Monitoring. This workload has been introduced in detail in Section 1. Word Count (WC). Word count represents word frequency statistics of different sentences. This workload is often used as the stream processing mode for each sentence. It is composed of a bolt for splitting sentences into words and another one for counting the number of occurrences for each word in a hash-map. We use the word count data-set from HiBench [6]. There are about 3 million sentences and more than 30 million words for testing. TPC-H (Q3). TPC-H is a decision support benchmark. Its queries and data have been chosen to have broad industrywide relevance. In order to verify the query processing of multiple data streams, we choose Q3 as the workload that includes three data sources, such as customer, orders and lineitem. We use 15 million tuples for each data source of this workload. It contains three filter bolts to filter data source, two join bolts to make equi-join for the three data sources, a group bolt to group the result of join bolts, and a order bolt to order the result of the group bolt. The sink bolt is used to merge the output results. For simplicity, we only use sink as the count statistics of received results. The topologies of these workloads are shown in Fig. 8. Our workloads cover two aspects: (1) various topology structural complexity, from the single chain topology of one data source to the complex DAG topology of multiple data sources. (2) different runtime characteristics, such as memory intensive of TM workload, communication intensive of WC workload and compute intensive of Q3 workload. 6.1 Online Resource Prediction In order to improve the prediction accuracy, we calculate the mean of CPU usage, memory usage, latency and throughput as the prediction targets for each training sample. For each workload, we respectively collect 3,000 samples by setting data rate and window sizes randomly. According to T ðlatencyþ and T ðthroughputþ set by users, we build two binary classification models. The classification accuracy of the latency and the throughput about three different workloads is shown in Fig. 9. The accuracy of predicting the latency and throughput is maintained at about 90 to 96 percent with increasing samples (from 500 to 3,000). This provides a high security for the next step in predicting the resource usage. For the regression prediction of the resource usage, we use different learning models mentioned in Sections 5.2 and 5.3. Tables 5 and 6 show the test results of the mean absolute error (MAE) and RAE per method based on different workloads. The experimental results show that: 1) The average RAE of prediction memory usage is lower than that of prediction CPU usage, because of

9 WANG ET AL.: AUTOMATING CHARACTERIZATION DEPLOYMENT IN DISTRIBUTED DATA STREAM MANAGEMENT SYSTEMS Fig. 8. Topology of three different workloads. the fluctuation of memory usage is lower than that of CPU usage. 2) For prediction of CPU usage in different workloads, EDKRegression is better than the best model in the four models, which MAE can reduce , RAE can reduce percent. For prediction of memory usage in different workloads, EDKRegression is better than the best model in the four models too, which MAE can reduce , RAE can reduce percent. 6.2 Features Influence We have proposed various features in Section 5.1. According to different workloads and different prediction targets, the influence of features is not the same. So, we use Relief- FAttributeEval [30], [31], [32] in Weka toolkit for evaluating the worth of an attribute by repeatedly sampling an instance and considering the value of the given attribute for the nearest instance of the same and different class. We take data-set features extracted from TPC-H(Q3) workload as an example. As shown in Fig. 10a, when using regression models to predict CPU usage, the features of higher ranker include spout rate, the number of topic partitions and Kafka brokers, every bolt rate and memory usage. Meanwhile, the features of lower ranker include the number of workers, spout parallelism, every bolt parallelism, window size and sliding step. However, as shown in Fig. 10b, when using regression models to predict memory usage, the features of higher ranker include the number of workers, spout parallelism, every bolt parallelism, sliding step and window size. Meanwhile, the features of lower ranker include the number of topic partitions and Kafka brokers, spout rate, join bolt rate and CPU usage. So, the results of features influence show that: (1) The speed of data stream and the processing rate of operators greatly influence the prediction of CPU usage. (2) The number of workers and the operator parallelism greatly influence the prediction memory usage. (3) The extracted dataset features are enough to predict workloads with different runtime characteristics. 6.3 Abnormal Filter Prediction In this section, we take 3,000 data samples of the CPU usage of Traffic Monitoring workload as an example, use the EDK- Regression as the prediction model, and set the prediction error threshold T e to one standard deviation, set the number of tuples to nearest target to 5. As shown in Fig. 11, there are 55 outliers and 8 warnings in the data set. By discarding outliers and re-training the prediction model, we can get MAE from to and RAE from to Incremental Learning Prediction In the case of continuous query analysis, the size of the training samples is increased by updating with data streams. Incremental learning models always ensure to reduce the error rate of the prediction continuously with the increase of the training samples. We take the prediction of the CPU usage of Traffic Monitoring workload as an example, and compare the five incremental models and one batch models from the beginning of the 500 samples to the end of 3,000 samples Effect of Incremental Learning Prediction According to the five learning models are proposed in this paper, Fig. 12 shows that RAE of these models can be reduced 1 from 500 samples to 3,000 samples. It shows that the optimization rate of incremental learning is significant. The increase of samples can reduce RAE observably. Fig. 9. Prediction accuracy Comparison of Incremental Learning and Batch Learning In this section, we make a comparative analysis of incremental learning and batch learning, which includes the relative absolute error, the learning runtime, and the resource usage.

10 TABLE 5 Mean and Relative Absolute Error Per Method for CPU Usage Prediction models Traffic Monitoring Word Count TPC-H (Q3) MAE Test RAE Test MAE Test RAE Test MAE Test RAE Test NB HT OzaBag (HT) IBk EDKRegression TABLE 6 Mean and Relative Absolute Error Per Method for Memory Usage Prediction models Traffic Monitoring Word Count TPC-H (Q3) MAE Test RAE Test MAE Test RAE Test MAE Test RAE Test NB HT OzaBag (HT) IBk EDKRegression We choose two models with the lowest relative absolute error respectively. The incremental one is EDKRegression, and the batch one is random forest[29] with splitting of 75 percent training data-set and 25 percent testing data-set in Weka toolkit for achieving the best regression results. As shown in Fig. 13a, with the increase of the volume of data, EDKRegression has a lower relative absolute error. Meanwhile, EDKRegression can also save time for modeling and prediction, and save the resource usage (see Figs. 13b, 13c and 13d). can save 8-15 percent CPU usage and percent memory usage by monitoring the whole running process of different workloads. Among them, Traffic Monitoring can save 8 percent of the CPU usage and 48 percent of the memory usage due to its memory intensive characteristic. Word Count can save 12 percent of the CPU usage and 38 percent of 6.5 Dynamic Tuning According to the randomness of the stream rate, we adjust the parameters of SPS to ensure that OrientStream can meet the user s threshold. As shown in Figs. 14 and 15, running the three workloads by using OrientStream under different data rates, we can find that the latency is below the threshold (e.g., 1 second), and the throughput is above the threshold (e.g., 10,000 tuples/second). As shown in Figs. 16 and 17, compared with the fixed configuration, we can find that our proposed EDKRegression Fig. 11. Anomaly filter. Fig. 10. Features Influence. Fig. 12. Incremental prediction of RAE.

11 WANG ET AL.: AUTOMATING CHARACTERIZATION DEPLOYMENT IN DISTRIBUTED DATA STREAM MANAGEMENT SYSTEMS Fig. 13. Incremental learning versus batch learning. Fig. 14. Latency. Fig. 15. Throughput. the memory usage due to its communication intensive characteristic. TPC-H (Q3) can save 15 percent of the CPU usage and 45 percent of the memory usage due to its compute intensive characteristic. At the same time, compared the existing algorithms with the highest prediction accuracy, EDKRegression can also save more resource usage. For CPU usage, EDKRegression can save 2-6 percent over OzaBag(HT); for memory usage, it can save 9-12 percent over IBk. 6.6 Live Adjustment When deploying configurations, we can get the tremendous benefits by using the mechanism of live adjustment. As Fig. 16. CPU usage. Fig. 17. Memory usage.

12 Fig. 18. Configuration adjustment. shown in Fig. 18, there are the comparisons of three workloads about adjusting configurations. Because of the difference between the stream rate and the query request, the tuning nodes of three workloads change as well. The adjustment time by using the live adjustment is between 100 and 300 milliseconds. However, under the condition of satisfying the re-balance mechanism, the adjustment time by using re-balance is between 5 and 7 seconds. If we need to increase the operator parallelism, the adjustment time by re-building the topology will last about 1 minute. 7 CONCLUSIONS AND FUTURE WORK In this paper, we propose OrientStream to predict the resource usage of DDSMS by using incremental learning techniques, develop the learning tool that can be implemented data acquisition automatically based on the different workloads. Our approach uses MOA and Weka toolkits to build real-time prediction models by defining four levels features extraction at different granularities. In addition, we introduce the EDKRegression algorithm and the outlier execution detection mechanism for improving the prediction accuracy, and verify the validity of our methods. Moreover, we introduce the system of the Operator States Migration to achieve live adjustment of the operator parallelism from different nodes. It can reduce the adjustment time and processing latency significantly. We evaluate OrientStream by running three workloads with different runtime characteristics. Our experimental results demonstrate that OrientStream can significantly reduce 8-15 percent CPU usage and percent memory usage, and save more resources with incrementally updated models. Hence, it is promising to deploy OrientStream in production to significantly increase the user s return-oninvestment. As future work, there are two directions to further improve our work. (1) We need to collect more training data to improve the prediction accuracy by using the incremental learning, and plan to further optimize the regression model to improve the prediction accuracy and use less samples. (2) In this paper, the optimal configuration is predicted from the manually supplied candidate configurations. So, we need to design the real-time recommendation to help users get the optimal configuration. ACKNOWLEDGMENTS This research was partially supported by the grants from the Natural Science Foundation of China (No , , , , ); the National Key Research and Development Program of China (No. 2016YFB , 2016YFB ); the Research Funds of Renmin University (No. 11XNL010); and the Science and Technology Opening up Cooperation project of Henan Province (No ). REFERENCES [1] (2017). [Online]. Available: [2] (2017). [Online]. Available: spark-streamingsql/ [3] (2017). [Online]. Available: [4] (2017). [Online]. Available: [5] (2017). [Online]. Available: weka/ [6] (2017). [Online]. Available: HiBench/ [7] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari, S4: Distributed stream computing platform, in Proc. IEEE Int. Conf. Data Mining Workshops, 2010, pp [8] A. Toshniwal, et al., Storm@twitter, in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2014, pp [9] M. J. Sax, M. Castellanos, Q. Chen, and M. Hsu, Aeolus: An optimizer for distributed intra-node-parallel streaming systems, in Proc. IEEE Int. Conf. Data Eng., 2013, pp [10] T. Z. J. Fu, J. Ding, R. T. B. Ma, M. Winslett, Y. Yang, and Z. Zhang, DRS: Dynamic resource scheduling for real-time analytics over fast streams, Comput. Sci., vol. 690, no. 1, pp , [11] L. Aniello, R. Baldoni, and L. Querzoni, Adaptive online scheduling in storm, in Proc. ACM Int. Conf. Distrib. Event-Based Syst., 2013, pp [12] G. R. Bitran and R. Morabito, State-of-the-art survey: Open queueing networks: Optimization and performance evaluation models for discrete manufacturing systems, Prod. Operations Manage., vol. 5, no. 2, pp , [13] A. Khoshkbarforoushha, R. Ranjan, R. Gaire, P. P. Jayaraman, J. Hosking, and E. Abbasnejad, Resource usage estimation of data stream processing workloads in datacenter clouds, CoRR abs/ , [14] M. C. Bishop, Mixture density networks, Neural Net., IEEE INNS ENNS Int. Joint Conf. IEEE Comput. Soc., [15] N. Poggi, D. Carrera, A. Call, and S. Mendoza, ALOJA: A systematic study of hadoop deployment variables to enable automated characterization of cost-effectiveness, in Proc. IEEE Int. Conf. Big Data, 2015, pp [16] J. L. Berral, N. Poggi, D. Carrera, A. Call, R. Reinauer, and D. Green, ALOJA-ML: A framework for automating characterization and knowledge discovery in hadoop deployments, in Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2015, pp [17] P Jamshidi and G Casale, An uncertainty-aware approach to optimal configuration of stream processing systems, in Proc. IEEE Int. Symp. Model. Anal. Simul. Comput. Telecommun. Syst. Comput. Soc., 2016, pp

13 WANG ET AL.: AUTOMATING CHARACTERIZATION DEPLOYMENT IN DISTRIBUTED DATA STREAM MANAGEMENT SYSTEMS [18] J. Li, A. C. Nig, V. Narasayya, and S. Chaudhuri, Robust estimation of resource consumption for SQL queries using statistical techniques, Proc. VLDB Endowment, vol. 5, no. 11, pp , [19] M. Akdere, U. Cetintemel, M. Riondato, E. Upfal, and S. B. Zdonik, Learning-based query performance modeling and prediction, in Proc. IEEE 28th Int. Conf. Data Eng., 2012, [20] J. Kreps, L. Corp, N. Narkhede, J. Rao, and L. Corp, Kafka: A distributed messaging system for log processing, in Proc. Workshop Netw. Meets Databases, 2011, [21] G. H. John and P. Langley, Estimating continuous distributions in Bayesian classifiers, in Proc. 11th Conf. Uncertainty Artif. Intell., 2013, pp [22] P. Domingos and G. Hulten, Mining high-speed data streams, in Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2000, pp [23] N. C. Oza and S. Russell, Experimental comparisons of online and batch versions of bagging and boosting, Carcinogenesis, vol. 22, no. 3, pp , [24] D. W. Aha, D. Kibler, and M. K. Albert, Instance-based learning algorithms, Mach. Learn., vol. 6, no. 1, pp , [25] Y. Zheng, L. Zhang, X. Xie, and W. Y. Ma, Mining interesting locations and travel sequences from GPS trajectories, in Proc. Int. Conf. World Wide Web, 2009, pp [26] L. Breiman, Bagging predictors, Mach. Learn., vol. 24, no. 2, pp , [27] Y. Freund and R. E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, J. Comput. Syst. Sci., vol. 55, no. 7, pp , [28] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed, ZooKeeper: Wait-free coordination for internet-scale systems, in Proc. USE- NIX Annu. Tech. Conf., 2010, pp [29] L. Breiman, Random forests, Mach. Learn., vol. 45, no. 1, pp. 5 32, [30] K Kira and L A. Rendell, A practical approach to feature selection, in Proc. 9th Int. Workshop Mach. Learn., 1992, pp [31] I. Kononenko, Estimating attributes: Analysis and extensions of RELIEF, in Proc. Eur. Conf. Mach. Learn., 1994, pp [32] M. Robnik-Sikonja and I. Kononenko, An adaptation of relief for attribute estimation in regression, in Proc. 14th Int. Conf. Mach. Learn., 1997, pp [33] C. Wang, X. Meng, Q. Guo, Z. Weng, and C. Yang, OrientStream: A framework for dynamic resource allocation in distributed data stream management systems, in Proc. 25th ACM Int. Conf. Inf. Knowl. Manage., 2016, pp [34] Z. Weng, Q. Guo, C. Wang, X. Meng and B. He, AdaStorm: Resource efficient storm with adaptive configuration, in Proc. 33nd IEEE Int. Conf. Data Eng., 2017, pp [35] B. Peng, M. Hosseini, Z. Hong, R. Farivar, and R. Campbell, R-Storm: Resource-aware scheduling in storm, in Proc. 16th Annu. Middleware Conf., 2015, pp [36] H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica, Tachyon: Reliable, memory speed storage for cluster computing frameworks, in Proc. ACM Symp. Cloud Comput., 2014, pp [37] J. Ding, T. Z. J. Fu, R. T. B. Ma, M.Winslett, Y. Yang, Z. Zhang and H.Chao, Optimal operator state migration for elastic data stream processing, CoRR abs/ , [38] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA, USA: Morgan Kaufmann, [39] W. Rodiger, et al., Locality-sensitive operators for parallel mainmemory database clusters, in Proc. IEEE Int. Conf. Data Eng., 2014, pp Chunkai Wang received the bachelor s and master s degrees from Zhengzhou University, both in computer science, in 2004 and 2007, respectively. He is currently working toward the PhD degree in the School of Information, Renmin University of China. His current research interests include data stream computing and distributed systems. Xiaofeng Meng received the bachelor s degree from Hebei University, the master s degree from the Renmin University of China, and the PhD degree from the Chinese Academy of Sciences, all in computer science, in 1987, 1993, and He is currently a professor in the School of Information, Renmin University of China. His current research interests include cloud data management, web data management, flash-based databases, and privacy protection. He is a member of the IEEE. Qi Guo received the BE degree in computer science from the Department of Computer Science and Technology, Tongji University, in 2007 and the PhD degree in computer science from the Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), in He is currently an associate professor at ICT. His research interests include computer architecture and VLSI design and verification. Zujian Weng received the bachelor s degree from the Nanjing University of Aeronautics and Astronautics, in He is working toward the postgraduate degree in the School of Information, Renmin University of China. His research interests include data stream computing and energy efficiency computing. Chen Yang received the bachelor s and master s degrees from Zhengzhou University. He is working toward the PhD degree in the School of Information, Renmin University of China. He is also an assistant in the School of Software Engineering, Zhengzhou University of Light Industry, China. His research interests include big data system and machine learning. " For more information on this or any other computing topic, please visit our Digital Library at

Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b

Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b 4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering (ICMMCCE 2015) Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b 1

More information

Adaptive replica consistency policy for Kafka

Adaptive replica consistency policy for Kafka Adaptive replica consistency policy for Kafka Zonghuai Guo 1,2,*, Shiwang Ding 1,2 1 Chongqing University of Posts and Telecommunications, 400065, Nan'an District, Chongqing, P.R.China 2 Chongqing Mobile

More information

Towards an Adaptive, Fully Automated Performance Modeling Methodology for Cloud Applications

Towards an Adaptive, Fully Automated Performance Modeling Methodology for Cloud Applications Towards an Adaptive, Fully Automated Performance Modeling Methodology for Cloud Applications Ioannis Giannakopoulos 1, Dimitrios Tsoumakos 2 and Nectarios Koziris 1 1:Computing Systems Laboratory, School

More information

Semi supervised clustering for Text Clustering

Semi supervised clustering for Text Clustering Semi supervised clustering for Text Clustering N.Saranya 1 Assistant Professor, Department of Computer Science and Engineering, Sri Eshwar College of Engineering, Coimbatore 1 ABSTRACT: Based on clustering

More information

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter Storm at Twitter Twitter Web Analytics Before Storm Queues Workers Example (simplified) Example Workers schemify tweets and

More information

Preliminary Research on Distributed Cluster Monitoring of G/S Model

Preliminary Research on Distributed Cluster Monitoring of G/S Model Available online at www.sciencedirect.com Physics Procedia 25 (2012 ) 860 867 2012 International Conference on Solid State Devices and Materials Science Preliminary Research on Distributed Cluster Monitoring

More information

Challenges in Data Stream Processing

Challenges in Data Stream Processing Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Challenges in Data Stream Processing Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria

More information

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Big Data Infrastructures & Technologies Data streams and low latency processing DATA STREAM BASICS What is a data stream? Large data volume, likely structured, arriving at a very high rate Potentially

More information

Traffic Flow Prediction Based on the location of Big Data. Xijun Zhang, Zhanting Yuan

Traffic Flow Prediction Based on the location of Big Data. Xijun Zhang, Zhanting Yuan 5th International Conference on Civil Engineering and Transportation (ICCET 205) Traffic Flow Prediction Based on the location of Big Data Xijun Zhang, Zhanting Yuan Lanzhou Univ Technol, Coll Elect &

More information

Scalable Streaming Analytics

Scalable Streaming Analytics Scalable Streaming Analytics KARTHIK RAMASAMY @karthikz TALK OUTLINE BEGIN I! II ( III b Overview Storm Overview Storm Internals IV Z V K Heron Operational Experiences END WHAT IS ANALYTICS? according

More information

Sizing Guidelines and Performance Tuning for Intelligent Streaming

Sizing Guidelines and Performance Tuning for Intelligent Streaming Sizing Guidelines and Performance Tuning for Intelligent Streaming Copyright Informatica LLC 2017. Informatica and the Informatica logo are trademarks or registered trademarks of Informatica LLC in the

More information

Nowadays data-intensive applications play a

Nowadays data-intensive applications play a Journal of Advances in Computer Engineering and Technology, 3(2) 2017 Data Replication-Based Scheduling in Cloud Computing Environment Bahareh Rahmati 1, Amir Masoud Rahmani 2 Received (2016-02-02) Accepted

More information

R-Storm: A Resource-Aware Scheduler for STORM. Mohammad Hosseini Boyang Peng Zhihao Hong Reza Farivar Roy Campbell

R-Storm: A Resource-Aware Scheduler for STORM. Mohammad Hosseini Boyang Peng Zhihao Hong Reza Farivar Roy Campbell R-Storm: A Resource-Aware Scheduler for STORM Mohammad Hosseini Boyang Peng Zhihao Hong Reza Farivar Roy Campbell Introduction STORM is an open source distributed real-time data stream processing system

More information

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records.

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records. DATA STREAMS MINING Mining Data Streams From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records. Hammad Haleem Xavier Plantaz APPLICATIONS Sensors

More information

Real-time data processing with Apache Flink

Real-time data processing with Apache Flink Real-time data processing with Apache Flink Gyula Fóra gyfora@apache.org Flink committer Swedish ICT Stream processing Data stream: Infinite sequence of data arriving in a continuous fashion. Stream processing:

More information

A priority based dynamic bandwidth scheduling in SDN networks 1

A priority based dynamic bandwidth scheduling in SDN networks 1 Acta Technica 62 No. 2A/2017, 445 454 c 2017 Institute of Thermomechanics CAS, v.v.i. A priority based dynamic bandwidth scheduling in SDN networks 1 Zun Wang 2 Abstract. In order to solve the problems

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Practical Machine Learning Agenda

Practical Machine Learning Agenda Practical Machine Learning Agenda Starting From Log Management Moving To Machine Learning PunchPlatform team Thales Challenges Thanks 1 Starting From Log Management 2 Starting From Log Management Data

More information

Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data

Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data Jesse Read 1, Albert Bifet 2, Bernhard Pfahringer 2, Geoff Holmes 2 1 Department of Signal Theory and Communications Universidad

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace [Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 20 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(20), 2014 [12526-12531] Exploration on the data mining system construction

More information

Online Optimization of VM Deployment in IaaS Cloud

Online Optimization of VM Deployment in IaaS Cloud Online Optimization of VM Deployment in IaaS Cloud Pei Fan, Zhenbang Chen, Ji Wang School of Computer Science National University of Defense Technology Changsha, 4173, P.R.China {peifan,zbchen}@nudt.edu.cn,

More information

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014 1061 A Cool Scheduler for Multi-Core Systems Exploiting Program Phases Zhiming Zhang and J. Morris Chang, Senior Member, IEEE Abstract Rapid growth

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK REAL TIME DATA SEARCH OPTIMIZATION: AN OVERVIEW MS. DEEPASHRI S. KHAWASE 1, PROF.

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering Abstract Mrs. C. Poongodi 1, Ms. R. Kalaivani 2 1 PG Student, 2 Assistant Professor, Department of

More information

Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach

Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach Tiziano De Matteis, Gabriele Mencagli University of Pisa Italy INTRODUCTION The recent years have

More information

Big Data Using Hadoop

Big Data Using Hadoop IEEE 2016-17 PROJECT LIST(JAVA) Big Data Using Hadoop 17ANSP-BD-001 17ANSP-BD-002 Hadoop Performance Modeling for JobEstimation and Resource Provisioning MapReduce has become a major computing model for

More information

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments Nikos Zacheilas, Vana Kalogeraki Department of Informatics Athens University of Economics and Business 1 Big Data era has arrived!

More information

7. Boosting and Bagging Bagging

7. Boosting and Bagging Bagging Group Prof. Daniel Cremers 7. Boosting and Bagging Bagging Bagging So far: Boosting as an ensemble learning method, i.e.: a combination of (weak) learners A different way to combine classifiers is known

More information

Chapter 10. Conclusion Discussion

Chapter 10. Conclusion Discussion Chapter 10 Conclusion 10.1 Discussion Question 1: Usually a dynamic system has delays and feedback. Can OMEGA handle systems with infinite delays, and with elastic delays? OMEGA handles those systems with

More information

Correlative Analytic Methods in Large Scale Network Infrastructure Hariharan Krishnaswamy Senior Principal Engineer Dell EMC

Correlative Analytic Methods in Large Scale Network Infrastructure Hariharan Krishnaswamy Senior Principal Engineer Dell EMC Correlative Analytic Methods in Large Scale Network Infrastructure Hariharan Krishnaswamy Senior Principal Engineer Dell EMC 2018 Storage Developer Conference. Dell EMC. All Rights Reserved. 1 Data Center

More information

Apache Storm. Hortonworks Inc Page 1

Apache Storm. Hortonworks Inc Page 1 Apache Storm Page 1 What is Storm? Real time stream processing framework Scalable Up to 1 million tuples per second per node Fault Tolerant Tasks reassigned on failure Guaranteed Processing At least once

More information

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

IX: A Protected Dataplane Operating System for High Throughput and Low Latency IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay, A. et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Reviewed by Chun-Yu and Xinghao Li Summary In this

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

DS595/CS525: Urban Network Analysis --Urban Mobility Prof. Yanhua Li

DS595/CS525: Urban Network Analysis --Urban Mobility Prof. Yanhua Li Welcome to DS595/CS525: Urban Network Analysis --Urban Mobility Prof. Yanhua Li Time: 6:00pm 8:50pm Wednesday Location: Fuller 320 Spring 2017 2 Team assignment Finalized. (Great!) Guest Speaker 2/22 A

More information

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors. About the Tutorial Storm was originally created by Nathan Marz and team at BackType. BackType is a social analytics company. Later, Storm was acquired and open-sourced by Twitter. In a short time, Apache

More information

Survey on Incremental MapReduce for Data Mining

Survey on Incremental MapReduce for Data Mining Survey on Incremental MapReduce for Data Mining Trupti M. Shinde 1, Prof.S.V.Chobe 2 1 Research Scholar, Computer Engineering Dept., Dr. D. Y. Patil Institute of Engineering &Technology, 2 Associate Professor,

More information

Streaming & Apache Storm

Streaming & Apache Storm Streaming & Apache Storm Recommended Text: Storm Applied Sean T. Allen, Matthew Jankowski, Peter Pathirana Manning 2010 VMware Inc. All rights reserved Big Data! Volume! Velocity Data flowing into the

More information

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER P.Radhabai Mrs.M.Priya Packialatha Dr.G.Geetha PG Student Assistant Professor Professor Dept of Computer Science and Engg Dept

More information

A Systematic Overview of Data Mining Algorithms

A Systematic Overview of Data Mining Algorithms A Systematic Overview of Data Mining Algorithms 1 Data Mining Algorithm A well-defined procedure that takes data as input and produces output as models or patterns well-defined: precisely encoded as a

More information

Table Of Contents: xix Foreword to Second Edition

Table Of Contents: xix Foreword to Second Edition Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data

More information

Clustering from Data Streams

Clustering from Data Streams Clustering from Data Streams João Gama LIAAD-INESC Porto, University of Porto, Portugal jgama@fep.up.pt 1 Introduction 2 Clustering Micro Clustering 3 Clustering Time Series Growing the Structure Adapting

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Chapter 1, Introduction

Chapter 1, Introduction CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from

More information

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross Track Join Distributed Joins with Minimal Network Traffic Orestis Polychroniou Rajkumar Sen Kenneth A. Ross Local Joins Algorithms Hash Join Sort Merge Join Index Join Nested Loop Join Spilling to disk

More information

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that

More information

10/24/2017 Sangmi Lee Pallickara Week 10- A. CS535 Big Data Fall 2017 Colorado State University

10/24/2017 Sangmi Lee Pallickara Week 10- A. CS535 Big Data Fall 2017 Colorado State University CS535 Big Data - Fall 2017 Week 10-A-1 CS535 BIG DATA FAQs Term project proposal Feedback for the most of submissions are available PA2 has been posted (11/6) PART 2. SCALABLE FRAMEWORKS FOR REAL-TIME

More information

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

An Overview of various methodologies used in Data set Preparation for Data mining Analysis An Overview of various methodologies used in Data set Preparation for Data mining Analysis Arun P Kuttappan 1, P Saranya 2 1 M. E Student, Dept. of Computer Science and Engineering, Gnanamani College of

More information

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Bhakti V. Gavali 1, Prof. Vivekanand Reddy 2 1 Department of Computer Science and Engineering, Visvesvaraya Technological

More information

Apache Flink. Alessandro Margara

Apache Flink. Alessandro Margara Apache Flink Alessandro Margara alessandro.margara@polimi.it http://home.deib.polimi.it/margara Recap: scenario Big Data Volume and velocity Process large volumes of data possibly produced at high rate

More information

Maximum Sustainable Throughput Prediction for Large-Scale Data Streaming Systems

Maximum Sustainable Throughput Prediction for Large-Scale Data Streaming Systems JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 215 1 Maximum Sustainable Throughput Prediction for Large-Scale Data Streaming Systems Abstract In cloud-based stream processing services, the maximum

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Nodes Energy Conserving Algorithms to prevent Partitioning in Wireless Sensor Networks

Nodes Energy Conserving Algorithms to prevent Partitioning in Wireless Sensor Networks IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.9, September 2017 139 Nodes Energy Conserving Algorithms to prevent Partitioning in Wireless Sensor Networks MINA MAHDAVI

More information

An Optimization Algorithm of Selecting Initial Clustering Center in K means

An Optimization Algorithm of Selecting Initial Clustering Center in K means 2nd International Conference on Machinery, Electronics and Control Simulation (MECS 2017) An Optimization Algorithm of Selecting Initial Clustering Center in K means Tianhan Gao1, a, Xue Kong2, b,* 1 School

More information

Random Forest A. Fornaser

Random Forest A. Fornaser Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York A Systematic Overview of Data Mining Algorithms Sargur Srihari University at Buffalo The State University of New York 1 Topics Data Mining Algorithm Definition Example of CART Classification Iris, Wine

More information

REAL-TIME ANALYTICS WITH APACHE STORM

REAL-TIME ANALYTICS WITH APACHE STORM REAL-TIME ANALYTICS WITH APACHE STORM Mevlut Demir PhD Student IN TODAY S TALK 1- Problem Formulation 2- A Real-Time Framework and Its Components with an existing applications 3- Proposed Framework 4-

More information

Data Analytics with HPC. Data Streaming

Data Analytics with HPC. Data Streaming Data Analytics with HPC Data Streaming Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Flying Faster with Heron

Flying Faster with Heron Flying Faster with Heron KARTHIK RAMASAMY @KARTHIKZ #TwitterHeron TALK OUTLINE BEGIN I! II ( III b OVERVIEW MOTIVATION HERON IV Z OPERATIONAL EXPERIENCES V K HERON PERFORMANCE END [! OVERVIEW TWITTER IS

More information

Analysis of offloading mechanisms in edge environment for an embedded application

Analysis of offloading mechanisms in edge environment for an embedded application Volume 118 No. 18 2018, 2329-2339 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu Analysis of offloading mechanisms in edge environment for an embedded

More information

V Conclusions. V.1 Related work

V Conclusions. V.1 Related work V Conclusions V.1 Related work Even though MapReduce appears to be constructed specifically for performing group-by aggregations, there are also many interesting research work being done on studying critical

More information

A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid Cache in 3D chip Multi-processors

A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid Cache in 3D chip Multi-processors , July 4-6, 2018, London, U.K. A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid in 3D chip Multi-processors Lei Wang, Fen Ge, Hao Lu, Ning Wu, Ying Zhang, and Fang Zhou Abstract As

More information

A Network Intrusion Detection System Architecture Based on Snort and. Computational Intelligence

A Network Intrusion Detection System Architecture Based on Snort and. Computational Intelligence 2nd International Conference on Electronics, Network and Computer Engineering (ICENCE 206) A Network Intrusion Detection System Architecture Based on Snort and Computational Intelligence Tao Liu, a, Da

More information

Comparative Analysis of Range Aggregate Queries In Big Data Environment

Comparative Analysis of Range Aggregate Queries In Big Data Environment Comparative Analysis of Range Aggregate Queries In Big Data Environment Ranjanee S PG Scholar, Dept. of Computer Science and Engineering, Institute of Road and Transport Technology, Erode, TamilNadu, India.

More information

An Efficient RFID Data Cleaning Method Based on Wavelet Density Estimation

An Efficient RFID Data Cleaning Method Based on Wavelet Density Estimation An Efficient RFID Data Cleaning Method Based on Wavelet Density Estimation Yaozong LIU 1*, Hong ZHANG 1, Fawang HAN 2, Jun TAN 3 1 School of Computer Science and Engineering Nanjing University of Science

More information

Privacy-Preserving of Check-in Services in MSNS Based on a Bit Matrix

Privacy-Preserving of Check-in Services in MSNS Based on a Bit Matrix BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 15, No 2 Sofia 2015 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.1515/cait-2015-0032 Privacy-Preserving of Check-in

More information

Functional Comparison and Performance Evaluation. Huafeng Wang Tianlun Zhang Wei Mao 2016/11/14

Functional Comparison and Performance Evaluation. Huafeng Wang Tianlun Zhang Wei Mao 2016/11/14 Functional Comparison and Performance Evaluation Huafeng Wang Tianlun Zhang Wei Mao 2016/11/14 Overview Streaming Core MISC Performance Benchmark Choose your weapon! 2 Continuous Streaming Micro-Batch

More information

An Improved DFSA Anti-collision Algorithm Based on the RFID-based Internet of Vehicles

An Improved DFSA Anti-collision Algorithm Based on the RFID-based Internet of Vehicles 2016 2 nd International Conference on Energy, Materials and Manufacturing Engineering (EMME 2016) ISBN: 978-1-60595-441-7 An Improved DFSA Anti-collision Algorithm Based on the RFID-based Internet of Vehicles

More information

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Contents. Foreword to Second Edition. Acknowledgments About the Authors Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1

More information

Co-operative Scheduled Energy Aware Load-Balancing technique for an Efficient Computational Cloud

Co-operative Scheduled Energy Aware Load-Balancing technique for an Efficient Computational Cloud 571 Co-operative Scheduled Energy Aware Load-Balancing technique for an Efficient Computational Cloud T.R.V. Anandharajan 1, Dr. M.A. Bhagyaveni 2 1 Research Scholar, Department of Electronics and Communication,

More information

A Data Classification Algorithm of Internet of Things Based on Neural Network

A Data Classification Algorithm of Internet of Things Based on Neural Network A Data Classification Algorithm of Internet of Things Based on Neural Network https://doi.org/10.3991/ijoe.v13i09.7587 Zhenjun Li Hunan Radio and TV University, Hunan, China 278060389@qq.com Abstract To

More information

A Study on Load Balancing Techniques for Task Allocation in Big Data Processing* Jin Xiaohong1,a, Li Hui1, b, Liu Yanjun1, c, Fan Yanfang1, d

A Study on Load Balancing Techniques for Task Allocation in Big Data Processing* Jin Xiaohong1,a, Li Hui1, b, Liu Yanjun1, c, Fan Yanfang1, d International Forum on Mechanical, Control and Automation IFMCA 2016 A Study on Load Balancing Techniques for Task Allocation in Big Data Processing* Jin Xiaohong1,a, Li Hui1, b, Liu Yanjun1, c, Fan Yanfang1,

More information

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) 4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) Benchmark Testing for Transwarp Inceptor A big data analysis system based on in-memory computing Mingang Chen1,2,a,

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Contents. Part I Setting the Scene

Contents. Part I Setting the Scene Contents Part I Setting the Scene 1 Introduction... 3 1.1 About Mobility Data... 3 1.1.1 Global Positioning System (GPS)... 5 1.1.2 Format of GPS Data... 6 1.1.3 Examples of Trajectory Datasets... 8 1.2

More information

Computer Based Image Algorithm For Wireless Sensor Networks To Prevent Hotspot Locating Attack

Computer Based Image Algorithm For Wireless Sensor Networks To Prevent Hotspot Locating Attack Computer Based Image Algorithm For Wireless Sensor Networks To Prevent Hotspot Locating Attack J.Anbu selvan 1, P.Bharat 2, S.Mathiyalagan 3 J.Anand 4 1, 2, 3, 4 PG Scholar, BIT, Sathyamangalam ABSTRACT:

More information

Robot localization method based on visual features and their geometric relationship

Robot localization method based on visual features and their geometric relationship , pp.46-50 http://dx.doi.org/10.14257/astl.2015.85.11 Robot localization method based on visual features and their geometric relationship Sangyun Lee 1, Changkyung Eem 2, and Hyunki Hong 3 1 Department

More information

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0.

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0. IBM Optim Performance Manager Extended Edition V4.1.0.1 Best Practices Deploying Optim Performance Manager in large scale environments Ute Baumbach (bmb@de.ibm.com) Optim Performance Manager Development

More information

Cisco Tetration Analytics

Cisco Tetration Analytics Cisco Tetration Analytics Enhanced security and operations with real time analytics Christopher Say (CCIE RS SP) Consulting System Engineer csaychoh@cisco.com Challenges in operating a hybrid data center

More information

Accelerating Analytical Workloads

Accelerating Analytical Workloads Accelerating Analytical Workloads Thomas Neumann Technische Universität München April 15, 2014 Scale Out in Big Data Analytics Big Data usually means data is distributed Scale out to process very large

More information

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( ) Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL

More information

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams Mining Data Streams Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction Summarization Methods Clustering Data Streams Data Stream Classification Temporal Models CMPT 843, SFU, Martin Ester, 1-06

More information

STORM AND LOW-LATENCY PROCESSING.

STORM AND LOW-LATENCY PROCESSING. STORM AND LOW-LATENCY PROCESSING Low latency processing Similar to data stream processing, but with a twist Data is streaming into the system (from a database, or a netk stream, or an HDFS file, or ) We

More information

Detect tracking behavior among trajectory data

Detect tracking behavior among trajectory data Detect tracking behavior among trajectory data Jianqiu Xu, Jiangang Zhou Nanjing University of Aeronautics and Astronautics, China, jianqiu@nuaa.edu.cn, jiangangzhou@nuaa.edu.cn Abstract. Due to the continuing

More information

A Dynamic TDMA Protocol Utilizing Channel Sense

A Dynamic TDMA Protocol Utilizing Channel Sense International Conference on Electromechanical Control Technology and Transportation (ICECTT 2015) A Dynamic TDMA Protocol Utilizing Channel Sense ZHOU De-min 1, a, LIU Yun-jiang 2,b and LI Man 3,c 1 2

More information

A Network-aware Scheduler in Data-parallel Clusters for High Performance

A Network-aware Scheduler in Data-parallel Clusters for High Performance A Network-aware Scheduler in Data-parallel Clusters for High Performance Zhuozhao Li, Haiying Shen and Ankur Sarker Department of Computer Science University of Virginia May, 2018 1/61 Data-parallel clusters

More information

Optimization on TEEN routing protocol in cognitive wireless sensor network

Optimization on TEEN routing protocol in cognitive wireless sensor network Ge et al. EURASIP Journal on Wireless Communications and Networking (2018) 2018:27 DOI 10.1186/s13638-018-1039-z RESEARCH Optimization on TEEN routing protocol in cognitive wireless sensor network Yanhong

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing : Exploring Memory Locality for Big Data Analytics in Virtualized Clusters Eunji Hwang, Hyungoo Kim, Beomseok Nam and Young-ri

More information

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li Welcome to DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: KH 116 Fall 2017 First Grading for Reading Assignment Weka v 6 weeks v https://weka.waikato.ac.nz/dataminingwithweka/preview

More information

Outrun Your Competition With SAS In-Memory Analytics Sascha Schubert Global Technology Practice, SAS

Outrun Your Competition With SAS In-Memory Analytics Sascha Schubert Global Technology Practice, SAS Outrun Your Competition With SAS In-Memory Analytics Sascha Schubert Global Technology Practice, SAS Topics AGENDA Challenges with Big Data Analytics How SAS can help you to minimize time to value with

More information

Low Overhead Geometric On-demand Routing Protocol for Mobile Ad Hoc Networks

Low Overhead Geometric On-demand Routing Protocol for Mobile Ad Hoc Networks Low Overhead Geometric On-demand Routing Protocol for Mobile Ad Hoc Networks Chang Su, Lili Zheng, Xiaohai Si, Fengjun Shang Institute of Computer Science & Technology Chongqing University of Posts and

More information

Typhoon: An SDN Enhanced Real-Time Big Data Streaming Framework

Typhoon: An SDN Enhanced Real-Time Big Data Streaming Framework Typhoon: An SDN Enhanced Real-Time Big Data Streaming Framework Junguk Cho, Hyunseok Chang, Sarit Mukherjee, T.V. Lakshman, and Jacobus Van der Merwe 1 Big Data Era Big data analysis is increasingly common

More information

Distance-based Outlier Detection: Consolidation and Renewed Bearing

Distance-based Outlier Detection: Consolidation and Renewed Bearing Distance-based Outlier Detection: Consolidation and Renewed Bearing Gustavo. H. Orair, Carlos H. C. Teixeira, Wagner Meira Jr., Ye Wang, Srinivasan Parthasarathy September 15, 2010 Table of contents Introduction

More information

Intelligent management of on-line video learning resources supported by Web-mining technology based on the practical application of VOD

Intelligent management of on-line video learning resources supported by Web-mining technology based on the practical application of VOD World Transactions on Engineering and Technology Education Vol.13, No.3, 2015 2015 WIETE Intelligent management of on-line video learning resources supported by Web-mining technology based on the practical

More information

Bagging and Boosting Algorithms for Support Vector Machine Classifiers

Bagging and Boosting Algorithms for Support Vector Machine Classifiers Bagging and Boosting Algorithms for Support Vector Machine Classifiers Noritaka SHIGEI and Hiromi MIYAJIMA Dept. of Electrical and Electronics Engineering, Kagoshima University 1-21-40, Korimoto, Kagoshima

More information

Multiple Query Optimization for Density-Based Clustering Queries over Streaming Windows

Multiple Query Optimization for Density-Based Clustering Queries over Streaming Windows Worcester Polytechnic Institute DigitalCommons@WPI Computer Science Faculty Publications Department of Computer Science 4-1-2009 Multiple Query Optimization for Density-Based Clustering Queries over Streaming

More information