NOWADAYS, many modern applications require largescale

Size: px

Start display at page:

Download "NOWADAYS, many modern applications require largescale"

Lucinda Hensley
5 years ago
Views:

1 Automating Characterization Deployment in Distributed Data Stream Management Systems Chunkai Wang, Xiaofeng Meng, Member, IEEE, Qi Guo, Zujian Weng, and Chen Yang Abstract Distributed data stream management systems (DDSMS) are usually composed of upper layer relational query systems (RQS) and lower layer stream processing systems (SPS). When users submit new queries to RQS, a query planner needs to be converted into a directed acyclic graph (DAG) consisting of tasks which are running on SPS. Based on different query requests and data stream properties, SPS need to configure different deployments strategies. However, how to dynamically predict deployment configurations of SPS to ensure the processing throughput and low resource usage is a great challenge. This article presents OrientStream, a framework for automating characterization deployment in DDSMS using incremental machine learning techniques. By introducing the data-level, query plan-level, operator-level, and cluster-level s four-level feature extraction mechanism, we first use the different query workloads as training sets to predict the resource usage by DDSMS, and select the optimal resource configuration from candidate settings based on the current query requests and stream properties, then migrate the operator state by introducing dynamic reconfiguration. Finally, we validate our approach on the open source SPS Storm. In view of the application scenarios with long monitoring cycle and non-frequent data fluctuation, experiments show that OrientStream can reduce CPU usage of 8-15 percent and memory usage of percent, respectively. Index Terms Stream processing system, relational query system, incremental learning, modeling and prediction Ç 1 INTRODUCTION NOWADAYS, many modern applications require largescale continuous queries and analysis, such as analytic over microblogs in social networks, monitoring of high frequency trading in financial fields, real-time recommendation in electronic business and so on. Therefore, there have been many open source SPS, such as S4 [7], Storm [8] and Spark Streaming [1] etc. For enhancing the ease of use and processing capability of them, some research institutions have developed RQS that have abstract query language, such as StreamingSQL [2] and Squall [3] etc. For completing the query requests given by users within the specified latency, DDSMS need to guarantee the high throughput and low latency of stream processing. This often requires users to set the relevant configurations in advance, such as the operator parallelism and the memory usage of processes. However, according to data stream properties (e.g., arrival rate and value distribution) and the difference of query tasks, the reasonable setting of configurations for ensuring both processing in real-time and low resource usage is a very challenging problem. C. Wang, X. Meng, Z. Weng, and C. Yang are with the School of Information, Renmin University of China, Beijing , China. chunkai_wang@163.com, xfmeng@ruc.edu.cn, hankwing@hotmail. com, yangchenwo@126.com. Q. Guo is with the Institute of Computing Technology, Chinese Academy of Sciences, Beijing , China. guoqi@ict.ac.cn. Manuscript received 25 Feb. 2017; revised 30 Aug. 2017; accepted 3 Sept Date of publication 15 Sept. 2017; date of current version 3 Nov (Corresponding author: Xiaofeng Meng.) Recommended for acceptance by Y. Zhang. For information on obtaining reprints of this article, please send to: reprints@ieee.org, and reference the Digital Object Identifier below. Digital Object Identifier no /TKDE Motivating Example. Consider an application which analyzes the traffic GPS data collected from taxis and buses in real time. We take the Traffic-Monitoring (TM) system as a query example, and Storm as a SPS example. Traffic-monitoring needs to process and analyze the traffic GPS data collected from taxis and buses in real time. The scenario is a typical application of streaming execution. We use GeoLife GPS Trajectories [25] for this workload. This system contains a map matching processing logic which accepts trajectories data of a GPS-loggers or GPS-phones including altitude, latitude, and longitude, and determines the location of this object in terms of a road ID. The speed calculate processing logic accepts the road ID result from the matching processing logic and calculate the average speed according to the road ID. Based on execution tasks of the TM system, the workload in Storm can be expressed as a directed acyclic graph (called topology), where nodes are data sources (called spouts) or processing logic (called bolts) which run in different processes (called workers), and edges indicate the flow of data between nodes. For real-time computation, the topology should be created at first. Then, before the deployment of the topology to the Storm cluster, the user should specify various topology parameters (e.g., the number of workers, spout parallelism, and bolt parallelism, etc.) according to either experience or machine configurations. Finally, the Storm cluster receives continuous data stream and processes them in a real-time fashion. However, the topology parameters are not aware of dynamic data stream. For example, as shown in Fig. 1, due to the fixed configurations, some processing latency are beyond the user s threshold (e.g., 1 second) with the change of data stream rate. Meanwhile, it also leads to the higher cluster resources usage. As shown in Fig. 2, the average _

2 Fig. 1. Latency. usage of CPU is about 40 percent, and the average usage of memory (the user specifies the maximum memory usage of each worker) is more than 90 percent. As DDSMS are always deployed in the cloud environment, the problem of fixed configurations not only leads to high resource usage, reduces the user s return-on-investment (ROI), and even causes the system bottleneck. To address the above limitation, our earlier works show that many system resources are wasted when fixing the topology configuration [33], [34]. Taking the Amazon EC2 as an example, the above resource waste will result in additional $300 per month per machine. Moreover, fixed topology configuration might also constrain the processing throughput if an overwhelming data flow exceeds the system capability. However, Paper [33] needs to restart a new topology after selecting the optimal configuration, which results in about 1 minute adjustment delay and does not apply to frequently fluctuating data streams. Paper [34] uses offline methods to train the initial machine learning models with randomly generated samples, which does not use the incremental learning methods, and cannot update the training dataset in real time. Meanwhile, Paper [35] have proposed a resource-aware dynamic load scheduler. Compared to the scheduler, which increases the overall throughput by maximizing the resource usage, our work focuses on minimizing the resource usage while still satisfying the real-time constraint. So, in this paper, we present OrientStream, a resourceefficient system to implement dynamic characterization deployment for Storm configuration according to the current data flow rate. This is achieved by using a two-phase mechanism. First, multiple resource usage models are trained incrementally with accumulated historical data features. Then, whenever necessary, these models can be used for deriving the most resource-efficient configuration to be deployed. With the advance of the query process, the amount of training data is dynamically increased. In order to use regression models to predict the resource usage in real time, not need to re-build the models of all training data repeatedly and save the time and resource for building models, we choose incremental learning technique based on MOA [4] and Weka [5] toolkits. And, we introduce Tachyon [36] as an in-memory file server to store the state information, and use the system of Operator States Migration (OSM) [37] to achieve live adjustment from different nodes Fig. 2. Resource usage. to adjust the operator parallelism dynamically. The contributions of our work are summarized as follows: 1) We construct initial training sets by using the several workloads for meeting the different runtime characteristics of DDSMS, and project queries to four levels (data-level, query plan-level, operator-level and cluster-level) for collecting the training data to predicttheresourceusage.and,weevaluatetheworth of an attribute by repeatedly sampling an instance and analyze the features influence. 2) As the collection of training samples has continued to increase, incremental learning models always ensure to improve prediction accuracy. In addition, compared to the batch learning models, incremental learning models can further reduce the runtime of training and prediction, and save the CPU and memory usage. 3) For enhancing the prediction accuracy, we give the algorithm of Ensemble Different kinds of Regressions (EDKRegression) to integrate different regression models (see Section 5.3 for details) based on the related independent of base models. 4) We introduce the mechanism of outlier execution detection, predict the rationality of configurations, and monitor the anomalous executions. Through the methods of normalization and nearest neighbors, we find abnormal data to further improve the prediction accuracy of regression models. 5) We use the Operator States Migration to achieve live adjustment from different nodes. It enable workloads to perform migrations automatically, without losing any data or incurring unnecessary restarting cost. This help us to complete the adaptive resource management. The rest of the paper is organized as follows. Section 2 surveys the related work. Then, there are the related concept description and the problem definition in Section 3. And, in Section 4, we give the architecture of OrientStream, which includes the model construction, resource prediction and dynamic tuning. Next, we describe some critical steps of OrientStream in Section 5, including feature extraction, model building, abnormal detection mechanism and live adjustment. Section 6 gives the results of our experiment evaluation. Finally, Section 7 concludes this paper.

3 WANG ET AL.: AUTOMATING CHARACTERIZATION DEPLOYMENT IN DISTRIBUTED DATA STREAM MANAGEMENT SYSTEMS Fig. 3. Storm architecture. 2 RELATED WORK To meet users different query requests, DDSMS needs to build a complete operator library in RQS and dynamically set up the parallelism of processing nodes of SPS based on different query requests. There are roughly three lines of related work as below. Dynamic Load Scheduling. Recently, several algorithms have been proposed for the strategies of dynamic load scheduling about Storm [9], [10], [11]. For example, Aeolus [9] is an optimizer to set up the parallel degree and the batch size of intra-node dataflow programs. It defined a cost model based on four measurements of processing tuple time, including shipping time, queuing time, pure processing time and actual processing time. According to the model, Aeolus can get the best configuration scheme for the parallel degree of operators and batch size. DRS [10] is a dynamic resource scheduler for cloud-based DDSMS. It designed an accurate performance model based on the theory of Jackson open queueing networks [12] that can get the setting of optimal parallel degree of operators. Aniello et al. [11] proposed two scheduling algorithms by using topology-based static scheduling strategy and traffic-based dynamic scheduling strategy for reducing the latency of processing tuples and inter-node traffic between multiple topologies. However, Aeolus and DRS need to be aware of the specific processing time for each operator, and can only be used for fixed query application scenarios. Paper [11] just focuses on the transmission delay, not on the resource usage, and does not take into account dynamic adjustment the operator parallelism according to the change of data rate and query requests in one topology. Machine Learning Techniques. Khoshkbarforoushha et al. [13] proposed a model based on the Mixture Density Network [14] for resource consumption estimation of data stream processing workloads. This model can help users to make a decision about whether to register a new query or not. ALOJA [15] project presented an open, vendor neutral repository, featuring over Hadoop executions. It can predict query time and abnormal detection based on ALOJA-ML [16], which is a framework for using machine learning techniques to interpret Hadoop benchmark performance data and performance tuning. Jamshidi et al. [17] designed the Bayesian optimization for configuration potimization (BO4CO), an auto-tuning algorithm to optimal configuration of SPS. However, the paper [13] cannot dynamically change the scheduling strategy and the operator parallelism, and cannot predict resource usage. ALOJA-ML can only be applied to Hadoop, not predict application scenes of data stream. BO4CO can only analyze the collected historical training data of SPS, but cannot do the incremental analysis of new data. Resource Estimation for RQS. As we know, RQS often has the SQL-like query interface. Some research has been devoted to monitoring resource consumption of SQL queries. Li et al. [18] set up two kinds of mechanisms of feature extractions: coarse-grained global features and finegrained operator-specific features for different queries over Microsoft SQL Server. Akdere et al. [19] built three models for predicting query performance based on different query plans: query plan-level model, operator-level model and hybrid model for nested query. However, The two models only consider the static selection process, cannot monitor dynamically, and do not consider the system characteristics under RQS. OrientStream is orthogonal to the above works since we focus on constructing training sets for the data stream properties and the parameter configurations of different levels, predicting resource usage by online machine learning techniques, and dynamically tuning the related parameter configurations of DDSMS. 3 PRELIMINARIES This section presents the related concepts of Storm and OrientStream, and gives the problem definition and description. 3.1 Storm Synopsis In this paper, Storm is used as an experimental platform of SPS. It is a distributed processing framework that can process data streams in real time. As shown in Fig. 3, Storm consists of a nimbus node and a number of supervisor nodes. The nimbus node is responsible for allocating resources, scheduling requests and monitoring each supervisor node; each supervisor node is responsible for accepting requests assigned by nimbus and managing the processes (called workers) under the node. Each worker contains multiple threads (called executors), and each executor can run one or more component instances (called tasks). The entire Storm cluster uses Zookeeper [28] to provide the coordinated management of the distributed application. 3.2 Concept Description OrientStream needs to monitor the operation of DDSMS in real time, and collects training samples based on different data stream rate and the parameter configurations. Formally, we define the related concepts as follows: Definition 1. A window model can divide the infinite data stream into a number of finite sub-streams. Each query processing is only aimed at the sub-stream in the current window. Generally, we can set the window size based on the time interval or the number of tuples. In OrientStream, we use the window model based on the time interval to get the training set. Each window with the different interval is used as a training sample. In this paper, we set the time interval to a random number from 30 to

4 Fig. 4. OrientStream architecture. 120 seconds. First, we start a query request from the configuration. When the window size reaches the interval bound, we collect the training sample. Then, we restart the query request from the new configuration and window size. Finally, we can collect the comprehensive training set through different windows. Definition 2. The processing latency of DDSMS is defined as the sum of processing delays of each tuple processed by operators. Each operator has the different parallelism, which is often implemented in a multi-threaded manner. Then, the latency of data source i is represented as the mean latency of every thread, and the formal definition is as follows: Latencyðsource i Þ¼ Xm j¼1 Latencyðsource i Þ j =m; (1) where, m is the parallelism of the data source node. However, the processing latency of DDSMS is the maximum latency of each data source, it can be defined as below: n Latency ¼ max Latencyðsource iþ; (2) i¼1 where, n is the number of data sources in a query request. 3.3 Problem Definition Query requests always need to set different parameter configurations by user s experience. Whether the parameter settings are reasonable or not will directly affect the throughput and the latency of DDSMS, as well as the resource usage. With the change of data stream rate, we should dynamically adjust parameter configurations to meet user s needs of the latency and the throughput thresholds. But, SPS are often not allowed to adjust parameters arbitrarily. For example, re-balance mechanism of Storm can only decrease operators parallelism without increasing them. We need to predict the resource usage, the latency and the throughput for any possible configurations. Based on the predicted results, the optimal configuration is defined as the one using minimal CPU and memory resource while the latency and the throughput requirements are still guaranteed under the thresholds set by users. Thus, the problem can be modeled as a resource usage optimization problem as we discuss next. Let N ¼ðn 1 ;n 2 ;n 3 ;...Þ be the set of all nodes in a cluster. The CPU usage U CPU and the memory usage U Memory of each node n i can be defined as follows: Uðn i Þ¼a U CPU ðn i Þþb U Memory ðn i Þ ða þ b ¼ 1Þ; (3) where, a and b are the weight of the CPU and memory usage respectively, which set as 50 percent by default. Then, let C ¼ðc 1 ;c 2 ;c 3 ;...Þ be the candidate configurations supplied by users. For the entire cluster, we need to predict the optimal configuration c opt to achieve the following optimization goals Minimize X n i 2N Uðn i Þ s:t: and RðlatencyÞ <TðlatencyÞ; RðthroughputÞ >TðthroughputÞ; where, RðlatencyÞ and RðthroughputÞ are the processing latency and the throughput of query requests. T ðlatencyþ and TðthroughputÞ are the thresholds set by users. The constructed models fall into two categories, i.e., regression and classification models. The regression models are constructed for predicting the resource usage based on their numeric labels. In order to realize regression models with higher accuracy, we select four independent models that support incremental learning and have the highest regression accuracy (see in Section 5.2). For predicting the processing latency and throughput, we build binary classification models to get whether the latency and throughput can meet the system specification. In this paper, we use the J48 algorithm in Weka to classify, which is the implementation of the C4.5 [38] algorithm. J48 algorithm is based on the decision tree, and its attribute with the maximum gain ratio is selected as the splitting attribute. The gain ratio of attribute A is defined as GainRateðAÞ ¼ (4) GainðAÞ SplitInfo A ðdþ, where GainðAÞ is the information gain of attribute A, SplitInfo A ðdþ is the potential information generated by splitting the training data set D into v partitions, corresponding to the v outcomes of a test on attribute A. 4 ORIENTSTREAM ARCHITECTURE This section gives the overview of OrientStream and working principles of each step. OrientStream Overview. Fig. 4 illustrates the overall architecture of OrientStream, by extending the original Storm and Kafka [20] cluster. It is mainly divided into two parts: the left part is the hierarchical feature extractions, and the right part is the query monitor that is responsible for

5 WANG ET AL.: AUTOMATING CHARACTERIZATION DEPLOYMENT IN DISTRIBUTED DATA STREAM MANAGEMENT SYSTEMS Name DATA_RATE DATA_PARA CBROKER CPARTITION OPERATOR_RATE collecting the feature data and predicting the resource usage by incremental learning models. In the left part, according to the characteristics of DDSMS and clusters, we divide feature extractions into four levels. There are data-level features of the data source, plan-level features of RQS, operator-level features of SPS and clusterlevel features of the hardware platform (see Section 5.1 for details). Each data source collects and transmits data stream using the Kafka message queue. Kafka is a distributed, high-throughput platform for handling real-time data feeds. In the right part, we analyze the real-time collected feature data and build the incremental learning models, and predict the resource usage of DDSMS using model-based methods. Model Construction. During the incremental learning, we collect the feature data dynamically to generate Orient- Stream data-set, use the updatable data-set to predict the accuracy of the each model and use the prediction results to train the model by incremental learning. According to the characteristics of the data-set, we choose different incremental learning algorithms to build models based on MOA and Weka toolkits. Resource Prediction. As shown in Fig. 4, Query Monitor is responsible for building incremental learning models and generating the Resource Predictor by collecting the feature data. Based on the data stream rate, distribution, and query requests submitted by users, we use the Resource Predictor to predict candidate configuration settings of the topology parameters that include the spouts parallelism, the bolts parallelism, etc. Then, we can predict the optimal configuration c opt, and alert the abnormal configurations which are defined as the ones using resource too much or the latency or the throughput in excess of thresholds. In these cases, we can learn from traces obtained from DDSMS executions incrementally, improve the robustness of the model by different workloads with different parallelism and query requests, and get higher accuracy of prediction models by using learning techniques than using rule of the thumb techniques. Dynamic Tuning. During model deployment, the constructed models are used for determining the optimal configuration for the current data rate, so as to minimize the resource usage while still preserving real-time property. To achieve this, there are two basic problems need to be addressed: (1) When to change the configuration, and (2) How to deploy the optimal configuration online. TABLE 1 Data-Level Features Description data rate of input RQS data parallelism of input RQS # Kakfa brokers # topics partitions in Kafka Avg rate of different operators When to change the configuration. We make the decision based on a simple threshold solution from the streaming behaviors (i.e., the data producing and consuming rate) collected from the Storm topology and the Kafka producer. More specifically, the Kafka producer and the Storm cluster periodically send the data producing rate and the data consuming rate of the running topology, respectively, to the Query Monitor (see Fig. 4). Then, such data rates are used in two conditions: (1) if the current data consuming rate significantly differs from previous data consuming rate, the current configuration may be unsuitable, (2) if the data producing rate is higher than the consuming rate to a certain extent, data needs to be stored in Kafka waiting to be processed, rendering reduced processing throughput. In either case, it is required to adjust the topology configuration for resource efficiency, since the properties of the data stream have changed. Moreover, users can explicitly specify the parameters of these two conditions. For example, for the first condition, users can specify not only the difference between the current and previous consuming rates, but also the distance between these two checkpoints. Deploying the Configuration Online. Due to the limitations of the re-balance mechanism of Storm, we proposed to deactivate the old topology and then switch to a new topology with the optimal configuration in papers [33] and [34]. However, the process of switching to a new topology requires about 1 minute delay. So, the application scenarios need to meet the following two requirements: (1) the monitor cycle of data stream is longer; (2) the data fluctuation is not frequent and far lower than the switching time. In this paper, we use the Operator States Migration to achieve live adjustment from different nodes. It enable workloads to perform migrations automatically, without losing any data or incurring unnecessary restarting cost. According to the predicted optimal configuration, we can increase/decrease the number of workers dynamically and adjust the parallelism of each operator arbitrarily. 5 MODELING RESOURCE PREDICTION This section focuses on the implementation details, including features extraction, model building and integration, abnormal detection mechanism and applications, and live adjustment. 5.1 Data-Set Features In this section, we will describe the features that use as the input OrientStream data-set to the learning model for predicting the resource usage. For this purpose, we need to specify the granularity of input features for learning models. The feature extraction strategy not only includes the data features and query features of RQS, but also takes into account the environment configurations at the bottom of RQS which include the configurations of SPS and the features of machine clusters. According to a series of experiments and specific domain knowledge, we choose features from the following four aspects: Data-Level Features. There are data-level features in Table 1. We control the rate of input data stream as DATA_RATE by changing the spout parallelism. Also, the number of Kafka brokers and the number of topics partitions are also taken into account. In addition, because the change of DATA_ RATE could influence the rate of subsequent operators, we

6 TABLE 2 Query Plan-Level Features TABLE 4 Cluster-Level Features Name WINDOW_SIZE SLIDING_STEP COPERATOR Description # tuples in a window step of sliding window # different operators in a query Name WORK_NODE CPU_CORE MEMORY NETWORK Description # Work nodes # CPU cores memory capacity (G) network transmission rate (Gbps) need to calculate the rate of relevant operators dynamically (e.g., map matching bolt rate and speed calculate bolt rate in the traffic monitoring workload). Query Plan-Level Features. As shown in Table 2, we define the query plan-level features. The continuous query can generate DAG-style topology by using the query plan. The amount of data waiting to be processed and the window sliding step can be controlled by WINDOW_SIZE and SLIDING_STEP. Moreover, we count the number of different types of operators. Operator-Level Features. There are operator-level features in Table 3. We can monitor resource usage by adjusting the operator parallelism in different windows. If the same operator appears more than once, we take the average value as the criterion of feature extraction. Meanwhile, we need to involve the number of workers about Storm as the feature that can influence the processing efficiency of operators. Cluster-Level Features. There is the summarization of features about clusters in Table 4. We defines cluster-level features including the number of worker nodes, CPU cores, memory capacity and network transmission rate. 5.2 Algorithms At this time, OrientStream data-set is generated in real time. Through the experimental comparison, we choose four incremental learning models in MOA and Weka toolkits with the highest prediction accuracy. Bayes model: we use the Naive Bayes (NB) from the MOA toolkit. Bayes[21] algorithm updates internal counters with each new instance and uses the counter to assign a class in a probabilistic fashion to a new item in the stream. It can be defined as follows: Q i pðcjxþ ¼ pðx i ¼ x i jc ¼ cþpðcþ Q i pðx ; (5) i ¼ x i Þ where, i ranges over the training points of attribute X. Hoeffding trees model: we use the HoeffdingTree (HT), also from the MOA toolkit. A hoeffding tree or VFDT [22] is an incremental decision tree that is capable of learning from massive data stream. It uses information entropy or Gini index as a criterion for selecting split attributes, and uses the hoeffding inequality as the condition for determining to split node. It exploits the fact that a small sample can often be enough to choose an optimal splitting attribute. Name OPERATOR_PARA WORKER TABLE 3 Operator-Level Features Description Avg para of operators # worker in Storm Online bagging model: we use the OzaBagging [23] from the MOA toolkit and the hoeffding tree as the base learner. Oza and Russell propose the online bagging algorithm that gives each instance a weight for re-sampling. Each base learner s training set contains K copies of each of the original training examples, where PðK ¼ kþ ¼ N 1 k 1 1 N k ; (6) k N N which is the Binomial distribution. As N!1, the distribution of K tends to a Poisson(1) distribution. Nearest neighbors model: We use the IBk [24] from the Weka toolkit, through introducing updateable classifier method to achieve incremental learning characteristics. IBk is the instance-based learning algorithm. It is identical to the k nearest neighbors algorithm expect that it normalizes its attributes ranges, processes instances incrementally. The similarity function is shown as follows: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Similarityðx; yþ ¼ X n i¼1 fðx i ;y i Þ; (7) where, the instances are described by n attributes. We define fðx i ;y i Þ¼ðx i y i Þ 2 for numeric-valued attributes and fðx i ;y i Þ¼ðx i 6¼ y i Þ for Boolean and symbolic-valued attributes. 5.3 EDKRegression Model Ensemble learning is one of the most effective methods to improve the accuracy of the regression algorithm, which is considered as one of the most effective learning ideas. Its contribution is to improve regression accuracy by aggregating multiple base models. Ensemble learning methods mainly include Bagging [26] and Boosting [27]. Although both of mathematical foundations are not exactly the same for improving the accuracy of regression algorithms, they attempt to produce efficient committees by combining the outputs of multiple base models to improve the performance of regression algorithms. However, Bagging and Boosting only ensemble the same regression model (e.g., hoeffding tree). As discussed in Section 5.2, four models have no contact and the prediction is independent of each other. Therefore, we combine these four models and propose a regression model EDKRegression (Ensemble Different Kind of Regression). Without loss of generality, as shown in Fig. 5, EDKRegression combines k base regression models R 1 ;R 2 ;...;R k, and creates an improved composite regression model R by scoring the regression models. For the new arrival tuple t n, EDKRegression can obtain the prediction result of the vote by k base regression models.

7 WANG ET AL.: AUTOMATING CHARACTERIZATION DEPLOYMENT IN DISTRIBUTED DATA STREAM MANAGEMENT SYSTEMS Fig. 5. EDKRegression prediction process. The detailed description of EDKRegression is listed in Algorithm 1. Algorithm 1. EDKRegression Algorithm Require: Each prediction values of base learning models for predicting sample n ; Ensure: The prediction value of EDKRegression for predicting sample n; 1: Computing AVG N ; 2: for (i ¼ 1; i<n; i þþ) do 3: Computing RAE n using Equation (5); 4: end for 5: for (j ¼ 1; j<k; j þþ) do 6: Computing F n using Equation (6); 7: end for In the process of prediction, first, EDKRegression lets each base regression model to make prediction; then, according to the relative absolute error (RAE) value of the each base regression model, EDKRegression calculates the voting weights respectively; finally, it determine the regression value of the sample with a weighted vote. Among them, RAE and the objective function of EDKRegression are defined as below: Let N be the set of training samples in each regression model and n be the nth sample to be predicted. P n is the prediction value of the regression model to the sample n. T n is the true value of the sample n. AVG N is the mean of N samples. Then RAE n ¼ XN jp in T i i¼1 j= XN j¼1 T j AVG N ; (8) where, RAE n is the relative absolute error of the nth sample using specified base regression model. So, based on the k base regression models, the objective function of EDKRegression to predict the sample n is F n ¼ Xk i¼1 ð1 RAE in Þ= k Xk j¼1 RAE jn! P in ; (9) where, F n is the relative absolute error of the nth sample using EDKRegression model, and P in is the prediction value of the nth sample using the ith base regression model. 5.4 Abnormal Detection Mechanism Due to the abnormal execution of the cluster (e.g., the cluster load is not balanced, network transmission is not stable), or Fig. 6. Anomaly detection mechanism flow. the observation does not fit into the model, it results in some abnormal training data. Based on the difference between the predicted value and the observed value (e.g., resource usage), we divide the training data into three types: Normal. Detected data can be predicted correctly by learning models. Warning. Detected data is mis-predicted, and more than a half of nearest neighbors to the data are mispredicted too. Outlier. Detected data is mis-predicted, but more than a half of nearest neighbors to the data are predicted correctly. Algorithm 2. Abnormal Detection Algorithm Require: Prediction Error Threshold T e ; The number of tuples to nearest tag tuple k NN ; The regression file F contains predicted value and true value; Ensure: The type of detected tuple t; 1: if (jt:predictedvalue t:truevaluej <T e ) then 2: t is Normal; 3: else 4: for (i ¼ 1; i < Number of TestDataSet; i þþ) do 5: Computing the k NN tuples nearest to t; 6: end for 7: for (j ¼ 1; j<k NN ; j þþ) do 8: if (jt k :PredictedValue t k :TrueValuej >T e ) then 9: N warning +1; 10: else 11: N outlier +1; 12: end if 13: end for 14: if (N warning >N outlier ) then 15: t is Warning; 16: else 17: t is Outlier; 18: end if 19: end if Fig. 6 shows the anomaly detection flow. We can improve the prediction accuracy by discarding abnormal training data and not getting it into the incremental learning model. Besides using this method to test new observations, we can re-train our models by subtracting observations marked as outlier. Algorithm 2 gives the detailed description of the detection procedure. For each detected data, first, we select the

8 Fig. 7. Architecture of OSM. normal data based on the error threshold T e (Algorithm 2 lines 1-2). Then, for each abnormal test data, we compute the k NN tuples that are nearest to the test data. Lines 4-6 in Algorithm 2 depicts this process. Next, we count the number of warning or outlier of these k NN tuples by the definition of warning or outlier category (Algorithm 2 lines 7-13). Finally, we judge whether the test data is warning or outlier (Algorithm 2 lines 14-18). 5.5 Live Adjustment Because the re-balance mechanism of Storm can only guarantee that the number of executors is less than or equal to the number of tasks, when adjusting the maximum of the number of tasks, we need to re-build a new topology based on the predicted optimal configuration [33], [34]. It causes about 1 minute delay to re-build a topology. In this section, we introduce the system of Operator States Migration [37] to implement the live adjustment of different configurations without building a new topology. It can dynamically change the operator parallelism and achieve accurate queries without missing operator states. The architecture of OSM is shown in Fig. 7 [37]. Each worker in Storm maintains its own thread pool (from task 1 to task n ). Meanwhile, it creates an Operator States Manager (OSMer) to access the application data when all executors are killed. Hence, the pointers in the OSMer ensures that no application data is marked as garbage, and remain in memory during the migration. The OSMer classifies operator states into three types, marked as to stay, to move out and to move in, based on the adjustment strategy. To stay tasks do not require extra processing; to move out tasks are to be moved out to another node, and their states need to serialize into a file server (as shown in Fig. 7); to move in tasks are moved to the current node from another node, and their states need to download from file server to a separate retriever thread illustrated in Fig. 7. Moreover, OSM uses the scheduling technique in [39] to reduce the network traffic cost during migration and minimize the total transmission time. 6 EVALUATION This section presents three workloads of each test scenario and gives the result of predicting the resource usage with different learning models. We take the prediction of CPU usage as an example to verify the effect of anomaly detection mechanism, compare to reduce the degree of error prediction in the process of incremental learning, and verify the resource save situation through using the dynamic tuning mechanism. The testbed is established on a cluster of ten nodes connected by a 1 Gbit Ethernet switch. Five nodes are used to transmit data source through Kafka. One node serves as the nimbus of Storm, and the remaining four nodes act as supervisor nodes. Each data source node and the nimbus node have a Intel E GHz four-core CPU and 4 GB of DDR3 RAM. Each supervisor node has two Intel E GHz twelve-core CPU and 64G of DDR3 RAM. We implement comprehensive evaluations of our prototype on Storm and Ubuntu In order to cover the different query characteristics, we choose three different workloads as below. Traffic Monitoring. This workload has been introduced in detail in Section 1. Word Count (WC). Word count represents word frequency statistics of different sentences. This workload is often used as the stream processing mode for each sentence. It is composed of a bolt for splitting sentences into words and another one for counting the number of occurrences for each word in a hash-map. We use the word count data-set from HiBench [6]. There are about 3 million sentences and more than 30 million words for testing. TPC-H (Q3). TPC-H is a decision support benchmark. Its queries and data have been chosen to have broad industrywide relevance. In order to verify the query processing of multiple data streams, we choose Q3 as the workload that includes three data sources, such as customer, orders and lineitem. We use 15 million tuples for each data source of this workload. It contains three filter bolts to filter data source, two join bolts to make equi-join for the three data sources, a group bolt to group the result of join bolts, and a order bolt to order the result of the group bolt. The sink bolt is used to merge the output results. For simplicity, we only use sink as the count statistics of received results. The topologies of these workloads are shown in Fig. 8. Our workloads cover two aspects: (1) various topology structural complexity, from the single chain topology of one data source to the complex DAG topology of multiple data sources. (2) different runtime characteristics, such as memory intensive of TM workload, communication intensive of WC workload and compute intensive of Q3 workload. 6.1 Online Resource Prediction In order to improve the prediction accuracy, we calculate the mean of CPU usage, memory usage, latency and throughput as the prediction targets for each training sample. For each workload, we respectively collect 3,000 samples by setting data rate and window sizes randomly. According to T ðlatencyþ and T ðthroughputþ set by users, we build two binary classification models. The classification accuracy of the latency and the throughput about three different workloads is shown in Fig. 9. The accuracy of predicting the latency and throughput is maintained at about 90 to 96 percent with increasing samples (from 500 to 3,000). This provides a high security for the next step in predicting the resource usage. For the regression prediction of the resource usage, we use different learning models mentioned in Sections 5.2 and 5.3. Tables 5 and 6 show the test results of the mean absolute error (MAE) and RAE per method based on different workloads. The experimental results show that: 1) The average RAE of prediction memory usage is lower than that of prediction CPU usage, because of

9 WANG ET AL.: AUTOMATING CHARACTERIZATION DEPLOYMENT IN DISTRIBUTED DATA STREAM MANAGEMENT SYSTEMS Fig. 8. Topology of three different workloads. the fluctuation of memory usage is lower than that of CPU usage. 2) For prediction of CPU usage in different workloads, EDKRegression is better than the best model in the four models, which MAE can reduce , RAE can reduce percent. For prediction of memory usage in different workloads, EDKRegression is better than the best model in the four models too, which MAE can reduce , RAE can reduce percent. 6.2 Features Influence We have proposed various features in Section 5.1. According to different workloads and different prediction targets, the influence of features is not the same. So, we use Relief- FAttributeEval [30], [31], [32] in Weka toolkit for evaluating the worth of an attribute by repeatedly sampling an instance and considering the value of the given attribute for the nearest instance of the same and different class. We take data-set features extracted from TPC-H(Q3) workload as an example. As shown in Fig. 10a, when using regression models to predict CPU usage, the features of higher ranker include spout rate, the number of topic partitions and Kafka brokers, every bolt rate and memory usage. Meanwhile, the features of lower ranker include the number of workers, spout parallelism, every bolt parallelism, window size and sliding step. However, as shown in Fig. 10b, when using regression models to predict memory usage, the features of higher ranker include the number of workers, spout parallelism, every bolt parallelism, sliding step and window size. Meanwhile, the features of lower ranker include the number of topic partitions and Kafka brokers, spout rate, join bolt rate and CPU usage. So, the results of features influence show that: (1) The speed of data stream and the processing rate of operators greatly influence the prediction of CPU usage. (2) The number of workers and the operator parallelism greatly influence the prediction memory usage. (3) The extracted dataset features are enough to predict workloads with different runtime characteristics. 6.3 Abnormal Filter Prediction In this section, we take 3,000 data samples of the CPU usage of Traffic Monitoring workload as an example, use the EDK- Regression as the prediction model, and set the prediction error threshold T e to one standard deviation, set the number of tuples to nearest target to 5. As shown in Fig. 11, there are 55 outliers and 8 warnings in the data set. By discarding outliers and re-training the prediction model, we can get MAE from to and RAE from to Incremental Learning Prediction In the case of continuous query analysis, the size of the training samples is increased by updating with data streams. Incremental learning models always ensure to reduce the error rate of the prediction continuously with the increase of the training samples. We take the prediction of the CPU usage of Traffic Monitoring workload as an example, and compare the five incremental models and one batch models from the beginning of the 500 samples to the end of 3,000 samples Effect of Incremental Learning Prediction According to the five learning models are proposed in this paper, Fig. 12 shows that RAE of these models can be reduced 1 from 500 samples to 3,000 samples. It shows that the optimization rate of incremental learning is significant. The increase of samples can reduce RAE observably. Fig. 9. Prediction accuracy Comparison of Incremental Learning and Batch Learning In this section, we make a comparative analysis of incremental learning and batch learning, which includes the relative absolute error, the learning runtime, and the resource usage.

10 TABLE 5 Mean and Relative Absolute Error Per Method for CPU Usage Prediction models Traffic Monitoring Word Count TPC-H (Q3) MAE Test RAE Test MAE Test RAE Test MAE Test RAE Test NB HT OzaBag (HT) IBk EDKRegression TABLE 6 Mean and Relative Absolute Error Per Method for Memory Usage Prediction models Traffic Monitoring Word Count TPC-H (Q3) MAE Test RAE Test MAE Test RAE Test MAE Test RAE Test NB HT OzaBag (HT) IBk EDKRegression We choose two models with the lowest relative absolute error respectively. The incremental one is EDKRegression, and the batch one is random forest[29] with splitting of 75 percent training data-set and 25 percent testing data-set in Weka toolkit for achieving the best regression results. As shown in Fig. 13a, with the increase of the volume of data, EDKRegression has a lower relative absolute error. Meanwhile, EDKRegression can also save time for modeling and prediction, and save the resource usage (see Figs. 13b, 13c and 13d). can save 8-15 percent CPU usage and percent memory usage by monitoring the whole running process of different workloads. Among them, Traffic Monitoring can save 8 percent of the CPU usage and 48 percent of the memory usage due to its memory intensive characteristic. Word Count can save 12 percent of the CPU usage and 38 percent of 6.5 Dynamic Tuning According to the randomness of the stream rate, we adjust the parameters of SPS to ensure that OrientStream can meet the user s threshold. As shown in Figs. 14 and 15, running the three workloads by using OrientStream under different data rates, we can find that the latency is below the threshold (e.g., 1 second), and the throughput is above the threshold (e.g., 10,000 tuples/second). As shown in Figs. 16 and 17, compared with the fixed configuration, we can find that our proposed EDKRegression Fig. 11. Anomaly filter. Fig. 10. Features Influence. Fig. 12. Incremental prediction of RAE.

WANG ET AL.: AUTOMATING CHARACTERIZATION DEPLOYMENT IN DISTRIBUTED DATA STREAM MANAGEMENT SYSTEMS Fig. 13. Incremental learning versus batch learning. Fig. 14. Latency. Fig. 15. Throughput.

At the same time, compared the existing algorithms with the highest prediction accuracy, EDKRegression can also save more resource usage.

11 WANG ET AL.: AUTOMATING CHARACTERIZATION DEPLOYMENT IN DISTRIBUTED DATA STREAM MANAGEMENT SYSTEMS Fig. 13. Incremental learning versus batch learning. Fig. 14. Latency. Fig. 15. Throughput. the memory usage due to its communication intensive characteristic. TPC-H (Q3) can save 15 percent of the CPU usage and 45 percent of the memory usage due to its compute intensive characteristic. At the same time, compared the existing algorithms with the highest prediction accuracy, EDKRegression can also save more resource usage. For CPU usage, EDKRegression can save 2-6 percent over OzaBag(HT); for memory usage, it can save 9-12 percent over IBk. 6.6 Live Adjustment When deploying configurations, we can get the tremendous benefits by using the mechanism of live adjustment. As Fig. 16. CPU usage. Fig. 17. Memory usage.

12 Fig. 18. Configuration adjustment. shown in Fig. 18, there are the comparisons of three workloads about adjusting configurations. Because of the difference between the stream rate and the query request, the tuning nodes of three workloads change as well. The adjustment time by using the live adjustment is between 100 and 300 milliseconds. However, under the condition of satisfying the re-balance mechanism, the adjustment time by using re-balance is between 5 and 7 seconds. If we need to increase the operator parallelism, the adjustment time by re-building the topology will last about 1 minute. 7 CONCLUSIONS AND FUTURE WORK In this paper, we propose OrientStream to predict the resource usage of DDSMS by using incremental learning techniques, develop the learning tool that can be implemented data acquisition automatically based on the different workloads. Our approach uses MOA and Weka toolkits to build real-time prediction models by defining four levels features extraction at different granularities. In addition, we introduce the EDKRegression algorithm and the outlier execution detection mechanism for improving the prediction accuracy, and verify the validity of our methods. Moreover, we introduce the system of the Operator States Migration to achieve live adjustment of the operator parallelism from different nodes. It can reduce the adjustment time and processing latency significantly. We evaluate OrientStream by running three workloads with different runtime characteristics. Our experimental results demonstrate that OrientStream can significantly reduce 8-15 percent CPU usage and percent memory usage, and save more resources with incrementally updated models. Hence, it is promising to deploy OrientStream in production to significantly increase the user s return-oninvestment. As future work, there are two directions to further improve our work. (1) We need to collect more training data to improve the prediction accuracy by using the incremental learning, and plan to further optimize the regression model to improve the prediction accuracy and use less samples. (2) In this paper, the optimal configuration is predicted from the manually supplied candidate configurations. So, we need to design the real-time recommendation to help users get the optimal configuration. ACKNOWLEDGMENTS This research was partially supported by the grants from the Natural Science Foundation of China (No , , , , ); the National Key Research and Development Program of China (No. 2016YFB , 2016YFB ); the Research Funds of Renmin University (No. 11XNL010); and the Science and Technology Opening up Cooperation project of Henan Province (No ). REFERENCES [1] (2017). [Online]. Available: [2] (2017). [Online]. Available: spark-streamingsql/ [3] (2017). [Online]. Available: [4] (2017). [Online]. Available: [5] (2017). [Online]. Available: weka/ [6] (2017). [Online]. Available: HiBench/ [7] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari, S4: Distributed stream computing platform, in Proc. IEEE Int. Conf. Data Mining Workshops, 2010, pp [8] A. Toshniwal, et al., Storm@twitter, in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2014, pp [9] M. J. Sax, M. Castellanos, Q. Chen, and M. Hsu, Aeolus: An optimizer for distributed intra-node-parallel streaming systems, in Proc. IEEE Int. Conf. Data Eng., 2013, pp [10] T. Z. J. Fu, J. Ding, R. T. B. Ma, M. Winslett, Y. Yang, and Z. Zhang, DRS: Dynamic resource scheduling for real-time analytics over fast streams, Comput. Sci., vol. 690, no. 1, pp , [11] L. Aniello, R. Baldoni, and L. Querzoni, Adaptive online scheduling in storm, in Proc. ACM Int. Conf. Distrib. Event-Based Syst., 2013, pp [12] G. R. Bitran and R. Morabito, State-of-the-art survey: Open queueing networks: Optimization and performance evaluation models for discrete manufacturing systems, Prod. Operations Manage., vol. 5, no. 2, pp , [13] A. Khoshkbarforoushha, R. Ranjan, R. Gaire, P. P. Jayaraman, J. Hosking, and E. Abbasnejad, Resource usage estimation of data stream processing workloads in datacenter clouds, CoRR abs/ , [14] M. C. Bishop, Mixture density networks, Neural Net., IEEE INNS ENNS Int. Joint Conf. IEEE Comput. Soc., [15] N. Poggi, D. Carrera, A. Call, and S. Mendoza, ALOJA: A systematic study of hadoop deployment variables to enable automated characterization of cost-effectiveness, in Proc. IEEE Int. Conf. Big Data, 2015, pp [16] J. L. Berral, N. Poggi, D. Carrera, A. Call, R. Reinauer, and D. Green, ALOJA-ML: A framework for automating characterization and knowledge discovery in hadoop deployments, in Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2015, pp [17] P Jamshidi and G Casale, An uncertainty-aware approach to optimal configuration of stream processing systems, in Proc. IEEE Int. Symp. Model. Anal. Simul. Comput. Telecommun. Syst. Comput. Soc., 2016, pp

WANG ET AL.: AUTOMATING CHARACTERIZATION DEPLOYMENT IN DISTRIBUTED DATA STREAM MANAGEMENT SYSTEMS [18] J. Li, A. C. Nig, V. Narasayya, and S.

Riondato, E. Upfal, and S. B. Zdonik, Learning-based query performance modeling and prediction, in Proc. IEEE 28th Int. Conf. Data Eng., 2012, 390 401. [20] J. Kreps, L. Corp, N. Narkhede, J.

Langley, Estimating continuous distributions in Bayesian classifiers, in Proc. 11th Conf. Uncertainty Artif. Intell., 2013, pp. 338 345. [22] P. Domingos and G.

Russell, Experimental comparisons of online and batch versions of bagging and boosting, Carcinogenesis, vol. 22, no. 3, pp. 515 517, 2001. [24] D. W. Aha, D. Ki

Int. Conf. World Wide Web, 2009, pp. 791 800. [26] L. Breiman, Bagging predictors, Mach. Learn., vol. 24, no. 2, pp. 123 140, 1996. [27] Y. Freund and R. E.

13 WANG ET AL.: AUTOMATING CHARACTERIZATION DEPLOYMENT IN DISTRIBUTED DATA STREAM MANAGEMENT SYSTEMS [18] J. Li, A. C. Nig, V. Narasayya, and S. Chaudhuri, Robust estimation of resource consumption for SQL queries using statistical techniques, Proc. VLDB Endowment, vol. 5, no. 11, pp , [19] M. Akdere, U. Cetintemel, M. Riondato, E. Upfal, and S. B. Zdonik, Learning-based query performance modeling and prediction, in Proc. IEEE 28th Int. Conf. Data Eng., 2012, [20] J. Kreps, L. Corp, N. Narkhede, J. Rao, and L. Corp, Kafka: A distributed messaging system for log processing, in Proc. Workshop Netw. Meets Databases, 2011, [21] G. H. John and P. Langley, Estimating continuous distributions in Bayesian classifiers, in Proc. 11th Conf. Uncertainty Artif. Intell., 2013, pp [22] P. Domingos and G. Hulten, Mining high-speed data streams, in Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2000, pp [23] N. C. Oza and S. Russell, Experimental comparisons of online and batch versions of bagging and boosting, Carcinogenesis, vol. 22, no. 3, pp , [24] D. W. Aha, D. Kibler, and M. K. Albert, Instance-based learning algorithms, Mach. Learn., vol. 6, no. 1, pp , [25] Y. Zheng, L. Zhang, X. Xie, and W. Y. Ma, Mining interesting locations and travel sequences from GPS trajectories, in Proc. Int. Conf. World Wide Web, 2009, pp [26] L. Breiman, Bagging predictors, Mach. Learn., vol. 24, no. 2, pp , [27] Y. Freund and R. E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, J. Comput. Syst. Sci., vol. 55, no. 7, pp , [28] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed, ZooKeeper: Wait-free coordination for internet-scale systems, in Proc. USE- NIX Annu. Tech. Conf., 2010, pp [29] L. Breiman, Random forests, Mach. Learn., vol. 45, no. 1, pp. 5 32, [30] K Kira and L A. Rendell, A practical approach to feature selection, in Proc. 9th Int. Workshop Mach. Learn., 1992, pp [31] I. Kononenko, Estimating attributes: Analysis and extensions of RELIEF, in Proc. Eur. Conf. Mach. Learn., 1994, pp [32] M. Robnik-Sikonja and I. Kononenko, An adaptation of relief for attribute estimation in regression, in Proc. 14th Int. Conf. Mach. Learn., 1997, pp [33] C. Wang, X. Meng, Q. Guo, Z. Weng, and C. Yang, OrientStream: A framework for dynamic resource allocation in distributed data stream management systems, in Proc. 25th ACM Int. Conf. Inf. Knowl. Manage., 2016, pp [34] Z. Weng, Q. Guo, C. Wang, X. Meng and B. He, AdaStorm: Resource efficient storm with adaptive configuration, in Proc. 33nd IEEE Int. Conf. Data Eng., 2017, pp [35] B. Peng, M. Hosseini, Z. Hong, R. Farivar, and R. Campbell, R-Storm: Resource-aware scheduling in storm, in Proc. 16th Annu. Middleware Conf., 2015, pp [36] H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica, Tachyon: Reliable, memory speed storage for cluster computing frameworks, in Proc. ACM Symp. Cloud Comput., 2014, pp [37] J. Ding, T. Z. J. Fu, R. T. B. Ma, M.Winslett, Y. Yang, Z. Zhang and H.Chao, Optimal operator state migration for elastic data stream processing, CoRR abs/ , [38] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA, USA: Morgan Kaufmann, [39] W. Rodiger, et al., Locality-sensitive operators for parallel mainmemory database clusters, in Proc. IEEE Int. Conf. Data Eng., 2014, pp Chunkai Wang received the bachelor s and master s degrees from Zhengzhou University, both in computer science, in 2004 and 2007, respectively. He is currently working toward the PhD degree in the School of Information, Renmin University of China. His current research interests include data stream computing and distributed systems. Xiaofeng Meng received the bachelor s degree from Hebei University, the master s degree from the Renmin University of China, and the PhD degree from the Chinese Academy of Sciences, all in computer science, in 1987, 1993, and He is currently a professor in the School of Information, Renmin University of China. His current research interests include cloud data management, web data management, flash-based databases, and privacy protection. He is a member of the IEEE. Qi Guo received the BE degree in computer science from the Department of Computer Science and Technology, Tongji University, in 2007 and the PhD degree in computer science from the Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), in He is currently an associate professor at ICT. His research interests include computer architecture and VLSI design and verification. Zujian Weng received the bachelor s degree from the Nanjing University of Aeronautics and Astronautics, in He is working toward the postgraduate degree in the School of Information, Renmin University of China. His research interests include data stream computing and energy efficiency computing. Chen Yang received the bachelor s and master s degrees from Zhengzhou University. He is working toward the PhD degree in the School of Information, Renmin University of China. He is also an assistant in the School of Software Engineering, Zhengzhou University of Light Industry, China. His research interests include big data system and machine learning. " For more information on this or any other computing topic, please visit our Digital Library at

Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b

4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering (ICMMCCE 2015) Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b 1