Labeling Instances in Evolving Data Streams with MapReduce

Size: px

Start display at page:

Download "Labeling Instances in Evolving Data Streams with MapReduce"

Drusilla Reynolds
6 years ago
Views:

2013 IEEE International Congress on Big Data Labeling Instances in Evolving Data Streams with MapReduce Ahsanul Haque Department of Computer Science University of Texas at Dallas Email: ahsanul.

1 2013 IEEE International Congress on Big Data Labeling Instances in Evolving Data Streams with MapReduce Ahsanul Haque Department of Computer Science University of Texas at Dallas Brandon Parker Department of Computer Science University of Texas at Dallas Latifur Khan Department of Computer Science University of Texas at Dallas Abstract Unlike traditional data mining where data is static, mining algorithms for data streams must process the data on the fly and update the class decision boundaries as the stream progresses to address the challenges of concept drift and feature evolution. In our current work, we have proposed a multi-tiered ensemble based fast and robust method, which rapidly learns the concepts in a data stream, predicts labels for new data with strong accuracy, and agilely tracks the dynamic changes in the evolving concepts and feature space. Bottleneck of our current work is, it needs to build ADABOOST ensemble for each numeric feature. This can face scalability issue as number of features can be very large at times in data stream. In this paper we propose a method to parallelize the independent parts of that work using a MapReduce framework. This increases scalability and achieves a significant speedup without compromising classification accuracy. We demonstrate the performance of our approach in terms of speedup, scaleup and classification accuracy. I. INTRODUCTION Stream data mining has some inherent challenges which are not present in traditional data mining. Stream data mining attempts to discover concepts and label data instances arriving in near real-time in a dynamic and continuous stream of data. In data stream, data isn t static (e.g. a data warehouse). Therefore mining algorithms must process the data in realtime and update the class decision boundaries as the stream progresses. In today s connected digital world, streaming data is becoming abundant, and the need to mine these data streams is becoming increasingly more important. For example, monitoring network traffic for intrusion detection or data leakage, monitoring social media traffic for trending topics, or processing continual data feeds from distributed sensor networks all require fast and dynamic mining algorithms. Since the data continuously enters into the system, it is effectively a data set of infinite size. Any classification method used in a streaming context should be able to predict drifting concepts, and overall changes in the data and label space [2] [4]. Feature evolution occurs when the features or attributes in the data stream shift meaning, range, or context, or when new features are added mid-stream. Concept drift happens when the target class or concept evolves within the feature space such that the class encroaches or crosses previously defined decision boundaries of the classifier. The approach we present in this paper to face all these challenges is based on our current work [1] on labeling instances in evolving data stream. It uses a pipeline processing paradigm to quickly apply predicted labels to data instance in the data stream, and to maintain the classifier regularly as the stream evolves. The core concept behind our method is the use of a multi-tiered ensemble depicted in Figure 1. Our approach builds a hierarchy of ensemble classifiers by breaking down the classification problem. As a result, it can be incrementally updated. Furthermore, each independent section of the ensemble tree learns and updates the weights and inclusion of relevant features. At the bottom of the hierarchy, there are two distinct types of base learners for different features. Nonnumeric features induce Naïve Bayes base classifiers. Likewise, numeric features induce ADABOOST sub-ensembles of linear classifiers. These sub-ensembles are maintained and updated after arrival of each chunk of data. Base approach [1] for our proposed method is described briefly in Section III. In design and implementation of this work, base learners for different features are trained serially. As the number of features is typically large for data stream mining, the number of ADABOOST sub-ensembles needed for numeric features can also be large. Thus lot of effort is needed for training and maintaining those ADABOOST sub-ensembles for different numeric features. Though it is a robust technique to address challenges of data stream mining, still it can suffer scalability issue for serialized training of large number of ensembles for different features after arrival of each data chunk. Size of each data chunk can also be very large. We observed that training of ADABOOST subensemble for one numeric feature is independent of training of ADABOOST sub-ensemble for other features. So there is a scope to parallelize training and maintenance of these ADABOOST sub-ensembles for numeric features. In this paper we present a method to apply parallel training using MapReduce framework to form ADABOOST ensembles for numeric features. MapReduce [15] is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster or a grid. There are two steps involved in a MapReduce framework. One is the Map step and another is the Reduce step. In the Map step, The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. The worker node processes the smaller problem, and passes the answer back to its master node. In the Reduce step, The master node then collects the answers to all the sub-problems and combines them in some way to form the output which is the answer to the problem it was originally trying to solve. To implement parallel training of ADABOOST sub /13 $ IEEE DOI /BigData.Congress

2 ensembles for numeric features we use Apache Hadoop. It is an open source implementation of MapReduce and is currently enjoying wide popularity. Hadoop presents MapReduce as an analytics engine and under the hood uses a distributed storage layer referred to as Hadoop Distributed File System (HDFS). This framework transparently provides both reliability and data motion to applications. We empirically show that while maintaining the test accuracy, our implementation with MapReduce achieves significant speedup compared to respective baseline training of per feature ADABOOST subensembles in a single machine. We experimented with different type of datasets and different chunk sizes to show scalability issue of the original method and how our proposed method can overcome this using parallel processing using MapReduce. The primary contributions of our work are as follows: 1) We addressed scalability issue of the base approach [1] of this work specially in case when number of features or size of data chunk is large. To solve this problem we have identified independent components of base approach to apply parallelism. 2) We design MapReduce based parallel distributed solution to perform independent parts of base approach [1] in parallel to achieve significant speedup and scalability. 3) We implement our proposed method using Apache Hadoop, a popular open source implementation of MapReduce which supports the running of distributed applications on large clusters of commodity hardware. We have made several design choices to make our solution highly optimized. 4) We have tested performance our implementation against several established benchmark data sets. We have presented results with different size of data chunks and different number of Map tasks to demonstrate superiority of MapReduce based parallelism in terms of performance metrics. The rest of the paper is organized as follows: Section II describes some earlier related works to our problem. In Section III we briefly describe our current work [1] which is base approach for the method presented in this paper. In Section IV we discuss about the problems of the base approach [1] and solution to those problems. Section V demonstrates the experimental results and shows the comparisons of our proposed approach with the base approach [1]. Finally Section VI concludes our discussion along with some future research directions. II. RELATED WORK With the recent explosion in interconnectivity of data sources and constant data accumulation from the expanding use and proliferation of Internet connected devices, it is no surprise that gleaning useful information from data streams has become a research focus. Since our approach is hierarchical, it has several similarities to various sub-areas of stream mining research. However, very few approaches among those adequately address as many of the stream mining issues as our approach. The closest work to our approach is DXMiner [5], where the authors address all three issues, but use simple ensemble voting, handle only numerical (or ordinal) features, and retain a homogenized conglomeration of all features in all ensemble model evaluations. DXMiner [5] uses an ensemble of hyperspheres to capture the decision boundary for classes as the stream is processed. It does not address the issue of changing feature space as they merely create a growing union of features. Instead of this, our method tracks the dynamic feature set within the ensemble hierarchy. Although DXMiner [5] and our method address concept drift, our method uses a model based approach by using the existing ensemble learners confidence compared to a provisional outlier ensembles confidence. Our method does not rely on an E-M (i.e. K-Means) algorithm which greatly reduces the number of necessary computations. Our method shows better accuracy than DXMiner [5] on all of the benchmark datasets used in this paper which is shown in Section V. Katakis et al. [4] used Naïve Bayes to mitigate dynamic features incrementally for textual data streams, pruning features not well suited for the current classification problem, but used Boolean bag-of-words attributes only. Naïve Bayes implicitly treats each feature separately, which can be mapped to our approach of creating per feature classifiers and pruning those base classifiers that are too error prone. This approach, however, address neither concept drift nor heterogeneous attributes. In our approach, we leverage the ADABOOST ing algorithm [6] to create more robust non-linear separators for our numeric per-feature classifiers. Likewise, the error approximation as outlined in [6] for ADABOOST helps to minimize the error propagation up through our hierarchy of ensembles. If the weak classifiers represent individual attributes in the feature space, the sum of a particular attribute-decider s weight corresponds to the decisive contribution of that feature in the data set. Freund et al. [7] and Shapire et al. [8] discuss the performance and accuracy contributions of ADABOOST ing in this context, but are focused on static data, not streaming data. Tahseen et al. [14] proposed a cloud-based solution to handle a large number of classes effectively. This class-based ensemble approach includes ensemble model where each class information is stored separately. From each data chunk, a model is trained for each class of data by using clustering algorithm. If number of dimension is high, due to curse of dimensionality clustering quality degrades. Moreover literature shows that ADABOOST ensemble model works far better than clustering based semi-supervised method in terms of classification accuracy. There also exists some research works on parallel boosting with MapReduce. Two parallel boosting algorithms, AD- ABOOST.PL and LOGITBOOST.PL were proposed by Indranil et al. [9] which facilitate simultaneous participation of multiple computing nodes to construct a boosted ensemble classifier. These algorithms achieve a significant speedup and still are competitive to the corresponding serial versions in terms of the generalization performance. Escudero et al. [10] proposed LAZYBOOST which utilizes several feature selection and ranking methods. Another fast boosting algorithm in this category was proposed by Busa-Fekete and Kegl [11], which utilizes multiple-armed bandits(mab). Unlike these approaches, we don t try to parallelize the boosting process. Our approach distributes task of forming ADABOOST ensembles for different numeric features among different Mappers. Our proposed framework is not similar to any of the above approaches. Our method is based on our current work [1] which is discussed in Section III. It collects list of features 388

3 which are selected to have ADABOOST sub-ensemble. Then it uses Hadoop MapReduce architecture to form, train and update these ADABOOST sub-ensembles in parallel after the arrival of each data chunk. Thus our framework, when applied on the base method which itself is very robust, achieves significant speedup and scalability without sacrificing accuracy. III. BACKGROUND Our framework is based on multi-tiered ensemble approach for labeling instances in evolving data streams [1]. In this section we will look into few details of this approach. Our approach uses a pipeline processing paradigm to quickly apply predicted labels to data instance in the data stream and to regularly maintain the classifier as the stream evolves. Fig. 2: Processing Pipeline order to adapt to changes in the data stream. Upon completion of this routine, the next data chunk enters the system, and the process repeats. Fig. 1: Multi-level ensemble architecture The core concept behind this method is the use of a multi-tiered ensemble, depicted in Figure 1. We break down the classification problem first into a two-class problem by creating a top-tier ensemble of per-class classifiers, following a one-against-all paradigm. Each per-class classifier is actually another ensemble of classifiers. Each class-based classifier sub-ensemble is further decomposed into feature-based classifiers. Feature-based classifiers may take one of two forms. Non-numeric attributes are learned using the Naïve Bayes algorithm. Numeric feature, however, create an ADABOOST ensemble of linear threshold classifiers. This provides two important advantages over other methods. First, since each class-based classifier maintains its own set of chunk-based classifiers, which in turn maintain their own set of feature-based classifiers, the features can be pruned such that only the necessary features are retained for each class. This ensures that as features drift in the stream, each class independently adapts to the features best suited for their target label. Second, maintaining separate per-feature classifiers allows our method to use both numeric and nonnumeric features without extra conversion or normalization processes. Figure 2 depicts the workflow of that approach. As the data stream enters the system, it is segmented into discrete data chunks for processing. Each chunk uses the ensemble classifier (to be described in detail later in this section) to predict a label for each data instance in the chunk. For testing purposes accuracy metrics are gathered. A subset of the data chunk is then used as training data to update the ensemble in When training data enters into the system, it is used to update the classifier ensemble. This update happens by creating new class-based classifier sub-ensembles for each label present in the training data. Once the class-based classifier sub-ensembles are created, each larger ensemble evaluates the existing class-based classifiers for those labels and prunes the single-class chunk-based classifier with the worst accuracy. This keeps the number of chunk-based classifiers at a constant, and ensures that the overall classifier is kept up to date with the current concept trends. To demonstrate the training and classification process of the hierarchical ensemble, consider the following example training set as shown in Table I. This example depicts types of sports balls with four attributes: diameter in millimeters, mass in grams, predominant color, and shape (features f1-f4 respectively). Without loss of generality, let us assume a chunk size of four, and maximum chunk retention of one. TABLE I: Example Training Data Instances f1 f2 f3 f4 Label x white sphere Baseball x white icosahedrons Soccer ball x yellow sphere Tennis ball x white sphere Baseball When the ensemble is trained on the data set, features f1 and f2 will induce ADABOOST ed threshold classifiers, while f3 and f4 will induce Naïve Bayes classifiers, as shown in Figure 3. For brevity, the labels baseball, soccerball, and tennisball are abbreviated bb, sb, and tb respectively, and only the baseball class is fully depicted. To predict label for testing data instance, the ensemble would find the label with the maximum confidence according to the hierarchical votes of the underlying classifiers. Votes of lower level feature based classifier is multiplied by their respective weight. These votes propagate up the hierarchy and are aggregated into a final vote tally. The label with the highest vote is said to be the predicted class of the data instance. 389

4 Assuming we have trained the ensemble as presented in this section using training data from Table I, suppose that we obtain the test data instance shown in Table II. TABLE II: Example Test Instance f1 f2 f3 f4 Label white sphere? (Baseball) Figure 3 below depicts a portion of the prediction process for this particular instance. Each feature-based classifier contributes a prediction. The continuous valued features (f1, f2) contributes through ADABOOST ensembles of linear threshold classifiers. Non-numeric features (f3, f4) contributes through Naïve Bayes classifiers. Each class-based ensemble (baseball, soccerball, tennis ball) then took the summation of the contributing feature-based classifiers. At the final tier of the ensemble, the class label with the highest vote become the predicted label. Thus the test data instance in Table II would be given the label baseball by the classifier ensemble. Fig. 3: Example Data Instance Prediction by the Ensemble Our base approach offers several key contributions to the area of stream data mining. First, it uses data attributes in their native format without performing any data normalization preprocessing. It uses both numeric and non-numeric attributes equally. Second, our approach makes decomposition of the classification into a tiered ensemble. Thus it has ability to adapt dynamic feature and concept changes continuously within the data stream. Finally, our method shows better efficiency in terms of accuracy and execution time comparing with other state of the art methods which try to solve the similar problem. IV. PROBLEMS AND SOLUTION As it is discussed in Section III, our method needs separate ADABOOST sub-ensemble for each numeric feature. ADABOOST [7] is an ensemble learning method which iteratively induces a strong classifier from a pool of weak hypothesis. During each iteration, it employs a simple learning algorithm(called the base classifier) to get a single learner for that iteration. The final ensemble classifier is a weighted linear combination of these base classifiers where each of them casts their weighted votes. These weights correspond to the correctness of the classifiers, i.e., a classifier with lower error rate gets higher weight. The base classifiers have to be atleast slightly better than a random classifier and hence, they are also called weak classifiers. In data streams, at times number of features can be very large. For example, in case of textual stream, each distinct keyword is regarded as a feature. So in a large corpus, number of dimension i.e., number of features can be in an order of magnitude of tens of thousands. To build an ADABOOST ensemble learner for each feature, number of iterations needed to form final ensemble classifier is equal to the number of weak classifier we need in the ensemble. At each iteration again it needs to iterate over all the data instances of the current data chunk. So, complexity for forming ADABOOST sub ensembles for all number features is O(n*w*f) where n is number of data instances in data set, w is maximum number of weak classifier in a ADABOOST sub ensemble for a particular feature and f is the number of numeric features for which we need to form seperate ADABOOST sub ensemble. For every data chunk, we need to build ADABOOST ensemble for this huge number of features. So this is a gigantic task in terms of time complexity. Since in data streams, data continuously enters into the system, size of data chunks need to be processed in unit time can be of a large size. That is why for each data chunk, the system needs to process all these calculation for forming ADABOOST ensembles in a very short amount of time. so our base work [1] can thus face scalability issue to label test data instances in realtime specially when number of features or size of data chunk is large. To solve this problem, we observed that the whole process for forming ADABOOST ensemble classifier for one feature is totally independent from forming ADABOOST ensemble for any other feature. So we can form and update the ADABOOST classifiers needed for the numeric features for each data chunk in parallel. To do this we have used MapReduce which is a distributed programming paradigm introduced by Dean and Ghemawat [12]. The model is capable of processing large data sets in a parallel distributed manner across many nodes. Fig. 4: Forming ADABOOST ensemble using MapReduce MapReduce has two primary functions: the Map function and the Reduce function. The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. In the Reduce step master node then collects the answers to all the sub-problems and combines them in some way to form the output which is the answer to the problem it was originally trying to solve. Provided each Mapping operation is independent of the others, all Map tasks can be performed in parallel. 390

5 To implement our approach we use Apache Hadoop which consists of Hadoop kernel, MapReduce and Hadoop Distributed File System (HDFS). Main strength of Apache Hadoop is it does not require expensive hardware. It runs applications on large clusters of commodity hardware and provides both reliability and data motion to applications. In addition, it provides a distributed file system(hdfs) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. This framework transparently provides both reliability and data motion to applications. Both MapReduce and the distributed file system are designed so that node failures are automatically handled by the framework. Our approach for applying MapReduce based parallelism is depicted in Figure 4. Input to the Mapper is feature index f i for which it needs to form the ensemble and number of weak classifiers T. Each Mapper can receive several feature indexes depending on the input split and thus training for different features can be run in parallel in different Mapper. Each Mapper runs T iterations for feature f i on the data chunk it receives. Each iteration forms a weak classifier for that feature. At iteration t, Mapper emits feature index f i as key and a composite pair as value. That pair contains weak classifier formed at that iteration h (t) i classifier α (t) i. and weight for the weak Output from the Map task is fed into Reducer procedure. All values for the same key are always reduced together regardless of which Mapper is its origin. Reducer receives a key and list of values associated to it. As it has been discussed earlier, in our case Map output key is the index of feature and value is the combination of weak classifier and weight for the weak classifier. So in our design, Reducer receives feature index as key and all the weak classifier associated with that feature along with their weights as list of values associate to that key. From these information Reducer forms and emits the final ADABOOST ensemble for that feature as value and index of feature as key. In our initial approach system first collects current data chunk and the list of features that need ADABOOST ensemble classifier. These are copied to HDFS(Hadoop distributed file system). The system then runs parallel Mappers to form ADABOOST ensemble classifiers for that feature. Mappers form weak classifiers for ADABOOST sub ensemble for a particular feature and emit this for Reducer. Reducer receives all these weak classifier and form the final ADABOOST ensemble for all the features. Output of Reducer then are copied back to the system for updating current ensembles in different hierarchy. Later we have made some optimizations in our initial approach. First our system after collecting current data chunk place it in DistributedCache of Hadoop. DistributedCache is a facility provided by the MapReduce framework to cache files needed by applications. The framework copies the necessary files on to the slave node before any tasks for the job are executed on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves. Another optimization we have applied is to customize the way Hadoop splits input file. The number of Map tasks in a Hadoop job depends on number of input file splits. In our design input file is basically list of features for which we need to build classifier. So increasing input split infers increasing number of Mappers running for the job and thus increasing level of parallelism. Algorithm 1 Pseudocode for Mapper 1: procedure INITIALIZE(context) 2: dataset loadf romdistributedcache Load Training Set of n samples from distributed cache file 3: end procedure 4: procedure MAP(f, T ) 5: w 1 ( 1 n,..., 1 n ) 6: for t =1 T do 7: h (t) LearnW eakclassifier(w t,f) 8: ɛ n i=1 wt i I{ht (x i ) y i } 9: α t (1/2) ln( 1 ɛ ɛ ) 10: for i =1 n do for all the data instances loaded in INITIALIZE method 11: if h t (x i ) y i then 12: w t+1 i 13: else 14: w t+1 wt i 2ɛ i wt i 2(1 ɛ ) 15: end if 16: end for 17: Emit (f, (α t,h (t) )) 18: end for 19: end procedure Pseudocode for Mapper is given in algorithm 1. INI- TIALIZE procedure is called before actual MAP begins. In INITIALIZE method Mapper loads data chunk from DistributedCache to corresponding data structure. To form ADABOOST ensemble for a feature, algorithm needs to iterate over the whole dataset multiple times. By loading the dataset in a data structure makes it more efficient in terms of execution time. The Map procedure then starts with an initial uniform weight distribution to the data instances in the current data chunk(line 5). T is the total number of boosting iteration. It is important to note that, for any iteration t, n i=1 wt i =1. At each iteration, a weak learner function is applied to the weighted version of the data which then returns an optimal weak hypothesis h (t) (line 7). The weak learner function always ensures that it will find an optimal h (t) with ɛ<1/2. At each iteration, an weight α t is assigned to the weak classifier(line 9). Weights for the data instances are re-calculated in for loop of line 10. The data instances which are not classified correctly in current iteration is given higher priority in the next iteration. Thus weak classifier in the next iteration focuses more on the samples that were previously misclassified. At the end of each iteration, Mapper emits feature index as key and a complex value consisting of weight for the weak classifier and the weak classifier itself. Pseudocode for Reducer is given in algorithm 2. In Hadoop, the Reducer receives a key and list of values associated to this particular key from Mapper. Values associated with a specific key are processed by the same Reducer. So in our design Reducer will receive feature index as key and a list of all the 391

6 Algorithm 2 Pseudocode for Reducer 1: procedure REDUCE(f, List(α t,h (t) )) 2: H (f) α t h (t) ) 3: Emit (f, H (f) ) 4: end procedure weak classifiers for that feature as a list of values. Reducer then prepares the final ADABOOST ensemble by taking weighted sum of all the weak classifiers for that feature in line 2. H (f) denotes ADABOOST ensemble for feature index f. Finally Reducer emits index of feature and ADABOOST ensemble for that feature as <key, value> pair. Output from Reducer is written in the HDFS file system which is then copied to the local file system. ADABOOST ensemble for features are extracted from the file afterwards and used to update the whole hierarchical structure. In this way, using MapReduce ADABOOST ensembles for all numeric features can be formed with a very high level of parallelism. We tested our approach with several data sets having large number of features. We used different size of chunks to examine performance of our approach. We have observed that it is especially useful when the number of features is large or the size of the data chunk is large. We have discussed experimental results in details in Section V. V. EXPERIMENTAL RESULTS We have implemented our base approach (Heterogeneous Hierarchical Ensemble, or HHE) [1] using Java JDK version For MapReduce implementation we have used Hadoop version We have tested our MapReduce implementation in a cluster having 9 data nodes. Each data node is a i7 2600K machine running linux. We have shown performance comparison in terms of execution time and classification accuracy. We have carried out our experiments for different size of data chunk and different number of Maps to evaluate impact of MapReduce parallelism. In order to evaluate our MapReduce implementation we used several established benchmark data sets. Table III depicts the characteristics of the data sets. TABLE III: Data Set Characteristics Name of Instances Classes Numeric Nominal DataSet Features Features KDD 490, PAMAP2 3,850, ForestCover 581, The KDD data set is from KDD Cup It contains information pertaining to network traffic metrics and attributes with both normal traffic and 20 types of network attack traffic. For the second data set, we used a large data set - Physical Activity Monitoring Data Set (PAMAP2) from UCI [13] - to test performance of DXMiner [5] and our methods as this data set is extremely large, has numerous features, evolving concepts, and is stream oriented. In this data set, nine persons were equipped with sensors that gathered a total of 52 streaming metrics attributes whilst they performed activities. Eighteen total activities were identified as class labels - including a single Other category for miscellaneous or transient activities). TABLE IV: Accuracy Results on different datasets Method ForestCover KDD PAMAP2 HHE Error 7.9% 3.2% 2.0% AUC DXMiner Error 5.2% 11.9% 48.3% AUC For the third data set, Forest Cover, was obtained from the UCI repository as explained in [5]. It contains 54 numeric attributes. We use all the attributes as they arrive in the stream by parsing the data file without filtering and without normalization. TABLE V: Average Execution time(in Second) per chunk using Original HHE and MapReduce implementation Dataset Sizeof Original MapReduce chunk HHE Implementation ForestCover PAMAP KDD Implementation of our approach with MapReduce exhibits the same accuracy as our base approach HHE without MapReduce implementation. We chose to compare the accuracy results of our algorithm against DXMiner since DXMiner focuses on solving many of the same problems we do, and has been shown to perform well for labeling streaming data [5]. Since we obtained access to the executables for DXMiner, we are able to compare the performance of our method against DXMiner on new benchmark data sets. We also compare results of implementation with MapReduce against our original approach without MapReduce implementation to show dramatic improvement in terms of execution time. Comparison between HHE and DXMiner [5] in terms of accuracy is shown in Table IV. Both method is evaluated for each of the three benchmark data sets with regard to the average error and the Area Under the Curve (AUC) metric. AUC is computed by numerically integrating the Receiver Operating Characteristic (ROC) curve. The ROC curve depicts the ratio of true positive to false negative for an algorithm, and is thus a more robust depiction of an algorithms overall performance. Likewise the AUC metric is a more robust singlevalue metric for comparing algorithms [16]. Note that the KDD AUC value for DXMiner is not listed since the DXMiner application failed to return an AUC number, and did not log enough information to compute the ROC or AUC manually. Our approach HHE shows significantly better accuracy in case of KDD and PAMAP2 datasets. On the other hand HHE shows competitive accuracy in case of ForestCover dataset. 392

7 better execution time comparing to MapReduce implementation. But for chunk sizes larger than 8000, HHE with MapReduce implementation shows significantly better results in terms of execution time. This is not surprising as Hadoop has some overhead of its own which contributes to its greater execution time initially when chunk size is not very large. As chunk size gets larger those overhead becomes well paid and MapReduce implementation shows better execution time. Another observation from the empirical data is in case of KDD dataset, it needs comparatively larger size of chunk to show a better execution time For MapReduce implementation. The reason behind this is KDD has much fewer attributes comparing with the other two datasets. So it takes larger chunk size in case of KDD dataset to overcome internal overhead of running Hadoop. So from these empirical data it can be said that MapReduce implementation is very useful specially when number of feature or size of data chunk is high. Fig. 5: Comparison of Execution time between original HHE vs HHE with MapReduce (Ten Map tasks) for ForestCover dataset HHE can use the features in their native format (continuous or discrete without normalization). That is why it does not need to reduce the feature space as is done in [5]. DXMiner handle a reduced feature set discarding the discrete data and have to normalize all features to ranges between (0.0, 1.0) while HHE does not require such pre-processing. In addition, the ForestCover dataset does not truly have a dynamic feature space. DXMiner retains union of all features for its learners as it progresses. Therefore the lower error of DXMiner on ForestCover dataset is not surprising due to fixed feature set in the data. TABLE VI: Average Execution time(in Second) per chunk for different number of Map tasks Sizeof Number Number Number Dataset chunk of of of Map =3 Map =5 Map = ForestCover PAMAP KDD Execution time for HHE and MapReduce implementation using different size of chunk is listed in Table V. It is clear from the data that MapReduce implementation outperforms original HHE with sufficiently large chunk size. Figure 5 graphically represents comparison of execution time between HHE and MapReduce implementation for ForestCover dataset. It can be observed from the figure that for chunk sizes less than 8000, original approach HHE shows Fig. 6: Comparison of Execution time using different number of Map for ForestCover dataset We have experimented our MapReduce solution with different number of Map tasks. Influence of number of Map tasks on average execution time is shown in Table VI. We have controlled number of Mapper by changing block size of the input file to the Hadoop. This input file contains list of features for which we need to build ADABOOST ensemble. Result shows that Execution time is reduced significantly with higher number of Mapper. Figure 6 more clearly depicts the power of MapReduce parallelism. As the number of Mappers get increased, the number of features per Mapper get decreased. So ADABOOST ensembles for different features are formed with more parallelism. From the Figure 6 we can observe that with larger number of Map, execution time per data chunk falls down for different chunk sizes. It is true especially when chunk size is large. To build an ADABOOST ensemble classifier T number of iterations over the dataset is needed, where T is the number of weak classifier in the ensemble. Decreased number of features per Map task leads to low number of iterations over this huge dataset for each Map. Thus with increased number of Map tasks, execution time per data chunk is decreased. From the above analysis it is clear that our base approach HHE shows better accuracy comparing with other 393

8 approaches which also tend to solve the similar problems of labeling instances in evolving data streams. Moreover applying MapReduce based parallelism to this robust technique makes it significantly faster and increase its scalability in case of larger chunks of data or high number of features. VI. CONCLUSION Our work presented in this paper shows an improvement over our current effort to design an efficient method which rapidly learns the concepts in a data stream, predicts labels for new data with strong accuracy, and agilely tracks the dynamic changes in the evolving concepts and feature space. In this paper, We have tried to address the scalability issue in our base approach [1]. We have shown through algorithmic and empirical analysis that our method is more efficient than the base approach in terms of execution time especially in case of large number of features and large data chunks. Compared to alternative methods, our approach adapts well to data streams, and tends to more tightly approximate the decision boundaries of the target classes. We demonstrated that training of ADABOOST sub-ensemble in parallel for handling large number of features are useful to decrease execution time significantly. We intend to continue to improve upon our methodology and add additional components to further adapt to changes in the data stream, including novel classes. we will continue to investigate variations to further optimize both speed and accuracy. We will also apply further parallelism in training of other sub-ensemble types other than ADABOOST and testing phases in future. [8] R. Shapire and Y. Singer, Improved Boosting Algorithms Using Confidence-rated Prediction, Machine Learning, pp , [9] Indranil Palit and Chandan K. Reddy, Scalable and Parallel Boosting with MapReduce, IEEE Transaction on knowledge and data engineering, vol. 24, no. 10, pp , 2012 [10] G. Escudero, L. Marquez, and G. Rigau, Boosting Applied Toe Word Sense Disambiguation, Proc. European Conf. Machine Learning(ECML), pp , 2000 [11] R. Busa-Fekete and B. Kegl, Bandit-Aided Boosting, Proc. Second NIPS Workshop Optimization for Machine Learning, [12] J. Dean and S. Ghemawat, Mapreduce: Simplified Data Processing on Large Clusters, Comm. ACM, vol. 51, no. 1, pp , 2008 [13] A. Reiss and D. Stricker, Introducing a New Benchmarked Dataset for Activity Monitoring, in The 16th IEEE International Symposium on Wearable Computers (ISWC), Newcastle, UK, [14] Tahseen Al-Khateeb, Mohammad M. Masud, Latifur Khan, Bhavani M. Thuraisingham: Cloud Guided Stream Classification Using Class-Based Ensemble, IEEE CLOUD 2012: [15] Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI 04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, [16] T. Fawcett, ROC Graphs: Notes and Practical Considerations for Researchers, VII. ACKNOWLEDGEMENT This material is based upon work supported by National Science Foundation under Award No. CNS and The Air Force Office of Scientific Research under Award No. FA We thank Dr. Robert Herklotz for his support. REFERENCES [1] Parker, B.; Mustafa, A.M.; Khan, L., Novel Class Detection and Feature via a Tiered Ensemble Approach for Stream Mining, Tools with Artificial Intelligence (ICTAI), 2012 IEEE 24th International Conference on, vol.1, no., pp.1171,1178, 7-9 Nov [2] A. Bifet, R. Kirkby, G. Holmes, R. Gavalda and B. Pfahringer, New Ensemble Methods For Evolving Data Streams, in KDD 09 Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, New York, [3] P. H. dos Santos Teixeira and R. L. Milidi, Data stream anomaly detection through principal subspace tracking, in Proceedings of the 2010 ACM Symposium on Applied Computing, New York, [4] I. Katakis, G. Tsoumakas and I. Vlahavas, Dynamic Feature Space and Incremental Feature Selection for the Classification of Textual Data Streams, in in ECML/PKDD-2006 International Workshop on Knowledge Discovery from Data Streams. 2006, Berlin, [5] M. M. Masud, J. Gao, L. Khan, J. Han and B. M. Thuraisingham, Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints, IEEE Transactions on Knowledge Engineering, vol. 23, no. 6, pp , [6] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Stasticial Learning, 2nd ed., New York: Springer-Verlag, [7] Y. Freund and R. Shapire, A Decision-theoretic Generalization of Online Learning and an Application to Boosting, Journal of Computer Science and System Sciences, vol. 55, no. 1, pp ,

1 INTRODUCTION 2 RELATED WORK. Usha.B.P ¹, Sushmitha.J², Dr Prashanth C M³

1 INTRODUCTION 2 RELATED WORK. Usha.B.P ¹, Sushmitha.J², Dr Prashanth C M³ International Journal of Scientific & Engineering Research, Volume 7, Issue 5, May-2016 45 Classification of Big Data Stream usingensemble Classifier Usha.B.P ¹, Sushmitha.J², Dr Prashanth C M³ Abstract-