Labeling Instances in Evolving Data Streams with MapReduce

Size: px
Start display at page:

Download "Labeling Instances in Evolving Data Streams with MapReduce"

Transcription

1 2013 IEEE International Congress on Big Data Labeling Instances in Evolving Data Streams with MapReduce Ahsanul Haque Department of Computer Science University of Texas at Dallas Brandon Parker Department of Computer Science University of Texas at Dallas Latifur Khan Department of Computer Science University of Texas at Dallas Abstract Unlike traditional data mining where data is static, mining algorithms for data streams must process the data on the fly and update the class decision boundaries as the stream progresses to address the challenges of concept drift and feature evolution. In our current work, we have proposed a multi-tiered ensemble based fast and robust method, which rapidly learns the concepts in a data stream, predicts labels for new data with strong accuracy, and agilely tracks the dynamic changes in the evolving concepts and feature space. Bottleneck of our current work is, it needs to build ADABOOST ensemble for each numeric feature. This can face scalability issue as number of features can be very large at times in data stream. In this paper we propose a method to parallelize the independent parts of that work using a MapReduce framework. This increases scalability and achieves a significant speedup without compromising classification accuracy. We demonstrate the performance of our approach in terms of speedup, scaleup and classification accuracy. I. INTRODUCTION Stream data mining has some inherent challenges which are not present in traditional data mining. Stream data mining attempts to discover concepts and label data instances arriving in near real-time in a dynamic and continuous stream of data. In data stream, data isn t static (e.g. a data warehouse). Therefore mining algorithms must process the data in realtime and update the class decision boundaries as the stream progresses. In today s connected digital world, streaming data is becoming abundant, and the need to mine these data streams is becoming increasingly more important. For example, monitoring network traffic for intrusion detection or data leakage, monitoring social media traffic for trending topics, or processing continual data feeds from distributed sensor networks all require fast and dynamic mining algorithms. Since the data continuously enters into the system, it is effectively a data set of infinite size. Any classification method used in a streaming context should be able to predict drifting concepts, and overall changes in the data and label space [2] [4]. Feature evolution occurs when the features or attributes in the data stream shift meaning, range, or context, or when new features are added mid-stream. Concept drift happens when the target class or concept evolves within the feature space such that the class encroaches or crosses previously defined decision boundaries of the classifier. The approach we present in this paper to face all these challenges is based on our current work [1] on labeling instances in evolving data stream. It uses a pipeline processing paradigm to quickly apply predicted labels to data instance in the data stream, and to maintain the classifier regularly as the stream evolves. The core concept behind our method is the use of a multi-tiered ensemble depicted in Figure 1. Our approach builds a hierarchy of ensemble classifiers by breaking down the classification problem. As a result, it can be incrementally updated. Furthermore, each independent section of the ensemble tree learns and updates the weights and inclusion of relevant features. At the bottom of the hierarchy, there are two distinct types of base learners for different features. Nonnumeric features induce Naïve Bayes base classifiers. Likewise, numeric features induce ADABOOST sub-ensembles of linear classifiers. These sub-ensembles are maintained and updated after arrival of each chunk of data. Base approach [1] for our proposed method is described briefly in Section III. In design and implementation of this work, base learners for different features are trained serially. As the number of features is typically large for data stream mining, the number of ADABOOST sub-ensembles needed for numeric features can also be large. Thus lot of effort is needed for training and maintaining those ADABOOST sub-ensembles for different numeric features. Though it is a robust technique to address challenges of data stream mining, still it can suffer scalability issue for serialized training of large number of ensembles for different features after arrival of each data chunk. Size of each data chunk can also be very large. We observed that training of ADABOOST subensemble for one numeric feature is independent of training of ADABOOST sub-ensemble for other features. So there is a scope to parallelize training and maintenance of these ADABOOST sub-ensembles for numeric features. In this paper we present a method to apply parallel training using MapReduce framework to form ADABOOST ensembles for numeric features. MapReduce [15] is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster or a grid. There are two steps involved in a MapReduce framework. One is the Map step and another is the Reduce step. In the Map step, The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. The worker node processes the smaller problem, and passes the answer back to its master node. In the Reduce step, The master node then collects the answers to all the sub-problems and combines them in some way to form the output which is the answer to the problem it was originally trying to solve. To implement parallel training of ADABOOST sub /13 $ IEEE DOI /BigData.Congress

2 ensembles for numeric features we use Apache Hadoop. It is an open source implementation of MapReduce and is currently enjoying wide popularity. Hadoop presents MapReduce as an analytics engine and under the hood uses a distributed storage layer referred to as Hadoop Distributed File System (HDFS). This framework transparently provides both reliability and data motion to applications. We empirically show that while maintaining the test accuracy, our implementation with MapReduce achieves significant speedup compared to respective baseline training of per feature ADABOOST subensembles in a single machine. We experimented with different type of datasets and different chunk sizes to show scalability issue of the original method and how our proposed method can overcome this using parallel processing using MapReduce. The primary contributions of our work are as follows: 1) We addressed scalability issue of the base approach [1] of this work specially in case when number of features or size of data chunk is large. To solve this problem we have identified independent components of base approach to apply parallelism. 2) We design MapReduce based parallel distributed solution to perform independent parts of base approach [1] in parallel to achieve significant speedup and scalability. 3) We implement our proposed method using Apache Hadoop, a popular open source implementation of MapReduce which supports the running of distributed applications on large clusters of commodity hardware. We have made several design choices to make our solution highly optimized. 4) We have tested performance our implementation against several established benchmark data sets. We have presented results with different size of data chunks and different number of Map tasks to demonstrate superiority of MapReduce based parallelism in terms of performance metrics. The rest of the paper is organized as follows: Section II describes some earlier related works to our problem. In Section III we briefly describe our current work [1] which is base approach for the method presented in this paper. In Section IV we discuss about the problems of the base approach [1] and solution to those problems. Section V demonstrates the experimental results and shows the comparisons of our proposed approach with the base approach [1]. Finally Section VI concludes our discussion along with some future research directions. II. RELATED WORK With the recent explosion in interconnectivity of data sources and constant data accumulation from the expanding use and proliferation of Internet connected devices, it is no surprise that gleaning useful information from data streams has become a research focus. Since our approach is hierarchical, it has several similarities to various sub-areas of stream mining research. However, very few approaches among those adequately address as many of the stream mining issues as our approach. The closest work to our approach is DXMiner [5], where the authors address all three issues, but use simple ensemble voting, handle only numerical (or ordinal) features, and retain a homogenized conglomeration of all features in all ensemble model evaluations. DXMiner [5] uses an ensemble of hyperspheres to capture the decision boundary for classes as the stream is processed. It does not address the issue of changing feature space as they merely create a growing union of features. Instead of this, our method tracks the dynamic feature set within the ensemble hierarchy. Although DXMiner [5] and our method address concept drift, our method uses a model based approach by using the existing ensemble learners confidence compared to a provisional outlier ensembles confidence. Our method does not rely on an E-M (i.e. K-Means) algorithm which greatly reduces the number of necessary computations. Our method shows better accuracy than DXMiner [5] on all of the benchmark datasets used in this paper which is shown in Section V. Katakis et al. [4] used Naïve Bayes to mitigate dynamic features incrementally for textual data streams, pruning features not well suited for the current classification problem, but used Boolean bag-of-words attributes only. Naïve Bayes implicitly treats each feature separately, which can be mapped to our approach of creating per feature classifiers and pruning those base classifiers that are too error prone. This approach, however, address neither concept drift nor heterogeneous attributes. In our approach, we leverage the ADABOOST ing algorithm [6] to create more robust non-linear separators for our numeric per-feature classifiers. Likewise, the error approximation as outlined in [6] for ADABOOST helps to minimize the error propagation up through our hierarchy of ensembles. If the weak classifiers represent individual attributes in the feature space, the sum of a particular attribute-decider s weight corresponds to the decisive contribution of that feature in the data set. Freund et al. [7] and Shapire et al. [8] discuss the performance and accuracy contributions of ADABOOST ing in this context, but are focused on static data, not streaming data. Tahseen et al. [14] proposed a cloud-based solution to handle a large number of classes effectively. This class-based ensemble approach includes ensemble model where each class information is stored separately. From each data chunk, a model is trained for each class of data by using clustering algorithm. If number of dimension is high, due to curse of dimensionality clustering quality degrades. Moreover literature shows that ADABOOST ensemble model works far better than clustering based semi-supervised method in terms of classification accuracy. There also exists some research works on parallel boosting with MapReduce. Two parallel boosting algorithms, AD- ABOOST.PL and LOGITBOOST.PL were proposed by Indranil et al. [9] which facilitate simultaneous participation of multiple computing nodes to construct a boosted ensemble classifier. These algorithms achieve a significant speedup and still are competitive to the corresponding serial versions in terms of the generalization performance. Escudero et al. [10] proposed LAZYBOOST which utilizes several feature selection and ranking methods. Another fast boosting algorithm in this category was proposed by Busa-Fekete and Kegl [11], which utilizes multiple-armed bandits(mab). Unlike these approaches, we don t try to parallelize the boosting process. Our approach distributes task of forming ADABOOST ensembles for different numeric features among different Mappers. Our proposed framework is not similar to any of the above approaches. Our method is based on our current work [1] which is discussed in Section III. It collects list of features 388

3 which are selected to have ADABOOST sub-ensemble. Then it uses Hadoop MapReduce architecture to form, train and update these ADABOOST sub-ensembles in parallel after the arrival of each data chunk. Thus our framework, when applied on the base method which itself is very robust, achieves significant speedup and scalability without sacrificing accuracy. III. BACKGROUND Our framework is based on multi-tiered ensemble approach for labeling instances in evolving data streams [1]. In this section we will look into few details of this approach. Our approach uses a pipeline processing paradigm to quickly apply predicted labels to data instance in the data stream and to regularly maintain the classifier as the stream evolves. Fig. 2: Processing Pipeline order to adapt to changes in the data stream. Upon completion of this routine, the next data chunk enters the system, and the process repeats. Fig. 1: Multi-level ensemble architecture The core concept behind this method is the use of a multi-tiered ensemble, depicted in Figure 1. We break down the classification problem first into a two-class problem by creating a top-tier ensemble of per-class classifiers, following a one-against-all paradigm. Each per-class classifier is actually another ensemble of classifiers. Each class-based classifier sub-ensemble is further decomposed into feature-based classifiers. Feature-based classifiers may take one of two forms. Non-numeric attributes are learned using the Naïve Bayes algorithm. Numeric feature, however, create an ADABOOST ensemble of linear threshold classifiers. This provides two important advantages over other methods. First, since each class-based classifier maintains its own set of chunk-based classifiers, which in turn maintain their own set of feature-based classifiers, the features can be pruned such that only the necessary features are retained for each class. This ensures that as features drift in the stream, each class independently adapts to the features best suited for their target label. Second, maintaining separate per-feature classifiers allows our method to use both numeric and nonnumeric features without extra conversion or normalization processes. Figure 2 depicts the workflow of that approach. As the data stream enters the system, it is segmented into discrete data chunks for processing. Each chunk uses the ensemble classifier (to be described in detail later in this section) to predict a label for each data instance in the chunk. For testing purposes accuracy metrics are gathered. A subset of the data chunk is then used as training data to update the ensemble in When training data enters into the system, it is used to update the classifier ensemble. This update happens by creating new class-based classifier sub-ensembles for each label present in the training data. Once the class-based classifier sub-ensembles are created, each larger ensemble evaluates the existing class-based classifiers for those labels and prunes the single-class chunk-based classifier with the worst accuracy. This keeps the number of chunk-based classifiers at a constant, and ensures that the overall classifier is kept up to date with the current concept trends. To demonstrate the training and classification process of the hierarchical ensemble, consider the following example training set as shown in Table I. This example depicts types of sports balls with four attributes: diameter in millimeters, mass in grams, predominant color, and shape (features f1-f4 respectively). Without loss of generality, let us assume a chunk size of four, and maximum chunk retention of one. TABLE I: Example Training Data Instances f1 f2 f3 f4 Label x white sphere Baseball x white icosahedrons Soccer ball x yellow sphere Tennis ball x white sphere Baseball When the ensemble is trained on the data set, features f1 and f2 will induce ADABOOST ed threshold classifiers, while f3 and f4 will induce Naïve Bayes classifiers, as shown in Figure 3. For brevity, the labels baseball, soccerball, and tennisball are abbreviated bb, sb, and tb respectively, and only the baseball class is fully depicted. To predict label for testing data instance, the ensemble would find the label with the maximum confidence according to the hierarchical votes of the underlying classifiers. Votes of lower level feature based classifier is multiplied by their respective weight. These votes propagate up the hierarchy and are aggregated into a final vote tally. The label with the highest vote is said to be the predicted class of the data instance. 389

4 Assuming we have trained the ensemble as presented in this section using training data from Table I, suppose that we obtain the test data instance shown in Table II. TABLE II: Example Test Instance f1 f2 f3 f4 Label white sphere? (Baseball) Figure 3 below depicts a portion of the prediction process for this particular instance. Each feature-based classifier contributes a prediction. The continuous valued features (f1, f2) contributes through ADABOOST ensembles of linear threshold classifiers. Non-numeric features (f3, f4) contributes through Naïve Bayes classifiers. Each class-based ensemble (baseball, soccerball, tennis ball) then took the summation of the contributing feature-based classifiers. At the final tier of the ensemble, the class label with the highest vote become the predicted label. Thus the test data instance in Table II would be given the label baseball by the classifier ensemble. Fig. 3: Example Data Instance Prediction by the Ensemble Our base approach offers several key contributions to the area of stream data mining. First, it uses data attributes in their native format without performing any data normalization preprocessing. It uses both numeric and non-numeric attributes equally. Second, our approach makes decomposition of the classification into a tiered ensemble. Thus it has ability to adapt dynamic feature and concept changes continuously within the data stream. Finally, our method shows better efficiency in terms of accuracy and execution time comparing with other state of the art methods which try to solve the similar problem. IV. PROBLEMS AND SOLUTION As it is discussed in Section III, our method needs separate ADABOOST sub-ensemble for each numeric feature. ADABOOST [7] is an ensemble learning method which iteratively induces a strong classifier from a pool of weak hypothesis. During each iteration, it employs a simple learning algorithm(called the base classifier) to get a single learner for that iteration. The final ensemble classifier is a weighted linear combination of these base classifiers where each of them casts their weighted votes. These weights correspond to the correctness of the classifiers, i.e., a classifier with lower error rate gets higher weight. The base classifiers have to be atleast slightly better than a random classifier and hence, they are also called weak classifiers. In data streams, at times number of features can be very large. For example, in case of textual stream, each distinct keyword is regarded as a feature. So in a large corpus, number of dimension i.e., number of features can be in an order of magnitude of tens of thousands. To build an ADABOOST ensemble learner for each feature, number of iterations needed to form final ensemble classifier is equal to the number of weak classifier we need in the ensemble. At each iteration again it needs to iterate over all the data instances of the current data chunk. So, complexity for forming ADABOOST sub ensembles for all number features is O(n*w*f) where n is number of data instances in data set, w is maximum number of weak classifier in a ADABOOST sub ensemble for a particular feature and f is the number of numeric features for which we need to form seperate ADABOOST sub ensemble. For every data chunk, we need to build ADABOOST ensemble for this huge number of features. So this is a gigantic task in terms of time complexity. Since in data streams, data continuously enters into the system, size of data chunks need to be processed in unit time can be of a large size. That is why for each data chunk, the system needs to process all these calculation for forming ADABOOST ensembles in a very short amount of time. so our base work [1] can thus face scalability issue to label test data instances in realtime specially when number of features or size of data chunk is large. To solve this problem, we observed that the whole process for forming ADABOOST ensemble classifier for one feature is totally independent from forming ADABOOST ensemble for any other feature. So we can form and update the ADABOOST classifiers needed for the numeric features for each data chunk in parallel. To do this we have used MapReduce which is a distributed programming paradigm introduced by Dean and Ghemawat [12]. The model is capable of processing large data sets in a parallel distributed manner across many nodes. Fig. 4: Forming ADABOOST ensemble using MapReduce MapReduce has two primary functions: the Map function and the Reduce function. The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. In the Reduce step master node then collects the answers to all the sub-problems and combines them in some way to form the output which is the answer to the problem it was originally trying to solve. Provided each Mapping operation is independent of the others, all Map tasks can be performed in parallel. 390

5 To implement our approach we use Apache Hadoop which consists of Hadoop kernel, MapReduce and Hadoop Distributed File System (HDFS). Main strength of Apache Hadoop is it does not require expensive hardware. It runs applications on large clusters of commodity hardware and provides both reliability and data motion to applications. In addition, it provides a distributed file system(hdfs) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. This framework transparently provides both reliability and data motion to applications. Both MapReduce and the distributed file system are designed so that node failures are automatically handled by the framework. Our approach for applying MapReduce based parallelism is depicted in Figure 4. Input to the Mapper is feature index f i for which it needs to form the ensemble and number of weak classifiers T. Each Mapper can receive several feature indexes depending on the input split and thus training for different features can be run in parallel in different Mapper. Each Mapper runs T iterations for feature f i on the data chunk it receives. Each iteration forms a weak classifier for that feature. At iteration t, Mapper emits feature index f i as key and a composite pair as value. That pair contains weak classifier formed at that iteration h (t) i classifier α (t) i. and weight for the weak Output from the Map task is fed into Reducer procedure. All values for the same key are always reduced together regardless of which Mapper is its origin. Reducer receives a key and list of values associated to it. As it has been discussed earlier, in our case Map output key is the index of feature and value is the combination of weak classifier and weight for the weak classifier. So in our design, Reducer receives feature index as key and all the weak classifier associated with that feature along with their weights as list of values associate to that key. From these information Reducer forms and emits the final ADABOOST ensemble for that feature as value and index of feature as key. In our initial approach system first collects current data chunk and the list of features that need ADABOOST ensemble classifier. These are copied to HDFS(Hadoop distributed file system). The system then runs parallel Mappers to form ADABOOST ensemble classifiers for that feature. Mappers form weak classifiers for ADABOOST sub ensemble for a particular feature and emit this for Reducer. Reducer receives all these weak classifier and form the final ADABOOST ensemble for all the features. Output of Reducer then are copied back to the system for updating current ensembles in different hierarchy. Later we have made some optimizations in our initial approach. First our system after collecting current data chunk place it in DistributedCache of Hadoop. DistributedCache is a facility provided by the MapReduce framework to cache files needed by applications. The framework copies the necessary files on to the slave node before any tasks for the job are executed on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves. Another optimization we have applied is to customize the way Hadoop splits input file. The number of Map tasks in a Hadoop job depends on number of input file splits. In our design input file is basically list of features for which we need to build classifier. So increasing input split infers increasing number of Mappers running for the job and thus increasing level of parallelism. Algorithm 1 Pseudocode for Mapper 1: procedure INITIALIZE(context) 2: dataset loadf romdistributedcache Load Training Set of n samples from distributed cache file 3: end procedure 4: procedure MAP(f, T ) 5: w 1 ( 1 n,..., 1 n ) 6: for t =1 T do 7: h (t) LearnW eakclassifier(w t,f) 8: ɛ n i=1 wt i I{ht (x i ) y i } 9: α t (1/2) ln( 1 ɛ ɛ ) 10: for i =1 n do for all the data instances loaded in INITIALIZE method 11: if h t (x i ) y i then 12: w t+1 i 13: else 14: w t+1 wt i 2ɛ i wt i 2(1 ɛ ) 15: end if 16: end for 17: Emit (f, (α t,h (t) )) 18: end for 19: end procedure Pseudocode for Mapper is given in algorithm 1. INI- TIALIZE procedure is called before actual MAP begins. In INITIALIZE method Mapper loads data chunk from DistributedCache to corresponding data structure. To form ADABOOST ensemble for a feature, algorithm needs to iterate over the whole dataset multiple times. By loading the dataset in a data structure makes it more efficient in terms of execution time. The Map procedure then starts with an initial uniform weight distribution to the data instances in the current data chunk(line 5). T is the total number of boosting iteration. It is important to note that, for any iteration t, n i=1 wt i =1. At each iteration, a weak learner function is applied to the weighted version of the data which then returns an optimal weak hypothesis h (t) (line 7). The weak learner function always ensures that it will find an optimal h (t) with ɛ<1/2. At each iteration, an weight α t is assigned to the weak classifier(line 9). Weights for the data instances are re-calculated in for loop of line 10. The data instances which are not classified correctly in current iteration is given higher priority in the next iteration. Thus weak classifier in the next iteration focuses more on the samples that were previously misclassified. At the end of each iteration, Mapper emits feature index as key and a complex value consisting of weight for the weak classifier and the weak classifier itself. Pseudocode for Reducer is given in algorithm 2. In Hadoop, the Reducer receives a key and list of values associated to this particular key from Mapper. Values associated with a specific key are processed by the same Reducer. So in our design Reducer will receive feature index as key and a list of all the 391

6 Algorithm 2 Pseudocode for Reducer 1: procedure REDUCE(f, List(α t,h (t) )) 2: H (f) α t h (t) ) 3: Emit (f, H (f) ) 4: end procedure weak classifiers for that feature as a list of values. Reducer then prepares the final ADABOOST ensemble by taking weighted sum of all the weak classifiers for that feature in line 2. H (f) denotes ADABOOST ensemble for feature index f. Finally Reducer emits index of feature and ADABOOST ensemble for that feature as <key, value> pair. Output from Reducer is written in the HDFS file system which is then copied to the local file system. ADABOOST ensemble for features are extracted from the file afterwards and used to update the whole hierarchical structure. In this way, using MapReduce ADABOOST ensembles for all numeric features can be formed with a very high level of parallelism. We tested our approach with several data sets having large number of features. We used different size of chunks to examine performance of our approach. We have observed that it is especially useful when the number of features is large or the size of the data chunk is large. We have discussed experimental results in details in Section V. V. EXPERIMENTAL RESULTS We have implemented our base approach (Heterogeneous Hierarchical Ensemble, or HHE) [1] using Java JDK version For MapReduce implementation we have used Hadoop version We have tested our MapReduce implementation in a cluster having 9 data nodes. Each data node is a i7 2600K machine running linux. We have shown performance comparison in terms of execution time and classification accuracy. We have carried out our experiments for different size of data chunk and different number of Maps to evaluate impact of MapReduce parallelism. In order to evaluate our MapReduce implementation we used several established benchmark data sets. Table III depicts the characteristics of the data sets. TABLE III: Data Set Characteristics Name of Instances Classes Numeric Nominal DataSet Features Features KDD 490, PAMAP2 3,850, ForestCover 581, The KDD data set is from KDD Cup It contains information pertaining to network traffic metrics and attributes with both normal traffic and 20 types of network attack traffic. For the second data set, we used a large data set - Physical Activity Monitoring Data Set (PAMAP2) from UCI [13] - to test performance of DXMiner [5] and our methods as this data set is extremely large, has numerous features, evolving concepts, and is stream oriented. In this data set, nine persons were equipped with sensors that gathered a total of 52 streaming metrics attributes whilst they performed activities. Eighteen total activities were identified as class labels - including a single Other category for miscellaneous or transient activities). TABLE IV: Accuracy Results on different datasets Method ForestCover KDD PAMAP2 HHE Error 7.9% 3.2% 2.0% AUC DXMiner Error 5.2% 11.9% 48.3% AUC For the third data set, Forest Cover, was obtained from the UCI repository as explained in [5]. It contains 54 numeric attributes. We use all the attributes as they arrive in the stream by parsing the data file without filtering and without normalization. TABLE V: Average Execution time(in Second) per chunk using Original HHE and MapReduce implementation Dataset Sizeof Original MapReduce chunk HHE Implementation ForestCover PAMAP KDD Implementation of our approach with MapReduce exhibits the same accuracy as our base approach HHE without MapReduce implementation. We chose to compare the accuracy results of our algorithm against DXMiner since DXMiner focuses on solving many of the same problems we do, and has been shown to perform well for labeling streaming data [5]. Since we obtained access to the executables for DXMiner, we are able to compare the performance of our method against DXMiner on new benchmark data sets. We also compare results of implementation with MapReduce against our original approach without MapReduce implementation to show dramatic improvement in terms of execution time. Comparison between HHE and DXMiner [5] in terms of accuracy is shown in Table IV. Both method is evaluated for each of the three benchmark data sets with regard to the average error and the Area Under the Curve (AUC) metric. AUC is computed by numerically integrating the Receiver Operating Characteristic (ROC) curve. The ROC curve depicts the ratio of true positive to false negative for an algorithm, and is thus a more robust depiction of an algorithms overall performance. Likewise the AUC metric is a more robust singlevalue metric for comparing algorithms [16]. Note that the KDD AUC value for DXMiner is not listed since the DXMiner application failed to return an AUC number, and did not log enough information to compute the ROC or AUC manually. Our approach HHE shows significantly better accuracy in case of KDD and PAMAP2 datasets. On the other hand HHE shows competitive accuracy in case of ForestCover dataset. 392

7 better execution time comparing to MapReduce implementation. But for chunk sizes larger than 8000, HHE with MapReduce implementation shows significantly better results in terms of execution time. This is not surprising as Hadoop has some overhead of its own which contributes to its greater execution time initially when chunk size is not very large. As chunk size gets larger those overhead becomes well paid and MapReduce implementation shows better execution time. Another observation from the empirical data is in case of KDD dataset, it needs comparatively larger size of chunk to show a better execution time For MapReduce implementation. The reason behind this is KDD has much fewer attributes comparing with the other two datasets. So it takes larger chunk size in case of KDD dataset to overcome internal overhead of running Hadoop. So from these empirical data it can be said that MapReduce implementation is very useful specially when number of feature or size of data chunk is high. Fig. 5: Comparison of Execution time between original HHE vs HHE with MapReduce (Ten Map tasks) for ForestCover dataset HHE can use the features in their native format (continuous or discrete without normalization). That is why it does not need to reduce the feature space as is done in [5]. DXMiner handle a reduced feature set discarding the discrete data and have to normalize all features to ranges between (0.0, 1.0) while HHE does not require such pre-processing. In addition, the ForestCover dataset does not truly have a dynamic feature space. DXMiner retains union of all features for its learners as it progresses. Therefore the lower error of DXMiner on ForestCover dataset is not surprising due to fixed feature set in the data. TABLE VI: Average Execution time(in Second) per chunk for different number of Map tasks Sizeof Number Number Number Dataset chunk of of of Map =3 Map =5 Map = ForestCover PAMAP KDD Execution time for HHE and MapReduce implementation using different size of chunk is listed in Table V. It is clear from the data that MapReduce implementation outperforms original HHE with sufficiently large chunk size. Figure 5 graphically represents comparison of execution time between HHE and MapReduce implementation for ForestCover dataset. It can be observed from the figure that for chunk sizes less than 8000, original approach HHE shows Fig. 6: Comparison of Execution time using different number of Map for ForestCover dataset We have experimented our MapReduce solution with different number of Map tasks. Influence of number of Map tasks on average execution time is shown in Table VI. We have controlled number of Mapper by changing block size of the input file to the Hadoop. This input file contains list of features for which we need to build ADABOOST ensemble. Result shows that Execution time is reduced significantly with higher number of Mapper. Figure 6 more clearly depicts the power of MapReduce parallelism. As the number of Mappers get increased, the number of features per Mapper get decreased. So ADABOOST ensembles for different features are formed with more parallelism. From the Figure 6 we can observe that with larger number of Map, execution time per data chunk falls down for different chunk sizes. It is true especially when chunk size is large. To build an ADABOOST ensemble classifier T number of iterations over the dataset is needed, where T is the number of weak classifier in the ensemble. Decreased number of features per Map task leads to low number of iterations over this huge dataset for each Map. Thus with increased number of Map tasks, execution time per data chunk is decreased. From the above analysis it is clear that our base approach HHE shows better accuracy comparing with other 393

8 approaches which also tend to solve the similar problems of labeling instances in evolving data streams. Moreover applying MapReduce based parallelism to this robust technique makes it significantly faster and increase its scalability in case of larger chunks of data or high number of features. VI. CONCLUSION Our work presented in this paper shows an improvement over our current effort to design an efficient method which rapidly learns the concepts in a data stream, predicts labels for new data with strong accuracy, and agilely tracks the dynamic changes in the evolving concepts and feature space. In this paper, We have tried to address the scalability issue in our base approach [1]. We have shown through algorithmic and empirical analysis that our method is more efficient than the base approach in terms of execution time especially in case of large number of features and large data chunks. Compared to alternative methods, our approach adapts well to data streams, and tends to more tightly approximate the decision boundaries of the target classes. We demonstrated that training of ADABOOST sub-ensemble in parallel for handling large number of features are useful to decrease execution time significantly. We intend to continue to improve upon our methodology and add additional components to further adapt to changes in the data stream, including novel classes. we will continue to investigate variations to further optimize both speed and accuracy. We will also apply further parallelism in training of other sub-ensemble types other than ADABOOST and testing phases in future. [8] R. Shapire and Y. Singer, Improved Boosting Algorithms Using Confidence-rated Prediction, Machine Learning, pp , [9] Indranil Palit and Chandan K. Reddy, Scalable and Parallel Boosting with MapReduce, IEEE Transaction on knowledge and data engineering, vol. 24, no. 10, pp , 2012 [10] G. Escudero, L. Marquez, and G. Rigau, Boosting Applied Toe Word Sense Disambiguation, Proc. European Conf. Machine Learning(ECML), pp , 2000 [11] R. Busa-Fekete and B. Kegl, Bandit-Aided Boosting, Proc. Second NIPS Workshop Optimization for Machine Learning, [12] J. Dean and S. Ghemawat, Mapreduce: Simplified Data Processing on Large Clusters, Comm. ACM, vol. 51, no. 1, pp , 2008 [13] A. Reiss and D. Stricker, Introducing a New Benchmarked Dataset for Activity Monitoring, in The 16th IEEE International Symposium on Wearable Computers (ISWC), Newcastle, UK, [14] Tahseen Al-Khateeb, Mohammad M. Masud, Latifur Khan, Bhavani M. Thuraisingham: Cloud Guided Stream Classification Using Class-Based Ensemble, IEEE CLOUD 2012: [15] Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI 04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, [16] T. Fawcett, ROC Graphs: Notes and Practical Considerations for Researchers, VII. ACKNOWLEDGEMENT This material is based upon work supported by National Science Foundation under Award No. CNS and The Air Force Office of Scientific Research under Award No. FA We thank Dr. Robert Herklotz for his support. REFERENCES [1] Parker, B.; Mustafa, A.M.; Khan, L., Novel Class Detection and Feature via a Tiered Ensemble Approach for Stream Mining, Tools with Artificial Intelligence (ICTAI), 2012 IEEE 24th International Conference on, vol.1, no., pp.1171,1178, 7-9 Nov [2] A. Bifet, R. Kirkby, G. Holmes, R. Gavalda and B. Pfahringer, New Ensemble Methods For Evolving Data Streams, in KDD 09 Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, New York, [3] P. H. dos Santos Teixeira and R. L. Milidi, Data stream anomaly detection through principal subspace tracking, in Proceedings of the 2010 ACM Symposium on Applied Computing, New York, [4] I. Katakis, G. Tsoumakas and I. Vlahavas, Dynamic Feature Space and Incremental Feature Selection for the Classification of Textual Data Streams, in in ECML/PKDD-2006 International Workshop on Knowledge Discovery from Data Streams. 2006, Berlin, [5] M. M. Masud, J. Gao, L. Khan, J. Han and B. M. Thuraisingham, Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints, IEEE Transactions on Knowledge Engineering, vol. 23, no. 6, pp , [6] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Stasticial Learning, 2nd ed., New York: Springer-Verlag, [7] Y. Freund and R. Shapire, A Decision-theoretic Generalization of Online Learning and an Application to Boosting, Journal of Computer Science and System Sciences, vol. 55, no. 1, pp ,

1 INTRODUCTION 2 RELATED WORK. Usha.B.P ¹, Sushmitha.J², Dr Prashanth C M³

1 INTRODUCTION 2 RELATED WORK. Usha.B.P ¹, Sushmitha.J², Dr Prashanth C M³ International Journal of Scientific & Engineering Research, Volume 7, Issue 5, May-2016 45 Classification of Big Data Stream usingensemble Classifier Usha.B.P ¹, Sushmitha.J², Dr Prashanth C M³ Abstract-

More information

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER P.Radhabai Mrs.M.Priya Packialatha Dr.G.Geetha PG Student Assistant Professor Professor Dept of Computer Science and Engg Dept

More information

An Adaptive Framework for Multistream Classification

An Adaptive Framework for Multistream Classification An Adaptive Framework for Multistream Classification Swarup Chandra, Ahsanul Haque, Latifur Khan and Charu Aggarwal* University of Texas at Dallas *IBM Research This material is based upon work supported

More information

Classification of Concept Drifting Data Streams Using Adaptive Novel-Class Detection

Classification of Concept Drifting Data Streams Using Adaptive Novel-Class Detection Volume 3, Issue 9, September-2016, pp. 514-520 ISSN (O): 2349-7084 International Journal of Computer Engineering In Research Trends Available online at: www.ijcert.org Classification of Concept Drifting

More information

Mining Distributed Frequent Itemset with Hadoop

Mining Distributed Frequent Itemset with Hadoop Mining Distributed Frequent Itemset with Hadoop Ms. Poonam Modgi, PG student, Parul Institute of Technology, GTU. Prof. Dinesh Vaghela, Parul Institute of Technology, GTU. Abstract: In the current scenario

More information

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

A Comparative study of Clustering Algorithms using MapReduce in Hadoop A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering

More information

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods Ensemble Learning Ensemble Learning So far we have seen learning algorithms that take a training set and output a classifier What if we want more accuracy than current algorithms afford? Develop new learning

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Feature Based Data Stream Classification (FBDC) and Novel Class Detection

Feature Based Data Stream Classification (FBDC) and Novel Class Detection RESEARCH ARTICLE OPEN ACCESS Feature Based Data Stream Classification (FBDC) and Novel Class Detection Sminu N.R, Jemimah Simon 1 Currently pursuing M.E (Software Engineering) in Vins christian college

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK DISTRIBUTED FRAMEWORK FOR DATA MINING AS A SERVICE ON PRIVATE CLOUD RUCHA V. JAMNEKAR

More information

Role of big data in classification and novel class detection in data streams

Role of big data in classification and novel class detection in data streams DOI 10.1186/s40537-016-0040-9 METHODOLOGY Open Access Role of big data in classification and novel class detection in data streams M. B. Chandak * *Correspondence: hodcs@rknec.edu; chandakmb@gmail.com

More information

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction International Journal of Computer Trends and Technology (IJCTT) volume 7 number 3 Jan 2014 Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction A. Shanthini 1,

More information

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Md Nasim Adnan and Md Zahidul Islam Centre for Research in Complex Systems (CRiCS)

More information

Automated Tagging for Online Q&A Forums

Automated Tagging for Online Q&A Forums 1 Automated Tagging for Online Q&A Forums Rajat Sharma, Nitin Kalra, Gautam Nagpal University of California, San Diego, La Jolla, CA 92093, USA {ras043, nikalra, gnagpal}@ucsd.edu Abstract Hashtags created

More information

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Bhakti V. Gavali 1, Prof. Vivekanand Reddy 2 1 Department of Computer Science and Engineering, Visvesvaraya Technological

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Obtaining Rough Set Approximation using MapReduce Technique in Data Mining

Obtaining Rough Set Approximation using MapReduce Technique in Data Mining Obtaining Rough Set Approximation using MapReduce Technique in Data Mining Varda Dhande 1, Dr. B. K. Sarkar 2 1 M.E II yr student, Dept of Computer Engg, P.V.P.I.T Collage of Engineering Pune, Maharashtra,

More information

Similarity Joins in MapReduce

Similarity Joins in MapReduce Similarity Joins in MapReduce Benjamin Coors, Kristian Hunt, and Alain Kaeslin KTH Royal Institute of Technology {coors,khunt,kaeslin}@kth.se Abstract. This paper studies how similarity joins can be implemented

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

Detecting Recurring and Novel Classes in Concept-Drifting Data Streams

Detecting Recurring and Novel Classes in Concept-Drifting Data Streams Detecting Recurring and Novel Classes in Concept-Drifting Data Streams Mohammad M. Masud, Tahseen M. Al-Khateeb, Latifur Khan, Charu Aggarwal,JingGao,JiaweiHan and Bhavani Thuraisingham Dept. of Comp.

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

Fast or furious? - User analysis of SF Express Inc

Fast or furious? - User analysis of SF Express Inc CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood

More information

Storm Identification in the Rainfall Data Using Singular Value Decomposition and K- Nearest Neighbour Classification

Storm Identification in the Rainfall Data Using Singular Value Decomposition and K- Nearest Neighbour Classification Storm Identification in the Rainfall Data Using Singular Value Decomposition and K- Nearest Neighbour Classification Manoj Praphakar.T 1, Shabariram C.P 2 P.G. Student, Department of Computer Science Engineering,

More information

An improved MapReduce Design of Kmeans for clustering very large datasets

An improved MapReduce Design of Kmeans for clustering very large datasets An improved MapReduce Design of Kmeans for clustering very large datasets Amira Boukhdhir Laboratoire SOlE Higher Institute of management Tunis Tunis, Tunisia Boukhdhir _ amira@yahoo.fr Oussama Lachiheb

More information

Analyzing Outlier Detection Techniques with Hybrid Method

Analyzing Outlier Detection Techniques with Hybrid Method Analyzing Outlier Detection Techniques with Hybrid Method Shruti Aggarwal Assistant Professor Department of Computer Science and Engineering Sri Guru Granth Sahib World University. (SGGSWU) Fatehgarh Sahib,

More information

An Empirical Comparison of Spectral Learning Methods for Classification

An Empirical Comparison of Spectral Learning Methods for Classification An Empirical Comparison of Spectral Learning Methods for Classification Adam Drake and Dan Ventura Computer Science Department Brigham Young University, Provo, UT 84602 USA Email: adam drake1@yahoo.com,

More information

Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams

Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams International Refereed Journal of Engineering and Science (IRJES) ISSN (Online) 2319-183X, (Print) 2319-1821 Volume 4, Issue 2 (February 2015), PP.01-07 Novel Class Detection Using RBF SVM Kernel from

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

Cost-sensitive Boosting for Concept Drift

Cost-sensitive Boosting for Concept Drift Cost-sensitive Boosting for Concept Drift Ashok Venkatesan, Narayanan C. Krishnan, Sethuraman Panchanathan Center for Cognitive Ubiquitous Computing, School of Computing, Informatics and Decision Systems

More information

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,

More information

Introduction to Big-Data

Introduction to Big-Data Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,

More information

CS294-1 Final Project. Algorithms Comparison

CS294-1 Final Project. Algorithms Comparison CS294-1 Final Project Algorithms Comparison Deep Learning Neural Network AdaBoost Random Forest Prepared By: Shuang Bi (24094630) Wenchang Zhang (24094623) 2013-05-15 1 INTRODUCTION In this project, we

More information

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING Amol Jagtap ME Computer Engineering, AISSMS COE Pune, India Email: 1 amol.jagtap55@gmail.com Abstract Machine learning is a scientific discipline

More information

CLASSIFICATION FOR SCALING METHODS IN DATA MINING

CLASSIFICATION FOR SCALING METHODS IN DATA MINING CLASSIFICATION FOR SCALING METHODS IN DATA MINING Eric Kyper, College of Business Administration, University of Rhode Island, Kingston, RI 02881 (401) 874-7563, ekyper@mail.uri.edu Lutz Hamel, Department

More information

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem I J C T A, 9(41) 2016, pp. 1235-1239 International Science Press Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem Hema Dubey *, Nilay Khare *, Alind Khare **

More information

A study of classification algorithms using Rapidminer

A study of classification algorithms using Rapidminer Volume 119 No. 12 2018, 15977-15988 ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A study of classification algorithms using Rapidminer Dr.J.Arunadevi 1, S.Ramya 2, M.Ramesh Raja

More information

Document Clustering with Map Reduce using Hadoop Framework

Document Clustering with Map Reduce using Hadoop Framework Document Clustering with Map Reduce using Hadoop Framework Satish Muppidi* Department of IT, GMRIT, Rajam, AP, India msatishmtech@gmail.com M. Ramakrishna Murty Department of CSE GMRIT, Rajam, AP, India

More information

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Hybrid Feature Selection for Modeling Intrusion Detection Systems Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,

More information

ABSTRACT I. INTRODUCTION

ABSTRACT I. INTRODUCTION International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve

More information

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Abstract Deciding on which algorithm to use, in terms of which is the most effective and accurate

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3

More information

Comparison of FP tree and Apriori Algorithm

Comparison of FP tree and Apriori Algorithm International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.78-82 Comparison of FP tree and Apriori Algorithm Prashasti

More information

Event Object Boundaries in RDF Streams A Position Paper

Event Object Boundaries in RDF Streams A Position Paper Event Object Boundaries in RDF Streams A Position Paper Robin Keskisärkkä and Eva Blomqvist Department of Computer and Information Science Linköping University, Sweden {robin.keskisarkka eva.blomqvist}@liu.se

More information

A data-driven framework for archiving and exploring social media data

A data-driven framework for archiving and exploring social media data A data-driven framework for archiving and exploring social media data Qunying Huang and Chen Xu Yongqi An, 20599957 Oct 18, 2016 Introduction Social media applications are widely deployed in various platforms

More information

A Parallel Evolutionary Algorithm for Discovery of Decision Rules

A Parallel Evolutionary Algorithm for Discovery of Decision Rules A Parallel Evolutionary Algorithm for Discovery of Decision Rules Wojciech Kwedlo Faculty of Computer Science Technical University of Bia lystok Wiejska 45a, 15-351 Bia lystok, Poland wkwedlo@ii.pb.bialystok.pl

More information

An Ensemble Data Mining Approach for Intrusion Detection in a Computer Network

An Ensemble Data Mining Approach for Intrusion Detection in a Computer Network International Journal of Science and Engineering Investigations vol. 6, issue 62, March 2017 ISSN: 2251-8843 An Ensemble Data Mining Approach for Intrusion Detection in a Computer Network Abisola Ayomide

More information

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm Proceedings of the National Conference on Recent Trends in Mathematical Computing NCRTMC 13 427 An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm A.Veeraswamy

More information

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013 Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

More information

Cluster based boosting for high dimensional data

Cluster based boosting for high dimensional data Cluster based boosting for high dimensional data Rutuja Shirbhate, Dr. S. D. Babar Abstract -Data Dimensionality is crucial for learning and prediction systems. Term Curse of High Dimensionality means

More information

New ensemble methods for evolving data streams

New ensemble methods for evolving data streams New ensemble methods for evolving data streams A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavaldà Laboratory for Relational Algorithmics, Complexity and Learning LARCA UPC-Barcelona Tech, Catalonia

More information

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation 2016 IJSRSET Volume 2 Issue 5 Print ISSN: 2395-1990 Online ISSN : 2394-4099 Themed Section: Engineering and Technology Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

More information

FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY

FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY , pp-01-05 FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY Ravin Ahuja 1, Anindya Lahiri 2, Nitesh Jain 3, Aditya Gabrani 4 1 Corresponding Author PhD scholar with the Department of Computer Engineering,

More information

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Survey Paper on Traditional Hadoop and Pipelined Map Reduce International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Naïve Bayes for text classification

Naïve Bayes for text classification Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support

More information

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN: IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T

More information

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b International Conference on Artificial Intelligence and Engineering Applications (AIEA 2016) Implementation of Parallel CASINO Algorithm Based on MapReduce Li Zhang a, Yijie Shi b State key laboratory

More information

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

Available online at  ScienceDirect. Procedia Computer Science 89 (2016 ) Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 89 (2016 ) 341 348 Twelfth International Multi-Conference on Information Processing-2016 (IMCIP-2016) Parallel Approach

More information

Survey on MapReduce Scheduling Algorithms

Survey on MapReduce Scheduling Algorithms Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures José Ramón Pasillas-Díaz, Sylvie Ratté Presenter: Christoforos Leventis 1 Basic concepts Outlier

More information

ORT EP R RCH A ESE R P A IDI! " #$$% &' (# $!"

ORT EP R RCH A ESE R P A IDI!  #$$% &' (# $! R E S E A R C H R E P O R T IDIAP A Parallel Mixture of SVMs for Very Large Scale Problems Ronan Collobert a b Yoshua Bengio b IDIAP RR 01-12 April 26, 2002 Samy Bengio a published in Neural Computation,

More information

Improved MapReduce k-means Clustering Algorithm with Combiner

Improved MapReduce k-means Clustering Algorithm with Combiner 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

Hierarchical Document Clustering

Hierarchical Document Clustering Hierarchical Document Clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester, Simon Fraser University, Canada INTRODUCTION Document clustering is an automatic grouping of text documents into clusters

More information

Business Club. Decision Trees

Business Club. Decision Trees Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building

More information

Accelerate Big Data Insights

Accelerate Big Data Insights Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not

More information

Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters.

Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters. Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management Kranti Patil 1, Jayashree Fegade 2, Diksha Chiramade 3, Srujan Patil 4, Pradnya A. Vikhar 5 1,2,3,4,5 KCES

More information

Parallel learning of content recommendations using map- reduce

Parallel learning of content recommendations using map- reduce Parallel learning of content recommendations using map- reduce Michael Percy Stanford University Abstract In this paper, machine learning within the map- reduce paradigm for ranking

More information

Ms. Ritu Dr. Bhawna Suri Dr. P. S. Kulkarni (Assistant Prof.) (Associate Prof. ) (Assistant Prof.) BPIT, Delhi BPIT, Delhi COER, Roorkee

Ms. Ritu Dr. Bhawna Suri Dr. P. S. Kulkarni (Assistant Prof.) (Associate Prof. ) (Assistant Prof.) BPIT, Delhi BPIT, Delhi COER, Roorkee Journal Homepage: NOVEL FRAMEWORK FOR DATA STREAMS CLASSIFICATION APPROACH BY DETECTING RECURRING FEATURE CHANGE IN FEATURE EVOLUTION AND FEATURE S CONTRIBUTION IN CONCEPT DRIFT Ms. Ritu Dr. Bhawna Suri

More information

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information

More information

LOG FILE ANALYSIS USING HADOOP AND ITS ECOSYSTEMS

LOG FILE ANALYSIS USING HADOOP AND ITS ECOSYSTEMS LOG FILE ANALYSIS USING HADOOP AND ITS ECOSYSTEMS Vandita Jain 1, Prof. Tripti Saxena 2, Dr. Vineet Richhariya 3 1 M.Tech(CSE)*,LNCT, Bhopal(M.P.)(India) 2 Prof. Dept. of CSE, LNCT, Bhopal(M.P.)(India)

More information

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications New Optimal Load Allocation for Scheduling Divisible Data Grid Applications M. Othman, M. Abdullah, H. Ibrahim, and S. Subramaniam Department of Communication Technology and Network, University Putra Malaysia,

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments Send Orders for Reprints to reprints@benthamscience.ae 368 The Open Automation and Control Systems Journal, 2014, 6, 368-373 Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing

More information

Efficient Algorithm for Frequent Itemset Generation in Big Data

Efficient Algorithm for Frequent Itemset Generation in Big Data Efficient Algorithm for Frequent Itemset Generation in Big Data Anbumalar Smilin V, Siddique Ibrahim S.P, Dr.M.Sivabalakrishnan P.G. Student, Department of Computer Science and Engineering, Kumaraguru

More information

Rank Measures for Ordering

Rank Measures for Ordering Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many

More information

SQL Query Optimization on Cross Nodes for Distributed System

SQL Query Optimization on Cross Nodes for Distributed System 2016 International Conference on Power, Energy Engineering and Management (PEEM 2016) ISBN: 978-1-60595-324-3 SQL Query Optimization on Cross Nodes for Distributed System Feng ZHAO 1, Qiao SUN 1, Yan-bin

More information

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang International Journal of Innovative Computing, Information and Control ICIC International c 2017 ISSN 1349-4198 Volume 13, Number 3, June 2017 pp. 1037 1046 AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA

More information

Noise-based Feature Perturbation as a Selection Method for Microarray Data

Noise-based Feature Perturbation as a Selection Method for Microarray Data Noise-based Feature Perturbation as a Selection Method for Microarray Data Li Chen 1, Dmitry B. Goldgof 1, Lawrence O. Hall 1, and Steven A. Eschrich 2 1 Department of Computer Science and Engineering

More information

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection Zhenghui Ma School of Computer Science The University of Birmingham Edgbaston, B15 2TT Birmingham, UK Ata Kaban School of Computer

More information

Efficient, Scalable, and Provenance-Aware Management of Linked Data

Efficient, Scalable, and Provenance-Aware Management of Linked Data Efficient, Scalable, and Provenance-Aware Management of Linked Data Marcin Wylot 1 Motivation and objectives of the research The proliferation of heterogeneous Linked Data on the Web requires data management

More information

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI 2017 International Conference on Electronic, Control, Automation and Mechanical Engineering (ECAME 2017) ISBN: 978-1-60595-523-0 The Establishment of Large Data Mining Platform Based on Cloud Computing

More information

Using Association Rules for Better Treatment of Missing Values

Using Association Rules for Better Treatment of Missing Values Using Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine Intelligence Group) National University

More information

Fast and Effective System for Name Entity Recognition on Big Data

Fast and Effective System for Name Entity Recognition on Big Data International Journal of Computer Sciences and Engineering Open Access Research Paper Volume-3, Issue-2 E-ISSN: 2347-2693 Fast and Effective System for Name Entity Recognition on Big Data Jigyasa Nigam

More information

Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark

Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark PL.Marichamy 1, M.Phil Research Scholar, Department of Computer Application, Alagappa University, Karaikudi,

More information

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING Journal homepage: www.mjret.in ISSN:2348-6953 A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING Bhavsar Nikhil, Bhavsar Riddhikesh,Patil Balu,Tad Mukesh Department of Computer Engineering JSPM s

More information

Subject-Oriented Image Classification based on Face Detection and Recognition

Subject-Oriented Image Classification based on Face Detection and Recognition 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information