Incremental Classification of Nonstationary Data Streams

Size: px
Start display at page:

Download "Incremental Classification of Nonstationary Data Streams"

Transcription

1 Incremental Classification of Nonstationary Data Streams Lior Cohen, Gil Avrahami, Mark Last Ben-Gurion University of the Negev Department of Information Systems Engineering Beer-Sheva 84105, Israel bgu.ac.il Abraham Kandel Dept. of Computer Science and Engineering, University of South- Florida, Tampa, Florida 33620, USA Oscar Kipersztok Boeing Phantom Works Mathematics & Computing Technology, P.O. Box 3707 MC 7L- 44, Seattle, WA USA Abstract. Traditional data mining algorithms used in the process of knowledge discovery are based on the assumption that the training data is a random sample drawn from a stationary distribution. In contrast, when dealing with non-stationary data sets, most of the processes generating time-stamped data may change drastically over time. Therefore, the results of a data-mining algorithm, which was run in the past, will be hardly relevant to the future data. In this paper, we present two new incremental approaches to real-time classification of continuous non-stationary data. The proposed approaches apply the data-mining algorithm repeatedly to a sliding window of examples in order to update the existing model, replace it by a former model, or construct a new model if a major concept drift is detected. The algorithms are evaluated on multi-year real-world streams of traffic data vs. the CVFDT online learning algorithm. Keywords Real-time data mining, data streams, incremental learning, online learning, concept drift, information networks. 1. Introduction In recent years, there is an ongoing demand for systems, which will be capable to process massive and continuous streams of information. The use of such systems can be in the field of temperature monitoring, precision agriculture, urban traffic control, classification of stock data, etc. The most important characteristic of such system is to deal with noise, uncertainty, and asynchrony of the real-world data [2]. The complex nature of real world data has increased the difficulties and challenges of data mining in terms of data processing, data storage, and model storage requirements [7]. One of the main difficulties in mining non-stationary continuous data streams is to cope with the changing data concept. The fundamental processes generating most real-time data streams may change over years, months and even seconds, at times drastically. This change, also known as concept drift [3], causes the data-mining model generated from past data, to become less accurate in the classification of new data. Algorithms and methods, which extract patterns from continuous data streams, are known as online real time learning [10]. Real-time data mining of high-speed and non-stationary data streams has large potential in fields such as efficient operation of machinery and vehicles, exploring ecosystem dynamics, and intrusion detection in computer networks. This paper continues our previous work on real-time data mining [8], [9], where we have presented the Basic Incremental On Line Information Network (IOLIN) algorithm, leaving the issue of a better trade-off between accuracy and run time open for the future research. Here we suggest two alternative algorithms for online incremental learning of non-stationary data streams: the Multiple-Model IOLIN and the Advanced IOLIN, both aimed at maximizing the average classification rate (in records per second) while preserving the accuracy rate of the regenerative approach. The two proposed algorithms are evaluated in comparison to the On-Line Information Network (OLIN) algorithm presented by Last in [11] and the more recent Incremental On Line Information Network (IOLIN) algorithm [8], [9]. All these incremental methods are based on the batch information-theoretic algorithm, named Information Network (IN), which

2 was introduced by Last & Maimon [12], [13]. The general idea of the new incremental algorithms is to keep updating an existing model as long as no major concept drift is detected in the arriving data. While most of the existing online algorithms (including OLIN [11]), which deal with the problem of concept drift, build a new model from every new window of training examples, the proposed incremental techniques save the expensive CPU time by performing minimal changes in the structure of the current classification model. We also compare the performance of our incremental methods to the CVFDT algorithm [4], which is designed at mining non-stationary data streams as well. 2. Related work In the case of non-stationary data streams, the optimal situation is to have KDD systems that operate continuously, constantly processing the data received so that potentially valuable information is never lost. In order to achieve this goal, several methods for extracting patterns from non-stationary streams of data have been developed, all under the general title of online (incremental) learning methods. The pure incremental learning methods take into account every new instance that arrives. These algorithms may be irrelevant when dealing with high-speed data streams. Widmer & Kubat [5] have described a series of purely incremental learning algorithms that flexibly react to concept drift and can take advantage of situations where context repeats itself. Their series of algorithms is based on a framework called FLORA. FLORA maintains a dynamically adjustable window during the learning process and whenever a concept drift seems to occur (a drop in predictive accuracy) the window shrinks (forgets old instances), and when the concept seems to be stable the window is kept fixed. Otherwise, the window keeps growing until the concept seems to be stable. FLORA is a computationally expensive methodology, since it updates the classification model with every example added to or removed from the training window. Domingos & Hulten [14] have proposed the VFDT (Very Fast Decision Trees learner) system in order to overcome the longer training time issue of the pure incremental algorithms. The VFDT system is based on a decision tree learning method, which builds the trees based on sub-sampling of a stationary data stream. To deal with non-stationary data streams, Hulten et al. [4] have proposed an improvement to the VFDT algorithm which is the CVFDT (Conceptadapting Very Fast Decision Tree learner). CVFDT applies the VFDT algorithm to a sliding window of a fixed size and builds the model in an incremental manner instead of building it from scratch whenever a new set of examples arrives. CVFDT increases the computational overload vs. VFDT by growing alternate sub-trees at its internal nodes. The model is modified when the alternate sub-tree becomes more accurate than the original one. In [1] the authors claim that methods such as [4], [14] which use the one-pass approach on the data in order create mining models, are limited. The problem with the one-pass approach is that it disables the recognition of changes in the mining model during the data stream process. Aggarwal et al. [1], suggested a framework for the classification of non stationary data streams. The methodology proposed in their paper uses the notion of micro-clusters that are created only from the training data. Each micro-cluster is associated with a set of points, all of which belong to the same class. The maintenance procedure of the micro clusters derives from the nearest neighbor and k-means algorithms and if needed, operations of merging multiple clusters into one cluster are made. In the classification process, the test record is classified to a certain micro-cluster according to its similarity to the cluster s centroid. Last in [11] describes an online classification system that uses an information network (IN) as a classification model. The system called OLIN (On Line Information Network) gets a continuous stream of non-stationary data and builds a network based on a sliding window of the latest examples. OLIN detects a concept drift (an unexpected rise in the classification error rate) and dynamically adjusts the size of the training window and accordingly, the frequency of the model reconstruction. The calculations of the window size in OLIN are based on information-theoretic and statistical measures. The experimental results of [11] show that in non-stationary data streams, dynamic windowing generates more accurate models than the static (fixed size) windowing approach used by CVFDT. Cohen et al [8], [9] extended the ideas suggested in [11] in order to enhance the performance of the OLIN system. The new developed approach called IOLIN (Incremental Online Information Network) also deals with a continuous non-stationary data stream, but this time instead of creating a new model for each sliding window of examples, the system updates a current model according to the new data. In the experiments of [8] and [9], the IOLIN version significantly outperformed the OLIN with regarding to the run time at the expense of a minor drop in the accuracy rate. 3. The Info Fuzzy Algorithm Many learning methods use the information theory to induce classification models. One of the methods, developed by Last & Maimon [12] [13] is the Info-Fuzzy Network algorithm (also known as Information Network-IN). IN, is an 2

3 oblivious tree-like classification model, which is designed to minimize the total number of predicting attributes. The underlying principle of the IN-based methodology is to construct a multi-layered network in order to test the Mutual Information (MI) between the input and target attributes. Each hidden layer is related to a specific input attribute and represents the interaction between this input attribute and those associated with previous layers. The IN algorithm is using a pre-pruning strategy: a node is split if this procedure brings about a statistically significant decrease in the conditional entropy of the target attribute (equal to an increase in the mutual information). If none of the remaining input attributes provides a statistically significant increase in the mutual information, the network construction stops. The output of this algorithm is a network, which can be used as a decision tree to predict the values of the target attribute. Fig. 1 illustrates a sample structure of an information network Layer No. 0 (the root node) Layer No. 1 (First input attribute) Fig. 1. Info-fuzzy Network Two Layered Structure [13] 1 1,1 1,2 3,1 3,2 Layer No. 2 (Second input attribute) Connection Weights Target Layer 4. Incremental On-Line Information Network The OLIN online classification algorithm [11] deals with the potential concept drift in a non-stationary data stream simply by generating a new model for every new sliding window. On one hand, this regenerative approach ensures accurate and relevant models over time and therefore an increase in the classification accuracy. On the other hand, OLIN's major shortcoming is the high computational cost of generating new models. In this section, we present three approaches for incremental learning, which are aimed to overcome the high computational cost of generating new models. As shown in the evaluation section below, the incremental methods achieve almost the same and sometimes even higher accuracy rates than the regenerative algorithm and are significantly cheaper, since there is no need in producing a new model for every new window of examples Basic Incremental Approach We start this section with an overview of the Basic IOLIN algorithm introduced by us in [8]. In general, the basic incremental method updates a current model as long as the concept is stable. If a concept drift has been detected, the algorithm creates a totally new network. The operations for updating the network include checking split validity, examining the replacement of the last layer, and adding new layers as necessary. Each of the update operations is aimed at reducing the error rate of the already existing model, when using it for classifying new instances. The first procedure, check splits validity, suggests eliminating nodes, which are not relevant for the model according to the current training window. The split validity checks start from the root and go downwards the network. The general idea is to ensure that the current split of a specific node actually contributes to the mutual information calculated from the current training set. Elimination of non-relevant splits should decrease the error rate. From experiments conducted on the regenerative OLIN [11], we discovered that when there is no major concept drift between two adjacent windows, the main differences between the newly constructed model and the current one are in the last hidden layer of the information network. For this reason, the second procedure check the replacement of the last layer is made in order to determine which attribute is more appropriate to correspond to the last (final) hidden layer. The alternatives considered are: the attribute that is already associated with the last layer or the second best attribute for the last layer of the previous model. The attribute selected for the last layer of the new model will be the one with the highest conditional mutual information based on the current training window. The last procedure add new layers as necessary attempts to split the nodes of the last layer on attributes, which are not yet participating in the updated network. The basic intuition behind the incremental approach is to update the current classification model with the current training window concept as long as no major concept drift has been detected by a statistically significant drop in predictive accuracy and to build a new model in case of a major concept drift. Actually, the regenerative approach can be viewed as a special case of the incremental approach. If a major concept drift is detected at each arrival of new data, the incremental approach is identical to the regenerative one. So the incremental approach becomes useful only when the concept stays stable between at least one pair of adjacent windows. This condition can be expected to exist in most 3

4 data streams, especially those where the size of the training window is relatively small. Following is the pseudo-code outline of the Basic Incremental OLIN (Basic IOLIN) algorithm: Basic IOLIN Input: Training window, current network model Output: Updated or re-generated network model For each new training window If concept drift is detected Create new network using the Info-Fuzzy algorithm Else Update existing network as follows: 1. Eliminate non-relevant nodes. 2. Replace the last layer with a better one. 3. Add new layers if possible. Determining the size of the training window is extensively discussed in [11]. In general, the size of the window is expanded if the concept appears to be stable and reduced if a concept drift has been detected. The detection of a concept drift is made whenever there is a sudden fall in the accuracy rate. The assumption is that if the concept is stable, the examples in the training window and in the subsequent validation interval conform to the same distribution and there is no statistically significant difference between the model s training and the validation error rates. Another assumption is that the true classification of the incoming records becomes available after a short while. This true classification can then be used to start the model update or the construction process for creating a new predictive model. If however, the true classification is not available from the monitored system, we can keep the current network. Another alternative is to use unsupervised learning methods in order to detect changes in the distribution of unlabelled data and update the current network accordingly. Zeira et al. [6] present a methodology for change detection and segmentation based on a set of statistical estimators. This methodology detects statistically significant changes in nonstationary continuous data streams. The time complexity of the suggested basic incremental algorithm depends on the number of detected concept drifts. As long as no concept drift has been detected, the major computational cost is to add a new layer to the existing network. Checking the split validity of the nodes in the network is relatively cheap (O(n) where n is the number of nodes in the network). In case of concept drift, a new network should be constructed from scratch, and the cost of the network construction procedure is linear in the number of records in a training window, linear in the number of distinct attribute values, and quadratic in the number of candidate input attributes [12] Multiple-Model IOLIN The first new incremental approach introduced in this paper also updates a current model as long as the concept is stable. However, if a concept drift has been detected for the first time, the algorithm searches for the best model representing the current data from all the past networks. The idea is to utilize potential periodicity in the data by reusing previously created models rather than generating a new model with every concept drift as suggested by the Basic IOLIN approach. The search is made as follows: when a concept drift is being detected, the algorithm calculates the target attribute s entropy and compares it to the target s entropy values of each previous training window, where a completely new network was constructed. The previous network to be chosen is the one, which was built from the training window having the closest target entropy to the current target s entropy. The chosen network is used for the classification of the examples arriving in the next validation session. In the case of re-occurrence of a concept drift, a new network will be created like under the Basic IOLIN approach. Following is the pseudo-code outline of the Multiple-Model IOLIN algorithm: Multiple Model IOLIN Input: Training window T, current network model, past network models Output: Updated / Former / New network model For each new training window If concept drift is detected Search for the best network from the past by: 1. Comparing the value of the target s entropy (based on the current window s data) to entropy values in all windows where a totally new network was built. 2. Choose Network with Min E current (T)- E former (T). 4

5 3. Add new layers if possible. If a concept drift is detected again with the chosen network Create totally new network using the Info-Fuzzy algorithm Else update existing network (see below) Else Update existing network as follows: 1. Eliminate non-relevant nodes. 2. Replace the last layer with a better one. 3. Add new layers if possible. In addition to the use of former models in the presence of a concept drift, we also checked another variation for the multi model approach, which included the use of former models even in the case when the concept is stable. This means that when the concept is stable, we will use a former model instead of updating the current model. With this approach, we expect to save additional run time because an appropriate model will probably be found in the adjacent former windows. This approach will be referred to as the pure multiple model approach Advanced IOLIN In the second proposed approach, we enhance the update operations of the incremental algorithms by maintaining the information-theoretic quality of the current model disregarding its predictive accuracy on the new data. We recalculate the conditional mutual information of each layer using the top-down approach: we start with the first layer, and replace the input attributes in that layer and all subsequent layers in case of a significant decrease in the layer s conditional mutual information with the target attribute. A reduction in the mutual information of the first layer will trigger a complete re-construction of the model. The current model will be retained only if there is no significant decrease in mutual information at any layer. In that case, we will try to add new layers representing attributes, which are not yet participating in the network. This approach does not relate directly to the issue of the concept drift detection. The initial network is constantly updated and the presence of a concept drift can be concluded if all the network layers have been replaced. For each session, the conditional mutual information (MI) value of each layer is saved. In the model s update process, a comparison is made between the former MI value of the i th layer and the current MI value. If the current value is nearly as high as the former one (up to 5% difference), the i th layer is kept as is. If the current MI value is significantly lower, the i th layer is replaced with a new one. Following is the pseudo-code outline of the advanced IOLIN algorithm: Advanced IOLIN Input: Training window, current network model, conditional mutual information of each network layer i (Former_Conditional_MI(i)) Output: Updated network model For each new training window For each Layer i in existing network Calculate Cond_MI(i) If Cond_MI(i) Former_Conditional_MI(i)*95% Keep i th layer as is and move to the next layer Else Continue the network construction by adding new layers Save value: Former_Conditional_MI(i) = Conditional_MI(i) If reached the last layer Try adding new layers to the network The time complexity of the advanced approach depends on the number of layers that have been replaced. In the case of a layer replacement, the algorithm searches for the best input attribute from the input candidates. This search cost O(n) where n is the number of candidate input attributes. In the replacement of all layers, which is actually the construction of a totally new network, the cost will be O(n*l) where l is the number of the network layers. 5

6 5. Evaluation The two proposed incremental algorithms were applied to a real-world data stream from the traffic control domain. We have also applied the Basic IOLIN [8] [9], the pure multi-model approach (see above), and the CVFDT algorithm [15] to the same data. The experiments objective was to simulate real-time prediction of the traffic volume at all lanes of a signaled urban junction based on several input attributes such as day hour, weekday, previous volume on the same weekday last week, etc. We have checked two aspects: first, we have compared the predictive accuracy of the algorithms and secondly we have compared their processing rate (in records per second). From the simulation results and the classification speed (as presented in the results sub-section), we can safely assume that the incremental algorithms are capable to simultaneously process data from a large number of sources. Furthermore, the predicted traffic volumes at each direction can be used to improve the traffic plan for the monitored intersection and consequently reduce the average waiting time of drivers at the junction Traffic Data Streams The data acquisition and preprocessing of this dataset are extensively discussed in [8], [9] so here we describe it only in brief. The data streams we used include traffic flow information from under-road sensors at a signaled three-way junction where vehicles can cross the intersection in five different directions. The resulting five data streams related to these directions have included the hourly incoming traffic volumes for 24 hours a day, seven days a week during a period of more than 3 years. Each original data stream has been converted into a relational data set, where a record contains twelve candidate attributes representing the exact time (date, hour, day in week, etc.) when the traffic volume was measured and traffic volumes at earlier points of time (the previous hour, the same hour of the previous day, etc.). The target attribute represents the volume of traffic during a given hour. Due to the fact that the target attribute is continuous we have manually discretized it to four equal-frequency intervals of traffic volume. As mentioned, the traffic data was partitioned into five separate data tables for the five directions. Each table corresponding to a given direction contained about 30,000 hourly records Results The runs of the algorithms were carried out on a Pentium IV processor with 256 MB of RAM. In the experiments, the online learning of the traffic data starts after inducing the initial model from the first 500 records, which leaves the system to work with about 30,000 records for each direction in every year. Figures 2, 3 and 4 present the results of the error rates (training and testing) and the processing time after applying the IOLIN-based algorithms and the CVFDT algorithm to the traffic data sets. Figure 2 shows that the IOLIN-based algorithms outperform the training accuracy rate of the CVFDT algorithm. As expected, the Advanced IOLIN algorithm gets the most accurate results due to the network update process, which updates each layer in the network if necessary. CVFDT Basic Incremental Multi Modal Pure Multi Modal Error Rate Advanced Incremental D5 D4 D3 D2 D1 Data Source Fig. 2. Training Error Rate 6

7 Figure 3 shows the results on the testing data. Again, in most cases the IOLIN-based algorithms outperform the accuracy rate of the CVFDT algorithm with Advanced IOLIN being the most accurate algorithm on all five data streams. CVFDT Basic Incremental Multi Modal Pure Multi Modal Advanced Incremental D5 D4 D3 Data Source D2 D Error Rate Fig. 3. Testing Error Rate Figure 4 presents the run time results. The CVFDT algorithm consumes the least run time for processing the incoming examples. We assume that the reason for that is the one-pass on the data, which significantly reduces the run time. In all IOLIN-based algorithms, the training windows are overlapping so each example may be used several times. Figure 4 also shows that the fastest IOLIN-based algorithm is the multiple-models IOLIN. One can also notice that the Advanced IOLIN requires much longer run time than the other incremental approaches. The reason for that is that the update process checks if changes should be made in each layer. On one hand, this process makes the model more accurate with respect to the training window but on the other hand, this process consumes longer processing time because of the checking process in each layer CVFDT 1200 Basic Incremental Multi Modal Pure Multi Modal Advanced Incremental D5 D4 D3 Data Source D2 D Run Time (Sec) Fig. 4. Run Time 6. Conclusions and Future Work This paper has presented two new incremental approaches to real-time classification of continuous non-stationary data streams. The proposed approaches apply the classification algorithm repeatedly to a sliding window of examples, in order to update the existing model (in the case of the Advanced IOLIN approach) or to replace it by a former model (in the case of the Multiple Model approach). The efficiency of the proposed algorithms has been validated by experiments on real-world data streams. The experimental data in use included a set of five data streams recorded by traffic sensors for more than 3 years. We used the CVFDT algorithm of Domingos & Hulton (2001) for the purpose of performance comparison. From the experimental results, we can conclude that in this dataset, the CVFDT algorithm is preferable in terms of the run time 7

8 but its accuracy rates are significantly lower. In order to improve the run time we suggest as future work to consider the one pass approach on the data when the concept appears to be stable. We assume that this will significantly reduce the processing time. In addition, the future work includes evaluating incremental classifiers on additional real-world data streams from other domains. Acknowledgements. We would like to thank the Traffic Control Center of Jerusalem for granting us the permission to use their traffic database. This work was partially supported under a research contract from the Israel Ministry of Defense and by the National Institute for Systems Test and Productivity at University of South Florida under the USA Space and Naval Warfare Systems Command Grant No. N References [1] C. Aggarwal, J. Han, J. Wang, and P. S. Yu, On Demand Classification of Data Streams, Proc Int. Conf. on Knowledge Discovery and Data Mining (KDD'04), Seattle, WA, Aug [2] D. E. Culler and W. Hong, Wireless sensor networks: Introduction, Communications of the ACM, Vol 47, No. 6, pp , [3] D.P Helmbold and P.M. Long, Tracking Drifting Concepts by Minimizing Disagreements, Machine Learning, No. 14, pp , [4] G. Hulten, L, Spencer, and P. Domingos, Mining Time-Changing Data Streams, Proc. of KDD 2001, pp , ACM Press, [5] G. Widmer and M. Kubat, Learning in the Presence of Concept Drift and Hidden Contexts, Machine Learning, Vol. 23, No. 1, pp , [6] G. Zeira, O. Maimon, M. Last, L. Rokach, Change Detection in Classification Models Induced from Time Series Data, in Data Mining in Time Series Databases, M. Last, A. Kandel, and H. Bunke (Editors), World Scientific, Series in Machine Perception and Artificial Intelligence, Vol. 57, pp , [7] H. Kargupta, Distributed Data Mining for Sensor Networks, Tutorial, ECML / PKDD [8] L. Cohen, M. Last, G. Avrahami, Incremental Info-Fuzzy Algorithm for Real Time Data Mining of Non-Stationary Data Streams, TDM Workshop, Brighton UK, [9] L. Cohen, G. Avrahami, M. Last, A. Kandel, and O. Kipersztok Real-Time Data Mining of Non-Stationary Data Streams from Sensor Networks, to appear in Information Fusion Journal, Special Issue on Information Fusion in Distributed Sensor Networks.. [10] M. Black and R. J. Hickey, Maintaining the Performance of a Learned Classifier under Concept Drift, Intelligent Data Analysis, No. 3, pp , [11] M. Last, Online Classification of Nonstationary Data Streams, Intelligent Data Analysis, Vol. 6, No. 2, pp , [12] M. Last and O. Maimon, A Compact and Accurate Model for Classification, IEEE Transactions on Knowledge and Data Engineering, Vol. 16, No. 2, pp , February [13] O. Maimon and M. Last, Knowledge Discovery and Data Mining - The Info-Fuzzy Network (IFN) Methodology, Kluwer Academic Publishers, December [14] P. Domingos and G. Hulten, Mining High-Speed Data Streams, Proc. of KDD 2000, pp , [15] G. Hulten, and P. Domingos, "VFML -- A toolkit for mining high-speed time-changing data streams"

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER P.Radhabai Mrs.M.Priya Packialatha Dr.G.Geetha PG Student Assistant Professor Professor Dept of Computer Science and Engg Dept

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Nesnelerin İnternetinde Veri Analizi

Nesnelerin İnternetinde Veri Analizi Nesnelerin İnternetinde Veri Analizi Bölüm 3. Classification in Data Streams w3.gazi.edu.tr/~suatozdemir Supervised vs. Unsupervised Learning (1) Supervised learning (classification) Supervision: The training

More information

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams Mining Data Streams Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction Summarization Methods Clustering Data Streams Data Stream Classification Temporal Models CMPT 843, SFU, Martin Ester, 1-06

More information

A CSP Search Algorithm with Reduced Branching Factor

A CSP Search Algorithm with Reduced Branching Factor A CSP Search Algorithm with Reduced Branching Factor Igor Razgon and Amnon Meisels Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, 84-105, Israel {irazgon,am}@cs.bgu.ac.il

More information

Lecture 7. Data Stream Mining. Building decision trees

Lecture 7. Data Stream Mining. Building decision trees 1 / 26 Lecture 7. Data Stream Mining. Building decision trees Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 26 1 Data Stream Mining 2 Decision Tree Learning Data Stream Mining 3

More information

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Bhakti V. Gavali 1, Prof. Vivekanand Reddy 2 1 Department of Computer Science and Engineering, Visvesvaraya Technological

More information

An Apriori-like algorithm for Extracting Fuzzy Association Rules between Keyphrases in Text Documents

An Apriori-like algorithm for Extracting Fuzzy Association Rules between Keyphrases in Text Documents An Apriori-lie algorithm for Extracting Fuzzy Association Rules between Keyphrases in Text Documents Guy Danon Department of Information Systems Engineering Ben-Gurion University of the Negev Beer-Sheva

More information

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE Sinu T S 1, Mr.Joseph George 1,2 Computer Science and Engineering, Adi Shankara Institute of Engineering

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Mining Frequent Itemsets for data streams over Weighted Sliding Windows Mining Frequent Itemsets for data streams over Weighted Sliding Windows Pauray S.M. Tsai Yao-Ming Chen Department of Computer Science and Information Engineering Minghsin University of Science and Technology

More information

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632

More information

Using Association Rules for Better Treatment of Missing Values

Using Association Rules for Better Treatment of Missing Values Using Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine Intelligence Group) National University

More information

Incremental Learning Algorithm for Dynamic Data Streams

Incremental Learning Algorithm for Dynamic Data Streams 338 IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.9, September 2008 Incremental Learning Algorithm for Dynamic Data Streams Venu Madhav Kuthadi, Professor,Vardhaman College

More information

Extending the Growing Neural Gas Classifier for Context Recognition

Extending the Growing Neural Gas Classifier for Context Recognition Extending the Classifier for Context Recognition Eurocast 2007, Workshop on Heuristic Problem Solving, Paper #9.24 14. February 2007, 10:30 Rene Mayrhofer Lancaster University, UK Johannes Kepler University

More information

A Composite Graph Model for Web Document and the MCS Technique

A Composite Graph Model for Web Document and the MCS Technique A Composite Graph Model for Web Document and the MCS Technique Kaushik K. Phukon Department of Computer Science, Gauhati University, Guwahati-14,Assam, India kaushikphukon@gmail.com Abstract It has been

More information

Incremental Rule Learning based on Example Nearness from Numerical Data Streams

Incremental Rule Learning based on Example Nearness from Numerical Data Streams 2005 ACM Symposium on Applied Computing Incremental Rule Learning based on Example Nearness from Numerical Data Streams Francisco Ferrer Troyano ferrer@lsi.us.es Jesus S. Aguilar Ruiz aguilar@lsi.us.es

More information

Web page recommendation using a stochastic process model

Web page recommendation using a stochastic process model Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,

More information

Datasets Size: Effect on Clustering Results

Datasets Size: Effect on Clustering Results 1 Datasets Size: Effect on Clustering Results Adeleke Ajiboye 1, Ruzaini Abdullah Arshah 2, Hongwu Qin 3 Faculty of Computer Systems and Software Engineering Universiti Malaysia Pahang 1 {ajibraheem@live.com}

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

Maintenance of the Prelarge Trees for Record Deletion

Maintenance of the Prelarge Trees for Record Deletion 12th WSEAS Int. Conf. on APPLIED MATHEMATICS, Cairo, Egypt, December 29-31, 2007 105 Maintenance of the Prelarge Trees for Record Deletion Chun-Wei Lin, Tzung-Pei Hong, and Wen-Hsiang Lu Department of

More information

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine

More information

Clustering & Classification (chapter 15)

Clustering & Classification (chapter 15) Clustering & Classification (chapter 5) Kai Goebel Bill Cheetham RPI/GE Global Research goebel@cs.rpi.edu cheetham@cs.rpi.edu Outline k-means Fuzzy c-means Mountain Clustering knn Fuzzy knn Hierarchical

More information

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Devina Desai ddevina1@csee.umbc.edu Tim Oates oates@csee.umbc.edu Vishal Shanbhag vshan1@csee.umbc.edu Machine Learning

More information

Anomaly Detection on Data Streams with High Dimensional Data Environment

Anomaly Detection on Data Streams with High Dimensional Data Environment Anomaly Detection on Data Streams with High Dimensional Data Environment Mr. D. Gokul Prasath 1, Dr. R. Sivaraj, M.E, Ph.D., 2 Department of CSE, Velalar College of Engineering & Technology, Erode 1 Assistant

More information

A New Online Clustering Approach for Data in Arbitrary Shaped Clusters

A New Online Clustering Approach for Data in Arbitrary Shaped Clusters A New Online Clustering Approach for Data in Arbitrary Shaped Clusters Richard Hyde, Plamen Angelov Data Science Group, School of Computing and Communications Lancaster University Lancaster, LA1 4WA, UK

More information

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification Data Mining 3.3 Fall 2008 Instructor: Dr. Masoud Yaghini Outline Using IF-THEN Rules for Classification Rules With Exceptions Rule Extraction from a Decision Tree 1R Algorithm Sequential Covering Algorithms

More information

CLASSIFICATION OF WEB LOG DATA TO IDENTIFY INTERESTED USERS USING DECISION TREES

CLASSIFICATION OF WEB LOG DATA TO IDENTIFY INTERESTED USERS USING DECISION TREES CLASSIFICATION OF WEB LOG DATA TO IDENTIFY INTERESTED USERS USING DECISION TREES K. R. Suneetha, R. Krishnamoorthi Bharathidasan Institute of Technology, Anna University krs_mangalore@hotmail.com rkrish_26@hotmail.com

More information

Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique

Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique Research Paper Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique C. Sudarsana Reddy 1 S. Aquter Babu 2 Dr. V. Vasu 3 Department

More information

RETRACTED ARTICLE. Web-Based Data Mining in System Design and Implementation. Open Access. Jianhu Gong 1* and Jianzhi Gong 2

RETRACTED ARTICLE. Web-Based Data Mining in System Design and Implementation. Open Access. Jianhu Gong 1* and Jianzhi Gong 2 Send Orders for Reprints to reprints@benthamscience.ae The Open Automation and Control Systems Journal, 2014, 6, 1907-1911 1907 Web-Based Data Mining in System Design and Implementation Open Access Jianhu

More information

Clustering Algorithms for Data Stream

Clustering Algorithms for Data Stream Clustering Algorithms for Data Stream Karishma Nadhe 1, Prof. P. M. Chawan 2 1Student, Dept of CS & IT, VJTI Mumbai, Maharashtra, India 2Professor, Dept of CS & IT, VJTI Mumbai, Maharashtra, India Abstract:

More information

Efficient Mining Algorithms for Large-scale Graphs

Efficient Mining Algorithms for Large-scale Graphs Efficient Mining Algorithms for Large-scale Graphs Yasunari Kishimoto, Hiroaki Shiokawa, Yasuhiro Fujiwara, and Makoto Onizuka Abstract This article describes efficient graph mining algorithms designed

More information

Feature Based Data Stream Classification (FBDC) and Novel Class Detection

Feature Based Data Stream Classification (FBDC) and Novel Class Detection RESEARCH ARTICLE OPEN ACCESS Feature Based Data Stream Classification (FBDC) and Novel Class Detection Sminu N.R, Jemimah Simon 1 Currently pursuing M.E (Software Engineering) in Vins christian college

More information

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Hybrid Feature Selection for Modeling Intrusion Detection Systems Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,

More information

Semi supervised clustering for Text Clustering

Semi supervised clustering for Text Clustering Semi supervised clustering for Text Clustering N.Saranya 1 Assistant Professor, Department of Computer Science and Engineering, Sri Eshwar College of Engineering, Coimbatore 1 ABSTRACT: Based on clustering

More information

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection Zhenghui Ma School of Computer Science The University of Birmingham Edgbaston, B15 2TT Birmingham, UK Ata Kaban School of Computer

More information

Incremental K-means Clustering Algorithms: A Review

Incremental K-means Clustering Algorithms: A Review Incremental K-means Clustering Algorithms: A Review Amit Yadav Department of Computer Science Engineering Prof. Gambhir Singh H.R.Institute of Engineering and Technology, Ghaziabad Abstract: Clustering

More information

On Multiple Query Optimization in Data Mining

On Multiple Query Optimization in Data Mining On Multiple Query Optimization in Data Mining Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland {marek,mzakrz}@cs.put.poznan.pl

More information

Iteration Reduction K Means Clustering Algorithm

Iteration Reduction K Means Clustering Algorithm Iteration Reduction K Means Clustering Algorithm Kedar Sawant 1 and Snehal Bhogan 2 1 Department of Computer Engineering, Agnel Institute of Technology and Design, Assagao, Goa 403507, India 2 Department

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Data mining techniques for data streams mining

Data mining techniques for data streams mining REVIEW OF COMPUTER ENGINEERING STUDIES ISSN: 2369-0755 (Print), 2369-0763 (Online) Vol. 4, No. 1, March, 2017, pp. 31-35 DOI: 10.18280/rces.040106 Licensed under CC BY-NC 4.0 A publication of IIETA http://www.iieta.org/journals/rces

More information

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, March 18, ISSN

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, March 18,   ISSN International Journal of Computer Engineering and Applications, Volume XII, Special Issue, March 18, www.ijcea.com ISSN 2321-3469 COMBINING GENETIC ALGORITHM WITH OTHER MACHINE LEARNING ALGORITHM FOR CHARACTER

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Extended R-Tree Indexing Structure for Ensemble Stream Data Classification

Extended R-Tree Indexing Structure for Ensemble Stream Data Classification Extended R-Tree Indexing Structure for Ensemble Stream Data Classification P. Sravanthi M.Tech Student, Department of CSE KMM Institute of Technology and Sciences Tirupati, India J. S. Ananda Kumar Assistant

More information

Data Stream Mining. Tore Risch Dept. of information technology Uppsala University Sweden

Data Stream Mining. Tore Risch Dept. of information technology Uppsala University Sweden Data Stream Mining Tore Risch Dept. of information technology Uppsala University Sweden 2016-02-25 Enormous data growth Read landmark article in Economist 2010-02-27: http://www.economist.com/node/15557443/

More information

A semi-incremental recognition method for on-line handwritten Japanese text

A semi-incremental recognition method for on-line handwritten Japanese text 2013 12th International Conference on Document Analysis and Recognition A semi-incremental recognition method for on-line handwritten Japanese text Cuong Tuan Nguyen, Bilan Zhu and Masaki Nakagawa Department

More information

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm Proceedings of the National Conference on Recent Trends in Mathematical Computing NCRTMC 13 427 An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm A.Veeraswamy

More information

Clustering Large Dynamic Datasets Using Exemplar Points

Clustering Large Dynamic Datasets Using Exemplar Points Clustering Large Dynamic Datasets Using Exemplar Points William Sia, Mihai M. Lazarescu Department of Computer Science, Curtin University, GPO Box U1987, Perth 61, W.A. Email: {siaw, lazaresc}@cs.curtin.edu.au

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.4. Spring 2010 Instructor: Dr. Masoud Yaghini Outline Using IF-THEN Rules for Classification Rule Extraction from a Decision Tree 1R Algorithm Sequential Covering Algorithms

More information

Classification of Concept Drifting Data Streams Using Adaptive Novel-Class Detection

Classification of Concept Drifting Data Streams Using Adaptive Novel-Class Detection Volume 3, Issue 9, September-2016, pp. 514-520 ISSN (O): 2349-7084 International Journal of Computer Engineering In Research Trends Available online at: www.ijcert.org Classification of Concept Drifting

More information

Context-based Navigational Support in Hypermedia

Context-based Navigational Support in Hypermedia Context-based Navigational Support in Hypermedia Sebastian Stober and Andreas Nürnberger Institut für Wissens- und Sprachverarbeitung, Fakultät für Informatik, Otto-von-Guericke-Universität Magdeburg,

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK REAL TIME DATA SEARCH OPTIMIZATION: AN OVERVIEW MS. DEEPASHRI S. KHAWASE 1, PROF.

More information

Detection of Anomalies using Online Oversampling PCA

Detection of Anomalies using Online Oversampling PCA Detection of Anomalies using Online Oversampling PCA Miss Supriya A. Bagane, Prof. Sonali Patil Abstract Anomaly detection is the process of identifying unexpected behavior and it is an important research

More information

Role of big data in classification and novel class detection in data streams

Role of big data in classification and novel class detection in data streams DOI 10.1186/s40537-016-0040-9 METHODOLOGY Open Access Role of big data in classification and novel class detection in data streams M. B. Chandak * *Correspondence: hodcs@rknec.edu; chandakmb@gmail.com

More information

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

Data Preprocessing. Data Preprocessing

Data Preprocessing. Data Preprocessing Data Preprocessing Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Data Preprocessing What preprocessing step can or should

More information

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering Abstract Mrs. C. Poongodi 1, Ms. R. Kalaivani 2 1 PG Student, 2 Assistant Professor, Department of

More information

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models

More information

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42 Pattern Mining Knowledge Discovery and Data Mining 1 Roman Kern KTI, TU Graz 2016-01-14 Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 1 / 42 Outline 1 Introduction 2 Apriori Algorithm 3 FP-Growth

More information

A hybrid method to categorize HTML documents

A hybrid method to categorize HTML documents Data Mining VI 331 A hybrid method to categorize HTML documents M. Khordad, M. Shamsfard & F. Kazemeyni Electrical & Computer Engineering Department, Shahid Beheshti University, Iran Abstract In this paper

More information

An Improved Apriori Algorithm for Association Rules

An Improved Apriori Algorithm for Association Rules Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Streaming Data Classification with the K-associated Graph

Streaming Data Classification with the K-associated Graph X Congresso Brasileiro de Inteligência Computacional (CBIC 2), 8 a de Novembro de 2, Fortaleza, Ceará Streaming Data Classification with the K-associated Graph João R. Bertini Jr.; Alneu Lopes and Liang

More information

Fast Fuzzy Clustering of Infrared Images. 2. brfcm

Fast Fuzzy Clustering of Infrared Images. 2. brfcm Fast Fuzzy Clustering of Infrared Images Steven Eschrich, Jingwei Ke, Lawrence O. Hall and Dmitry B. Goldgof Department of Computer Science and Engineering, ENB 118 University of South Florida 4202 E.

More information

City, University of London Institutional Repository

City, University of London Institutional Repository City Research Online City, University of London Institutional Repository Citation: Andrienko, N., Andrienko, G., Fuchs, G., Rinzivillo, S. & Betz, H-D. (2015). Real Time Detection and Tracking of Spatial

More information

The k-means Algorithm and Genetic Algorithm

The k-means Algorithm and Genetic Algorithm The k-means Algorithm and Genetic Algorithm k-means algorithm Genetic algorithm Rough set approach Fuzzy set approaches Chapter 8 2 The K-Means Algorithm The K-Means algorithm is a simple yet effective

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

A Novel Template Matching Approach To Speaker-Independent Arabic Spoken Digit Recognition

A Novel Template Matching Approach To Speaker-Independent Arabic Spoken Digit Recognition Special Session: Intelligent Knowledge Management A Novel Template Matching Approach To Speaker-Independent Arabic Spoken Digit Recognition Jiping Sun 1, Jeremy Sun 1, Kacem Abida 2, and Fakhri Karray

More information

A Naïve Soft Computing based Approach for Gene Expression Data Analysis

A Naïve Soft Computing based Approach for Gene Expression Data Analysis Available online at www.sciencedirect.com Procedia Engineering 38 (2012 ) 2124 2128 International Conference on Modeling Optimization and Computing (ICMOC-2012) A Naïve Soft Computing based Approach for

More information

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Lutfi Fanani 1 and Nurizal Dwi Priandani 2 1 Department of Computer Science, Brawijaya University, Malang, Indonesia. 2 Department

More information

Agglomerative clustering on vertically partitioned data

Agglomerative clustering on vertically partitioned data Agglomerative clustering on vertically partitioned data R.Senkamalavalli Research Scholar, Department of Computer Science and Engg., SCSVMV University, Enathur, Kanchipuram 631 561 sengu_cool@yahoo.com

More information

Comparative Study of Subspace Clustering Algorithms

Comparative Study of Subspace Clustering Algorithms Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that

More information

DS-Means: Distributed Data Stream Clustering

DS-Means: Distributed Data Stream Clustering DS-Means: Distributed Data Stream Clustering Alessio Guerrieri and Alberto Montresor University of Trento, Italy Abstract. This paper proposes DS-means, a novel algorithm for clustering distributed data

More information

Clustering from Data Streams

Clustering from Data Streams Clustering from Data Streams João Gama LIAAD-INESC Porto, University of Porto, Portugal jgama@fep.up.pt 1 Introduction 2 Clustering Micro Clustering 3 Clustering Time Series Growing the Structure Adapting

More information

More Efficient Classification of Web Content Using Graph Sampling

More Efficient Classification of Web Content Using Graph Sampling More Efficient Classification of Web Content Using Graph Sampling Chris Bennett Department of Computer Science University of Georgia Athens, Georgia, USA 30602 bennett@cs.uga.edu Abstract In mining information

More information

A Self-Adaptive Insert Strategy for Content-Based Multidimensional Database Storage

A Self-Adaptive Insert Strategy for Content-Based Multidimensional Database Storage A Self-Adaptive Insert Strategy for Content-Based Multidimensional Database Storage Sebastian Leuoth, Wolfgang Benn Department of Computer Science Chemnitz University of Technology 09107 Chemnitz, Germany

More information

A Prototype-driven Framework for Change Detection in Data Stream Classification

A Prototype-driven Framework for Change Detection in Data Stream Classification Prototype-driven Framework for hange Detection in Data Stream lassification Hamed Valizadegan omputer Science and Engineering Michigan State University valizade@cse.msu.edu bstract- This paper presents

More information

Analyzing Dshield Logs Using Fully Automatic Cross-Associations

Analyzing Dshield Logs Using Fully Automatic Cross-Associations Analyzing Dshield Logs Using Fully Automatic Cross-Associations Anh Le 1 1 Donald Bren School of Information and Computer Sciences University of California, Irvine Irvine, CA, 92697, USA anh.le@uci.edu

More information

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records.

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records. DATA STREAMS MINING Mining Data Streams From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records. Hammad Haleem Xavier Plantaz APPLICATIONS Sensors

More information

On Reduct Construction Algorithms

On Reduct Construction Algorithms 1 On Reduct Construction Algorithms Yiyu Yao 1, Yan Zhao 1 and Jue Wang 2 1 Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 {yyao, yanzhao}@cs.uregina.ca 2 Laboratory

More information

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka Data Preprocessing Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Data Preprocessing What preprocessing step can or should

More information

Keywords hierarchic clustering, distance-determination, adaptation of quality threshold algorithm, depth-search, the best first search.

Keywords hierarchic clustering, distance-determination, adaptation of quality threshold algorithm, depth-search, the best first search. Volume 4, Issue 3, March 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Distance-based

More information

DOI:: /ijarcsse/V7I1/0111

DOI:: /ijarcsse/V7I1/0111 Volume 7, Issue 1, January 2017 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey on

More information

Association Rule Mining and Clustering

Association Rule Mining and Clustering Association Rule Mining and Clustering Lecture Outline: Classification vs. Association Rule Mining vs. Clustering Association Rule Mining Clustering Types of Clusters Clustering Algorithms Hierarchical:

More information

9. Conclusions. 9.1 Definition KDD

9. Conclusions. 9.1 Definition KDD 9. Conclusions Contents of this Chapter 9.1 Course review 9.2 State-of-the-art in KDD 9.3 KDD challenges SFU, CMPT 740, 03-3, Martin Ester 419 9.1 Definition KDD [Fayyad, Piatetsky-Shapiro & Smyth 96]

More information

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online):

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online): IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online): 2321-0613 A Study on Handling Missing Values and Noisy Data using WEKA Tool R. Vinodhini 1 A. Rajalakshmi

More information

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

6. Dicretization methods 6.1 The purpose of discretization

6. Dicretization methods 6.1 The purpose of discretization 6. Dicretization methods 6.1 The purpose of discretization Often data are given in the form of continuous values. If their number is huge, model building for such data can be difficult. Moreover, many

More information

Domain Specific Search Engine for Students

Domain Specific Search Engine for Students Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam

More information

Introduction to Data Mining and Data Analytics

Introduction to Data Mining and Data Analytics 1/28/2016 MIST.7060 Data Analytics 1 Introduction to Data Mining and Data Analytics What Are Data Mining and Data Analytics? Data mining is the process of discovering hidden patterns in data, where Patterns

More information

Classification with Diffuse or Incomplete Information

Classification with Diffuse or Incomplete Information Classification with Diffuse or Incomplete Information AMAURY CABALLERO, KANG YEN Florida International University Abstract. In many different fields like finance, business, pattern recognition, communication

More information

K-means based data stream clustering algorithm extended with no. of cluster estimation method

K-means based data stream clustering algorithm extended with no. of cluster estimation method K-means based data stream clustering algorithm extended with no. of cluster estimation method Makadia Dipti 1, Prof. Tejal Patel 2 1 Information and Technology Department, G.H.Patel Engineering College,

More information

A Graph Theoretic Approach to Image Database Retrieval

A Graph Theoretic Approach to Image Database Retrieval A Graph Theoretic Approach to Image Database Retrieval Selim Aksoy and Robert M. Haralick Intelligent Systems Laboratory Department of Electrical Engineering University of Washington, Seattle, WA 98195-2500

More information

Kernel-based online machine learning and support vector reduction

Kernel-based online machine learning and support vector reduction Kernel-based online machine learning and support vector reduction Sumeet Agarwal 1, V. Vijaya Saradhi 2 andharishkarnick 2 1- IBM India Research Lab, New Delhi, India. 2- Department of Computer Science

More information

Noval Stream Data Mining Framework under the Background of Big Data

Noval Stream Data Mining Framework under the Background of Big Data BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 16, No 5 Special Issue on Application of Advanced Computing and Simulation in Information Systems Sofia 2016 Print ISSN: 1311-9702;

More information

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

A Comparative study of Clustering Algorithms using MapReduce in Hadoop A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering

More information

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Abstract - The primary goal of the web site is to provide the

More information