Classification of Data Streams with Skewed Distribution

Size: px
Start display at page:

Download "Classification of Data Streams with Skewed Distribution"

Transcription

1 Classification of Data Streams with Skewed Distribution Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by Abhijeet B. Godase Roll No: under the guidance of Prof. V. Z. Attar Department of Computer Engineering and Information Technology College of Engineering, Pune Pune June 2012

2 Dedicated to my mother Smt. Aruna B. Godase and my father Shri. Balasaheb J. Godase for their love, endless support and encouragement.

3 DEPARTMENT OF COMPUTER ENGINEERING AND INFORMATION TECHNOLOGY, COLLEGE OF ENGINEERING, PUNE CERTIFICATE This is to certify that the dissertation titled Classification of Data Streams with Skewed Distribution has been successfully completed By Abhijeet B. Godase ( ) and is approved for the degree of Master of Technology, Computer Engineering. Prof. V. Z. Attar, Guide, Department of Computer Engineering and Information Technology, College of Engineering, Pune, Shivaji Nagar, Pune Dr. Jibi Abraham, Head, Department of Computer Engineering and Information Technology, College of Engineering, Pune, Shivaji Nagar, Pune Date :

4 Abstract The emerging domain of data stream mining is one of the important areas of research for the data mining community. The data streams in various real life applications are characterized by concept drift. Such data streams may also be characterized by skewed or imbalance class distributions for example Financial fraud detection, Network intrusion detection etc. In such cases skewed class distribution of the stream increases the problems associated with classifying stream instances. Learning from such skewed data streams results in a classifier which is biased towards the majority class. Thus the model or the classifier built on such skewed data streams tends to misclassify the minority class examples. In case of some applications for instance, financial fraud detection the identification of fraudulent transaction is the main focus because here misclassification of such minority class instances might result in heavy losses, in this case financial. Increasingly higher losses due to misclassification of such minority class instances cannot be ruled out in many other data stream applications as well. The challenge, therefore, is to pro-actively identify such minority class instances in order to avoid the losses associated with the same. With an effort in this direction we propose a method using k nearest neighbours approach and oversampling technique to classify such skewed data streams. Oversampling is achieved by making use of minority class examples which are retained from the stream as the time progresses. Experimental results show that our approach shows good classification performance on synthetic as well as real world datasets. iii

5 Acknowledgements I express my sincere gratitude towards my guide Prof. V.Z.Attar for her constant help, encouragement and inspiration throughout the project work. Without her invaluable guidance, this work would never have been a successful one. She also guided me through the essence of time-management, need of efficient organization and the presentation skills. I would also like to thank Head of the Department Dr. Jibi Abraham and all other faculty members who made my journey of post-graduation and technical learning such an enjoyable experience. I would also like to thank Dr. Albert Bifet, University of Waikato, New Zealand for his help and guidance during the implementation of the project. Last but not the least I would like to thank all my classmates for their support and help throughout the course of the project. Abhijeet B. Godase College of Engineering, Pune May 30, 2012 iv

6 Contents Abstract Acknowledgements List of Tables List of Figures iii iv viii viii 1 Introduction Introduction to Data Mining and Techniques Classification An Overview of Data Streams Data Stream Classification An Overview of Skewed Data Sets in the Real World Issues in Learning from Skewed Data Streams Thesis Outline Literature Survey Overview of Methods for Dealing with Skewed Data Streams - Traditional Approaches Oversampling Under-sampling Cost Sensitive Learning Problem Description Motivation Problem Statement How did we reach our problem statement Why are Existing Classifiers Weak? Evaluation Metrics

7 4 Our Approach Approach to Deal with Skewed Data Streams Algorithm for Skewed Data Streams Experimental Results and Discussions Experimental Setup Datasets Synthetically Generated Datasets Results of Our Approach on Synthetically Generated Datasets Real World Datasets Results of Our Approach on Real World Datasets Effect of Noise Level on the Performance of the approach Conclusion and Future Work Conclusion Future Work A Publications 42 A.1 IDSL: Imbalanced Data Stream Learner A.2 Classification of Data Streams with Skewed Distribution vi

8 List of Tables 5.1 Description of Synthetic Datasets used Description of Electricity Pricing Dataset Description of Real World Datasets vii

9 List of Figures 1.1 Classification as a task of mapping input attribute setx into its class label y Classification model in data streams Skewed Distributions, each data chunk has fewer positive examples than negative examples Our Approach to Classify Data Streams with Skewed Distributions Comparison of AUROC on SPH Datasets Comparison of G-Mean on SPH Datasets Comparison of F-Measure on SPH Datasets Comparison of Overall Accuracies on SPH Datasets Comparison of AUROC on SEA Datasets Comparison of G-Mean on SEA Datasets Comparison of F-Measure on SEA Datasets Comparison of Overall Accuracies on SEA Datasets Comparison of AUROC on Real World Datasets Comparison of G-Mean on Real World Datasets Comparison of F-Measure on Real World Datasets Comparison of Overall Accuracies on Real World Datasets Effect on AUROC when noise levels are varied Effect on G-Mean when noise levels are varied Effect on F-Measure when noise levels are varied Effect on Overall Accuracy when noise levels are varied

10 Chapter 1 Introduction This chapter gives the introduction to the basics of the area of project, also gives some insight to the problem domain under consideration. 1.1 Introduction to Data Mining and Techniques With the internet age the data and information explosion have resulted in the huge amount of data. Fortunately to gather knowledge from such abundant data there exist data mining techniques. As per the definition by Jiawei Han in his book Data Mining: Concepts and Techniques [18] the data mining is - Extraction of interesting, non trivial, implicit, previously unknown and potentially useful patterns or knowledge from huge amount of data. Data mining has been used in various areas like Health care, business intelligence, financial trade analysis, network intrusion detection etc. General process of knowledge discovery from data involves data cleaning, data integration, data selection, data mining, pattern evaluation and knowledge presentation. Data cleaning, data integration constitute data preprocessing. Here data is processed so that it becomes appropriate for the data mining process. Data mining forms the core part of the knowledge discovery process. There exist various data mining techniques viz Classification, Clustering, Association rule mining etc. Our work mainly falls under the classification data mining technique. 1

11 1.2 An Overview of Data Streams Classification Classification is one of the important technique of data mining. It involves use of the model built by learning from the historical data to make prediction about the class label of the new data/observations. Formally, it is task of learning a target function f, that maps each attribute set x to a set of predefined class labels y. Classification model learned from historical data is nothing but the target function. It can serve as a tool to distinguish between the objects of different classes as well as to predict class label of unknown records. Fig 1.1 shows the classification task which maps attribute set x to its class label y. Figure 1.1: Classification as a task of mapping input attribute setx into its class label y Classification is a pervasive problem that encompasses many diverse applications, right from static datasets to data streams. Classification tasks have been employed on static data over the years. In last decade more and more applications featuring data streams have been evolving which are a challenge to traditional classification algorithms. 1.2 An Overview of Data Streams Many real world applications, such as network traffic monitoring, credit card transactions, real time surveillance systems, electric power grids, remote sensors, web click streams etc, generate continuously arriving data known as data streams [4],[2]. Unlike the traditional data sets, data streams arrive continuously at varying speeds. Data streams are fast changing, temporally ordered, potentially infinite and massive[18]. It may be impossible to store the entire data stream into memory or to go through it more than once due to its voluminous nature. Thus there is 2

12 1.3 Data Stream Classification need of single scan, multidimensional, online stream analysis methods. In today s world with data explosion the data is increasing by terabytes and even petabytes, stream data has rightly captured our data mining needs of today. Even though complete set of data can be collected and stored its quite expensive to go through such huge data multiple times. 1.3 Data Stream Classification Since classification could help decision making by predicting class labels for given data based on past records, classification on stream data has been extensively studied in recent years with many interesting algorithms developed. Some of them are cited here: [4],[21]. Fig 1.2 depicts the classification model in data streams.as shown in fig 1.2 data chunks C 1, C 2, C 3...C i arrive one by one. Figure 1.2: Classification model in data streams Each chunk contains positive instances P i and negative instances Q i. Suppose C 1, C 2, C 3...C i are labelled. At the time stamp m + 1, when an unlabelled chunk C m+1 arrives, the classification model predicts the labels of instances in C m+1 on basis of previously labelled data chunks.when experts give true class labels of the instances in C m+1, the chunk can join the training set, resulting in more and more 3

13 1.4 An Overview of Skewed Data Sets in the Real World labelled data chunks. Because of storage constraints, it is critical to judiciously select labelled examples that can represent the current distribution well. Most studies on stream mining assume relatively balanced and stable data streams. However, many applications can involve concept-drifting data streams with skewed distributions. In data with skewed distributions each data chunk has many fewer positive instances. Fig 1.3 shows the similar concept diagrammatically. At the same time, loss functions associated with positive and negative classes are also unbalanced. Misclassifying positive instances can have serious effects in some applications like stream of financial transactions. Figure 1.3: Skewed Distributions, each data chunk has fewer positive examples than negative examples 1.4 An Overview of Skewed Data Sets in the Real World The rate at which science and technology have developed has resulted in proliferation of data at an exponential pace. This unrestrained increase in data has intensified need of various applications in data mining. This huge data in in no way necessarily equally distributed. Class skew or class imbalance refers to domains where in one class instances outnumber the other class instances, i.e. some classes occupy the majority of the dataset which are known as majority classes; while the other classes are known as minority classes. The most vital issue in these kinds of data sets is that, compared to the majority of the classes, minority classes are often of much significance and interest to the user. 4

14 1.4 An Overview of Skewed Data Sets in the Real World There are many real world applications, where in datasets contain such skewed nature. Following paragraphs gives an overview of some real world problems that exhibit such nature. Financial Fraud Detection: In financial fraud detection, majority of financial transactions are genuine and legitimate, and very small number of them may be fraudulent[5]. Network Intrusion Detection: Here, the number of malicious activities are hidden among the voluminous routine network traffic[30],[35]. Usually there are thousands of access requests every day. Among all these requests, the number of malicious connections is, in most cases, very small compared to the number of normal connections. Obviously, building good model that can effectively detect future attacks is crucial so that the system can respond promptly in case of network intrusions. Medical Fraud Detection: In medical fraud detection, the percentage of bogus claims is small, but the total loss is significant. Real Time Video Surveillance: Imbalance is seen in data that arrives as video sequences[34]. Oil Spillage: Oil spills detection in satellite radar images[22]. Astronomy: Skewed data sets exist in astronomical field also; only 0.001% of the objects in the sky survey images are truly beyond the scope of current science and may lead to new discoveries [27]. Spam Image Detection: In Spam image detection, near duplicate spam images are difficult to discover from the large number of spam images[33] Text Classification: In text classification, there is imbalanced data such as text number, class size, subclass and class fold [24]. Health Care: Health care domain has a classic example of class imbalance presence; the rare diseases affect very negligible amount of people, but the consequences involved are very severe. It id extremely vital to correctly detect and classify the rare diseases and the affected patients. If any errors occur in such cases it might be fatal. 5

15 1.5 Issues in Learning from Skewed Data Streams Above real world examples signify the importance of dealing with imbalanced data sets. Most of the above examples also fall into the category of skewed data streams. Most learning algorithms work well with balanced data streams as their aim is to improve overall accuracy or a related measure. When such algorithms are applied to skewed data streams then their accuracy of classifying majority examples is good but the accuracy of classifying the minority examples is poor. This happens because learning from such imbalanced/skewed streams causes the learner to become biased towards the majority class; thus the minority examples are likely to be misclassified. Thus the main issue towards which the research community is working on in regard to skewed data streams is of correctly classifying the minority data instances without affecting the accuracy of majority data instances. In recent years learning from skewed data streams has been recognized as one of the crucial problem in machine learning and data mining. Encapsulating, the principle problem statement is improving the accuracy of both the minority as well as majority class instances of the stream. 1.5 Issues in Learning from Skewed Data Streams In general learning from skewed data streams is challenging due to following issues. 1. Evaluation Metric: Appropriate choice of evaluation metrics is also important in this domain. Evaluation metrics play vital role in data mining; they are used to guide the algorithms to desired solution. Thus if evaluation metric does not take minority class into the consideration, the learning algorithm will not be able to cope up well with the skewed data streams. The standard evaluation metrics like overall accuracy are not valid this case, because although minority class instances are misclassified then also the overall accuracy may remain higher, primary reason being negligible amount of minority class instances. 2. Lack of minority class data for training: In skewed data streams due to lack of minority class data it becomes difficult to learn class boundaries. As the number of instances available are very few. Thus training a classifier in such situations is very difficult. 3. Treatment of minority class data as noise. One of the major issues is that of the noise. Noisy data in the streams affects the minority class more than 6

16 1.6 Thesis Outline that of the majority data. Furthermore, standard stream mining algorithm tend to treat the minority class data as noise. 4. As stated earlier data streams are usually massive and arrive at varying speeds; which makes it difficult to store them completely and multiple scans are nearly impossible and hence learning from such streams is difficult. 5. Data streams also undergo concept evolution considerably over time. As per the above points we can see that classification of the skewed data streams is and multi-fold problem. All the above issues need to be addressed so as to design a appropriate learning algorithm for skewed data streams. 1.6 Thesis Outline The rest of the thesis is organized as follows: Chapter 2 gives the brief description of the literature survey carried out by us. Chapter 3 describes our problem statement and how we arrived at it. It also focusses on different evaluation metrics chosen by us throughout this thesis. Chapter 4 elaborates our approach to deal with data streams with skewed distribution of classes and explains our algorithm and its implementation details. Chapter 5 contains the results and discussions of the implemented algorithm and evaluation of its performance. Chapter 6 provides the conclusion and the future enhancements that can be carried out. 7

17 Chapter 2 Literature Survey This chapter gives brief details of our literature survey in carried out in the area of imbalanced datasets and the skewed data streams problem. 2.1 Overview of Methods for Dealing with Skewed Data Streams -Traditional Approaches We went through various methods available in the literature to deal with imbalanced datasets and portray some of the well known and most popular approaches, algorithms and methods that have been devised to deal with skewed data streams. Some of the books that we have referred to get an effective understanding of data mining concepts are Data Mining: Concepts and Techniques by Han and Kamber [18], Introduction to Data Mining by Kumar et al [23]. In the literature there are number of methods addressing class imbalance problem but the area of skewed data streams is relatively new to the research community. The sampling based and ensemble algorithms are the simplest yet the effective ones. Following paragraphs will provide the brief overview of the same. Some of the approaches for dealing with skewed data streams are categorised under following methods. Oversampling. Under-sampling. Cost Sensitive Learning. 8

18 2.1 Overview of Methods for Dealing with Skewed Data Streams -Traditional Approaches Oversampling and under-sampling are sampling based preprocessing methods of data mining. The main idea in these methods is to manipulate the data distributions such that all the classes are represented well in the training or learning datasets. Recent studies in this domain have shown that sampling is effective method to deal with such kind of problems. Cost sensitive learning is basically associates cost of misclassifying the examples to penalise the classifier Oversampling Oversampling is one of the sampling based preprocessing technique in data mining. In oversampling the number of minority class instances in increased by either reusing the instances from the previous training/learning chunks or by creating the synthetic examples. Oversampling tries to strike the balance between ratio of majority and minority. classes. One of the advantage of this method is that using this normal stream classification methods can be used. The most commonly used method of oversampling is SMOTE(Synthetic Minority Oversampling Technique)[7]. Some of the Oversampling based approaches in the literature are discussed below. Most of the stream classification algorithms available assume that the streams have balanced distribution of classes. In the last few years few attempts have been made to address the problem to deal with skewed data streams. First of such attempt was done by Gao et al [16] in their work they have proposed SE(Sampling + Ensemble) approach which processes the stream in batches. In SE approach each of the classifier in ensemble is trained by drawing the uncorrelated sample of negative instances and all the positive instances in the current training chunk as well as the positive instances of all previous training chunks. Thus in SE approach, oversampling of positive instances is done by incorporation of old positive examples along with under sampling by the way of using disjoint subsets of negative examples. SERA(Selectively Recursive Approach) framework was proposed by Chen and He [9] in this framework they selectively absorbed minority examples from previous chunks into current training chunk to balance it. Similarity measure used to select minority examples from previous chunks was mahalanobis distance. Thus different from approach used by gao et al [16] which uses take in all approach. In SERA framework single hypothesis based on current training chunk is maintained 9

19 2.1 Overview of Methods for Dealing with Skewed Data Streams -Traditional Approaches for making predictions on test data chunk. MuSeRA(Multiple Selectively Recursive Approach) algorithm by Chen and He [10] was their further work after SERA to deal with imbalanced data stream classification. In MuSeRA balancing of training chunk is done in the similar way by using mahalanobis distance as similarity measure to accommodate minority samples accumulated from all the previous training chunks. In MuSeRA a hypothesis is built on every training chunk, thus a set of hypothesis is built over time as opposed to SERA which maintains only single hypothesis. Here set of hypothesis at time-stamp i will be used to predict the classes for instances in test chunk at time-stamp i. In their further work in similar area Chen and He [8] proposed an approach named REA(Recursive Ensemble Approach), in which when next training chunk arrives, it is balanced by adding those positive instances from previous chunks which are nearest neighbours of the positive instances in the current training chunk, then it is used to build a soft typed hypothesis. In REA for every training chunk a new soft typed hypothesis is built. It then uses weighted majority voting to predict the posterior probabilities of test instances, here the weights are assigned to different hypothesis based on their performance on current training chunk Under-sampling Under-sampling is another sampling based method which solves the problem by reducing the number of majority class instances. This is generally done by filtering out the majority class instances or by randomly selecting the appropriate number of majority class examples. under-sampling is mostly carried out using the clustering method. Using clustering the best representative from the majority class are chosen and the training chunk is balanced accordingly. Some of the under-sampling based approaches in the literature are discussed below. Zhang et al [32] proposed another algorithm to deal with skewed data streams. They used clustering+sampling algorithm to deal with skewed data streams. Sampling was carried out by using k-means algorithm to form clusters of negative examples in the current training chunk and then they used the centroids of each of 10

20 2.1 Overview of Methods for Dealing with Skewed Data Streams -Traditional Approaches the clusters formed to represent each of those clusters. Number of clusters formed were equal to the number of positive examples in current training batch and thus current training batch was updated by taking all positive examples along with centroids of the clusters of negative samples. A new classifier was created on these sampled instances. Further size of the ensemble was fixed so for all classifiers present in the ensemble along with new classifier built on sampled instances. AU- ROC was used as measure to select best classifiers to be included in the ensemble. Weights of the classifiers were assigned on the basis of AUROC calculated and ensemble thus formed was used to classify the instances in the test chunk. The work by song et al [28] proposes to address the issue of skewed data streams in cloud security which also follows a similar approach of using k-means clustering to draw centroids of clusters of negative class to undersample the negative class. Recently Zhang et al [36] proposed approach called ECSDS (Ensemble classifier for skewed data streams) which aims at reducing the time needed by ensemble learning by updating it only when required. In this algorithm initially ensemble is built in the similar way as that of gao et al approach[16] but the mechanism to update the ensemble is bit different. When the ensemble makes prediction for the test batch, the positive instances which are misclassified are retained back as MI. After prediction on two batches difference of f 1 values in these two batches is calculated. If the difference crosses a threshold value then ensemble is updated by using all misclassified positive instances MI and all previous positive instances AP along with negative instances in the current batch Cost Sensitive Learning. Cost sensitive learning is one if the important technique of data mining. It assigns different values of misclassification penalties to each class. Cost sensitive learning has been incorporated into classification algorithms by taking into account the cost information and trying to optimize overall cost during the learning process. In cost sensitive classification the problem is dealt by adjusting the learning. It is done by creating costs associated with misclassification of minority class and adjusting the learner based on punishment-reward system. One of the advantage of this method is that training dataset is unchanged unlike oversampling and undersampling where in the data distribution changes completely. 11

21 2.1 Overview of Methods for Dealing with Skewed Data Streams -Traditional Approaches Polliker et al [11] proposed algorithm named Learn++SMOTE in which algorithm updates the distribution of instance weights by evaluating current training chunk D n on the ensemble and accordingly adjust the weights of misclassified instances to form new instance weighing distribution D m. A new classifier is trained on this data chunk D m and then new data subset is created by calling SMOTE. All the classifiers in the ensemble are evaluated on this synthetic dataset and then accordingly classifiers with less error are chosen to form ensemble and while doing so if new classifier has error more than 0.5 then its discarded and new one is created but if error on older classifier is more than 0.5 then its set to 0.5, since it might be useful if stream follows a cyclic nature. Cooper et al [26] proposed online approach to deal with skewed data streams. In this work they propose an incremental learning algorithm for skewed data stream classification. They argue that in batch learning algorithms update of model is delayed until next training chunk is received but in some cases there is need to update the model with every new instance. They build the initial ensemble model in the similar way as that of gao et al approach[16] but the difference is that here they draw random subset of negative instances with replacement. Once the initial model is built then for every new incoming instance which is of positive class all the base models are updated. If the new incoming instance is of negative class then it will be used for updating the models with probability of n/p where n, p are the number of negative and positive instances observed till now respectively. Liu et al [25] proposed RDFCSDS(Reusing data for classifying skewed data streams) in this algorithm before training the ensemble it is checked if concept drift has occurred, if it has then current ensemble is cleared and it is built from scratch by sampling the data from current training chunk as well as the previous training chunk thus reusing the previous data. If no concept drift has occurred then a new classifier is built on current training chunk and then AUROC is calculated for all the classifiers in the ensemble and among those top k classifiers are chosen to form ensemble based on AUROC value. Prediction of instances in the test chunk is done by weighted majority voting. From the above discussion we can see that most of the approaches proposed till now have used have either used oversampling and/or under-sampling to balance the training chunks while to dealing with skewed data streams. Some of them have gone for the cost sensitive classification approach. Also as we have seen previously 12

22 2.1 Overview of Methods for Dealing with Skewed Data Streams -Traditional Approaches that most of the applications do exhibit characteristics of skewed data streams and there is still a room for improvement in this area. 13

23 Chapter 3 Problem Description This chapter gives the motivation behind choosing the problem consideration and how did to arrive at the problem statement. 3.1 Motivation While working on identification of the project topic in the area of data mining we found that lot of work has been already done in the different areas of the data mining with respect to the static datasets. Further in the last decade class imbalance problem in static datasets has drawn the attention of the data mining community. Due to which various workshops in the different conference were dedicated to specially for problem of class imbalance. First of such workshop was organized way back in 2000 in the AAAI 2000 conference, another workshop on Learning from Imbalanced Datasets was organized in ICML Recently another workshop named Data Mining when Classes are Imbalanced and Errors have Cost was organized in the PAKDD 2009 conference. Various points from the above discussion drew our attention towards the problem of class imbalance. Further we could find that considerable amount of work has been done on the class imbalance problem in terms of static datasets. Then we went on to look for the another area wherein class imbalance is in more primary concern. Meanwhile we found that the data streams is an area where class imbalance has not been thoroughly studied. After this we concentrated on various real life application where in data streams are prominent. Various applications like Network intrusion detection, Financial fraud detection etc. are various areas which are characterised by stream data. 14

24 3.2 Problem Statement We went on to see the characteristics of some data streams from which we could find that most of the real life data streams are characterised by the class imbalance or we can say these were skewed data streams. As mentioned earlier in the Section 1.4 many applications are characterised by skewed data streams. Thus our main point of focus was then brought to skewed data streams. As we have seen various real life examples of skewed data streams lets consider one of them say financial fraud detection. As we know lacs of transaction are on during any particular instance very few of them are fraudulent transactions. Identifying such transactions is difficult task. Further identification of new patterns of the fraudulent transactions can be tedious task if we ask the domain experts to label each of them. Thus in such case identifying these minority class examples is the important task. In these scenarios methods like random sampling of instances where these randomly sampled examples are given to domain expert for labelling is not a good idea. If we miss out on identifying such examples then we are calling for the financial losses. In case of network intrusion detection depending upon the network there will either normal connection requests and also there will be malicious connection requests. In this case also either of the type of connections may be in minority. In such case if the malicious connection requests are in minority then identifying them becomes the primary goal. From the above discussions its quite evident that there is need for identification or classification of minority class examples in the skewed data streams in a effective manner. 3.2 Problem Statement As seen in previous section that after concentrating on problem of dealing with skewed data streams we formally framed our problem statement as : To design and develop a general classification framework that will correctly classify the minority class instances along with improvement in the evaluation metrics of data streams with skewed distribution of classes. 15

25 3.3 Evaluation Metrics How did we reach our problem statement As we realised that how important it is to identify the minority class examples in the skewed data streams in some applications as mentioned in the Section 1.4. For instance consider financial fraud detection where in the fraudulent attempt of transactions which are very few in number are the minority instances which need early identification. Misidentification or misclassification of such minority examples may result in the financial losses. Thus we can see that these minority class instances have more valued loss functions associated with them. Elaborating it further the data streams are also characterised by concept drift which complicates the problem further. Thus dealing with skewed data stream is an multi-fold problem. Identification of minority class examples could be achieved by classification task of the data mining. Thus we drilled down on our problem statement Why are Existing Classifiers Weak? Most of Existing stream classifiers cannot be used to classify the skewed data streams. These classifiers face the issues that are mentioned in the Section 1.5. Further these classifiers also assume that examples in the stream have fairly balanced distribution over the different classes. Hence the standard stream classifiers are affected by such majority class instances and thus they tend to consider the minority class instances as noise. 3.3 Evaluation Metrics Although most of the stream classifiers measure their performance based on overall accuracy but in case of imbalanced datasets such measure is not appropriate. Consider for example two classifiers being tested on a imbalanced dataset with class distribution ratio of 0.99:0.01. Now if classifier one classifies all 99% of majority class correctly but none from minority where as second classifier classifies 97% of majority class correctly and 0.8% of the minority class correctly then as per overall accuracy the classifier one beats the second one but if we consider that minority class is the main class under focus then second classifier is the one which should be chosen. 16

26 3.3 Evaluation Metrics Thus in such cases various performance measures have been suggested in the literature. In this section we present the evaluation metrics that we have used. Confusion Matrix: The columns of the confusion matrix represent the predictions, and the rows represent the actual class. Correct predictions always lie on the diagonal of the matrix. Equation 3.1 shows the general structure of confusion matrix. Conf usionm atrix T P F P F N (3.1) T N wherein, True Positives (TP) indicate the number of instances of the minority that were correctly predicted, True Negatives (TN) indicate the number of instances of the majority that were correctly predicted. False Positives (FP) indicate the number of instances of the majority that were incorrectly predicted as minority class instances and False Negatives (FN) indicate the number of the minority that were incorrectly predicted as majority class instances. Though the confusion matrix gives a better outlook on how the classifier performed than accuracy, a more detailed analysis is preferable which are provided by the further metrics. Recall: Recall is a metric that gives a percentage of how many of the actual minority class members the classifier correctly identified. (TP + FN) represent a total of all minority members. Recall is given by equation 3.2 Recall = T P T P + F N (3.2) Precision: It gives us the total the percentage of how many of minority class instances as determined by the model or classifier actually belong to the minority class. (TP + FP) represents the total of positive predictions by the classifier. Precision is given by equation

27 3.3 Evaluation Metrics P recision = T P T P + F P (3.3) Thus in general it is said that Recall is a Completeness Measure and Precision is a Exactness Measure. The ideal classifier would give value as 1 for both Recall and Precision but if the classifier gives higher(closer to one) for one of the above metrics and lower for the other metrics in that case choosing the classifier is difficult task. In such cases some other metrics as discussed further are suggested in the literature. Further in the literature few metrics which are based on the above metrics have been suggested to be used in case of imbalanced data sets or streams. The performance measures like AUROC (Area under ROC curve)[12],[20] G-Mean [17] are well suited in such situations. F-Measure: It is a harmonic mean of Precision & Recall. We can say that it is essentially an average between the two percentage. It really simplifies the comparison between the classifiers. It is given by the equation 3.4. F Measure = 2 1 ( + 1 ) (3.4) Recall P recision G-Mean: G-Mean is a metric that measures the balanced performance of a learning algorithm between the classes. It is given by the equation 3.5 G Mean = Recall T NR (3.5) T NR = T N T N + F P (3.6) 18

28 3.3 Evaluation Metrics Area Under ROC Curve: The area under ROC Curve(Receiver Operating Characteristics) gives the probability that, when one draws one majority and one minority class example at random, the decision function assigns the higher value to the minority class than the majority class sample. AUROC is not sensitive to the class distributions in the dataset. Generally it is plotted as a True Positive Rate verses False Positive Rate. AUROC was mainly used in signal detection theory and medical domain where it was said to be the plot of Sensitivity verses (1 Specificity) where Sensitivity is same as True Positive Rate and Specificity is (1 - False Positive Rate) so in general it reduces to False Positive Rate thus both definitions of AUROC are one and the same. 19

29 Chapter 4 Our Approach As mentioned in the previous chapter that after the literature survey we came to the conclusion that issue of skewed data streams certainly needs more attention and there is still room for improvement in the performance of existing classifiers. Thus we came up with our own approach to deal with skewed data streams. 4.1 Approach to Deal with Skewed Data Streams This section gives the brief description of our approach to deal with classification of data streams with skewed distribution. Our approach in general is as follows. Instead of using a single model from single training set, it is proposed to use multiple models from different sets. We use the ensemble of classifiers/models to classify the data streams with skewed distributions. The use of ensemble of classifiers has been proven to be the effective one to deal with concept drifting data streams [31]. In our approach we have also used the oversampling approach to so as to balance the training chunk contents. Fig 1.4 shows in brief our approach. In our approach we process the data stream in chunks. If the set of instances used for training the classifiers are disjoint then the classifiers will make uncorrelated errors and which can be eliminated by averaging [16]. On the similar lines we build a new classifier for the ensemble for every incoming batch such that instances of the training chunk are as uncorrelated as possible. In order to learn from the data streams the best way is to construct the model on the most recent training chunk. This works for instances of majority class because instances of majority class are in abundance in every chunk but as the minority class samples are very less. Some of the earlier approaches [7],[11] depend 20

30 4.1 Approach to Deal with Skewed Data Streams on creation of synthetic examples to deal with this problem. In our approach we store the minority class examples from the previous training chunks. To balance the training chunks distribution we oversample the minority class instances stored over the period of time. Figure 4.1: Our Approach to Classify Data Streams with Skewed Distributions While balancing the training chunk we follow the K nearest neighbour approach. We select those examples from the previous minority examples that are accommodated over the time. To balance the training chunk we find the k nearest neighbours of the each example stored from previous chunks into the current training chunk. Out of those appropriate number of minority samples are used to balance the training chunk. Using that training chunk we build a new classifier. This newly built classifier is added to the ensemble. The size of the ensemble is maintained to 10. As the size of ensemble goes beyond this specified limit then classifiers with best AUROC(Area Under ROC 21

31 4.2 Algorithm for Skewed Data Streams Curve) values are chosen. The process of learning is continuous as you can see from the fig 4.1. Thus as the stream is continuously coming in the ensemble is continuously updated. Mean while the test examples are tested by taking the predictions from the ensemble. The predictions are made by taking weighted majority voting among the classifiers. The weights of the classifiers in the ensemble are assigned as per the AUROC values. While taking the majority voting the weights are normalized. Thus our approach consist of the oversampling based approach along with k nearest neighbour algorithm and the ensemble based approach to deal with data streams with skewed distribution of classes. Algorithm 1 presents the K nearest neighbour algorithm in details which is used by our approach. Algorithm 1: K Nearest Neighbours Algorithm Input: Set of training examples or the set of examples from which nearest neighbours are to be determined. Output: begin K Nearest Neighbours. Determine parameter K = number of nearest neighbours. Calculate the distance between the query-instance and all the training samples. Sort the distance and determine nearest neighbours based on the K-th minimum distance. Return the K nearest neighbours. 4.2 Algorithm for Skewed Data Streams Algorithm 2 presents the algorithmic implementation of our approach that we have come up with to deal with classification of skewed data streams. The algorithm basically processes the data stream in batches or chunks. Thus 22

32 4.2 Algorithm for Skewed Data Streams Algorithm 2: CDSSDC: Classifier for Data streams with skewed distribution of classes Input: Training chunk B i,test chunk B i+1 arriving at current time-stamp t i Ensemble of classifiers Z i 1 built from t i = 0 to i 1 AP set of all positive instances accumulated from the previous training chunks from t i = 0 to i 1 Balance ratio f, and ratio of positive to negative instances in the training Output: chunk γ. Ensemble Z i, for the test chunk B i+1 begin foreach timestamp t i = 1,2... do Split the current training chunk B i into P and N containing positive and negative examples respectively. if f > (t i -1) γ then TrainSet = B i AP else Find K nearest neighbours for each instance of AP within current training chunk an read positive class instances from these into D i and sort D i in descending order. Read first {(f γ) B i }instances from D i into E i. TrainSet = B i E i Build new classifier C i on TrainSet if Z i 1 < 10 then Z i = Z i 1 C i else Find AUROC for all classifiers in Z i 1 on B i. C j = classifier with lowest AUROC from Z i 1 if AUROC(C j ) < AUROC (C i ) then Z i = Z i 1 { C i } { C j } Assign weights to the classifiers in Z i according to AUROC. Normalize the weights of the classifiers in Z i. 23 AP = AP P

33 4.2 Algorithm for Skewed Data Streams input to the system is a skewed data stream in chunks or batches. We assume that at any given instance Training and Test chunks are available. At time-stamp t i the training chunk B i and B i+1 are available as input to the algorithm. Further ensemble of classifiers built over the from the time-stamp t 0 to t i 1 is also given as input to the algorithm. All the minority class samples from the start of the stream till the last time-stamp are accumulated and those are also given as input to the algorithm. Balance ratio of 0.5 is also set initially. Its assumed that for each training chunk we are aware of the ratio of positive to negative class instances. The training batch that arrives at current time-stamp is split into positive and negative class instances. Further it is checked whether appropriate number of minority class or positive class instances have been accumulated or not so as to balance the batch. If the number is not adequate then all the retained examples are added along with positive and negative class instances into the train-set. If the retained examples are more than that of those required then for each of the those minority class instances, the K nearest neighbours into the current training batch are found. Then those nearest neighbours that are of minority class are sorted in descending order depending on which is more nearest. Then select minority class examples from retained minority samples set which are corresponding to the first x examples from the descending ordered list. x number of examples is chosen such that the balance ratio of 0.5 is maintained. Thus the Train-Set is created. Then a new classifier is built on this Train-Set. This newly built classifier is then added to the ensemble of classifiers. If the number of classifiers in the ensemble is more than 10 then AUROC for each of the classifiers is found on the original training data chunk and then 10 classifiers with best AUROC values are chosen to be the part of the ensemble. Then the classifiers are assigned weights based on the AUROC values calculated.then for making the predictions on the test batch B i+1 the weighted majority voting is taken and the test instances are given predictions accordingly. The Test then Train pattern for the data chunks is followed. 24

34 Chapter 5 Experimental Results and Discussions This chapter deals with the results obtained by us after the experiments were carried out on the algorithm developed by us. 5.1 Experimental Setup This section gives the abstract of the experimental setup on which we have worked to perform our experiments and test the effectiveness of our algorithm. For the implementation of our algorithm we have used MOA [6]. MOA is a open source framework for mining data streams. It is implemented in java. It has collection of various data stream mining algorithms. MOA being an open source framework we could easily add and integrate our algorithm into it. Details of other hardware and software that we have used is as follows. We used MOA release along with it we have also used WEKA version on Windows 7 Profession operating system running on Dual core AMD opteron processor GHz with 2GB RAM. 5.2 Datasets To evaluate the performance of our algorithm we have tested it on various synthetic as well as real world datasets. We have used various datasets available at UCI[13] repository. Datasets available at MOA dataset repository[1]. Also we have used various synthetically generated datasets. These datasets were chosen 25

35 5.2 Datasets such that they model real world scenarios, have variety of features and largely vary in size and class distribution Synthetically Generated Datasets This subsection presents the details of the synthetic datasets that we have generated. Later in this section we have briefly explained the details of these datasets. The main reason for generating these two datasets was that they focus on one of the main characteristic of data streams that is concept drift. Table 5.1 shows the characteristics of synthetically generated datasets. As shown in Table 5.1 are the number of instances, number of majority class instances, number of minority class instances, number of attributes in the data set, chunk size of the data sets and the ratio of majority class to minority class. Table 5.1: Description of Synthetic Datasets used Dataset Instances Max-Class Min-Class Attributes Chunk-size Ratio SPH 1% 1,00, ::0.01 SPH 3% 1,00, ::0.03 SPH 5% 1,00, ::0.05 SPH 10% 1,00, ::0.10 SEA 1% 80, ::0.01 SEA 3% 80, ::0.03 SEA 5% 80, ::0.05 SEA 10% 80, ::0.10 Description of Synthetically Generated Datasets SEA Dataset: The SEA dataset is generated to model the abrupt concept drift. The SEA dataset was proposed by Street and Kim[29]. The instances of the SEA dataset are generally generated in 3 dimensional feature space with two classes. Each of 26

36 5.2 Datasets these 3 attributes or features has the value between 0 to 10. Out of these attributes only first two are relevant and the third attribute is added which acts as noise. If the some of the first two attributes crosses a certain threshold value then it belongs to class 1 else it belongs to class 2. The threshold values generally used are 8,9,7 and 9.5 for the four data blocks. From each block some of the instances as reserved as test instances. We generated 80,000 instances of SEA dataset with different percentage of class imbalances. These were used in the batch 1000 each that is a batch of 1000 was used for training and then next 1000 instances were used for testing the model built. We have generated 4 different datasets of SEA dataset. Each with 1%,3%,5%,10% of skew added to them. In each of these datasets we have added 1% noise so as to make the task of learning more difficult. Noise was added by reversing the correct class label to a incorrect one in the training set. Spinning Hyperplane Dataset: The SPH(Spinning Hyperplane Dataset) in generated to model the Gradual Concept Drift. The SPH dataset was proposed by Wang et al[31]. The SPH dataset defines a class boundary in n dimensions by coefficients as α 1,α 2,α 3...α n. An instance d=(d 1,d 2,d3...,d n ) is generated by randomizing each attribute in the range 0 to 1. Thus each attribute value is between 0 to 1. A constant bias is defined as: α 0 = 1 2 n i=1 then the class label l for the instance t is defined as α i as opposed to the abrupt concept drift in the SEA dataset the, SPH dataset is characterised by the gradual concept drift. Some of the coefficients of α i will be sampled in random to have small increment δ added. Here δ is defined δ = s t N 27

37 5.2 Datasets here t is the magnitude of the change for every N examples and s alternates in closed interval -1 to 1 and thus specifying the direction of change, and it has 20% chance of being reversed for every N examples. α 0 is also modified as per the equation mentioned earlier. In such a way class boundary is like spinning hyperplane in the process of creating data Results of Our Approach on Synthetically Generated Datasets We present the graphical results of our approach in comparison with some other state of the art algorithms. Details of the algorithms used for comparison are as follows Our approach CDSSDC which uses K nearest neighbour approach to balance the current training chunk, ratio of positive to negative examples is kept to be 0.5 We have used SMOTE algorithm [7] for comparison, which creates synthetic examples to balance the dataset. The implementation of SMOTE available in WEKA toolkit [14] has been used with the default settings in WEKA except for SPH and SEA datasets where SMOTE was applied so as maintain ratio of positive to negative examples to be 0.5. We have used FLC [3] for comparison, FLC was designed to deal with concept drift in data streams efficiently. From Fig 5.1 we can clearly see that our approach gives better performance as compared to SMOTE and FLC in terms of AUROC in case of SPH datasets with various imbalance levels. As the class skew ratio eases up from 99:1 towards 90:10 the FLC and SMOTE algorithm come close in performance to our approach. Fig 5.2 shows comparative performance of our approach, SMOTE and FLC with respect to G-mean on SPH datasets. Here we can see that our approach performs better than SMOTE and FLC for all the class distributions of SPH datasets shown. Further here also we can observe that the other algorithms come closer in catching up as the class skew eases up. 28

38 5.2 Datasets Figure 5.1: Comparison of AUROC on SPH Datasets Figure 5.2: Comparison of G-Mean on SPH Datasets Fig 5.3 gives the comparative details of F-Measure values obtained on our approach, SMOTE and FLC algorithm. Here we can observe that our approach achieves better values than other algorithms. Fig 5.4 shows the overall accuracies of SMOTE, CDSSDC and FLC on SPH datasets with varied class imbalance levels. Here we can see that our approach maintains almost same or better accuracy performance than SMOTE and FLC. 29

39 5.2 Datasets Figure 5.3: Comparison of F-Measure on SPH Datasets Figure 5.4: Comparison of Overall Accuracies on SPH Datasets From Fig 5.5 we can clearly see that our approach gives better performance as compared to SMOTE and FLC in terms of AUROC in case of SEA datasets with various imbalance levels. As the class skew ratio eases up from 99:1 towards 90:10 the FLC and SMOTE algorithm very come close in performance to our approach. Fig 5.6 shows comparative performance of our approach, SMOTE and FLC with respect to G-mean values obtained. Here we can see that our approach performs a bit better than SMOTE and FLC for all the class distributions of SPH datasets shown. Further here also we can observe that the other algorithms come 30

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem Data Mining Classification: Alternative Techniques Imbalanced Class Problem Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Class Imbalance Problem Lots of classification problems

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data

More information

K- Nearest Neighbors(KNN) And Predictive Accuracy

K- Nearest Neighbors(KNN) And Predictive Accuracy Contact: mailto: Ammar@cu.edu.eg Drammarcu@gmail.com K- Nearest Neighbors(KNN) And Predictive Accuracy Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni.

More information

Model s Performance Measures

Model s Performance Measures Model s Performance Measures Evaluating the performance of a classifier Section 4.5 of course book. Taking into account misclassification costs Class imbalance problem Section 5.7 of course book. TNM033:

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy Machine Learning Dr.Ammar Mohammed Nearest Neighbors Set of Stored Cases Atr1... AtrN Class A Store the training samples Use training samples

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions Offer Sharabi, Yi Sun, Mark Robinson, Rod Adams, Rene te Boekhorst, Alistair G. Rust, Neil Davey University of

More information

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Md Nasim Adnan and Md Zahidul Islam Centre for Research in Complex Systems (CRiCS)

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Ester Bernadó-Mansilla. Research Group in Intelligent Systems Enginyeria i Arquitectura La Salle Universitat Ramon Llull Barcelona, Spain

Ester Bernadó-Mansilla. Research Group in Intelligent Systems Enginyeria i Arquitectura La Salle Universitat Ramon Llull Barcelona, Spain Learning Classifier Systems for Class Imbalance Problems Research Group in Intelligent Systems Enginyeria i Arquitectura La Salle Universitat Ramon Llull Barcelona, Spain Aim Enhance the applicability

More information

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at Performance Evaluation of Ensemble Method Based Outlier Detection Algorithm Priya. M 1, M. Karthikeyan 2 Department of Computer and Information Science, Annamalai University, Annamalai Nagar, Tamil Nadu,

More information

Chapter 3: Supervised Learning

Chapter 3: Supervised Learning Chapter 3: Supervised Learning Road Map Basic concepts Evaluation of classifiers Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Summary 2 An example

More information

2. On classification and related tasks

2. On classification and related tasks 2. On classification and related tasks In this part of the course we take a concise bird s-eye view of different central tasks and concepts involved in machine learning and classification particularly.

More information

Classification Part 4

Classification Part 4 Classification Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Model Evaluation Metrics for Performance Evaluation How to evaluate

More information

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique NDoT: Nearest Neighbor Distance Based Outlier Detection Technique Neminath Hubballi 1, Bidyut Kr. Patra 2, and Sukumar Nandi 1 1 Department of Computer Science & Engineering, Indian Institute of Technology

More information

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset International Journal of Computer Applications (0975 8887) Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset Mehdi Naseriparsa Islamic Azad University Tehran

More information

Fraud Detection using Machine Learning

Fraud Detection using Machine Learning Fraud Detection using Machine Learning Aditya Oza - aditya19@stanford.edu Abstract Recent research has shown that machine learning techniques have been applied very effectively to the problem of payments

More information

NETWORK FAULT DETECTION - A CASE FOR DATA MINING

NETWORK FAULT DETECTION - A CASE FOR DATA MINING NETWORK FAULT DETECTION - A CASE FOR DATA MINING Poonam Chaudhary & Vikram Singh Department of Computer Science Ch. Devi Lal University, Sirsa ABSTRACT: Parts of the general network fault management problem,

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá INTRODUCTION TO DATA MINING Daniel Rodríguez, University of Alcalá Outline Knowledge Discovery in Datasets Model Representation Types of models Supervised Unsupervised Evaluation (Acknowledgement: Jesús

More information

SYED ABDUL SAMAD RANDOM WALK OVERSAMPLING TECHNIQUE FOR MI- NORITY CLASS CLASSIFICATION

SYED ABDUL SAMAD RANDOM WALK OVERSAMPLING TECHNIQUE FOR MI- NORITY CLASS CLASSIFICATION TAMPERE UNIVERSITY OF TECHNOLOGY SYED ABDUL SAMAD RANDOM WALK OVERSAMPLING TECHNIQUE FOR MI- NORITY CLASS CLASSIFICATION Master of Science Thesis Examiner: Prof. Tapio Elomaa Examiners and topic approved

More information

K-means based data stream clustering algorithm extended with no. of cluster estimation method

K-means based data stream clustering algorithm extended with no. of cluster estimation method K-means based data stream clustering algorithm extended with no. of cluster estimation method Makadia Dipti 1, Prof. Tejal Patel 2 1 Information and Technology Department, G.H.Patel Engineering College,

More information

Comparison of FP tree and Apriori Algorithm

Comparison of FP tree and Apriori Algorithm International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.78-82 Comparison of FP tree and Apriori Algorithm Prashasti

More information

A Comparative Study of Selected Classification Algorithms of Data Mining

A Comparative Study of Selected Classification Algorithms of Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.220

More information

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY BHARAT SIGINAM IN

More information

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati Evaluation Metrics (Classifiers) CS Section Anand Avati Topics Why? Binary classifiers Metrics Rank view Thresholding Confusion Matrix Point metrics: Accuracy, Precision, Recall / Sensitivity, Specificity,

More information

Evolving SQL Queries for Data Mining

Evolving SQL Queries for Data Mining Evolving SQL Queries for Data Mining Majid Salim and Xin Yao School of Computer Science, The University of Birmingham Edgbaston, Birmingham B15 2TT, UK {msc30mms,x.yao}@cs.bham.ac.uk Abstract. This paper

More information

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane DATA MINING AND MACHINE LEARNING Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Data preprocessing Feature normalization Missing

More information

Extended R-Tree Indexing Structure for Ensemble Stream Data Classification

Extended R-Tree Indexing Structure for Ensemble Stream Data Classification Extended R-Tree Indexing Structure for Ensemble Stream Data Classification P. Sravanthi M.Tech Student, Department of CSE KMM Institute of Technology and Sciences Tirupati, India J. S. Ananda Kumar Assistant

More information

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models

More information

International Journal of Software and Web Sciences (IJSWS)

International Journal of Software and Web Sciences (IJSWS) International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) ISSN (Print): 2279-0063 ISSN (Online): 2279-0071 International

More information

Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data

Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data Jesse Read 1, Albert Bifet 2, Bernhard Pfahringer 2, Geoff Holmes 2 1 Department of Signal Theory and Communications Universidad

More information

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3

More information

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4

More information

CS 584 Data Mining. Classification 3

CS 584 Data Mining. Classification 3 CS 584 Data Mining Classification 3 Today Model evaluation & related concepts Additional classifiers Naïve Bayes classifier Support Vector Machine Ensemble methods 2 Model Evaluation Metrics for Performance

More information

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering W10.B.0.0 CS435 Introduction to Big Data W10.B.1 FAQs Term project 5:00PM March 29, 2018 PA2 Recitation: Friday PART 1. LARGE SCALE DATA AALYTICS 4. RECOMMEDATIO SYSTEMS 5. EVALUATIO AD VALIDATIO TECHIQUES

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

Further Thoughts on Precision

Further Thoughts on Precision Further Thoughts on Precision David Gray, David Bowes, Neil Davey, Yi Sun and Bruce Christianson Abstract Background: There has been much discussion amongst automated software defect prediction researchers

More information

Extra readings beyond the lecture slides are important:

Extra readings beyond the lecture slides are important: 1 Notes To preview next lecture: Check the lecture notes, if slides are not available: http://web.cse.ohio-state.edu/~sun.397/courses/au2017/cse5243-new.html Check UIUC course on the same topic. All their

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 20: 10/12/2015 Data Mining: Concepts and Techniques (3 rd ed.) Chapter

More information

Business Club. Decision Trees

Business Club. Decision Trees Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building

More information

Classification of Concept Drifting Data Streams Using Adaptive Novel-Class Detection

Classification of Concept Drifting Data Streams Using Adaptive Novel-Class Detection Volume 3, Issue 9, September-2016, pp. 514-520 ISSN (O): 2349-7084 International Journal of Computer Engineering In Research Trends Available online at: www.ijcert.org Classification of Concept Drifting

More information

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2018

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2018 CPSC 340: Machine Learning and Data Mining Outlier Detection Fall 2018 Admin Assignment 2 is due Friday. Assignment 1 grades available? Midterm rooms are now booked. October 18 th at 6:30pm (BUCH A102

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Fast or furious? - User analysis of SF Express Inc

Fast or furious? - User analysis of SF Express Inc CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood

More information

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster)

More information

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER P.Radhabai Mrs.M.Priya Packialatha Dr.G.Geetha PG Student Assistant Professor Professor Dept of Computer Science and Engg Dept

More information

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Bhakti V. Gavali 1, Prof. Vivekanand Reddy 2 1 Department of Computer Science and Engineering, Visvesvaraya Technological

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information

CS4491/CS 7265 BIG DATA ANALYTICS

CS4491/CS 7265 BIG DATA ANALYTICS CS4491/CS 7265 BIG DATA ANALYTICS EVALUATION * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Dr. Mingon Kang Computer Science, Kennesaw State University Evaluation for

More information

Dr. Prof. El-Bahlul Emhemed Fgee Supervisor, Computer Department, Libyan Academy, Libya

Dr. Prof. El-Bahlul Emhemed Fgee Supervisor, Computer Department, Libyan Academy, Libya Volume 5, Issue 1, January 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Performance

More information

COMP 465: Data Mining Classification Basics

COMP 465: Data Mining Classification Basics Supervised vs. Unsupervised Learning COMP 465: Data Mining Classification Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Supervised

More information

Role of big data in classification and novel class detection in data streams

Role of big data in classification and novel class detection in data streams DOI 10.1186/s40537-016-0040-9 METHODOLOGY Open Access Role of big data in classification and novel class detection in data streams M. B. Chandak * *Correspondence: hodcs@rknec.edu; chandakmb@gmail.com

More information

Improved Classification of Known and Unknown Network Traffic Flows using Semi-Supervised Machine Learning

Improved Classification of Known and Unknown Network Traffic Flows using Semi-Supervised Machine Learning Improved Classification of Known and Unknown Network Traffic Flows using Semi-Supervised Machine Learning Timothy Glennan, Christopher Leckie, Sarah M. Erfani Department of Computing and Information Systems,

More information

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction CHAPTER 5 SUMMARY AND CONCLUSION Chapter 1: Introduction Data mining is used to extract the hidden, potential, useful and valuable information from very large amount of data. Data mining tools can handle

More information

Classification. Instructor: Wei Ding

Classification. Instructor: Wei Ding Classification Part II Instructor: Wei Ding Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1 Practical Issues of Classification Underfitting and Overfitting Missing Values Costs of Classification

More information

CHAPTER 4 STOCK PRICE PREDICTION USING MODIFIED K-NEAREST NEIGHBOR (MKNN) ALGORITHM

CHAPTER 4 STOCK PRICE PREDICTION USING MODIFIED K-NEAREST NEIGHBOR (MKNN) ALGORITHM CHAPTER 4 STOCK PRICE PREDICTION USING MODIFIED K-NEAREST NEIGHBOR (MKNN) ALGORITHM 4.1 Introduction Nowadays money investment in stock market gains major attention because of its dynamic nature. So the

More information

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore Module No # 09 Lecture No # 40 This is lecture forty of the course on

More information

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

A Comparative study of Clustering Algorithms using MapReduce in Hadoop A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining

More information

Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process

Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process Vol.133 (Information Technology and Computer Science 2016), pp.79-84 http://dx.doi.org/10.14257/astl.2016. Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction

More information

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Predictive Analysis: Evaluation and Experimentation. Heejun Kim Predictive Analysis: Evaluation and Experimentation Heejun Kim June 19, 2018 Evaluation and Experimentation Evaluation Metrics Cross-Validation Significance Tests Evaluation Predictive analysis: training

More information

List of Exercises: Data Mining 1 December 12th, 2015

List of Exercises: Data Mining 1 December 12th, 2015 List of Exercises: Data Mining 1 December 12th, 2015 1. We trained a model on a two-class balanced dataset using five-fold cross validation. One person calculated the performance of the classifier by measuring

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

7. Decision or classification trees

7. Decision or classification trees 7. Decision or classification trees Next we are going to consider a rather different approach from those presented so far to machine learning that use one of the most common and important data structure,

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection Zhenghui Ma School of Computer Science The University of Birmingham Edgbaston, B15 2TT Birmingham, UK Ata Kaban School of Computer

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

MULTIDIMENSIONAL INDEXING TREE STRUCTURE FOR SPATIAL DATABASE MANAGEMENT

MULTIDIMENSIONAL INDEXING TREE STRUCTURE FOR SPATIAL DATABASE MANAGEMENT MULTIDIMENSIONAL INDEXING TREE STRUCTURE FOR SPATIAL DATABASE MANAGEMENT Dr. G APPARAO 1*, Mr. A SRINIVAS 2* 1. Professor, Chairman-Board of Studies & Convener-IIIC, Department of Computer Science Engineering,

More information

1. Inroduction to Data Mininig

1. Inroduction to Data Mininig 1. Inroduction to Data Mininig 1.1 Introduction Universe of Data Information Technology has grown in various directions in the recent years. One natural evolutionary path has been the development of the

More information

Identification of the correct hard-scatter vertex at the Large Hadron Collider

Identification of the correct hard-scatter vertex at the Large Hadron Collider Identification of the correct hard-scatter vertex at the Large Hadron Collider Pratik Kumar, Neel Mani Singh pratikk@stanford.edu, neelmani@stanford.edu Under the guidance of Prof. Ariel Schwartzman( sch@slac.stanford.edu

More information

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis CHAPTER 3 BEST FIRST AND GREEDY SEARCH BASED CFS AND NAÏVE BAYES ALGORITHMS FOR HEPATITIS DIAGNOSIS 3.1 Introduction

More information

A System for Managing Experiments in Data Mining. A Thesis. Presented to. The Graduate Faculty of The University of Akron. In Partial Fulfillment

A System for Managing Experiments in Data Mining. A Thesis. Presented to. The Graduate Faculty of The University of Akron. In Partial Fulfillment A System for Managing Experiments in Data Mining A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Greeshma

More information

Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem

Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem Chumphol Bunkhumpornpat, Krung Sinapiromsaran, and Chidchanok Lursinsap Department of Mathematics,

More information

WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE

WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE Ms.S.Muthukakshmi 1, R. Surya 2, M. Umira Taj 3 Assistant Professor, Department of Information Technology, Sri Krishna College of Technology, Kovaipudur,

More information

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine

More information

code pattern analysis of object-oriented programming languages

code pattern analysis of object-oriented programming languages code pattern analysis of object-oriented programming languages by Xubo Miao A thesis submitted to the School of Computing in conformity with the requirements for the degree of Master of Science Queen s

More information

Face Recognition using Eigenfaces SMAI Course Project

Face Recognition using Eigenfaces SMAI Course Project Face Recognition using Eigenfaces SMAI Course Project Satarupa Guha IIIT Hyderabad 201307566 satarupa.guha@research.iiit.ac.in Ayushi Dalmia IIIT Hyderabad 201307565 ayushi.dalmia@research.iiit.ac.in Abstract

More information

3-D MRI Brain Scan Classification Using A Point Series Based Representation

3-D MRI Brain Scan Classification Using A Point Series Based Representation 3-D MRI Brain Scan Classification Using A Point Series Based Representation Akadej Udomchaiporn 1, Frans Coenen 1, Marta García-Fiñana 2, and Vanessa Sluming 3 1 Department of Computer Science, University

More information

COMP 465 Special Topics: Data Mining

COMP 465 Special Topics: Data Mining COMP 465 Special Topics: Data Mining Introduction & Course Overview 1 Course Page & Class Schedule http://cs.rhodes.edu/welshc/comp465_s15/ What s there? Course info Course schedule Lecture media (slides,

More information

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control. What is Learning? CS 343: Artificial Intelligence Machine Learning Herbert Simon: Learning is any process by which a system improves performance from experience. What is the task? Classification Problem

More information

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Lutfi Fanani 1 and Nurizal Dwi Priandani 2 1 Department of Computer Science, Brawijaya University, Malang, Indonesia. 2 Department

More information

AUTOMATED STUDENT S ATTENDANCE ENTERING SYSTEM BY ELIMINATING FORGE SIGNATURES

AUTOMATED STUDENT S ATTENDANCE ENTERING SYSTEM BY ELIMINATING FORGE SIGNATURES AUTOMATED STUDENT S ATTENDANCE ENTERING SYSTEM BY ELIMINATING FORGE SIGNATURES K. P. M. L. P. Weerasinghe 149235H Faculty of Information Technology University of Moratuwa June 2017 AUTOMATED STUDENT S

More information

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

Non-trivial extraction of implicit, previously unknown and potentially useful information from data CS 795/895 Applied Visual Analytics Spring 2013 Data Mining Dr. Michele C. Weigle http://www.cs.odu.edu/~mweigle/cs795-s13/ What is Data Mining? Many Definitions Non-trivial extraction of implicit, previously

More information

Evaluating Machine Learning Methods: Part 1

Evaluating Machine Learning Methods: Part 1 Evaluating Machine Learning Methods: Part 1 CS 760@UW-Madison Goals for the lecture you should understand the following concepts bias of an estimator learning curves stratified sampling cross validation

More information

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression Journal of Data Analysis and Information Processing, 2016, 4, 55-63 Published Online May 2016 in SciRes. http://www.scirp.org/journal/jdaip http://dx.doi.org/10.4236/jdaip.2016.42005 A Comparative Study

More information

Sentiment analysis under temporal shift

Sentiment analysis under temporal shift Sentiment analysis under temporal shift Jan Lukes and Anders Søgaard Dpt. of Computer Science University of Copenhagen Copenhagen, Denmark smx262@alumni.ku.dk Abstract Sentiment analysis models often rely

More information

A Cloud Based Intrusion Detection System Using BPN Classifier

A Cloud Based Intrusion Detection System Using BPN Classifier A Cloud Based Intrusion Detection System Using BPN Classifier Priyanka Alekar Department of Computer Science & Engineering SKSITS, Rajiv Gandhi Proudyogiki Vishwavidyalaya Indore, Madhya Pradesh, India

More information

A Review: Content Base Image Mining Technique for Image Retrieval Using Hybrid Clustering

A Review: Content Base Image Mining Technique for Image Retrieval Using Hybrid Clustering A Review: Content Base Image Mining Technique for Image Retrieval Using Hybrid Clustering Gurpreet Kaur M-Tech Student, Department of Computer Engineering, Yadawindra College of Engineering, Talwandi Sabo,

More information

Parallel Approach for Implementing Data Mining Algorithms

Parallel Approach for Implementing Data Mining Algorithms TITLE OF THE THESIS Parallel Approach for Implementing Data Mining Algorithms A RESEARCH PROPOSAL SUBMITTED TO THE SHRI RAMDEOBABA COLLEGE OF ENGINEERING AND MANAGEMENT, FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

More information

Lecture 3: Linear Classification

Lecture 3: Linear Classification Lecture 3: Linear Classification Roger Grosse 1 Introduction Last week, we saw an example of a learning task called regression. There, the goal was to predict a scalar-valued target from a set of features.

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 1 1 Acknowledgement Several Slides in this presentation are taken from course slides provided by Han and Kimber (Data Mining Concepts and Techniques) and Tan,

More information

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES EXPERIMENTAL WORK PART I CHAPTER 6 DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES The evaluation of models built using statistical in conjunction with various feature subset

More information

Spam Filtering Using Visual Features

Spam Filtering Using Visual Features Spam Filtering Using Visual Features Sirnam Swetha Computer Science Engineering sirnam.swetha@research.iiit.ac.in Sharvani Chandu Electronics and Communication Engineering sharvani.chandu@students.iiit.ac.in

More information