Classification of Data Streams with Skewed Distribution

Size: px

Start display at page:

Download "Classification of Data Streams with Skewed Distribution"

Angelina Thomas
5 years ago
Views:

Classification of Data Streams with Skewed Distribution Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by

1 Classification of Data Streams with Skewed Distribution Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by Abhijeet B. Godase Roll No: under the guidance of Prof. V. Z. Attar Department of Computer Engineering and Information Technology College of Engineering, Pune Pune June 2012

2 Dedicated to my mother Smt. Aruna B. Godase and my father Shri. Balasaheb J. Godase for their love, endless support and encouragement.

3 DEPARTMENT OF COMPUTER ENGINEERING AND INFORMATION TECHNOLOGY, COLLEGE OF ENGINEERING, PUNE CERTIFICATE This is to certify that the dissertation titled Classification of Data Streams with Skewed Distribution has been successfully completed By Abhijeet B. Godase ( ) and is approved for the degree of Master of Technology, Computer Engineering. Prof. V. Z. Attar, Guide, Department of Computer Engineering and Information Technology, College of Engineering, Pune, Shivaji Nagar, Pune Dr. Jibi Abraham, Head, Department of Computer Engineering and Information Technology, College of Engineering, Pune, Shivaji Nagar, Pune Date :

4 Abstract The emerging domain of data stream mining is one of the important areas of research for the data mining community. The data streams in various real life applications are characterized by concept drift. Such data streams may also be characterized by skewed or imbalance class distributions for example Financial fraud detection, Network intrusion detection etc. In such cases skewed class distribution of the stream increases the problems associated with classifying stream instances. Learning from such skewed data streams results in a classifier which is biased towards the majority class. Thus the model or the classifier built on such skewed data streams tends to misclassify the minority class examples. In case of some applications for instance, financial fraud detection the identification of fraudulent transaction is the main focus because here misclassification of such minority class instances might result in heavy losses, in this case financial. Increasingly higher losses due to misclassification of such minority class instances cannot be ruled out in many other data stream applications as well. The challenge, therefore, is to pro-actively identify such minority class instances in order to avoid the losses associated with the same. With an effort in this direction we propose a method using k nearest neighbours approach and oversampling technique to classify such skewed data streams. Oversampling is achieved by making use of minority class examples which are retained from the stream as the time progresses. Experimental results show that our approach shows good classification performance on synthetic as well as real world datasets. iii

5 Acknowledgements I express my sincere gratitude towards my guide Prof. V.Z.Attar for her constant help, encouragement and inspiration throughout the project work. Without her invaluable guidance, this work would never have been a successful one. She also guided me through the essence of time-management, need of efficient organization and the presentation skills. I would also like to thank Head of the Department Dr. Jibi Abraham and all other faculty members who made my journey of post-graduation and technical learning such an enjoyable experience. I would also like to thank Dr. Albert Bifet, University of Waikato, New Zealand for his help and guidance during the implementation of the project. Last but not the least I would like to thank all my classmates for their support and help throughout the course of the project. Abhijeet B. Godase College of Engineering, Pune May 30, 2012 iv

6 Contents Abstract Acknowledgements List of Tables List of Figures iii iv viii viii 1 Introduction Introduction to Data Mining and Techniques Classification An Overview of Data Streams Data Stream Classification An Overview of Skewed Data Sets in the Real World Issues in Learning from Skewed Data Streams Thesis Outline Literature Survey Overview of Methods for Dealing with Skewed Data Streams - Traditional Approaches Oversampling Under-sampling Cost Sensitive Learning Problem Description Motivation Problem Statement How did we reach our problem statement Why are Existing Classifiers Weak? Evaluation Metrics

7 4 Our Approach Approach to Deal with Skewed Data Streams Algorithm for Skewed Data Streams Experimental Results and Discussions Experimental Setup Datasets Synthetically Generated Datasets Results of Our Approach on Synthetically Generated Datasets Real World Datasets Results of Our Approach on Real World Datasets Effect of Noise Level on the Performance of the approach Conclusion and Future Work Conclusion Future Work A Publications 42 A.1 IDSL: Imbalanced Data Stream Learner A.2 Classification of Data Streams with Skewed Distribution vi

8 List of Tables 5.1 Description of Synthetic Datasets used Description of Electricity Pricing Dataset Description of Real World Datasets vii

9 List of Figures 1.1 Classification as a task of mapping input attribute setx into its class label y Classification model in data streams Skewed Distributions, each data chunk has fewer positive examples than negative examples Our Approach to Classify Data Streams with Skewed Distributions Comparison of AUROC on SPH Datasets Comparison of G-Mean on SPH Datasets Comparison of F-Measure on SPH Datasets Comparison of Overall Accuracies on SPH Datasets Comparison of AUROC on SEA Datasets Comparison of G-Mean on SEA Datasets Comparison of F-Measure on SEA Datasets Comparison of Overall Accuracies on SEA Datasets Comparison of AUROC on Real World Datasets Comparison of G-Mean on Real World Datasets Comparison of F-Measure on Real World Datasets Comparison of Overall Accuracies on Real World Datasets Effect on AUROC when noise levels are varied Effect on G-Mean when noise levels are varied Effect on F-Measure when noise levels are varied Effect on Overall Accuracy when noise levels are varied

10 Chapter 1 Introduction This chapter gives the introduction to the basics of the area of project, also gives some insight to the problem domain under consideration. 1.1 Introduction to Data Mining and Techniques With the internet age the data and information explosion have resulted in the huge amount of data. Fortunately to gather knowledge from such abundant data there exist data mining techniques. As per the definition by Jiawei Han in his book Data Mining: Concepts and Techniques [18] the data mining is - Extraction of interesting, non trivial, implicit, previously unknown and potentially useful patterns or knowledge from huge amount of data. Data mining has been used in various areas like Health care, business intelligence, financial trade analysis, network intrusion detection etc. General process of knowledge discovery from data involves data cleaning, data integration, data selection, data mining, pattern evaluation and knowledge presentation. Data cleaning, data integration constitute data preprocessing. Here data is processed so that it becomes appropriate for the data mining process. Data mining forms the core part of the knowledge discovery process. There exist various data mining techniques viz Classification, Clustering, Association rule mining etc. Our work mainly falls under the classification data mining technique. 1

11 1.2 An Overview of Data Streams Classification Classification is one of the important technique of data mining. It involves use of the model built by learning from the historical data to make prediction about the class label of the new data/observations. Formally, it is task of learning a target function f, that maps each attribute set x to a set of predefined class labels y. Classification model learned from historical data is nothing but the target function. It can serve as a tool to distinguish between the objects of different classes as well as to predict class label of unknown records. Fig 1.1 shows the classification task which maps attribute set x to its class label y. Figure 1.1: Classification as a task of mapping input attribute setx into its class label y Classification is a pervasive problem that encompasses many diverse applications, right from static datasets to data streams. Classification tasks have been employed on static data over the years. In last decade more and more applications featuring data streams have been evolving which are a challenge to traditional classification algorithms. 1.2 An Overview of Data Streams Many real world applications, such as network traffic monitoring, credit card transactions, real time surveillance systems, electric power grids, remote sensors, web click streams etc, generate continuously arriving data known as data streams [4],[2]. Unlike the traditional data sets, data streams arrive continuously at varying speeds. Data streams are fast changing, temporally ordered, potentially infinite and massive[18]. It may be impossible to store the entire data stream into memory or to go through it more than once due to its voluminous nature. Thus there is 2

12 1.3 Data Stream Classification need of single scan, multidimensional, online stream analysis methods. In today s world with data explosion the data is increasing by terabytes and even petabytes, stream data has rightly captured our data mining needs of today. Even though complete set of data can be collected and stored its quite expensive to go through such huge data multiple times. 1.3 Data Stream Classification Since classification could help decision making by predicting class labels for given data based on past records, classification on stream data has been extensively studied in recent years with many interesting algorithms developed. Some of them are cited here: [4],[21]. Fig 1.2 depicts the classification model in data streams.as shown in fig 1.2 data chunks C 1, C 2, C 3...C i arrive one by one. Figure 1.2: Classification model in data streams Each chunk contains positive instances P i and negative instances Q i. Suppose C 1, C 2, C 3...C i are labelled. At the time stamp m + 1, when an unlabelled chunk C m+1 arrives, the classification model predicts the labels of instances in C m+1 on basis of previously labelled data chunks.when experts give true class labels of the instances in C m+1, the chunk can join the training set, resulting in more and more 3

13 1.4 An Overview of Skewed Data Sets in the Real World labelled data chunks. Because of storage constraints, it is critical to judiciously select labelled examples that can represent the current distribution well. Most studies on stream mining assume relatively balanced and stable data streams. However, many applications can involve concept-drifting data streams with skewed distributions. In data with skewed distributions each data chunk has many fewer positive instances. Fig 1.3 shows the similar concept diagrammatically. At the same time, loss functions associated with positive and negative classes are also unbalanced. Misclassifying positive instances can have serious effects in some applications like stream of financial transactions. Figure 1.3: Skewed Distributions, each data chunk has fewer positive examples than negative examples 1.4 An Overview of Skewed Data Sets in the Real World The rate at which science and technology have developed has resulted in proliferation of data at an exponential pace. This unrestrained increase in data has intensified need of various applications in data mining. This huge data in in no way necessarily equally distributed. Class skew or class imbalance refers to domains where in one class instances outnumber the other class instances, i.e. some classes occupy the majority of the dataset which are known as majority classes; while the other classes are known as minority classes. The most vital issue in these kinds of data sets is that, compared to the majority of the classes, minority classes are often of much significance and interest to the user. 4

14 1.4 An Overview of Skewed Data Sets in the Real World There are many real world applications, where in datasets contain such skewed nature. Following paragraphs gives an overview of some real world problems that exhibit such nature. Financial Fraud Detection: In financial fraud detection, majority of financial transactions are genuine and legitimate, and very small number of them may be fraudulent[5]. Network Intrusion Detection: Here, the number of malicious activities are hidden among the voluminous routine network traffic[30],[35]. Usually there are thousands of access requests every day. Among all these requests, the number of malicious connections is, in most cases, very small compared to the number of normal connections. Obviously, building good model that can effectively detect future attacks is crucial so that the system can respond promptly in case of network intrusions. Medical Fraud Detection: In medical fraud detection, the percentage of bogus claims is small, but the total loss is significant. Real Time Video Surveillance: Imbalance is seen in data that arrives as video sequences[34]. Oil Spillage: Oil spills detection in satellite radar images[22]. Astronomy: Skewed data sets exist in astronomical field also; only 0.001% of the objects in the sky survey images are truly beyond the scope of current science and may lead to new discoveries [27]. Spam Image Detection: In Spam image detection, near duplicate spam images are difficult to discover from the large number of spam images[33] Text Classification: In text classification, there is imbalanced data such as text number, class size, subclass and class fold [24]. Health Care: Health care domain has a classic example of class imbalance presence; the rare diseases affect very negligible amount of people, but the consequences involved are very severe. It id extremely vital to correctly detect and classify the rare diseases and the affected patients. If any errors occur in such cases it might be fatal. 5

15 1.5 Issues in Learning from Skewed Data Streams Above real world examples signify the importance of dealing with imbalanced data sets. Most of the above examples also fall into the category of skewed data streams. Most learning algorithms work well with balanced data streams as their aim is to improve overall accuracy or a related measure. When such algorithms are applied to skewed data streams then their accuracy of classifying majority examples is good but the accuracy of classifying the minority examples is poor. This happens because learning from such imbalanced/skewed streams causes the learner to become biased towards the majority class; thus the minority examples are likely to be misclassified. Thus the main issue towards which the research community is working on in regard to skewed data streams is of correctly classifying the minority data instances without affecting the accuracy of majority data instances. In recent years learning from skewed data streams has been recognized as one of the crucial problem in machine learning and data mining. Encapsulating, the principle problem statement is improving the accuracy of both the minority as well as majority class instances of the stream. 1.5 Issues in Learning from Skewed Data Streams In general learning from skewed data streams is challenging due to following issues. 1. Evaluation Metric: Appropriate choice of evaluation metrics is also important in this domain. Evaluation metrics play vital role in data mining; they are used to guide the algorithms to desired solution. Thus if evaluation metric does not take minority class into the consideration, the learning algorithm will not be able to cope up well with the skewed data streams. The standard evaluation metrics like overall accuracy are not valid this case, because although minority class instances are misclassified then also the overall accuracy may remain higher, primary reason being negligible amount of minority class instances. 2. Lack of minority class data for training: In skewed data streams due to lack of minority class data it becomes difficult to learn class boundaries. As the number of instances available are very few. Thus training a classifier in such situations is very difficult. 3. Treatment of minority class data as noise. One of the major issues is that of the noise. Noisy data in the streams affects the minority class more than 6

16 1.6 Thesis Outline that of the majority data. Furthermore, standard stream mining algorithm tend to treat the minority class data as noise. 4. As stated earlier data streams are usually massive and arrive at varying speeds; which makes it difficult to store them completely and multiple scans are nearly impossible and hence learning from such streams is difficult. 5. Data streams also undergo concept evolution considerably over time. As per the above points we can see that classification of the skewed data streams is and multi-fold problem. All the above issues need to be addressed so as to design a appropriate learning algorithm for skewed data streams. 1.6 Thesis Outline The rest of the thesis is organized as follows: Chapter 2 gives the brief description of the literature survey carried out by us. Chapter 3 describes our problem statement and how we arrived at it. It also focusses on different evaluation metrics chosen by us throughout this thesis. Chapter 4 elaborates our approach to deal with data streams with skewed distribution of classes and explains our algorithm and its implementation details. Chapter 5 contains the results and discussions of the implemented algorithm and evaluation of its performance. Chapter 6 provides the conclusion and the future enhancements that can be carried out. 7

17 Chapter 2 Literature Survey This chapter gives brief details of our literature survey in carried out in the area of imbalanced datasets and the skewed data streams problem. 2.1 Overview of Methods for Dealing with Skewed Data Streams -Traditional Approaches We went through various methods available in the literature to deal with imbalanced datasets and portray some of the well known and most popular approaches, algorithms and methods that have been devised to deal with skewed data streams. Some of the books that we have referred to get an effective understanding of data mining concepts are Data Mining: Concepts and Techniques by Han and Kamber [18], Introduction to Data Mining by Kumar et al [23]. In the literature there are number of methods addressing class imbalance problem but the area of skewed data streams is relatively new to the research community. The sampling based and ensemble algorithms are the simplest yet the effective ones. Following paragraphs will provide the brief overview of the same. Some of the approaches for dealing with skewed data streams are categorised under following methods. Oversampling. Under-sampling. Cost Sensitive Learning. 8

18 2.1 Overview of Methods for Dealing with Skewed Data Streams -Traditional Approaches Oversampling and under-sampling are sampling based preprocessing methods of data mining. The main idea in these methods is to manipulate the data distributions such that all the classes are represented well in the training or learning datasets. Recent studies in this domain have shown that sampling is effective method to deal with such kind of problems. Cost sensitive learning is basically associates cost of misclassifying the examples to penalise the classifier Oversampling Oversampling is one of the sampling based preprocessing technique in data mining. In oversampling the number of minority class instances in increased by either reusing the instances from the previous training/learning chunks or by creating the synthetic examples. Oversampling tries to strike the balance between ratio of majority and minority. classes. One of the advantage of this method is that using this normal stream classification methods can be used. The most commonly used method of oversampling is SMOTE(Synthetic Minority Oversampling Technique)[7]. Some of the Oversampling based approaches in the literature are discussed below. Most of the stream classification algorithms available assume that the streams have balanced distribution of classes. In the last few years few attempts have been made to address the problem to deal with skewed data streams. First of such attempt was done by Gao et al [16] in their work they have proposed SE(Sampling + Ensemble) approach which processes the stream in batches. In SE approach each of the classifier in ensemble is trained by drawing the uncorrelated sample of negative instances and all the positive instances in the current training chunk as well as the positive instances of all previous training chunks. Thus in SE approach, oversampling of positive instances is done by incorporation of old positive examples along with under sampling by the way of using disjoint subsets of negative examples. SERA(Selectively Recursive Approach) framework was proposed by Chen and He [9] in this framework they selectively absorbed minority examples from previous chunks into current training chunk to balance it. Similarity measure used to select minority examples from previous chunks was mahalanobis distance. Thus different from approach used by gao et al [16] which uses take in all approach. In SERA framework single hypothesis based on current training chunk is maintained 9

19 2.1 Overview of Methods for Dealing with Skewed Data Streams -Traditional Approaches for making predictions on test data chunk. MuSeRA(Multiple Selectively Recursive Approach) algorithm by Chen and He [10] was their further work after SERA to deal with imbalanced data stream classification. In MuSeRA balancing of training chunk is done in the similar way by using mahalanobis distance as similarity measure to accommodate minority samples accumulated from all the previous training chunks. In MuSeRA a hypothesis is built on every training chunk, thus a set of hypothesis is built over time as opposed to SERA which maintains only single hypothesis. Here set of hypothesis at time-stamp i will be used to predict the classes for instances in test chunk at time-stamp i. In their further work in similar area Chen and He [8] proposed an approach named REA(Recursive Ensemble Approach), in which when next training chunk arrives, it is balanced by adding those positive instances from previous chunks which are nearest neighbours of the positive instances in the current training chunk, then it is used to build a soft typed hypothesis. In REA for every training chunk a new soft typed hypothesis is built. It then uses weighted majority voting to predict the posterior probabilities of test instances, here the weights are assigned to different hypothesis based on their performance on current training chunk Under-sampling Under-sampling is another sampling based method which solves the problem by reducing the number of majority class instances. This is generally done by filtering out the majority class instances or by randomly selecting the appropriate number of majority class examples. under-sampling is mostly carried out using the clustering method. Using clustering the best representative from the majority class are chosen and the training chunk is balanced accordingly. Some of the under-sampling based approaches in the literature are discussed below. Zhang et al [32] proposed another algorithm to deal with skewed data streams. They used clustering+sampling algorithm to deal with skewed data streams. Sampling was carried out by using k-means algorithm to form clusters of negative examples in the current training chunk and then they used the centroids of each of 10

20 2.1 Overview of Methods for Dealing with Skewed Data Streams -Traditional Approaches the clusters formed to represent each of those clusters. Number of clusters formed were equal to the number of positive examples in current training batch and thus current training batch was updated by taking all positive examples along with centroids of the clusters of negative samples. A new classifier was created on these sampled instances. Further size of the ensemble was fixed so for all classifiers present in the ensemble along with new classifier built on sampled instances. AU- ROC was used as measure to select best classifiers to be included in the ensemble. Weights of the classifiers were assigned on the basis of AUROC calculated and ensemble thus formed was used to classify the instances in the test chunk. The work by song et al [28] proposes to address the issue of skewed data streams in cloud security which also follows a similar approach of using k-means clustering to draw centroids of clusters of negative class to undersample the negative class. Recently Zhang et al [36] proposed approach called ECSDS (Ensemble classifier for skewed data streams) which aims at reducing the time needed by ensemble learning by updating it only when required. In this algorithm initially ensemble is built in the similar way as that of gao et al approach[16] but the mechanism to update the ensemble is bit different. When the ensemble makes prediction for the test batch, the positive instances which are misclassified are retained back as MI. After prediction on two batches difference of f 1 values in these two batches is calculated. If the difference crosses a threshold value then ensemble is updated by using all misclassified positive instances MI and all previous positive instances AP along with negative instances in the current batch Cost Sensitive Learning. Cost sensitive learning is one if the important technique of data mining. It assigns different values of misclassification penalties to each class. Cost sensitive learning has been incorporated into classification algorithms by taking into account the cost information and trying to optimize overall cost during the learning process. In cost sensitive classification the problem is dealt by adjusting the learning. It is done by creating costs associated with misclassification of minority class and adjusting the learner based on punishment-reward system. One of the advantage of this method is that training dataset is unchanged unlike oversampling and undersampling where in the data distribution changes completely. 11

21 2.1 Overview of Methods for Dealing with Skewed Data Streams -Traditional Approaches Polliker et al [11] proposed algorithm named Learn++SMOTE in which algorithm updates the distribution of instance weights by evaluating current training chunk D n on the ensemble and accordingly adjust the weights of misclassified instances to form new instance weighing distribution D m. A new classifier is trained on this data chunk D m and then new data subset is created by calling SMOTE. All the classifiers in the ensemble are evaluated on this synthetic dataset and then accordingly classifiers with less error are chosen to form ensemble and while doing so if new classifier has error more than 0.5 then its discarded and new one is created but if error on older classifier is more than 0.5 then its set to 0.5, since it might be useful if stream follows a cyclic nature. Cooper et al [26] proposed online approach to deal with skewed data streams. In this work they propose an incremental learning algorithm for skewed data stream classification. They argue that in batch learning algorithms update of model is delayed until next training chunk is received but in some cases there is need to update the model with every new instance. They build the initial ensemble model in the similar way as that of gao et al approach[16] but the difference is that here they draw random subset of negative instances with replacement. Once the initial model is built then for every new incoming instance which is of positive class all the base models are updated. If the new incoming instance is of negative class then it will be used for updating the models with probability of n/p where n, p are the number of negative and positive instances observed till now respectively. Liu et al [25] proposed RDFCSDS(Reusing data for classifying skewed data streams) in this algorithm before training the ensemble it is checked if concept drift has occurred, if it has then current ensemble is cleared and it is built from scratch by sampling the data from current training chunk as well as the previous training chunk thus reusing the previous data. If no concept drift has occurred then a new classifier is built on current training chunk and then AUROC is calculated for all the classifiers in the ensemble and among those top k classifiers are chosen to form ensemble based on AUROC value. Prediction of instances in the test chunk is done by weighted majority voting. From the above discussion we can see that most of the approaches proposed till now have used have either used oversampling and/or under-sampling to balance the training chunks while to dealing with skewed data streams. Some of them have gone for the cost sensitive classification approach. Also as we have seen previously 12

22 2.1 Overview of Methods for Dealing with Skewed Data Streams -Traditional Approaches that most of the applications do exhibit characteristics of skewed data streams and there is still a room for improvement in this area. 13

23 Chapter 3 Problem Description This chapter gives the motivation behind choosing the problem consideration and how did to arrive at the problem statement. 3.1 Motivation While working on identification of the project topic in the area of data mining we found that lot of work has been already done in the different areas of the data mining with respect to the static datasets. Further in the last decade class imbalance problem in static datasets has drawn the attention of the data mining community. Due to which various workshops in the different conference were dedicated to specially for problem of class imbalance. First of such workshop was organized way back in 2000 in the AAAI 2000 conference, another workshop on Learning from Imbalanced Datasets was organized in ICML Recently another workshop named Data Mining when Classes are Imbalanced and Errors have Cost was organized in the PAKDD 2009 conference. Various points from the above discussion drew our attention towards the problem of class imbalance. Further we could find that considerable amount of work has been done on the class imbalance problem in terms of static datasets. Then we went on to look for the another area wherein class imbalance is in more primary concern. Meanwhile we found that the data streams is an area where class imbalance has not been thoroughly studied. After this we concentrated on various real life application where in data streams are prominent. Various applications like Network intrusion detection, Financial fraud detection etc. are various areas which are characterised by stream data. 14

24 3.2 Problem Statement We went on to see the characteristics of some data streams from which we could find that most of the real life data streams are characterised by the class imbalance or we can say these were skewed data streams. As mentioned earlier in the Section 1.4 many applications are characterised by skewed data streams. Thus our main point of focus was then brought to skewed data streams. As we have seen various real life examples of skewed data streams lets consider one of them say financial fraud detection. As we know lacs of transaction are on during any particular instance very few of them are fraudulent transactions. Identifying such transactions is difficult task. Further identification of new patterns of the fraudulent transactions can be tedious task if we ask the domain experts to label each of them. Thus in such case identifying these minority class examples is the important task. In these scenarios methods like random sampling of instances where these randomly sampled examples are given to domain expert for labelling is not a good idea. If we miss out on identifying such examples then we are calling for the financial losses. In case of network intrusion detection depending upon the network there will either normal connection requests and also there will be malicious connection requests. In this case also either of the type of connections may be in minority. In such case if the malicious connection requests are in minority then identifying them becomes the primary goal. From the above discussions its quite evident that there is need for identification or classification of minority class examples in the skewed data streams in a effective manner. 3.2 Problem Statement As seen in previous section that after concentrating on problem of dealing with skewed data streams we formally framed our problem statement as : To design and develop a general classification framework that will correctly classify the minority class instances along with improvement in the evaluation metrics of data streams with skewed distribution of classes. 15

25 3.3 Evaluation Metrics How did we reach our problem statement As we realised that how important it is to identify the minority class examples in the skewed data streams in some applications as mentioned in the Section 1.4. For instance consider financial fraud detection where in the fraudulent attempt of transactions which are very few in number are the minority instances which need early identification. Misidentification or misclassification of such minority examples may result in the financial losses. Thus we can see that these minority class instances have more valued loss functions associated with them. Elaborating it further the data streams are also characterised by concept drift which complicates the problem further. Thus dealing with skewed data stream is an multi-fold problem. Identification of minority class examples could be achieved by classification task of the data mining. Thus we drilled down on our problem statement Why are Existing Classifiers Weak? Most of Existing stream classifiers cannot be used to classify the skewed data streams. These classifiers face the issues that are mentioned in the Section 1.5. Further these classifiers also assume that examples in the stream have fairly balanced distribution over the different classes. Hence the standard stream classifiers are affected by such majority class instances and thus they tend to consider the minority class instances as noise. 3.3 Evaluation Metrics Although most of the stream classifiers measure their performance based on overall accuracy but in case of imbalanced datasets such measure is not appropriate. Consider for example two classifiers being tested on a imbalanced dataset with class distribution ratio of 0.99:0.01. Now if classifier one classifies all 99% of majority class correctly but none from minority where as second classifier classifies 97% of majority class correctly and 0.8% of the minority class correctly then as per overall accuracy the classifier one beats the second one but if we consider that minority class is the main class under focus then second classifier is the one which should be chosen. 16

26 3.3 Evaluation Metrics Thus in such cases various performance measures have been suggested in the literature. In this section we present the evaluation metrics that we have used. Confusion Matrix: The columns of the confusion matrix represent the predictions, and the rows represent the actual class. Correct predictions always lie on the diagonal of the matrix. Equation 3.1 shows the general structure of confusion matrix. Conf usionm atrix T P F P F N (3.1) T N wherein, True Positives (TP) indicate the number of instances of the minority that were correctly predicted, True Negatives (TN) indicate the number of instances of the majority that were correctly predicted. False Positives (FP) indicate the number of instances of the majority that were incorrectly predicted as minority class instances and False Negatives (FN) indicate the number of the minority that were incorrectly predicted as majority class instances. Though the confusion matrix gives a better outlook on how the classifier performed than accuracy, a more detailed analysis is preferable which are provided by the further metrics. Recall: Recall is a metric that gives a percentage of how many of the actual minority class members the classifier correctly identified. (TP + FN) represent a total of all minority members. Recall is given by equation 3.2 Recall = T P T P + F N (3.2) Precision: It gives us the total the percentage of how many of minority class instances as determined by the model or classifier actually belong to the minority class. (TP + FP) represents the total of positive predictions by the classifier. Precision is given by equation

27 3.3 Evaluation Metrics P recision = T P T P + F P (3.3) Thus in general it is said that Recall is a Completeness Measure and Precision is a Exactness Measure. The ideal classifier would give value as 1 for both Recall and Precision but if the classifier gives higher(closer to one) for one of the above metrics and lower for the other metrics in that case choosing the classifier is difficult task. In such cases some other metrics as discussed further are suggested in the literature. Further in the literature few metrics which are based on the above metrics have been suggested to be used in case of imbalanced data sets or streams. The performance measures like AUROC (Area under ROC curve)[12],[20] G-Mean [17] are well suited in such situations. F-Measure: It is a harmonic mean of Precision & Recall. We can say that it is essentially an average between the two percentage. It really simplifies the comparison between the classifiers. It is given by the equation 3.4. F Measure = 2 1 ( + 1 ) (3.4) Recall P recision G-Mean: G-Mean is a metric that measures the balanced performance of a learning algorithm between the classes. It is given by the equation 3.5 G Mean = Recall T NR (3.5) T NR = T N T N + F P (3.6) 18

28 3.3 Evaluation Metrics Area Under ROC Curve: The area under ROC Curve(Receiver Operating Characteristics) gives the probability that, when one draws one majority and one minority class example at random, the decision function assigns the higher value to the minority class than the majority class sample. AUROC is not sensitive to the class distributions in the dataset. Generally it is plotted as a True Positive Rate verses False Positive Rate. AUROC was mainly used in signal detection theory and medical domain where it was said to be the plot of Sensitivity verses (1 Specificity) where Sensitivity is same as True Positive Rate and Specificity is (1 - False Positive Rate) so in general it reduces to False Positive Rate thus both definitions of AUROC are one and the same. 19

29 Chapter 4 Our Approach As mentioned in the previous chapter that after the literature survey we came to the conclusion that issue of skewed data streams certainly needs more attention and there is still room for improvement in the performance of existing classifiers. Thus we came up with our own approach to deal with skewed data streams. 4.1 Approach to Deal with Skewed Data Streams This section gives the brief description of our approach to deal with classification of data streams with skewed distribution. Our approach in general is as follows. Instead of using a single model from single training set, it is proposed to use multiple models from different sets. We use the ensemble of classifiers/models to classify the data streams with skewed distributions. The use of ensemble of classifiers has been proven to be the effective one to deal with concept drifting data streams [31]. In our approach we have also used the oversampling approach to so as to balance the training chunk contents. Fig 1.4 shows in brief our approach. In our approach we process the data stream in chunks. If the set of instances used for training the classifiers are disjoint then the classifiers will make uncorrelated errors and which can be eliminated by averaging [16]. On the similar lines we build a new classifier for the ensemble for every incoming batch such that instances of the training chunk are as uncorrelated as possible. In order to learn from the data streams the best way is to construct the model on the most recent training chunk. This works for instances of majority class because instances of majority class are in abundance in every chunk but as the minority class samples are very less. Some of the earlier approaches [7],[11] depend 20

30 4.1 Approach to Deal with Skewed Data Streams on creation of synthetic examples to deal with this problem. In our approach we store the minority class examples from the previous training chunks. To balance the training chunks distribution we oversample the minority class instances stored over the period of time. Figure 4.1: Our Approach to Classify Data Streams with Skewed Distributions While balancing the training chunk we follow the K nearest neighbour approach. We select those examples from the previous minority examples that are accommodated over the time. To balance the training chunk we find the k nearest neighbours of the each example stored from previous chunks into the current training chunk. Out of those appropriate number of minority samples are used to balance the training chunk. Using that training chunk we build a new classifier. This newly built classifier is added to the ensemble. The size of the ensemble is maintained to 10. As the size of ensemble goes beyond this specified limit then classifiers with best AUROC(Area Under ROC 21

31 4.2 Algorithm for Skewed Data Streams Curve) values are chosen. The process of learning is continuous as you can see from the fig 4.1. Thus as the stream is continuously coming in the ensemble is continuously updated. Mean while the test examples are tested by taking the predictions from the ensemble. The predictions are made by taking weighted majority voting among the classifiers. The weights of the classifiers in the ensemble are assigned as per the AUROC values. While taking the majority voting the weights are normalized. Thus our approach consist of the oversampling based approach along with k nearest neighbour algorithm and the ensemble based approach to deal with data streams with skewed distribution of classes. Algorithm 1 presents the K nearest neighbour algorithm in details which is used by our approach. Algorithm 1: K Nearest Neighbours Algorithm Input: Set of training examples or the set of examples from which nearest neighbours are to be determined. Output: begin K Nearest Neighbours. Determine parameter K = number of nearest neighbours. Calculate the distance between the query-instance and all the training samples. Sort the distance and determine nearest neighbours based on the K-th minimum distance. Return the K nearest neighbours. 4.2 Algorithm for Skewed Data Streams Algorithm 2 presents the algorithmic implementation of our approach that we have come up with to deal with classification of skewed data streams. The algorithm basically processes the data stream in batches or chunks. Thus 22

32 4.2 Algorithm for Skewed Data Streams Algorithm 2: CDSSDC: Classifier for Data streams with skewed distribution of classes Input: Training chunk B i,test chunk B i+1 arriving at current time-stamp t i Ensemble of classifiers Z i 1 built from t i = 0 to i 1 AP set of all positive instances accumulated from the previous training chunks from t i = 0 to i 1 Balance ratio f, and ratio of positive to negative instances in the training Output: chunk γ. Ensemble Z i, for the test chunk B i+1 begin foreach timestamp t i = 1,2... do Split the current training chunk B i into P and N containing positive and negative examples respectively. if f > (t i -1) γ then TrainSet = B i AP else Find K nearest neighbours for each instance of AP within current training chunk an read positive class instances from these into D i and sort D i in descending order. Read first {(f γ) B i }instances from D i into E i. TrainSet = B i E i Build new classifier C i on TrainSet if Z i 1 < 10 then Z i = Z i 1 C i else Find AUROC for all classifiers in Z i 1 on B i. C j = classifier with lowest AUROC from Z i 1 if AUROC(C j ) < AUROC (C i ) then Z i = Z i 1 { C i } { C j } Assign weights to the classifiers in Z i according to AUROC. Normalize the weights of the classifiers in Z i. 23 AP = AP P

33 4.2 Algorithm for Skewed Data Streams input to the system is a skewed data stream in chunks or batches. We assume that at any given instance Training and Test chunks are available. At time-stamp t i the training chunk B i and B i+1 are available as input to the algorithm. Further ensemble of classifiers built over the from the time-stamp t 0 to t i 1 is also given as input to the algorithm. All the minority class samples from the start of the stream till the last time-stamp are accumulated and those are also given as input to the algorithm. Balance ratio of 0.5 is also set initially. Its assumed that for each training chunk we are aware of the ratio of positive to negative class instances. The training batch that arrives at current time-stamp is split into positive and negative class instances. Further it is checked whether appropriate number of minority class or positive class instances have been accumulated or not so as to balance the batch. If the number is not adequate then all the retained examples are added along with positive and negative class instances into the train-set. If the retained examples are more than that of those required then for each of the those minority class instances, the K nearest neighbours into the current training batch are found. Then those nearest neighbours that are of minority class are sorted in descending order depending on which is more nearest. Then select minority class examples from retained minority samples set which are corresponding to the first x examples from the descending ordered list. x number of examples is chosen such that the balance ratio of 0.5 is maintained. Thus the Train-Set is created. Then a new classifier is built on this Train-Set. This newly built classifier is then added to the ensemble of classifiers. If the number of classifiers in the ensemble is more than 10 then AUROC for each of the classifiers is found on the original training data chunk and then 10 classifiers with best AUROC values are chosen to be the part of the ensemble. Then the classifiers are assigned weights based on the AUROC values calculated.then for making the predictions on the test batch B i+1 the weighted majority voting is taken and the test instances are given predictions accordingly. The Test then Train pattern for the data chunks is followed. 24

34 Chapter 5 Experimental Results and Discussions This chapter deals with the results obtained by us after the experiments were carried out on the algorithm developed by us. 5.1 Experimental Setup This section gives the abstract of the experimental setup on which we have worked to perform our experiments and test the effectiveness of our algorithm. For the implementation of our algorithm we have used MOA [6]. MOA is a open source framework for mining data streams. It is implemented in java. It has collection of various data stream mining algorithms. MOA being an open source framework we could easily add and integrate our algorithm into it. Details of other hardware and software that we have used is as follows. We used MOA release along with it we have also used WEKA version on Windows 7 Profession operating system running on Dual core AMD opteron processor GHz with 2GB RAM. 5.2 Datasets To evaluate the performance of our algorithm we have tested it on various synthetic as well as real world datasets. We have used various datasets available at UCI[13] repository. Datasets available at MOA dataset repository[1]. Also we have used various synthetically generated datasets. These datasets were chosen 25

35 5.2 Datasets such that they model real world scenarios, have variety of features and largely vary in size and class distribution Synthetically Generated Datasets This subsection presents the details of the synthetic datasets that we have generated. Later in this section we have briefly explained the details of these datasets. The main reason for generating these two datasets was that they focus on one of the main characteristic of data streams that is concept drift. Table 5.1 shows the characteristics of synthetically generated datasets. As shown in Table 5.1 are the number of instances, number of majority class instances, number of minority class instances, number of attributes in the data set, chunk size of the data sets and the ratio of majority class to minority class. Table 5.1: Description of Synthetic Datasets used Dataset Instances Max-Class Min-Class Attributes Chunk-size Ratio SPH 1% 1,00, ::0.01 SPH 3% 1,00, ::0.03 SPH 5% 1,00, ::0.05 SPH 10% 1,00, ::0.10 SEA 1% 80, ::0.01 SEA 3% 80, ::0.03 SEA 5% 80, ::0.05 SEA 10% 80, ::0.10 Description of Synthetically Generated Datasets SEA Dataset: The SEA dataset is generated to model the abrupt concept drift. The SEA dataset was proposed by Street and Kim[29]. The instances of the SEA dataset are generally generated in 3 dimensional feature space with two classes. Each of 26

36 5.2 Datasets these 3 attributes or features has the value between 0 to 10. Out of these attributes only first two are relevant and the third attribute is added which acts as noise. If the some of the first two attributes crosses a certain threshold value then it belongs to class 1 else it belongs to class 2. The threshold values generally used are 8,9,7 and 9.5 for the four data blocks. From each block some of the instances as reserved as test instances. We generated 80,000 instances of SEA dataset with different percentage of class imbalances. These were used in the batch 1000 each that is a batch of 1000 was used for training and then next 1000 instances were used for testing the model built. We have generated 4 different datasets of SEA dataset. Each with 1%,3%,5%,10% of skew added to them. In each of these datasets we have added 1% noise so as to make the task of learning more difficult. Noise was added by reversing the correct class label to a incorrect one in the training set. Spinning Hyperplane Dataset: The SPH(Spinning Hyperplane Dataset) in generated to model the Gradual Concept Drift. The SPH dataset was proposed by Wang et al[31]. The SPH dataset defines a class boundary in n dimensions by coefficients as α 1,α 2,α 3...α n. An instance d=(d 1,d 2,d3...,d n ) is generated by randomizing each attribute in the range 0 to 1. Thus each attribute value is between 0 to 1. A constant bias is defined as: α 0 = 1 2 n i=1 then the class label l for the instance t is defined as α i as opposed to the abrupt concept drift in the SEA dataset the, SPH dataset is characterised by the gradual concept drift. Some of the coefficients of α i will be sampled in random to have small increment δ added. Here δ is defined δ = s t N 27

37 5.2 Datasets here t is the magnitude of the change for every N examples and s alternates in closed interval -1 to 1 and thus specifying the direction of change, and it has 20% chance of being reversed for every N examples. α 0 is also modified as per the equation mentioned earlier. In such a way class boundary is like spinning hyperplane in the process of creating data Results of Our Approach on Synthetically Generated Datasets We present the graphical results of our approach in comparison with some other state of the art algorithms. Details of the algorithms used for comparison are as follows Our approach CDSSDC which uses K nearest neighbour approach to balance the current training chunk, ratio of positive to negative examples is kept to be 0.5 We have used SMOTE algorithm [7] for comparison, which creates synthetic examples to balance the dataset. The implementation of SMOTE available in WEKA toolkit [14] has been used with the default settings in WEKA except for SPH and SEA datasets where SMOTE was applied so as maintain ratio of positive to negative examples to be 0.5. We have used FLC [3] for comparison, FLC was designed to deal with concept drift in data streams efficiently. From Fig 5.1 we can clearly see that our approach gives better performance as compared to SMOTE and FLC in terms of AUROC in case of SPH datasets with various imbalance levels. As the class skew ratio eases up from 99:1 towards 90:10 the FLC and SMOTE algorithm come close in performance to our approach. Fig 5.2 shows comparative performance of our approach, SMOTE and FLC with respect to G-mean on SPH datasets. Here we can see that our approach performs better than SMOTE and FLC for all the class distributions of SPH datasets shown. Further here also we can observe that the other algorithms come closer in catching up as the class skew eases up. 28

38 5.2 Datasets Figure 5.1: Comparison of AUROC on SPH Datasets Figure 5.2: Comparison of G-Mean on SPH Datasets Fig 5.3 gives the comparative details of F-Measure values obtained on our approach, SMOTE and FLC algorithm. Here we can observe that our approach achieves better values than other algorithms. Fig 5.4 shows the overall accuracies of SMOTE, CDSSDC and FLC on SPH datasets with varied class imbalance levels. Here we can see that our approach maintains almost same or better accuracy performance than SMOTE and FLC. 29

39 5.2 Datasets Figure 5.3: Comparison of F-Measure on SPH Datasets Figure 5.4: Comparison of Overall Accuracies on SPH Datasets From Fig 5.5 we can clearly see that our approach gives better performance as compared to SMOTE and FLC in terms of AUROC in case of SEA datasets with various imbalance levels. As the class skew ratio eases up from 99:1 towards 90:10 the FLC and SMOTE algorithm very come close in performance to our approach. Fig 5.6 shows comparative performance of our approach, SMOTE and FLC with respect to G-mean values obtained. Here we can see that our approach performs a bit better than SMOTE and FLC for all the class distributions of SPH datasets shown. Further here also we can observe that the other algorithms come 30

Weka ( )

Weka ( ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised