Racing for unbalanced methods selection

Similar documents
Racing for Unbalanced Methods Selection

A Proposal of Evolutionary Prototype Selection for Class Imbalance Problems

A Wrapper for Reweighting Training Instances for Handling Imbalanced Data Sets

PARALLEL SELECTIVE SAMPLING USING RELEVANCE VECTOR MACHINE FOR IMBALANCE DATA M. Athitya Kumaraguru 1, Viji Vinod 2, N.

CS145: INTRODUCTION TO DATA MINING

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

Cost-Sensitive Learning Methods for Imbalanced Data

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling

Boosting Support Vector Machines for Imbalanced Data Sets

Identification of the correct hard-scatter vertex at the Large Hadron Collider

An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification

Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem

Using Decision Trees and Soft Labeling to Filter Mislabeled Data. Abstract

Lina Guzman, DIRECTV

Competence-guided Editing Methods for Lazy Learning

SYED ABDUL SAMAD RANDOM WALK OVERSAMPLING TECHNIQUE FOR MI- NORITY CLASS CLASSIFICATION

Improving Tree-Based Classification Rules Using a Particle Swarm Optimization

Hybrid classification approach of SMOTE and instance selection for imbalanced. datasets. Tianxiang Gao. A thesis submitted to the graduate faculty

Constrained Classification of Large Imbalanced Data

Hybrid probabilistic sampling with random subspace for imbalanced data learning

A Modular Reduction Method for k-nn Algorithm with Self-recombination Learning

Using a genetic algorithm for editing k-nearest neighbor classifiers

Feature Selection for Multi-Class Imbalanced Data Sets Based on Genetic Algorithm

Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process

CS249: ADVANCED DATA MINING

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions

Univariate Margin Tree

Classifying Imbalanced Data Sets Using. Similarity Based Hierarchical Decomposition

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

An Expert System for Detection of Breast Cancer Using Data Preprocessing and Bayesian Network

Model s Performance Measures

Available online at ScienceDirect. Procedia Computer Science 35 (2014 )

RBM-SMOTE: Restricted Boltzmann Machines for Synthetic Minority Oversampling Technique

IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde

Calibrating Random Forests

Fast and Effective Spam Sender Detection with Granular SVM on Highly Imbalanced Mail Server Behavior Data

Predict Employees Computer Access Needs in Company

Naïve Bayes for text classification

LVQ-SMOTE Learning Vector Quantization based Synthetic Minority Over sampling Technique for biomedical data

Classification of weld flaws with imbalanced class data

A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets

3 Virtual attribute subsetting

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Classification for Imbalanced and Overlapping Classes Using Outlier Detection and Sampling Techniques

Prediction of Student Performance using MTSD algorithm

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Trade-offs in Explanatory

A density-based approach for instance selection

Slides for Data Mining by I. H. Witten and E. Frank

Classification of Imbalanced Data Using Synthetic Over-Sampling Techniques

Adaptive Informative Sampling for Active Learning

A Feature Selection Method to Handle Imbalanced Data in Text Classification

Cost sensitive adaptive random subspace ensemble for computer-aided nodule detection

Network Traffic Measurements and Analysis

Complement Random Forest

Cost-sensitive C4.5 with post-pruning and competition


NEATER: Filtering of Over-Sampled Data Using Non-Cooperative Game Theory

The class imbalance problem

Combine Vector Quantization and Support Vector Machine for Imbalanced Datasets

Random Forest A. Fornaser

Feature-weighted k-nearest Neighbor Classifier

Nearest Cluster Classifier

SSV Criterion Based Discretization for Naive Bayes Classifiers

Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Fraud Detection using Machine Learning

International Journal of Software and Web Sciences (IJSWS)

Support Vector Machine with Restarting Genetic Algorithm for Classifying Imbalanced Data

Package ebmc. August 29, 2017

CS4491/CS 7265 BIG DATA ANALYTICS

A PSO-based Cost-Sensitive Neural Network for Imbalanced Data Classification

Machine Learning Techniques for Data Mining

Nearest Cluster Classifier

Information Management course

Class Confidence Weighted k NN Algorithms for Imbalanced Data Sets

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION

Evaluating Credit Card Transactions in the Frequency Domain for a Proactive Fraud Detection Approach

Evaluating the SVM Component in Oracle 10g Beta

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

Performance analysis of a MLP weight initialization algorithm

Ester Bernadó-Mansilla. Research Group in Intelligent Systems Enginyeria i Arquitectura La Salle Universitat Ramon Llull Barcelona, Spain

Using Machine Learning to Optimize Storage Systems

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

CHALLENGES IN HANDLING IMBALANCED BIG DATA: A SURVEY

arxiv: v4 [cs.lg] 17 Sep 2018

Learning and Evaluating Classifiers under Sample Selection Bias

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

A Novel Template Reduction Approach for the K-Nearest Neighbor Method

Targeting Business Users with Decision Table Classifiers

A Classifier with the Function-based Decision Tree

OWA-FRPS: A Prototype Selection method based on Ordered Weighted Average Fuzzy Rough Set Theory

Data Set. What is Data Mining? Data Mining (Big Data Analytics) Illustrative Applications. What is Knowledge Discovery?

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

A Conflict-Based Confidence Measure for Associative Classification

Distributed Pasting of Small Votes

Version Space Support Vector Machines: An Extended Paper

A Novel Data Representation for Effective Learning in Class Imbalanced Scenarios

Transcription:

The International Conference on Intelligent Data Engineering and Automated Learning (IDEAL 2013) Racing for unbalanced methods selection Andrea DAL POZZOLO, Olivier CAELEN, Serge WATERSCHOOT and Gianluca BONTEMPI 22/10/2013 Machine Learning Group Université Libre de Bruxelles Business Analytics Competence Center - Atos Worldline

Table of contents 1 Introduction 2 Unbalanced problem 3 Unbalanced techniques comparison 4 Racing 5 Conclusion and future work

Introduction A common problem in data mining is dealing with unbalanced datasets in which one class vastly outnumbers the other in the training data. State-of-the-art classification algorithms suffer when the data is skewed towards one class [8]. Several techniques have been proposed to cope with unbalanced data. However no technique appears to work consistently better in all conditions. We propose to use a racing method to select adaptively the most appropriate strategy for a given unbalanced task.

Unbalanced problem A dataset is unbalanced when the class of interest (minority class) is much rarer than the other (majority class). dataset$x2 dataset$x1 The unbalanced nature of the data is typical of many applications such as medical diagnosis, text classification and credit card fraud detection. The cost of missing a minority class is typically much higher that missing a majority class. Proposed strategies essentially belong to the following categories: sampling, ensemble, cost-based and distance-based.

Existing methods for unbalanced data Sampling methods Undersampling [5] Oversampling [5] SMOTE [3] Ensemble methods BalanceCascade [11] EasyEnsemble [11] Cost based methods Cost proportional sampling [6] Costing [19] Distance based methods Tomek link [15] Condensed Nearest Neighbor (CNN) [7] One side Selection (OSS) [9] Edited Nearest Neighbor (ENN) [17] Neighborhood Cleaning Rule (NCL) [10]

Unbalanced strategies Sampling techniques up-sample or down-sample a class to rebalance the classes. SMOTE generates synthetic minority examples. Ensemble techniques combine an unbalanced method with a classifier to explore the majority and minority class distribution. Cost based techniques consider the misclassification cost to rebalance the dataset. Distance based techniques use distances between input points to undersample or to remove noisy and borderline examples.

Sampling methods 2/24/2009 But no new information is being added. Hurts the generalization capacity. Nitesh Chawla, SIAM 2009 Tutorial Undersampling Figure: Undersampling Randomly remove majority class examples Risk of losing potentially important majority class examples, that help establish the discriminating power Instead of replicating, let us invent some new instances 2/24/2009 SMOTE: Figure: Synthetic SMOTE Minority Over-sampling [3] Technique Oversampling Figure: Oversampling Nitesh Chawla, SIAM 2009 Tutorial Replicate the minority class examples to increase their relevance What about focusing on the borderline and noisy examples? Nitesh Chawla, SIAM 2009 Tutorial But no new information is being added. Hurts the generalization capacity.

Fraud detection problem Credit card fraud detection [13, 4, 14] is a highly unbalanced problem. Fraudulent behaviour evolves over the time changing the distribution of the frauds and a method that worked well in the past could become inaccurate afterward.

Datasets 1 real credit card fraud dataset provided by a payment service provider in Belgium. 9 datasets from UCI [1] Dataset ID Dataset name Size Input Prop 1 Class 1 1 fraud 527026 51 0.39% Fraud = 1 2 breastcancer 698 10 34.52% class =4 3 car 1727 6 3.76% class = Vgood 4 forest 38501 54 7.13% class = Cottonwood/Willow 5 letter 19999 16 3.76% letter = W 6 nursery 12959 8 2.53% class = very recom 7 pima 768 8 34.89% class = 1 8 satimage 6433 36 9.73% class = 4 9 women 1472 9 22.62% class = long-term 10 spam 4601 57 42.14% class =1 Some datasets are reduced to speed up computations.

Fraud Data - Fmeasure 0.12" 0.1" Bayes" 0.08" 0.06" Neural" Network" Tree" 0.04" 0.02" 0" unbalanced" over" under" EasyEnsemble" balancecascade" SMOTE" SMOTEnsemble" EnnSmote" ENN" NCL" CNN" OSS" Tomek" TomekUnder" TomekEasyEnse TomekSMOTE" costbalance" coscng" SVM" Random" Forest" Figure: Comparison of strategies for unbalanced data in terms of F-measure for the Fraud dataset using different supervised algorithms, where F-measure = 2 Precision Recall Precision+Recall.

Friedman test over all dataset using RF and F-measure In the table a cell is marked as (+) if the rank difference between the method in the row and the method the column is positive, (-) othervise. The table shows the level of significance using *** (α = 0.001), ** (α = 0.01), * (α = 0.05),. (α = 0.1). Cascade CNN costbal costing EasyEns ENN EnnSmt NCL OSS over SMT SMTEnsTomek TomekE TomekS TomekU unbal under Cascade (+). (+). (+). CNN (-). (-). (-). (-) * (-) * costbal costing EasyEns ENN (-). EnnSmt (+). NCL (-). (-). OSS over SMT (+). (+). (+). SMTEns (+) * (+). (+). (+) * (+) * (+). Tomek TomekE TomekS (+) * (+). (+). (+). TomekU (-). (-). (-) * (-). unbal (-). (-). (-) * (-). under (-). Figure: Comparison of strategies using a post-hoc Friedman test in terms of F-measure for a RF classifier over multiple datasets. Sheet1

Racing idea With no prior information about the data distribution is difficult to decide which unbalanced strategy to use. No single strategy is coherently superior to all others in all conditions (i.e. algorithm, dataset and performance metric) Under different conditions, such as fraud evolution, the best methods may change. Testing all unbalanced techniques is not an option because of the associated computational cost. We proposed to use the Racing approach [12] to perform strategy selection.

Racing for strategy selection Racing consists in testing in parallel a set of alternatives and using a statistical test to remove an alternative if it is significantly worse than the others. We adopted F-Race version [2] to search efficiently for the best strategy for unbalanced data. The F-race combines the Friedman test with Hoeffding Races [12].

Candidate(1 Candidate(2 Candidate(3 subset(1( 0.50 0.47 0.48 Racing for unbalanced technique selection Automatically select the most adequate technique for a given dataset. 1 Test in parallel a set of alternative balancing strategies on a subset of the dataset 2 Remove progressively the alternatives which are significantly worse. 3 Iterate the testing and removal step until there is only one candidate left or not more data is available Candidate(1 Candidate(2 Candidate(3 subset(1( 0.50 0.47 0.48 subset(2 0.51 0.48 0.30 subset(3 0.51 0.47 subset(4 0.60 0.45 subset(5 0.55

F-race method Use 10-fold cross validation to provide the data during the race. Candidate(1 Candidate(2 Candidate(3 Every time new datasubset(1( is added0.50 to the race, 0.47 the Friedman 0.48 test is subset(2 0.51 0.48 0.30 used to remove significantly bad candidates. subset(3 0.51 0.47 We made a comparison subset(4 of Cross 0.60 Validation 0.45 and F-race in subset(5 0.55 terms of F-measure. Original(Dataset Candidate(1 Candidate(2 Candidate(3 subset(1( 0.50 0.47 0.48 subset(2 0.51 0.48 0.30 subset(3 0.51 0.47 subset(4 0.49 0.46 subset(5 0.48 0.46 subset(6 0.60 0.45 subset(7 0.59 subset(8 subset(9 subset(10 Time

F-race Vs Cross Validation Dataset Algo Exploration Method N test Gain Mean Sd Pval Fraud RF best CV SMOTEnsemble 180 - F-race SMOTEnsemble 44 76% 0.100 0.016 - SVM best CV over 180 - F-race over 46 74% 0.084 0.017 - Breast Cancer RF best CV balancecascade 180 - F-race balancecascade 180 0% 0.963 0.035 - SVM best CV under 180 - F-race under 180 0% 0.957 0.038 - Car RF best CV OSS 180 - F-race OSS 108 40% 0.970 0.039 - SVM best CV over 180 - F-race over 93 48% 0.944 0.052 - Forest RF best CV balancecascade 180 - F-race balancecascade 60 67% 0.911 0.012 - SVM best CV ENN 180 - F-race ENN 64 64% 0.809 0.011 - Letter RF best CV balancecascade 180 - F-race balancecascade 73 59% 0.981 0.010 - SVM best CV over 180 - F-race over 44 76% 0.953 0.022 - Nursery RF best CV SMOTE 180 - F-race SMOTE 76 58% 0.809 0.047 - SVM best CV over 180 - F-race over 58 68% 0.875 0.052 - Pima best CV under 180 - RF 0.691 0.045 - F-race under 136 24% best CV EasyEnsemble 180-0.675 0.071 SVM 0.107 F-race costbalance 110 39% 0.674 0.06 Satimage RF best CV balancecascade 180 - F-race balancecascade 132 27% 0.719 0.033 - SVM best CV balancecascade 180 - F-race balancecascade 90 50% 0.662 0.044 - Spam best CV SMOTE 180 - RF 0.942 0.015 - F-race SMOTE 122 32% best CV SMOTEnsemble 180-0.917 0.018 SVM 0.266 F-race SMOTE 135 25% 0.918 0.02 best CV TomekUnder 180 - RF Women F-race TomekUnder 150 17% 0.488 0.051 - SVM best CV EnnSmote 180-0.492 0.073 - F-race EnnSmote 102 43%

Conclusion Class unbalanced problem is well known, different techniques and metrics have been proposed. The best strategy is extremely dependent on the data nature, algorithm adopted and performance measure. F-race is able to automatise the selection of the best unbalanced strategy for a given unbalanced problem without exploring the whole dataset. For the fraud dataset the unbalanced strategy chosen had a big impact on the accuracy of the results. F-race is crucial in adapting the strategy with fraud evolution.

Future work 1 Release of an R package for unbalanced dataset 2 Adopt Racing for incremental learning / data streams Acknowledgment The work of Andrea Dal Pozzolo was supported by the Doctiris programe of Innoviris (Brussels Institute for Research and Innovation), Belgium.

F-race Vs Cross Validation II For almost all datasets F-race is able to return the best method according to the cross validation (CV) assessment. In Pima and Spam datasets F-race returns a sub-optimal strategy that is not significantly worse than the best (Pvalue greater than 0.05). The Gain column shows the computational gain (in percentage of the the CV tests) obtained by using F-race. Apart from the Breast Cancer dataset in all the other cases F-race allows a significant computational saving with no loss in performance.

UCI BreastCancer - Fmeasure Fmeasure unbalanced over under EasyEnsemble balancecascade misscascade missremove SMOTE SMOTEnsemble EnnSmote ENN NCL CNN OSS Tomek TomekUnder TomekEasyEnsemble TomekSMOTE costbalance costing 0.90 0.92 0.94 0.96 0.98 1.00 1.02 Figure: Comparison of techniques for unbalanced data with UCI Breast Cancer dataset and Random Forest classifier in terms of Fmeasure.

Tutorial SMOTE, R package [16] SMOTE SMOTE n i l k j m n i k 1 diff k j m l 2/24/2009 For each minority example k compute nearest minority class examples {i,, j, l, n, m} Nitesh Chawla, SIAM 2009 Tutorial 1. For each minority example k compute nearest minority class examples (i, j, l, SMOTE n, m) Synthetically generate event k 1, such that k 1 lies between k and Nitesh Chawla, SIAM 2009 i Tutorial 3. Synthetically generate event k 1, such that k 1 lies between k and i SMOTE 11 i n k diff l j m n i k 1 k k 3 diff k 2 m l j Randomly choose an example out of 5 closest points Nitesh Chawla, SIAM 2009 Tutorial 2. Randomly choose an example out of 5 closest points After applying SMOTE 3 times (SMOTE parameter = 300%) data set may look like as the picture above Nitesh Chawla, SIAM 2009 Tutorial 4. Dataset after applying SMOTE 3 times

Balance Cascade [11] BalanceCascade, explore the majority class in a supervised manner: Unbalanced training set Minority Balanced dataset Minority Majority Model Majority Classify majority observa4on le5 out New training set Minority Remove correctly classified examples Majority Keep removing majority class examples un4l none is miss- classified

Easy Ensemble [11] EasyEnsemble, learns different aspects of the original majority class in an unsupervised manner:

Cost proportional sampling [6] Positive and negative examples sample by the ratio: p(1)fncost : p(0)fpcost where p(1) and p(0) are prior class probability. Proportional sampling with replacement produces duplicated cases with the risk of overfitting

Costing [19] Use rejection sampling to avoid duplication of instances: 1 Each instance in the original training set is drawn once 2 Accept an instance into the sample with the accepting probability C(i)/Z. C(i) is the misclassification cost of class i, and Z is an arbitrary constant such that Z maxc(i). If Z = max C(i), this is equivalent to keeping all examples of the rare class, and sampling the majority class without replacement according to FPcost/ FNcost

Tomek link [15] Nitesh Chawla, SIAM 2009 Tutorial Goal is to remove both noise and borderline examples. Undersample by Tomek Links Nitesh Chawla, SIAM 2009 Tutorial

Condensed Nearest Neighbor (CNN) [7] Nitesh Chawla, SIAM 2009 Tutorial Goal is to eliminate the instances from the majority class that are distant from the decision border, considered less relevant for learning. CNN Editing Nitesh Chawla, SIAM 2009 Tutorial

One Side Selection [9] Hybrid method obtained from Tomek link and CNN: Apply first Tomek link and then CNN Major drawback is the use of CNN which is sensitive to noise[18], since noisy examples are likely to be misclassified. Many of them will be added to the training.

Edited Nearest Neighbor [17] If an instance belongs to the majority class and the classification given by its three nearest neighbours contradicts the original class, then it is removed X$X2 0.4 0.6 0.8 1.0 1.2 X$X2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 X$X1 1. Before ENN 0.0 0.2 0.4 0.6 0.8 1.0 X$X1 2. Majority class removed with ENN

Neighborhood Cleaning Rule [10] Apply ENN first and If an instance belongs to the minority class and its three nearest neighbours misclassify it, then the nearest neighbours that belong to the majority class are removed. X$X2 0.4 0.6 0.8 1.0 1.2 X$X2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 X$X1 1. Apply ENN 0.0 0.2 0.4 0.6 0.8 1.0 X$X1 2. Majority class removed after ENN

D.J. Newman A. Asuncion. UCI machine learning repository, 2007. M. Birattari, T. Stützle, L. Paquete, and K. Varrentrapp. A racing algorithm for configuring metaheuristics. In Proceedings of the genetic and evolutionary computation conference, pages 11 18, 2002. N.V. Chawla, K.W. Bowyer, L.O. Hall, and W.P. Kegelmeyer. Smote: synthetic minority over-sampling technique. Arxiv preprint arxiv:1106.1813, 2011. P. Clark and T. Niblett. The cn2 induction algorithm. Machine learning, 3(4):261 283, 1989. C. Drummond, R.C. Holte, et al. C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on Learning from Imbalanced Datasets II. Citeseer, 2003. C. Elkan. The foundations of cost-sensitive learning.

In International Joint Conference on Artificial Intelligence, volume 17, pages 973 978. Citeseer, 2001. P. E. Hart. The condensed nearest neighbor rule. IEEE Transactions on Information Theory, 1968. N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent data analysis, 6(5):429 449, 2002. M. Kubat, S. Matwin, et al. Addressing the curse of imbalanced training sets: one-sided selection. In MACHINE LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-, pages 179 186. MORGAN KAUFMANN PUBLISHERS, INC., 1997. J. Laurikkala. Improving identification of difficult small classes by balancing class distribution. Artificial Intelligence in Medicine, pages 63 66, 2001. X.Y. Liu, J. Wu, and Z.H. Zhou.

Exploratory undersampling for class-imbalance learning. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 39(2):539 550, 2009. O. Maron and A.W. Moore. Hoeffding races: Accelerating model selection search for classification and function approximation. Robotics Institute, page 263, 1993. L.B.J.H.F.R.A. Olshen and C.J. Stone. Classification and regression trees. Wadsworth International Group, 1984. J.R. Quinlan. C4. 5: programs for machine learning, volume 1. Morgan kaufmann, 1993. I. Tomek. Two modifications of cnn. IEEE Trans. Syst. Man Cybern., 6:769 772, 1976. L. Torgo. Data Mining with R, learning with case studies.

Chapman and Hall/CRC, 2010. D.L. Wilson. Asymptotic properties of nearest neighbor rules using edited data. Systems, Man and Cybernetics, IEEE Transactions on, (3):408 421, 1972. D.R. Wilson and T.R. Martinez. Reduction techniques for instance-based learning algorithms. Machine learning, 38(3):257 286, 2000. B. Zadrozny, J. Langford, and N. Abe. Cost-sensitive learning by cost-proportionate example weighting. In Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, pages 435 442. IEEE, 2003.