The Effects of Outliers on Support Vector Machines

Similar documents
Leave-One-Out Support Vector Machines

A Practical Guide to Support Vector Classification

SUPPORT VECTOR MACHINES

Version Space Support Vector Machines: An Extended Paper

Bagging for One-Class Learning

Supervised classification exercice

Lecture 9: Support Vector Machines

SUPPORT VECTOR MACHINES

A WEIGHTED SUPPORT VECTOR MACHINE FOR DATA CLASSIFICATION

SoftDoubleMinOver: A Simple Procedure for Maximum Margin Classification

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Support Vector Machines

KBSVM: KMeans-based SVM for Business Intelligence

Rule extraction from support vector machines

Support Vector Machines

Classification by Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Table of Contents. Recognition of Facial Gestures... 1 Attila Fazekas

Support Vector Machines

Robust 1-Norm Soft Margin Smooth Support Vector Machine

DECISION-TREE-BASED MULTICLASS SUPPORT VECTOR MACHINES. Fumitake Takahashi, Shigeo Abe

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

Classification by Support Vector Machines

Kernel Combination Versus Classifier Combination

Machine Learning for NLP

Local Linear Approximation for Kernel Methods: The Railway Kernel

12 Classification using Support Vector Machines

DM6 Support Vector Machines

Evolving Kernels for Support Vector Machine Classification

Support Vector Regression for Software Reliability Growth Modeling and Prediction

Improving Classification Accuracy for Single-loop Reliability-based Design Optimization

SVM Classification in Multiclass Letter Recognition System

Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification

Bagging and Boosting Algorithms for Support Vector Machine Classifiers

Lab 2: Support vector machines

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Kernel SVM. Course: Machine Learning MAHDI YAZDIAN-DEHKORDI FALL 2017

Reihe Informatik 10/2001. Efficient Feature Subset Selection for Support Vector Machines. Matthias Heiler, Daniel Cremers, Christoph Schnörr

Support Vector Machines (a brief introduction) Adrian Bevan.

FUZZY KERNEL K-MEDOIDS ALGORITHM FOR MULTICLASS MULTIDIMENSIONAL DATA CLASSIFICATION

SELF-ORGANIZING methods such as the Self-

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Support Vector Machines

Naïve Bayes for text classification

Lecture 10: Support Vector Machines and their Applications

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Introduction to Support Vector Machines

Information Management course

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018

Feature scaling in support vector data description

Kernel Methods and Visualization for Interval Data Mining

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Support Vector Machines and their Applications

CS570: Introduction to Data Mining

Support Vector Machines + Classification for IR

CS229 Final Project: Predicting Expected Response Times

Improved DAG SVM: A New Method for Multi-Class SVM Classification

Controlling False Alarms with Support Vector Machines

Support Vector Machines

Training Data Selection for Support Vector Machines

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1

A New Fuzzy Membership Computation Method for Fuzzy Support Vector Machines

Lab 2: Support Vector Machines

A Comparative Study of SVM Kernel Functions Based on Polynomial Coefficients and V-Transform Coefficients

Novel Intuitionistic Fuzzy C-Means Clustering for Linearly and Nonlinearly Separable Data

5 Learning hypothesis classes (16 points)

Support vector machines

Data mining with Support Vector Machine

Support Vector Machines

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

Support Vector Regression with ANOVA Decomposition Kernels

Feature Selection for Multi-Class Problems Using Support Vector Machines

HW2 due on Thursday. Face Recognition: Dimensionality Reduction. Biometrics CSE 190 Lecture 11. Perceptron Revisited: Linear Separators

A generalized quadratic loss for Support Vector Machines

Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms

Chap.12 Kernel methods [Book, Chap.7]

CSE 573: Artificial Intelligence Autumn 2010

GENDER CLASSIFICATION USING SUPPORT VECTOR MACHINES

Some Thoughts on Machine Learning Software Design

Learning Dot Product Polynomials for multiclass problems

Mathematical Themes in Economics, Machine Learning, and Bioinformatics

EE 8591 Homework 4 (10 pts) Fall 2018 SOLUTIONS Topic: SVM classification and regression GRADING: Problems 1,2,4 3pts each, Problem 3 1 point.

Support Vector Machines for visualization and dimensionality reduction

Rule Based Learning Systems from SVM and RBFNN

Support Vector Machines.

Active Cleaning of Label Noise

ENSEMBLE RANDOM-SUBSET SVM

Feature Data Optimization with LVQ Technique in Semantic Image Annotation

ECG782: Multidimensional Digital Signal Processing

Kernel-based online machine learning and support vector reduction

Minimum Risk Feature Transformations

ORT EP R RCH A ESE R P A IDI! " #$$% &' (# $!"

Multi-Threaded Support Vector Machines For Pattern Recognition

CLASS IMBALANCE LEARNING METHODS FOR SUPPORT VECTOR MACHINES

Large synthetic data sets to compare different data mining methods

Support Vector Machines: Brief Overview" November 2011 CPSC 352

Content-based image and video analysis. Machine learning

A model for a complex polynomial SVM kernel

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Transcription:

The Effects of Outliers on Support Vector Machines Josh Hoak jrhoak@gmail.com Portland State University Abstract. Many techniques have been developed for mitigating the effects of outliers on the results of SVM classification, including fuzzy-svms and weighted-svms. This paper examines the effect that outliers have on weighted-svms and standard SVMs on artificially generated data, using linear, radial basis function, and polynomial kernels. We find that SVMs are robust in the presence of noise outliers caused by mislabeled data. 1 Introduction In the field of data classification, data points are detected as outliers either because they are the points of interest (known as anomaly detection) or because they are a hinderance to the classification. This paper considers outliers of the latter category and how they affect support vector machines or (SVM), a machine learning model developed by Vapnik and others [8, 10]. SVMs have been quite successful in both their training time and their accuracy, and so are of particular interest to researchers doing practical pattern classification [1]. SVMs are not impervious to outliers [8], and so methods have been developed to mitigate the effects of outliers on SVMs. Some researchers have used schemes that identify possible outliers, assigning a confidence value that indicates how likely a point is believed to be an outlier. In implementing this, researchers have developed the ideas of the fuzzy -SVM and weighted-svms [4 7, 11]. However, some of these studies show only incremental improvements over standard SVM methods [5]. These studies raise some fundamental questions: To what degree do outliers affect SVM models? By how much can we improve our SVM models in ideal and real-world situations? What methods should we use to deal with outliers? These are difficult questions to answer in general. SVMs have many different tunable parameters, and so it is unclear what parameters represent the most general parameters. Also, to test the effects of outliers, we must create outliers according to some rule. However, a common the characteristic of outliers is that they are unpredictable, and so results may be particular to the generating scheme. Deciding which data to use is difficult as well. We can choose to modify data already in existence, or we can make artificial data, but in this latter case, we

have to design reasonable data, and it is unclear what this should look like. Structurally, SVMs are only designed to classify between two classes and so the results of a multi-class SVM might reflect our choice of ensemble classifier rather than the effects of the outliers on SVMs. This paper takes some first steps at examining these questions by looking at a particular type of outlier called noise, in which the class has been incorrectly labeled. For data, we created artificial data using the statistical package R and we examined some standard data sets [3]. We created noise on the training set by flipping the class label from 0 to 1 or 1 to 0. Then, we observed how the introduction of noise affected the accuracy on the test set. Finally, we compared the performance of a standard SVM with a weighted-svm, both of which are provided in the LIBSVM library [2]. 2 Support Vector Machines Support vector machines are a standard classification technique used in machine learning and are reviewed here briefly. For a longer exposition of the topic, see [8, 9]. We define the training data S to be a set consisting of tuples of feature vectors x i R N, along with a given label y i { 1, 1} called the class. In symbols, S = {(x 1, y 1 ),..., (x i, y i ),..., (x m, y m )}. The task of the SVM is to find the optimal hyperplane separating the a set of training points into two categories. Practically, many problems are not linearly separable, and so features are mapped to a higher dimensional space that is, we often map z = φ(x), where φ : R N F and where F is some feature space. Then, we are looking to find the optimal hyperplane such that with w z + b = 0, y i (w z i + b) 1 ξ i, for i = 1,..., m By symmetry, we require that ξ i 0. Finding the optimal hyperplane is equivalent to minimizing subject to m 1 2 w w + C y i (w z i + b) 1 ξ i, and ξ i 0 for i = 1,..., m. i=1 ξ i

To solve the optimal hyperplane problem, we can construct a Lagrangian and transforming to the dual, defining the kernel K(, ) as the function where K(x i, x j ) = φ(x i ) φ(x j ). Then, we can equivalently maximize W (α) = m α i 1 2 i=1 m α i α j y i y j K(x i, x j ) i,j=1 subject to m y i α i = 0, and 0 α i C, for i = 1,..., m. i=1 For a test example x, we define the decision function D for some hyperplane H (established by training) as ( m ) D H (x) = sign α i y i K(x i, x) + b. 3 Methods i=1 We generated the artificial data using the statistical package R. One hundred instances of data were created for each class c {1, 2}, with the features of each class normally distributed around (2 (c 1)) with standard deviation 1 (see Figure 1). This means that for the artificial data used, the center of class 1 would be located at (0, 0) while the center of class 2 would be located at (2, 2). For the real-world data, we used the Breast Cancer, Heart, and Ionsphere data from UCI, Statlog, and UCI respectively [3]. We selected a fraction of the initial data set to be used for training and testing, and then we created outliers on the training set. For our experiments, 70% of the data went to the training set and 30% went to the testing set. Outliers were created by selecting a point and flipping the class for a fraction, between 0 and 1/2, of the training examples. To perform the SVM training and testing, we used LIBSVM, with linear, polynomial (cubic) and radial-basis kernels [2]. To reduce variation for each noisesetting used, the SVM was run on multiple different sets of data and then accuracy over these sets was averaged. The performance of the standard SVM was then compared to weighted versions, we used LIBSVM s weighted tool, which allows one to give different weights for each training instance. We only did a simple weighting scheme in which we gave every non-outlier training example 1 and every outlier training example a fixed value (set between 0 and 1). A value of 0.0 for the weight indicates that the outlier is excluded from training.

(a) Initial Data Creation. (b) Training Data with 30% Noise Fig. 1. Data Creation. Data was generated in R using normal distributions centered around (0,0) and (2,2), each with standard deviation 1. 4 Results and Discussion For the artificial data, both the linear and radial-basis kernels performed quite well in the presence of noise (see Figure 2). For these kernels, accuracy on the test sets was unaffected until outliers reached roughly 20% of the training examples. However, the polynomial kernel (over degree 3), showed worse performance: even with setting 5% of the training data set to outliers, we observed better performance in the weighted-svm. The sigmoid kernel was also tested, but its performance was quite poor the maximum accuracy was only around 60%. For the real-world data (Figures 3-5), there was no perceivable pattern to which kernels did better. The breast cancer data performed very well in the presence of noise: for both the RBF and polynomial kernels, noise had no affect on the performance of a standard SVM until thirty percent of the training examples chosen as noise. Standard SVMs trained with the ionosphere and heart data were less resilient to noise. Still, noise in the training data only had a significant affect on the testing accuracies for these data sets when the percentage of noise examples in the training data rose above 10 percent. From both the artificial and real-world data, this study indicates that significant results are difficult to attain using outlier-mitigating efforts such as fuzzy-svms over the standard SVMs. Our experiments indicate that one must have a high level of noise in a training set ( 10%) before test-accuracies are significantly affected. If noise is present, we found that noisy examples do not have to be weighted severely to achieve significant improvements: setting the weights to 0.6 for the noise examples produced much higher accuracies on high noise settings.

Generally, this gives evidence that SVMs are more resilient to outliers than is sometimes supposed. 5 Future Work There are still several questions and areas for further research on this topic. We only looked a the binary-class SVM; multi-class SVMs are more complicated and it would be valuable to see how they respond to noise and outliers. We only tested a very limited set of parameters and types of data. A more exhaustive search would give more convincing evidence that SVMs with particular kernels are more or less susceptible to outliers. To support the hypothesis that SVMs are relatively robust to outliers, better methods need to be developed to generate outliers. Certainly, incorrectly labeled classes are one particular type of outlier, but probably the most common type of outlier is one in which a particular feature has unexpected or poorly sampled data. Also, we labeled all of our training examples as noise. It is quite possible that an outlier detection scheme would only be able to detect a fraction of the outliers, and so it might be valuable to study the effects on accuracy when only a fraction of outliers are weighted.

References 1. Bartlett, P., and Shawe-Taylor, J. Generalization performance of support vector machines and other pattern classifiers. Advances in Kernel Methods: Support Vector Learning (1998), 43 54. 2. Chang, C.-C., and Lin, C.-J. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. 3. Frank, A., and Asuncion, A. UCI machine learning repository, 2010. 4. Lin, C.-f., and Wang, S.-d. Training algorithms for fuzzy support vector machines with noisy data. Pattern Recognition Letters 25, 14 (2004), 1647 1656. 5. Line, C.-F., and Wang, S.-D. Fuzzy support vector machines. IEEE Transactions on Neural Networks 13, 2 (March 2002). 6. Suykens, J. A. K., Brabanter, J. D., Lukas, L., and Vandewalle, J. Weighted least squares support vector machines: robustness and sparse approximation. Neurocomputing 48, 1-4 (2002), 85 105. 7. Tsujinishi, D., and Abe, S. Fuzzy least squares support vector machines for multiclass problems. Neural Networks 16, 5-6 (2003), 785 792. Advances in Neural Networks Research: IJCNN 03. 8. Vapnik, V. The Nature of Statistical Learning Theory, 2nd ed. Springer, November 1995. 9. Vapnik, V. Statistical Learning Theory. Wiley, 1998. 10. Vapnik, V., and Cortes, C. Support-vector networks. Machine Learning 20, 3 (September 1995). 11. Wang, T.-Y., and Chiang, H.-M. Fuzzy support vector machine for multi-class text categorization. Inf. Process. Manage. 43, 4 (2007), 914 929.

(a) Results using the linear kernel. (b) Results using the RBF kernel. (c) Results using the polynomial kernel with degree 3. Fig. 2. Results - Artificial Data. Displayed are the results from both the weighted-svm and standard SVM using linear, radial basis kernels, and polynomial kernels. The SVMs were run on 20 different data sets for each noise-setting (51 different settings, i.e., 0%, 1%,..., 50% of training) and then the accuracies on the test sets were averaged.

(a) Results using the linear kernel. (b) Results using the RBF kernel. Fig. 3. (c) Results using the polynomial kernel with degree 3. Results - UCI Breast Cancer Data. Displayed are the results from both the weighted-svm and standard SVM using linear, radial basis kernels, and polynomial kernels. The SVMs were run on 5 different data sets for each noise-setting (21 different settings) and then the accuracies on the test sets were averaged.

(a) Results using the linear kernel. (b) Results using the RBF kernel. Fig. 4. Results - Statlog Heart Data. Displayed are the results from both a weighted-svm and a standard SVM using linear, radial basis kernels, and polynomial kernels. The SVMs were run on 5 different data sets for each noise-setting (21 different settings) and then the accuracies on the test sets were averaged. (a) Results using the linear kernel. (b) Results using the RBF kernel. Fig. 5. Results - UCI Ionosphere Data. Displayed are the results from both a weighted-svm and a standard SVM using linear, radial basis kernels, and polynomial kernels. The SVMs were run on 5 different data sets for each noise-setting (21 different percentages of the training) and then the accuracies on the test sets were averaged.