Penalizied Logistic Regression for Classification

Similar documents
Network Traffic Measurements and Analysis

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

What is machine learning?

Comparison of Optimization Methods for L1-regularized Logistic Regression

Machine Learning / Jan 27, 2010

Simple Model Selection Cross Validation Regularization Neural Networks

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

I How does the formulation (5) serve the purpose of the composite parameterization

On The Value of Leave-One-Out Cross-Validation Bounds

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Logistic Regression and Gradient Ascent

Supervised vs unsupervised clustering

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Supervised Learning for Image Segmentation

Fast or furious? - User analysis of SF Express Inc

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Bayes Net Learning. EECS 474 Fall 2016

The exam is closed book, closed notes except your one-page cheat sheet.

How Learning Differs from Optimization. Sargur N. Srihari

ISyE 6416 Basic Statistical Methods Spring 2016 Bonus Project: Big Data Report

Machine Learning. Supervised Learning. Manfred Huber

An MM Algorithm for Multicategory Vertex Discriminant Analysis

Application of Support Vector Machine Algorithm in Spam Filtering

06: Logistic Regression

Semi-Supervised Clustering with Partial Background Information

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

A Multilevel Acceleration for l 1 -regularized Logistic Regression

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Lasso Regression: Regularization for feature selection

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

CS229 Final Project: Predicting Expected Response Times

6 Model selection and kernels

Weka ( )

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Variable Selection 6.783, Biomedical Decision Support

Classification. 1 o Semestre 2007/2008

Advanced phase retrieval: maximum likelihood technique with sparse regularization of phase and amplitude

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Problem Set #6 Due: 11:30am on Wednesday, June 7th Note: We will not be accepting late submissions.

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Lecture 9: Support Vector Machines

Optimization Methods for Machine Learning (OMML)

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem

Effectiveness of Sparse Features: An Application of Sparse PCA

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Leave-One-Out Support Vector Machines

Lecture on Modeling Tools for Clustering & Regression

Stat 342 Exam 3 Fall 2014

ORT EP R RCH A ESE R P A IDI! " #$$% &' (# $!"

Lecture 3: Linear Classification

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

Semi-supervised Learning

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

Neural Network Neurons

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Gradient LASSO algoithm

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Problem Set #6 Due: 2:30pm on Wednesday, June 1 st

Machine Learning and Computational Statistics, Spring 2016 Homework 1: Ridge Regression and SGD

Machine Learning Basics. Sargur N. Srihari

Naïve Bayes, Gaussian Distributions, Practical Applications

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Classification Algorithms in Data Mining

Data mining with Support Vector Machine

Version Space Support Vector Machines: An Extended Paper

The picasso Package for High Dimensional Regularized Sparse Learning in R

Stat 602X Exam 2 Spring 2011

Predict the box office of US movies

Boosting Simple Model Selection Cross Validation Regularization

Large Scale Data Analysis Using Deep Learning

Kernels for Structured Data

Performance Evaluation of Various Classification Algorithms

Information Driven Healthcare:

A Brief Look at Optimization

Univariate Margin Tree

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Machine Learning and Computational Statistics, Spring 2015 Homework 1: Ridge Regression and SGD

Optimization Plugin for RapidMiner. Venkatesh Umaashankar Sangkyun Lee. Technical Report 04/2012. technische universität dortmund

10601 Machine Learning. Model and feature selection

Feature-weighted k-nearest Neighbor Classifier

low bias high variance high bias low variance error test set training set high low Model Complexity Typical Behaviour Lecture 11:

CPSC Coding Project (due December 17)

Lecture 16: High-dimensional regression, non-linear regression

Machine Learning. Chao Lan

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Perceptron-Based Oblique Tree (P-BOT)

Robust Regression. Robust Data Mining Techniques By Boonyakorn Jantaranuson

Machine Learning: An Applied Econometric Approach Online Appendix

Data mining with sparse grids

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

Kernel-based online machine learning and support vector reduction

Using Recursive Classification to Discover Predictive Features

Mondrian Forests: Efficient Online Random Forests

An Empirical Comparison of Spectral Learning Methods for Classification

Online Feature Selection using Grafting

Transcription:

Penalizied Logistic Regression for Classification Gennady G. Pekhimenko Department of Computer Science University of Toronto Toronto, ON M5S3L1 pgen@cs.toronto.edu Abstract Investigation for using different penalty functions (L 1 - absolute value penalty or lasso, L 2 - standard weight decay or ridge regression, weight elimination etc.) on the weights for logistic regression for classification. 5 data sets from UCI Machine Learning Repository were used. 1 Introduction For supervised learning algorithms over-fitting is usually a potential problem which we need to solve. It is well-known fact that for unregularized discriminative models fit via training error minimization, sample complexity (training set size needed to learn well in some sense) grows linearly with the VCD (Vapnik Chervonenkis Dimension). Moreover, VCD for most models grows linearly in the number of parameters, which usually grows at least linearly in the number of input features. So, if we do not have training set size large relatively to the input dimension, we need some special mechanism - such as regularization, which makes the fitted parameters smaller to prevent over-fitting [4](p.1). In this paper, we mostly focused on study the typical behavior of two well-know regularization methods: Ridge Regression or L 2 penalty function and Lasso or L 1 penalty function. L 2 penalty function uses the sum of the squares of the parameters and Ridge Regression encourages this sum to be small. L 1 penalty function uses the sum of the absolute values of the parameters and Lasso encourages this sum to be small. We are going to investigate these two regularization techniques for classical classification algorithm: Logistic Regression for different data sets. 2 Logistic Regression Logistic Regression is a popular linear classification method. It s predictor function consists of a transformed linear combination of explanatory variables. Our model: y - multinomial random variable, x - feature vector, θ - weigth vector. y s posterior is the softmax of linear functions of the feature vector x. p(y = k x,θ) = exp(θ k x) j exp(θ j x) (1) To fit this model we need to optimize the conditional log-likelihood l(θ;d): l(θ;d) = n log p(y = y n x n,θ) = nk y n k log p n k (2) where y n k [yn == k],p n k p(y = k xn ).

To maximize the log-likelihood we set it s derivatives w.r.t to θ to zero. l(θ) θ i = n (y n i p n i )x n (3) Typically some method like conjugate gradients can be used then to maximize log-likelihood. It needs (2) and (3): value of l and it s derivatives. 3 Regularization 3.1 Penalty Functions As was mentioned before, using logistic regression without any additional regularization can not guarantee you good results, because of possible over-fitting. This usually happens when the number of observations or training examples (m) is not large enough, compared with the number of feature variables (n) [3](p.4). So, we need some kind of regularizer in order to get stable classifier. One of the way is using penalization. The general motivation behind penalizing the likelihood - the approach logistic regression can be regularized in a straight forward manner - is the following: Avoid arbitrary coefficient estimates θ and a classification that appears to be perfect in the training set, but is poor in the validation set. Penalization aims are improving prediction performance in a new data set by balancing the fit to the data and the stability of the estimates [1](p.4). Now we are going to penalize logistic regression from (2) in a such way: where α k > 0. l (θ) = l(θ) λj(θ) (4) J(θ) = k α k ψ(θ k ), (5) Usually the penalty function ψ is chosen to be symmetric and increasing on [0,+ ]. Furthermore, ψ can be convex or non-convex, smooth or non-smooth. A good penalty function should result in unbiasedness, sparsity and stability [6](p.20). Most popular penalty functions are in the Table 1. More examples are in [6](p.21). Now, we are going to consider two special cases of function J(θ) that are usually used in practice: J(θ) = k θ k (or l 1 ) and J(θ) = k θ2 k (or l 2). 3.2 l 2 -regularization Over-fitting often tends to occur when the fitted model has many feature variables with relatively large weights in magnitude. To prevent this situation we can use weight decay method (or ridge regression). Let J(θ) = k θ2 k (or l 2). The result of using such a function J(θ) is classifier with smaller values of weights and often better generalization ability. We can also prune this classifier: weights with magnitudes smaller than some certain threshold can be considered redundant and removed. This regularization technique is very easy to implement and often works pretty well. It is still very often used for different applications: in [8] for detecting genes interaction, in [6] for classification of microarray data, in [1] for gene expression analysis. 3.3 l 1 -regularization However, l 2 usually leaves most of the weights θ i non-zero that can be a problem when we have a lot features and want to get sparse θ-vector. It can be also a native characteristic of our data, when we have a large amount of features, but only relatively small number of them is sufficient to learn the target concept. So, Let J(θ) = k θ k (or l 1 ). In this case a lot of weights can become zero, so, that gives a model with relatively sparse vector θ. We can think about this quality as a possibility to select important features [3](p.6). Logistic model with sparse vector of weights is simpler and

Table 1: Examples of penalty functions Penalty function Smoothness at 0 Convexity Authors ψ(θ) = θ yes θ (0 + ) = 1 Rudin(1992) ψ(θ) = θ α,α (0,1) yes θ (0 + ) = Saquib(1998) ψ(θ) = α θ /(1 + α θ ) yes θ (0 + ) = α Geman(92, 95) ψ(θ) = θ α,α > 1 yes yes Bouman(1993) ψ(θ) = αθ 2 /(1 + αθ 2 ) no yes McClure(1987) more parsimonious and it s not surprising that it can outperform l 2, especially when the number of observation is smaller than the number of features [4](p.5). In this paper we are going to compare both these regularization methods on different common classification data sets (UCI data sets). 4 Experiments 4.1 Data sets 0.018 Optical Digits (0 s vs 1 s), continuos 1 Optical Digits (1 s vs 2 s), continuos 0.016 0.014 0.09 0.012 0.01 0.008 0.07 0.006 0.004 0.002 0 Figure 1: (A), (B) 2 1 0.09 Optical Digits (8 s vs 9 s), continuos Artificial Characters (A s vs C s), discrete 0.07 0.01 0 Figure 2: (C), (D) Our experiments used data sets from the UCI Machine Learning Repository. The first data set was the Optical Digits data set that was the extraction of handwritten digits from a preprinted form. It consisted of 3823 training samples and 1797 test ones; had 64 attributes and 10 classes (all digits). The second data set was the Artificial Characters database generated by first order theory that described 10 letters of the English alphabet (A,C,D,E,F,G,H,L,P,R). There were 1000 instances (100

0.28 0.26 0.24 Ionosphere, continuous 0.48 0.46 Spectf (heart data), continuous 0.22 0.44 0.2 8 6 0.42 0.4 4 0.38 2 0.36 0.34 10 20 30 40 50 60 70 0.4 Zoo animals, discrete 0.35 0.3 0.25 0.2 10 20 30 40 50 60 70 Figure 3: (E), (F), (G) per class) as learning set and 5000 instances (500 per class) as test one. Every instance had 96 binary attributes (12*8 matrix). Then, we used Johns Hopkins University Ionosphere database of radar returns from the ionosphere. There were two classes for this database: good radar return (showing evidence of some type of structure in the ionosphere) and bad return (signals passed through the ionosphere). Number of Instances: 351. Number of Attributes: 34 (all of them are continuous). We used 200 of them for training (carefully split almost 50% positive and 50% negative) and the rest for test. Fourth data set (SPECTF) described diagnosing of cardiac Single Proton Emission Computed Tomography (SPECT) images. Each of the patients is classified into two categories: normal and abnormal. It had 267 instances that were described by 45 attributes. And the last one was Zoo database (classification for animals in the zoo). It had 101 instances (we divided them 80 to 21: learning and test respectively) and 17 attributes (15 boolean attributes and 2 numeric). 4.2 Results We used simple self-written Logistic Regression algorithm (with two variants of regularizing). For determining optimal λ parameter we used cross-validation. Results of 7 experiments are in Figures 1, 2, 3. These experiments show the difference in error rate for l 1 and l 2 regularization methods. Because we expected to see the difference when the number of samples was comparatively small to the number of features, executions for different amount of training instances were made (from 10 to 200 (or less) with step 10). To avoid problems with bad choice, for every number of training samples we made 100 random splits of training data and then reported the average error for all cases. So, our test error is formally the generalization error (for 100 random splits experiments). The difference (l 1 outperforming) was not so dramatic as in [4], but in [4](p.5) they considered cases for 1 and 3 relevant features when the total amount of features is from 100 to 1000, that was not true for all of our experiments.

In Figure 1 there is a result for classification 0 vs 1 and 1 vs 2 digits. In this execution l 2 was a bit better than l 1 (0.2-0.4%). The reasons for this are, first, digits are not similar (in case of 8 vs 9 situation is different) and, second, we used conjugate gradient method for both methods. In [2](p.5-6) & [4](p.3-4) specific algorithms were used for l 1 case. [3](p.6-8) gives a short overview of methods that can be used to increase performance for l 1 -regularization. In Figure 2 we see the result for 8 vs 9 digits (C) and a vs c letters (D). When the number of samples is small (10-30) relatively to the number of features (64,96), l 1 outperformed l 2 ; and then error rate is almost equal. In Figure 3 are the results for experiments with 3 last data sets: Ionosphere, Spectf and Zoo (E,F,G respectively). For (E) & (F) results are similar to (C) & (D) - for the case when number of samples is small (10-30) l 1 outperformed l 2. For (G) results for both methods are almost the same. 5 Short Overview of Related Works In 3.2 we ve mentioned 3 papers that used l 2 regularization. In [4](p.7-8) Ng proved an interesting fact that for logistic regression with l 1 regularization sample complexity grows only logarithmically in the number of irrelevant features, while l 2 grows at least linearly (in the worst case). [3] proposed specialized method for solving l 1 -regularized logistic regression in every efficient way for all size problems. In [8](p.24) there is a comparison of logistic regression with naive bayes using few data sets that we used in this paper experiments. [1](p.6-7) gives efficiency comparison of using different methods (cross-validation, Akaike Information Criterion etc.) for choosing λ parameter. [2](p.6-7) is a very good natural example of the case when l 1 is much quicker than l 2, because of the sparsity (Text Categorization problem). 6 Concluding Remarks In this paper we compared two well-known and widely used methods of regularization. The idea was to understand which method is preferable for different types of data, different amount of training samples and which additional algorithms can improve these methods. According to this research (and using results from the papers in the References), we can conclude that l 1 is a good choice when you do not have enough training samples and your input vector has relatively many features or when you need sparsity (for weight vector) to decrease the time that your algorithm needs to get the result. On the other hands, l 2 is still a good choice for the tasks where most of the weights during learning process are relevant. It can be interesting also to investigate the same things for other classifiers and ML algorithms like Neural Networks (NN), Support Vector Machines etc. Part of these experiments was also done for NN and the results are very similar (digits and characters data sets). References [1] M.G. Schimek, (2003) Penalized Logistic Regression in Gene Expression Analysis, Proc. The Art of Semiparametrics Conference, Oct. 2003. [2] Genkin A., Lewis D.D., & Madigan D. (2005) Sparse Logistic Regression for Text Categorization DIMACS Working Group on Monitoring Message Streams. Project Report, April 2005. [3] Koh K., Kim S.-J.& Boyd, S. (2006) An Interior-Point Method for Large-Scale l 1-Regularized Logistic Regression Submitted to Journal of Machine Learning Research, July 2006., pp. 1-38. [4] Ng. A. J., (2004) Feature selection, L1 vs. L2 regularization, and rotational invariance In Proceedings of the Twenty-first International Conference on Machine Learning, 2004. [5] Lee S.-I., Lee H., Abbeel P. & Ng. A. J., (2006) Efficient L1 Regularized Logistic Regression. In Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI-06), 2006. [6] Antoniadis A. (2003) Penalized Logistic Regression and Classification of Microarray Data Computational and Statistical Aspects of Microarray Analysis, Workshop, Milan, May 2003 p.1-32 [7] Mitchell T.M. Machine Learning Course 10-701, Logistic Regression Center for Automated Learning and Discovery, Carnegie Mellon University September 29, 2005 [8] Park M.Y. & Hastie T. (2006) Penalized Logistic Regression for Detecting Gene Interactions Technical Report, Dept. of Statistics, Stanford University, September 2006.