Feature Subset Selection for Logistic Regression via Mixed Integer Optimization

Similar documents
Department of Policy and Planning Sciences. Discussion Paper Series. Feature Subset Selection for Logistic Regression via. Mixed Integer Optimization

Lecture 13: Model selection and regularization

Subset Selection by Mallows C p : A Mixed Integer Programming Approach

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

A Study on Clustering Method by Self-Organizing Map and Information Criteria

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

Classification Algorithms in Data Mining

CS 229 Midterm Review

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Univariate and Multivariate Decision Trees

Nonparametric Mixed-Effects Models for Longitudinal Data

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Information Criteria Methods in SAS for Multiple Linear Regression Models

Comparison of Optimization Methods for L1-regularized Logistic Regression

Machine Learning / Jan 27, 2010

MODULE 6 Different Approaches to Feature Selection LESSON 10

Didacticiel - Études de cas

Regularization and model selection

Linear Methods for Regression and Shrinkage Methods

Fundamentals of Integer Programming

Decision Trees: Representation

Lecture 16: High-dimensional regression, non-linear regression

What is machine learning?

Package EBglmnet. January 30, 2016

Didacticiel - Études de cas

Variational Methods for Discrete-Data Latent Gaussian Models

STA121: Applied Regression Analysis

Link Prediction for Social Network

[POLS 8500] Stochastic Gradient Descent, Linear Model Selection and Regularization

Nonparametric Methods Recap

CS249: ADVANCED DATA MINING

Machine Learning for Software Engineering

CPSC 340: Machine Learning and Data Mining

Robust Regression. Robust Data Mining Techniques By Boonyakorn Jantaranuson

Parallel Computing in Combinatorial Optimization

A Comparison of Modeling Scales in Flexible Parametric Models. Noori Akhtar-Danesh, PhD McMaster University

Statistical Consulting Topics Using cross-validation for model selection. Cross-validation is a technique that can be used for model evaluation.

All lecture slides will be available at CSC2515_Winter15.html

Exploring Econometric Model Selection Using Sensitivity Analysis

22s:152 Applied Linear Regression

Package FWDselect. December 19, 2015

Network Traffic Measurements and Analysis

22s:152 Applied Linear Regression

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Machine Learning (BSMC-GA 4439) Wenke Liu

Stat 342 Exam 3 Fall 2014

B553 Lecture 12: Global Optimization

Package milr. June 8, 2017

Voxel selection algorithms for fmri

Machine Learning in Python. Rohith Mohan GradQuant Spring 2018

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Algorithms: Decision Trees

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Comparing Univariate and Multivariate Decision Trees *

scikit-learn (Machine Learning in Python)

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Credit card Fraud Detection using Predictive Modeling: a Review

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

10601 Machine Learning. Model and feature selection

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

Subject. Dataset. Copy paste feature of the diagram. Importing the dataset. Copy paste feature into the diagram.

Cover Page. The handle holds various files of this Leiden University dissertation.

Exam Advanced Data Mining Date: Time:

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Coding for Random Projects

Unsupervised Discretization using Tree-based Density Estimation

Resampling methods (Ch. 5 Intro)

Paper ST-157. Dennis J. Beal, Science Applications International Corporation, Oak Ridge, Tennessee

Speeding up Correlation Search for Binary Data

Random Forest A. Fornaser

SOCIAL MEDIA MINING. Data Mining Essentials

3 INTEGER LINEAR PROGRAMMING

Applying Supervised Learning

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Predicting Popular Xbox games based on Search Queries of Users

Mixed-Integer SOCP in optimal contribution selection of tree breeding

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

Penalizied Logistic Regression for Classification

Using Machine Learning to Optimize Storage Systems

Data Mining. Lesson 9 Support Vector Machines. MSc in Computer Science University of New York Tirana Assoc. Prof. Dr.

Machine Learning (BSMC-GA 4439) Wenke Liu

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

BEST SUBSET SELECTION FOR ELIMINATING MULTICOLLINEARITY

Learning decomposable models with a bounded clique size

A faster model selection criterion for OP-ELM and OP-KNN: Hannan-Quinn criterion

An Optimal Search Process of Eigen Knots. for Spline Logistic Regression. Research Department, Point Right

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

The ALAMO approach to machine learning: Best subset selection, adaptive sampling, and constrained regression

Slides for Data Mining by I. H. Witten and E. Frank

Behavioral Data Mining. Lecture 18 Clustering

A Systematic Overview of Data Mining Algorithms

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

Generative and discriminative classification techniques

Optimization Techniques for Design Space Exploration

Transcription:

Feature Subset Selection for Logistic Regression via Mixed Integer Optimization Yuichi TAKANO (Senshu University, Japan) Toshiki SATO (University of Tsukuba) Ryuhei MIYASHIRO (Tokyo University of Agriculture and Technology) Akiko YOSHISE (University of Tsukuba) Workshop on Advances in Optimization (WAO2016) TKP Shinagawa Conference Center, Tokyo, JAPAN August 12-13, 2016

2/21 Outline Introduction Mixed Integer Optimization Formulation Computational Results Conclusions

3/21 Outline Introduction Mixed Integer Optimization Formulation Computational Results Conclusions

4/21 Binary Classification Binary classification aims at developing a model for separating two classes of samples that are characterized by numerical features. Example: corporate bankruptcy prediction Methods: classic discriminant analysis, logistic regression, SVM and so on exist bankrupt We focus on the feature subset selection problem for logistic regression. financial indicators of company i input model predict company i will exist or go bankrupt

5/21 Feature Subset Selection Feature subset selection is the method of choosing a set of significant features for model construction. candidate features #(employees) ROA cashflow growth debts selection #(employees) ROA cashflow growth debts Potential benefits of feature subset selection are: Improving predictive performance by preventing overfitting Identifying a model that captures the essence of a system Providing a computationally efficient set of features It is essential importance in statistics. It has recently received considerable attention in data mining and machine learning as a result of the increased size of the datasets.

6/21 Methods for Feature Subset Selection Stepwise Method (i.e., local search algorithm) Metaheuristics (e.g., tabu search, simulated annealing) L 1 -regularized Regression (a.k.a. LASSO) These algorithms do not necessarily provide a best subset of features under a goodness-of-fit measure (e.g., AIC, BIC, C p ) Branch-and-Bound Algorithm (e.g., Narendra & Fukunaga (1977)) This algorithm assumes the monotonicity of a GOF measure in its pruning process; this assumption is not satisfied by commonly used goodness-of-fit measures. Our Approach: Mixed Integer Optimization (MIO) We formulate the problem as an MILO problem by making a piecewise linear approximation of the logistic loss function.

7/21 Outline Introduction Mixed Integer Optimization Formulation Computational Results Conclusions

8/21 Logistic Regression Model Logistic regression model y : binary class label (y { 1,1}) x : p-dimensional feature vector w : p-dimensional coefficient vector (to be estimated) b : intercept (to be estimated) Sigmoid Function f x = 1 1 + exp( x) feature vector (, + ) (0,1) occurrence probability

Information Criterion The log likelihood function is defined as follows: occurrence probability Logistic loss function We will select a subset S {1,2,, p} of features so that the information criterion is minimized: maximum log likelihood penalty parameter #features F = 2 Akaike information criterion F = log n Bayesian information criterion Y. Takano : Feature Subset Selection for Logistic Regression via Mixed Integer Optimization 9/21

10/21 Piecewise Linear Approximation Most MIO software cannot handle such a nonlinear function. The logistic loss function can be approximated by the pointwise maximum of a family of tangent lines as follows: tangent lines

11/21 Greedy Algorithm for Selecting Tangent Lines It is crucial to select a limited number of good tangent lines. Our greedy algorithm adds tangent lines one by one so that the biggest triangle (= approximation error) will be cut off. Step 0 Step 1 try to reduce this area cut off the biggest triangle Step 2 Step 3 cut off the biggest triangle cut off the biggest triangle

12/21 Mixed Integer Linear Optimization Formulation Feature subset selection for logistic regression is formulated as a mixed integer linear optimization (MILO) problem as follows: min. information criterion tangent lines big-m method: Mz j w j Mz j SOS type1: GRB.SOS_TYPE1: {1 z j, w j } unselected feature is eliminated the j-th feature is selected or not

13/21 Optimality Guarantee By solving the MILO problem, we have obj MILO : optimal objective value, S : subset of features. Let IC opt be the minimum value of the information criterion. The objective function of the MILO problem is an underestimator to the information criterion. S is not necessarily a minimizer of the information criterion. Thus we can give an optimality guarantee to an obtained subset S as follows: obj MILO lower bound IC opt IC(S ) upper bound

14/21 Outline Introduction Mixed Integer Optimization Formulation Computational Results Conclusions

15/21 Experimental Design for AIC minimization The datasets for classification were downloaded from the UCI machine learning repository. We compare the following methods for minimizing AIC: SW const : stepwise method starting with S = (step in R) SW all : stepwise method starting with S = {1,2,, p} (step in R) L 1 -reg: L 1 -regularized logistic regression (glmnet in R) MILO(V): our MILO formulation (Gurobi Optimizer) We employed three sets of tangent lines computed by our greedy algorithm:

16/21 Breast Cancer Dataset (n = 194, p = 33) Method AIC(S) LB Relgap S Time (sec.) SW const 162.94 --- --- 12 1.06 SW all 152.13 --- --- 24 1.49 L 1 -reg 157.57 --- --- 24 4.67 MILO(V 1 ), V 1 = 5 147.04 137.96 6.58% 18 22.88 MILO(V 2 ), V 2 = 9 147.04 144.56 1.72% 18 57.72 MILO(V 3 ), V 3 = 17 147.04 146.41 0.43% 18 240.39 AIC(S): AIC of selected subset S of features. LB: obj.val. of MILO problem (i.e., lower bound on min. AIC) Relgap: relative optimality gap, i.e., 100 S : the number of selected features Time (sec.): computation time in seconds AIC S LB LB

17/21 Breast Cancer Dataset (n = 194, p = 33) Method AIC(S) LB Relgap S Time (sec.) SW const 162.94 --- --- 12 1.06 SW all 152.13 --- --- 24 1.49 L 1 -reg 157.57 --- --- 24 4.67 MILO(V 1 ), V 1 = 5 147.04 137.96 6.58% 18 22.88 MILO(V 2 ), V 2 = 9 147.04 144.56 1.72% 18 57.72 MILO(V 3 ), V 3 = 17 147.04 146.41 0.43% 18 240.39 Stepwise methods and L 1 -reg finished their computations within five seconds, but they provided low-quality solutions. MILO formulations successfully attained the smallest AIC value among these methods. As the number of tangent lines increased, Relgap got small, but its computation time also increased.

18/21 Libras Movement Dataset (n = 360, p = 90) Method AIC(S) LB Relgap S Time (sec.) SW const 22.00 --- --- 10 6.53 SW all 18.00 --- --- 8 323.22 L 1 -reg 28.00 --- --- 13 4.75 MILO(V 1 ), V 1 = 5 14.00 8.00 75.00% 6 10000.00 MILO(V 2 ), V 2 = 9 16.00 8.00 100.00% 7 10000.00 MILO(V 3 ), V 3 = 17 16.00 6.00 166.67% 7 10000.00 MILO computation took very long time, so we quit its computation in 10000 seconds. Stepwise methods and L 1 -reg finished their computations within six minutes but they still provided low-quality solutions. MILO formulations attained smaller AIC values than the other methods; however, Relgap was very large.

19/21 Outline Introduction Mixed Integer Optimization Formulation Computational Results Conclusions

Conclusions This talk considered the feature subset selection problem for logistic regression. We formulated its approximation as a mixed integer linear optimization (MILO) problem by applying a piecewise linearization technique to the logistic loss function. We also developed a greedy algorithm to select good tangent lines used for piecewise linear approximation. Our approach has the advantage of selecting a subset of features with an optimality guarantee. Our method often outperformed the stepwise methods and L 1 - regularized logistic regression in terms of solution quality. Future directions of study: specialized B&B algorithm, discrete choice model (e.g., multinomial logit model), and exact algorithm for selecting a best set of tangent lines. Y. Takano : Feature Subset Selection for Logistic Regression via Mixed Integer Optimization 20/21

21/21 References T. Sato, Y. Takano, R. Miyashiro and A. Yoshise Feature Subset Selection for Logistic Regression via Mixed Integer Optimization Computational Optimization and Applications Vol.64, No.3, pp.865-880 (2016).