Gradient LASSO algoithm

Similar documents
Lecture 13: Model selection and regularization

Comparison of Optimization Methods for L1-regularized Logistic Regression

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Network Traffic Measurements and Analysis

Model selection and validation 1: Cross-validation

Lasso.jl Documentation

Machine Learning / Jan 27, 2010

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation

Linear Methods for Regression and Shrinkage Methods

Multiresponse Sparse Regression with Application to Multidimensional Scaling

The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R

MATH 829: Introduction to Data Mining and Analysis Model selection

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation

Variable Selection 6.783, Biomedical Decision Support

A General Greedy Approximation Algorithm with Applications

Lasso Regression: Regularization for feature selection

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Picasso: A Sparse Learning Library for High Dimensional Data Analysis in R and Python

Machine Learning Techniques for Detecting Hierarchical Interactions in GLM s for Insurance Premiums

Optimization Plugin for RapidMiner. Venkatesh Umaashankar Sangkyun Lee. Technical Report 04/2012. technische universität dortmund

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

The fastclime Package for Linear Programming and Large-Scale Precision Matrix Estimation in R

Ensemble methods in machine learning. Example. Neural networks. Neural networks

The picasso Package for High Dimensional Regularized Sparse Learning in R

I How does the formulation (5) serve the purpose of the composite parameterization

Package EBglmnet. January 30, 2016

Lecture 16: High-dimensional regression, non-linear regression

A General Framework for Fast Stagewise Algorithms

Voxel selection algorithms for fmri

scikit-learn (Machine Learning in Python)

Chapter 6: Linear Model Selection and Regularization

Constrained optimization

Lecture 19: November 5

The Simplex Algorithm

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Learning via Optimization

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Lecture 17 Sparse Convex Optimization

Sparse Linear Models

Effectiveness of Sparse Features: An Application of Sparse PCA

Machine Learning: An Applied Econometric Approach Online Appendix

What is machine learning?

A Multilevel Acceleration for l 1 -regularized Logistic Regression

Sparsity and image processing

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Random Forests and Boosting

Fast or furious? - User analysis of SF Express Inc

Lasso. November 14, 2017

Support Vector Machines.

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs

Machine Learning: Think Big and Parallel

Sparse & Functional Principal Components Analysis

Automatic Feature Selection via Weighted Kernels and Regularization

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

The exam is closed book, closed notes except your one-page cheat sheet.

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

Classification and Regression Trees

Seeking Interpretable Models for High Dimensional Data

Package ADMM. May 29, 2018

Learning with infinitely many features

Lecture 2 September 3

An MM Algorithm for Multicategory Vertex Discriminant Analysis

SUMMARY THEORY. L q Norm Reflectivity Inversion

Machine Learning. Chao Lan

Divide and Conquer Kernel Ridge Regression

Robust Regression. Robust Data Mining Techniques By Boonyakorn Jantaranuson

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions

Lecture 2 - Introduction to Polytopes

Optimization. Industrial AI Lab.

CSE 546 Machine Learning, Autumn 2013 Homework 2

Chapter II. Linear Programming

Sparsity Based Regularization

GENREG DID THAT? Clay Barker Research Statistician Developer JMP Division, SAS Institute

Mapping of Hierarchical Activation in the Visual Cortex Suman Chakravartula, Denise Jones, Guillaume Leseur CS229 Final Project Report. Autumn 2008.

Penalizied Logistic Regression for Classification

Cross-validation. Cross-validation is a resampling method.

Machine Learning Duncan Anderson Managing Director, Willis Towers Watson

06: Logistic Regression

Yelp Recommendation System

CS294-1 Assignment 2 Report

The Problem of Overfitting with Maximum Likelihood

Non-linear models. Basis expansion. Overfitting. Regularization.

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Online Feature Selection using Grafting

Introduction to Optimization Problems and Methods

Gene signature selection to predict survival benefits from adjuvant chemotherapy in NSCLC patients

Model Inference and Averaging. Baging, Stacking, Random Forest, Boosting

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

14. League: A factor with levels A and N indicating player s league at the end of 1986

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

Theoretical Concepts of Machine Learning

arxiv: v1 [stat.ml] 15 May 2014

Fast Ridge Regression with Randomized Principal Component Analysis and Gradient Descent

Cross-validation and the Bootstrap

Classification: Linear Discriminant Functions

MATH3016: OPTIMIZATION

Statistical Matching using Fractional Imputation

Transcription:

Gradient LASSO algoithm Yongdai Kim Seoul National University, Korea jointly with Yuwon Kim University of Minnesota, USA and Jinseog Kim Statistical Research Center for Complex Systems, Korea

Contents 1. Introduction 2. LASSO: Review 3. Algorithms for LASSO: Review 4. Coordinate gradient descent algorithm 5. Gradient LASSO algorithm 6. Simulation 7. Data analysis 8. Conclusion

1. Introduction LASSO (Least Absolute Shrinkage and Selection Operator, Tibshirani, 1996) achieves better prediction accuracy by shrinkage, and at the same time it gives a sparse solution (some coefficients are exactly 0). A problem in LASSO is that the objective function is not differentiable, and hence special optimization techniques (QP or non-linear programing) are necessary. In this talk, I propose a gradient descent algorithm for LASSO, called gradient LASSO, which is simpler and stabler than the existing algorithms and can be applied to very large dimensional data easily.

2. LASSO: Review Covariate : x = (x 1,..., x p ) R p Response : y R Model: y = β x + ɛ or Pr(y = 1 x) = logit(β x) or... Data: {(y i, x i ) i = 1,..., n} Objective: Estimate β using the data. Question: How do we estimate β when p is large compared to n? Two methods (variable) Selection Shrinkage

Selection Select a subset of {x 1,..., x p } and use only the selected covariates for model fitting. Popular methods for selection All possible Forward Selection Backward elimination Stepwise The predictive model is unstable due to the nature of discreteness of the selection process, which may result in inferior prediction error.

Shrinkage: Ridge regression Estimate β by minimizing n l(y i, β x i ) + λ i=1 p j=1 β 2 j where λ > 0 and l is a loss function. Why Shrinkage? : It can be shown that ˆβ Ridge 2 ˆβ 2 where ˆβ Ridge is the Ridge estimator and ˆβ is the ordinary estimator (Ridge estimator with λ = 0). The Ridge estimator is stabler than that from the selection and so gives better accuracy. A disadvantage of Ridge regression is that the interpretation is hard (i.e. none of the estimated coefficients is 0).

Comparison of Selection and Shrinkage Term LS Selection Ridge Intercept 2.480 2.495 2.467 x 1 0.680 0.740 0.389 x 2 0.305 0.367 0.238 x 3-0.141-0.029 x 4 0.210 0.159 x 5 0.305 0.217 x 6-0.288 0.026 x 7-0.021 0.042 x 8 0.267 0.123 Test Error 0.586 0.574 0.540

Can we do selection and shrinkage simultaneously? Yes, it is LASSO!!!

LASSO LASSO estimates β by minimizing n l(y i, β x i ) + λ i=1 p β j. j=1 LASSO is first proposed by Tibshirani (1996). One very interesting property of LASSO is that the predictive model is sparse (i.e. some coefficients are exactly 0).

Comparison of LASSO with the other methods Term LS Selection Ridge LASSO Intercept 2.480 2.495 2.467 2.477 x 1 0.680 0.740 0.389 0.545 x 2 0.305 0.367 0.238 0.237 x 3-0.141-0.029 x 4 0.210 0.159 0.098 x 5 0.305 0.217 0.165 x 6-0.288 0.026 x 7-0.021 0.042 x 8 0.267 0.123 0.059 Test Error 0.586 0.574 0.540 0.491

Why is LASSO sparse? LASSO Ridge β 1 β 1 β 2 β 2

3. Algorithms: Review As mentioned in Introduction, LASSO is computationally demanding since the L 1 constraint is not differentiable. There are various special algorithms for LASSO: Osborne s algorithm Grafting ɛ-boosting Let R(β) = n i=1 l(y i, β x i ) be the empirical risk. Assume R is convex and differentiable (i.e. smooth convex loss).

Osborne s algorithm (Osborne et al. 2000) Implemented in R. The Obsnorne s algorithm repeatedly updates the solution through (i) optimization, (ii) deletion and (iii) optimality check and addition. For a current solution β, let σ = {j : β j 0}.

Optimization H σ = {h R p : h k = 0 if k σ}. Solve subject to sign(β) (β + h σ ) λ. Minimize hσ H σ R(β + h σ ) If sign(β + h σ ) = sign(β), then β + h σ is feasible (i.e. β + h σ 1 λ), then we update β = β + h σ and go to the optimality check step. Otherwise, we go to the deletion step. Note: With the square error loss, h σ has the closed form solution which includes (X σx σ ) 1. For general loss, one can find h σ using the IRLS method.

Deletion Find a coordinate (say k) among σ which violates the constraint maximally, and then update σ = σ {k} and update β by setting β k = 0. Recompute h σ. Delete coordinates until β + h σ is feasible.

Optimality check and addition If β satisfies the optimality condition, the algorithm is terminated. Otherwise, find the coefficient (say β k ) which violates the optimality condition most. Update σ = σ {k} and then go to the optimization step.

Remark The Osborne s algorithm can fail to converge when p is larger than n. In particular, when the cardinality of σ exceeds n, (X σx σ ) 1 does not exist.

Grafting (Perkins et al. 2003) It repeatedly modifies the set of nonzero coefficients (i.e. σ) as follows. Start with σ =. Repeat until the solution is found. Find a coefficient (say β k ) where the corresponding gradient has the largest absolute value. Modify σ = σ {k}. Solve the LASSO problem with only coefficients in σ. The Grafting algorithm is computationally demanding since it solve the LASSO problem repeatedly. Also, similar to the Osborne s algorithm, Grafting can fail when p > n.

Stagewise forward selection: ɛ-boosting (Friedman 2001) Algorithm Choose ɛ sufficiently small. Select β k whose gradient is the smallest. Update β = β + ɛe k where e k is a p-dimensional vector where the k-th entry is 1 and the others are zero. Keep iterating the above two steps until β 1 exceeds λ. Efron et al. (2004) for square error loss (LARS) and Rossett et al. (2004) for general convex loss. It is simple and gives a solution path as a by-product. But for general convex loss except the square error loss, it may not converge to the solution.

4. Coordinatewise gradient descent (CGD) algorithm Recall that for LASSO we need to solve subject to β 1 λ. Minimize β R(β) For simplicity, we let λ = 1. For other λ, we rescale the inputs by multiplying λ. Let e k be the p-dimensional vector whose k-th entry is 1 and the others are zero (eg. e 1 = (1, 0,..., 0)) (called it an coordinate vector ). Let E = {e k, e k : k = 1,..., p}.

The LASSO problem can be restated as Minimize β co(e) R(β) where co(e) is the convex hull of E. The CGD algorithm updates β as follows. Get the gradient vector R (1) (β) = R(β)/ β. Choose the coordinate vector e E which minimizes e R (1) (β). Find the convex combination of β and e which minimizes R. That is, find α [0, 1] which minimizes R(αβ + (1 α)e). Update β = αβ + (1 α)e.

That is The LASSO problem is an optimization problem over the simplex of a given set of vertices. The CGD algorithm repeatedly finds the optimal vertex (i.e smallest gradient) and takes the optimal convex combination (i.e. minimizing R).

Illustration The number of vertices in E is 3.

Illustration β is the current solution.

Illustration Find the vertex

Illustration Take a convex combination

Illustration Move to the optimal convex combination.

Illustration A typical path

Remark In each update, we only need one-dimensional optimization for finding the optimal convex combination. No inversion of matrix is required. Hence, it can be used for very large dimensional data.

Convergence Let β m be the solution obtained after the m-th iteration of the CGD algorithm. Let β be the optimal solution. Then R(β m ) R(β ) = O(m 1 ). Note: Apparently, O(m 1 ) seems to be optimal for optimization over the convex set (Jones 1992, Barron 1993, Zhang 2003).

5. Gradient LASSO algorithm A problem of the CGD algorithm For given β, recall that σ = {j : β j 0}. We can say that the CGD algorithm consists of only the addition step. If β k 0 but β k = 0, the CGD algorithm deletes β k by repeatedly adding other βs. Hence, the CGD algorithm converges very slowly in this case. That is, when the optimal solution locates the boundary of the simplex of E, the convergence is slow.

Illustration

Speed up the CGD algorithm Instead of moving toward one of the vertices, move directly toward the optimal solution inside the simplex following the gradient direction as follows.

Find the gradient direction

Take a convex combination

Find the optimal convex combination

In many cases, the gradient may direct the outside of the simplex. It happens when the current solution is on the boundary of the simplex. In this case, we project the gradient direction onto the simplex.

Illustration The gradient directs the outside of the simplex.

Illustration Project the gradient onto the simplex.

Gradient LASSO algorithm Let A = (A: active set). The gradient LASSO algorithm consists of the two steps - addition and deletion. Addition: Move the solution using the CGD algorithm, which may result in adding a new coordinate vector into A. Deletion: Move toward the (projected) gradient direction on the simplex of A (i.e. co(a)), which may result in deletion of a coordinate vector from A. The gradient LASSO algorithm repeats the addition and deletion steps until the solution converges. Note: The gradient LASSO algorithm always converges faster that the CGD algorithm since the deletion step decreases the empirical risk.

Remark The gradient LASSO algorithm does not require matrix inversion. In each iteration, we need two one-dimensional optimizations - one for addition and the other for deletion, which can be done easily. Hence, the gradient LASSO algorithm has all of the advantages of the CGD algorithm, and at the same time, it improves the speed of convergence significantly.

Comparison of the CGD and gradient LASSO algorithms Starting from the null model (i.e. β k = 0 for all k). Training Error 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Number of zero coefficients 0 10 20 30 40 50 0 50 100 150 200 number of iterations 0 50 100 150 200 number of iterations

Starting from the full model (i.e. β k = 3/50). Training Error 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Number of zero coefficients 0 10 20 30 40 50 0 50 100 150 200 number of iterations 0 50 100 150 200 number of iterations

The CGD and gradient LASSO give similar results when we start from the null model. However, the CGD algorithm completely fails to delete unnecessary coefficients when we start from the full model. This is because the CGD algorithm keeps adding coefficients. It is interesting to see that even if the number of zero coefficients are quite different in the full model case, the empirical risks are very close. This observation suggests that finding an approximated solution (one which minimizes the empirical risk approximately) is not enough for sparse learning, in particular for variable selection.

6. Simulation: Gradient LASSO vs Osborne Logistic regression Sample size is 20 Only three coefficients are set to be nonzero. 100 repetition

Results p λ Method # NA Time Train error # zero 50 1 Osborne 0 0.04 0.4240 45.38 Gradient 0 0.13 0.4240 45.38 10 Osborne 29 0.04 0.0242 37.63 Gradient 0 0.48 0.0242 37.49 100 1 Osborne 0 0.06 0.4019 95.03 Gradient 0 0.18 0.4019 95.02 10 Osborne 63 0.07 0.0107 87.35 Gradient 0 0.48 0.0108 86.92

The Osborne s algorithm is faster. But, it fails quite often when λ is large. In contrast, the gradient LASSO algorithm never fails and gives similar results as those of Osborne s even though it needs more computing time.

7. Analysis of gene expression data We analyzed the NCI60 data set, which consists of n = 60 p = 7129 (number of genes) # of class = 7 The multiclass logistic regression model is used. The misclassification error is measured by repeating random split of the data to training (70%) and test (30%) data sets 100 times. The regularization parameter λ is selected using the five-fold cross validation with 0-1 loss (CV1) and logistic loss (CV2).

Objective of the analysis A popular approach for gene expression data is to select a small number of genes a priori using a prescreening measure such as the F -ratio, and then construct a predictive model. This saves computing time and makes it possible to use various advanced learning algorithms. But, it is not clear how the prescreening affects the accuracy. An objective of this analysis is to see the effect of prescreening to prediction accuracy.

Results Selected by CV1 Selected by CV2 Misclassification rate 0.30 0.40 0.50 0.30 0.40 0.50 10 50 500 7129 Number of inputs 10 50 500 7129 Number of inputs

Remark The optimal number of genes is 500, which is still large. Our results suggest that deleting too many genes a priori can degrade the accuracy significantly. So, we recommend using prescreening procedures not for selecting important genes but for deleting noisy genes. Hence, computing algorithms which can process large dimensional data are still necessary for better accuracy even after a prescreening process. The gradient LASSO algorithm is one of such algorithms.

8. Concluding Remarks There are various sparse learning methods. Examples are fused LASSO (Tibshirani et al. 2005), grouped LASSO (Yuan and Lin, 2004), blockwise sparse regression (Kim et al. 2006), SCAD (Fan and Li, 2001) and elastic net (Zou and Hastie, 2004). We have seen that for sparse learning methods, crude approximated solutions may be significantly misleading, especially for variable selection. Hence, it is worth pursuing to develop efficient and globally convergent computational algorithms for these methods, particularly for large dimensional data.

The R-library of the gradient LASSO algorithm can be downloaded at http://idea.snu.ac.kr/research/glassojskim/glasso.htm.