MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE

Similar documents
MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE

Divide and Conquer Kernel Ridge Regression

CPSC 340: Machine Learning and Data Mining

Online Learning. Lorenzo Rosasco MIT, L. Rosasco Online Learning

Variable Selection 6.783, Biomedical Decision Support

Empirical risk minimization (ERM) A first model of learning. The excess risk. Getting a uniform guarantee

Kernels and Clustering

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

A Geometric Perspective on Data Analysis

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

One-class Problems and Outlier Detection. 陶卿 中国科学院自动化研究所

Machine Learning / Jan 27, 2010

Lecture 11: Clustering and the Spectral Partitioning Algorithm A note on randomized algorithm, Unbiased estimates

CS 343: Artificial Intelligence

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Model validation T , , Heli Hiisilä

Classification: Feature Vectors

MLCC 2018 Local Methods and Bias Variance Trade-Off. Lorenzo Rosasco UNIGE-MIT-IIT

Model selection and validation 1: Cross-validation

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Applying Machine Learning to Real Problems: Why is it Difficult? How Research Can Help?

Use of Extreme Value Statistics in Modeling Biometric Systems

Computational Statistics and Mathematics for Cyber Security

Sparsity Based Regularization

Bayesian Spherical Wavelet Shrinkage: Applications to Shape Analysis

POLYHEDRAL GEOMETRY. Convex functions and sets. Mathematical Programming Niels Lauritzen Recall that a subset C R n is convex if

Algorithmic Semi-algebraic Geometry and its applications. Saugata Basu School of Mathematics & College of Computing Georgia Institute of Technology.

CSE 573: Artificial Intelligence Autumn 2010

Outline Introduction Problem Formulation Proposed Solution Applications Conclusion. Compressed Sensing. David L Donoho Presented by: Nitesh Shroff

A Course in Machine Learning

Non-Parametric Modeling

Regularization and model selection

Table of Contents. Recognition of Facial Gestures... 1 Attila Fazekas

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Note Set 4: Finite Mixture Models and the EM Algorithm

DEGENERACY AND THE FUNDAMENTAL THEOREM

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Lecture 2 September 3

Kernels and representation

Supervised Learning. CS 586 Machine Learning. Prepared by Jugal Kalita. With help from Alpaydin s Introduction to Machine Learning, Chapter 2.

Lecture Notes on Data Representation

Kernel Spectral Clustering

Intrinsic Dimensionality Estimation for Data Sets

DOWNLOAD PDF BIG IDEAS MATH VERTICAL SHRINK OF A PARABOLA

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

INTERNATIONAL COMPUTER SCIENCE INSTITUTE. Semi-Supervised Model Selection Based on Cross-Validation

Today. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps

A Framework for Statistical Clustering with a Constant Time Approximation Algorithms for K-Median Clustering

Combinatorial Methods in Density Estimation

Learning with minimal supervision

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem

Pruning Nearest Neighbor Cluster Trees

Topics in Machine Learning-EE 5359 Model Assessment and Selection

New Results on Simple Stochastic Games

Lecture notes on the simplex method September We will present an algorithm to solve linear programs of the form. maximize.

Transductive Learning: Motivation, Model, Algorithms

Lecture Transcript While and Do While Statements in C++

Diffusion Wavelets for Natural Image Analysis

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Lecture 3: Linear Classification

Boosting Simple Model Selection Cross Validation Regularization

Mathematical Programming and Research Methods (Part II)

Lecture 1: An Introduction to Online Algorithms

Convexization in Markov Chain Monte Carlo

Learning with Regularization Networks

Nearest Neighbor Classification

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

More on Classification: Support Vector Machine

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

An introduction to multi-armed bandits

6.867 Machine Learning

Optimization Methods for Machine Learning (OMML)

Last time... Bias-Variance decomposition. This week

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Theoretical Concepts of Machine Learning

Math 5593 Linear Programming Lecture Notes

A Deterministic Global Optimization Method for Variational Inference

Advanced Operations Research Techniques IE316. Quiz 2 Review. Dr. Ted Ralphs

Markov Decision Processes (MDPs) (cont.)

A Geometric Analysis of Subspace Clustering with Outliers

Enclosures of Roundoff Errors using SDP

MATH3016: OPTIMIZATION

SEQUENCES, MATHEMATICAL INDUCTION, AND RECURSION

Machine Learning: Think Big and Parallel

Nonparametric Methods Recap

CS Foundations of Communication Complexity

Nonparametric Risk Attribution for Factor Models of Portfolios. October 3, 2017 Kellie Ottoboni

Approximating the surface volume of convex bodies

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Contents. I The Basic Framework for Stationary Problems 1

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

Cellular Tree Classifiers. Gérard Biau & Luc Devroye

Week 5. Convex Optimization

Clustering algorithms and introduction to persistent homology

Image deconvolution using a characterization of sharp images in wavelet domain

Lecture 25: Element-wise Sampling of Graphs and Linear Equation Solving, Cont. 25 Element-wise Sampling of Graphs and Linear Equation Solving,

Learning with infinitely many features

Transcription:

MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE REGULARIZATION METHODS FOR HIGH DIMENSIONAL LEARNING Francesca Odone and Lorenzo Rosasco odone@disi.unige.it - lrosasco@mit.edu June 6, 2011

ABOUT THIS CLASS GOAL To discuss the choice of the regularization parameter, giving a brief description of the theoretical results and an overview of a few heuristics used in practice.

PLAN The general problem: model selection Error analysis: sketch error decomposition: the bias variance trade-off sample error approximation error Heuristics

REGULARIZATION PARAMETER We have seen that a learning algorithm can be seen as a map S f S from the training set to the hypotheses space. Actually most learning algorithms define a one parameter family of solutions, i.e. given λ > 0 S f λ S

EXAMPLES Tikhonov regularization Spectral regularization Sparsity based regularization Manifold regularization but also SVM, boosting... In all these algorithms one (or more parameters) has to be tuned to find the final solution. The parameters controls the regularity of the solution and the performance of the algorithm.

OPTIMAL CHOICE We can start asking: 1 whether there exists an optimal parameter choice 2 what it depends on 3 and most importantly if we can design a scheme to find it. Remember that our goal is to have good generalization properties...

ORACLE CHOICE Ideally we want to choose λ to make the generalization error small. Recall that I[f ] = V (f (x), y)p(x, y)dxdy X Y EXPECTED RISK min λ {I[f λ S ]} EXCESS RISK min λ {I[fS λ ] inf {I[f ]}} f Both choices require knowledge of the probability distribution and can be considered as the choice of an oracle

CONSISTENCY A minimal requirement on the parameter choice λ = λ n is that it should lead to consistency. CONSISTENCY For all ɛ > 0 lim n P{I[f λn S ] inf {I[f ]} > ɛ} = 0 f

PROBABILISTIC BOUND A possible approach is that of: 1) finding a suitable probabilistic bound for fixed λ, 2) minimizing the bound w.r.t. λ. Two ways of writing bounds. For λ > 0, and all ɛ > 0 P{I[fS λ ] inf {I[f ]} ɛ} 1 η(ɛ, λ, n). f or for λ > 0, and for 0 < η 1, with probability at least 1 η I[fS λ ] inf {I[f ]} ɛ(η, λ, n) f

A PRIORI PARAMETER CHOICE We can then define the parameter choice λ = λ(η, n) minimizing the bound, i.e. min λ ɛ(η, λ, n) One can easily see that such a choice leads to consistency. We have yet to see how to find a bound...

ERROR DECOMPOSITION The first step is, often, to consider a suitable error decomposition I[fS λ ] inf {I[f ]} = I[fS λ ] I[f λ ] + I[f λ ] inf{i[f ]} f f The function f λ is the infinite sample regularized solution, for example in Tikhonov regularization is the solution of V (f (x), y)p(x, y)dxdy + λ f 2 min f H X Y

ERROR DECOMPOSITION (CONT.) Consider I[f λ S ] inf f {I[f ]} = I[fS λ ] I[f λ ] + I[f λ ] inf{i[f ]} }{{} f sample error }{{} approximation error The sample error I[f λ S ] I[f λ ] quantifies the error due to finite sampling The approximation error I[f λ ] inf f {I[f ]} quantifies the bias error due to the chosen regularization scheme

TRADE-OFF The two terms typically have the following behavior errors sample error approximation error λ n λ The parameter choice λ solves a bias variance trade-off.

SAMPLE ERROR To study the sample error, we have to compare empirical quantities to expected quantities. The main mathematical tools we can use are quantitative version of the law of large numbers.

CONCENTRATION INEQUALITIES If ξ 1,..., ξ n i.i.d. zero mean real random variables and ξ i C, i = 1,..., n, then Höeffding inequality ensures that ɛ > 0 P{ 1 n i ξ i ɛ} 2e nɛ2 2C 2 or equivalently setting τ = nɛ2 2C 2 we have with probability at least (with confidence) 1 2e τ 1 n i ξ i C 2τ n.

THE CASE OF SPECTRAL REGULARIZATION (CONT.) The explicit form of the sample error is typically of the form I[f λ S ] I[f λ ] C λ α n β where the above bound holds with high probability. If λ decreases sufficiently slow the sample error goes to zero as n increases.

APPROXIMATION ERROR Compare f λ and f solving: V (f (x), y)p(x, y)dxdy + λ f 2 min f H X Y and min V (f (x), y)p(x, y)dxdy f H X Y

APPROXIMATION ERROR Compare f λ and f solving: V (f (x), y)p(x, y)dxdy + λ f 2 min f H X Y and min V (f (x), y)p(x, y)dxdy f H X Y The approximation error is purely deterministic. It is typically easy to prove that it decreases as λ decreases. The explicit behavior depends on the problem at hand. The last problem is an instance of the so called no free lunch theorem

FEW BASIC QUESTIONS Can we learn consistently any problem? YES! Can we always learn at some prescribed speed? NO! it depends on the problem! The latter statement is the called no free lunch theorem.

APPROXIMATION ERROR (CONT.) We have to restrict to a class of problems. Typical examples: the target function f minimizing I[f ] belongs to a RKHS the target function f belongs to some Sobolev Space with smoothness s the target function f depends only on a few variables... Usually the regularity of the target function is summarized in a regularity index r and the approximation error depends on such index I[f λ ] I[f ] Cλ 2r

PUTTING ALL TOGETHER Putting all together we get with high probability I[fS λ ] inf {I[f ]} C 2τ f n + Cλ 2r We choose λ n = n 1 2r+1 to optimize the bound If we set f S = f λn S we get with high probability I[fS λ ] inf {I[f ]} Cn 2r 2r+1 f

ADAPTIVITY The parameter choice depends λ n = n 2r 2r+1 depends on the regularity index r which is typically unknown. In the last ten years a main trend in non-parametric statistics has been the design of the parameter choices not depending on r and still achieving the rate n 2r 2r+1. For this reason these kinds of parameter choices are called adaptive

THE THEORY AND THE TRUTH The bounds are often asymptotically tight and in fact optimal in a suitable sense

THE THEORY AND THE TRUTH The bounds are often asymptotically tight and in fact optimal in a suitable sense Nonetheless we often pay the price of the great generality under which they hold in that they are often pessimistic

THE THEORY AND THE TRUTH The bounds are often asymptotically tight and in fact optimal in a suitable sense Nonetheless we often pay the price of the great generality under which they hold in that they are often pessimistic In practice we have to resort to heuristics to choose the regularization parameter

THE THEORY AND THE TRUTH The bounds are often asymptotically tight and in fact optimal in a suitable sense Nonetheless we often pay the price of the great generality under which they hold in that they are often pessimistic In practice we have to resort to heuristics to choose the regularization parameter WHY TO CARE ABOUT BOUNDS? hopefully in the way we learn something about the problems and the algorithms we use

HOLD-OUT ESTIMATE One of the most common heuristics is probably the following: HOLD OUT ESTIMATES Split training set S in S 1 and S 2. Find estimators on S 1 for different λ. Choose λ minimizing the empirical error on S 2. Repeat for different splits and average answers.

CROSS VALIDATION When the data are few other strategies are preferable that are variations of hold-out. K-FOLD CROSS-VALIDATION Split training set S in k groups. For fixed λ, train on all k 1 groups, test on the group left out and sum up the errors Repeat for different values of λ and choose the one minimizing the cross validation error One can repeat for different splits and average answers to decrease variance When the data are really few we can take k = n, this strategy is called leave one-out (LOO)

IMPLEMENTATION: REGULARIZATION PATH The real computational price is often the one for finding the solutions for several regularization parameter values (so called regularization path). In this view we can somewhat reconsider the different computational prices of spectral regularization schemes.

K-FOLDS CROSS VALIDATION FOR ITERATIVE METHODS Iterative methods have the interesting property that iteration underlying the optimization is the regularization parameter. This implies that we can essentially calculate k-fold cross validation without increasing the computational complexity

LOO IMPLEMENTATION FOR RLS In the case of Tikhonov regularization there is a simple closed form for the LOO n (y i fs λ LOO(λ) = (x i)) (I K (K + λni) 1 ) ii i=1 If we can compute (I K (K + λni) 1 ) ii easily the price of LOO (for fixed λ) is that of calculating f λ S It turns out that computing the eigen-decomposition of the kernel matrix can be actually convenient in this case

HOW SHOULD WE EXPLORE THE PARAMETER RANGE? In practice we have to choose several things. Minimum and maximum value? We can look at the maximum and minimum eigenvalues of the kernel matrix to have a reasonable range (iterative methods don t need any though) Step size? the answer is different depending on the algorithm for iterative and projections methods the regularization is intrinsically discrete, for Tikhonov regularization we can take a geometric series λ i = λ 0 q i, for some q > 1.

WHAT ABOUT KERNEL PARAMETERS? So far we only talked about λ and not about kernel parameters Both parameters controls the complexity of the solution Clearly we can minimize the cross validation error w.r.t. both parameters Often a rough choice of a kernel parameter can be allowed if we eventually fine tune λ For the gaussian kernel a reasonable value of the width σ can be often chosen looking at some statistics of the distances among input points

CONCENTRATION INEQUALITIES (CONT.) Similarly if ξ 1,..., ξ n are zero mean independent random variables with values in a Hilbert space and ξ i C, i = 1,..., n, then the same inequality holds with the absolute value replaced by the norm in the Hilbert space. The following inequality 1 ξ i C 2τ. n n i is due to Pinelis.

THE CASE OF SPECTRAL REGULARIZATION We have to compare f λ S = g λ(k )Y and the infinite sample regularized solution f λ. f λ can be written as f λ = g λ (L K )L K f where L K f (s) = K (x, s)f (x)p(x)dx and f (x) = Y yp(y x)dy. For n large enough we expect 1 n K L K f (s) = X X K (x, s)f (x)p(x)dx

OUT OF SAMPLE EXTENSION How can we compare the matrix K and the operator L K? Consider the out of sample extension of K, i.e. L K,n f (s) = 1 n n K (x i, s)f (x i ) i=1 Note that we can write L K,n and L K as L K,n = 1 n n i=1 < K x i, > K xi L K = X < K x, > K x p(x)dx where K x = K (x, )

THE CASE OF SPECTRAL REGULARIZATION (CONT.) The explicit form of L K,n and L K L K,n = 1 n n i=1 < K x i, > K xi L K = X < K x, > K x p(x)dx suggest to let ξ =< K xi, > K xi, and look at L K,n as the empirical average of L K. In fact using the law of large numbers we have with high probability L K,n L K C 2τ n, where C is a constant depending on the kernel.