MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE

Similar documents
MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE

Divide and Conquer Kernel Ridge Regression

CPSC 340: Machine Learning and Data Mining

Empirical risk minimization (ERM) A first model of learning. The excess risk. Getting a uniform guarantee

Kernels and Clustering

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

Applying Machine Learning to Real Problems: Why is it Difficult? How Research Can Help?

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Computational Statistics and Mathematics for Cyber Security

CS 343: Artificial Intelligence

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

Classification: Feature Vectors

MLCC 2018 Local Methods and Bias Variance Trade-Off. Lorenzo Rosasco UNIGE-MIT-IIT

CSE 573: Artificial Intelligence Autumn 2010

One-class Problems and Outlier Detection. 陶卿 中国科学院自动化研究所

Machine Learning / Jan 27, 2010

Variable Selection 6.783, Biomedical Decision Support

Intrinsic Dimensionality Estimation for Data Sets

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Supervised Learning. CS 586 Machine Learning. Prepared by Jugal Kalita. With help from Alpaydin s Introduction to Machine Learning, Chapter 2.

A Geometric Perspective on Data Analysis

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Lecture 11: Clustering and the Spectral Partitioning Algorithm A note on randomized algorithm, Unbiased estimates

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Transductive Learning: Motivation, Model, Algorithms

Non-Parametric Modeling

Bayesian Spherical Wavelet Shrinkage: Applications to Shape Analysis

Use of Extreme Value Statistics in Modeling Biometric Systems

Online Learning. Lorenzo Rosasco MIT, L. Rosasco Online Learning

Model validation T , , Heli Hiisilä

Lecture notes on the simplex method September We will present an algorithm to solve linear programs of the form. maximize.

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Lecture Notes on Data Representation

INTERNATIONAL COMPUTER SCIENCE INSTITUTE. Semi-Supervised Model Selection Based on Cross-Validation

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Lecture 2 September 3

Model selection and validation 1: Cross-validation

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Sparsity Based Regularization

DEGENERACY AND THE FUNDAMENTAL THEOREM

Nonparametric Risk Attribution for Factor Models of Portfolios. October 3, 2017 Kellie Ottoboni

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs

A Course in Machine Learning

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Algorithmic Semi-algebraic Geometry and its applications. Saugata Basu School of Mathematics & College of Computing Georgia Institute of Technology.

Note Set 4: Finite Mixture Models and the EM Algorithm

CS675: Convex and Combinatorial Optimization Spring 2018 The Simplex Algorithm. Instructor: Shaddin Dughmi

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

Discrete Mathematics Lecture 4. Harper Langston New York University

Lecture 17. Lower bound for variable-length source codes with error. Coding a sequence of symbols: Rates and scheme (Arithmetic code)

CS 229 Midterm Review

Computational complexity

An introduction to multi-armed bandits

Table of Contents. Recognition of Facial Gestures... 1 Attila Fazekas

Advanced Operations Research Techniques IE316. Quiz 2 Review. Dr. Ted Ralphs

Mathematical Programming and Research Methods (Part II)

DOWNLOAD PDF BIG IDEAS MATH VERTICAL SHRINK OF A PARABOLA

Boosting Simple Model Selection Cross Validation Regularization

Advanced Digital Signal Processing Adaptive Linear Prediction Filter (Using The RLS Algorithm)

POLYHEDRAL GEOMETRY. Convex functions and sets. Mathematical Programming Niels Lauritzen Recall that a subset C R n is convex if

Spectral Clustering and Community Detection in Labeled Graphs

Nearest Neighbor Classification

Today. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps

Combinatorial Methods in Density Estimation

CS261: A Second Course in Algorithms Lecture #14: Online Bipartite Matching

Lecture 11: Randomized Least-squares Approximation in Practice. 11 Randomized Least-squares Approximation in Practice

1 Counting triangles and cliques

Algorithms and Data Structures

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Gaussian Processes for Robotics. McGill COMP 765 Oct 24 th, 2017

Locality- Sensitive Hashing Random Projections for NN Search

A Geometric Perspective on Machine Learning

Lecture 1: An Introduction to Online Algorithms

Constrained Willmore Tori in the 4 Sphere

Convex and Distributed Optimization. Thomas Ropars

low bias high variance high bias low variance error test set training set high low Model Complexity Typical Behaviour Lecture 11:

Analytic Performance Models for Bounded Queueing Systems

Exploiting Low-Rank Structure in Semidenite Programming by Approximate Operator Splitting

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

CS6716 Pattern Recognition

Outline Introduction Problem Formulation Proposed Solution Applications Conclusion. Compressed Sensing. David L Donoho Presented by: Nitesh Shroff

Pruning Nearest Neighbor Cluster Trees

A Framework for Statistical Clustering with a Constant Time Approximation Algorithms for K-Median Clustering

Optimal Denoising of Natural Images and their Multiscale Geometry and Density

A Geometric Analysis of Subspace Clustering with Outliers

Theoretical Concepts of Machine Learning

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Lecture 19 - Oblivious Transfer (OT) and Private Information Retrieval (PIR)

Markov Decision Processes (MDPs) (cont.)

Chapter 15 Introduction to Linear Programming

New Results on Simple Stochastic Games

error low bias high variance test set training set high low Model Complexity Typical Behaviour 2 CSC2515 Machine Learning high bias low variance

Learning with minimal supervision

Deep Learning. Architecture Design for. Sargur N. Srihari

Regularization and model selection

Chapter 1. Introduction

Lecture: Simulation. of Manufacturing Systems. Sivakumar AI. Simulation. SMA6304 M2 ---Factory Planning and scheduling. Simulation - A Predictive Tool

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Image deconvolution using a characterization of sharp images in wavelet domain

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Transcription:

MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE REGULARIZATION METHODS FOR HIGH DIMENSIONAL LEARNING Francesca Odone and Lorenzo Rosasco odone@disi.unige.it - lrosasco@mit.edu June 3, 2013

ABOUT THIS CLASS GOAL To discuss the choice of the regularization parameter, giving a brief description of the theoretical results and an overview of a few heuristics used in practice.

PLAN The general problem: model selection Error analysis: sketch error decomposition: the bias variance trade-off sample error approximation error Heuristics

REGULARIZATION PARAMETER We have seen that a learning algorithm can be seen as a map S f S from the training set to the hypotheses space. Actually most learning algorithms define a one parameter family of solutions, i.e. given λ > 0 S f λ S

EXAMPLES Tikhonov regularization Spectral regularization Sparsity based regularization Manifold regularization but also SVM, boosting... In all these algorithms one (or more parameters) has to be tuned to find the final solution. The parameters controls the regularity of the solution and the performance of the algorithm.

OPTIMAL CHOICE We can start asking: 1 whether there exists an optimal parameter choice 2 what it depends on 3 and most importantly if we can design a scheme to find it. Remember that our goal is to have good generalization properties...

ORACLE CHOICE Ideally we want to choose λ to make the generalization error small. Recall that I[f ] = V (f (x), y)p(x, y)dxdy X Y EXPECTED RISK min λ {I[f λ S ]} EXCESS RISK min{i[fs λ ] inf {I[f ]}} λ f Both choices require knowledge of the probability distribution and can be considered as the choice of an oracle

CONSISTENCY A minimal requirement on the parameter choice λ = λ n is that it should lead to consistency. CONSISTENCY For all ɛ > 0 λn lim P{I[f n S ] inf {I[f ]} > ɛ} = 0 f

PROBABILISTIC BOUND A possible approach is that of: 1) finding a suitable probabilistic bound for fixed λ, 2) minimizing the bound w.r.t. λ. Two ways of writing bounds. For λ > 0, and all ɛ > 0 P{I[fS λ ] inf {I[f ]} ɛ} 1 η(ɛ, λ, n). f or for λ > 0, and for 0 < η 1, with probability at least 1 η I[fS λ ] inf {I[f ]} ɛ(η, λ, n) f

A PRIORI PARAMETER CHOICE We can then define the parameter choice λ = λ(η, n) minimizing the bound, i.e. min λ ɛ(η, λ, n) One can easily see that such a choice leads to consistency. We have yet to see how to find a bound...

ERROR DECOMPOSITION The first step is, often, to consider a suitable error decomposition I[fS λ ] inf {I[f ]} = I[fS λ ] I[f λ ] + I[f λ ] inf{i[f ]} f f The function f λ is the infinite sample regularized solution, for example in Tikhonov regularization is the solution of V (f (x), y)p(x, y)dxdy + λ f 2 min f H X Y

ERROR DECOMPOSITION (CONT.) Consider I[f λ S ] inf f {I[f ]} = I[fS λ ] I[f λ ] + I[f λ ] inf{i[f ]} }{{} f sample error }{{} approximation error The sample error I[f λ S ] I[f λ ] quantifies the error due to finite sampling The approximation error I[f λ ] inf f {I[f ]} quantifies the bias error due to the chosen regularization scheme

TRADE-OFF The two terms typically have the following behavior errors sample error approximation error λ n λ The parameter choice λ solves a bias variance trade-off.

SAMPLE ERROR To study the sample error, we have to compare empirical quantities to expected quantities. The main mathematical tools we can use are quantitative version of the law of large numbers.

CONCENTRATION INEQUALITIES If ξ 1,..., ξ n i.i.d. zero mean real random variables and ξ i C, i = 1,..., n, then Höeffding inequality ensures that ɛ > 0 P{ 1 ξ i ɛ} 2e nɛ2 2C n 2 i or equivalently setting τ = nɛ2 2C 2 we have with probability at least (with confidence) 1 2e τ 1 n i ξ i C 2τ n.

THE CASE OF SPECTRAL REGULARIZATION (CONT.) The explicit form of the sample error is typically of the form I[f λ S ] I[f λ ] C log 2 η λn where the above bound holds with with probability at least 1 η. If λ decreases sufficiently slow the sample error goes to zero as n increases.

APPROXIMATION ERROR Compare f λ and f solving: V (f (x), y)p(x, y)dxdy + λ f 2 and min f H X Y min V (f (x), y)p(x, y)dxdy f H X Y

APPROXIMATION ERROR Compare f λ and f solving: V (f (x), y)p(x, y)dxdy + λ f 2 and min f H X Y min V (f (x), y)p(x, y)dxdy f H X Y The approximation error is purely deterministic. It is typically easy to prove that it decreases as λ decreases. The explicit behavior depends on the problem at hand. The last problem is an instance of the so called no free lunch theorem

FEW BASIC QUESTIONS Can we learn consistently any problem? YES! Can we always learn at some prescribed speed? NO! it depends on the problem! The latter statement is the called no free lunch theorem.

APPROXIMATION ERROR (CONT.) We have to restrict to a class of problems. Typical examples: the target function f minimizing I[f ] belongs to a RKHS the target function f belongs to some Sobolev Space with smoothness s the target function f depends only on a few variables... Usually the regularity of the target function is summarized in a regularity index r and the approximation error depends on such index I[f λ ] I[f ] Cλ 2r

PUTTING ALL TOGETHER Putting all together we get with probability at least 1 η, C log 2 I[fS λ ] inf η {I[f ]} + Cλ 2r f λn We choose λ n = n 1 2r+1 to optimize the bound If we set f S = f λn S we get with high probability I[fS λ ] inf {I[f ]} Cn 2r 2r+1 f log 2 η

ADAPTIVITY The parameter choice depends λ n = n 2r 2r+1 depends on the regularity index r which is typically unknown. In the last ten years and more a main trend in non-parametric statistics has been the design of the parameter choices not depending on r and still achieving the rate n 2r 2r+1. For this reason these kinds of parameter choices are called adaptive

THE THEORY AND THE TRUTH The bounds are often asymptotically tight and in fact the rates are optimal in a suitable sense

THE THEORY AND THE TRUTH The bounds are often asymptotically tight and in fact the rates are optimal in a suitable sense Nonetheless we often pay the price of the great generality under which they hold in that they are often pessimistic

THE THEORY AND THE TRUTH The bounds are often asymptotically tight and in fact the rates are optimal in a suitable sense Nonetheless we often pay the price of the great generality under which they hold in that they are often pessimistic Constants are likely to be non optimal

THE THEORY AND THE TRUTH The bounds are often asymptotically tight and in fact the rates are optimal in a suitable sense Nonetheless we often pay the price of the great generality under which they hold in that they are often pessimistic Constants are likely to be non optimal In practice we have to resort to heuristics to choose the regularization parameter

THE THEORY AND THE TRUTH The bounds are often asymptotically tight and in fact the rates are optimal in a suitable sense Nonetheless we often pay the price of the great generality under which they hold in that they are often pessimistic Constants are likely to be non optimal In practice we have to resort to heuristics to choose the regularization parameter WHY TO CARE ABOUT BOUNDS? hopefully in the way we learn something about the problems and the algorithms we use

HOLD-OUT ESTIMATE One of the most common heuristics is probably the following: HOLD OUT ESTIMATES Split training set S in S 1 and S 2. Find estimators on S 1 for different λ. Choose λ minimizing the empirical error on S 2. Repeat for different splits and average answers.

CROSS VALIDATION When the data are few other strategies are preferable that are variations of hold-out. K-FOLD CROSS-VALIDATION Split training set S in k groups. For fixed λ, train on all k 1 groups, test on the group left out and sum up the errors Repeat for different values of λ and choose the one minimizing the cross validation error One can repeat for different splits and average answers to decrease variance When the data are really few we can take k = n, this strategy is called leave one-out (LOO)

IMPLEMENTATION: REGULARIZATION PATH The real computational price is often the one for finding the solutions for several regularization parameter values (so called regularization path). In this view we can somewhat reconsider the different computational prices of spectral regularization schemes.

K-FOLDS CROSS VALIDATION FOR ITERATIVE METHODS Iterative methods have the interesting property that iteration underlying the optimization is the regularization parameter. This implies that we can essentially calculate k-fold cross validation without increasing the computational complexity

LOO IMPLEMENTATION FOR RLS In the case of Tikhonov regularization there is a simple closed form for the LOO LOO(λ) = n (y i fs λ(x i)) (I K (K + λni) 1 ) ii i=1 If we can compute (I K (K + λni) 1 ) ii easily the price of LOO (for fixed λ) is that of calculating f λ S It turns out that computing the eigen-decomposition of the kernel matrix can be actually convenient in this case

HOW SHOULD WE EXPLORE THE PARAMETER RANGE? In practice we have to choose several things. Minimum and maximum value? We can look at the maximum and minimum eigenvalues of the kernel matrix to have a reasonable range (iterative methods don t need any though) Step size? the answer is different depending on the algorithm for iterative and projections methods the regularization is intrinsically discrete, for Tikhonov regularization we can take a geometric series λ i = λ 0 q i, for some q > 1.

WHAT ABOUT KERNEL PARAMETERS? So far we only talked about λ and not about kernel parameters Both parameters controls the complexity of the solution Clearly we can minimize the cross validation error w.r.t. both parameters Often a rough choice of a kernel parameter can be allowed if we eventually fine tune λ For the gaussian kernel a reasonable value of the width σ can be often chosen looking at some statistics of the distances among input points

WHAT ABOUT KERNEL PARAMETERS? So far we only talked about λ and not about kernel parameters Optimal parameter choice can be defined in theory. They are defined in terms of finite sample bounds and depend on prior assumptions on the problem. In practice heuristics are typically adopted. In practice many regularization parameter need to be chosen. Parameter choice is a HUGE unsolved problem.