FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Similar documents
Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

STA 4273H: Sta-s-cal Machine Learning

What is machine learning?

Machine Learning / Jan 27, 2010

Network Traffic Measurements and Analysis

FMA901F: Machine Learning Lecture 6: Graphical Models. Cristian Sminchisescu

COMPUTATIONAL INTELLIGENCE (CS) (INTRODUCTION TO MACHINE LEARNING) SS16. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Linear Regression & Gradient Descent

Support vector machines

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Linear Methods for Regression and Shrinkage Methods

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Generative and discriminative classification techniques

Lecture 26: Missing data

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Lecture 27, April 24, Reading: See class website. Nonparametric regression and kernel smoothing. Structured sparse additive models (GroupSpAM)

Instance-based Learning

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

EE613 Machine Learning for Engineers LINEAR REGRESSION. Sylvain Calinon Robot Learning & Interaction Group Idiap Research Institute Nov.

Building Classifiers using Bayesian Networks

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Classification: Linear Discriminant Functions

Perceptron as a graph

EE613 Machine Learning for Engineers LINEAR REGRESSION. Sylvain Calinon Robot Learning & Interaction Group Idiap Research Institute Nov.

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

Lecture 13: Model selection and regularization

Lecture on Modeling Tools for Clustering & Regression

Theoretical Concepts of Machine Learning

Machine Learning Lecture 3

Support Vector Machines.

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Learning via Optimization

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

All lecture slides will be available at CSC2515_Winter15.html

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

The exam is closed book, closed notes except your one-page cheat sheet.

3 Nonlinear Regression

Machine Learning. Chao Lan

Support Vector Machines

Linear Regression QuadraJc Regression Year

6 Model selection and kernels

Last time... Bias-Variance decomposition. This week

Simple Model Selection Cross Validation Regularization Neural Networks

3 Nonlinear Regression

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2016

STA 4273H: Statistical Machine Learning

CS 229 Midterm Review

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Topics in Machine Learning-EE 5359 Model Assessment and Selection

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Machine Learning. Topic 4: Linear Regression Models

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

CSE446: Linear Regression. Spring 2017

Performance Evaluation of Various Classification Algorithms

Instance-based Learning

Multiple Model Estimation : The EM Algorithm & Applications

CSC 411: Lecture 02: Linear Regression

EE 511 Linear Regression

Understanding Andrew Ng s Machine Learning Course Notes and codes (Matlab version)

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Knowledge Discovery and Data Mining

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

Support Vector Machines

Machine Learning for NLP

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Chapter 7: Numerical Prediction

Support Vector Machines

Introduction to Automated Text Analysis. bit.ly/poir599

Optimization Methods for Machine Learning (OMML)

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Lecture 9: Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Applying Supervised Learning

CSE 5526: Introduction to Neural Networks Radial Basis Function (RBF) Networks

Support Vector Machines.

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

Module 4. Non-linear machine learning econometrics: Support Vector Machine

I How does the formulation (5) serve the purpose of the composite parameterization

Introduction to ANSYS DesignXplorer

Clustering Lecture 5: Mixture Model

A popular method for moving beyond linearity. 2. Basis expansion and regularization 1. Examples of transformations. Piecewise-polynomials and splines

Weka ( )

LOESS curve fitted to a population sampled from a sine wave with uniform noise added. The LOESS curve approximates the original sine wave.

Learning from Data: Adaptive Basis Functions

EE795: Computer Vision and Intelligent Systems

06: Logistic Regression

Supplementary Figure 1. Decoding results broken down for different ROIs

Non-linearity and spatial correlation in landslide susceptibility mapping

Lecture #11: The Perceptron

Generalized Additive Models

Analysis of Functional MRI Timeseries Data Using Signal Processing Techniques

Multicollinearity and Validation CIVL 7012/8012

08 An Introduction to Dense Continuous Robotic Mapping

Interactive Graphics. Lecture 9: Introduction to Spline Curves. Interactive Graphics Lecture 9: Slide 1

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

Transcription:

FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu

Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s) determined by an estimator (ML, MAP, etc.) Analysis (e.g. the bias variance decomposition) is realized by computing expectations over all possible datasets Empirically, uncertainty (error bars) for the estimate is obtained by approximating the distribution of possible datasets e.g. by bootstrap sampling the original, observed data, with replacement In a Bayesian perspective, there is a single dataset, the one actually observed, and uncertainty is expressed through a probability distribution over all model parameters and variables of interest In this lecture we will analyze linear models both from a frequentist and from a Bayesian perspective

Linear Models It is mathematically easy to fit linear models to data and we can get insight by analyzing this relatively simple case. We already studied the polynomial curve fitting problem in the previous lectures There are many ways to make linear models more powerful while retaining their attractive mathematical properties By using non linear, non adaptive basis functions, we obtain generalized linear models. These offer non linear mappings from inputs to outputs but are linear in their parameters. Typically, only the linear part of the model is learnt By using kernel methods we can handle expansions of the raw data that use a huge number of non linear, non adaptive basis functions. By using large margin loss functions, we can avoid overfitting even when we use huge numbers of basis functions adapted from MLG@Toronto

The Loss Function Once a model class is chosen, fitting the model to data is typically done by finding the parameter values that minimize a loss function There are many possible loss functions. What criterion should we use in selecting one? Loss that makes analytic calculations easy (squared error) Loss that makes the fitting correspond to maximizing the likelihood of the training data, given some noise model for the observed outputs Loss that makes it easy to interpret the learned coefficients (easy if mostly zeros) Loss that corresponds to the real loss on a practical application (losses are often asymmetric) adapted from MLG@Toronto

Linear Basis Function Models (1) Consider the polynomial curve fitting. The model is linear in the parameters and non linear in inputs Slides adapted from textbook (Bishop)

Linear Basis Function Models (2) Generally where are known as basis functions. Typically, = 1 (dummy basis function) so that acts as a bias (allows for any fixed offset in the data do not confuse with statistical bias) In the simplest case, we use linear basis functions:

Linear Basis Function Models (3) Polynomial basis functions: These are global; a small change in affects all basis functions. The issue can be resolved by dividing input space into regions and fitting a different polynomial in each region. This leads to spline functions.

Linear Basis Function Models (4) Gaussian basis functions: These are local; a small change in only affect nearby basis functions. and control location and scale (width).

Linear Basis Function Models (5) Sigmoidal basis functions: where Also these are local; a small change in only affect nearby basis functions. and control location and scale (slope).

Other Basis Function Models (6) In a Fourier representation, each basis function represents a given frequency and has infinite spatial extent Wavelets are localized in both space and frequency, and by definition are linearly orthogonal

Maximum Likelihood and Least Squares (1) Assume observations from a deterministic function with added Gaussian noise: where which is the same as saying, /2 exp 2, Given observed inputs,, and targets,, we obtain the likelihood function

Maximum Likelihood and Least Squares (2) Taking the logarithm, we get where is the sum of squares error

Maximum Likelihood and Least Squares (3) Computing the gradient and setting it to zero yields Solving for where, we get normal equations The Moore Penrose pseudo inverse,. design matrix (NxM)

Maximum Likelihood and Least Squares (4) 1 2 2 Maximizing with respect to the bias,, we get Bias compensates for the difference between averages of the target values and weighted sum of averages of basis function values We can also maximize with respect to, giving Inverse noise precision is given by residual variance of targets around regression function

Geometry of Least Squares The space has one axis for each training case. The vector of target values,, is a point in space Each vector of the values of one component of the input is also a point in this space. The input component vectors span a subspace,. A weighted sum of the input component vectors must lie in. The optimal solution is the orthogonal projection of the vector of target values onto.

Sequential Learning Data items considered one at a time (a.k.a. online learning); use stochastic (sequential) gradient descent: This is known as the least mean squares (LMS) algorithm. Issue: how to choose?

Regularized Least Squares (1) Consider the error function: Data term + Regularization term With the sum of squares error function and a quadratic regularizer, we get which is minimized by is called the regularization coefficient.

Regularized Least Squares (2) With a more general regularizer, we have Lasso Quadratic

Effect of a Quadratic Regularizer The overall cost function is the sum of two quadratic functions The sum is also quadratic The combined minimum lies on the line between the minimum of the squared error and the origin The regularizer shrinks the weights, but does not make them exactly zero

Lasso: penalizing the absolute values of weights Finding the minimum requires quadratic programming but it is still unique because the cost function is convex Sum of two convex functions: a bowl plus an inverted pyramid As is increased, many of the weights go exactly to zero This is good for interpretation, e.g. identify factors causing a given disease, given input vectors containing many measurement components (temperature, blood pressure, sugar level, etc.) It is also good in preventing overfitting

Penalty over square weights vs. Lasso Lasso tends to generate sparser solutions than a quadratic regularizer Notice that 0at the optimum

1 d loss functions with different powers Negative log of Gaussian Negative log of Laplacian

Multiple Outputs If there are multiple outputs we can often treat the learning problem as a set of independent problems, one per output This approach is suboptimal if the output noise is correlated and changes across datapoints Even though they are independent problems we can save work by only multiplying the input vectors by the inverse covariance of the input components once

Multiple Outputs (1) Analogously to the single output case we have: Given observed inputs,, and targets,, we obtain the log likelihood function

Multiple Outputs (2) Maximizing with respect to, we obtain If we consider a single target variable,, we see that where, which is identical with the single output case.

Decision Theory for Regression (L2) Inference step Determine Decision step For given, make optimal prediction for Loss function:

The Squared Loss Function (L2) substitute and integrate over, w. r. t where regression function Optimal least squares predictor given by the conditional average First term in gives prediction error according to conditional mean estimates Second term is independent of. It measures the intrinsic variability of the target data, hence it represents the irreducible minimum value of loss, independent of

The Bias Variance Decomposition (1) Expected squared loss, where The second term of corresponds to the noise inherent in the random variable. What about the first term?

The Bias Variance Decomposition (2) Suppose we were given multiple data sets, each of size. Any particular data set,, will give a particular function. We then have

The Bias Variance Decomposition (3) Taking the expectation over yields

The Bias Variance Decomposition (4) Thus we can write where

The Bias Variance Decomposition (5) Example: 25 data sets from the sinusoidal, varying the degree of regularization,

The Bias Variance Decomposition (6) Example: 25 data sets from the sinusoidal, varying the degree of regularization,

The Bias Variance Decomposition (7) Example: 25 data sets from the sinusoidal, varying the degree of regularization,

The Bias Variance Trade off From these plots, we note that an over regularized model (large will have a high bias, while an under regularized model (small ) will have a high variance.

Recall: The Gaussian Distribution (L2) Can also use precision, Λ Σ

Linear Gaussian Models (L2) Given we have where Many applications for linear Gaussian models: time series models linear dynamical systems (Kalman filtering), Bayesian modeling

Predictive Distribution

Bayesian Regression (1) Determine by minimizing regularized sum of squares error, with regularizer corresponding to /

Bayesian Regression (2) In compact form, for,

Bayesian Linear Regression (3) 0 data points observed Red lines: samples from the model prior Prior Data Space Fit,, assume known. 25 Ground truth,, ; 0.3, 0.5, 1,1, ~ 0,0.2

Bayesian Linear Regression (4) 1 data point observed Likelihood, Posterior Red lines: samples from the model posterior Data Space Fit,, assume known. 25 Ground truth,, ; 0.3, 0.5, 1,1, ~ 0,0.2

Bayesian Linear Regression (5) 2 data points observed, second Likelihood, Posterior Red lines: samples from the model posterior Data Space Fit,, assume known. 25 Ground truth,, ; 0.3, 0.5, 1,1, ~ 0,0.2

Bayesian Linear Regression (6) 20 data points observed, last being Likelihood, Posterior Red lines: samples from the model posterior Data Space Fit,, assume known. 25 Ground truth,, ; 0.3, 0.5, 1,1, ~ 0,0.2

Predictive Distribution (1) Predict for new values of over : by integrating where Noise in Uncertainty for the data parameters Can prove:, 0

Predictive Distribution (2) Example: Sinusoidal data, 9 Gaussian basis functions, 1 data point

Predictive Distribution (3) Example: Sinusoidal data, 9 Gaussian basis functions, 2 data points

Predictive Distribution (4) Example: Sinusoidal data, 9 Gaussian basis functions, 4 data points

Predictive Distribution (5) Example: Sinusoidal data, 9 Gaussian basis functions, 25 data points

Equivalent Kernel (1) The predictive mean can be written Equivalent kernel or smoother matrix. This is a weighted sum of the training data target values,

Equivalent Kernel (2) Weight of depends on distance between nearby carry more weight. and ;

Bayesian Model Comparison (1) How do we choose the right model? Assume we want to compare models, using data ; this requires computing Posterior Prior Model evidence or marginal likelihood Bayes Factor: ratio of evidence for two models

Bayesian Model Comparison (2) Having computed we can compute the predictive (mixture) distribution A simpler approximation, known as model selection, is to use the model with the highest evidence.

Bayesian Model Comparison (3) For a model with parameters, we get the model evidence by marginalizing over Note that

Bayesian Model Comparison (4) For a given model with a single parameter,, consider the approximation where the posterior is assumed to be sharply peaked.

Bayesian Model Comparison (5) Taking logarithms, we obtain Negative With parameters, all assumed to have the same ratio, we get Negative and linear in M.

Bayesian Model Comparison (6) Matching data and model complexity A simple model (e.g. ) has little variability, so it will generate datasets that are fairly similar to each other, with distribution confined narrowly on the axis Complex models (e.g. ) can generate a large variety of datasets and assign a relatively low probability to any one of them

The Evidence Approximation (1) The full Bayesian predictive distribution is given by but this integral is intractable. Approximate with where is the mode of, which is assumed to be sharply peaked; a.k.a. empirical Bayes, type II or generalized maximum likelihood, or evidence approximation.

The Evidence Approximation (2) From Bayes theorem we have and if we assume to be flat we see that General results for Gaussian integrals give

The Evidence Approximation (3) Example: sinusoidal data, degree polynomial,

Readings Bishop Ch. 3, sections 3.1 3.4 (optional 3.5)