STA 4273H: Sta-s-cal Machine Learning

Similar documents
FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Logis&c Regression. Aar$ Singh & Barnabas Poczos. Machine Learning / Jan 28, 2014

STA 4273H: Statistical Machine Learning

CS 6140: Machine Learning Spring 2016

CAP5415-Computer Vision Lecture 13-Support Vector Machines for Computer Vision Applica=ons

Machine Learning Crash Course: Part I

Ensemble- Based Characteriza4on of Uncertain Features Dennis McLaughlin, Rafal Wojcik

CSCI 599 Class Presenta/on. Zach Levine. Markov Chain Monte Carlo (MCMC) HMM Parameter Es/mates

Introduc)on to Probabilis)c Latent Seman)c Analysis. NYP Predic)ve Analy)cs Meetup June 10, 2010

Machine Learning / Jan 27, 2010

Lecture on Modeling Tools for Clustering & Regression

10703 Deep Reinforcement Learning and Control

Linear Methods for Regression and Shrinkage Methods

ML4Bio Lecture #1: Introduc3on. February 24 th, 2016 Quaid Morris

Feature Selec+on. Machine Learning Fall 2018 Kasthuri Kannan

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Pa#ern Recogni-on for Neuroimaging Toolbox

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

CS 6140: Machine Learning Spring Final Exams. What we learned Final Exams 2/26/16

CS 6140: Machine Learning Spring 2016

Predic'ng ALS Progression with Bayesian Addi've Regression Trees

Decision Trees, Random Forests and Random Ferns. Peter Kovesi

All lecture slides will be available at CSC2515_Winter15.html

Classification: Linear Discriminant Functions

What is machine learning?

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

Lecture 24 EM Wrapup and VARIATIONAL INFERENCE

Minimum Redundancy and Maximum Relevance Feature Selec4on. Hang Xiao

Cross- Valida+on & ROC curve. Anna Helena Reali Costa PCS 5024

CSC 411: Lecture 02: Linear Regression

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

STA 4273H: Statistical Machine Learning

Machine Learning. Chao Lan

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

CSE446: Linear Regression. Spring 2017

Last time... Bias-Variance decomposition. This week

3 Nonlinear Regression

About the Course. Reading List. Assignments and Examina5on

I How does the formulation (5) serve the purpose of the composite parameterization

Supervised vs unsupervised clustering

COMPUTATIONAL INTELLIGENCE (CS) (INTRODUCTION TO MACHINE LEARNING) SS16. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

6 Model selection and kernels

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Applying Supervised Learning

COMP 9517 Computer Vision

Network Traffic Measurements and Analysis

3 Nonlinear Regression

CS 229 Midterm Review

Support Vector Machines

EE 511 Linear Regression

CS395T Visual Recogni5on and Search. Gautam S. Muralidhar

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #19: Machine Learning 1

TerraSwarm. A Machine Learning and Op0miza0on Toolkit for the Swarm. Ilge Akkaya, Shuhei Emoto, Edward A. Lee. University of California, Berkeley

Topics in Machine Learning-EE 5359 Model Assessment and Selection

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Learning via Optimization

Evalua&ng methods for represen&ng model error using ensemble data assimila&on

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Rapid Extraction and Updating Road Network from LIDAR Data

The exam is closed book, closed notes except your one-page cheat sheet.

Generative and discriminative classification techniques

Mul$-objec$ve Visual Odometry Hsiang-Jen (Johnny) Chien and Reinhard Kle=e

Soft Threshold Estimation for Varying{coecient Models 2 ations of certain basis functions (e.g. wavelets). These functions are assumed to be smooth an

Mixture Models and EM

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

EE613 Machine Learning for Engineers LINEAR REGRESSION. Sylvain Calinon Robot Learning & Interaction Group Idiap Research Institute Nov.

08 An Introduction to Dense Continuous Robotic Mapping

Lecture #11: The Perceptron

Quick review: Data Mining Tasks... Classifica(on [Predic(ve] Regression [Predic(ve] Clustering [Descrip(ve] Associa(on Rule Discovery [Descrip(ve]

06: Logistic Regression

Support Vector Machines.

Decision Trees: Representa:on

Clustering Lecture 5: Mixture Model

Linear Regression & Gradient Descent

Robust Regression. Robust Data Mining Techniques By Boonyakorn Jantaranuson

Proofs about Programs

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

Linear Regression QuadraJc Regression Year

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Lasso. November 14, 2017

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Machine Learning. CS 232: Ar)ficial Intelligence Naïve Bayes Oct 26, 2015

Web- Scale Mul,media: Op,mizing LSH. Malcolm Slaney Yury Li<shits Junfeng He Y! Research

Lecture 27, April 24, Reading: See class website. Nonparametric regression and kernel smoothing. Structured sparse additive models (GroupSpAM)

Instance-based Learning

Lecture 15: Segmentation (Edge Based, Hough Transform)

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

Lecture 26: Missing data

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2016

Simple Model Selection Cross Validation Regularization Neural Networks

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Perceptron as a graph

CSE 473: Ar+ficial Intelligence Machine Learning: Naïve Bayes and Perceptron

Stages of (Batch) Machine Learning

Network Traffic Measurements and Analysis

CS249: ADVANCED DATA MINING

Transcription:

STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! h0p://www.cs.toronto.edu/~rsalakhu/ Lecture 3

Parametric Distribu>ons We want model the probability distribu>on variable x given a finite set of observa>ons: of a random Need to determine given We will also assume that the data points are i.i.d We will focus on the maximum likelihood es>ma>on Remember curve fifng example.

Linear Basis Func>on Models Remember, the simplest linear model for regression: where is a d- dimensional input vector (covariates). Key property: linear func>on of the parameters. However, it is also a linear func>on of input variables. Instead consider: where are known as basis func>ons. Typically, so that w 0 acts as a bias (or intercept). In the simplest case, we use linear bases func>ons: Using nonlinear basis allows the func>ons to be nonlinear func>ons of the input space.

Linear Basis Func>on Models Polynomial basis func>ons: Gaussian basis func>ons: Basis func>ons are global: small changes in x affect all basis func>ons. Basis func>ons are local: small changes in x only affect nearby basis func>ons. µ j and s control loca>on and scale (width).

Linear Basis Func>on Models Sigmoidal basis func>ons Basis func>ons are local: small changes in x only affect nearby basis func>ons. µ j and s control loca>on and scale (slope). Decision boundaries will be linear in the feature space but would correspond to nonlinear boundaries in the original input space x. Classes that are linearly separable in the feature space be linearly separable in the original input space. need not

Linear Basis Func>on Models Original input space Corresponding feature space using two Gaussian basis functions We define two Gaussian basis functions with centers shown by green the crosses, and with contours shown by the green circles. Linear decision boundary (right) is obtained using logistic regression, and corresponds to nonlinear decision boundary in the input space (left, black curve).

Maximum Likelihood As before, assume observa>ons from determinis>c func>on with addi>ve Gaussian noise: which we can write as: Given observed inputs values likelihood func>on:, and corresponding target, under i.i.d assump>on, we can write down the where

Maximum Likelihood Taking the logarithm, we obtain: Differen>a>ng and sefng to zero yields: sum- of- squares error func>on

Maximum Likelihood Differen>a>ng and sefng to zero yields: Solving for w, we get: The Moore- Penrose pseudo- inverse,. where is known as the design matrix:

Geometry of Least Squares Consider an N- dimensional space, so that is a vector in that space. Each basis func>on evaluated at the N data points, can be represented as a vector in the same space. If M is less than N, then the M basis func>on will span a linear subspace S of dimensionality M. Define: The sum- of- squares error is equal to the squared Euclidean distance between y and t (up to a factor of 1/2). The solu>on corresponds to the orthogonal projec>on of t onto the subspace S.

Sequen>al Learning The training data examples are presented one at a >me, and the model parameter are updated a_er each such presenta>on (online learning): weights a_er seeing training case t+1 learning rate vector of deriva>ves of the squared error w.r.t. the weights on the training case presented at >me t. For the case of sum- of- squares error func>on, we obtain: Stochas>c gradient descent: if the training examples are picked at random (dominant technique when learning with very large datasets). Care must be taken when choosing learning rate to ensure convergence.

Regularized Least Squares Let us consider the following error func>on: Data term + Regulariza>on term is called the regulariza>on coefficient. Using sum- of- squares error func>on with a quadra>c penaliza>on term, we obtain: which is minimized by sefng: Ridge regression The solu>on adds a posi>ve constant to the diagonal of This makes the problem nonsingular, even if is not of full rank (e.g. when the number of training examples is less than the number of basis func>ons).

Effect of Regulariza>on The overall error func>on is the sum of two parabolic bowls. The combined minimum lies on the line between the minimum of the squared error and the origin. The regularizer shrinks model parameters to zero.

Other Regularizers Using a more general regularizer, we get: Lasso Quadra>c

The Lasso Penalize the absolute value of the weights: For sufficiently large, some of the coefficients will be driven to exactly zero, leading to a sparse model. The above formula>on is equivalent to: unregularized sum- of- squares error The two approaches are related using Lagrange mul>plies. The Lasso solu>on is a quadra>c programming problem: can be solved efficiently.

Lasso vs. Quadra>c Penalty Lasso tends to generate sparser solu>ons compared to a quadra>c regualrizer (some>mes called L 1 and L 2 regularizers).

Sta>s>cal Decision Theory We now develop a small amount of theory that provides a framework for developing many of the models we consider. Suppose we have a real- valued input vector x and a corresponding target (output) value t with joint probability distribu>on: Our goal is predict target t given a new value for x: - for regression: t is a real- valued con>nuous target. - for classifica>on: t a categorical variable represen>ng class labels. The joint probability distribu>on provides a complete summary of uncertain>es associated with these random variables. Determining from training data is known as the inference problem.

Example: Classifica>on Medical diagnosis: Based on the X- ray image, we would like determine whether the pa>ent has cancer or not. The input vector x is the set of pixel intensi>es, and the output variable t will represent the presence of cancer, class C 1, or absence of cancer, class C 2. C 1 : Cancer present x - - set of pixel intensi>es C 2 : Cancer absent Choose t to be binary: t=0 correspond to class C 1, and t=1 corresponds to C 2. Inference Problem: Determine the joint distribu>on, or equivalently. However, in the end, we must make a decision of whether to give treatment to the pa>ent or not.

Example: Classifica>on Informally: Given a new X- ray image, our goal is to decide which of the two classes that image should be assigned to. We could compute condi>onal probabili>es of the two classes, given the input image: posterior probability of C k given observed data. probability of observed data given C k prior probability for class C k Bayes Rule If our goal to minimize the probability of assigning x to the wrong class, then we should choose the class having the highest posterior probability.

Minimizing Misclassifica>on Rate Goal: Make as few misclassifica>ons as possible. We need a rule that assigns each value of x to one of the available classes. Divide the input space into regions (decision regions), such that all points in are assigned to class. red+green regions: input belongs to class C 2, but is assigned to C 1 blue region: input belongs to class C 1, but is assigned to C 2

Minimizing Misclassifica>on Rate

Minimizing Misclassifica>on Rate

Minimizing Misclassifica>on Rate Using : To minimize the probability of making mistake, we assign each x to the class for which the posterior probability is largest.

Expected Loss Introduce loss func>on: overall measure incurred in taking any available decisions. Suppose that for x, the true class is C k, but we assign x to class j! incur loss of L kj (k,j element of a loss matrix). Consider medical diagnosis example: example of a loss matrix: Decision Truth Expected Loss: Goal is to choose regions as to minimize expected loss.

Reject Op>on

Regression Let x 2 R d denote a real- valued input vector, and t 2 R denote a real- valued random target (output) variable with joint distribu>on The decision step consists of finding an es>mate y(x) of t for each input x. Similar to classifica>on case, to quan>fy what it means to do well or poorly on a task, we need to define a loss (error) func>on: The average, or expected, loss is given by: If we use squared loss, we obtain:

If we use squared loss, we obtain: Squared Loss Func>on Our goal is to choose y(x) so as minimize expected squared loss. The op>mal solu>on (if we assume a completely flexible func>on) is the condi>onal average: The regression func>on y(x) that minimizes the expected squared loss is given by the mean of the condi>onal distribu>on

If we use squared loss, we obtain: Squared Loss Func>on Plugging into expected loss: expected loss is minimized when intrinsic variability of the target values. Because it is independent noise, it represents an irreducible minimum value of expected loss.

Other Loss Func>on Simple generaliza>on of the squared loss, called the Minkowski loss: The minimum of is given by: - the condi>onal mean for q=2, - the condi>onal median when q=1, and - the condi>onal mode for q! 0.

Bias- Variance Decomposi>on Introducing regulariza>on term can control overfifng, but how can we determine a suitable value of the regulariza>on coefficient. Let us examine expected squared loss func>on. Remember: for which the op>mal predic>on is given by the condi>onal expecta>on: intrinsic variability of the target values: The minimum achievable value of expected loss If we model using a parametric func>on then from a Bayesian perspec>ve, the uncertainly in our model is expressed through the posterior distribu>on over parameters w. We first look at the frequen>st perspec>ve.

Bias- Variance Decomposi>on From a frequen>st perspec>ve: we make a point es>mate of w * based on the data set D. We next interpret the uncertainly of this es>mate through the following thought experiment: - Suppose we had a large number of datasets, each of size N, where each dataset is drawn independently from - For each dataset D, we can obtain a predic>on func>on - Different data sets will give different predic>on func>ons. - The performance of a par>cular learning algorithm is then assessed by taking the average over the ensemble of data sets. Let us consider the expression: Note that this quan>ty depends on a par>cular dataset D.

Bias- Variance Decomposi>on Consider: Adding and subtrac>ng the term we obtain Taking the expecta>on over the last term vanishes, so we get:

Bias- Variance Trade- off Average predic>ons over all datasets differ from the op>mal regression func>on. Solu>ons for individual datasets vary around their averages - - how sensi>ve is the func>on to the par>cular choice of the dataset. Intrinsic variability of the target values. Trade- off between bias and variance with very flexible models (high complexity) having low bias and high variance, and rela>ve rigid models (low complexity) having high bias and low variance. The model with the op>mal predic>ve capabili>es has to balance between bias and variance.

Bias- Variance Trade- off Consider the sinusoidal dataset. We generate 100 datasets, each containing N=25 points, drawn independently from The datasets are indexed by l=1,,l, where L=100. For each dataset, we fit a model with 24 Gaussian basis func>ons by minimizing the regularized error func>on: Once the model is fit, we can make predic>ons y (l) (x) for each of the L datasets.

Bias- Variance Trade- off Consider the sinusoidal dataset. We generate 100 datasets, each containing N=25 points, drawn independently from High variance Low variance Average over 100 fits Low bias High bias

Bias- Variance Trade- off Consider the sinusoidal dataset. We generate 100 datasets, each containing N=25 points, drawn independently from High variance Note that averaging many solu>ons to the complex model with M=25 data points represents a very good fit to the regression func>on Averaging may be a beneficial procedure. Let us examine the bias- variance trade- off quan>ta>vely. Low bias

Bias- Variance Trade- off Consider the sinusoidal dataset. We generate 100 datasets, each containing N=25 points, drawn independently from The average predic>on is es>mated as: And the integrated squared bias and variance are given by: where the integral over x weighted by the distribu>on p(x) is approximated by the finite sum over data points drawn from that distribu>on.

Bias- Variance Trade- off From these plots note that over- regularized model (large ) has high bias, and under- regularized model (low ) has high variance.

Bea>ng the Bias- Variance Trade- off We can reduce the variance by averaging over many models trained on different datasets: - In prac>ce, we only have a single observed dataset. If we had many independent training set, we would be be0er off combining them into one large training dataset. With more data, we have less variance. Given a standard training set D of size N, we could generates new training sets, N, by sampling examples from D uniformly and with replacement. - This is called bagging and work quite well in prac>ce. Given enough computa>on, we would be be0er off resor>ng to the Bayesian framework (which we will discuss next): - Combine the predic>ons of many models using the posterior probability of each parameter vector as the combina>on weight.