Machine Learning. Topic 4: Linear Regression Models

Similar documents
Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Lecture 13: Model selection and regularization

Topics in Machine Learning-EE 5359 Model Assessment and Selection

CSE446: Linear Regression. Spring 2017

What is machine learning?

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

EE 511 Linear Regression

Nonparametric Methods Recap

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017

CSC 411: Lecture 02: Linear Regression

CPSC 340: Machine Learning and Data Mining

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

CSE446: Linear Regression. Spring 2017

This is called a linear basis expansion, and h m is the mth basis function For example if X is one-dimensional: f (X) = β 0 + β 1 X + β 2 X 2, or

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Topics in Machine Learning

Statistical Consulting Topics Using cross-validation for model selection. Cross-validation is a technique that can be used for model evaluation.

Nonparametric Approaches to Regression

Lasso Regression: Regularization for feature selection

STA121: Applied Regression Analysis

low bias high variance high bias low variance error test set training set high low Model Complexity Typical Behaviour Lecture 11:

Lecture on Modeling Tools for Clustering & Regression

error low bias high variance test set training set high low Model Complexity Typical Behaviour 2 CSC2515 Machine Learning high bias low variance

Lecture 9: Support Vector Machines

Supervised Learning. CS 586 Machine Learning. Prepared by Jugal Kalita. With help from Alpaydin s Introduction to Machine Learning, Chapter 2.

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Machine Learning / Jan 27, 2010

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

Boosting Simple Model Selection Cross Validation Regularization

Linear Regression & Gradient Descent

Basis Functions. Volker Tresp Summer 2017

STA 4273H: Statistical Machine Learning

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

Last time... Bias-Variance decomposition. This week

Using Machine Learning to Optimize Storage Systems

Knowledge Discovery and Data Mining

Lecture 16: High-dimensional regression, non-linear regression

Machine Learning Lecture 3

Simple Model Selection Cross Validation Regularization Neural Networks

Chapter 7: Numerical Prediction

Linear models. Subhransu Maji. CMPSCI 689: Machine Learning. 24 February February 2015

3 Nonlinear Regression

Machine Learning: Think Big and Parallel

Machine Learning Lecture 3

A popular method for moving beyond linearity. 2. Basis expansion and regularization 1. Examples of transformations. Piecewise-polynomials and splines

CS 229 Midterm Review

Practical Guidance for Machine Learning Applications

10601 Machine Learning. Model and feature selection

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

3 Nonlinear Regression

CS249: ADVANCED DATA MINING

The exam is closed book, closed notes except your one-page cheat sheet.

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Bias-Variance Trade-off (cont d) + Image Representations

Lecture 7: Linear Regression (continued)

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

CS570: Introduction to Data Mining

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

22s:152 Applied Linear Regression

Machine Learning Lecture 3

Classification: Linear Discriminant Functions

CSE 158. Web Mining and Recommender Systems. Midterm recap

Network Traffic Measurements and Analysis

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Model Complexity and Generalization

JMP Book Descriptions

Nonparametric Classification Methods

Advanced and Predictive Analytics with JMP 12 PRO. JMP User Meeting 9. Juni Schwalbach

Linear Methods for Regression and Shrinkage Methods

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

Model Generalization and the Bias-Variance Trade-Off

Time Series Analysis by State Space Methods

CSE 5526: Introduction to Neural Networks Radial Basis Function (RBF) Networks

Package freeknotsplines

Cross-validation and the Bootstrap

Content-based image and video analysis. Machine learning

Resampling methods (Ch. 5 Intro)

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Machine Perception of Music & Audio. Topic 10: Classification

Multivariate Analysis Multivariate Calibration part 2

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Cross-validation and the Bootstrap

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Statistical foundations of machine learning

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

22s:152 Applied Linear Regression

An Introduction to the Bootstrap

Information Criteria Methods in SAS for Multiple Linear Regression Models

Linear Regression and K-Nearest Neighbors 3/28/18

Recent advances in Metamodel of Optimal Prognosis. Lectures. Thomas Most & Johannes Will

Transcription:

Machine Learning Topic 4: Linear Regression Models (contains ideas and a few images from wikipedia and books by Alpaydin, Duda/Hart/ Stork, and Bishop. Updated Fall 205)

Regression Learning Task There is a set of possible examples X = {x,...x n } Each example is a vector of k real valued attributes There is a target function that maps X onto some real value Y The DATA is a set of tuples <example, response value> Find a hypothesis h such that... x i =< x i,..., x ik > f : X Y {< x, y >,... < x n, y n >} x,h(x) f (x) Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202

Why use a linear regression model? Easily understood Interpretable Well studied by statisticians many variations and diagnostic measures Computationally efficient Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 3

Linear Regression Model Assumption: The observed response (dependent) variable, r, is the true function, f(x), with additive Gaussian noise, ε, with a 0 mean. Observed response y = f (x) + ε Where Assumption: The expected value of the response variable y is a linear combination of the k independent attributes/features) Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 4

The Hypothesis Space Given the assumptions on the previous slide, our hypothesis space is the set of linear functions (hyperplanes) h(x) = w 0 + w x + w 2 x 2 +...w k x k (w 0 is the offset from the origin. You always need w 0 ) The goal is to learn a k+ dimensional vector of weights that define a hyperplane minimizing an error criterion. w =< w 0,w,...w k > Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 5

Simple Linear Regression x has attribute a (predictor variable) Hypothesis function is a line: Example: ŷ = h(x) = w 0 + w x yt 0 x 0 x Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 6

The Error Criterion Typically estimate parameters by minimizing sum of squared residuals (RSS) also known as the Sum of Squared Errors (SSE) y RSS = n number of training examples (y h(x )) 2 i i i residual Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 7 x

Multiple (Multivariate*) Linear Regression Many attributes x,...x k h(x) function is a hyperplane *NOTE: In statistical literature, multivariate linear regression is regression with multiple outputs, and the case of multiple input variables is simply multiple linear regression h(x) = w 0 + w x + w 2 x 2 +...w k x k x x 2 Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 8

Formatting the data Create a new 0 dimension with and append it to the beginning of every example vector x i w 0 This placeholder corresponds to the offset x i =<, x i,, x i,2..., x i,k > Format the data as a matrix of examples x and a vector of response values y X = x,... x,k x 2,... x 2,k............ x n,k... x n,k One training example y = y y 2... y n 9

There is a closed-form solution! Our goal is to find the weights of a function. h(x) = w 0 + w x + w 2 x 2 +...w k x k that minimizes the sum of squared residuals: RSS = n (y h(x )) 2 i i i It turns out that there is a close-form solution to this problem! w = (X T X) X T y Just plug your training data into the above formula and the best hyperplane comes out! Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 0

RSS in vector/matrix notation RSS(w) = n (y h(x )) 2 i i i= n = (y w x w ) 2 i 0 ij j i= k j= = (y Xw) T (y Xw) Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202

Deriving the formula to find w RSS(w) = (y Xw) T (y Xw) RSS w = 2XT (y Xw) 0 = 2X T (y Xw) 0 = X T (y Xw) 0 = X T y X T Xw X T Xw = X T y w = (X T X) X T y Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 2

Making polynomial regression You re familiar with linear regression where the input has k dimensions. h(x) = w 0 + w x + w 2 x 2 +...w k x k We can use this same machinery to make polynomial regression from a one-dimensional input.. h(x) = w 0 + w x + w 2 x 2 +...w k x k Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 3

Making polynomial regression Given a scalar example z. We can make a k+ dimensional example x x = z 0,z,z 2,...,z k The ith element of x is the power z i h(x) = w 0 + w z + w 2 z 2 +...w k z k Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 4

Making polynomial regression x k z k Since we can interpret the output of the regression as a polynomial function of z h(x) = w 0 + w x + w 2 x 2 +...w k x k = w 0 + w z + w 2 z 2 +...w k z k Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 5

Polynomial Regression Model the relationship between the response variable and the attributes/predictor variables as a k th -order polynomial. While this can model non-linear functions, it is still linear with respect to the coefficients. h(x) = w 0 + w z + w 2 z 2 + w 3 z 3 t 0 0 x Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 6

Polynomial Regression Parameter estimation (analytically minimizing sum of squared residuals): X = z z 2 One training example... z k... z 2 k............ z n... z n k y = y y 2... y n (Note, there is only attribute z for each training example. Those superscripts are powers, since we re doing polynomial regression) w = (X T X) X T y Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 7

Tuning Model Complexity: Example What is your hypothesis for? t 0 0 x Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 8

Tuning Model Complexity: Example What is your hypothesis for? k th order polynomial regression M k=0 = 0 t 0 0 x Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 9

Tuning Model Complexity: Example What is your hypothesis for? k th order polynomial regression t 0 M k= = 0 x Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 20

Tuning Model Complexity: Example What is your hypothesis for? k th order polynomial regression t 0 M k=3 = 3 0 x Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 2

Tuning Model Complexity: Example What is your hypothesis for? k th order polynomial regression t 0 M k=9 = 9 0 x Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 22

Tuning Model Complexity: Example What is your hypothesis for? k th order polynomial regression t 0 M k=9 = 9 0 x Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 23

Tuning Model Complexity: Example What happens if we fit to more data? k th order polynomial regression t 0 M k=9 = 9 N m=5 = 5 0 x Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 24

Tuning Model Complexity: Example What happens if we fit to more data? k th order polynomial regression t 0 M k=9 = 9 N m=00 = 00 0 x Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 25

Bias and Variance of an Estimator Let X be a sample from a population specified by a true parameter Let d=d(x) be an estimator for mean square error variance bias 2 Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 26

Bias and Variance Order of a polynomial fit to some data As we increase complexity, bias decreases (a better fit to data) and variance increases (fit varies more with data) Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 27

Bias and Variance of Hypothesis Fn Bias: Measures how much h(x) is wrong disregarding the effect of varying samples (This the statistical bias of an estimator. This is NOT the same as inductive bias, which is the set of assumptions that your learner is making) Variance: Measures how much h(x) fluctuate around the expected value as the sample varies. NOTE: These concepts are general machine learning concepts, not specific to linear regression. Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 28

Coefficient of Determination the coefficient of determination, or R 2 indicates how well data points fit a line or curve. We d like R 2 to be close to E RSS = n (y i h(x i )) 2 i n where y is the sample mean (y i y) 2 i Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 29

Don t just rely on numbers, visualize! For all 4 sets: same mean and variance for x, same mean and variance (almost) for y, and same regression line and correlation between x and y (and therefore same R-squared). Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 30

Summary of Linear Regression Models Easily understood Interpretable Well studied by statisticians Computationally efficient Can handle non-linear situations if formulated properly Bias/variance tradeoff (occurs in all machine learning) Visualize!! GLMs Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 3

Appendix (Stuff I couldn t cover in class) 32

Bias and Variance high bias, low variance Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 33

Bias and Variance high bias, high variance Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 34

Bias and Variance low bias, high variance Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 35

Bias and Variance low bias, low variance Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 36

Bias: Bias and Variance Measures how much h(x) is wrong disregarding the effect of varying samples high bias Variance: underfitting Measures how much h(x) fluctuate around the expected value as the sample varies. high variance overfitting There s a trade-off between bias and variance Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 37

Ways to Avoid Overfitting Simpler model E.g. fewer parameters Regularization penalize for complexity in objective function Fewer features Dimensionality reduction of features (e.g. PCA) More data Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 38

Model Selection Cross-validation: Measure generalization accuracy by testing on data unused during training Regularization: Penalize complex models E =error on data + λ model complexity Akaike s information criterion (AIC), Bayesian information criterion (BIC) Minimum description length (MDL): Kolmogorov complexity, shortest description of data Structural risk minimization (SRM) Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 39

Generalized Linear Models Models shown have assumed that the response variable follows a Gaussian distribution around the mean Can be generalized to response variables that take on any exponential family distribution (Generalized Linear Models - GLMs) Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 202 40