Bayes Estimators & Ridge Regression

Similar documents
Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Lecture 13: Model selection and regularization

Lasso. November 14, 2017

This is called a linear basis expansion, and h m is the mth basis function For example if X is one-dimensional: f (X) = β 0 + β 1 X + β 2 X 2, or

Lab #13 - Resampling Methods Econ 224 October 23rd, 2018

Lecture 16: High-dimensional regression, non-linear regression

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

Model Selection and Inference

CSSS 510: Lab 2. Introduction to Maximum Likelihood Estimation

Variable selection is intended to select the best subset of predictors. But why bother?

22s:152 Applied Linear Regression

22s:152 Applied Linear Regression

Linear Methods for Regression and Shrinkage Methods

Penalized regression Statistical Learning, 2011

Poisson Regression and Model Checking

Statistical Matching using Fractional Imputation

Chapter 6: Linear Model Selection and Regularization

Predictive Checking. Readings GH Chapter 6-8. February 8, 2017

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Stat 8053, Fall 2013: Additive Models

7. Collinearity and Model Selection

Generalized Additive Models

Exercise 2.23 Villanova MAT 8406 September 7, 2015

Multiple Linear Regression

DATA ANALYSIS USING HIERARCHICAL GENERALIZED LINEAR MODELS WITH R

Package msgps. February 20, 2015

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Lecture 25: Review I

Model selection. Peter Hoff. 560 Hierarchical modeling. Statistics, University of Washington 1/41

Regression Analysis and Linear Regression Models

Algorithms for LTS regression

Lecture 17: Smoothing splines, Local Regression, and GAMs

Section 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business

Multivariate Analysis Multivariate Calibration part 2

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

DATA ANALYSIS USING HIERARCHICAL GENERALIZED LINEAR MODELS WITH R

Model selection and validation 1: Cross-validation

Linear Modeling with Bayesian Statistics

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

Bayesian model selection and diagnostics

1 StatLearn Practical exercise 5

The Data. Math 158, Spring 2016 Jo Hardin Shrinkage Methods R code Ridge Regression & LASSO

14. League: A factor with levels A and N indicating player s league at the end of 1986

Using the SemiPar Package

Applied Statistics and Econometrics Lecture 6

The linear mixed model: modeling hierarchical and longitudinal data

Repeated Measures Part 4: Blood Flow data

Stat 5303 (Oehlert): Response Surfaces 1

Lecture 7: Linear Regression (continued)

Regularization Methods. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

Package EBglmnet. January 30, 2016

Predictor Selection Algorithm for Bayesian Lasso

NONPARAMETRIC REGRESSION SPLINES FOR GENERALIZED LINEAR MODELS IN THE PRESENCE OF MEASUREMENT ERROR

Machine Learning. Topic 4: Linear Regression Models

Splines and penalized regression

Solution to Bonus Questions

that is, Data Science Hello World.

Divide and Conquer Kernel Ridge Regression

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business

Doubly Cyclic Smoothing Splines and Analysis of Seasonal Daily Pattern of CO2 Concentration in Antarctica

Topics in Machine Learning-EE 5359 Model Assessment and Selection

What is machine learning?

Stat 4510/7510 Homework 4

Regression III: Lab 4

Generalized Additive Models

Statistics Lab #7 ANOVA Part 2 & ANCOVA

Solution to Series 7

Last time... Bias-Variance decomposition. This week

Missing Data and Imputation

MCMC Methods for data modeling

Package rgcvpack. February 20, Index 6. Fitting Thin Plate Smoothing Spline. Fit thin plate splines of any order with user specified knots

Bivariate (Simple) Regression Analysis

Splines. Patrick Breheny. November 20. Introduction Regression splines (parametric) Smoothing splines (nonparametric)

Model selection Outline for today

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Model selection. Peter Hoff STAT 423. Applied Regression and Analysis of Variance. University of Washington /53

Analysis of variance - ANOVA

PS 6: Regularization. PART A: (Source: HTF page 95) The Ridge regression problem is:

Detecting and Circumventing Collinearity or Ill-Conditioning Problems

1 Methods for Posterior Simulation

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

Calibration and emulation of TIE-GCM

Week 5: Multiple Linear Regression II

Lecture on Modeling Tools for Clustering & Regression

Gelman-Hill Chapter 3

Package SafeBayes. October 20, 2016

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

Fast Ridge Regression with Randomized Principal Component Analysis and Gradient Descent

P-spline ANOVA-type interaction models for spatio-temporal smoothing

Simulation studies. Patrick Breheny. September 8. Monte Carlo simulation Example: Ridge vs. Lasso vs. Subset

9.1 Random coefficients models Constructed data Consumer preference mapping of carrots... 10

Among those 14 potential explanatory variables,non-dummy variables are:

A quick introduction to First Bayes

Lecture 26: Missing data

Performing Cluster Bootstrapped Regressions in R

Practice in R. 1 Sivan s practice. 2 Hetroskadasticity. January 28, (pdf version)

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Package ridge. R topics documented: February 15, Title Ridge Regression with automatic selection of the penalty parameter. Version 2.

Two-Stage Least Squares

Section 2.2: Covariance, Correlation, and Least Squares

Transcription:

Bayes Estimators & Ridge Regression Readings ISLR 6 STA 521 Duke University Merlise Clyde October 27, 2017

Model Assume that we have centered (as before) and rescaled X o (original X) so that X j = X o j X o j i (Xo ij X o j )2 Equivalent to using r scale(x) divided by n 1 Model: Y = 1β 0 + Xβ + ϵ X T X = Cor(X) (correlation matrix of X) Eigenvalue Decomposition X T X = UΛU T if smallest eigen value is 0, X has columns that are linearly dependent

How Good are Various Estimators Quadratic loss for estimating β using estimator a L(β, a) = (β a) T (β a) Consider our expected loss (before we see the data) of taking an action a Under OLS or the Reference prior the Expected Mean Square Error E Y [(β ˆβ) T (β ˆβ) = σ 2 tr[(x T X) 1 ] = p σ 2 j=1 λ 1 j If smallest λ j 0 then MSE

Problems Estimates: ˆβ = (X T X) 1 X T Y or with g-prior ˆβ = g 1 + g (XT X) 1 X T Y may be unstable without variable selection. Solutions: remove redundant variables (model selection) (AIC, BIC, other approches) 2 p models combinatorial hard problem even with MCMC add constant to X T X: β = (X T X + ki) 1 X T Y to stabilise eigenvalues - alternative shrinkage estimator/prior

Independent Prior Reference prior p(β 0, ϕ) ϕ 1 Prior Distribution on β ϕ, β 0, k N(0 p, 1 ϕk I p) log likelihood (integrated) for β plus prior Posterior mean ϕ 2 ( Y 1Ȳ Xβ 2 + k β 2) b n = (X T X + ki) 1 X T Xˆβ importance of standardizing Choice of k in practice? k = 0 OLS k = estimates are 0 (intercept only)

Alternative Motivation If ˆβ is unconstrained expect high variance with nearly singular X Control how large coefficients may grow min β (Y 1Ȳ Xβ)T (Y 1Ȳ Xβ) subject to β 2 j t Equivalent Quadratic Programming Problem min β Yc X c β 2 + k β 2 penalized likelihood Ridge Regression

Geometry 1 1 onlinecourses.science.pse.edu

Longley Data: library(mass); data(longley) 250 500 150 300 1950 GNP.deflator 85 115 250 550 GNP Unemployed 200 450 150 350 Armed.Force Population 110 130 1950 Year Employed 60 68 85 110 200 450 110 130 60 66

OLS > longley.lm = lm(employed ~., data=longley) > summary(longley.lm) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -3.482e+03 8.904e+02-3.911 0.003560 ** GNP.deflator 1.506e-02 8.492e-02 0.177 0.863141 GNP -3.582e-02 3.349e-02-1.070 0.312681 Unemployed -2.020e-02 4.884e-03-4.136 0.002535 ** Armed.Forces -1.033e-02 2.143e-03-4.822 0.000944 *** Population -5.110e-02 2.261e-01-0.226 0.826212 Year 1.829e+00 4.555e-01 4.016 0.003037 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.3049 on 9 degrees of freedom Multiple R-squared: 0.9955,^^IAdjusted R-squared: 0.9925 F-statistic: 330.3 on 6 and 9 DF, p-value: 4.984e-10

Ridge Regression # from library MASS longley.ridge = lm.ridge(employed ~., data=longley, lambda=seq(0, 0.1, 0.0001)) # lambda = k in notes summary(longley.ridge) ## Length Class Mode ## coef 6006 -none- numeric ## scales 6 -none- numeric ## Inter 1 -none- numeric ## lambda 1001 -none- numeric ## ym 1 -none- numeric ## xm 6 -none- numeric ## GCV 1001 -none- numeric ## khkb 1 -none- numeric ## klw 1 -none- numeric

Ridge Trace Plot t(x$coef) 2 0 2 4 6 8 0.00 0.02 0.04 0.06 0.08 0.10 x$lambda

Choice of k k = seq(0, 0.1, 0.0001) n.k = length(k); n = nrow(longley) cv.lambda = matrix(na, n, n.k) rmse.ridge = function(data, i, j, k) { m.ridge = lm.ridge(employed ~., data = data, lambda=k[j] subset = -i) yhat = scale(data[i,1:6, drop=f],center = m.ridge$xm, scale = m.ridge$scales) %*% m.ridge$coef + m.ridge$ym (yhat - data$employed[i])^2 } for (i in 1:n) { for (j in 1:n.k) { cv.lambda[i,j] = rmse.ridge(longley, i, j, k) }

Cross Validation Error cv.error = apply(cv.lambda, 2, mean) plot(k, cv.error, type="l") cv.error 0.16 0.18 0.20 0.22 0.24 0.00 0.02 0.04 0.06 0.08 0.10 k Best k = 0.0028

Generalized Cross-validation select(lm.ridge(employed ~., data=longley, lambda=seq(0, 0.1, 0.0001))) ## modified HKB estimator is 0.004275357 ## modified L-W estimator is 0.03229531 ## smallest value of GCV at 0.0028 best.k = longley.ridge$lambda[which.min(longley.ridge$gcv)] longley.rreg = lm.ridge(employed ~., data=longley, lambda=best.k) coef(longley.rreg) ## GNP.deflator GNP Unemployed Arme ## -2.950348e+03-5.381450e-04-1.822639e-02-1.761107e-02-9.60 ## Population Year ## -1.185103e-01 1.557856e+00

Priors on k X is centered and standardized Y = 1β 0 + Xβ + ϵ Hierarchical prior p(β 0, ϕ β, κ) ϕ 1 β ϕ, κ N(0, I(ϕκ) 1 ) prior on κ? Take κ ϕ Gamma(1/2, 1/2) What is induced prior on β ϕ?

Posterior Distributions Joint Distribution β 0, β, ϕ κ, Y Normal-Gamma family given Y and κ κ Y not tractable Obtain marginal for β via MCMC Pick initial values β (0) 0, β(0), ϕ (0), Set t = 1 1. Sample κ (t) p(κ β (t 1) 0, β (t 1), ϕ (t 1), Y) 2. Sample β (t) 0, β(t), ϕ (t) κ(t), Y 3. Set t = t + 1 and repeat until t > T Use Samples β (t) 0, β(t), ϕ (t), κ (t) for t = B,..., T for inference

JAGS JAGS = Just Another Gibbs Sampler scripting language to express sampling models and priors derives full conditional distributions integrates with R typically faster than interpreted R code accounts for uncertainty about k How would you compare Bayes predictions with Ridge with Cross-validation?