Incorporating Geospatial Data in House Price Indexes: A Hedonic Imputation Approach with Splines. Robert J. Hill and Michael Scholz

Size: px
Start display at page:

Download "Incorporating Geospatial Data in House Price Indexes: A Hedonic Imputation Approach with Splines. Robert J. Hill and Michael Scholz"

Transcription

1 Incorporating Geospatial Data in House Price Indexes: A Hedonic Imputation Approach with Splines Robert J. Hill and Michael Scholz Department of Economics University of Graz, Austria OeNB Workshop Vienna, 9th 10th October 2014 Supported by funds of the Oesterreichische Nationalbank (Anniversary Fund, project number: 14947) Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

2 Overview of the talk: 1 Hedonic House Price Indexes (): Incorporating Geospatial Data in Taxonomy: Time Dummy, Average Characteristics, Hedonic Imputation 2 : GAM with Spline vs. Postcode/Region Dummies Estimation of the Semiparametric 3 : Data Set, Missing Characteristics, Results and Main Findings Are Postcode/Region Based Indexes Downward Biased? 4 Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

3 Introduction Motivation: Houses differ both in their physical characteristics and location Exact longitude and latitude of each house are now increasingly available in housing data sets How can we incorporate geospatial data (i.e., longitudes and latitudes) in a hedonic model of the housing market? How much difference does it make? Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

4 for Incorporating Geospatial Data in a Hedonic Index : 1 Distance to amenities (including the city center, nearest train station and shopping center, etc.) as additional characteristics. 2 Spatial autoregressive models 3 A spline function (or some other nonparametric function) Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

5 A Taxonomy of for Computing Hedonic House Price Indexes I Time dummy method y = Zβ + Dδ + ε where Z is a matrix of characteristics and D is a matrix of dummy variables. Index: P t = exp(ˆδ t ) where ˆδ t is the estimated coefficient obtained from the hedonic model Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

6 A Taxonomy of for Computing Hedonic House Price Indexes II Average characteristics method Laspeyres : P L t,t+1 = ˆp t+1 ( z t ) ˆp t ( z t ) Paasche : P P t,t+1 = ˆp t+1 ( z t+1 ) ˆp t ( z t+1 ) C = exp (ˆβ c,t+1 ˆβ c,t ) z c,t, c=1 C = exp (ˆβ c,t+1 ˆβ c,t ) z c,t+1, c=1 where z c,t = 1 H t H t h=1 z c,t,h and z c,t+1 = 1 H t+1 z c,t+1,h. H t+1 h=1 Note: Average characteristics methods cannot use geospatial data, since averaging longitudes and latitudes makes no sense. Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

7 A Taxonomy of for Computing Hedonic House Price Indexes III Hedonic imputation method Ht+1 ( Paasche : P PSI t,t+1 = p t+1,h ˆp t,h (z t+1,h )) 1/Ht+1 h=1 Ht Laspeyres : P LSI t,t+1 = h=1 ( ) ˆp t+1,h (z t,h ) 1/Ht p t,h Fisher : P FSI t,t+1 = P PSI t,t+1 PLSI t,t+1 ( with Single Imputation) Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

8 Our s I (i) semilog with geospatial spline (ii)/(iii) semilog with postcode/region dummies y = Zβ + g(z lat, z long ) + ε (1) y = Zβ + Dδ + ε (2) y is a H 1 vector of log-prices Z is an H C matrix of physical characteristics (including a constant and quarterly dummies) g(z lat, z long ) is the geospatial spline function defined on the longitudes and latitudes D is a matrix of postcode or region dummies. Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

9 Our s II (i) semilog with geospatial spline y = Zβ + g(z lat, z long ) + ε (1) (ii)/(iii) semilog with postcode/region dummies y = Zβ + Dδ + ε (2) The parameters to be estimated in (i) are the C 1 vector of characteristic shadow prices β, and the geospatial spline surface g(z lat, z long ). The parameters to be estimated in (ii) or (iii) are the C 1 vector of characteristic shadow prices β, and the B 1 vector of postcode or region shadow prices δ. We consider 16 regions and 242 postcodes. So on average there are 15 postcodes in a region. Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

10 Estimation of the Semiparametric I Estimation: The semiparametric model is estimated using the bam-function (Big Additive s) in the mgcv package in R. details gam For the nonparametric part we use an approximation to a thin plate spline (TPS) (developed by S. Wood, 2003), defined on the longitudes and latitudes. details tps In our context, TPRSs have two advantages over other splines: It is not necessary to specify knot locations. The spline surface can be estimated as a function of two explanatory variables. details tprs With a full TPS the computational burden rises rapidly as the number of data points is increased. Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

11 Estimation of the Semiparametric II The approximation reduces the computational burden for any given level of fit. It is necessary to select a value for the parameter k which determines the approximation of our spline to a TPS: A lower value of k has a lower computational burden but leads to a worse fit. We choose a value of k where the gains from increasing it further (in terms of better fit) are low. details par choice The smoothing parameter λ determines the smoothness of the spline surface itself. The algorithm selects λ using restricted Maximum Likelihood (REML), where the likelihood function is maximized across all the parameters of the semiparametric model. Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

12 Our Data Set The data Sydney, Australia from 2001 to Our characteristics are: Transaction price + exact date of sale Physical: Number of bedrooms, Number of bathrooms, Land area Location: Postcode, Longitude, Latitude Some (physical) characteristics are missing for some houses. There are more gaps in the data in the earlier years in our sample. We have a total of 454,567 transactions. All characteristics are available for only 240,142 of these transactions. Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

13 Dealing with Missing Characteristics We impute the price of each house from the model below that has exactly the same mix of characteristics (HM1) : y = f(d, z 1, z 2, z 3, loc) (HM2) : y = f(d, z 2, z 3, loc) (HM3) : (HM4) : (HM5) : (HM6) : (HM7) : (HM8) : y = f(d, z 1, z 3, loc) y = f(d, z 1, z 2, loc) y = f(d, z 3, loc) y = f(d, z 2, loc) y = f(d, z 1, loc) y = f(d, loc) For all obs.: y = log(price), d = quarter dummy, loc = location Not for all obs.: z 1 = land area, z 2 = number of bedrooms, z 3 = number of bathrooms Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

14 Comparing the Performance of Our s I Different performance measures: Akaike information criterion (AIC): trade-off between the goodness of fit and the complexity of the model Sum of squared log errors SSLE t = 1 H t Repeat-Sales as a benchmark H t h=1 [ ln(ˆp th /p th ) ] 2 Z SI h Z SI h = Actual Price Relative /Imputed Price Relative = p / t+k,h p t+k,h ˆp / t+k,h p t+k,h ˆp t+k,h = p th p th p th ˆp th D SI = 1 H H h=1 [ ln(z SI h ) ]2 The spline model significantly outperforms its postcode counterpart. ˆp th Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

15 Comparing the Performance of Our s II Table: Akaike information criterion (restricted data set) (i) (ii) (iii) Table: Sum of squared log errors (restricted data set) (i) (ii) (iii) Table: Sum of squared log price relative errors D SI Restricted Full (i) (ii) (iii) Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

16 Spline Surface Based on Restricted Data Set for 2007 Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

17 Price Indexes Calculated on the Restricted Data Set region LM long lat GAM pc LM median rep sales years Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

18 Price Indexes Calculated on the Full Data Set region LM long lat GAM pc LM median rep sales years Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

19 Main Findings The price index rises more when locational effects are captured using a geospatial spline, rather than postcode or region dummies. The gap between the spline and postcode based indexes is small (about 0.5 percent over 11 years). region based indexes is larger (about 4.2 percent over 11 years). The full-sample spline-based price index rises 6.5 percent more over 11 years than its restricted data set counterpart. The median index is dramatically different when the full data set is used. The gap between spline and postcode/region hedonic price indexes is smaller when the full data set is used. Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

20 Are Postcode/Region Based Indexes Downward Biased? I A downward bias can arise when the locations of sold houses in a postcode or region get worse over time. Our test procedure: 1 Choose a postcode 2 Calculate the mean number of bedrooms, bathrooms, land area and quarter of sale over the 11 years for that postcode. 3 Impute the price of this average house in every location in which a house actually sold in 2001,...,2011 in that postcode (using spline model of year 2001) 4 Take the geometric mean of these imputed prices for each year. 5 Repeat for another postcode 6 Take the geometric mean across postcodes in each year. 7 Repeat steps 3-6 using the spline of year 2002, spline of 2003, etc. Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

21 Are Postcode Based Indexes Downward Biased? II Findings: The geometric means from step 6 fall over time irrespective of which year s spline is used as the reference. Most of the fall occurs in the first half of the sample. The fall is bigger for regions than postcodes. Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

22 Evidence of Bias in the Postcode-Based Price Indexes years Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

23 Evidence of Bias in the Region-Based Price Indexes years Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

24 s Splines (or some other nonparametric method), when combined with the hedonic imputation method, provide a flexible way of incorporating geospatial data into a house price index In our data set postcode/region based indexes seem to have a downward bias since they fail to account for a general shift over time in houses sold to worse locations in each postcode/region. The bias is negligible with postcode dummies but not so with region dummies. For a city with postcodes as finely defined as Sydney (about 14,300 residents and 7.39 square kilometers per postcode), postcode dummies do a good job of controlling for locational effects. It is important to use the full data set, and not just observations with no missing characteristics. Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

25 Thank you for your attention! Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

26 GAM We estimate the Generalized Additive y = Zβ + g(z lat, z long ) + ε with a thin plate regression spline (tprs) Wood (2003) an optimal low rank approximation of a thin plate spline A tprs uses far fewer coefficients than a full spline: computationally efficient, while losing little statistical performance. Big Additive s bam-function from the R package mgcv. We can avoid certain problems in spline smoothing: Choice of knot location and basis Smooths of more than one predictor Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 10

27 Thin plate spline I Thin plate spline smoothing problem Duchon (1977): Estimate the smooth function g with d-vector x from n observations s.t. y i = g(x i ) + ε i by finding the function ˆf that minimizes penalized problem y = (y 1,..., y n ), y f 2 + λj md (f), (3) f = (f(x 1 ),..., f(x n )), λ is a smoothing parameter, J md (f) is a penalty function measuring the wiggliness of f J md =... ν ν d =m m! ν 1!... ν d! m f x ν xν d d 2 dx 1... dx d Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 10

28 Thin plate spline II It can be shown that the solution of (3) has the form, n M ˆf(x) = δ i η md ( x x i ) + α j φ j (x), (4) i=1 j=1 δ i and α j are coefficients to be estimated, δ i such that T δ = 0 with T ij = φ j (x i ). The M = ( ) m+d 1 d functions φj are linearly independent polynomials spanning the space of polynomials in R d of degree less than m (i.e. the null space of J md ) ( 1) m+1+d/2 2 η md (r) = 2m 1 π d/2 (m 1)!(m d/2)! r 2m d log(r) d even Γ(d/2 m) 2 2m π d/2 (m 1)! r 2m d d odd Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 10

29 Figure: Rank 15 Eigen Approx to 2D thin plate spline (Wood, 2006) Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 10

30 Thin plate spline III With E by E ij = η md ( x i x j ), the thin plate spline fitting problem is now the minimization of y Eδ Tα 2 + λδ Eδ s. t. T δ = 0. (5) with respect to δ and α Ideal smoother: Exact weight to the conflicting goals of matching the data and making f smooth Disadvantage: As many parameters as there are data, i.e. O(n 3 ) calculations More on thin plate splines: Wahba (1990) or Green and Silverman (1994) Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 10

31 Thin plate regression spline I The computational burden can be reduced with the use of a low rank approximation, Wood (2003) A parameter space basis that perturbs (5) as little as possible. The basic idea: Truncation of the space of the wiggly components of the spline (with parameter δ), while leaving the α-components unchanged. E = UDU eigen-decomposition of E Appropriate submatrix D k of D and corresponding U k, restricting δ to the column space of U k, i.e. δ = U k δ k, y U k D k δ k Tα 2 + λδ k D k δ k s. t. T U k δ k = 0, (6) with respect to δ k and α. Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 10

32 Thin plate regression spline II The computational cost is reduced from O(n 3 ) to O(k 3 ) Remaining problem: Find U k and D k sufficiently cheaply. Wood (2003) proposes the use of the Lanczos method (Demmel (1997)) which allows the calculation at the substantially lower cost of O(n 2 k) operations. Smoothing parameter selection: Laplace approximation to obtain an approximate REML which is suitable for efficient direct optimization and computationally stable, Wood (2011) Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 10

33 Thin plate regression spline III In practice: 1 Choose degree of approximation k = 600 (tradeoff: oversmoothing vs. computational burden) 2 Construct tprs-basis from a randomly chosen sample of 2500 data points. The locations of these observations depend on the locational distribution of house sales in that period. Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 10

34 Figure: Sum of squared errors D SI of the price relatives and computational time for different basis dimensions sse of the price relatives computational time number of basis dimensions Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 10

35 Figure: AIC distribution and comp. time for 100 repetitions of the y2011-fit based on different number of randomly chosen observations. range of AIC for 100 repetitions computational time number of randomly chosen observations for basis construction Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 10

Incorporating Geospatial Data in House Price Indexes: A Hedonic Imputation Approach with Splines

Incorporating Geospatial Data in House Price Indexes: A Hedonic Imputation Approach with Splines Incorporating Geospatial Data in House Price Indexes: A Hedonic Imputation Approach with Splines Robert J. Hill and Michael Scholz University of Graz Austria robert.hill@uni-graz.at michael-scholz@uni-graz.at

More information

Scholz, Hill and Rambaldi: Weekly Hedonic House Price Indexes Discussion

Scholz, Hill and Rambaldi: Weekly Hedonic House Price Indexes Discussion Scholz, Hill and Rambaldi: Weekly Hedonic House Price Indexes Discussion Dr Jens Mehrhoff*, Head of Section Business Cycle, Price and Property Market Statistics * Jens This Mehrhoff, presentation Deutsche

More information

A toolbox of smooths. Simon Wood Mathematical Sciences, University of Bath, U.K.

A toolbox of smooths. Simon Wood Mathematical Sciences, University of Bath, U.K. A toolbo of smooths Simon Wood Mathematical Sciences, University of Bath, U.K. Smooths for semi-parametric GLMs To build adequate semi-parametric GLMs requires that we use functions with appropriate properties.

More information

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K. GAMs semi-parametric GLMs Simon Wood Mathematical Sciences, University of Bath, U.K. Generalized linear models, GLM 1. A GLM models a univariate response, y i as g{e(y i )} = X i β where y i Exponential

More information

Lecture 17: Smoothing splines, Local Regression, and GAMs

Lecture 17: Smoothing splines, Local Regression, and GAMs Lecture 17: Smoothing splines, Local Regression, and GAMs Reading: Sections 7.5-7 STATS 202: Data mining and analysis November 6, 2017 1 / 24 Cubic splines Define a set of knots ξ 1 < ξ 2 < < ξ K. We want

More information

Semiparametric Tools: Generalized Additive Models

Semiparametric Tools: Generalized Additive Models Semiparametric Tools: Generalized Additive Models Jamie Monogan Washington University November 8, 2010 Jamie Monogan (WUStL) Generalized Additive Models November 8, 2010 1 / 33 Regression Splines Choose

More information

Doubly Cyclic Smoothing Splines and Analysis of Seasonal Daily Pattern of CO2 Concentration in Antarctica

Doubly Cyclic Smoothing Splines and Analysis of Seasonal Daily Pattern of CO2 Concentration in Antarctica Boston-Keio Workshop 2016. Doubly Cyclic Smoothing Splines and Analysis of Seasonal Daily Pattern of CO2 Concentration in Antarctica... Mihoko Minami Keio University, Japan August 15, 2016 Joint work with

More information

GAMs, GAMMs and other penalized GLMs using mgcv in R. Simon Wood Mathematical Sciences, University of Bath, U.K.

GAMs, GAMMs and other penalized GLMs using mgcv in R. Simon Wood Mathematical Sciences, University of Bath, U.K. GAMs, GAMMs and other penalied GLMs using mgcv in R Simon Wood Mathematical Sciences, University of Bath, U.K. Simple eample Consider a very simple dataset relating the timber volume of cherry trees to

More information

Lecture 13: Model selection and regularization

Lecture 13: Model selection and regularization Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17 What do we know so far In linear regression, adding predictors always

More information

Nonparametric regression using kernel and spline methods

Nonparametric regression using kernel and spline methods Nonparametric regression using kernel and spline methods Jean D. Opsomer F. Jay Breidt March 3, 016 1 The statistical model When applying nonparametric regression methods, the researcher is interested

More information

Nonparametric Mixed-Effects Models for Longitudinal Data

Nonparametric Mixed-Effects Models for Longitudinal Data Nonparametric Mixed-Effects Models for Longitudinal Data Zhang Jin-Ting Dept of Stat & Appl Prob National University of Sinagpore University of Seoul, South Korea, 7 p.1/26 OUTLINE The Motivating Data

More information

Generalized Additive Models

Generalized Additive Models :p Texts in Statistical Science Generalized Additive Models An Introduction with R Simon N. Wood Contents Preface XV 1 Linear Models 1 1.1 A simple linear model 2 Simple least squares estimation 3 1.1.1

More information

Splines and penalized regression

Splines and penalized regression Splines and penalized regression November 23 Introduction We are discussing ways to estimate the regression function f, where E(y x) = f(x) One approach is of course to assume that f has a certain shape,

More information

Moving Beyond Linearity

Moving Beyond Linearity Moving Beyond Linearity The truth is never linear! 1/23 Moving Beyond Linearity The truth is never linear! r almost never! 1/23 Moving Beyond Linearity The truth is never linear! r almost never! But often

More information

Straightforward intermediate rank tensor product smoothing in mixed models

Straightforward intermediate rank tensor product smoothing in mixed models Straightforward intermediate rank tensor product smoothing in mixed models Simon N. Wood, Fabian Scheipl, Julian J. Faraway January 6, 2012 Abstract Tensor product smooths provide the natural way of representing

More information

Additive hedonic regression models for the Austrian housing market ERES Conference, Edinburgh, June

Additive hedonic regression models for the Austrian housing market ERES Conference, Edinburgh, June for the Austrian housing market, June 14 2012 Ao. Univ. Prof. Dr. Fachbereich Stadt- und Regionalforschung Technische Universität Wien Dr. Strategic Risk Management Bank Austria UniCredit, Wien Inhalt

More information

Stat 8053, Fall 2013: Additive Models

Stat 8053, Fall 2013: Additive Models Stat 853, Fall 213: Additive Models We will only use the package mgcv for fitting additive and later generalized additive models. The best reference is S. N. Wood (26), Generalized Additive Models, An

More information

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2) SOC6078 Advanced Statistics: 9. Generalized Additive Models Robert Andersen Department of Sociology University of Toronto Goals of the Lecture Introduce Additive Models Explain how they extend from simple

More information

Lecture 16: High-dimensional regression, non-linear regression

Lecture 16: High-dimensional regression, non-linear regression Lecture 16: High-dimensional regression, non-linear regression Reading: Sections 6.4, 7.1 STATS 202: Data mining and analysis November 3, 2017 1 / 17 High-dimensional regression Most of the methods we

More information

Machine Learning / Jan 27, 2010

Machine Learning / Jan 27, 2010 Revisiting Logistic Regression & Naïve Bayes Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010 Generative and Discriminative Classifiers Training classifiers involves learning a mapping f: X -> Y,

More information

GAMs with integrated model selection using penalized regression splines and applications to environmental modelling.

GAMs with integrated model selection using penalized regression splines and applications to environmental modelling. GAMs with integrated model selection using penalized regression splines and applications to environmental modelling. Simon N. Wood a, Nicole H. Augustin b a Mathematical Institute, North Haugh, St Andrews,

More information

Nonparametric Approaches to Regression

Nonparametric Approaches to Regression Nonparametric Approaches to Regression In traditional nonparametric regression, we assume very little about the functional form of the mean response function. In particular, we assume the model where m(xi)

More information

Splines. Patrick Breheny. November 20. Introduction Regression splines (parametric) Smoothing splines (nonparametric)

Splines. Patrick Breheny. November 20. Introduction Regression splines (parametric) Smoothing splines (nonparametric) Splines Patrick Breheny November 20 Patrick Breheny STA 621: Nonparametric Statistics 1/46 Introduction Introduction Problems with polynomial bases We are discussing ways to estimate the regression function

More information

Nonparametric Regression

Nonparametric Regression Nonparametric Regression John Fox Department of Sociology McMaster University 1280 Main Street West Hamilton, Ontario Canada L8S 4M4 jfox@mcmaster.ca February 2004 Abstract Nonparametric regression analysis

More information

Generalized additive models I

Generalized additive models I I Patrick Breheny October 6 Patrick Breheny BST 764: Applied Statistical Modeling 1/18 Introduction Thus far, we have discussed nonparametric regression involving a single covariate In practice, we often

More information

P-spline ANOVA-type interaction models for spatio-temporal smoothing

P-spline ANOVA-type interaction models for spatio-temporal smoothing P-spline ANOVA-type interaction models for spatio-temporal smoothing Dae-Jin Lee and María Durbán Universidad Carlos III de Madrid Department of Statistics IWSM Utrecht 2008 D.-J. Lee and M. Durban (UC3M)

More information

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Topics in Machine Learning-EE 5359 Model Assessment and Selection Topics in Machine Learning-EE 5359 Model Assessment and Selection Ioannis D. Schizas Electrical Engineering Department University of Texas at Arlington 1 Training and Generalization Training stage: Utilizing

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

More advanced use of mgcv. Simon Wood Mathematical Sciences, University of Bath, U.K.

More advanced use of mgcv. Simon Wood Mathematical Sciences, University of Bath, U.K. More advanced use of mgcv Simon Wood Mathematical Sciences, University of Bath, U.K. Fine control of smoothness: gamma Suppose that we fit a model but a component is too wiggly. For GCV/AIC we can increase

More information

Divide and Conquer Kernel Ridge Regression

Divide and Conquer Kernel Ridge Regression Divide and Conquer Kernel Ridge Regression Yuchen Zhang John Duchi Martin Wainwright University of California, Berkeley COLT 2013 Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT 2013 1 / 15 Problem

More information

Package gamm4. July 25, Index 10

Package gamm4. July 25, Index 10 Version 0.2-5 Author Simon Wood, Fabian Scheipl Package gamm4 July 25, 2017 Maintainer Simon Wood Title Generalized Additive Mixed Models using 'mgcv' and 'lme4' Description

More information

Generalized Additive Model

Generalized Additive Model Generalized Additive Model by Huimin Liu Department of Mathematics and Statistics University of Minnesota Duluth, Duluth, MN 55812 December 2008 Table of Contents Abstract... 2 Chapter 1 Introduction 1.1

More information

Machine Learning. Topic 4: Linear Regression Models

Machine Learning. Topic 4: Linear Regression Models Machine Learning Topic 4: Linear Regression Models (contains ideas and a few images from wikipedia and books by Alpaydin, Duda/Hart/ Stork, and Bishop. Updated Fall 205) Regression Learning Task There

More information

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:

More information

CoxFlexBoost: Fitting Structured Survival Models

CoxFlexBoost: Fitting Structured Survival Models CoxFlexBoost: Fitting Structured Survival Models Benjamin Hofner 1 Institut für Medizininformatik, Biometrie und Epidemiologie (IMBE) Friedrich-Alexander-Universität Erlangen-Nürnberg joint work with Torsten

More information

STA121: Applied Regression Analysis

STA121: Applied Regression Analysis STA121: Applied Regression Analysis Variable Selection - Chapters 8 in Dielman Artin Department of Statistical Science October 23, 2009 Outline Introduction 1 Introduction 2 3 4 Variable Selection Model

More information

Improved smoothing spline regression by combining estimates of dierent smoothness

Improved smoothing spline regression by combining estimates of dierent smoothness Available online at www.sciencedirect.com Statistics & Probability Letters 67 (2004) 133 140 Improved smoothing spline regression by combining estimates of dierent smoothness Thomas C.M. Lee Department

More information

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010 Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010 Tim Hanson, Ph.D. University of South Carolina T. Hanson (USC) Stat 704: Data Analysis I, Fall 2010 1 / 26 Additive predictors

More information

Chapter 5: Basis Expansion and Regularization

Chapter 5: Basis Expansion and Regularization Chapter 5: Basis Expansion and Regularization DD3364 April 1, 2012 Introduction Main idea Moving beyond linearity Augment the vector of inputs X with additional variables. These are transformations of

More information

A review of spline function selection procedures in R

A review of spline function selection procedures in R Matthias Schmid Department of Medical Biometry, Informatics and Epidemiology University of Bonn joint work with Aris Perperoglou on behalf of TG2 of the STRATOS Initiative September 1, 2016 Introduction

More information

A popular method for moving beyond linearity. 2. Basis expansion and regularization 1. Examples of transformations. Piecewise-polynomials and splines

A popular method for moving beyond linearity. 2. Basis expansion and regularization 1. Examples of transformations. Piecewise-polynomials and splines A popular method for moving beyond linearity 2. Basis expansion and regularization 1 Idea: Augment the vector inputs x with additional variables which are transformation of x use linear models in this

More information

Edge and local feature detection - 2. Importance of edge detection in computer vision

Edge and local feature detection - 2. Importance of edge detection in computer vision Edge and local feature detection Gradient based edge detection Edge detection by function fitting Second derivative edge detectors Edge linking and the construction of the chain graph Edge and local feature

More information

Dynamic Thresholding for Image Analysis

Dynamic Thresholding for Image Analysis Dynamic Thresholding for Image Analysis Statistical Consulting Report for Edward Chan Clean Energy Research Center University of British Columbia by Libo Lu Department of Statistics University of British

More information

Computational Physics PHYS 420

Computational Physics PHYS 420 Computational Physics PHYS 420 Dr Richard H. Cyburt Assistant Professor of Physics My office: 402c in the Science Building My phone: (304) 384-6006 My email: rcyburt@concord.edu My webpage: www.concord.edu/rcyburt

More information

Lecture on Modeling Tools for Clustering & Regression

Lecture on Modeling Tools for Clustering & Regression Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into

More information

Lecture 26: Missing data

Lecture 26: Missing data Lecture 26: Missing data Reading: ESL 9.6 STATS 202: Data mining and analysis December 1, 2017 1 / 10 Missing data is everywhere Survey data: nonresponse. 2 / 10 Missing data is everywhere Survey data:

More information

Linear Penalized Spline Model Estimation Using Ranked Set Sampling Technique

Linear Penalized Spline Model Estimation Using Ranked Set Sampling Technique Linear Penalized Spline Model Estimation Using Ranked Set Sampling Technique Al Kadiri M. A. Abstract Benefits of using Ranked Set Sampling (RSS) rather than Simple Random Sampling (SRS) are indeed significant

More information

Nonparametric imputation method for arxiv: v2 [stat.me] 6 Feb nonresponse in surveys

Nonparametric imputation method for arxiv: v2 [stat.me] 6 Feb nonresponse in surveys Nonparametric imputation method for arxiv:1603.05068v2 [stat.me] 6 Feb 2017 nonresponse in surveys Caren Hasler and Radu V. Craiu February 7, 2017 Abstract Many imputation methods are based on statistical

More information

Generalized Additive Models

Generalized Additive Models Generalized Additive Models Statistics 135 Autumn 2005 Copyright c 2005 by Mark E. Irwin Generalized Additive Models GAMs are one approach to non-parametric regression in the multiple predictor setting.

More information

Algorithms for LTS regression

Algorithms for LTS regression Algorithms for LTS regression October 26, 2009 Outline Robust regression. LTS regression. Adding row algorithm. Branch and bound algorithm (BBA). Preordering BBA. Structured problems Generalized linear

More information

Moving Beyond Linearity

Moving Beyond Linearity Moving Beyond Linearity Basic non-linear models one input feature: polynomial regression step functions splines smoothing splines local regression. more features: generalized additive models. Polynomial

More information

Assessing the Quality of the Natural Cubic Spline Approximation

Assessing the Quality of the Natural Cubic Spline Approximation Assessing the Quality of the Natural Cubic Spline Approximation AHMET SEZER ANADOLU UNIVERSITY Department of Statisticss Yunus Emre Kampusu Eskisehir TURKEY ahsst12@yahoo.com Abstract: In large samples,

More information

STAT 705 Introduction to generalized additive models

STAT 705 Introduction to generalized additive models STAT 705 Introduction to generalized additive models Timothy Hanson Department of Statistics, University of South Carolina Stat 705: Data Analysis II 1 / 22 Generalized additive models Consider a linear

More information

Nonparametric Survey Regression Estimation in Two-Stage Spatial Sampling

Nonparametric Survey Regression Estimation in Two-Stage Spatial Sampling Nonparametric Survey Regression Estimation in Two-Stage Spatial Sampling Siobhan Everson-Stewart, F. Jay Breidt, Jean D. Opsomer January 20, 2004 Key Words: auxiliary information, environmental surveys,

More information

Nonparametric Risk Attribution for Factor Models of Portfolios. October 3, 2017 Kellie Ottoboni

Nonparametric Risk Attribution for Factor Models of Portfolios. October 3, 2017 Kellie Ottoboni Nonparametric Risk Attribution for Factor Models of Portfolios October 3, 2017 Kellie Ottoboni Outline The problem Page 3 Additive model of returns Page 7 Euler s formula for risk decomposition Page 11

More information

Smoothing parameterselection forsmoothing splines: a simulation study

Smoothing parameterselection forsmoothing splines: a simulation study Computational Statistics & Data Analysis 42 (2003) 139 148 www.elsevier.com/locate/csda Smoothing parameterselection forsmoothing splines: a simulation study Thomas C.M. Lee Department of Statistics, Colorado

More information

arxiv: v1 [stat.me] 2 Jun 2017

arxiv: v1 [stat.me] 2 Jun 2017 Inference for Penalized Spline Regression: Improving Confidence Intervals by Reducing the Penalty arxiv:1706.00865v1 [stat.me] 2 Jun 2017 Ning Dai School of Statistics University of Minnesota daixx224@umn.edu

More information

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation Instance-Based Learning: Nearest neighbor and kernel regression and classificiation Emily Fox University of Washington February 3, 2017 Simplest approach: Nearest neighbor regression 1 Fit locally to each

More information

Dimension Reduction Methods for Multivariate Time Series

Dimension Reduction Methods for Multivariate Time Series Dimension Reduction Methods for Multivariate Time Series BigVAR Will Nicholson PhD Candidate wbnicholson.com github.com/wbnicholson/bigvar Department of Statistical Science Cornell University May 28, 2015

More information

Last time... Bias-Variance decomposition. This week

Last time... Bias-Variance decomposition. This week Machine learning, pattern recognition and statistical data modelling Lecture 4. Going nonlinear: basis expansions and splines Last time... Coryn Bailer-Jones linear regression methods for high dimensional

More information

AM205: lecture 2. 1 These have been shifted to MD 323 for the rest of the semester.

AM205: lecture 2. 1 These have been shifted to MD 323 for the rest of the semester. AM205: lecture 2 Luna and Gary will hold a Python tutorial on Wednesday in 60 Oxford Street, Room 330 Assignment 1 will be posted this week Chris will hold office hours on Thursday (1:30pm 3:30pm, Pierce

More information

GAM: The Predictive Modeling Silver Bullet

GAM: The Predictive Modeling Silver Bullet GAM: The Predictive Modeling Silver Bullet Author: Kim Larsen Introduction Imagine that you step into a room of data scientists; the dress code is casual and the scent of strong coffee is hanging in the

More information

Variable selection is intended to select the best subset of predictors. But why bother?

Variable selection is intended to select the best subset of predictors. But why bother? Chapter 10 Variable Selection Variable selection is intended to select the best subset of predictors. But why bother? 1. We want to explain the data in the simplest way redundant predictors should be removed.

More information

Lecture 7: Splines and Generalized Additive Models

Lecture 7: Splines and Generalized Additive Models Lecture 7: and Generalized Additive Models Computational Statistics Thierry Denœux April, 2016 Introduction Overview Introduction Simple approaches Polynomials Step functions Regression splines Natural

More information

Penalizied Logistic Regression for Classification

Penalizied Logistic Regression for Classification Penalizied Logistic Regression for Classification Gennady G. Pekhimenko Department of Computer Science University of Toronto Toronto, ON M5S3L1 pgen@cs.toronto.edu Abstract Investigation for using different

More information

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation Instance-Based Learning: Nearest neighbor and kernel regression and classificiation Emily Fox University of Washington February 3, 2017 Simplest approach: Nearest neighbor regression 1 Fit locally to each

More information

Economics Nonparametric Econometrics

Economics Nonparametric Econometrics Economics 217 - Nonparametric Econometrics Topics covered in this lecture Introduction to the nonparametric model The role of bandwidth Choice of smoothing function R commands for nonparametric models

More information

Nonlinearity and Generalized Additive Models Lecture 2

Nonlinearity and Generalized Additive Models Lecture 2 University of Texas at Dallas, March 2007 Nonlinearity and Generalized Additive Models Lecture 2 Robert Andersen McMaster University http://socserv.mcmaster.ca/andersen Definition of a Smoother A smoother

More information

What is machine learning?

What is machine learning? Machine learning, pattern recognition and statistical data modelling Lecture 12. The last lecture Coryn Bailer-Jones 1 What is machine learning? Data description and interpretation finding simpler relationship

More information

PRE-PROCESSING HOUSING DATA VIA MATCHING: SINGLE MARKET

PRE-PROCESSING HOUSING DATA VIA MATCHING: SINGLE MARKET PRE-PROCESSING HOUSING DATA VIA MATCHING: SINGLE MARKET KLAUS MOELTNER 1. Introduction In this script we show, for a single housing market, how mis-specification of the hedonic price function can lead

More information

Detection of Smoke in Satellite Images

Detection of Smoke in Satellite Images Detection of Smoke in Satellite Images Mark Wolters Charmaine Dean Shanghai Center for Mathematical Sciences Western University December 15, 2014 TIES 2014, Guangzhou Summary Application Smoke identification

More information

1D Regression. i.i.d. with mean 0. Univariate Linear Regression: fit by least squares. Minimize: to get. The set of all possible functions is...

1D Regression. i.i.d. with mean 0. Univariate Linear Regression: fit by least squares. Minimize: to get. The set of all possible functions is... 1D Regression i.i.d. with mean 0. Univariate Linear Regression: fit by least squares. Minimize: to get. The set of all possible functions is... 1 Non-linear problems What if the underlying function is

More information

Model selection and validation 1: Cross-validation

Model selection and validation 1: Cross-validation Model selection and validation 1: Cross-validation Ryan Tibshirani Data Mining: 36-462/36-662 March 26 2013 Optional reading: ISL 2.2, 5.1, ESL 7.4, 7.10 1 Reminder: modern regression techniques Over the

More information

Median and Extreme Ranked Set Sampling for penalized spline estimation

Median and Extreme Ranked Set Sampling for penalized spline estimation Appl Math Inf Sci 10, No 1, 43-50 (016) 43 Applied Mathematics & Information Sciences An International Journal http://dxdoiorg/1018576/amis/10014 Median and Extreme Ranked Set Sampling for penalized spline

More information

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100. Linear Model Selection and Regularization especially usefull in high dimensions p>>100. 1 Why Linear Model Regularization? Linear models are simple, BUT consider p>>n, we have more features than data records

More information

Diffusion Wavelets for Natural Image Analysis

Diffusion Wavelets for Natural Image Analysis Diffusion Wavelets for Natural Image Analysis Tyrus Berry December 16, 2011 Contents 1 Project Description 2 2 Introduction to Diffusion Wavelets 2 2.1 Diffusion Multiresolution............................

More information

Bernstein-Bezier Splines on the Unit Sphere. Victoria Baramidze. Department of Mathematics. Western Illinois University

Bernstein-Bezier Splines on the Unit Sphere. Victoria Baramidze. Department of Mathematics. Western Illinois University Bernstein-Bezier Splines on the Unit Sphere Victoria Baramidze Department of Mathematics Western Illinois University ABSTRACT I will introduce scattered data fitting problems on the sphere and discuss

More information

Lecture 7: Linear Regression (continued)

Lecture 7: Linear Regression (continued) Lecture 7: Linear Regression (continued) Reading: Chapter 3 STATS 2: Data mining and analysis Jonathan Taylor, 10/8 Slide credits: Sergio Bacallado 1 / 14 Potential issues in linear regression 1. Interactions

More information

NONPARAMETRIC REGRESSION TECHNIQUES

NONPARAMETRIC REGRESSION TECHNIQUES NONPARAMETRIC REGRESSION TECHNIQUES C&PE 940, 28 November 2005 Geoff Bohling Assistant Scientist Kansas Geological Survey geoff@kgs.ku.edu 864-2093 Overheads and other resources available at: http://people.ku.edu/~gbohling/cpe940

More information

Multicollinearity and Validation CIVL 7012/8012

Multicollinearity and Validation CIVL 7012/8012 Multicollinearity and Validation CIVL 7012/8012 2 In Today s Class Recap Multicollinearity Model Validation MULTICOLLINEARITY 1. Perfect Multicollinearity 2. Consequences of Perfect Multicollinearity 3.

More information

Introduction to floating point arithmetic

Introduction to floating point arithmetic Introduction to floating point arithmetic Matthias Petschow and Paolo Bientinesi AICES, RWTH Aachen petschow@aices.rwth-aachen.de October 24th, 2013 Aachen, Germany Matthias Petschow (AICES, RWTH Aachen)

More information

A comparison of spline methods in R for building explanatory models

A comparison of spline methods in R for building explanatory models A comparison of spline methods in R for building explanatory models Aris Perperoglou on behalf of TG2 STRATOS Initiative, University of Essex ISCB2017 Aris Perperoglou Spline Methods in R ISCB2017 1 /

More information

CS 450 Numerical Analysis. Chapter 7: Interpolation

CS 450 Numerical Analysis. Chapter 7: Interpolation Lecture slides based on the textbook Scientific Computing: An Introductory Survey by Michael T. Heath, copyright c 2018 by the Society for Industrial and Applied Mathematics. http://www.siam.org/books/cl80

More information

Nonparametric Methods Recap

Nonparametric Methods Recap Nonparametric Methods Recap Aarti Singh Machine Learning 10-701/15-781 Oct 4, 2010 Nonparametric Methods Kernel Density estimate (also Histogram) Weighted frequency Classification - K-NN Classifier Majority

More information

Radial Basis Function Networks

Radial Basis Function Networks Radial Basis Function Networks As we have seen, one of the most common types of neural network is the multi-layer perceptron It does, however, have various disadvantages, including the slow speed in learning

More information

arxiv: v2 [stat.ap] 14 Nov 2016

arxiv: v2 [stat.ap] 14 Nov 2016 The Cave of Shadows Addressing the human factor with generalized additive mixed models Harald Baayen a Shravan Vasishth b Reinhold Kliegl b Douglas Bates c arxiv:1511.03120v2 [stat.ap] 14 Nov 2016 a University

More information

Recent Developments in Model-based Derivative-free Optimization

Recent Developments in Model-based Derivative-free Optimization Recent Developments in Model-based Derivative-free Optimization Seppo Pulkkinen April 23, 2010 Introduction Problem definition The problem we are considering is a nonlinear optimization problem with constraints:

More information

Support Vector Regression with ANOVA Decomposition Kernels

Support Vector Regression with ANOVA Decomposition Kernels Support Vector Regression with ANOVA Decomposition Kernels Mark O. Stitson, Alex Gammerman, Vladimir Vapnik, Volodya Vovk, Chris Watkins, Jason Weston Technical Report CSD-TR-97- November 7, 1997!()+,

More information

56:272 Integer Programming & Network Flows Final Exam -- December 16, 1997

56:272 Integer Programming & Network Flows Final Exam -- December 16, 1997 56:272 Integer Programming & Network Flows Final Exam -- December 16, 1997 Answer #1 and any five of the remaining six problems! possible score 1. Multiple Choice 25 2. Traveling Salesman Problem 15 3.

More information

arxiv: v1 [stat.ap] 14 Nov 2017

arxiv: v1 [stat.ap] 14 Nov 2017 How to estimate time-varying Vector Autoregressive Models? A comparison of two methods. arxiv:1711.05204v1 [stat.ap] 14 Nov 2017 Jonas Haslbeck Psychological Methods Group University of Amsterdam Lourens

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:

More information

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff Overfitting Machine Learning CSE546 Carlos Guestrin University of Washington October 2, 2013 1 Bias-Variance Tradeoff Choice of hypothesis class introduces learning bias More complex class less bias More

More information

Package ibr. R topics documented: May 1, Version Date Title Iterative Bias Reduction

Package ibr. R topics documented: May 1, Version Date Title Iterative Bias Reduction Version 2.0-3 Date 2017-04-28 Title Iterative Bias Reduction Package ibr May 1, 2017 Author Pierre-Andre Cornillon, Nicolas Hengartner, Eric Matzner-Lober Maintainer ``Pierre-Andre Cornillon''

More information

Predicting housing price

Predicting housing price Predicting housing price Shu Niu Introduction The goal of this project is to produce a model for predicting housing prices given detailed information. The model can be useful for many purpose. From estimating

More information

I How does the formulation (5) serve the purpose of the composite parameterization

I How does the formulation (5) serve the purpose of the composite parameterization Supplemental Material to Identifying Alzheimer s Disease-Related Brain Regions from Multi-Modality Neuroimaging Data using Sparse Composite Linear Discrimination Analysis I How does the formulation (5)

More information

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business Section 3.4: Diagnostics and Transformations Jared S. Murray The University of Texas at Austin McCombs School of Business 1 Regression Model Assumptions Y i = β 0 + β 1 X i + ɛ Recall the key assumptions

More information

A Trimmed Translation-Invariant Denoising Estimator

A Trimmed Translation-Invariant Denoising Estimator A Trimmed Translation-Invariant Denoising Estimator Eric Chicken and Jordan Cuevas Florida State University, Tallahassee, FL 32306 Abstract A popular wavelet method for estimating jumps in functions is

More information

MLCC 2018 Local Methods and Bias Variance Trade-Off. Lorenzo Rosasco UNIGE-MIT-IIT

MLCC 2018 Local Methods and Bias Variance Trade-Off. Lorenzo Rosasco UNIGE-MIT-IIT MLCC 2018 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT About this class 1. Introduce a basic class of learning methods, namely local methods. 2. Discuss the fundamental concept

More information

CSE446: Linear Regression. Spring 2017

CSE446: Linear Regression. Spring 2017 CSE446: Linear Regression Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer Prediction of continuous variables Billionaire says: Wait, that s not what I meant! You say: Chill

More information

Local spatial-predictor selection

Local spatial-predictor selection University of Wollongong Research Online Centre for Statistical & Survey Methodology Working Paper Series Faculty of Engineering and Information Sciences 2013 Local spatial-predictor selection Jonathan

More information