Incorporating Geospatial Data in House Price Indexes: A Hedonic Imputation Approach with Splines. Robert J. Hill and Michael Scholz

Similar documents
Incorporating Geospatial Data in House Price Indexes: A Hedonic Imputation Approach with Splines

Scholz, Hill and Rambaldi: Weekly Hedonic House Price Indexes Discussion

A toolbox of smooths. Simon Wood Mathematical Sciences, University of Bath, U.K.

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

Lecture 17: Smoothing splines, Local Regression, and GAMs

Semiparametric Tools: Generalized Additive Models

Doubly Cyclic Smoothing Splines and Analysis of Seasonal Daily Pattern of CO2 Concentration in Antarctica

GAMs, GAMMs and other penalized GLMs using mgcv in R. Simon Wood Mathematical Sciences, University of Bath, U.K.

Lecture 13: Model selection and regularization

Nonparametric regression using kernel and spline methods

Nonparametric Mixed-Effects Models for Longitudinal Data

Generalized Additive Models

Splines and penalized regression

Moving Beyond Linearity

Straightforward intermediate rank tensor product smoothing in mixed models

Additive hedonic regression models for the Austrian housing market ERES Conference, Edinburgh, June

Stat 8053, Fall 2013: Additive Models

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

Lecture 16: High-dimensional regression, non-linear regression

Machine Learning / Jan 27, 2010

GAMs with integrated model selection using penalized regression splines and applications to environmental modelling.

Nonparametric Approaches to Regression

Splines. Patrick Breheny. November 20. Introduction Regression splines (parametric) Smoothing splines (nonparametric)

Nonparametric Regression

Generalized additive models I

P-spline ANOVA-type interaction models for spatio-temporal smoothing

Topics in Machine Learning-EE 5359 Model Assessment and Selection

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

More advanced use of mgcv. Simon Wood Mathematical Sciences, University of Bath, U.K.

Divide and Conquer Kernel Ridge Regression

Package gamm4. July 25, Index 10

Generalized Additive Model

Machine Learning. Topic 4: Linear Regression Models

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

CoxFlexBoost: Fitting Structured Survival Models

STA121: Applied Regression Analysis

Improved smoothing spline regression by combining estimates of dierent smoothness

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

Chapter 5: Basis Expansion and Regularization

A review of spline function selection procedures in R

A popular method for moving beyond linearity. 2. Basis expansion and regularization 1. Examples of transformations. Piecewise-polynomials and splines

Edge and local feature detection - 2. Importance of edge detection in computer vision

Dynamic Thresholding for Image Analysis

Computational Physics PHYS 420

Lecture on Modeling Tools for Clustering & Regression

Lecture 26: Missing data

Linear Penalized Spline Model Estimation Using Ranked Set Sampling Technique

Nonparametric imputation method for arxiv: v2 [stat.me] 6 Feb nonresponse in surveys

Generalized Additive Models

Algorithms for LTS regression

Moving Beyond Linearity

Assessing the Quality of the Natural Cubic Spline Approximation

STAT 705 Introduction to generalized additive models

Nonparametric Survey Regression Estimation in Two-Stage Spatial Sampling

Nonparametric Risk Attribution for Factor Models of Portfolios. October 3, 2017 Kellie Ottoboni

Smoothing parameterselection forsmoothing splines: a simulation study

arxiv: v1 [stat.me] 2 Jun 2017

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Dimension Reduction Methods for Multivariate Time Series

Last time... Bias-Variance decomposition. This week

AM205: lecture 2. 1 These have been shifted to MD 323 for the rest of the semester.

GAM: The Predictive Modeling Silver Bullet

Variable selection is intended to select the best subset of predictors. But why bother?

Lecture 7: Splines and Generalized Additive Models

Penalizied Logistic Regression for Classification

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Economics Nonparametric Econometrics

Nonlinearity and Generalized Additive Models Lecture 2

What is machine learning?

PRE-PROCESSING HOUSING DATA VIA MATCHING: SINGLE MARKET

Detection of Smoke in Satellite Images

1D Regression. i.i.d. with mean 0. Univariate Linear Regression: fit by least squares. Minimize: to get. The set of all possible functions is...

Model selection and validation 1: Cross-validation

Median and Extreme Ranked Set Sampling for penalized spline estimation

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Diffusion Wavelets for Natural Image Analysis

Bernstein-Bezier Splines on the Unit Sphere. Victoria Baramidze. Department of Mathematics. Western Illinois University

Lecture 7: Linear Regression (continued)

NONPARAMETRIC REGRESSION TECHNIQUES

Multicollinearity and Validation CIVL 7012/8012

Introduction to floating point arithmetic

A comparison of spline methods in R for building explanatory models

CS 450 Numerical Analysis. Chapter 7: Interpolation

Nonparametric Methods Recap

Radial Basis Function Networks

arxiv: v2 [stat.ap] 14 Nov 2016

Recent Developments in Model-based Derivative-free Optimization

Support Vector Regression with ANOVA Decomposition Kernels

56:272 Integer Programming & Network Flows Final Exam -- December 16, 1997

arxiv: v1 [stat.ap] 14 Nov 2017

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

Package ibr. R topics documented: May 1, Version Date Title Iterative Bias Reduction

Predicting housing price

I How does the formulation (5) serve the purpose of the composite parameterization

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business

A Trimmed Translation-Invariant Denoising Estimator

MLCC 2018 Local Methods and Bias Variance Trade-Off. Lorenzo Rosasco UNIGE-MIT-IIT

CSE446: Linear Regression. Spring 2017

Local spatial-predictor selection

Transcription:

Incorporating Geospatial Data in House Price Indexes: A Hedonic Imputation Approach with Splines Robert J. Hill and Michael Scholz Department of Economics University of Graz, Austria OeNB Workshop Vienna, 9th 10th October 2014 Supported by funds of the Oesterreichische Nationalbank (Anniversary Fund, project number: 14947) Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 1 / 25

Overview of the talk: 1 Hedonic House Price Indexes (): Incorporating Geospatial Data in Taxonomy: Time Dummy, Average Characteristics, Hedonic Imputation 2 : GAM with Spline vs. Postcode/Region Dummies Estimation of the Semiparametric 3 : Data Set, Missing Characteristics, Results and Main Findings Are Postcode/Region Based Indexes Downward Biased? 4 Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 2 / 25

Introduction Motivation: Houses differ both in their physical characteristics and location Exact longitude and latitude of each house are now increasingly available in housing data sets How can we incorporate geospatial data (i.e., longitudes and latitudes) in a hedonic model of the housing market? How much difference does it make? Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 3 / 25

for Incorporating Geospatial Data in a Hedonic Index : 1 Distance to amenities (including the city center, nearest train station and shopping center, etc.) as additional characteristics. 2 Spatial autoregressive models 3 A spline function (or some other nonparametric function) Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 4 / 25

A Taxonomy of for Computing Hedonic House Price Indexes I Time dummy method y = Zβ + Dδ + ε where Z is a matrix of characteristics and D is a matrix of dummy variables. Index: P t = exp(ˆδ t ) where ˆδ t is the estimated coefficient obtained from the hedonic model Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 5 / 25

A Taxonomy of for Computing Hedonic House Price Indexes II Average characteristics method Laspeyres : P L t,t+1 = ˆp t+1 ( z t ) ˆp t ( z t ) Paasche : P P t,t+1 = ˆp t+1 ( z t+1 ) ˆp t ( z t+1 ) C = exp (ˆβ c,t+1 ˆβ c,t ) z c,t, c=1 C = exp (ˆβ c,t+1 ˆβ c,t ) z c,t+1, c=1 where z c,t = 1 H t H t h=1 z c,t,h and z c,t+1 = 1 H t+1 z c,t+1,h. H t+1 h=1 Note: Average characteristics methods cannot use geospatial data, since averaging longitudes and latitudes makes no sense. Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 6 / 25

A Taxonomy of for Computing Hedonic House Price Indexes III Hedonic imputation method Ht+1 ( Paasche : P PSI t,t+1 = p t+1,h ˆp t,h (z t+1,h )) 1/Ht+1 h=1 Ht Laspeyres : P LSI t,t+1 = h=1 ( ) ˆp t+1,h (z t,h ) 1/Ht p t,h Fisher : P FSI t,t+1 = P PSI t,t+1 PLSI t,t+1 ( with Single Imputation) Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 7 / 25

Our s I (i) semilog with geospatial spline (ii)/(iii) semilog with postcode/region dummies y = Zβ + g(z lat, z long ) + ε (1) y = Zβ + Dδ + ε (2) y is a H 1 vector of log-prices Z is an H C matrix of physical characteristics (including a constant and quarterly dummies) g(z lat, z long ) is the geospatial spline function defined on the longitudes and latitudes D is a matrix of postcode or region dummies. Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 8 / 25

Our s II (i) semilog with geospatial spline y = Zβ + g(z lat, z long ) + ε (1) (ii)/(iii) semilog with postcode/region dummies y = Zβ + Dδ + ε (2) The parameters to be estimated in (i) are the C 1 vector of characteristic shadow prices β, and the geospatial spline surface g(z lat, z long ). The parameters to be estimated in (ii) or (iii) are the C 1 vector of characteristic shadow prices β, and the B 1 vector of postcode or region shadow prices δ. We consider 16 regions and 242 postcodes. So on average there are 15 postcodes in a region. Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 9 / 25

Estimation of the Semiparametric I Estimation: The semiparametric model is estimated using the bam-function (Big Additive s) in the mgcv package in R. details gam For the nonparametric part we use an approximation to a thin plate spline (TPS) (developed by S. Wood, 2003), defined on the longitudes and latitudes. details tps In our context, TPRSs have two advantages over other splines: It is not necessary to specify knot locations. The spline surface can be estimated as a function of two explanatory variables. details tprs With a full TPS the computational burden rises rapidly as the number of data points is increased. Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 10 / 25

Estimation of the Semiparametric II The approximation reduces the computational burden for any given level of fit. It is necessary to select a value for the parameter k which determines the approximation of our spline to a TPS: A lower value of k has a lower computational burden but leads to a worse fit. We choose a value of k where the gains from increasing it further (in terms of better fit) are low. details par choice The smoothing parameter λ determines the smoothness of the spline surface itself. The algorithm selects λ using restricted Maximum Likelihood (REML), where the likelihood function is maximized across all the parameters of the semiparametric model. Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 11 / 25

Our Data Set The data Sydney, Australia from 2001 to 2011. Our characteristics are: Transaction price + exact date of sale Physical: Number of bedrooms, Number of bathrooms, Land area Location: Postcode, Longitude, Latitude Some (physical) characteristics are missing for some houses. There are more gaps in the data in the earlier years in our sample. We have a total of 454,567 transactions. All characteristics are available for only 240,142 of these transactions. Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 12 / 25

Dealing with Missing Characteristics We impute the price of each house from the model below that has exactly the same mix of characteristics (HM1) : y = f(d, z 1, z 2, z 3, loc) (HM2) : y = f(d, z 2, z 3, loc) (HM3) : (HM4) : (HM5) : (HM6) : (HM7) : (HM8) : y = f(d, z 1, z 3, loc) y = f(d, z 1, z 2, loc) y = f(d, z 3, loc) y = f(d, z 2, loc) y = f(d, z 1, loc) y = f(d, loc) For all obs.: y = log(price), d = quarter dummy, loc = location Not for all obs.: z 1 = land area, z 2 = number of bedrooms, z 3 = number of bathrooms Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 13 / 25

Comparing the Performance of Our s I Different performance measures: Akaike information criterion (AIC): trade-off between the goodness of fit and the complexity of the model Sum of squared log errors SSLE t = 1 H t Repeat-Sales as a benchmark H t h=1 [ ln(ˆp th /p th ) ] 2 Z SI h Z SI h = Actual Price Relative /Imputed Price Relative = p / t+k,h p t+k,h ˆp / t+k,h p t+k,h ˆp t+k,h = p th p th p th ˆp th D SI = 1 H H h=1 [ ln(z SI h ) ]2 The spline model significantly outperforms its postcode counterpart. ˆp th Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 14 / 25

Comparing the Performance of Our s II Table: Akaike information criterion (restricted data set) 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 (i) -452-203 -1448-2471 -9638-10644 -14052-13980 -20436-18857 -23659 (ii) 1321 1515 493 158-4930 -4384-6372 -8070-12506 -11857-16522 (iii) 3463 3807 3650 3634 4841 7970 11996 8583 14842 16980 5223 Table: Sum of squared log errors (restricted data set) 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 (i) 0.048 0.050 0.044 0.040 0.036 0.038 0.037 0.034 0.032 0.032 0.028 (ii) 0.068 0.070 0.059 0.057 0.046 0.049 0.048 0.043 0.041 0.040 0.035 (iii) 0.105 0.108 0.093 0.089 0.073 0.079 0.084 0.079 0.089 0.098 0.068 Table: Sum of squared log price relative errors D SI Restricted Full (i) 0.016802 0.036523 (ii) 0.016857 0.038863 (iii) 0.029087 0.052078 Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 15 / 25

Spline Surface Based on Restricted Data Set for 2007 Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 16 / 25

Price Indexes Calculated on the Restricted Data Set 0.8 1.0 1.2 1.4 1.6 region LM long lat GAM pc LM median rep sales 2002 2004 2006 2008 2010 years Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 17 / 25

Price Indexes Calculated on the Full Data Set 1.0 1.2 1.4 1.6 1.8 region LM long lat GAM pc LM median rep sales 2002 2004 2006 2008 2010 years Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 18 / 25

Main Findings The price index rises more when locational effects are captured using a geospatial spline, rather than postcode or region dummies. The gap between the spline and postcode based indexes is small (about 0.5 percent over 11 years). region based indexes is larger (about 4.2 percent over 11 years). The full-sample spline-based price index rises 6.5 percent more over 11 years than its restricted data set counterpart. The median index is dramatically different when the full data set is used. The gap between spline and postcode/region hedonic price indexes is smaller when the full data set is used. Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 19 / 25

Are Postcode/Region Based Indexes Downward Biased? I A downward bias can arise when the locations of sold houses in a postcode or region get worse over time. Our test procedure: 1 Choose a postcode 2 Calculate the mean number of bedrooms, bathrooms, land area and quarter of sale over the 11 years for that postcode. 3 Impute the price of this average house in every location in which a house actually sold in 2001,...,2011 in that postcode (using spline model of year 2001) 4 Take the geometric mean of these imputed prices for each year. 5 Repeat for another postcode 6 Take the geometric mean across postcodes in each year. 7 Repeat steps 3-6 using the spline of year 2002, spline of 2003, etc. Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 20 / 25

Are Postcode Based Indexes Downward Biased? II Findings: The geometric means from step 6 fall over time irrespective of which year s spline is used as the reference. Most of the fall occurs in the first half of the sample. The fall is bigger for regions than postcodes. Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 21 / 25

Evidence of Bias in the Postcode-Based Price Indexes 0.980 0.985 0.990 0.995 1.000 2001 2006 2011 2002 2004 2006 2008 2010 years Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 22 / 25

Evidence of Bias in the Region-Based Price Indexes 0.88 0.90 0.92 0.94 0.96 0.98 1.00 2001 2006 2011 2002 2004 2006 2008 2010 years Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 23 / 25

s Splines (or some other nonparametric method), when combined with the hedonic imputation method, provide a flexible way of incorporating geospatial data into a house price index In our data set postcode/region based indexes seem to have a downward bias since they fail to account for a general shift over time in houses sold to worse locations in each postcode/region. The bias is negligible with postcode dummies but not so with region dummies. For a city with postcodes as finely defined as Sydney (about 14,300 residents and 7.39 square kilometers per postcode), postcode dummies do a good job of controlling for locational effects. It is important to use the full data set, and not just observations with no missing characteristics. Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 24 / 25

Thank you for your attention! Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 25 / 25

GAM We estimate the Generalized Additive y = Zβ + g(z lat, z long ) + ε with a thin plate regression spline (tprs) Wood (2003) an optimal low rank approximation of a thin plate spline A tprs uses far fewer coefficients than a full spline: computationally efficient, while losing little statistical performance. Big Additive s bam-function from the R package mgcv. We can avoid certain problems in spline smoothing: Choice of knot location and basis Smooths of more than one predictor Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 1 / 10

Thin plate spline I Thin plate spline smoothing problem Duchon (1977): Estimate the smooth function g with d-vector x from n observations s.t. y i = g(x i ) + ε i by finding the function ˆf that minimizes penalized problem y = (y 1,..., y n ), y f 2 + λj md (f), (3) f = (f(x 1 ),..., f(x n )), λ is a smoothing parameter, J md (f) is a penalty function measuring the wiggliness of f J md =... ν 1 +...+ν d =m m! ν 1!... ν d! m f x ν 1 1... xν d d 2 dx 1... dx d Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 2 / 10

Thin plate spline II It can be shown that the solution of (3) has the form, n M ˆf(x) = δ i η md ( x x i ) + α j φ j (x), (4) i=1 j=1 δ i and α j are coefficients to be estimated, δ i such that T δ = 0 with T ij = φ j (x i ). The M = ( ) m+d 1 d functions φj are linearly independent polynomials spanning the space of polynomials in R d of degree less than m (i.e. the null space of J md ) ( 1) m+1+d/2 2 η md (r) = 2m 1 π d/2 (m 1)!(m d/2)! r 2m d log(r) d even Γ(d/2 m) 2 2m π d/2 (m 1)! r 2m d d odd Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 3 / 10

Figure: Rank 15 Eigen Approx to 2D thin plate spline (Wood, 2006) Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 4 / 10

Thin plate spline III With E by E ij = η md ( x i x j ), the thin plate spline fitting problem is now the minimization of y Eδ Tα 2 + λδ Eδ s. t. T δ = 0. (5) with respect to δ and α Ideal smoother: Exact weight to the conflicting goals of matching the data and making f smooth Disadvantage: As many parameters as there are data, i.e. O(n 3 ) calculations More on thin plate splines: Wahba (1990) or Green and Silverman (1994) Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 5 / 10

Thin plate regression spline I The computational burden can be reduced with the use of a low rank approximation, Wood (2003) A parameter space basis that perturbs (5) as little as possible. The basic idea: Truncation of the space of the wiggly components of the spline (with parameter δ), while leaving the α-components unchanged. E = UDU eigen-decomposition of E Appropriate submatrix D k of D and corresponding U k, restricting δ to the column space of U k, i.e. δ = U k δ k, y U k D k δ k Tα 2 + λδ k D k δ k s. t. T U k δ k = 0, (6) with respect to δ k and α. Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 6 / 10

Thin plate regression spline II The computational cost is reduced from O(n 3 ) to O(k 3 ) Remaining problem: Find U k and D k sufficiently cheaply. Wood (2003) proposes the use of the Lanczos method (Demmel (1997)) which allows the calculation at the substantially lower cost of O(n 2 k) operations. Smoothing parameter selection: Laplace approximation to obtain an approximate REML which is suitable for efficient direct optimization and computationally stable, Wood (2011) Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 7 / 10

Thin plate regression spline III In practice: 1 Choose degree of approximation k = 600 (tradeoff: oversmoothing vs. computational burden) 2 Construct tprs-basis from a randomly chosen sample of 2500 data points. The locations of these observations depend on the locational distribution of house sales in that period. Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 8 / 10

Figure: Sum of squared errors D SI of the price relatives and computational time for different basis dimensions sse of the price relatives 0.0168 0.0170 0.0172 0.0174 0.0176 0.0178 0.0180 1500 3000 4500 6000 7500 9000 10500 computational time 200 400 600 800 number of basis dimensions Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 9 / 10

Figure: AIC distribution and comp. time for 100 repetitions of the y2011-fit based on different number of randomly chosen observations. range of AIC for 100 repetitions 21800 21600 21400 21200 21000 20800 20600 15000 25000 35000 45000 computational time 1000 1500 2000 2500 3000 3500 number of randomly chosen observations for basis construction Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 10 / 10