Incorporating Geospatial Data in House Price Indexes: A Hedonic Imputation Approach with Splines Robert J. Hill and Michael Scholz Department of Economics University of Graz, Austria OeNB Workshop Vienna, 9th 10th October 2014 Supported by funds of the Oesterreichische Nationalbank (Anniversary Fund, project number: 14947) Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 1 / 25
Overview of the talk: 1 Hedonic House Price Indexes (): Incorporating Geospatial Data in Taxonomy: Time Dummy, Average Characteristics, Hedonic Imputation 2 : GAM with Spline vs. Postcode/Region Dummies Estimation of the Semiparametric 3 : Data Set, Missing Characteristics, Results and Main Findings Are Postcode/Region Based Indexes Downward Biased? 4 Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 2 / 25
Introduction Motivation: Houses differ both in their physical characteristics and location Exact longitude and latitude of each house are now increasingly available in housing data sets How can we incorporate geospatial data (i.e., longitudes and latitudes) in a hedonic model of the housing market? How much difference does it make? Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 3 / 25
for Incorporating Geospatial Data in a Hedonic Index : 1 Distance to amenities (including the city center, nearest train station and shopping center, etc.) as additional characteristics. 2 Spatial autoregressive models 3 A spline function (or some other nonparametric function) Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 4 / 25
A Taxonomy of for Computing Hedonic House Price Indexes I Time dummy method y = Zβ + Dδ + ε where Z is a matrix of characteristics and D is a matrix of dummy variables. Index: P t = exp(ˆδ t ) where ˆδ t is the estimated coefficient obtained from the hedonic model Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 5 / 25
A Taxonomy of for Computing Hedonic House Price Indexes II Average characteristics method Laspeyres : P L t,t+1 = ˆp t+1 ( z t ) ˆp t ( z t ) Paasche : P P t,t+1 = ˆp t+1 ( z t+1 ) ˆp t ( z t+1 ) C = exp (ˆβ c,t+1 ˆβ c,t ) z c,t, c=1 C = exp (ˆβ c,t+1 ˆβ c,t ) z c,t+1, c=1 where z c,t = 1 H t H t h=1 z c,t,h and z c,t+1 = 1 H t+1 z c,t+1,h. H t+1 h=1 Note: Average characteristics methods cannot use geospatial data, since averaging longitudes and latitudes makes no sense. Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 6 / 25
A Taxonomy of for Computing Hedonic House Price Indexes III Hedonic imputation method Ht+1 ( Paasche : P PSI t,t+1 = p t+1,h ˆp t,h (z t+1,h )) 1/Ht+1 h=1 Ht Laspeyres : P LSI t,t+1 = h=1 ( ) ˆp t+1,h (z t,h ) 1/Ht p t,h Fisher : P FSI t,t+1 = P PSI t,t+1 PLSI t,t+1 ( with Single Imputation) Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 7 / 25
Our s I (i) semilog with geospatial spline (ii)/(iii) semilog with postcode/region dummies y = Zβ + g(z lat, z long ) + ε (1) y = Zβ + Dδ + ε (2) y is a H 1 vector of log-prices Z is an H C matrix of physical characteristics (including a constant and quarterly dummies) g(z lat, z long ) is the geospatial spline function defined on the longitudes and latitudes D is a matrix of postcode or region dummies. Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 8 / 25
Our s II (i) semilog with geospatial spline y = Zβ + g(z lat, z long ) + ε (1) (ii)/(iii) semilog with postcode/region dummies y = Zβ + Dδ + ε (2) The parameters to be estimated in (i) are the C 1 vector of characteristic shadow prices β, and the geospatial spline surface g(z lat, z long ). The parameters to be estimated in (ii) or (iii) are the C 1 vector of characteristic shadow prices β, and the B 1 vector of postcode or region shadow prices δ. We consider 16 regions and 242 postcodes. So on average there are 15 postcodes in a region. Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 9 / 25
Estimation of the Semiparametric I Estimation: The semiparametric model is estimated using the bam-function (Big Additive s) in the mgcv package in R. details gam For the nonparametric part we use an approximation to a thin plate spline (TPS) (developed by S. Wood, 2003), defined on the longitudes and latitudes. details tps In our context, TPRSs have two advantages over other splines: It is not necessary to specify knot locations. The spline surface can be estimated as a function of two explanatory variables. details tprs With a full TPS the computational burden rises rapidly as the number of data points is increased. Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 10 / 25
Estimation of the Semiparametric II The approximation reduces the computational burden for any given level of fit. It is necessary to select a value for the parameter k which determines the approximation of our spline to a TPS: A lower value of k has a lower computational burden but leads to a worse fit. We choose a value of k where the gains from increasing it further (in terms of better fit) are low. details par choice The smoothing parameter λ determines the smoothness of the spline surface itself. The algorithm selects λ using restricted Maximum Likelihood (REML), where the likelihood function is maximized across all the parameters of the semiparametric model. Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 11 / 25
Our Data Set The data Sydney, Australia from 2001 to 2011. Our characteristics are: Transaction price + exact date of sale Physical: Number of bedrooms, Number of bathrooms, Land area Location: Postcode, Longitude, Latitude Some (physical) characteristics are missing for some houses. There are more gaps in the data in the earlier years in our sample. We have a total of 454,567 transactions. All characteristics are available for only 240,142 of these transactions. Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 12 / 25
Dealing with Missing Characteristics We impute the price of each house from the model below that has exactly the same mix of characteristics (HM1) : y = f(d, z 1, z 2, z 3, loc) (HM2) : y = f(d, z 2, z 3, loc) (HM3) : (HM4) : (HM5) : (HM6) : (HM7) : (HM8) : y = f(d, z 1, z 3, loc) y = f(d, z 1, z 2, loc) y = f(d, z 3, loc) y = f(d, z 2, loc) y = f(d, z 1, loc) y = f(d, loc) For all obs.: y = log(price), d = quarter dummy, loc = location Not for all obs.: z 1 = land area, z 2 = number of bedrooms, z 3 = number of bathrooms Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 13 / 25
Comparing the Performance of Our s I Different performance measures: Akaike information criterion (AIC): trade-off between the goodness of fit and the complexity of the model Sum of squared log errors SSLE t = 1 H t Repeat-Sales as a benchmark H t h=1 [ ln(ˆp th /p th ) ] 2 Z SI h Z SI h = Actual Price Relative /Imputed Price Relative = p / t+k,h p t+k,h ˆp / t+k,h p t+k,h ˆp t+k,h = p th p th p th ˆp th D SI = 1 H H h=1 [ ln(z SI h ) ]2 The spline model significantly outperforms its postcode counterpart. ˆp th Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 14 / 25
Comparing the Performance of Our s II Table: Akaike information criterion (restricted data set) 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 (i) -452-203 -1448-2471 -9638-10644 -14052-13980 -20436-18857 -23659 (ii) 1321 1515 493 158-4930 -4384-6372 -8070-12506 -11857-16522 (iii) 3463 3807 3650 3634 4841 7970 11996 8583 14842 16980 5223 Table: Sum of squared log errors (restricted data set) 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 (i) 0.048 0.050 0.044 0.040 0.036 0.038 0.037 0.034 0.032 0.032 0.028 (ii) 0.068 0.070 0.059 0.057 0.046 0.049 0.048 0.043 0.041 0.040 0.035 (iii) 0.105 0.108 0.093 0.089 0.073 0.079 0.084 0.079 0.089 0.098 0.068 Table: Sum of squared log price relative errors D SI Restricted Full (i) 0.016802 0.036523 (ii) 0.016857 0.038863 (iii) 0.029087 0.052078 Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 15 / 25
Spline Surface Based on Restricted Data Set for 2007 Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 16 / 25
Price Indexes Calculated on the Restricted Data Set 0.8 1.0 1.2 1.4 1.6 region LM long lat GAM pc LM median rep sales 2002 2004 2006 2008 2010 years Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 17 / 25
Price Indexes Calculated on the Full Data Set 1.0 1.2 1.4 1.6 1.8 region LM long lat GAM pc LM median rep sales 2002 2004 2006 2008 2010 years Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 18 / 25
Main Findings The price index rises more when locational effects are captured using a geospatial spline, rather than postcode or region dummies. The gap between the spline and postcode based indexes is small (about 0.5 percent over 11 years). region based indexes is larger (about 4.2 percent over 11 years). The full-sample spline-based price index rises 6.5 percent more over 11 years than its restricted data set counterpart. The median index is dramatically different when the full data set is used. The gap between spline and postcode/region hedonic price indexes is smaller when the full data set is used. Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 19 / 25
Are Postcode/Region Based Indexes Downward Biased? I A downward bias can arise when the locations of sold houses in a postcode or region get worse over time. Our test procedure: 1 Choose a postcode 2 Calculate the mean number of bedrooms, bathrooms, land area and quarter of sale over the 11 years for that postcode. 3 Impute the price of this average house in every location in which a house actually sold in 2001,...,2011 in that postcode (using spline model of year 2001) 4 Take the geometric mean of these imputed prices for each year. 5 Repeat for another postcode 6 Take the geometric mean across postcodes in each year. 7 Repeat steps 3-6 using the spline of year 2002, spline of 2003, etc. Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 20 / 25
Are Postcode Based Indexes Downward Biased? II Findings: The geometric means from step 6 fall over time irrespective of which year s spline is used as the reference. Most of the fall occurs in the first half of the sample. The fall is bigger for regions than postcodes. Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 21 / 25
Evidence of Bias in the Postcode-Based Price Indexes 0.980 0.985 0.990 0.995 1.000 2001 2006 2011 2002 2004 2006 2008 2010 years Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 22 / 25
Evidence of Bias in the Region-Based Price Indexes 0.88 0.90 0.92 0.94 0.96 0.98 1.00 2001 2006 2011 2002 2004 2006 2008 2010 years Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 23 / 25
s Splines (or some other nonparametric method), when combined with the hedonic imputation method, provide a flexible way of incorporating geospatial data into a house price index In our data set postcode/region based indexes seem to have a downward bias since they fail to account for a general shift over time in houses sold to worse locations in each postcode/region. The bias is negligible with postcode dummies but not so with region dummies. For a city with postcodes as finely defined as Sydney (about 14,300 residents and 7.39 square kilometers per postcode), postcode dummies do a good job of controlling for locational effects. It is important to use the full data set, and not just observations with no missing characteristics. Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 24 / 25
Thank you for your attention! Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 25 / 25
GAM We estimate the Generalized Additive y = Zβ + g(z lat, z long ) + ε with a thin plate regression spline (tprs) Wood (2003) an optimal low rank approximation of a thin plate spline A tprs uses far fewer coefficients than a full spline: computationally efficient, while losing little statistical performance. Big Additive s bam-function from the R package mgcv. We can avoid certain problems in spline smoothing: Choice of knot location and basis Smooths of more than one predictor Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 1 / 10
Thin plate spline I Thin plate spline smoothing problem Duchon (1977): Estimate the smooth function g with d-vector x from n observations s.t. y i = g(x i ) + ε i by finding the function ˆf that minimizes penalized problem y = (y 1,..., y n ), y f 2 + λj md (f), (3) f = (f(x 1 ),..., f(x n )), λ is a smoothing parameter, J md (f) is a penalty function measuring the wiggliness of f J md =... ν 1 +...+ν d =m m! ν 1!... ν d! m f x ν 1 1... xν d d 2 dx 1... dx d Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 2 / 10
Thin plate spline II It can be shown that the solution of (3) has the form, n M ˆf(x) = δ i η md ( x x i ) + α j φ j (x), (4) i=1 j=1 δ i and α j are coefficients to be estimated, δ i such that T δ = 0 with T ij = φ j (x i ). The M = ( ) m+d 1 d functions φj are linearly independent polynomials spanning the space of polynomials in R d of degree less than m (i.e. the null space of J md ) ( 1) m+1+d/2 2 η md (r) = 2m 1 π d/2 (m 1)!(m d/2)! r 2m d log(r) d even Γ(d/2 m) 2 2m π d/2 (m 1)! r 2m d d odd Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 3 / 10
Figure: Rank 15 Eigen Approx to 2D thin plate spline (Wood, 2006) Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 4 / 10
Thin plate spline III With E by E ij = η md ( x i x j ), the thin plate spline fitting problem is now the minimization of y Eδ Tα 2 + λδ Eδ s. t. T δ = 0. (5) with respect to δ and α Ideal smoother: Exact weight to the conflicting goals of matching the data and making f smooth Disadvantage: As many parameters as there are data, i.e. O(n 3 ) calculations More on thin plate splines: Wahba (1990) or Green and Silverman (1994) Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 5 / 10
Thin plate regression spline I The computational burden can be reduced with the use of a low rank approximation, Wood (2003) A parameter space basis that perturbs (5) as little as possible. The basic idea: Truncation of the space of the wiggly components of the spline (with parameter δ), while leaving the α-components unchanged. E = UDU eigen-decomposition of E Appropriate submatrix D k of D and corresponding U k, restricting δ to the column space of U k, i.e. δ = U k δ k, y U k D k δ k Tα 2 + λδ k D k δ k s. t. T U k δ k = 0, (6) with respect to δ k and α. Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 6 / 10
Thin plate regression spline II The computational cost is reduced from O(n 3 ) to O(k 3 ) Remaining problem: Find U k and D k sufficiently cheaply. Wood (2003) proposes the use of the Lanczos method (Demmel (1997)) which allows the calculation at the substantially lower cost of O(n 2 k) operations. Smoothing parameter selection: Laplace approximation to obtain an approximate REML which is suitable for efficient direct optimization and computationally stable, Wood (2011) Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 7 / 10
Thin plate regression spline III In practice: 1 Choose degree of approximation k = 600 (tradeoff: oversmoothing vs. computational burden) 2 Construct tprs-basis from a randomly chosen sample of 2500 data points. The locations of these observations depend on the locational distribution of house sales in that period. Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 8 / 10
Figure: Sum of squared errors D SI of the price relatives and computational time for different basis dimensions sse of the price relatives 0.0168 0.0170 0.0172 0.0174 0.0176 0.0178 0.0180 1500 3000 4500 6000 7500 9000 10500 computational time 200 400 600 800 number of basis dimensions Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 9 / 10
Figure: AIC distribution and comp. time for 100 repetitions of the y2011-fit based on different number of randomly chosen observations. range of AIC for 100 repetitions 21800 21600 21400 21200 21000 20800 20600 15000 25000 35000 45000 computational time 1000 1500 2000 2500 3000 3500 number of randomly chosen observations for basis construction Robert J. Hill and Michael Scholz OeNB Workshop, Oct 2014 10 / 10