Incorporating Geospatial Data in House Price Indexes: A Hedonic Imputation Approach with Splines. Robert J. Hill and Michael Scholz

Size: px

Start display at page:

Download "Incorporating Geospatial Data in House Price Indexes: A Hedonic Imputation Approach with Splines. Robert J. Hill and Michael Scholz"

Stephen Allison
5 years ago
Views:

1 Incorporating Geospatial Data in House Price Indexes: A Hedonic Imputation Approach with Splines Robert J. Hill and Michael Scholz Department of Economics University of Graz, Austria OeNB Workshop Vienna, 9th 10th October 2014 Supported by funds of the Oesterreichische Nationalbank (Anniversary Fund, project number: 14947) Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

2 Overview of the talk: 1 Hedonic House Price Indexes (): Incorporating Geospatial Data in Taxonomy: Time Dummy, Average Characteristics, Hedonic Imputation 2 : GAM with Spline vs. Postcode/Region Dummies Estimation of the Semiparametric 3 : Data Set, Missing Characteristics, Results and Main Findings Are Postcode/Region Based Indexes Downward Biased? 4 Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

3 Introduction Motivation: Houses differ both in their physical characteristics and location Exact longitude and latitude of each house are now increasingly available in housing data sets How can we incorporate geospatial data (i.e., longitudes and latitudes) in a hedonic model of the housing market? How much difference does it make? Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

4 for Incorporating Geospatial Data in a Hedonic Index : 1 Distance to amenities (including the city center, nearest train station and shopping center, etc.) as additional characteristics. 2 Spatial autoregressive models 3 A spline function (or some other nonparametric function) Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

5 A Taxonomy of for Computing Hedonic House Price Indexes I Time dummy method y = Zβ + Dδ + ε where Z is a matrix of characteristics and D is a matrix of dummy variables. Index: P t = exp(ˆδ t ) where ˆδ t is the estimated coefficient obtained from the hedonic model Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

6 A Taxonomy of for Computing Hedonic House Price Indexes II Average characteristics method Laspeyres : P L t,t+1 = ˆp t+1 ( z t ) ˆp t ( z t ) Paasche : P P t,t+1 = ˆp t+1 ( z t+1 ) ˆp t ( z t+1 ) C = exp (ˆβ c,t+1 ˆβ c,t ) z c,t, c=1 C = exp (ˆβ c,t+1 ˆβ c,t ) z c,t+1, c=1 where z c,t = 1 H t H t h=1 z c,t,h and z c,t+1 = 1 H t+1 z c,t+1,h. H t+1 h=1 Note: Average characteristics methods cannot use geospatial data, since averaging longitudes and latitudes makes no sense. Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

7 A Taxonomy of for Computing Hedonic House Price Indexes III Hedonic imputation method Ht+1 ( Paasche : P PSI t,t+1 = p t+1,h ˆp t,h (z t+1,h )) 1/Ht+1 h=1 Ht Laspeyres : P LSI t,t+1 = h=1 ( ) ˆp t+1,h (z t,h ) 1/Ht p t,h Fisher : P FSI t,t+1 = P PSI t,t+1 PLSI t,t+1 ( with Single Imputation) Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

8 Our s I (i) semilog with geospatial spline (ii)/(iii) semilog with postcode/region dummies y = Zβ + g(z lat, z long ) + ε (1) y = Zβ + Dδ + ε (2) y is a H 1 vector of log-prices Z is an H C matrix of physical characteristics (including a constant and quarterly dummies) g(z lat, z long ) is the geospatial spline function defined on the longitudes and latitudes D is a matrix of postcode or region dummies. Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

9 Our s II (i) semilog with geospatial spline y = Zβ + g(z lat, z long ) + ε (1) (ii)/(iii) semilog with postcode/region dummies y = Zβ + Dδ + ε (2) The parameters to be estimated in (i) are the C 1 vector of characteristic shadow prices β, and the geospatial spline surface g(z lat, z long ). The parameters to be estimated in (ii) or (iii) are the C 1 vector of characteristic shadow prices β, and the B 1 vector of postcode or region shadow prices δ. We consider 16 regions and 242 postcodes. So on average there are 15 postcodes in a region. Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

10 Estimation of the Semiparametric I Estimation: The semiparametric model is estimated using the bam-function (Big Additive s) in the mgcv package in R. details gam For the nonparametric part we use an approximation to a thin plate spline (TPS) (developed by S. Wood, 2003), defined on the longitudes and latitudes. details tps In our context, TPRSs have two advantages over other splines: It is not necessary to specify knot locations. The spline surface can be estimated as a function of two explanatory variables. details tprs With a full TPS the computational burden rises rapidly as the number of data points is increased. Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

11 Estimation of the Semiparametric II The approximation reduces the computational burden for any given level of fit. It is necessary to select a value for the parameter k which determines the approximation of our spline to a TPS: A lower value of k has a lower computational burden but leads to a worse fit. We choose a value of k where the gains from increasing it further (in terms of better fit) are low. details par choice The smoothing parameter λ determines the smoothness of the spline surface itself. The algorithm selects λ using restricted Maximum Likelihood (REML), where the likelihood function is maximized across all the parameters of the semiparametric model. Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

12 Our Data Set The data Sydney, Australia from 2001 to Our characteristics are: Transaction price + exact date of sale Physical: Number of bedrooms, Number of bathrooms, Land area Location: Postcode, Longitude, Latitude Some (physical) characteristics are missing for some houses. There are more gaps in the data in the earlier years in our sample. We have a total of 454,567 transactions. All characteristics are available for only 240,142 of these transactions. Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

13 Dealing with Missing Characteristics We impute the price of each house from the model below that has exactly the same mix of characteristics (HM1) : y = f(d, z 1, z 2, z 3, loc) (HM2) : y = f(d, z 2, z 3, loc) (HM3) : (HM4) : (HM5) : (HM6) : (HM7) : (HM8) : y = f(d, z 1, z 3, loc) y = f(d, z 1, z 2, loc) y = f(d, z 3, loc) y = f(d, z 2, loc) y = f(d, z 1, loc) y = f(d, loc) For all obs.: y = log(price), d = quarter dummy, loc = location Not for all obs.: z 1 = land area, z 2 = number of bedrooms, z 3 = number of bathrooms Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

14 Comparing the Performance of Our s I Different performance measures: Akaike information criterion (AIC): trade-off between the goodness of fit and the complexity of the model Sum of squared log errors SSLE t = 1 H t Repeat-Sales as a benchmark H t h=1 [ ln(ˆp th /p th ) ] 2 Z SI h Z SI h = Actual Price Relative /Imputed Price Relative = p / t+k,h p t+k,h ˆp / t+k,h p t+k,h ˆp t+k,h = p th p th p th ˆp th D SI = 1 H H h=1 [ ln(z SI h ) ]2 The spline model significantly outperforms its postcode counterpart. ˆp th Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

15 Comparing the Performance of Our s II Table: Akaike information criterion (restricted data set) (i) (ii) (iii) Table: Sum of squared log errors (restricted data set) (i) (ii) (iii) Table: Sum of squared log price relative errors D SI Restricted Full (i) (ii) (iii) Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

16 Spline Surface Based on Restricted Data Set for 2007 Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

17 Price Indexes Calculated on the Restricted Data Set region LM long lat GAM pc LM median rep sales years Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

18 Price Indexes Calculated on the Full Data Set region LM long lat GAM pc LM median rep sales years Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

19 Main Findings The price index rises more when locational effects are captured using a geospatial spline, rather than postcode or region dummies. The gap between the spline and postcode based indexes is small (about 0.5 percent over 11 years). region based indexes is larger (about 4.2 percent over 11 years). The full-sample spline-based price index rises 6.5 percent more over 11 years than its restricted data set counterpart. The median index is dramatically different when the full data set is used. The gap between spline and postcode/region hedonic price indexes is smaller when the full data set is used. Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

20 Are Postcode/Region Based Indexes Downward Biased? I A downward bias can arise when the locations of sold houses in a postcode or region get worse over time. Our test procedure: 1 Choose a postcode 2 Calculate the mean number of bedrooms, bathrooms, land area and quarter of sale over the 11 years for that postcode. 3 Impute the price of this average house in every location in which a house actually sold in 2001,...,2011 in that postcode (using spline model of year 2001) 4 Take the geometric mean of these imputed prices for each year. 5 Repeat for another postcode 6 Take the geometric mean across postcodes in each year. 7 Repeat steps 3-6 using the spline of year 2002, spline of 2003, etc. Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

21 Are Postcode Based Indexes Downward Biased? II Findings: The geometric means from step 6 fall over time irrespective of which year s spline is used as the reference. Most of the fall occurs in the first half of the sample. The fall is bigger for regions than postcodes. Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

22 Evidence of Bias in the Postcode-Based Price Indexes years Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

23 Evidence of Bias in the Region-Based Price Indexes years Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

24 s Splines (or some other nonparametric method), when combined with the hedonic imputation method, provide a flexible way of incorporating geospatial data into a house price index In our data set postcode/region based indexes seem to have a downward bias since they fail to account for a general shift over time in houses sold to worse locations in each postcode/region. The bias is negligible with postcode dummies but not so with region dummies. For a city with postcodes as finely defined as Sydney (about 14,300 residents and 7.39 square kilometers per postcode), postcode dummies do a good job of controlling for locational effects. It is important to use the full data set, and not just observations with no missing characteristics. Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

25 Thank you for your attention! Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 25

26 GAM We estimate the Generalized Additive y = Zβ + g(z lat, z long ) + ε with a thin plate regression spline (tprs) Wood (2003) an optimal low rank approximation of a thin plate spline A tprs uses far fewer coefficients than a full spline: computationally efficient, while losing little statistical performance. Big Additive s bam-function from the R package mgcv. We can avoid certain problems in spline smoothing: Choice of knot location and basis Smooths of more than one predictor Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 10

27 Thin plate spline I Thin plate spline smoothing problem Duchon (1977): Estimate the smooth function g with d-vector x from n observations s.t. y i = g(x i ) + ε i by finding the function ˆf that minimizes penalized problem y = (y 1,..., y n ), y f 2 + λj md (f), (3) f = (f(x 1 ),..., f(x n )), λ is a smoothing parameter, J md (f) is a penalty function measuring the wiggliness of f J md =... ν ν d =m m! ν 1!... ν d! m f x ν xν d d 2 dx 1... dx d Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 10

28 Thin plate spline II It can be shown that the solution of (3) has the form, n M ˆf(x) = δ i η md ( x x i ) + α j φ j (x), (4) i=1 j=1 δ i and α j are coefficients to be estimated, δ i such that T δ = 0 with T ij = φ j (x i ). The M = ( ) m+d 1 d functions φj are linearly independent polynomials spanning the space of polynomials in R d of degree less than m (i.e. the null space of J md ) ( 1) m+1+d/2 2 η md (r) = 2m 1 π d/2 (m 1)!(m d/2)! r 2m d log(r) d even Γ(d/2 m) 2 2m π d/2 (m 1)! r 2m d d odd Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 10

29 Figure: Rank 15 Eigen Approx to 2D thin plate spline (Wood, 2006) Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 10

30 Thin plate spline III With E by E ij = η md ( x i x j ), the thin plate spline fitting problem is now the minimization of y Eδ Tα 2 + λδ Eδ s. t. T δ = 0. (5) with respect to δ and α Ideal smoother: Exact weight to the conflicting goals of matching the data and making f smooth Disadvantage: As many parameters as there are data, i.e. O(n 3 ) calculations More on thin plate splines: Wahba (1990) or Green and Silverman (1994) Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 10

31 Thin plate regression spline I The computational burden can be reduced with the use of a low rank approximation, Wood (2003) A parameter space basis that perturbs (5) as little as possible. The basic idea: Truncation of the space of the wiggly components of the spline (with parameter δ), while leaving the α-components unchanged. E = UDU eigen-decomposition of E Appropriate submatrix D k of D and corresponding U k, restricting δ to the column space of U k, i.e. δ = U k δ k, y U k D k δ k Tα 2 + λδ k D k δ k s. t. T U k δ k = 0, (6) with respect to δ k and α. Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 10

32 Thin plate regression spline II The computational cost is reduced from O(n 3 ) to O(k 3 ) Remaining problem: Find U k and D k sufficiently cheaply. Wood (2003) proposes the use of the Lanczos method (Demmel (1997)) which allows the calculation at the substantially lower cost of O(n 2 k) operations. Smoothing parameter selection: Laplace approximation to obtain an approximate REML which is suitable for efficient direct optimization and computationally stable, Wood (2011) Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 10

33 Thin plate regression spline III In practice: 1 Choose degree of approximation k = 600 (tradeoff: oversmoothing vs. computational burden) 2 Construct tprs-basis from a randomly chosen sample of 2500 data points. The locations of these observations depend on the locational distribution of house sales in that period. Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 10

34 Figure: Sum of squared errors D SI of the price relatives and computational time for different basis dimensions sse of the price relatives computational time number of basis dimensions Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 10

35 Figure: AIC distribution and comp. time for 100 repetitions of the y2011-fit based on different number of randomly chosen observations. range of AIC for 100 repetitions computational time number of randomly chosen observations for basis construction Robert J. Hill and Michael Scholz OeNB Workshop, Oct / 10

Incorporating Geospatial Data in House Price Indexes: A Hedonic Imputation Approach with Splines

Incorporating Geospatial Data in House Price Indexes: A Hedonic Imputation Approach with Splines Robert J. Hill and Michael Scholz University of Graz Austria robert.hill@uni-graz.at michael-scholz@uni-graz.at