Incorporating Geospatial Data in House Price Indexes: A Hedonic Imputation Approach with Splines

Similar documents
Incorporating Geospatial Data in House Price Indexes: A Hedonic Imputation Approach with Splines. Robert J. Hill and Michael Scholz

Scholz, Hill and Rambaldi: Weekly Hedonic House Price Indexes Discussion

Predicting housing price

Lecture 26: Missing data

Lecture 13: Model selection and regularization

Lecture 7: Linear Regression (continued)

Machine Learning / Jan 27, 2010

Additive hedonic regression models for the Austrian housing market ERES Conference, Edinburgh, June

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Economics Nonparametric Econometrics

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

MAT 082. Final Exam Review

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business

Review of the Robust K-means Algorithm and Comparison with Other Clustering Methods

Harmonic Spline Series Representation of Scaling Functions

Analysis of Panel Data. Third Edition. Cheng Hsiao University of Southern California CAMBRIDGE UNIVERSITY PRESS

PRE-PROCESSING HOUSING DATA VIA MATCHING: SINGLE MARKET

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

Going nonparametric: Nearest neighbor methods for regression and classification

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

Math 112 Fall 2014 Midterm 1 Review Problems Page 1. (E) None of these

STATISTICAL TECHNIQUES. Interpreting Basic Statistical Values

DESIGNING ALGORITHMS FOR SEARCHING FOR OPTIMAL/TWIN POINTS OF SALE IN EXPANSION STRATEGIES FOR GEOMARKETING TOOLS

Lecture 17: Smoothing splines, Local Regression, and GAMs

Algorithms for LTS regression

Nonparametric Survey Regression Estimation in Two-Stage Spatial Sampling

Nonparametric regression using kernel and spline methods

3 Nonlinear Regression

Lasso Regression: Regularization for feature selection

Spatial Variation of Sea-Level Sea level reconstruction

Applying SnapVX to Real-World Problems

Inferring the Source of Encrypted HTTP Connections. Michael Lin CSE 544

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Net profit remained at the same level as the last fiscal year mainly due to the sale of investment securities during the third quarter.

RECOMMENDATION ITU-R P DIGITAL TOPOGRAPHIC DATABASES FOR PROPAGATION STUDIES. (Question ITU-R 202/3)

Using Machine Learning to Optimize Storage Systems

RELEASED. Spring 2013 North Carolina Measures of Student Learning: NC s Common Exams. Advanced Functions and Modeling

Four equations are necessary to evaluate these coefficients. Eqn

3. Data Structures for Image Analysis L AK S H M O U. E D U

Radial Basis Function Networks

56:272 Integer Programming & Network Flows Final Exam -- December 16, 1997

addesc Add a variable description to the key file CCDmanual0.docx

TELCOM2125: Network Science and Analysis

Nearest Neighbor with KD Trees

Lecture on Modeling Tools for Clustering & Regression

Estimating survival from Gray s flexible model. Outline. I. Introduction. I. Introduction. I. Introduction

Summary Statistics. Closed Sales. Paid in Cash. New Pending Sales. New Listings. Median Sale Price. Average Sale Price. Median Days on Market

3 Nonlinear Regression

Image Stitching. Slides from Rick Szeliski, Steve Seitz, Derek Hoiem, Ira Kemelmacher, Ali Farhadi

Analysis of Housing Square Footage Estimates Reported by the American Housing Survey and the Survey of Construction

Bayesian Spatiotemporal Modeling with Hierarchical Spatial Priors for fmri

Middle School Math Course 3

Chapter 6: Examples 6.A Introduction

REPLACING MLE WITH BAYESIAN SHRINKAGE CAS ANNUAL MEETING NOVEMBER 2018 GARY G. VENTER

PASS Sample Size Software. Randomization Lists

Econ 172A - Slides from Lecture 8

This blog addresses the question: how do we determine the intersection of two circles in the Cartesian plane?

Further Simulation Results on Resampling Confidence Intervals for Empirical Variograms

Supplementary Figure 1. Decoding results broken down for different ROIs

Quality Control of Web- Scraped and Transaction Data (Scanner Data)

Section 4.1: Time Series I. Jared S. Murray The University of Texas at Austin McCombs School of Business

Influence Maximization in Location-Based Social Networks Ivan Suarez, Sudarshan Seshadri, Patrick Cho CS224W Final Project Report

Geometric Rectification of Remote Sensing Images

Introduction to Mixed Models: Multivariate Regression

Objective 1 : The student will demonstrate an understanding of numbers, operations, and quantitative reasoning.

Nearest Neighbor Predictors

CSC 411: Lecture 02: Linear Regression

CS490D: Introduction to Data Mining Prof. Chris Clifton

Estimation of Bilateral Connections in a Network: Copula vs. Maximum Entropy

Grade 4 ISTEP+ T1 #1-2 ISTEP+ T1 # Identify, describe and draw parallelograms, rhombuses, and ISTEP+ T1 #5-6

MATH 1101 Chapter 4 Review

Review: What is the definition of a parallelogram? What are the properties of a parallelogram? o o o o o o

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS

Missing Data Analysis for the Employee Dataset

An Optimal Regression Algorithm for Piecewise Functions Expressed as Object-Oriented Programs

Sensor Tasking and Control

Generalized Additive Model

XI Conference "Medical Informatics & Technologies" VALIDITY OF MRI BRAIN PERFUSION IMAGING METHOD

Solving Dynamic Single-Runway Aircraft Landing Problems With Extremal Optimisation

Ant Colony Optimization for dynamic Traveling Salesman Problems

Getting Started. January 2012 EAC Red Square Support

Real Estate Investment Advising Using Machine Learning

OLS Assumptions and Goodness of Fit

Approximating Square Roots

PRECISE GEOREFERENCING OF CARTOSAT IMAGERY VIA DIFFERENT ORIENTATION MODELS

Chapter 15 Mixed Models. Chapter Table of Contents. Introduction Split Plot Experiment Clustered Data References...

Model Clustering via Group Lasso

Random Walks and Cover Times. Project Report. Aravind Ranganathan. ECES 728 Internet Studies and Web Algorithms

Statistical Analysis of MRI Data

Programming Lecture 3

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Standard Wind DPR: Essential Contents

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

Chapter 7: Dual Modeling in the Presence of Constant Variance

Why Do Computers Depreciate? Michael J. Geske University of Michigan. Valerie A. Ramey University of California, San Diego and NBER.

Using Crime Statistics to Recommend Police Precinct Locations

Support Vector Regression with ANOVA Decomposition Kernels

Lecture notes on Transportation and Assignment Problem (BBE (H) QTM paper of Delhi University)

Transcription:

Incorporating Geospatial Data in House Price Indexes: A Hedonic Imputation Approach with Splines Robert J. Hill and Michael Scholz University of Graz Austria robert.hill@uni-graz.at michael-scholz@uni-graz.at 1 May 2013 Presentation to the Ottawa Group Hill and Scholz Ottawa Group 2013 1 / 24

Introduction Houses differ both in their physical characteristics and location Exact longitude and latitude of each house are now increasingly included as variables in housing data sets How can we incorporate geospatial data (i.e., longitudes and latitudes) in a hedonic model of the housing market? 1. Distance to amenities (including the city center, nearest train station and shopping center, etc.) as additional characteristics. 2. Spatial autoregressive models 3. A spline function (or some other nonparametric function) Hill and Scholz Ottawa Group 2013 2 / 24

A Taxonomy of Methods for Computing Hedonic House Price Indexes Time dummy method y = Zβ + Dδ + ε P t = exp(ˆδ t ) where Z is a matrix of characteristics and D is a matrix of dummy variables. Hill and Scholz Ottawa Group 2013 3 / 24

Average characteristics method [ C ] Laspeyres : Pt,t+1 L = ˆpt+1( zt) = exp ( ˆβ c,t+1 ˆβ c,t) z c,t, ˆp t( z t) Paasche : P P t,t+1 = ˆpt+1( zt+1) ˆp t( z t+1) where z c,t = 1 H t H t h=1 c=1 [ C ] = exp ( ˆβ c,t+1 ˆβ c,t) z c,t+1, c=1 z c,t,h and z c,t+1 = 1 H t+1 H t+1 h=1 z c,t+1,h. Average characteristics methods cannot use geospatial data, since averaging longitudes and latitudes makes no sense. Hill and Scholz Ottawa Group 2013 4 / 24

Imputation method Paasche Single Imputation : P PSI t,t+1 = [ ( pt+1,h ˆp t,h (z t+1,h ) ) ] 1/Ht+1 [ H t ( ) ] 1/Ht ˆpt+1,h (z t,h ) H t+1 Laspeyres Single Imputation : P LSI t,t+1 = h=1 h=1 Fisher Single Imputation : P FSI t,t+1 = p t,h P PSI t,t+1 PLSI t,t+1 Hill and Scholz Ottawa Group 2013 5 / 24

Distance to Amenities as Additional Characteristics Throws away a lot of potentially useful information Distance from an amenity may impact on price in a nonmonotonic way Direction may matter as well (e.g., do you live under the flight path of an airport)? Hill and Scholz Ottawa Group 2013 6 / 24

Spatial autoregressive models The SARAR(1,1) model takes the following form: y = ρsy + X β + u, u = λsu + ε, where y is the vector of log prices, (i.e., each element y h = ln p h ), and S is a spatial weights matrix that is calculated from the geospatial data. The impact of location on house prices is captured by the parameters ρ and λ. SARAR models can be combined with either the time-dummy or hedonic imputation methods. Hill and Scholz Ottawa Group 2013 7 / 24

Spatial autoregressive models (continued) The limitations of the SAR(1) model are endless. These include: (1) the implausible and unnecessary normality assumption, (2) the fact that if y i depends on spatially lagged ys, it may also depend on spatially lagged xs, which potentially generates reflection-problem endogeneity concerns..., (3) the fact that the relationship may not be linear, and (4) the rather likely possibility that u and X are dependent because of, e.g., endogeneity and/or heteroskedasticity. Even if one were to leave aside all of these concerns, there remains the laughable notion that one can somehow know the entire spatial dependence structure up to a single unknown multiplicative coefficient [two unknown coefficients in the case of SARAR(1,1)]. (Pinkse and Slade 2010, p. 106 - text in square brackets added by the authors) Hill and Scholz Ottawa Group 2013 8 / 24

Our Models (estimated separately for each year) (i) generalized additive model (GAM) with a geospatial spline C y = c 1 + Dδ 1 + f 1,c (z c ) + g 1 (z lat, z long ) + ε 1 c=1 (ii) GAM with postcode dummies C y = c 2 + Dδ 2 + f 2,c (z c ) + m 2 (z pc ) + ε 2 c=1 Hill and Scholz Ottawa Group 2013 9 / 24

Our Models (continued) (iii) semilog with geospatial spline y = c 3 + Dδ 3 + C z c β 3,c + g 3 (z lat, z long ) + ε 3 c=1 (iv) semilog with postcode dummies y = c 4 + Dδ 4 + C z c β 4,c + c=1 250 pc=1 z pc m 4,pc + ε 4 Hill and Scholz Ottawa Group 2013 10 / 24

Our Data Set Sydney, Australia from 2001 to 2011. Our characteristics are: Transaction price Exact date of sale Number of bedrooms Number of bathrooms Land area Postcode Longitude Latitude Hill and Scholz Ottawa Group 2013 11 / 24

Our Data Set (continued) Some characteristics are missing for some houses. There are more gaps in the data in the earlier years in our sample. We have a total of 454567 transactions. All characteristics are available for only 240142 of these transactions. Hill and Scholz Ottawa Group 2013 12 / 24

Dealing with Missing Characteristics We impute the price of each house from the model below that has exactly the same mix of characteristics. (HM1): ln price = f(quarter dummy, land area, num bedrooms, num bathrooms, postcode) (HM2): ln price = f(quarter dummy, num bedrooms, num bathrooms, postcode) (HM3): ln price = f(quarter dummy, land area, num bathrooms, postcode) (HM4): ln price = f(quarter dummy, land area, num bedrooms, postcode) (HM5): ln price = f(quarter dummy, num bathrooms, postcode) (HM6): ln price = f(quarter dummy, num bedrooms, postcode) (HM7): ln price = f(quarter dummy, land area, postcode) (HM8): ln price = f(quarter dummy, postcode) Hill and Scholz Ottawa Group 2013 13 / 24

Comparing the Performance of Our Models Table 1 : Akaike information criterion for models 1-4 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 1 416 89-778 -1599-7290 -6417-8544 -10271-14059 -14953-18493 2 4888 5456 5780 5598 8635 11678 16233 11652 12819 12313 8696 3-55 -85-1093 -1571-7192 -6199-8917 -10286-15529 -14649-18520 4 4730 5337 5677 5571 8630 11677 16009 11564 12086 12307 8662 Table 2 : Sum of squared log errors for models 1-4 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 1 0.061 0.057 0.051 0.047 0.041 0.046 0.045 0.040 0.039 0.037 0.034 2 0.133 0.140 0.123 0.111 0.087 0.091 0.096 0.089 0.084 0.085 0.076 3 0.056 0.056 0.049 0.048 0.042 0.046 0.044 0.040 0.038 0.037 0.034 4 0.130 0.138 0.121 0.111 0.087 0.091 0.095 0.088 0.082 0.085 0.075 The sum of squared log errors is calculated as follows: ( ) 1 Ht SSLE t = [ln(ˆp th /p th )] 2. H t h=1 Hill and Scholz Ottawa Group 2013 14 / 24

Results (continued) The spline models significantly outperform their postcode counterparts. The GAM outperforms its semilog counterpart Repeat-Sales as a Benchmark Z SI h Z SI h = Actual Price Relative /Imputed Price Relative = p t+k,h p th / p t+k,h ˆp / t+k,h p t+k,h ˆpt+k,h = ˆp th p th p th ˆp th Hill and Scholz Ottawa Group 2013 15 / 24

Results (continued) D SI = ( ) 1 H [ln(zh SI H )]2. h=1 Table 3 : Sum of squared log price relative errors for models 1-4 Model 1-GAM spline 0.017467 2-GAM postcode 0.020900 3-semilog spline 0.016927 4-semilog postcode 0.036040 D SI Spline outperforms postcodes. Surprisingly, semilog spline outperforms GAM spline. Hill and Scholz Ottawa Group 2013 16 / 24

Price Indexes Restricted data set with no missing characteristics: Figures 1 and 2 Full data set: Figures 3 and 4 Main Findings The mean and median indexes are dramatically different when the full data set is used. Prices rise more when geospatial data is used instead of postcodes The gap is slightly smaller when the full data set is used. It is also smaller for GAM than for semilog. Hill and Scholz Ottawa Group 2013 17 / 24

Figure 1 : GAM on restricted data set SIF for post code and long/lat SIF 0.8 1.0 1.2 1.4 1.6 post code long/lat median price mean price 2002 2004 2006 2008 2010 Hill and Scholz Ottawa Group 2013 18 / 24 years

Figure 2 : Semilog on restricted data set SIF for post code and long/lat partlin SIF 0.8 1.0 1.2 1.4 1.6 post code long/lat median price mean price 2002 2004 2006 2008 2010 Hill and Scholz Ottawa Group 2013 19 / 24 years

Figure 3 : GAM on full data set SIF for post code and long/lat SIF 1.0 1.2 1.4 1.6 1.8 post code long/lat median price mean price 2002 2004 2006 2008 2010 Hill and Scholz Ottawa Group 2013 20 / 24 years

Figure 4 : Semilog on full data set SIF for post code and long/lat SIF 1.0 1.2 1.4 1.6 1.8 post code long/lat median price mean price 2002 2004 2006 2008 2010 Hill and Scholz Ottawa Group 2013 21 / 24 years

Are Postcode Based Indexes Downward Biased? A downward bias can arise when the locations of sold houses in a postcode get worse over time. We test for this as follows: Choose a postcode Calculate the mean number of bedrooms, bathrooms, land area and quarter of sale over the 11 years for that postcode. Impute using the semilog model with spline of year t (where t could be 2001,...,2011) the price of this average house in every location in which a house actually sold in 2001,...,2011 in that postcode Take the geometric mean of these imputed prices for each year. Repeat for another postcode Take the geometric mean across postcodes in each year. Hill and Scholz Ottawa Group 2013 22 / 24

Are Postcode Based Indexes Downward Biased? (continued) Questions: Does this geometric mean rise or fall over time? How much difference does it make which year s semilog model is used to impute prices? Hill and Scholz Ottawa Group 2013 23 / 24

Conclusions Splines (or some other nonparametric method) provide a flexible way of incorporating geospatial data into a house price index Switching from postcodes to geospatial data can have a big impact. Between 2001 and 2011 hosue prices rose by 60 percent based on geospatial data as compared with only 40 percent based on postcodes In our data set postcode based indexes seem to have a downward bias since they fail to account for a general shift over time in houses sold to worse locations in each postcode. Hill and Scholz Ottawa Group 2013 24 / 24