How to Price a House

Size: px

Start display at page:

Download "How to Price a House"

May Daniels
5 years ago
Views:

1 How to Price a House An Interpretable Bayesian Approach Dustin Lennon dustin@inferentialist.com Inferentialist Consulting Seattle, WA April 9, 2014

2 Introduction Project to tie up loose ends / came out of interview prep for Climate Corp Disclaimer: two week sprint, not a dissertation An easier version of a more involved spatio-temporal model for zipcode aggregation

3 Outline Size of Housing Market Modeling/Technology Gap 1 Size of Housing Market Modeling/Technology Gap 2 Model Specification General Model Formulation Model Fitting 3 Data Scalability and Sampling Model Output Model Validation 4 Scalability & Sparsity Optimization 5

4 Housing Market Size of Housing Market Modeling/Technology Gap A few Wikipedia Facts Outstanding U.S. residential mortgages: $10.6 trillion as of midyear 2008 By August 2008, 9.2% of all U.S. mortgages outstanding were either delinquent or in foreclosure

5 Housing Market Size of Housing Market Modeling/Technology Gap

6 A Valuation Problem? Size of Housing Market Modeling/Technology Gap Subprime loans, yes, but was there also a systemic failure in estimating home values?

7 Temporal Instability Trulia Size of Housing Market Modeling/Technology Gap Seasonality, perhaps. But a sliding median approach breaks down as the window size goes to zero. page accessed on 6/4/2014

8 Overfitting Zestimates Size of Housing Market Modeling/Technology Gap The time series appears to chase the listing data, stays elevated for a time, then abruptly returns to baseline. page accessed on 6/4/2014

9 Spatial Instability Zestimates Size of Housing Market Modeling/Technology Gap The time series appears to adjust to the recently added zipcode level information, perhaps indicating some spatial instability when adjusting to new data. page accessed on 6/4/2014

10 Ad-hoc Analysis Size of Housing Market Modeling/Technology Gap Limiting case failures Lack of regularization / prior information Uninterpretable models

11 Outline Model Specification General Model Model Fitting 1 Size of Housing Market Modeling/Technology Gap 2 Model Specification General Model Formulation Model Fitting 3 Data Scalability and Sampling Model Output Model Validation 4 Scalability & Sparsity Optimization 5

12 I Model Specification General Model Model Fitting Decompose home value into constituent parts Z i = x t i β + a iy (s i ) + δ i, Z i price paid for the i th home x i covariates associated with β [ e.g., square footage ] β coefficients fixed across space [ e.g., build cost per square foot ] a i lot size Y (s) unit cost of land s i location difference between the true value and the price paid δ i

13 II Model Specification General Model Model Fitting Data Model ( [X ] [ ] ) β [Z β, Y ] N A, Y ( ) = diag [σ 2 z1 2,..., σ 2 zn 2 ] Process Model [β, Y ] = [β][y ] [β] N(ν, Φ) Φ = diag ([φ 1,..., φ k ]) [Y ] N(τ1, Σ) Σ = Σ(θ)

14 III Model Specification General Model Model Fitting σ 2 Σ(θ) interpretable as coefficient of variation defines the covariance structure of the land value term In particular, Σ(θ) is specified through an isotropic, Matern covariance function: ( ) Σ ij (θ) C d ij ; θ 1, θ 2, σ0, 2 σ1 2 ( ) ( ) 1 θ2 ( ) = σ0i 2 0 (d ij ) + σ1 2 2 θ2 1 dij dij Γ(θ 2 ) K θ2 θ 1 and d ij is the Euclidean distance between s i and s j. θ 1

15 III Model Specification General Model Model Fitting

16 General Model Formulation Model Specification General Model Model Fitting Hierarchical Formulation [Z G] N (MG, ) [G] N (µ, Ω) [Z, G] N Joint Distribution {( ) [ Mµ + MΩM t, µ ΩM t Posterior Distribution ( ) [G Z ] N µ, Ω ]} MΩ Ω µ µ + ΩM t ( + MΩM t) 1 (Z Mµ) Ω Ω ΩM t ( + MΩM t) 1 MΩ

17 Fitting the Model Model Specification General Model Model Fitting Inference is on posterior distribution [G Z ; Θ] Specialize general case to hedonic model EM Algorithm to obtain ˆΘ. Iterate until convergence: update µ, Ω minimize 2E [log [Z, G] Z ; Θ] 2E [log[z, G] Z ; Θ] = logdet + logdet Ω + Z t 1 Z + µ t Ω 1 µ [ 2 Z t 1 M + µ t Ω 1] µ [ + µ t M t 1 M + Ω 1] µ [( + tr M t 1 M + Ω 1) ] Ω

18 Outline Data Scalability and Sampling Model Output Model Validation 1 Size of Housing Market Modeling/Technology Gap 2 Model Specification General Model Formulation Model Fitting 3 Data Scalability and Sampling Model Output Model Validation 4 Scalability & Sparsity Optimization 5

19 Data: Maps TIGER/Line Shapefile Data Data Scalability and Sampling Model Output Model Validation

20 Data: Home Sales King County Department of Assessments Data Scalability and Sampling Model Output Model Validation Table Joins: Real Property Sales (non-flagged 2012 records) Exempt From Excite Tax Related Party, Friend, or Neighbor Quit Claim Deed Multi-Parcel Sale Residential Buildings Parcel Information Outlier Filtering: Sale Price: $100k to $5m Lot Size 1.03 acres No properties with multiple sale records in ,812 homes

21 Data: Geocoding Yahoo Data Scalability and Sampling Model Output Model Validation 2012: KC records have UID, street address, no lat/long 2014: Sporadic lat/long (Seattle, not Tacoma) Yahoo geocoder: bash script, 500k lookups over two weeks curl -s "

22 Scalability and Sampling I Data Scalability and Sampling Model Output Model Validation Recall the objective function to be optimized on each iteration of EM algorithm: 2E [log[z, G] Z ; Θ] = logdet + logdet Ω + Z t 1 Z + µ t Ω 1 µ [ 2 Z t 1 M + µ t Ω 1] µ [ + µ t M t 1 M + Ω 1] µ [( + tr M t 1 M + Ω 1) ] Ω Naive approach with dense matrices: extremely memory intensive O(n 3 ) cost to compute inverse Solution: sample, weighted by inverse local density

23 Scalability and Sampling II Data Scalability and Sampling Model Output Model Validation

24 Model Output Coefficients Data Scalability and Sampling Model Output Model Validation σ coefficient of variation [active constraint] ν 1, φ 1 (139.51, ) build cost per square foot (living) ν 2, φ 2 (0.00, ) build cost per square foot (basement) ν 3, φ 3 (0.00, ) build cost per square foot (garage) τ 7.19 lot size cost per square foot θ matern spread parameter [active constraint] θ matern shape parameter [active constraint] σ matern nugget effect [active constraint] σ matern variance

25 Model Output Heatmaps Data Scalability and Sampling Model Output Model Validation Need predictive distribution [y 0 Z ]: E [y 0 Z ] = E [E (y 0 Y, Z ) Z ] = E [E (y 0 Y ) Z ] Var [y 0 Z ] = Var [E (y 0 Y, Z ) Z ] + E [Var (y 0 Y, Z ) Z ] = Var [E (y 0 Y ) Z ] + E [Var (y 0 Y ) Z ] [y 0 Y ] is immediate: extend Σ(θ)

27 Model Comparison Data Scalability and Sampling Model Output Model Validation

28 Model Validation Data Scalability and Sampling Model Output Model Validation Not a predictive model; attempts to characterize variation Out of sample coverage of 95% confidence intervals: Process 86.7% Process + Proxy 92.0% Process + Data 97.2% Conclusion: the typical variability in a home s sale price is inherently large

29 Outline Scalability & Sparsity Optimization 1 Size of Housing Market Modeling/Technology Gap 2 Model Specification General Model Formulation Model Fitting 3 Data Scalability and Sampling Model Output Model Validation 4 Scalability & Sparsity Optimization 5

30 Scalability I Scalability & Sparsity Optimization Goal: linear algebra operations to evaluate objective function, gradient should be: sparse matrices low rank perturbations to sparse matrices arbitrarily close to sparse matrices under reasonable parameter choices Larger sample sizes require sparse representation Specializing the general model: M is sparse; Ω decomposes into a diagonal and the Matern matrix, Σ(θ). For θ 1 small and θ 2 bounded, Σ(θ) is arbitrarily close to a sparse matrix For θ 1 and θ 2 bounded, Σ(θ) is well conditioned; relative to underlying Euclidean distances

31 Scalability I Scalability & Sparsity Optimization For θ 1 = 500:

32 Scalability II More on θ 1 Scalability & Sparsity Optimization ˆθ 1 is an active constraint, at the upper bound reflects a desire to increase spatial scale of correlation; smoother surface Conclusion: the upper bound enforced on θ 1 should be interpreted as a model complexity parameter keeping θ 1 small increases sparsity of Σ(θ) and decreases scale of spatial correlation effect choose upper bound via cross validation

33 Inner Optimization Scalability & Sparsity Optimization EM algorithm requires an inner optimization Dynamically adjust the convergence tolerance (optim/factr) in early iterations for speed

34 Outline 1 Size of Housing Market Modeling/Technology Gap 2 Model Specification General Model Formulation Model Fitting 3 Data Scalability and Sampling Model Output Model Validation 4 Scalability & Sparsity Optimization 5

35 The Hedonic Bayesian model needs very few parameters to describe a complex spatial field. The model does a good job describing the variability inherent in the data. Future Work Experimentation with smaller σ 2 ; cross validation of θ 1 upper bound Increase scalability through a more thorough approach to sparsity

Spatial Outlier Detection

Spatial Outlier Detection Chang-Tien Lu Department of Computer Science Northern Virginia Center Virginia Tech Joint work with Dechang Chen, Yufeng Kou, Jiang Zhao 1 Spatial Outlier A spatial data point