Predicting housing price

Similar documents
MULTIPLE REGRESSION IN EXCEL EXCEL LAB #8

Applying SnapVX to Real-World Problems

Q: Which month has the lowest sale? Answer: Q:There are three consecutive months for which sale grow. What are they? Answer: Q: Which month

Data Mining. ❷Chapter 2 Basic Statistics. Asso.Prof.Dr. Xiao-dong Zhu. Business School, University of Shanghai for Science & Technology

STAT R Overview. R Intro. R Data Structures. Subsetting. Graphics. January 11, 2018

Lasso Regression: Regularization for feature selection

Slide Copyright 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION. Chapter 13. Statistics Sampling Techniques

Analysis of Housing Square Footage Estimates Reported by the American Housing Survey and the Survey of Construction

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

list of names. Graph the scatter plot of the residuals 4-6 Regression and Median-Fit Lines

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Linear Equations in Two Variables

Multiple Regression White paper

Section 2.1: Intro to Simple Linear Regression & Least Squares

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

YEAR 12 Trial Exam Paper FURTHER MATHEMATICS. Written examination 1. Worked solutions

Chapter 1 Polynomials and Modeling

Building Better Parametric Cost Models

9.8 Rockin the Residuals

Programming Exercise 1: Linear Regression

Relations and Functions 2.1

Advanced Algebra Chapter 3 - Note Taking Guidelines

Correlation. January 12, 2019

UNIT 2 Data Preprocessing

STAT Statistical Learning. Predictive Modeling. Statistical Learning. Overview. Predictive Modeling. Classification Methods.

Algebra II Notes Unit Two: Linear Equations and Functions

Create a bar graph that displays the data from the frequency table in Example 1. See the examples on p Does our graph look different?

REGRESSION ANALYSIS : LINEAR BY MAUAJAMA FIRDAUS & TULIKA SAHA

Section 2.3: Simple Linear Regression: Predictions and Inference

'Numerical Reasoning' Results For. Sample Candidate. Numerical Reasoning. Summary

Data Preprocessing. Slides by: Shree Jaswal

Probability and Statistics. Copyright Cengage Learning. All rights reserved.

Practice Test (page 391) 1. For each line, count squares on the grid to determine the rise and the run. Use slope = rise

Section 2.1: Intro to Simple Linear Regression & Least Squares

Middle School Math Course 3

Unit Essential Questions: Does it matter which form of a linear equation that you use?

Ready to Go On? Skills Intervention 1-1. Exploring Transformations. 2 Holt McDougal Algebra 2. Name Date Class

Linear Methods for Regression and Shrinkage Methods

Summer Algebra Review

1-5 Parent Functions and Transformations

'Numerical Reasoning' Results For JOE BLOGGS. Numerical Reasoning. Summary

Statistics can best be defined as a collection and analysis of numerical information.

Machine Learning - Regression. CS102 Fall 2017

Measures of Dispersion

4 th Grade CRCT Study Guide

Name: ID: Discussion Section:

Bagging & System Combination for POS Tagging. Dan Jinguji Joshua T. Minor Ping Yu

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

ES-2 Lecture: Fitting models to data

This research aims to present a new way of visualizing multi-dimensional data using generalized scatterplots by sensitivity coefficients to highlight

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

Product Catalog. AcaStat. Software

MEDIA RELEASE. For Immediate Release: Contact: David Ginsborg May 20, 2013 (408)

2.1 Solutions to Exercises

DIAL: A Distributed Adaptive-Learning Routing Method in VDTNs

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

CHAPTER 2 DESCRIPTIVE STATISTICS

Comparing and Modeling with Functions

Math 101 Exam 1 Review

Ex.1 constructing tables. a) find the joint relative frequency of males who have a bachelors degree.

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Statistics 1 - Basic Commands. Basic Commands. Consider the data set: {15, 22, 32, 31, 52, 41, 11}

Haga's Origamics. Haga's Origamics are a series of activities designed to illustrate the science behind simple paper folding.

Mathematics Department Level 3 TJ Book Pupil Learning Log. St Ninian s High School. Name Class Teacher. I understand this part of the course =

4 th Grade CRCT Study Guide

4.2 Data Distributions

PRE-PROCESSING HOUSING DATA VIA MATCHING: SINGLE MARKET

Chapter 3: Functions

Whitepaper US SEO Ranking Factors 2012

Network School KS3 Long Term Curriculum Plan

KS3 Mathematics Long Term Curriculum Plan

Piecewise Functions. ACCOUNTING The Internal Revenue Service estimates that taxpayers. Single Individual Income Tax

Incorporating Geospatial Data in House Price Indexes: A Hedonic Imputation Approach with Splines

A straight line is the graph of a linear equation. These equations come in several forms, for example: change in x = y 1 y 0

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Year 8 Key Performance Indicators Maths (Number)

AP Statistics Summer Math Packet

SUMMER PACKET Answer Key

TABLE OF CONTENTS WELCOME TO ENEIGHBORHOODS... 1 CLASS OBJECTIVES... 1 CHAPTER 1 GETTING STARTED Opening CMA through WyldFyre7...

Year 9 Key Performance Indicators Maths (Number)

Name: Teacher: Form: LEARNER JOURNAL. Set: Mathematics. Module 2 END OF YEAR TARGET: GCSE TARGET:

Unit #4: Statistics Lesson #1: Mean, Median, Mode, & Standard Deviation

Nearest Neighbor Predictors

7. Solve the following compound inequality. Write the solution in interval notation.

CHAPTER 6: Scatter plot, Correlation, and Line of Best Fit

One time Setup Paragon4 MLS (ex. NNVR MLS)

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

College and Career Readiness Practice Workbooks. Series Crosswalks. Math. Science. Social Studies Reading

CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1. Daphne Skipper, Augusta University (2016)

Quadratic Functions, Part 1

Chapter 1 Linear Equations

CONJOINT. Overview. **Default if subcommand or keyword is omitted.

Cambridge International Examinations Cambridge Ordinary Level

3. Data Analysis and Statistics

How Does the Use of Trademarks by Third-Party Sellers Affect Online Search?: Further Data Analysis

AP Statistics Summer Review Packet

Model Complexity and Generalization

Frequency Tables. Chapter 500. Introduction. Frequency Tables. Types of Categorical Variables. Data Structure. Missing Values

Lecture 06 Decision Trees I

COMMUNITY UNIT SCHOOL DISTRICT 200

Transcription:

Predicting housing price Shu Niu Introduction The goal of this project is to produce a model for predicting housing prices given detailed information. The model can be useful for many purpose. From estimating the worth for a house that s not on market, to figuring out which component factors the most into a house, which can help decide if a remodel makes sense financially. Data collection For the purpose of this project, I have written a script to crawl recently sold listings on Zillow.com to obtain details about houses sold in the last 3 years. I am focusing my data set on the South Bay area, with the expectation that they adhere to some consistent pattern and trend. Here is a sample listing I collected from Zillow: Address: 2042 Calle Mesa Alta, Milpitas CA 95035 Sold Date: 2013/11/08 Sold Price: 685,000 Built Year: 1990 Interior Size: 1785 sqft Lot Size: 1785 sqft # Bedroom: 3 # Bathroom: 2.5 These information needs to be translated into numbers so they can be fed into a model. Most of the components are straight forward: Sold Date: represented as number of months since 2010/01 Sold Price: represented as number of thousand dollars Interior Size, Exterior Size: number of square foot # Bedroom: number of bedroom # Bathroom: i tried to deal with only integers in the input, so this component is converted to an integer that equals to double the number of bathrooms Address is hard to quantify directly. I have considered using the latitude and longitude as data points, but there s no simple trend like when latitude increases, we expect the house value to rise. I do expect there to be some trending pattern for the intrinsic value of the location, but the contour of that would be too complex to model using the limited data set I have. I do however expect the contour to form around city boundaries and school districts. So I am using the following two values to represent the location: Median price of the city: for each city I have collected at least 400 recent sales, and the

median price will be used as a data point. This can be used to capture the base value for each city. I could have also used {city, zipcode} as the unit here. However, I won t have hundreds of data point for each zipcode. School API score: I have also crawled schoolandhousing.com for each recently sold listing to collect a school score that ranges from 0 to 100. The score is computed based on the API scores for the elementary school, middle school, and high school the house is assigned to. So it should be a fair estimate for the school value portion of the house. Overall I have collected 5000 data points, each one can be represented as <x i, y i >: x i = <x i, 1 xi, 2 xi, 3 xi, 4 xi, 5 xi, x, 6 7 xi > 8 x i 1 = sale months since 2010/01 x i 2 = interior size x i 3 = lot size x i 4 = built year x i 5 = number of bedrooms x i 6 = number of bathrooms times 2 x i 7 = school api score x i 8 = median price of the city y i = price (in thousand dollars) For example, we would represent the sample listing above as x = <46, 1990, 1785, 1990, 3, 5, 88, 675> y = 685 Data Analysis Next I wanted to look at the distribution of the data in each dimension, comparing it to the y value to see if there s any immediate trend jumping out from the graph. Intruitively I would expect for each x j dimension, the higher the x j value, the higher the y value. Here is a visual representation of each x j vs y. sale month vs price interior size vs price lot size vs price

built year vs price # bedroom vs price # bathroom vs price school score vs price city median price vs price From the graph, interior size and school score are the two that shows the strongest correlation with price. The other factors are not as obvious, as there are also expensive houses with low x j values. Linear Regression My first hypothesis is a linear model, where it would predict y = a T x, and x = <1, x 1, x 2, x 3, x 4, x 5, x 6, x 7, x 8 > To test the errors of this hypothesis, I divided my dataset into 10 groups to perform a 10 fold cross validation. Each time using 9 groups to train the linear regression model, and then use the 10th group for testing. The error is calculated to be the average error in the test. The set that produced the smallest error is used as my final model. The resulting model had an error of 322 in the testing group. The surprising finding is that there are two coefficients that turn out to be negative: a 4 and a 5, corresponding to built year and number of bedrooms. And they are consistently negative in all 10 cross validation groups. For built year, the intuitive expectation is for the same specification of a house, the more recent ones would cost more. However the least square fitting shows a negative slope between built year and price. It could be that among the data I collected, most of the expensive houses are the older ones, and their value is captured by another dimension that s not part of our data set. Another conjecture is that the linear model is a bad performing model at least for the year dimension. There s a significant difference in value between a brand new house and a 10 year

old house, but almost no difference in value between a 50 year house and a 60 year old house. With the linear model, it cannot differentiate between those scenarios. Number of bedrooms is also trending negatively with price. This is surprising at first, but it can be explained. Because interior size is already a dimension in the data. Given the same interior size, more bedrooms means smaller footage per room, thus most likely to be a budget house than a luxury house. Normalization Next, I applied a normalization to all the x i <j>, where x i,normalized <j> = (x i <j> mean of x i ) / standard deviation of x i Then applying linear regression again on the normalized data set gives the following values for the coefficients a: <1209, 107, 282, 100, 34, 33, 50, 55, 371> Since all the x i values are normalized, we can compare their coefficients to get a sense of how much each dimension contributes to the price. Based that, we rank the dimensions from the most important to least important as: x 8 : median city price (371) x 2 : interior size (282) x 1 : sold month since 2010/1 (107) x 3 : lot size (100) x 7 : school score (55) x 6 : # bathroom (50) x 4 : built year ( 34) x 5 : # bedroom ( 33) We all know that location is a very important factor in housing price, and the median city price being the most prominent factor also reflects that. There is a caveat that the median city price is generated from the price value of the same data set where we evaluate our model. This could mean the median price is already adjusted to fit the data set well. A more unbiased indicator would be getting the median city price from a census data rather than the same data set. Component Analysis A different way of doing the component analysis is to do a backward search: remove one dimension each time and see what s the error of the new model, and then pick the dimension that leaves the smallest error. By doing so, these would be the order of dimensions we take away (with the resulting error in model after taking away that dimension): x 5 : # bedroom (323) x 6 : # bathroom (324) x 4 : built year (325) x 7 : school score (327)

x 1 : sold month since 2010/1 (335) x 3 : lot size (346) x 2 : interior size (448) x 8 : median city price is the last dimension remaining I also did a forward search for the most prominent dimensions: starting from 0 dimension, each time adding a dimension that leaves the smallest error in the model. It resulted in the same order of components as the backward search. Comparing these with the order found in normalization, the ordering is almost identical. The only ones that are switched around are the ones that have very close coefficients in normalization and very close additional errors in backward search. This also solidifies the conjecture that location is the most important factor, followed by size of house. Quadratic Model The final model I used is a quadratic normalized model. I converted each data point from <1, x 1, x 2, x 3 x 8 > to <1, x 2n, x 3n x 8n *x 1n *x 2n *x 3n x 8n *x 8n > where x in is x i normalized to number of standard deviations from mean. This expands the x dimensions from 9 to 45, which should still be model able using our 5000 data points. Using the same 10 fold cross validation as linear model, I find the error for the quadratic model is 271, better than the error of 321 produced in linear model. So this indicates that quadratic function is a more fitting model in this case. By comparing the coefficients of all the quadratic terms, the following are the top 3 combinations that had the highest coefficients (in brackets): x1 * x7: sold month * school score (42) x7 * x8: school score * median city price (39) x2 * x8: interior size * median city price (39) Most of these are also the most valuable dimensions seen in the linear model, noting that school score is having an increased effect here.