Zillow, is an us tech real-estate marketplace company. It has information on approximately 110 million homes in the United States. The company provides several features, including home value estimates, value changes for each home over a specified time period, aerial views of homes, and comparable home prices in the area. It also offers basic information about a specific home, such as square footage and the number of bedrooms and bathrooms. Users can also get estimates for homes that have undergone major changes, such as a kitchen remodel. Zillow has recognized that their housing market forecasts are not as accurate as they could be due to a lack of regional intelligence, which is one of the core function of the company. As a result, this project is established to figure the problem and create a better predictive model of home prices in Mecklenberg County, North Carolina that could be used by Zillow. To achieve this, we gathered and imported data as parameters for the prediction of house prices, ranging from schools, crimes, neighborhoods to attributes of houses such as bedrooms and heated area, all the data are for websites with open data of Mecklenberg County, and a type of linear least squares method - ordinary least squares (OLS) regression is applied to establish the model for the testing of the accuracy of the predicted house prices. Also, the dataset was split to a testing and a training set to test, train and adjust the accuracy and the generability of the established model in this project using spatial lag, moran’s I and the cross-validation.
Among all the parts, the collection and selection of different categories of data is the most difficult part in this project.We need to get different categories of data from different websites and decide what potential impact they might have on house prices. Since we collect both point and polygon data, we need to choose the appropriate method to add them to the same data set. When testing the data, any increase in data can have an impact on the accuracy of the results, so we also need to keep adding or removing data.
The derived result by the model indicates that the model was trained in a relatively regulate accuracy, the mean absolute percentage error (MAPE) is 0.2698475 and the calculated R-Square is approximately 0.678. Although there are some limitations in generalization, the model established in the project can be used for the house price prediction for Zillow with a relatively higher accuracy and generalization.
We collected our data from following sources:
From City of Charlotte Open Data Portal website, we collected the point data, like crimes, schools, parks, medical services, etc. We extract their geographic information separately to calculate the average distance from the neighborhood to them. From the open data map, we collected the tracts data, and joined them to our basic data set. Besides, we also collected the census data by using the tidycensus package.
crimes.sf <-
crime %>%
filter(HIGHEST_NIBRS_DESCRIPTION == "Aggravated Assault",
LATITUDE_PUBLIC > -1) %>%
dplyr::select(LATITUDE_PUBLIC, LONGITUDE_PUBLIC) %>%
na.omit() %>%
st_as_sf(coords = c("LONGITUDE_PUBLIC", "LATITUDE_PUBLIC"), crs = "EPSG:4326") %>%
st_transform('ESRI:102719') %>%
distinct()
# Counts of crime per buffer of house sale
newdat$crimes.Buffer <- newdat %>%
st_buffer(660) %>%
aggregate(mutate(crimes.sf, counter = 1),., sum) %>%
pull(counter)
## Nearest Neighbor Feature
newdat <-
newdat %>%
mutate(
crime_nn1 = nn_function(st_coordinates(newdat),
st_coordinates(crimes.sf), k = 1),
crime_nn2 = nn_function(st_coordinates(newdat),
st_coordinates(crimes.sf), k = 2),
crime_nn3 = nn_function(st_coordinates(newdat),
st_coordinates(crimes.sf), k = 3),
crime_nn4 = nn_function(st_coordinates(newdat),
st_coordinates(crimes.sf), k = 4),
crime_nn5 = nn_function(st_coordinates(newdat),
st_coordinates(crimes.sf), k = 5))
st_c <- st_coordinates
newdat <-
newdat %>%
mutate(
commercial_nn3 = nn_function(na.omit(st_c(newdat)),na.omit(st_c(Shoppingcenter))[,c(1,2)], 3),
park_nn3 = nn_function(na.omit(st_c(newdat)),na.omit(st_c(parks)), 3),
schools_nn3 = nn_function(na.omit(st_c(newdat)),na.omit(st_c(School))[,c(1,2)], 3),
med_nn3 = nn_function(na.omit(st_c(newdat)),na.omit(st_c(medical))[,c(1,2)], 3),
stop_nn3 = nn_function(na.omit(st_c(newdat)),na.omit(st_c(bus))[,c(1,2)], 3)
)
Before analysis, we’ve gathered three kinds of variables to help us predict the housing price more precisely.
a.Internal characteristics: The housing internal characteristics data have already been given to us in the StudentData.geojson. The following variables are valuable for our analysis: price, built year(Age), number of rooms (bedrooms, bathrooms, fireplaces,units,etc.), area, etc.
b.Public amenities: We can assess the impact of external factors on house prices by calculating the average distance of the house from some services (to the average distance of the three closest public service to the house). Here, I calculated and added the accessibility to schools, shopping centers, parks, medical services, jobs, bus stops All of them are named “xxx_nn3”.
c.Census data: We also imported the 2020 census data for Mecklenburg County using the get acs command, and after calculating them, we have data of total population, median household income, education percentage, white percentage, poverty, etc.
stargazer(st_drop_geometry(newdat),
type = 'text',
title = "Summary Statistics")
##
## Summary Statistics
## ========================================================================
## Statistic N Mean St. Dev. Min Max
## ------------------------------------------------------------------------
## condo_town 46,183 0.000 0.000 0 0
## parcel_typ 46,183 0.000 0.000 0 0
## totalac 46,183 1.772 106.717 0.000 9,757.440
## price 46,183 465,686.800 602,409.800 0 75,500,000
## cardno 46,183 1.000 0.000 1 1
## yearbuilt 46,183 1,993.386 31.836 0 2,022
## heatedarea 46,183 2,359.428 1,060.335 0.000 14,710.000
## numfirepla 46,183 0.787 0.465 0 7
## fullbaths 46,183 2.282 0.825 0 9
## halfbaths 46,183 0.596 0.528 0 4
## bedrooms 46,183 3.510 0.943 0 65
## units 46,183 0.981 0.969 0 205
## ownerno 46,183 1.000 0.000 1 1
## landsequen 46,183 1.001 0.040 0 3
## shape_Leng 46,183 504.273 253.070 152.910 10,798.810
## shape_Area 46,183 15,875.080 35,033.910 1,139.637 3,486,865.000
## sale_year 46,183 2,020.884 0.765 2,020 2,028
## musaID 46,183 23,092.000 13,332.030 1 46,183
## h_price 46,172 359,076.800 195,210.200 101,660.000 1,980,920.000
## h_age 46,175 31.595 16.307 4.000 83.000
## h_size 46,181 2,305.622 744.765 914.330 5,569.507
## employ 46,063 1.843 1.289 0 5
## total_pop 46,180 4,066.102 1,310.649 790 7,703
## med_hh_income 46,180 86,175.660 37,415.660 17,685 238,750
## total_belpov100 46,180 357.862 355.277 0 2,241
## pctvacancy 46,180 0.058 0.043 0.000 0.293
## pctwhite 46,180 0.563 0.262 0.011 0.989
## pctunemployed 46,180 0.046 0.039 0.000 0.225
## pctedu 46,180 0.111 0.065 0.004 0.322
## crimes.Buffer 16,939 2.141 1.466 1 11
## crime_nn1 46,183 3,738.334 7,091.211 43.003 42,592.160
## crime_nn2 46,183 4,176.513 7,446.173 89.984 44,682.910
## crime_nn3 46,183 4,501.107 7,714.390 99.900 45,638.190
## crime_nn4 46,183 4,811.474 7,979.975 118.446 46,552.480
## crime_nn5 46,183 5,069.228 8,195.000 182.700 47,361.670
## commercial_nn3 46,183 7,340.499 4,311.997 484.963 32,116.810
## park_nn3 46,183 5,670.022 2,655.575 295.369 16,925.850
## schools_nn3 46,183 5,973.619 3,126.916 653.722 24,399.970
## med_nn3 46,183 6,744.319 4,478.215 149.571 26,588.100
## stop_nn3 46,183 4,916.262 4,638.132 145.797 32,479.790
## ------------------------------------------------------------------------
Below is the correlation matrix for the numeric variables. The deeper the color is, the stronger the correlation is between the variables.
numericVars <-
select_if(st_drop_geometry(newdat), is.numeric) %>% na.omit()%>%select(-cardno,-ownerno,-parcel_typ,-condo_town,-crime_nn1,-crime_nn2,-crime_nn4,-crime_nn5)
ggcorrplot(
round(cor(numericVars), 1),
p.mat = cor_pmat(numericVars),
colors = c("#001b48", "#d6e8ee", "#FA7800"),
type="lower",
insig = "blank") +
labs(title = "Correlation across numeric variables")
There are 8 correlation scatterplots of home price as functions of 8
independent variables. crime_nn3
&stop_nn3
are the distance to 3 nearest crimes/stops. employ
is the
data I collected from the open data map, showing the score of
accessibility to jobs.
## Home Features cor
st_drop_geometry(newdat) %>%
dplyr::select(price, heatedarea, bedrooms, h_age, pctwhite,med_hh_income,crime_nn3, stop_nn3, employ) %>%
filter(price <= 2000000, h_age < 500,bedrooms<10) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
facet_wrap(~Variable, ncol = 4, scales = "free") +
labs(title = "Price as a function of continuous variables") +
plotTheme()
Here, I presented a map showing the distribution of houses and their prices in Mecklenburg County. The darker the color, the higher the sales price. From the graph, it is easy to tell that the high-priced houses are concentrated in the central and northern part of the city, while the low- priced houses are mainly distributed between the center and the border.
# Mapping data
ggplot() +
geom_sf(data = saleprice, fill = "grey90") +
geom_sf(data = newdat, aes(colour = q5(price)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(newdat,"price"),
name="Quintile\nBreaks") +
labs(title="Price Per Square Foot, Mecklenburg") +
mapTheme()
Following are three maps showing the spatial distribution of some features that may relates to housing price.
a.Map of crimes density and houses exposure to crimes
ggplot() +
geom_sf(data = acsTractsBD20, fill = "grey90") +
stat_density2d(data = data.frame(st_coordinates(crimes.sf)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 40, geom = 'polygon') +
scale_fill_gradient(low = "#FAE8E8", high = "#ab3141", name = "Crime Density") +
geom_sf(data = newdat, aes(colour = q5(crime_nn3)),
show.legend = "point", size = .75, ) +
scale_colour_manual(values = palette5,
labels=qBr(newdat,"crime_nn3"),
name="average distance to 3 nearest crimes") +
labs(title="Crimes and Houses, Mecklenburg") +
theme(plot.title = element_text(hjust = 0.5)) +
mapTheme()
b.Map of schools density and houses exposure to schools
ggplot() +
geom_sf(data = acsTractsBD20, fill = "grey90") +
stat_density2d(data = data.frame(st_coordinates(School)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 40, geom = 'polygon') +
scale_fill_gradient(low = "#FAE8E8", high = "#ab3141", name = "School Density") +
geom_sf(data = newdat, aes(colour = q5(schools_nn3)),
show.legend = "point", size = .75, ) +
scale_colour_manual(values = palette5,
labels=qBr(newdat,"schools_nn3"),
name="average distance to 3 nearest schools") +
labs(title="Schools and Houses, Mecklenburg") +
theme(plot.title = element_text(hjust = 0.5)) +
mapTheme()
c.Vacant Houses
vacanthouse<- newdat %>%
mutate(pctvacancy = pctvacancy*100) %>%
dplyr::select(pctvacancy)
ggplot() +
geom_sf(data = acsTractsBD20, fill = "grey90") +
geom_sf(data = saleprice, aes(fill = q5(X2021)),colour=NA) +
scale_fill_manual(values = p52,
labels=qBr(saleprice,"X2021"),
name="tract house prices") +
geom_sf(data = vacanthouse, aes(colour = q5(pctvacancy)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(vacanthouse,"pctvacancy"),
name="pctvacancy") +
labs(title="Tract house prices and vacancy") +
theme(plot.title = element_text(hjust = 0.5))+
mapTheme()
1.This is the correlation scatterplots between price and the average distance from houses to nearest x crimes. By comparison, I don’t think the distance to crime has a significant impact on the price of housing.
## Crime cor
newdat%>%
st_drop_geometry() %>%
dplyr::select(price, starts_with("crime_")) %>%
filter(price <= 1000000) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
facet_wrap(~Variable, nrow = 2, scales = "free") +
##scale_y_log10()+
labs(title = "Price as a function of continuous variables") +
plotTheme()
2.Below I presented histograms of all numeric variables.We can read their features of distribution
ggplot(gather(numericVars), aes(value)) +
geom_histogram(bins = 50) +
facet_wrap(~key,nrow=6, scales = 'free_x')
3.Here I pick out the area data as an independent variable to test its correlation with the price. After testing, the R value is 0.68, suggested the area has relatively stronger correlation with the housing price.
price_200k<-st_drop_geometry(newdat) %>%
filter(price <= 2000000)
cor.test(price_200k$heatedarea,
price_200k$price,
method = "pearson")
##
## Pearson's product-moment correlation
##
## data: price_200k$heatedarea and price_200k$price
## t = 197.77, df = 45734, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6739937 0.6838739
## sample estimates:
## cor
## 0.6789645
ggscatter(price_200k,
x = "heatedarea",
y = "price",
add = "reg.line") +
stat_cor(label.y = 2100000)
We use machine learning to make house price predictions. The core logic of this process is to first find existing house price information and, for each house, collect data that may be relevant to it, which, as mentioned above, includes internal characteristics, public amenities and census data, such as house size, surrounding services, area demographic information, etc.
After collecting the information, we combine all the data into one dataset and split it randomly into a training set and a test set (0.8:0.2). In the training set, we train the machine to recognize the association between our previously collected data and house prices, while later in the test set, we use the predict function to infer house prices from the data.
After deriving the predicted house price, we also need to calculate the deviation of the predicted house price from the real house price to derive the accuracy of our model. This process is iterative, as we need to keep adding and subtracting data to test and improve the accuracy of our model.
The dataset was splited into a training and testing dataset. And the ratio of them is 0.8:0.2. The training set was used to train the Ordinary Least Squares regression (OLS) model established, and the testing set was used to test the accuracy of the model. To better explain the accuracy of the model, the test set was not inmcluded in the training section. Also, the parametre ‘y’ was applied for the catogorical variables to include all the categories of a single variable in training set and derive a model with a higher accuracy.
Mk.modelling <-
newdat[newdat$toPredict=="MODELLING",]%>%
filter(price<10000000)%>%
filter(price>0)
Train <- createDataPartition(
y=paste(Mk.modelling$lot_num),p=0.8,list=F)
Mk.train<-
newdat[Train,]%>%
filter(price<10000000)%>%
filter(price>0)
Mk.test <-
newdat[-Train,]%>%
filter(price<10000000)%>%
filter(price>0)
Mk.challenge <-
newdat[newdat$toPredict=="CHALLENGE",]
Mk_sub_200k <- st_drop_geometry(Mk.train) %>%
filter(price <= 2000000)
A OLS regression model was trained using the defined training set, with the results shown below.
all_reg <- lm(price ~.,data = st_drop_geometry(Mk_sub_200k) %>%
dplyr::select(price, heatedarea, h_age,
h_size, bedrooms, units,
fullbaths,numfirepla, halfbaths,pctwhite,crime_nn3,med_hh_income,pctedu,total_pop,shape_Area,pctvacancy,park_nn3,schools_nn3,commercial_nn3,med_nn3,total_belpov100,pctunemployed,employ,stop_nn3))
stargazer(all_reg, type = "text",title="Regression Results",align=TRUE,no.space=TRUE)
##
## Regression Results
## =================================================
## Dependent variable:
## -----------------------------
## price
## -------------------------------------------------
## heatedarea 124.784***
## (1.602)
## h_age 4,816.571***
## (79.853)
## h_size 79.365***
## (2.384)
## bedrooms -22,389.420***
## (1,107.878)
## units -70,065.140***
## (4,019.245)
## fullbaths 53,875.930***
## (1,702.121)
## numfirepla 2,495.499
## (1,908.554)
## halfbaths 22,918.660***
## (1,744.391)
## pctwhite 53,868.530***
## (5,567.580)
## crime_nn3 0.657***
## (0.138)
## med_hh_income 0.387***
## (0.044)
## pctedu 537,972.900***
## (22,961.730)
## total_pop -5.727***
## (0.686)
## shape_Area 0.722***
## (0.033)
## pctvacancy 382,238.200***
## (21,276.150)
## park_nn3 2.091***
## (0.367)
## schools_nn3 3.040***
## (0.409)
## commercial_nn3 -1.859***
## (0.440)
## med_nn3 1.920***
## (0.372)
## total_belpov100 41.568***
## (3.296)
## pctunemployed 48,224.360**
## (22,739.680)
## employ 13,663.770***
## (740.432)
## stop_nn3 -6.838***
## (0.304)
## Constant -349,413.200***
## (7,870.541)
## -------------------------------------------------
## Observations 36,362
## R2 0.685
## Adjusted R2 0.685
## Residual Std. Error 148,588.700 (df = 36338)
## F Statistic 3,432.428*** (df = 23; 36338)
## =================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
It is shown that the R-square of the OLS model is 0.678, it represents that there are approximately 80% of the derived results could be indicated by the OLS model established above. Also, there are 23 predictors included in the training model, with 23 categorical variables. According to the derived statistics above and below, due to the p-value is less than 2.2e-16, most of the selected predictors are significant, and the variable park_nn3 and med_nn3 are relatively not that significant because the p-value of them are larger that 0.1.
The mean absolute error (MAE) and mean absolute percentage error (MAPE) are calculated in this section as the table shown below. The MAE is 111050.7, and MAPE is 0.2698475. It means there is an error of 27.02% in between the observed value in the real world and hte predicted error in the model we use.
Mk.test <-
Mk.test %>%
mutate(price.Predict = predict(all_reg, Mk.test),
price.Error = price.Predict - price,
price.AbsError = abs(price.Predict - price),
price.APE = (abs(price.Predict - price)) / price.Predict) %>%
filter(price < 5000000)
MAE <- mean(Mk.test$price.AbsError, na.rm = T)
MAPE <- mean(Mk.test$price.APE, na.rm = T)
acc <- data.frame(MAE, MAPE)
kable(acc) %>% kable_styling(full_width = F)
MAE | MAPE |
---|---|
112593.6 | 0.2651099 |
#export housing price
Mk.challenge <-
Mk.challenge %>%
mutate(prediction = predict(all_reg, Mk.challenge))%>%
filter(price < 5000000)
priceprediction <- st_drop_geometry(Mk.challenge) %>% dplyr::select(musaID,prediction)
write.csv(priceprediction,"mean_prediction.csv", row.names = FALSE)
The cross validation test is applied to the model in this section.
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)
all_reg.cv <-
train(price ~ ., data = st_drop_geometry(Mk.train) %>%
dplyr::select(price, heatedarea, h_age, h_size, bedrooms, units,
fullbaths,numfirepla,halfbaths,pctwhite,crime_nn3,
med_hh_income,pctedu,total_pop,shape_Area,pctvacancy,
park_nn3,schools_nn3,commercial_nn3,med_nn3,total_belpov100,pctunemployed,employ,stop_nn3),
method = "lm", trControl = fitControl, na.action = na.pass)
all_reg.cv
## Linear Regression
##
## 36814 samples
## 23 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 36446, 36446, 36447, 36446, 36447, 36445, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 224922.4 0.6357926 126977.2
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
all_reg.cv$resample[1:5,]
## RMSE Rsquared MAE Resample
## 1 189622.9 0.6921531 122656.1 Fold001
## 2 183400.2 0.6427728 122982.9 Fold002
## 3 223989.4 0.6825975 129787.0 Fold003
## 4 152141.0 0.7324429 112929.5 Fold004
## 5 164541.2 0.7206764 116306.0 Fold005
sd(all_reg.cv$resample[,3])
## [1] 9764.068
mean(all_reg.cv$resample[,3])
## [1] 126977.2
hist(all_reg.cv$resample[,3],
breaks = 50,
main = 'Distribution of MAE \nK fold cross validation;k=100',
xlab = 'Mean Absolute Mile',
ylab = 'count')
According to the derived statistics above, the average value of MAE is 10474.09, and its standard deviation is 128766.2. The derived histogram represents the distribution of the calculated MAEs of the 100 folds in a cross- validation. Based on the figure above, the derived MAEs tend to distribute in a not very compacted way, it indicates that sometimes the model works probably not very generable to new data.
A graph of predicted prices as a function of observed prices is shown below.
ggplot(
Mk.test, aes(price.Predict, price)) +
geom_point(size = .5) +
geom_smooth(method = "lm", se=F, colour = "#018abe") +
geom_abline(intercept = 0, slope = 1, color='red',size=1) +
labs(title = 'Predicted sale price as a function of observed price', subtitle = 'Red line represents a perfect prediction \nBlue line represents prediction') +
theme(plot.title = element_text(hjust = 0.5))
In this graph, the red line is the prediction based on the observed house price, the orange line is the average of house price prediction. It is shown that the derived prediction of house price by the model is close to the prediction from the model, and there are 2 small gaps between them in the start and the end of the graph.
Spatial Lag
Mk.test <-
Mk.test %>%
mutate(Regression = "Baseline Regression",
price.Predict = predict(all_reg, Mk.test),
price.Error = price.Predict - price,
price.AbsError = abs(price.Predict - price),
price.APE = (abs(price.Predict - price)) /price.Predict)%>%
filter(price < 5000000)
mk.test_predict <-
Mk.test[which(Mk.test$price.Error != 0),]
coords.test <- st_coordinates(mk.test_predict)
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W")
mk.test_predict$lagPriceError <- lag.listw(spatialWeights.test, mk.test_predict$price.Error)
ggplot(mk.test_predict, aes(x = lagPriceError, y = price.Error))+
geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
labs(title = "Price error as a function of continuous variables") +
plotTheme()
The graph of lag price error and price error shows a partly positive relationship between them, it means spatial autocorrelation exits in the model.
ggplot() +
geom_sf(data = acsTractsBD20, fill = "grey90") +
geom_sf(data = mk.test_predict, aes(colour = q5(lagPriceError)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(mk.test_predict,"lagPriceError"),
name="Quintile\nBreaks") +
labs(title="a map of spatial lag in errors") +
theme(plot.title = element_text(hjust = 0.5)) +
mapTheme()
moranTest <- moran.mc(mk.test_predict$price.Error,
spatialWeights.test, nsim = 999)
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in orange",
x="Moran's I",
y="Count") +
plotTheme()
The derived histogram indicates that the distribution of Moran’s I test in the model, the orange line shows the observed Moran’s I. The histogram represents that there are several spatial autocorrelation in distribution of residuals, however, it is suggested that the spatial autocorrelation here is not high. Even if there are supposed to be no spatial autocorrelation in a model that is perfectly generable.
test_map = st_drop_geometry(Mk.test) %>%
group_by(NAME) %>%
summarize(MAPE = mean(price.APE, na.rm = T))
test_map
## # A tibble: 291 × 2
## NAME MAPE
## <chr> <dbl>
## 1 Census Tract 1.02, Mecklenburg County, North Carolina 0.547
## 2 Census Tract 10, Mecklenburg County, North Carolina 0.178
## 3 Census Tract 11, Mecklenburg County, North Carolina 0.197
## 4 Census Tract 12, Mecklenburg County, North Carolina 0.245
## 5 Census Tract 13, Mecklenburg County, North Carolina 0.213
## 6 Census Tract 14, Mecklenburg County, North Carolina 0.238
## 7 Census Tract 15.04, Mecklenburg County, North Carolina 0.339
## 8 Census Tract 15.05, Mecklenburg County, North Carolina 0.292
## 9 Census Tract 15.07, Mecklenburg County, North Carolina 0.285
## 10 Census Tract 15.08, Mecklenburg County, North Carolina 0.240
## # … with 281 more rows
It is shown in the table above that in each census tract, there are differences existent mostly in between 0.067 to 0.56. The largest MAPE is 1.04 in Census Tract 56.17, which indicates the model fits not very well in this area.
mk_data <-
newdat %>%
mutate(price.Predict = predict(all_reg, newdat))%>%
filter(price < 5000000)
ggplot() +
geom_sf(data = acsTractsBD20, fill = "grey90") +
geom_sf(data = mk_data, aes(colour = q5(price.Predict)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(mk_data,"price.Predict"),
name="Quintile\nBreaks") +
labs(title="predicted price with model") +
theme(plot.title = element_text(hjust = 0.5)) +
mapTheme()
According to the derived map, predicted prices tend to cluster in a large amount of areas. Most of the houses with a relatively high price are in the center of the city, and some of them are in the suburb area of the city. And the houses with a relatively lower price locate on the area that surround the airport in the west and in the center east area.
Mean MAPE of each census tract is not that relevant with mean house price
mm_table <- left_join(
st_drop_geometry(Mk.test) %>%
group_by(NAME) %>%
summarize(meanPrice = mean(price, na.rm = T)),
st_drop_geometry(Mk.test) %>%
group_by(NAME) %>%
summarize(MAPE = mean(price.APE, na.rm = T)))
mm_table <- merge(acsTractsBD20, mm_table,by.x = 'NAME',by.y = 'NAME') # add geometry of each census tracts
mm_table <- mm_table[,c('NAME','MAPE','meanPrice')]
ggplot() +
geom_sf(data = mm_table, aes(fill = MAPE),colour="grey40") +
scale_colour_manual(values = palette5,
labels=qBr(mm_table, "MAPE"),
name="Quintile\nBreaks") +
theme(plot.title = element_text(hjust = 0.5))+
labs(title="MAPE of Each Census Tract") +
mapTheme()
According to the map, of MAPE in each cencus tract above, the model works very well in most of the tracts in the county, especially in the north and south part with a value of MAPE that lower than 0.25. It fits the tracts of central east and few tracts in central west not very well.
ggplot(data=st_drop_geometry(mm_table), aes(meanPrice, MAPE)) +
geom_point(size = 1.0, colour = "#FA7800") +
labs(title = 'MAPE as a function of mean price in each census tract') +
theme(plot.title = element_text(hjust = 0.5))+
plotTheme()
The factor income was selected to split the Mecklenberg County area into two groups of high and low income group. It is shown that the MAPE in the high income area is 23%, in the low income area is 33%. The difference of the MAPE between the two groups is relatively small, which indicates that the model is in a good generalizability for different groups of income.
median_hhincome <- median(acsTractsBD20$med_hh_income,na.rm=T)
acsTractsBD20 <- acsTractsBD20 %>%
mutate(incomeContext = ifelse(med_hh_income > median_hhincome, "High Income", "Low Income"))
ggplot() + geom_sf(data = na.omit(acsTractsBD20), aes(fill = incomeContext),colour="grey80") +
scale_fill_manual(values = c( "#a83341","#d6e8ee"), name="Income Context") +
labs(title = "Income Context in MK County") +
theme(plot.title = element_text(hjust = 0.5),legend.position="bottom") +
mapTheme()
st_join(Mk.test, acsTractsBD20) %>%
group_by(incomeContext) %>%
summarize(mean.MAPE = scales::percent(mean(price.APE, na.rm = T))) %>%
st_drop_geometry() %>%
spread(incomeContext, mean.MAPE) %>%
kable(caption = "Mean Absolute Percentyage Error of Test Set Sales by Neighborhood Income Cont") %>% kable_styling(full_width = F)
High Income | Low Income | <NA> |
---|---|---|
23% | 31% | NA |
In conclusion, the model we established above is a effective model for the prediction of house prices. The derived r-square value is 0.678, which indicates that the majority of the house price difference could be described by our model. The MAPE is approximately 26.9%, which indicates averagely, our model deviates from the observed price by 26.9%. Neighborhood is one of the most interesting variable, which reflects the existence of spatial autocorrelation. Features including spatial structure, Amenities, and Internal Characteristics are included in the model. In spatial structure, spatial lag variable works as a significant variable calculating the average house price of a buffer of every house sale points shown in the dataset. And the model could better catch the existence of the spatial autocorrelation by this. The MAE is 111050.7, and MAPE is 0.2698475. It means there is an error of 27% in between the observed value in the real world and the predicted error in the model we use. According to the derived MAPE map, the dark blue area tend to have a lower MAPE, with a better predict of the lower house value, and in the areas of lighter blue, the model works better in the prediction of higher house prices, and the model fits not very well in the area with a lower house price. This is probably because of the price of houses which is relatively low are affected by some more complex factors than the ones we inserted in the model.
We would be glad to recommend our model to Zillow for the use of house price prediction with a relatively small error in predictions of the house prices. Although the model works not very well in the areas with a lower house prices, the model still works very well in predictions with a great MAE, MAPE, and r-square value. The model could be improved by gathering some more variables to the model.