I. Introduction:

Zillow, is an us tech real-estate marketplace company. It has information on approximately 110 million homes in the United States. The company provides several features, including home value estimates, value changes for each home over a specified time period, aerial views of homes, and comparable home prices in the area. It also offers basic information about a specific home, such as square footage and the number of bedrooms and bathrooms. Users can also get estimates for homes that have undergone major changes, such as a kitchen remodel. Zillow has recognized that their housing market forecasts are not as accurate as they could be due to a lack of regional intelligence, which is one of the core function of the company. As a result, this project is established to figure the problem and create a better predictive model of home prices in Mecklenberg County, North Carolina that could be used by Zillow. To achieve this, we gathered and imported data as parameters for the prediction of house prices, ranging from schools, crimes, neighborhoods to attributes of houses such as bedrooms and heated area, all the data are for websites with open data of Mecklenberg County, and a type of linear least squares method - ordinary least squares (OLS) regression is applied to establish the model for the testing of the accuracy of the predicted house prices. Also, the dataset was split to a testing and a training set to test, train and adjust the accuracy and the generability of the established model in this project using spatial lag, moran’s I and the cross-validation.

Among all the parts, the collection and selection of different categories of data is the most difficult part in this project.We need to get different categories of data from different websites and decide what potential impact they might have on house prices. Since we collect both point and polygon data, we need to choose the appropriate method to add them to the same data set. When testing the data, any increase in data can have an impact on the accuracy of the results, so we also need to keep adding or removing data.

The derived result by the model indicates that the model was trained in a relatively regulate accuracy, the mean absolute percentage error (MAPE) is 0.2698475 and the calculated R-Square is approximately 0.678. Although there are some limitations in generalization, the model established in the project can be used for the house price prediction for Zillow with a relatively higher accuracy and generalization.

II.DATA

2.1 Collecting Data

We collected our data from following sources:

From City of Charlotte Open Data Portal website, we collected the point data, like crimes, schools, parks, medical services, etc. We extract their geographic information separately to calculate the average distance from the neighborhood to them. From the open data map, we collected the tracts data, and joined them to our basic data set. Besides, we also collected the census data by using the tidycensus package.

crimes.sf <-
  crime %>%
    filter(HIGHEST_NIBRS_DESCRIPTION == "Aggravated Assault",
           LATITUDE_PUBLIC > -1) %>%
    dplyr::select(LATITUDE_PUBLIC, LONGITUDE_PUBLIC) %>%
    na.omit() %>%
    st_as_sf(coords = c("LONGITUDE_PUBLIC", "LATITUDE_PUBLIC"), crs = "EPSG:4326") %>%
    st_transform('ESRI:102719') %>%
    distinct()
# Counts of crime per buffer of house sale
newdat$crimes.Buffer <- newdat %>% 
    st_buffer(660) %>% 
    aggregate(mutate(crimes.sf, counter = 1),., sum) %>%
    pull(counter)

## Nearest Neighbor Feature

newdat <-
  newdat %>% 
    mutate(
      crime_nn1 = nn_function(st_coordinates(newdat), 
                              st_coordinates(crimes.sf), k = 1),
      
      crime_nn2 = nn_function(st_coordinates(newdat), 
                              st_coordinates(crimes.sf), k = 2), 
      
      crime_nn3 = nn_function(st_coordinates(newdat), 
                              st_coordinates(crimes.sf), k = 3), 
      
      crime_nn4 = nn_function(st_coordinates(newdat), 
                              st_coordinates(crimes.sf), k = 4), 
      
      crime_nn5 = nn_function(st_coordinates(newdat), 
                              st_coordinates(crimes.sf), k = 5)) 
st_c <- st_coordinates
newdat <-
  newdat %>%
  mutate(
 
      commercial_nn3 = nn_function(na.omit(st_c(newdat)),na.omit(st_c(Shoppingcenter))[,c(1,2)], 3),
      park_nn3 = nn_function(na.omit(st_c(newdat)),na.omit(st_c(parks)), 3),
      schools_nn3 = nn_function(na.omit(st_c(newdat)),na.omit(st_c(School))[,c(1,2)], 3),
      med_nn3 = nn_function(na.omit(st_c(newdat)),na.omit(st_c(medical))[,c(1,2)], 3),
      stop_nn3 = nn_function(na.omit(st_c(newdat)),na.omit(st_c(bus))[,c(1,2)], 3)
      )

2.2 Summarizing Data

Before analysis, we’ve gathered three kinds of variables to help us predict the housing price more precisely.

a.Internal characteristics: The housing internal characteristics data have already been given to us in the StudentData.geojson. The following variables are valuable for our analysis: price, built year(Age), number of rooms (bedrooms, bathrooms, fireplaces,units,etc.), area, etc.

b.Public amenities: We can assess the impact of external factors on house prices by calculating the average distance of the house from some services (to the average distance of the three closest public service to the house). Here, I calculated and added the accessibility to schools, shopping centers, parks, medical services, jobs, bus stops All of them are named “xxx_nn3”.

c.Census data: We also imported the 2020 census data for Mecklenburg County using the get acs command, and after calculating them, we have data of total population, median household income, education percentage, white percentage, poverty, etc.

stargazer(st_drop_geometry(newdat), 
          type = 'text', 
          title = "Summary Statistics")
## 
## Summary Statistics
## ========================================================================
## Statistic         N       Mean      St. Dev.       Min          Max     
## ------------------------------------------------------------------------
## condo_town      46,183    0.000       0.000         0            0      
## parcel_typ      46,183    0.000       0.000         0            0      
## totalac         46,183    1.772      106.717      0.000      9,757.440  
## price           46,183 465,686.800 602,409.800      0       75,500,000  
## cardno          46,183    1.000       0.000         1            1      
## yearbuilt       46,183  1,993.386    31.836         0          2,022    
## heatedarea      46,183  2,359.428   1,060.335     0.000     14,710.000  
## numfirepla      46,183    0.787       0.465         0            7      
## fullbaths       46,183    2.282       0.825         0            9      
## halfbaths       46,183    0.596       0.528         0            4      
## bedrooms        46,183    3.510       0.943         0           65      
## units           46,183    0.981       0.969         0           205     
## ownerno         46,183    1.000       0.000         1            1      
## landsequen      46,183    1.001       0.040         0            3      
## shape_Leng      46,183   504.273     253.070     152.910    10,798.810  
## shape_Area      46,183 15,875.080  35,033.910   1,139.637  3,486,865.000
## sale_year       46,183  2,020.884     0.765       2,020        2,028    
## musaID          46,183 23,092.000  13,332.030       1         46,183    
## h_price         46,172 359,076.800 195,210.200 101,660.000 1,980,920.000
## h_age           46,175   31.595      16.307       4.000       83.000    
## h_size          46,181  2,305.622    744.765     914.330     5,569.507  
## employ          46,063    1.843       1.289         0            5      
## total_pop       46,180  4,066.102   1,310.649      790         7,703    
## med_hh_income   46,180 86,175.660  37,415.660    17,685       238,750   
## total_belpov100 46,180   357.862     355.277        0          2,241    
## pctvacancy      46,180    0.058       0.043       0.000        0.293    
## pctwhite        46,180    0.563       0.262       0.011        0.989    
## pctunemployed   46,180    0.046       0.039       0.000        0.225    
## pctedu          46,180    0.111       0.065       0.004        0.322    
## crimes.Buffer   16,939    2.141       1.466         1           11      
## crime_nn1       46,183  3,738.334   7,091.211    43.003     42,592.160  
## crime_nn2       46,183  4,176.513   7,446.173    89.984     44,682.910  
## crime_nn3       46,183  4,501.107   7,714.390    99.900     45,638.190  
## crime_nn4       46,183  4,811.474   7,979.975    118.446    46,552.480  
## crime_nn5       46,183  5,069.228   8,195.000    182.700    47,361.670  
## commercial_nn3  46,183  7,340.499   4,311.997    484.963    32,116.810  
## park_nn3        46,183  5,670.022   2,655.575    295.369    16,925.850  
## schools_nn3     46,183  5,973.619   3,126.916    653.722    24,399.970  
## med_nn3         46,183  6,744.319   4,478.215    149.571    26,588.100  
## stop_nn3        46,183  4,916.262   4,638.132    145.797    32,479.790  
## ------------------------------------------------------------------------

2.3 Correlation Matrix

Below is the correlation matrix for the numeric variables. The deeper the color is, the stronger the correlation is between the variables.

numericVars <- 
  select_if(st_drop_geometry(newdat), is.numeric) %>% na.omit()%>%select(-cardno,-ownerno,-parcel_typ,-condo_town,-crime_nn1,-crime_nn2,-crime_nn4,-crime_nn5)

ggcorrplot(
  round(cor(numericVars), 1), 
  p.mat = cor_pmat(numericVars),
  colors = c("#001b48", "#d6e8ee", "#FA7800"),
  type="lower",
  insig = "blank") +  
    labs(title = "Correlation across numeric variables") 

2.4 Home Price Correlation Scatterplots

There are 8 correlation scatterplots of home price as functions of 8 independent variables. crime_nn3 &stop_nn3 are the distance to 3 nearest crimes/stops. employ is the data I collected from the open data map, showing the score of accessibility to jobs.

## Home Features cor
st_drop_geometry(newdat) %>% 
  dplyr::select(price, heatedarea, bedrooms, h_age, pctwhite,med_hh_income,crime_nn3, stop_nn3, employ) %>%
  filter(price <= 2000000, h_age < 500,bedrooms<10) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
     facet_wrap(~Variable, ncol = 4, scales = "free") +
     labs(title = "Price as a function of continuous variables") +
     plotTheme()

2.5 Sale Price map

Here, I presented a map showing the distribution of houses and their prices in Mecklenburg County. The darker the color, the higher the sales price. From the graph, it is easy to tell that the high-priced houses are concentrated in the central and northern part of the city, while the low- priced houses are mainly distributed between the center and the border.

# Mapping data
ggplot() +
  geom_sf(data = saleprice, fill = "grey90") +
  geom_sf(data = newdat, aes(colour = q5(price)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(newdat,"price"),
                   name="Quintile\nBreaks") +
  labs(title="Price Per Square Foot, Mecklenburg") +
  mapTheme()

2.6 Maps of Independent Variables

Following are three maps showing the spatial distribution of some features that may relates to housing price.

a.Map of crimes density and houses exposure to crimes

ggplot() +
  geom_sf(data = acsTractsBD20, fill = "grey90") +
  stat_density2d(data = data.frame(st_coordinates(crimes.sf)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "#FAE8E8", high = "#ab3141", name = "Crime Density") +
  geom_sf(data = newdat, aes(colour = q5(crime_nn3)), 
          show.legend = "point", size = .75, ) +
  scale_colour_manual(values = palette5,
                   labels=qBr(newdat,"crime_nn3"),    
                   name="average distance to 3 nearest crimes") +
  labs(title="Crimes and Houses, Mecklenburg") +
  theme(plot.title = element_text(hjust = 0.5)) +
  mapTheme()

b.Map of schools density and houses exposure to schools

ggplot() +
  geom_sf(data = acsTractsBD20, fill = "grey90") +
  stat_density2d(data = data.frame(st_coordinates(School)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "#FAE8E8", high = "#ab3141", name = "School Density") +
  geom_sf(data = newdat, aes(colour = q5(schools_nn3)), 
          show.legend = "point", size = .75, ) +
  scale_colour_manual(values = palette5,
                   labels=qBr(newdat,"schools_nn3"),    
                   name="average distance to 3 nearest schools") +
  labs(title="Schools and Houses, Mecklenburg") +
  theme(plot.title = element_text(hjust = 0.5)) +
  mapTheme()

c.Vacant Houses

vacanthouse<- newdat %>% 
  mutate(pctvacancy = pctvacancy*100) %>%
  dplyr::select(pctvacancy)
ggplot() +
  geom_sf(data = acsTractsBD20, fill = "grey90") +
  geom_sf(data = saleprice, aes(fill = q5(X2021)),colour=NA) +
  scale_fill_manual(values = p52,
                   labels=qBr(saleprice,"X2021"),    
                   name="tract house prices") +
  geom_sf(data = vacanthouse, aes(colour = q5(pctvacancy)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(vacanthouse,"pctvacancy"),    
                   name="pctvacancy") +
  labs(title="Tract house prices and vacancy") +
  theme(plot.title = element_text(hjust = 0.5))+
  mapTheme()

2.7 Other Interesting features

1.This is the correlation scatterplots between price and the average distance from houses to nearest x crimes. By comparison, I don’t think the distance to crime has a significant impact on the price of housing.

## Crime cor
newdat%>%
  st_drop_geometry() %>%
  dplyr::select(price, starts_with("crime_")) %>%
  filter(price <= 1000000) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
     facet_wrap(~Variable, nrow = 2, scales = "free") +
     ##scale_y_log10()+
     labs(title = "Price as a function of continuous variables") +
     plotTheme()

2.Below I presented histograms of all numeric variables.We can read their features of distribution

ggplot(gather(numericVars), aes(value)) +
  geom_histogram(bins = 50) +
  facet_wrap(~key,nrow=6, scales = 'free_x')

3.Here I pick out the area data as an independent variable to test its correlation with the price. After testing, the R value is 0.68, suggested the area has relatively stronger correlation with the housing price.

price_200k<-st_drop_geometry(newdat) %>% 
filter(price <= 2000000) 

cor.test(price_200k$heatedarea,
         price_200k$price, 
         method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  price_200k$heatedarea and price_200k$price
## t = 197.77, df = 45734, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6739937 0.6838739
## sample estimates:
##       cor 
## 0.6789645
ggscatter(price_200k,
          x = "heatedarea",
          y = "price",
          add = "reg.line") +
  stat_cor(label.y = 2100000) 

III.METHODS

We use machine learning to make house price predictions. The core logic of this process is to first find existing house price information and, for each house, collect data that may be relevant to it, which, as mentioned above, includes internal characteristics, public amenities and census data, such as house size, surrounding services, area demographic information, etc.

After collecting the information, we combine all the data into one dataset and split it randomly into a training set and a test set (0.8:0.2). In the training set, we train the machine to recognize the association between our previously collected data and house prices, while later in the test set, we use the predict function to infer house prices from the data.

After deriving the predicted house price, we also need to calculate the deviation of the predicted house price from the real house price to derive the accuracy of our model. This process is iterative, as we need to keep adding and subtracting data to test and improve the accuracy of our model.

IV.RESULTS

The dataset was splited into a training and testing dataset. And the ratio of them is 0.8:0.2. The training set was used to train the Ordinary Least Squares regression (OLS) model established, and the testing set was used to test the accuracy of the model. To better explain the accuracy of the model, the test set was not inmcluded in the training section. Also, the parametre ‘y’ was applied for the catogorical variables to include all the categories of a single variable in training set and derive a model with a higher accuracy.

Mk.modelling <-
  newdat[newdat$toPredict=="MODELLING",]%>%
  filter(price<10000000)%>%
  filter(price>0)

Train <- createDataPartition(
  y=paste(Mk.modelling$lot_num),p=0.8,list=F)

Mk.train<-
  newdat[Train,]%>%
  filter(price<10000000)%>%
  filter(price>0)

Mk.test <-
  newdat[-Train,]%>%
  filter(price<10000000)%>%
  filter(price>0)

Mk.challenge <-
  newdat[newdat$toPredict=="CHALLENGE",]

Mk_sub_200k <- st_drop_geometry(Mk.train) %>% 
filter(price <= 2000000) 

A OLS regression model was trained using the defined training set, with the results shown below.

all_reg <- lm(price ~.,data = st_drop_geometry(Mk_sub_200k) %>%
                  dplyr::select(price, heatedarea, h_age, 
                                               h_size, bedrooms, units,
                                               fullbaths,numfirepla, halfbaths,pctwhite,crime_nn3,med_hh_income,pctedu,total_pop,shape_Area,pctvacancy,park_nn3,schools_nn3,commercial_nn3,med_nn3,total_belpov100,pctunemployed,employ,stop_nn3))

stargazer(all_reg, type = "text",title="Regression Results",align=TRUE,no.space=TRUE)
## 
## Regression Results
## =================================================
##                          Dependent variable:     
##                     -----------------------------
##                                 price            
## -------------------------------------------------
## heatedarea                   124.784***          
##                                (1.602)           
## h_age                       4,816.571***         
##                               (79.853)           
## h_size                        79.365***          
##                                (2.384)           
## bedrooms                   -22,389.420***        
##                              (1,107.878)         
## units                      -70,065.140***        
##                              (4,019.245)         
## fullbaths                   53,875.930***        
##                              (1,702.121)         
## numfirepla                    2,495.499          
##                              (1,908.554)         
## halfbaths                   22,918.660***        
##                              (1,744.391)         
## pctwhite                    53,868.530***        
##                              (5,567.580)         
## crime_nn3                     0.657***           
##                                (0.138)           
## med_hh_income                 0.387***           
##                                (0.044)           
## pctedu                     537,972.900***        
##                             (22,961.730)         
## total_pop                     -5.727***          
##                                (0.686)           
## shape_Area                    0.722***           
##                                (0.033)           
## pctvacancy                 382,238.200***        
##                             (21,276.150)         
## park_nn3                      2.091***           
##                                (0.367)           
## schools_nn3                   3.040***           
##                                (0.409)           
## commercial_nn3                -1.859***          
##                                (0.440)           
## med_nn3                       1.920***           
##                                (0.372)           
## total_belpov100               41.568***          
##                                (3.296)           
## pctunemployed               48,224.360**         
##                             (22,739.680)         
## employ                      13,663.770***        
##                               (740.432)          
## stop_nn3                      -6.838***          
##                                (0.304)           
## Constant                   -349,413.200***       
##                              (7,870.541)         
## -------------------------------------------------
## Observations                   36,362            
## R2                              0.685            
## Adjusted R2                     0.685            
## Residual Std. Error   148,588.700 (df = 36338)   
## F Statistic         3,432.428*** (df = 23; 36338)
## =================================================
## Note:                 *p<0.1; **p<0.05; ***p<0.01

It is shown that the R-square of the OLS model is 0.678, it represents that there are approximately 80% of the derived results could be indicated by the OLS model established above. Also, there are 23 predictors included in the training model, with 23 categorical variables. According to the derived statistics above and below, due to the p-value is less than 2.2e-16, most of the selected predictors are significant, and the variable park_nn3 and med_nn3 are relatively not that significant because the p-value of them are larger that 0.1.

4.1 Value of MAE and MAPE

The mean absolute error (MAE) and mean absolute percentage error (MAPE) are calculated in this section as the table shown below. The MAE is 111050.7, and MAPE is 0.2698475. It means there is an error of 27.02% in between the observed value in the real world and hte predicted error in the model we use.

Mk.test <-
  Mk.test %>%
  mutate(price.Predict = predict(all_reg, Mk.test),
         price.Error = price.Predict - price,
         price.AbsError = abs(price.Predict - price),
         price.APE = (abs(price.Predict - price)) / price.Predict) %>%
  filter(price < 5000000)


MAE <- mean(Mk.test$price.AbsError, na.rm = T)
MAPE <- mean(Mk.test$price.APE, na.rm = T)
acc <- data.frame(MAE, MAPE)
kable(acc) %>% kable_styling(full_width = F)
MAE MAPE
112593.6 0.2651099
#export housing price
Mk.challenge <-
  Mk.challenge %>%
  mutate(prediction = predict(all_reg, Mk.challenge))%>%
  filter(price < 5000000)

priceprediction <- st_drop_geometry(Mk.challenge) %>% dplyr::select(musaID,prediction)
write.csv(priceprediction,"mean_prediction.csv", row.names = FALSE)

4.2 Cross Validation

The cross validation test is applied to the model in this section.

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

all_reg.cv <- 
  train(price ~ ., data = st_drop_geometry(Mk.train) %>% 
          dplyr::select(price, heatedarea, h_age,  h_size, bedrooms, units,
                        fullbaths,numfirepla,halfbaths,pctwhite,crime_nn3,
                        med_hh_income,pctedu,total_pop,shape_Area,pctvacancy,
                        park_nn3,schools_nn3,commercial_nn3,med_nn3,total_belpov100,pctunemployed,employ,stop_nn3),
        method = "lm", trControl = fitControl, na.action = na.pass)

all_reg.cv
## Linear Regression 
## 
## 36814 samples
##    23 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 36446, 36446, 36447, 36446, 36447, 36445, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   224922.4  0.6357926  126977.2
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
all_reg.cv$resample[1:5,]
##       RMSE  Rsquared      MAE Resample
## 1 189622.9 0.6921531 122656.1  Fold001
## 2 183400.2 0.6427728 122982.9  Fold002
## 3 223989.4 0.6825975 129787.0  Fold003
## 4 152141.0 0.7324429 112929.5  Fold004
## 5 164541.2 0.7206764 116306.0  Fold005
sd(all_reg.cv$resample[,3])
## [1] 9764.068
mean(all_reg.cv$resample[,3])
## [1] 126977.2
hist(all_reg.cv$resample[,3], 
     breaks = 50, 
     main = 'Distribution of MAE \nK fold cross validation;k=100', 
     xlab = 'Mean Absolute Mile', 
     ylab = 'count')

According to the derived statistics above, the average value of MAE is 10474.09, and its standard deviation is 128766.2. The derived histogram represents the distribution of the calculated MAEs of the 100 folds in a cross- validation. Based on the figure above, the derived MAEs tend to distribute in a not very compacted way, it indicates that sometimes the model works probably not very generable to new data.

4.3 Distribution of the MAEs of the 100 folds in k fold cross validation

A graph of predicted prices as a function of observed prices is shown below.

ggplot(
  Mk.test, aes(price.Predict, price)) +
  geom_point(size = .5) + 
  geom_smooth(method = "lm", se=F, colour = "#018abe") +
  geom_abline(intercept = 0, slope = 1, color='red',size=1) +
  labs(title = 'Predicted sale price as a function of observed price', subtitle = 'Red line represents a perfect prediction \nBlue line represents prediction') +
  theme(plot.title = element_text(hjust = 0.5))

In this graph, the red line is the prediction based on the observed house price, the orange line is the average of house price prediction. It is shown that the derived prediction of house price by the model is close to the prediction from the model, and there are 2 small gaps between them in the start and the end of the graph.

4.4 A map of residuals in model for the test set

Spatial Lag

Mk.test <-
  Mk.test %>%
  mutate(Regression = "Baseline Regression",
         price.Predict = predict(all_reg, Mk.test),
         price.Error = price.Predict - price,
         price.AbsError = abs(price.Predict - price),
         price.APE = (abs(price.Predict - price)) /price.Predict)%>%
  filter(price < 5000000) 
mk.test_predict <-
  Mk.test[which(Mk.test$price.Error != 0),]

coords.test <-  st_coordinates(mk.test_predict) 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")
mk.test_predict$lagPriceError <- lag.listw(spatialWeights.test, mk.test_predict$price.Error)



ggplot(mk.test_predict, aes(x = lagPriceError, y = price.Error))+
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
    
     
     labs(title = "Price error as a function of continuous variables") +
     plotTheme()

The graph of lag price error and price error shows a partly positive relationship between them, it means spatial autocorrelation exits in the model.

ggplot() +
  geom_sf(data = acsTractsBD20, fill = "grey90") +
  geom_sf(data = mk.test_predict, aes(colour = q5(lagPriceError)),
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(mk.test_predict,"lagPriceError"),
                   name="Quintile\nBreaks") +
  labs(title="a map of spatial lag in errors") +
  theme(plot.title = element_text(hjust = 0.5)) +
  mapTheme()

4.5 Moran’s I test

moranTest <- moran.mc(mk.test_predict$price.Error, 
                      spatialWeights.test, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  plotTheme()

The derived histogram indicates that the distribution of Moran’s I test in the model, the orange line shows the observed Moran’s I. The histogram represents that there are several spatial autocorrelation in distribution of residuals, however, it is suggested that the spatial autocorrelation here is not high. Even if there are supposed to be no spatial autocorrelation in a model that is perfectly generable.

4.6 A list of mean MAPE of each census tract

test_map = st_drop_geometry(Mk.test) %>%
    group_by(NAME) %>%
    summarize(MAPE = mean(price.APE, na.rm = T))
test_map
## # A tibble: 291 × 2
##    NAME                                                    MAPE
##    <chr>                                                  <dbl>
##  1 Census Tract 1.02, Mecklenburg County, North Carolina  0.547
##  2 Census Tract 10, Mecklenburg County, North Carolina    0.178
##  3 Census Tract 11, Mecklenburg County, North Carolina    0.197
##  4 Census Tract 12, Mecklenburg County, North Carolina    0.245
##  5 Census Tract 13, Mecklenburg County, North Carolina    0.213
##  6 Census Tract 14, Mecklenburg County, North Carolina    0.238
##  7 Census Tract 15.04, Mecklenburg County, North Carolina 0.339
##  8 Census Tract 15.05, Mecklenburg County, North Carolina 0.292
##  9 Census Tract 15.07, Mecklenburg County, North Carolina 0.285
## 10 Census Tract 15.08, Mecklenburg County, North Carolina 0.240
## # … with 281 more rows

It is shown in the table above that in each census tract, there are differences existent mostly in between 0.067 to 0.56. The largest MAPE is 1.04 in Census Tract 56.17, which indicates the model fits not very well in this area.

4.7 A map of the predicted price of the regression model

mk_data <-
  newdat %>%
  mutate(price.Predict = predict(all_reg, newdat))%>%
  filter(price < 5000000)
ggplot() +
  geom_sf(data = acsTractsBD20, fill = "grey90") +
  geom_sf(data = mk_data, aes(colour = q5(price.Predict)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(mk_data,"price.Predict"),
                   name="Quintile\nBreaks") +
  labs(title="predicted price with model") +
  theme(plot.title = element_text(hjust = 0.5)) +
  mapTheme()

According to the derived map, predicted prices tend to cluster in a large amount of areas. Most of the houses with a relatively high price are in the center of the city, and some of them are in the suburb area of the city. And the houses with a relatively lower price locate on the area that surround the airport in the west and in the center east area.

Mean MAPE of each census tract is not that relevant with mean house price

mm_table <- left_join(
  st_drop_geometry(Mk.test) %>%
    group_by(NAME) %>%
    summarize(meanPrice = mean(price, na.rm = T)),
 st_drop_geometry(Mk.test) %>%
    group_by(NAME) %>%
    summarize(MAPE = mean(price.APE, na.rm = T)))
mm_table <- merge(acsTractsBD20, mm_table,by.x = 'NAME',by.y = 'NAME') # add geometry of each census tracts
mm_table <- mm_table[,c('NAME','MAPE','meanPrice')]
ggplot() +
  geom_sf(data = mm_table, aes(fill = MAPE),colour="grey40") +
  scale_colour_manual(values = palette5,
                   labels=qBr(mm_table, "MAPE"),    
                   name="Quintile\nBreaks") +
  theme(plot.title = element_text(hjust = 0.5))+
  labs(title="MAPE of Each Census Tract") +
  mapTheme()

According to the map, of MAPE in each cencus tract above, the model works very well in most of the tracts in the county, especially in the north and south part with a value of MAPE that lower than 0.25. It fits the tracts of central east and few tracts in central west not very well.

ggplot(data=st_drop_geometry(mm_table), aes(meanPrice, MAPE)) +
  geom_point(size = 1.0, colour = "#FA7800") +
  labs(title = 'MAPE as a function of mean price in each census tract') +
  theme(plot.title = element_text(hjust = 0.5))+
  plotTheme()

The factor income was selected to split the Mecklenberg County area into two groups of high and low income group. It is shown that the MAPE in the high income area is 23%, in the low income area is 33%. The difference of the MAPE between the two groups is relatively small, which indicates that the model is in a good generalizability for different groups of income.

median_hhincome <- median(acsTractsBD20$med_hh_income,na.rm=T)
acsTractsBD20 <- acsTractsBD20 %>%
  mutate(incomeContext = ifelse(med_hh_income > median_hhincome, "High Income", "Low Income"))


ggplot() + geom_sf(data = na.omit(acsTractsBD20), aes(fill = incomeContext),colour="grey80") +
    scale_fill_manual(values = c( "#a83341","#d6e8ee"), name="Income Context") +
    labs(title = "Income Context in MK County") +
    theme(plot.title = element_text(hjust = 0.5),legend.position="bottom") +
    mapTheme()

st_join(Mk.test, acsTractsBD20) %>% 
  group_by(incomeContext) %>%
  summarize(mean.MAPE = scales::percent(mean(price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(incomeContext, mean.MAPE) %>%
  kable(caption = "Mean Absolute Percentyage Error of Test Set Sales by Neighborhood Income Cont") %>% kable_styling(full_width = F)
Mean Absolute Percentyage Error of Test Set Sales by Neighborhood Income Cont
High Income Low Income <NA>
23% 31% NA

V.Discussion

In conclusion, the model we established above is a effective model for the prediction of house prices. The derived r-square value is 0.678, which indicates that the majority of the house price difference could be described by our model. The MAPE is approximately 26.9%, which indicates averagely, our model deviates from the observed price by 26.9%. Neighborhood is one of the most interesting variable, which reflects the existence of spatial autocorrelation. Features including spatial structure, Amenities, and Internal Characteristics are included in the model. In spatial structure, spatial lag variable works as a significant variable calculating the average house price of a buffer of every house sale points shown in the dataset. And the model could better catch the existence of the spatial autocorrelation by this. The MAE is 111050.7, and MAPE is 0.2698475. It means there is an error of 27% in between the observed value in the real world and the predicted error in the model we use. According to the derived MAPE map, the dark blue area tend to have a lower MAPE, with a better predict of the lower house value, and in the areas of lighter blue, the model works better in the prediction of higher house prices, and the model fits not very well in the area with a lower house price. This is probably because of the price of houses which is relatively low are affected by some more complex factors than the ones we inserted in the model.

VI.Conclusion

We would be glad to recommend our model to Zillow for the use of house price prediction with a relatively small error in predictions of the house prices. Although the model works not very well in the areas with a lower house prices, the model still works very well in predictions with a great MAE, MAPE, and r-square value. The model could be improved by gathering some more variables to the model.