Introduction

  • The purpose of this project is to develop a home price forecasting model to predict home prices in Mecklenberg County, NC. The price of a home is a direct reflection of the value of a property, not only in terms of the living conditions and future development of an individual family. It is also a reflection of the development of the city, region and community. As an analyst with a background in architecture and urban planning, it is a challenge for this project to use our own domain knowledge to select appropriate forecasting projects that fully reflect the influence of environment, space, and people on housing prices and make correct forecasts.
  • This project uses OLS regression (least squares regression) method to predict house prices based on hedonic model (more bedrooms, lower crime rate leads to higher house prices, etc.).
  • The result is that, compared to other algorithms, we include parameters of greenness and building rate in each neighborhood range in our algorithm species. Our model uses xxxx as the training sample and presents an overall prediction accuracy of is xxxx. ### Set up

Data Wrangling

Raw Data Descriptions

  • Data sources:
    1. Census data from the American Community Survey (ACS)
    2. GIS Data Center,Mecklenberg County, NC. https://www.mecknc.gov/LUESA/GIS/Pages/GIS-Data-Center.aspx
    3. Open Mapping, Mecklenberg County, NC. http://maps.co.mecklenburg.nc.us/openmapping/index.html
  • Data Collection:
  • Through the visualization of the data, we projected the different house prices on the map of Mecklenberg County using different color points. From the distribution of the price points, we can visually see that prices are higher in the north and south of Mecklenberg County, and lower in the southwest, through the center, and all the way to the east, showing a band of the letter “n”. Although the geoJSON dataset containing house prices includes a lot of important basic pricing information such as the year of construction and the number of bedrooms, the environmental information around the house, such as the occurrence of crime and public facilities, also has an impact on the house prices. Therefore, we use the K-nearest-neighbor method to process some information within Mecklenberg County and add it to our machine learning.

Feature Engineering

st_c <- st_coordinates
Charlotte2 <- Charlotte %>%
  mutate(Grocery_nn2 = nn_function(st_c(Charlotte), st_c(Grocery), 2),
         Crime_nn4 = nn_function(st_c(Charlotte), st_c(Crime) %>% na.omit(), 4),
         School_nn1 = nn_function(st_c(Charlotte), st_c(School)%>% na.omit(), 1),
         Mall_nn2 = nn_function(st_c(Charlotte), st_c(Mall), 2),
         Park_nn3 = nn_function(st_c(Charlotte), st_c(Parks), 3),
         Bus_nn3 = nn_function(st_c(Charlotte), st_c(bus), 3))

Charlotte3 <- Charlotte2 %>% 
  st_join(zoning) %>% 
  st_join(ZIP) %>% 
  st_join(tracts2021) %>% 
  st_join(npa) %>% 
  mutate(builtTime = case_when(
    yearbuilt >= 2015 ~ "New",
    yearbuilt > 2000 & yearbuilt < 2015 ~ "Recent",
    yearbuilt <= 2000 ~ "Old"  )) %>% 
  mutate(numfpla = case_when(
    numfirepla >=3 ~ "many",
    numfirepla > 0 & numfirepla <3 ~ "less2",
    numfirepla == 0 ~ "No"  ))
  

Charlotte4 <- Charlotte3%>%
  filter(toPredict != 'CHALLENGE')%>%
  st_drop_geometry 
Charlotte4$fullbaths <- as.character(Charlotte4$fullbaths)
Charlotte4$halfbaths <- as.character(Charlotte4$halfbaths)
Charlotte4$units <- as.character(Charlotte4$units)
Charlotte4$bedrooms <- as.character(Charlotte4$bedrooms)
Charlotte4$yearbuilt <- as.character(Charlotte4$yearbuilt)
Charlotte4$npa <- as.character(Charlotte4$npa)


###### Feature Selection
Table <- Charlotte4%>%
   dplyr::select(Grocery_nn2,Park_nn3,Bus_nn3, School_nn1,Crime_nn4, Mall_nn2,bedrooms, fullbaths,heatedarea, halfbaths, shape_Area, units, price,  cdebuildin,heatedfuel,bldggrade, totalac, builtTime,extwall,foundation,numfpla,MedHHInc,pctWhite,pctPoverty,pctBlack,pctHispanic,ZIP)
  • The factors that influence the price of a house fall into three main categories:the properties of the house itself, the urban public services accessible around the house, and the specific spatial structure within the area. Therefore, the variables we use for modeling are also mainly these three categories.
  • 1.Internal Characteristics First we looked at the internal characteristics of the houses. We considered features associated with the number of bedrooms, number of bathrooms, living square feet, actual square feet, year built, story-height, units, building type, heated fuel, the structure quality, wall material, number of fire places and foundation type. We took most of them as categorical variables in our model.
internal <- Table%>%
   dplyr::select(bedrooms, fullbaths,heatedarea, halfbaths, shape_Area, units, cdebuildin,heatedfuel,bldggrade, totalac, builtTime, extwall, foundation, numfpla)

stargazer(st_drop_geometry(internal), 
          type = 'text', 
          title = "Numeric Internal Features",
          covariate.labels = c()) 
## 
## Numeric Internal Features
## ===============================================================
## Statistic    N       Mean     St. Dev.     Min         Max     
## ---------------------------------------------------------------
## heatedarea 46,083 2,359.402  1,060.681    0.000    14,710.000  
## shape_Area 46,083 15,872.850 35,058.410 1,139.637 3,486,865.000
## totalac    46,083   1.775     106.833     0.000     9,757.440  
## ---------------------------------------------------------------
  • 2.Amenities/Public services Accessibility to public services and amenities can add value to houses. We considered the following public services and amenities when building our model: grocery, school, hospital, shopping mall, park, bus station. Here we also consider crime as a special “facility” has negative affects to house value. We use k-nearest neighbor to calculate the average distance of several closest facilities to each house as the value of variables.
Amenities <- Table %>% 
   dplyr::select(Grocery_nn2,Park_nn3,Bus_nn3, School_nn1,Crime_nn4, Mall_nn2)

stargazer(st_drop_geometry(Amenities), 
          type = 'text', 
          title = "Amenities Features",
          covariate.labels = c()) 
## 
## Amenities Features
## =========================================================
## Statistic     N      Mean    St. Dev.    Min      Max    
## ---------------------------------------------------------
## Grocery_nn2 46,083 6,610.182 4,176.016 438.342 28,608.460
## Park_nn3    46,083 5,701.951 2,671.000 296.326 17,024.180
## Bus_nn3     46,083 4,943.569 4,663.943 144.973 32,668.800
## School_nn1  46,083 4,062.715 2,708.480 64.186  20,333.820
## Crime_nn4   46,083 2,301.561 3,996.468 50.171  24,620.840
## Mall_nn2    46,083 6,695.646 4,325.880 372.942 31,974.730
## ---------------------------------------------------------
  • 3.Spatial Process House prices are correlated with each other due to the influence of a particular spatial process, so we use spatial structure as a factor affecting house prices as well. The spatial process we consider includes: zip area, land use, income, racial segregation.
Spatial <- Table %>% 
   dplyr::select(MedHHInc,pctWhite,pctPoverty,pctBlack,pctHispanic,ZIP)

stargazer(st_drop_geometry(Spatial), 
          type = "text", 
          title = "Spatial Process Features",
          covariate.labels = c()) 
## 
## Spatial Process Features
## =======================================================
## Statistic     N       Mean     St. Dev.   Min     Max  
## -------------------------------------------------------
## MedHHInc    46,080 86,180.690 37,419.720 17,685 238,750
## pctWhite    46,080   0.571      0.270    0.011   1.710 
## pctPoverty  46,080   0.091      0.082    0.000   0.614 
## pctBlack    46,080   0.299      0.239    0.000   1.346 
## pctHispanic 46,080   0.114      0.099    0.000   0.655 
## -------------------------------------------------------

Exploratory Analysis

  • To explore the factors that influence house prices, we selected four factors to explore the relationship between each of them and house prices. 1.heatedarea First, the heating area, which undeniably directly reflects the usable area of the house, should rightly show a strong correlation with the price of the house, with each 2,500 square foot increase in heating area increasing the price of the house by approximately $250,000. While this is consistent with our common sense, small outlier points appear in the lower right corner, which may be due to special reasons (e.g., building form) and may have potential research value. 2.Mall_nn2 Regarding the section on shopping malls, the overall distribution the amount sample shows a trend of more near and less far. Although malls within closer distances correspond to a large portion of highe r priced houses. The overall distance and house prices still show a slow growth trend, which seems to be contrary to common sense. 3.shape_Area The projected area is not intended to have a strong positive correlation with house prices. However, the distribution of house prices still shows dispersion when the area is small, indicating that other factors besides the area of the house itself are also associated with house prices to a considerable extent. 4.School_nn1 Regarding the section on commercial schools, similar to malls, the overall distribution of the amount sample shows a trend of more near and less far. Although the malls within closer distance correspond to a large portion of the higher priced houses. The overall distance and house prices still show a slow growth trend, which seems to be contrary to common sense.
## home plots scatterplots (1 point)
     st_drop_geometry(Table) %>% 
     dplyr::select(price, shape_Area,heatedarea,Mall_nn2,School_nn1) %>%
     filter(price <= 1000000) %>%
     gather(Variable, Value, -price) %>% 
     ggplot(aes(Value, price)) +
     geom_point(size = .05,alpha=0.5) + geom_smooth(method = "lm", se=F, colour = "#FA7800",size=2) +
     facet_wrap(~Variable, ncol = 2, scales = "free") +
     labs(title = "Price as a function of continuous variables") +
     plotTheme()

  • In consideration of model training efficiency, only one set of data is selected to be added to the model training in the K-nearest-neighbor method. Since some rating indicators of houses, such as overall maintenance, are more likely to have a direct impact on house prices than some numerical parameters, such as distance to parks & stores, fewer parameters appear in the Correlation matrix, and except for the strong correlation between percent of White and percent of Black, the rest of the data do not show a strong correlation and can be be used as parameters for model training.
## Correlation Matrix(0.5 point)
numericVars <- 
  select_if(st_drop_geometry(Table), is.numeric) %>% na.omit()

ggcorrplot(
  round(cor(numericVars), 1), 
  p.mat = cor_pmat(numericVars),
  colors = c("#659FE3", "#EC9F6D"),
  type="lower",
  insig = "blank") +  
  labs(title = "Correlation across numeric variables") 

Data Mapping

  • We first visualized the home price to have a rough idea about the spatial pattern.
## Map home prices (0.5 point)
ggplot()+
  geom_sf(data = tracts2021, color="gray50",fill = "transparent",size=0.5)+
  geom_sf(data = boundary,fill = "transparent", color="black",size=0.5)+
  geom_sf(data = Charlotte, aes(color = q5(price)), size = 0.5,alpha=0.2)+
  scale_color_manual(values = palette6, 
                    labels = qBr(Charlotte, "price"),
                    name = "Home sale prices")+
  labs(title = "Map of home sale prices in Mecklenberg County, NC.", 
       subtitle = "Real Dollars; In tract unit")+
  mapTheme()

  • 3 interesting maps Regarding the choice of three interesting variables, the variables “heatedarea”, “bedrooms” and “fullbaths” are used here. It is easy to see from the graph that these three variables, based on the quality of the house itself, have a considerable influence on the price of the house.
## 3 interesting maps
Tableinterest = Table %>%
  gather(key,value, -price)

# 1
Tableinterest1 <- Tableinterest %>% filter(key == "heatedarea")
ggplot(Tableinterest1, aes(x = value, y = price))+
  geom_point(size=3,shape = 21, fill = "#69b3a2",alpha=0.3)+
  geom_smooth(method = lm, se=F,colour = "#DC986D",size=5)+
  labs(title="Scatter plot of price to heatedarea") +
  plotTheme()+
  theme(axis.text=element_text(size=12),axis.title=element_text(size=2))

# 2
Tableinterest2 <- Tableinterest %>% filter(key == "bedrooms")
ggplot(Tableinterest2, aes(x = value, y = price))+
  geom_point(size=5,shape = 21, fill = "#69b3a2",alpha=0.3)+
  geom_smooth(method = lm, se=F,colour = "#DC986D",size=2)+
  labs(title="Scatter plot of price to bedrooms") +
  plotTheme()+
  theme(axis.text=element_text(size=12),axis.title=element_text(size=2))

# 3
Tableinterest3 <- Tableinterest %>% filter(key == "fullbaths")
ggplot(Tableinterest3, aes(x = value, y = price))+
  geom_point(size=5,shape = 21, fill = "#69b3a2",alpha=0.3)+
  geom_smooth(method = lm, se=F,colour = "red",size=5)+
  labs(title="Scatter plot of price to fullbaths") +
  plotTheme()+
  theme(axis.text=element_text(size=12),axis.title=element_text(size=2))

  • Here we use buffers to try to find the link between parks, a public amenity, and the home of house prices, however the results do not show the presence of park buffers in both areas with higher and lower house prices, indicating that the effect of parks on house prices does not seem to be strong.
#Bonus for something extra engaging(0.5 point)

parksunion <- st_buffer(st_union(Parks),500)%>%st_sf()
parksunion2 <- st_buffer(st_union(Parks),2000)%>%st_sf()
parksunion3 <- st_buffer(st_union(Parks),4000)%>%st_sf()
ggplot()+
  geom_sf(data = parksunion3, color="#DD1144", size = 1,fill="white")+
  geom_sf(data = tracts2021,color="gray50",fill = "transparent",size=0.5,linetype = "dashed")+
  geom_sf(data = boundary,fill = "transparent",color="black",size=1)+
  geom_sf(data = Charlotte, aes(color = q5(price)), size = 0.01,alpha=0.5)+
  scale_color_manual(values = palette6, 
                    labels = qBr(Table, "price"),
                    name = "Home sale prices")+
  labs(
    title = "Home prices with distance to parks",
    subtitle = "Real Dollars; In tract unit") +
  mapTheme() +
  theme(
    plot.title = element_text(size = 20),
    plot.subtitle=element_text(size = 15,face="italic"),
    panel.border = element_blank(),
    axis.text = element_blank(),
    panel.background = element_blank(),
    panel.grid = element_blank(),
    axis.ticks = element_blank())

Method

This house price forecasting model uses an OLS regression model, also known as
least squares regression. The objective is to minimize the sum of squares of the
difference between the observed values and the fitted values provided by the model.
In the actual model, we use multivariate spatial data to forecast housing prices in
Mecklenberg County, NC.
The whole forecasting process is divided into three parts, 1. forecasting parameter
selection, 2. forecasting model training, and 3. model prediction.
1. Forecasting parameter selection.
Firstly, we use the background knowledge of architecture and urban planning to make
the initial selection of parameters. Secondly, the parameters are further selected
by using Correlation Matrix and Home prices scatterplots.
2. Forecasting model training.
Firstly, the model is trained using least squares regression. According to the R^2
of the model and the data of the test set, the parameters of the model are adjusted
to keep the error rate in a low range.
3. Model prediction.
Based on the trained model, input the geoJSON information of the house to be
predicted, and automatically generate the predicted price of the house.

Result

Machine Learning & OLS Regression Model

## Train Results
Table <- Table %>%
  dplyr::select(Grocery_nn2,Park_nn3,Bus_nn3, School_nn1,Crime_nn4, Mall_nn2,bedrooms, fullbaths,heatedarea, halfbaths, shape_Area, units, price,  cdebuildin,heatedfuel,bldggrade, totalac, builtTime,extwall,foundation,numfpla,MedHHInc,pctWhite,pctPoverty,pctHispanic,ZIP) 
  
inTrain <- createDataPartition(
              y = paste(Table$cdebuildin, Table$codemunici,Table$heatedfuel, Table$bldggrade, Table$builtTime, Table$fullbaths, Table$halfbaths, Table$units, Table$bedrooms, Table$actype, Table$extwall, Table$foundation, Table$fireplaces, Table$numfpla,Table$typeofdeed, Table$landusecod, Table$accounttyp, Table$landsequen,Table$ZIP), 
              p = .60, list = FALSE)
Table.training <- Table[inTrain,] 
Table.test <- Table[-inTrain,]  

reg.training <- lm(price ~ ., data = Table.training)

Table.test <-
  Table.test %>%
  mutate(SalePrice.Predict = predict(reg.training, Table.test),
         SalePrice.Error = SalePrice.Predict - price,
         SalePrice.AbsError = abs(SalePrice.Predict - price),
         SalePrice.APE = (abs(SalePrice.Predict - price)) / SalePrice.Predict)%>%
  filter(price < 5000000)

MAE <- mean(Table.test$SalePrice.AbsError, na.rm = T)
MAPE <- mean(Table.test$SalePrice.APE, na.rm = T)
MAE = c(MAE) %>% format(., digits = 3)
MAPE = c(MAPE) %>% format(., digits = 3)
Model = c("Model with Tract ID", "Model with Spatial Lag")
summaryTable1 = cbind(Model, MAE, MAPE)
  • Table of results(training set) After selecting the completed training set parameters, the following are the results of the training set training.
##Table of results(0.5point)
stargazer(st_drop_geometry(reg.training), 
          type = 'text', 
          title = "OLS Regression",
          covariate.labels = c())
## 
## OLS Regression
## =========================================================
##                                  Dependent variable:     
##                              ----------------------------
##                                         price            
## ---------------------------------------------------------
## Grocery_nn2                            6.272***          
##                                        (1.682)           
##                                                          
## Park_nn3                               3.897***          
##                                        (1.030)           
##                                                          
## Bus_nn3                               -7.965***          
##                                        (0.836)           
##                                                          
## School_nn1                              1.148            
##                                        (0.922)           
##                                                          
## Crime_nn4                             12.936***          
##                                        (1.074)           
##                                                          
## Mall_nn2                              -8.201***          
##                                        (1.597)           
##                                                          
## bedrooms1                            108,164.900*        
##                                      (61,515.700)        
##                                                          
## bedrooms15                        72,812,376.000***      
##                                     (350,086.100)        
##                                                          
## bedrooms2                            57,410.750*         
##                                      (30,279.330)        
##                                                          
## bedrooms3                            53,688.510*         
##                                      (29,435.030)        
##                                                          
## bedrooms4                             21,441.770         
##                                      (29,494.920)        
##                                                          
## bedrooms5                             -5,999.548         
##                                      (30,119.030)        
##                                                          
## bedrooms50                            92,810.440         
##                                     (244,615.200)        
##                                                          
## bedrooms6                            -57,629.420*        
##                                      (33,319.730)        
##                                                          
## bedrooms65                           -92,368.050         
##                                     (342,195.400)        
##                                                          
## bedrooms7                            -25,690.820         
##                                      (60,079.640)        
##                                                          
## bedrooms8                           -253,826.800*        
##                                     (149,158.400)        
##                                                          
## bedrooms9                           824,988.300**        
##                                     (342,195.000)        
##                                                          
## fullbaths1                           -32,594.830         
##                                      (50,727.600)        
##                                                          
## fullbaths2                            -5,863.357         
##                                      (50,143.930)        
##                                                          
## fullbaths3                           -11,003.280         
##                                      (50,159.200)        
##                                                          
## fullbaths4                            26,037.900         
##                                      (50,691.980)        
##                                                          
## fullbaths5                          162,114.400***       
##                                      (52,709.110)        
##                                                          
## fullbaths6                          329,447.700***       
##                                      (60,864.560)        
##                                                          
## fullbaths7                          330,335.200***       
##                                      (83,704.630)        
##                                                          
## fullbaths8                         1,156,454.000***      
##                                     (141,792.800)        
##                                                          
## fullbaths9                         3,048,614.000***      
##                                     (393,800.300)        
##                                                          
## heatedarea                            133.282***         
##                                        (4.050)           
##                                                          
## halfbaths1                          -11,196.700**        
##                                      (4,569.134)         
##                                                          
## halfbaths2                          80,355.490***        
##                                      (15,386.570)        
##                                                          
## halfbaths3                          213,465.500***       
##                                      (46,688.550)        
##                                                          
## halfbaths4                           205,039.800*        
##                                     (122,330.600)        
##                                                          
## shape_Area                             2.483***          
##                                        (0.054)           
##                                                          
## units1                              -56,903.230***       
##                                      (11,444.110)        
##                                                          
## units2                              107,875.500**        
##                                      (53,207.910)        
##                                                          
## units205                                                 
##                                                          
##                                                          
## units3                              338,266.600***       
##                                      (90,169.860)        
##                                                          
## units4                                67,544.020         
##                                     (173,391.000)        
##                                                          
## units8                               511,809.500         
##                                     (344,655.100)        
##                                                          
## cdebuildin01T                        -52,918.440         
##                                     (242,351.900)        
##                                                          
## cdebuildin05                         -44,713.970         
##                                      (84,290.900)        
##                                                          
## cdebuildin08                         -109,249.500        
##                                      (91,115.570)        
##                                                          
## cdebuildin62                       -202,996.600***       
##                                      (52,160.480)        
##                                                          
## cdebuildin69                         -36,663.960         
##                                     (340,957.500)        
##                                                          
## heatedfuelGAS                       22,410.740***        
##                                      (5,825.424)         
##                                                          
## heatedfuelNONE                        60,603.970         
##                                      (42,298.190)        
##                                                          
## heatedfuelOIL/WD/COAL                 6,065.259          
##                                      (23,692.290)        
##                                                          
## heatedfuelSOLAR/GEOTHRM               42,128.200         
##                                     (115,948.500)        
##                                                          
## bldggradeCUSTOM                    1,104,735.000***      
##                                      (27,843.580)        
##                                                          
## bldggradeEXCELLENT                  526,599.200***       
##                                      (15,628.860)        
##                                                          
## bldggradeFAIR                         11,393.120         
##                                      (15,439.420)        
##                                                          
## bldggradeGOOD                       48,489.970***        
##                                      (5,783.309)         
##                                                          
## bldggradeMINIMUM                     -56,910.210         
##                                      (75,360.120)        
##                                                          
## bldggradeVERY GOOD                  198,486.700***       
##                                      (9,074.678)         
##                                                          
## totalac                                 -2.873           
##                                        (17.857)          
##                                                          
## builtTimeOld                         -12,082.470*        
##                                      (7,249.378)         
##                                                          
## builtTimeRecent                      -10,953.480*        
##                                      (6,557.517)         
##                                                          
## extwallASB SHNG/SDG                  -40,576.400         
##                                      (29,419.380)        
##                                                          
## extwallBOARD&BATTEN                   87,961.890         
##                                      (57,024.410)        
##                                                          
## extwallCEDAR,RDWD                    43,457.900**        
##                                      (18,136.840)        
##                                                          
## extwallCEM BR/SPL B                  -22,192.650         
##                                      (62,591.450)        
##                                                          
## extwallCOMP OR WLBD                   35,264.060         
##                                     (152,550.500)        
##                                                          
## extwallCONC BLOCK                     23,992.200         
##                                      (59,972.990)        
##                                                          
## extwallCORR MTL LGT                  -159,181.100        
##                                     (253,936.000)        
##                                                          
## extwallEXT PLYWOOD                   60,357.770**        
##                                      (28,960.680)        
##                                                          
## extwallFACE BRICK                     -2,585.066         
##                                      (6,302.627)         
##                                                          
## extwallHARDIPLK/DSGN VINYL          20,502.780***        
##                                      (6,818.515)         
##                                                          
## extwallIMITATION STONE               176,219.700         
##                                     (241,186.700)        
##                                                          
## extwallLOG                          -193,561.800**       
##                                      (88,671.230)        
##                                                          
## extwallMASONITE                       11,807.110         
##                                      (7,719.448)         
##                                                          
## extwallMODULAR MTL                   -104,214.900        
##                                     (201,344.900)        
##                                                          
## extwallPREFIN MTL                     63,102.960         
##                                     (196,813.400)        
##                                                          
## extwallRNFR CONC                  -1,604,332.000***      
##                                     (350,126.100)        
##                                                          
## extwallSDG MIN/NONE                   38,473.880         
##                                     (240,838.100)        
##                                                          
## extwallSIDG NO SHTG                   37,844.440         
##                                      (81,185.580)        
##                                                          
## extwallSTONE                        354,332.600***       
##                                      (64,093.400)        
##                                                          
## extwallSTUCCO HRDCT                 65,331.020***        
##                                      (15,562.080)        
##                                                          
## extwallSTUCCO SYNTH                   25,988.760         
##                                      (34,647.370)        
##                                                          
## extwallWOOD ON SHTG                   -1,246.000         
##                                      (10,007.550)        
##                                                          
## extwallWOOD SHINGLE                  63,689.030**        
##                                      (27,649.280)        
##                                                          
## foundationCRAWL SPACE               62,129.040***        
##                                      (12,842.560)        
##                                                          
## foundationPIER-NO FOUND WALL         268,588.400*        
##                                     (161,509.200)        
##                                                          
## foundationSLAB-ABV GRD              120,072.700**        
##                                      (52,459.190)        
##                                                          
## foundationSLAB-COM                    75,494.770         
##                                     (152,880.200)        
##                                                          
## foundationSLAB-RES                  74,030.640***        
##                                      (13,277.710)        
##                                                          
## numfplamany                           30,510.230         
##                                      (35,415.400)        
##                                                          
## numfplaNo                            11,166.230**        
##                                      (5,322.305)         
##                                                          
## MedHHInc                               1.179***          
##                                        (0.096)           
##                                                          
## pctWhite                             32,862.600**        
##                                      (14,593.500)        
##                                                          
## pctPoverty                          100,494.900***       
##                                      (36,545.300)        
##                                                          
## pctHispanic                           5,235.913          
##                                      (27,021.080)        
##                                                          
## ZIP28036                           -194,861.000***       
##                                      (17,367.400)        
##                                                          
## ZIP28078                           -181,024.700***       
##                                      (13,328.570)        
##                                                          
## ZIP28104                            -134,629.900**       
##                                      (56,211.600)        
##                                                          
## ZIP28105                            -92,775.060***       
##                                      (17,002.300)        
##                                                          
## ZIP28107                            -271,739.700**       
##                                     (105,815.900)        
##                                                          
## ZIP28134                            -69,543.400***       
##                                      (21,521.580)        
##                                                          
## ZIP28202                             160,145.300*        
##                                      (90,408.040)        
##                                                          
## ZIP28203                            264,208.000***       
##                                      (23,611.300)        
##                                                          
## ZIP28204                            242,930.000***       
##                                      (33,302.150)        
##                                                          
## ZIP28205                            119,489.900***       
##                                      (18,285.780)        
##                                                          
## ZIP28206                              37,541.850         
##                                      (28,422.800)        
##                                                          
## ZIP28207                            474,738.200***       
##                                      (27,388.840)        
##                                                          
## ZIP28208                              7,664.956          
##                                      (20,953.970)        
##                                                          
## ZIP28209                            220,518.000***       
##                                      (20,274.770)        
##                                                          
## ZIP28210                              14,374.820         
##                                      (18,583.140)        
##                                                          
## ZIP28211                            167,762.800***       
##                                      (19,056.930)        
##                                                          
## ZIP28212                            -44,057.800**        
##                                      (21,603.680)        
##                                                          
## ZIP28213                            -73,581.590***       
##                                      (20,783.400)        
##                                                          
## ZIP28214                            -84,067.020***       
##                                      (18,694.090)        
##                                                          
## ZIP28215                            -53,978.790***       
##                                      (18,745.480)        
##                                                          
## ZIP28216                             -29,161.260         
##                                      (19,686.920)        
##                                                          
## ZIP28217                            -58,763.520**        
##                                      (24,413.250)        
##                                                          
## ZIP28226                             -25,791.660         
##                                      (18,946.340)        
##                                                          
## ZIP28227                            -70,841.920***       
##                                      (16,877.260)        
##                                                          
## ZIP28262                             -39,882.390*        
##                                      (22,301.120)        
##                                                          
## ZIP28269                            -64,631.790***       
##                                      (18,569.900)        
##                                                          
## ZIP28270                            -53,494.070***       
##                                      (19,200.100)        
##                                                          
## ZIP28273                           -105,541.600***       
##                                      (21,299.220)        
##                                                          
## ZIP28277                            -43,456.350**        
##                                      (18,362.120)        
##                                                          
## ZIP28278                            -82,289.420***       
##                                      (19,592.490)        
##                                                          
## Constant                             -82,181.930         
##                                      (55,304.510)        
##                                                          
## ---------------------------------------------------------
## Observations                            32,980           
## R2                                      0.722            
## Adjusted R2                             0.721            
## Residual Std. Error            340,133.100 (df = 32859)  
## F Statistic                  712.695*** (df = 120; 32859)
## =========================================================
## Note:                         *p<0.1; **p<0.05; ***p<0.01
  • Table of goodness of fit(test set)(0.5points)
  • Here we show the Table of goodness of fit for the test set, focusing on the MAE and MAPE metrics. MAE is the mean absolute error, which only shows the predicted results, with an average difference of 90635 over the actual value.MAPE is the mean absolute percentage error, which illustrates the difference between the predicted and actual values in terms of range.
kable(summaryTable1, digits = 1, caption = "Table. Prediction precision of the first two models") %>%
  kable_classic(full_width = T)%>%
  footnote()
Table. Prediction precision of the first two models
Model MAE MAPE
Model with Tract ID 82799 0.222
Model with Spatial Lag 82799 0.222

Accuracy

  • The following are the results of Cross-Validation Cross-validation, sometimes called round-robin estimation, is a statistically useful method for cutting data samples into smaller subsets. The analysis can then be done on one subset first, while the other subsets are used for subsequent confirmation and validation of this analysis. The initial subset is called the training set. The other subsets are called the validation set or test set. Here we perform 100-fold cross-validation.It can be used to determine the average goodness of fit of the model.The results of cross-validation show that the MAE of the model is around 100,000, which, together with the compact distribution of MAE, indicates that the model has a good fit, which means it’s powerful.
###Cross-Validation
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv <- 
  train(price ~ ., data = Table,
     method = "lm", trControl = fitControl, na.action = na.pass) 

data.frame(Test = c("Cross_Validation"), 
           Mean = mean(reg.cv$resample[,3]),
           Max = max(reg.cv$resample[,3]),
           Min = min(reg.cv$resample[,3]),
           Standard_Deviation = sd(reg.cv$resample[,3]))%>%
  kable() %>%
  kable_styling() %>%
  footnote(general_title = "Summary Statistics of Cross Validation, k = 100 folds",
           general = "Table")
Test Mean Max Min Standard_Deviation
Cross_Validation 105617.4 262458.7 82653.56 22714.22
Summary Statistics of Cross Validation, k = 100 folds
Table
reg.cv$resample %>% 
  pivot_longer(-Resample) %>% 
  mutate(name = as.factor(name)) %>% 
  ggplot(., aes(x = name, y = value, color = name)) +
  geom_jitter(width = 0.1) +
  facet_wrap(~name, ncol = 3, scales = "free") +
  theme_bw() +
  theme(
    legend.position = "none"
  )

hist(reg.cv$resample[,3],xlab="MAE", col="#A2BEDF", breaks = 20, main = "Histogram of MAE")

  • The scatter plot of predicted prices as a function of observed prices also shows a compact distribution and a correlation coefficient close to 1, which is also a sign for a good fit of the model.
Table.test %>% 
ggplot(aes(SalePrice.Predict, price))+
geom_point() +
stat_smooth(aes(SalePrice.Predict, price), 
            method = "lm", se = FALSE, size = 1, colour="#25CB10")+
labs(title="Predicted prices as a function of observed prices") +
plotTheme()

Check Cluster & Neighborhood Effects

  • Although many spatial processes have been taken into account, there is still an error rate of about 20% that cannot be accounted for. To explore this point, we verify it by mapping MAE, spatial lag, and Moran’s I.
  • As can be seen from the graph of the results, although neither the error map nor the spatial lag presents very clearly that the error has spatial autocorrelation, the result of the Moran’s I value of 0.22 indicates that the spatial distribution of the error is not completely random. Therefore, it indicates that the model has a neglected spatial process.
###risdural map
errorToPlot <- Charlotte2 %>% 
  left_join(Table.test, by = "Grocery_nn2") %>% 
  filter(SalePrice.AbsError != 0) %>% 
  dplyr::select(SalePrice.AbsError,SalePrice.APE)


ggplot() +
    geom_sf(data=errorToPlot,  aes(color = q5(SalePrice.APE)), size=0.5)+
    geom_sf(data = tracts2021, color="gray96",fill = "transparent",size=0.5)+
    geom_sf(data = boundary,fill = "transparent", color="black")+
    scale_color_manual(values = palette6, 
                    labels = qBr(Charlotte, "price"),
                    name = "Sale Price Errors")+
    labs(title = "Sale price errors on test set", 
       subtitle = "Risdural Map")+
     mapTheme()

##Spatial Lag
Table.testshp <- dplyr::select(Charlotte2, Grocery_nn2) %>% 
  left_join(Table.test, by = "Grocery_nn2") %>% 
  na.omit() 
coords.test <-  st_coordinates(Table.testshp) 
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W")

Lag <- Table.testshp%>% 
  mutate(nhood5 = nn_function(coords.test, coords.test, 6))

ggplot(Lag, aes(nhood5,SalePrice.Error))+
geom_point() +
stat_smooth(aes(nhood5,SalePrice.Error), 
            method = "lm", se = FALSE, size = 1, colour="#25CB10")+
labs(title="Erros as a function of the Spatial Lag of price") +
plotTheme()

###Moran's I

moranTest <- moran.mc(Table.testshp$SalePrice.Error, 
                      spatialWeights.test, nsim = 999)
moranTest
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  Table.testshp$SalePrice.Error 
## weights: spatialWeights.test  
## number of simulations + 1: 1000 
## 
## statistic = 0.20307, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  plotTheme()

ggplot() +
  geom_sf(data=Table.testshp,  aes(color = q5(SalePrice.Predict)), size=0.1)+
    geom_sf(data = tracts2021, color="gray50",fill = "transparent",size=0.5)+
  geom_sf(data = boundary,fill = "transparent", color="black")+
  scale_color_manual(values = palette6, 
                    labels = qBr(Charlotte, "price"),
                    name = "Predicted prices")+
  labs(title = "Predicted home prices in Mecklenberg County, NC.", 
       subtitle = "Real Dollars")+
  mapTheme()

Generalizability

  • To test the generalizability of the model, we confirm the performance of the model across communities by calculating the average APE of the communities. The results show that there are still individual communities with higher MAPE compared to others.
MAPE_nhood <- Table.test %>% 
  group_by(ZIP) %>% 
  summarise( MAPE= mean(SalePrice.APE),
              Price.mean = mean(price))
MAPE_bynhood <- ZIP %>% 
  left_join(MAPE_nhood, by="ZIP")

ggplot() +
    geom_sf(data=MAPE_bynhood,  aes(fill = MAPE))+
    scale_fill_gradient(low = palette5[1], high = palette5[5],
                          name = "MAPE") +
      labs(title = "MAPE by neighborhood(Test set)") +
      mapTheme()

  • Here in the scatter plot of MAPE by neighborhood as a function of mean price by neighborhood, there is no clear relationship between the MAPE and home price.
MAPE_bynhood %>% 
ggplot(aes(Price.mean, MAPE))+
geom_point() +
stat_smooth(aes(Price.mean, MAPE), 
            method = "lm", se = FALSE, size = 1, colour="#25CB10")+
labs(title="MAPE by neighborhood as a function of mean price by neighborhood") +
plotTheme()

  • We also tested the generalizability of the model across income groups and across races. The results show that the accuracy of the model is similar between the white and non-white groups. In contrast, the accuracy of the models differed more between different income groups. This implies that the generalizability of the model still has some shortcomings.
IncomeRace <- tracts2021 %>% 
  mutate(raceContext = ifelse(pctWhite > .5, "Majority White", "Majority Non-White"),
         incomeContext = ifelse(MedHHInc > 35000, "High Income", "Low Income"))

grid.arrange(ncol = 2,
  ggplot() + geom_sf(data = na.omit(IncomeRace), aes(fill = raceContext)) +
    scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Race Context") +
    labs(title = "Race Context") +
    mapTheme() + theme(legend.position="bottom"), 
  ggplot() + geom_sf(data = na.omit(IncomeRace), aes(fill = incomeContext)) +
    scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Income Context") +
    labs(title = "Income Context") +
    mapTheme() + theme(legend.position="bottom"))

st_join(Table.testshp, IncomeRace) %>% 
  group_by(raceContext) %>%
  summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(raceContext, mean.MAPE) %>%
  kable(caption = "Test set MAPE by neighborhood racial context") %>% 
  kable_styling() %>%
  footnote(general = "Table Income Content")
Test set MAPE by neighborhood racial context
Majority Non-White Majority White
24% 21%
Note:
Table Income Content
st_join(Table.testshp, IncomeRace) %>% 
  filter(!is.na(incomeContext)) %>%
  group_by(incomeContext) %>%
  summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(incomeContext, mean.MAPE) %>%
  kable(caption = "Test set MAPE by neighborhood income context")%>% 
  kable_styling() %>%
  footnote(general = "Table Race Content")
Test set MAPE by neighborhood income context
High Income Low Income
22% 39%
Note:
Table Race Content
  • In general, although our model exhibits some accuracy and generalizability, it has significant shortcomings. The lack of accuracy is not clear spatially distributted, but there might be smaller scale spatial process that is not taken into considered in this model.The lack of generalizability is mainly reflected in the poor generalizability within different communities and among different income groups. The median income feature in the model doesn’t affect the model that much. We may need to find more detailed income data or other features related to income to avoid this.

Conclusion & Recommendation

  • I would recommend this model to zillow. this model shows some ability while the direction of deficiency is very clear, which makes this model have the potential to be more optimized. I think it can be improved in two ways, one is to find smaller scale spatial process, and the other is to explore the impact of some unconventional thinking features on the model, for example, the per capita tree canopy cover can be considered as an urban amenity into the model.