Geospatial Risk Prediction

Introduction

How to reduce crime and protect public safety as much as possible with limited government budget and number of personnel has been the focus of administration. Geospatial risk model, as a regression model, is good for the investment of government resources and gives corresponding suggestions. Based on the assumption that there is a correlation between crime and corresponding certain geographical information, such as street facilities, building conditions, etc. We can identify those potential risks. Armed robbery being a violent crime, police presence tends to be free of preconceived bias. Therefore this may be a more generalizable predictive model.

#Input all the data
policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)

## Reading layer `OGRGeoJSON' from data source 
##   `https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON' 
##   using driver `GeoJSON'
## Simple feature collection with 25 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
## Geodetic CRS:  WGS 84

policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = beat_num)

## Reading layer `OGRGeoJSON' from data source 
##   `https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON' 
##   using driver `GeoJSON'
## Simple feature collection with 277 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
## Geodetic CRS:  WGS 84

bothPoliceUnits <- rbind(mutate(policeDistricts, Legend = "Police Districts"), 
                         mutate(policeBeats, Legend = "Police Beats"))

##read socrata is a open data type.get the data and the type and do the filtering you need.
robberies <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr") %>% 
    filter(Primary.Type == "ROBBERY" & Description == "ARMED: HANDGUN") %>%
    mutate(x = gsub("[()]", "", Location)) %>%
    separate(x,into= c("Y","X"), sep=",") %>%
    mutate(X = as.numeric(X),Y = as.numeric(Y)) %>% 
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
    st_transform('ESRI:102271') %>% 
    distinct()

chicagoBoundary <- 
  st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
  st_transform('ESRI:102271')

## Reading layer `chicagoBoundary' from data source 
##   `https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA///Chapter5/chicagoBoundary.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 1 feature and 1 field
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.8367 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## Geodetic CRS:  WGS 84

Robbery Map

With publicly available data, we can easily access the occurrence of armed robberies within the Chicago city limits in 2017. The density map allows us to see that armed robberies in Chicago show a clustering pattern in the western and southern parts of the city.

# 1.A map of your outcome of interest in point form, with some description of what, when, and why you think selection bias may be an issue.
# uses grid.arrange to organize independent plots
grid.arrange(ncol=2,
ggplot() + 
  geom_sf(data = chicagoBoundary) +
  geom_sf(data = robberies, colour="red", size=0.1, show.legend = "point") +
  labs(title= "Robbery, Chicago - 2017") +
  mapTheme(title_size = 14),

ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(robberies)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_viridis() +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Density of Robberies") +
  mapTheme(title_size = 14) + theme(legend.position = "none"))

## Count of Robberies for the fishnet However we should not consider armed robbery as a phenomenon that changes between administrative units, but rather as data that is constantly changing within a homogeneous grid within the Chicago city limits. Rasterizing the data better expresses this one spatial trend.

## add a value of 1 to each crime, sum them with aggregate
## 2.A map of your outcome joined to the fishnet.
fishnet <- 
  st_make_grid(chicagoBoundary,
               cellsize = 500, 
               square = TRUE) %>%
  .[chicagoBoundary] %>%            # fast way to select intersecting polygons
  st_sf() %>%
  mutate(uniqueID = 1:n())

crime_net <- 
  dplyr::select(robberies) %>% 
  mutate(countRobberies = 1) %>% 
  aggregate(., fishnet, sum) %>% ##this is what is mentioned above.
  mutate(countRobberies = replace_na(countRobberies, 0),
         uniqueID = 1:n(),
         cvID = sample(round(nrow(fishnet) / 24), ##cvID 24 is the highest number.
                       size=nrow(fishnet), replace = TRUE))

ggplot() +
  geom_sf(data = crime_net, aes(fill = countRobberies), color = NA) +
  scale_fill_viridis() +
  labs(title = "Count of Robberies for the fishnet") +
  mapTheme()

# For demo. requires updated mapview package
# xx <- mapview::mapview(crime_net, zcol = "countBurglaries")
# yy <- mapview::mapview(mutate(burglaries, ID = seq(1:n())))
# xx + yy

The next step was about the screening of risk factors. First, some obvious spatial elements were included, such as abandoned houses and vehicles, as well as damaged streetlights and graffiti. These elements are often associated with crime in most people’s view. Next are elements that are less visual, but have logical support. Examples include spirits outlets and or refuse collection points. Alcohol and crime tend to present a correlation, and dumpsters are also marketed and associated with crime. Again based on that, I chose grocery stores and affordable housing as two additional elements. Grocery store robberies often occur, while affordable housing reflects to some extent the economic conditions of the change of location.

## using Socrata again
## A small multiple map of your risk factors in the fishnet (counts, distance and/or other feature engineering approaches).
abandonCars <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Cars")
abandonBuildings <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
    mutate(year = substr(date_service_request_was_received,1,4)) %>%  filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Buildings")
graffiti <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    filter(where_is_the_graffiti_located_ %in% c("Front", "Rear", "Side")) %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Graffiti")
streetLightsOut <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Street_Lights_Out")
sanitation <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Sanitation")
liquorRetail <- 
  read.socrata("https://data.cityofchicago.org/resource/nrmj-3kcf.json") %>%  
    filter(business_activity == "Retail Sales of Packaged Liquor") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Liquor_Retail")
GroceryStores <- 
  read.socrata("https://data.cityofchicago.org/resource/53t8-wyrc.json") %>%  
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "GroceryStores")
AffordableRentalHousing <- 
  read.socrata("https://data.cityofchicago.org/resource/s6ha-ppgi.json") %>%  
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "AffordableRentalHousing")


## Neighborhoods to use in LOOCV in a bit
neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet))

## Reading layer `chicago' from data source 
##   `https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 98 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## Geodetic CRS:  WGS 84

vars_net <- 
  rbind(abandonCars,streetLightsOut,abandonBuildings,liquorRetail, graffiti, sanitation,GroceryStores,AffordableRentalHousing) %>%
  st_join(fishnet, join=st_within) %>% ##take each cell which it belong. each cell have an id and a count of cars.the points in the polygon
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
    full_join(fishnet) %>%
    spread(Legend, count, fill=0) %>%
    st_sf() %>%
    dplyr::select(-`<NA>`) %>%
    na.omit() %>%
    ungroup()

## `summarise()` has grouped output by 'uniqueID'. You can override using the
## `.groups` argument.
## Joining, by = "uniqueID"

Risk Factors by Fishnet

This is a small multiple map of your risk factors in the fishnet.By importing the above put eight kinds of data, we can get a series of Fishnet plots. In some scatter plots, we can see some spatial aggregation, and some similar spatial distribution of armed robbery.

vars_net.long <- 
  gather(vars_net, Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme()}

do.call(grid.arrange,c(mapList, ncol=4, top="Risk Factors by Fishnet"))

And in the k-nearest-neighbor analysis, this spatial aggregation seems to be amplified.

st_c <- st_coordinates
st_coid <- st_centroid

vars_net <-
  vars_net %>%
    mutate(
      Abandoned_Buildings.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(abandonBuildings),3),
      Abandoned_Cars.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(abandonCars),3),
      Graffiti.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(graffiti),3),
      Liquor_Retail.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(liquorRetail),3),
      Street_Lights_Out.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(streetLightsOut),3),
      Sanitation.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(sanitation),3),
      GroceryStores.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(sanitation),3),
      AffordableRentalHousing.nn = 
        nn_function(st_c(st_coid(vars_net)), st_c(sanitation),3)
      )

vars_net.long.nn <- 
  dplyr::select(vars_net, ends_with(".nn")) %>%
    gather(Variable, value, -geometry)

vars <- unique(vars_net.long.nn$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme()}

do.call(grid.arrange,c(mapList, ncol = 4, top = "Nearest Neighbor risk Factors by Fishnet"))

loopPoint <-
  filter(neighborhoods, name == "Loop") %>%
  st_centroid()

vars_net$loopDistance =
  st_distance(st_centroid(vars_net),loopPoint) %>%
  as.numeric()

final_net <-
  left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID") 

final_net <-
  st_centroid(final_net) %>%
    st_join(dplyr::select(neighborhoods, name)) %>%
    st_join(dplyr::select(policeDistricts, District)) %>%
      st_drop_geometry() %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()

## Joining, by = "uniqueID"

final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

Local Moran’s I

Here we use the Local Morans I index to measure the spatial agglomeration of armed robbery activities. In the calculation, we choose the queen neighbor of the raster, if P_Value is less than or equal to 0.001. This means that we can say that 99.9% of the certain elements are discrete, i.e., they do not show aggregation. On the contrary, it shows aggregation. From the figure, we can see that robbery in the west-central and south-central part of the city of Chicago does show spatial aggregation.

## this is the planb version
local_morans <- localmoran(final_net$countRobberies, final_net.weights, zero.policy=TRUE) %>% 
  as.data.frame()
final_net.localMorans <- 
  cbind(local_morans, as.data.frame(final_net)) %>% 
  st_sf() %>%
  dplyr::select(count_Robberies = countRobberies, 
                Local_Morans_I = Ii, 
                P_Value = `Pr(z != E(Ii))`) %>%
  mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
  gather(Variable, Value, -geometry)
  
vars <- unique(final_net.localMorans$Variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(final_net.localMorans, Variable == i), 
              aes(fill = Value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme() + theme(legend.position="bottom")}

do.call(grid.arrange,c(varList, ncol = 4, top = "Local Morans I statistics, Robberies"))

final_net <- final_net %>% 
  mutate(Robberies.isSig = 
           ifelse(local_morans[,5] <= 0.001, 1, 0)) %>%
  mutate(Robberies.isSig.dist = 
           nn_function(st_c(st_coid(final_net)),
                       st_c(st_coid(filter(final_net, 
                                           Robberies.isSig == 1))), 
                       k = 1))

Robberies count as a function of risk factors

Next, we use multiple scatterplot to visualize the relationship between each spatial factor and armed robbery. In general, the number of these eight factors showed a positive correlation with the number of armed robberies. The highest correlation is for abandoned houses. According to the calculation results. For every 10 additional abandoned houses in the region, there are 4 additional armed robberies.

correlation.long <-
  st_drop_geometry(final_net) %>%
    dplyr::select(-uniqueID, -cvID, -loopDistance, -name, -District) %>%
    gather(Variable, Value, -countRobberies)

correlation.cor <-
  correlation.long %>%
    group_by(Variable) %>%
    summarize(correlation = cor(Value, countRobberies, use = "complete.obs"))
    
ggplot(correlation.long, aes(Value, countRobberies)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Robberies count as a function of risk factors") +
  plotTheme()

## `geom_smooth()` using formula 'y ~ x'

reg.vars <- c("Abandoned_Buildings.nn", "Abandoned_Cars.nn", "Graffiti.nn", 
              "Liquor_Retail.nn", "Street_Lights_Out.nn", "Sanitation.nn", "GroceryStores.nn","AffordableRentalHousing.nn",
              "loopDistance")

reg.ss.vars <- c("Abandoned_Buildings.nn", "Abandoned_Cars.nn", "Graffiti.nn", 
                 "Liquor_Retail.nn", "Street_Lights_Out.nn",         "Sanitation.nn","GroceryStores.nn","AffordableRentalHousing.nn", 
                 "loopDistance", "Robberies.isSig", "Robberies.isSig.dist")

reg.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countRobberies",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = cvID, countRobberies, Prediction, geometry)

## This hold out fold is 54

## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(id)` instead of `id` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

## This hold out fold is 61 
## This hold out fold is 44 
## This hold out fold is 15 
## This hold out fold is 69 
## This hold out fold is 81 
## This hold out fold is 16 
## This hold out fold is 62 
## This hold out fold is 77 
## This hold out fold is 30 
## This hold out fold is 21 
## This hold out fold is 36 
## This hold out fold is 97 
## This hold out fold is 73 
## This hold out fold is 50 
## This hold out fold is 87 
## This hold out fold is 65 
## This hold out fold is 23 
## This hold out fold is 25 
## This hold out fold is 74 
## This hold out fold is 4 
## This hold out fold is 9 
## This hold out fold is 2 
## This hold out fold is 88 
## This hold out fold is 45 
## This hold out fold is 68 
## This hold out fold is 34 
## This hold out fold is 71 
## This hold out fold is 64 
## This hold out fold is 92 
## This hold out fold is 27 
## This hold out fold is 12 
## This hold out fold is 101 
## This hold out fold is 90 
## This hold out fold is 41 
## This hold out fold is 18 
## This hold out fold is 39 
## This hold out fold is 46 
## This hold out fold is 63 
## This hold out fold is 49 
## This hold out fold is 53 
## This hold out fold is 56 
## This hold out fold is 35 
## This hold out fold is 89 
## This hold out fold is 26 
## This hold out fold is 72 
## This hold out fold is 3 
## This hold out fold is 32 
## This hold out fold is 59 
## This hold out fold is 55 
## This hold out fold is 14 
## This hold out fold is 70 
## This hold out fold is 58 
## This hold out fold is 19 
## This hold out fold is 43 
## This hold out fold is 24 
## This hold out fold is 60 
## This hold out fold is 100 
## This hold out fold is 93 
## This hold out fold is 11 
## This hold out fold is 40 
## This hold out fold is 96 
## This hold out fold is 33 
## This hold out fold is 48 
## This hold out fold is 1 
## This hold out fold is 31 
## This hold out fold is 42 
## This hold out fold is 80 
## This hold out fold is 98 
## This hold out fold is 79 
## This hold out fold is 37 
## This hold out fold is 8 
## This hold out fold is 38 
## This hold out fold is 47 
## This hold out fold is 28 
## This hold out fold is 99 
## This hold out fold is 22 
## This hold out fold is 5 
## This hold out fold is 85 
## This hold out fold is 52 
## This hold out fold is 10 
## This hold out fold is 95 
## This hold out fold is 29 
## This hold out fold is 83 
## This hold out fold is 78 
## This hold out fold is 76 
## This hold out fold is 20 
## This hold out fold is 75 
## This hold out fold is 6 
## This hold out fold is 94 
## This hold out fold is 91 
## This hold out fold is 17 
## This hold out fold is 84 
## This hold out fold is 102 
## This hold out fold is 7 
## This hold out fold is 86 
## This hold out fold is 51 
## This hold out fold is 67 
## This hold out fold is 57 
## This hold out fold is 13 
## This hold out fold is 82 
## This hold out fold is 66

reg.ss.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countRobberies",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = cvID, countRobberies, Prediction, geometry)

## This hold out fold is 54 
## This hold out fold is 61 
## This hold out fold is 44 
## This hold out fold is 15 
## This hold out fold is 69 
## This hold out fold is 81 
## This hold out fold is 16 
## This hold out fold is 62 
## This hold out fold is 77 
## This hold out fold is 30 
## This hold out fold is 21 
## This hold out fold is 36 
## This hold out fold is 97 
## This hold out fold is 73 
## This hold out fold is 50 
## This hold out fold is 87 
## This hold out fold is 65 
## This hold out fold is 23 
## This hold out fold is 25 
## This hold out fold is 74 
## This hold out fold is 4 
## This hold out fold is 9 
## This hold out fold is 2 
## This hold out fold is 88 
## This hold out fold is 45 
## This hold out fold is 68 
## This hold out fold is 34 
## This hold out fold is 71 
## This hold out fold is 64 
## This hold out fold is 92 
## This hold out fold is 27 
## This hold out fold is 12 
## This hold out fold is 101 
## This hold out fold is 90 
## This hold out fold is 41 
## This hold out fold is 18 
## This hold out fold is 39 
## This hold out fold is 46 
## This hold out fold is 63 
## This hold out fold is 49 
## This hold out fold is 53 
## This hold out fold is 56 
## This hold out fold is 35 
## This hold out fold is 89 
## This hold out fold is 26 
## This hold out fold is 72 
## This hold out fold is 3 
## This hold out fold is 32 
## This hold out fold is 59 
## This hold out fold is 55 
## This hold out fold is 14 
## This hold out fold is 70 
## This hold out fold is 58 
## This hold out fold is 19 
## This hold out fold is 43 
## This hold out fold is 24 
## This hold out fold is 60 
## This hold out fold is 100 
## This hold out fold is 93 
## This hold out fold is 11 
## This hold out fold is 40 
## This hold out fold is 96 
## This hold out fold is 33 
## This hold out fold is 48 
## This hold out fold is 1 
## This hold out fold is 31 
## This hold out fold is 42 
## This hold out fold is 80 
## This hold out fold is 98 
## This hold out fold is 79 
## This hold out fold is 37 
## This hold out fold is 8 
## This hold out fold is 38 
## This hold out fold is 47 
## This hold out fold is 28 
## This hold out fold is 99 
## This hold out fold is 22 
## This hold out fold is 5 
## This hold out fold is 85 
## This hold out fold is 52 
## This hold out fold is 10 
## This hold out fold is 95 
## This hold out fold is 29 
## This hold out fold is 83 
## This hold out fold is 78 
## This hold out fold is 76 
## This hold out fold is 20 
## This hold out fold is 75 
## This hold out fold is 6 
## This hold out fold is 94 
## This hold out fold is 91 
## This hold out fold is 17 
## This hold out fold is 84 
## This hold out fold is 102 
## This hold out fold is 7 
## This hold out fold is 86 
## This hold out fold is 51 
## This hold out fold is 67 
## This hold out fold is 57 
## This hold out fold is 13 
## This hold out fold is 82 
## This hold out fold is 66

reg.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countRobberies",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = name, countRobberies, Prediction, geometry)

## This hold out fold is Riverdale 
## This hold out fold is Hegewisch 
## This hold out fold is West Pullman 
## This hold out fold is South Deering 
## This hold out fold is Morgan Park 
## This hold out fold is Mount Greenwood 
## This hold out fold is Roseland 
## This hold out fold is Pullman 
## This hold out fold is East Side 
## This hold out fold is Beverly 
## This hold out fold is Washington Heights 
## This hold out fold is Burnside 
## This hold out fold is Calumet Heights 
## This hold out fold is Chatham 
## This hold out fold is South Chicago 
## This hold out fold is Auburn Gresham 
## This hold out fold is Ashburn 
## This hold out fold is Avalon Park 
## This hold out fold is West Lawn 
## This hold out fold is Grand Crossing 
## This hold out fold is South Shore 
## This hold out fold is Chicago Lawn 
## This hold out fold is Englewood 
## This hold out fold is Woodlawn 
## This hold out fold is Clearing 
## This hold out fold is Jackson Park 
## This hold out fold is Washington Park 
## This hold out fold is Garfield Ridge 
## This hold out fold is West Elsdon 
## This hold out fold is Gage Park 
## This hold out fold is Hyde Park 
## This hold out fold is New City 
## This hold out fold is Fuller Park 
## This hold out fold is Archer Heights 
## This hold out fold is Brighton Park 
## This hold out fold is Grand Boulevard 
## This hold out fold is Kenwood 
## This hold out fold is Oakland 
## This hold out fold is Little Village 
## This hold out fold is Mckinley Park 
## This hold out fold is Bridgeport 
## This hold out fold is Armour Square 
## This hold out fold is Douglas 
## This hold out fold is Lower West Side 
## This hold out fold is North Lawndale 
## This hold out fold is Chinatown 
## This hold out fold is Near South Side 
## This hold out fold is Museum Campus 
## This hold out fold is Little Italy, UIC 
## This hold out fold is West Loop 
## This hold out fold is Austin 
## This hold out fold is Printers Row 
## This hold out fold is Garfield Park 
## This hold out fold is Grant Park 
## This hold out fold is United Center 
## This hold out fold is Greektown 
## This hold out fold is Loop 
## This hold out fold is Millenium Park 
## This hold out fold is Humboldt Park 
## This hold out fold is West Town 
## This hold out fold is River North 
## This hold out fold is Streeterville 
## This hold out fold is Ukrainian Village 
## This hold out fold is East Village 
## This hold out fold is Rush & Division 
## This hold out fold is Wicker Park 
## This hold out fold is Gold Coast 
## This hold out fold is Galewood 
## This hold out fold is Old Town 
## This hold out fold is Lincoln Park 
## This hold out fold is Belmont Cragin 
## This hold out fold is Hermosa 
## This hold out fold is Logan Square 
## This hold out fold is Bucktown 
## This hold out fold is Montclare 
## This hold out fold is Sheffield & DePaul 
## This hold out fold is Dunning 
## This hold out fold is Avondale 
## This hold out fold is North Center 
## This hold out fold is Lake View 
## This hold out fold is Portage Park 
## This hold out fold is Irving Park 
## This hold out fold is Boystown 
## This hold out fold is Wrigleyville 
## This hold out fold is Uptown 
## This hold out fold is Albany Park 
## This hold out fold is Lincoln Square 
## This hold out fold is Norwood Park 
## This hold out fold is Jefferson Park 
## This hold out fold is Sauganash,Forest Glen 
## This hold out fold is North Park 
## This hold out fold is Andersonville 
## This hold out fold is Edgewater 
## This hold out fold is West Ridge 
## This hold out fold is Edison Park 
## This hold out fold is Rogers Park

reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countRobberies",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = name, countRobberies, Prediction, geometry)

## This hold out fold is Riverdale 
## This hold out fold is Hegewisch 
## This hold out fold is West Pullman 
## This hold out fold is South Deering 
## This hold out fold is Morgan Park 
## This hold out fold is Mount Greenwood 
## This hold out fold is Roseland 
## This hold out fold is Pullman 
## This hold out fold is East Side 
## This hold out fold is Beverly 
## This hold out fold is Washington Heights 
## This hold out fold is Burnside 
## This hold out fold is Calumet Heights 
## This hold out fold is Chatham 
## This hold out fold is South Chicago 
## This hold out fold is Auburn Gresham 
## This hold out fold is Ashburn 
## This hold out fold is Avalon Park 
## This hold out fold is West Lawn 
## This hold out fold is Grand Crossing 
## This hold out fold is South Shore 
## This hold out fold is Chicago Lawn 
## This hold out fold is Englewood 
## This hold out fold is Woodlawn 
## This hold out fold is Clearing 
## This hold out fold is Jackson Park 
## This hold out fold is Washington Park 
## This hold out fold is Garfield Ridge 
## This hold out fold is West Elsdon 
## This hold out fold is Gage Park 
## This hold out fold is Hyde Park 
## This hold out fold is New City 
## This hold out fold is Fuller Park 
## This hold out fold is Archer Heights 
## This hold out fold is Brighton Park 
## This hold out fold is Grand Boulevard 
## This hold out fold is Kenwood 
## This hold out fold is Oakland 
## This hold out fold is Little Village 
## This hold out fold is Mckinley Park 
## This hold out fold is Bridgeport 
## This hold out fold is Armour Square 
## This hold out fold is Douglas 
## This hold out fold is Lower West Side 
## This hold out fold is North Lawndale 
## This hold out fold is Chinatown 
## This hold out fold is Near South Side 
## This hold out fold is Museum Campus 
## This hold out fold is Little Italy, UIC 
## This hold out fold is West Loop 
## This hold out fold is Austin 
## This hold out fold is Printers Row 
## This hold out fold is Garfield Park 
## This hold out fold is Grant Park 
## This hold out fold is United Center 
## This hold out fold is Greektown 
## This hold out fold is Loop 
## This hold out fold is Millenium Park 
## This hold out fold is Humboldt Park 
## This hold out fold is West Town 
## This hold out fold is River North 
## This hold out fold is Streeterville 
## This hold out fold is Ukrainian Village 
## This hold out fold is East Village 
## This hold out fold is Rush & Division 
## This hold out fold is Wicker Park 
## This hold out fold is Gold Coast 
## This hold out fold is Galewood 
## This hold out fold is Old Town 
## This hold out fold is Lincoln Park 
## This hold out fold is Belmont Cragin 
## This hold out fold is Hermosa 
## This hold out fold is Logan Square 
## This hold out fold is Bucktown 
## This hold out fold is Montclare 
## This hold out fold is Sheffield & DePaul 
## This hold out fold is Dunning 
## This hold out fold is Avondale 
## This hold out fold is North Center 
## This hold out fold is Lake View 
## This hold out fold is Portage Park 
## This hold out fold is Irving Park 
## This hold out fold is Boystown 
## This hold out fold is Wrigleyville 
## This hold out fold is Uptown 
## This hold out fold is Albany Park 
## This hold out fold is Lincoln Square 
## This hold out fold is Norwood Park 
## This hold out fold is Jefferson Park 
## This hold out fold is Sauganash,Forest Glen 
## This hold out fold is North Park 
## This hold out fold is Andersonville 
## This hold out fold is Edgewater 
## This hold out fold is West Ridge 
## This hold out fold is Edison Park 
## This hold out fold is Rogers Park

Distribution of MAE

With the regression analysis we then use k fold cross validation and LOGO cross validation to validate the accuracy of the model respectively. the advantage of LOGO-CV is that it avoids overfitting and has a better fit at the same time.

reg.summary <- 
  rbind(
    mutate(reg.cv,           Error = Prediction - countRobberies,
                             Regression = "Random k-fold CV: Just Risk Factors"),
                             
    mutate(reg.ss.cv,        Error = Prediction - countRobberies,
                             Regression = "Random k-fold CV: Spatial Process"),
    
    mutate(reg.spatialCV,    Error = Prediction - countRobberies,
                             Regression = "Spatial LOGO-CV: Just Risk Factors"),
                             
    mutate(reg.ss.spatialCV, Error = Prediction - countRobberies,
                             Regression = "Spatial LOGO-CV: Spatial Process")) %>%
    st_sf() 


error_by_reg_and_fold <- 
  reg.summary %>%
    group_by(Regression, cvID) %>% 
    summarize(Mean_Error = mean(Prediction - countRobberies, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()

## `summarise()` has grouped output by 'Regression'. You can override using the
## `.groups` argument.

error_by_reg_and_fold %>%
  ggplot(aes(MAE)) + 
    geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
    facet_wrap(~Regression) +  
    geom_vline(xintercept = 0) + scale_x_continuous(breaks = seq(0, 8, by = 1)) + 
    labs(title="Distribution of MAE", subtitle = "k-fold cross validation vs. LOGO-CV",
         x="Mean Absolute Error", y="Count") +
    plotTheme()

A table of MAE and standard deviation MAE by regression

From the table we can see that Spatial Process has a lower MAS than Just Risk Factors

## A table of MAE and standard deviation MAE by regression.
st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>% 
    summarize(Mean_MAE = round(mean(MAE), 2),
              SD_MAE = round(sd(MAE), 2)) %>%
kable() %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(2, color = "black", background = "#FDE725FF") %>%
    row_spec(4, color = "black", background = "#FDE725FF")

Regression	Mean_MAE	SD_MAE
Random k-fold CV: Just Risk Factors	0.43	0.31
Random k-fold CV: Spatial Process	0.36	0.28
Spatial LOGO-CV: Just Risk Factors	0.86	0.72
Spatial LOGO-CV: Spatial Process	0.77	0.57

## A table of raw errors by race context for a random k-fold vs. spatial cross validation regression.
error_by_reg_and_fold %>%
  filter(str_detect(Regression, "LOGO")) %>%
  ggplot() +
    geom_sf(aes(fill = MAE)) +
    facet_wrap(~Regression) +
    scale_fill_viridis() +
    labs(title = "Robbery errors by LOGO-CV Regression") +
    mapTheme() + theme(legend.position="bottom")

## A table of raw errors by race context for a random k-fold vs. spatial cross validation regression

neighborhood.weights <-
  filter(error_by_reg_and_fold, Regression == "Spatial LOGO-CV: Spatial Process") %>%
    group_by(cvID) %>%
      poly2nb(as_Spatial(.), queen=TRUE) %>%
      nb2listw(., style="W", zero.policy=TRUE)

  filter(error_by_reg_and_fold, str_detect(Regression, "LOGO"))  %>% 
    st_drop_geometry() %>%
    group_by(Regression) %>%
    summarize(Morans_I = moran.mc(abs(Mean_Error), neighborhood.weights, 
                                 nsim = 999, zero.policy = TRUE, 
                                 na.action=na.omit)[[1]],
              p_value = moran.mc(abs(Mean_Error), neighborhood.weights, 
                                 nsim = 999, zero.policy = TRUE, 
                                 na.action=na.omit)[[3]])

## # A tibble: 2 × 3
##   Regression                         Morans_I p_value
##   <chr>                                 <dbl>   <dbl>
## 1 Spatial LOGO-CV: Just Risk Factors  0.108     0.051
## 2 Spatial LOGO-CV: Spatial Process    0.00525   0.393

A table of raw errors by race context for a random k-fold vs. spatial cross validation regression

The graph below reflects that: all models overpredict in low robbery areas and underpredict in hot areas. Overprediction in areas of lower armed robbery may highlight areas of potential risk. The underprediction in areas of higher armed robbery may reflect the difficulty in predicting hotspots.

st_drop_geometry(reg.summary) %>%
  group_by(Regression) %>%
    mutate(Robberies_Decile = ntile(countRobberies, 10)) %>%
  group_by(Regression, Robberies_Decile) %>%
    summarize(meanObserved = mean(countRobberies, na.rm=T),
              meanPrediction = mean(Prediction, na.rm=T)) %>%
    gather(Variable, Value, -Regression, -Robberies_Decile) %>%          
    ggplot(aes(Robberies_Decile, Value, shape = Variable)) +
      geom_point(size = 2) + geom_path(aes(group = Robberies_Decile), colour = "black") +
      scale_shape_manual(values = c(2, 17)) +
      facet_wrap(~Regression) + xlim(0,10) +
      labs(title = "Predicted and observed burglary by observed burglary decile")  +
      plotTheme()

## `summarise()` has grouped output by 'Regression'. You can override using the
## `.groups` argument.

Comparison of Kernel Density and Risk Predictions

The map comparing kernel density to risk predictions for the next year’s crime.The comparison between the prediction results and the actual situation we can find that there is a problem of underfitting the high-risk areas. There is a spatial misalignment between the predicted and actual results.

burg_ppp <- as.ppp(st_coordinates(robberies), W = st_bbox(final_net))
burg_KD.1000 <- spatstat.core::density.ppp(burg_ppp, 1000)
burg_KD.1500 <- spatstat.core::density.ppp(burg_ppp, 1500)
burg_KD.2000 <- spatstat.core::density.ppp(burg_ppp, 2000)
burg_KD.df <- rbind(
  mutate(data.frame(rasterToPoints(mask(raster(burg_KD.1000), as(neighborhoods, 'Spatial')))), Legend = "1000 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(burg_KD.1500), as(neighborhoods, 'Spatial')))), Legend = "1500 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(burg_KD.2000), as(neighborhoods, 'Spatial')))), Legend = "2000 Ft.")) 

burg_KD.df$Legend <- factor(burg_KD.df$Legend, levels = c("1000 Ft.", "1500 Ft.", "2000 Ft."))

as.data.frame(burg_KD.1000) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
   ggplot() +
     geom_sf(aes(fill=value)) +
     geom_sf(data = sample_n(robberies, 1500), size = .5) +
     scale_fill_viridis(name = "Density") +
     labs(title = "Kernel density of 2017 robberies") +
     mapTheme(title_size = 14)

robberies18 <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2018/3i3m-jwuy") %>% 
  filter(Primary.Type == "ROBBERY" & 
         Description == "ARMED: HANDGUN") %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y)) %>% 
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102271') %>% 
  distinct() %>%
  .[fishnet,]

burg_KDE_sum <- as.data.frame(burg_KD.1000) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) 
kde_breaks <- classIntervals(burg_KDE_sum$value, 
                             n = 5, "fisher")
burg_KDE_sf <- burg_KDE_sum %>%
  mutate(label = "Kernel Density",
         Risk_Category = classInt::findCols(kde_breaks),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(robberies18) %>% mutate(burgCount = 1), ., sum) %>%
    mutate(burgCount = replace_na(burgCount, 0))) %>%
  dplyr::select(label, Risk_Category, burgCount)

ml_breaks <- classIntervals(reg.ss.spatialCV$Prediction, 
                             n = 5, "fisher")
burg_risk_sf <-
  reg.ss.spatialCV %>%
  mutate(label = "Risk Predictions",
         Risk_Category =classInt::findCols(ml_breaks),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(robberies18) %>% mutate(burgCount = 1), ., sum) %>%
      mutate(burgCount = replace_na(burgCount, 0))) %>%
  dplyr::select(label,Risk_Category, burgCount)

rbind(burg_KDE_sf, burg_risk_sf) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    geom_sf(data = sample_n(robberies18, 3000), size = .5, colour = "black") +
    facet_wrap(~label, ) +
    scale_fill_viridis(discrete = TRUE) +
    labs(title="Comparison of Kernel Density and Risk Predictions",
         subtitle="2017 robbery risk predictions; 2018 robberies") +
    mapTheme(title_size = 14)

## Risk prediction vs. Kernel density, 2018 robberies From the table we can see more clearly that the forecast results are too high for both low risk areas and very high risk, and too low for the middle part of the forecast.

rbind(burg_KDE_sf, burg_risk_sf) %>%
  st_drop_geometry() %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countRobberies = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Pcnt_of_test_set_crimes = countRobberies / sum(countRobberies)) %>%
    ggplot(aes(Risk_Category,Pcnt_of_test_set_crimes)) +
      geom_bar(aes(fill=label), position="dodge", stat="identity") +
      scale_fill_viridis(discrete = TRUE, name = "Model") +
      labs(title = "Risk prediction vs. Kernel density, 2018 robberies",
           y = "% of Test Set Robberies (per model)",
           x = "Risk Category") +
  theme_bw() +
      theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

## `summarise()` has grouped output by 'label'. You can override using the
## `.groups` argument.

Summary

Overall I would be less likely to recommend my model. In terms of results, while the high risk areas are predicted to have more cases than they actually do, at the same time the low risk areas are predicted to have more cases than they actually do. And the middle part of the predictions are all less than the actual situation, which means that if the police force is allocated according to the actual results, most of the cases will be understaffed and a few low-risk areas will be overstaffed. There is still a lot of room for improvement, such as using more years of data or adding more predictor variables to solve the problem of over-prediction in low-risk areas, and the model will become better.