Prediction Parking Demand

1.Introduction

1.1. Research Background

SFMTA Since the Ford Model T was first introduced on October 1, 1908, the automobile has become an integral part of society. One of the problems that have plagued mankind at the same time is the problem of parking. Difficulty in parking first of all imposes additional time costs on individuals. Vehicles wandering on the streets in search of a parking space add to the congestion on the roads. At the same time, more exhaust emissions and environmental pollution are created. To address these problems, some local governments have attempted to ensure the availability of parking spaces by dynamically adjusting parking prices, such as the SFMTA in our study, which seeks to implement a dynamic parking pricing policy for on-street parkingmeters between 9 a.m. and 6 p.m., Monday through Saturday, every week. This is intended to use the increased parking prices to ensure that there are always one or two spaces available on each street for vehicles to park, thus making it difficult for vehicles to find a space.

1.2. User & Use Case

This study is intended to help the City of San Francisco better evaluate and utilize the existing dynamic pricing policy, i.e., to be able to predict and regulate the occupancy rate of parking spaces through dynamic pricing and various spatial and temporal factors to ensure the availability of parking spaces. To address this issue, first, we will download the relevant parking meter usage data from the SFMTA website. Secondly, we will also use US census data to filter some parameters related to parking occupancy. After completing the exploration and sorting of the above data, we will build a regression model to analyze and predict the occupancy rate of parking spaces. At the same time, we will evaluate the validity of our model using cross-validation methods.


2. Data Wrangling.

2.1. Setup

Let’s load relevant libraries and some graphic themes.

library(prettydoc)
library(rmdformats)
library(tidyverse)
library(ggplot2)
library(tidycensus)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(osmdata)
library(tigris)
library(osmextract)
library(curl)
library(reshape2)
library(glue)
library(dismo)
library(spgwr)
library(MASS)
library(lme4)
library(data.table)
library(kableExtra)
library(stargazer)
library(RSocrata)
library(knitr)
library(gifski)
library(rjson)
library(riem)
library(gganimate)
library(viridis)
library(lubridate)
library(tigris)
library(geojsonio)
library(magrittr)

#-----Import external functions & a palette-----
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.2),
  axis.ticks=element_blank())

mapTheme1 <- function(base_size = 35, title_size = 45) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(hjust = 0.5, size = title_size, colour = "black",face = "bold"), 
    plot.subtitle = element_text(hjust = 0.5,size=base_size,face="italic"),
    plot.caption = element_text(size=base_size,hjust=0),
    axis.ticks = element_blank(),
    panel.spacing = unit(6, 'lines'),
    panel.background = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=base_size),
    axis.title = element_text(face = "bold",size=base_size),
    axis.text = element_blank(),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(size=base_size,colour = "black", face = "italic"),
    legend.text = element_text(size=base_size,colour = "black", face = "italic"),
    strip.text.x = element_text(size = base_size,face = "bold")
  )
}

mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))

palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")
palette1_main <- "#174C4F"
palette1_assist <- '#F9B294'

2.2. Import Parking Data

Through the official website of San Francisco Municipal Transportation Agency (SFMTA), we can download the parking data since 2020. Due to the high demand of parking in San Francisco, the total data volume is 105 million rows. Therefore, when importing the api, we filtered the data to only get the data from 20220501 9am to 20220514 6pm. The two weeks of data are used to form the basic dataset of our model.

## This part variable has been saved locally as dat, location
dat <- read.socrata("https://data.sfgov.org/resource/imvp-dq3v.csv?$where=SESSION_START_DT%20between%20%272022-05-1T9:00:00%27%20and%20%272022-05-14T17:00:00%27")

location <- 
  read.socrata("https://data.sfgov.org/resource/8vzz-qzz9.csv") %>%
    st_as_sf(coords = c("latitude", "longitude"), crs = 4326, agr = "constant")%>%
    st_transform('ESRI:102271') %>% 
    distinct()
glimpse(dat)

By calculating the total time each parking meter is used during the day, we can here calculate the parking occupancy of each parking meter based on the post_id. Here we just process the data and wait for the analysis to be done later.

## This part variable has been saved locally as dat2, parking_rate
dat2 <- dat %>% 
  left_join(location, by=('post_id'='post_id')) %>% 
  dplyr::select(post_id, street_block, session_start_dt, session_end_dt, meter_event_type, gross_paid_amt, on_offstreet_type, ms_space_num, old_rate_area, street_name,shape,geometry) %>% 
    mutate(end_interval15 = floor_date(ymd_hms(session_end_dt), unit = "15 mins"),
         start_interval15 = floor_date(ymd_hms(session_start_dt), unit = "15 mins"),
         ms_space_num = ifelse(ms_space_num == 0, 1, ms_space_num))

parking_rate <- dat2 %>% 
  mutate(length = end_interval15 - start_interval15,
         parking_hour = as.numeric(length)/3600,) %>% 
  dplyr::select(start_interval15, street_block, length,gross_paid_amt,parking_hour,post_id) %>% 
  mutate(rate=gross_paid_amt/parking_hour) %>% 
  filter(parking_hour!=0) %>% 
  dplyr::select(post_id, start_interval15, rate, street_block)

2.3. Import Census Data

In US Census Data, we have selected the following data: TotalPop, Whites, AfricanAmericans, Asians, MedHHINC, MedRent. First, we believe that there is a direct relationship between population size and parking occupancy, which is the basic logic of supply and demand in the market. Second, we also argue that resident income and rents also have an impact on parking occupancy.

CensusData <- 
  get_acs(geography = "block group", 
          variables = c("B01003_001E","B02001_002E","B19013_001E","B25058_001E",'B02001_003E','B02001_005E'), 
          year=2019, state="CA", county="SAN FRANCISCO", geometry=T, output="wide") %>%
  st_transform('ESRI:102243') %>%
  rename(Census_TotalPop = B01003_001E, 
         Census_Whites = B02001_002E,
         Census_AfricanAmericans = B02001_003E,
         Census_Asians = B02001_005E,
         Census_MedHHInc = B19013_001E, 
         Census_MedRent = B25058_001E) %>%
  dplyr::select(-NAME,-starts_with("B")) %>%
  mutate(Census_pctWhite = ifelse(Census_TotalPop > 0, 100*Census_Whites / Census_TotalPop,0),
         Census_pctAfricanAmericans = ifelse(Census_TotalPop > 0, 100*Census_AfricanAmericans / Census_TotalPop,0),
         Census_pctAsians = ifelse(Census_TotalPop > 0, 100*Census_Asians / Census_TotalPop,0),
         Census_blockgroupareasqm = as.numeric(st_area(.)),
         Census_areaperpeople = ifelse(Census_blockgroupareasqm > 0,Census_blockgroupareasqm/Census_TotalPop,0),
         year = "2019") %>%
  dplyr::select(-Census_Whites,-Census_AfricanAmericans,-Census_Asians, -GEOID)
# Geometries
CensusData2 <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001E"), 
          year=2019, state="CA", county="SAN FRANCISCO", geometry=T, output="wide") %>%
  st_transform('ESRI:102243') %>%
  dplyr::select(-NAME,-starts_with("B"))

#-----Join Parking Meter data to census data-----
trainData <-st_join(location,dplyr::select(CensusData,-Census_blockgroupareasqm,-year,-geometry,-Census_TotalPop))
trainData <-st_join(location,dplyr::select(CensusData2, GEOID, -geometry))
## plot parking meters
boundary = st_read("D:/Upenn/Upenn Lec/05-MUSA-508/Assign-Final/DATAEXPO/SF/trim.shp") %>%
  st_transform(crs) %>%
  filter(objectid == "32")
CensusData3 <- CensusData2 %>%  
  filter(GEOID!="06075980401")%>% 
  filter(GEOID!="06075017902")
ggplot()+
  geom_sf(data = CensusData3, color="gray50",fill = "transparent",size=1,linetype = "dashed")+
  geom_sf(data = boundary,fill = "transparent", color="black",size=100000000)+
  geom_sf(data = trainData, size = 1,color=palette1_main )+
  scale_color_manual(values = palette5,
                    name = "Home sale prices")+
  labs(title = "Map of Parking Meters in San Francisco", 
       subtitle = "In tract unit")+
  mapTheme()

ParkingMap

After importing the US Census Data of San Francisco, we can show the location information of the parking meters on the map. From the map, we can see that most of the Parking Meters are located in the northeast part of San Francisco.

2.4. Import Spatial Data

After we finish importing the censusdata, let’s import some spatial data. Using the data from openstreetmap, we can easily retrieve which facility locations we think are relevant to parking occupancy. And we use k-nearest-neighbor function and buffer to process these data. In the initial screening of the data, we selected a large number of data, such as the location of parks and restaurants. We use both of these algorithms to calculate the data information around each parking meter for subsequent analysis.

## Set the data obtain area from OSM
q0 <- opq(bbox = c(-122.55,37.70,-122.35,37.82))
crs = "ESRI:102243"

## Define the the function to obtain different data.
get_osm_1 <- function(bbox, key, value, geom, crs){
  object <- add_osm_feature(opq = bbox, key = key, value = value) %>%
    osmdata_sf(.)
  geom_name <- paste("osm_", geom, sep = "")
  object.sf <- st_geometry(object[[geom_name]]) %>%
    st_transform(crs) %>%
    st_sf() %>%
    cbind(., object[[geom_name]][[key]]) %>%
    rename(NAME = "object..geom_name....key..")
  return(object.sf)
}

# Buffer Count - Count the number of appearance within the buffer of Parking Meter
# Input: Parking Meter data (sf), object/amenity type, buffer radius, object data
buffer_count = function(trainData, type, radius, data){
  ParkingBuffer = st_buffer(trainData, radius)
  trainData[type] = st_intersects(ParkingBuffer, data) %>%
    sapply(., length)
  return(trainData)
}

# Buffer Area - for polygon amenities, calculate the aggregate area of all entries that intersect with the buffer of a Parking Meter
# Input: Parking Meter data (sf), object/amenity type, buffer radius, object data
buffer_area = function(trainData, type, radius, data){
  data <- mutate(data, area = st_area(data))
  trainData[type] = st_buffer(trainData, radius) %>%
    aggregate(data %>% dplyr:: select (geometry, area),., sum) %>%
    pull(area)
  return(trainData)
}

# K-near - Calculate the average distance of a Parking Meter to its nearest (1 to 5) amenity object(s)
# Input: Parking Meter data, object/amenity data, object/amenity type, geometry type (point or non-point)
knear <- function(trainData, data, type, geom){
  if (geom != "points") {
    data =  data %>% dplyr::select(geometry) %>%st_centroid(.) %>% st_coordinates
  }
  else {
    data = data %>% dplyr::select(geometry) %>% st_coordinates
  }
  trainDataCoords = st_coordinates(trainData)
  nn_1 = nn_function(trainDataCoords, data, 1)
  nn_2 = nn_function(trainDataCoords, data, 2)
  nn_3 = nn_function(trainDataCoords, data, 3)
  nn_4 = nn_function(trainDataCoords, data, 4)
  nn_5 = nn_function(trainDataCoords, data, 5)
  trainData[paste(type, "_nn1", sep = "")] = nn_1
  trainData[paste(type, "_nn2", sep = "")] = nn_2
  trainData[paste(type, "_nn3", sep = "")] = nn_3
  trainData[paste(type, "_nn4", sep = "")] = nn_4
  trainData[paste(type, "_nn5", sep = "")] = nn_5
  return(trainData)
}

# Import data(DO NOT FORGET TO ADD DATA HERE!!!!!!!,这儿做各种spatial操作的data,记得替换!!!!!)
trainData <- location
# Select the data from OSM using the functions above.
#-----Amenity: parks-----

# Get park data, same below
parks = get_osm_1(q0, "leisure", "park", "polygons", "ESRI:102243")

## How many parks within 1000 m buffer of each home, same below
trainData = buffer_count(trainData, "parkCount", 1000, parks)
## Aggregate park area of all intersected park, same below
trainData = buffer_area(trainData, "parkArea", 1000, parks)

#-----Amenity: restaurants-----

restaurants = get_osm_1(q0, "amenity", "restaurant", "points", crs)
trainData = buffer_count(trainData, "restaurantsCount", 1000, restaurants)

# Average distance to 1~5 nearest restaurant(s), same below
trainData = knear(trainData, restaurants, "restaurants", "points")

#-----Amenity: schools-----

schools = get_osm_1(q0, 'amenity', 'school', "polygons", "ESRI:102243")
trainData = buffer_count(trainData, "schoolCount", 1000, schools)
trainData = knear(trainData, schools, "schools", "polygons")

# Assign each home to its nearest school
schoolDistrict = voronoi(st_coordinates(schools %>% st_centroid())) %>% 
  st_as_sf() %>% 
  st_set_crs("ESRI:102243") %>%
  rename(schoolNo = id) %>%
  mutate(schoolNo = as.character(schoolNo))
trainData = st_join(trainData, schoolDistrict)

#-----Amenity: universities-----

# Get university data
universities = get_osm_1(q0, 'amenity', 'university', "polygons", "ESRI:102243")

# Whether university is within 1000 m buffer of each home
trainData = buffer_count(trainData, "universitiesCount", 1000, universities)

#-----Amenity: parking-----

parking = get_osm_1(q0, "amenity", "parking", "polygons", "ESRI:102243")
trainData = buffer_count(trainData, "parkingCount", 1000, parking)
trainData = buffer_area(trainData, "parkingArea", 1000, parking)
trainData = knear(trainData, parking, "parking", "polygons")

#-----Amenity: clinics-----

clinics = get_osm_1(q0, "amenity", "clinic", "points", "ESRI:102243")
trainData = knear(trainData, clinics, "clinics", "points")

#-----Amenity: hospitals-----

hospitals = get_osm_1(q0, "amenity", "hospital", "polygons", "ESRI:102243")
trainData = knear(trainData, hospitals, "hospitals", "polygons")

#-----Amenity: cinemas-----
cinemas = get_osm_1(q0, "amenity", "cinema", "points", "ESRI:102243")
trainData = buffer_count(trainData, "cinemasCount", 1000, cinemas)
trainData = buffer_count(trainData, "cinemasCount2", 2000, cinemas)

#-----Amenity: stadiums-----
stadiums = get_osm_1(q0, "building", "stadium", "polygons", "ESRI:102243")
trainData = buffer_count(trainData, "stadiumsCount", 1000, stadiums)

#-----Amenity: commerce-----
commercial = get_osm_1(q0, "building", "commercial", "polygons", "ESRI:102243")
trainData = knear(trainData, commercial, "commerce", "polygons")
trainData = buffer_count(trainData, "commerceCount", 1000, commercial)

#-----Amenity: retail-----
retail = get_osm_1(q0, "building", "retail", "polygons", "ESRI:102243")
trainData = buffer_count(trainData, "retailCount", 1000, retail)
trainData = knear(trainData, retail, "retail", "polygons")

#-----Amenity: common leisure space-----
commonleisure = get_osm_1(q0, "leisure", "common", "polygons", "ESRI:102243")
trainData = knear(trainData, commonleisure, "commonLeisure", "polygons")

#-----Amenity: fitness centers-----
fitness_centre = get_osm_1(q0, "leisure", "fitness_centre", "points", "ESRI:102243")
trainData = buffer_count(trainData, "fitnessCenterCount", 1000, fitness_centre)
trainData = knear(trainData, fitness_centre, "fitnessCenter", "pooints")

#-----Amenity: public gardens-----
garden = get_osm_1(q0, "leisure", "garden", "polygons", "ESRI:102243")
trainData = knear(trainData, garden, "gardens", "polygons")
trainData = buffer_count(trainData, "gardensCount", 1000, garden)

#-----Jobs: companies-----
companies = get_osm_1(q0, "office", "company", "points", "ESRI:102243")
trainData = knear(trainData, companies, "companies", "points")

#-----Transport: public transport-----
public_transport = get_osm_1(q0, "public_transport", "station", "points", "ESRI:102243")
trainData <-
  trainData %>% 
  mutate(
    public_transport_nn1 = nn_function(st_coordinates(trainData), st_coordinates(public_transport%>%dplyr::select(geometry)%>%st_centroid(.)), 1),
    public_transport_nn2 = nn_function(st_coordinates(trainData), st_coordinates(public_transport%>%dplyr::select(geometry)%>%st_centroid(.)), 2))
#-----Join trainData to census data-----

trainData1 <-st_join(trainData,dplyr::select(CensusData,-Census_blockgroupareasqm,-year,-geometry,-Census_TotalPop))
trainData1 <-st_join(trainData1,dplyr::select(CensusData2, GEOID, -geometry))
# Possible Distribution Transformation
trainDataNumeric2 = trainData1[,numericColumns] %>% 
  st_drop_geometry()
i=1
par(mfrow=c(65,1))
for (column in trainDataNumeric2) {
  names = names(trainDataNumeric2)
  name = names[i:i]
  par(mfrow=c(1,2));hist(as.numeric(column),breaks=80,main=c(name,"Before"));hist(log(1+as.numeric(column)),breaks=80,main=c(name,"After"))
  i = i+1
}
dev.off()
#-----Select Numeric Variables-----
numericColumns = unlist(lapply(trainData, is.numeric))
trainDataNumeric = trainData[,numericColumns] %>% 
  dplyr::select(-MUSA_ID,-toPredict)%>% 
  st_drop_geometry() %>% 
  gather(key,value, -price)

#-----Make scatter-plots of all numeric variables----

ggplot(trainDataNumeric, aes(x = value, y = price))+
  geom_point(size=.05,alpha=0.5)+
  geom_smooth(method = lm, se=F,colour = "#DC986D",size=0.5)+
  facet_wrap(~key, scales = "free",ncol = 8)+
  plotTheme()
# Possible Distribution Transformation
trainDataNumeric2 = trainData[,numericColumns] %>% 
  dplyr::select(-MUSA_ID,-toPredict)%>%
  st_drop_geometry()
i=1
par(mfrow=c(65,1))
for (column in trainDataNumeric2) {
  names = names(trainDataNumeric2)
  name = names[i:i]
  par(mfrow=c(1,2));hist(as.numeric(column),breaks=80,main=c(name,"Before"));hist(log(1+as.numeric(column)),breaks=80,main=c(name,"After"))
  i = i+1
}
dev.off()

2.5.Import time data

Time token

image:

Here we will analyze our algorithm for time, time token, the above graph illustrates our calculation rules, for a parking record on a parking meter, the time it occupies is rounded to 15min time period, for the same time period, different parking spaces under the same Geo_id, the above rounded time token is added up, we get the total time token for the first time period.

add_rate <- parking_rate %>% 
  group_by(street_block) %>% 
  summarize(rate= mean(rate))

panel_st_block <- panel_join %>%
  filter(occupancy<=1) %>% 
  mutate(count=1) %>% 
  group_by(street_block,start_interval15) %>% 
  summarize(occ_space = sum(occupancy),
            all_space = sum(count)) %>% 
  mutate(occupancy = occ_space/all_space) %>% 
  dplyr::select(-occ_space, -all_space) %>% 
  left_join(add_rate)


predictors_test <- panel_join %>% 
              count(post_id)

predictors <- trainData1 %>% 
  left_join(panel_join %>% 
              group_by(post_id, street_block)) %>%
  na.omit()

predictors <- predictors %>% 
  group_by(street_block) %>% 
  summarize(GEOID=first(GEOID.x),
            parkCount=mean(parkCount),
            parkArea=mean(parkArea),
            restaurantsCount=mean(restaurantsCount),
            restaurants_nn4=mean(restaurants_nn4),
            schoolCount=mean(schoolCount),
            schools_nn4=mean(schools_nn4),
            universitiesCount=mean(universitiesCount),
            parkingCount=mean(parkingCount),
            parkingArea=mean(parkingArea),
            parking_nn4=mean(parking_nn4),
            clinics_nn4=mean(clinics_nn4),
            hospitals_nn4=mean(hospitals_nn4),
            cinemasCount=mean(cinemasCount),
            stadiumsCount=mean(stadiumsCount),
            commerce_nn4=mean(commerce_nn4),
            commerceCount=mean(commerceCount),
            retailCount=mean(retailCount),
            retail_nn4=mean(retail_nn4),
            commonLeisure_nn4=mean(commonLeisure_nn4),
            fitnessCenterCount=mean(fitnessCenterCount),
            fitnessCenter_nn4=mean(fitnessCenter_nn4),
            gardens_nn4=mean(gardens_nn4),
            gardensCount=mean(gardensCount),
            companies_nn4=mean(companies_nn4),
            public_transport_nn1=mean(public_transport_nn1),
            public_transport_nn2=mean(public_transport_nn2),
            Census_MedHHInc=mean(Census_MedHHInc),
            Census_MedRent=mean(Census_MedRent),
            Census_pctWhite=mean(Census_pctWhite),
            Census_areaperpeople=mean(Census_areaperpeople),
            Census_pctAfricanAmericans=mean(Census_pctAfricanAmericans),
            Census_pctAsians=mean(Census_pctAsians))

train_data_1 <- panel_st_block %>% 
  left_join(predictors, by="street_block")

Space Lag & Time Lag

Here, we merge all the time-space predictors as the basis for our subsequent models.

space_lag <- train_data_1 %>% 
  dplyr::select(GEOID,start_interval15,occupancy) %>% 
  group_by(GEOID,start_interval15) %>% 
  summarize(lag_occ = mean(occupancy))

train_data_2 <- train_data_1 %>% 
  left_join(space_lag) 
train_data_3 <- train_data_2 %>% 
  mutate(hod = hour(start_interval15)) %>% 
  filter(hod>=9 & hod<18)

time_lag <- train_data_3 %>% 
  dplyr::select(street_block,start_interval15,occupancy,hod) %>% 
  arrange(start_interval15) %>% 
  mutate(lagHour = dplyr::lag(occupancy,4),
         lag2Hours = dplyr::lag(occupancy, 8),
         lag3Hours = dplyr::lag(occupancy,12),
         lag6Hours = dplyr::lag(occupancy, 24),
         lag1day = dplyr::lag(occupancy,36)) %>% 
  mutate(lagHour = replace(lagHour, hod>=9&hod<10, NA),
         lag2Hours = replace(lag2Hours, hod>=9&hod<11, NA),
         lag3Hours = replace(lag3Hours, hod>=9&hod<12, NA),
         lag6Hours = replace(lag6Hours, hod>=9&hod<15, NA),
         lag1day = replace(lag1day, is.na(lag1day), occupancy)) %>% 
  replace(is.na(.), 0) 

train_data_4 <- train_data_3 %>% 
  left_join(time_lag)

3. Data Exploratory

In this section we will analyze the data obtained in the previous section.

3.1. Parking Time Distribution

The first is the distribution of parking hours. In the SFMTA’s parking meter charging policy, only the hours from Monday to Saturday, 9am to 6pm, are charged. Therefore, outside of these hours, the number of parking meters recorded by the parking meter is greatly reduced. The most frequent parking is between 9am and 6pm, which is the main paid parking period, and the number of parking decreases as time goes by. This is in line with our perception of travel to work and parking situation.

## Distribution
distribution <- dat2 %>% 
  mutate(dtime = format(as.POSIXct(start_interval15), format="%H")) %>% 
  dplyr::select(dtime) %>%
  mutate(dtime=as.numeric(distribution$dtime)) %>% 
  filter(dtime>=5 & dtime<22)
barplot(table(distribution$dtime),
main = "Parking Time Distribution",
xlab = "Hour of a Day",
ylab = "Frequency")

image1

3.2.Parking Price Rate

Whether the cost of parking is related to the occupancy rate of parking is also a part we would like to know. The time period is divided into 9:00-10:00, 12:00-13:00, 15:00-16:00, 17:00-18:00 quadrants, and the pearson correlation is shown with the parking rate as the horizontal coordinate and the parking occupancy rate as the vertical coordinate. From the graph, we can see that most of the parking meters have a parking rate between 0.4 and 0.5. As the parking rate increases, the change of occupancy rate is quite limited.

## Parking price rate
price_rate <- train_data_4 %>% 
  filter(hod==9 | hod==12 | hod==15 | hod==17) %>% 
  select(occupancy, rate, hod)

image2 <- ggplot(price_rate)+             ##图表2 相关性散点图
  geom_point(aes(x = rate, 
                 y = occupancy))+ ###散点数据
  geom_smooth(aes(x = rate, 
                  y = occupancy), 
              method = "lm", se = FALSE)+ ###拟合直线
  facet_wrap(~hod, scales = "free",ncol=2)+
  labs(
    title = "Parking Price by Occupancy",
    subtitle = "On 9:00-10:00, 12:00-13:00, 15:00-16:00, 17:00-18:00 ",
    x="rate", 
    y="occupancy")

ggsave("Parking Price by Occupancy.jpg",width = 10,height = 20, image2)

image2

3.3. Occupancy by Day of Week

Which days of the week have higher occupancy rates is also one of our concerns. Because the SFMTA’s parking meter is free on weekdays, the fact that a lower occupancy rate is recorded on weekdays does not mean that the parking occupancy rate is actually lower. For the rest of the time period, the daily parking occupancy rate is guaranteed to be around 0.5.

## Occupancy by day of week
day_of_week <- train_data_4 %>% 
  select(occupancy,start_interval15) %>%
  mutate(dotw = wday(start_interval15, label=TRUE)) %>% 
  group_by(dotw) %>% 
  summarize(occupancy = mean(occupancy))
  
image3 <- ggplot(day_of_week, aes(x = dotw, y = occupancy)) + 
  geom_line(aes(group=1)) +
  labs(title="Parking Occupancy by day of week",
       x="Day of week", 
       y="Average Occupancy")+
     plotTheme
ggsave("Parking Price by day of week.png", image3)

image3

3.4. Parking Occupancy by day of week

Finally, the parking occupancy at different times of the day for different Geo_ids. The horizontal coordinate starts at 9 a.m. and ends at 6 p.m. For most Geo_ids, the parking occupancy always stays within the same interval.

## Rate by time
Rate_time <- train_data_4 %>% 
  mutate(dtime = format(as.POSIXct(start_interval15), format="%H")) %>% 
  group_by(dtime, GEOID) %>% 
  summarize(rate = mean(rate))

image4 <- ggplot(Rate_time, aes(x = dtime, y = rate)) + 
  geom_line(aes(group=1)) +
  facet_wrap(~GEOID, scales = "free",ncol=5)+
  labs(title="Parking Occupancy by day of week",
       x="Day of week", 
       y="Average Occupancy")+
     plotTheme

ggsave("Rate by time.jpg",width = 10, height = 40,image4)

image4


4. Regression Model

In this section, we analyze the parking occupancy in combination with time lag and spatial lag.

Our predictor includes the spatial elements (restaurantCount, restaurant_nn4, schoolCount, schools_nn4), time lag (lagHour, lag2Hours, lag3Hours, lag6Hours), and Censusdata (MedHHInc, etc.)

model_data_offcial <- train_data_4 %>% 
  mutate(dotw = wday(start_interval15, label=TRUE))


set.seed(3456)
trainIndex <- createDataPartition(y=paste(model_data_offcial$street_block,model_data_offcial$GEOID, model_data_offcial$dotw), p = .70,
                                  list = FALSE,
                                  times = 1)

parking.Train <- model_data_offcial[ trainIndex,]
parking.Test <- model_data_offcial[ -trainIndex,]

reg.training <- 
  lm(occupancy ~  rate + parkCount + parkArea + restaurantsCount + restaurants_nn4 + schoolCount + schools_nn4 + universitiesCount + parkingCount + parkingArea + parking_nn4 + clinics_nn4 + hospitals_nn4 + cinemasCount + stadiumsCount + commerce_nn4 + commerceCount + retailCount + retail_nn4 + commonLeisure_nn4 + fitnessCenterCount + fitnessCenter_nn4 + gardens_nn4 + gardensCount + companies_nn4 + public_transport_nn1 + public_transport_nn1 + Census_MedHHInc + Census_MedRent + Census_pctWhite + Census_areaperpeople + Census_pctAfricanAmericans + Census_pctAsians + lag_occ + hod + dotw + lagHour + lag2Hours + lag3Hours + lag6Hours + lag1day,  data=parking.Train)

# 不用展示R方
#    stargazer(reg.training, 
#              type = 'text', 
#              title = "OLS Regression",
#              covariate.labels = c())

4.1. Goodness of fit

For the results of the model, the MAE of the model = 0.0395,MAPE of the model = 0.824

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

parking.Test$occupancy.Predict = predict(reg.training, parking.Test)

parking.Test <-
  parking.Test %>%
  mutate(occupancy.Error = occupancy.Predict - occupancy,
         occupancy.AbsError = abs(occupancy.Predict - occupancy),
         occupancy.APE = (abs(occupancy.Predict - occupancy)) / occupancy.Predict)

MAE <- mean(parking.Test$occupancy.AbsError, na.rm = T)
MAPE <- mean(parking.Test$occupancy.APE, na.rm = T)
MAE = c(MAE) %>% format(., digits = 3)
MAPE = c(MAPE) %>% format(., digits = 3)
Model = c("Model Fitness")
summaryTable1 = cbind(Model, MAE, MAPE)

kable(summaryTable1, digits = 1, caption = "Table. Prediction precision of the first two models") %>%
  kable_classic(full_width = T)%>%
  footnote()

image5

4.2. Examine Error Metrics for Accuracy

The following chart shows our predicted results compared to the actual results.The cyan line is the predicted result and the red line is the actual result. As can be seen in the same figure, the model presents a better prediction in terms of overall trend due to the inclusion of time lag. However, on the whole, the occupancy rate of the model prediction is lower compared to the actual situation. This leads to a situation where the predicted results appear in parallel to the actual results during the free hours of the day.

parking_test_plot <- parking.Test %>% 
    mutate(interval60 = floor_date(ymd_hms(start_interval15), unit = "60 mins"))%>% 
    dplyr::select(interval60, street_block, occupancy, occupancy.Predict) %>%
    unnest()%>%
    na.omit() %>%
    gather(Variable, Value, -interval60, -street_block) %>%
    group_by(Variable, interval60) %>%
    summarize(Value = mean(Value))

image6 <- parking_test_plot %>% 
    ggplot(aes(interval60, Value, color=Variable)) + 
      geom_line(size = 1.1) + 
      labs(title = "Predicted/Observed Occupancy", subtitle = "SF; A test set of 2 weeks",  x = "Hour", y= "Occupancy") +
      plotTheme
##save.image(file='D:/UPenn/Fall2022/5080Policy/Final/Final_data_model/.RData')

ggsave("Predicted/Observed Occupancy.jpg",width = 10, height = 5,image6)

image6


5. Generalizability

5.1. Cross Validation

In the cross-validation analysis, our results are as follows. Similar to the results of the test set, the MAE of the model = 0.0757, which is slightly imprecise compared to a distribution that occupies the range between 0 and 1.

# R program to implement
# K-fold cross-validation
 
# setting seed to generate a
# reproducible random sampling
set.seed(125)
 
# defining training control
# as cross-validation and
# value of K equal to 10
train_control <- trainControl(method = "cv",
                              number = 10)
 
# training the model by assigning sales column
# as target variable and rest other column
# as independent variable
model <- train(occupancy ~ dotw+lagHour+lag2Hours+lag3Hours+lag6Hours+lag1day, data = park.cross,
               method = "lm",
               trControl = train_control)
 
# printing model performance metrics
# along with other details
print(model)

image8


6. Application Development

Finally we will share our user-specific interface design. Our app is intended to manage the occupancy of parking spots by helping SFMTA officials with pricing.

image7

In this online interface, decision makers can view projections of future parking rates on certain streets by entering prices. image8

---
title: "Prediction Parking Demand"
author: "Yuhao Jia & Zile Wu"
date: "December 15, 2022"
output: 
  rmdformats::readthedown:
    code_folding: "hide"
    code_download: true

---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(eval = FALSE)
```

# 1.Introduction

## 1.1. Research Background
![SFMTA](D:/Upenn/Upenn Lec/05-MUSA-508/Assign-Final/PPT/DEMO for 1209/1.png)
Since the Ford Model T was first introduced on October 1, 1908, the automobile has become an integral part of society. One of the problems that have plagued mankind at the same time is the problem of parking.
Difficulty in parking first of all imposes additional time costs on individuals. Vehicles wandering on the streets in search of a parking space add to the congestion on the roads. At the same time, more exhaust emissions and environmental pollution are created.
To address these problems, some local governments have attempted to ensure the availability of parking spaces by dynamically adjusting parking prices, such as the SFMTA in our study, which seeks to implement a dynamic parking pricing policy for on-street parkingmeters between 9 a.m. and 6 p.m., Monday through Saturday, every week. This is intended to use the increased parking prices to ensure that there are always one or two spaces available on each street for vehicles to park, thus making it difficult for vehicles to find a space.

## 1.2. User & Use Case
This study is intended to help the City of San Francisco better evaluate and utilize the existing dynamic pricing policy, i.e., to be able to predict and regulate the occupancy rate of parking spaces through dynamic pricing and various spatial and temporal factors to ensure the availability of parking spaces.
To address this issue, first, we will download the relevant parking meter usage data from the SFMTA website. Secondly, we will also use US census data to filter some parameters related to parking occupancy. After completing the exploration and sorting of the above data, we will build a regression model to analyze and predict the occupancy rate of parking spaces. At the same time, we will evaluate the validity of our model using cross-validation methods.

---

# 2. Data Wrangling.

## 2.1. Setup

Let's load relevant libraries and some graphic themes.

```{r setup_13, message=FALSE, class.source = 'fold-hide'}
library(prettydoc)
library(rmdformats)
library(tidyverse)
library(ggplot2)
library(tidycensus)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(osmdata)
library(tigris)
library(osmextract)
library(curl)
library(reshape2)
library(glue)
library(dismo)
library(spgwr)
library(MASS)
library(lme4)
library(data.table)
library(kableExtra)
library(stargazer)
library(RSocrata)
library(knitr)
library(gifski)
library(rjson)
library(riem)
library(gganimate)
library(viridis)
library(lubridate)
library(tigris)
library(geojsonio)
library(magrittr)

#-----Import external functions & a palette-----
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.2),
  axis.ticks=element_blank())

mapTheme1 <- function(base_size = 35, title_size = 45) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(hjust = 0.5, size = title_size, colour = "black",face = "bold"), 
    plot.subtitle = element_text(hjust = 0.5,size=base_size,face="italic"),
    plot.caption = element_text(size=base_size,hjust=0),
    axis.ticks = element_blank(),
    panel.spacing = unit(6, 'lines'),
    panel.background = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=base_size),
    axis.title = element_text(face = "bold",size=base_size),
    axis.text = element_blank(),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(size=base_size,colour = "black", face = "italic"),
    legend.text = element_text(size=base_size,colour = "black", face = "italic"),
    strip.text.x = element_text(size = base_size,face = "bold")
  )
}

mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))

palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")
palette1_main <- "#174C4F"
palette1_assist <- '#F9B294'

```


## 2.2. Import Parking Data
Through the official website of San Francisco Municipal Transportation Agency (SFMTA), we can download the parking data since 2020. Due to the high demand of parking in San Francisco, the total data volume is 105 million rows. Therefore, when importing the api, we filtered the data to only get the data from 20220501 9am to 20220514 6pm. The two weeks of data are used to form the basic dataset of our model.

```{r read_dat}
## This part variable has been saved locally as dat, location
dat <- read.socrata("https://data.sfgov.org/resource/imvp-dq3v.csv?$where=SESSION_START_DT%20between%20%272022-05-1T9:00:00%27%20and%20%272022-05-14T17:00:00%27")

location <- 
  read.socrata("https://data.sfgov.org/resource/8vzz-qzz9.csv") %>%
    st_as_sf(coords = c("latitude", "longitude"), crs = 4326, agr = "constant")%>%
    st_transform('ESRI:102271') %>% 
    distinct()
glimpse(dat)
```

By calculating the total time each parking meter is used during the day, we can here calculate the parking occupancy of each parking meter based on the post_id. Here we just process the data and wait for the analysis to be done later.

```{r glimpse_dat1}
## This part variable has been saved locally as dat2, parking_rate
dat2 <- dat %>% 
  left_join(location, by=('post_id'='post_id')) %>% 
  dplyr::select(post_id, street_block, session_start_dt, session_end_dt, meter_event_type, gross_paid_amt, on_offstreet_type, ms_space_num, old_rate_area, street_name,shape,geometry) %>% 
    mutate(end_interval15 = floor_date(ymd_hms(session_end_dt), unit = "15 mins"),
         start_interval15 = floor_date(ymd_hms(session_start_dt), unit = "15 mins"),
         ms_space_num = ifelse(ms_space_num == 0, 1, ms_space_num))

parking_rate <- dat2 %>% 
  mutate(length = end_interval15 - start_interval15,
         parking_hour = as.numeric(length)/3600,) %>% 
  dplyr::select(start_interval15, street_block, length,gross_paid_amt,parking_hour,post_id) %>% 
  mutate(rate=gross_paid_amt/parking_hour) %>% 
  filter(parking_hour!=0) %>% 
  dplyr::select(post_id, start_interval15, rate, street_block)
```

## 2.3. Import Census Data

In US Census Data, we have selected the following data: TotalPop, Whites, AfricanAmericans, Asians, MedHHINC, MedRent.
First, we believe that there is a direct relationship between population size and parking occupancy, which is the basic logic of supply and demand in the market. Second, we also argue that resident income and rents also have an impact on parking occupancy.

```{r install_census_API_key, warning = FALSE, include=FALSE, eval = TRUE}
# Install Census API Key
tidycensus::census_api_key("e79f3706b6d61249968c6ce88794f6f556e5bf3d", overwrite = TRUE)
```

```{r get_census, message=FALSE, warning=FALSE, cache=TRUE, results = 'hide'}
CensusData <- 
  get_acs(geography = "block group", 
          variables = c("B01003_001E","B02001_002E","B19013_001E","B25058_001E",'B02001_003E','B02001_005E'), 
          year=2019, state="CA", county="SAN FRANCISCO", geometry=T, output="wide") %>%
  st_transform('ESRI:102243') %>%
  rename(Census_TotalPop = B01003_001E, 
         Census_Whites = B02001_002E,
         Census_AfricanAmericans = B02001_003E,
         Census_Asians = B02001_005E,
         Census_MedHHInc = B19013_001E, 
         Census_MedRent = B25058_001E) %>%
  dplyr::select(-NAME,-starts_with("B")) %>%
  mutate(Census_pctWhite = ifelse(Census_TotalPop > 0, 100*Census_Whites / Census_TotalPop,0),
         Census_pctAfricanAmericans = ifelse(Census_TotalPop > 0, 100*Census_AfricanAmericans / Census_TotalPop,0),
         Census_pctAsians = ifelse(Census_TotalPop > 0, 100*Census_Asians / Census_TotalPop,0),
         Census_blockgroupareasqm = as.numeric(st_area(.)),
         Census_areaperpeople = ifelse(Census_blockgroupareasqm > 0,Census_blockgroupareasqm/Census_TotalPop,0),
         year = "2019") %>%
  dplyr::select(-Census_Whites,-Census_AfricanAmericans,-Census_Asians, -GEOID)
```

```{r extract_geometries }
# Geometries
CensusData2 <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001E"), 
          year=2019, state="CA", county="SAN FRANCISCO", geometry=T, output="wide") %>%
  st_transform('ESRI:102243') %>%
  dplyr::select(-NAME,-starts_with("B"))

#-----Join Parking Meter data to census data-----
trainData <-st_join(location,dplyr::select(CensusData,-Census_blockgroupareasqm,-year,-geometry,-Census_TotalPop))
trainData <-st_join(location,dplyr::select(CensusData2, GEOID, -geometry))
```

```{r, fig.height=40, fig.width=40}
## plot parking meters
boundary = st_read("D:/Upenn/Upenn Lec/05-MUSA-508/Assign-Final/DATAEXPO/SF/trim.shp") %>%
  st_transform(crs) %>%
  filter(objectid == "32")
CensusData3 <- CensusData2 %>%  
  filter(GEOID!="06075980401")%>% 
  filter(GEOID!="06075017902")
ggplot()+
  geom_sf(data = CensusData3, color="gray50",fill = "transparent",size=1,linetype = "dashed")+
  geom_sf(data = boundary,fill = "transparent", color="black",size=100000000)+
  geom_sf(data = trainData, size = 1,color=palette1_main )+
  scale_color_manual(values = palette5,
                    name = "Home sale prices")+
  labs(title = "Map of Parking Meters in San Francisco", 
       subtitle = "In tract unit")+
  mapTheme()
```

![ParkingMap](D:/Upenn/Upenn Lec/05-MUSA-508/Assign-Final/FinalDataModelBackUp/ParkingMap.jpg)

After importing the US Census Data of San Francisco, we can show the location information of the parking meters on the map. From the map, we can see that most of the Parking Meters are located in the northeast part of San Francisco.

## 2.4. Import Spatial Data

After we finish importing the censusdata, let's import some spatial data. Using the data from openstreetmap, we can easily retrieve which facility locations we think are relevant to parking occupancy. And we use k-nearest-neighbor function and buffer to process these data.
In the initial screening of the data, we selected a large number of data, such as the location of parks and restaurants. We use both of these algorithms to calculate the data information around each parking meter for subsequent analysis.

```{r}
## Set the data obtain area from OSM
q0 <- opq(bbox = c(-122.55,37.70,-122.35,37.82))
crs = "ESRI:102243"

## Define the the function to obtain different data.
get_osm_1 <- function(bbox, key, value, geom, crs){
  object <- add_osm_feature(opq = bbox, key = key, value = value) %>%
    osmdata_sf(.)
  geom_name <- paste("osm_", geom, sep = "")
  object.sf <- st_geometry(object[[geom_name]]) %>%
    st_transform(crs) %>%
    st_sf() %>%
    cbind(., object[[geom_name]][[key]]) %>%
    rename(NAME = "object..geom_name....key..")
  return(object.sf)
}

# Buffer Count - Count the number of appearance within the buffer of Parking Meter
# Input: Parking Meter data (sf), object/amenity type, buffer radius, object data
buffer_count = function(trainData, type, radius, data){
  ParkingBuffer = st_buffer(trainData, radius)
  trainData[type] = st_intersects(ParkingBuffer, data) %>%
    sapply(., length)
  return(trainData)
}

# Buffer Area - for polygon amenities, calculate the aggregate area of all entries that intersect with the buffer of a Parking Meter
# Input: Parking Meter data (sf), object/amenity type, buffer radius, object data
buffer_area = function(trainData, type, radius, data){
  data <- mutate(data, area = st_area(data))
  trainData[type] = st_buffer(trainData, radius) %>%
    aggregate(data %>% dplyr:: select (geometry, area),., sum) %>%
    pull(area)
  return(trainData)
}

# K-near - Calculate the average distance of a Parking Meter to its nearest (1 to 5) amenity object(s)
# Input: Parking Meter data, object/amenity data, object/amenity type, geometry type (point or non-point)
knear <- function(trainData, data, type, geom){
  if (geom != "points") {
    data =  data %>% dplyr::select(geometry) %>%st_centroid(.) %>% st_coordinates
  }
  else {
    data = data %>% dplyr::select(geometry) %>% st_coordinates
  }
  trainDataCoords = st_coordinates(trainData)
  nn_1 = nn_function(trainDataCoords, data, 1)
  nn_2 = nn_function(trainDataCoords, data, 2)
  nn_3 = nn_function(trainDataCoords, data, 3)
  nn_4 = nn_function(trainDataCoords, data, 4)
  nn_5 = nn_function(trainDataCoords, data, 5)
  trainData[paste(type, "_nn1", sep = "")] = nn_1
  trainData[paste(type, "_nn2", sep = "")] = nn_2
  trainData[paste(type, "_nn3", sep = "")] = nn_3
  trainData[paste(type, "_nn4", sep = "")] = nn_4
  trainData[paste(type, "_nn5", sep = "")] = nn_5
  return(trainData)
}

# Import data(DO NOT FORGET TO ADD DATA HERE!!!!!!!,这儿做各种spatial操作的data，记得替换！！！！！)
trainData <- location
```

```{r}
# Select the data from OSM using the functions above.
#-----Amenity: parks-----

# Get park data, same below
parks = get_osm_1(q0, "leisure", "park", "polygons", "ESRI:102243")

## How many parks within 1000 m buffer of each home, same below
trainData = buffer_count(trainData, "parkCount", 1000, parks)
## Aggregate park area of all intersected park, same below
trainData = buffer_area(trainData, "parkArea", 1000, parks)

#-----Amenity: restaurants-----

restaurants = get_osm_1(q0, "amenity", "restaurant", "points", crs)
trainData = buffer_count(trainData, "restaurantsCount", 1000, restaurants)

# Average distance to 1~5 nearest restaurant(s), same below
trainData = knear(trainData, restaurants, "restaurants", "points")

#-----Amenity: schools-----

schools = get_osm_1(q0, 'amenity', 'school', "polygons", "ESRI:102243")
trainData = buffer_count(trainData, "schoolCount", 1000, schools)
trainData = knear(trainData, schools, "schools", "polygons")

# Assign each home to its nearest school
schoolDistrict = voronoi(st_coordinates(schools %>% st_centroid())) %>% 
  st_as_sf() %>% 
  st_set_crs("ESRI:102243") %>%
  rename(schoolNo = id) %>%
  mutate(schoolNo = as.character(schoolNo))
trainData = st_join(trainData, schoolDistrict)

#-----Amenity: universities-----

# Get university data
universities = get_osm_1(q0, 'amenity', 'university', "polygons", "ESRI:102243")

# Whether university is within 1000 m buffer of each home
trainData = buffer_count(trainData, "universitiesCount", 1000, universities)

#-----Amenity: parking-----

parking = get_osm_1(q0, "amenity", "parking", "polygons", "ESRI:102243")
trainData = buffer_count(trainData, "parkingCount", 1000, parking)
trainData = buffer_area(trainData, "parkingArea", 1000, parking)
trainData = knear(trainData, parking, "parking", "polygons")

#-----Amenity: clinics-----

clinics = get_osm_1(q0, "amenity", "clinic", "points", "ESRI:102243")
trainData = knear(trainData, clinics, "clinics", "points")

#-----Amenity: hospitals-----

hospitals = get_osm_1(q0, "amenity", "hospital", "polygons", "ESRI:102243")
trainData = knear(trainData, hospitals, "hospitals", "polygons")

#-----Amenity: cinemas-----
cinemas = get_osm_1(q0, "amenity", "cinema", "points", "ESRI:102243")
trainData = buffer_count(trainData, "cinemasCount", 1000, cinemas)
trainData = buffer_count(trainData, "cinemasCount2", 2000, cinemas)

#-----Amenity: stadiums-----
stadiums = get_osm_1(q0, "building", "stadium", "polygons", "ESRI:102243")
trainData = buffer_count(trainData, "stadiumsCount", 1000, stadiums)

#-----Amenity: commerce-----
commercial = get_osm_1(q0, "building", "commercial", "polygons", "ESRI:102243")
trainData = knear(trainData, commercial, "commerce", "polygons")
trainData = buffer_count(trainData, "commerceCount", 1000, commercial)

#-----Amenity: retail-----
retail = get_osm_1(q0, "building", "retail", "polygons", "ESRI:102243")
trainData = buffer_count(trainData, "retailCount", 1000, retail)
trainData = knear(trainData, retail, "retail", "polygons")

#-----Amenity: common leisure space-----
commonleisure = get_osm_1(q0, "leisure", "common", "polygons", "ESRI:102243")
trainData = knear(trainData, commonleisure, "commonLeisure", "polygons")

#-----Amenity: fitness centers-----
fitness_centre = get_osm_1(q0, "leisure", "fitness_centre", "points", "ESRI:102243")
trainData = buffer_count(trainData, "fitnessCenterCount", 1000, fitness_centre)
trainData = knear(trainData, fitness_centre, "fitnessCenter", "pooints")

#-----Amenity: public gardens-----
garden = get_osm_1(q0, "leisure", "garden", "polygons", "ESRI:102243")
trainData = knear(trainData, garden, "gardens", "polygons")
trainData = buffer_count(trainData, "gardensCount", 1000, garden)

#-----Jobs: companies-----
companies = get_osm_1(q0, "office", "company", "points", "ESRI:102243")
trainData = knear(trainData, companies, "companies", "points")

#-----Transport: public transport-----
public_transport = get_osm_1(q0, "public_transport", "station", "points", "ESRI:102243")
trainData <-
  trainData %>% 
  mutate(
    public_transport_nn1 = nn_function(st_coordinates(trainData), st_coordinates(public_transport%>%dplyr::select(geometry)%>%st_centroid(.)), 1),
    public_transport_nn2 = nn_function(st_coordinates(trainData), st_coordinates(public_transport%>%dplyr::select(geometry)%>%st_centroid(.)), 2))

```

```{r}
#-----Join trainData to census data-----

trainData1 <-st_join(trainData,dplyr::select(CensusData,-Census_blockgroupareasqm,-year,-geometry,-Census_TotalPop))
trainData1 <-st_join(trainData1,dplyr::select(CensusData2, GEOID, -geometry))

```

```{r}
# Possible Distribution Transformation
trainDataNumeric2 = trainData1[,numericColumns] %>% 
  st_drop_geometry()
i=1
par(mfrow=c(65,1))
for (column in trainDataNumeric2) {
  names = names(trainDataNumeric2)
  name = names[i:i]
  par(mfrow=c(1,2));hist(as.numeric(column),breaks=80,main=c(name,"Before"));hist(log(1+as.numeric(column)),breaks=80,main=c(name,"After"))
  i = i+1
}
dev.off()
```

```{r glimpse_dat, echo=FALSE, eval= FALSE }
dat2 <- dat %>% 
  left_join(location, by=('post_id'='post_id')) %>% 
  dplyr::select(post_id, street_block, session_start_dt, session_end_dt, meter_event_type, gross_paid_amt, on_offstreet_type, meter_type, old_rate_area, street_name,shape,geometry) %>% 
    mutate(end_interval15 = floor_date(ymd_hms(session_end_dt), unit = "15 mins"),
         start_interval15 = floor_date(ymd_hms(session_start_dt), unit = "15 mins"))
glimpse(dat2)

distribution <- dat2 %>% 
  mutate(dtime = format(as.POSIXct(start_interval15), format="%H:%M:%S")) %>% 
  group_by(dtime) %>% 
  count()

length <- dat2 %>% 
  mutate(dhour = hour(start_interval15),
         length = end_interval15 - start_interval15,
         parking_hour = as.numeric(length)/3600,
         count =1) %>% 
  filter(dhour<=17 & dhour>=9) %>% 
  dplyr::select(dhour, parking_hour, count) %>% 
  group_by(dhour,parking_hour) %>% 
  summarize(chour = sum(count))
```

```{r}
#-----Select Numeric Variables-----
numericColumns = unlist(lapply(trainData, is.numeric))
trainDataNumeric = trainData[,numericColumns] %>% 
  dplyr::select(-MUSA_ID,-toPredict)%>% 
  st_drop_geometry() %>% 
  gather(key,value, -price)

#-----Make scatter-plots of all numeric variables----

ggplot(trainDataNumeric, aes(x = value, y = price))+
  geom_point(size=.05,alpha=0.5)+
  geom_smooth(method = lm, se=F,colour = "#DC986D",size=0.5)+
  facet_wrap(~key, scales = "free",ncol = 8)+
  plotTheme()
```

```{r}
# Possible Distribution Transformation
trainDataNumeric2 = trainData[,numericColumns] %>% 
  dplyr::select(-MUSA_ID,-toPredict)%>%
  st_drop_geometry()
i=1
par(mfrow=c(65,1))
for (column in trainDataNumeric2) {
  names = names(trainDataNumeric2)
  name = names[i:i]
  par(mfrow=c(1,2));hist(as.numeric(column),breaks=80,main=c(name,"Before"));hist(log(1+as.numeric(column)),breaks=80,main=c(name,"After"))
  i = i+1
}
dev.off()
```

## 2.5.Import time data

### Time token

image: ![](D:/Upenn/Upenn Lec/05-MUSA-508/Assign-Final/PPT/DEMO for 1209/6.png)

Here we will analyze our algorithm for time, time token, the above graph illustrates our calculation rules, for a parking record on a parking meter, the time it occupies is rounded to 15min time period, for the same time period, different parking spaces under the same Geo_id, the above rounded time token is added up, we get the total time token for the first time period.

```{r occupancy, echo=FALSE}
## This part variable has been saved locally
dat0912 <- dat2 %>%
  mutate(st_hour = hour(start_interval15),
         end_hour = hour(end_interval15)) %>% 
  filter(st_hour>=9 & end_hour<12) %>% 
  mutate(length = end_interval15 - start_interval15,
         tokens = as.numeric(length )/ 900,
         tokens_00 = ifelse(tokens >= 1, 1, 0),
         tokens_01 = ifelse(tokens >= 2, 1, 0),
         tokens_02 = ifelse(tokens >= 3, 1, 0),
         tokens_03 = ifelse(tokens >= 4, 1, 0),
         tokens_04 = ifelse(tokens >= 5, 1, 0),
         tokens_05 = ifelse(tokens >= 6, 1, 0),
         tokens_06 = ifelse(tokens >= 7, 1, 0),
         tokens_07 = ifelse(tokens >= 8, 1, 0),
         tokens_08 = ifelse(tokens >= 9, 1, 0),
         tokens_09 = ifelse(tokens >= 10, 1, 0),
         tokens_10 = ifelse(tokens >= 11, 1, 0),
         tokens_11 = ifelse(tokens >= 12, 1, 0),
         tokens_12 = ifelse(tokens >= 13, 1, 0),
         tokens_13 = ifelse(tokens >= 14, 1, 0),
         tokens_14 = ifelse(tokens >= 15, 1, 0),
         tokens_15 = ifelse(tokens >= 16, 1, 0),
         tokens_16 = ifelse(tokens >= 17, 1, 0),
         tokens_17 = ifelse(tokens >= 18, 1, 0),
         tokens_18 = ifelse(tokens >= 19, 1, 0),
         tokens_19 = ifelse(tokens >= 20, 1, 0),
         tokens_20 = ifelse(tokens >= 21, 1, 0),
         tokens_21 = ifelse(tokens >= 22, 1, 0),
         tokens_22 = ifelse(tokens >= 23, 1, 0),
         tokens_23 = ifelse(tokens >= 24, 1, 0),
         tokens_24 = ifelse(tokens >= 25, 1, 0),
         tokens_25 = ifelse(tokens >= 26, 1, 0),
         tokens_26 = ifelse(tokens >= 27, 1, 0),
         tokens_27 = ifelse(tokens >= 28, 1, 0),
         tokens_28 = ifelse(tokens >= 29, 1, 0),
         tokens_29 = ifelse(tokens >= 30, 1, 0),
         tokens_30 = ifelse(tokens >= 31, 1, 0),
         tokens_31 = ifelse(tokens >= 32, 1, 0),        
         tokens_32 = ifelse(tokens >= 33, 1, 0),
         tokens_33 = ifelse(tokens >= 34, 1, 0),
         tokens_34 = ifelse(tokens >= 35, 1, 0),
         tokens_35 = ifelse(tokens >= 36, 1, 0)) %>%
  group_by(start_interval15, post_id) %>%
  summarize(tokens_00 = sum(tokens_00),
            tokens_01 = sum(tokens_01),
            tokens_02 = sum(tokens_02),
            tokens_03 = sum(tokens_03),
            tokens_04 = sum(tokens_04),
            tokens_05 = sum(tokens_05),
            tokens_06 = sum(tokens_06),
            tokens_07 = sum(tokens_07),
            tokens_08 = sum(tokens_08),
            tokens_09 = sum(tokens_09),
            tokens_10 = sum(tokens_10),
            tokens_11 = sum(tokens_11),
            tokens_12 = sum(tokens_12),
            tokens_13 = sum(tokens_13),
            tokens_14 = sum(tokens_14),
            tokens_15 = sum(tokens_15),
            tokens_16 = sum(tokens_16),
            tokens_17 = sum(tokens_17),
            tokens_18 = sum(tokens_18),
            tokens_19 = sum(tokens_19),
            tokens_20 = sum(tokens_20),
            tokens_21 = sum(tokens_21),
            tokens_22 = sum(tokens_22),
            tokens_23 = sum(tokens_23),
            tokens_24 = sum(tokens_24),
            tokens_25 = sum(tokens_25),
            tokens_26 = sum(tokens_26),
            tokens_27 = sum(tokens_27),
            tokens_28 = sum(tokens_28),
            tokens_29 = sum(tokens_29),
            tokens_30 = sum(tokens_30),
            tokens_31 = sum(tokens_31),
            tokens_32 = sum(tokens_32),
            tokens_33 = sum(tokens_33),
            tokens_34 = sum(tokens_34),
            tokens_35 = sum(tokens_35))

study.panel0912 <- 
  expand.grid(start_interval15=unique(dat2$start_interval15), 
              post_id = unique(dat2$post_id)) %>%
  mutate(doty = yday(start_interval15)) %>%
  left_join(., dat0912)

transaction_panel1_1 <- study.panel0912 %>%
  replace(is.na(.), 0) %>%
  arrange(start_interval15) %>%
  group_by(post_id, doty) %>%
  mutate(lag01 = ifelse(is.na(lag(tokens_01)) == FALSE, lag(tokens_01), 0),
         lag02 = ifelse(is.na(lag(tokens_02,2)) == FALSE, lag(tokens_02,2), 0),
         lag03 = ifelse(is.na(lag(tokens_03,3)) == FALSE, lag(tokens_03,3), 0),
         lag04 = ifelse(is.na(lag(tokens_04,4)) == FALSE, lag(tokens_04,4), 0),
         lag05 = ifelse(is.na(lag(tokens_05,5)) == FALSE, lag(tokens_05,5), 0),
         lag06 = ifelse(is.na(lag(tokens_06,6)) == FALSE, lag(tokens_06,6), 0),
         lag07 = ifelse(is.na(lag(tokens_07,7)) == FALSE, lag(tokens_07,7), 0),
         lag08 = ifelse(is.na(lag(tokens_08,8)) == FALSE, lag(tokens_08,8), 0),
         lag09 = ifelse(is.na(lag(tokens_08,9)) == FALSE, lag(tokens_09,9), 0),
         lag10 = ifelse(is.na(lag(tokens_10,10)) == FALSE, lag(tokens_10,10), 0),
         lag11 = ifelse(is.na(lag(tokens_11,11)) == FALSE, lag(tokens_11,11), 0),
         lag12 = ifelse(is.na(lag(tokens_12,12)) == FALSE, lag(tokens_12,12), 0),
         lag13 = ifelse(is.na(lag(tokens_13,13)) == FALSE, lag(tokens_13,13), 0),
         lag14 = ifelse(is.na(lag(tokens_14,14)) == FALSE, lag(tokens_14,14), 0),
         lag15 = ifelse(is.na(lag(tokens_15,15)) == FALSE, lag(tokens_15,15), 0),
         lag16 = ifelse(is.na(lag(tokens_16,16)) == FALSE, lag(tokens_16,16), 0),
         lag17 = ifelse(is.na(lag(tokens_17,17)) == FALSE, lag(tokens_17,17), 0),
         lag18 = ifelse(is.na(lag(tokens_18,18)) == FALSE, lag(tokens_18,18), 0),
         lag19 = ifelse(is.na(lag(tokens_19,19)) == FALSE, lag(tokens_19,19), 0),
         lag20 = ifelse(is.na(lag(tokens_20,20)) == FALSE, lag(tokens_20,20), 0),
         lag21 = ifelse(is.na(lag(tokens_21,21)) == FALSE, lag(tokens_21,21), 0),
         lag22 = ifelse(is.na(lag(tokens_22,22)) == FALSE, lag(tokens_22,22), 0),
         lag23 = ifelse(is.na(lag(tokens_23,23)) == FALSE, lag(tokens_23,23), 0),
         lag24 = ifelse(is.na(lag(tokens_24,34)) == FALSE, lag(tokens_24,24), 0),
         lag25 = ifelse(is.na(lag(tokens_25,25)) == FALSE, lag(tokens_25,25), 0),
         lag26 = ifelse(is.na(lag(tokens_26,26)) == FALSE, lag(tokens_26,26), 0),
         lag27 = ifelse(is.na(lag(tokens_27,27)) == FALSE, lag(tokens_27,27), 0),
         lag28 = ifelse(is.na(lag(tokens_28,28)) == FALSE, lag(tokens_28,28), 0),
         lag29 = ifelse(is.na(lag(tokens_29,29)) == FALSE, lag(tokens_29,29), 0),
         lag30 = ifelse(is.na(lag(tokens_30,30)) == FALSE, lag(tokens_30,30), 0),
         lag31 = ifelse(is.na(lag(tokens_31,31)) == FALSE, lag(tokens_31,31), 0),
         lag32 = ifelse(is.na(lag(tokens_32,32)) == FALSE, lag(tokens_32,32), 0),
         lag33 = ifelse(is.na(lag(tokens_33,33)) == FALSE, lag(tokens_33,33), 0),
         lag34 = ifelse(is.na(lag(tokens_34,34)) == FALSE, lag(tokens_34,34), 0),
         lag35 = ifelse(is.na(lag(tokens_35,35)) == FALSE, lag(tokens_35,35), 0)) %>%
  mutate(occupancy = tokens_00 + lag01 + lag02 + lag03+ lag04 + lag05 + 
           lag06 + lag07 + lag08+ lag09 + lag10 + lag11+ lag12+ 
           lag13+ lag16+ lag19+ lag21+ lag24+ lag26+ lag28+ lag30+ lag32+ 
           lag14+ lag17+ lag20+ lag22+ lag25+ lag27+ lag29+ lag31+ lag33+ 
           lag15+ lag18+ lag21+ lag23+ lag34+ lag35) %>%
  filter(is.na(occupancy) == FALSE) %>%
  select(start_interval15, post_id, occupancy)


transaction0912 <- rbind(transaction_panel1, transaction_panel1_1)
	##save.image(file='D:/UPenn/Fall2022/5080Policy/Final/.RData')

##period2

dat1215 <- dat2 %>%
  mutate(st_hour = hour(start_interval15),
         end_hour = hour(end_interval15)) %>% 
  filter(st_hour>=12 & end_hour<15) %>% 
  mutate(length = end_interval15 - start_interval15,
         tokens = as.numeric(length )/ 900,
         tokens_00 = ifelse(tokens >= 1, 1, 0),
         tokens_01 = ifelse(tokens >= 2, 1, 0),
         tokens_02 = ifelse(tokens >= 3, 1, 0),
         tokens_03 = ifelse(tokens >= 4, 1, 0),
         tokens_04 = ifelse(tokens >= 5, 1, 0),
         tokens_05 = ifelse(tokens >= 6, 1, 0),
         tokens_06 = ifelse(tokens >= 7, 1, 0),
         tokens_07 = ifelse(tokens >= 8, 1, 0),
         tokens_08 = ifelse(tokens >= 9, 1, 0),
         tokens_09 = ifelse(tokens >= 10, 1, 0),
         tokens_10 = ifelse(tokens >= 11, 1, 0),
         tokens_11 = ifelse(tokens >= 12, 1, 0),
         tokens_12 = ifelse(tokens >= 13, 1, 0),
         tokens_13 = ifelse(tokens >= 14, 1, 0),
         tokens_14 = ifelse(tokens >= 15, 1, 0),
         tokens_15 = ifelse(tokens >= 16, 1, 0),
         tokens_16 = ifelse(tokens >= 17, 1, 0),
         tokens_17 = ifelse(tokens >= 18, 1, 0),
         tokens_18 = ifelse(tokens >= 19, 1, 0),
         tokens_19 = ifelse(tokens >= 20, 1, 0),
         tokens_20 = ifelse(tokens >= 21, 1, 0),
         tokens_21 = ifelse(tokens >= 22, 1, 0),
         tokens_22 = ifelse(tokens >= 23, 1, 0),
         tokens_23 = ifelse(tokens >= 24, 1, 0)) %>%
  group_by(start_interval15, post_id) %>%
  summarize(tokens_00 = sum(tokens_00),
            tokens_01 = sum(tokens_01),
            tokens_02 = sum(tokens_02),
            tokens_03 = sum(tokens_03),
            tokens_04 = sum(tokens_04),
            tokens_05 = sum(tokens_05),
            tokens_06 = sum(tokens_06),
            tokens_07 = sum(tokens_07),
            tokens_08 = sum(tokens_08),
            tokens_09 = sum(tokens_09),
            tokens_10 = sum(tokens_10),
            tokens_11 = sum(tokens_11),
            tokens_12 = sum(tokens_12),
            tokens_13 = sum(tokens_13),
            tokens_14 = sum(tokens_14),
            tokens_15 = sum(tokens_15),
            tokens_16 = sum(tokens_16),
            tokens_17 = sum(tokens_17),
            tokens_18 = sum(tokens_18),
            tokens_19 = sum(tokens_19),
            tokens_20 = sum(tokens_20),
            tokens_21 = sum(tokens_21),
            tokens_22 = sum(tokens_22),
            tokens_23 = sum(tokens_23))

study.panel1215 <- 
  expand.grid(start_interval15=unique(dat1215$start_interval15), 
              post_id = unique(dat1215$post_id)) %>%
  mutate(st_hour = hour(start_interval15)) %>% 
  filter(st_hour>=12) %>% 
  mutate(doty = yday(start_interval15)) %>%
  left_join(., dat1215)

transaction_panel2 <- study.panel1215 %>%
  replace(is.na(.), 0) %>%
  arrange(start_interval15) %>%
  group_by(post_id, doty) %>%
  mutate(lag01 = ifelse(is.na(lag(tokens_01)) == FALSE, lag(tokens_01), 0),
         lag02 = ifelse(is.na(lag(tokens_02,2)) == FALSE, lag(tokens_02,2), 0),
         lag03 = ifelse(is.na(lag(tokens_03,3)) == FALSE, lag(tokens_03,3), 0),
         lag04 = ifelse(is.na(lag(tokens_04,4)) == FALSE, lag(tokens_04,4), 0),
         lag05 = ifelse(is.na(lag(tokens_05,5)) == FALSE, lag(tokens_05,5), 0),
         lag06 = ifelse(is.na(lag(tokens_06,6)) == FALSE, lag(tokens_06,6), 0),
         lag07 = ifelse(is.na(lag(tokens_07,7)) == FALSE, lag(tokens_07,7), 0),
         lag08 = ifelse(is.na(lag(tokens_08,8)) == FALSE, lag(tokens_08,8), 0),
         lag09 = ifelse(is.na(lag(tokens_08,9)) == FALSE, lag(tokens_09,9), 0),
         lag10 = ifelse(is.na(lag(tokens_10,10)) == FALSE, lag(tokens_10,10), 0),
         lag11 = ifelse(is.na(lag(tokens_11,11)) == FALSE, lag(tokens_11,11), 0),
         lag12 = ifelse(is.na(lag(tokens_12,12)) == FALSE, lag(tokens_12,12), 0),
         lag13 = ifelse(is.na(lag(tokens_13,13)) == FALSE, lag(tokens_13,13), 0),
         lag14 = ifelse(is.na(lag(tokens_14,14)) == FALSE, lag(tokens_14,14), 0),
         lag15 = ifelse(is.na(lag(tokens_15,15)) == FALSE, lag(tokens_15,15), 0),
         lag16 = ifelse(is.na(lag(tokens_16,16)) == FALSE, lag(tokens_16,16), 0),
         lag17 = ifelse(is.na(lag(tokens_17,17)) == FALSE, lag(tokens_17,17), 0),
         lag18 = ifelse(is.na(lag(tokens_18,18)) == FALSE, lag(tokens_18,18), 0),
         lag19 = ifelse(is.na(lag(tokens_19,19)) == FALSE, lag(tokens_19,19), 0),
         lag20 = ifelse(is.na(lag(tokens_20,20)) == FALSE, lag(tokens_20,20), 0),
         lag21 = ifelse(is.na(lag(tokens_21,21)) == FALSE, lag(tokens_21,21), 0),
         lag22 = ifelse(is.na(lag(tokens_22,22)) == FALSE, lag(tokens_22,22), 0),
         lag23 = ifelse(is.na(lag(tokens_23,23)) == FALSE, lag(tokens_23,23), 0)) %>%
  mutate(occupancy = tokens_00 + lag01 + lag02 + lag03+ lag04 + lag05 + 
           lag06 + lag07 + lag08+ lag09 + lag10 + lag11+ lag12+ 
           lag13+ lag16+ lag19+ lag21+  
           lag14+ lag17+ lag20+ lag22+  
           lag15+ lag18+ lag21+ lag23) %>%
  filter(is.na(occupancy) == FALSE) %>%
  select(start_interval15, post_id, occupancy)

	##save.image(file='D:/UPenn/Fall2022/5080Policy/Final/.RData')

##period3

dat1518 <- dat2 %>%
  mutate(st_hour = hour(start_interval15),
         end_hour = hour(end_interval15)) %>% 
  filter(st_hour>=15 & end_hour<18) %>% 
  mutate(length = end_interval15 - start_interval15,
         tokens = as.numeric(length )/ 900,
         tokens_00 = ifelse(tokens >= 1, 1, 0),
         tokens_01 = ifelse(tokens >= 2, 1, 0),
         tokens_02 = ifelse(tokens >= 3, 1, 0),
         tokens_03 = ifelse(tokens >= 4, 1, 0),
         tokens_04 = ifelse(tokens >= 5, 1, 0),
         tokens_05 = ifelse(tokens >= 6, 1, 0),
         tokens_06 = ifelse(tokens >= 7, 1, 0),
         tokens_07 = ifelse(tokens >= 8, 1, 0),
         tokens_08 = ifelse(tokens >= 9, 1, 0),
         tokens_09 = ifelse(tokens >= 10, 1, 0),
         tokens_10 = ifelse(tokens >= 11, 1, 0),
         tokens_11 = ifelse(tokens >= 12, 1, 0)) %>%
  group_by(start_interval15, post_id) %>%
  summarize(tokens_00 = sum(tokens_00),
            tokens_01 = sum(tokens_01),
            tokens_02 = sum(tokens_02),
            tokens_03 = sum(tokens_03),
            tokens_04 = sum(tokens_04),
            tokens_05 = sum(tokens_05),
            tokens_06 = sum(tokens_06),
            tokens_07 = sum(tokens_07),
            tokens_08 = sum(tokens_08),
            tokens_09 = sum(tokens_09),
            tokens_10 = sum(tokens_10),
            tokens_11 = sum(tokens_11))

study.panel1518 <- 
  expand.grid(start_interval15=unique(dat1518$start_interval15), 
              post_id = unique(dat1518$post_id)) %>%
  mutate(st_hour = hour(start_interval15)) %>% 
  filter(st_hour>=15) %>% 
  mutate(doty = yday(start_interval15)) %>%
  left_join(., dat1518)

transaction_panel3 <- study.panel1518 %>%
  replace(is.na(.), 0) %>%
  arrange(start_interval15) %>%
  group_by(post_id, doty) %>%
  mutate(lag01 = ifelse(is.na(lag(tokens_01)) == FALSE, lag(tokens_01), 0),
         lag02 = ifelse(is.na(lag(tokens_02,2)) == FALSE, lag(tokens_02,2), 0),
         lag03 = ifelse(is.na(lag(tokens_03,3)) == FALSE, lag(tokens_03,3), 0),
         lag04 = ifelse(is.na(lag(tokens_04,4)) == FALSE, lag(tokens_04,4), 0),
         lag05 = ifelse(is.na(lag(tokens_05,5)) == FALSE, lag(tokens_05,5), 0),
         lag06 = ifelse(is.na(lag(tokens_06,6)) == FALSE, lag(tokens_06,6), 0),
         lag07 = ifelse(is.na(lag(tokens_07,7)) == FALSE, lag(tokens_07,7), 0),
         lag08 = ifelse(is.na(lag(tokens_08,8)) == FALSE, lag(tokens_08,8), 0),
         lag09 = ifelse(is.na(lag(tokens_08,9)) == FALSE, lag(tokens_09,9), 0),
         lag10 = ifelse(is.na(lag(tokens_10,10)) == FALSE, lag(tokens_10,10), 0),
         lag11 = ifelse(is.na(lag(tokens_11,11)) == FALSE, lag(tokens_11,11), 0)) %>%
  mutate(occupancy = tokens_00 + lag01 + lag02 + lag03+ lag04 + lag05 + 
           lag06 + lag07 + lag08+ lag09 + lag10 + lag11) %>%
 filter(is.na(occupancy) == FALSE) %>%
  select(start_interval15, post_id, occupancy)

	##save.image(file='D:/UPenn/Fall2022/5080Policy/Final/.RData')

transaction_panel1518 <- transaction_panel3 %>% 
  left_join(dat %>% select(post_id, street_block) %>% 
  unique())


transaction_panel <- transaction_panel3 %>% 
  full_join(transaction0912, by = c("start_interval15", "post_id"))%>% 
  full_join(transaction_panel2, by = c("start_interval15", "post_id"))

transaction_panel[is.na(transaction_panel)] <- 0

panel_join <- transaction_panel %>% 
  mutate(occupancy = occupancy.x+occupancy.y+occupancy,
         doty = doty.x+doty.y+doty) %>% 
  select(post_id, start_interval15, occupancy)%>% 
  left_join(dat2 %>% select(post_id, street_block) %>% 
  unique())
	##save.image(file='D:/UPenn/Fall2022/5080Policy/Final/.RData')

```

```{r}
add_rate <- parking_rate %>% 
  group_by(street_block) %>% 
  summarize(rate= mean(rate))

panel_st_block <- panel_join %>%
  filter(occupancy<=1) %>% 
  mutate(count=1) %>% 
  group_by(street_block,start_interval15) %>% 
  summarize(occ_space = sum(occupancy),
            all_space = sum(count)) %>% 
  mutate(occupancy = occ_space/all_space) %>% 
  dplyr::select(-occ_space, -all_space) %>% 
  left_join(add_rate)


predictors_test <- panel_join %>% 
              count(post_id)

predictors <- trainData1 %>% 
  left_join(panel_join %>% 
              group_by(post_id, street_block)) %>%
  na.omit()

predictors <- predictors %>% 
  group_by(street_block) %>% 
  summarize(GEOID=first(GEOID.x),
            parkCount=mean(parkCount),
            parkArea=mean(parkArea),
            restaurantsCount=mean(restaurantsCount),
            restaurants_nn4=mean(restaurants_nn4),
            schoolCount=mean(schoolCount),
            schools_nn4=mean(schools_nn4),
            universitiesCount=mean(universitiesCount),
            parkingCount=mean(parkingCount),
            parkingArea=mean(parkingArea),
            parking_nn4=mean(parking_nn4),
            clinics_nn4=mean(clinics_nn4),
            hospitals_nn4=mean(hospitals_nn4),
            cinemasCount=mean(cinemasCount),
            stadiumsCount=mean(stadiumsCount),
            commerce_nn4=mean(commerce_nn4),
            commerceCount=mean(commerceCount),
            retailCount=mean(retailCount),
            retail_nn4=mean(retail_nn4),
            commonLeisure_nn4=mean(commonLeisure_nn4),
            fitnessCenterCount=mean(fitnessCenterCount),
            fitnessCenter_nn4=mean(fitnessCenter_nn4),
            gardens_nn4=mean(gardens_nn4),
            gardensCount=mean(gardensCount),
            companies_nn4=mean(companies_nn4),
            public_transport_nn1=mean(public_transport_nn1),
            public_transport_nn2=mean(public_transport_nn2),
            Census_MedHHInc=mean(Census_MedHHInc),
            Census_MedRent=mean(Census_MedRent),
            Census_pctWhite=mean(Census_pctWhite),
            Census_areaperpeople=mean(Census_areaperpeople),
            Census_pctAfricanAmericans=mean(Census_pctAfricanAmericans),
            Census_pctAsians=mean(Census_pctAsians))

train_data_1 <- panel_st_block %>% 
  left_join(predictors, by="street_block")

```

### Space Lag & Time Lag

Here, we merge all the time-space predictors as the basis for our subsequent models.

```{r}

space_lag <- train_data_1 %>% 
  dplyr::select(GEOID,start_interval15,occupancy) %>% 
  group_by(GEOID,start_interval15) %>% 
  summarize(lag_occ = mean(occupancy))

train_data_2 <- train_data_1 %>% 
  left_join(space_lag) 
train_data_3 <- train_data_2 %>% 
  mutate(hod = hour(start_interval15)) %>% 
  filter(hod>=9 & hod<18)

time_lag <- train_data_3 %>% 
  dplyr::select(street_block,start_interval15,occupancy,hod) %>% 
  arrange(start_interval15) %>% 
  mutate(lagHour = dplyr::lag(occupancy,4),
         lag2Hours = dplyr::lag(occupancy, 8),
         lag3Hours = dplyr::lag(occupancy,12),
         lag6Hours = dplyr::lag(occupancy, 24),
         lag1day = dplyr::lag(occupancy,36)) %>% 
  mutate(lagHour = replace(lagHour, hod>=9&hod<10, NA),
         lag2Hours = replace(lag2Hours, hod>=9&hod<11, NA),
         lag3Hours = replace(lag3Hours, hod>=9&hod<12, NA),
         lag6Hours = replace(lag6Hours, hod>=9&hod<15, NA),
         lag1day = replace(lag1day, is.na(lag1day), occupancy)) %>% 
  replace(is.na(.), 0) 

train_data_4 <- train_data_3 %>% 
  left_join(time_lag)

```

---

# 3. Data Exploratory

In this section we will analyze the data obtained in the previous section.

## 3.1. Parking Time Distribution

The first is the distribution of parking hours. In the SFMTA's parking meter charging policy, only the hours from Monday to Saturday, 9am to 6pm, are charged. Therefore, outside of these hours, the number of parking meters recorded by the parking meter is greatly reduced.
The most frequent parking is between 9am and 6pm, which is the main paid parking period, and the number of parking decreases as time goes by. This is in line with our perception of travel to work and parking situation.


```{r, fig.height=30, fig.width=12}
## Distribution
distribution <- dat2 %>% 
  mutate(dtime = format(as.POSIXct(start_interval15), format="%H")) %>% 
  dplyr::select(dtime) %>%
  mutate(dtime=as.numeric(distribution$dtime)) %>% 
  filter(dtime>=5 & dtime<22)
```
```{r}
barplot(table(distribution$dtime),
main = "Parking Time Distribution",
xlab = "Hour of a Day",
ylab = "Frequency")
```

![image1](D:/Upenn/Upenn Lec/05-MUSA-508/Assign-Final/FinalDataModelBackUp/Parking Time Distribution .png)

## 3.2.Parking Price Rate

Whether the cost of parking is related to the occupancy rate of parking is also a part we would like to know. The time period is divided into 9:00-10:00, 12:00-13:00, 15:00-16:00, 17:00-18:00 quadrants, and the pearson correlation is shown with the parking rate as the horizontal coordinate and the parking occupancy rate as the vertical coordinate.
From the graph, we can see that most of the parking meters have a parking rate between 0.4 and 0.5. As the parking rate increases, the change of occupancy rate is quite limited.

```{r, fig.height=20, fig.width=8}
## Parking price rate
price_rate <- train_data_4 %>% 
  filter(hod==9 | hod==12 | hod==15 | hod==17) %>% 
  select(occupancy, rate, hod)

image2 <- ggplot(price_rate)+             ##图表2 相关性散点图
  geom_point(aes(x = rate, 
                 y = occupancy))+ ###散点数据
  geom_smooth(aes(x = rate, 
                  y = occupancy), 
              method = "lm", se = FALSE)+ ###拟合直线
  facet_wrap(~hod, scales = "free",ncol=2)+
  labs(
    title = "Parking Price by Occupancy",
    subtitle = "On 9:00-10:00, 12:00-13:00, 15:00-16:00, 17:00-18:00 ",
    x="rate", 
    y="occupancy")

ggsave("Parking Price by Occupancy.jpg",width = 10,height = 20, image2)
```

![image2](D:/Upenn/Upenn Lec/05-MUSA-508/Assign-Final/FinalDataModelBackUp/Parking Price by Occupancy.jpg)

## 3.3. Occupancy by Day of Week

Which days of the week have higher occupancy rates is also one of our concerns. Because the SFMTA's parking meter is free on weekdays, the fact that a lower occupancy rate is recorded on weekdays does not mean that the parking occupancy rate is actually lower. For the rest of the time period, the daily parking occupancy rate is guaranteed to be around 0.5.

```{r}
## Occupancy by day of week
day_of_week <- train_data_4 %>% 
  select(occupancy,start_interval15) %>%
  mutate(dotw = wday(start_interval15, label=TRUE)) %>% 
  group_by(dotw) %>% 
  summarize(occupancy = mean(occupancy))
  
image3 <- ggplot(day_of_week, aes(x = dotw, y = occupancy)) + 
  geom_line(aes(group=1)) +
  labs(title="Parking Occupancy by day of week",
       x="Day of week", 
       y="Average Occupancy")+
     plotTheme
ggsave("Parking Price by day of week.png", image3)
```

![image3](D:/Upenn/Upenn Lec/05-MUSA-508/Assign-Final/FinalDataModelBackUp/Parking Price by day of week.png)

## 3.4. Parking Occupancy by day of week
Finally, the parking occupancy at different times of the day for different Geo_ids. The horizontal coordinate starts at 9 a.m. and ends at 6 p.m. For most Geo_ids, the parking occupancy always stays within the same interval.

```{r, fig.height=30, fig.width=10}
## Rate by time
Rate_time <- train_data_4 %>% 
  mutate(dtime = format(as.POSIXct(start_interval15), format="%H")) %>% 
  group_by(dtime, GEOID) %>% 
  summarize(rate = mean(rate))

image4 <- ggplot(Rate_time, aes(x = dtime, y = rate)) + 
  geom_line(aes(group=1)) +
  facet_wrap(~GEOID, scales = "free",ncol=5)+
  labs(title="Parking Occupancy by day of week",
       x="Day of week", 
       y="Average Occupancy")+
     plotTheme

ggsave("Rate by time.jpg",width = 10, height = 40,image4)
```

![image4](D:/Upenn/Upenn Lec/05-MUSA-508/Assign-Final/FinalDataModelBackUp/Rate by time.jpg)

---

# 4. Regression Model

In this section, we analyze the parking occupancy in combination with time lag and spatial lag.

Our predictor includes the spatial elements (restaurantCount, restaurant_nn4, schoolCount, schools_nn4), time lag (lagHour, lag2Hours, lag3Hours, lag6Hours), and Censusdata (MedHHInc, etc.)
```{r}

model_data_offcial <- train_data_4 %>% 
  mutate(dotw = wday(start_interval15, label=TRUE))


set.seed(3456)
trainIndex <- createDataPartition(y=paste(model_data_offcial$street_block,model_data_offcial$GEOID, model_data_offcial$dotw), p = .70,
                                  list = FALSE,
                                  times = 1)

parking.Train <- model_data_offcial[ trainIndex,]
parking.Test <- model_data_offcial[ -trainIndex,]

reg.training <- 
  lm(occupancy ~  rate + parkCount + parkArea + restaurantsCount + restaurants_nn4 + schoolCount + schools_nn4 + universitiesCount + parkingCount + parkingArea + parking_nn4 + clinics_nn4 + hospitals_nn4 + cinemasCount + stadiumsCount + commerce_nn4 + commerceCount + retailCount + retail_nn4 + commonLeisure_nn4 + fitnessCenterCount + fitnessCenter_nn4 + gardens_nn4 + gardensCount + companies_nn4 + public_transport_nn1 + public_transport_nn1 + Census_MedHHInc + Census_MedRent + Census_pctWhite + Census_areaperpeople + Census_pctAfricanAmericans + Census_pctAsians + lag_occ + hod + dotw + lagHour + lag2Hours + lag3Hours + lag6Hours + lag1day,  data=parking.Train)

# 不用展示R方
#    stargazer(reg.training, 
#              type = 'text', 
#              title = "OLS Regression",
#              covariate.labels = c())
```


## 4.1. Goodness of fit

For the results of the model, the MAE of the model = 0.0395,MAPE of the model = 0.824

```{r}
model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

parking.Test$occupancy.Predict = predict(reg.training, parking.Test)

parking.Test <-
  parking.Test %>%
  mutate(occupancy.Error = occupancy.Predict - occupancy,
         occupancy.AbsError = abs(occupancy.Predict - occupancy),
         occupancy.APE = (abs(occupancy.Predict - occupancy)) / occupancy.Predict)

MAE <- mean(parking.Test$occupancy.AbsError, na.rm = T)
MAPE <- mean(parking.Test$occupancy.APE, na.rm = T)
MAE = c(MAE) %>% format(., digits = 3)
MAPE = c(MAPE) %>% format(., digits = 3)
Model = c("Model Fitness")
summaryTable1 = cbind(Model, MAE, MAPE)

kable(summaryTable1, digits = 1, caption = "Table. Prediction precision of the first two models") %>%
  kable_classic(full_width = T)%>%
  footnote()
```

![image5](D:/Upenn/Upenn Lec/05-MUSA-508/Assign-Final/FinalDataModelBackUp/image5.png)

## 4.2. Examine Error Metrics for Accuracy
The following chart shows our predicted results compared to the actual results.The cyan line is the predicted result and the red line is the actual result. As can be seen in the same figure, the model presents a better prediction in terms of overall trend due to the inclusion of time lag. However, on the whole, the occupancy rate of the model prediction is lower compared to the actual situation. This leads to a situation where the predicted results appear in parallel to the actual results during the free hours of the day.

```{r}
parking_test_plot <- parking.Test %>% 
    mutate(interval60 = floor_date(ymd_hms(start_interval15), unit = "60 mins"))%>% 
    dplyr::select(interval60, street_block, occupancy, occupancy.Predict) %>%
    unnest()%>%
    na.omit() %>%
    gather(Variable, Value, -interval60, -street_block) %>%
    group_by(Variable, interval60) %>%
    summarize(Value = mean(Value))

image6 <- parking_test_plot %>% 
    ggplot(aes(interval60, Value, color=Variable)) + 
      geom_line(size = 1.1) + 
      labs(title = "Predicted/Observed Occupancy", subtitle = "SF; A test set of 2 weeks",  x = "Hour", y= "Occupancy") +
      plotTheme
##save.image(file='D:/UPenn/Fall2022/5080Policy/Final/Final_data_model/.RData')

ggsave("Predicted/Observed Occupancy.jpg",width = 10, height = 5,image6)
```

![image6](D:/Upenn/Upenn Lec/05-MUSA-508/Assign-Final/FinalDataModelBackUp/Predicted/Observed Occupancy.jpg)

---

# 5. Generalizability

## 5.1. Cross Validation 

In the cross-validation analysis, our results are as follows. Similar to the results of the test set, the MAE of the model = 0.0757, which is slightly imprecise compared to a distribution that occupies the range between 0 and 1.

```{r}
# R program to implement
# K-fold cross-validation
 
# setting seed to generate a
# reproducible random sampling
set.seed(125)
 
# defining training control
# as cross-validation and
# value of K equal to 10
train_control <- trainControl(method = "cv",
                              number = 10)
 
# training the model by assigning sales column
# as target variable and rest other column
# as independent variable
model <- train(occupancy ~ dotw+lagHour+lag2Hours+lag3Hours+lag6Hours+lag1day, data = park.cross,
               method = "lm",
               trControl = train_control)
 
# printing model performance metrics
# along with other details
print(model)
```
![image8](D:/Upenn/Upenn Lec/05-MUSA-508/Assign-Final/FinalDataModelBackUp/CVResult.png)

---

# 6. Application Development

Finally we will share our user-specific interface design. Our app is intended to manage the occupancy of parking spots by helping SFMTA officials with pricing.

![image7](D:/Upenn/Upenn Lec/05-MUSA-508/Assign-Final/PPT/DEMO for 1209/image 10.png)

In this online interface, decision makers can view projections of future parking rates on certain streets by entering prices.
![image8](D:/Upenn/Upenn Lec/05-MUSA-508/Assign-Final/PPT/DEMO for 1209/image11.png)


