Targeting A Housing Subsidy

1. Overview

In the public policy neighborhood, how to make residents as aware as possible to benefit from public policies has always been an important part of the effectiveness of policy designation. In this analysis, we are faced with a similar problem: how can we, as data scientists, help the government promote the home repair tax credit program so that more eligible residents can benefit from it? To address this question, we will use a logistic regression model to analyze many characteristics of residents, train our predictive model, and give our recommendations and expected benefits.

options(scipen=10000000)

library(tidyverse)
library(kableExtra)
library(caret)
library(knitr) 
library(pscl)
library(plotROC)
library(pROC)
library(lubridate)
library(scales)
library(data.table)
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

palette5 <- c("#981FAC","#CB0F8B","#FF006A","#FE4C35","#FE9900")
palette4 <- c("#981FAC","#FF006A","#FE4C35","#FE9900")
palette2 <- c("#981FAC","#FF006A")

housing <- read.csv("D:/Upenn/Upenn Lec/05-MUSA-508/Assign-05/Week_7/hw/housingSubsidy.csv")

2. Feature Interpretion

Resident characteristics are divided into two categories in terms of data type, one is continuous variables, such as age of residents, annual cost spent on housing maintenance, etc., and the other is categorical variables, such as gender of residents, education level, etc. In the face of these two different types of variables we should use two different methods of correlation description. First is the continuous variable. The criterion of whether the residents actually accept the home maintenance tax credit program is divided into two categories of no and yes as the horizontal coordinate. A histogram drawn with the mean of each continuous variable in these two categories as the vertical coordinate can visualize the relationship between acceptance and non-acceptance and the corresponding characteristics. The greater the difference between the two, the more pronounced the correlation tends to be represented. For example, the inflation rate indicator, along with the previous knowledge indicator, both have a strong correlation with the acceptance of the tax relief program.

housing %>%
  dplyr::select(y,age,previous,unemploy_rate, cons.price.idx, cons.conf.idx,inflation_rate,spent_on_repairs) %>%
  gather(Variable, value, -y) %>%
    ggplot(aes(y, value, fill=y)) + 
      geom_bar(position = "dodge", stat = "summary", fun = "mean") + 
      facet_wrap(~Variable, scales = "free") +
      scale_fill_manual(values = palette2) +
      labs(x="homeowner took the credit", y="Value", 
           title = "Feature associations with the homeowner took the credit",
           subtitle = "(continous outcomes)") +
      theme(legend.position = "none")

[Figure. Correlation of continuous variables1]

housing %>%
    dplyr::select(y,age,previous,unemploy_rate, cons.price.idx, cons.conf.idx,inflation_rate,spent_on_repairs) %>%
    gather(Variable, value, -y) %>%
    ggplot() + 
    geom_density(aes(value, color=y), fill = "transparent") + 
    facet_wrap(~Variable, scales = "free") +
    scale_fill_manual(values = palette2) +
    labs(title = "Feature distributions credit vs. no credit",
         subtitle = "(continous outcomes)")

[Figure. Correlation of continuous variables2]

The second is a category variable, which is divided into two categories of no and yes as horizontal coordinates, based on whether the residents actually accept the home maintenance tax credit program or not. A histogram is drawn with the number of subdivided categories in each of these two categories as the vertical coordinate, which allows visualization of the relationship between acceptance and non-acceptance and the corresponding characteristics. The greater the difference in the number of no vs. no; yes vs. yes between the subcategories, the stronger the correlation. An example is occupation and educational attainment.

housing %>%
    dplyr::select(y, job, marital,education,taxLien,mortgage,taxbill_in_phl,contact,month,day_of_week,campaign,pdays) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = palette2) +
        labs(x="Credit", y="Value",
             title = "Feature associations with the likelihood of credit",
             subtitle = "Categorical features") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

[Figure. Correlation of category variables]

3. Split Data into Training/Test Set

Up to this point, we have a preliminary understanding of the relevance of all features, and our next step will be to use these feature data, combined with the logistic regression model, to give our recommendations. First we divide all the data into a training set for training the model and a test set for testing the model according to the ratio of 65% and 35%. Logistic regression analysis is performed using the data from the training set.

set.seed(3456)
trainIndex <- createDataPartition(y=paste(housing$taxLien,housing$education), p = .65,
                                  list = FALSE,
                                  times = 1)
housingTrain <- housing[ trainIndex,]
housingTest  <- housing[-trainIndex,]

In this model we accounted for all the characteristics, and the results of the regression analysis showed that the p-values of the required multiple characteristics were all greater than 0.05, i.e., statistically insignificant.

housingModel <- glm(y_numeric ~ .,
                  data=housingTrain %>% 
                    dplyr::select(-y),
                  family="binomial" (link="logit"))

summary(housingModel)

## 
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"), 
##     data = housingTrain %>% dplyr::select(-y))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1943  -0.4147  -0.3256  -0.2586   3.0926  
## 
## Coefficients:
##                                   Estimate    Std. Error z value  Pr(>|z|)    
## (Intercept)                  -278.85327760  139.96804153  -1.992  0.046342 *  
## X                              -0.00001331    0.00006015  -0.221  0.824909    
## age                             0.01403305    0.00873220   1.607  0.108044    
## jobblue-collar                 -0.32797503    0.27163327  -1.207  0.227271    
## jobentrepreneur                -0.41474972    0.43895099  -0.945  0.344727    
## jobhousemaid                    0.11787593    0.45568524   0.259  0.795883    
## jobmanagement                  -0.52798701    0.31095067  -1.698  0.089512 .  
## jobretired                     -0.41678128    0.38885628  -1.072  0.283804    
## jobself-employed               -0.85936919    0.45052464  -1.907  0.056458 .  
## jobservices                    -0.15625210    0.28775184  -0.543  0.587123    
## jobstudent                     -0.12256127    0.41895061  -0.293  0.769871    
## jobtechnician                  -0.34064892    0.24896860  -1.368  0.171237    
## jobunemployed                   0.00698463    0.43362252   0.016  0.987149    
## jobunknown                     -0.98454063    0.86728103  -1.135  0.256290    
## maritalmarried                 -0.11230091    0.23346500  -0.481  0.630504    
## maritalsingle                   0.03104307    0.26773677   0.116  0.907695    
## maritalunknown                  1.60183882    1.23871333   1.293  0.195960    
## educationbasic.6y               0.51798429    0.42188129   1.228  0.219523    
## educationbasic.9y               0.44289711    0.33865468   1.308  0.190937    
## educationhigh.school            0.24226477    0.33033899   0.733  0.463325    
## educationilliterate           -11.87185886  535.41147938  -0.022  0.982310    
## educationprofessional.course    0.49448324    0.36306834   1.362  0.173212    
## educationuniversity.degree      0.45266628    0.33045587   1.370  0.170742    
## educationunknown                0.53513138    0.42052846   1.273  0.203188    
## taxLienunknown                 -0.01495724    0.21407698  -0.070  0.944298    
## taxLienyes                    -10.11713559  535.41149007  -0.019  0.984924    
## mortgageunknown                -0.31180939    0.50655531  -0.616  0.538193    
## mortgageyes                    -0.10869431    0.14492166  -0.750  0.453242    
## taxbill_in_phlyes               0.00211370    0.19493817   0.011  0.991349    
## contacttelephone               -1.11254297    0.31141354  -3.573  0.000354 ***
## monthaug                        0.57341407    0.46743091   1.227  0.219922    
## monthdec                       -0.25418106    0.85194737  -0.298  0.765434    
## monthjul                        0.20651187    0.37997601   0.543  0.586795    
## monthjun                       -0.06145945    0.49278182  -0.125  0.900746    
## monthmar                        2.42531246    0.60735745   3.993 0.0000652 ***
## monthmay                        0.17510806    0.32383613   0.541  0.588693    
## monthnov                       -0.27510007    0.44063086  -0.624  0.532409    
## monthoct                        0.58010443    0.56615565   1.025  0.305534    
## monthsep                        0.32053540    0.66903917   0.479  0.631869    
## day_of_weekmon                 -0.09048675    0.22277094  -0.406  0.684605    
## day_of_weekthu                  0.01702850    0.22152945   0.077  0.938729    
## day_of_weektue                 -0.21626831    0.23145659  -0.934  0.350108    
## day_of_weekwed                 -0.01629696    0.23302673  -0.070  0.944245    
## campaign                       -0.06706867    0.03847471  -1.743  0.081301 .  
## pdays                          -0.00048934    0.00080546  -0.608  0.543496    
## previous                        0.25188999    0.22915027   1.099  0.271666    
## poutcomenonexistent             0.54030334    0.35407733   1.526  0.127023    
## poutcomesuccess                 1.19601053    0.80913495   1.478  0.139372    
## unemploy_rate                  -1.38677195    0.54137687  -2.562  0.010420 *  
## cons.price.idx                  2.39574823    0.93242048   2.569  0.010188 *  
## cons.conf.idx                   0.05953345    0.02924320   2.036  0.041770 *  
## inflation_rate                 -0.03039306    0.44901409  -0.068  0.946034    
## spent_on_repairs                0.01057464    0.01109433   0.953  0.340510    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1876.5  on 2686  degrees of freedom
## Residual deviance: 1465.4  on 2634  degrees of freedom
## AIC: 1571.4
## 
## Number of Fisher Scoring iterations: 12

The ‘Psuedo R Squared’ of the model is also a measure of the quality of the model. Generally speaking, the larger this value is, the better the model is, and the value of ‘Psuedo R Squared’ for the model without screening features here is 0.2190556.

pR2(housingModel)

## fitting null model for pseudo-r2

##          llh      llhNull           G2     McFadden         r2ML         r2CU 
## -732.7089634 -938.2356658  411.0534047    0.2190566    0.1418519    0.2822386

4. New model based on New Features

Next we use the test set to test the quality of our model. Using the distribution of the test set predictions we can see that the Sensitivity of the model is low, i.e., the actual user gets credit and we predict a low probability that he or she will get credit. While the Specificity of the model is high, i.e., the user will not get credit, and we also correctly predict a high probability that he or she will not get credit. Therefore, we try to improve Sensitivity by adjusting the features.

## this the model include all the predictor
testProbs_housing <- data.frame(Outcome = as.factor(housingTest$y_numeric),
                        Probs = predict(housingModel, housingTest, type= "response"))

ggplot(testProbs_housing, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) +
  labs(x = "Credit", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome") +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none")

[Figure. Distribution of model with all features]

testProbs_housing <- 
  testProbs_housing %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs_housing$Probs > 0.5 , 1, 0)))
caret::confusionMatrix(testProbs_housing$predOutcome, testProbs_housing$Outcome, 
                       positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1248  114
##          1   32   38
##                                           
##                Accuracy : 0.898           
##                  95% CI : (0.8812, 0.9132)
##     No Information Rate : 0.8939          
##     P-Value [Acc > NIR] : 0.3217          
##                                           
##                   Kappa : 0.2952          
##                                           
##  Mcnemar's Test P-Value : 0.00000000002033
##                                           
##             Sensitivity : 0.25000         
##             Specificity : 0.97500         
##          Pos Pred Value : 0.54286         
##          Neg Pred Value : 0.91630         
##              Prevalence : 0.10615         
##          Detection Rate : 0.02654         
##    Detection Prevalence : 0.04888         
##       Balanced Accuracy : 0.61250         
##                                           
##        'Positive' Class : 1               
##

In the new model, we eliminate some features with low p-values, which are used to simplify the model and improve the accuracy of the model. The new model has Sensitivity=0.25 and Specificity=0.97656, which is a small improvement compared to the model before the features were filtered.

housingModel1 <- glm(y_numeric ~ .,
                  data=housingTrain %>% 
                    dplyr::select(-y, -marital, -day_of_week,-pdays,-taxbill_in_phl,-taxLien,-mortgage,-education,-job,-inflation_rate),
                  family="binomial" (link="logit"))

summary(housingModel1)

## 
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"), 
##     data = housingTrain %>% dplyr::select(-y, -marital, -day_of_week, 
##         -pdays, -taxbill_in_phl, -taxLien, -mortgage, -education, 
##         -job, -inflation_rate))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0547  -0.3967  -0.3288  -0.2825   2.8604  
## 
## Coefficients:
##                          Estimate    Std. Error z value     Pr(>|z|)    
## (Intercept)         -265.73737600  117.40272641  -2.263     0.023607 *  
## X                     -0.00001671    0.00005916  -0.282     0.777644    
## age                    0.00508736    0.00607450   0.837     0.402315    
## contacttelephone      -1.08893335    0.30418754  -3.580     0.000344 ***
## monthaug               0.58427864    0.44801059   1.304     0.192178    
## monthdec              -0.18090958    0.80232253  -0.225     0.821604    
## monthjul               0.16981871    0.36757728   0.462     0.644085    
## monthjun              -0.07942232    0.47626013  -0.167     0.867557    
## monthmar               2.37364540    0.56258863   4.219 0.0000245226 ***
## monthmay               0.10506471    0.31080972   0.338     0.735336    
## monthnov              -0.34166596    0.35289194  -0.968     0.332950    
## monthoct               0.57214291    0.48442554   1.181     0.237573    
## monthsep               0.29031450    0.60925223   0.477     0.633711    
## campaign              -0.06688193    0.03873487  -1.727     0.084229 .  
## previous               0.27327519    0.19694039   1.388     0.165258    
## poutcomenonexistent    0.59630919    0.33314758   1.790     0.073466 .  
## poutcomesuccess        1.66975239    0.31090779   5.371 0.0000000785 ***
## unemploy_rate         -1.35290744    0.52677361  -2.568     0.010220 *  
## cons.price.idx         2.31995160    0.83786861   2.769     0.005625 ** 
## cons.conf.idx          0.05727531    0.01986162   2.884     0.003930 ** 
## spent_on_repairs       0.00933384    0.00776910   1.201     0.229594    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1876.5  on 2686  degrees of freedom
## Residual deviance: 1484.1  on 2666  degrees of freedom
## AIC: 1526.1
## 
## Number of Fisher Scoring iterations: 6

pR2(housingModel1)

## fitting null model for pseudo-r2

##          llh      llhNull           G2     McFadden         r2ML         r2CU 
## -742.0572795 -938.2356658  392.3567726    0.2090929    0.1358599    0.2703166

testProbs_housing1 <- data.frame(Outcome = as.factor(housingTest$y_numeric),
                        Probs = predict(housingModel1, housingTest, type= "response"))

testProbs_housing1 <- 
  testProbs_housing1 %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs_housing1$Probs > 0.5 , 1, 0)))
caret::confusionMatrix(testProbs_housing1$predOutcome, testProbs_housing1$Outcome, 
                       positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1250  114
##          1   30   38
##                                            
##                Accuracy : 0.8994           
##                  95% CI : (0.8827, 0.9145) 
##     No Information Rate : 0.8939           
##     P-Value [Acc > NIR] : 0.2621           
##                                            
##                   Kappa : 0.2995           
##                                            
##  Mcnemar's Test P-Value : 0.000000000004624
##                                            
##             Sensitivity : 0.25000          
##             Specificity : 0.97656          
##          Pos Pred Value : 0.55882          
##          Neg Pred Value : 0.91642          
##              Prevalence : 0.10615          
##          Detection Rate : 0.02654          
##    Detection Prevalence : 0.04749          
##       Balanced Accuracy : 0.61328          
##                                            
##        'Positive' Class : 1                
##

Cross-validation of the two models at the same time can obtain a comparison of the effects of the two groups of models. The ROC curve represents the fitted fetched performance at different Threshold, with False positive rate and True positive rate as the horizontal and vertical coordinates, and the ROC is the AUC (Area Under Curve), which represents the prediction accuracy. Compared with the model before screening, the screened model has improved in ROC and Specificity, but has little improvement in Sensitivity.(Upper chunk: CV of model with all features; Lower chunk: CV of model with selected features)

ctrl_housing <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)
housing_cv <- housing
housing_cv$y[which(housing_cv$y =='yes')] <- 'A_yes'
cvFit_housing <- train(y ~ .,
                  data=housing_cv %>% 
                    dplyr::select(-y_numeric), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl_housing)
cvFit_housing_all_features <- cvFit_housing
cvFit_housing_all_features

## Generalized Linear Model 
## 
## 4119 samples
##   20 predictor
##    2 classes: 'A_yes', 'no' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 4078, 4079, 4078, 4079, 4079, 4077, ... 
## Resampling results:
## 
##   ROC        Sens   Spec     
##   0.7663551  0.217  0.9801652

ctrl_housing1 <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)
housing_cv1 <- housing
housing_cv1$y[which(housing_cv1$y =='yes')] <- 'A_yes'
cvFit_housing1 <- train(y ~ .,
                  data=housing_cv1 %>% 
                    dplyr::select(-y_numeric,-marital, -day_of_week,-pdays,-taxbill_in_phl,-taxLien,-mortgage,-education,-job,-inflation_rate), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl_housing1)
cvFit_housing_selected_features <- cvFit_housing1 

cvFit_housing_selected_features

## Generalized Linear Model 
## 
## 4119 samples
##   11 predictor
##    2 classes: 'A_yes', 'no' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 4078, 4078, 4077, 4077, 4077, 4078, ... 
## Resampling results:
## 
##   ROC        Sens   Spec     
##   0.7863968  0.219  0.9858408

5. Output an ROC curve for your new model and interpret it.

In the ROC graph, the maximum value of the horizontal and vertical coordinates is 1. The closer the curve is to the upper left corner, the larger the area under the curve (AUC), and the better the model is. The following are some rough judgments on the value of AUC taken in relation to the effectiveness of the model. 0.90-1 = Excellent 0.80-0.90 = Good 0.70-0.80 = Fair 0.60-0.70 = Poor 0.50-0.60 = Fail

auc(testProbs_housing1$Outcome, testProbs_housing1$Probs)

## Area under the curve: 0.813

ggplot(testProbs_housing1, aes(d = as.numeric(Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve - creditModel_SelectedFeatures")

[Figure. ROC of model with selected features]

6. Develop a Cost Benefit Analysis.

According to our measure, the group FN that actually accepts credit items but the model incorrectly predicts them, and the group TN that actually does not accept credit items while correctly predicting them, have no effect on the benefits of the model output. Notated as 0 in the equation, the fraction of benefits generated is TP for the group that receives credit while correctly forecasting, and FP for the group that does not receive credit but incorrectly forecasts (places resources)

Assuming we use the algorithm, the sum of the benefits is 0+471200+0-85500=385700, and if we do not use the algorithm, we will lose 25% of the successful forecasters of credit items, the amount of loss is (10000+56000-2850-5000)0.2538=552425.

In the cost-benefit table above we can see the project gains more intuitively. The project gain comes mainly from the increase in TP value and the loss comes from the FP value, because both are related to Sensitivity, so improving the Sensitivity of the model helps to improve the project’s gain.

cost_benefit_table_housing1 <-
   testProbs_housing1 %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
       gather(Variable, Count) %>%
       mutate(Revenue =
               ifelse(Variable == "True_Negative", Count * 0,
               ifelse(Variable == "True_Positive",(((10000+56000-2850-5000)*(.25)+(.75)*(-2850)) * Count),
               ifelse(Variable == "False_Negative", (0) * Count,
               ifelse(Variable == "False_Positive", (-2850) * Count, 0))))) %>%
    bind_cols(data.frame(Description = c(
              "We correctly predicted no credit",
              "We correctly predicted a credit",
              "We predicted no credit and customer credited",
              "We predicted a credit and customer did not get credit")))

kable(cost_benefit_table_housing1,
       caption = "Cost/Benefit Table") %>% kable_styling()%>% 
    kable_classic(full_width = F, html_font = "Cambria")

Cost/Benefit Table
Variable	Count	Revenue	Description
True_Negative	1250	0	We correctly predicted no credit
True_Positive	38	471200	We correctly predicted a credit
False_Negative	114	0	We predicted no credit and customer credited
False_Positive	30	-85500	We predicted a credit and customer did not get credit

whichThreshold_housing1 <- 
  iterateThresholds(
     data=testProbs_housing1, observedClass=Outcome, predictedProbs=Probs)

whichThreshold_housing1[1:5,]

## # A tibble: 5 × 10
##   Count_TN Count_TP Count_FN Count_FP Rate_TP Rate_FP Rate_FN  Rate_TN Accuracy
##      <int>    <int>    <int>    <int>   <dbl>   <dbl>   <dbl>    <dbl>    <dbl>
## 1        1      152        0     1279   1       0.999 0       0.000781    0.107
## 2       23      151        1     1257   0.993   0.982 0.00658 0.0180      0.122
## 3       65      150        2     1215   0.987   0.949 0.0132  0.0508      0.150
## 4      197      148        4     1083   0.974   0.846 0.0263  0.154       0.241
## 5      500      138       14      780   0.908   0.609 0.0921  0.391       0.446
## # … with 1 more variable: Threshold <dbl>

whichThreshold_housing1 <- 
  whichThreshold_housing1 %>%
    dplyr::select(starts_with("Count"), Threshold) %>%
    gather(Variable, Count, -Threshold) %>%
    mutate(Revenue =
             case_when(Variable == "Count_TN"  ~ Count * 0,
                       Variable == "Count_TP"  ~ (((10000+56000-2850-5000)*(.25)+(.75)*(-2850)) * Count),
                       Variable == "Count_FN"  ~ (0) * Count,
                       Variable == "Count_FP"  ~ (-2850) * Count))

In the figure below we create curves for each value in the confusion matrix that vary with Threshold. According to the previous explanation, since both FN and TN do not produce benefits, they take the value of 0 at all times, while the benefits of TP and FP keep changing as the Threshold increases by 0.01unit. From the figure, it can be seen that the output of FP converges to 0 faster with increasing Threshold compared to TP.

whichThreshold_housing1 %>%
   ggplot(.,aes(Threshold, Revenue, colour = Variable)) +
   geom_point() +
   scale_colour_manual(values = palette5[c(5, 1:3)]) +    
   labs(title = "Revenue by confusion matrix type and threshold",
       y = "Revenue") +
   plotTheme() +
   guides(colour=guide_legend(title = "Confusion Matrix"))

The following graph shows the change of overall benefit as Threshold is increased. Through the icon we can visually see that the overall benefit shows an increase and then a decrease, with the maximum value of the overall benefit occurring in the 0.13 part.

whichThreshold_revenue_housing1 <- 
whichThreshold_housing1 %>% 
    group_by(Threshold) %>% 
    summarize(Revenue = sum(Revenue))

  ggplot(whichThreshold_revenue_housing1)+
  geom_line(aes(x = Threshold, y = Revenue),color="red", size=2)+
  geom_vline(xintercept =  pull(arrange(whichThreshold_revenue_housing1, -Revenue)[1,1]))+
    labs(title = "Model Revenues By Threshold For Test Sample",
         subtitle = "Vertical Line Denotes Optimal Threshold")

The last table is about the comparison of the benefits of using 0.5 as the original Threshold and 0.13 as the optimized Threshold, the overall benefit of the filtered model is about double compared to the previous one.

outcome_housing <- whichThreshold_revenue_housing1 %>% 
  filter(Threshold == 0.13  | Threshold == 0.50)
kable(outcome_housing, caption = "Revenue with different Thresholds") %>% 
    kable_classic(full_width = F, html_font = "Cambria")

Revenue with different Thresholds
Threshold	Revenue
0.13	838000
0.50	385700

7. Summary

In general, the model’s effectiveness is heavily dependent on the model’s Sensitivity, but the model’s Sensitivity is always about 0.25. Therefore, I do not really recommend the model to be put into production. There are two possible reasons for this phenomenon. One is that the sample set is insufficient for the original scheme of accepting credit, resulting in insufficient samples actually being used for the model to predict TP. Another possibility is that the established features are filled with a large amount of invalid data, pending further data cleaning or even reselection. So for better returns, the number of samples should be further increased and Kyung may collect more significant features to optimize the model.