pacman::p_load(caTools, ggplot2, dplyr)


【A】Preparing the Data

Data: The quality of (medical) care for diabetic patients

D = read.csv("data/quality.csv")  # Read in dataset
summary(D)
    MemberID     InpatientDays      ERVisits     OfficeVisits    Narcotics    
 Min.   :  1.0   Min.   : 0.00   Min.   : 0.0   Min.   : 0.0   Min.   : 0.00  
 1st Qu.: 33.5   1st Qu.: 0.00   1st Qu.: 0.0   1st Qu.: 7.0   1st Qu.: 0.00  
 Median : 66.0   Median : 0.00   Median : 1.0   Median :12.0   Median : 1.00  
 Mean   : 66.0   Mean   : 2.72   Mean   : 1.5   Mean   :13.2   Mean   : 4.57  
 3rd Qu.: 98.5   3rd Qu.: 3.00   3rd Qu.: 2.0   3rd Qu.:18.5   3rd Qu.: 3.00  
 Max.   :131.0   Max.   :30.00   Max.   :11.0   Max.   :46.0   Max.   :59.00  
 DaysSinceLastERVisit      Pain        TotalVisits   ProviderCount
 Min.   :  6          Min.   :  0.0   Min.   : 0.0   Min.   : 5   
 1st Qu.:207          1st Qu.:  1.0   1st Qu.: 8.0   1st Qu.:15   
 Median :641          Median :  8.0   Median :15.0   Median :20   
 Mean   :481          Mean   : 15.6   Mean   :17.4   Mean   :24   
 3rd Qu.:731          3rd Qu.: 23.0   3rd Qu.:22.5   3rd Qu.:30   
 Max.   :731          Max.   :104.0   Max.   :69.0   Max.   :82   
 MedicalClaims     ClaimLines    StartedOnCombination AcuteDrugGapSmall
 Min.   : 11.0   Min.   : 20.0   Mode :logical        Min.   : 0.00    
 1st Qu.: 25.5   1st Qu.: 83.5   FALSE:125            1st Qu.: 0.00    
 Median : 37.0   Median :120.0   TRUE :6              Median : 1.00    
 Mean   : 43.2   Mean   :142.9                        Mean   : 2.69    
 3rd Qu.: 49.5   3rd Qu.:185.0                        3rd Qu.: 3.00    
 Max.   :194.0   Max.   :577.0                        Max.   :71.00    
    PoorCare    
 Min.   :0.000  
 1st Qu.:0.000  
 Median :0.000  
 Mean   :0.252  
 3rd Qu.:0.500  
 Max.   :1.000  

Before modeling, we should examine the probability of P[y=1] within the data

base = table(D$PoorCare) # the base probability
base %>% prop.table

      0       1 
0.74809 0.25191 

🌻 The sample.split() generate a splitting vector that

set.seed(88)
split = sample.split(D$PoorCare, SplitRatio = 0.7)  # split vector
table(split) %>% prop.table()
split
  FALSE    TRUE 
0.29771 0.70229 
table(D$PoorCare, split) %>% prop.table(2)
   split
      FALSE    TRUE
  0 0.74359 0.75000
  1 0.25641 0.25000

Splitting the data D into Train (TR) and Test (TS) sets

TR = subset(D, split == TRUE)
TS = subset(D, split == FALSE)

Building and examining the model - glm1

glm1 = glm(PoorCare ~ OfficeVisits + Narcotics, TR, family=binomial)
summary(glm1)

Call:
glm(formula = PoorCare ~ OfficeVisits + Narcotics, family = binomial, 
    data = TR)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.878  -0.635  -0.517  -0.224   2.146  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -2.5662     0.5249   -4.89 0.000001 ***
OfficeVisits   0.0814     0.0310    2.62   0.0087 ** 
Narcotics      0.0620     0.0320    1.94   0.0527 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 103.470  on 91  degrees of freedom
Residual deviance:  84.528  on 89  degrees of freedom
AIC: 90.53

Number of Fisher Scoring iterations: 4


【B】Model Coefficients and Marginal Effects

Model :

Coefficients & Marginal Effect :


💡 Marginal Effects of Predictors

pop = function(p, k) {o = p/(1-p);  o = k * o; o/(1+o)}
p0 = seq(0.1, 0.9, 0.1); k = 2
p1 = sapply(seq(0.1, 0.9, 0.1), pop, k)
data.frame(k, p0, p1, increment=p1-p0) %>% round(2) 
  k  p0   p1 increment
1 2 0.1 0.18      0.08
2 2 0.2 0.33      0.13
3 2 0.3 0.46      0.16
4 2 0.4 0.57      0.17
5 2 0.5 0.67      0.17
6 2 0.6 0.75      0.15
7 2 0.7 0.82      0.12
8 2 0.8 0.89      0.09
9 2 0.9 0.95      0.05


Sometimes it is desirable to specify the marginal effects in increments of probability. In order to do that we need to specify a level of probability. Usually we’d estimate the P[y=1] that all predictors equal to their medians as a benchmark. For an example …

df = data.frame(OfficeVisits = median(D$OfficeVisits), Narcotics=median(D$Narcotics))
predict(glm1, df, type="response")
      1 
0.17831 

Given that both predictors equals to their medians, using the model we can predict P[y=1] equals to 0.178

exp(0.0814)
[1] 1.0848

The marginal effect of OfficeVisits is exp(0.0814)=1.0848

df = data.frame(OfficeVisits = median(D$OfficeVisits)+1, Narcotics=median(D$Narcotics))
predict(glm1, df, type="response")
      1 
0.19054 

At the median level (P[y=1] = 0.178), an unit increment in officeVisits would increase the probability of poor care to 0.191 … and

df = data.frame(OfficeVisits = median(D$OfficeVisits), Narcotics=median(D$Narcotics)+1)
predict(glm1, df, type="response")
      1 
0.18757 

an unit increment in Narcotis would increase the probability of poor care to 0.188



【C】Practice

quantile(D$OfficeVisits)
  0%  25%  50%  75% 100% 
 0.0  7.0 12.0 18.5 46.0 
quantile(D$Narcotics)
  0%  25%  50%  75% 100% 
   0    0    1    3   59 


🗿 Quiz:
  ■ When OfficeVisits and Narcotic both equal to their first quantile (25%):
    ■ What is the prob. that PoorCare = 1?
    ■ In term of odd ratio, what are the marginal effects of the predictors?
    ■ In term of probability increment, what are the marginal effects of the predictors?
  ■ When OfficeVisits and Narcotic both equal to their third quantile (75%):
    ■ What is the prob. that PoorCare = 1?
    ■ In term of odd ratio, what are the marginal effects of the predictors?
    ■ In term of probability increment, what are the marginal effects of the predictors?
  ■ Comparing the above, can we tell the difference between odd ratio and increment probability?