::p_load(caTools, ggplot2, dplyr) pacman
= read.csv("data/quality.csv") # Read in dataset
D summary(D)
MemberID InpatientDays ERVisits OfficeVisits Narcotics
Min. : 1.0 Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.00
1st Qu.: 33.5 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 7.0 1st Qu.: 0.00
Median : 66.0 Median : 0.00 Median : 1.0 Median :12.0 Median : 1.00
Mean : 66.0 Mean : 2.72 Mean : 1.5 Mean :13.2 Mean : 4.57
3rd Qu.: 98.5 3rd Qu.: 3.00 3rd Qu.: 2.0 3rd Qu.:18.5 3rd Qu.: 3.00
Max. :131.0 Max. :30.00 Max. :11.0 Max. :46.0 Max. :59.00
DaysSinceLastERVisit Pain TotalVisits ProviderCount
Min. : 6 Min. : 0.0 Min. : 0.0 Min. : 5
1st Qu.:207 1st Qu.: 1.0 1st Qu.: 8.0 1st Qu.:15
Median :641 Median : 8.0 Median :15.0 Median :20
Mean :481 Mean : 15.6 Mean :17.4 Mean :24
3rd Qu.:731 3rd Qu.: 23.0 3rd Qu.:22.5 3rd Qu.:30
Max. :731 Max. :104.0 Max. :69.0 Max. :82
MedicalClaims ClaimLines StartedOnCombination AcuteDrugGapSmall
Min. : 11.0 Min. : 20.0 Mode :logical Min. : 0.00
1st Qu.: 25.5 1st Qu.: 83.5 FALSE:125 1st Qu.: 0.00
Median : 37.0 Median :120.0 TRUE :6 Median : 1.00
Mean : 43.2 Mean :142.9 Mean : 2.69
3rd Qu.: 49.5 3rd Qu.:185.0 3rd Qu.: 3.00
Max. :194.0 Max. :577.0 Max. :71.00
PoorCare
Min. :0.000
1st Qu.:0.000
Median :0.000
Mean :0.252
3rd Qu.:0.500
Max. :1.000
= table(D$PoorCare) # the base probability
base %>% prop.table base
0 1
0.74809 0.25191
set.seed(88)
= sample.split(
split $PoorCare, # 目標變數
DSplitRatio = 0.7 # 分割比率
) table(split) %>% prop.table()
split
FALSE TRUE
0.29771 0.70229
table(D$PoorCare, split) %>% prop.table(2)
split
FALSE TRUE
0 0.74359 0.75000
1 0.25641 0.25000
= subset(D, split == TRUE)
TR = subset(D, split == FALSE) TS
= glm(PoorCare ~ OfficeVisits + Narcotics, TR, family=binomial)
glm1 summary(glm1)
Call:
glm(formula = PoorCare ~ OfficeVisits + Narcotics, family = binomial,
data = TR)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.878 -0.635 -0.517 -0.224 2.146
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.5662 0.5249 -4.89 0.000001 ***
OfficeVisits 0.0814 0.0310 2.62 0.0087 **
Narcotics 0.0620 0.0320 1.94 0.0527 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 103.470 on 91 degrees of freedom
Residual deviance: 84.528 on 89 degrees of freedom
AIC: 90.53
Number of Fisher Scoring iterations: 4
模型:
Pr[y = 1] = 1/(1+exp(-f(x)))
\(Logit = f(x) = b_0 + b_1 x_1 + b_2 x_2 \;\; (1)\)
\(Logit = f(x) = -2.6461 + 0.0821 \times OfficeVisits + 0.0763 \times Narcotics \;\; (2)\)
係數:
\(Odd_0 = Exp(b_0 + b_1 x_1)\;\;(3)\)
\(Odd_1 = Exp[b_0 + b_1(x_1+1)] = Exp(b_0 + b_1 x_1 + b_1) = Exp(b_0 + b_1 x_1) \times Exp(b_1) \;\;(4)\)
\(Odd_1 = Odd_0 \times Exp(b_1) \:\:(5)\)
\(\frac{Odd_1}{Odd_0} = Exp(b_1)
\:\:(6)\)
機率和勝率之間的關係
💡 學習重點:
■
係數的指數就是勝率比;也就是說,\(x_i\)
每增加\(1\),\(y=1\)的勝率會變成原來的 \(Exp(b_i)\) 倍
■
各預測變數的(勝率)效果是相乘,而不是相加
■
機率和勝率之間的關係並不是線性的:
■
邏輯式回歸裡面各預測變數的勝率效果是固定的
■
但是他們的機率效果並不是固定的
■
我們需先推算原先的機率,才能推算變數的機率效果
= function(p, k) {o = p/(1-p); o = k * o; o/(1+o)}
pop = seq(0.1, 0.9, 0.1)
p0 = 2
k = sapply(seq(0.1, 0.9, 0.1), pop, k)
p1 data.frame(k, p0, p1, difference=p1-p0, multiplier=p1/p0) %>% round(2)
k p0 p1 difference multiplier
1 2 0.1 0.18 0.08 1.82
2 2 0.2 0.33 0.13 1.67
3 2 0.3 0.46 0.16 1.54
4 2 0.4 0.57 0.17 1.43
5 2 0.5 0.67 0.17 1.33
6 2 0.6 0.75 0.15 1.25
7 2 0.7 0.82 0.12 1.18
8 2 0.8 0.89 0.09 1.11
9 2 0.9 0.95 0.05 1.05
💡 相同的「勝率比(\(k=2\))」在不同的機率(0.1~0.9
)下會換算出不同的「機率差(5%~17%
)」
變數的邊際效果
當OfficeVisits
和Narcotics
分別都等於它們的中位數時,\(y=1\)的機率為
= data.frame(
df OfficeVisits = median(D$OfficeVisits),
Narcotics=median(D$Narcotics)
)predict(glm1, df, type="response")
1
0.17831
OfficeVisits
的邊際勝率效果是:exp(0.0814)=1.0848
exp(0.0814)
[1] 1.0848
= data.frame(
df OfficeVisits = median(D$OfficeVisits)+1,
Narcotics=median(D$Narcotics)
)predict(glm1, df, type="response")
1
0.19054
OfficeVisits
的邊際機率效果是:0.19054-0.17831=0.01223
-
OfficeVisits
和Narcotics
都等於中位數時,OfficeVisits
每增加一次PoorCare
的機率會增加1.223%
= data.frame(OfficeVisits = median(D$OfficeVisits), Narcotics=median(D$Narcotics)+1)
df predict(glm1, df, type="response")
1
0.18757
= data.frame(OfficeVisits = median(D$OfficeVisits)+1, Narcotics=median(D$Narcotics)+1)
df predict(glm1, df, type="response")
1
0.20029
💡 學習重點:
■
係數的指數就是勝率比;也就是說,\(x_i\)
每增加\(1\),\(y=1\)的勝率會變成原來的 \(Exp(b_i)\) 倍
■
各預測變數的(勝率)效果是相乘,而不是相加
■
機率和勝率之間的關係並不是線性的:
■
邏輯式回歸裡面各預測變數的勝率效果是固定的
■
但是他們的機率效果並不是固定的
■
我們需先推算原先的機率,才能推算變數的機率效果
quantile(D$OfficeVisits)
0% 25% 50% 75% 100%
0.0 7.0 12.0 18.5 46.0
quantile(D$Narcotics)
0% 25% 50% 75% 100%
0 0 1 3 59
🗿 練習:
■
當OfficeVisits
和Narcotic
分別等於他們的第一分位(Q1
)時:
■ PoorCare = 1
的機率是?
■
兩個自變數的邊際勝率效果分別是?
■
兩個自變數的邊際機率效果分別是?
■
當OfficeVisits
和Narcotic
分別等於他們的第三分位(Q3
)時:
■ PoorCare = 1
的機率是?
■
兩個自變數的邊際勝率效果分別是?
■
兩個自變數的邊際機率效果分別是?
■
比較以上兩個題目的答案,我們可以觀察到什麼?