案例:糖尿病患醫療品質
pacman::p_load(caTools, ggplot2, dplyr)
D = read.csv("data/quality.csv") # Read in dataset
set.seed(88)
split = sample.split(D$PoorCare, SplitRatio = 0.75) # split vector
TR = subset(D, split == TRUE)
TS = subset(D, split == FALSE)
glm1 = glm(PoorCare ~ OfficeVisits + Narcotics, TR, family=binomial)
summary(glm1)Fig 12.1 - 混淆矩陣與模型準確性指標
預測機率 Predicted Probability (Training)
par(cex=0.8)
pred = predict(glm1, type="response")
hist(pred)
abline(v=0.5, col='red')混淆矩陣 Confusion Matrix (Training)
cmx = table(Acture=TR$PoorCare, Predict=pred > 0.5)
cmx Predict
Acture FALSE TRUE
0 70 4
1 15 10
模型準確性指標 Accuracy Metrices (Training)
AccuracyMetrices = function(x, k=3) c(
accuracy = sum(diag(x))/sum(x), # 正確性
sensitivity = as.numeric(x[2,2]/rowSums(x)[2]), # 敏感性
specificity = as.numeric(x[1,1]/rowSums(x)[1]) # 明確性
) %>% round(k)
AccuracyMetrices(cmx) accuracy sensitivity specificity
0.808 0.400 0.946
預測機率 Predicted Probability (Testing)
par(cex=0.8)
pred2 = predict(glm1, newdata=TS, type="response")
hist(pred2, 10)
abline(v=0.5, col='red')混淆矩陣 Confusion Matrix (Testing)
cmx2 = table(Acture=TS$PoorCare, Predict=pred2 > 0.5)
cmx2 Predict
Acture FALSE TRUE
0 23 1
1 5 3
比較模型準確性指標 Accuracy Matrices (Testing)
sapply(list(Train=cmx, Test=cmx2), AccuracyMetrices) Train Test
accuracy 0.808 0.812
sensitivity 0.400 0.375
specificity 0.946 0.958
Fig 12.2 - 預測機率分佈、臨界機率、混淆矩陣
分類預測機率分佈 (cDPP) - Categorical Dist. of Predicted Prob. (Train)
data.frame(y=factor(TR$PoorCare), pred=pred) %>%
ggplot(aes(x=pred, fill=y)) +
geom_histogram(bins=20, col='white', position="stack", alpha=0.5) +
ggtitle("Distribution of Predicted Probability (DPP,Train)") +
xlab("predicted probability")
分類預測機率分佈 (cDPP) - Categorical Dist. of Predicted Prob. (Test)
#
#ROC - Receiver Operation Curve
par(mfrow=c(1,2), cex=0.8)
trAUC = colAUC(pred, y=TR$PoorCare, plotROC=T)
tsAUC = colAUC(pred2, y=TS$PoorCare, plotROC=T)AUC - Area Under Curve
c(trAUC, tsAUC)[1] 0.77459 0.79948
🗿 練習:
使用TR$MemberID以外的所有欄位,建立一個邏輯式回歸模型來預測PoorCare,並:
【A】
分別畫出Training和Testing的DPP
【B】
分別畫出Training和Testing的ROC
【C】
分別算出Training和Testing的ACC、SENS和SPEC
【D】
分別算出Training和Testing的AUC
【E】 跟用兩個預測變數的模型相比,這一個模型有比較準嗎?
【F】
為什麼它比較準(或比較不準)呢?
glm2 = glm(PoorCare ~ ., TR[,-c(1,8)], family=binomial)
summary(glm2)
Call:
glm(formula = PoorCare ~ ., family = binomial, data = TR[, -c(1,
8)])
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4950 -0.5389 -0.3037 -0.0612 2.0898
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.52721 1.24887 -1.22 0.2214
InpatientDays 0.13978 0.09725 1.44 0.1506
ERVisits -0.32645 0.24856 -1.31 0.1891
OfficeVisits 0.09852 0.04687 2.10 0.0355 *
Narcotics 0.04695 0.04567 1.03 0.3039
DaysSinceLastERVisit -0.00389 0.00188 -2.06 0.0391 *
Pain 0.00227 0.01538 0.15 0.8827
ProviderCount 0.02847 0.03330 0.85 0.3927
MedicalClaims 0.01568 0.02441 0.64 0.5205
ClaimLines -0.01055 0.00924 -1.14 0.2535
StartedOnCombinationTRUE 2.34587 1.66703 1.41 0.1594
AcuteDrugGapSmall 0.26808 0.10246 2.62 0.0089 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 111.888 on 98 degrees of freedom
Residual deviance: 72.527 on 87 degrees of freedom
AIC: 96.53
Number of Fisher Scoring iterations: 6
pTR = predict(glm2, type="response")
pTS = predict(glm2, TS, type="response")
cmTR = table(Acture=TR$PoorCare, Predict=pTR > 0.5)
cmTS = table(Acture=TS$PoorCare, Predict=pTS > 0.5)
sapply(list(Train1=cmx, Train2=cmTR, Test1=cmx2, Test2=cmTS), AccuracyMetrices) Train1 Train2 Test1 Test2
accuracy 0.808 0.798 0.812 0.844
sensitivity 0.400 0.440 0.375 0.500
specificity 0.946 0.919 0.958 0.958
par(mfrow=c(1,2), cex=0.8)
trAUC = colAUC(pred2, y=TS$PoorCare)
tsAUC = colAUC(pTS, y=TS$PoorCare)AUC - Area Under Curve
c(trAUC, tsAUC)[1] 0.79948 0.86458