案例:糖尿病患醫療品質
pacman::p_load(caTools, ggplot2, dplyr)
D = read.csv("data/quality.csv") # Read in dataset
set.seed(88)
split = sample.split(D$PoorCare, SplitRatio = 0.75) # split vector
TR = subset(D, split == TRUE)
TS = subset(D, split == FALSE)
glm1 = glm(PoorCare ~ OfficeVisits + Narcotics, TR, family=binomial)
summary(glm1)
Fig 12.1 - 混淆矩陣與模型準確性指標
預測機率 Predicted Probability (Training)
par(cex=0.8)
pred = predict(glm1, type="response") #type="response"=>predict 機率, 不加的話會預設threshold=0.5,回傳0,1
#predict training data不用再給一次資料
hist(pred)
abline(v=0.5, col='red')
混淆矩陣 Confusion Matrix (Training)
Predict
Acture FALSE TRUE
0 70 4
1 15 10
模型準確性指標 Accuracy Metrices (Training)
Accuracy = function(x, k=3) c(
accuracy = sum(diag(x))/sum(x), #diag=>對角線數據
sensitivity = as.numeric(x[2,2]/rowSums(x)[2]),
specificity = as.numeric(x[1,1]/rowSums(x)[1])
) %>% round(k) #自製function #x:混淆矩陣; k:小數點位數
Accuracy(cmx)
accuracy sensitivity specificity
0.808 0.400 0.946
預測機率 Predicted Probability (Testing)
par(cex=0.8)
pred2 = predict(glm1, newdata=TS, type="response")
hist(pred2, 10)
abline(v=0.5, col='red')
混淆矩陣 Confusion Matrix (Testing)
Predict
Acture FALSE TRUE
0 23 1
1 5 3
模型準確性指標 Accuracy Matrices (Testing)
Train Test
accuracy 0.808 0.812
sensitivity 0.400 0.375
specificity 0.946 0.958
Fig 12.2 - 預測機率分佈、臨界機率、混淆矩陣
預測機率分佈 (DPP) - Distribution of Predicted Probability (Train)
data.frame(y=factor(TR$PoorCare), pred=pred) %>%
ggplot(aes(x=pred, fill=y)) +
geom_histogram(bins=20, col='white', position="stack", alpha=0.5) +
ggtitle("Distribution of Predicted Probability (DPP,Train)") +
xlab("predicted probability")
預測機率分佈 (DPP) - Distribution of Predicted Probability (Test)
data.frame(y=factor(TS$PoorCare), pred=pred2) %>%
ggplot(aes(x=pred, fill=y)) +
geom_histogram(bins=20, col='white', position="stack", alpha=0.5) +
ggtitle("Distribution of Predicted Probability (DPP,Test)") +
xlab("predicted probability")
ROC - Receiver Operation Curve
par(mfrow=c(1,2), cex=0.8)
trAUC = colAUC(pred, y=TR$PoorCare, plotROC=T) #colAUC=>caTools套件
tsAUC = colAUC(pred2, y=TS$PoorCare, plotROC=T)
AUC - Area Under Curve
[1] 0.77459 0.79948
🗿 練習:
使用TR$MemberID
以外的所有欄位,建立一個邏輯式回歸模型來預測PoorCare
,並:
glm2 = glm(PoorCare ~ InpatientDays + ERVisits + OfficeVisits + Narcotics + DaysSinceLastERVisit + Pain + ProviderCount + MedicalClaims + ClaimLines + StartedOnCombination + AcuteDrugGapSmall, TR, family=binomial)
#TotalVisits是NA,應是與其他自變數存在完全共線性,使係數估計值無法求出唯一解
summary(glm2)
Call:
glm(formula = PoorCare ~ InpatientDays + ERVisits + OfficeVisits +
Narcotics + DaysSinceLastERVisit + Pain + ProviderCount +
MedicalClaims + ClaimLines + StartedOnCombination + AcuteDrugGapSmall,
family = binomial, data = TR)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4950 -0.5389 -0.3037 -0.0612 2.0898
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.52721 1.24887 -1.22 0.2214
InpatientDays 0.13978 0.09725 1.44 0.1506
ERVisits -0.32645 0.24856 -1.31 0.1891
OfficeVisits 0.09852 0.04687 2.10 0.0355 *
Narcotics 0.04695 0.04567 1.03 0.3039
DaysSinceLastERVisit -0.00389 0.00188 -2.06 0.0391 *
Pain 0.00227 0.01538 0.15 0.8827
ProviderCount 0.02847 0.03330 0.85 0.3927
MedicalClaims 0.01568 0.02441 0.64 0.5205
ClaimLines -0.01055 0.00924 -1.14 0.2535
StartedOnCombinationTRUE 2.34587 1.66703 1.41 0.1594
AcuteDrugGapSmall 0.26808 0.10246 2.62 0.0089 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 111.888 on 98 degrees of freedom
Residual deviance: 72.527 on 87 degrees of freedom
AIC: 96.53
Number of Fisher Scoring iterations: 6
【A】 分別畫出Training
和Testing
的DPP
#Training DPP
data.frame(y=factor(TR$PoorCare), pred=predTR) %>%
ggplot(aes(x=pred, fill=y)) +
geom_histogram(bins=20, col='white', position="stack", alpha=0.5) +
ggtitle("Distribution of Predicted Probability (DPP,Train)") +
xlab("predicted probability")
#Testing DPP
data.frame(y=factor(TS$PoorCare), pred=predTS) %>%
ggplot(aes(x=pred, fill=y)) +
geom_histogram(bins=20, col='white', position="stack", alpha=0.5) +
ggtitle("Distribution of Predicted Probability (DPP,Test)") +
xlab("predicted probability")
【B】 分別畫出Training
和Testing
的ROC
par(mfrow=c(1,2), cex=0.8)
TRAUC = colAUC(predTR, y=TR$PoorCare, plotROC=T) #colAUC=>caTools套件
TSAUC = colAUC(predTS, y=TS$PoorCare, plotROC=T)
【C】 分別算出Training
和Testing
的ACC
、SENS
和SPEC
Predict
Acture FALSE TRUE
0 68 6
1 14 11
Predict
Acture FALSE TRUE
0 23 1
1 4 4
Train Test
accuracy 0.798 0.844
sensitivity 0.440 0.500
specificity 0.919 0.958
【D】 分別算出Training
和Testing
的AUC
[1] 0.87568 0.86458
【E】 跟用兩個預測變數的模型相比,這一個模型有比較準嗎?