UNIT12A：類別模型的準確性指標

案例：糖尿病患醫療品質

pacman::p_load(caTools, ggplot2, dplyr)
D = read.csv("data/quality.csv")  # Read in dataset
set.seed(88)
split = sample.split(D$PoorCare, SplitRatio = 0.75)  # split vector
TR = subset(D, split == TRUE)
TS = subset(D, split == FALSE)
glm1 = glm(PoorCare ~ OfficeVisits + Narcotics, TR, family=binomial)
summary(glm1)

【A】傳統準確性指標

Fig 12.1 - 混淆矩陣與模型準確性指標

Training Data

預測機率 Predicted Probability (Training)

par(cex=0.8)
pred = predict(glm1, type="response") #type="response"=>predict 機率, 不加的話會預設threshold=0.5,回傳0,1
#predict training data不用再給一次資料
hist(pred)
abline(v=0.5, col='red')

混淆矩陣 Confusion Matrix (Training)

cmx = table(Acture=TR$PoorCare, Predict=pred > 0.5) #threshold=0.5
cmx

      Predict
Acture FALSE TRUE
     0    70    4
     1    15   10

模型準確性指標 Accuracy Metrices (Training)

Accuracy = function(x, k=3) c(
  accuracy = sum(diag(x))/sum(x), #diag=>對角線數據
  sensitivity = as.numeric(x[2,2]/rowSums(x)[2]),
  specificity = as.numeric(x[1,1]/rowSums(x)[1])
  ) %>% round(k) #自製function #x:混淆矩陣; k:小數點位數
Accuracy(cmx)

   accuracy sensitivity specificity 
      0.808       0.400       0.946

Testing Data

預測機率 Predicted Probability (Testing)

par(cex=0.8)
pred2 = predict(glm1, newdata=TS, type="response") 
hist(pred2, 10)
abline(v=0.5, col='red')

混淆矩陣 Confusion Matrix (Testing)

cmx2 = table(Acture=TS$PoorCare, Predict=pred2 > 0.5)
cmx2

      Predict
Acture FALSE TRUE
     0    23    1
     1     5    3

模型準確性指標 Accuracy Matrices (Testing)

sapply(list(Train=cmx, Test=cmx2), Accuracy) #sapply:list裡面的每個元件都跑後面的function

            Train  Test
accuracy    0.808 0.812
sensitivity 0.400 0.375
specificity 0.946 0.958

【B】預測機率分佈、臨界機率、混淆矩陣

Fig 12.2 - 預測機率分佈、臨界機率、混淆矩陣

預測機率分佈 (DPP) - Distribution of Predicted Probability (Train)

data.frame(y=factor(TR$PoorCare), pred=pred) %>% 
  ggplot(aes(x=pred, fill=y)) + 
  geom_histogram(bins=20, col='white', position="stack", alpha=0.5) +
  ggtitle("Distribution of Predicted Probability (DPP,Train)") +
  xlab("predicted probability")

預測機率分佈 (DPP) - Distribution of Predicted Probability (Test)

data.frame(y=factor(TS$PoorCare), pred=pred2) %>% 
  ggplot(aes(x=pred, fill=y)) + 
  geom_histogram(bins=20, col='white', position="stack", alpha=0.5) +
  ggtitle("Distribution of Predicted Probability (DPP,Test)") +
  xlab("predicted probability")

【C】作業曲線(ROC)與辨識率(AUC)

ROC - Receiver Operation Curve

par(mfrow=c(1,2), cex=0.8)
trAUC = colAUC(pred, y=TR$PoorCare, plotROC=T) #colAUC=>caTools套件
tsAUC = colAUC(pred2, y=TS$PoorCare, plotROC=T)

AUC - Area Under Curve

c(trAUC, tsAUC)

[1] 0.77459 0.79948

🗿 練習：
使用TR$MemberID以外的所有欄位，建立一個邏輯式回歸模型來預測PoorCare，並：

glm2 = glm(PoorCare ~ InpatientDays + ERVisits + OfficeVisits + Narcotics + DaysSinceLastERVisit + Pain  + ProviderCount + MedicalClaims + ClaimLines + StartedOnCombination + AcuteDrugGapSmall, TR, family=binomial)
#TotalVisits是NA,應是與其他自變數存在完全共線性,使係數估計值無法求出唯一解

summary(glm2)


Call:
glm(formula = PoorCare ~ InpatientDays + ERVisits + OfficeVisits + 
    Narcotics + DaysSinceLastERVisit + Pain + ProviderCount + 
    MedicalClaims + ClaimLines + StartedOnCombination + AcuteDrugGapSmall, 
    family = binomial, data = TR)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4950  -0.5389  -0.3037  -0.0612   2.0898  

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)   
(Intercept)              -1.52721    1.24887   -1.22   0.2214   
InpatientDays             0.13978    0.09725    1.44   0.1506   
ERVisits                 -0.32645    0.24856   -1.31   0.1891   
OfficeVisits              0.09852    0.04687    2.10   0.0355 * 
Narcotics                 0.04695    0.04567    1.03   0.3039   
DaysSinceLastERVisit     -0.00389    0.00188   -2.06   0.0391 * 
Pain                      0.00227    0.01538    0.15   0.8827   
ProviderCount             0.02847    0.03330    0.85   0.3927   
MedicalClaims             0.01568    0.02441    0.64   0.5205   
ClaimLines               -0.01055    0.00924   -1.14   0.2535   
StartedOnCombinationTRUE  2.34587    1.66703    1.41   0.1594   
AcuteDrugGapSmall         0.26808    0.10246    2.62   0.0089 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 111.888  on 98  degrees of freedom
Residual deviance:  72.527  on 87  degrees of freedom
AIC: 96.53

Number of Fisher Scoring iterations: 6

【A】分別畫出Training和Testing的DPP

predTR = predict(glm2, type="response")
predTS = predict(glm2, newdata=TS, type="response")

#Training DPP
data.frame(y=factor(TR$PoorCare), pred=predTR) %>% 
  ggplot(aes(x=pred, fill=y)) + 
  geom_histogram(bins=20, col='white', position="stack", alpha=0.5) +
  ggtitle("Distribution of Predicted Probability (DPP,Train)") +
  xlab("predicted probability")

#Testing DPP
data.frame(y=factor(TS$PoorCare), pred=predTS) %>% 
  ggplot(aes(x=pred, fill=y)) + 
  geom_histogram(bins=20, col='white', position="stack", alpha=0.5) +
  ggtitle("Distribution of Predicted Probability (DPP,Test)") +
  xlab("predicted probability")

【B】分別畫出Training和Testing的ROC

par(mfrow=c(1,2), cex=0.8)
TRAUC = colAUC(predTR, y=TR$PoorCare, plotROC=T) #colAUC=>caTools套件
TSAUC = colAUC(predTS, y=TS$PoorCare, plotROC=T)

【C】分別算出Training和Testing的ACC、SENS和SPEC

#Train
cmTR = table(Acture=TR$PoorCare, Predict=predTR > 0.5) #threshold=0.5
cmTR

      Predict
Acture FALSE TRUE
     0    68    6
     1    14   11

#Test
cmTS = table(Acture=TS$PoorCare, Predict=predTS > 0.5) #threshold=0.5
cmTS

      Predict
Acture FALSE TRUE
     0    23    1
     1     4    4

sapply(list(Train=cmTR, Test=cmTS), Accuracy)

            Train  Test
accuracy    0.798 0.844
sensitivity 0.440 0.500
specificity 0.919 0.958

【D】分別算出Training和Testing的AUC

c(TRAUC, TSAUC)

[1] 0.87568 0.86458

【E】跟用兩個預測變數的模型相比，這一個模型有比較準嗎？

#Ans: 有,因AUC比較大，Test data 的accuracy 也有提高

【F】為什麼它比較準(或比較不準)呢？

#Ans: 可能是因為有更多自變數可以去解釋應變數的變異

UNIT12A：類別模型的準確性指標

模型的準確性

中山大學管理學院第8組

2020-05-24 19:15:00

【A】傳統準確性指標

Training Data

Testing Data

【B】預測機率分佈、臨界機率、混淆矩陣

【C】作業曲線(ROC)與辨識率(AUC)

UNIT12A：類別模型的準確性指標

模型的準確性

中山大學管理學院 第8組

2020-05-24 19:15:00

【A】傳統準確性指標

Training Data

Testing Data

【B】預測機率分佈、臨界機率、混淆矩陣

【C】作業曲線(ROC)與辨識率(AUC)

中山大學管理學院第8組