pacman::p_load(dplyr, ggplot2, plotly, ggpubr)

批發商資料集

W = read.csv('data/wholesales.csv')
W$Channel = factor( paste0("Ch",W$Channel) )
W$Region = factor( paste0("Reg",W$Region) )
W[3:8] = lapply(W[3:8], log, base=10)


[A] 用R做線性回歸模型

🌻 製作模型:lm()

md = lm(Milk ~ Grocery, W)

🌻 檢視模型:summary()

summary(md)

Call:
lm(formula = Milk ~ Grocery, data = W)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.1371 -0.1875  0.0219  0.1593  1.8732 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept)   0.8318     0.1115    7.46     0.00000000000047 ***
Grocery       0.7352     0.0301   24.39 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.306 on 438 degrees of freedom
Multiple R-squared:  0.576, Adjusted R-squared:  0.575 
F-statistic:  595 on 1 and 438 DF,  p-value: <0.0000000000000002

🌻 用模型做預測:predict

yhat = predict(md,W)


[B] 理論 vs 實證模型

y = W$Milk                # 目標變數
x = W$Grocery             # 解釋變數
b0 = md$coef[1]           # 捷距
b1 = md$coef[2]           # 斜率
yhat = b0 + b1 * x        # 預測值
er = y - yhat             # 殘差 (Residual)
# range(yhat - md$fitted.values)
# range(er - md$residuals)


[C] 畫出回歸線

\[ Milk_i = b_0 + b_1 Grocery_i\]

par(cex=0.8, mar=c(4,4,1,1))
plot(W$Grocery, W$Milk, pch=20, col="#80808080")
abline(b0, b1, col='red')

ggplot(aes(Grocery, Milk), data=W) + 
  geom_point(alpha=0.4, size=0.8) + 
  geom_smooth(method="lm", level=0.95, lwd=0.2) + 
  theme_bw() -> p
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
i Please use `linewidth` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
ggplotly(p)
`geom_smooth()` using formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation:
x_plotlyDomain, y_plotlyDomain
i This can happen when ggplot fails to infer the correct grouping structure in
  the data.
i Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

🌻 問題



[D] 使用模型做預測(估計)

new_data = tibble(Grocery = seq(1,5,0.5))
new_data
# A tibble: 9 x 1
  Grocery
    <dbl>
1     1  
2     1.5
3     2  
4     2.5
5     3  
6     3.5
7     4  
8     4.5
9     5  

🌻 \(\hat y\)信賴區間(Confidence Interval)

conf = predict(md, new_data, interval="confidence"); conf
     fit    lwr    upr
1 1.5670 1.4064 1.7275
2 1.9345 1.8030 2.0661
3 2.3021 2.1993 2.4049
4 2.6697 2.5949 2.7445
5 3.0373 2.9885 3.0861
6 3.4049 3.3746 3.4352
7 3.7725 3.7377 3.8074
8 4.1401 4.0830 4.1973
9 4.5077 4.4236 4.5918

🌻 \(y\)預測區間(Prediction Interval)

pred = predict(md, new_data, interval="prediction"); pred
     fit     lwr    upr
1 1.5670 0.94409 2.1898
2 1.9345 1.31853 2.5506
3 2.3021 1.69161 2.9127
4 2.6697 2.06329 3.2762
5 3.0373 2.43354 3.6411
6 3.4049 2.80235 4.0075
7 3.7725 3.16969 4.3753
8 4.1401 3.53559 4.7446
9 4.5077 3.90004 5.1154
df = cbind(new_data,conf,pred[,-1]) 
names(df) = c("x","yhat","c.lwr","c.upr","p.lwr","p.upr")
df
    x   yhat  c.lwr  c.upr   p.lwr  p.upr
1 1.0 1.5670 1.4064 1.7275 0.94409 2.1898
2 1.5 1.9345 1.8030 2.0661 1.31853 2.5506
3 2.0 2.3021 2.1993 2.4049 1.69161 2.9127
4 2.5 2.6697 2.5949 2.7445 2.06329 3.2762
5 3.0 3.0373 2.9885 3.0861 2.43354 3.6411
6 3.5 3.4049 3.3746 3.4352 2.80235 4.0075
7 4.0 3.7725 3.7377 3.8074 3.16969 4.3753
8 4.5 4.1401 4.0830 4.1973 3.53559 4.7446
9 5.0 4.5077 4.4236 4.5918 3.90004 5.1154
ggplot(df, aes(x)) +
  geom_point(aes(y=yhat),col='red',size=2) +
  geom_line(aes(y=yhat),col='red') +
  geom_line(aes(y=c.lwr),col='orange') +
  geom_line(aes(y=c.upr),col='orange') +
  geom_line(aes(y=p.lwr),col='magenta',linetype='dashed') +
  geom_line(aes(y=p.upr),col='magenta',linetype='dashed') +
  geom_point(data=W, aes(x=Grocery, y=Milk), size=0.8, alpha=0.5) +
  xlim(1,5) + ylim(1,5) + theme_bw()

🌻 95%預測區間(Prediction Interval)才會涵蓋95%的\(y\)


[E] 模型摘要功能

summary(md)

Call:
lm(formula = Milk ~ Grocery, data = W)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.1371 -0.1875  0.0219  0.1593  1.8732 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept)   0.8318     0.1115    7.46     0.00000000000047 ***
Grocery       0.7352     0.0301   24.39 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.306 on 438 degrees of freedom
Multiple R-squared:  0.576, Adjusted R-squared:  0.575 
F-statistic:  595 on 1 and 438 DF,  p-value: <0.0000000000000002
§ 模型的準確性
  • Residuals: 殘差的分布
  • R-squared: 模型所能解釋的(目標變數的)變異比率

💡 : \(b_1\) 係數代表平均而言, \(x\) 每增加一單位時,\(y\) 會增加的數量

  • Coefficients:
    • Estimate: 係數估計值
    • Std. Error: 係數估計值的標準差
    • Pr(>|t|): \(p\)值 (變數之間沒有關係的機率?)
    • Signif. codes: 顯著標記


§ 係數的分布與顯著性
curve(dnorm(x, 0.7352, 0.0301), -0.1, 1, n=400, xlab=bquote(italic(b[1])),
      main=bquote("The Distribution of Random Variable: " ~ italic(b[1])))
abline(v=qnorm(c(0.025, 0.975),0.7352, 0.0301), col="red")

💡 : 係數(\(b_0, b_1\))也都是隨機變數,他們都會有:

  • 點估計:期望值、最有可能的值
  • 區間估計:某一信心水準之下估計值可能出現的區間
  • 通常某預測變數係數的95%信賴區間不包含0,我們就可以說它與目標變數有”顯著”的相關性