UNIT09B：簡單線性回歸 Simple Linear Regression

pacman::p_load(dplyr, ggplot2, plotly, ggpubr)

批發商資料集

W = read.csv('data/wholesales.csv')
W$Channel = factor( paste0("Ch",W$Channel) )
W$Region = factor( paste0("Reg",W$Region) )
W[3:8] = lapply(W[3:8], log, base=10)

[A] 用R做線性回歸模型

lm : 方法
md : 模型
Milk : 目標變數 Response, Dependent Variable (DV)
Grocery : 預測變數 Predictor, Independent Variable (IV)
W : 資料

🌻 製作模型：lm()

md = lm(Milk ~ Grocery, W)

🌻 檢視模型：summary()

summary(md)


Call:
lm(formula = Milk ~ Grocery, data = W)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.1371 -0.1875  0.0219  0.1593  1.8732 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept)   0.8318     0.1115    7.46     0.00000000000047 ***
Grocery       0.7352     0.0301   24.39 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.306 on 438 degrees of freedom
Multiple R-squared:  0.576, Adjusted R-squared:  0.575 
F-statistic:  595 on 1 and 438 DF,  p-value: <0.0000000000000002

🌻 用模型做預測：predict

yhat = predict(md,W)

[B] 理論 vs 實證模型

理論模型：$y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \; \epsilon \in i.i.d. Normal Dist.$
- $\beta_0$, $beta_1$ - 係數
- $y_i$ - 目標變數
- $x_i$ - 預測變數
- $\epsilon_i$ - 誤差
模型估計/預測：$\hat{y}_i = b_0 + b_1 x_i$
- md$coefficient : $b_0$, $b_1$ 係數估計值
- md$fitted.value : $\hat{y}_i$ 目標變數估計值
- md$residuals : $e_i = y - \hat{y}$ 殘差

y = W$Milk                # 目標變數
x = W$Grocery             # 解釋變數
b0 = md$coef[1]           # 捷距
b1 = md$coef[2]           # 斜率
yhat = b0 + b1 * x        # 預測值
er = y - yhat             # 殘差 (Residual)

# range(yhat - md$fitted.values)

# range(er - md$residuals)

[C] 畫出回歸線

\[ Milk_i = b_0 + b_1 Grocery_i\]

par(cex=0.8, mar=c(4,4,1,1))
plot(W$Grocery, W$Milk, pch=20, col="#80808080")
abline(b0, b1, col='red')

ggplot(aes(Grocery, Milk), data=W) + 
  geom_point(alpha=0.4, size=0.8) + 
  geom_smooth(method="lm", level=0.95, lwd=0.2) + 
  theme_bw() -> p

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
i Please use `linewidth` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.

ggplotly(p)

`geom_smooth()` using formula = 'y ~ x'

Warning: The following aesthetics were dropped during statistical transformation:
x_plotlyDomain, y_plotlyDomain
i This can happen when ggplot fails to infer the correct grouping structure in
  the data.
i Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

🌻 問題

為什麼大部分的資料點都沒有落在95%信賴區間呢？
模型估計的是$y$的平均值($\bar{y}|x$)、而不是$y$本身！

[D] 使用模型做預測(估計)

new_data = tibble(Grocery = seq(1,5,0.5))
new_data

# A tibble: 9 x 1
  Grocery
    <dbl>
1     1  
2     1.5
3     2  
4     2.5
5     3  
6     3.5
7     4  
8     4.5
9     5

🌻 $\hat y$ 的 信賴區間(Confidence Interval)

conf = predict(md, new_data, interval="confidence"); conf

     fit    lwr    upr
1 1.5670 1.4064 1.7275
2 1.9345 1.8030 2.0661
3 2.3021 2.1993 2.4049
4 2.6697 2.5949 2.7445
5 3.0373 2.9885 3.0861
6 3.4049 3.3746 3.4352
7 3.7725 3.7377 3.8074
8 4.1401 4.0830 4.1973
9 4.5077 4.4236 4.5918

🌻 $y$ 的 預測區間(Prediction Interval)

pred = predict(md, new_data, interval="prediction"); pred

     fit     lwr    upr
1 1.5670 0.94409 2.1898
2 1.9345 1.31853 2.5506
3 2.3021 1.69161 2.9127
4 2.6697 2.06329 3.2762
5 3.0373 2.43354 3.6411
6 3.4049 2.80235 4.0075
7 3.7725 3.16969 4.3753
8 4.1401 3.53559 4.7446
9 4.5077 3.90004 5.1154

df = cbind(new_data,conf,pred[,-1]) 
names(df) = c("x","yhat","c.lwr","c.upr","p.lwr","p.upr")
df

    x   yhat  c.lwr  c.upr   p.lwr  p.upr
1 1.0 1.5670 1.4064 1.7275 0.94409 2.1898
2 1.5 1.9345 1.8030 2.0661 1.31853 2.5506
3 2.0 2.3021 2.1993 2.4049 1.69161 2.9127
4 2.5 2.6697 2.5949 2.7445 2.06329 3.2762
5 3.0 3.0373 2.9885 3.0861 2.43354 3.6411
6 3.5 3.4049 3.3746 3.4352 2.80235 4.0075
7 4.0 3.7725 3.7377 3.8074 3.16969 4.3753
8 4.5 4.1401 4.0830 4.1973 3.53559 4.7446
9 5.0 4.5077 4.4236 4.5918 3.90004 5.1154

ggplot(df, aes(x)) +
  geom_point(aes(y=yhat),col='red',size=2) +
  geom_line(aes(y=yhat),col='red') +
  geom_line(aes(y=c.lwr),col='orange') +
  geom_line(aes(y=c.upr),col='orange') +
  geom_line(aes(y=p.lwr),col='magenta',linetype='dashed') +
  geom_line(aes(y=p.upr),col='magenta',linetype='dashed') +
  geom_point(data=W, aes(x=Grocery, y=Milk), size=0.8, alpha=0.5) +
  xlim(1,5) + ylim(1,5) + theme_bw()

🌻 95%預測區間(Prediction Interval)才會涵蓋95%的$y$

[E] 模型摘要功能

summary(md)


Call:
lm(formula = Milk ~ Grocery, data = W)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.1371 -0.1875  0.0219  0.1593  1.8732 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept)   0.8318     0.1115    7.46     0.00000000000047 ***
Grocery       0.7352     0.0301   24.39 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.306 on 438 degrees of freedom
Multiple R-squared:  0.576, Adjusted R-squared:  0.575 
F-statistic:  595 on 1 and 438 DF,  p-value: <0.0000000000000002

§ 模型的準確性

Residuals: 殘差的分布
R-squared: 模型所能解釋的(目標變數的)變異比率

💡 : $b_1$ 係數代表平均而言， $x$ 每增加一單位時，$y$ 會增加的數量

Coefficients:
- Estimate: 係數估計值
- Std. Error: 係數估計值的標準差
- Pr(>|t|): $p$值 (變數之間沒有關係的機率？)
- Signif. codes: 顯著標記

§ 係數的分布與顯著性

curve(dnorm(x, 0.7352, 0.0301), -0.1, 1, n=400, xlab=bquote(italic(b[1])),
      main=bquote("The Distribution of Random Variable: " ~ italic(b[1])))
abline(v=qnorm(c(0.025, 0.975),0.7352, 0.0301), col="red")

💡 : 係數($b_0, b_1$)也都是隨機變數，他們都會有：

點估計：期望值、最有可能的值
區間估計：某一信心水準之下估計值可能出現的區間
通常某預測變數係數的95%信賴區間不包含0，我們就可以說它與目標變數有”顯著”的相關性

UNIT09B：簡單線性回歸 Simple Linear Regression

中山大學管理學院卓雍然

2024-04-17 14:39:04

[A] 用R做線性回歸模型

[B] 理論 vs 實證模型

[C] 畫出回歸線

[D] 使用模型做預測(估計)

[E] 模型摘要功能

§ 模型的準確性

§ 係數的分布與顯著性

UNIT09B：簡單線性回歸 Simple Linear Regression

中山大學管理學院 卓雍然

2024-04-17 14:39:04

[A] 用R做線性回歸模型

[B] 理論 vs 實證模型

[C] 畫出回歸線

[D] 使用模型做預測(估計)

[E] 模型摘要功能

§ 模型的準確性

§ 係數的分布與顯著性

中山大學管理學院卓雍然