::p_load(dplyr, ggplot2, plotly, ggpubr) pacman
批發商資料集
= read.csv('data/wholesales.csv')
W $Channel = factor( paste0("Ch",W$Channel) )
W$Region = factor( paste0("Reg",W$Region) )
W3:8] = lapply(W[3:8], log, base=10) W[
lm
: 方法md
: 模型Milk
: 目標變數 Response, Dependent Variable (DV)Grocery
: 預測變數 Predictor, Independent Variable
(IV)W
: 資料🌻 製作模型:lm()
= lm(Milk ~ Grocery, W) md
🌻 檢視模型:summary()
summary(md)
Call:
lm(formula = Milk ~ Grocery, data = W)
Residuals:
Min 1Q Median 3Q Max
-1.1371 -0.1875 0.0219 0.1593 1.8732
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8318 0.1115 7.46 0.00000000000047 ***
Grocery 0.7352 0.0301 24.39 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.306 on 438 degrees of freedom
Multiple R-squared: 0.576, Adjusted R-squared: 0.575
F-statistic: 595 on 1 and 438 DF, p-value: <0.0000000000000002
🌻 用模型做預測:predict
= predict(md,W) yhat
md$coefficient
: \(b_0\), \(b_1\) 係數估計值md$fitted.value
: \(\hat{y}_i\) 目標變數估計值md$residuals
: \(e_i = y -
\hat{y}\) 殘差= W$Milk # 目標變數
y = W$Grocery # 解釋變數
x = md$coef[1] # 捷距
b0 = md$coef[2] # 斜率
b1 = b0 + b1 * x # 預測值
yhat = y - yhat # 殘差 (Residual) er
# range(yhat - md$fitted.values)
# range(er - md$residuals)
\[ Milk_i = b_0 + b_1 Grocery_i\]
par(cex=0.8, mar=c(4,4,1,1))
plot(W$Grocery, W$Milk, pch=20, col="#80808080")
abline(b0, b1, col='red')
ggplot(aes(Grocery, Milk), data=W) +
geom_point(alpha=0.4, size=0.8) +
geom_smooth(method="lm", level=0.95, lwd=0.2) +
theme_bw() -> p
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
i Please use `linewidth` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
ggplotly(p)
`geom_smooth()` using formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation:
x_plotlyDomain, y_plotlyDomain
i This can happen when ggplot fails to infer the correct grouping structure in
the data.
i Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
🌻 問題
= tibble(Grocery = seq(1,5,0.5))
new_data new_data
# A tibble: 9 x 1
Grocery
<dbl>
1 1
2 1.5
3 2
4 2.5
5 3
6 3.5
7 4
8 4.5
9 5
🌻 \(\hat y\) 的 信賴區間(Confidence Interval)
= predict(md, new_data, interval="confidence"); conf conf
fit lwr upr
1 1.5670 1.4064 1.7275
2 1.9345 1.8030 2.0661
3 2.3021 2.1993 2.4049
4 2.6697 2.5949 2.7445
5 3.0373 2.9885 3.0861
6 3.4049 3.3746 3.4352
7 3.7725 3.7377 3.8074
8 4.1401 4.0830 4.1973
9 4.5077 4.4236 4.5918
🌻 \(y\) 的 預測區間(Prediction Interval)
= predict(md, new_data, interval="prediction"); pred pred
fit lwr upr
1 1.5670 0.94409 2.1898
2 1.9345 1.31853 2.5506
3 2.3021 1.69161 2.9127
4 2.6697 2.06329 3.2762
5 3.0373 2.43354 3.6411
6 3.4049 2.80235 4.0075
7 3.7725 3.16969 4.3753
8 4.1401 3.53559 4.7446
9 4.5077 3.90004 5.1154
= cbind(new_data,conf,pred[,-1])
df names(df) = c("x","yhat","c.lwr","c.upr","p.lwr","p.upr")
df
x yhat c.lwr c.upr p.lwr p.upr
1 1.0 1.5670 1.4064 1.7275 0.94409 2.1898
2 1.5 1.9345 1.8030 2.0661 1.31853 2.5506
3 2.0 2.3021 2.1993 2.4049 1.69161 2.9127
4 2.5 2.6697 2.5949 2.7445 2.06329 3.2762
5 3.0 3.0373 2.9885 3.0861 2.43354 3.6411
6 3.5 3.4049 3.3746 3.4352 2.80235 4.0075
7 4.0 3.7725 3.7377 3.8074 3.16969 4.3753
8 4.5 4.1401 4.0830 4.1973 3.53559 4.7446
9 5.0 4.5077 4.4236 4.5918 3.90004 5.1154
ggplot(df, aes(x)) +
geom_point(aes(y=yhat),col='red',size=2) +
geom_line(aes(y=yhat),col='red') +
geom_line(aes(y=c.lwr),col='orange') +
geom_line(aes(y=c.upr),col='orange') +
geom_line(aes(y=p.lwr),col='magenta',linetype='dashed') +
geom_line(aes(y=p.upr),col='magenta',linetype='dashed') +
geom_point(data=W, aes(x=Grocery, y=Milk), size=0.8, alpha=0.5) +
xlim(1,5) + ylim(1,5) + theme_bw()
🌻 95%預測區間(Prediction Interval)才會涵蓋95%的\(y\)
summary(md)
Call:
lm(formula = Milk ~ Grocery, data = W)
Residuals:
Min 1Q Median 3Q Max
-1.1371 -0.1875 0.0219 0.1593 1.8732
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8318 0.1115 7.46 0.00000000000047 ***
Grocery 0.7352 0.0301 24.39 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.306 on 438 degrees of freedom
Multiple R-squared: 0.576, Adjusted R-squared: 0.575
F-statistic: 595 on 1 and 438 DF, p-value: <0.0000000000000002
💡 : \(b_1\) 係數代表平均而言, \(x\) 每增加一單位時,\(y\) 會增加的數量
curve(dnorm(x, 0.7352, 0.0301), -0.1, 1, n=400, xlab=bquote(italic(b[1])),
main=bquote("The Distribution of Random Variable: " ~ italic(b[1])))
abline(v=qnorm(c(0.025, 0.975),0.7352, 0.0301), col="red")
💡 : 係數(\(b_0, b_1\))也都是隨機變數,他們都會有:
0
,我們就可以說它與目標變數有”顯著”的相關性