::p_load(dplyr, ggplot2, corrplot) pacman
lm()
= read.csv('data/climate_change.csv')
D head(D)
Year Month MEI CO2 CH4 N2O CFC.11 CFC.12 TSI Aerosols Temp
1 1983 5 2.556 345.96 1638.6 303.68 191.32 350.11 1366.1 0.0863 0.109
2 1983 6 2.167 345.52 1633.7 303.75 192.06 351.85 1366.1 0.0794 0.118
3 1983 7 1.741 344.15 1633.2 303.80 192.82 353.73 1366.3 0.0731 0.137
4 1983 8 1.130 342.25 1631.3 303.84 193.60 355.63 1366.4 0.0673 0.176
5 1983 9 0.428 340.17 1648.4 303.90 194.39 357.46 1366.2 0.0619 0.149
6 1983 10 0.002 340.30 1663.8 303.97 195.17 359.17 1366.1 0.0569 0.093
Splitting the data into Training and Testing data
= subset(D, Year <= 2006) # Train Data
TR = subset(D, Year > 2006) # Test Data TS
🌻 We prefer to measure model accuracy on
= lm(Temp~MEI+CO2+CH4+N2O+CFC.11+CFC.12+TSI+Aerosols,TR) m1
options(digits=4, scipen=10)
summary(m1)
Call:
lm(formula = Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 +
TSI + Aerosols, data = TR)
Residuals:
Min 1Q Median 3Q Max
-0.2589 -0.0591 -0.0008 0.0565 0.3243
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -124.594260 19.886800 -6.27 0.0000000014310 ***
MEI 0.064205 0.006470 9.92 < 2e-16 ***
CO2 0.006457 0.002285 2.83 0.00505 **
CH4 0.000124 0.000516 0.24 0.81015
N2O -0.016528 0.008565 -1.93 0.05467 .
CFC.11 -0.006630 0.001626 -4.08 0.0000595728765 ***
CFC.12 0.003808 0.001014 3.76 0.00021 ***
TSI 0.093141 0.014755 6.31 0.0000000010959 ***
Aerosols -1.537613 0.213252 -7.21 0.0000000000054 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.0917 on 275 degrees of freedom
Multiple R-squared: 0.751, Adjusted R-squared: 0.744
F-statistic: 104 on 8 and 275 DF, p-value: <2e-16
In the estimated model …
Simply speaking, for every regressors
However:
Fortunately:
Variance in \(y\) (SST) : \(\sum_i (y_i-\bar{y})^2\)
Variance in Residual (SSE) : \(\sum_i (\hat{y_i}-y_i)^2\)
Fraction of \(y\)’s variance explained by the model (\(R^2\)) : \(1 - \frac{SSE}{SST}\)
Root Mean Square Error (RMSE) : \(\sqrt{SSE/n}\)
Mean Absolute Error (MAE) : \(|y - \hat y| / n\)
= predict(m1) # no new data is required
pred = sum((pred - TR$Temp)^2) # SSE.train
SSE = sum((mean(TR$Temp) - TR$Temp)^2) # SST.train
SST = 1 - SSE/SST # R2.train
R2 = sqrt(SSE/nrow(TR)) # RMSE.train
RMSE c(SSE, SST, R2, RMSE)
[1] 2.31302 9.28526 0.75089 0.09025
= predict(m1, TS) # need to provide new data
pred = sum((pred - TS$Temp)^2) # SSE.test
SSE = sum((mean(TR$Temp) - TS$Temp)^2) # SST.test
SST = 1 - SSE/SST # R2.test
R2 = sqrt(SSE/nrow(TS)) # RMSE.test
RMSE c(SSE, SST, R2, RMSE)
[1] 0.21835 0.58602 0.62741 0.09538
Note that \(SST_{test}\) reference to {y}{train} rather than {y}{test}
\[ SST_{test} = \sum_{i=1}^n(\bar{y}_{train} - y_{i,test})^2\] So, \(R_{test}^2\) might be less than zero.
🗿 Discussion:
In the aspect
of business, which of the following do we care most?
■ the
coefficients (implying the effects)?
■ the \(p\)-values
■ the significance
■ anything else?
= c(3:5,7); cx = summary(m1)$coef
vx par(cex=0.8)
plot(0,0,xlim=c(-0.04,0.02),ylim=c(0,800),pch=20,
xlab="coefficients: the estimated value of b's",
ylab='probability density',
main='Probability Density Function of Coefficients')
abline(h=seq(0,800,100),v=seq(-0.04,0.03,0.0025),col='lightgray',lty=3)
for(i in vx) curve(dnorm(x,cx[i,1],cx[i,2]),add=T,col=i,lwd=2,n=1000)
abline(v=0,col='red')
legend("topleft",col=vx,lwd=2,legend=paste0(
"b",vx," ( ",rownames(cx)[vx]," )", c("**","","。","***") ) )
summary(m1)$coef[vx,]
Estimate Std. Error t value Pr(>|t|)
CO2 0.006457 0.0022846 2.8264 0.0050525
CH4 0.000124 0.0005158 0.2405 0.8101456
N2O -0.016528 0.0085649 -1.9297 0.0546693
CFC.12 0.003808 0.0010135 3.7573 0.0002097
🗿 Discussion:
1.Accroding to
the model summary, among
CO2
、CH4
、N2O
、CFC12
:
■ which gas has the largest effect on temperature? Is it
significant?
■ which gas has the most significant? Does it bear
the largest effect?
■ can you explain the inconsistency in the
above two question?
Correlation among regressors might bias the estimation of coefficients
cor(TR[3:10]) %>% round(2)
MEI CO2 CH4 N2O CFC.11 CFC.12 TSI Aerosols
MEI 1.00 -0.04 -0.03 -0.05 0.07 0.01 -0.15 0.34
CO2 -0.04 1.00 0.88 0.98 0.51 0.85 0.18 -0.36
CH4 -0.03 0.88 1.00 0.90 0.78 0.96 0.25 -0.27
N2O -0.05 0.98 0.90 1.00 0.52 0.87 0.20 -0.34
CFC.11 0.07 0.51 0.78 0.52 1.00 0.87 0.27 -0.04
CFC.12 0.01 0.85 0.96 0.87 0.87 1.00 0.26 -0.23
TSI -0.15 0.18 0.25 0.20 0.27 0.26 1.00 0.05
Aerosols 0.34 -0.36 -0.27 -0.34 -0.04 -0.23 0.05 1.00
cor(TR[3:10]) %>% corrplot.mixed(tl.cex=0.7, tl.col='black')
Remove high correlated regressors help to avoid collinearity
= lm(Temp~MEI+N2O+TSI+Aerosols,TR)
m2 summary(m2)
Call:
lm(formula = Temp ~ MEI + N2O + TSI + Aerosols, data = TR)
Residuals:
Min 1Q Median 3Q Max
-0.2792 -0.0598 -0.0060 0.0567 0.3419
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -116.22686 20.22303 -5.75 0.00000002373584 ***
MEI 0.06419 0.00665 9.65 < 2e-16 ***
N2O 0.02532 0.00131 19.31 < 2e-16 ***
TSI 0.07949 0.01488 5.34 0.00000018937322 ***
Aerosols -1.70174 0.21800 -7.81 0.00000000000012 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.0955 on 279 degrees of freedom
Multiple R-squared: 0.726, Adjusted R-squared: 0.722
F-statistic: 185 on 4 and 279 DF, p-value: <2e-16
R also have a tool that help to select the regressors automatically
= step(m1) m3
Start: AIC=-1348
Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + TSI + Aerosols
Df Sum of Sq RSS AIC
- CH4 1 0.000 2.31 -1350
<none> 2.31 -1348
- N2O 1 0.031 2.34 -1346
- CO2 1 0.067 2.38 -1342
- CFC.12 1 0.119 2.43 -1336
- CFC.11 1 0.140 2.45 -1333
- TSI 1 0.335 2.65 -1312
- Aerosols 1 0.437 2.75 -1301
- MEI 1 0.828 3.14 -1263
Step: AIC=-1350
Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI + Aerosols
Df Sum of Sq RSS AIC
<none> 2.31 -1350
- N2O 1 0.031 2.35 -1348
- CO2 1 0.067 2.38 -1344
- CFC.12 1 0.130 2.44 -1337
- CFC.11 1 0.139 2.45 -1335
- TSI 1 0.335 2.65 -1314
- Aerosols 1 0.440 2.75 -1303
- MEI 1 0.831 3.14 -1265
summary(m3)
Call:
lm(formula = Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI +
Aerosols, data = TR)
Residuals:
Min 1Q Median 3Q Max
-0.2577 -0.0599 -0.0010 0.0559 0.3220
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -124.515178 19.850113 -6.27 0.0000000013651 ***
MEI 0.064068 0.006434 9.96 < 2e-16 ***
CO2 0.006401 0.002269 2.82 0.0051 **
N2O -0.016021 0.008287 -1.93 0.0542 .
CFC.11 -0.006609 0.001621 -4.08 0.0000595362634 ***
CFC.12 0.003868 0.000981 3.94 0.0001 ***
TSI 0.093116 0.014729 6.32 0.0000000010355 ***
Aerosols -1.540206 0.212616 -7.24 0.0000000000044 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.0916 on 276 degrees of freedom
Multiple R-squared: 0.751, Adjusted R-squared: 0.745
F-statistic: 119 on 7 and 276 DF, p-value: <2e-16