pacman::p_load(dplyr, ggplot2, corrplot)


[A] Multiple Regression - lm()

§ A1 The Global Warming Dataset
D = read.csv('data/climate_change.csv')
head(D)
  Year Month   MEI    CO2    CH4    N2O CFC.11 CFC.12    TSI Aerosols  Temp
1 1983     5 2.556 345.96 1638.6 303.68 191.32 350.11 1366.1   0.0863 0.109
2 1983     6 2.167 345.52 1633.7 303.75 192.06 351.85 1366.1   0.0794 0.118
3 1983     7 1.741 344.15 1633.2 303.80 192.82 353.73 1366.3   0.0731 0.137
4 1983     8 1.130 342.25 1631.3 303.84 193.60 355.63 1366.4   0.0673 0.176
5 1983     9 0.428 340.17 1648.4 303.90 194.39 357.46 1366.2   0.0619 0.149
6 1983    10 0.002 340.30 1663.8 303.97 195.17 359.17 1366.1   0.0569 0.093

Splitting the data into Training and Testing data

TR = subset(D, Year <= 2006)  # Train Data
TS = subset(D, Year > 2006)   # Test Data

🌻 We prefer to measure model accuracy on out of sample data

§ A2 Building Model
m1 = lm(Temp~MEI+CO2+CH4+N2O+CFC.11+CFC.12+TSI+Aerosols,TR)
§ A3 Model Summary
options(digits=4, scipen=10)
summary(m1)

Call:
lm(formula = Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + 
    TSI + Aerosols, data = TR)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.2589 -0.0591 -0.0008  0.0565  0.3243 

Coefficients:
               Estimate  Std. Error t value        Pr(>|t|)    
(Intercept) -124.594260   19.886800   -6.27 0.0000000014310 ***
MEI            0.064205    0.006470    9.92         < 2e-16 ***
CO2            0.006457    0.002285    2.83         0.00505 ** 
CH4            0.000124    0.000516    0.24         0.81015    
N2O           -0.016528    0.008565   -1.93         0.05467 .  
CFC.11        -0.006630    0.001626   -4.08 0.0000595728765 ***
CFC.12         0.003808    0.001014    3.76         0.00021 ***
TSI            0.093141    0.014755    6.31 0.0000000010959 ***
Aerosols      -1.537613    0.213252   -7.21 0.0000000000054 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0917 on 275 degrees of freedom
Multiple R-squared:  0.751, Adjusted R-squared:  0.744 
F-statistic:  104 on 8 and 275 DF,  p-value: <2e-16

In the estimated model …

  • \(\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + b_3 x_3 + b_4 x_4 + ...\)

Simply speaking, for every regressors

  • \(b_i\) : the coefficient estimation indicates the \(x\)’s marginal effect on \(y\)
  • \(\sigma_i\) : the standard error indicates the uncertainty of the estimation
  • \(p_i\) : \(p\)-value - the probability that there’s no relationship between \(x_i\) and \(y_i\) ?

However

  • The above inferences are subject to a series of assumptions
  • It takes serious training to validate the coefficient estimations

Fortunately

  • We do not have too much problem, if we only care about the predicted value (\(\hat{y}\))


[B] Measures of Model Accuracy


§ B1 Measure Accuracy in the Training Data
pred =  predict(m1)                     # no new data is required
SSE = sum((pred - TR$Temp)^2)           # SSE.train
SST = sum((mean(TR$Temp) - TR$Temp)^2)  # SST.train
R2 = 1 - SSE/SST                        # R2.train
RMSE = sqrt(SSE/nrow(TR))               # RMSE.train
c(SSE, SST, R2, RMSE)
[1] 2.31302 9.28526 0.75089 0.09025


§ B2 Measure Accuracy in the Testing Data
pred =  predict(m1, TS)                 # need to provide new data
SSE = sum((pred - TS$Temp)^2)           # SSE.test
SST = sum((mean(TR$Temp) - TS$Temp)^2)  # SST.test
R2 = 1 - SSE/SST                        # R2.test
RMSE = sqrt(SSE/nrow(TS))               # RMSE.test
c(SSE, SST, R2, RMSE)
[1] 0.21835 0.58602 0.62741 0.09538

Note that \(SST_{test}\) reference to {y}{train} rather than {y}{test}

\[ SST_{test} = \sum_{i=1}^n(\bar{y}_{train} - y_{i,test})^2\] So, \(R_{test}^2\) might be less than zero.


🗿 Discussion:
  In the aspect of business, which of the following do we care most?
    ■ the coefficients (implying the effects)?
    ■ the \(p\)-values
    ■ the significance
    ■ anything else?



[C] The Coefficients as Random Variable

vx = c(3:5,7); cx = summary(m1)$coef  
par(cex=0.8)
plot(0,0,xlim=c(-0.04,0.02),ylim=c(0,800),pch=20,
     xlab="coefficients: the estimated value of b's",
     ylab='probability density',
     main='Probability Density Function of Coefficients')
abline(h=seq(0,800,100),v=seq(-0.04,0.03,0.0025),col='lightgray',lty=3)
for(i in vx) curve(dnorm(x,cx[i,1],cx[i,2]),add=T,col=i,lwd=2,n=1000)
abline(v=0,col='red')
legend("topleft",col=vx,lwd=2,legend=paste0(
  "b",vx," ( ",rownames(cx)[vx]," )", c("**","","。","***") ) )

summary(m1)$coef[vx,]
        Estimate Std. Error t value  Pr(>|t|)
CO2     0.006457  0.0022846  2.8264 0.0050525
CH4     0.000124  0.0005158  0.2405 0.8101456
N2O    -0.016528  0.0085649 -1.9297 0.0546693
CFC.12  0.003808  0.0010135  3.7573 0.0002097

🗿 Discussion:
  1.Accroding to the model summary, among CO2CH4N2OCFC12 :
    ■ which gas has the largest effect on temperature? Is it significant?
    ■ which gas has the most significant? Does it bear the largest effect?
    ■ can you explain the inconsistency in the above two question?




[D] Collinearity

Correlation among regressors might bias the estimation of coefficients

cor(TR[3:10]) %>% round(2)
           MEI   CO2   CH4   N2O CFC.11 CFC.12   TSI Aerosols
MEI       1.00 -0.04 -0.03 -0.05   0.07   0.01 -0.15     0.34
CO2      -0.04  1.00  0.88  0.98   0.51   0.85  0.18    -0.36
CH4      -0.03  0.88  1.00  0.90   0.78   0.96  0.25    -0.27
N2O      -0.05  0.98  0.90  1.00   0.52   0.87  0.20    -0.34
CFC.11    0.07  0.51  0.78  0.52   1.00   0.87  0.27    -0.04
CFC.12    0.01  0.85  0.96  0.87   0.87   1.00  0.26    -0.23
TSI      -0.15  0.18  0.25  0.20   0.27   0.26  1.00     0.05
Aerosols  0.34 -0.36 -0.27 -0.34  -0.04  -0.23  0.05     1.00
cor(TR[3:10]) %>% corrplot.mixed(tl.cex=0.7, tl.col='black')


[E] Adjust Model Spec.

Remove high correlated regressors help to avoid collinearity

m2 = lm(Temp~MEI+N2O+TSI+Aerosols,TR)  
summary(m2)

Call:
lm(formula = Temp ~ MEI + N2O + TSI + Aerosols, data = TR)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.2792 -0.0598 -0.0060  0.0567  0.3419 

Coefficients:
              Estimate Std. Error t value         Pr(>|t|)    
(Intercept) -116.22686   20.22303   -5.75 0.00000002373584 ***
MEI            0.06419    0.00665    9.65          < 2e-16 ***
N2O            0.02532    0.00131   19.31          < 2e-16 ***
TSI            0.07949    0.01488    5.34 0.00000018937322 ***
Aerosols      -1.70174    0.21800   -7.81 0.00000000000012 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0955 on 279 degrees of freedom
Multiple R-squared:  0.726, Adjusted R-squared:  0.722 
F-statistic:  185 on 4 and 279 DF,  p-value: <2e-16

R also have a tool that help to select the regressors automatically

m3 = step(m1)   
Start:  AIC=-1348
Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + TSI + Aerosols

           Df Sum of Sq  RSS   AIC
- CH4       1     0.000 2.31 -1350
<none>                  2.31 -1348
- N2O       1     0.031 2.34 -1346
- CO2       1     0.067 2.38 -1342
- CFC.12    1     0.119 2.43 -1336
- CFC.11    1     0.140 2.45 -1333
- TSI       1     0.335 2.65 -1312
- Aerosols  1     0.437 2.75 -1301
- MEI       1     0.828 3.14 -1263

Step:  AIC=-1350
Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI + Aerosols

           Df Sum of Sq  RSS   AIC
<none>                  2.31 -1350
- N2O       1     0.031 2.35 -1348
- CO2       1     0.067 2.38 -1344
- CFC.12    1     0.130 2.44 -1337
- CFC.11    1     0.139 2.45 -1335
- TSI       1     0.335 2.65 -1314
- Aerosols  1     0.440 2.75 -1303
- MEI       1     0.831 3.14 -1265
summary(m3)

Call:
lm(formula = Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI + 
    Aerosols, data = TR)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.2577 -0.0599 -0.0010  0.0559  0.3220 

Coefficients:
               Estimate  Std. Error t value        Pr(>|t|)    
(Intercept) -124.515178   19.850113   -6.27 0.0000000013651 ***
MEI            0.064068    0.006434    9.96         < 2e-16 ***
CO2            0.006401    0.002269    2.82          0.0051 ** 
N2O           -0.016021    0.008287   -1.93          0.0542 .  
CFC.11        -0.006609    0.001621   -4.08 0.0000595362634 ***
CFC.12         0.003868    0.000981    3.94          0.0001 ***
TSI            0.093116    0.014729    6.32 0.0000000010355 ***
Aerosols      -1.540206    0.212616   -7.24 0.0000000000044 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0916 on 276 degrees of freedom
Multiple R-squared:  0.751, Adjusted R-squared:  0.745 
F-statistic:  119 on 7 and 276 DF,  p-value: <2e-16