Download the R markdown file for this lecture.
We have now considered two types of testing problems in multiple linear regression:
In this lecture, we investigate the importance of a group of covariates simultaneously.
A linear model M0 is said to be nested within another model M1 if M0 can be recovered as a special case of M1 by setting the parameters of M1 to the necessary values.
For the paramo data, define model M1 by
\[E[\mbox{N}] = \beta_0 + \beta_1 \mbox{AR} + \beta_2 \mbox{EL} + \beta_3 \mbox{DEc} + \beta_4 \mbox{DNI}\]
Then the model M0 defined by
\[E[\mbox{N}] = \beta_0 + \beta_1 \mbox{AR} + \beta_3 \mbox{DEc}\] is nested within M1 because we can obtain M0 by setting \(\beta_2 = \beta_4 = 0\) in M1.
The general ideas about model comparison from lecture 11 continue to apply.
We want a “cheap” model (i.e. one with few parameters);
We want a model that fits well (i.e. one with small RSS.
The choice of model will be a question of whether improvement in goodness of fit of complex model M1 over simpler model M0 is worth the cost.
Selecting a model is equivalent to testing hypotheses about parameters of M1 (more complex model).
Suppose that linear model M0 has q explanatory variables (hence q+1 regression parameters, including intercept).
Suppose that M0 is nested within linear model M1 which has p > q explanatory variables (hence p+1 regression parameters, including intercept).
Comparison of the models can be achieved by testing
H0: \(\beta_j = 0\) for all \(j \in J\) (i.e. M0 adequate); versus
H1: \(\beta_j\) not zero for all \(j \in J\) (i.e. M1 better)
where J indexes the p-q variables that appear in M1 but not M0.
The F-statistic to test these hypotheses is
\[F = \frac{[RSS_{M0} - RSS_{M1}]/(p-q)}{RSS_{M1}/(n-p-1)}\]
As before, extreme, large values of F provide evidence against H0 (hence evidence that we should prefer model M1 to M0).
If H0 is correct then F has an F distribution with (p-q),(n-p-1) degrees of freedom.
Hence if fobs is the observed value of the test statistic, and X is a random variable from an Fp-q,n-p-1 distribution, then the P-value is given by \[P= P(X \ge f_{obs})\]
As we have seen earlier, the importance of variables in a multiple regression is dependent upon context (i.e. what other variables are taken into account).
Retention of H0 (i.e. acceptance that model M0 is adequate) does not mean that the p-q variables indexed by J are unrelated to the response.
Rather, retention of H0 indicates that the variables indexed by J do not provide additional information about the response having adjusted for all the other (q) variables.
Suppose (for the purposes of argument) that a scientist theorizes that EL
and DNI
provide no further information about N
once we have adjusted for AR
and DEc
.
Define M1 by \[E[\mbox{N}] = \beta_0 + \beta_1 \mbox{AR} + \beta_2 \mbox{EL} + \beta_3 \mbox{DEc} + \beta_4 \mbox{DNI}\] and M0 by \[E[\mbox{N}] = \beta_0 + \beta_1 \mbox{AR} + \beta_3 \mbox{DEc}\] as in the previous example.
Testing the scientist’s theory requires that we test H0: \[\beta_2 = \beta_4 = 0~~~~\mbox{versus}~~~~~H_1: \beta_2, \beta_4 \mbox{ not both zero.}\]
## Paramo <- read.csv(file = "https://r-resources.massey.ac.nz/161221/data/paramo.csv",
## header = TRUE, row.names = 1)
<- lm(N ~ AR + DEc, data = Paramo)
Paramo.lm0 anova(Paramo.lm0)
Analysis of Variance Table
Response: N
Df Sum Sq Mean Sq F value Pr(>F)
AR 1 508.92 508.92 12.937 0.004193 **
DEc 1 557.23 557.23 14.165 0.003134 **
Residuals 11 432.71 39.34
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
<- lm(N ~ AR + EL + DEc + DNI, data = Paramo)
Paramo.lm1 anova(Paramo.lm1)
Analysis of Variance Table
Response: N
Df Sum Sq Mean Sq F value Pr(>F)
AR 1 508.92 508.92 11.3208 0.008328 **
EL 1 45.90 45.90 1.0211 0.338661
DEc 1 537.39 537.39 11.9541 0.007189 **
DNI 1 2.06 2.06 0.0457 0.835412
Residuals 9 404.59 44.95
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For model M0, calculations give RSSM0 = 432.7 and q=2.
For model M1, calculations give RSSM1 = 404.6 and p=4.
Hence F-statistic is \[F = \frac{[RSS_{M0} - RSS_{M1}]/(p-q)}{RSS_{M1}/(n-p-1)} = \frac{[432.7 - 404.6]/(4-2)}{404.6/(9)} = 0.31\]
Corresponding P-value is right hand tail probability: \[P(X \ge 0.31) = 0.74~~~~~~~~~~\mbox{where $X \sim F_{2,9}$}\]
We conclude that H0 can be retained (i.e. model M0 is adequate). There is no evidence that EL
and DNI
provide further information about N
once we have adjusted for AR
and DEc
.
anova(Paramo.lm0, Paramo.lm1)
Analysis of Variance Table
Model 1: N ~ AR + DEc
Model 2: N ~ AR + EL + DEc + DNI
Res.Df RSS Df Sum of Sq F Pr(>F)
1 11 432.71
2 9 404.59 2 28.12 0.3128 0.7391
The omnibus F test of model fit discussed in lecture 10 is a particular case of the more general methodology for model comparison presented in this lecture.
Specifically, in omnibus F test we always take M0 to be the null model with q=0 explanatory variables, and M1 to be the full model with p explanatory variables.
F tests can be used to compare two models that differ by a single explanatory variable.
T test can also be used to assess the importance of any given explanatory variable.
There is a connection between the two approaches. Suppose we test H0: \[\beta_j = 0~~~~\mbox{versus}~~~~H_1:\beta_j \ne 0\] having adjusted for certain other variables. If the observed t-test statistic is tobs, and the observed F test statistic is fobs, then fobs = tobs2.
The P-value from the two tests will be the same.
Suppose that we wish to test for the importance of EL
having adjusted for AR
(only).
Consider model M1 given by \[E[\mbox{N}] = \beta_0 + \beta_1 \mbox{AR} + \beta_2 \mbox{EL}\] and M0 by \[E[\mbox{N}] = \beta_0 + \beta_1 \mbox{AR}\]
We want to test H0: \[\beta_2 = 0~~~~\mbox{versus}~~~~H_1:\beta_2 \ne 0\] which is equivalent to comparing models M0 and M1.
Will perform t and F tests using R.
<- lm(N ~ AR, data = Paramo)
Paramo.lm2 <- lm(N ~ AR + EL, data = Paramo)
Paramo.lm3 anova(Paramo.lm2, Paramo.lm3)
Analysis of Variance Table
Model 1: N ~ AR
Model 2: N ~ AR + EL
Res.Df RSS Df Sum of Sq F Pr(>F)
1 12 989.94
2 11 944.04 1 45.901 0.5348 0.4799
The F-statistic (from the anova() function output) is f = 0.5348 with corresponding P-value P=0.4799.
The t-statistic (from the summary() function output) is t=0.731 with corresponding P-value P=0.4799.
Note that f = 0.5348 = 0.7312 = t2.