Download the R markdown file for this lecture.
In previous lectures we have looked at tests for single explanatory variables and groups of explanatory variables.
Sometimes we will have reasons for wanting to test hypotheses about a given group of covariates.
Often, however, a statistician will simply want to know what is the best group of explanatory variables for describing the response.
In this lecture we will investigate automatic variable selection methodologies.
Automated variable selection procedures work along the following lines:
Choose a model to start with (often the model with no covariates, or the model with all covariates included).
Test to see if there is an advantage in adding or removing covariates.
Repeat adding/removing variables until there is no advantage in changing the model.
The precise implementation depends on how we assess the importance of variables.
One approach is to use F tests (or equivalently t tests).
Data on production of cheddar cheese from the LaTrobe Valley of Victoria, Australia. Taste of the final product is related to the concentration of several chemicals in the cheese. 30 samples of cheese were tasted by experts, and the following variables recorded for them:
## Cheese <- read.csv(file = "https://r-resources.massey.ac.nz/161221/data/cheese.csv",
## header = TRUE)
str(Cheese)
'data.frame': 30 obs. of 4 variables:
$ taste : num 12.3 20.9 39 47.9 5.6 25.9 37.3 21.9 18.1 21 ...
$ Acetic: num 4.54 5.16 5.37 5.76 4.66 ...
$ H2S : num 3.13 5.04 5.44 7.5 3.81 ...
$ Lactic: num 0.86 1.53 1.57 1.81 0.99 1.09 1.29 1.78 1.29 1.58 ...
head(Cheese)
taste Acetic H2S Lactic
1 12.3 4.543 3.135 0.86
2 20.9 5.159 5.043 1.53
3 39.0 5.366 5.438 1.57
4 47.9 5.759 7.496 1.81
5 5.6 4.663 3.807 0.99
6 25.9 5.697 7.601 1.09
Of interest to the manufacturers to relate the cheese’s taste to the ‘chemical’ variables.
Therefore construct multiple linear regression model of taste
on other variables.
Variable selection will allow us to produce a parsimonious model.
Backwards variable selection starts with the full model (i.e. with all predictors):
<- lm(taste ~ ., data = Cheese) Cheese.lm.full
Consider effect of dropping each covariate using :
drop1(Cheese.lm.full, test = "F")
Single term deletions
Model:
taste ~ Acetic + H2S + Lactic
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 2668.4 142.64
Acetic 1 0.55 2669.0 140.65 0.0054 0.941980
H2S 1 1007.66 3676.1 150.25 9.8182 0.004247 **
Lactic 1 533.32 3201.7 146.11 5.1964 0.031079 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R output tells us that
H2S
and Lactic
should not be dropped, since the models without these terms have a considerably worse fit than the full model (as evidenced by the p-values of 0.004 and 0.031 respectively).Acetic
from the model makes little difference in terms of model fit (p-value of 0.942 in comparison with the full model), so we should omit this variable.We can create a new model without using Acetic
:
<- update(Cheese.lm.full, . ~ . - Acetic, data = Cheese)
Cheese.lm.A summary(Cheese.lm.A)
Call:
lm(formula = taste ~ H2S + Lactic, data = Cheese)
Residuals:
Min 1Q Median 3Q Max
-17.343 -6.530 -1.164 4.844 25.618
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -27.592 8.982 -3.072 0.00481 **
H2S 3.946 1.136 3.475 0.00174 **
Lactic 19.887 7.959 2.499 0.01885 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.942 on 27 degrees of freedom
Multiple R-squared: 0.6517, Adjusted R-squared: 0.6259
F-statistic: 25.26 on 2 and 27 DF, p-value: 6.551e-07
The general syntax for updating models is
update(old.model, new.formula)
Note that full stops in the updated formula stand for whatever was in the corresponding position in the old formula.
Look at effect of dropping single variable from :
drop1(Cheese.lm.A, test = "F")
Single term deletions
Model:
taste ~ H2S + Lactic
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 2669.0 140.65
H2S 1 1193.52 3862.5 149.74 12.0740 0.001743 **
Lactic 1 617.18 3286.1 144.89 6.2435 0.018850 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H2S
nor Lactic
can be removed without an important loss of fit.Stepwise variable selection using F tests can be implemented in R using a combination of add1()
and drop1()
commands.
drop1()
in action when doing backwards variable selection.add1()
function tells us about the effects of adding each variable not currently in the model.We will perform stepwise variable selection starting from the null model (i.e. the model with no explanatory variables).
<- lm(taste ~ 1, data = Cheese)
Cheese.lm.null summary(Cheese.lm.null)
Call:
lm(formula = taste ~ 1, data = Cheese)
Residuals:
Min 1Q Median 3Q Max
-23.833 -10.983 -3.583 12.167 32.667
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.533 2.968 8.266 4.1e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.26 on 29 degrees of freedom
Now consider adding single terms into the model.
add1(Cheese.lm.null, scope = ~Acetic + H2S + Lactic, data = Cheese, test = "F")
Single term additions
Model:
taste ~ 1
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 7662.9 168.29
Acetic 1 2314.1 5348.7 159.50 12.114 0.001658 **
H2S 1 4376.7 3286.1 144.89 37.293 1.374e-06 ***
Lactic 1 3800.4 3862.5 149.74 27.550 1.405e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H2S
is included as most significant.<- update(Cheese.lm.null, . ~ . + H2S, data = Cheese)
Cheese.lm.B add1(Cheese.lm.B, scope = ~Acetic + H2S + Lactic, data = Cheese, test = "F")
Single term additions
Model:
taste ~ H2S
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 3286.1 144.89
Acetic 1 84.41 3201.7 146.11 0.7118 0.40625
Lactic 1 617.18 2669.0 140.65 6.2435 0.01885 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We should add Lactic
to the model.
<- update(Cheese.lm.B, . ~ . + Lactic, data = Cheese)
Cheese.lm.C drop1(Cheese.lm.C, test = "F")
Single term deletions
Model:
taste ~ H2S + Lactic
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 2669.0 140.65
H2S 1 1193.52 3862.5 149.74 12.0740 0.001743 **
Lactic 1 617.18 3286.1 144.89 6.2435 0.018850 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
No terms should be deleted from the model.
add1(Cheese.lm.C, scope = ~Acetic + H2S + Lactic, data = Cheese, test = "F")
Single term additions
Model:
taste ~ H2S + Lactic
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 2669.0 140.65
Acetic 1 0.55427 2668.4 142.64 0.0054 0.942
No need to add any more terms to Cheese.lm.C
.
So the best model as selected by the stepwise procedure is \[E[ \mbox{taste}] = -27.59 + 3.95 \, \mbox{H2S} + 19.89 \, \mbox{Lactic}.\]
This is the same model as was selected by backwards variable selection.
N.B. Stepwise and backwards variable selection will sometimes generate the same model, but this will not always be the case.