Lecture 13: Variable Selection

Download the R markdown file for this lecture.

In previous lectures we have looked at tests for single explanatory variables and groups of explanatory variables.

Sometimes we will have reasons for wanting to test hypotheses about a given group of covariates.

Often, however, a statistician will simply want to know what is the best group of explanatory variables for describing the response.

In this lecture we will investigate automatic variable selection methodologies.

Backwards, Forwards and Stepwise Variable Selection

Automated variable selection procedures work along the following lines:

Choose a model to start with (often the model with no covariates, or the model with all covariates included).
Test to see if there is an advantage in adding or removing covariates.
Repeat adding/removing variables until there is no advantage in changing the model.

Backwards Variable Selection

Start with model containing all possible explanatory variables.
For each variable in turn, investigate effect of removing variable from current model.
Remove the least informative variable, unless this variable is nonetheless supplying significant information about the response.
Go to step 2. Stop only if all variables in the current model are important.

The precise implementation depends on how we assess the importance of variables.

One approach is to use F tests (or equivalently t tests).

Example: Analysis of Cheese Tasting Data

Data on production of cheddar cheese from the LaTrobe Valley of Victoria, Australia. Taste of the final product is related to the concentration of several chemicals in the cheese. 30 samples of cheese were tasted by experts, and the following variables recorded for them:

tasters’ ratings.
Acetic acid in cheese.
Hydrogen sulphide in cheese.
Lactic acid in the cheese.

Some cheese

## Cheese <- read.csv(file = "https://r-resources.massey.ac.nz/161221/data/cheese.csv", 
##     header = TRUE)
str(Cheese)

'data.frame':   30 obs. of  4 variables:
 $ taste : num  12.3 20.9 39 47.9 5.6 25.9 37.3 21.9 18.1 21 ...
 $ Acetic: num  4.54 5.16 5.37 5.76 4.66 ...
 $ H2S   : num  3.13 5.04 5.44 7.5 3.81 ...
 $ Lactic: num  0.86 1.53 1.57 1.81 0.99 1.09 1.29 1.78 1.29 1.58 ...

head(Cheese)

  taste Acetic   H2S Lactic
1  12.3  4.543 3.135   0.86
2  20.9  5.159 5.043   1.53
3  39.0  5.366 5.438   1.57
4  47.9  5.759 7.496   1.81
5   5.6  4.663 3.807   0.99
6  25.9  5.697 7.601   1.09

Download cheese.csv

Of interest to the manufacturers to relate the cheese’s taste to the ‘chemical’ variables.

Therefore construct multiple linear regression model of taste on other variables.

Variable selection will allow us to produce a parsimonious model.

Backwards variable selection starts with the full model (i.e. with all predictors):

Cheese.lm.full <- lm(taste ~ ., data = Cheese)

Consider effect of dropping each covariate using :

drop1(Cheese.lm.full, test = "F")

Single term deletions

Model:
taste ~ Acetic + H2S + Lactic
       Df Sum of Sq    RSS    AIC F value   Pr(>F)   
<none>              2668.4 142.64                    
Acetic  1      0.55 2669.0 140.65  0.0054 0.941980   
H2S     1   1007.66 3676.1 150.25  9.8182 0.004247 **
Lactic  1    533.32 3201.7 146.11  5.1964 0.031079 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R output tells us that

both H2S and Lactic should not be dropped, since the models without these terms have a considerably worse fit than the full model (as evidenced by the p-values of 0.004 and 0.031 respectively).
However, deletion of Acetic from the model makes little difference in terms of model fit (p-value of 0.942 in comparison with the full model), so we should omit this variable.
If there had been more than one variable with p-value greater than 0.05, then we would have removed the covariate with largest corresponding p-value.

We can create a new model without using Acetic:

Cheese.lm.A <- update(Cheese.lm.full, . ~ . - Acetic, data = Cheese)
summary(Cheese.lm.A)


Call:
lm(formula = taste ~ H2S + Lactic, data = Cheese)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.343  -6.530  -1.164   4.844  25.618 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -27.592      8.982  -3.072  0.00481 **
H2S            3.946      1.136   3.475  0.00174 **
Lactic        19.887      7.959   2.499  0.01885 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.942 on 27 degrees of freedom
Multiple R-squared:  0.6517,    Adjusted R-squared:  0.6259 
F-statistic: 25.26 on 2 and 27 DF,  p-value: 6.551e-07

The general syntax for updating models is

update(old.model, new.formula)

Note that full stops in the updated formula stand for whatever was in the corresponding position in the old formula.

Look at effect of dropping single variable from :

drop1(Cheese.lm.A, test = "F")

Single term deletions

Model:
taste ~ H2S + Lactic
       Df Sum of Sq    RSS    AIC F value   Pr(>F)   
<none>              2669.0 140.65                    
H2S     1   1193.52 3862.5 149.74 12.0740 0.001743 **
Lactic  1    617.18 3286.1 144.89  6.2435 0.018850 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Output tells us that neither H2S nor Lactic can be removed without an important loss of fit.
Hence, best model (according to backwards selection) is \[E[ \mbox{taste}] = -27.59 + 3.95 \, \mbox{H2S} + 19.89 \, \mbox{Lactic}.\]

Stepwise Variable Selection

Start with some model, typically null model (with no explanatory variables) or full model (with all variables).
For each variable currently in the model, investigate effect of removing it.
Remove the least informative variable, unless this variable is nonetheless supplying significant information about the response.
For each variable not currently in the model, investigate effect of including it.
Include the most statistically significant variable not currently in the model (unless no significant variable exists).
Go to step 2. Stop only if no change in steps 2–5.

Implementing Stepwise Variable Selection

Stepwise variable selection using F tests can be implemented in R using a combination of add1() and drop1() commands.

You have already seen drop1() in action when doing backwards variable selection.
The add1() function tells us about the effects of adding each variable not currently in the model.

Return to Example

We will perform stepwise variable selection starting from the null model (i.e. the model with no explanatory variables).

Cheese.lm.null <- lm(taste ~ 1, data = Cheese)
summary(Cheese.lm.null)


Call:
lm(formula = taste ~ 1, data = Cheese)

Residuals:
    Min      1Q  Median      3Q     Max 
-23.833 -10.983  -3.583  12.167  32.667 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   24.533      2.968   8.266  4.1e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.26 on 29 degrees of freedom

Now consider adding single terms into the model.

add1(Cheese.lm.null, scope = ~Acetic + H2S + Lactic, data = Cheese, test = "F")

Single term additions

Model:
taste ~ 1
       Df Sum of Sq    RSS    AIC F value    Pr(>F)    
<none>              7662.9 168.29                      
Acetic  1    2314.1 5348.7 159.50  12.114  0.001658 ** 
H2S     1    4376.7 3286.1 144.89  37.293 1.374e-06 ***
Lactic  1    3800.4 3862.5 149.74  27.550 1.405e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H2S is included as most significant.
No need to try dropping terms at this stage.

Cheese.lm.B <- update(Cheese.lm.null, . ~ . + H2S, data = Cheese)
add1(Cheese.lm.B, scope = ~Acetic + H2S + Lactic, data = Cheese, test = "F")

Single term additions

Model:
taste ~ H2S
       Df Sum of Sq    RSS    AIC F value  Pr(>F)  
<none>              3286.1 144.89                  
Acetic  1     84.41 3201.7 146.11  0.7118 0.40625  
Lactic  1    617.18 2669.0 140.65  6.2435 0.01885 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We should add Lactic to the model.

Cheese.lm.C <- update(Cheese.lm.B, . ~ . + Lactic, data = Cheese)
drop1(Cheese.lm.C, test = "F")

Single term deletions

Model:
taste ~ H2S + Lactic
       Df Sum of Sq    RSS    AIC F value   Pr(>F)   
<none>              2669.0 140.65                    
H2S     1   1193.52 3862.5 149.74 12.0740 0.001743 **
Lactic  1    617.18 3286.1 144.89  6.2435 0.018850 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

No terms should be deleted from the model.

add1(Cheese.lm.C, scope = ~Acetic + H2S + Lactic, data = Cheese, test = "F")

Single term additions

Model:
taste ~ H2S + Lactic
       Df Sum of Sq    RSS    AIC F value Pr(>F)
<none>              2669.0 140.65               
Acetic  1   0.55427 2668.4 142.64  0.0054  0.942

No need to add any more terms to Cheese.lm.C.

So the best model as selected by the stepwise procedure is \[E[ \mbox{taste}] = -27.59 + 3.95 \, \mbox{H2S} + 19.89 \, \mbox{Lactic}.\]

This is the same model as was selected by backwards variable selection.

N.B. Stepwise and backwards variable selection will sometimes generate the same model, but this will not always be the case.