Lecture 28 Stepwise models
28.1 Automated Variable Selection Procedures
Automated variable selection procedures work along the following lines:
Choose a model to start with (often the model with no covariates, or the model with all covariates included).
Test to see if there is an advantage in adding or removing covariates.
Repeat adding/removing variables until there is no advantage in changing the model.
28.2 Forward Model Selection
In this technique:
All one-X variable models are computed, and the one which has the smallest p-value is chosen. (provided any of the simple linear regression models are significant at the 0.05% significance level, or whatever significance level is being used. )
Then all two-variable models which include the first chosen X-variable are computed. The variable whose regression coefficient is most significant (smallest p-value) is chosen (provided the p-value < 0.05)
Then all three X-variable models which include the two variables previously chosen are computed … We continue to add variables to the model until there are no more significant ones to add.
Then we stop.
Note the criterion for “significance” can be chosen by the user: e.g. One may choose to only include variables that have P-value < the usual significance level \(\alpha\)=0.05.
28.3 When to stop
Sometimes it is useful to consider a higher P-value for entry into the model for forward selection, such as P-value < \(\alpha\)= 0.25. This is because we don’t want to ignore any potentially useful variables. It is potentially better to first find the possible predictors, and then weed them out, than to not know about them.
For computing the P-value for entering a variable, we have two things we can look at. On the one hand we can look to see if the t-value for that variable’s regression coefficient is significant. (For example, when df=60, the t-value will be significant at \(\alpha=0.05\) if \(|t| > 2\)).
Equivalently, we since we are comparing two nested models, we can use the anova()
function (e.g. anova(\(model_1,model_2\)), and check if the F statistic is significant with P-value < 0.05.
The two approaches are equivalent since, when we add just one variable, \(F = t^2\) where \(F \sim F_{1,df}\).
28.4 Backwards Variable Selection
This starts with model containing all possible explanatory variables (and possibly interactions).
For each variable in turn, we investigate the effect of removing the variable (or interaction, if any) from current model.
Remove the least informative variable / interaction, unless this term is nonetheless supplying significant information about the response.
Go to step 2. Stop only if all variables/interactions in the current model are important.
28.5 Stepwise (direction =“both”)
The phrase Stepwise variable selection is usually taken to mean that we start with a forward selection, but after each variable is entered into the model we check whether any other variables have fallen below the significance criterion \(\alpha\), and if so we remove it from the model.
One approach is to use F tests (or equivalently t tests).
28.6 Example: Hours of sunshine backwards elimination
Use backwards elimination on a full model of Sun
from the climate
data.
## climate = read.csv("climate.csv", header = TRUE, row.names = 1)
climate.lm1 = lm(Sun ~ ., data = climate)
summary(climate.lm1)
Call:
lm(formula = Sun ~ ., data = climate)
Residuals:
Min 1Q Median 3Q Max
-252.41 -91.95 -2.43 94.71 254.49
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.703e+03 3.809e+03 -1.497 0.1459
Lat -5.104e+00 2.938e+01 -0.174 0.8634
Long 3.959e+01 1.947e+01 2.033 0.0520 .
MnJanTemp 7.200e+01 4.581e+01 1.572 0.1276
MnJlyTemp -1.887e+01 3.152e+01 -0.599 0.5544
Rain -8.373e-02 4.716e-02 -1.775 0.0871 .
Height 2.609e-01 2.833e-01 0.921 0.3652
Sea 2.064e+02 9.785e+01 2.109 0.0443 *
NorthIsland -1.856e+02 1.355e+02 -1.370 0.1820
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 158.7 on 27 degrees of freedom
Multiple R-squared: 0.5843, Adjusted R-squared: 0.4612
F-statistic: 4.745 on 8 and 27 DF, p-value: 0.001021
28.7
We would drop the variable with highest P-value, in this case Lat
. NB only drop one variable at a time. We can use the update()
function to save typing.
Call:
lm(formula = Sun ~ Long + MnJanTemp + MnJlyTemp + Rain + Height +
Sea + NorthIsland, data = climate)
Residuals:
Min 1Q Median 3Q Max
-254.409 -89.690 -3.238 90.960 252.261
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.048e+03 3.193e+03 -1.894 0.0686 .
Long 3.966e+01 1.913e+01 2.073 0.0475 *
MnJanTemp 7.767e+01 3.157e+01 2.460 0.0203 *
MnJlyTemp -1.671e+01 2.847e+01 -0.587 0.5619
Rain -8.055e-02 4.271e-02 -1.886 0.0697 .
Height 2.894e-01 2.266e-01 1.277 0.2120
Sea 2.078e+02 9.581e+01 2.169 0.0387 *
NorthIsland -1.777e+02 1.254e+02 -1.417 0.1674
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 155.9 on 28 degrees of freedom
Multiple R-squared: 0.5839, Adjusted R-squared: 0.4799
F-statistic: 5.613 on 7 and 28 DF, p-value: 0.000405
28.8
The highest P-value is for MnJlyTemp
, so we update to drop this one as well. Note instead of outputting the whole summary we can use the function drop1()
to inform us of our choices in terms of dropping further variables.
Single term deletions
Model:
Sun ~ Long + MnJanTemp + Rain + Height + Sea + NorthIsland
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 689322 368.96
Long 1 98319 787641 371.76 4.1363 0.05121 .
MnJanTemp 1 140484 829807 373.64 5.9102 0.02147 *
Rain 1 142964 832286 373.74 6.0145 0.02044 *
Height 1 76044 765366 370.73 3.1992 0.08413 .
Sea 1 125566 814888 372.98 5.2826 0.02894 *
NorthIsland 1 107455 796777 372.17 4.5207 0.04212 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F test indicates that Height
has p-value =0.08, so we can drop that one as well.
28.9
Single term deletions
Model:
Sun ~ Long + MnJanTemp + Rain + Sea + NorthIsland
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 765366 370.73
Long 1 129226 894592 374.34 5.0653 0.03190 *
MnJanTemp 1 91939 857305 372.81 3.6037 0.06731 .
Rain 1 102527 867893 373.25 4.0188 0.05409 .
Sea 1 54835 820201 371.22 2.1494 0.15303
NorthIsland 1 106633 871999 373.42 4.1797 0.04977 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now we are told that Sea
has p-value= 0.15, so we drop this as well.
28.10
Single term deletions
Model:
Sun ~ Long + MnJanTemp + Rain + NorthIsland
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 820201 371.22
Long 1 124840 945040 374.32 4.7184 0.037609 *
MnJanTemp 1 209006 1029207 377.39 7.8995 0.008493 **
Rain 1 74019 894219 372.33 2.7976 0.104471
NorthIsland 1 165899 986099 375.85 6.2702 0.017751 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Finally we drop Rain
, which has p-value= 0.10.
28.11
Call:
lm(formula = Sun ~ Long + MnJanTemp + NorthIsland, data = climate)
Residuals:
Min 1Q Median 3Q Max
-349.06 -90.02 -15.92 98.84 392.75
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7504.65 3290.81 -2.280 0.02938 *
Long 47.33 19.83 2.387 0.02308 *
MnJanTemp 85.06 26.08 3.262 0.00263 **
NorthIsland -296.09 105.04 -2.819 0.00820 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 167.2 on 32 degrees of freedom
Multiple R-squared: 0.4535, Adjusted R-squared: 0.4023
F-statistic: 8.853 on 3 and 32 DF, p-value: 0.0002034
We see that annual hours of Sunshine are greater in places with high Longitude (i.e. eastern side of NZ), high mean January temperature and, perhaps surprisingly that hours are apparently lower in the North Island (but note NorthIsland
and Longitude
are correlated so this “apparently” may not be correct. )
28.12 Forward selection
We try using forward selection instead. Do we finish with the same model?
climate.null = lm(Sun ~ 1, data = climate)
add1(climate.null, scope = ~. - Lat + Long + MnJanTemp + MnJlyTemp + Height + Rain +
Sea + NorthIsland, test = "F")
Single term additions
Model:
Sun ~ 1
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 1636408 388.08
Long 1 344588 1291820 381.57 9.0694 0.004877 **
MnJanTemp 1 506057 1130350 376.76 15.2218 0.000429 ***
MnJlyTemp 1 182734 1453674 385.82 4.2740 0.046380 *
Height 1 90941 1545467 388.02 2.0007 0.166321
Rain 1 316258 1320150 382.35 8.1451 0.007302 **
Sea 1 148252 1488156 386.66 3.3871 0.074450 .
NorthIsland 1 95788 1540620 387.91 2.1140 0.155131
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
28.13
climate.s1 = update(climate.null, . ~ . + MnJanTemp)
add1(climate.s1, scope = ~Lat + Long + MnJanTemp + MnJlyTemp + Height + Rain + Sea +
NorthIsland, test = "F")
Single term additions
Model:
Sun ~ MnJanTemp
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 1130350 376.76
Lat 1 43712 1086639 377.34 1.3275 0.25753
Long 1 14102 1116248 378.31 0.4169 0.52295
MnJlyTemp 1 22898 1107452 378.03 0.6823 0.41472
Height 1 40 1130310 378.76 0.0012 0.97294
Rain 1 132170 998181 374.29 4.3696 0.04437 *
Sea 1 50284 1080066 377.12 1.5364 0.22390
NorthIsland 1 76944 1053407 376.22 2.4104 0.13007
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
28.14
Call:
lm(formula = Sun ~ MnJanTemp + Rain, data = climate)
Residuals:
Min 1Q Median 3Q Max
-353.07 -84.98 -30.65 86.74 383.28
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1032.83976 343.96323 3.003 0.00507 **
MnJanTemp 62.40295 19.12695 3.263 0.00257 **
Rain -0.08159 0.03903 -2.090 0.04437 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 173.9 on 33 degrees of freedom
Multiple R-squared: 0.39, Adjusted R-squared: 0.353
F-statistic: 10.55 on 2 and 33 DF, p-value: 0.0002869
The model chosen by forward selection is entirely different to the model by backwards elimination.
28.15
Sometimes the models are the same, but in this example they are not the same.
The reason for the difference is that sometimes two variables are both significant when together in a regression, but are not significant enough to enter one-at-a-time in forwards selection. That is why backwards elimination is usually the better approach.
28.16 Model Selection using AIC
An alternative approach to selecting a good model is to define a measure of the quality of a model, then choose the model of the highest quality.
The so called information criteria can be used as measures of model quality.
We will now examine information criteria in general, the Akaike information criterion (AIC) in particular, and consider their use in model selection.
28.17 Information Criteria
A good regression model (i.e. with a good choice of predictors) will:
- fit the data well;
- be simple (have as few predictors as possible).
Hence quality of a model can be measured by a quantity of the form \[\mbox{Measure of badness of fit} + k \times \mbox{Number of predictors} + \mbox{constant}\]
where k can be adjusted to control the relative importance of the two aspects of a model’s quality just given.
Note that any additive constant is of no real importance, since in comparing the quality of different models (our real aim) the constant term cancels out.
28.18 Akaike Information Criterion
Several information criteria in statistics are of the form of our simple equation for a “measure of badness of fit” given above.
One such measure is the Akaike Information Criterion (AIC). For a linear model with unknown error variance this is defined by \[AIC = n \log (RSS/n) + 2p + \mbox{constant}.\]
We can choose between models by selecting the one with smaller AIC.
N.B. This is not the only criterion in common use, but is perhaps the most common.
28.19 Back to the Climate data
We have found two different models, climate.lm6
and climate.s2
. These not only have different numbers of variables, but are not even nested! However we can compare them using the adjusted \(R^2\), or the function AIC()
.
Both criteria suggest that the model lm6
is to be preferred.
[1] 0.4023174
[1] 0.3530486
[1] 476.4903
[1] 478.4497
28.20
An alternative to AIC()
is the extractAIC()
command. The output from extractAIC()
command is number of regression parameters (p+1) and the model’s AIC.
[1] 4.0000 372.3268
[1] 3.0000 374.2861
28.21 AIC in Stepwise Variable Selection
We have seen that sequential variable selection can be conducted using a sequence of F tests.
An alternative is to compare models using AIC.
Specifically, at each step of the algorithm we change to the model with smallest AIC (from adding or dropping a single variable), and stop only when we cannot reduce AIC by single term additions or deletions.
This procedure is implemented using the step()
command in R.
28.22
Start: AIC=388.08
Sun ~ 1
Df Sum of Sq RSS AIC
+ MnJanTemp 1 506057 1130350 376.76
+ Long 1 344588 1291820 381.57
+ Rain 1 316258 1320150 382.35
+ Lat 1 220053 1416355 384.88
+ MnJlyTemp 1 182734 1453674 385.82
+ Sea 1 148252 1488156 386.66
+ NorthIsland 1 95788 1540620 387.91
+ Height 1 90941 1545467 388.02
<none> 1636408 388.08
Step: AIC=376.76
Sun ~ MnJanTemp
Df Sum of Sq RSS AIC
+ Rain 1 132170 998181 374.29
+ NorthIsland 1 76944 1053407 376.22
<none> 1130350 376.76
+ Sea 1 50284 1080066 377.12
+ Lat 1 43712 1086639 377.34
+ MnJlyTemp 1 22898 1107452 378.03
+ Long 1 14102 1116248 378.31
+ Height 1 40 1130310 378.76
- MnJanTemp 1 506057 1636408 388.08
Step: AIC=374.29
Sun ~ MnJanTemp + Rain
Df Sum of Sq RSS AIC
+ Sea 1 87195 910985 373.00
<none> 998181 374.29
+ NorthIsland 1 53140 945040 374.32
+ Long 1 12081 986099 375.85
+ Height 1 1193 996988 376.24
+ Lat 1 330 997850 376.27
+ MnJlyTemp 1 16 998165 376.29
- Rain 1 132170 1130350 376.76
- MnJanTemp 1 321969 1320150 382.35
Step: AIC=373
Sun ~ MnJanTemp + Rain + Sea
Df Sum of Sq RSS AIC
+ Height 1 94788 816197 371.04
+ MnJlyTemp 1 52902 858083 372.84
<none> 910985 373.00
+ Long 1 38986 871999 373.42
- Sea 1 87195 998181 374.29
+ NorthIsland 1 16393 894592 374.34
+ Lat 1 1945 909040 374.92
- Rain 1 169081 1080066 377.12
- MnJanTemp 1 214016 1125001 378.59
Step: AIC=371.04
Sun ~ MnJanTemp + Rain + Sea + Height
Df Sum of Sq RSS AIC
<none> 816197 371.04
+ NorthIsland 1 28556 787641 371.76
+ MnJlyTemp 1 22773 793424 372.02
+ Long 1 19420 796777 372.17
+ Lat 1 10788 805409 372.56
- Height 1 94788 910985 373.00
- Sea 1 180791 996988 376.24
- Rain 1 226572 1042769 377.86
- MnJanTemp 1 284136 1100333 379.79
28.23
Call:
lm(formula = Sun ~ MnJanTemp + Rain + Sea + Height, data = climate)
Residuals:
Min 1Q Median 3Q Max
-268.06 -104.31 1.62 90.67 344.42
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 872.1199 356.7010 2.445 0.02037 *
MnJanTemp 63.9085 19.4541 3.285 0.00253 **
Rain -0.1123 0.0383 -2.934 0.00625 **
Sea 203.2272 77.5551 2.620 0.01348 *
Height 0.3860 0.2034 1.897 0.06712 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 162.3 on 31 degrees of freedom
Multiple R-squared: 0.5012, Adjusted R-squared: 0.4369
F-statistic: 7.788 on 4 and 31 DF, p-value: 0.0001822
Note AIC is more liberal than the F test, and has allowed a non-significant variable in.
28.24 Backwards elimination via step() and AIC
The syntax below shows how to do a backwards elimination. This also results in including a non-significant variable. However this has higher adjusted R-square.
Start: AIC=372.48
Sun ~ Lat + Long + MnJanTemp + MnJlyTemp + Rain + Height + Sea +
NorthIsland
Df Sum of Sq RSS AIC
- Lat 1 760 680941 370.52
- MnJlyTemp 1 9027 689208 370.95
- Height 1 21370 701551 371.59
<none> 680181 372.48
- NorthIsland 1 47272 727453 372.90
- MnJanTemp 1 62239 742420 373.63
- Rain 1 79398 759579 374.45
- Long 1 104127 784308 375.61
- Sea 1 112082 792263 375.97
Step: AIC=370.52
Sun ~ Long + MnJanTemp + MnJlyTemp + Rain + Height + Sea + NorthIsland
Df Sum of Sq RSS AIC
- MnJlyTemp 1 8381 689322 368.96
<none> 680941 370.52
- Height 1 39668 720609 370.56
- NorthIsland 1 48846 729788 371.01
- Rain 1 86487 767428 372.82
- Long 1 104518 785460 373.66
- Sea 1 114410 795351 374.11
- MnJanTemp 1 147165 828106 375.56
Step: AIC=368.96
Sun ~ Long + MnJanTemp + Rain + Height + Sea + NorthIsland
Df Sum of Sq RSS AIC
<none> 689322 368.96
- Height 1 76044 765366 370.73
- Long 1 98319 787641 371.76
- NorthIsland 1 107455 796777 372.17
- Sea 1 125566 814888 372.98
- MnJanTemp 1 140484 829807 373.64
- Rain 1 142964 832286 373.74
28.25
Call:
lm(formula = Sun ~ Long + MnJanTemp + Rain + Height + Sea + NorthIsland,
data = climate)
Residuals:
Min 1Q Median 3Q Max
-245.527 -94.201 -9.587 98.988 253.293
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.743e+03 3.114e+03 -1.844 0.0754 .
Long 3.808e+01 1.872e+01 2.034 0.0512 .
MnJanTemp 7.138e+01 2.936e+01 2.431 0.0215 *
Rain -9.205e-02 3.753e-02 -2.452 0.0204 *
Height 3.527e-01 1.972e-01 1.789 0.0841 .
Sea 1.740e+02 7.571e+01 2.298 0.0289 *
NorthIsland -2.188e+02 1.029e+02 -2.126 0.0421 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 154.2 on 29 degrees of freedom
Multiple R-squared: 0.5788, Adjusted R-squared: 0.4916
F-statistic: 6.641 on 6 and 29 DF, p-value: 0.0001691
Question: Which of these models is preferred?