Lecture 30 Variable Selection

In class version

In previous lectures we have looked at tests for single explanatory variables and groups of explanatory variables.

Often, however, a statistician will simply want to know what is the best group of explanatory variables for describing the response.

In this lecture we will consider various criteria for model selection, and investigate automatic variable selection methodologies.

30.1 Overfitting and Underfitting

We begin by considering the following question: Should all the explanatory variables be used, or should some be omitted?
Why might we want to omit variables?

• Perhaps we want to detect variables which affect the response but avoid collecting unnecessary measurements in the future - this is called variable screening

• Perhaps we want to predict from the model.
- Our predictions will be less accurate if the model is incorrectly specified.

We first consider the risks and benefits of including a predictor that might have coefficient \(\beta =0\).

30.1.1 Dropping or Including Variables that have \(\beta=0\)

Suppose we have a regression model with several variables. We might suspect one has regression coefficient \(\beta=0\). Should we drop it?
Some mathematical theory shows:

• If we omit a variable which (truly) has regression coefficient \(\beta=0\) then

  1. variances of the remaining regression coefficients are reduced

  2. The variances of predictions from the model at any point \(x_0\) are reduced. Both these things are good.

• Conversely if we include a variable whose coefficient \(\beta=0\) this increases these variances unnecessarily.
This is called the problem of overfitting. We want to avoid this.

In practice we do not know if parameters are zero. We could use a hypothesis test to help us decide, but this is not a definitive answer.

30.1.2 Dropping or Including Variables that have \(\beta \ne 0\)

• On the other hand suppose we make a mistake and omit a variable that has regression coefficient ≠ 0.

Then we have the opposite problem, called underfitting. Mathematical theory shows the consequence of underfitting is that

  1. if the wrongly-omitted variables are correlated with the remaining variables then the expected values of the remaining regression coefficients \(\beta_0,\beta_1,...\beta_{p-1}\) are biased. (i.e. may be a lot too big or too small).

  2. Further (whether or not the omitted variables are correlated with the existing ones) the predictions \(\hat y_i\) are biased.

30.2 A Numerical Illustration of Underfitting and Overfitting

In this artificial example, we consider an industrial process whose Yield is thought to depend on the process temperature at stage A of the process, and possibly also on the temperature at stage B.

i.e. We consider the model \[ Y=\beta_0 + \beta_1.tempA+ \beta_2.tempB +\varepsilon.\]

We will assume the values of tempA and tempB and \(\varepsilon\) given in the table below. The errors \(\varepsilon\) are just random ~ Normal with mean 0 and standard deviation 1.

What we want to know is, what are the implications of using an incorrect model, i.e. incorrect assumptions about the \(\beta\)s.

Download Simulation1.csv

## Sim = read.csv("Simulation1.csv", header = TRUE)
Sim
   tempA tempB  Epsilon
1   16.5  16.9 -1.64960
2   17.1  16.8 -1.27669
3   15.0  15.8  0.45926
4   16.9  17.0  0.34253
5   21.2  20.6 -0.71849
6   18.2  17.8 -0.51455
7   19.3  19.9 -1.41754
8   16.7  16.0 -1.30384
9   16.3  16.3  0.40450
10  17.3  18.0 -0.18211
11  19.1  19.4  0.34204
12  20.8  20.1  1.25550
13  19.7  20.5  0.62556
14  17.7  18.6 -1.01822
15  16.8  16.4 -1.35425
16  15.9  16.6  0.81533
17  18.4  19.5  0.41898
18  21.3  20.9 -0.37490
19  19.7  19.6 -0.78441
20  17.0  16.3  0.20040
attach(Sim)

To understand the implications, we simulate two scenarios.

30.2.1 1. Suppose \(\beta_0=50\), \(\beta_1=10\) and \(\beta_2=0\).

That is, the correct model doesn’t depend on tempB at all. Thus for this illustration,
\[ Y = 50 + 10.tempA+ \varepsilon \]

The following shows the result of modelling \(Y\) by (A) tempA alone and (B) by tempA and tempB.

Y = 50 + 10 * tempA + Epsilon
lmA = lm(Y ~ tempA)
summary(lmA)

Call:
lm(formula = Y ~ tempA)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.32422 -0.78386 -0.02354  0.70616  1.47274 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  49.2597     2.0330   24.23 3.43e-15 ***
tempA        10.0251     0.1121   89.41  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8906 on 18 degrees of freedom
Multiple R-squared:  0.9978,    Adjusted R-squared:  0.9976 
F-statistic:  7995 on 1 and 18 DF,  p-value: < 2.2e-16
lmB = lm(Y ~ tempA + tempB)
summary(lmB)

Call:
lm(formula = Y ~ tempA + tempB)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.36984 -0.74294 -0.05982  0.69150  1.63488 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  48.7641     2.1367  22.822 3.43e-14 ***
tempA         9.7613     0.3383  28.852 7.01e-16 ***
tempB         0.2896     0.3500   0.827    0.419    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8985 on 17 degrees of freedom
Multiple R-squared:  0.9978,    Adjusted R-squared:  0.9976 
F-statistic:  3928 on 2 and 17 DF,  p-value: < 2.2e-16

The basic outcome is this: in both the preceding analyses, the estimated regression coefficient b1 is very close to the correct value \(\beta_1\).

This illustrates that adding in an unnecessary variable does not greatly change the estimates or add bias to the estimates (10.2 or 9.8 compared to the correct value of 10)

The major effect is the increase in the standard errors of the estimates – in this case a threefold increase.

30.2.2 2. Suppose \(\beta_0=50\), \(\beta_1=5\) and \(\beta_2=5\).

i.e. we consider the model \[ Y = 50 + 5.tempA+ 5.tempB+ \varepsilon \]

(A). Regressing Y on tempA alone, and (B). Regressing Y on tempA and tempB.

Y = 50 + 5 * tempA + 5 * tempB + Epsilon
lmA = lm(Y ~ tempA)
summary(lmA)

Call:
lm(formula = Y ~ tempA)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6070 -2.6463 -0.5516  2.8106  5.8295 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  57.8140     7.5733   7.634 4.74e-07 ***
tempA         9.5802     0.4177  22.937 8.94e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.318 on 18 degrees of freedom
Multiple R-squared:  0.9669,    Adjusted R-squared:  0.9651 
F-statistic: 526.1 on 1 and 18 DF,  p-value: 8.936e-15
lmB = lm(Y ~ tempA + tempB)
summary(lmB)

Call:
lm(formula = Y ~ tempA + tempB)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.36984 -0.74294 -0.05982  0.69150  1.63488 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  48.7641     2.1367   22.82 3.43e-14 ***
tempA         4.7613     0.3383   14.07 8.48e-11 ***
tempB         5.2896     0.3500   15.11 2.75e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8985 on 17 degrees of freedom
Multiple R-squared:  0.9977,    Adjusted R-squared:  0.9974 
F-statistic:  3701 on 2 and 17 DF,  p-value: < 2.2e-16

In the first model, tempB is wrongly ignored. The result is that the effect of tempA is greatly overestimated – biased upwards to nearly twice what it should be.

The standard error of tempA is also greater, but it is still not large enough that a confidence interval based on the regression would contain the true value \(\beta_1=5\).

In the second model, the values of b1 and b2 are close to the true ones and the standard errors are the same as in the second analysis of the previous case.

In summary, if we omit a needed variable then we may introduce bias into our estimation.

The big problem with bias is that there may be no sign that we have a mistake.

In some cases we don’t even have the needed variable, so we can’t try it in the model.

At least in case 1, when we wrongly enter an unnecessary variable, it doesn’t bias our estimates, just increases our standard errors. So at least we know exactly how bad things are. But with an omitted variable we don’t know how bad things are.

For this reason, we may decide to err on the side of caution and add in variables even if we are uncertain about their relevance. Then at least we can avoid bias.

30.3 Criteria for Model Selection

30.3.1 Bias and Variance

Model selection, then, comes down to balancing the effects of variance and bias.

  • If we omit a variable, it may dramatically decrease the variance of the other parameter estimates. This is a plus.

  • But we make a mistake and omit an important variable, then the other parameter estimates may be biased. This is a minus.

So what should we choose? There is no settled rule for every circumstance, but in general we should probably be more concerned to avoid bias (especially since we don’t know how bad this is!) than to avoid a small increase in variance (at least we do know how bad it is.)

Consequently we may decide to err on the side of including variables with P-values a little bigger than 0.05, especially if they seem to induce a sizeable change in the other coefficient estimates. This is another thing to watch out for in model selection.

30.3.2 Review of previous Model Selection Criteria

Coefficient of determination \(R^2\)

Recall that this is the proportion of total variation which is explained by the model. (Alternatively, 1 minus the ratio of unexplained variation to total variation): \[ R^2 = \frac{SS_{Regn}}{SST} = 1 - \frac{\sum(y_i-\hat y_i)^2}{\sum(y_i-\bar y)^2}. \] \(R^2\) is useful for comparing two models with the same number of regression parameters. If models have different numbers of regression parameters we use the -

Adjusted Coefficient of determination \(R^2_{Adj}\)

Recall that this is 1 minus the ratio of Mean Square Error to Mean Square Total: \[ R^2_{Adj} = 1 - \frac{MSE}{MST} = 1 - \frac{\sum(y_i-\hat y_i)^2/(n-p)}{\sum(y_i-\bar y)^2/(n-1)}.\] When considering adding a variable to a regression, the \(R^2_{Adj}\) goes up if the variable is “worth” at least one row of data (one df), which is a pretty low bar to cross. But at least if a variable can’t meet that rule then we know to exclude it.

Residual Standard Error, S

This is the standard deviation of the residuals, \[ S = \frac{1}{n-p}\sum{(y_i - \hat y_i)^2} \] or in other words the standard deviation of how the \(y_i\)s vary around the model predictions. We might choose to include a variable if it reduces the \(S\).

P-value from the T-ratio

We usually prefer models that only include variables with coefficient P-value <0.05. However the “0.05” criterion is somewhat arbitrary and occasionally we allow in variables with slightly higher P-values.

30.4 Criteria based on prediction

A problem with all the previous criteria is that they choose a model based on the current data, but may not be so good for predicting new data.

Recall we defined deleted residuals
\[ e_{i,-i}= y_i- \hat y_{i,-i} \] where \(\hat y_{i,-i}\) = the prediction of \(y_i\) based on fitting the model to all data except for \(i\)th.
This is a special case of a class of methods called “cross-validation”, where you fit a model based on part of the data and see how good it is at predicting new data.

Now combine the deleted residuals into
\[ \mbox{PRESS} = \sum(e_{i,-i}^2) \]

From which we can derive
\[R^2_{Pred} = 1- \frac{PRESS}{ SS_{total} }\]
We can interpret \(R^2_{Pred}\) as approximately the amount of variation in new data which would be explained by our model.

So our criterion can be to choose the model with minimum PRESS or (equivalently) maximum \(R^2_{Pred}\)

We can calculate this, for example, using the following code from https://stackoverflow.com/questions/64224124/how-to-calculate-predicted-r-sq-in-r .

PRESS <- function(linear.model) {
    #' calculate the predictive residuals
    pr <- residuals(linear.model)/(1 - lm.influence(linear.model)$hat)
    #' calculate the PRESS
    PRESS <- sum(pr^2)
    return(PRESS)
}

pred_r_squared <- function(linear.model) {
    #' Use anova() to get the sum of squares for the linear model
    lm.anova <- anova(linear.model)
    #' Calculate the total sum of squares
    tss <- sum(lm.anova$"Sum Sq")
    # Calculate the predictive R^2
    pred.r.squared <- 1 - PRESS(linear.model)/(tss)
    return(pred.r.squared)
}

30.5 Example: Model Selection for Climate Data

Download Climate.csv

## climate = read.csv("Climate.csv", header = TRUE)
attach(climate)
lm1 = lm(MnJlyTemp ~ Lat)
summary(lm1)

Call:
lm(formula = MnJlyTemp ~ Lat)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6992 -0.9786  0.1566  1.0588  4.7930 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.42127    3.94299   8.730 3.35e-10 ***
Lat         -0.66187    0.09613  -6.885 6.25e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.822 on 34 degrees of freedom
Multiple R-squared:  0.5824,    Adjusted R-squared:  0.5701 
F-statistic: 47.41 on 1 and 34 DF,  p-value: 6.25e-08
summary(lm1)$sigma
[1] 1.822262
summary(lm1)$adj.r.squared
[1] 0.570072
pred_r_squared(lm1)
[1] 0.5376735
lm2 = lm(MnJlyTemp ~ Lat + Height)
summary(lm2)

Call:
lm(formula = MnJlyTemp ~ Lat + Height)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0607 -0.6215 -0.1021  0.5145  3.9898 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 32.8788366  2.4333203  13.512 5.30e-15 ***
Lat         -0.6005121  0.0596687 -10.064 1.38e-11 ***
Height      -0.0071984  0.0009542  -7.544 1.12e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.121 on 33 degrees of freedom
Multiple R-squared:  0.8467,    Adjusted R-squared:  0.8374 
F-statistic: 91.14 on 2 and 33 DF,  p-value: 3.639e-14
summary(lm2)$sigma
[1] 1.120587
summary(lm2)$adj.r.squared
[1] 0.8374207
pred_r_squared(lm2)
[1] 0.8258196
lm3 = lm(MnJlyTemp ~ Lat + Height + Sea)
summary(lm3)

Call:
lm(formula = MnJlyTemp ~ Lat + Height + Sea)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7969 -0.4792 -0.0299  0.5042  3.7850 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 31.290016   2.290747  13.659 6.71e-15 ***
Lat         -0.587826   0.054584 -10.769 3.59e-12 ***
Height      -0.005121   0.001147  -4.463 9.38e-05 ***
Sea          1.294326   0.466063   2.777  0.00909 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.022 on 32 degrees of freedom
Multiple R-squared:  0.8765,    Adjusted R-squared:  0.8649 
F-statistic: 75.69 on 3 and 32 DF,  p-value: 1.275e-14
summary(lm3)$sigma
[1] 1.0215
summary(lm3)$adj.r.squared
[1] 0.8649013
pred_r_squared(lm3)
[1] 0.8551276

For these first three predictors, they have significant p-values, the S has gone down each time, and the adjusted \(R^2\) and \(R^2_{pred}\) have gone up each time, so there is no reason to doubt that Lat, Height and Sea should all be included. Also note the regression coefficient for Height altered by about 2 standard errors, which suggests that Sea is correcting for bias. All the more reason for including it.

lm4 = lm(MnJlyTemp ~ Lat + Height + Sea + NorthIsland)
summary(lm4)

Call:
lm(formula = MnJlyTemp ~ Lat + Height + Sea + NorthIsland)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2667 -0.5054 -0.2098  0.4707  3.4088 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 21.378217   4.637932   4.609 6.56e-05 ***
Lat         -0.375517   0.101816  -3.688 0.000862 ***
Height      -0.004416   0.001109  -3.981 0.000385 ***
Sea          1.803422   0.483327   3.731 0.000766 ***
NorthIsland  1.559710   0.647795   2.408 0.022193 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9526 on 31 degrees of freedom
Multiple R-squared:  0.8959,    Adjusted R-squared:  0.8825 
F-statistic: 66.73 on 4 and 31 DF,  p-value: 8.723e-15
summary(lm4)$sigma
[1] 0.9525912
summary(lm4)$adj.r.squared
[1] 0.8825137
pred_r_squared(lm4)
[1] 0.870069

Same conclusions as before. Should include NorthIsland. Note including it has induced big changes in the coefficients (especially Lat) which again is evidence it is important.

lm5 = lm(MnJlyTemp ~ Lat + Height + Sea + NorthIsland + Rain)
summary(lm5)

Call:
lm(formula = MnJlyTemp ~ Lat + Height + Sea + NorthIsland + Rain)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2829 -0.5042 -0.1944  0.4422  3.4651 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 19.6651523  4.6317227   4.246 0.000194 ***
Lat         -0.3437001  0.1009517  -3.405 0.001900 ** 
Height      -0.0049792  0.0011322  -4.398 0.000127 ***
Sea          1.6435808  0.4802716   3.422 0.001814 ** 
NorthIsland  1.7682036  0.6430084   2.750 0.010003 *  
Rain         0.0003583  0.0002170   1.651 0.109182    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9271 on 30 degrees of freedom
Multiple R-squared:  0.9046,    Adjusted R-squared:  0.8887 
F-statistic:  56.9 on 5 and 30 DF,  p-value: 2.11e-14
summary(lm5)$sigma
[1] 0.9271362
summary(lm5)$adj.r.squared
[1] 0.8887087
pred_r_squared(lm5)
[1] 0.8748455

This time the Rain variable is not significant (P=0.1) and the S has gone up, which go against including Rain. However the adjusted R-squared and pred_r_squared have gone up. So we may decide that if our aim is to predict for new data, then we should include this variable in the model.

lm6 = lm(MnJlyTemp ~ Lat + Height + Sea + NorthIsland + Rain + Long)
summary(lm6)

Call:
lm(formula = MnJlyTemp ~ Lat + Height + Sea + NorthIsland + Rain + 
    Long)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.3360 -0.4832 -0.1712  0.4008  3.2814 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.8938252 21.1286658   0.326 0.746557    
Lat         -0.3272942  0.1053819  -3.106 0.004216 ** 
Height      -0.0050077  0.0011449  -4.374 0.000144 ***
Sea          1.6381187  0.4853578   3.375 0.002113 ** 
NorthIsland  1.5548897  0.7352242   2.115 0.043156 *  
Rain         0.0004002  0.0002295   1.744 0.091750 .  
Long         0.0701947  0.1132443   0.620 0.540196    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9368 on 29 degrees of freedom
Multiple R-squared:  0.9059,    Adjusted R-squared:  0.8864 
F-statistic: 46.51 on 6 and 29 DF,  p-value: 1.402e-13
summary(lm6)$sigma
[1] 0.9368005
summary(lm6)$adj.r.squared
[1] 0.8863764
pred_r_squared(lm6)
[1] 0.8566546

At this point all the indications are bad: The P-value for Long is >0.5, the adjusted R-squared has gone down, the S has gone up and the pred_r_squared has gone down.

In conclusion we should include the first four variables, and optionally the fifth one, Rain.

30.5.1 Mallow’s C

Mallow’s C is another commonly used selection criterion, which seeks to balance the variance and bias in estimating the predicted values \(\hat y_i\).

Without going into details, we note that one common way of using Mallow’s C boils down to being equivalent to choosing the model based on minimum \(S\).

So that gives a double reason to use \(S\) as a selection criterion: it can be thought of as a providing balance of variance and bias for prediction.