Variable selection
In previous lectures we have looked at tests for single explanatory
variables and groups of explanatory variables.
Often, however, a statistician will simply want to know what is the
best group of explanatory variables for describing the response.
In this lecture we will consider various criteria for model
selection, and cover an example on finding a suitable variable set “by
hand”.
Overfitting and Underfitting
We begin by considering the following question: Should all the
explanatory variables be used, or should some be omitted?
Why might we want to omit variables?
Perhaps we want to detect variables which affect the response but
avoid collecting unnecessary measurements in the future. This is called
variable screening
Perhaps we want to predict from the model. Our predictions will
be less accurate if the model is incorrectly specified.
We first consider the risks and benefits of including a predictor
that might have coefficient \(\beta=0\).
Dropping or Including Variables that have \(\beta=0\)
Suppose we have a regression model with several variables. We might
suspect one has regression coefficient \(\beta=0\). Should we drop it?
Some mathematical theory shows that if we omit a variable which
(truly) has regression coefficient \(\beta=0\) then
Both these things are good.
Conversely if we include a variable whose coefficient \(\beta=0\) this increases these variances
unnecessarily. This is called overfitting. We want to
avoid this.
Dropping or Including Variables that have \(\beta \ne 0\)
On the other hand suppose we make a mistake and omit a variable that
has regression coefficient \(\ne
0\).
Then we have the opposite problem, called
underfitting. Mathematical theory shows the consequence
of underfitting is that
if the wrongly-omitted variables are correlated with the
remaining variables then the expected values of the remaining regression
coefficients \(\beta_0,\beta_1,...\beta_{p-1}\) are
biased. (i.e. may be a too big or too small).
Whether or not the omitted variables are correlated with the
existing ones, the predictions \(\hat
y_i\) are biased.
Both these things are bad!
In practice we do not know if parameters are zero.
We could use a hypothesis test to help us decide, but this is not a
definitive answer.
A Numerical Illustration of Under and Overfitting
In this artificial example, we consider an industrial process whose
Yield is thought to depend on the process temperature at stage A of the
process, and possibly also on the temperature at stage B.
i.e. We consider the model \[ Y=\beta_0 +
\beta_1 A+ \beta_2 B +\varepsilon.\]
We will assume the values of \(A\)
and \(B\) and \(\varepsilon\) are given in the data below.
The errors are random \(\varepsilon \sim
\mbox{Normal}(0, 1)\).
Download simulation1.csv
## sim = read.csv("simulation1.csv", header = TRUE)
What we want to know is, what are the implications of using an
incorrect model, i.e. incorrect assumptions about the \(\beta\)s. To understand the implications,
we simulate two scenarios.
Scenario 1: \(\beta_0=50\), \(\beta_1=10\) and \(\beta_2=0\).
For this illustration, the correct model doesn’t depend on \(B\) at all: \[Y = 50 + 10 A + \varepsilon\]
The following shows the result of modelling \(Y\) by (1) \(A\) alone and (2) by \(A\) and \(B\).
dat <- sim |>
mutate(Y = 50 + 10 * A + epsilon)
lm1 = lm(Y ~ A, data = dat)
lm2 = lm(Y ~ A + B, data = dat)
lst(lm1, lm2) |>
map_dfr(tidy, .id = "model")
# A tibble: 5 × 6
model term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 lm1 (Intercept) 49.3 2.03 24.2 3.43e-15
2 lm1 A 10.0 0.112 89.4 2.71e-25
3 lm2 (Intercept) 48.8 2.14 22.8 3.43e-14
4 lm2 A 9.76 0.338 28.9 7.01e-16
5 lm2 B 0.290 0.350 0.827 4.19e- 1
Adding the unnecessary variable doesn’t introduce bias (\(\hat\beta_1 \approx 10\)), but the standard
error of \(\hat\beta_1\) is higher in
the misspecified model.
Scenario 2: \(\beta_0=50\), \(\beta_1=5\) and \(\beta_2=5\).
We consider the model where \(Y\)
depends on both \(A\) and \(B\), \[ Y = 50 + 5 A + 5 B + \varepsilon
\]
dat <- sim |>
mutate(Y = 50 + 5 * A + 5 * B + epsilon)
lm1 = lm(Y ~ A, data = dat)
lm2 = lm(Y ~ A + B, data = dat)
lst(lm1, lm2) |>
map_dfr(tidy, .id = "model")
# A tibble: 5 × 6
model term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 lm1 (Intercept) 57.8 7.57 7.63 4.74e- 7
2 lm1 A 9.58 0.418 22.9 8.94e-15
3 lm2 (Intercept) 48.8 2.14 22.8 3.43e-14
4 lm2 A 4.76 0.338 14.1 8.48e-11
5 lm2 B 5.29 0.350 15.1 2.75e-11
In the first model, \(B\) is wrongly
ignored and \(A\) is both biased and
has a larger standard error. In the second model, the values of \(\hat\beta_1\) and \(\hat\beta_2\) are close to the true ones
and the standard errors are the same as in the second analysis of
Scenario 1.
In summary, if we omit a needed variable then we may introduce bias
into our estimation.
The big problem with bias is that there may be no sign that we have a
mistake.
In some cases we don’t even have the needed variable, so we
can’t try it in the model.
At least in case 1, when we wrongly enter an unnecessary variable, it
doesn’t bias our estimates, just increases our standard errors. So at
least we know exactly how bad things are. But with an omitted variable
we don’t know how bad things are.
For this reason, we may decide to err on the side of caution and add
in variables even if we are uncertain about their relevance. Then at
least we can avoid bias.
Model Selection: Bias and Variance
Model selection, then, comes down to balancing the effects of
variance and bias.
If we omit a variable, it may dramatically decrease the variance
of the other parameter estimates. This is a plus.
But if the variable is important, then the other parameter
estimates may be biased. This is a minus.
So what should we choose? There is no settled rule for every
circumstance, but in general we should probably be more concerned to
avoid bias (especially since we don’t know how bad this is!) than to
avoid a small increase in variance (at least we do know how bad it is -
we can see the standard errors.)
Consequently we may decide to err on the side of including variables
with P-values a little bigger than 0.05, especially if they seem to
induce a sizeable change in the other coefficient estimates. This
is another thing to watch out for in model selection.
Review of Model Selection Criteria
There are a number of measures we can use for model selection:
Coefficient of determination \(R^2\)
Adjusted Coefficient of determination \(R^2_{Adj}\)
Residual Standard Error \(S\)
\(P\)-values of individual
coefficients
Each of these are slightly different, so it pays to consider more
than one.
Coefficient of determination \(R^2\)
Recall that this is the proportion of total variation which is
explained by the model. (Alternatively, 1 minus the ratio of unexplained
variation to total variation): \[ R^2 =
\frac{SS_{Regn}}{SST} = 1 - \frac{\sum(y_i-\hat y_i)^2}{\sum(y_i-\bar
y)^2}. \] \(R^2\) is useful for
comparing two models with the same number of regression
parameters. If models have different numbers of regression parameters we
use the adjusted version.
Adjusted Coefficient of determination \(R^2_{Adj}\)
Recall that this is 1 minus the ratio of Mean Square Error to Mean
Square Total: \[ R^2_{Adj} = 1 -
\frac{MSE}{MST} = 1 - \frac{\sum(y_i-\hat y_i)^2/(n-p)}{\sum(y_i-\bar
y)^2/(n-1)}.\] When considering adding a variable to a
regression, the \(R^2_{Adj}\) goes up
if the variable is “worth” at least one row of data (one degree of
freedom), which is a pretty low bar to cross.
But at least if a variable can’t meet that rule then we know to
exclude it.
Residual Standard Error \(S\)
This is the standard deviation of the residuals, \[ S = \frac{1}{n-p}\sum{(y_i - \hat y_i)^2}
\] or in other words the standard deviation of how the \(y_i\)s vary around the model predictions.
We might choose to include a variable if it reduces the \(S\).
\(P\)-value for individual
coefficients
We usually prefer models that only include variables with coefficient
\(P\)-value is small (e.g. \(< 0.05\)).
However the “0.05” criterion is somewhat arbitrary and occasionally
we allow in variables with slightly higher \(P\)-values to potentially reduce bias.
Example: Model Selection for Climate Data
We wish to model the mean temperature in July (winter) in Aotearoa
New Zealand using data from 36 towns. We have information on location
(latitude, longitude, height (elevation), whether they’re by the sea or
in the North island) as well as climate information (mean temperature in
January, total sunshine hours and total rainfall).
Download climate.csv
## climate = read_csv("climate.csv")
climate
# A tibble: 36 × 10
Place Lat Long MnJanTemp MnJlyTemp Rain Sun Height Sea NorthIsland
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Kaitaia 35.1 173. 19.3 11.7 1418 2113 80 1 1
2 Kerikeri 35.2 174 18.9 10.8 1682 2004 73 1 1
3 Dargavi… 36 174. 18.6 10.7 1248 1956 20 1 1
4 Whangar… 35.7 174. 19.7 11 1600 1925 29 1 1
5 Auckland 36.9 175. 19.4 11 1185 2102 49 1 1
6 Tauranga 37.7 176. 18.5 9.3 1349 2277 4 1 1
7 Hamilton 37.8 175. 17.8 8.3 1201 2006 40 0 1
8 Rotorua 38.2 176. 17.5 7.3 1439 1948 307 0 1
9 Gisborne 38.7 178 18.7 13.6 1058 2204 4 1 1
10 Taupo 38.7 177. 17.3 6.5 1178 2021 376 0 1
# ℹ 26 more rows
Climate Exploratory Data Analysis
climate |>
pivot_longer(-c(Place, MnJlyTemp)) |>
ggplot() + geom_point(mapping = aes(x = value, y = MnJlyTemp)) + facet_wrap(vars(name),
scales = "free_x", ncol = 4)
Higher winter temperature associations
- Lower elevations (height)
- Lower latitudes (further north)
- Higher longitudes (further east - recall how Ao/NZ is oriented on a
map - possibly correlation with latitude, will check next!)
- Higher summer temperatures
- North island (could just be latitude?)
- Increased rainfall - up to a point! (subtropical vs rainy westcoast
- interaction with Island/latitude??)
- Closeness to the sea
- Increased sunshine hours
Many of these are likely correlated.
Climate: Latitude/Longitude and North vs South
ggplot(climate) + geom_point(mapping = aes(x = Long, y = -Lat, shape = as_factor(NorthIsland),
col = MnJlyTemp)) + geom_text(mapping = aes(x = Long, y = -Lat, label = Place),
size = 2, hjust = "right", nudge_x = 0.1, nudge_y = 0.2) + coord_equal()
Climate: Latitude/Longitude and North vs South
Ok, so we can see that the temperature is being driven by latitude
(further north is warmer) - the longitude effect is possibly just due to
orientation. It’s clear there’s an elevation (height) effect as well,
though that is possibly confounded with whether the town is close to the
sea (it can’t really be both!)
So we’d expect some collinearity here and figuring out which are the
best variables to be used might take a bit of playing!
We’ll start with Latitude, then add in Height and Sea:
lm1 = lm(MnJlyTemp ~ Lat, data = climate)
lm2 = lm(MnJlyTemp ~ Lat + Height, data = climate)
lm3 = lm(MnJlyTemp ~ Lat + Height + Sea, data = climate)
Climate: model output
Call:
lm(formula = MnJlyTemp ~ Lat, data = climate)
Residuals:
Min 1Q Median 3Q Max
-3.6992 -0.9786 0.1566 1.0588 4.7930
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.42127 3.94299 8.730 3.35e-10 ***
Lat -0.66187 0.09613 -6.885 6.25e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.822 on 34 degrees of freedom
Multiple R-squared: 0.5824, Adjusted R-squared: 0.5701
F-statistic: 47.41 on 1 and 34 DF, p-value: 6.25e-08
Call:
lm(formula = MnJlyTemp ~ Lat + Height, data = climate)
Residuals:
Min 1Q Median 3Q Max
-2.0607 -0.6215 -0.1021 0.5145 3.9898
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 32.8788366 2.4333203 13.512 5.30e-15 ***
Lat -0.6005121 0.0596687 -10.064 1.38e-11 ***
Height -0.0071984 0.0009542 -7.544 1.12e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.121 on 33 degrees of freedom
Multiple R-squared: 0.8467, Adjusted R-squared: 0.8374
F-statistic: 91.14 on 2 and 33 DF, p-value: 3.639e-14
Call:
lm(formula = MnJlyTemp ~ Lat + Height + Sea, data = climate)
Residuals:
Min 1Q Median 3Q Max
-1.7969 -0.4792 -0.0299 0.5042 3.7850
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 31.290016 2.290747 13.659 6.71e-15 ***
Lat -0.587826 0.054584 -10.769 3.59e-12 ***
Height -0.005121 0.001147 -4.463 9.38e-05 ***
Sea 1.294326 0.466063 2.777 0.00909 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.022 on 32 degrees of freedom
Multiple R-squared: 0.8765, Adjusted R-squared: 0.8649
F-statistic: 75.69 on 3 and 32 DF, p-value: 1.275e-14
Climate: First three predictors.
For these first three predictors:
- They have significant p-values
- \(S\) has gone down each time
- The adjusted \(R^2\) and has gone
up each time.
There is no reason to doubt that Lat
,
Height
and Sea
should all be included.
Also note the regression coefficient for Height
altered
by about 2 standard errors, which suggests that Sea
is
correcting for bias, all the more reason for including it.
This makes sense - sea level towns will mostly have a very low
height, but there’ll also be an additional warming effect of being close
to the sea.
Climate: Adding island
lm4 = lm(MnJlyTemp ~ Lat + Height + Sea + NorthIsland, data = climate)
lm4 |>
summary()
Call:
lm(formula = MnJlyTemp ~ Lat + Height + Sea + NorthIsland, data = climate)
Residuals:
Min 1Q Median 3Q Max
-1.2667 -0.5054 -0.2098 0.4707 3.4088
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.378217 4.637932 4.609 6.56e-05 ***
Lat -0.375517 0.101816 -3.688 0.000862 ***
Height -0.004416 0.001109 -3.981 0.000385 ***
Sea 1.803422 0.483327 3.731 0.000766 ***
NorthIsland 1.559710 0.647795 2.408 0.022193 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9526 on 31 degrees of freedom
Multiple R-squared: 0.8959, Adjusted R-squared: 0.8825
F-statistic: 66.73 on 4 and 31 DF, p-value: 8.723e-15
Same conclusions as before, we should include
NorthIsland
Including it has induced big changes in the coefficients (especially
Lat
) which again is evidence it is important.
Why do you think adding North vs South island is useful over
and above Latitude? Wouldn’t Latitude do everything here??
Adding Rain
lm5 = lm(MnJlyTemp ~ Lat + Height + Sea + NorthIsland + Rain, data = climate)
lm5 |>
summary()
Call:
lm(formula = MnJlyTemp ~ Lat + Height + Sea + NorthIsland + Rain,
data = climate)
Residuals:
Min 1Q Median 3Q Max
-1.2829 -0.5042 -0.1944 0.4422 3.4651
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.6651523 4.6317227 4.246 0.000194 ***
Lat -0.3437001 0.1009517 -3.405 0.001900 **
Height -0.0049792 0.0011322 -4.398 0.000127 ***
Sea 1.6435808 0.4802716 3.422 0.001814 **
NorthIsland 1.7682036 0.6430084 2.750 0.010003 *
Rain 0.0003583 0.0002170 1.651 0.109182
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9271 on 30 degrees of freedom
Multiple R-squared: 0.9046, Adjusted R-squared: 0.8887
F-statistic: 56.9 on 5 and 30 DF, p-value: 2.11e-14
The Rain
variable is not significant (P=0.1) and \(S\) has gone up, which go against including
Rain
. However the adjusted R-squared has gone up. So we may
decide that if our aim is to predict for new data, then we should
include this variable in the model.
Adding Longitude
lm6 = lm(MnJlyTemp ~ Lat + Height + Sea + NorthIsland + Rain + Long, data = climate)
lm6 |>
summary()
Call:
lm(formula = MnJlyTemp ~ Lat + Height + Sea + NorthIsland + Rain +
Long, data = climate)
Residuals:
Min 1Q Median 3Q Max
-1.3360 -0.4832 -0.1712 0.4008 3.2814
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.8938252 21.1286658 0.326 0.746557
Lat -0.3272942 0.1053819 -3.106 0.004216 **
Height -0.0050077 0.0011449 -4.374 0.000144 ***
Sea 1.6381187 0.4853578 3.375 0.002113 **
NorthIsland 1.5548897 0.7352242 2.115 0.043156 *
Rain 0.0004002 0.0002295 1.744 0.091750 .
Long 0.0701947 0.1132443 0.620 0.540196
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9368 on 29 degrees of freedom
Multiple R-squared: 0.9059, Adjusted R-squared: 0.8864
F-statistic: 46.51 on 6 and 29 DF, p-value: 1.402e-13
At this point all the indications are bad: The P-value for
Long
is >0.5, the adjusted R-squared has gone down and S
has gone up.
In conclusion we should include the first four variables, and
optionally the fifth one, Rain
.
---
title: "Lecture 27: Variable Selection"
subtitle: 161.251 Regression Modelling
author: "Presented by Jonathan Marshall <J.C.marshall@massey.ac.nz>"  
date: "Week 10 of Semester 2, `r lubridate::year(lubridate::now())`"
output:
  html_document:
    code_download: true
    theme: yeti
    highlight: tango
  html_notebook:
    code_download: true
    theme: yeti
    highlight: tango
  ioslides_presentation:
    widescreen: true
    smaller: true
  word_document: default
  slidy_presentation: 
    theme: yeti
    highlight: tango
  pdf_document: default
---




<!--- Data is on
https://r-resources.massey.ac.nz/data/161251/
--->

```{r setup, purl=FALSE, include=FALSE}
library(knitr)
opts_chunk$set(dev=c("png", "pdf"))
opts_chunk$set(fig.height=6, fig.width=7, fig.path="Figures/", fig.alt="unlabelled")
opts_chunk$set(comment="", fig.align="center", tidy=TRUE)
options(knitr.kable.NA = '')
library(tidyverse)
library(broom)
```


<!--- Do not edit anything above this line. --->

## Variable selection

In previous lectures we have looked at tests for single explanatory variables and groups of explanatory variables.

Often, however, a statistician will simply want to know what is the     best group of explanatory variables for describing the response.

In this lecture we will consider various criteria for model selection, and cover an example on finding a suitable variable set "by hand".

## Overfitting and Underfitting

We begin by considering the following question:  Should all the explanatory variables be used, or should some be omitted?

Why might we want to omit variables? 

-	Perhaps we want to detect variables which affect the response  but avoid collecting unnecessary measurements in the future. This is called variable screening

-	Perhaps we want to predict from the model. Our predictions will be less accurate if the model is incorrectly specified.
	
We first consider the risks and benefits of including a predictor that might have coefficient $\beta=0$.

##  Dropping or Including Variables that have $\beta=0$

Suppose we have a regression model with several variables.  We might suspect one has regression coefficient $\beta=0$.  Should we drop it?

Some mathematical theory shows that if we omit a variable which (truly) has regression coefficient  $\beta=0$  then

- Variances of the remaining regression coefficients are reduced

- Variances of predictions from the model  at any point $x_0$  are reduced. 

Both these things are good.

Conversely if we include a variable whose coefficient $\beta=0$ this  increases these variances unnecessarily. This is called **overfitting**. We want to avoid this.

## Dropping or Including Variables that have $\beta \ne 0$

On the other hand suppose we make a mistake and omit a variable that has regression coefficient $\ne 0$.   

Then we have the opposite problem, called **underfitting**.  Mathematical theory shows the consequence of underfitting is that 

-	if the wrongly-omitted variables are correlated with the remaining variables then the expected values of the remaining regression coefficients $\beta_0,\beta_1,...\beta_{p-1}$ are **biased**. (i.e. may be a too big or too small).

- Whether or not the omitted variables are correlated with the existing ones, the predictions $\hat y_i$ are biased.

Both these things are bad!

**In practice we do not know if parameters are zero.** We could use a hypothesis test to help us decide, but this is not a definitive answer.

## A Numerical Illustration of Under and Overfitting

In this artificial example, we consider an industrial process whose Yield is thought  to depend on the process temperature at stage A of the process, and possibly also on the temperature at stage B.   

i.e.  We consider the model 
$$ Y=\beta_0 + \beta_1 A+ \beta_2 B +\varepsilon.$$
	
We will assume the values of $A$ and $B$ and $\varepsilon$ are given in the data below.  The errors are random $\varepsilon \sim \mbox{Normal}(0, 1)$.

`r xfun::embed_file("../../data/simulation1.csv")` 

```{r read Simulation1, eval=-1, echo=-2, message=FALSE}
sim = read.csv("simulation1.csv",header=TRUE)
sim = read.csv("../../data/simulation1.csv",header=TRUE)
```


What we want to know is, what are the implications of using an incorrect model, 	i.e. incorrect assumptions about the $\beta$s. To understand the implications, we simulate two scenarios. 

## Scenario 1: $\beta_0=50$, $\beta_1=10$ and $\beta_2=0$.

For this illustration, the correct model doesn't depend on $B$ at all:
$$Y  =  50 + 10 A + \varepsilon$$

The following shows the result of modelling $Y$ by (1) $A$ alone and  (2) by $A$ and $B$.

```{r overfitting}
dat <- sim |> mutate(Y = 50 + 10*A + epsilon)
lm1 = lm(Y ~ A, data=dat)
lm2 = lm(Y ~ A + B, data=dat)
lst(lm1, lm2) |> map_dfr(tidy, .id='model')
```

Adding the unnecessary variable doesn't introduce bias ($\hat\beta_1 \approx 10$), but the standard error of $\hat\beta_1$ is higher in the misspecified model.

## Scenario 2: $\beta_0=50$, $\beta_1=5$ and $\beta_2=5$.

We consider the model where $Y$ depends on both $A$ and $B$,
$$	Y  =  50 + 5 A + 5 B + \varepsilon $$

```{r underfitting}
dat <- sim |> mutate(Y = 50 + 5*A + 5*B + epsilon)
lm1 = lm(Y ~ A, data=dat)
lm2 = lm(Y ~ A + B, data=dat)
lst(lm1, lm2) |> map_dfr(tidy, .id='model')
```

In the first model, $B$ is wrongly ignored and $A$ is both biased and has a larger standard error. In the second model, the values of $\hat\beta_1$ and $\hat\beta_2$ are close to the true ones and the standard errors are the same as in the second analysis of Scenario 1.

##

In summary, if we omit a needed variable then we may introduce bias into our estimation.
	
The big problem with bias is that there may be no sign that we have a mistake. 

In some cases we don't even *have* the needed variable, so we can't try it in the model. 

At least in case 1, when we wrongly enter an unnecessary variable, it doesn't bias our estimates, just increases our standard errors. So at least we know exactly how bad things are. But with an omitted variable we don't know how bad things are.
	
For this reason, we may decide to err on the side of caution and add in variables even if we are uncertain about their relevance.  Then at least we can avoid bias. 

## Model Selection: Bias and Variance

Model selection, then, comes down to balancing the effects of variance and bias.

- If we omit a variable, it may dramatically decrease the variance of the other parameter estimates. This is a plus. 

- But if the variable is important, then the other parameter estimates may be biased.  This is a minus. 

So what should we choose?  There is no settled rule for every circumstance, but in general we should probably be more concerned to avoid bias (especially since we don't know how bad this is!) than to avoid a small increase in variance (at least we do know how bad it is - we can see the standard errors.)

Consequently we may decide to err on the side of including variables with P-values a little  bigger than 0.05, *especially if they seem to induce a sizeable change in the other coefficient estimates.*   This is another thing to watch out for in model selection. 

## Review of Model Selection Criteria

There are a number of measures we can use for model selection:

- Coefficient of determination $R^2$

- Adjusted Coefficient of determination $R^2_{Adj}$

- Residual Standard Error $S$

- $P$-values of individual coefficients

Each of these are slightly different, so it pays to consider more than one.

## Coefficient of determination $R^2$

Recall that this is the proportion of total variation  which is explained by the model. (Alternatively, 1 minus the ratio of unexplained variation to total variation):
$$ R^2 = \frac{SS_{Regn}}{SST}  = 1 - \frac{\sum(y_i-\hat y_i)^2}{\sum(y_i-\bar y)^2}. $$
$R^2$ is useful for comparing two models with the **same** number of regression parameters.   If models have different numbers of regression parameters we use the adjusted version.

## Adjusted Coefficient of determination $R^2_{Adj}$

Recall that this is 1 minus the ratio of Mean Square Error to Mean Square Total: 
$$ R^2_{Adj} = 1 - \frac{MSE}{MST}  = 1 - \frac{\sum(y_i-\hat y_i)^2/(n-p)}{\sum(y_i-\bar y)^2/(n-1)}.$$
When considering adding a variable to a regression, the $R^2_{Adj}$ goes up if the variable is "worth" at least one row of data (one degree of freedom), which is a pretty low bar to cross.

But at least if a variable can't meet that rule then we know to exclude it.

## Residual Standard Error $S$

This is the standard deviation of the residuals, 
$$ S = \frac{1}{n-p}\sum{(y_i - \hat y_i)^2} $$
or in other words the standard deviation of how the $y_i$s vary around the model predictions. We might choose to include a variable if it reduces the $S$.

## $P$-value for individual coefficients

We usually prefer models that only include variables with coefficient $P$-value is small (e.g. $< 0.05$).

However the "0.05" criterion is somewhat arbitrary and occasionally we allow in variables with slightly higher $P$-values to potentially reduce bias.

## Example:  Model Selection for Climate Data

We wish to model the mean temperature in July (winter) in Aotearoa New Zealand using data from 36 towns. We have information on location (latitude, longitude, height (elevation), whether they're by the sea or in the North island) as well as climate information (mean temperature in January, total sunshine hours and total rainfall).

`r xfun::embed_file("../../data/climate.csv")` 

```{r read climate,eval=-1, echo=-2, message=FALSE}
climate = read_csv("climate.csv")
climate = read_csv("../../data/climate.csv")
climate
```

## Climate Exploratory Data Analysis

```{r, fig.width=10, fig.height=4}
climate |>
  pivot_longer(-c(Place, MnJlyTemp)) |>
  ggplot() +
  geom_point(mapping=aes(x=value, y=MnJlyTemp)) +
  facet_wrap(vars(name), scales = 'free_x', ncol=4)
```

## Higher winter temperature associations

- Lower elevations (height)
- Lower latitudes (further north)
- Higher longitudes (further east - recall how Ao/NZ is oriented on a map - possibly correlation with latitude, will check next!)
- Higher summer temperatures
- North island (could just be latitude?)
- Increased rainfall - up to a point! (subtropical vs rainy westcoast - interaction with Island/latitude??)
- Closeness to the sea
- Increased sunshine hours

Many of these are likely correlated.

## Climate: Latitude/Longitude and North vs South

```{r}
ggplot(climate) +
  geom_point(mapping=aes(x=Long, y=-Lat, shape=as_factor(NorthIsland), col=MnJlyTemp)) +
  geom_text(mapping=aes(x=Long, y=-Lat, label=Place), size=2, hjust='right', nudge_x=0.1, nudge_y=0.2) +
  coord_equal()
```

## Climate: Latitude/Longitude and North vs South

Ok, so we can see that the temperature is being driven by latitude (further north is warmer) - the longitude effect is possibly just due to orientation. It's clear there's an elevation (height) effect as well, though that is possibly confounded with whether the town is close to the sea (it can't really be both!)

So we'd expect some collinearity here and figuring out which are the best variables to be used might take a bit of playing!

We'll start with Latitude, then add in Height and Sea:

```{r}
lm1 = lm( MnJlyTemp ~ Lat, data=climate)
lm2 = lm( MnJlyTemp ~ Lat + Height, data=climate)
lm3 = lm( MnJlyTemp ~ Lat + Height+ Sea, data=climate)
```

## Climate: model output

```{r}
lm1 |> summary()
```

##

```{r}
lm2 |> summary()
```

##

```{r}
lm3 |> summary()
```

## Climate: First three predictors.

For these first three predictors:

 - They  have significant p-values
 - $S$ has gone down each time
 - The adjusted $R^2$ and has gone up each time.
 
There is no reason to doubt that `Lat`, `Height` and `Sea` should all be included.   

Also note the regression coefficient for `Height` altered by about 2 standard errors, which suggests that `Sea` is correcting for bias, all the more reason for including it.

This makes sense - sea level towns will mostly have a very low height, but there'll also be an additional warming effect of being close to the sea.

## Climate: Adding island

```{r fourth variable}
lm4 = lm( MnJlyTemp ~ Lat+ Height+ Sea + NorthIsland, data=climate)
lm4 |> summary()
```

##

Same conclusions as before, we should include `NorthIsland`

Including it has induced big changes in the coefficients (especially `Lat`) which again is evidence it is important.

**Why do you think adding North vs South island is useful over and above Latitude? Wouldn't Latitude do everything here??**

## Adding Rain

```{r fifth variable}
lm5 = lm( MnJlyTemp ~ Lat+ Height+ Sea + NorthIsland+ Rain, data=climate)
lm5 |> summary()
```

##

The `Rain` variable is not significant (P=0.1) and $S$ has gone up, which go against including `Rain`.  However the adjusted R-squared has gone up.  So we may decide that if our aim is to predict for new data, then we should include this variable in the model.

## Adding Longitude

```{r sixth variable}
lm6 = lm( MnJlyTemp ~ Lat+ Height+ Sea + NorthIsland+ Rain + Long, data=climate)
lm6 |> summary()
```

##

At this point all the indications are bad:  The P-value for `Long` is >0.5, the adjusted R-squared has gone down and S has gone up.

In conclusion we should include the first four variables, and optionally the fifth one, `Rain`.