View the latest recording of this lecture

In previous lectures we have considered model comparison of nested regression models, and nested ANOVA type models.

The comparison of nested general linear models using F tests follows naturally.

Natural ordering of GLMs

image
image

Comparison on Nested Linear Models

For any linear model \(M0\) nested within \(M1\), we can test:

using an F test.

The hypotheses can be stated in terms of parameters of model \(M1\).

The F test statistic is

\[F = \frac{[RSS_{M0} - RSS_{M1}]/d}{RSS_{M1}/r}\] where d is the difference in the number of (regression) parameters between the models, and r is the residual degrees of freedom for \(M1\).

If \(H_0\) is correct then F has an F distribution with \(d,r\) degrees of freedom.

Extreme large values of the test statistic provide evidence against \(H_0\).

Hence if \(f_{obs}\) is the observed value of the test statistic, and X is a random variable from an \(F_{d,r}\) distribution, then the P-value is given by \(P= P(X \ge f_{obs})\)

This test can be carried out in R using anova(m0.lm, m1.lm) where m0.lm and m1.lm are respectively the \(M0\) and \(M1\) fitted models.

The Samara Data revisited

Recall that we were modelling the mean speed of fall of samara (fruit from a maple tree) as a function of the disk loading (a covariate) and the tree from which it fell (a factor) for each fruit.

We fitted separate linear regressions for each of the three trees.

We shall now consider simpler models:

We shall compare these models with each other, and with the separate regressions model.

Different Regressions Model

Download samara.csv

## Samara <- read.csv(file = "samara.csv", header = TRUE)
Samara$Tree <- factor(Samara$Tree)
Samara.lm.m2 <- lm(Velocity ~ Tree + Load + Tree:Load, data = Samara)
summary(Samara.lm.m2)

Call:
lm(formula = Velocity ~ Tree + Load + Tree:Load, data = Samara)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.120023 -0.049465 -0.001298  0.049938  0.145571 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   0.5414     0.2632   2.057   0.0488 *
Tree2        -0.8408     0.3356  -2.505   0.0181 *
Tree3        -0.2987     0.4454  -0.671   0.5078  
Load          3.0629     1.1599   2.641   0.0132 *
Tree2:Load    3.7343     1.5000   2.490   0.0188 *
Tree3:Load    0.8205     2.2837   0.359   0.7220  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.07554 on 29 degrees of freedom
Multiple R-squared:  0.8436,    Adjusted R-squared:  0.8167 
F-statistic: 31.29 on 5 and 29 DF,  p-value: 7.656e-11

Parallel Regressions Model

Samara.lm.m1 <- lm(Velocity ~ Tree + Load, data = Samara)
summary(Samara.lm.m1)

Call:
lm(formula = Velocity ~ Tree + Load, data = Samara)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.13572 -0.06027 -0.01576  0.05973  0.17130 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.07561    0.16871   0.448    0.657    
Tree2       -0.01047    0.03440  -0.304    0.763    
Tree3       -0.05879    0.04629  -1.270    0.213    
Load         5.12257    0.73875   6.934 8.88e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.08101 on 31 degrees of freedom
Multiple R-squared:  0.8078,    Adjusted R-squared:  0.7892 
F-statistic: 43.43 on 3 and 31 DF,  p-value: 3.261e-11

Common Regressions Model

Samara.lm.m0 <- lm(Velocity ~ Load, data = Samara)
summary(Samara.lm.m0)

Call:
lm(formula = Velocity ~ Load, data = Samara)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.16168 -0.05332 -0.01511  0.05528  0.17086 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.09326    0.10743  -0.868    0.392    
Load         5.82019    0.51119  11.386  5.7e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.08067 on 33 degrees of freedom
Multiple R-squared:  0.7971,    Adjusted R-squared:  0.7909 
F-statistic: 129.6 on 1 and 33 DF,  p-value: 5.704e-13

Comparing Models

anova(Samara.lm.m1, Samara.lm.m2)
Analysis of Variance Table

Model 1: Velocity ~ Tree + Load
Model 2: Velocity ~ Tree + Load + Tree:Load
  Res.Df     RSS Df Sum of Sq     F  Pr(>F)  
1     31 0.20344                             
2     29 0.16549  2  0.037949 3.325 0.05011 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(Samara.lm.m0, Samara.lm.m1)
Analysis of Variance Table

Model 1: Velocity ~ Load
Model 2: Velocity ~ Tree + Load
  Res.Df     RSS Df Sum of Sq      F Pr(>F)
1     33 0.21476                           
2     31 0.20344  2  0.011322 0.8626 0.4319
anova(Samara.lm.m0, Samara.lm.m2)
Analysis of Variance Table

Model 1: Velocity ~ Load
Model 2: Velocity ~ Tree + Load + Tree:Load
  Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
1     33 0.21476                              
2     29 0.16549  4  0.049272 2.1585 0.09885 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comments

Models fitted as follows:

  • Samara.lm.m2 is separate regressions model. It has 6 regression parameters (3 for intercepts, 3 for slopes) and RSS of 0.16549.
  • Samara.lm.m1 is parallel regressions model. It has 4 regression parameters (3 for intercepts, 1 for slope) and RSS of 0.20344.
  • Samara.lm.m0 is single regressions model. It has 2 regression parameters (1 for intercept, 1 for slope) and RSS of 0.21476.

Comparison of \(M2\) and \(M1\) gives very borderline result (P=0.05011). We would just barely retain \(M1\) if we were working at a 5% significance level.

Comparison of \(M1\) and \(M0\) provides no evidence to prefer \(M1\) (P=0.4319).

Comparison of \(M2\) and \(M0\) shows that \(M0\) should be retained at the 5% significance level (though there is weak evidence in favour of \(M2\)).

What happens if you put all three models in the same anova?

anova(Samara.lm.m0, Samara.lm.m1, Samara.lm.m2)
Analysis of Variance Table

Model 1: Velocity ~ Load
Model 2: Velocity ~ Tree + Load
Model 3: Velocity ~ Tree + Load + Tree:Load
  Res.Df     RSS Df Sum of Sq     F  Pr(>F)  
1     33 0.21476                             
2     31 0.20344  2  0.011322 0.992 0.38306  
3     29 0.16549  2  0.037949 3.325 0.05011 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the p-value for comparing M0 and M1 is now 0.38306 instead of 0.4319. Why is there a change in the p-value?

The reason is that the denominator for the F test is based on the biggest model to hand (M2), namely using RSS=0.16549 on 29 df, whereas for the previous analysis the biggest model to hand was M1, using RSS=0.20344 on 31 df. This is enough of a difference to change the p-value to 0.38306.

The lesson then is that if we are going to quote a p-value then we need to decide once and for all whether to use the different slopes model or not.

Fitted model is

Overall, it seems that single regression is probably adequate.

\[E[ \mbox{Velocity}] = -0.093 + 5.82 \mbox{Load}\]

Comparison of Models for Body Fat

In this section, we are concerned with modelling a small data set on human body fat.

The response is percentage fat. Age is a numerical covariate, gender is a factor.

Various model comparisons and model summaries are provided. Decide which is the best model, and write down the fitted model equation for males and females.

image
image

Data for Task

Download fat.csv

## Fat <- read.csv(file = "fat.csv", header = TRUE)
head(Fat)
  X Age Percent.Fat Gender
1 1  23         9.5      M
2 2  23        27.9      F
3 3  27         7.8      M
4 4  27        17.8      M
5 5  39        31.4      F
6 6  41        25.9      F

Coplot for Data from Task

The function coplot() produces a conditioning plot.

Syntax is coplot(y ~ x | A)

This plots y against x for each level of factor A.

coplot(Percent.Fat ~ Age | Gender, data = Fat)

unlabelled

A similar result si obtained using ggplot() with the group parameter specified. (not shown)

R code for models fitting and comparisons

Fat.lm.0 <- lm(Percent.Fat ~ Age, data = Fat)
Fat.lm.1 <- lm(Percent.Fat ~ Gender + Age, data = Fat)
Fat.lm.2 <- lm(Percent.Fat ~ Gender + Age + Gender:Age, data = Fat)
anova(Fat.lm.0, Fat.lm.1)
Analysis of Variance Table

Model 1: Percent.Fat ~ Age
Model 2: Percent.Fat ~ Gender + Age
  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1     16 529.66                              
2     15 360.88  1    168.79 7.0157 0.01824 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(Fat.lm.1, Fat.lm.2)
Analysis of Variance Table

Model 1: Percent.Fat ~ Gender + Age
Model 2: Percent.Fat ~ Gender + Age + Gender:Age
  Res.Df    RSS Df Sum of Sq      F Pr(>F)  
1     15 360.88                             
2     14 282.02  1    78.853 3.9144 0.0679 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(Fat.lm.0)

Call:
lm(formula = Percent.Fat ~ Age, data = Fat)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.2166  -3.3214  -0.8424   1.9466  12.0753 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.2209     5.0762   0.635    0.535    
Age           0.5480     0.1056   5.191 8.93e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.754 on 16 degrees of freedom
Multiple R-squared:  0.6274,    Adjusted R-squared:  0.6041 
F-statistic: 26.94 on 1 and 16 DF,  p-value: 8.93e-05
summary(Fat.lm.1)

Call:
lm(formula = Percent.Fat ~ Gender + Age, data = Fat)

Residuals:
   Min     1Q Median     3Q    Max 
-6.638 -3.455 -1.103  3.297  8.952 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  15.0708     6.2243   2.421   0.0286 *
GenderM      -9.7914     3.6966  -2.649   0.0182 *
Age           0.3392     0.1196   2.835   0.0125 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.905 on 15 degrees of freedom
Multiple R-squared:  0.7461,    Adjusted R-squared:  0.7123 
F-statistic: 22.04 on 2 and 15 DF,  p-value: 3.424e-05
summary(Fat.lm.2)

Call:
lm(formula = Percent.Fat ~ Gender + Age + Gender:Age, data = Fat)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.6756 -2.8862 -0.2464  1.9100  9.1641 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  20.1116     6.2395   3.223  0.00613 **
GenderM     -29.2692    10.4098  -2.812  0.01386 * 
Age           0.2401     0.1204   1.994  0.06600 . 
GenderM:Age   0.5725     0.2893   1.978  0.06790 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.488 on 14 degrees of freedom
Multiple R-squared:  0.8016,    Adjusted R-squared:  0.7591 
F-statistic: 18.86 on 3 and 14 DF,  p-value: 3.455e-05

Task: Fitted model equation of chosen model

What is the preferred model’s equation?

Solution

Since we are talking about a one degree of freedom difference, the p-value from anova(Fat.lm.1, Fat.lm.2) is the same as the p-value for the interaction term in Fat.lm.2, namely 0.0679. So we don’t use the interaction model.

Similarly the difference between the first two models is one df, so the p-value from anova(Fat.lm.0, Fat.lm.1) would be the same as for the coefficient of GenderM, namely 0.0182.

So we need to use the second model.

The equation is Fat = 15.0708 -9.7914 * GenderM + 0.3392 * Age

or Fat = 15.0708 + 0.3392 * Age for females and

Fat = 5.2794 + 0.3392* Age for males.

---
title: "Lecture 26: Comparison of general linear models"
subtitle: 161.251 Regression Modelling
author: "Presented by Jonathan Godfrey <a.j.godfrey@massey.ac.nz>"  
date: "Week 9 of Semester 2, `r lubridate::year(lubridate::now())`"
output:
  html_document:
    code_download: true
    theme: yeti
    highlight: tango
  html_notebook:
    code_download: true
    theme: yeti
    highlight: tango
  ioslides_presentation:
    widescreen: true
    smaller: true
  word_document: default
  slidy_presentation: 
    theme: yeti
    highlight: tango
  pdf_document: default
---





[View the latest recording of this lecture](https://R-Resources.massey.ac.nz/videos/251L26.mp4)
<!--- Data is on
https://r-resources.massey.ac.nz/data/161251/
--->

```{r setup, purl=FALSE, include=FALSE}
library(knitr)
opts_chunk$set(dev=c("png", "pdf"))
opts_chunk$set(fig.height=6, fig.width=7, fig.path="Figures/", fig.alt="unlabelled")
opts_chunk$set(comment="", fig.align="center", tidy=TRUE)
options(knitr.kable.NA = '')
library(tidyverse)
library(broom)
```


<!--- Do not edit anything above this line. --->

In previous lectures we have considered model comparison of nested    regression models, and nested ANOVA type models.

The comparison of nested general linear models using *F* tests follows
    naturally.

## Natural ordering of GLMs


![image](../graphics/glm_comp.png)

## Comparison on Nested Linear Models

For any linear model $M0$ nested within $M1$, we can test:

* $H_0$: $M0$ adequate vs

* $H_1$: $M0$ not adequate.


using an *F* test.

The hypotheses can be stated in terms of parameters of model $M1$.

The *F* test statistic is

$$F = \frac{[RSS_{M0} - RSS_{M1}]/d}{RSS_{M1}/r}$$ where *d* is
    the difference in the number of (regression) parameters between the
    models, and *r* is the residual degrees of freedom for $M1$.



If $H_0$ is correct then *F* has an *F* distribution with $d,r$
    degrees of freedom.

Extreme large values of the test statistic provide evidence against
    $H_0$.

Hence if $f_{obs}$ is the observed value of the test statistic,
    and *X* is a random variable from an $F_{d,r}$ distribution,
    then the *P*-value is given by $P= P(X \ge f_{obs})$

This test can be carried out in R using `anova(m0.lm, m1.lm)` where `m0.lm` and `m1.lm` are respectively the $M0$ and $M1$ fitted models.

## The Samara Data revisited

Recall that we were modelling the mean speed of fall of samara
    (fruit from a maple tree) as a function of the disk loading (a
    covariate) and the tree from which it fell (a factor) for each
    fruit.

We fitted separate linear regressions for each of the three trees.

We shall now consider simpler models:
    
- Parallel regressions for each tree (i.e. equal slope for `Load` for all        trees);
- A single regression model for every fruit (i.e. equal slope and intercept for each tree).
    
We shall compare these models with each other, and with the separate regressions model.


### Different Regressions Model

`r xfun::embed_file("../../data/samara.csv")`

```{r Samara.lm.m2, echo=-1, eval=-2}
Samara <- read.csv(file="../../data/samara.csv", header=TRUE)
Samara <- read.csv(file="samara.csv", header=TRUE)
Samara$Tree <- factor(Samara$Tree)
Samara.lm.m2 <- lm(Velocity~ Tree +Load + Tree:Load, data = Samara)
summary(Samara.lm.m2)
```


### Parallel Regressions Model

```{r Samara.lm.m1}
Samara.lm.m1 <- lm(Velocity ~ Tree + Load, data=Samara)
summary(Samara.lm.m1)
```


### Common Regressions Model

```{r Samara.lm.m0}
Samara.lm.m0 <- lm(Velocity ~ Load, data=Samara)
summary(Samara.lm.m0)
```

### Comparing Models

```{r }
anova(Samara.lm.m1, Samara.lm.m2)
anova(Samara.lm.m0, Samara.lm.m1)
anova(Samara.lm.m0, Samara.lm.m2)
```

 

### Comments

Models fitted as follows:

- `Samara.lm.m2` is separate regressions model. It has 6        regression parameters (3 for intercepts, 3 for slopes) and RSS
        of *0.16549*.
- `Samara.lm.m1` is parallel regressions model. It has 4       regression parameters (3 for intercepts, 1 for slope) and RSS of     *0.20344*.
- `Samara.lm.m0` is single regressions model. It has 2 regression  parameters (1 for intercept, 1 for slope) and RSS of *0.21476*.

Comparison of $M2$ and $M1$ gives very borderline result
    (*P=0.05011*). We would just barely retain $M1$ if we were working at a 5%
    significance level.

Comparison of $M1$ and $M0$ provides no evidence to prefer
    $M1$ (*P=0.4319*).

Comparison of $M2$ and $M0$ shows that $M0$ should be retained
    at the 5% significance level (though there is weak evidence in
    favour of $M2$).

#### What happens if you put all three models in the same anova? 
```{r }
anova(Samara.lm.m0,  Samara.lm.m1, Samara.lm.m2)
```

We see that the p-value for comparing M0 and M1 is now *0.38306*  instead of *0.4319*. Why is there a change in the p-value? 

The reason is that the denominator for the F test is based on the biggest model to hand (M2), namely using *RSS=0.16549* on *29* df, whereas for the previous analysis the biggest model to hand was M1, using *RSS=0.20344* on *31* df. This is enough of a difference to change the p-value to *0.38306*. 

The lesson then is that if we are going to quote a p-value then we need to decide once and for all whether to use the `different slopes` model or not. 

#### Fitted model is

Overall, it seems that single regression is probably adequate.

$$E[ \mbox{Velocity}] = -0.093 + 5.82  \mbox{Load}$$

## Comparison of Models for Body Fat


In this section, we are concerned with modelling a small data set on human body fat.

The response is percentage fat. Age is a numerical covariate, gender is a factor.

Various model comparisons and model summaries are provided. Decide which is the best model, and write down the fitted model equation for males and females.


![image](../graphics/bodyfat.jpg)

  

### Data for Task 

`r xfun::embed_file("../../data/fat.csv")`

```{r getFat, echo=-1, eval=-2}
Fat <- read.csv(file="../../data/fat.csv", header=TRUE)
Fat <- read.csv(file="fat.csv", header=TRUE)
head(Fat)
```

### Coplot for Data from Task 


The function `coplot()`  produces a **conditioning plot**.

Syntax is `coplot(y ~ x | A)`
    

This plots `y` against `x` for each level of factor `A`.

```{r}
coplot(Percent.Fat ~ Age | Gender, data = Fat)
```

A similar result si obtained using `ggplot()` with the `group` parameter specified. (not shown)

### R code for models fitting and comparisons

```{r fatModels}
Fat.lm.0 <- lm(Percent.Fat ~ Age, data = Fat)
Fat.lm.1 <-lm(Percent.Fat ~ Gender + Age, data = Fat)
Fat.lm.2 <- lm(Percent.Fat~ Gender+Age + Gender:Age, data = Fat)
anova(Fat.lm.0, Fat.lm.1)
anova(Fat.lm.1, Fat.lm.2)
summary(Fat.lm.0)
summary(Fat.lm.1)
summary(Fat.lm.2)
```

### Task: Fitted model equation of chosen model

What is the preferred model's equation?


##### Solution

Since we are talking about a one degree of freedom difference, the p-value from 
anova(Fat.lm.1, Fat.lm.2) is the same as the p-value for the interaction term in Fat.lm.2, namely 0.0679.  So we don't use the interaction model.  

Similarly the difference between the first two models is one df, so the p-value from anova(Fat.lm.0, Fat.lm.1) would be the same as for the coefficient of GenderM, namely 0.0182. 

So we need to use the second model.  

The equation is Fat = 15.0708   -9.7914 * GenderM   +   0.3392 * Age  

or Fat = 15.0708 +  0.3392 * Age    for females and 

Fat = 5.2794 +  0.3392* Age for males.
