Lecture 26 Comparison of general linear models

In class version

In previous lectures we have considered model comparison of nested regression models, and nested ANOVA type models.

The comparison of nested general linear models using F tests follows naturally.

26.1 Natural ordering of GLMs

image
image

26.2 Comparison on Nested Linear Models

For any linear model \(M0\) nested within \(M1\), we can test:

  • \(H_0\): \(M0\) adequate vs

  • \(H_1\): \(M0\) not adequate.

using an F test.

The hypotheses can be stated in terms of parameters of model \(M1\).

The F test statistic is

\[F = \frac{[RSS_{M0} - RSS_{M1}]/d}{RSS_{M1}/r}\] where d is the difference in the number of (regression) parameters between the models, and r is the residual degrees of freedom for \(M1\).

26.3 Comparison on Nested Linear Models continued

If \(H_0\) is correct then F has an F distribution with \(d,r\) degrees of freedom.

Extreme large values of the test statistic provide evidence against \(H_0\).

Hence if \(f_{obs}\) is the observed value of the test statistic, and X is a random variable from an \(F_{d,r}\) distribution, then the P-value is given by \(P= P(X \ge f_{obs})\)

This test can be carried out in R using anova(m0.lm, m1.lm) where m0.lm and m1.lm are respectively the \(M0\) and \(M1\) fitted models.

26.4 The Samara Data revisited

Recall that we were modelling the mean speed of fall of samara (fruit from a maple tree) as a function of the disk loading (a covariate) and the tree from which it fell (a factor) for each fruit.

We fitted separate linear regressions for each of the three trees.

We shall now consider simpler models:

  • Parallel regressions for each tree (i.e. equal slope for Load for all trees);
  • A single regression model for every fruit (i.e. equal slope and intercept for each tree).

We shall compare these models with each other, and with the separate regressions model.

26.4.1 Different Regressions Model

Download samara.csv

## Samara <- read.csv(file = "samara.csv", header = TRUE)
Samara$Tree <- factor(Samara$Tree)
Samara.lm.m2 <- lm(Velocity ~ Tree + Load + Tree:Load, data = Samara)
summary(Samara.lm.m2)

Call:
lm(formula = Velocity ~ Tree + Load + Tree:Load, data = Samara)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.120023 -0.049465 -0.001298  0.049938  0.145571 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   0.5414     0.2632   2.057   0.0488 *
Tree2        -0.8408     0.3356  -2.505   0.0181 *
Tree3        -0.2987     0.4454  -0.671   0.5078  
Load          3.0629     1.1599   2.641   0.0132 *
Tree2:Load    3.7343     1.5000   2.490   0.0188 *
Tree3:Load    0.8205     2.2837   0.359   0.7220  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.07554 on 29 degrees of freedom
Multiple R-squared:  0.8436,    Adjusted R-squared:  0.8167 
F-statistic: 31.29 on 5 and 29 DF,  p-value: 7.656e-11

26.4.2 Parallel Regressions Model

Samara.lm.m1 <- lm(Velocity ~ Tree + Load, data = Samara)
summary(Samara.lm.m1)

Call:
lm(formula = Velocity ~ Tree + Load, data = Samara)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.13572 -0.06027 -0.01576  0.05973  0.17130 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.07561    0.16871   0.448    0.657    
Tree2       -0.01047    0.03440  -0.304    0.763    
Tree3       -0.05879    0.04629  -1.270    0.213    
Load         5.12257    0.73875   6.934 8.88e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.08101 on 31 degrees of freedom
Multiple R-squared:  0.8078,    Adjusted R-squared:  0.7892 
F-statistic: 43.43 on 3 and 31 DF,  p-value: 3.261e-11

26.4.3 Common Regressions Model

Samara.lm.m0 <- lm(Velocity ~ Load, data = Samara)
summary(Samara.lm.m0)

Call:
lm(formula = Velocity ~ Load, data = Samara)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.16168 -0.05332 -0.01511  0.05528  0.17086 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.09326    0.10743  -0.868    0.392    
Load         5.82019    0.51119  11.386  5.7e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.08067 on 33 degrees of freedom
Multiple R-squared:  0.7971,    Adjusted R-squared:  0.7909 
F-statistic: 129.6 on 1 and 33 DF,  p-value: 5.704e-13

26.4.4 Comparing Models

anova(Samara.lm.m1, Samara.lm.m2)
Analysis of Variance Table

Model 1: Velocity ~ Tree + Load
Model 2: Velocity ~ Tree + Load + Tree:Load
  Res.Df     RSS Df Sum of Sq     F  Pr(>F)  
1     31 0.20344                             
2     29 0.16549  2  0.037949 3.325 0.05011 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(Samara.lm.m0, Samara.lm.m1)
Analysis of Variance Table

Model 1: Velocity ~ Load
Model 2: Velocity ~ Tree + Load
  Res.Df     RSS Df Sum of Sq      F Pr(>F)
1     33 0.21476                           
2     31 0.20344  2  0.011322 0.8626 0.4319
anova(Samara.lm.m0, Samara.lm.m2)
Analysis of Variance Table

Model 1: Velocity ~ Load
Model 2: Velocity ~ Tree + Load + Tree:Load
  Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
1     33 0.21476                              
2     29 0.16549  4  0.049272 2.1585 0.09885 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

26.4.5 Comments

Models fitted as follows:

  • Samara.lm.m2 is separate regressions model. It has 6 regression parameters (3 for intercepts, 3 for slopes) and RSS of 0.16549.
  • Samara.lm.m1 is parallel regressions model. It has 4 regression parameters (3 for intercepts, 1 for slope) and RSS of 0.20344.
  • Samara.lm.m0 is single regressions model. It has 2 regression parameters (1 for intercept, 1 for slope) and RSS of 0.21476.

Comparison of \(M2\) and \(M1\) gives very borderline result (P=0.05011). We would just barely retain \(M1\) if we were working at a 5% significance level.

Comparison of \(M1\) and \(M0\) provides no evidence to prefer \(M1\) (P=0.4319).

Comparison of \(M2\) and \(M0\) shows that \(M0\) should be retained at the 5% significance level (though there is weak evidence in favour of \(M2\)).

26.4.5.1 What happens if you put all three models in the same anova?

anova(Samara.lm.m0, Samara.lm.m1, Samara.lm.m2)
Analysis of Variance Table

Model 1: Velocity ~ Load
Model 2: Velocity ~ Tree + Load
Model 3: Velocity ~ Tree + Load + Tree:Load
  Res.Df     RSS Df Sum of Sq     F  Pr(>F)  
1     33 0.21476                             
2     31 0.20344  2  0.011322 0.992 0.38306  
3     29 0.16549  2  0.037949 3.325 0.05011 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the p-value for comparing M0 and M1 is now 0.38306 instead of 0.4319. Why is there a change in the p-value?

The reason is that the denominator for the F test is based on the biggest model to hand (M2), namely using RSS=0.16549 on 29 df, whereas for the previous analysis the biggest model to hand was M1, using RSS=0.20344 on 31 df. This is enough of a difference to change the p-value to 0.38306.

The lesson then is that if we are going to quote a p-value then we need to decide once and for all whether to use the different slopes model or not.

26.4.5.2 Fitted model is

Overall, it seems that single regression is probably adequate.

\[E[ \mbox{Velocity}] = -0.093 + 5.82 \mbox{Load}\]

26.5 Comparison of Models for Body Fat

In this section, we are concerned with modelling a small data set on human body fat.

The response is percentage fat. Age is a numerical covariate, gender is a factor.

Various model comparisons and model summaries are provided. Decide which is the best model, and write down the fitted model equation for males and females.

image
image

26.5.1 Data for Task

Download fat.csv

## Fat <- read.csv(file = "fat.csv", header = TRUE)
head(Fat)
  X Age Percent.Fat Gender
1 1  23         9.5      M
2 2  23        27.9      F
3 3  27         7.8      M
4 4  27        17.8      M
5 5  39        31.4      F
6 6  41        25.9      F

26.5.2 Coplot for Data from Task

The function coplot() produces a conditioning plot.

Syntax is coplot(y ~ x | A)

This plots y against x for each level of factor A.

coplot(Percent.Fat ~ Age | Gender, data = Fat)

unlabelled

26.5.3 R code for models fitting and comparisons

Fat.lm.0 <- lm(Percent.Fat ~ Age, data = Fat)
Fat.lm.1 <- lm(Percent.Fat ~ Gender + Age, data = Fat)
Fat.lm.2 <- lm(Percent.Fat ~ Gender + Age + Gender:Age, data = Fat)
anova(Fat.lm.0, Fat.lm.1)
Analysis of Variance Table

Model 1: Percent.Fat ~ Age
Model 2: Percent.Fat ~ Gender + Age
  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1     16 529.66                              
2     15 360.88  1    168.79 7.0157 0.01824 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(Fat.lm.1, Fat.lm.2)
Analysis of Variance Table

Model 1: Percent.Fat ~ Gender + Age
Model 2: Percent.Fat ~ Gender + Age + Gender:Age
  Res.Df    RSS Df Sum of Sq      F Pr(>F)  
1     15 360.88                             
2     14 282.02  1    78.853 3.9144 0.0679 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(Fat.lm.0)

Call:
lm(formula = Percent.Fat ~ Age, data = Fat)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.2166  -3.3214  -0.8424   1.9466  12.0753 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.2209     5.0762   0.635    0.535    
Age           0.5480     0.1056   5.191 8.93e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.754 on 16 degrees of freedom
Multiple R-squared:  0.6274,    Adjusted R-squared:  0.6041 
F-statistic: 26.94 on 1 and 16 DF,  p-value: 8.93e-05
summary(Fat.lm.1)

Call:
lm(formula = Percent.Fat ~ Gender + Age, data = Fat)

Residuals:
   Min     1Q Median     3Q    Max 
-6.638 -3.455 -1.103  3.297  8.952 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  15.0708     6.2243   2.421   0.0286 *
GenderM      -9.7914     3.6966  -2.649   0.0182 *
Age           0.3392     0.1196   2.835   0.0125 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.905 on 15 degrees of freedom
Multiple R-squared:  0.7461,    Adjusted R-squared:  0.7123 
F-statistic: 22.04 on 2 and 15 DF,  p-value: 3.424e-05
summary(Fat.lm.2)

Call:
lm(formula = Percent.Fat ~ Gender + Age + Gender:Age, data = Fat)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.6756 -2.8862 -0.2464  1.9100  9.1641 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  20.1116     6.2395   3.223  0.00613 **
GenderM     -29.2692    10.4098  -2.812  0.01386 * 
Age           0.2401     0.1204   1.994  0.06600 . 
GenderM:Age   0.5725     0.2893   1.978  0.06790 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.488 on 14 degrees of freedom
Multiple R-squared:  0.8016,    Adjusted R-squared:  0.7591 
F-statistic: 18.86 on 3 and 14 DF,  p-value: 3.455e-05

26.5.4 Task: Fitted model equation of chosen model

What is the preferred model’s equation?

26.5.4.0.1 Solution

Since we are talking about a one degree of freedom difference, the p-value from anova(Fat.lm.1, Fat.lm.2) is the same as the p-value for the interaction term in Fat.lm.2, namely 0.0679. So we don’t use the interaction model.

Similarly the difference between the first two models is one df, so the p-value from anova(Fat.lm.0, Fat.lm.1) would be the same as for the coefficient of GenderM, namely 0.0182.

So we need to use the second model.

The equation is Fat = 15.0708 -9.7914 * GenderM + 0.3392 * Age

or Fat = 15.0708 + 0.3392 * Age for females and

Fat = 5.2794 + 0.3392* Age for males.