Lecture 26 Comparison of general linear models
In previous lectures we have considered model comparison of nested regression models, and nested ANOVA type models.
The comparison of nested general linear models using F tests follows naturally.
26.2 Comparison on Nested Linear Models
For any linear model \(M0\) nested within \(M1\), we can test:
\(H_0\): \(M0\) adequate vs
\(H_1\): \(M0\) not adequate.
using an F test.
The hypotheses can be stated in terms of parameters of model \(M1\).
The F test statistic is
\[F = \frac{[RSS_{M0} - RSS_{M1}]/d}{RSS_{M1}/r}\] where d is the difference in the number of (regression) parameters between the models, and r is the residual degrees of freedom for \(M1\).
If \(H_0\) is correct then F has an F distribution with \(d,r\) degrees of freedom.
Extreme large values of the test statistic provide evidence against \(H_0\).
Hence if \(f_{obs}\) is the observed value of the test statistic, and X is a random variable from an \(F_{d,r}\) distribution, then the P-value is given by \(P= P(X \ge f_{obs})\)
This test can be carried out in R using anova(m0.lm, m1.lm)
where m0.lm
and m1.lm
are respectively the \(M0\) and \(M1\) fitted models.
26.3 The Samara Data revisited
Recall that we were modelling the mean speed of fall of samara (fruit from a maple tree) as a function of the disk loading (a covariate) and the tree from which it fell (a factor) for each fruit.
We fitted separate linear regressions for each of the three trees.
We shall now consider simpler models:
- Parallel regressions for each tree (i.e. equal slope for
Load
for all trees); - A single regression model for every fruit (i.e. equal slope and intercept for each tree).
We shall compare these models with each other, and with the separate regressions model.
26.3.1 Different Regressions Model
## Samara <- read.csv(file = "samara.csv", header = TRUE)
Samara$Tree <- factor(Samara$Tree)
Samara.lm.m2 <- lm(Velocity ~ Tree + Load + Tree:Load, data = Samara)
summary(Samara.lm.m2)
Call:
lm(formula = Velocity ~ Tree + Load + Tree:Load, data = Samara)
Residuals:
Min 1Q Median 3Q Max
-0.120023 -0.049465 -0.001298 0.049938 0.145571
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5414 0.2632 2.057 0.0488 *
Tree2 -0.8408 0.3356 -2.505 0.0181 *
Tree3 -0.2987 0.4454 -0.671 0.5078
Load 3.0629 1.1599 2.641 0.0132 *
Tree2:Load 3.7343 1.5000 2.490 0.0188 *
Tree3:Load 0.8205 2.2837 0.359 0.7220
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.07554 on 29 degrees of freedom
Multiple R-squared: 0.8436, Adjusted R-squared: 0.8167
F-statistic: 31.29 on 5 and 29 DF, p-value: 7.656e-11
26.3.2 Parallel Regressions Model
Call:
lm(formula = Velocity ~ Tree + Load, data = Samara)
Residuals:
Min 1Q Median 3Q Max
-0.13572 -0.06027 -0.01576 0.05973 0.17130
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.07561 0.16871 0.448 0.657
Tree2 -0.01047 0.03440 -0.304 0.763
Tree3 -0.05879 0.04629 -1.270 0.213
Load 5.12257 0.73875 6.934 8.88e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.08101 on 31 degrees of freedom
Multiple R-squared: 0.8078, Adjusted R-squared: 0.7892
F-statistic: 43.43 on 3 and 31 DF, p-value: 3.261e-11
26.3.3 Common Regressions Model
Call:
lm(formula = Velocity ~ Load, data = Samara)
Residuals:
Min 1Q Median 3Q Max
-0.16168 -0.05332 -0.01511 0.05528 0.17086
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.09326 0.10743 -0.868 0.392
Load 5.82019 0.51119 11.386 5.7e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.08067 on 33 degrees of freedom
Multiple R-squared: 0.7971, Adjusted R-squared: 0.7909
F-statistic: 129.6 on 1 and 33 DF, p-value: 5.704e-13
26.3.4 Comparing Models
Analysis of Variance Table
Model 1: Velocity ~ Tree + Load
Model 2: Velocity ~ Tree + Load + Tree:Load
Res.Df RSS Df Sum of Sq F Pr(>F)
1 31 0.20344
2 29 0.16549 2 0.037949 3.325 0.05011 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance Table
Model 1: Velocity ~ Load
Model 2: Velocity ~ Tree + Load
Res.Df RSS Df Sum of Sq F Pr(>F)
1 33 0.21476
2 31 0.20344 2 0.011322 0.8626 0.4319
Analysis of Variance Table
Model 1: Velocity ~ Load
Model 2: Velocity ~ Tree + Load + Tree:Load
Res.Df RSS Df Sum of Sq F Pr(>F)
1 33 0.21476
2 29 0.16549 4 0.049272 2.1585 0.09885 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
26.4 Comparison of Models for Body Fat
In this section, we are concerned with modelling a small data set on human body fat.
The response is percentage fat. Age is a numerical covariate, gender is a factor.
Various model comparisons and model summaries are provided. Decide which is the best model, and write down the fitted model equation for males and females.
26.4.1 Data for Task
X Age Percent.Fat Gender
1 1 23 9.5 M
2 2 23 27.9 F
3 3 27 7.8 M
4 4 27 17.8 M
5 5 39 31.4 F
6 6 41 25.9 F
26.4.2 Coplot for Data from Task
The function coplot()
produces a conditioning plot.
Syntax is coplot(y ~ x | A)
This plots y
against x
for each level of factor A
.
A similar result si obtained using ggplot()
with the group
parameter specified. (not shown)
26.4.3 R code for models fitting and comparisons
Fat.lm.0 <- lm(Percent.Fat ~ Age, data = Fat)
Fat.lm.1 <- lm(Percent.Fat ~ Gender + Age, data = Fat)
Fat.lm.2 <- lm(Percent.Fat ~ Gender + Age + Gender:Age, data = Fat)
anova(Fat.lm.0, Fat.lm.1)
Analysis of Variance Table
Model 1: Percent.Fat ~ Age
Model 2: Percent.Fat ~ Gender + Age
Res.Df RSS Df Sum of Sq F Pr(>F)
1 16 529.66
2 15 360.88 1 168.79 7.0157 0.01824 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance Table
Model 1: Percent.Fat ~ Gender + Age
Model 2: Percent.Fat ~ Gender + Age + Gender:Age
Res.Df RSS Df Sum of Sq F Pr(>F)
1 15 360.88
2 14 282.02 1 78.853 3.9144 0.0679 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = Percent.Fat ~ Age, data = Fat)
Residuals:
Min 1Q Median 3Q Max
-10.2166 -3.3214 -0.8424 1.9466 12.0753
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.2209 5.0762 0.635 0.535
Age 0.5480 0.1056 5.191 8.93e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.754 on 16 degrees of freedom
Multiple R-squared: 0.6274, Adjusted R-squared: 0.6041
F-statistic: 26.94 on 1 and 16 DF, p-value: 8.93e-05
Call:
lm(formula = Percent.Fat ~ Gender + Age, data = Fat)
Residuals:
Min 1Q Median 3Q Max
-6.638 -3.455 -1.103 3.297 8.952
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.0708 6.2243 2.421 0.0286 *
GenderM -9.7914 3.6966 -2.649 0.0182 *
Age 0.3392 0.1196 2.835 0.0125 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.905 on 15 degrees of freedom
Multiple R-squared: 0.7461, Adjusted R-squared: 0.7123
F-statistic: 22.04 on 2 and 15 DF, p-value: 3.424e-05
Call:
lm(formula = Percent.Fat ~ Gender + Age + Gender:Age, data = Fat)
Residuals:
Min 1Q Median 3Q Max
-6.6756 -2.8862 -0.2464 1.9100 9.1641
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.1116 6.2395 3.223 0.00613 **
GenderM -29.2692 10.4098 -2.812 0.01386 *
Age 0.2401 0.1204 1.994 0.06600 .
GenderM:Age 0.5725 0.2893 1.978 0.06790 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.488 on 14 degrees of freedom
Multiple R-squared: 0.8016, Adjusted R-squared: 0.7591
F-statistic: 18.86 on 3 and 14 DF, p-value: 3.455e-05
26.4.4 Task: Fitted model equation of chosen model
What is the preferred model’s equation?
26.4.4.0.1 Solution
Since we are talking about a one degree of freedom difference, the p-value from anova(Fat.lm.1, Fat.lm.2) is the same as the p-value for the interaction term in Fat.lm.2, namely 0.0679. So we don’t use the interaction model.
Similarly the difference between the first two models is one df, so the p-value from anova(Fat.lm.0, Fat.lm.1) would be the same as for the coefficient of GenderM, namely 0.0182.
So we need to use the second model.
The equation is Fat = 15.0708 -9.7914 * GenderM + 0.3392 * Age
or Fat = 15.0708 + 0.3392 * Age for females and
Fat = 5.2794 + 0.3392* Age for males.
26.3.5 Comments
Models fitted as follows:
Samara.lm.m2
is separate regressions model. It has 6 regression parameters (3 for intercepts, 3 for slopes) and RSS of 0.16549.Samara.lm.m1
is parallel regressions model. It has 4 regression parameters (3 for intercepts, 1 for slope) and RSS of 0.20344.Samara.lm.m0
is single regressions model. It has 2 regression parameters (1 for intercept, 1 for slope) and RSS of 0.21476.Comparison of \(M2\) and \(M1\) gives very borderline result (P=0.05011). We would just barely retain \(M1\) if we were working at a 5% significance level.
Comparison of \(M1\) and \(M0\) provides no evidence to prefer \(M1\) (P=0.4319).
Comparison of \(M2\) and \(M0\) shows that \(M0\) should be retained at the 5% significance level (though there is weak evidence in favour of \(M2\)).
26.3.5.1 What happens if you put all three models in the same anova?
We see that the p-value for comparing M0 and M1 is now 0.38306 instead of 0.4319. Why is there a change in the p-value?
The reason is that the denominator for the F test is based on the biggest model to hand (M2), namely using RSS=0.16549 on 29 df, whereas for the previous analysis the biggest model to hand was M1, using RSS=0.20344 on 31 df. This is enough of a difference to change the p-value to 0.38306.
The lesson then is that if we are going to quote a p-value then we need to decide once and for all whether to use the
different slopes
model or not.26.3.5.2 Fitted model is
Overall, it seems that single regression is probably adequate.
\[E[ \mbox{Velocity}] = -0.093 + 5.82 \mbox{Load}\]