Download the R markdown file for this lecture.
In previous lectures we have considered model comparison of nested regression models, and nested ANOVA type models.
The comparison of nested general linear models using F tests follows naturally.
For any linear model M0 nested within M1, we can test
\[H_0:~~\mbox{*M0* adequate}~~~~~~\mbox{versus}~~~~~H_1:~~\mbox{*M0* not adequate}\]
using an F test.
The hypotheses can be stated in terms of parameters of model M1.
The F test statistic is
\[F = \frac{[RSS_{M0} - RSS_{M1}]/d}{RSS_{M1}/r}\] where d is the difference in the number of (regression) parameters between the models, and r is the residual degrees of freedom for M1.
continued
If H0 is correct then F has an F distribution with \(d,r\) degrees of freedom.
Extreme large values of the test statistic provide evidence against H0.
Hence if \(f_{obs}\) is the observed value of the test statistic, and X is a random variable from an \(F_{d,r}\) distribution, then the P-value is given by \(P= P(X \ge f_{obs})\)
This test can be carried out in R using anova(m0.lm, m1.lm)
where `m0.lm` and `m1.lm` are respectively the *M0* and *M1*
fitted models.
Recall that we were modelling the mean speed of fall of samara (fruit from a maple tree) as a function of the disk loading (a covariate) and the tree from which it fell (a factor) for each fruit.
We fitted separate linear regressions for each of the three trees.
We shall now consider simpler models:
Parallel regressions for each tree (i.e. equal slope for Load
for all trees);
A single regression model for every fruit (i.e. equal slope and intercept for each tree).
We shall compare these models with each other, and with the separate regressions model.
## Samara <- read.csv(file = "samara.csv", header = TRUE)
$Tree <- factor(Samara$Tree)
Samara<- lm(Velocity ~ Tree/Load, data = Samara)
Samara.lm.m2 summary(Samara.lm.m2)
Call:
lm(formula = Velocity ~ Tree/Load, data = Samara)
Residuals:
Min 1Q Median 3Q Max
-0.120023 -0.049465 -0.001298 0.049938 0.145571
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5414 0.2632 2.057 0.0488 *
Tree2 -0.8408 0.3356 -2.505 0.0181 *
Tree3 -0.2987 0.4454 -0.671 0.5078
Tree1:Load 3.0629 1.1599 2.641 0.0132 *
Tree2:Load 6.7971 0.9511 7.147 7.26e-08 ***
Tree3:Load 3.8834 1.9672 1.974 0.0580 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.07554 on 29 degrees of freedom
Multiple R-squared: 0.8436, Adjusted R-squared: 0.8167
F-statistic: 31.29 on 5 and 29 DF, p-value: 7.656e-11
<- lm(Velocity ~ Tree + Load, data = Samara)
Samara.lm.m1 summary(Samara.lm.m1)
Call:
lm(formula = Velocity ~ Tree + Load, data = Samara)
Residuals:
Min 1Q Median 3Q Max
-0.13572 -0.06027 -0.01576 0.05973 0.17130
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.07561 0.16871 0.448 0.657
Tree2 -0.01047 0.03440 -0.304 0.763
Tree3 -0.05879 0.04629 -1.270 0.213
Load 5.12257 0.73875 6.934 8.88e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.08101 on 31 degrees of freedom
Multiple R-squared: 0.8078, Adjusted R-squared: 0.7892
F-statistic: 43.43 on 3 and 31 DF, p-value: 3.261e-11
<- lm(Velocity ~ Load, data = Samara)
Samara.lm.m0 summary(Samara.lm.m0)
Call:
lm(formula = Velocity ~ Load, data = Samara)
Residuals:
Min 1Q Median 3Q Max
-0.16168 -0.05332 -0.01511 0.05528 0.17086
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.09326 0.10743 -0.868 0.392
Load 5.82019 0.51119 11.386 5.7e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.08067 on 33 degrees of freedom
Multiple R-squared: 0.7971, Adjusted R-squared: 0.7909
F-statistic: 129.6 on 1 and 33 DF, p-value: 5.704e-13
anova(Samara.lm.m1, Samara.lm.m2)
Analysis of Variance Table
Model 1: Velocity ~ Tree + Load
Model 2: Velocity ~ Tree/Load
Res.Df RSS Df Sum of Sq F Pr(>F)
1 31 0.20344
2 29 0.16549 2 0.037949 3.325 0.05011 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
continued
anova(Samara.lm.m0, Samara.lm.m2)
Analysis of Variance Table
Model 1: Velocity ~ Load
Model 2: Velocity ~ Tree/Load
Res.Df RSS Df Sum of Sq F Pr(>F)
1 33 0.21476
2 29 0.16549 4 0.049272 2.1585 0.09885 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Concerned with modelling a small data set on human body fat.
Response is percentage fat.
Age is a numerical covariate.
Gender is a factor.
Various model comparisons and model summaries provided.
Decide which is the best model, and write down the fitted model equation for males and females.
## Fat <- read.csv(file = "fat.csv", header = TRUE)
Fat
Age Percent.Fat Gender
1 23 9.5 M
2 23 27.9 F
3 27 7.8 M
4 27 17.8 M
5 39 31.4 F
6 41 25.9 F
7 45 27.4 M
8 49 25.2 F
9 50 31.1 F
10 53 34.7 F
11 53 42.0 F
12 54 29.1 F
13 56 32.5 F
14 57 30.3 F
15 58 33.0 F
16 58 33.8 F
17 60 41.1 F
18 61 34.5 F
The function coplot()
produces a conditioning plot.
Syntax is coplot(y x | A)
This plots y
against x
for each level of factor A
.
.0 <- lm(Percent.Fat ~ Age, data = Fat)
Fat.lm.1 <- lm(Percent.Fat ~ Gender + Age, data = Fat)
Fat.lm.2 <- lm(Percent.Fat ~ Gender/Age, data = Fat)
Fat.lmanova(Fat.lm.0, Fat.lm.1)
Analysis of Variance Table
Model 1: Percent.Fat ~ Age
Model 2: Percent.Fat ~ Gender + Age
Res.Df RSS Df Sum of Sq F Pr(>F)
1 16 529.66
2 15 360.88 1 168.79 7.0157 0.01824 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(Fat.lm.1, Fat.lm.2)
Analysis of Variance Table
Model 1: Percent.Fat ~ Gender + Age
Model 2: Percent.Fat ~ Gender/Age
Res.Df RSS Df Sum of Sq F Pr(>F)
1 15 360.88
2 14 282.02 1 78.853 3.9144 0.0679 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(Fat.lm.0)
Call:
lm(formula = Percent.Fat ~ Age, data = Fat)
Residuals:
Min 1Q Median 3Q Max
-10.2166 -3.3214 -0.8424 1.9466 12.0753
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.2209 5.0762 0.635 0.535
Age 0.5480 0.1056 5.191 8.93e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.754 on 16 degrees of freedom
Multiple R-squared: 0.6274, Adjusted R-squared: 0.6041
F-statistic: 26.94 on 1 and 16 DF, p-value: 8.93e-05
summary(Fat.lm.1)
Call:
lm(formula = Percent.Fat ~ Gender + Age, data = Fat)
Residuals:
Min 1Q Median 3Q Max
-6.638 -3.455 -1.103 3.297 8.952
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.0708 6.2243 2.421 0.0286 *
GenderM -9.7914 3.6966 -2.649 0.0182 *
Age 0.3392 0.1196 2.835 0.0125 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.905 on 15 degrees of freedom
Multiple R-squared: 0.7461, Adjusted R-squared: 0.7123
F-statistic: 22.04 on 2 and 15 DF, p-value: 3.424e-05
summary(Fat.lm.2)
Call:
lm(formula = Percent.Fat ~ Gender/Age, data = Fat)
Residuals:
Min 1Q Median 3Q Max
-6.6756 -2.8862 -0.2464 1.9100 9.1641
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.1116 6.2395 3.223 0.00613 **
GenderM -29.2692 10.4098 -2.812 0.01386 *
GenderF:Age 0.2401 0.1204 1.994 0.06600 .
GenderM:Age 0.8125 0.2631 3.088 0.00802 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.488 on 14 degrees of freedom
Multiple R-squared: 0.8016, Adjusted R-squared: 0.7591
F-statistic: 18.86 on 3 and 14 DF, p-value: 3.455e-05
What is the preferred model’s equation?
Comments
Models fitted as follows:
Samara.lm.m2
is separate regressions model. It has 6 regression parameters (3 for intercepts, 3 for slopes) and RSS of 0.165492.Samara.lm.m1
is parallel regressions model. It has 4 regression parameters (3 for intercepts, 1 for slope) and RSS of 0.203441.Samara.lm.m0
is single regressions model. It has 2 regression parameters (1 for intercept, 1 for slope) and RSS of 0.214763.Comparison of M2 and M1 gives very borderline result (P=0.05011). Just retain M1 if we were working at a 5% significance level.
Comparison of M1 and M0 provides no evidence to prefer M1 (P=0.4319).
Comparison of M2 and M0 shows that M0 should be retained at the 5% significance level (though there is weak evidence in favour of M2).
Overall, it seems that single regression is probably adequate.
Fitted model is
\[E[ \mbox{Velocity}] = -0.093 + 5.82 \mbox{Load}\]