Download the R markdown file for this lecture.

In previous lectures we have considered model comparison of nested regression models, and nested ANOVA type models.

The comparison of nested general linear models using F tests follows naturally.

Natural ordering of GLMs

image

Comparison on Nested Linear Models

For any linear model M0 nested within M1, we can test

\[H_0:~~\mbox{*M0* adequate}~~~~~~\mbox{versus}~~~~~H_1:~~\mbox{*M0* not adequate}\]

using an F test.

The hypotheses can be stated in terms of parameters of model M1.

The F test statistic is

\[F = \frac{[RSS_{M0} - RSS_{M1}]/d}{RSS_{M1}/r}\] where d is the difference in the number of (regression) parameters between the models, and r is the residual degrees of freedom for M1.

Comparison on Nested Linear Models

continued

If H0 is correct then F has an F distribution with \(d,r\) degrees of freedom.

Extreme large values of the test statistic provide evidence against H0.

Hence if \(f_{obs}\) is the observed value of the test statistic, and X is a random variable from an \(F_{d,r}\) distribution, then the P-value is given by \(P= P(X \ge f_{obs})\)

This test can be carried out in R using anova(m0.lm, m1.lm)

where `m0.lm` and `m1.lm` are respectively the *M0* and *M1*
fitted models.

The Samara Data Redux

Recall that we were modelling the mean speed of fall of samara (fruit from a maple tree) as a function of the disk loading (a covariate) and the tree from which it fell (a factor) for each fruit.

We fitted separate linear regressions for each of the three trees.

We shall now consider simpler models:

Different Regressions Model

`Download samara.csv

## Samara <- read.csv(file = "samara.csv", header = TRUE)
Samara$Tree <- factor(Samara$Tree)
Samara.lm.m2 <- lm(Velocity ~ Tree/Load, data = Samara)
summary(Samara.lm.m2)

Call:
lm(formula = Velocity ~ Tree/Load, data = Samara)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.120023 -0.049465 -0.001298  0.049938  0.145571 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.5414     0.2632   2.057   0.0488 *  
Tree2        -0.8408     0.3356  -2.505   0.0181 *  
Tree3        -0.2987     0.4454  -0.671   0.5078    
Tree1:Load    3.0629     1.1599   2.641   0.0132 *  
Tree2:Load    6.7971     0.9511   7.147 7.26e-08 ***
Tree3:Load    3.8834     1.9672   1.974   0.0580 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.07554 on 29 degrees of freedom
Multiple R-squared:  0.8436,    Adjusted R-squared:  0.8167 
F-statistic: 31.29 on 5 and 29 DF,  p-value: 7.656e-11

Parallel Regressions Model

Samara.lm.m1 <- lm(Velocity ~ Tree + Load, data = Samara)
summary(Samara.lm.m1)

Call:
lm(formula = Velocity ~ Tree + Load, data = Samara)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.13572 -0.06027 -0.01576  0.05973  0.17130 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.07561    0.16871   0.448    0.657    
Tree2       -0.01047    0.03440  -0.304    0.763    
Tree3       -0.05879    0.04629  -1.270    0.213    
Load         5.12257    0.73875   6.934 8.88e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.08101 on 31 degrees of freedom
Multiple R-squared:  0.8078,    Adjusted R-squared:  0.7892 
F-statistic: 43.43 on 3 and 31 DF,  p-value: 3.261e-11

Common Regressions Model

Samara.lm.m0 <- lm(Velocity ~ Load, data = Samara)
summary(Samara.lm.m0)

Call:
lm(formula = Velocity ~ Load, data = Samara)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.16168 -0.05332 -0.01511  0.05528  0.17086 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.09326    0.10743  -0.868    0.392    
Load         5.82019    0.51119  11.386  5.7e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.08067 on 33 degrees of freedom
Multiple R-squared:  0.7971,    Adjusted R-squared:  0.7909 
F-statistic: 129.6 on 1 and 33 DF,  p-value: 5.704e-13

Comparing Models

anova(Samara.lm.m1, Samara.lm.m2)
Analysis of Variance Table

Model 1: Velocity ~ Tree + Load
Model 2: Velocity ~ Tree/Load
  Res.Df     RSS Df Sum of Sq     F  Pr(>F)  
1     31 0.20344                             
2     29 0.16549  2  0.037949 3.325 0.05011 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

continued

anova(Samara.lm.m0, Samara.lm.m2)
Analysis of Variance Table

Model 1: Velocity ~ Load
Model 2: Velocity ~ Tree/Load
  Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
1     33 0.21476                              
2     29 0.16549  4  0.049272 2.1585 0.09885 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comments

Models fitted as follows:

  • Samara.lm.m2 is separate regressions model. It has 6 regression parameters (3 for intercepts, 3 for slopes) and RSS of 0.165492.
  • Samara.lm.m1 is parallel regressions model. It has 4 regression parameters (3 for intercepts, 1 for slope) and RSS of 0.203441.
  • Samara.lm.m0 is single regressions model. It has 2 regression parameters (1 for intercept, 1 for slope) and RSS of 0.214763.

Comparison of M2 and M1 gives very borderline result (P=0.05011). Just retain M1 if we were working at a 5% significance level.

Comparison of M1 and M0 provides no evidence to prefer M1 (P=0.4319).

Comparison of M2 and M0 shows that M0 should be retained at the 5% significance level (though there is weak evidence in favour of M2).

Overall, it seems that single regression is probably adequate.

Fitted model is

\[E[ \mbox{Velocity}] = -0.093 + 5.82 \mbox{Load}\]

Comparison of Models for Body Fat

Concerned with modelling a small data set on human body fat.

Response is percentage fat.
Age is a numerical covariate.
Gender is a factor.

Various model comparisons and model summaries provided.

Decide which is the best model, and write down the fitted model equation for males and females.

image

R Code for Task

Download fat.csv

## Fat <- read.csv(file = "fat.csv", header = TRUE)
Fat
   Age Percent.Fat Gender
1   23         9.5      M
2   23        27.9      F
3   27         7.8      M
4   27        17.8      M
5   39        31.4      F
6   41        25.9      F
7   45        27.4      M
8   49        25.2      F
9   50        31.1      F
10  53        34.7      F
11  53        42.0      F
12  54        29.1      F
13  56        32.5      F
14  57        30.3      F
15  58        33.0      F
16  58        33.8      F
17  60        41.1      F
18  61        34.5      F

Coplot for Data from Task

The function coplot() produces a conditioning plot.

Syntax is coplot(y   x | A)

This plots y against x for each level of factor A.

Fat.lm.0 <- lm(Percent.Fat ~ Age, data = Fat)
Fat.lm.1 <- lm(Percent.Fat ~ Gender + Age, data = Fat)
Fat.lm.2 <- lm(Percent.Fat ~ Gender/Age, data = Fat)
anova(Fat.lm.0, Fat.lm.1)
Analysis of Variance Table

Model 1: Percent.Fat ~ Age
Model 2: Percent.Fat ~ Gender + Age
  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1     16 529.66                              
2     15 360.88  1    168.79 7.0157 0.01824 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(Fat.lm.1, Fat.lm.2)
Analysis of Variance Table

Model 1: Percent.Fat ~ Gender + Age
Model 2: Percent.Fat ~ Gender/Age
  Res.Df    RSS Df Sum of Sq      F Pr(>F)  
1     15 360.88                             
2     14 282.02  1    78.853 3.9144 0.0679 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(Fat.lm.0)

Call:
lm(formula = Percent.Fat ~ Age, data = Fat)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.2166  -3.3214  -0.8424   1.9466  12.0753 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.2209     5.0762   0.635    0.535    
Age           0.5480     0.1056   5.191 8.93e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.754 on 16 degrees of freedom
Multiple R-squared:  0.6274,    Adjusted R-squared:  0.6041 
F-statistic: 26.94 on 1 and 16 DF,  p-value: 8.93e-05
summary(Fat.lm.1)

Call:
lm(formula = Percent.Fat ~ Gender + Age, data = Fat)

Residuals:
   Min     1Q Median     3Q    Max 
-6.638 -3.455 -1.103  3.297  8.952 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  15.0708     6.2243   2.421   0.0286 *
GenderM      -9.7914     3.6966  -2.649   0.0182 *
Age           0.3392     0.1196   2.835   0.0125 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.905 on 15 degrees of freedom
Multiple R-squared:  0.7461,    Adjusted R-squared:  0.7123 
F-statistic: 22.04 on 2 and 15 DF,  p-value: 3.424e-05
summary(Fat.lm.2)

Call:
lm(formula = Percent.Fat ~ Gender/Age, data = Fat)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.6756 -2.8862 -0.2464  1.9100  9.1641 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  20.1116     6.2395   3.223  0.00613 **
GenderM     -29.2692    10.4098  -2.812  0.01386 * 
GenderF:Age   0.2401     0.1204   1.994  0.06600 . 
GenderM:Age   0.8125     0.2631   3.088  0.00802 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.488 on 14 degrees of freedom
Multiple R-squared:  0.8016,    Adjusted R-squared:  0.7591 
F-statistic: 18.86 on 3 and 14 DF,  p-value: 3.455e-05

Task: Fitted model equation of chosen model

What is the preferred model’s equation?