Download the R markdown file for this lecture.



N.B. We use a command from the broom package in the example.

library(broom)

and to get rid of the pesky scientific notation for P values in our paragraphs, we use an additional function of our own:

PrettyPVal = function(x, digits = 4) {
    # this function gets rid of scientific notation when we want P values in the
    # paragraphs.
    Rounded = round(x, digits)
    Threshold = 10^(-digits)
    return(ifelse(Rounded < Threshold, paste0("P<", format(Threshold, scientific = FALSE)), 
        paste0("P=", Rounded)))
}

Sometimes it is unclear as to whether a particular explanatory variable should be regarded as a factor (categorical explanatory variable) or a numerical covariate.

For example, should the effect of a drug be modelled using a four level factor (zero, low, medium and high doses) or as a numeric variable (0, 10, 20, 30 ml doses)?

We will explore this issue in this lecture, and develop an appropriate methodology to compare the modelling approaches.

Vitamin C and Tooth Growth

We use an experiment into the effects of vitamin C on tooth growth to illustrate this question.

30 guinea pigs were divided (at random) into three groups of ten and dosed with vitamin C in mg (administered in orange juice). Group 1 dose was low (0.5mg), group 2 dose was medium (1mg) and group 3 dose was high (2mg). Length of odontoblasts (teeth) was measured as response variable.

a guinea pig

Exploratory Data Analysis

We fetch the data from the datasets package, and then extract only the guinea pigs receiving the Vitamin C supplement. (We could look at the other group later.)

data(ToothGrowth)
TG = ToothGrowth %>% filter(supp != "OJ")
plot(len ~ as.factor(dose), data = TG, col = "grey")

bartlett.test(len ~ as.factor(dose), data = TG)

    Bartlett test of homogeneity of variances

data:  len by as.factor(dose)
Bartlett's K-squared = 4.5119, df = 2, p-value = 0.1048

Initial Comments

The as.factor() function tells R to regard the (numeric variable) dose as a factor.

The levels are ordered in the natural way.

The box plot is slightly suggestive of a change of dispersion (spread) in the response between the groups, but Bartlett’s test does not provide clear evidence of heteroscedasticity (P-value is P=0.1048).

From the plot it seems highly plausible that mean tooth growth depends on vitamin C levels.

Question

Could the dependence of tooth growth on vitamin C be adequately modelled by a linear regression of len on dose, or is a factorial model with len and as.factor(dose) to be preferred?

Nesting of Linear Effects

Consider a one factor experiment in which the factor levels are defined in terms of a numerical variable x. In other words, all individuals at factor level j take the same value of x for this variable (\(j=1,2,\ldots,K\)).

We could fit a simple linear regression model to the data:

\[M0: Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i~~~~~~(i=1,2,\ldots,n)\]

We could fit a standard one-way factorial model which can be written as a multiple regression model:

\[M1: Y_i = \mu + \alpha_2 z_{i2} + \cdots + \alpha_K z_{iK}+\varepsilon_i~~~~~~(i=1,2,\ldots,n)\]

where \(z_{ij}\) is the indicator variable for unit i being at factor level j as usual.

We can show that model M0 is nested within model M1.

It follows that we can compare models M0 (the simpler one) and M1 (the more complex one) using an F test.

Specifically, we will test H0: model M0 adequate; versus H1: model M0 not adequate. using the F test statistic

\[F = \frac{[RSS_{M0} - RSS_{M1}]/(K-2)}{RSS_{M1}/(n-K)}.\]

M1 has n-K residual degrees of freedom.

The difference in the number of parameters between the models is K-2.

Hence the F-statistic has K-2, n-K degrees of freedom.

If fobs is the observed value of the test statistic, and X is a random variable from an FK-2,n-K distribution, then the P-value is given by

\[P= P(X \ge f_{obs})\]

Retention of H0 indicates that a simple linear regression model is adequate; i.e. the effect of the variable x is adequately modelled as linear.

Rejection of H0 indicates that any association between y and x is non-linear.

Analysis of the Guinea Pig Tooth Data

R Code for Model M0

Tooth.lm.0 <- lm(len ~ dose, data = TG)
summary(Tooth.lm.0)

Call:
lm(formula = len ~ dose, data = TG)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.2264 -2.6029  0.0814  2.2288  7.4893 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    3.295      1.427   2.309   0.0285 *  
dose          11.716      1.079  10.860 1.51e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.685 on 28 degrees of freedom
Multiple R-squared:  0.8082,    Adjusted R-squared:  0.8013 
F-statistic: 117.9 on 1 and 28 DF,  p-value: 1.509e-11

R Code for Model M1

Tooth.lm.1 <- lm(len ~ as.factor(dose), data = TG)
summary(Tooth.lm.1)

Call:
lm(formula = len ~ as.factor(dose), data = TG)

Residuals:
   Min     1Q Median     3Q    Max 
-7.640 -2.248 -0.455  2.027  7.760 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)         7.980      1.109   7.196 9.71e-08 ***
as.factor(dose)1    8.790      1.568   5.605 6.02e-06 ***
as.factor(dose)2   18.160      1.568  11.580 5.58e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.507 on 27 degrees of freedom
Multiple R-squared:  0.8324,    Adjusted R-squared:   0.82 
F-statistic: 67.07 on 2 and 27 DF,  p-value: 3.357e-11

R Code to Compare Models

anova(Tooth.lm.0, Tooth.lm.1)
Analysis of Variance Table

Model 1: len ~ dose
Model 2: len ~ as.factor(dose)
  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1     28 380.15                              
2     27 332.00  1    48.146 3.9155 0.05813 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusions

The relationship between tooth length and vitamin C is found to be statistically significant using both the linear regression model (P<0.0001) and the factor model (P<0.0001).

The F test for comparing the linear effect and factor model gives an F-statistic of \(f=3.92\) on \(K - 2 = 1\) and \(n - K = 27\) degrees of freedom.

The corresponding P-value is \(P=0.058\). We would retain H0 at the 5% significance level, but there is certainly some evidence to doubt that should be modelled as a linear effect.

Analysis of Genetic Data

A haplotype is a unit of genetic information contained on a single chromosome. Because chromosomes come in pairs (one inherited from each parent), an individual may have either zero, one or two copies of any given haplotype. The effects of a haplotype on some phenotype (measurable physical attribute) may or may not be linear in the number of copies of the haplotype.

a chromosome

A measure of cholesterol was recorded on 35 subjects. Of these, 8 have zero copies, 16 had one copy, and 11 had two copies of a particular haplotype.

Fitting the number of copies of the haplotype as a linear effect gave a residual sum of squares of \(153.2\).

Fitting the number of copies of the haplotype as a factor gave a residual sum of squares of \(126.2\).

Question

What is the value of the F-statistic for testing whether the effect of haplotype is linear in the number of copies of the haplotype?

What are the degrees of freedom for this test statistic?