Download the R markdown file for this lecture.
N.B. We use a command from the broom
package in the example.
library(broom)
and to get rid of the pesky scientific notation for P values in our paragraphs, we use an additional function of our own:
= function(x, digits = 4) {
PrettyPVal # this function gets rid of scientific notation when we want P values in the
# paragraphs.
= round(x, digits)
Rounded = 10^(-digits)
Threshold return(ifelse(Rounded < Threshold, paste0("P<", format(Threshold, scientific = FALSE)),
paste0("P=", Rounded)))
}
Sometimes it is unclear as to whether a particular explanatory variable should be regarded as a factor (categorical explanatory variable) or a numerical covariate.
For example, should the effect of a drug be modelled using a four level factor (zero, low, medium and high doses) or as a numeric variable (0, 10, 20, 30 ml doses)?
We will explore this issue in this lecture, and develop an appropriate methodology to compare the modelling approaches.
We use an experiment into the effects of vitamin C on tooth growth to illustrate this question.
30 guinea pigs were divided (at random) into three groups of ten and dosed with vitamin C in mg (administered in orange juice). Group 1 dose was low (0.5mg), group 2 dose was medium (1mg) and group 3 dose was high (2mg). Length of odontoblasts (teeth) was measured as response variable.
We fetch the data from the datasets
package, and then extract only the guinea pigs receiving the Vitamin C supplement. (We could look at the other group later.)
data(ToothGrowth)
= ToothGrowth %>% filter(supp != "OJ")
TG plot(len ~ as.factor(dose), data = TG, col = "grey")
bartlett.test(len ~ as.factor(dose), data = TG)
Bartlett test of homogeneity of variances
data: len by as.factor(dose)
Bartlett's K-squared = 4.5119, df = 2, p-value = 0.1048
The as.factor()
function tells R to regard the (numeric variable) dose
as a factor.
The levels are ordered in the natural way.
The box plot is slightly suggestive of a change of dispersion (spread) in the response between the groups, but Bartlett’s test does not provide clear evidence of heteroscedasticity (P-value is P=0.1048).
From the plot it seems highly plausible that mean tooth growth depends on vitamin C levels.
Could the dependence of tooth growth on vitamin C be adequately modelled by a linear regression of len
on dose
, or is a factorial model with len
and as.factor(dose)
to be preferred?
Consider a one factor experiment in which the factor levels are defined in terms of a numerical variable x. In other words, all individuals at factor level j take the same value of x for this variable (\(j=1,2,\ldots,K\)).
We could fit a simple linear regression model to the data:
\[M0: Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i~~~~~~(i=1,2,\ldots,n)\]
We could fit a standard one-way factorial model which can be written as a multiple regression model:
\[M1: Y_i = \mu + \alpha_2 z_{i2} + \cdots + \alpha_K z_{iK}+\varepsilon_i~~~~~~(i=1,2,\ldots,n)\]
where \(z_{ij}\) is the indicator variable for unit i being at factor level j as usual.
We can show that model M0 is nested within model M1.
It follows that we can compare models M0 (the simpler one) and M1 (the more complex one) using an F test.
Specifically, we will test H0: model M0
adequate; versus H1: model M0 not adequate. using the F test statistic
\[F = \frac{[RSS_{M0} - RSS_{M1}]/(K-2)}{RSS_{M1}/(n-K)}.\]
M1 has n-K residual degrees of freedom.
The difference in the number of parameters between the models is K-2.
Hence the F-statistic has K-2, n-K degrees of freedom.
If fobs is the observed value of the test statistic, and X is a random variable from an FK-2,n-K distribution, then the P-value is given by
\[P= P(X \ge f_{obs})\]
Retention of H0 indicates that a simple linear regression model is adequate; i.e. the effect of the variable x is adequately modelled as linear.
Rejection of H0 indicates that any association between y and x is non-linear.
R Code for Model M0
.0 <- lm(len ~ dose, data = TG)
Tooth.lmsummary(Tooth.lm.0)
Call:
lm(formula = len ~ dose, data = TG)
Residuals:
Min 1Q Median 3Q Max
-8.2264 -2.6029 0.0814 2.2288 7.4893
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.295 1.427 2.309 0.0285 *
dose 11.716 1.079 10.860 1.51e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.685 on 28 degrees of freedom
Multiple R-squared: 0.8082, Adjusted R-squared: 0.8013
F-statistic: 117.9 on 1 and 28 DF, p-value: 1.509e-11
R Code for Model M1
.1 <- lm(len ~ as.factor(dose), data = TG)
Tooth.lmsummary(Tooth.lm.1)
Call:
lm(formula = len ~ as.factor(dose), data = TG)
Residuals:
Min 1Q Median 3Q Max
-7.640 -2.248 -0.455 2.027 7.760
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.980 1.109 7.196 9.71e-08 ***
as.factor(dose)1 8.790 1.568 5.605 6.02e-06 ***
as.factor(dose)2 18.160 1.568 11.580 5.58e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.507 on 27 degrees of freedom
Multiple R-squared: 0.8324, Adjusted R-squared: 0.82
F-statistic: 67.07 on 2 and 27 DF, p-value: 3.357e-11
R Code to Compare Models
anova(Tooth.lm.0, Tooth.lm.1)
Analysis of Variance Table
Model 1: len ~ dose
Model 2: len ~ as.factor(dose)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 28 380.15
2 27 332.00 1 48.146 3.9155 0.05813 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The relationship between tooth length and vitamin C is found to be statistically significant using both the linear regression model (P<0.0001) and the factor model (P<0.0001).
The F test for comparing the linear effect and factor model gives an F-statistic of \(f=3.92\) on \(K - 2 = 1\) and \(n - K = 27\) degrees of freedom.
The corresponding P-value is \(P=0.058\). We would retain H0 at the 5% significance level, but there is certainly some evidence to doubt that should be modelled as a linear effect.
A haplotype is a unit of genetic information contained on a single chromosome. Because chromosomes come in pairs (one inherited from each parent), an individual may have either zero, one or two copies of any given haplotype. The effects of a haplotype on some phenotype (measurable physical attribute) may or may not be linear in the number of copies of the haplotype.
A measure of cholesterol was recorded on 35 subjects. Of these, 8 have zero copies, 16 had one copy, and 11 had two copies of a particular haplotype.
Fitting the number of copies of the haplotype as a linear effect gave a residual sum of squares of \(153.2\).
Fitting the number of copies of the haplotype as a factor gave a residual sum of squares of \(126.2\).
What is the value of the F-statistic for testing whether the effect of haplotype is linear in the number of copies of the haplotype?
What are the degrees of freedom for this test statistic?