Download the R markdown file for this lecture.
In the previous lecture we looked at tests for a relationship between the response and at least one of the predictors in a multiple linear regression model.
In this lecture we will examine other types of test.
We will begin by looking at tests for a single predictor variable.
Consider the following question regarding the paramo biodiversity example:
Is the number of bird species related to the area of the site?
This question is essentially the same as asking, “Does area provide useful information about the number of bird species?”
The answer to this depends on the context in which the question is asked.
For example:
- `AR` may be helpful in understanding `N` in the absence of other
information;
- `AR` may not provide significant additional information once the
other explanatory variables are taken into account.
For example, suppose we are modelling a measure of reading ability (response variable) for children at primary school.
We would find that reading ability is related to height.
However, height probably does not provide additional useful information once age is taken into account.
The summary command in R (and standard output from other statistical packages) provides information for testing the importance of a covariate “taking into account all other variables in the model”.
Specifically, given the model \(Y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i~~~(i=1,2,\ldots,n)\) the output provides statistics for performing a t test of \[H_0: \beta_j = 0~~\mbox{versus}~~H_1: \beta_j \ne 0\] for any given variable xj, making no assumptions about the other regression parameters.
For the model \(E[\mbox{N}] = \beta_0 + \beta_1 \mbox{AR} + \beta_2 \mbox{EL} + \beta_3 \mbox{DEc} + \beta_4 \mbox{DNI}\)
## Paramo <- read.csv(file = "https://r-resources.massey.ac.nz/161221/data/paramo.csv",
## header = TRUE, row.names = 1)
<- lm(N ~ ., data = Paramo)
Paramo.lm summary(Paramo.lm)
Call:
lm(formula = N ~ ., data = Paramo)
Residuals:
Min 1Q Median 3Q Max
-10.6660 -3.4090 0.0834 3.5592 8.2357
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.889386 6.181843 4.511 0.00146 **
AR 5.153864 3.098074 1.664 0.13056
EL 3.075136 4.000326 0.769 0.46175
DEc -0.017216 0.005243 -3.284 0.00947 **
DNI 0.016591 0.077573 0.214 0.83541
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.705 on 9 degrees of freedom
Multiple R-squared: 0.7301, Adjusted R-squared: 0.6101
F-statistic: 6.085 on 4 and 9 DF, p-value: 0.01182
Consider testing whether N
is related to AR
having accounted for the other variables EL
, DEc
and DNI
.
We will test \(H_0: \beta_1 = 0\) versus \(H_1: \beta_1 \ne 0\).
The estimated regression coefficient for AR
is \(\hat \beta_1 = 5.15\) with corresponding standard error \(SE(\hat \beta_1) = 3.10\).
The t-test statistic is \(\frac{\hat \beta_1 - 0}{SE(\hat \beta_1)} = \frac{5.1539}{3.098} = 1.664\)
The corresponding P-value, calculated from a t distribution with n-p-1 = 9 degrees of freedom, is P=0.13.
We conclude that the data do not provide evidence against H0.
The previous test shows that area does not provide useful additional information about the number of bird species having taken account of (or adjusted for) the other variables.
This does not mean that N
is unrelated to AR
.
Consider now fitting a simple linear regression of N
on AR
alone.
The simpler model \(E[\mbox{N}] = \beta_0 + \beta_1 \mbox{AR}\)
<- lm(N ~ AR, data = Paramo)
Paramo.lm.new summary(Paramo.lm.new)
Call:
lm(formula = N ~ AR, data = Paramo)
Residuals:
Min 1Q Median 3Q Max
-12.809 -4.404 -1.676 4.216 17.905
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.442 3.225 4.789 0.000442 ***
AR 8.041 3.237 2.484 0.028759 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.083 on 12 degrees of freedom
Multiple R-squared: 0.3395, Adjusted R-squared: 0.2845
F-statistic: 6.169 on 1 and 12 DF, p-value: 0.02876
Notice that the estimate \(\hat \beta_1\) is different in the simple linear regression as compared to the multiple linear regression.
The same comment applies to \(\hat \beta_0\), and the standard errors.
The t-test of \(H_0: \beta_1 = 0\) versus \(H_1: \beta_1 \ne 0\) now has P-value P=0.029.
We therefore have evidence that the number of bird species is related to area.
The change to our earlier conclusion can be explained as follows. The information on N
provided by AR
is duplicated (to some significant extent) by the information provided by the other variables.
Data are from 12 naval hospitals in the USA. Response variable, ManHours
, is monthly man-hours associated with maintaining the anaesthesiology service. Explanatory variables are
Variable | Description |
---|---|
Cases |
The number of surgical cases |
Eligible |
the eligible population per thousand |
OpRooms |
the number of operating rooms |
Source: A Handbook of Small Data Sets by Hand, Daly, Lunn, McConway and Ostrowski.
## Hospital <- read.csv(file = "https://r-resources.massey.ac.nz/161221/data/hospital.csv",
## header = TRUE)
<- lm(ManHours ~ ., data = Hospital)
Hospital.lm summary(Hospital.lm)
Call:
lm(formula = ManHours ~ ., data = Hospital)
Residuals:
Min 1Q Median 3Q Max
-218.669 -98.962 -3.585 112.429 195.408
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -114.5895 130.3392 -0.879 0.4049
Cases 2.0315 0.6778 2.997 0.0171 *
Eligible 2.2714 1.6820 1.350 0.2138
OpRooms 99.7254 42.2158 2.362 0.0458 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 165.5 on 8 degrees of freedom
Multiple R-squared: 0.9855, Adjusted R-squared: 0.98
F-statistic: 181 on 3 and 8 DF, p-value: 1.087e-07
Is ManHours
related to at least one of the explanatory variables?
Does Cases
provide additional information about ManHours
when the other explanatory variables are taken into account?
Does Eligible
provide additional information about ManHours
when the other explanatory variables are taken into account?
Is ManHours
related to Eligible
?
To understand more about these data, look at scatter plots of the response against each of the explanatory variables.
T-tests can be used to test single parameters, and hence assess the importance of single variables in a multiple linear regression model.
The interpretation of these tests depends on what other variables are included in the model.
To test the importance of groups of multiple covariates requires F tests, using the type of technology that was introduced in the previous lecture.