Lecture 11: Testing in Multiple Regression Models (2)

Download the R markdown file for this lecture.

In the previous lecture we looked at tests for a relationship between the response and at least one of the predictors in a multiple linear regression model.

In this lecture we will examine other types of test.

We will begin by looking at tests for a single predictor variable.

What Precisely Are We Testing?

Consider the following question regarding the paramo biodiversity example:
Is the number of bird species related to the area of the site?

This question is essentially the same as asking, “Does area provide useful information about the number of bird species?”

The answer to this depends on the context in which the question is asked.

For example:

- `AR` may be helpful in understanding `N` in the absence of other
    information;
- `AR` may not provide significant additional information once the
    other explanatory variables are taken into account.

For example, suppose we are modelling a measure of reading ability (response variable) for children at primary school.

We would find that reading ability is related to height.

However, height probably does not provide additional useful information once age is taken into account.

T Tests for Single Parameters

The summary command in R (and standard output from other statistical packages) provides information for testing the importance of a covariate “taking into account all other variables in the model”.

Specifically, given the model \(Y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i~~~(i=1,2,\ldots,n)\) the output provides statistics for performing a t test of \[H_0: \beta_j = 0~~\mbox{versus}~~H_1: \beta_j \ne 0\] for any given variable x_j, making no assumptions about the other regression parameters.

T Tests for the Paramo Regression Model

For the model \(E[\mbox{N}] = \beta_0 + \beta_1 \mbox{AR} + \beta_2 \mbox{EL} + \beta_3 \mbox{DEc} + \beta_4 \mbox{DNI}\)

Download paramo.csv

## Paramo <- read.csv(file = "https://r-resources.massey.ac.nz/161221/data/paramo.csv", 
##     header = TRUE, row.names = 1)
Paramo.lm <- lm(N ~ ., data = Paramo)
summary(Paramo.lm)


Call:
lm(formula = N ~ ., data = Paramo)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.6660  -3.4090   0.0834   3.5592   8.2357 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) 27.889386   6.181843   4.511  0.00146 **
AR           5.153864   3.098074   1.664  0.13056   
EL           3.075136   4.000326   0.769  0.46175   
DEc         -0.017216   0.005243  -3.284  0.00947 **
DNI          0.016591   0.077573   0.214  0.83541   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.705 on 9 degrees of freedom
Multiple R-squared:  0.7301,    Adjusted R-squared:  0.6101 
F-statistic: 6.085 on 4 and 9 DF,  p-value: 0.01182

Consider testing whether N is related to AR having accounted for the other variables EL, DEc and DNI.

We will test \(H_0: \beta_1 = 0\) versus \(H_1: \beta_1 \ne 0\).

The estimated regression coefficient for AR is \(\hat \beta_1 = 5.15\) with corresponding standard error \(SE(\hat \beta_1) = 3.10\).

The t-test statistic is \(\frac{\hat \beta_1 - 0}{SE(\hat \beta_1)} = \frac{5.1539}{3.098} = 1.664\)

The corresponding P-value, calculated from a t distribution with n-p-1 = 9 degrees of freedom, is P=0.13.

We conclude that the data do not provide evidence against H₀.

The previous test shows that area does not provide useful additional information about the number of bird species having taken account of (or adjusted for) the other variables.

This does not mean that N is unrelated to AR.

Consider now fitting a simple linear regression of N on AR alone.

The simpler model \(E[\mbox{N}] = \beta_0 + \beta_1 \mbox{AR}\)

Paramo.lm.new <- lm(N ~ AR, data = Paramo)
summary(Paramo.lm.new)


Call:
lm(formula = N ~ AR, data = Paramo)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.809  -4.404  -1.676   4.216  17.905 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   15.442      3.225   4.789 0.000442 ***
AR             8.041      3.237   2.484 0.028759 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.083 on 12 degrees of freedom
Multiple R-squared:  0.3395,    Adjusted R-squared:  0.2845 
F-statistic: 6.169 on 1 and 12 DF,  p-value: 0.02876

Notice that the estimate \(\hat \beta_1\) is different in the simple linear regression as compared to the multiple linear regression.

The same comment applies to \(\hat \beta_0\), and the standard errors.

The t-test of \(H_0: \beta_1 = 0\) versus \(H_1: \beta_1 \ne 0\) now has P-value P=0.029.

We therefore have evidence that the number of bird species is related to area.

The change to our earlier conclusion can be explained as follows. The information on N provided by AR is duplicated (to some significant extent) by the information provided by the other variables.

Task: Hospital Maintenance Data

Data are from 12 naval hospitals in the USA. Response variable, ManHours , is monthly man-hours associated with maintaining the anaesthesiology service. Explanatory variables are

Variable	Description
`Cases`	The number of surgical cases
`Eligible`	the eligible population per thousand
`OpRooms`	the number of operating rooms

doctors at play

Source: A Handbook of Small Data Sets by Hand, Daly, Lunn, McConway and Ostrowski.

R output

Download hospital.csv

## Hospital <- read.csv(file = "https://r-resources.massey.ac.nz/161221/data/hospital.csv", 
##     header = TRUE)
Hospital.lm <- lm(ManHours ~ ., data = Hospital)
summary(Hospital.lm)


Call:
lm(formula = ManHours ~ ., data = Hospital)

Residuals:
     Min       1Q   Median       3Q      Max 
-218.669  -98.962   -3.585  112.429  195.408 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) -114.5895   130.3392  -0.879   0.4049  
Cases          2.0315     0.6778   2.997   0.0171 *
Eligible       2.2714     1.6820   1.350   0.2138  
OpRooms       99.7254    42.2158   2.362   0.0458 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 165.5 on 8 degrees of freedom
Multiple R-squared:  0.9855,    Adjusted R-squared:   0.98 
F-statistic:   181 on 3 and 8 DF,  p-value: 1.087e-07

Questions

Is ManHours related to at least one of the explanatory variables?
Does Cases provide additional information about ManHours when the other explanatory variables are taken into account?
Does Eligible provide additional information about ManHours when the other explanatory variables are taken into account?
Is ManHours related to Eligible?

To understand more about these data, look at scatter plots of the response against each of the explanatory variables.

Summary for Tests for a Single Predictor

T-tests can be used to test single parameters, and hence assess the importance of single variables in a multiple linear regression model.

The interpretation of these tests depends on what other variables are included in the model.

To test the importance of groups of multiple covariates requires F tests, using the type of technology that was introduced in the previous lecture.