Lecture 9: Introduction to Multiple Linear Regression

Download the R markdown file for this lecture.

This lecture will cover an introduction to regression with multiple explanatory variables.

Scottish Hill Racing Data Revisited

For the linear model of time against dist , plot residuals against climb .

data(hills, package = "MASS")
Hills.lm <- lm(time ~ dist, data = hills)
plot(resid(Hills.lm) ~ climb, data = hills, xlab = "Climb")

This is a lurking variable plot.

For the Scottish Hill Racing data, the trend in the lurking variable plot of residuals against climb indicates that the height climbed should be included in the model in addition to the race distance.
For applications such as this we need to extend simple linear regression to incorporate multiple predictors.
This extended methodology is called multiple linear regression.

The Multiple Linear Regression Model

The multiple linear regression model is \[Y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i\] for i=1,2,…,n.

Note now that:

There are p predictor variables x₁, x₂, …, x_p.
The notation x_ij denotes the value of x_j for the i^th individual.
\(\varepsilon_1, \ldots \varepsilon_n\) are random errors satisfying assumptions A1 — A4.
\(\beta_0, \beta_1, \ldots, \beta_p\) are the regression parameters.

Parameter Estimation

The equation for the regression line and A1 imply that \[E[Y_i] = \mu_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip}\]
The parameters \(\beta_0,\beta_1, \ldots, \beta_p\) can be estimated by the method of least squares.
The least squares estimates (LSEs) \(\hat \beta_0,\hat \beta_1, \ldots ,\hat \beta_p\) are the values that minimize the following sum of squares: \[\begin{aligned} SS(\beta_0, \beta_1, \ldots , \beta_p) &=& \sum_{i=1}^n (y_i - \mu_i)^2\\ &=& \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_{i1} - \ldots - \beta_p x_{ip} )^2\end{aligned}\]
Software packages can quickly compute LSEs.

Fitted Values and Residuals

The definitions of a fitted value and a residual in multiple linear regression follow on naturally from simple linear regression.
The i^th fitted value is \[\hat \mu_i = \hat \beta_0 + \hat \beta_1 x_{i1} + \ldots + \hat\beta_p x_{ip}\]
The i^th residual is \(e_i = y_i - \hat \mu_i\).
The error variance, \(\sigma^2\), can be estimated (unbiasedly) by \[s^2 = \frac{1}{n-p-1} RSS\] where the residual sum of squares is \(RSS = \sum_{i=1}^n (y_i - \hat \mu_i)^2\).

Birds of the High Paramo

A paramo is an exposed, high plateau in the tropical parts of South America. In the northern Andes, there is a pattern of ‘islands’ of vegetation within the otherwise bare paramo. An (observational) study was conducted to investigate the bird life in this region. One question of interest — what characteristics of these islands (if any) affect the diversity of bird species?

hopefully the right kind of bird!

For each island of vegetation the following variables were recorded:

N - the number of species of bird present,

AR - area of the island in square kilometers ,

EL - elevation in thousands of meters,

DEc - the distance from Ecuador in kilometers, and

DNI - distance to the nearest other island in kilometers .

Data source: Vuilleumier (1970), “Insular biogeography in continental regions. I. The northern Andes of South America”, American Naturalist, 104, 373-388.

We will fit a multiple linear regression model with response N, and all other variables as predictors.

Modelling the Paramo Biodiversity Data

Loading the Data into R

## Paramo <- read.csv(file = "https://r-resources.massey.ac.nz/161221/data/paramo.csv", 
##     header = TRUE, row.names = 1)
head(Paramo)

                    N   AR   EL DEc DNI
Chiles             36 0.33 1.26  36  14
Las Papas-Coconuco 30 0.50 1.17 234  13
Sumapaz            37 2.03 1.06 543  83
Tolima-Quindio     35 0.99 1.90 551  23
Paramillo          11 0.03 0.46 773  45
Cocuy              21 2.17 2.00 801  14

Download paramo.csv

Fitting the Regression Model in R

Paramo.lm <- lm(N ~ ., data = Paramo)
summary(Paramo.lm)


Call:
lm(formula = N ~ ., data = Paramo)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.6660  -3.4090   0.0834   3.5592   8.2357 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) 27.889386   6.181843   4.511  0.00146 **
AR           5.153864   3.098074   1.664  0.13056   
EL           3.075136   4.000326   0.769  0.46175   
DEc         -0.017216   0.005243  -3.284  0.00947 **
DNI          0.016591   0.077573   0.214  0.83541   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.705 on 9 degrees of freedom
Multiple R-squared:  0.7301,    Adjusted R-squared:  0.6101 
F-statistic: 6.085 on 4 and 9 DF,  p-value: 0.01182

Comments on Analysis of Paramo Biodiversity Data

The row.names=1 option in the read.csv() command tells R that the first column in the text file contains row names.
A full stop on the right hand side of an R model formula represents all variables in the data frame except for the response (on the left).
The fitted model is \[E[\mbox{N}] = 27.9 + 5.15~\mbox{AR} + 3.08~\mbox{EL} + -0.02~\mbox{DEc} + 0.02~\mbox{DNI}\]
The fitted model explains R² = 73% of the variation in the response.

Diagnostics for the Paramo Biodiversity Data

Should check model diagnostics for a multiple linear regression model following the same approach as for a simple linear regression.

However, for a multiple linear regression, plots of residuals against each individual covariate can provide information beyond the residuals versus fitted values plot.
Can achieve this on a single graphics tool with

par(mfrow = c(2, 2))
plot(resid(Paramo.lm) ~ AR + EL + DEc + DNI, data = Paramo)

Plots of Residuals Against Each Predictor

Note that for a formula argument, the R command plot produces scatterplot of the response against each predictor.
The first command sets graphical parameters so as to divide the plotting window into a 2 by 2 array.

Standard Four Diagnostic Plots

par(mfrow = c(2, 2))
plot(Paramo.lm)

Residual analysis for first model.

Conclusions for the Model for the Paramo Data

The diagnostic plots do not provide any compelling evidence that the model assumptions are invalid (but n small).

The model that we fitted contains four predictor variables. There are a number of questions that we could ask regarding this model:
- Overall, is there significant evidence that the model is useful in explaining variation in the number of bird species?
- Are all the predictor variables providing useful information about the response?
- What would we lose (and gain) if we tried to use a model with less predictor variables?
We will address these kinds of question in the next couple of lectures.