Download the R markdown file for this lecture.
This lecture will cover an introduction to regression with multiple explanatory variables.
For the linear model of time
against dist
, plot residuals against climb
.
data(hills, package = "MASS")
<- lm(time ~ dist, data = hills)
Hills.lm plot(resid(Hills.lm) ~ climb, data = hills, xlab = "Climb")
This is a lurking variable plot.
For the Scottish Hill Racing data, the trend in the lurking variable plot of residuals against climb
indicates that the height climbed should be included in the model in addition to the race distance.
For applications such as this we need to extend simple linear regression to incorporate multiple predictors.
This extended methodology is called multiple linear regression.
The multiple linear regression model is \[Y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i\] for i=1,2,…,n.
Note now that:
There are p predictor variables x1, x2, …, xp.
The notation xij denotes the value of xj for the ith individual.
\(\varepsilon_1, \ldots \varepsilon_n\) are random errors satisfying assumptions A1 — A4.
\(\beta_0, \beta_1, \ldots, \beta_p\) are the regression parameters.
The equation for the regression line and A1 imply that \[E[Y_i] = \mu_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip}\]
The parameters \(\beta_0,\beta_1, \ldots, \beta_p\) can be estimated by the method of least squares.
The least squares estimates (LSEs) \(\hat \beta_0,\hat \beta_1, \ldots ,\hat \beta_p\) are the values that minimize the following sum of squares: \[\begin{aligned} SS(\beta_0, \beta_1, \ldots , \beta_p) &=& \sum_{i=1}^n (y_i - \mu_i)^2\\ &=& \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_{i1} - \ldots - \beta_p x_{ip} )^2\end{aligned}\]
Software packages can quickly compute LSEs.
The definitions of a fitted value and a residual in multiple linear regression follow on naturally from simple linear regression.
The ith fitted value is \[\hat \mu_i = \hat \beta_0 + \hat \beta_1 x_{i1} + \ldots + \hat\beta_p x_{ip}\]
The ith residual is \(e_i = y_i - \hat \mu_i\).
The error variance, \(\sigma^2\), can be estimated (unbiasedly) by \[s^2 = \frac{1}{n-p-1} RSS\] where the residual sum of squares is \(RSS = \sum_{i=1}^n (y_i - \hat \mu_i)^2\).
A paramo is an exposed, high plateau in the tropical parts of South America. In the northern Andes, there is a pattern of ‘islands’ of vegetation within the otherwise bare paramo. An (observational) study was conducted to investigate the bird life in this region. One question of interest — what characteristics of these islands (if any) affect the diversity of bird species?
N
- the number of species of bird present,
AR
- area of the island in square kilometers ,
EL
- elevation in thousands of meters,
DEc
- the distance from Ecuador in kilometers, and
DNI
- distance to the nearest other island in kilometers .
Data source: Vuilleumier (1970), “Insular biogeography in continental regions. I. The northern Andes of South America”, American Naturalist, 104, 373-388.
We will fit a multiple linear regression model with response N
, and all other variables as predictors.
Loading the Data into R
## Paramo <- read.csv(file = "https://r-resources.massey.ac.nz/161221/data/paramo.csv",
## header = TRUE, row.names = 1)
head(Paramo)
N AR EL DEc DNI
Chiles 36 0.33 1.26 36 14
Las Papas-Coconuco 30 0.50 1.17 234 13
Sumapaz 37 2.03 1.06 543 83
Tolima-Quindio 35 0.99 1.90 551 23
Paramillo 11 0.03 0.46 773 45
Cocuy 21 2.17 2.00 801 14
<- lm(N ~ ., data = Paramo)
Paramo.lm summary(Paramo.lm)
Call:
lm(formula = N ~ ., data = Paramo)
Residuals:
Min 1Q Median 3Q Max
-10.6660 -3.4090 0.0834 3.5592 8.2357
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.889386 6.181843 4.511 0.00146 **
AR 5.153864 3.098074 1.664 0.13056
EL 3.075136 4.000326 0.769 0.46175
DEc -0.017216 0.005243 -3.284 0.00947 **
DNI 0.016591 0.077573 0.214 0.83541
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.705 on 9 degrees of freedom
Multiple R-squared: 0.7301, Adjusted R-squared: 0.6101
F-statistic: 6.085 on 4 and 9 DF, p-value: 0.01182
Should check model diagnostics for a multiple linear regression model following the same approach as for a simple linear regression.
However, for a multiple linear regression, plots of residuals against each individual covariate can provide information beyond the residuals versus fitted values plot.
Can achieve this on a single graphics tool with
par(mfrow = c(2, 2))
plot(resid(Paramo.lm) ~ AR + EL + DEc + DNI, data = Paramo)
Note that for a formula argument, the R command plot
produces scatterplot of the response against each predictor.
The first command sets graphical parameters so as to divide the plotting window into a 2 by 2 array.
par(mfrow = c(2, 2))
plot(Paramo.lm)
The diagnostic plots do not provide any compelling evidence that the model assumptions are invalid (but n small).
The model that we fitted contains four predictor variables. There are a number of questions that we could ask regarding this model:
Overall, is there significant evidence that the model is useful in explaining variation in the number of bird species?
Are all the predictor variables providing useful information about the response?
What would we lose (and gain) if we tried to use a model with less predictor variables?
We will address these kinds of question in the next couple of lectures.
Comments on Analysis of Paramo Biodiversity Data
The
row.names=1
option in theread.csv()
command tells R that the first column in the text file contains row names.A full stop on the right hand side of an R model formula represents all variables in the data frame except for the response (on the left).
The fitted model is \[E[\mbox{N}] = 27.9 + 5.15~\mbox{AR} + 3.08~\mbox{EL} + -0.02~\mbox{DEc} + 0.02~\mbox{DNI}\]
The fitted model explains R2 = 73% of the variation in the response.