Download the R markdown file for this lecture.

This lecture will cover an introduction to regression with multiple explanatory variables.

Scottish Hill Racing Data Revisited

For the linear model of time against dist , plot residuals against climb .

data(hills, package = "MASS")
Hills.lm <- lm(time ~ dist, data = hills)
plot(resid(Hills.lm) ~ climb, data = hills, xlab = "Climb")

This is a lurking variable plot.

The Multiple Linear Regression Model

The multiple linear regression model is \[Y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i\] for i=1,2,…,n.

Note now that:

Parameter Estimation

Fitted Values and Residuals

Birds of the High Paramo

A paramo is an exposed, high plateau in the tropical parts of South America. In the northern Andes, there is a pattern of ‘islands’ of vegetation within the otherwise bare paramo. An (observational) study was conducted to investigate the bird life in this region. One question of interest — what characteristics of these islands (if any) affect the diversity of bird species?

hopefully the right kind of bird!

N - the number of species of bird present,

AR - area of the island in square kilometers ,

EL - elevation in thousands of meters,

DEc - the distance from Ecuador in kilometers, and

DNI - distance to the nearest other island in kilometers .

Data source: Vuilleumier (1970), “Insular biogeography in continental regions. I. The northern Andes of South America”, American Naturalist, 104, 373-388.

We will fit a multiple linear regression model with response N, and all other variables as predictors.

Modelling the Paramo Biodiversity Data

Loading the Data into R

## Paramo <- read.csv(file = "https://r-resources.massey.ac.nz/161221/data/paramo.csv", 
##     header = TRUE, row.names = 1)
head(Paramo)
                    N   AR   EL DEc DNI
Chiles             36 0.33 1.26  36  14
Las Papas-Coconuco 30 0.50 1.17 234  13
Sumapaz            37 2.03 1.06 543  83
Tolima-Quindio     35 0.99 1.90 551  23
Paramillo          11 0.03 0.46 773  45
Cocuy              21 2.17 2.00 801  14

Download paramo.csv

Fitting the Regression Model in R

Paramo.lm <- lm(N ~ ., data = Paramo)
summary(Paramo.lm)

Call:
lm(formula = N ~ ., data = Paramo)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.6660  -3.4090   0.0834   3.5592   8.2357 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) 27.889386   6.181843   4.511  0.00146 **
AR           5.153864   3.098074   1.664  0.13056   
EL           3.075136   4.000326   0.769  0.46175   
DEc         -0.017216   0.005243  -3.284  0.00947 **
DNI          0.016591   0.077573   0.214  0.83541   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.705 on 9 degrees of freedom
Multiple R-squared:  0.7301,    Adjusted R-squared:  0.6101 
F-statistic: 6.085 on 4 and 9 DF,  p-value: 0.01182

Comments on Analysis of Paramo Biodiversity Data

Diagnostics for the Paramo Biodiversity Data

Should check model diagnostics for a multiple linear regression model following the same approach as for a simple linear regression.

par(mfrow = c(2, 2))
plot(resid(Paramo.lm) ~ AR + EL + DEc + DNI, data = Paramo)
Plots of Residuals Against Each Predictor

Plots of Residuals Against Each Predictor

Standard Four Diagnostic Plots

par(mfrow = c(2, 2))
plot(Paramo.lm)
Residual analysis for first model.

Residual analysis for first model.

Conclusions for the Model for the Paramo Data

The diagnostic plots do not provide any compelling evidence that the model assumptions are invalid (but n small).