Download the R markdown file for this lecture.
Statistical models allow us to answer questions using data.
The validity of the conclusions that we reach is contingent upon the appropriateness of the models that we use.
Once you have fitted a statistical model to data, it is therefore important to check the assumptions underlying the model.
The tools to check these assumptions are often referred to as model diagnostics.
We shall look at assumption checking for simple linear regression in this lecture.
Recall that the linear regression model is \[Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\]
where we make the following assumptions:
\(E[\varepsilon_i] = 0\) for \(i=1,\ldots,n\).
\(\varepsilon_1, \ldots ,\varepsilon_n\) are independent.
var\((\varepsilon_i) = \sigma^2\) for i=1,…,n (homoscedasticity).
\(\varepsilon_1, \ldots ,\varepsilon_n\) are normally distributed.
Class Discussion: Interpret these assumptions, and think about situations in which each might fail.
The residuals (or raw residuals), \[\begin{aligned} e_i &=& y_i - \hat \mu_i\\ &=& y_i - \hat \beta_0 - \hat \beta_1 x_i~~~~~~~~~(i=1,\ldots ,n)\end{aligned}\] are a fundamental diagnostic aid.
These residuals are a kind of sample version of the error random variables \(\varepsilon_1,\ldots,\varepsilon_n\).
Consequently we can test the assumptions by seeing if they hold for \(e_1, \ldots ,e_n\) (instead of \(\varepsilon_1, \ldots ,\varepsilon_n\)).
Problem with raw residuals as substitutes for the error terms: it can be shown that the variance of the ith raw residual is given by \[var(e_i) = \sigma^2 (1 - h_{ii})~~~~~~(i=1,\ldots,n)\] where \[h_{ii} = \frac{1}{n} + \frac{(x_i - \bar x)^2}{s_{xx}}.\]
Hence unlike error terms, raw residuals do not have equal variance.
Often preferable to work with the standardized residuals, defined by \[r_i = \frac{e_i}{s \sqrt{1-h_{ii}}}.\]
The standardized residuals have an approximately standard normal, N(0,1), distribution and thus common variance.
Variables are dist
distance (miles) climb
height gained (ft) time
record time (minutes)
We will first consider simple linear regression of time
on dist
in R.
Reading the Data into R
data(hills, package = "MASS")
<- hills
Hills ## Hills <- read.csv(file = "https://r-resources.massey.ac.nz/161221/data/hills.csv",
## header = TRUE, row.names = 1)
The row.names=1
argument tells R that the first column of data in the comma separated values file hills.csv contains information which should provide the row names for the data (in this case, names of race locations).
<- lm(time ~ dist, data = Hills)
Hills.lm summary(Hills.lm)
Call:
lm(formula = time ~ dist, data = Hills)
Residuals:
Min 1Q Median 3Q Max
-35.745 -9.037 -4.201 2.849 76.170
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.8407 5.7562 -0.841 0.406
dist 8.3305 0.6196 13.446 6.08e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.96 on 33 degrees of freedom
Multiple R-squared: 0.8456, Adjusted R-squared: 0.841
F-statistic: 180.8 on 1 and 33 DF, p-value: 6.084e-15
library(ggplot2)
ggplot(Hills, mapping = aes(x = time, y = dist)) + geom_point() + geom_smooth(method = "lm",
se = FALSE) + labs(ylab = "Winning time of race", xlab = "Race distance")
`geom_smooth()` using formula 'y ~ x'
The fitted model is (to 2 decimal places) \[E[\mbox{time}] = -4.84 + 8.33 \, \mbox{dist}\]
The P-value of \(6.083904\times 10^{-15}\) for testing whether the regression slope is zero. This is tiny, and indicates overwhelming evidence for (linear) dependence of race time on distance.
The R-squared statistic of 0.846 indicates that distance explains quite a large proportion of the variation in race record times.
The command resid()
returns (raw) residuals for a fitted linear regression model.
resid(Hills.lm)
Greenmantle Carnethy Craig Dunain Ben Rha
0.09757971 3.20798304 -11.49201696 -12.03770125
Ben Lomond Goatfell Bens of Jura Cairnpapple
0.46407065 11.41407065 76.17042112 -8.77501696
Scolty Traprain Lairig Ghru Dollar
-7.06156077 -5.39201696 -35.74505318 6.23843923
Lomonds Cairn Table Eildon Two Cairngorm
-9.29861363 -1.00901696 -5.71333268 -6.21384173
Seven Hills Knock Hill Black Hill Creag Beag
-13.36866650 58.49935161 -15.22933268 -8.40978887
Kildcon Hill Meall Ant-Suidhe Half Ben Nevis Cow Hill
-4.20064839 3.58412351 2.49098304 6.11280780
N Berwick Law Creag Dubh Burnswark Largo Law
-1.46764839 -2.26410458 -10.70901696 -8.24456077
Criffel Acmony Ben Nevis Knockfarrel
1.19275494 -15.86156077 7.11915827 -12.75901696
Two Breweries Cockleroi Moffat Chase
25.14250874 -4.54633268 -1.93540365
Assumption A1 is equivalent to stating \(E[Y_i] = \beta_0 + \beta_1 x_i\)
Failure of A1 indicates that we are fitting the wrong mean function.
Looking for any trends in residual versus fitted values or a covariate is a useful diagnostic.
If A2 is correct, then residuals should appear randomly scattered about zero if plotted against time order (if known), fitted values or the predictor.
Long sequences of positive residuals followed by sequences of negative residuals suggests that the error terms are not independent.
If A3 is correct, then the spread of the (standardized) residuals about zero should not vary systematically with the fitted value or covariate.
A “fanning out” of (standardized) residuals when plotted against fitted value of covariate is a common sign of failure of A3.
A normal probability plot, or normal Q-Q plot, is a standard graphical tool for assessing the normality of a data set.
A normal Q-Q plot of the (standardized) residuals should look close to a straight line — curvature indicates a failure of A4.
The plot()
command applied to the fitted model Hills.lm
produces four useful diagnostic plots. The par()
command put first splits the graphing window into a two-by-two grid so we can see all four graphs together.
par(mfrow = c(2, 2))
plot(Hills.lm)
The first three of these plots are:
plot(Hills.lm, which = 1)
This residual plot shows some data points with very large residuals.
The red line is a nonparametric estimate of trend (take care not to over-interpret).
Would you say there is evidence of increasing spread in residuals?
plot(Hills.lm, which = 2)
Normal Q-Q plot far from a straight line. Indicates failure of normality assumption.
Note that this plot is heavily influenced by data points with large residuals.
plot(Hills.lm, which = 3)
The scale-location plot provides a hint that spread of residuals increases with fitted value.
Error variance \(\sigma^2\) may not be constant for all data points.
Linear regression model assumptions do not appear valid.
Hence conclusions based on this model are questionable.
Problems with model seem largely down to a few odd data points or outliers.