Lecture 5: Checking the Assumptions

Download the R markdown file for this lecture.

Statistical models allow us to answer questions using data.

The validity of the conclusions that we reach is contingent upon the appropriateness of the models that we use.

Once you have fitted a statistical model to data, it is therefore important to check the assumptions underlying the model.

The tools to check these assumptions are often referred to as model diagnostics.

We shall look at assumption checking for simple linear regression in this lecture.

Assumptions for Simple Linear Regression

Recall that the linear regression model is \[Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\]

where we make the following assumptions:

\(E[\varepsilon_i] = 0\) for \(i=1,\ldots,n\).
\(\varepsilon_1, \ldots ,\varepsilon_n\) are independent.
var\((\varepsilon_i) = \sigma^2\) for i=1,…,n (homoscedasticity).
\(\varepsilon_1, \ldots ,\varepsilon_n\) are normally distributed.

Class Discussion: Interpret these assumptions, and think about situations in which each might fail.

Raw Residuals

The residuals (or raw residuals), \[\begin{aligned} e_i &=& y_i - \hat \mu_i\\ &=& y_i - \hat \beta_0 - \hat \beta_1 x_i~~~~~~~~~(i=1,\ldots ,n)\end{aligned}\] are a fundamental diagnostic aid.

These residuals are a kind of sample version of the error random variables \(\varepsilon_1,\ldots,\varepsilon_n\).

Consequently we can test the assumptions by seeing if they hold for \(e_1, \ldots ,e_n\) (instead of \(\varepsilon_1, \ldots ,\varepsilon_n\)).

Standardized Residuals

Problem with raw residuals as substitutes for the error terms: it can be shown that the variance of the i^th raw residual is given by \[var(e_i) = \sigma^2 (1 - h_{ii})~~~~~~(i=1,\ldots,n)\] where \[h_{ii} = \frac{1}{n} + \frac{(x_i - \bar x)^2}{s_{xx}}.\]

Hence unlike error terms, raw residuals do not have equal variance.

Often preferable to work with the standardized residuals, defined by \[r_i = \frac{e_i}{s \sqrt{1-h_{ii}}}.\]

The standardized residuals have an approximately standard normal, N(0,1), distribution and thus common variance.

Scottish Hill Racing Data

Data comprise the record times in 1984 for 35 Scottish hill races.

Variables are dist distance (miles) climb height gained (ft) time record time (minutes)

Data source: A.C. Atkinson (1986) Comment: Aspects of diagnostic regression analysis. Statistical Science, 1, 397-402.

a hill runner

We will first consider simple linear regression of time on dist in R.

Reading the Data into R

data(hills, package = "MASS")
Hills <- hills
## Hills <- read.csv(file = "https://r-resources.massey.ac.nz/161221/data/hills.csv", 
##     header = TRUE, row.names = 1)

The row.names=1 argument tells R that the first column of data in the comma separated values file hills.csv contains information which should provide the row names for the data (in this case, names of race locations).

Fitting the Regression Model in R

Hills.lm <- lm(time ~ dist, data = Hills)
summary(Hills.lm)


Call:
lm(formula = time ~ dist, data = Hills)

Residuals:
    Min      1Q  Median      3Q     Max 
-35.745  -9.037  -4.201   2.849  76.170 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -4.8407     5.7562  -0.841    0.406    
dist          8.3305     0.6196  13.446 6.08e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.96 on 33 degrees of freedom
Multiple R-squared:  0.8456,    Adjusted R-squared:  0.841 
F-statistic: 180.8 on 1 and 33 DF,  p-value: 6.084e-15

library(ggplot2)
ggplot(Hills, mapping = aes(x = time, y = dist)) + geom_point() + geom_smooth(method = "lm", 
    se = FALSE) + labs(ylab = "Winning time of race", xlab = "Race distance")

`geom_smooth()` using formula 'y ~ x'

Fitted line plot for the first model for the hill racing data

The fitted model is (to 2 decimal places) \[E[\mbox{time}] = -4.84 + 8.33 \, \mbox{dist}\]

The P-value of \(6.083904\times 10^{-15}\) for testing whether the regression slope is zero. This is tiny, and indicates overwhelming evidence for (linear) dependence of race time on distance.

The R-squared statistic of 0.846 indicates that distance explains quite a large proportion of the variation in race record times.

The command resid() returns (raw) residuals for a fitted linear regression model.

resid(Hills.lm)

     Greenmantle         Carnethy     Craig Dunain          Ben Rha 
      0.09757971       3.20798304     -11.49201696     -12.03770125 
      Ben Lomond         Goatfell     Bens of Jura      Cairnpapple 
      0.46407065      11.41407065      76.17042112      -8.77501696 
          Scolty         Traprain      Lairig Ghru           Dollar 
     -7.06156077      -5.39201696     -35.74505318       6.23843923 
         Lomonds      Cairn Table       Eildon Two        Cairngorm 
     -9.29861363      -1.00901696      -5.71333268      -6.21384173 
     Seven Hills       Knock Hill       Black Hill       Creag Beag 
    -13.36866650      58.49935161     -15.22933268      -8.40978887 
    Kildcon Hill Meall Ant-Suidhe   Half Ben Nevis         Cow Hill 
     -4.20064839       3.58412351       2.49098304       6.11280780 
   N Berwick Law       Creag Dubh        Burnswark        Largo Law 
     -1.46764839      -2.26410458     -10.70901696      -8.24456077 
         Criffel           Acmony        Ben Nevis      Knockfarrel 
      1.19275494     -15.86156077       7.11915827     -12.75901696 
   Two Breweries        Cockleroi     Moffat Chase 
     25.14250874      -4.54633268      -1.93540365

Diagnostics for Failure of Model Assumptions

Assessing Assumption A1

Assumption A1 is equivalent to stating \(E[Y_i] = \beta_0 + \beta_1 x_i\)

Failure of A1 indicates that we are fitting the wrong mean function.

Looking for any trends in residual versus fitted values or a covariate is a useful diagnostic.

Assessing Assumption A2

If A2 is correct, then residuals should appear randomly scattered about zero if plotted against time order (if known), fitted values or the predictor.

Long sequences of positive residuals followed by sequences of negative residuals suggests that the error terms are not independent.

Assessing Assumption A3

If A3 is correct, then the spread of the (standardized) residuals about zero should not vary systematically with the fitted value or covariate.

A “fanning out” of (standardized) residuals when plotted against fitted value of covariate is a common sign of failure of A3.

Assessing Assumption A4

A normal probability plot, or normal Q-Q plot, is a standard graphical tool for assessing the normality of a data set.

A normal Q-Q plot of the (standardized) residuals should look close to a straight line — curvature indicates a failure of A4.

Diagnostics for Hill race s Example

The plot() command applied to the fitted model Hills.lm produces four useful diagnostic plots. The par() command put first splits the graphing window into a two-by-two grid so we can see all four graphs together.

par(mfrow = c(2, 2))
plot(Hills.lm)

Residual analysis plots for the first model applied to the hill racing data.

The first three of these plots are:

Plot of raw residuals versus fitted values
Normal Q-Q plot for standardized residuals
Scale-location plot (square root of absolute standardized residuals against fitted value)

plot(Hills.lm, which = 1)

Residuals plotted against fitted values for the first model fitted to the hill racing data

This residual plot shows some data points with very large residuals.

The red line is a nonparametric estimate of trend (take care not to over-interpret).

Would you say there is evidence of increasing spread in residuals?

plot(Hills.lm, which = 2)

Normality plot of residuals for the first model fitted to the hill racing data

Normal Q-Q plot far from a straight line. Indicates failure of normality assumption.

Note that this plot is heavily influenced by data points with large residuals.

plot(Hills.lm, which = 3)

Scale versus location plot for the first model fitted to the hill racing data

The scale-location plot provides a hint that spread of residuals increases with fitted value.

Error variance \(\sigma^2\) may not be constant for all data points.

Conclusions from Diagnostics for Example

Linear regression model assumptions do not appear valid.

Hence conclusions based on this model are questionable.

Problems with model seem largely down to a few odd data points or outliers.