Lecture 6 Outliers and Influential Points

In class version

In the previous lecture we looked at diagnostic tools for linear regression models.

We saw that a model can appear inappropriate for the data because of one or two odd points

In this lecture we will examine the issue of odd points in more depth.

6.1 What Exactly is an Outlier?

There is no hard and fast definition.

Essentially, an outlier is defined as a data point that emanates from a different model than do the rest of the data.

Notice that this makes the definition dependent on the model in question.

Outlier county in Florida
Outlier county in Florida

6.2 Detecting Outliers

Outliers may be detected by:

  • Inspection of a scatterplot of the data.
  • Inspection of a fitted line plot (where fitted regression line is superimposed on scatterplot).
  • Plotting residuals versus fitted values or covariate.

Any data point that lies an unusual distance from the fitted line, or has a large residual in comparison to the other data points, may be considered an outlier.

6.3 Outliers for the Hill Racing Data

data(hills, package = "MASS")
Hills <- hills  # We used Upper case before.
Hills.lm <- lm(time ~ dist, data = Hills)
plot(Hills.lm, which = 1)
unlabelled

(#fig:Hills.lmDiag1)Residuals v fitted plot for Hills data

Inspection of the plot of residuals versus fitted values suggests three possible outliers: Knock Hill, Bens of Jura, and Lairig Ghru. Most extreme is Bens of Jura.

The Knock Hill outlier may be due to transcription error — time of 18.65 recorded as 78.65 minutes. See the help page for this data.

6.4 What to Do When You Find Outliers

Do not ignore them!

First, investigate whether data have been mis-recorded (go back to data source if possible).

If an outlier is not due to transcription errors, then it may need removing before refitting the model.

In general one should refit the model after removing even a single outlier, because removal of one outlier can alter the fitted model, and hence make other points appear less (or more) inconsistent with the model.

Other approaches covered in this course are to use a transformation to reduce its impact, or use weighted regression to reduce its impact.

The field of robust statistics (not covered in this course) is concerned with more sophisticated ways of dealing with outliers.

6.5 Influence

6.5.1 Leverage

The extent of problems caused by an outlier depends on the amount of influence that this data point has on the fitted model.

The potential influence of a data point can be measured by its leverage. For the ith data point in a simple linear regression, leverage is given by

\[h_{ii} = \frac{1}{n} + \frac{(x_i - \bar x)^2}{s_{xx}}\]

Notice that:

  • leverage depends only on the x-value, so that the actual influence of a data point will not depend on the response value;
  • data points with extreme x-values (i.e. far from \(\bar{x}\)) have greatest leverage.

As an aside, the sum of the leverages adds to the number of parameters fitted in a model, so the average leverage will be p/n.

6.5.2 Cook’s Distance

The actual influence of a data point can be measured by calculating the change in parameter estimates between

  1. the model built on all the data, and

  2. the model built on data with the data point removed.

Cook’s distance is a measure of influence that is based on exactly this kind of approach.

Cook’s distance depends on both the x and y-values for a data point, and will be large for data points with both high leverage and large standardized residual.

A value of Cook’s distance greater than one is a cause for concern.

6.6 Using plots to detect influence

The fourth plot produced in R by plot() applied to a linear model is a plot of standardized residuals against leverage, with contours marking extreme Cook’s distance.

plot(Hills.lm, which = 5)
unlabelled

(#fig:Hills.lmDiag4)Fourth diagnostic plot for Hills.lm

For the hill racing data, Lairig Ghru and Bens of Jura have much greater influence than other data points.

N.B. The names of observations appear here because the data is stored with rownames. How we import data is important.

You should observe that the which argument is not set to 4 as you may have expected. There are actually 6 plots available but the vast majority of need is served with plots 1,2,3, and 5.

6.7 Conclusions on the Hill Racing Data

The Scottish Hill Racing Data appears to contain some outliers.

Some of these outliers are data points with high leverage, and hence may have a marked effect on the fitted model.

Recall that the definition of an outlier will depend on the model that we choose to fit.

Perhaps the outliers only appear as odd points because the model is inadequate.

For example, it may be the variable climb has an important effect on race time, and should be included in the model.

6.8 An Exercise

Remove one of the points flagged as unusual in the above analyses. Then re-fit the model to the reduced data. Compare the fit of the two models by plotting their fitted lines on the scatter plot of the data.

Look at the residual analysis for your new model. Does removal of the point you chose affect the status of the other points?

6.8.1 A Solution

The following code picks out the row for the maximum residual, and then recomputes the linear model with that removed. To be explicit, the code:

  • make a row names variable so we can refer to them explicitly using other tidyverse commands. N.B. The standard rownames are an attribute so don’t work so nicely for the filter() command which operates on columns.
  • filter() out the one we don’t want
  • then revert to having rownames so that our graphs have labelled points
  • fit the same model using the reduced dataset.
library(tidyverse)  # for data manipulation
library(broom)  # going to sweep up a bunch of useful stuff...
augment(Hills.lm)
# A tibble: 35 × 9
   .rownames     time  dist .fitted   .resid   .hat .sigma    .cooksd .std.resid
   <chr>        <dbl> <dbl>   <dbl>    <dbl>  <dbl>  <dbl>      <dbl>      <dbl>
 1 Greenmantle   16.1   2.5    16.0   0.0976 0.0529   20.3    7.06e-7    0.00502
 2 Carnethy      48.4   6      45.1   3.21   0.0308   20.3    4.24e-4    0.163  
 3 Craig Dunain  33.6   6      45.1 -11.5    0.0308   20.2    5.44e-3   -0.585  
 4 Ben Rha       45.6   7.5    57.6 -12.0    0.0286   20.1    5.51e-3   -0.612  
 5 Ben Lomond    62.3   8      61.8   0.464  0.0288   20.3    8.25e-6    0.0236 
 6 Goatfell      73.2   8      61.8  11.4    0.0288   20.2    4.99e-3    0.580  
 7 Bens of Jura 205.   16     128.   76.2    0.0977   14.5    8.75e-1    4.02   
 8 Cairnpapple   36.4   6      45.1  -8.78   0.0308   20.2    3.17e-3   -0.447  
 9 Scolty        29.8   5      36.8  -7.06   0.0347   20.2    2.33e-3   -0.360  
10 Traprain      39.8   6      45.1  -5.39   0.0308   20.2    1.20e-3   -0.274  
# ℹ 25 more rows
Hills2 = Hills |>
    rownames_to_column() |>
    # so we can use them explicitly like a variable
filter(rowname != "Bens of Jura") |>
    # filter out the one we don't want
column_to_rownames()
# go back to having rownames so that our graphs have labelled points

Hills2.lm <- lm(time ~ dist, data = Hills2)
augment(Hills2.lm)
# A tibble: 34 × 9
   .rownames     time  dist .fitted  .resid   .hat .sigma  .cooksd .std.resid
   <chr>        <dbl> <dbl>   <dbl>   <dbl>  <dbl>  <dbl>    <dbl>      <dbl>
 1 Greenmantle   16.1   2.5    17.0  -0.957 0.0531   14.7 0.000129    -0.0679
 2 Carnethy      48.4   6      43.8   4.57  0.0311   14.7 0.00165      0.320 
 3 Craig Dunain  33.6   6      43.8 -10.1   0.0311   14.6 0.00811     -0.711 
 4 Ben Rha       45.6   7.5    55.2  -9.65  0.0295   14.6 0.00694     -0.676 
 5 Ben Lomond    62.3   8      59.1   3.20  0.0300   14.7 0.000778     0.224 
 6 Goatfell      73.2   8      59.1  14.2   0.0300   14.5 0.0152       0.992 
 7 Cairnpapple   36.4   6      43.8  -7.42  0.0311   14.7 0.00435     -0.520 
 8 Scolty        29.8   5      36.1  -6.39  0.0348   14.7 0.00364     -0.449 
 9 Traprain      39.8   6      43.8  -4.03  0.0311   14.7 0.00129     -0.283 
10 Lairig Ghru  193.   28     212.  -19.2   0.475    13.9 1.52        -1.83  
# ℹ 24 more rows

N.B. The augment() command added columns starting with a period containing several of the values we want for checking assumptions; other added columns will only be explained if they are needed in future. Watch that the first column though is called .rownames but we use the filter() command on a data set, not the results of augment()

plot(time ~ dist, data = Hills)
abline(Hills.lm, lty = 2)
abline(Hills2.lm, col = 2)
unlabelled

Figure 6.1: Models fitted to Hills data with and without the Bens of Jura race.

N.B. this form of plot() and abline() is old-fashioned, but still good value

par(mfrow = c(2, 2))
plot(Hills2.lm)
unlabelled

(#fig:residsHills2.lm)Diagnostic plots for Hills2.lm

6.9 Deleted residuals

Suppose we think an outlier is dragging the line towards itself, to some extent hiding how unusual it is.

Suppose we were to delete the \(i^{th}\) data point and recalculate the line with it gone. \[ \hat y_{ -i} = \hat\beta_{0,-i} + \hat\beta_{1,-i} \,\, x \] for \(j=1,2,...,n\), \(j\ne i\). The \(-i\) notation means with the \(i\) point removed.

Since we can calculate this “deleted line” at any x, we can do it at the value of \(x_i\).

A deleted residual is \[ e_{i, -i} = y_i - \hat y_{i,-i}\]

We calculate these deleted residuals for each row \(i\) in turn.

We standardise each deleted residual using the variance calculated with the \(i^{th}\) point removed. What we get are known as studentized deleted residuals.

The R function rstudent() calculates the deleted residuals for us.

It is often helpful to plot them on the same graph as standardized residuals. If all the points overlap then no data values are influential. But the graph below immediately shows us that deleting the two outliers in the left-middle of the plot would have a big effect, so they are influential. However deleting the point on the right would not change the model much, so it should be left alone.

plot(rstudent(Hills.lm) ~ predict(Hills.lm), col = "red", pch = "+", main = "Studentized Deleted Residuals")
points(rstandard(Hills.lm) ~ predict(Hills.lm), col = "black", pch = 1)

unlabelled

max(rstudent(Hills.lm))
[1] 5.53731
max(rstandard(Hills.lm))
[1] 4.018411

6.10 You might find the following commands useful.

max(cooks.distance(Hills.lm))
[1] 2.15456
max(cooks.distance(Hills2.lm))
[1] 1.517719
max(hatvalues(Hills.lm))  # leverages
[1] 0.4325145
max(hatvalues(Hills2.lm))
[1] 0.474975