Download the R markdown file for this lecture.

In the previous lecture we looked at diagnostic tools for linear regression models.

We saw that a model can appear inappropriate for the data because of one or two odd points

In this lecture we will examine the issue of odd points in more depth.

What Exactly is an Outlier?

There is no hard and fast definition.

Essentially, an outlier is defined as a data point that emanates from a different model than do the rest of the data.

Notice that this makes the definition dependent on the model in question.

Outlier county in Florida

Detecting Outliers

Outliers may be detected by:

Any data point that lies an unusual distance from the fitted line, or has a large residual in comparison to the other data points, may be considered an outlier.

Outliers for the Hill Racing Data

data(hills, package = "MASS")
Hills <- hills  # We used Upper case before.
Hills.lm <- lm(time ~ dist, data = Hills)
plot(Hills.lm, which = 1)
Residuals v fitted plot for Hills data

Residuals v fitted plot for Hills data

Inspection of the plot of residuals versus fitted values suggests three possible outliers: Knock Hill, Bens of Jura, and Lairig Ghru. Most extreme is Bens of Jura.

Knock Hill outlier may be due to transcription error — time of 18.65 recorded as 78.65 minutes. See the help page for this data.

What to Do When You Find Outliers

Do not ignore them!

Investigate whether data have been mis-recorded (go back to data source if possible).

If an outlier is not due to transcription errors, then it may need removing before refitting the model.

In general one should refit the model after removing even a single outlier, because removal of one outlier can alter the fitted model, and hence make other points appear less (or more) inconsistent with the model.

The field of robust statistics is concerned with more sophisticated ways of dealing with outliers.

Influence

Leverage

The extent of problems caused by an outlier depends on the amount of influence that this data point has on the fitted model.

The potential influence of a data point can be measured by its leverage. For the ith data point in a simple linear regression, leverage is given by

\[h_{ii} = \frac{1}{n} + \frac{(x_i - \bar x)^2}{s_{xx}}\]

Notice that:

  • leverage depends only on the x-value, so that the actual influence of a data point will not depend on the response value;
  • data points with extreme x-values (i.e. far from \(\bar{x}\)) have greatest leverage.

Cook’s Distance

The actual influence of a data point can be measured by calculating the change in parameter estimates between

  1. the model built on all the data, and

  2. the model built on data with the data point removed.

Cook’s distance is a measure of influence that is based on exactly this kind of approach.

Cook’s distance depends on both the x and y-values for a data point, and will be large for data points with both high leverage and large standardized residual.

A value of Cook’s distance greater than one is a cause for concern.

Using plots to detect influence

The fourth plot produced in R by plot() applied to a linear model is a plot of standardized residuals against leverage, with contours marking extreme Cook’s distance.

plot(Hills.lm, which = 5)
Fourth diagnostic plot for Hills.lm

Fourth diagnostic plot for Hills.lm

For the hill racing data, Lairig Ghru and Bens of Jura have much greater influence than other data points.

N.B. The names of observations appear here because the data is stored with rownames. How we import data is important.

You should observe that the which argument is not set to 4 as you may have expected. There are actually 6 plots available but the vast majority of need is served with plots 1,2,3, and 5.

Conclusions on the Hill Racing Data

The Scottish Hill Racing Data appears to contain some outliers.

Some of these outliers are data points with high leverage, and hence may have a marked effect on the fitted model.

Recall that the definition of an outlier will depend on the model that we choose to fit.

Perhaps the outliers only appear as odd points because the model is inadequate.

For example, it may be the variable climb has an important effect on race time, and should be included in the model.

An Exercise

Remove one of the points flagged as unusual in the above analyses. Then re-fit the model to the reduced data. Compare the fit of the two models by plotting their fitted lines on the scatter plot of the data.

Look at the residual analysis for your new model. Does removal of the point you chose affect the status of the other points?

The Solution

library(tidyverse) # for data manipulation
library(broom) # going to sweep up a bunch of useful stuff...
augment(Hills.lm)
# A tibble: 35 x 9
   .rownames     time  dist .fitted   .resid   .hat .sigma    .cooksd .std.resid
   <chr>        <dbl> <dbl>   <dbl>    <dbl>  <dbl>  <dbl>      <dbl>      <dbl>
 1 Greenmantle   16.1   2.5    16.0   0.0976 0.0529   20.3    7.06e-7    0.00502
 2 Carnethy      48.4   6      45.1   3.21   0.0308   20.3    4.24e-4    0.163  
 3 Craig Dunain  33.6   6      45.1 -11.5    0.0308   20.2    5.44e-3   -0.585  
 4 Ben Rha       45.6   7.5    57.6 -12.0    0.0286   20.1    5.51e-3   -0.612  
 5 Ben Lomond    62.3   8      61.8   0.464  0.0288   20.3    8.25e-6    0.0236 
 6 Goatfell      73.2   8      61.8  11.4    0.0288   20.2    4.99e-3    0.580  
 7 Bens of Jura 205.   16     128.   76.2    0.0977   14.5    8.75e-1    4.02   
 8 Cairnpapple   36.4   6      45.1  -8.78   0.0308   20.2    3.17e-3   -0.447  
 9 Scolty        29.8   5      36.8  -7.06   0.0347   20.2    2.33e-3   -0.360  
10 Traprain      39.8   6      45.1  -5.39   0.0308   20.2    1.20e-3   -0.274  
# ... with 25 more rows
Hills %>% rownames_to_column() %>% # so we can use them explicitly like a variable
filter(rowname != "Bens of Jura") %>% # filter out the one we don't want
column_to_rownames() -> Hills2 # go back to having rownames so that our graphs have labelled points

Hills2.lm <- lm(time~dist, data=Hills2) 
augment(Hills2.lm)
# A tibble: 34 x 9
   .rownames     time  dist .fitted  .resid   .hat .sigma  .cooksd .std.resid
   <chr>        <dbl> <dbl>   <dbl>   <dbl>  <dbl>  <dbl>    <dbl>      <dbl>
 1 Greenmantle   16.1   2.5    17.0  -0.957 0.0531   14.7 0.000129    -0.0679
 2 Carnethy      48.4   6      43.8   4.57  0.0311   14.7 0.00165      0.320 
 3 Craig Dunain  33.6   6      43.8 -10.1   0.0311   14.6 0.00811     -0.711 
 4 Ben Rha       45.6   7.5    55.2  -9.65  0.0295   14.6 0.00694     -0.676 
 5 Ben Lomond    62.3   8      59.1   3.20  0.0300   14.7 0.000778     0.224 
 6 Goatfell      73.2   8      59.1  14.2   0.0300   14.5 0.0152       0.992 
 7 Cairnpapple   36.4   6      43.8  -7.42  0.0311   14.7 0.00435     -0.520 
 8 Scolty        29.8   5      36.1  -6.39  0.0348   14.7 0.00364     -0.449 
 9 Traprain      39.8   6      43.8  -4.03  0.0311   14.7 0.00129     -0.283 
10 Lairig Ghru  193.   28     212.  -19.2   0.475    13.9 1.52        -1.83  
# ... with 24 more rows

N.B. The augment() command added columns starting with a period containing several of the values we want for checking assumptions; other added columns will only be explained if they are needed in future. Watch that the first column though is called .rownames but we use the filter() command on a data set, not the results of augment()

The standard rownames are an attribute so don’t work so nicely for the filter() command which operates on columns.

plot(time ~ dist, data = Hills)
abline(Hills.lm, lty = 2)
abline(Hills2.lm, col = 2)
Models fitted to Hills data with and without the Bens of Jura race.

Models fitted to Hills data with and without the Bens of Jura race.

N.B. this form of plot() and abline() is old-fashioned, but still good value

par(mfrow = c(2, 2))
plot(Hills2.lm)
Diagnostic plots for Hills2.lm

Diagnostic plots for Hills2.lm

You might find the following commands useful.

max(cooks.distance(Hills.lm))
[1] 2.15456
max(cooks.distance(Hills2.lm))
[1] 1.517719
max(hatvalues(Hills.lm))  # leverages
[1] 0.4325145
max(hatvalues(Hills2.lm))
[1] 0.474975