Download the R markdown file for this lecture.
In the previous lecture we looked at diagnostic tools for linear regression models.
We saw that a model can appear inappropriate for the data because of one or two odd points
In this lecture we will examine the issue of odd points in more depth.
There is no hard and fast definition.
Essentially, an outlier is defined as a data point that emanates from a different model than do the rest of the data.
Notice that this makes the definition dependent on the model in question.
Outliers may be detected by:
Any data point that lies an unusual distance from the fitted line, or has a large residual in comparison to the other data points, may be considered an outlier.
data(hills, package = "MASS")
<- hills # We used Upper case before. Hills
<- lm(time ~ dist, data = Hills) Hills.lm
plot(Hills.lm, which = 1)
Inspection of the plot of residuals versus fitted values suggests three possible outliers: Knock Hill, Bens of Jura, and Lairig Ghru. Most extreme is Bens of Jura.
Knock Hill
outlier may be due to transcription error — time
of 18.65 recorded as 78.65 minutes. See the help page for this data.
Do not ignore them!
Investigate whether data have been mis-recorded (go back to data source if possible).
If an outlier is not due to transcription errors, then it may need removing before refitting the model.
In general one should refit the model after removing even a single outlier, because removal of one outlier can alter the fitted model, and hence make other points appear less (or more) inconsistent with the model.
The field of robust statistics is concerned with more sophisticated ways of dealing with outliers.
The extent of problems caused by an outlier depends on the amount of influence that this data point has on the fitted model.
The potential influence of a data point can be measured by its leverage. For the ith data point in a simple linear regression, leverage is given by
\[h_{ii} = \frac{1}{n} + \frac{(x_i - \bar x)^2}{s_{xx}}\]
Notice that:
The actual influence of a data point can be measured by calculating the change in parameter estimates between
the model built on all the data, and
the model built on data with the data point removed.
Cook’s distance is a measure of influence that is based on exactly this kind of approach.
Cook’s distance depends on both the x and y-values for a data point, and will be large for data points with both high leverage and large standardized residual.
A value of Cook’s distance greater than one is a cause for concern.
The fourth plot produced in R by plot()
applied to a linear model is a plot of standardized residuals against leverage, with contours marking extreme Cook’s distance.
plot(Hills.lm, which = 5)
For the hill racing data, Lairig Ghru and Bens of Jura have much greater influence than other data points.
N.B. The names of observations appear here because the data is stored with rownames
. How we import data is important.
You should observe that the which
argument is not set to 4 as you may have expected. There are actually 6 plots available but the vast majority of need is served with plots 1,2,3, and 5.
The Scottish Hill Racing Data appears to contain some outliers.
Some of these outliers are data points with high leverage, and hence may have a marked effect on the fitted model.
Recall that the definition of an outlier will depend on the model that we choose to fit.
Perhaps the outliers only appear as odd points because the model is inadequate.
For example, it may be the variable climb
has an important effect on race time, and should be included in the model.
Remove one of the points flagged as unusual in the above analyses. Then re-fit the model to the reduced data. Compare the fit of the two models by plotting their fitted lines on the scatter plot of the data.
Look at the residual analysis for your new model. Does removal of the point you chose affect the status of the other points?
library(tidyverse) # for data manipulation
library(broom) # going to sweep up a bunch of useful stuff...
augment(Hills.lm)
# A tibble: 35 x 9
.rownames time dist .fitted .resid .hat .sigma .cooksd .std.resid
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Greenmantle 16.1 2.5 16.0 0.0976 0.0529 20.3 7.06e-7 0.00502
2 Carnethy 48.4 6 45.1 3.21 0.0308 20.3 4.24e-4 0.163
3 Craig Dunain 33.6 6 45.1 -11.5 0.0308 20.2 5.44e-3 -0.585
4 Ben Rha 45.6 7.5 57.6 -12.0 0.0286 20.1 5.51e-3 -0.612
5 Ben Lomond 62.3 8 61.8 0.464 0.0288 20.3 8.25e-6 0.0236
6 Goatfell 73.2 8 61.8 11.4 0.0288 20.2 4.99e-3 0.580
7 Bens of Jura 205. 16 128. 76.2 0.0977 14.5 8.75e-1 4.02
8 Cairnpapple 36.4 6 45.1 -8.78 0.0308 20.2 3.17e-3 -0.447
9 Scolty 29.8 5 36.8 -7.06 0.0347 20.2 2.33e-3 -0.360
10 Traprain 39.8 6 45.1 -5.39 0.0308 20.2 1.20e-3 -0.274
# ... with 25 more rows
%>% rownames_to_column() %>% # so we can use them explicitly like a variable
Hills filter(rowname != "Bens of Jura") %>% # filter out the one we don't want
column_to_rownames() -> Hills2 # go back to having rownames so that our graphs have labelled points
<- lm(time~dist, data=Hills2)
Hills2.lm augment(Hills2.lm)
# A tibble: 34 x 9
.rownames time dist .fitted .resid .hat .sigma .cooksd .std.resid
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Greenmantle 16.1 2.5 17.0 -0.957 0.0531 14.7 0.000129 -0.0679
2 Carnethy 48.4 6 43.8 4.57 0.0311 14.7 0.00165 0.320
3 Craig Dunain 33.6 6 43.8 -10.1 0.0311 14.6 0.00811 -0.711
4 Ben Rha 45.6 7.5 55.2 -9.65 0.0295 14.6 0.00694 -0.676
5 Ben Lomond 62.3 8 59.1 3.20 0.0300 14.7 0.000778 0.224
6 Goatfell 73.2 8 59.1 14.2 0.0300 14.5 0.0152 0.992
7 Cairnpapple 36.4 6 43.8 -7.42 0.0311 14.7 0.00435 -0.520
8 Scolty 29.8 5 36.1 -6.39 0.0348 14.7 0.00364 -0.449
9 Traprain 39.8 6 43.8 -4.03 0.0311 14.7 0.00129 -0.283
10 Lairig Ghru 193. 28 212. -19.2 0.475 13.9 1.52 -1.83
# ... with 24 more rows
N.B. The augment()
command added columns starting with a period containing several of the values we want for checking assumptions; other added columns will only be explained if they are needed in future. Watch that the first column though is called .rownames
but we use the filter()
command on a data set, not the results of augment()
The standard rownames are an attribute so don’t work so nicely for the filter()
command which operates on columns.
plot(time ~ dist, data = Hills)
abline(Hills.lm, lty = 2)
abline(Hills2.lm, col = 2)
N.B. this form of plot()
and abline()
is old-fashioned, but still good value
par(mfrow = c(2, 2))
plot(Hills2.lm)
max(cooks.distance(Hills.lm))
[1] 2.15456
max(cooks.distance(Hills2.lm))
[1] 1.517719
max(hatvalues(Hills.lm)) # leverages
[1] 0.4325145
max(hatvalues(Hills2.lm))
[1] 0.474975