Lecture 7: What to do When Assumptions Fail

Download the R markdown file for this lecture.

In the previous lecture we looked at:

the assumptions underlying the (simple) linear regression model;
tools for detecting failure of the assumptions.

In this lecture we shall consider what to do if the model assumptions fail.

What to Do When You Find Outliers

A Reminder

The existence of outliers impacts on all model assumptions.
If there are outliers, check whether data have been mis-recorded (go back to data source if possible).
If an outlier is not due to transcription errors, then it may need removing before refitting the model.
In general one should refit model after removing a single outlier, because removal of one outlier can alter the fitted models and hence make other points appear less (or more) inconsistent with the model.

What to do When Assumption A1 Fails

Failure of A1 indicates that the relationship between the mean response and the predictor is not linear (as specified by the simple linear regression model).

Failure of this assumption is very serious — all conclusions from the model will be questionable.

One possible remedy is to transform the data. e.g. if responses curve upwards then a log transformation of responses may help.

However, important to recognize that any transformation of the response distribution will also effect the error variance. Hence fixing one problem (lack of linearity) by transformation may create another problem (heteroscedasticity).

An alternative approach is to use polynomial regression (covered later in the course).

What to do When Assumption A2 Fails

Failure of A2 occurs most frequently because the data exhibit serial correlation in time (or space).

Failure of this assumption will leave parameter estimates unbiased, but standard errors will be incorrect.

Hence failure of A2 will render test results and confidence intervals unreliable.

We will look at regression models for time series later in the course.

What to do When Assumption A3 Fails

Failure of this assumption will leave parameter estimates unbiased, but standard errors will be incorrect.

Hence failure of A3 will render test results and confidence intervals unreliable.

Perhaps most common form of heteroscedasticity is when error variance increases with the mean response.

A logarithmic transformation of both response and predictor variables can sometimes help in this case.

An alternative strategy is to use weighted linear regression (covered later in this lecture).

What to do When Assumption A4 Fails

Failure of the assumption of normality for the distribution of the errors is not usually a serious problem.

Removal of (obvious) outliers will typically improve the normality of the standardized residuals.

Weighted Linear Regression

Suppose that the variance is not constant: \(\mbox{Var}(Y_i) = v_i \sigma^2\) where \(v_1, v_2, \ldots, v_n\) are some pre-specified variance multipliers.

In that case, contribution to parameter estimation should vary, with lower variance data given more weight.

We can rectify this inequity by using weighted sum of squares.

Weighted Sum of Squares

The weighted sum of squares is \[\begin{aligned} WSS(\beta_0, \beta_1) &=& \sum_{i=1}^n w_i (y_i - \beta_0 - \beta_1 x_i)^2 \\ &=& \sum_{i=1}^n w_i (y_i - \mu_i)^2~.\end{aligned}\] where w₁, w₂, …, w_n are weights given by w_i = c/v_i for some constant c.

Weighted Regression For Insurance Data

Car insurance claims made by drivers aged 60+ from Britain in 1975. Claims divided into 16 groups by insurance agency. For each group, average claim y and number of claims x recorded. Aim is to regress y on x.

not my car!

Deriving suitable weights

Define \(Z_{ij}\) to be the j^th claim in the i^th group.

Average claim for the ith group, \(Y_i\) (the response variable) given by \[Y_i = \frac{1}{x_i} \sum_{j=1}^{x_i} Z_{ij}\]

\[\mbox{Var}(Y_i) = \frac{1}{x_i^2} \mbox{Var} \left ( \sum_{j=1}^{x_i} Z_{ij} \right ) = \frac{1}{x_i^2} x_i \mbox{Var}(Z_{i1}) = \frac{\sigma^2_Z}{x_i}\] where \(\sigma^2_Z = \mbox{Var}(Z_{ij})\) is the variance of a single claim amount.

Hence \(var(Y_i) = v_i \sigma^2\) where v_i = 1/x_i.

Therefore apply weights w_i = 1/v_i = x_i.

Download insure.csv

## Insure <- read.csv(file = "https://r-resources.massey.ac.nz/161221/data/insure.csv", 
##     header = TRUE)
head(Insure)

  NClaim AveClaim
1     64      264
2    100      198
3     43      167
4     53      114
5    228      224
6    233      193

Insure.lm <- lm(AveClaim ~ NClaim, weights = NClaim, data = Insure)
summary(Insure.lm)


Call:
lm(formula = AveClaim ~ NClaim, data = Insure, weights = NClaim)

Weighted Residuals:
   Min     1Q Median     3Q    Max 
-797.0 -359.3 -154.6  346.8 1288.2 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 220.52827   33.75251   6.534 1.33e-05 ***
NClaim        0.01414    0.20449   0.069    0.946    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 563.2 on 14 degrees of freedom
Multiple R-squared:  0.0003415, Adjusted R-squared:  -0.07106 
F-statistic: 0.004782 on 1 and 14 DF,  p-value: 0.9458

Comments and Conclusions

Notice lm() command accepts additional argument weights, in which the user supplies weights \(w_1, w_2, \ldots, w_n\).

The appropriate weighting factor for this data is simply the number of claims, as we saw above.

The summary table for the analysis shows a highly non-significant P-value , *P=0.946, for significance testing of the regression slope.

We conclude that there is no evidence of a relationship between the average claim size and the number of claims in each group.