Download the R markdown file for this lecture.
In the previous lecture we looked at:
In this lecture we shall consider what to do if the model assumptions fail.
A Reminder
Failure of A1 indicates that the relationship between the mean response and the predictor is not linear (as specified by the simple linear regression model).
Failure of this assumption is very serious — all conclusions from the model will be questionable.
One possible remedy is to transform the data. e.g. if responses curve upwards then a log transformation of responses may help.
However, important to recognize that any transformation of the response distribution will also effect the error variance. Hence fixing one problem (lack of linearity) by transformation may create another problem (heteroscedasticity).
An alternative approach is to use polynomial regression (covered later in the course).
Failure of A2 occurs most frequently because the data exhibit serial correlation in time (or space).
Failure of this assumption will leave parameter estimates unbiased, but standard errors will be incorrect.
Hence failure of A2 will render test results and confidence intervals unreliable.
We will look at regression models for time series later in the course.
Failure of this assumption will leave parameter estimates unbiased, but standard errors will be incorrect.
Hence failure of A3 will render test results and confidence intervals unreliable.
Perhaps most common form of heteroscedasticity is when error variance increases with the mean response.
A logarithmic transformation of both response and predictor variables can sometimes help in this case.
An alternative strategy is to use weighted linear regression (covered later in this lecture).
Failure of the assumption of normality for the distribution of the errors is not usually a serious problem.
Removal of (obvious) outliers will typically improve the normality of the standardized residuals.
Suppose that the variance is not constant: \(\mbox{Var}(Y_i) = v_i \sigma^2\) where \(v_1, v_2, \ldots, v_n\) are some pre-specified variance multipliers.
In that case, contribution to parameter estimation should vary, with lower variance data given more weight.
We can rectify this inequity by using weighted sum of squares.
The weighted sum of squares is \[\begin{aligned} WSS(\beta_0, \beta_1) &=& \sum_{i=1}^n w_i (y_i - \beta_0 - \beta_1 x_i)^2 \\ &=& \sum_{i=1}^n w_i (y_i - \mu_i)^2~.\end{aligned}\] where w1, w2, …, wn are weights given by wi = c/vi for some constant c.
Car insurance claims made by drivers aged 60+ from Britain in 1975. Claims divided into 16 groups by insurance agency. For each group, average claim y and number of claims x recorded. Aim is to regress y on x.
Define \(Z_{ij}\) to be the jth claim in the ith group.
Average claim for the ith group, \(Y_i\) (the response variable) given by \[Y_i = \frac{1}{x_i} \sum_{j=1}^{x_i} Z_{ij}\]
\[\mbox{Var}(Y_i) = \frac{1}{x_i^2} \mbox{Var} \left ( \sum_{j=1}^{x_i} Z_{ij} \right ) = \frac{1}{x_i^2} x_i \mbox{Var}(Z_{i1}) = \frac{\sigma^2_Z}{x_i}\] where \(\sigma^2_Z = \mbox{Var}(Z_{ij})\) is the variance of a single claim amount.
Hence \(var(Y_i) = v_i \sigma^2\) where vi = 1/xi.
Therefore apply weights wi = 1/vi = xi.
## Insure <- read.csv(file = "https://r-resources.massey.ac.nz/161221/data/insure.csv",
## header = TRUE)
head(Insure)
NClaim AveClaim
1 64 264
2 100 198
3 43 167
4 53 114
5 228 224
6 233 193
<- lm(AveClaim ~ NClaim, weights = NClaim, data = Insure)
Insure.lm summary(Insure.lm)
Call:
lm(formula = AveClaim ~ NClaim, data = Insure, weights = NClaim)
Weighted Residuals:
Min 1Q Median 3Q Max
-797.0 -359.3 -154.6 346.8 1288.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 220.52827 33.75251 6.534 1.33e-05 ***
NClaim 0.01414 0.20449 0.069 0.946
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 563.2 on 14 degrees of freedom
Multiple R-squared: 0.0003415, Adjusted R-squared: -0.07106
F-statistic: 0.004782 on 1 and 14 DF, p-value: 0.9458
Comments and Conclusions
Notice
lm()
command accepts additional argumentweights
, in which the user supplies weights \(w_1, w_2, \ldots, w_n\).The appropriate weighting factor for this data is simply the number of claims, as we saw above.
The summary table for the analysis shows a highly non-significant P-value , *P=0.946, for significance testing of the regression slope.
We conclude that there is no evidence of a relationship between the average claim size and the number of claims in each group.