Download the R markdown file for this lecture.
This lecture will cover prediction for (Simple) Linear Regression.
Simple linear regression models can be used to predict the response at any given value of the predictor.
Nonetheless, we should beware predicting far beyond the range of the data.
Point predictions should be accompanied by corresponding prediction intervals, providing a range of plausible values.
Suppose that we want to predict the response when the predictor x takes the value x0.
Our model for the corresponding response, y0, is
\[Y_0 = \mu_0 + \varepsilon_0\] where \[\mu_0 = \beta_0 + \beta_1 x_0\]
It is natural to estimate \(\mu_0\) by
\[\hat \mu_0 = \hat \beta_0 + \hat \beta_1 x_0\]
We have no information about \(\varepsilon_0\) so estimate by its expected value — i.e. define \(\hat \varepsilon_0 = 0\) as estimate of \(\varepsilon_0\).
Then \[\begin{aligned} \hat y_0 &=& \hat \mu_0 + \hat \varepsilon_0\\ &=& \hat \beta_0 + \hat \beta_1 x_0\end{aligned}\]
This should be fairly intuitive — just read prediction from fitted regression line.
The prediction was \(\hat y_0 = \hat \mu_0\)
But \(\hat \mu_0\) is also the natural estimate of \(\mu_0 = E[Y_0]\).
Hence prediction is the same as estimate of mean response.
However, differences arise if we want to construct corresponding intervals.
The standard deviation of the sampling distribution of \(\hat \mu_0\) is
\[\sigma_{\hat\mu_0} = \sigma \sqrt{ \frac1n + \frac{ (x_0 - \bar x)^2} {s_{xx}} }\]
The standard error for \(\hat \mu_0\) is
\[SE(\hat \mu_0) = s \sqrt{ \frac1n + \frac{(x_0 - \bar x)^2} {s_{xx}} } \; .\]
It can be shown that the sampling distribution of \(\hat \mu_0\) is defined by
\[\frac{ \hat \mu_0 - \mu_0 }{SE(\hat \mu_0)} \sim t(n-2)\]
Hence a \(100(1-\alpha)\%\) confidence interval for \(\mu_0\) is given by
\[\left ( \hat \mu_0 - t_{\alpha/2} SE(\hat \mu_0), \; \hat \mu_0 + t_{\alpha/2} SE(\hat \mu_0) \right )\]
where \(t_{\alpha/2}\) is a critical value from the t distribution with df=n-2.
This confidence interval provides a range of plausible values for the mean response \(\hat \mu_0\)
When it comes to finding the prediction error for y0, we must take into account the \(\varepsilon_0\) term:
\[\mbox{var}(\hat{y_0}) = \mbox{var}(\hat{\mu_0}) + \mbox{var}(\varepsilon_0) = [SE(\hat{\mu_0})]^2 + \sigma^2\]
Replacing \(\sigma^2\) by its estimate s2, we get the prediction error of \(\hat y_0\) as
\[PE(\hat y_0) = s \sqrt{ \frac1n + \frac{(x_0 - \bar{x})^2}{s_{xx} }+1} \; .\]
A \(100(1-\alpha)\%\) prediction interval for y0 is then
\[\left ( \hat{y_0} - t_{\alpha/2} PE(\hat{y_0}), \; \hat{y_0} + t_{\alpha/2} PE(\hat y_0) \right )\]
As the name suggests, the probability that y0 will actually lie in this prediction interval is \(1-\alpha\).
We will use data on boiling point in degrees Fahrenheit (y) and pressure in inches of mercury (x), collected during an expedition in the Alps. Data source: A Handbook of Small Data Sets by Hand, Daly, Lunn, McConway and Ostrowski.
## Boil <- read.csv(file = "https://r-resources.massey.ac.nz/161221/data/boiling.csv",
## header = TRUE)
<- lm(BPt ~ Pressure, data = Boil)
Boil.lm summary(Boil.lm)
Call:
lm(formula = BPt ~ Pressure, data = Boil)
Residuals:
Min 1Q Median 3Q Max
-1.22687 -0.22178 0.07723 0.19687 0.51001
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 155.29648 0.92734 167.47 <2e-16 ***
Pressure 1.90178 0.03676 51.74 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.444 on 15 degrees of freedom
Multiple R-squared: 0.9944, Adjusted R-squared: 0.9941
F-statistic: 2677 on 1 and 15 DF, p-value: < 2.2e-16
<- data.frame(Pressure = c(27, 32))
x0 predict(Boil.lm, newdata = x0, interval = "prediction", level = 0.95)
fit lwr upr
1 206.6446 205.6590 207.6303
2 216.1536 215.0382 217.2690
ggplot(Boil, mapping = aes(y = BPt, x = Pressure)) + geom_point() + geom_smooth(method = "lm",
se = FALSE) + labs(ylab = "Boiling Point", xlab = "Pressure")
`geom_smooth()` using formula 'y ~ x'
Model fit looks good so we should expect fairly tight prediction intervals.
Fitted model is \[E[\mbox{BPt}] = 155.3 + 1.902~\mbox{Pressure}\]
Pressure explains over 99% of the variation in boiling point.
When pressure is 27 inches of mercury, predicted temperature is 206.6 degrees Fahrenheit with 95% prediction interval (205.7, 207.6).
When pressure is 32 inches of mercury, prediction is 216.2 with 95% prediction interval (215.0, 217.3).
Use the model above to prove to yourself that:
<- data.frame(Pressure = c(27:32))
x0 predict(Boil.lm, newdata = x0, interval = "prediction", level = 0.95)
fit lwr upr
1 206.6446 205.6590 207.6303
2 208.5464 207.5457 209.5472
3 210.4482 209.4266 211.4698
4 212.3500 211.3020 213.3980
5 214.2518 213.1724 215.3312
6 216.1536 215.0382 217.2690
predict(Boil.lm, newdata = x0, interval = "confidence", level = 0.95)
fit lwr upr
1 206.6446 206.3693 206.9200
2 208.5464 208.2212 208.8717
3 210.4482 210.0635 210.8329
4 212.3500 211.8999 212.8000
5 214.2518 213.7328 214.7707
6 216.1536 215.5633 216.7438
predict(Boil.lm, newdata = x0, interval = "confidence", level = 0.99)
fit lwr upr
1 206.6446 206.2640 207.0253
2 208.5464 208.0968 208.9961
3 210.4482 209.9163 210.9801
4 212.3500 211.7278 212.9722
5 214.2518 213.5343 214.9693
6 216.1536 215.3375 216.9696