Download the R markdown file for this lecture.
Simple linear regression seeks to model the relationship between the mean of the response y and a single predictor x.
You will have met simple linear regression in earlier stats papers, so some of the material in this lecture should be revision.
Nonetheless, simple linear regression provides a friendly context to introduce notation, remind you of fundamental concepts, and take a first look at statistical modelling with R.
The following R code
## Pulse <- read.csv(file = "https://r-resources.massey.ac.nz/161221/data/pulse.csv",
## header = TRUE)
Pulse
Height Pulse
1 160 68
2 167 80
3 162 84
4 175 80
5 177 80
6 185 80
7 165 76
8 162 80
9 182 100
10 173 92
11 162 88
12 167 92
13 172 90
14 170 80
15 177 90
16 170 80
17 168 90
18 163 80
19 178 80
20 158 80
21 182 76
22 157 80
23 167 80
24 160 78
25 170 84
26 170 90
27 160 80
28 177 80
29 182 80
30 166 72
31 168 80
32 170 80
33 155 80
34 148 82
35 175 104
36 175 76
37 168 80
38 160 84
39 180 68
40 153 70
41 175 84
42 185 80
43 145 64
44 165 82
45 170 84
46 165 84
47 175 72
48 172 116
49 185 80
50 163 95
library(ggplot2)
ggplot(Pulse, mapping = aes(x = Height, y = Pulse)) + geom_point() + labs(ylab = "Resting pulse rate (beats per minute)",
xlab = "Height (in centimeters) ")
It is of interest to know whether pulse rate is truly dependent on height.
View pulse as response (y) variable and height as predictor (x) variable.
We can formally test whether this really is the case by using a simple linear regression model.
Note there is one observation (patient resting pulse 116) which is distant from the main bulk of data.
Quantitative data in pairs: (x1,y1), …,(xn,yn).
yi denotes the response variable and xi the explanatory variable for the ith individual.
Aim to estimate the mean response for any given value of the explanatory variable.
It therefore makes sense to think of x1,…,xn as fixed constants, while y1, …, yn are chance values taken by random variables Y1,…, Yn.
The linear regression model is
\[Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\]
where - \(\beta_0, \beta_1\) are regression parameters. - \(\varepsilon_1, \ldots ,\varepsilon_n\) are error terms, about which we make the following assumptions.
The expected value of the errors \(E[\varepsilon_i] = 0\) for i=1,…,n.
The errors \(\varepsilon_1, \ldots ,\varepsilon_n\) are independent.
The error variance \(\mbox{var}(\varepsilon_i) = \sigma^2\) (an unknown constant) for i=1,…,n.
The errors \(\varepsilon_1, \ldots ,\varepsilon_n\) are normally distributed.
N.B. We refer back to these four assumptions many times in this course; they will often be referred to using the shorthand (A1) to (A4).
The validity of the simple regression model given above and (A1) implies that
\[E[Y_i] = \mu_i = \beta_0 + \beta_1 x_i\]
So we say that \(E[Y_i] = \mu_i\) is the expected value of Yi given (or conditional upon) xi.
Parameters \(\beta_0\) and \(\beta_1\) can be estimated from the data using the method of least squares.
Least squares estimates (LSEs) \(\hat{\beta_0}\) and \(\hat{\beta_1}\), are those values that minimize the sum of squares
\[\begin{aligned} SS(\beta_0, \beta_1) &=& \sum_{i=1}^n (y_i - \mu_i)^2\\ &=& \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2. \end{aligned}\]
It can be shown that \(\hat{\beta_0}\) and \(\hat{\beta_1}\) are given by \[\hat{\beta_1} = \frac{s_{xy}}{s_{xx}}\] where
\(s_{xy} = \sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)\);
\(s_{xx} = \sum_{i=1}^n (x_i - \bar x)^2\);
and \(\hat{\beta_0} = \bar y - \hat{\beta_1} \bar x \; .\)
The sampling distributions of \(\hat{\beta_0}\) and \(\hat{\beta_1}\) are:
\[\hat{\beta_1} \sim N \left(\beta_1, \frac{\sigma^2}{s_{xx}} \right )\] and \[\hat{\beta_0} \sim N \left(\beta_0 , \, \sigma^2\left[\frac{1}{n} + \frac{\bar x^2}{s_{xx}}\right ] \right ) \; .\]
what do these sampling distributions tell us?
Just for illustration, let’s say that the model fitted to the population of data is \(Y_i = 1 + 2 x_i + \varepsilon_i\) (i=1,2,…,n) with \(\varepsilon \sim N(0,2^2)\)
To make our calculations a simpler task, let’s also say that the covariate Information we need is: Sample size n = 32, \(\bar{x} = \sqrt{2}\), and sxx = 64.
Calculate
We are looking for \(P(\hat{\beta_1} > 3)\)
We know that the parameter is normally distributed so we will use the standard normal distribution to find the probability. Note that \(Z \sim N(0,1)\)
After inserting the values given above, this is equal to \(P \left ( Z > \frac{3 - \beta_1}{(\sigma/\sqrt{s_{xx}}) } \right )\)
Simplifying gives \(P \left ( Z > \frac{3 - 2}{(2/\sqrt{64}) } \right ) = P \left ( Z > \frac{1}{2/8} \right ) = P(Z > 4)\)
which is approximately zero.
Your turn!
In practice, we are almost certainly working with sample data, which means the error variance, \(\sigma^2\) is also an unknown parameter.
It can be estimated using the residual sum of squares (RSS), defined by
\[RSS = \sum_{i=1}^n (y_i - \hat{\mu_i})^2 = \sum_{i=1}^n (y_i - \hat{\beta_0} - \hat{\beta_1} x_i )^2 \; .\]
where \(\hat{\mu_i} = \hat{\beta_0} + \hat{\beta_1} x_i\) is the ith fitted value; \(e_i = y_i - \hat{\mu_i}\) is the ith residual.
Then an estimate of \(\sigma^2\) is given by
\[s^2 = \frac{1}{n-2} RSS.\]
We have seen the term \(\sigma^2\) in a number of equations. This is not known, but must be estimated from our data. When we do not know this “true” variance, we use an estimate or “sample variance” instead. This means we must alter the sampling distribution too. This happened when you used sample data to find an interval estimate of the population mean or conducted a one sample test about the population mean.
Standardizing the normal sampling distribution of \(\hat{\beta_1}\) gives
\[\frac{\hat{\beta_1} - \beta_1}{\sigma/\sqrt{s_{xx}}} \sim N(0,1)\]
Replacing \(\sigma^2\) with the estimate s2 changes the sampling distribution of \(\hat{\beta_1}\):
\[\frac{\hat{\beta_1} - \beta_1}{SE(\hat{\beta_1})} \sim t(n-2)\]
where
The sampling distribution for \(\hat{\beta_1}\) forms the basis for testing hypotheses and constructing confidence intervals for \(\beta_1\).
For testing H0: \(\beta_1 = \beta_1^0\), against H1: \(\beta_1 \ne \beta_1^0\), use the test statistic
\[t = \frac{\hat{\beta_1} - \beta_1^0}{SE(\hat{\beta_1})}.\]
We find p-value corresponding to t. As usual, smaller P indicates more evidence against H0.
A \(100(1-\alpha)\%\) confidence interval for \(\beta_1\) is
\[\left (\hat{\beta_1} - t_{\alpha/2}(n-2) SE(\hat{\beta_1}), \, \hat{\beta_1} + t_{\alpha/2}(n-2) SE(\hat{\beta_1}) \right ) .\]
The t distribution is used for small samples when we do not know the population variance.
You’ll see that we still use the t distribution when samples are so large that it doesn’t actually matter. Why doesn’t it matter? Because the t distribution gets closer and closer to the normal distribution as the sample size increases.
The “fatter” tails of the t distribution mean that the critical values we use in confidence intervals are larger for smaller sample sizes.
For the pulse and height data, it can be shown that the fitted regression line (i.e. the regression line with least squares estimates in place of the unknown model parameters) is
\[E[\mbox{Pulse}] = 46.907 + 0.2098~\mbox{Height}\]
Also, it can be shown that \(SE(\hat{\beta_1}) = 0.1354\).
Finally, recall that n=50.
Question: Do the data provide evidence the pulse rate depends on height?
Solution: To answer this question, we will test
H0: \(\beta_1 = 0\), versus
H1: \(\beta_1 \ne 0\)
The test statistic is
\[t = \frac{\hat{\beta_1} - \beta_1^0}{SE(\hat{\beta_1})} = \frac{0.2098}{0.1354} = 1.549\]
The (2-sided) P-value is \(P=P(|T| > 1.549) = 2 \times P(T > 1.549) = 2 \times 0.06398 = 0.128\), so we conclude that the data provide no evidence of a relationship between pulse rate and height.