Lecture 3: Simple Linear Regression

Download the R markdown file for this lecture.

Simple linear regression seeks to model the relationship between the mean of the response y and a single predictor x.

You will have met simple linear regression in earlier stats papers, so some of the material in this lecture should be revision.

Nonetheless, simple linear regression provides a friendly context to introduce notation, remind you of fundamental concepts, and take a first look at statistical modelling with R.

Pulse and Height

The following R code

reads the pulse data into R (from a text file); and then
prints out the data for simple viewing.

## Pulse <- read.csv(file = "https://r-resources.massey.ac.nz/161221/data/pulse.csv", 
##     header = TRUE)
Pulse

   Height Pulse
1     160    68
2     167    80
3     162    84
4     175    80
5     177    80
6     185    80
7     165    76
8     162    80
9     182   100
10    173    92
11    162    88
12    167    92
13    172    90
14    170    80
15    177    90
16    170    80
17    168    90
18    163    80
19    178    80
20    158    80
21    182    76
22    157    80
23    167    80
24    160    78
25    170    84
26    170    90
27    160    80
28    177    80
29    182    80
30    166    72
31    168    80
32    170    80
33    155    80
34    148    82
35    175   104
36    175    76
37    168    80
38    160    84
39    180    68
40    153    70
41    175    84
42    185    80
43    145    64
44    165    82
45    170    84
46    165    84
47    175    72
48    172   116
49    185    80
50    163    95

Download pulse.csv

We would normally then look at some plots. Only one scatter plot is possible.

library(ggplot2)
ggplot(Pulse, mapping = aes(x = Height, y = Pulse)) + geom_point() + labs(ylab = "Resting pulse rate (beats per minute)", 
    xlab = "Height (in centimeters) ")

Plot of resting pulse rate (beats per minute) versus height (in centimeters) for a sample of 50 hospital patients. Source: A Handbook of Small Data Sets by Hand, Daly, Lunn, McConway and Ostrowski.

It is of interest to know whether pulse rate is truly dependent on height.

View pulse as response (y) variable and height as predictor (x) variable.

We can formally test whether this really is the case by using a simple linear regression model.

Note there is one observation (patient resting pulse 116) which is distant from the main bulk of data.

Basics of Simple Linear Regression: Data

Quantitative data in pairs: (x₁,y₁), …,(x_n,y_n).

y_i denotes the response variable and x_i the explanatory variable for the i^th individual.

Aim to estimate the mean response for any given value of the explanatory variable.

It therefore makes sense to think of x₁,…,x_n as fixed constants, while y₁, …, y_n are chance values taken by random variables Y₁,…, Y_n.

Basics of Simple Linear Regression: Model

The linear regression model is

\[Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\]

where - \(\beta_0, \beta_1\) are regression parameters. - \(\varepsilon_1, \ldots ,\varepsilon_n\) are error terms, about which we make the following assumptions.

The expected value of the errors \(E[\varepsilon_i] = 0\) for i=1,…,n.
The errors \(\varepsilon_1, \ldots ,\varepsilon_n\) are independent.
The error variance \(\mbox{var}(\varepsilon_i) = \sigma^2\) (an unknown constant) for i=1,…,n.
The errors \(\varepsilon_1, \ldots ,\varepsilon_n\) are normally distributed.

N.B. We refer back to these four assumptions many times in this course; they will often be referred to using the shorthand (A1) to (A4).

The validity of the simple regression model given above and (A1) implies that

\[E[Y_i] = \mu_i = \beta_0 + \beta_1 x_i\]

So we say that \(E[Y_i] = \mu_i\) is the expected value of Y_i given (or conditional upon) x_i.

Parameter Estimation

Parameters \(\beta_0\) and \(\beta_1\) can be estimated from the data using the method of least squares.

Least squares estimates (LSEs) \(\hat{\beta_0}\) and \(\hat{\beta_1}\), are those values that minimize the sum of squares

\[\begin{aligned} SS(\beta_0, \beta_1) &=& \sum_{i=1}^n (y_i - \mu_i)^2\\ &=& \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2. \end{aligned}\]

It can be shown that \(\hat{\beta_0}\) and \(\hat{\beta_1}\) are given by \[\hat{\beta_1} = \frac{s_{xy}}{s_{xx}}\] where

\(s_{xy} = \sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)\);

\(s_{xx} = \sum_{i=1}^n (x_i - \bar x)^2\);

and \(\hat{\beta_0} = \bar y - \hat{\beta_1} \bar x \; .\)

Properties of LSEs

The sampling distributions of \(\hat{\beta_0}\) and \(\hat{\beta_1}\) are:

\[\hat{\beta_1} \sim N \left(\beta_1, \frac{\sigma^2}{s_{xx}} \right )\] and \[\hat{\beta_0} \sim N \left(\beta_0 , \, \sigma^2\left[\frac{1}{n} + \frac{\bar x^2}{s_{xx}}\right ] \right ) \; .\]

Class discussion

what do these sampling distributions tell us?

Just for illustration, let’s say that the model fitted to the population of data is \(Y_i = 1 + 2 x_i + \varepsilon_i\) (i=1,2,…,n) with \(\varepsilon \sim N(0,2^2)\)

To make our calculations a simpler task, let’s also say that the covariate Information we need is: Sample size n = 32, \(\bar{x} = \sqrt{2}\), and s_xx = 64.

Question:

Calculate

\(P(\hat{\beta_1} > 3)\); and
\(P(\hat{\beta_0} < 0)\).

Answer to (i):

We are looking for \(P(\hat{\beta_1} > 3)\)

We know that the parameter is normally distributed so we will use the standard normal distribution to find the probability. Note that \(Z \sim N(0,1)\)

After inserting the values given above, this is equal to \(P \left ( Z > \frac{3 - \beta_1}{(\sigma/\sqrt{s_{xx}}) } \right )\)

Simplifying gives \(P \left ( Z > \frac{3 - 2}{(2/\sqrt{64}) } \right ) = P \left ( Z > \frac{1}{2/8} \right ) = P(Z > 4)\)

which is approximately zero.

Answer to (ii):

Your turn!

Estimating the Error Variance

In practice, we are almost certainly working with sample data, which means the error variance, \(\sigma^2\) is also an unknown parameter.

It can be estimated using the residual sum of squares (RSS), defined by

\[RSS = \sum_{i=1}^n (y_i - \hat{\mu_i})^2 = \sum_{i=1}^n (y_i - \hat{\beta_0} - \hat{\beta_1} x_i )^2 \; .\]

where \(\hat{\mu_i} = \hat{\beta_0} + \hat{\beta_1} x_i\) is the i^th fitted value; \(e_i = y_i - \hat{\mu_i}\) is the i^th residual.

Then an estimate of \(\sigma^2\) is given by

\[s^2 = \frac{1}{n-2} RSS.\]

Revised Sampling Distributions

We have seen the term \(\sigma^2\) in a number of equations. This is not known, but must be estimated from our data. When we do not know this “true” variance, we use an estimate or “sample variance” instead. This means we must alter the sampling distribution too. This happened when you used sample data to find an interval estimate of the population mean or conducted a one sample test about the population mean.

Standardizing the normal sampling distribution of \(\hat{\beta_1}\) gives

\[\frac{\hat{\beta_1} - \beta_1}{\sigma/\sqrt{s_{xx}}} \sim N(0,1)\]

Replacing \(\sigma^2\) with the estimate s² changes the sampling distribution of \(\hat{\beta_1}\):

\[\frac{\hat{\beta_1} - \beta_1}{SE(\hat{\beta_1})} \sim t(n-2)\]

where

the notation t(n-2) or sometimes t_n-2 signifies a t-distribution with (n-2) degrees of freedom (df)
the term \(SE(\hat{\beta_1}) = \frac{s}{\sqrt{s_{xx}}}\) is the standard error of \(\hat{\beta_1}\); i.e. the (estimated) standard deviation of the sampling distribution of \(\hat{\beta_1}\).

The sampling distribution for \(\hat{\beta_1}\) forms the basis for testing hypotheses and constructing confidence intervals for \(\beta_1\).

For testing H₀: \(\beta_1 = \beta_1^0\), against H₁: \(\beta_1 \ne \beta_1^0\), use the test statistic

\[t = \frac{\hat{\beta_1} - \beta_1^0}{SE(\hat{\beta_1})}.\]

We find p-value corresponding to t. As usual, smaller P indicates more evidence against H₀.

A \(100(1-\alpha)\%\) confidence interval for \(\beta_1\) is

\[\left (\hat{\beta_1} - t_{\alpha/2}(n-2) SE(\hat{\beta_1}), \, \hat{\beta_1} + t_{\alpha/2}(n-2) SE(\hat{\beta_1}) \right ) .\]

The t distribution

The t distribution is used for small samples when we do not know the population variance.

You’ll see that we still use the t distribution when samples are so large that it doesn’t actually matter. Why doesn’t it matter? Because the t distribution gets closer and closer to the normal distribution as the sample size increases.

Probability density functions for t distributions; As df =10,20,30, (light to dark blue), the curves get closer to the normal curve (black)

The “fatter” tails of the t distribution mean that the critical values we use in confidence intervals are larger for smaller sample sizes.

Back to the example

For the pulse and height data, it can be shown that the fitted regression line (i.e. the regression line with least squares estimates in place of the unknown model parameters) is

\[E[\mbox{Pulse}] = 46.907 + 0.2098~\mbox{Height}\]

Hence \(\hat{\beta_1} = 0.2098\).

Also, it can be shown that \(SE(\hat{\beta_1}) = 0.1354\).

Finally, recall that n=50.

Question: Do the data provide evidence the pulse rate depends on height?

Solution: To answer this question, we will test

H₀: \(\beta_1 = 0\), versus
H₁: \(\beta_1 \ne 0\)

The test statistic is

\[t = \frac{\hat{\beta_1} - \beta_1^0}{SE(\hat{\beta_1})} = \frac{0.2098}{0.1354} = 1.549\]

The (2-sided) P-value is \(P=P(|T| > 1.549) = 2 \times P(T > 1.549) = 2 \times 0.06398 = 0.128\), so we conclude that the data provide no evidence of a relationship between pulse rate and height.