Download the R markdown file for this lecture.

In this lecture we will look at simple linear regression modelling in R.

We shall use R to fit models (i.e. estimate the unknown parameters in models).

We shall discuss the interpretation of output from R.

The lm() Command

R uses the linear model command to fit models of this type.

The basic syntax is lm(formula, data) where:

Analysis of the Pulse Data:

## Pulse <- read.csv(file = "https://r-resources.massey.ac.nz/161221/data/pulse.csv", 
##     header = TRUE)
summary(Pulse)
     Height          Pulse      
 Min.   :145.0   Min.   : 64.0  
 1st Qu.:162.2   1st Qu.: 80.0  
 Median :169.0   Median : 80.0  
 Mean   :168.7   Mean   : 82.3  
 3rd Qu.:175.0   3rd Qu.: 84.0  
 Max.   :185.0   Max.   :116.0  

Download pulse.csv

It is common to use names for your models that make sense, especially to the person you will work with the most in future (you, yourself!). Remember, your future self will be pretty annoyed that your current self isn’t going to be able to answer questions about your funny selections of names, so go easy on yourself by being clear with your work.

The most common convention is to use a name that shows what data was used and what type of model was created. We will see <data>.lm lots in this course!

Pulse.lm <- lm(Pulse ~ 1 + Height, data = Pulse)
summary(Pulse.lm)

Call:
lm(formula = Pulse ~ 1 + Height, data = Pulse)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.666  -4.876  -1.520   3.424  33.012 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  46.9069    22.8793   2.050   0.0458 *
Height        0.2098     0.1354   1.549   0.1279  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.811 on 48 degrees of freedom
Multiple R-squared:  0.04762,   Adjusted R-squared:  0.02778 
F-statistic:   2.4 on 1 and 48 DF,  p-value: 0.1279

The most commonly sought numbers in that output are the coefficients

coef(Pulse.lm)
(Intercept)      Height 
  46.906927    0.209774 

But in simple regression, we often want to see how these coefficients create a line, and how that line looks alongside our data.

library(ggplot2)
ggplot(Pulse, mapping = aes(x = Height, y = Pulse)) + geom_point() + geom_smooth(method = "lm", 
    se = FALSE) + labs(ylab = "Resting pulse rate (beats per minute)", xlab = "Height (in centimeters) ")
`geom_smooth()` using formula 'y ~ x'
Plot of resting pulse rate (beats per minute) versus height (in centimeters) with fitted line added, for a sample of 50 hospital patients. Source: A Handbook of Small Data Sets by Hand, Daly, Lunn, McConway and Ostrowski.

Plot of resting pulse rate (beats per minute) versus height (in centimeters) with fitted line added, for a sample of 50 hospital patients. Source: A Handbook of Small Data Sets by Hand, Daly, Lunn, McConway and Ostrowski.

Comments on the R code for the Pulse Data

The read.csv() command reads (multivariate) data from a text file with commas separating the values.

The option header=TRUE indicates that the first line of the text file contains column headings (not data points).

The summary() command will typically give sensible output when applied to a variety of types of object. We saw it used to summarise the raw data, and to summarise the simple regression model we fitted.

The formula Pulse ~ 1 + Height specifies that Pulse is the response and Height the explanatory variable in the model. The 1 explicitly indicates inclusion of an intercept. (Most people leave it out.)

The geom_smooth() command adds a line fitted using the specified method for fitting the model. Note that we used "lm" here so that this line matches the results from fitting a model using lm() to this data.

Interpretation of Model for Pulse Data

The summary table of coefficients contains much relevant information. We can get this part of the summary using the tidy() command.

tidy(Pulse.lm) %>% kable()
term estimate std.error statistic p.value
(Intercept) 46.906927 22.8793281 2.050188 0.0458292
Height 0.209774 0.1354041 1.549245 0.1278917

The kable() command here makes the pretty table for presentation; you wouldn’t use it in an interactive situation.

The fitted model (i.e. model with parameters replaced by estimates) is \[E[\mbox{Pulse}] = 46.91 + 0.21 ~\mbox{Height}\]

The error standard deviation, \(\sigma\), is estimated by s=8.811 (the residual standard error) according to the output.

R2 for the Pulse Data

The (multiple) R-squared statistic, R2, is the square of the correlation between the observed and fitted responses.

It can be interpreted as the proportion of variation in the response that is explained by the predictor in the model.

Hence Height explains just R2 = 4.8% of the variation in the response according to the fitted model.

The slope estimate, \(\hat{\beta_1} = 0.209774\) has associated standard error \(SE(\hat \beta_1) = 0.1354041.\)

For testing H0: \(\beta_1 = 0\) versus H1: \(\beta_1 \ne 0\), the t-test statistic is

\[t = \hat \beta_1 / SE(\hat \beta_1) = 0.210/0.135 = 1.5492447\]

The corresponding P-value is P=0.1278917, just like when we did the analysis by hand.

We conclude that the data provide no evidence of a (linear) relationship between resting pulse rate and height.

Further Questions from the Pulse Data

Suppose (just for the purposes of argument) that we wish to test H0: \(\beta_1 = - 0.1\) versus H1: \(\beta_1 > - 0.1\). Use the R output given earlier to calculate the t-statistic for this test. How would you compute the associated P-value?

How would you calculate a 95% confidence interval for \(\beta_1\) the true slope ? What information do you need in addition to that which is provided in the R output?

the thinker Discuss these questions with your classmates.

Final Comments on the Model for Pulse Data

Our conclusion does not provide evidence that pulse is not related to height - that’s not the way hypothesis testing works.

Our conclusions are based on the assumption that the model is appropriate.

Failure of the model assumptions A1-A4 (described previously) would indicate an inappropriate model.

We need diagnostic tools to assess model adequacy.