Download the R markdown file for this lecture.
This lecture is an overview of regression and linear models.
This course deals with models that are used to explain how one random variable, Y, is affected by one or more other variables x1, x2, …, xp.
Here:
Y is called the response variable;
x1, x2, …, xp are called the explanatory variables, or regressors, or predictors, or covariates.
A statistical regression model specifies how the distribution Y depends on the values x1, x2, …, xp (which are assumed to be fixed for the purposes of our analyses);
Regression models express the response distribution in terms of these values and also one or more unknown parameters.
Suppose you observe paired (x,y) data.
Class discussion: How is this different to a regression analysis?
We shall typically assume that the distribution of Y follows a normal distribution for any given values of x1, x2 …, xp.
We shall typically assume that:
The model is then:
\[Y \sim N \left ( g(x_1, x_2, \ldots, x_p), \, \sigma^2 \right )\]
where
The model \(Y \sim N \left ( g(x_1, x_2, \ldots, x_p),\, \sigma^2 \right )\) can be expressed equivalently by
\[Y = g(x_1, x_2, \ldots, x_p) + \varepsilon\]
where \(\varepsilon \sim N(0,\, \sigma^2)\)
Notice that the mean (or expected) value of Y for this model is given by E[Y] = g(x1, x2 …, xp)
Usually we will assume that g is a parametric function.
Suppose you have data on a response variable y (e.g. blood pressure) and an explanatory variable x (e.g. a measurement of cholesterol).
We want to model the relationship between the mean value of y, and x.
We might use a simple linear regression model: \[E[Y] = \beta_0 + \beta_1 x\] where \(\beta_0\) and \(\beta_1\) are model parameters.
This is a linear model because \(E[Y]\) is linearly related to the parameters \(\beta_0\) and \(\beta_1\) (not because it is linearly related to x).
Linear models are easy to apply and interpret.
The mathematical theory underlying linear models is very well understood.
We can investigate the relationship between a response and lots of explanatory variables in a straightforward manner.
A linear model will often (but not always) provide an adequate approximation to reality.
Which of the following are linear models?
\(Y \sim N( \beta_0 + \beta_1 x^{\beta_2}, \, \sigma^2)\)
\(Y \sim N( \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3, \, \sigma^2)\)
\(Y \sim N( \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 \log(x_3), \, \sigma^2)\)
Descriptive modelling: just interested in better understanding the problem under study.
Prediction: predict the value of Y that will result from particular values of the explanatory variables.
Parameter estimation: want to estimate interpretable model parameters.
Variable screening: want to investigate which explanatory variables have an effect on the response.
Regression analyses can be used to examine the association between response and predictor variables.
Possible interpretations of association:
Causation: y depends causally on x;
Common Response: y does not depend causally on x; both y and x are related (perhaps causally) to lurking variable z;
Confounding: x is (strongly) associated with lurking variable z, so it is then impossible to tell whether y depends causally on x or z.
…is not easy
Use a carefully designed experiment.
If not possible to conduct an experiment, then the following questions should help:
Is the association between the variables strong?
Is the association consistent?
Are higher doses associated with stronger responses?
Do the alleged causes precede the effect in time?
Is the alleged cause plausible?
If yes to all then a causal link seems probable. E.g. a causal link between lung cancer and smoking now fully accepted by scientists because above criteria satisfied.
Regression models seek to represent dependence of a response on explanatory variables.
This course focuses (primarily) on models with a particular linear form.
Typically we will assume that the response is normally distributed.
Linear regression models can be used for description, prediction, parameter estimation and variable screening.