Download the R markdown file for this lecture.
Linear regression models can be conveniently expressed using matrix notation.
In this lecture, we will now see how results for linear models are much more easily derived and understood using matrix notation than without it.
Also note that the matrix approach is what is being done in the background by all good statistical software.
\[\boldsymbol{y} = X {\boldsymbol{\beta}} + {\boldsymbol{\varepsilon}} \label{eq:matrixLM}\]
where \(\boldsymbol{y}\) is the response vector, X is the design matrix, \({\boldsymbol{\beta}}\) is the vector of p+1 regression parameters, and \({\boldsymbol{\varepsilon}}\) is the vector of n error terms.
\(\boldsymbol{y} = \left [ \begin{array}{c} y_1\\ y_2\\ \vdots\\ y_n \end{array} \right ]\), \(X = \left [ \begin{array}{cccc} 1 & x_{11} & \ldots & x_{1p}\\ 1 & x_{21} & \ldots & x_{2p}\\ \vdots & \vdots & \ddots & \vdots\\ 1 & x_{n1} & \ldots & x_{np} \end{array} \right ]\), \(\boldsymbol{\beta} = \left [ \begin{array}{c} \beta_0\\ \beta_1\\ \vdots\\ \beta_p\\ \end{array} \right ]\), and \(\boldsymbol{\varepsilon} = \left [ \begin{array}{c} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_n\\ \end{array} \right ]\)
The mean (expected) value of the random vector \(\boldsymbol{y}\) is \[\begin{aligned} \boldsymbol{\mu} &=& E[\boldsymbol{y}]\\ &=& \left [ \begin{array}{c} E[y_1] \\ E[y_2] \\ \vdots \\ E[y_n] \end{array} \right ] \\ &=& E[X {\boldsymbol{\beta}} + {\boldsymbol{\varepsilon}}]\\ &=& X {\boldsymbol{\beta}}\end{aligned}\]
For observed responses \(\boldsymbol{y}\) the sum of squared errors can be written as \[SS({\boldsymbol{\beta}}) = (\boldsymbol{y} - X {\boldsymbol{\beta}})^T (\boldsymbol{y} - X {\boldsymbol{\beta}}).\]
For a simple linear regression model the design matrix is \[X = \left [ \begin{array}{cc} 1 & x_{1} \\ 1 & x_{2} \\ \vdots & \vdots \\ 1 & x_{n} \end{array} \right ]\] If we observe responses \(\boldsymbol{y}\), the least squares estimate of \({\boldsymbol{\beta}}\) is: \[\begin{aligned} \hat{\boldsymbol{\beta}} &=& (X^T X)^{-1} X^T \boldsymbol{y} = \left [ \begin{array}{cc} n & n \bar{x}\\ n\bar{x} & \sum_i x_i^2 \end{array} \right ]^{-1} \left [ \begin{array}{cccc} 1 & 1 & \ldots & 1 \\ x_1 & x_2 & \ldots & x_n \end{array} \right ] \left [ \begin{array}{c} y_1\\ y_2\\ \vdots\\ y_n \end{array} \right ] \\ &=& \frac{1}{n s_{xx}} \left [ \begin{array}{cc} \sum_i x_i^2 & - n \bar{x}\\ - n\bar{x} & n \end{array} \right ] \left [ \begin{array}{c} n \bar{y}\\ \sum_i x_i y_i \end{array} \right ] = \frac{1}{s_{xx}} \left [ \begin{array}{c} \bar{y} \sum_i x_i^2 - \bar{x} \sum_i{x_i y_i} \\ s_{xy} \end{array} \right ]\end{aligned}\]
The vector of fitted values is given by \(\hat{\boldsymbol{\mu}} = X \hat{\boldsymbol{\beta}} = X (X^TX)^{-1} X^T \boldsymbol{y}\) and the vector of residuals by \(\boldsymbol{e} = \boldsymbol{y} - \hat{\boldsymbol{\mu}}\).
The equation for the fitted values just given, can be re-written \(\hat{\boldsymbol{\mu}} = H \boldsymbol{y} = \hat{\boldsymbol{y}}\) where H = X (XTX)-1XT is often called the hat matrix because it “puts hats on things”!
We have the equality \(\hat{\boldsymbol{y}} = \hat{\boldsymbol{\mu}}\) because we use the same values for prediction and mean estimation.
The variance-covariance (or simple covariance, or dispersion matrix) has variances down the diagonal and covariances off the diagonal.
It can be shown that for a matrix M and random vector \(\boldsymbol{z}\) (of appropriate dimensions) \[\mbox{Var} (M\boldsymbol{z}) = M \mbox{Var} (\boldsymbol{z}) M^T.\]
The covariance matrix of \(\hat{\boldsymbol{\beta}}\) is
\[\begin{aligned} \mbox{Var}(\hat{\boldsymbol{\beta}}) = \mbox{Var}[ (X^TX)^{-1} X^T \boldsymbol{y} ] &=& (X^TX)^{-1} X^T \mbox{Var}(\boldsymbol{y}) [(X^TX)^{-1} X^T]^T\\ &=& (X^TX)^{-1} X^T \sigma^2 I [(X^TX)^{-1} X^T]^T\\ &=& \sigma^2 (X^TX)^{-1} X^TX (X^T X)^{-1}\\ &=& \sigma^2 (X^TX)^{-1}\end{aligned}\]
We can use \(\mbox{Var}(\boldsymbol{y}) = \sigma^2 I\) since the responses are independent and hence uncorrelated.
The variances express the variability of the estimators from sample to sample.
The covariances describe the inter-dependence of estimators.