Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
Collinearity and Orthogonality, Measuring multicollinearity, variance inflation factors.
Consider the model \[Y = \beta_0 + \beta_1 X_1 + \beta_2X_2 + \varepsilon\] .
We normally interpret the regression parameter \(\beta_1\) as the rate of change in Y per unit change in \(X_1\), where \(X_2\) is kept constant.
However, if \(X_1\) and \(X_2\) are strongly related, then this interpretation may not be valid (or even possible). In particular if \(X_1\) increases, \(X_2\) may increase at the same time. An example would be if \(X_1\)= personal income and \(X_2\) =household income. The effect of \(X_1\) is expressed both directly (through \(\beta_1\)) and indirectly (through \(\beta_2\)). This complicates both the interpretation and the estimation of these parameters.
Now if \(X_1\) and \(X_2\) have a total lack of any linear relationship (uncorrelated) then they are said to be orthogonal.
In practice regressors are rarely orthogonal unless the regression arises from a carefully designed experiment.
Nevertheless a little non-orthogonality is usually not important to the analysis, especially in view of all the other variability involved.
However sometimes \(X_1\) and \(X_2\) are so closely related that the regression results are very hard to interpret. For example inclusion of one variable into a model already containing the other variable can radically alter the regression parameters. The parameters may also be very sensitive to slight changes in the data and have large standard errors.
Severe non-orthogonality is referred to as collinear data or multicollinearity. This
can be difficult to detect
is not strictly a modeling error but a problem of inadequate data
requires caution in interpreting results.
So we address the following questions: How does multicollinearity affect inference and forecasting? How does one detect multicollinearity? How does one improve the analysis if multicollinearity is present?
We will consider the effects of multicollinearity in the TempA and
TempB data considered earlier. Recall the model is
\[Y = 50 + 5\times TempA + 5\times TempB +
\varepsilon ~~~~ (*) \]
Though you would not have noticed it, the TempA and TempB data were highly correlated.
tempA tempB Epsilon
1 16.5 16.9 -1.64960
2 17.1 16.8 -1.27669
3 15.0 15.8 0.45926
4 16.9 17.0 0.34253
5 21.2 20.6 -0.71849
6 18.2 17.8 -0.51455
7 19.3 19.9 -1.41754
8 16.7 16.0 -1.30384
9 16.3 16.3 0.40450
10 17.3 18.0 -0.18211
11 19.1 19.4 0.34204
12 20.8 20.1 1.25550
13 19.7 20.5 0.62556
14 17.7 18.6 -1.01822
15 16.8 16.4 -1.35425
16 15.9 16.6 0.81533
17 18.4 19.5 0.41898
18 21.3 20.9 -0.37490
19 19.7 19.6 -0.78441
20 17.0 16.3 0.20040
attach(Sim)
Y = 50 + 5 * tempA + 5 * tempB + Epsilon
library(scatterplot3d)
scatterplot3d(cbind(tempA, tempB, Y), pch = 16, type = "h")
This is what resulted in the high standard errors for each regression coefficient.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 48.764145 2.1367008 22.82217 3.425706e-14
tempA 4.761300 0.3383262 14.07311 8.481311e-11
tempB 5.289623 0.3500034 15.11306 2.748190e-11
Now suppose we keep exactly the same numbers TempA, TempB and Epsilon, but break the correlation between the TempA and TempB columns. As a simple way of spreading them out, I randomize the order of numbers in column TempB.
tempBrandomized = tempB[order(rnorm(20))]
Yrand = 50 + 5 * tempA + 5 * tempBrandomized + Epsilon
scatterplot3d(cbind(tempA, tempBrandomized, Yrand), pch = 16, type = "h")
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47.863862 2.8433492 16.83362 4.901709e-12
tempA 5.017845 0.1141485 43.95894 5.980158e-19
tempBrandomized 5.084165 0.1180882 43.05395 8.491809e-19
The standard errors of the coefficients are about 1/3 what they were.
Note that really the only thing we have done here is to break the correlation between TempA and TempB, but this has greatly improved the analysis.
Consider the model \[Y = \beta_0 + \beta_1 X + \beta_2 Z + \varepsilon\]
Recall that \(X\) and \(Z\) are uncorrelated, they are called orthogonal. Then
the least squares estimates \(\hat\beta_1\) and \(\hat\beta_2\) are independent of each other
the order of adding the variables does not affect the sums of squares in the ANOVA table.
In other words it doesn’t matter whether we talk about the effect of \(X\) by itself or “\(X\) after \(Z\)”. The effect is the same.
Interpretation of parameters a lot easier.
But it usually only happens in designed experiments with balanced numbers.
In summary, orthogonality of the explanatory variables allows easier interpretation of the results of an experiment and is an important goal in the design of experiments.
Suppose we have three sets of numbers \(X,Z,Y\) such that \[Y = \beta_0 +\beta_1 X + \beta_2 Z +\varepsilon\]
but that, unknown to us, \[Z = \alpha_0 + \alpha_1 X ~ \] at least equal up to some level of rounding error.
Then the regression model could be rewritten in terms of any arbitrary combination of the X and Z figures, such as
\[Y = \beta_0 +\beta_1 X + \beta_2
[k(\alpha_0 + \alpha_1 X) + (1-k)Z] +\varepsilon\] \[= \gamma_0(k) +\gamma_1(k) X + \gamma_2(k) Z
+\varepsilon\] for any real number \(k\). In other words there are an infinite
number of possible solutions, and if we do get a numerical answer for
the regression coefficients, the particular answer may depend on
something as trivial as rounding error. Different amounts of rounding
could give very different answers.
We describe this situation by saying the regression model is not
estimable.
A consequence of such high correlation between the predictor variables is that the regression coefficients may become totally uninterpretable (for example something you know has a positive relationship with the response ends up with a negative coefficient) and the regression coefficients may have very high standard errors.
So we recognize a potential problem. How do we assess whether it is a real problem in our data?
We could look at the correlations between the explanatory variables. Since correlation is a measure of linear relationship, then if two variables are highly correlated then it doesn’t make sense to keep them both in the model.
Example. Consider the data in the following page. This data, called Electric.csv consists of household electricity bills, along with explanatory variables:
Each of these explanatory variables would be expected to be related to the Bill, and yet they are also related to each other, so we may not be able to have them all in the same model.
Bill Income Persons Area
1 228 3220 2 1160
2 156 2750 1 1080
3 648 3620 2 1720
4 528 3940 1 1840
5 552 4510 3 2240
6 636 3990 4 2190
7 444 2430 1 830
8 144 3070 1 1150
9 744 3750 2 1570
10 1104 4790 5 2660
11 204 2490 1 900
12 420 3600 3 1680
13 876 5370 1 2550
14 840 3180 7 1770
15 876 5910 2 2960
16 276 3020 2 1190
17 1236 5920 3 3130
18 372 3520 2 1560
19 276 3720 1 1510
20 540 4840 1 2190
21 1044 4700 6 2620
22 552 3270 2 1350
23 756 4420 2 1990
24 636 4480 2 2070
25 708 3820 4 1850
26 960 5740 2 2700
27 1080 5600 3 3030
28 480 3950 2 1700
29 96 2290 3 890
30 1272 5580 5 3270
31 1056 5820 2 2660
32 156 3160 2 1330
33 396 2880 4 1280
34 768 3780 3 1950
Bill Income Persons Area
Bill 1.0000000 0.8368788 0.4941247 0.9050448
Income 0.8368788 1.0000000 0.1426486 0.9612801
Persons 0.4941247 0.1426486 1.0000000 0.3656021
Area 0.9050448 0.9612801 0.3656021 1.0000000
This indicates that the Area of the house is very strongly correlated to the Income of the occupants (\(r= 0.961\)) , and so both may say much the same thing about the electricity bill. We may not be able to have both these in the model.
By contrast the number of persons in the house is almost unrelated (r = 0.143) to the Income and only weakly related to the Area (r = 0.366) so persons provides more-or-less independent information.
The following graph helps one visualise the relationships expressed by the correlations.
The question is:
The aim of the analysis is to explain electricity consumption (‘Bill’) in terms of household ‘Income’, ‘Persons’ and ‘Area’.
Call:
lm(formula = Bill ~ Area + Persons + Income)
Residuals:
Min 1Q Median 3Q Max
-223.36 -102.88 -8.53 78.61 331.46
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -358.44157 198.73583 -1.804 0.0813 .
Area 0.28110 0.22611 1.243 0.2234
Persons 55.08763 29.04515 1.897 0.0675 .
Income 0.07514 0.13609 0.552 0.5850
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 135.4 on 30 degrees of freedom
Multiple R-squared: 0.8514, Adjusted R-squared: 0.8365
F-statistic: 57.28 on 3 and 30 DF, p-value: 1.585e-12
Note (fact 1) the regression as a whole is highly significant.
However (fact 2) none of the regression coefficients in the model is significant.
So these two facts together suggest multicollinearity.
Note that by themselves both variables are positively related to Bill and have small Std Errors, But put together, one predictor has changed its sign and the Std Errors are much larger.
This illustrates how multicollinearity can render the regression coefficients quite misleading.
Call:
lm(formula = Bill ~ Area)
Residuals:
Min 1Q Median 3Q Max
-216.58 -102.40 -30.67 106.13 292.44
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -211.64863 73.36173 -2.885 0.00695 **
Area 0.43760 0.03635 12.037 2.02e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 144.7 on 32 degrees of freedom
Multiple R-squared: 0.8191, Adjusted R-squared: 0.8135
F-statistic: 144.9 on 1 and 32 DF, p-value: 2.02e-13
Call:
lm(formula = Bill ~ Income)
Residuals:
Min 1Q Median 3Q Max
-288.36 -117.01 -46.72 134.27 441.57
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -425.16251 124.92953 -3.403 0.00181 **
Income 0.25899 0.02995 8.649 6.97e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 186.2 on 32 degrees of freedom
Multiple R-squared: 0.7004, Adjusted R-squared: 0.691
F-statistic: 74.8 on 1 and 32 DF, p-value: 6.974e-10
Call:
lm(formula = Bill ~ Income + Area)
Residuals:
Min 1Q Median 3Q Max
-221.3 -112.4 -19.8 86.4 297.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -52.23835 120.64872 -0.433 0.668
Income -0.13498 0.08229 -1.640 0.111
Area 0.64033 0.12857 4.981 2.27e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 141 on 31 degrees of freedom
Multiple R-squared: 0.8336, Adjusted R-squared: 0.8228
F-statistic: 77.62 on 2 and 31 DF, p-value: 8.507e-13
One solution to multicollinearity is to delete one of the variables. However it can be difficult to decide which one to delete. For example suppose we particularly want to quantify the effect of Persons on the electricity bill:
Call:
lm(formula = Bill ~ Area + Persons)
Residuals:
Min 1Q Median 3Q Max
-223.77 -101.97 -13.34 83.17 322.35
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -255.94923 70.14230 -3.649 0.000959 ***
Area 0.40429 0.03615 11.183 2.08e-12 ***
Persons 42.03394 16.67947 2.520 0.017096 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 133.9 on 31 degrees of freedom
Multiple R-squared: 0.8499, Adjusted R-squared: 0.8402
F-statistic: 87.74 on 2 and 31 DF, p-value: 1.72e-13
Call:
lm(formula = Bill ~ Income + Persons)
Residuals:
Min 1Q Median 3Q Max
-220.49 -105.87 -22.33 85.43 345.76
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -575.4106 95.9010 -6.000 1.23e-06 ***
Income 0.2421 0.0222 10.905 3.89e-12 ***
Persons 85.3352 16.0031 5.332 8.28e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 136.6 on 31 degrees of freedom
Multiple R-squared: 0.8437, Adjusted R-squared: 0.8336
F-statistic: 83.68 on 2 and 31 DF, p-value: 3.204e-13
Should we estimate that each extra person adds 42 to the electric bill, or 83 to the bill? Of course the interpretation of these regression models differs slightly (in the first model it is the effect of each additional person among families of the same income, while in the second it is the effect of person among families of with homes of the same area). But it still makes one uneasy to see such a big difference.
More generally (more than just looking at individual correlations) we could look at whether any of the variables are strongly predicted by a combination of the other regressors.
i.e. Whether the \(R^2\) for
regressing variable \(X_i\) on the
other regressors
\(X_1, X_2,..., X_{i-1},
X_{i+1},...,X{p}\) is large. We call this \(R^2_i\).
If it is large then we probably don’t need \(X_i\) (or some other variables) in the model.
This is more general than just looking at simple correlations between variables, since the regression model picks up on combinations of variables.
A “rule of thumb” is that multicollinearity is a big problem if \(R_i^2 > 0.9\), i.e. the \(i\)th variable is more than 90% explained by the other predictors.
We can get \(R_i^2\) by doing the regressions ourselves, but there is an easier way.
A major impact of multicollinearity is found in the variances (and hence standard errors) of the regression coefficients.
\[\hat{\mbox{var}}(\hat\beta_1) = \frac{S^2}{\sum_j{(X_{ij} - \bar X_i)^2}}\] 2. If If \(X_i\) is in a model with other predictors then \[\hat{\mbox{var}}(\hat\beta_i) = S^2 \left(X^T X\right)^{-1}_{ii}\] where the last term is the corresponding diagonal entry of the \(S^2 (X^T X )^{-1}\) matrix, the variance-covariance matrix. This is always \(\ge\) the \(\hat{\mbox{var}} (\hat\beta_1)\).
The ratio of these variances is called the Variance Inflation Factor (VIF). It turns out that \[VIF_i = \frac{1}{1- R^2_i}\] so we get \(VIF_i \ge 10\) if \(R_i^2 \ge 0.9\).
In conclusion if the VIF is large eg. \(VIF_i> 10\) then multicollinearity is a big problem. On the other hand if the VIF is around 1, then the variable \(X_i\) is pretty much orthogonal to the other variables.
To obtain VIFs we need to invoke library(car).
Area Persons Income
44.141264 3.421738 39.035444
Clearly we need to do something to reduce the multicollinearity.
Note: if a regression model includes a factor, then vif() defaults to a generalised VIF. We will interpret it the same way.
The following example illustrates multicollinearity in a context where it is not immediately visible. Chatterjee and Price discuss a dataset relating Sales to A (advertising expenditure), A1 (last year’s advertising), P (promotion expenditure), P1 (last year’s promotions expenditure) and other variables. The analyst eventually found out that an approximate budgetary constraint
\[ A + A1 + P + P1 = 5 \] was hidden in the data. The constraint was known to the advertising executives who had commissioned the study but it had not been mentioned to the statistical consultant who was given the data to analyze.
Annual Data on Advertising, Promotion, and Sales Expenses, and Sales:
A= advertising expenditure (at time t)
A1 denotes lagged values, i.e. at time t-1.
P denotes promotion expenditures (at time t)
P1 denotes the lagged values.
SE denotes sales expense and
S the aggregate sales.
A P SE A1 P1 S
1 1.98786 1.0 0.30 2.01722 0.0 20.1137
2 1.94418 0.0 0.30 1.98786 1.0 15.1044
3 2.19954 0.8 0.35 1.94418 0.0 18.6838
4 2.00107 0.0 0.35 2.19954 0.8 16.0517
5 1.69292 1.3 0.30 2.00107 0.0 21.3010
6 1.74334 0.3 0.32 1.69292 1.3 17.8500
7 2.06907 1.0 0.31 1.74334 0.3 18.8756
8 1.01709 1.0 0.41 2.06907 1.0 21.2660
9 2.01906 0.9 0.45 1.01709 1.0 20.4847
10 1.06139 1.0 0.45 2.01906 0.9 20.5403
11 1.45999 1.5 0.50 1.06139 1.0 26.1844
12 1.87511 0.0 0.60 1.45999 1.5 21.7161
13 2.27109 0.8 0.65 1.87511 0.0 28.6959
14 1.11191 1.0 0.65 2.27109 0.8 25.8372
15 1.77407 1.2 0.65 1.11191 1.0 29.3199
16 0.95878 1.0 0.65 1.77407 1.2 24.1904
17 1.98930 1.0 0.62 0.95878 1.0 26.5897
18 1.97111 0.0 0.60 1.98930 1.0 22.2447
19 2.26603 0.7 0.60 1.97111 0.0 24.7994
20 1.98346 0.1 0.61 2.26603 0.7 21.1910
21 2.10054 1.0 0.60 1.98346 0.1 26.0344
22 1.06815 1.0 0.58 2.10054 1.0 27.3930
Call:
lm(formula = S ~ A + P + SE + A1 + P1)
Residuals:
Min 1Q Median 3Q Max
-1.8601 -0.9847 0.1323 0.7017 2.2046
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -14.194 18.715 -0.758 0.4592
A 5.361 4.028 1.331 0.2019
P 8.372 3.586 2.334 0.0329 *
SE 22.521 2.142 10.512 1.36e-08 ***
A1 3.855 3.578 1.077 0.2973
P1 4.125 3.895 1.059 0.3053
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.32 on 16 degrees of freedom
Multiple R-squared: 0.9169, Adjusted R-squared: 0.8909
F-statistic: 35.3 on 5 and 16 DF, p-value: 4.289e-08
We see that, apart from Sales Expenses, there is only weak evidence of a relationship between advertising, promotions and sales.
A P SE A1 P1
36.941513 33.473514 1.075962 25.915651 43.520965
The VIF shows high multicollinearity for all advertising and promotions data. If we drop one of the variables the others change their coefficients unpredictably depending on which variable we drop.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.5094338 2.4575979 4.2763032 5.103623e-04
P 3.7017455 0.7571140 4.8892842 1.382315e-04
SE 22.7941649 2.1803443 10.4543880 8.040607e-09
A1 -0.7692144 0.8745617 -0.8795427 3.913708e-01
P1 -0.9686584 0.7422511 -1.3050279 2.092725e-01
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.76106517 2.6971385 2.13599157 4.752049e-02
A 1.14726708 0.9674889 1.18581942 2.520041e-01
P 4.61226335 0.8298268 5.55810393 3.469186e-05
SE 22.71199425 2.1450683 10.58800520 6.658248e-09
P1 0.03145725 0.8623592 0.03647813 9.713260e-01
Suppose we look at the relationships among the variables.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.60813189 0.14474201 31.8368663 1.352575e-16
P -0.87125259 0.04459078 -19.5388520 4.380113e-13
SE 0.05095381 0.12841295 0.3967965 6.964545e-01
A1 -0.86252031 0.05150795 -16.7453834 5.334057e-12
P1 -0.95013832 0.04371542 -21.7346283 7.654334e-14
This is approximately A = 5 - P -A1 - P1 as in the hidden constraint.
As far as 161.251 is concerned, we have few options.
We can remove one (or more) of the correlated variables from the regression. Our aim is to fix the problem, but sometimes we have a bit of choice which variable to remove, in which case we would like the model to be as interpretable as possible. Sometimes the choice is between a variable with many missing values and one with few, and so we choose the latter if it allows us a bigger effective sample size \(n\).
We could construct another variable which is a combination of the two most highly correlated variables. It might be an ordinary average \((X_1+X_2)/2\), or an average of standardized versions of the \(X_i\)s or some other linear combination. Sometimes subtracting or dividing one variable by another will help. For example the incidence of a particular disease such as obesity may be higher in countries with high GDP and high Population. But GDP depends on the size of the economy, which is correlated with the Population. Dividing gives us per capita GDP which is a better indicator of average wealth and hence may be more related to disease incidence.
(Not examinable). Those who have studied multivariate statistics may find it useful to construct principal components or do a factor analysis of a set of correlated variables. Dropping less important components leads to a class of models called biased regression. Another form of biased regression shrinks the regression coefficients towards zero - a method called shrinkage estimators.
(Not examinable). It is possible to fit regression models that impose linear constraints on the regression coefficients (such as forcing them to add to 100%). Such restricted least squares is also beyond the scope of this course. Therefore for this course we focus on method 1, dropping variables.