Lecture 32 Multicollinearity

In class version

library(scatterplot3d)
library(car)
Loading required package: carData

Attaching package: 'car'
The following object is masked from 'package:dplyr':

    recode
The following object is masked from 'package:purrr':

    some

Collinearity and Orthogonality, Measuring multicollinearity, variance inflation factors.

32.1 Idea of Collinearity and Orthogonality

Consider the model \[Y = \beta_0 + \beta_1 X_1 + \beta_2X_2 + \varepsilon\] .

We normally interpret the regression parameter \(\beta_1\) as the rate of change in Y per unit change in \(X_1\), where \(X_2\) is kept constant.

However, if \(X_1\) and \(X_2\) are strongly related, then this interpretation may not be valid (or even possible). In particular if \(X_1\) increases, \(X_2\) may increase at the same time. An example would be if \(X_1\)= personal income and \(X_2\) =household income. The effect of \(X_1\) is expressed both directly (through \(\beta_1\)) and indirectly (through \(\beta_2\)). This complicates both the interpretation and the estimation of these parameters.

Now if \(X_1\) and \(X_2\) have a total lack of any linear relationship (uncorrelated) then they are said to be orthogonal.

In practice regressors are rarely orthogonal unless the regression arises from a carefully designed experiment.

Nevertheless a little non-orthogonality is usually not important to the analysis, especially in view of all the other variability involved.

However sometimes \(X_1\) and \(X_2\) are so closely related that the regression results are very hard to interpret. For example inclusion of one variable into a model already containing the other variable can radically alter the regression parameters. The parameters may also be very sensitive to slight changes in the data and have large standard errors.

Severe non-orthogonality is referred to as collinear data or multicollinearity. This

  • can be difficult to detect

  • is not strictly a modeling error but a problem of inadequate data

  • requires caution in interpreting results.

So we address the following questions: How does multicollinearity affect inference and forecasting? How does one detect multicollinearity? How does one improve the analysis if multicollinearity is present?

32.2 A Numerical Example

We will consider the effects of multicollinearity in the TempA and TempB data considered earlier. Recall the model is
\[Y = 50 + 5\times TempA + 5\times TempB + \varepsilon ~~~~ (*) \]

Though you would not have noticed it, the TempA and TempB data were highly correlated.

Download Simulation1.csv

## Sim = read.csv("Simulation1.csv", header = TRUE)
Sim
   tempA tempB  Epsilon
1   16.5  16.9 -1.64960
2   17.1  16.8 -1.27669
3   15.0  15.8  0.45926
4   16.9  17.0  0.34253
5   21.2  20.6 -0.71849
6   18.2  17.8 -0.51455
7   19.3  19.9 -1.41754
8   16.7  16.0 -1.30384
9   16.3  16.3  0.40450
10  17.3  18.0 -0.18211
11  19.1  19.4  0.34204
12  20.8  20.1  1.25550
13  19.7  20.5  0.62556
14  17.7  18.6 -1.01822
15  16.8  16.4 -1.35425
16  15.9  16.6  0.81533
17  18.4  19.5  0.41898
18  21.3  20.9 -0.37490
19  19.7  19.6 -0.78441
20  17.0  16.3  0.20040
attach(Sim)
Y = 50 + 5 * tempA + 5 * tempB + Epsilon

library(scatterplot3d)
scatterplot3d(cbind(tempA, tempB, Y), pch = 16, type = "h")

unlabelled

This is what resulted in the high standard errors for each regression coefficient.

lm.cor = lm(Y ~ tempA + tempB)
summary(lm.cor)$coef
             Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 48.764145  2.1367008 22.82217 3.425706e-14
tempA        4.761300  0.3383262 14.07311 8.481311e-11
tempB        5.289623  0.3500034 15.11306 2.748190e-11

Now suppose we keep exactly the same numbers TempA, TempB and Epsilon, but break the correlation between the TempA and TempB columns. As a simple way of spreading them out, I randomize the order of numbers in column TempB.

tempBrandomized = tempB[order(rnorm(20))]
Yrand = 50 + 5 * tempA + 5 * tempBrandomized + Epsilon
scatterplot3d(cbind(tempA, tempBrandomized, Yrand), pch = 16, type = "h")

unlabelled

lm.uncor = lm(Yrand ~ tempA + tempBrandomized)
summary(lm.uncor)$coef
                 Estimate Std. Error  t value     Pr(>|t|)
(Intercept)     52.528341  3.5458633 14.81398 3.773542e-11
tempA            4.981583  0.1179155 42.24705 1.168037e-18
tempBrandomized  4.863221  0.1219853 39.86726 3.100717e-18

The standard errors of the coefficients are about 1/3 what they were.

Note that really the only thing we have done here is to break the correlation between TempA and TempB, but this has greatly improved the analysis.

32.3 Orthogonal Data

Consider the model \[Y = \beta_0 + \beta_1 X + \beta_2 Z + \varepsilon\]

Recall that \(X\) and \(Z\) are uncorrelated, they are called orthogonal. Then

  • the least squares estimates \(\hat\beta_1\) and \(\hat\beta_2\) are independent of each other

  • the order of adding the variables does not affect the sums of squares in the ANOVA table.

In other words it doesn’t matter whether we talk about the effect of \(X\) by itself or “\(X\) after \(Z\)”. The effect is the same.

  • Interpretation of parameters a lot easier.

  • But it usually only happens in designed experiments with balanced numbers.

In summary, orthogonality of the explanatory variables allows easier interpretation of the results of an experiment and is an important goal in the design of experiments.

32.4 Collinear data

Suppose we have three sets of numbers \(X,Z,Y\) such that \[Y = \beta_0 +\beta_1 X + \beta_2 Z +\varepsilon\]

but that, unknown to us, \[Z = \alpha_0 + \alpha_1 X ~ \] at least equal up to some level of rounding error.

Then the regression model could be rewritten in terms of any arbitrary combination of the X and Z figures, such as

\[Y = \beta_0 +\beta_1 X + \beta_2 [k(\alpha_0 + \alpha_1 X) + (1-k)Z] +\varepsilon\] \[= \gamma_0(k) +\gamma_1(k) X + \gamma_2(k) Z +\varepsilon\] for any real number \(k\). In other words there are an infinite number of possible solutions, and if we do get a numerical answer for the regression coefficients, the particular answer may depend on something as trivial as rounding error. Different amounts of rounding could give very different answers.
We describe this situation by saying the regression model is not estimable.

A consequence of such high correlation between the predictor variables is that the regression coefficients may become totally uninterpretable (for example something you know has a positive relationship with the response ends up with a negative coefficient) and the regression coefficients may have very high standard errors.

32.5 Measuring Multicollinearity

So we recognize a potential problem. How do we assess whether it is a real problem in our data?

32.5.1 Correlations Among The Predictors

We could look at the correlations between the explanatory variables. Since correlation is a measure of linear relationship, then if two variables are highly correlated then it doesn’t make sense to keep them both in the model.

Example. Consider the data in the following page. This data, called Electric.csv consists of household electricity bills, along with explanatory variables:

  • the income of the household;
  • the number of persons in the house; and
  • the area of the house.

Each of these explanatory variables would be expected to be related to the Bill, and yet they are also related to each other, so we may not be able to have them all in the same model.

Download Electric.csv

## electric = read.csv("Electric.csv", header = TRUE)
electric
   Bill Income Persons Area
1   228   3220       2 1160
2   156   2750       1 1080
3   648   3620       2 1720
4   528   3940       1 1840
5   552   4510       3 2240
6   636   3990       4 2190
7   444   2430       1  830
8   144   3070       1 1150
9   744   3750       2 1570
10 1104   4790       5 2660
11  204   2490       1  900
12  420   3600       3 1680
13  876   5370       1 2550
14  840   3180       7 1770
15  876   5910       2 2960
16  276   3020       2 1190
17 1236   5920       3 3130
18  372   3520       2 1560
19  276   3720       1 1510
20  540   4840       1 2190
21 1044   4700       6 2620
22  552   3270       2 1350
23  756   4420       2 1990
24  636   4480       2 2070
25  708   3820       4 1850
26  960   5740       2 2700
27 1080   5600       3 3030
28  480   3950       2 1700
29   96   2290       3  890
30 1272   5580       5 3270
31 1056   5820       2 2660
32  156   3160       2 1330
33  396   2880       4 1280
34  768   3780       3 1950
attach(electric)
cor(electric)
             Bill    Income   Persons      Area
Bill    1.0000000 0.8368788 0.4941247 0.9050448
Income  0.8368788 1.0000000 0.1426486 0.9612801
Persons 0.4941247 0.1426486 1.0000000 0.3656021
Area    0.9050448 0.9612801 0.3656021 1.0000000

This indicates that the Area of the house is very strongly correlated to the Income of the occupants (\(r= 0.961\)) , and so both may say much the same thing about the electricity bill. We may not be able to have both these in the model.

By contrast the number of persons in the house is almost unrelated (r = 0.143) to the Income and only weakly related to the Area (r = 0.366) so persons provides more-or-less independent information.

The following graph helps one visualise the relationships expressed by the correlations.

pairs(electric)

unlabelled The question is:

32.5.2 How do we know this is causing real problems?

The aim of the analysis is to explain electricity consumption (‘Bill’) in terms of household ‘Income’, ‘Persons’ and ‘Area’.

electric.lm0 = lm(Bill ~ Area + Persons + Income)
summary(electric.lm0)

Call:
lm(formula = Bill ~ Area + Persons + Income)

Residuals:
    Min      1Q  Median      3Q     Max 
-223.36 -102.88   -8.53   78.61  331.46 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept) -358.44157  198.73583  -1.804   0.0813 .
Area           0.28110    0.22611   1.243   0.2234  
Persons       55.08763   29.04515   1.897   0.0675 .
Income         0.07514    0.13609   0.552   0.5850  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 135.4 on 30 degrees of freedom
Multiple R-squared:  0.8514,    Adjusted R-squared:  0.8365 
F-statistic: 57.28 on 3 and 30 DF,  p-value: 1.585e-12

Note (fact 1) the regression as a whole is highly significant.

However (fact 2) none of the regression coefficients in the model is significant.

So these two facts together suggest multicollinearity.

32.5.3 Contradictory Regression Coefficients

Note that by themselves both variables are positively related to Bill and have small Std Errors, But put together, one predictor has changed its sign and the Std Errors are much larger.

This illustrates how multicollinearity can render the regression coefficients quite misleading.

elec.lm1 = lm(Bill ~ Area)
summary(elec.lm1)

Call:
lm(formula = Bill ~ Area)

Residuals:
    Min      1Q  Median      3Q     Max 
-216.58 -102.40  -30.67  106.13  292.44 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -211.64863   73.36173  -2.885  0.00695 ** 
Area           0.43760    0.03635  12.037 2.02e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 144.7 on 32 degrees of freedom
Multiple R-squared:  0.8191,    Adjusted R-squared:  0.8135 
F-statistic: 144.9 on 1 and 32 DF,  p-value: 2.02e-13
elec.lm2 = lm(Bill ~ Income)
summary(elec.lm2)

Call:
lm(formula = Bill ~ Income)

Residuals:
    Min      1Q  Median      3Q     Max 
-288.36 -117.01  -46.72  134.27  441.57 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -425.16251  124.92953  -3.403  0.00181 ** 
Income         0.25899    0.02995   8.649 6.97e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 186.2 on 32 degrees of freedom
Multiple R-squared:  0.7004,    Adjusted R-squared:  0.691 
F-statistic:  74.8 on 1 and 32 DF,  p-value: 6.974e-10
elec.lm3 = lm(Bill ~ Income + Area)
summary(elec.lm3)

Call:
lm(formula = Bill ~ Income + Area)

Residuals:
   Min     1Q Median     3Q    Max 
-221.3 -112.4  -19.8   86.4  297.1 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -52.23835  120.64872  -0.433    0.668    
Income       -0.13498    0.08229  -1.640    0.111    
Area          0.64033    0.12857   4.981 2.27e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 141 on 31 degrees of freedom
Multiple R-squared:  0.8336,    Adjusted R-squared:  0.8228 
F-statistic: 77.62 on 2 and 31 DF,  p-value: 8.507e-13

One solution to multicollinearity is to delete one of the variables. However it can be difficult to decide which one to delete. For example suppose we particularly want to quantify the effect of Persons on the electricity bill:

elec.lm4 = lm(Bill ~ Area + Persons)
summary(elec.lm4)

Call:
lm(formula = Bill ~ Area + Persons)

Residuals:
    Min      1Q  Median      3Q     Max 
-223.77 -101.97  -13.34   83.17  322.35 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -255.94923   70.14230  -3.649 0.000959 ***
Area           0.40429    0.03615  11.183 2.08e-12 ***
Persons       42.03394   16.67947   2.520 0.017096 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 133.9 on 31 degrees of freedom
Multiple R-squared:  0.8499,    Adjusted R-squared:  0.8402 
F-statistic: 87.74 on 2 and 31 DF,  p-value: 1.72e-13
elec.lm5 = lm(Bill ~ Income + Persons)
summary(elec.lm5)

Call:
lm(formula = Bill ~ Income + Persons)

Residuals:
    Min      1Q  Median      3Q     Max 
-220.49 -105.87  -22.33   85.43  345.76 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -575.4106    95.9010  -6.000 1.23e-06 ***
Income         0.2421     0.0222  10.905 3.89e-12 ***
Persons       85.3352    16.0031   5.332 8.28e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 136.6 on 31 degrees of freedom
Multiple R-squared:  0.8437,    Adjusted R-squared:  0.8336 
F-statistic: 83.68 on 2 and 31 DF,  p-value: 3.204e-13

Should we estimate that each extra person adds 42 to the electric bill, or 83 to the bill? Of course the interpretation of these regression models differs slightly (in the first model it is the effect of each additional person among families of the same income, while in the second it is the effect of person among families of with homes of the same area). But it still makes one uneasy to see such a big difference.

32.5.4 Regressing The Predictors Among Themselves

More generally (more than just looking at individual correlations) we could look at whether any of the variables are strongly predicted by a combination of the other regressors.

i.e. Whether the \(R^2\) for regressing variable \(X_i\) on the other regressors
\(X_1, X_2,..., X_{i-1}, X_{i+1},...,X{p}\) is large. We call this \(R^2_i\).

If it is large then we probably don’t need \(X_i\) (or some other variables) in the model.

This is more general than just looking at simple correlations between variables, since the regression model picks up on combinations of variables.

A “rule of thumb” is that multicollinearity is a big problem if \(R_i^2 > 0.9\), i.e. the \(i\)th variable is more than 90% explained by the other predictors.

We can get \(R_i^2\) by doing the regressions ourselves, but there is an easier way.

32.5.5 Variance Inflation Factors, VIFs

A major impact of multicollinearity is found in the variances (and hence standard errors) of the regression coefficients.

  1. If \(X_i\) is alone in the model \[Y =\beta_0 +\beta_1 X_i +\varepsilon~,~~~~~\mbox{then}\]

\[\hat{\mbox{var}}(\hat\beta_1) = \frac{S^2}{\sum_j{(X_{ij} - \bar X_i)^2}}\] 2. If If \(X_i\) is in a model with other predictors then \[\hat{\mbox{var}}(\hat\beta_i) = S^2 \left(X^T X\right)^{-1}_{ii}\] where the last term is the corresponding diagonal entry of the \(S^2 (X^T X )^{-1}\) matrix, the variance-covariance matrix. This is always \(\ge\) the \(\hat{\mbox{var}} (\hat\beta_1)\).

The ratio of these variances is called the Variance Inflation Factor (VIF). It turns out that \[VIF_i = \frac{1}{1- R^2_i}\] so we get \(VIF_i \ge 10\) if \(R_i^2 \ge 0.9\).

In conclusion if the VIF is large eg. \(VIF_i> 10\) then multicollinearity is a big problem. On the other hand if the VIF is around 1, then the variable \(X_i\) is pretty much orthogonal to the other variables.

To obtain VIFs we need to invoke library(car).

library(car)
vif(electric.lm0)
     Area   Persons    Income 
44.141264  3.421738 39.035444 

Clearly we need to do something to reduce the multicollinearity.

Note: if a regression model includes a factor, then vif() defaults to a generalised VIF. We will interpret it the same way.

32.6 Example: SAP Data

The following example illustrates multicollinearity in a context where it is not immediately visible. Chatterjee and Price discuss a dataset relating Sales to A (advertising expenditure), A1 (last year’s advertising), P (promotion expenditure), P1 (last year’s promotions expenditure) and other variables. The analyst eventually found out that an approximate budgetary constraint

\[ A + A1 + P + P1 = 5 \] was hidden in the data. The constraint was known to the advertising executives who had commissioned the study but it had not been mentioned to the statistical consultant who was given the data to analyze.

Annual Data on Advertising, Promotion, and Sales Expenses, and Sales:

A= advertising expenditure (at time t)

A1 denotes lagged values, i.e. at time t-1.

P denotes promotion expenditures (at time t)

P1 denotes the lagged values.

SE denotes sales expense and

S the aggregate sales.

Download SAP.csv

## SAPdata = read.csv("SAP.csv", header = TRUE)
SAPdata
         A   P   SE      A1  P1       S
1  1.98786 1.0 0.30 2.01722 0.0 20.1137
2  1.94418 0.0 0.30 1.98786 1.0 15.1044
3  2.19954 0.8 0.35 1.94418 0.0 18.6838
4  2.00107 0.0 0.35 2.19954 0.8 16.0517
5  1.69292 1.3 0.30 2.00107 0.0 21.3010
6  1.74334 0.3 0.32 1.69292 1.3 17.8500
7  2.06907 1.0 0.31 1.74334 0.3 18.8756
8  1.01709 1.0 0.41 2.06907 1.0 21.2660
9  2.01906 0.9 0.45 1.01709 1.0 20.4847
10 1.06139 1.0 0.45 2.01906 0.9 20.5403
11 1.45999 1.5 0.50 1.06139 1.0 26.1844
12 1.87511 0.0 0.60 1.45999 1.5 21.7161
13 2.27109 0.8 0.65 1.87511 0.0 28.6959
14 1.11191 1.0 0.65 2.27109 0.8 25.8372
15 1.77407 1.2 0.65 1.11191 1.0 29.3199
16 0.95878 1.0 0.65 1.77407 1.2 24.1904
17 1.98930 1.0 0.62 0.95878 1.0 26.5897
18 1.97111 0.0 0.60 1.98930 1.0 22.2447
19 2.26603 0.7 0.60 1.97111 0.0 24.7994
20 1.98346 0.1 0.61 2.26603 0.7 21.1910
21 2.10054 1.0 0.60 1.98346 0.1 26.0344
22 1.06815 1.0 0.58 2.10054 1.0 27.3930
attach(SAPdata)

SAP.lm = lm(S ~ A + P + SE + A1 + P1)

summary(SAP.lm)

Call:
lm(formula = S ~ A + P + SE + A1 + P1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8601 -0.9847  0.1323  0.7017  2.2046 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -14.194     18.715  -0.758   0.4592    
A              5.361      4.028   1.331   0.2019    
P              8.372      3.586   2.334   0.0329 *  
SE            22.521      2.142  10.512 1.36e-08 ***
A1             3.855      3.578   1.077   0.2973    
P1             4.125      3.895   1.059   0.3053    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.32 on 16 degrees of freedom
Multiple R-squared:  0.9169,    Adjusted R-squared:  0.8909 
F-statistic:  35.3 on 5 and 16 DF,  p-value: 4.289e-08

We see that, apart from Sales Expenses, there is only weak evidence of a relationship between advertising, promotions and sales.

vif(SAP.lm)
        A         P        SE        A1        P1 
36.941513 33.473514  1.075962 25.915651 43.520965 

The VIF shows high multicollinearity for all advertising and promotions data. If we drop one of the variables the others change their coefficients unpredictably depending on which variable we drop.

SAP.lm2 = lm(S ~ P + SE + A1 + P1)
summary(SAP.lm2)$coef
              Estimate Std. Error    t value     Pr(>|t|)
(Intercept) 10.5094338  2.4575979  4.2763032 5.103623e-04
P            3.7017455  0.7571140  4.8892842 1.382315e-04
SE          22.7941649  2.1803443 10.4543880 8.040607e-09
A1          -0.7692144  0.8745617 -0.8795427 3.913708e-01
P1          -0.9686584  0.7422511 -1.3050279 2.092725e-01
SAP.lm3 = lm(S ~ A + P + SE + P1)
summary(SAP.lm3)$coef
               Estimate Std. Error     t value     Pr(>|t|)
(Intercept)  5.76106517  2.6971385  2.13599157 4.752049e-02
A            1.14726708  0.9674889  1.18581942 2.520041e-01
P            4.61226335  0.8298268  5.55810393 3.469186e-05
SE          22.71199425  2.1450683 10.58800520 6.658248e-09
P1           0.03145725  0.8623592  0.03647813 9.713260e-01

Suppose we look at the relationships among the variables.

SAPA.lm = lm(A ~ P + SE + A1 + P1)
summary(SAPA.lm)$coef
               Estimate Std. Error     t value     Pr(>|t|)
(Intercept)  4.60813189 0.14474201  31.8368663 1.352575e-16
P           -0.87125259 0.04459078 -19.5388520 4.380113e-13
SE           0.05095381 0.12841295   0.3967965 6.964545e-01
A1          -0.86252031 0.05150795 -16.7453834 5.334057e-12
P1          -0.95013832 0.04371542 -21.7346283 7.654334e-14

This is approximately A = 5 - P -A1 - P1 as in the hidden constraint.

32.7 What do we do about multicollinearity?

As far as 161.251 is concerned, we have few options.

  1. We can remove one (or more) of the correlated variables from the regression. Our aim is to fix the problem, but sometimes we have a bit of choice which variable to remove, in which case we would like the model to be as interpretable as possible. Sometimes the choice is between a variable with many missing values and one with few, and so we choose the latter if it allows us a bigger effective sample size \(n\).

  2. We could construct another variable which is a combination of the two most highly correlated variables. It might be an ordinary average \((X_1+X_2)/2\), or an average of standardized versions of the \(X_i\)s or some other linear combination. Sometimes subtracting or dividing one variable by another will help. For example the incidence of a particular disease such as obesity may be higher in countries with high GDP and high Population. But GDP depends on the size of the economy, which is correlated with the Population. Dividing gives us per capita GDP which is a better indicator of average wealth and hence may be more related to disease incidence.

  3. (Not examinable). Those who have studied multivariate statistics may find it useful to construct principal components or do a factor analysis of a set of correlated variables. Dropping less important components leads to a class of models called biased regression. Another form of biased regression shrinks the regression coefficients towards zero - a method called shrinkage estimators.

  4. (Not examinable). It is possible to fit regression models that impose linear constraints on the regression coefficients (such as forcing them to add to 100%). Such restricted least squares is also beyond the scope of this course. Therefore for this course we focus on method 1, dropping variables.