Lecture 14: Model Choice Using Information Criteria

Download the R markdown file for this lecture.

In previous lectures we have looked at comparison of models, and automatic methods of variable selection, using F testing.

An alternative approach to selecting a good model is to define a measure of the quality of a model, then choose the model of the highest quality.
So called information criteria can be used as measures of model quality.

In this lecture, we will examine information criteria in general, the Akaike information criterion (AIC) in particular, and consider their use in model selection.

Information Criteria

A good regression model (i.e. with a good choice of predictors) will:

fit the data well;
be simple (have as few predictors as possible).

Hence quality of a model can be measured by a quantity of the form \[\mbox{Measure of badness of fit} + k \times \mbox{Number of predictors} + \mbox{constant}\] where k can be adjusted to control the relative importance of the two aspects of a model’s quality just given.

Note that any additive constant is of no real importance, since in comparing the quality of different models (our real aim) the constant term cancels out.

Akaike Information Criterion

Several information criteria in statistics are of the form of our simple equation for a “measure of badness of fit” given above.

One such measure is the Akaike Information Criterion (AIC). For a linear model with unknown error variance this is defined by \[AIC = n \log (RSS/n) + 2p + \mbox{constant}.\]

We can choose between models by selecting the one with smaller AIC.

N.B. This is not the only criterion in common use, but is perhaps the most common. Different software presents different criteria.

Back to the Paramo Data

Recall previously that we compared models M0 and M1 defined as follows: \[M0~~~~~~E[ \mbox{N}] = \beta_0 + \beta_1 \mbox{AR} + \beta_3 \mbox{DEc}\] and \[M1~~~~~~E[ \mbox{N}] = \beta_0 + \beta_1 \mbox{AR} + \beta_2 \mbox{EL} + \beta_3 \mbox{DEc} + \beta_4 \mbox{DNI}\]

These models were fitted in R and stored as Paramo.lm0 and Paramo.lm1 respectively.

Selecting between these models by an F test involves testing H₀: \[\beta_2 = \beta_4 = 0~~~~\mbox{versus}~~~~~H_1: \beta_2, \beta_4 \mbox{ not both zero.}\]

Download paramo.csv

## Paramo <- read.csv(file = "https://r-resources.massey.ac.nz/161221/data/paramo.csv", 
##     header = TRUE, row.names = 1)
Paramo.lm0 <- lm(N ~ AR + DEc, data = Paramo)
Paramo.lm1 <- lm(N ~ AR + EL + DEc + DNI, data = Paramo)
AIC(Paramo.lm0)

[1] 95.7644

AIC(Paramo.lm1)

[1] 98.8237

We can compare the models using AIC (instead of using an F test).

An alternative to AIC() is the extractAIC() command. The output from extractAIC() command is number of regression parameters (p+1) and the model’s AIC.

AIC is smaller for model M0, so this model is to be preferred in accordance with our earlier findings.

AIC in Stepwise Variable Selection

We have seen that sequential variable selection can be conducted using a sequence of F tests.

An alternative is to compare models using AIC.

Specifically, at each step of the algorithm we change to the model with smallest AIC (from adding or dropping a single variable), and stop only when we cannot reduce AIC by single term additions or deletions.

This procedure is implemented using the step() command in R.

US Crime Rate Data

This example concerns crime data from 47 states of the USA in 1960. Our aim is to model the crime rate as a function of up to 15 potential explanatory variables. We will use stepwise variable selection based on AIC to choose a model. Data source: Ehrlich, I. (1973) “Participation in illegitimate activities: a theoretical and empirical investigation” Journal of Political Economy, 81, 521-565).

possible burglar

Example : Variables in the Data Set

Variable	Description
M	percentage of males aged 14-24
So	indicator variable for a southern state
Ed	mean years of schooling
Po1	police expenditure in 1960
Po2	police expenditure in 1959
LF	labour force participation rate
M.F	number of males per 1000 females
Pop	state population
NW	number of nonwhites per 1000 people
U1	unemployment rate of urban males 14-24
U2	unemployment rate of urban males 35-39
GDP	gross domestic product per head
Ineq	income inequality
Prob	probability of imprisonment
Time	average time served in state prisons
Crime	rate of crimes in a particular category per head of population

Example : R Code

data(UScrime, package = "MASS")  # data is in MASS package
USCrime <- UScrime %>% rename(Crime = y)
USCrime.lm.full <- lm(Crime ~ ., data = USCrime)

Notes:

Specification of the scope in a step() can be tricky when using dots in formula. So we will specify the upper scope explicitly by extracting the necessary details using formula() from an existing model.
The output from R is presented “iteration by iteration” and (relatively speaking) takes some time to push through for this data.

USCrime.lm.step <- step(USCrime.lm.full, scope = list(lower = Crime ~ 1, upper = formula(USCrime.lm.full)), 
    direction = "both")

Start:  AIC=514.65
Crime ~ M + So + Ed + Po1 + Po2 + LF + M.F + Pop + NW + U1 + 
    U2 + GDP + Ineq + Prob + Time

       Df Sum of Sq     RSS    AIC
- So    1        29 1354974 512.65
- LF    1      8917 1363862 512.96
- Time  1     10304 1365250 513.00
- Pop   1     14122 1369068 513.14
- NW    1     18395 1373341 513.28
- M.F   1     31967 1386913 513.74
- GDP   1     37613 1392558 513.94
- Po2   1     37919 1392865 513.95
<none>              1354946 514.65
- U1    1     83722 1438668 515.47
- Po1   1    144306 1499252 517.41
- U2    1    181536 1536482 518.56
- M     1    193770 1548716 518.93
- Prob  1    199538 1554484 519.11
- Ed    1    402117 1757063 524.86
- Ineq  1    423031 1777977 525.42

Step:  AIC=512.65
Crime ~ M + Ed + Po1 + Po2 + LF + M.F + Pop + NW + U1 + U2 + 
    GDP + Ineq + Prob + Time

       Df Sum of Sq     RSS    AIC
- Time  1     10341 1365315 511.01
- LF    1     10878 1365852 511.03
- Pop   1     14127 1369101 511.14
- NW    1     21626 1376600 511.39
- M.F   1     32449 1387423 511.76
- Po2   1     37954 1392929 511.95
- GDP   1     39223 1394197 511.99
<none>              1354974 512.65
- U1    1     96420 1451395 513.88
+ So    1        29 1354946 514.65
- Po1   1    144302 1499277 515.41
- U2    1    189859 1544834 516.81
- M     1    195084 1550059 516.97
- Prob  1    204463 1559437 517.26
- Ed    1    403140 1758114 522.89
- Ineq  1    488834 1843808 525.13

Step:  AIC=511.01
Crime ~ M + Ed + Po1 + Po2 + LF + M.F + Pop + NW + U1 + U2 + 
    GDP + Ineq + Prob

       Df Sum of Sq     RSS    AIC
- LF    1     10533 1375848 509.37
- NW    1     15482 1380797 509.54
- Pop   1     21846 1387161 509.75
- Po2   1     28932 1394247 509.99
- GDP   1     36070 1401385 510.23
- M.F   1     41784 1407099 510.42
<none>              1365315 511.01
- U1    1     91420 1456735 512.05
+ Time  1     10341 1354974 512.65
+ So    1        65 1365250 513.00
- Po1   1    134137 1499452 513.41
- U2    1    184143 1549458 514.95
- M     1    186110 1551425 515.01
- Prob  1    237493 1602808 516.54
- Ed    1    409448 1774763 521.33
- Ineq  1    502909 1868224 523.75

Step:  AIC=509.37
Crime ~ M + Ed + Po1 + Po2 + M.F + Pop + NW + U1 + U2 + GDP + 
    Ineq + Prob

       Df Sum of Sq     RSS    AIC
- NW    1     11675 1387523 507.77
- Po2   1     21418 1397266 508.09
- Pop   1     27803 1403651 508.31
- M.F   1     31252 1407100 508.42
- GDP   1     35035 1410883 508.55
<none>              1375848 509.37
- U1    1     80954 1456802 510.06
+ LF    1     10533 1365315 511.01
+ Time  1      9996 1365852 511.03
+ So    1      3046 1372802 511.26
- Po1   1    123896 1499744 511.42
- U2    1    190746 1566594 513.47
- M     1    217716 1593564 514.27
- Prob  1    226971 1602819 514.54
- Ed    1    413254 1789103 519.71
- Ineq  1    500944 1876792 521.96

Step:  AIC=507.77
Crime ~ M + Ed + Po1 + Po2 + M.F + Pop + U1 + U2 + GDP + Ineq + 
    Prob

       Df Sum of Sq     RSS    AIC
- Po2   1     16706 1404229 506.33
- Pop   1     25793 1413315 506.63
- M.F   1     26785 1414308 506.66
- GDP   1     31551 1419073 506.82
<none>              1387523 507.77
- U1    1     83881 1471404 508.52
+ NW    1     11675 1375848 509.37
+ So    1      7207 1380316 509.52
+ LF    1      6726 1380797 509.54
+ Time  1      4534 1382989 509.61
- Po1   1    118348 1505871 509.61
- U2    1    201453 1588976 512.14
- Prob  1    216760 1604282 512.59
- M     1    309214 1696737 515.22
- Ed    1    402754 1790276 517.74
- Ineq  1    589736 1977259 522.41

Step:  AIC=506.33
Crime ~ M + Ed + Po1 + M.F + Pop + U1 + U2 + GDP + Ineq + Prob

       Df Sum of Sq     RSS    AIC
- Pop   1     22345 1426575 505.07
- GDP   1     32142 1436371 505.39
- M.F   1     36808 1441037 505.54
<none>              1404229 506.33
- U1    1     86373 1490602 507.13
+ Po2   1     16706 1387523 507.77
+ NW    1      6963 1397266 508.09
+ So    1      3807 1400422 508.20
+ LF    1      1986 1402243 508.26
+ Time  1       575 1403654 508.31
- U2    1    205814 1610043 510.76
- Prob  1    218607 1622836 511.13
- M     1    307001 1711230 513.62
- Ed    1    389502 1793731 515.83
- Ineq  1    608627 2012856 521.25
- Po1   1   1050202 2454432 530.57

Step:  AIC=505.07
Crime ~ M + Ed + Po1 + M.F + U1 + U2 + GDP + Ineq + Prob

       Df Sum of Sq     RSS    AIC
- GDP   1     26493 1453068 503.93
<none>              1426575 505.07
- M.F   1     84491 1511065 505.77
- U1    1     99463 1526037 506.24
+ Pop   1     22345 1404229 506.33
+ Po2   1     13259 1413315 506.63
+ NW    1      5927 1420648 506.87
+ So    1      5724 1420851 506.88
+ LF    1      5176 1421398 506.90
+ Time  1      3913 1422661 506.94
- Prob  1    198571 1625145 509.20
- U2    1    208880 1635455 509.49
- M     1    320926 1747501 512.61
- Ed    1    386773 1813348 514.35
- Ineq  1    594779 2021354 519.45
- Po1   1   1127277 2553852 530.44

Step:  AIC=503.93
Crime ~ M + Ed + Po1 + M.F + U1 + U2 + Ineq + Prob

       Df Sum of Sq     RSS    AIC
<none>              1453068 503.93
+ GDP   1     26493 1426575 505.07
- M.F   1    103159 1556227 505.16
+ Pop   1     16697 1436371 505.39
+ Po2   1     14148 1438919 505.47
+ So    1      9329 1443739 505.63
+ LF    1      4374 1448694 505.79
+ NW    1      3799 1449269 505.81
+ Time  1      2293 1450775 505.86
- U1    1    127044 1580112 505.87
- Prob  1    247978 1701046 509.34
- U2    1    255443 1708511 509.55
- M     1    296790 1749858 510.67
- Ed    1    445788 1898855 514.51
- Ineq  1    738244 2191312 521.24
- Po1   1   1672038 3125105 537.93

Example: Comments on R Output

The stepwise algorithm has selected \[E[ \mbox{Crime}] = -6426.10 + 9.33 \mbox{M} + 18.01 \mbox{Ed} + 10.27 \mbox{Po1} + 2.23 \mbox{M.F} \mbox{} -6.09 \mbox{U1} + 18.74 \mbox{U2} + 6.13 \mbox{Ineq} -3796.03 \mbox{Prob} \] as the “best” model.

While this is (probably) a perfectly reasonable model, it is usually unrealistic to hope to find the best model (even if we can define precisely what we mean by such a thing).

The AIC criteria tends to err on the side of including too many (rather than too few) predictors.

Example : The Final Model

summary(USCrime.lm.step)


Call:
lm(formula = Crime ~ M + Ed + Po1 + M.F + U1 + U2 + Ineq + Prob, 
    data = USCrime)

Residuals:
    Min      1Q  Median      3Q     Max 
-444.70 -111.07    3.03  122.15  483.30 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -6426.101   1194.611  -5.379 4.04e-06 ***
M               9.332      3.350   2.786  0.00828 ** 
Ed             18.012      5.275   3.414  0.00153 ** 
Po1            10.265      1.552   6.613 8.26e-08 ***
M.F             2.234      1.360   1.642  0.10874    
U1             -6.087      3.339  -1.823  0.07622 .  
U2             18.735      7.248   2.585  0.01371 *  
Ineq            6.133      1.396   4.394 8.63e-05 ***
Prob        -3796.032   1490.646  -2.547  0.01505 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 195.5 on 38 degrees of freedom
Multiple R-squared:  0.7888,    Adjusted R-squared:  0.7444 
F-statistic: 17.74 on 8 and 38 DF,  p-value: 1.159e-10

Example: Conclusions

The crime rate is negatively associated with the probability of imprisonment.

Only one of Po1 and Po2 appear in the model because these predictors are highly interdependent.

The crime rate is positively associated with police expenditure; i.e. the more that is spent on policing, the higher the crime rate!

This association should not be interpreted as a causal relationship - obviously higher police spending does not cause a higher crime rate.