Chapter7 Contingency Tables

Chapter Aim: To use the generalised linear model to analyse multi-way contingency tables. This requires a careful distinction between response and explanatory variables.

7.1 Introduction

The $\chi^2$ test in a two way contingency table is one of the most commonly used statistical tests. It is a very different looking test from those based on the linear model, and apparently very specialised, applicable only to two factors. In this chapter we shall see that it can be generalised to any number of factors and made to look like a GLM. The first set of examples show that the data in the two way table of a humble $\chi^2$ test arises from a variety of sampling methods, and the hypothesis tested can be expressed in more than one way.

7.2 Distributions in Contingency Tables

7.2.1 Examples of Two way contingency tables

Example 1: Comparing two binomial proportions

An example of this was Example 5.2.1, where the data came from two independent samples and were modelled by two independent binomial random variables. Type of treatment (drug or placebo) was the explanatory variable, and outcome (survive or die) was the response. The number of people on each treatment was fixed, which is why there were two binomial random variables. As in simple regression, the data was used to explore the relationship between an explanatory variable and a response variable.

Example 2: Multinomial distribution

In order to examine the relationship between pollution and the size of fish in a river a sample of 100 trout were caught and classified as over/under 450g and high/low concentration of pollutant in their liver.

Table 7.1: Multinomial sampling: 100 trout with polluted livers.
	Low	High	Total
Under 450g	13	29	42
Over 450g	32	26	58
Total	45	55	100

Here, because the total is fixed, the distribution is multinomial with four categories. The data in Table 7.1 would be used to explore the same type of question though, with pollutant being the explanatory and weight the response.

Example 3: Poisson distribution: Lizards on perches

In a simpler study than Example 5.4.2, 164 adult male Anolis sagrei lizards of Bimini were observed in a fixed time period, and their perch height and perch diameter were recorded.

Lizard preferences for perch height and diameter
	Perch diameter (mm)
Perch height	$\leq$ 10	> 10	Total
> 1.5m	32	11	43
$\leq$ 1.5m	86	35	21
Total	118	46	164

Here, there is no clear response or explanatory variables, both variables being responses to the lizards’ preferences. The question asked is whether the two variables are associated.

If, as is reasonable to assume, lizards on each perch type are spotted independently and at random (a Poisson process) the total in each perch type should have a Poisson distribution.

Despite the different distributions and different questions a $\chi^2$ statistic would be appropriate for all three contingency tables described above, although its interpretation would differ. This is not a result you would ever have seen proved, although the statistic $\sum{(O - E)^2/E}$ seems reasonable in each case. We will not prove the result for GLMs either, but it is:

In all cases mentioned, so long as the totals in the margin are treated as given, the likelihood function is identical to that of a GLM with the counts in each cell having a Poisson distribution with a log link function.

The model can be written rather like the two way ANOVA model. If $\mu_{ij}$ is the mean count in one cell \[\begin{equation} \mu_{ij}=\exp\{\alpha+\beta_i +\gamma_j + \delta_{ij}\} \tag{7.1} \end{equation}\] allows for each cell to have any mean at all. This is the saturated model. The simpler model \[\begin{equation} \mu_{ij}=\exp\{\alpha+\beta_i +\gamma_j\} \tag{7.2} \end{equation}\] implies that the row variable and the column variable are independent.

To see this, note that the proportion of observations in the first row which are also in the first column is \[\begin{aligned} \frac{\mu_{11}}{\mu_{11}+\mu_{12}} &= \frac{ \exp\{ \alpha + \beta_1 + \gamma_1\} }{ \exp\{ \alpha + \beta_1 + \gamma_1\} + \exp\{\alpha + \beta_1 + \gamma_2\} } \nonumber \\ &=\frac{ \exp \gamma_1 }{ \exp\gamma_1 + \exp\gamma_2} \end{aligned}\] and likewise, that the proportion of observations in the second row which are also in the first column is the same value: \[\frac{\mu_{21}}{\mu_{21}+\mu_{22}} =\frac{ \exp \gamma_1 }{ \exp\gamma_1 + \exp\gamma_2}\]

So if $\delta_{ij} = 0$, the distribution of counts across the columns is the same in each row, and therefore the two variables are independent. The right hand expressions ($\alpha + \beta_i + \gamma_j + \delta_{ij})$ are the linear predictors of a GLM, where the $\boldsymbol{X}$ matrix represents the categorical variables of the table.

In the special case where a response variable has just two categories the number in one category has a binomial distribution, and the binomial/logit model could be used. Such a model leads to the same estimates and same likelihood as the Poisson/log model.

Note in passing that \[\mu_{ij}=\alpha+\beta_i + \gamma_j + \delta_{ij}\] is also a perfectly legitimate generalised linear model. With the log link $\delta_{ij} = 0$ implies that the categories are multiplicatively independent. With the identity link they are additively independent, a reasonable model for a 2$$2 factorial experiment where the response is a count. The log link has a computational advantage in making a negative mean count impossible, but the link function should always be chosen to give a realistic model.

7.3 Log Linear Models for Contingency Tables

The Poisson/log model always has ‘count’ as its ‘response’ variable. The deviance is a measure of the total variation in the table, and the aim is, as always, to reduce it to an amount consistent with a $\chi^2$ distribution. Having done this, the interaction terms are the ones which answer questions about the actual explanatory and response variables. In the two-way tables of Section 7.2.1 the interaction term answers whatever the question was, just as the $\chi^2$ statistic did. The totals, corresponding to the main effect of an ANOVA model, describe the way the data was collected but give no information about the relationship between the variables. Possibilities become much more complex with three way tables, but the totals in each category, which are fitted by the main effect terms, must always be included in the model, because they merely describe the way the data was collected. In addition interactions between explanatory variables also just describe the way the data was collected, and they too should be included in the model.

Example 1: A three way table — Diabetics

Diabetics classified by family history, dependence on insulin and age of onset.
Family History of Diabetes	Yes		No
Dependence on Insulin	Yes	No	Yes	No
Age of onset < 45	6	1	16	2
Age of onset $\geq$ 45	6	33	8	48

The table above shows information from a group of diabetic patients. Family history is an explanatory variable and the other two are responses. The response is therefore bivariate, which makes the range of possible questions rather different from anything met before.

7.3.1 Notation

The minimal model describing the data as collected is the main effects only model. To avoid using a string of different Greek letters, suffixes are used to denote the variables.

\[\begin{equation} \log\mu_{ijk} = u + u_{1(i)} + u_{2(j)} + u_{3(k)} \tag{7.3} \end{equation}\]

If there are N observations in total \[N p_{ijk} = \mu_{ijk}\]

or \[p_{ijk} = \frac{\mu_{ijk}}{N} = p_{i\cdot \cdot } p_{\cdot j\cdot } p_{\cdot \cdot k}\]

In the example

$u_1$= variable 1, family history

$u_2$ = variable 2, dependence on insulin

$u_3$ = variable 3, age at onset.

7.3.2 Types of model

The model in Equation (7.3), implies no association between any of the variables, that is family history does not affect age of onset nor dependence on insulin, and there is no association between dependence and age of onset. Equation (7.3) follows from the elementary definition of independence:

\[\begin{equation} P\left[(U_1 = i) \cap (U_2 = j) \cap (U_3 = k)\right] = P(U_1 = i) P(U_2 = j) P(U_3 = k) \tag{7.4} \end{equation}\]

As a shorthand, this model is written $[1][2][3]$: each variable appears on its own.

Now, add an interaction between family history ($U_1$) and dependence on insulin ($U_2$). \[\begin{equation} \log\mu_{ijk} = u + u_{1(i)} + u_{2(j)} + u_{3(k)} + u_{12(ij)} \tag{7.5} \end{equation}\] or \[p_{ijk} = p_{ij\cdot} p_{\cdot\cdot k}\]

This model implies that family history may explain dependence on insulin, but not age of onset, and there is no association between age of onset and dependence on insulin. The shorthand is $[12][3]$. Model $[13][2]$ is similar, and $[1][23]$ says that family history does not explain either of the response variables, but the two response variables are associated with each other. Numbers within the same brackets appear in the model with all their interactions.

Now add a second interaction, family history by age at onset. \[\log\mu_{ijk} = u + u_{1(i)}+ u_{2(j)}+ u_{3(k)}+ u_{12(ij)}+ u_{13(ik)}\] or \[\label{Old6.2.3} p_{ijk} = p_{ij\cdot}p_{i\cdot k}\]

This is the model of conditional independence of variable 2 and 3 at each level of variable 1. For people with the same family history, age of onset and dependence on insulin are independent. If we looked just at the two response variables they would be associated but only because of the lurking variable, family history. The shorthand for this model is $[12][13]$. This would be consistent with both insulin dependence and age of onset resulting partially from family history.

The two other models of conditional independence, $[12][23]$ and $[13][23]$ are consistent with cause and effect chains like family history affecting insulin dependence which in turn affects age of onset. For people with the same level of insulin dependence there will be no association between family history and age of onset. Next consider: \[\log\mu_{ijk} = u + u_{1(i)} + u_{2(j)}+ u_{3(k) }+ u_{12(ij)}+ u_{13(ik)} + u_{23(jk)}\] or \[\label{Old6.2.4} p_{ijk}= p_{ij\cdot}p_{i\cdot k} p_{\cdot jk}\] This model allows association between every pair of variables, but the association between any pair is the same at each level of the third variable. This is the model $[12][13][23]$.

Finally there is the saturated model $[123]$, which implies that no summary of the table is possible.

$i$ Coefficients of minimal model

Analyses of diabetics data
	Value	Std. Error	t value
(Constant)	2.3849	0.1212	19.7
History	-0.2377	0.0939	-2.5
Age of onset	0.6675	0.1124	5.9
Insulin	-0.4236	0.0996	-4.3

$ii$ Coefficients of minimal model plus Age : Insulin interaction

Analyses of diabetics data
	Value	Std. Error	t value
(Constant)	2.0847	0.1714	12.2
History	-0.2377	0.0939	-2.5
Age of onset	0.7110	0.1700	4.2
Insulin	0.0593	0.1700	0.3
Onset : Insulin	-0.9370	0.1700	-5.5

$iii$ Analysis of deviance

Analyses of diabetics data
	Df	Deviance Drop	Resid. Df	Resid. Dev
NULL			7	120.24
+ History	1	6.59	6	113.65
+ Age of onset	1	43.54	5	70.11
+ Insulin	1	19.75	4	50.36
+ Onset : Insulin	1	48.82	3	1.54

The tables above show the results of fitting two models to the diabetics data.

The first model is specified as: \[Count ~\sim~ History ~+~ Insulin ~+~ Onset\] with distribution ‘poisson’ and link ‘log’.

This corresponds to the ‘independence’ model in the familiar two way contingency table. The expected number in each cell is \[n\hat{p}_{i\cdot \cdot}\hat{p}_{\cdot j\cdot}\hat{p}_{\cdot \cdot k}\] where the $\hat{p}$’s are the overall proportions of observations at each level of each factor.

The second model is specified as \[Count ~\sim~ History ~+~ Insulin ~+~ Onset ~+~ Onset:Insulin\] The Onset:Insulin term allows an association between age at onset and insulin dependence, so the fitted proportions no longer need to be the products of the marginal totals for these two factors. The very low deviance shows that this model fits very well.

In all there are no fewer than nine possible models. Their deviances are shown in Table 7.2. One might hope that not all these models would need to be fitted, but the drop a term causes in the deviance can depend on which other terms are present in the model — just as in ordinary regression models. The importance of the interaction of the second and third factors for example is the difference between models 1 and 2, models 4 and 5, models 3 and 6, and models 7 and 8. In this example the changes in deviance between each of these pairs of models is very similar, but not identical. The differences would have been greater if the other interactions had been larger.

    data(Diabetics, package="ELMER")
    Diabetics.1.2.3 = glm(Count~History+Insulin+Onset, data=Diabetics,  family=poisson(link="log"))
    deviance(Diabetics.1.2.3)
    df.residual(Diabetics.1.2.3)
    
    Diabetics.1.23 = glm(Count~History+Insulin*Onset, data=Diabetics,family=poisson(link="log"))
    Diabetics.13.2= glm(Count~History*Onset+Insulin, data=Diabetics,family=poisson(link="log"))
    Diabetics.12.3= glm(Count~History*Insulin+Onset, data=Diabetics,family=poisson(link="log"))
    Diabetics.12.23= glm(Count~History*Insulin+Insulin*Onset, data=Diabetics,family=poisson(link="log"))
    Diabetics.13.23= glm(Count~History*Onset+Insulin*Onset, data=Diabetics,family=poisson(link="log"))
    Diabetics.12.13= glm(Count~History*Insulin+History*Onset, data=Diabetics,family=poisson(link="log"))
    Diabetics.12.13.23= glm(Count~History*Insulin*Onset - History:Insulin:Onset, data=Diabetics, family=poisson(link="log"))
    Diabetics.123= glm(Count~History*Insulin*Onset, data=Diabetics, family=poisson(link="log"))

Table 7.2: Deviances for all possible models for the diabetics data.
	Model	df	Deviance
1	$[1][2][3]$	4	50.362
2	$[1][23]$	3	1.543
3	$[13][2]$	3	48.888
4	$[12][3]$	3	49.812
5	$[12][23]$	2	0.993
6	$[13][23]$	2	0.069
7	$[12][13]$	2	48.338
8	$[12][13][23]$	1	0.066
9	$[123]$	0	0.000

There is an association between age of onset and dependence on insulin, but nothing else much seems to matter. If this association had arisen because family history had affected both age of onset and dependence on insulin, there would have been a drop in deviance between model 1 and 2, but not between 7 and 8. Model 7, which allows for an association between family history and both age of onset and dependence on insulin, would also have allowed for the observed association between insulin dependence and age of onset.

In conclusion, this has been a very straight forward data set. The associations can be shown in a two-way table showing age against insulin dependence, combining the levels of family history.

Example 2: Housing conditions in Copenhagen

Satisfaction and degree of contact with other residents classified by type of housing.
	Contact With Other Residents
	Low			High
Satisfaction	Low	Medium	High	Low	Medium	High
Tower blocks	65	54	100	34	47	100
Apartments	130	76	111	141	116	191
Houses	67	48	62	130	105	104

The data used in this example are extracted from Table 8.5 of (Dobson and Barnett 2008), but it is actually a subset of the more detailed data set that is part of the MASS package that accompanies (Venables and Ripley 2002).

Residents living in rented homes built between 1960 and 1968 were questioned about their satisfaction and degree of contact with other residents. The type of building is clearly an explanatory variable and satisfaction is clearly a response, but degree of contact could be either. Totaling over levels of contact shows that satisfaction is highest in tower blocks. Looking along each row separately, shows satisfaction is generally highest with high contact with other residents but not so much with houses.

The sort of questions which might be asked are:

What type of building gives highest satisfaction, forgetting about contact with other residents?
Is degree of contact between residents associated with level of satisfaction?
If there are differences in levels of satisfaction between buildings, can they be explained by different buildings leading to different levels of contact between residents?

If factors are numbered:

Satisfaction.
Degree of contact with other residents.
Type of building.

The first question is relevant to a comparison between $[1][2][3]$ and $[13][2]$ or Models 1 and 2 in Table 7.3. This treats degree of contact as a second response variable.

The change in deviance, 34.53 with 2 degrees of freedom, is large, as the data clearly shows.

The second question is relevant to a comparison between $[1][23]$ and $[12][23]$ or Models 3 and 4 in Table 7.3. Note that this question regards degree of contact as an explanatory variable, so in accordance with the principle that all explanatory variables should be included, $[23]$ must be in both models.

The change in deviance of 5.126 with 2 degrees of freedom, lies in the “unproven but cannot be ignored" category.

The last question is relevant to a comparison between $[12][23]$ and $[12][13][23]$ or Models 4 and 5 in Table 7.3.

$[12][23]$ implies that amongst residents with the same degree of contact with others, there is no association between satisfaction and buildings. Adding the $[13]$ term allows satisfaction to vary with building, even if degree of contact remains the same, and makes a large drop in the deviance.

data(Copenhagen, package="ELMER")
Copenhagen.1.2.3 = glm(Count~Satisfaction+Contact+Type, data=Copenhagen,  family=poisson(link="log"))
Copenhagen.13.2 = glm(Count~Satisfaction*Type+Contact, data=Copenhagen,  family=poisson(link="log"))
Copenhagen.1.23 = glm(Count~Satisfaction+Contact*Type, data=Copenhagen,  family=poisson(link="log"))
Copenhagen.12.23 = glm(Count~Satisfaction*Contact+Contact*Type, data=Copenhagen,  family=poisson(link="log"))
Copenhagen.12.23.13 = glm(Count~Satisfaction*Contact*Type-Satisfaction:Contact:Type, data=Copenhagen,  family=poisson(link="log"))

Table 7.3: Deviances for all possible models for the diabetics data.
	Model	df	Deviance	Change in Deviance	Change in df
1	$[1][2][3]$	12	89.35
2	$[13][2]$	8	54.82	34.53	4
3	$[1][23]$	10	50.29
4	$[12][23]$	8	45.16	5.13	2
5	$[12][13][23]$	4	6.89	38.27	4

Contact with others by satisfaction is significant only at the 10% level, but if it is omitted the final model would be$[13][23]$, which with a deviance of 15.73 on 6df, is rather too large. Interestingly, the association between satisfaction and contact with others is actually stronger amongst those living in the same building than it is overall, and if terms had been added in the order $[13]$ then $[12]$, contact with others by satisfaction would have lowered the deviance by 8.84, which is clearly significant. That is because the contact by satisfaction effect is in a different direction with houses than with tower blocks or apartments.

Finally a warning. Talk of explanatory and response variables conjures images of cause and effect. Banish them. Cause and effect can only be established by an experiment. It may seem common sense that building type causes satisfaction, but maybe less satisfied people buy houses, and people who like contact with others may well choose to live in tower blocks rather than houses.

7.4 Zero Cells

One difficulty with all GLM’s, and one which is particularly relevant to count data, is that all inferences are based on statistics whose distributions are known only asymptotically. When a contingency table has many dimensions many cells are likely to have zero counts. You will be familiar with rules for the $\chi^2$ test suggesting lower limits for the expected number in each cell. The Poisson/log GLM cannot fit a zero count, and large negative coefficients with very large standard errors indicate that that is what it is trying to do. It is the fitted values which cannot be zero, so a few actual zeros may not matter. Often the fitting process proceeds smoothly until adding one particular interaction term makes several coefficients very large and negative. That interaction must be attempting to fit a zero count. Although the coefficients will have little value, the deviance should still be sensible and provide a general indication of whether the troublesome interaction is needed.

7.5 Summary

Generalised linear models provide a powerful tool for analysing associations between categorical variables. Normally explaining variability in the response variable is the aim of fitting a GLM (or any linear model), but in contingency tables explaining the variability of counts is only the means to an end. An interaction term which lowers the deviance indicates an association between the corresponding categorical variables. Explanatory and response categorical variables look just the same as far as the model is concerned, emphasising (what is always true) that the distinction lies in their practical interpretation which is external to the model.

If there is only one response variable which has just two levels, that variable may be tested either as a category in a contingency table, as in this chapter, or as a binary response variable, as in Chapter 5. The answers will be just the same.

7.6 Exercises

(Agresti 2007) presented an example on the counts of murderers given the death penalty in the state of Florida. The data can be used to judge the level of racial prejudice that might exist in the justice system as the race of the murderer and their victims are noted. Use a log-linear model to determine the appropriate model for this scenario and draw a conclusion on the possible prejudice that may exist. Do your results match those that could be found using a logistic regression approach? The data can be obtained using:

    data(DeathPenalty, package="ELMER")

This exercise could be used as an in-class tutorial.

A sociological experiment examined the way racial descent and gender influenced people’s helpfulness towards a stranger. The data, a $2\times2\times2\times2$ array, is shown in the table below.

Requestor					Respondents
		Female			Male			Total
	Help	Refuse	Total	Help	Refuse	Total	Help	Refuse	Total
English
females	23	0	23	24	3	27	47	3	50
males	20	4	24	21	5	26	41	9	50
Asian
females	25	2	27	17	11	28	42	13	55
males	9	15	24	21	5	26	30	20	50

Students of similar age and dressed alike approached strangers in a busy shopping precinct and requested change for a phone call. If the stranger provided or looked for change the response was counted as helpful. Not replying or not looking were counted as unhelpful. The stranger’s gender was also noted. The data can be obtained using:

    data(Helpful, package="ELMER")

The students were either Asian or English, males or females.

1.  What are the explanatory and response variables?
2.  What is the minimal model for a Poisson/log analysis?
3.  Starting with the minimal model add interactions until the deviance
drops to a value consistent with random variation. Give an
interpretation of this model.
4.  Does your model make sense when you look just at the proportions in
the table? In other words, how well could you have predicted the
model without formal analysis?

This exercise could be used as an in-class tutorial.

In Examples 1 and 2 there were only three factors leading to nine possible models. Enumeration of all nine models was achievable. How many models are required for full enumeration of a four factor situation? What strategy would you suggest for efficiently determining the optimal model for a four factor situation?

References

Agresti, Alan. 2007. An Introduction to Categorical Data Analysis. Second. New Jersey: John Wiley; Sons.

Dobson, Annette J., and Adrian G. Barnett. 2008. An Introduction to Generalized Linear Models. Third. Boca Raton, Florida: Chapman; Hall.

Venables, W. N., and B. D. Ripley. 2002. Modern Applied Statistics with s. Fourth. Springer.

	Perch diameter (mm)
Perch height	\(\leq\) 10	> 10	Total
> 1.5m	32	11	43
\(\leq\) 1.5m	86	35	21
Total	118	46	164

	Model	df	Deviance
1	\([1][2][3]\)	4	50.362
2	\([1][23]\)	3	1.543
3	\([13][2]\)	3	48.888
4	\([12][3]\)	3	49.812
5	\([12][23]\)	2	0.993
6	\([13][23]\)	2	0.069
7	\([12][13]\)	2	48.338
8	\([12][13][23]\)	1	0.066
9	\([123]\)	0	0.000