In this exercise you will:
RiverProbAll
data.data(RiverProbAll, package="ELMER")
str(RiverProbAll)
'data.frame': 91 obs. of 3 variables:
$ Flow : int 24 24 26 26 50 48 72 72 25 23 ...
$ Type : Factor w/ 3 levels "Canadian","Kayak",..: 1 3 2 1 1 1 1 2 1 2 ...
$ Problems: int 1 0 0 0 1 0 1 2 1 3 ...
The dataset RiverProbAll
gives the number of problems
experienced by travelers on the Whanganui river. This is the full data
set including the craft type as a categorical variable.
Fit a generalised linear model to this data using a Poisson distribution for the number of problems and a log link function.
Use comparisons of the deviance to make any simplifications to a full model you find possible.
Hint: The categorical variable could be altered.
<- glm(Problems~Flow*Type, family=poisson(link="log"),data=RiverProbAll)
River.glm1 |> summary() River.glm1
Call:
glm(formula = Problems ~ Flow * Type, family = poisson(link = "log"),
data = RiverProbAll)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3449 -1.2442 -0.4588 0.3276 2.3919
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.081955 0.457869 0.179 0.8579
Flow -0.012242 0.012351 -0.991 0.3216
TypeKayak 0.016486 0.590950 0.028 0.9777
TypeOther 3.788587 1.621478 2.337 0.0195 *
Flow:TypeKayak 0.003831 0.014986 0.256 0.7983
Flow:TypeOther -0.153219 0.068375 -2.241 0.0250 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 119.04 on 90 degrees of freedom
Residual deviance: 102.16 on 85 degrees of freedom
AIC: 210.66
Number of Fisher Scoring iterations: 6
|> anova(test="Chisq") River.glm1
Analysis of Deviance Table
Model: poisson, link: log
Response: Problems
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 90 119.04
Flow 1 5.0657 89 113.97 0.024404 *
Type 2 0.7330 87 113.24 0.693151
Flow:Type 2 11.0823 85 102.16 0.003922 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Analysis of deviance table shows us that the interaction term is significant so we cannot reduce the model by removing terms. However, the summary output shows more detail about the importance of individual levels of our categorical predictor.
The two terms relating to the difference between Canadians and kayaks
are not significant, but the terms for other craft show differences. We
could re-categorise the three levels into two in order to simplify our
model. Let’s make a new categorical variable based on Type
and fit a new version of the model.
library(dplyr)
= RiverProbAll |> mutate(Type2 = Type=="Other") RiverProbAll
<- glm(Problems~Flow*Type2, family=poisson(link="log"),data=RiverProbAll)
River.glm2 |> summary() River.glm2
Call:
glm(formula = Problems ~ Flow * Type2, family = poisson(link = "log"),
data = RiverProbAll)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3449 -1.2319 -0.4588 0.2682 2.4736
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.060632 0.283695 0.214 0.8308
Flow -0.009484 0.006975 -1.360 0.1740
Type2TRUE 3.809910 1.581148 2.410 0.0160 *
Flow:Type2TRUE -0.155977 0.067611 -2.307 0.0211 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 119.04 on 90 degrees of freedom
Residual deviance: 102.52 on 87 degrees of freedom
AIC: 207.02
Number of Fisher Scoring iterations: 6
|> anova(test="Chisq") River.glm2
Analysis of Deviance Table
Model: poisson, link: log
Response: Problems
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 90 119.04
Flow 1 5.0657 89 113.97 0.0244036 *
Type2 1 0.3904 88 113.58 0.5320729
Flow:Type2 1 11.0652 87 102.52 0.0008796 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Our two models are nested. We can compare them using anova(). To confirm this, the more complicated model has a column indicating a difference between Canadian and Kayak, which is not used in our simpler model.
anova(River.glm1, River.glm2, test="Chisq")
Analysis of Deviance Table
Model 1: Problems ~ Flow * Type
Model 2: Problems ~ Flow * Type2
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 85 102.16
2 87 102.52 -2 -0.3597 0.8354
anova(River.glm2, River.glm1, test="Chisq")
Analysis of Deviance Table
Model 1: Problems ~ Flow * Type2
Model 2: Problems ~ Flow * Type
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 87 102.52
2 85 102.16 2 0.3597 0.8354
Strictly speaking, we ought to put the simpler model in the
anova()
first, but the switch doesn’t matter. It is better
to avoid the minus signs in model comparisons though! What we see is
that the extra distinction between Canadians and kayaks is not
necessary.
You should only move on to the second exercise using this dataset) once you have completed the above exercise.