Practical Computing Exercise for Week 5 :The first River Problems exercise to use all the data Solutions

Aims of this practical exercise

In this exercise you will:

Complete the exercise given in ELMER
get some practice using R markdown
fit a variety of models to the entire RiverProbAll data.

Before you undertake this exercise…

Get the data

data(RiverProbAll, package="ELMER")
str(RiverProbAll)

'data.frame':   91 obs. of  3 variables:
 $ Flow    : int  24 24 26 26 50 48 72 72 25 23 ...
 $ Type    : Factor w/ 3 levels "Canadian","Kayak",..: 1 3 2 1 1 1 1 2 1 2 ...
 $ Problems: int  1 0 0 0 1 0 1 2 1 3 ...

The Exercise

The dataset RiverProbAll gives the number of problems experienced by travelers on the Whanganui river. This is the full data set including the craft type as a categorical variable.

Fit a generalised linear model to this data using a Poisson distribution for the number of problems and a log link function.

Use comparisons of the deviance to make any simplifications to a full model you find possible.

Hint: The categorical variable could be altered.

The solution

River.glm1 <- glm(Problems~Flow*Type, family=poisson(link="log"),data=RiverProbAll)
River.glm1 |> summary()


Call:
glm(formula = Problems ~ Flow * Type, family = poisson(link = "log"), 
    data = RiverProbAll)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3449  -1.2442  -0.4588   0.3276   2.3919  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)  
(Intercept)     0.081955   0.457869   0.179   0.8579  
Flow           -0.012242   0.012351  -0.991   0.3216  
TypeKayak       0.016486   0.590950   0.028   0.9777  
TypeOther       3.788587   1.621478   2.337   0.0195 *
Flow:TypeKayak  0.003831   0.014986   0.256   0.7983  
Flow:TypeOther -0.153219   0.068375  -2.241   0.0250 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 119.04  on 90  degrees of freedom
Residual deviance: 102.16  on 85  degrees of freedom
AIC: 210.66

Number of Fisher Scoring iterations: 6

River.glm1 |> anova(test="Chisq")

Analysis of Deviance Table

Model: poisson, link: log

Response: Problems

Terms added sequentially (first to last)

          Df Deviance Resid. Df Resid. Dev Pr(>Chi)   
NULL                         90     119.04            
Flow       1   5.0657        89     113.97 0.024404 * 
Type       2   0.7330        87     113.24 0.693151   
Flow:Type  2  11.0823        85     102.16 0.003922 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Analysis of deviance table shows us that the interaction term is significant so we cannot reduce the model by removing terms. However, the summary output shows more detail about the importance of individual levels of our categorical predictor.

The two terms relating to the difference between Canadians and kayaks are not significant, but the terms for other craft show differences. We could re-categorise the three levels into two in order to simplify our model. Let’s make a new categorical variable based on Type and fit a new version of the model.

library(dplyr)
RiverProbAll = RiverProbAll |> mutate(Type2 = Type=="Other")

River.glm2 <- glm(Problems~Flow*Type2, family=poisson(link="log"),data=RiverProbAll)
River.glm2 |>     summary()


Call:
glm(formula = Problems ~ Flow * Type2, family = poisson(link = "log"), 
    data = RiverProbAll)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3449  -1.2319  -0.4588   0.2682   2.4736  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)  
(Intercept)     0.060632   0.283695   0.214   0.8308  
Flow           -0.009484   0.006975  -1.360   0.1740  
Type2TRUE       3.809910   1.581148   2.410   0.0160 *
Flow:Type2TRUE -0.155977   0.067611  -2.307   0.0211 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 119.04  on 90  degrees of freedom
Residual deviance: 102.52  on 87  degrees of freedom
AIC: 207.02

Number of Fisher Scoring iterations: 6

River.glm2 |>     anova(test="Chisq")

Analysis of Deviance Table

Model: poisson, link: log

Response: Problems

Terms added sequentially (first to last)

           Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                          90     119.04              
Flow        1   5.0657        89     113.97 0.0244036 *  
Type2       1   0.3904        88     113.58 0.5320729    
Flow:Type2  1  11.0652        87     102.52 0.0008796 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our two models are nested. We can compare them using anova(). To confirm this, the more complicated model has a column indicating a difference between Canadian and Kayak, which is not used in our simpler model.

anova(River.glm1, River.glm2, test="Chisq")

Analysis of Deviance Table

Model 1: Problems ~ Flow * Type
Model 2: Problems ~ Flow * Type2
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1        85     102.16                     
2        87     102.52 -2  -0.3597   0.8354

anova(River.glm2, River.glm1, test="Chisq")

Analysis of Deviance Table

Model 1: Problems ~ Flow * Type2
Model 2: Problems ~ Flow * Type
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1        87     102.52                     
2        85     102.16  2   0.3597   0.8354

Strictly speaking, we ought to put the simpler model in the anova() first, but the switch doesn’t matter. It is better to avoid the minus signs in model comparisons though! What we see is that the extra distinction between Canadians and kayaks is not necessary.

What to do next?

You should only move on to the second exercise using this dataset) once you have completed the above exercise.

Massey University — School of Mathematical and Computational Sciences

161.331 Biostatistics

Staff member responsible for this workshop: Jonathan Godfrey a.j.godfrey@massey.ac.nz