Practical Computing Exercise for Week 7: Wedderburn’s leaf blotch data — Solutions

Aims of this practical exercise

In this exercise you will:

  • compare use of family="binomial" versus family="quasibinomial"

The exercise

Wedderburn’s leaf blotch data represents the proportion of leaf blotch on ten varieties of barley grown at nine sites in 1965. The amount of leaf blotch is recorded as a percentage. Obtain the data using:

    data(LeafBlotch, package="ELMER")

and answer the following questions.

  1. obtain the data and check that variety and site are correctly defined.
LeafBlotch |> glimpse()
Rows: 90
Columns: 3
$ Prop    <dbl> 0.0005, 0.0000, 0.0125, 0.0250, 0.0550, 0.0100, 0.0500, 0.0500…
$ Site    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3,…
$ Variety <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3,…

Both predictors are stored as factors.

  1. fit a GLM that is appropriate for this data, assuming the data are binomially distributed.
Leaf.glm = glm(Prop~Site+Variety, family=binomial(), data=LeafBlotch, weights=rep(100, nrow(LeafBlotch)))
Warning in eval(family$initialize): non-integer #successes in a binomial glm!
anova(Leaf.glm, test="Chisq")
Warning in eval(family$initialize): non-integer #successes in a binomial glm!
Analysis of Deviance Table

Model: binomial, link: logit

Response: Prop

Terms added sequentially (first to last)

        Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                       89     4080.3              
Site     8   1857.7        81     2222.7 < 2.2e-16 ***
Variety  9   1610.1        72      612.6 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Why can’t we fit the interaction model here.

There are insufficient degrees of freedom for the interaction term. Fitting the interaction saturates the model. Strictly speaking we could have fitted the interaction which gives us a p-value for the confounded interaction and residual (it is quite large!).

Leaf.glm2 = glm(Prop~Site*Variety, family=binomial(), data=LeafBlotch, weights=rep(100, nrow(LeafBlotch)))
Warning in eval(family$initialize): non-integer #successes in a binomial glm!
anova(Leaf.glm2, test="Chisq")
Warning in eval(family$initialize): non-integer #successes in a binomial glm!

Warning in eval(family$initialize): non-integer #successes in a binomial glm!
Analysis of Deviance Table

Model: binomial, link: logit

Response: Prop

Terms added sequentially (first to last)

             Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                            89     4080.3              
Site          8   1857.7        81     2222.7 < 2.2e-16 ***
Variety       9   1610.1        72      612.6 < 2.2e-16 ***
Site:Variety 72    612.6         0        0.0 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. produce a plot of the residuals vs the predicted values for this model and identify any problems that might exist.

:\{r plotLeaf.glm} plot(Leaf.glm,which=1)

  1. fit the following model.
Leaf.glm.quasi = glm(Prop~Site+Variety, family=quasibinomial(), data=LeafBlotch, weights=rep(100, nrow(LeafBlotch)))

The differences only appear in the summary() output.

summary(Leaf.glm.quasi)

Call:
glm(formula = Prop ~ Site + Variety, family = quasibinomial(), 
    data = LeafBlotch, weights = rep(100, nrow(LeafBlotch)))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-6.4435  -1.3537  -0.2028   0.9500   8.1003  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -8.0546     1.4220  -5.665 2.84e-07 ***
Site2         1.6391     1.4433   1.136 0.259871    
Site3         3.3265     1.3492   2.466 0.016066 *  
Site4         3.5822     1.3444   2.664 0.009510 ** 
Site5         3.5831     1.3444   2.665 0.009494 ** 
Site6         3.8933     1.3402   2.905 0.004875 ** 
Site7         4.7300     1.3348   3.544 0.000697 ***
Site8         5.5227     1.3346   4.138 9.38e-05 ***
Site9         6.7946     1.3407   5.068 3.00e-06 ***
Variety2      0.1501     0.7237   0.207 0.836289    
Variety3      0.6895     0.6724   1.025 0.308587    
Variety4      1.0482     0.6494   1.614 0.110910    
Variety5      1.6147     0.6257   2.581 0.011895 *  
Variety6      2.3712     0.6090   3.893 0.000219 ***
Variety7      2.5705     0.6065   4.238 6.58e-05 ***
Variety8      3.3420     0.6015   5.556 4.39e-07 ***
Variety9      3.5000     0.6013   5.820 1.51e-07 ***
Variety10     4.2530     0.6042   7.039 9.38e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 8.87778)

    Null deviance: 4080.3  on 89  degrees of freedom
Residual deviance:  612.6  on 72  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

The standard errors for the coefficients are larger in the quasi model than in the binomial model and the dispersion parameter is larger, importantly so. The quasi model is a better option.

  1. produce a plot of the residuals vs the predicted values and identify if this is a better option than using the previous model.

:\{r plotLeaf.glm.quasi} plot(Leaf.glm.quasi, which=1)

  1. What is the dispersion parameter for this model?

Reading off the summary we see dispersion is 8.87778 which is very large and indicates serious over-dispersion.

  1. Describe the effect site and variety have on the proportion of leaf blotch.

Both site and variety have a noticeable impact on the amount of leaf blotch.

As an aside…

The agridat package includes this dataset and calls it wedderburn.barley. The help page for it presents several other ways to model the proportion that are beyond the expected learning for the 161.331 course. You might investigate if you have lots of time on your hands, or after the conclusion of the semester.