7.5 Summarizing Data with Standard Errors and Confidence Intervals

7.5.1 Problem

You want to summarize your data with the standard error of the mean and/or confidence intervals.

7.5.2 Solution

Getting the standard error of the mean involves two steps: first get the standard deviation and count for each group, then use those values to calculate the standard error. The standard error for each group is just the standard deviation divided by the square root of the sample size:

library(MASS)  # Load MASS for the cabbages data set
library(dplyr)

ca <- cabbages %>%
  group_by(Cult, Date) %>%
  summarise(
    Weight = mean(HeadWt),
    sd = sd(HeadWt),
    n = n(),
    se = sd / sqrt(n)
  )
#> `summarise()` has grouped output by 'Cult'. You can override using the
#> `.groups` argument.

ca
#> # A tibble: 6 x 6
#> # Groups:   Cult [2]
#>   Cult  Date  Weight    sd     n     se
#>   <fct> <fct>  <dbl> <dbl> <int>  <dbl>
#> 1 c39   d16     3.18 0.957    10 0.303 
#> 2 c39   d20     2.8  0.279    10 0.0882
#> 3 c39   d21     2.74 0.983    10 0.311 
#> 4 c52   d16     2.26 0.445    10 0.141 
#> 5 c52   d20     3.11 0.791    10 0.250 
#> 6 c52   d21     1.47 0.211    10 0.0667

7.5.3 Discussion

The summarise() function computes the columns in order, so you can refer to previous newly-created columns. That’s why se can use the sd and n columns.

The n() function gets a count of rows, but if you want to have it not count NA values from a column, you need to use a different technique. For example, if you want it to ignore any NAs in the HeadWt column, use sum(!is.na(Headwt)).

7.5.3.1 Confidence Intervals {#_confidence_intervals}

Confidence intervals are calculated using the standard error of the mean and the degrees of freedom. To calculate a confidence interval, use the qt() function to get the quantile, then multiply that by the standard error. The qt() function will give quantiles of the t-distribution when given a probability level and degrees of freedom. For a 95% confidence interval, use a probability level of .975; for the bell-shaped t-distribution, this will in essence cut off 2.5% of the area under the curve at either end. The degrees of freedom equal the sample size minus one.

This will calculate the multiplier for each group. There are six groups and each has the same number of observations (10), so they will all have the same multiplier:

ciMult <- qt(.975, ca$n - 1)
ciMult
#> [1] 2.262157 2.262157 2.262157 2.262157 2.262157 2.262157

Now we can multiply that vector by the standard error to get the 95% confidence interval:

ca$ci95 <- ca$se * ciMult
ca
#> # A tibble: 6 x 7
#> # Groups:   Cult [2]
#>   Cult  Date  Weight    sd     n     se  ci95
#>   <fct> <fct>  <dbl> <dbl> <int>  <dbl> <dbl>
#> 1 c39   d16     3.18 0.957    10 0.303  0.684
#> 2 c39   d20     2.8  0.279    10 0.0882 0.200
#> 3 c39   d21     2.74 0.983    10 0.311  0.703
#> 4 c52   d16     2.26 0.445    10 0.141  0.318
#> 5 c52   d20     3.11 0.791    10 0.250  0.566
#> 6 c52   d21     1.47 0.211    10 0.0667 0.151

This could be done in one line, like this:

ca$ci95 <- ca$se * qt(.975, ca$n - 1)

For a 99% confidence interval, use .995.

Error bars that represent the standard error of the mean and confidence intervals serve the same general purpose: to give the viewer an idea of how good the estimate of the population mean is. The standard error is the standard deviation of the sampling distribution. Confidence intervals are a little easier to interpret. Very roughly, a 95% confidence interval means that there’s a 95% chance that the true population mean is within the interval (actually, it doesn’t mean this at all, but this seemingly simple topic is way too complicated to cover here; if you want to know more, read up on Bayesian statistics).

This function will perform all the steps of calculating the standard deviation, count, standard error, and confidence intervals. It can also handle NAs and missing combinations, with the na.rm and .drop options. By default, it provides a 95% confidence interval, but this can be set with the conf.interval argument:

summarySE <- function(data = NULL, measurevar, groupvars = NULL, na.rm = FALSE,
                      conf.interval = .95, .drop = TRUE) {

  # New version of length which can handle NA's: if na.rm==T, don't count them
  length2 <- function(x, na.rm = FALSE) {
    if (na.rm) sum(!is.na(x))
    else       length(x)
  }

  groupvars  <- rlang::syms(groupvars)
  measurevar <- rlang::sym(measurevar)

  datac <- data %>%
    dplyr::group_by(!!!groupvars) %>%
    dplyr::summarise(
      N             = length2(!!measurevar, na.rm = na.rm),
      sd            = sd     (!!measurevar, na.rm = na.rm),
      !!measurevar := mean   (!!measurevar, na.rm = na.rm),
      se            = sd / sqrt(N),
      # Confidence interval multiplier for standard error
      # Calculate t-statistic for confidence interval:
      # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
      ci            = se * qt(conf.interval/2 + .5, N - 1)
    ) %>%
    dplyr::ungroup() %>%
    # Rearrange the columns so that sd, se, ci are last
    dplyr::select(seq_len(ncol(.) - 4), ncol(.) - 2, sd, se, ci)

  datac
}

The following usage example has a 99% confidence interval and handles NAs and missing combinations:

# Remove all rows with both c52 and d21
c2 <- filter(cabbages, !(Cult == "c52" & Date == "d21" ))
# Set some values to NA
c2$HeadWt[c(1, 20, 45)] <- NA
summarySE(c2, "HeadWt", c("Cult", "Date"),
          conf.interval = .99, na.rm = TRUE, .drop = FALSE)
#> `summarise()` has grouped output by 'Cult'. You can override using the
#> `.groups` argument.
#> # A tibble: 5 x 7
#>   Cult  Date      N HeadWt    sd     se    ci
#>   <fct> <fct> <int>  <dbl> <dbl>  <dbl> <dbl>
#> 1 c39   d16       9   3.26 0.982 0.327  1.10 
#> 2 c39   d20       9   2.72 0.139 0.0465 0.156
#> 3 c39   d21      10   2.74 0.983 0.311  1.01 
#> 4 c52   d16      10   2.26 0.445 0.141  0.458
#> 5 c52   d20       9   3.04 0.809 0.270  0.905

7.5.4 See Also

See Recipe ?? to use the values calculated here to add error bars to a graph.