7.5 Summarizing Data with Standard Errors and Confidence Intervals
7.5.1 Problem
You want to summarize your data with the standard error of the mean and/or confidence intervals.
7.5.2 Solution
Getting the standard error of the mean involves two steps: first get the standard deviation and count for each group, then use those values to calculate the standard error. The standard error for each group is just the standard deviation divided by the square root of the sample size:
library(MASS) # Load MASS for the cabbages data set
library(dplyr)
<- cabbages %>%
ca group_by(Cult, Date) %>%
summarise(
Weight = mean(HeadWt),
sd = sd(HeadWt),
n = n(),
se = sd / sqrt(n)
)#> `summarise()` has grouped output by 'Cult'. You can override using the
#> `.groups` argument.
ca#> # A tibble: 6 x 6
#> # Groups: Cult [2]
#> Cult Date Weight sd n se
#> <fct> <fct> <dbl> <dbl> <int> <dbl>
#> 1 c39 d16 3.18 0.957 10 0.303
#> 2 c39 d20 2.8 0.279 10 0.0882
#> 3 c39 d21 2.74 0.983 10 0.311
#> 4 c52 d16 2.26 0.445 10 0.141
#> 5 c52 d20 3.11 0.791 10 0.250
#> 6 c52 d21 1.47 0.211 10 0.0667
7.5.3 Discussion
The summarise()
function computes the columns in order, so you can refer to previous newly-created columns. That’s why se
can use the sd
and n
columns.
The n()
function gets a count of rows, but if you want to have it not count NA
values from a column, you need to use a different technique. For example, if you want it to ignore any NA
s in the HeadWt
column, use sum(!is.na(Headwt))
.
7.5.3.1 Confidence Intervals {#_confidence_intervals}
Confidence intervals are calculated using the standard error of the mean and the degrees of freedom. To calculate a confidence interval, use the qt()
function to get the quantile, then multiply that by the standard error. The qt()
function will give quantiles of the t-distribution when given a probability level and degrees of freedom. For a 95% confidence interval, use a probability level of .975; for the bell-shaped t-distribution, this will in essence cut off 2.5% of the area under the curve at either end. The degrees of freedom equal the sample size minus one.
This will calculate the multiplier for each group. There are six groups and each has the same number of observations (10), so they will all have the same multiplier:
<- qt(.975, ca$n - 1)
ciMult
ciMult#> [1] 2.262157 2.262157 2.262157 2.262157 2.262157 2.262157
Now we can multiply that vector by the standard error to get the 95% confidence interval:
$ci95 <- ca$se * ciMult
ca
ca#> # A tibble: 6 x 7
#> # Groups: Cult [2]
#> Cult Date Weight sd n se ci95
#> <fct> <fct> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 c39 d16 3.18 0.957 10 0.303 0.684
#> 2 c39 d20 2.8 0.279 10 0.0882 0.200
#> 3 c39 d21 2.74 0.983 10 0.311 0.703
#> 4 c52 d16 2.26 0.445 10 0.141 0.318
#> 5 c52 d20 3.11 0.791 10 0.250 0.566
#> 6 c52 d21 1.47 0.211 10 0.0667 0.151
This could be done in one line, like this:
$ci95 <- ca$se * qt(.975, ca$n - 1) ca
For a 99% confidence interval, use .995.
Error bars that represent the standard error of the mean and confidence intervals serve the same general purpose: to give the viewer an idea of how good the estimate of the population mean is. The standard error is the standard deviation of the sampling distribution. Confidence intervals are a little easier to interpret. Very roughly, a 95% confidence interval means that there’s a 95% chance that the true population mean is within the interval (actually, it doesn’t mean this at all, but this seemingly simple topic is way too complicated to cover here; if you want to know more, read up on Bayesian statistics).
This function will perform all the steps of calculating the standard deviation, count, standard error, and confidence intervals. It can also handle NA
s and missing combinations, with the na.rm
and .drop
options. By default, it provides a 95% confidence interval, but this can be set with the conf.interval
argument:
<- function(data = NULL, measurevar, groupvars = NULL, na.rm = FALSE,
summarySE conf.interval = .95, .drop = TRUE) {
# New version of length which can handle NA's: if na.rm==T, don't count them
<- function(x, na.rm = FALSE) {
length2 if (na.rm) sum(!is.na(x))
else length(x)
}
<- rlang::syms(groupvars)
groupvars <- rlang::sym(measurevar)
measurevar
<- data %>%
datac ::group_by(!!!groupvars) %>%
dplyr::summarise(
dplyrN = length2(!!measurevar, na.rm = na.rm),
sd = sd (!!measurevar, na.rm = na.rm),
!!measurevar := mean (!!measurevar, na.rm = na.rm),
se = sd / sqrt(N),
# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval:
# e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
ci = se * qt(conf.interval/2 + .5, N - 1)
%>%
) ::ungroup() %>%
dplyr# Rearrange the columns so that sd, se, ci are last
::select(seq_len(ncol(.) - 4), ncol(.) - 2, sd, se, ci)
dplyr
datac }
The following usage example has a 99% confidence interval and handles NA
s and missing combinations:
# Remove all rows with both c52 and d21
<- filter(cabbages, !(Cult == "c52" & Date == "d21" ))
c2 # Set some values to NA
$HeadWt[c(1, 20, 45)] <- NA
c2summarySE(c2, "HeadWt", c("Cult", "Date"),
conf.interval = .99, na.rm = TRUE, .drop = FALSE)
#> `summarise()` has grouped output by 'Cult'. You can override using the
#> `.groups` argument.
#> # A tibble: 5 x 7
#> Cult Date N HeadWt sd se ci
#> <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 c39 d16 9 3.26 0.982 0.327 1.10
#> 2 c39 d20 9 2.72 0.139 0.0465 0.156
#> 3 c39 d21 10 2.74 0.983 0.311 1.01
#> 4 c52 d16 10 2.26 0.445 0.141 0.458
#> 5 c52 d20 9 3.04 0.809 0.270 0.905
7.5.4 See Also
See Recipe ?? to use the values calculated here to add error bars to a graph.