6.1 Making a Basic Histogram
6.1.2 Solution
Use geom_histogram()
and map a continuous variable to x (Figure 6.1):
ggplot(faithful, aes(x = waiting)) +
geom_histogram()
#> This is an untitled chart with no subtitle or caption.
#> It has x-axis 'waiting' with labels 50, 60, 70, 80, 90 and 100.
#> It has y-axis 'count' with labels 0, 10 and 20.
#> The chart is a bar chart with 30 vertical bars.
6.1.3 Discussion
All geom_histogram()
requires is one column from a data frame or a single vector of data. For this example we’ll use the faithful
data set, which contains two columns with data about the Old Faithful geyser: eruptions
, which is the length of each eruption, and waiting
, which is the length of time to the next eruption. We’ll only use the waiting
variable in this example:
faithful#> eruptions waiting
#> 1 3.600 79
#> 2 1.800 54
#> 3 3.333 74
#> ...<266 more rows>...
#> 270 4.417 90
#> 271 1.817 46
#> 272 4.467 74
If you just want to get a quick look at some data that isn’t in a data frame, you can get the same result by passing in NULL
for the data frame and giving ggplot()
a vector of values. This would have the same result as the previous code:
# Store the values in a simple vector
<- faithful$waiting
w
ggplot(NULL, aes(x = w)) +
geom_histogram()
By default, the data is grouped into 30 bins. This number of bins is an arbitrary default value, and may be too fine or too coarse for your data. You can change the size of the bins by specifying the binwidth
, or you can divide the range of the data into a specific number of bins.
In addition, the default colors – a dark fill without an outline – can make it difficult to see which bar corresponds to which value, so we’ll also change the colors, as shown in Figure 6.2.
# Set the width of each bin to 5 (each bin will span 5 x-axis units)
ggplot(faithful, aes(x = waiting)) +
geom_histogram(binwidth = 5, fill = "white", colour = "black")
#> This is an untitled chart with no subtitle or caption.
#> It has x-axis 'waiting' with labels 40, 50, 60, 70, 80, 90 and 100.
#> It has y-axis 'count' with labels 0, 20, 40 and 60.
#> The chart is a bar chart with 11 vertical bars.
#> It has fill set to white.
#> It has colour set to black.
# Divide the x range into 15 bins
<- diff(range(faithful$waiting))/15
binsize
ggplot(faithful, aes(x = waiting)) +
geom_histogram(binwidth = binsize, fill = "white", colour = "black")
#> This is an untitled chart with no subtitle or caption.
#> It has x-axis 'waiting' with labels 40, 60 and 80.
#> It has y-axis 'count' with labels 0, 10, 20, 30 and 40.
#> The chart is a bar chart with 16 vertical bars.
#> It has fill set to white.
#> It has colour set to black.
Sometimes the appearance of the histogram will be very dependent on the width of the bins and where the boundary points between the bins are. In Figure 6.3, we’ll use a bin width of 8. In the version on the left, we’ll use the origin parameter to put boundaries at 31, 39, 47, etc., while in the version on the right, we’ll shift it over by 4, putting boundaries at 35, 43, 51, etc.:
# Save a base plot
<- ggplot(faithful, aes(x = waiting))
faithful_p
+
faithful_p geom_histogram(binwidth = 8, fill = "white", colour = "black", boundary = 31)
#> This is an untitled chart with no subtitle or caption.
#> It has x-axis 'waiting' with labels 40, 60, 80 and 100.
#> It has y-axis 'count' with labels 0, 20, 40 and 60.
#> The chart is a bar chart with 8 vertical bars.
#> Bar 1 is centered horizontally at 43, and spans vertically from 0 to 13.
#> Bar 2 is centered horizontally at 51, and spans vertically from 0 to 46.
#> Bar 3 is centered horizontally at 59, and spans vertically from 0 to 31.
#> Bar 4 is centered horizontally at 67, and spans vertically from 0 to 22.
#> Bar 5 is centered horizontally at 75, and spans vertically from 0 to 68.
#> Bar 6 is centered horizontally at 83, and spans vertically from 0 to 71.
#> Bar 7 is centered horizontally at 91, and spans vertically from 0 to 20.
#> Bar 8 is centered horizontally at 99, and spans vertically from 0 to 1.
#> It has fill set to white.
#> It has colour set to black.
+
faithful_p geom_histogram(binwidth = 8, fill = "white", colour = "black", boundary = 35)
#> This is an untitled chart with no subtitle or caption.
#> It has x-axis 'waiting' with labels 60, 80 and 100.
#> It has y-axis 'count' with labels 0, 25, 50 and 75.
#> The chart is a bar chart with 7 vertical bars.
#> Bar 1 is centered horizontally at 47, and spans vertically from 0 to 32.
#> Bar 2 is centered horizontally at 55, and spans vertically from 0 to 45.
#> Bar 3 is centered horizontally at 63, and spans vertically from 0 to 23.
#> Bar 4 is centered horizontally at 71, and spans vertically from 0 to 34.
#> Bar 5 is centered horizontally at 79, and spans vertically from 0 to 93.
#> Bar 6 is centered horizontally at 87, and spans vertically from 0 to 40.
#> Bar 7 is centered horizontally at 95, and spans vertically from 0 to 5.
#> It has fill set to white.
#> It has colour set to black.
The results look quite different, even though they have the same bin size. The faithful
data set is not particularly small, with 272 observations; with smaller data sets, this can be even more of an issue. When visualizing your data, it’s a good idea to experiment with different bin sizes and boundary points.
If your data has discrete values, it may matter that the histogram bins are asymmetrical. They are closed on the lower bound and open on the upper bound. If you have bin boundaries at 1, 2, 3, etc., then the bins will be [1, 2), [2, 3), and so on. In other words, the first bin contains 1 but not 2, and the second bin contains 2 but not 3.] <- “Yes” pg\(treatment[pg\)group == “trt2”] <- “Yes”