7.6 Converting Data from Wide to Long
7.6.2 Solution
Use gather()
from the tidyr package. In the anthoming
data set, for each angle
, there are two measurements: one column contains measurements in the experimental condition and the other contains measurements in the control condition:
library(gcookbook) # For the data set
anthoming#> angle expt ctrl
#> 1 -20 1 0
#> 2 -10 7 3
#> 3 0 2 3
#> 4 10 0 3
#> 5 20 0 1
We can reshape the data so that all the measurements are in one column. This will put the values from expt
and ctrl
into one column, and put the names into a different column:
library(tidyr)
gather(anthoming, condition, count, expt, ctrl)
#> angle condition count
#> 1 -20 expt 1
#> 2 -10 expt 7
#> ...<6 more rows>...
#> 9 10 ctrl 3
#> 10 20 ctrl 1
This data frame represents the same information as the original one, but it is structured in a way that is more conducive to some analyses.
7.6.3 Discussion
In the source data, there are ID variables and value variables. The ID variables are those that specify which values go together. In the source data, the first row holds measurements for when angle
is –20. In the output data frame, the two measurements, for expt
and ctrl
, are no longer in the same row, but we can still tell that they belong together because they have the same value of angle
.
The value variables are by default all the non-ID variables. The names of these variables are put into a new key column, which we called condition
, and the values are put into a new value column which we called count
.
You can designate the value columns from the source data by naming them individually, as we did above with expt
and ctrl
. gather()
automatically inferred that the ID variable was the remaining column, angle
. Another way to tell it which columns are values is to do the reverse: if you exclude the angle
column, then gather()
will infer that the value columns are the remaining ones, expt
and ctrl
.
gather(anthoming, condition, count, expt, ctrl)
# Prepending the column name with a '-' means it is not a value column
gather(anthoming, condition, count, -angle)
There are other convenient shortcuts to specify which columns are values. For example expt:ctrl
means to select all columns between expt
and ctrl
(in this particular case, there are no other columns in between, but for a larger data set you can imagine how this would save typing).
By default, gather()
will use all of the columns from the source data as either ID columns or value columnbs. That means that if you want to ignore some columns, you’ll need to filter them out first using the select()
function.
For example, in the drunk
data set, suppose we want to convert it to long format, keeping sex
in one column and putting the numeric values in another column. This time, we want the values for only the 0-29
and 30-39
columns, and we want to discard the values for the other age ranges:
# Our source data
drunk#> sex 0-29 30-39 40-49 50-59 60+
#> 1 male 185 207 260 180 71
#> 2 female 4 13 10 7 10
# Try gather() with just 0-29 and 30-39
%>%
drunk gather(age, count, "0-29", "30-39")
#> sex 40-49 50-59 60+ age count
#> 1 male 260 180 71 0-29 185
#> 2 female 10 7 10 0-29 4
#> 3 male 260 180 71 30-39 207
#> 4 female 10 7 10 30-39 13
That doesn’t look right! We told gather()
that 0-29
and 30-39
were the value columns we wanted, and it automatically inferred that we wanted to use all of the other columns as ID columns, when we wanted to just keep sex
and discard the others. The solution is to use select()
to remove the unwanted columns first, and then gather()
.
library(dplyr) # For the select() function
%>%
drunk select(sex, "0-29", "30-39") %>%
gather(age, count, "0-29", "30-39")
#> sex age count
#> 1 male 0-29 185
#> 2 female 0-29 4
#> 3 male 30-39 207
#> 4 female 30-39 13
There are times where you may want to use use more than one column as the ID variables:
plum_wide#> length time dead alive
#> 1 long at_once 84 156
#> 2 long in_spring 156 84
#> 3 short at_once 133 107
#> 4 short in_spring 209 31
# Use length and time as the ID variables (by not naming them as value variables)
gather(plum_wide, "survival", "count", dead, alive)
#> length time survival count
#> 1 long at_once dead 84
#> 2 long in_spring dead 156
#> ...<4 more rows>...
#> 7 short at_once alive 107
#> 8 short in_spring alive 31
Some data sets don’t come with a column with an ID variable. For example, in the corneas
data set, each row represents one pair of measurements, but there is no ID variable. Without an ID variable, you won’t be able to tell how the values are meant to be paired together. In these cases, you can add an ID variable before using melt():
# Make a copy of the data
<- corneas
co # Add an ID column
$id <- 1:nrow(co)
co
gather(co, "eye", "thickness", affected, notaffected)
#> id eye thickness
#> 1 1 affected 488
#> 2 2 affected 478
#> ...<12 more rows>...
#> 15 7 notaffected 464
#> 16 8 notaffected 476
Having numeric values for the ID variable may be problematic for subsequent analyses, so you may want to convert id to a character vector with as.character()
, or a factor with factor()
.
7.6.4 See Also
See Recipe 7.7 to do conversions in the other direction, from long to wide.
See the stack()
function for another way of converting from wide to long.