7.6 Converting Data from Wide to Long

7.6.1 Problem

You want to convert a data frame from “wide” format to “long” format.

7.6.2 Solution

Use gather() from the tidyr package. In the anthoming data set, for each angle, there are two measurements: one column contains measurements in the experimental condition and the other contains measurements in the control condition:

library(gcookbook) # For the data set
anthoming
#>   angle expt ctrl
#> 1   -20    1    0
#> 2   -10    7    3
#> 3     0    2    3
#> 4    10    0    3
#> 5    20    0    1

We can reshape the data so that all the measurements are in one column. This will put the values from expt and ctrl into one column, and put the names into a different column:

library(tidyr)
gather(anthoming, condition, count, expt, ctrl)
#>    angle condition count
#> 1    -20      expt     1
#> 2    -10      expt     7
#>  ...<6 more rows>...
#> 9     10      ctrl     3
#> 10    20      ctrl     1

This data frame represents the same information as the original one, but it is structured in a way that is more conducive to some analyses.

7.6.3 Discussion

In the source data, there are ID variables and value variables. The ID variables are those that specify which values go together. In the source data, the first row holds measurements for when angle is –20. In the output data frame, the two measurements, for expt and ctrl, are no longer in the same row, but we can still tell that they belong together because they have the same value of angle.

The value variables are by default all the non-ID variables. The names of these variables are put into a new key column, which we called condition, and the values are put into a new value column which we called count.

You can designate the value columns from the source data by naming them individually, as we did above with expt and ctrl. gather() automatically inferred that the ID variable was the remaining column, angle. Another way to tell it which columns are values is to do the reverse: if you exclude the angle column, then gather() will infer that the value columns are the remaining ones, expt and ctrl.

gather(anthoming, condition, count, expt, ctrl)
# Prepending the column name with a '-' means it is not a value column
gather(anthoming, condition, count, -angle)

There are other convenient shortcuts to specify which columns are values. For example expt:ctrl means to select all columns between expt and ctrl (in this particular case, there are no other columns in between, but for a larger data set you can imagine how this would save typing).

By default, gather() will use all of the columns from the source data as either ID columns or value columnbs. That means that if you want to ignore some columns, you’ll need to filter them out first using the select() function.

For example, in the drunk data set, suppose we want to convert it to long format, keeping sex in one column and putting the numeric values in another column. This time, we want the values for only the 0-29 and 30-39 columns, and we want to discard the values for the other age ranges:

# Our source data
drunk
#>      sex 0-29 30-39 40-49 50-59 60+
#> 1   male  185   207   260   180  71
#> 2 female    4    13    10     7  10

# Try gather() with just 0-29 and 30-39
drunk %>%
  gather(age, count, "0-29", "30-39")
#>      sex 40-49 50-59 60+   age count
#> 1   male   260   180  71  0-29   185
#> 2 female    10     7  10  0-29     4
#> 3   male   260   180  71 30-39   207
#> 4 female    10     7  10 30-39    13

That doesn’t look right! We told gather() that 0-29 and 30-39 were the value columns we wanted, and it automatically inferred that we wanted to use all of the other columns as ID columns, when we wanted to just keep sex and discard the others. The solution is to use select() to remove the unwanted columns first, and then gather().

library(dplyr)  # For the select() function

drunk %>%
  select(sex, "0-29", "30-39") %>%
  gather(age, count, "0-29", "30-39")
#>      sex   age count
#> 1   male  0-29   185
#> 2 female  0-29     4
#> 3   male 30-39   207
#> 4 female 30-39    13

There are times where you may want to use use more than one column as the ID variables:

plum_wide
#>   length      time dead alive
#> 1   long   at_once   84   156
#> 2   long in_spring  156    84
#> 3  short   at_once  133   107
#> 4  short in_spring  209    31
# Use length and time as the ID variables (by not naming them as value variables)
gather(plum_wide, "survival", "count", dead, alive)
#>   length      time survival count
#> 1   long   at_once     dead    84
#> 2   long in_spring     dead   156
#>  ...<4 more rows>...
#> 7  short   at_once    alive   107
#> 8  short in_spring    alive    31

Some data sets don’t come with a column with an ID variable. For example, in the corneas data set, each row represents one pair of measurements, but there is no ID variable. Without an ID variable, you won’t be able to tell how the values are meant to be paired together. In these cases, you can add an ID variable before using melt():

# Make a copy of the data
co <- corneas
# Add an ID column
co$id <- 1:nrow(co)

gather(co, "eye", "thickness", affected, notaffected)
#>    id         eye thickness
#> 1   1    affected       488
#> 2   2    affected       478
#>  ...<12 more rows>...
#> 15  7 notaffected       464
#> 16  8 notaffected       476

Having numeric values for the ID variable may be problematic for subsequent analyses, so you may want to convert id to a character vector with as.character(), or a factor with factor().

7.6.4 See Also

See Recipe 7.7 to do conversions in the other direction, from long to wide.

See the stack() function for another way of converting from wide to long.