Chapter 5
LURN… To Manipulate your Datasets

This chapter explains how to rearrange your data to meet your needs. Basic sorting, data extraction and re-combination are discussed.

To get the most of this chapter, you should have worked through Chapter 3 on entering data manually. You do not need to have completely read through Chapter 4 on importing data as the examples here do not rely on its content. Ultimately, you will want to use the material in Chapter 4 in conjunction with what follows here.

This chapter starts using the functionality that is built into R, but we also take a look at an add-on package called dplyr that is becomeing more and more commonly used.

5.1 Sorting the data into a different order

The sort() command is illustrated in this section. It works for sorting a single vector of numeric values into ascending or descending order.

> x=round(rnorm(10),2)
> sort(x)

 [1] -0.89 -0.67 -0.54 -0.40  0.06  0.07  0.10  0.11  0.16
[10]  0.23

> sort(x, decreasing=TRUE)

 [1]  0.23  0.16  0.11  0.10  0.07  0.06 -0.40 -0.54 -0.67
[10] -0.89

We use the order() command to work with data that are non-numeric vectors, as well as data in matrix or data.frame format. For example, we first sort the names of me and my three children initially given in descending order of age.

> Home=c("Jonathan", "Callum", "Annabelle", "Cordelia", "Hershey")
> order(Home)

[1] 3 2 4 5 1

> Home[order(Home)]

[1] "Annabelle" "Callum"    "Cordelia"  "Hershey"
[5] "Jonathan"

So we see that the order() command doesn’t actually do the re-ordering after all, but that we need to use the subscripts to extract the elements (the full set) from the pre-existing set.

We do need to take note of how the upper and lower case letters will be ordered.

> SomeLetters=c("A", "B", "a", "c", "b", "C")
> SomeLetters[order(SomeLetters)]

[1] "a" "A" "b" "B" "c" "C"

The order() command can prove a useful method of re-ordering experimental data if the run order has been stored as a variable. To illustrate this point we create a data.frame with two elements.

> x=round(rnorm(10),2)
> x

 [1] -0.35 -1.14  0.53 -0.46 -0.02 -1.13 -0.05  0.56  0.50
[10] -0.73

> RunOrder=sample(10)

Note that the sample() command used here has created a random permutation of the numbers one to ten.

> RunOrder

 [1]  9  4 10  3  2  6  5  8  1  7

> Data=data.frame(x,RunOrder)
> Data[order(RunOrder),]

       x RunOrder
9   0.50        1
5  -0.02        2
4  -0.46        3
2  -1.14        4
7  -0.05        5
6  -1.13        6
10 -0.73        7
8   0.56        8
1  -0.35        9
3   0.53       10

5.2 Extracting data using information on one variable

To illustrate the extraction of a set of observations from a data.frame, we make use of one of the internal datasets provided with the base installation of R. We retreiv the airquality dat using:

> data(airquality)

An explanation of this command is given in Section 4.5. This data set has its own help page that can be viewed by typing ?airquality at the R prompt.

We can ask for the rows of this data that match specified details using subscripting. Various operators exist for numerical comparisons; the simplest of which are < and >. For the purposes of brevity in our examples that follow, we use the nrow() command to show how many observations in the data meet the specified conditions. The following commands extract hte number of days where the wind speed was less than ten miles per hour and then the temperature was less than 60 degF.

> nrow(airquality[airquality$Wind<10,])

[1] 81

> nrow(airquality[airquality$Temp<60,])

[1] 8

If we want to know how many observations have an exact value, we use == not a single equals sign because it has a different meaning in R. The double equals sign is the one used for comparisons. For example, the number of days when the temperature was recorded as 69 degF can be found using:

> nrow(airquality[airquality$Temp==69,])

[1] 3

5.3 Extracting data using information on more than one variable

To print out the rows of the airquality data.frame where the temperature was less than 60 degrees and the wind was less than ten miles per hour, we would use the & operator between the two conditions that identify which rows are required.

> airquality[airquality$Wind<10&airquality$Temp<60,]

   Ozone Solar.R Wind Temp Month Day
21     1       8  9.7   59     5  21
27    NA      NA  8.0   57     5  27

5.4 Use of dplyr for data manipulation

The dplyr package is an alternative to the basic R functionality shown in this chapter. The package creator (Hadley Wickham) wants his package to make data manipulation easier for the end-user, and more fficient with respect to computation time. He has simplified data manipulations to a small set of commands: filter(), arrange(), sample_n(), sample_frac(), mutate(), and transmute() are discussed here, while the others will be discussed in Section 8.5 where they are more relevant.

You’ll need to install the package (see Section 16.1 for instructions) before running the function

> library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

to gain access to the commands just listed. Note that the head() command is used in many of the following examples to ensure that only a handful of rows are printed out instead of the full result. An alternative is to create a new data structure using the tbl_df() command. Let’s make a second copy of the air quality data that is the same as the original except for being made ready for use with the dplyr functions.

> airquality2 = tbl_df(airquality)

and then see the structure of the data using the glimpse() function that is a substitute for the str() command we’ve used previously.

> glimpse(airquality)

Observations: 153
Variables: 6
$ Ozone   <int> 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, ...
$ Solar.R <int> 190, 118, 149, 313, NA, NA, 299, 99, 19...
$ Wind    <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, ...
$ Temp    <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69,...
$ Month   <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...
$ Day     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ...

The glimpse() command works on a data.frame or the new structure called a tbl_df equally well.

The filter() and arrange() commands make many of the examples seen earlier in this chapter much easier. The filter() command is for extraction, while arrange() is for rearrangement. For example,

> filter(airquality, Wind<10&Temp<60)

  Ozone Solar.R Wind Temp Month Day
1     1       8  9.7   59     5  21
2    NA      NA  8.0   57     5  27

extracts the rows of the airquality data that meet the combinations of conditions specified. Use of filter() means we don’t need the data.frame attached, or to gain access to the variables in it using the dollar notation; we also don’t need the square brackets for common subscripting tasks. We could have stored the outcome in a new data.frame if we wished.

Perhaps the greatest gift of the dplyr package is its popularising a different way of combining commands, called the pipe operator. The last command issued can be re-written as:

> airquality %>% filter(Wind<10&Temp<60)

  Ozone Solar.R Wind Temp Month Day
1     1       8  9.7   59     5  21
2    NA      NA  8.0   57     5  27

This way of presenting commands is considered to be easier to read by many users which helps with the popularity of the dplyr package. The pipe operator comes from another package called magrittr so its use is not limited to the commands found in the dplyr package. This way of presenting multiple commands is called “chaining" in many resources. I like the simplicity that this offers so I have tried to use the pipe operator as often as I can from here on.

The filter() command is useful if we know the values a variable takes, but at times we just want to know more about the observations that are best or worst according to one (or more) variables. For example, if we wanted to find the hottest days, we can use the arrange() command instead of the order() command seen earlier.

> airquality2 %>% arrange(desc(Temp))

# A tibble: 153 x 6
   Ozone Solar.R  Wind  Temp Month   Day
   <int>   <int> <dbl> <int> <int> <int>
 1    76     203  9.70    97     8    28
 2    84     237  6.30    96     8    30
 3   118     225  2.30    94     8    29
 4    85     188  6.30    94     8    31
 5    NA     259 10.9     93     6    11
 6    73     183  2.80    93     9     3
 7    91     189  4.60    93     9     4
 8    NA     250  9.20    92     6    12
 9    97     267  6.30    92     7     8
10    97     272  5.70    92     7     9
# ... with 143 more rows

We can add more variables to sort the data by if we like. We used the desc to get the temperatures to be in descending order in this example.

The temperatures just listed are measured on the Fahrenheit scale. I live in a country that uses the centigrade scale so a conversion is needed to make these results more interpretable. Enter the mutate() and transmute() commands.

> airquality %>% mutate(TempC = (Temp - 32) * 5 / 9) %>% head()

  Ozone Solar.R Wind Temp Month Day TempC
1    41     190  7.4   67     5   1 19.44
2    36     118  8.0   72     5   2 22.22
3    12     149 12.6   74     5   3 23.33
4    18     313 11.5   62     5   4 16.67
5    NA      NA 14.3   56     5   5 13.33
6    28      NA 14.9   66     5   6 18.89

will create a new column in the returned data. N.B. the new data was not stored in a new object so this command’s effect was purely temporary.

Again, notice that we didn’t need to do all that much to add a new column and that this code is perhaps a little simpler than the alternative:

> airquality$TempC = (airquality$Temp - 32) * 5 / 9

which we saw earlier. The main difference between the two commands is that this one will add the new colun to the original data.frame. In this instance the difference is probably fairly trivial, but when we use much larger data sets, the storage of results is important because of speed and memory usage concerns.

The command

> airquality2 %>% transmute(TempC = (Temp - 32) * 5 / 9)

# A tibble: 153 x 1
   TempC
   <dbl>
 1  19.4
 2  22.2
 3  23.3
 4  16.7
 5  13.3
 6  18.9
 7  18.3
 8  15.0
 9  16.1
10  20.6
# ... with 143 more rows

creates the new variable as a stand alone object. It hasn’t been stored as an object so is only temporarily available.

Finally, the dplyr package also gives users two simple ways to extract a random sample from a data.frame. The commands sample_n() and sample_frac() extract a sample of a predetermined size and a specified fraction respectively. For example:

> airquality %>% sample_n(4)

    Ozone Solar.R Wind Temp Month Day
145    23      14  9.2   71     9  22
122    84     237  6.3   96     8  30
38     29     127  9.7   82     6   7
18      6      78 18.4   57     5  18

and

> airquality %>% sample_frac(0.05)

    Ozone Solar.R Wind Temp Month Day
41     39     323 11.5   87     6  10
39     NA     273  6.9   87     6   8
108    22      71 10.3   77     8  16
111    31     244 10.9   78     8  19
110    23     115  7.4   76     8  18
101   110     207  8.0   90     8   9
31     37     279  7.4   76     5  31
68     77     276  5.1   88     7   7

The dplyr package is under active development and so we should expect more functionality to become available, but there is also work being done to make the processing of data even faster so that it can handle huge data sets.