8 LURN... To Examine Data Numerically

In this chapter we see how to obtain and present simple numerical summaries suitable for describing data. The same data set used in Chapter 7 on creating graphs is used again.

8.1 Obtaining basic numerical summaries of data

The summary() command is a very useful tool. It behaves differently for R objects depending on the nature of the object it is working on. The air quality data we used for Chapter 7 for demonstrating basic graphing techniques is a data.frame object as evidenced by asking R what class the object is using the class() command

> summary(airquality)

     Ozone          Solar.R         Wind
Min.   :  1.0   Min.   :  7   Min.   : 1.70
1st Qu.: 18.0   1st Qu.:116   1st Qu.: 7.40
Median : 31.5   Median :205   Median : 9.70
Mean   : 42.1   Mean   :186   Mean   : 9.96
3rd Qu.: 63.2   3rd Qu.:259   3rd Qu.:11.50
Max.   :168.0   Max.   :334   Max.   :20.70
NA's   :37      NA's   :7
      Temp          Month           Day
Min.   :56.0   Min.   :5.00   Min.   : 1.0
1st Qu.:72.0   1st Qu.:6.00   1st Qu.: 8.0
Median :79.0   Median :7.00   Median :16.0
Mean   :77.9   Mean   :6.99   Mean   :15.8
3rd Qu.:85.0   3rd Qu.:8.00   3rd Qu.:23.0
Max.   :97.0   Max.   :9.00   Max.   :31.0

The summary() command finds the common numeric summary statistics for the continuous-valued variables and does a particulary useless job on the variables for month and day. Well it’s not really R’s fault. The month didn’t need to be stored numerically, but in any case we got what we asked for! It’s not R’s fault we asked to be told what the minimum day number was!

Notice that the summary() command extracted the minimum, maximum, median and mean for the variables in the data.frame. These are all available using corresponding min(), max(), median(), and mean() commands separately but they work in different ways on data frames. For example we might have thought to use the mean() command on a data.frame, but it is better practice to use the more specific colMeans() command instead.

> colMeans(airquality[,1:4])

Ozone Solar.R Wind Temp
NA NA 9.958 77.882

> colMeans(airquality[,1:4], na.rm=TRUE)

Ozone Solar.R Wind Temp
42.129 185.932 9.958 77.882

> mean(airquality[,1], na.rm=TRUE)

Note that I’ve told R which columns of the air quality data.frame to work with using the appropriate subscripting code. I’ve also needed to tell R to ignore the missing data for the two variables Ozone and Solar.R using the na.rm=TRUE. The output for the summary() command did tell us how many values for each variable were missing. Recall that missing values are represented by a “NA".

The min() and max() commands all work on the data by assuming we want the statistic for the whole list of numbers over all columns. For example

> #median(airquality[,1:4], na.rm=TRUE)
> min(airquality[,1:4], na.rm=TRUE)

> max(airquality[,1:4], na.rm=TRUE)

It used to be possible to ask R to use the median(), in a command like this, but as of version 2.14.0, an error message was returned instead of an answer.

We will need to see how to get the relevant medians, minima, and maxima for the columns separately later in this chapter, but for now let’s see how the relevant summary measures can be obtained for a single column of the air quality data. The attach() command allows us direct access to the variables by name.

> attach(airquality)

> median(Solar.R, na.rm=TRUE)

Notice that a consequence of missing values in our data is that some functions will return the NA value. If we want the quantity returned to be estimated using the available data we simply add the argument na.rm=TRUE which removes the values denoted with the NA.

The first and third quartiles are not so easily extracted however. We can get the first and third quartiles from the five number summary by extracting elements of the output returned by the fivenum() command. For example

[1] 1.0 18.0 31.5 63.5 168.0

There are some minor differences between the methods used by the summary() and fivenum() commands for the first and third quartiles. See the relevant help page using ?fivenum for an explanation.

Knowing how these commands work on simple sets of numbers is very useful for the more elegant presentations in the following sections.

8.2 Obtaining slightly more elegant summaries

When preparing reports it is unlikely that we would want a single statistic all that often. Rather, we are usually interested in exposing patterns or differences within the data. We do this graphically but we may need to extract the exact quantities plotted in the graph as well.

The tapply() and aggregate() commands are very useful for creating tables of results, sometimes referred to as pivot tables. The tapply() command has three arguments; the main variable of interest, the grouping variable(s), and the function we want performed on each group within the data. In contrast, the aggregate() command has a slightly different way of relating the response variable to one or more grouping variables. It is easier to show how these commands work in action so we’ll use the air quality data again and present various alternatives and the slightly different resulting output.

Let’s say we want to know the mean() wind speed over the five months data were collected. We could use

> tapply(Wind, Month, mean)

5 6 7 8 9
11.623 10.267 8.942 8.794 10.180

> aggregate(Wind~Month, data=airquality, mean)

  Month   Wind
1     5 11.623
2     6 10.267
3     7  8.942
4     8  8.794
5     9 10.180

These commands do the job very nicely; choose the one that delivers the output in the way you want it. We could ask for other quantities such as the minimum, maximum, or standard deviation within each month by referring to the relevant R command in the third argument. Note that the aggregate() command allows specification of the data.frame in which to find the variables in the formula; this approach is commonplace when we fit models for our data and avoids the need to make use of the attach() and detach() commands.

If we had multiple response variables of interest, we would need to repeat the command for each one. If we had more than one way of asking for the grouping though, the tapply() command is very useful indeed as it presents a two-way pivot table. The air quality data doesn’t have a second factor we can use for this illustration, so I’ve created a second variable which asks if the day of the month is in the second half of the month. In this case, the tapply() and aggregate() commands generate quite different outcomes.

> Day15 = Day > 15
> tapply(Wind, list(Month, Day15), mean)

   FALSE   TRUE
5 11.313 11.912
6 10.787  9.747
7  9.360  8.550
8  8.907  8.688
9  9.547 10.813

> aggregate(Wind ~ Month + Day15, data=airquality, mean)

   Month Day15   Wind
1      5 FALSE 11.313
2      6 FALSE 10.787
3      7 FALSE  9.360
4      8 FALSE  8.907
5      9 FALSE  9.547
6      5  TRUE 11.912
7      6  TRUE  9.747
8      7  TRUE  8.550
9      8  TRUE  8.688
10     9  TRUE 10.813

If we wanted more than one statistic of interest, we would need to repeat the command and change the function being referenced in the third argument. We can use the summary() command here though as it gives us more quantities.

> tapply(Wind, Month, summary)

$‘5‘
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    5.7     8.9    11.5    11.6    14.1    20.1

$‘6‘
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    1.7     8.0     9.7    10.3    11.5    20.7

$‘7‘
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   4.10    6.90    8.60    8.94   10.90   14.90

$‘8‘
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   2.30    6.60    8.60    8.79   11.20   15.50

$‘9‘
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   2.80    7.55   10.30   10.18   12.32   16.60

This printout isn’t particularly attractive, but it is quick and informative. If presentation was more important than getting results efficiently, we would look to improve the appearance — this means more difficult coding though.

There is one more feature of the tapply() command that must be pointed out. The simple commands shown thus far use the default settings of the relevant functions. In particular, the mean() command does not ignore missing values by default.

> tapply(Ozone, Month, mean)

5 6 7 8 9
NA NA NA NA NA

We now know that there are missing Ozone observations in each month. We can add a fourth argument to the command which will be passed on to the function mentioned in the third argument.

> tapply(Ozone, Month, mean, na.rm=TRUE)

5 6 7 8 9
23.62 29.44 59.12 59.96 31.45

8.3 Getting things printed how we want them

We will often need to watch the way R chooses to print out results. When a direct reference to data is sought, the order of the data is maintained. In our example using the tapply() command above, the months were numeric values. The order was probably the most logical one as it was in chronological order.

If we had a long list of results and it was the extremes we wanted to focus attention towards, the sort() and rev() commands would be most useful. For example, say we want to order the monthly average wind speeds found above in ascending order, then we would use

> sort(tapply(Wind, Month, mean))

8 7 9 6 5
8.794 8.942 10.180 10.267 11.623

So we now know that August has the smallest average wind speed and that May is the windiest month.

> rev(tapply(Wind, Month, mean))

9 8 7 6 5
10.180 8.794 8.942 10.267 11.623

and we can combine both commands to get the months ordered from windiest to least windy using:

> rev(sort(tapply(Wind, Month, mean)))

5 6 9 7 8
11.623 10.267 10.180 8.942 8.794

8.4 Correlation structure within a data set

The summaries given thus far in this chapter all refer to variables as single objects. Often we have multivariate data and want to know more about it. Looking for inter-relationships among variables can be achieved using correlation coefficients; this gives us a numerical equivalent to the scatter plot matrices shown in Section 7.6.

In fact, the commands used here resemble those used to generate the scatter plot matrix. I like to do this to save effort. My ability to copy and paste is better than my typing!

> cor(data.frame(Ozone, Solar.R, Wind, Temp))

        Ozone Solar.R   Wind   Temp
Ozone       1      NA     NA     NA
Solar.R    NA       1     NA     NA
Wind       NA      NA  1.000 -0.458
Temp       NA      NA -0.458  1.000

Having said this, it is now obvious that the missing values in some variables is causing grief. An additional argument is required.

> cor(data.frame(Ozone, Solar.R, Wind, Temp),
use = "pairwise.complete.obs")

          Ozone  Solar.R     Wind    Temp
Ozone    1.0000  0.34834 -0.60155  0.6984
Solar.R  0.3483  1.00000 -0.05679  0.2758
Wind    -0.6015 -0.05679  1.00000 -0.4580
Temp     0.6984  0.27584 -0.45799  1.0000

Often, R will allow arguments to be abbreviated. Unfortunately, this isn’t the case for this additional argument. It is the case for choosing the preferred correlation measure to be found though. The default is to calculate Pearson’s correlation coefficient. My personal preference is to use both this measure and Spearman’s rank correlation coefficient as it does not require the relationship to be linear, nor does it require the samples to be normally distributed. As well as finding Spearman’s measure, I’ve decided that printing correlations to five decimal places is unnecessary so have restricted the output using the round() command.

> round(cor(data.frame(Ozone, Solar.R, Wind, Temp),
use = "pairwise.complete.obs", method = "s"), 3)

         Ozone Solar.R   Wind   Temp
Ozone    1.000   0.348 -0.590  0.774
Solar.R  0.348   1.000 -0.001  0.207
Wind    -0.590  -0.001  1.000 -0.447
Temp     0.774   0.207 -0.447  1.000

8.5 Use of dplyr for data summarisation

In Section 5.4, we saw how the dplyr package offers an alternative to base R functionality for manipulating data. We now see that this package offers commands for summarising data. In particular, the commands: group_by(), summarise(), and count().

You’ll need to install the package (see Section 16.1 for instructions) before running the function

to gain access to the commands just listed. We will also make use of the dplyr data structure called a tbl_df instead of the common data.frame using:

> airquality2=tbl_df(airquality)

The summarise() and group_by() commands can be used instead of the tapply() command introduced earlier in this chapter. To get the overall average wind speed from the airquality data, we could use:

> airquality %>% summarise(mean(Wind, na.rm = TRUE))

mean(Wind, na.rm = TRUE)
1 9.958

which is more long-winded than just using the mean() command alone. The value of summarise() becomes more evident when we seek the means for groups within our data, such as:

> airquality%>% group_by(Month) %>% summarise(mean(Wind, na.rm = TRUE))

# A tibble: 5 x 2
  Month ‘mean(Wind, na.rm = TRUE)‘
  <int>                      <dbl>
1     5                      11.6
2     6                      10.3
3     7                       8.94
4     8                       8.79
5     9                      10.2

We could use other functions seen in this chapter for alternative summary statistics within these functions. A common task is to check on how much data we have for each group of interest. We could use summarise() and length() to count up the data, but the dplyr package gives us the count() function to make this even more efficient.

> airquality%>% count(Month)

# A tibble: 5 x 2
  Month     n
  <int> <int>
1     5    31
2     6    30
3     7    31
4     8    31
5     9    30

As an illustration of the real power of the pipe operator, say we wanted to show the hottest months by their average temperature within the air quality data. One way to do this using commands from the dplyr package is to:

> Grouped = group_by(airquality, Month)
> Summarised = summarise(Grouped, AveTemp=mean(Temp, na.rm = TRUE))
> arrange(Summarised, desc(AveTemp))

# A tibble: 5 x 2
  Month AveTemp
  <int>   <dbl>
1     8    84.0
2     7    83.9
3     6    79.1
4     9    76.9
5     5    65.5

Which has stored data at each step. Note in particular that a new variable was created by name in the summarise() command; this helps in the subsequent arrange() step. We might want to store the results at each step, but if we didn’t want to, then we could use piping like this:

> airquality2 %>%
   group_by(Month) %>%
   summarise(AveTemp = mean(Temp, na.rm = TRUE))  %>%
   arrange(desc(AveTemp))

# A tibble: 5 x 2
  Month AveTemp
  <int>   <dbl>
1     8    84.0
2     7    83.9
3     6    79.1
4     9    76.9
5     5    65.5

Note that commands that might be nested on one line in the standard way of combining commands can be split out onto multiple lines to make it even easier to see what has been managed by each command. This leaves room for comments to be added at the end of lines to help anyone else read the code being written.

8.6 Closing

If you are carrying on working with R you might wish to remove direct access to the data sets we used in this chapter by issuing the following commands