Chapter 3
LURN… To Enter Data

The purpose of this chapter is to show the novice R user how R stores data by introducing the shortcuts that make data entry a fairly simple task.

Entering screeds of data is not fun in any software tool. It’s more common for the R user to have the data they need already available from another source. This is covered in Chapter 4 on importing data from alternate sources.

Of course, if you really must enter data manually then you’d better read on; we can at least try to make it as painless as possible.

3.1 Using R as a simple calculator

R can be used to do basic operations whose results do not get stored as objects. We can also assign the answers to a variable. For example

> x=100/7
> x

[1] 14.29

This means we can use the variable by name later. For example

> 12*x

[1] 171.4

We can also use some basic mathematical functions such as the logarithm and square root via the log() and sqrt() commands — many other commands like this exist! For example

> x=sqrt(169)
> y=log(500)
> x*y

[1] 80.79

OK, these manipulations are trivial, but they can be used in conjunction with other data objects as we will see later. More detail on how to use R as a scientific calculator can be found in Chapter 19.

3.2 A simple set of numbers

Operating on single values is rare. We are usually faced with numbers that we wish to use as a set. Entering them as a set is therefore necessary. The most basic way of entering a set of numbers is using the c() command. For example

> y=c(1, 4, 9, 16, 25, 36, 49, 64, 81, 100)
> y

 [1]   1   4   9  16  25  36  49  64  81 100

is the list of the squares of the first ten natural numbers. We can obtain the numbers from 1 to 10 faster by issuing

> x=1:10
> x

 [1]  1  2  3  4  5  6  7  8  9 10

and therefore can obtain the desired set of the squares for the first ten natural numbers using shorter code based on a simple sequence.

> y=(1:10)^2
> y

 [1]   1   4   9  16  25  36  49  64  81 100

This is much more efficient than typing out the actual results as done previously. Note that the colon symbol is used for generating series of integers and that in terms of the order of mathematical operators, it comes after the exponent; the brackets around the sequence are essential. In this case, squaring a number is achieved through use of the carat symbol followed by a 2.

3.3 A simple set of text values

The c() command is good for entering any kind of data. We may need to enter a set of categories for example.

> Names = c("Jonathan", "Elizabeth", "Peter", "Jenna", "Callum", "Annabelle", "Cordelia")
> Names

[1] "Jonathan"  "Elizabeth" "Peter"     "Jenna"
[5] "Callum"    "Annabelle" "Cordelia"

Note two important features of this command. I capitalized the variable name here on purpose. The vast majority of R commands are in lower case and there is one called “names". I don’t want to confuse my data and an existing R command so my preference is to use an upper case letter on the front of all my variable names that mean anything. The second point is that the character-valued data I entered were encapsulated by quote marks. This means that the actual names (mine, my last dog and some family members) were stored. If I had omitted the quote marks, R would have looked for variables with the appropriate names — not defined in this instance. You could remove a quote or two to see what happens if you must.

3.4 Logical indicators

R often uses logical indicators to tell us something. These variables take the values “TRUE" or “FALSE".

> Human = c(TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE)

You might suspect that this variable tells you if the name given in the Names variable are for humans. The fourth name is therefore the dog. To extract the names of the humans we use the command

> Names[Human]

[1] "Jonathan"  "Elizabeth" "Peter"     "Callum"
[5] "Annabelle" "Cordelia"

Note here that the brackets used are square brackets. Use round brackets for functions, square for elements of an object.

It’s pretty simple to extract the dog’s name:

> Names[!Human]

[1] "Jenna"

In this situation the exclamation mark should be read as “not" and therefore picks up the elements where the logical variable is set to FALSE.

3.5 A note on subscripting

We’ve seen that we can find a subset of the set of names using the indicator variable, but it’s frequently useful to be able to extract one or more elements by their location. For example

> Names[1]

[1] "Jonathan"

gives the first name, and

> Names[2:3]

[1] "Elizabeth" "Peter"

extracts the second and third names. The set of names entered is a vector and has only one subscript to monitor. We will see how to subscript elements within a matrix or data.frame later.

It’s also often necessary to understand how R has stored an object. The class() command is useful, and so is the str() command. For example

> class(Names)

[1] "character"

> str(Names)

 chr [1:7] "Jonathan" "Elizabeth" "Peter" "Jenna" ...

We can get the names of all the people that aren’t me by

> Names[-1]

[1] "Elizabeth" "Peter"     "Jenna"     "Callum"
[5] "Annabelle" "Cordelia"

which of course assumes you know my name was given first. The subscripts have used square brackets in these examples. The type (and number) of brackets is crucial. If you open a bracket it must be closed, and closed by a bracket of the same type. Nesting brackets is quite acceptable.

3.6 A patterned set of numbers

In many instances we need to generate series of values in a patterned way. Let’s say we want to generate variables that represent the twenty working days over a four week period. We want a list of the week number, and then a list of the weekday names. In both situations we will use the rep() command. It has three arguments; the list of values, the number of times each value is to be repeated, and the number of times the whole series should be repeated. The second and third of these arguments have default values so may not need to be stated explicitly.

> Week = rep(1:4, each=5)
> Week

 [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

> Day = rep(c("Mon", "Tue", "Wed", "Thu", "Fri"), times=4)
> Day

 [1] "Mon" "Tue" "Wed" "Thu" "Fri" "Mon" "Tue" "Wed" "Thu"
[10] "Fri" "Mon" "Tue" "Wed" "Thu" "Fri" "Mon" "Tue" "Wed"
[19] "Thu" "Fri"

These two variables can be brought together with the corresponding data using the data.frame() command illustrated later in this chapter.

3.7 Less pattern and more repetition

The rep() command is very flexible, and to be honest can either be a lot of fun to play with or just one big headache. Let’s say we want to generate the series of numbers which has one 1, two 2’s, three 3’s, four 4’s, and five 5’s. Instead of using the constant for the number of times each element is repeated, we can choose the number of repeats for each element.

> rep(1:5, times=1:5)

 [1] 1 2 2 3 3 3 4 4 4 4 5 5 5 5 5

Whatever you do you should observe the way your series is coming out. I would have expected the each to be used in the last example not times for example — always check.

3.8 An incomplete pattern

Let’s say we want a set of numbers to be cycled, but know that we won’t use the full cycle. The rep() has an argument that can stop the process early for us. Try using the length.out argument as follows

> rep(c(1,2,4), times=3, length.out=8)

[1] 1 2 4 1 2 4 1 2

3.9 Dates and times

The current time and date can be extracted using the date() command but this is not an object that can be manipulated — it is a character string only.

> date()

[1] "Thu Mar 29 15:17:17 2018"

> class(date())

[1] "character"

It might be useful to print this information as part of your documentation of an analysis. We can come back to an R session at a later time and to keep track of when we do things might prove useful. You can see from the output given here the exact time and date this document was compiled. This format is not to be confused with what we would store or manipulate; it is just a print out of the current time and date. The Sys.Date() command stores the same information as number.

> Sys.Date()

[1] "2018-03-29"

> class(Sys.Date())

[1] "Date"

This print out is different to what needs to happen when we want to store numerical values that represent the times and dates particular observations were taken. The base distribution of R does not cater for extracting the date and time in simple numeric terms. This can be achieved, but is beyond the scope of this chapter. It may prove best to store a date using its three constituent parts (day, month, and year) as separate numeric variables. Times should be stored using 24-hour format and be careful not to use a separator between the hour and minute values. Mathematical operations should not be done on these variables unless we convert the minutes to decimal fractions of an hour. In any situation you should decide what you will do with the data before choosing the format you wish to store it in.

An example for storing details of months might be useful here.

> Months = as.factor(c(3,6,9,12,3,6,9,12))
> Months

[1] 3  6  9  12 3  6  9  12
Levels: 3 6 9 12

The as.factor() command tells R that these numbers are to be thought of as non-numeric data. A factor also has an associated attribute called levels. We can edit the levels directly and this will change our entire variable.

> levels(Months) = c("Mar", "Jun", "Sep", "Dec")
> Months

[1] Mar Jun Sep Dec Mar Jun Sep Dec
Levels: Mar Jun Sep Dec

As it happens, we didn’t need to actually explicitly state the month number when the variable was first created, but it is good practice to keep things logical!

3.10 Larger data objects

There are two data object types that are quite similar but not the same. A matrix is a two-dimensional array of values of the same type — numeric, character, or logical. A data.frame looks like a matrix but can have variables of different types embedded within it. For example, we can create a new data.frame by combining the names and human status variables created earlier using

> data.frame(Names, Human)

      Names Human
1  Jonathan  TRUE
2 Elizabeth  TRUE
3     Peter  TRUE
4     Jenna FALSE
5    Callum  TRUE
6 Annabelle  TRUE
7  Cordelia  TRUE

We would usually assign the results of this command to a named object for storage.

> MyFirstDF = data.frame(Names, Human)
> str(MyFirstDF)

'data.frame': 7 obs. of  2 variables:
 $ Names: Factor w/ 7 levels "Annabelle","Callum",..: 6 4 7 5 2 1 3
 $ Human: logi  TRUE TRUE TRUE FALSE TRUE TRUE ...

Now we can see why we should not confuse names and Names. We can ask for the names of the variables in a data.frame using the names() command. For example

> names(MyFirstDF)

[1] "Names" "Human"

We can also now think about how we might extract the fourth name from the data.frame because this is the data structure we will work with the most. There are several alternatives.

> MyFirstDF[4,1]

[1] Jenna
7 Levels: Annabelle Callum Cordelia Elizabeth ... Peter

> MyFirstDF[4,"Names"]

[1] Jenna
7 Levels: Annabelle Callum Cordelia Elizabeth ... Peter

> MyFirstDF$Names[4]

[1] Jenna
7 Levels: Annabelle Callum Cordelia Elizabeth ... Peter

Notice that as well as the result we wanted, R has also printed the levels of the Names variable. This is because this variable has been determined to be a factor.

3.11 Appropriate data labelling

The construction of our first data.frame is slightly flawed. If the MyFirstDF data was going to be for all those beings I am in contact with, the details should be related to the individual concerned. In this situation, allowing the Names object to be data rather than a label was probably not the wisest move. Let’s say that we want the year and month of birth for the individuals in the example, and that they should form a new data.frame.

> Year = c(1971, 1945, 1925, 2003, 2010, 2012, 2013)
> Month = c("October", "October", "July", "October", "April", "June", "June")
> MySecondDF = data.frame(Year, Month, Human, row.names = Names)
> str(MySecondDF)

'data.frame': 7 obs. of  3 variables:
 $ Year : num  1971 1945 1925 2003 2010 ...
 $ Month: Factor w/ 4 levels "April","July",..: 4 4 2 4 1 3 3
 $ Human: logi  TRUE TRUE TRUE FALSE TRUE TRUE ...

3.12 Other approaches

Data entry is tedious. Efficiency is therefore an important weapon in your armoury. When we plan experiments we are often interested in obtaining an observation (or multiple observations) for every combination of some factors. In this simple example, imagine there are three experimental factors, given the names H, W, and Sex, and that each factor can take either of two levels. The expand.grid() command is a useful way to construct a data.frame.

> MyThirdDF = expand.grid(h=c(60,80), w=c(100, 300), sex=c("Male", "Female"))
> MyThirdDF

   h   w    sex
1 60 100   Male
2 80 100   Male
3 60 300   Male
4 80 300   Male
5 60 100 Female
6 80 100 Female
7 60 300 Female
8 80 300 Female

I’ve used a full printout of the resulting data.frame instead of using the str() command to show what R has created because the str() command gives additional information that we do not require at this point.

Let’s say we wish to add a new variable to this data.frame; a task common when designing an experiment. We can use the $ notation shown earlier in a new way. We add a set of eight random values extracted from a standard normal distribution here as an illustration using the rnorm() function.

> MyThirdDF$Response = rnorm(8)
> MyThirdDF

   h   w    sex Response
1 60 100   Male -0.53954
2 80 100   Male  0.09504
3 60 300   Male  0.16238
4 80 300   Male -0.40040
5 60 100 Female  0.07082
6 80 100 Female  0.06128
7 60 300 Female  0.22646
8 80 300 Female -0.66805

I like to create random data when planning an analysis. In this instance, the data are normally distributed which is not particularly important, but they are random which is important. There are many other functions that generate random data from other distributions; by convention, these functions all start with a letter “r" followed by a shortened form of the distribution’s name.

I do this because data that will be collected should be appropriate for the intended analysis to be conducted, and likewise the analysis should reflect the way data were collected. Creation of the random data means I can write the R commands that will generate the analysis, and once I have checked that the analysis is possible, I can then collect the data. When the data has been collected, I can then re-process the commands using the real data instead of the random data. This can save time during the analysis but more importantly, I can be confident that the data collection and analysis were planned well.