This chapter covers the methods required to pull data from external sources into R. If you want to create data within R, you should be reading Chapter 3.
While I prefer to use files extracted from EXCEL with comma delimited values, R handles many common formats such as plain text files with space or tab delimiters. You need to know what format a file is, probably by opening it and actually seeing if it is as you expect. It is too easy to rename a file with various extensions which may have meanings in your operating system that have little relevance to R. A specific case in point is when you are presented with a file having the txt extension, which is commonly assumed to be a text file. We need to know if the first line of information in the file is actual data or the headings for the columns of data. We also need to know what symbol is used to separate the columns of the data; spaces, commas, or tabs are the most common options. Each of these options has a distinct R command associated with it, but all of these commands link back to the same read.table() command.
The various commands are as follows
Note that the txt extension appears as possible extensions for all delimiter types. R will not assume any extension for these commands. You will need to explicitly state the full filename including extension when using these commands. Some extensions will have a default program associated with them by your operating system. For example, txt files will be opened in Notepad under Windows, and if Microsoft Office is installed on your machine, a csv file will be opened in Microsoft EXCEL.
To import a comma delimited file called chickens.csv, you would issue the following command
In this most basic form, the read.csv() command will look for the chickens.csv file in the current working directory, and import it into R and store it as a data.frame called Chickens. The default settings of the read.csv() command are to have a header row in the file and to have no row.names attribute associated with the data in the file. If your data file already had a column for the names of the chickens as the first column, you would issue the command
and if the data did not currently have any column headings you would issue the command
There are other settings to consider which you can investigate using the help for the read.csv() command by typing ?read.csv at the command prompt. This help page is actually a combined help page for the family of commands described in this section.
The working above assumed you could put the data file into the correct working directory. So where was that? To find out where R thinks you are currently working, use the getwd() command. Note that the output may look a little strange to some users, especially Windows users. What do I mean? Look at the following:
The full path to the working directory where this chapter was processed has been displayed, starting with the letter associated with the hard drive, followed by a colon. Then the fun begins. The folder structure is represented using forward slash signs, not the backslash used in Windows operating systems even though the processing of this work is done using a Windows machine. The rest should be as expected and you could find the right folder by looking in the appropriate place on your hard drive. The reason for R’s use of the forward slash is not entirely simple to explain, but in short it is because the standard backslash symbol has a special use in R. For the moment, the choice of slash versus backslash is not important. It is important when we need to type out the path to the location of a file for ourselves.
I do try to keep each separate project in its own distinct directory, and moving the raw data file to that directory makes sense. It does not make sense to have multiple copies of a dataset though, so we need to know how to pull a file from a different working directory into R.
When I displayed the current working directory in the previous section using the getwd() command, we saw the way that R used forward slash symbols to denote the hierarchy of folders on our hard drive.
If we know the complete specification of the location of a file, right the way from the name of the hard drive down the directory tree to the actual location, the way we specify the location needs to match the way R has printed a path. That is, we use the forward slash symbol not a single backslash. If we do want to use a backslash symbol, we would need to use a double backslash, not just a single one. It is better to use a single forward slash symbol however. This is because the single forward slash presentation works for all operating systems and means our code can be shared to users of all operating systems. You might not plan to do this right now, but let’s use good habits from the start.
It is common for data to be stored in a folder that is close to the one we are working in. We might have a folder called MyData which is within the working directory (a subfolder), or it might be a folder at the same level in the directory tree as the current working directory. In either case, we don’t need to specify the location using the full path. The term used to describe the location is a relative path because the reference to the location is relative to the current location having focus (our working directory).
If our data set is stored in the file chickens.csv in a subfolder called MyData, we can use
to pull it into our current workspace. This is actually shorthand for the more complete form
where the current folder is denoted using the single period followed by the first slash. Personally, I would prefer to see the more complete form of this relative path being used but it is personal preference only.
If the MyData folder was not a subfolder, but was on the same level in the directory tree as our current working directory, we would use the shorthand symbol for the parent directory (one level up the directory tree). This is done using a double period.
Relative paths are therefore quite useful because they avoid having to type out long paths. The full specification of apath might also become a problem when files for a project are moved from one storage device to another. For example, you might want to take your work to a friend, tutor, or colleague on a memory stick or other portable storage medium. Relative path referencing makes your work very transferable and transportable.
Many statistical software packages use their own file types for storing data. R is no different actually! The chief problem we have is to find a way of transferring data from one application to another. Like most other statistics programs, R doesn’t handle all other file types. Some files can be imported into R using the commands in the foreign package, but it is probably best just to avoid the problems from the start.
In many instances it will prove easiest to use copy and paste functionality within your operating system to take the data from whatever original source it was given to you in and put it into a suitable spreadsheet program. Then save it using the comma separated values format and read the data into R using the commands given in the previous section.
I recommend trying to obtain the data in an easily imported file type rather than attempting to use the functions in the foreign package.
Sometimes a data set is made available via the internet. If you can obtain the full URL for the downloadable data file then it can be entered into the read.csv() or read.table() commands. This exercise is seldom necessary except for data files that you know will be updated for distribution through the web. Some government agencies and financial database services do this.
The base installation of R includes a package called datasets. These data sets are useful for testing code and writing examples for insertion into documents like this one. Data sets contained in the datasets package are actually ready and waiting to be accessed, but often we want to bring the data into our current workspace using a command such as
The data() command looks in the datasets package by default. If we wanted to get some data from another package we would need to state the name of the package explicitly. For example,
(The MASS package is already installed by default.)
Often we will be using a particular data set because it is good for demonstrating functions within a particular package. If we are loading the package using the library() command, to get access to the functions, we will have also made the data available. This is why the data from the datasets is ready for use; this package is loaded by default whenever R is started.
Rather ironically, the suggestion for including this section came from my mother. All too frequently, data are entered by people who will not actually need to use the data themselves. Sometimes data are entered by different people and then compiled into a single dataset and the various codes that some sources of data may use are not necessarily in common with all other users. Take for example, the many ways we might enter data for the gender of individuals answering a survey. You might easily realise that an “M" means male, but R needs consistency.
It is extremely important that we have knowledge of the format of the data we import. Use of the str() command is a good start, and use of the head() command might also be useful.
Here are some pointers to look out for:
It is all too common to need to import data from a spreadsheet application where the data is not conveniently placed in the upper-left corner of the first sheet. There are a number of such spreadsheet applications, but the most commonly used one is Microsoft Excel.
Data in spreadsheets is seldom ready for importing in the exact form we want it. If you need to extract a set of data that is in a named sheet or always appears in the same numbered sheet of a Microsoft Excel file, then you might investigate the openxlsx package. See Chapter 14 and the help page for the read.xlsx() command.