Lecture 36 Appendix: Introduction to R and RStudio

In class version

library(MASS)

Attaching package: 'MASS'
The following object is masked from 'package:dplyr':

    select

In this lecture we will look at:

  • What is R?

  • Basic R syntax

  • Use of R via RStudio.

36.1 What is R?

  • R is a statistical software system.

  • R is a programming language that has many “inbuilt” statistical commands (e.g. to fit a linear regression).

  • R started life as a quasi-clone of commercial package S-Plus, but development of R and S-Plus now slowly diverging.

  • Advantages over other statistics packages include flexibility, power, and quality of graphical display.

  • R is open-source software, and part of GNU project. It can be downloaded from http://cran.r-project.org/ and used for free.

  • There are versions of R for all common operating systems — Windows, Linux, and MacOS.

36.2 A Little R History

photo of Ross Ihaka photo of Robert Gentleman

Ross Ihaka (left) and Robert Gentleman.

  • R is a New Zealand invention!

  • R originally developed by Ross Ihaka and Robert Gentleman in the Department of Statistics, University of Auckland.

  • First (test) version of R released in public domain in 1995.

  • R Development Core Team took over supervision of R in 1997.

  • This Team includes about 20 statisticians worldwide.

36.3 Starting R in the Computing Labs

  • Start R for the first time via the Start menu (or desktop icon if available).
  • Select R (version 4) from the appropriate program group.
  • RGui (R Graphical User Interface) should appear as displayed.
RGUI v3.3.0
RGUI v3.3.0

N.B. Check the version number. Ideally we want to be sure that we are running the latest version, but do not change versions partway through the semester unless it is truly necessary.

36.4 Starting RStudio in the Computing Labs

  • Start RStudio for the first time via the Start menu (or desktop icon if available).
  • You should see the window split into sub-windows. Please note that RStudio is undergoing constant development. Your version may not look quite like the image displayed below.
RStudio using R 3.4.1
RStudio using R 3.4.1

N.B. the contents of the RGUI are all available in RStudio. We will work in RStudio because it offers many extra features.

36.5 Working directory

  • When you quit R you will get a pop-up asking “Save workspace image?”.

  • If you click yes, then the R workspace that you have created will be saved in your working directory.

  • You can find out where your working directory is by typing the command getwd() at the R prompt.

  • The working directory can be changed using the Change dir… command from the R File menu.

Note the command getwd() is in typewriter font. This font is used in lectures for R input and output (amongst other things). The parentheses are included to show you that this is a command; typing getwd alone will not give you what you want!

36.6 Save your work using scripts!

  • Please get into the habit of writing all your R commands in a R script before you run it in the console.

  • An R script is a text file containing code which can be run directly by highlighting then hitting CTRL-R.

    • Comments following # symbols in script files are not executed.
  • To create a new script use the New script command from the R File menu or New File>R Script from the RStudio File menu.

  • To save your script, click on the scripts pane, and then go to File > Save As in the menu bar.

  • Scripts allow you to rerun your entire analysis without re-writing all the commands, it also helps editing and proofreading your code.

36.7 Managing Your Code

Screen shot of R Script in R
Screen shot of R Script in R
Screen shot of R Script in RStudio
Screen shot of R Script in RStudio

36.8 An even better way

Putting your R commands into the file that becomes your end-use document will make your workflow even more efficient. When we use RStudio, we can make use of the extensive features to create these documents.

Look out for the tutorial sheets introducing you to what we call R markdown documents. The lecture material you are viewing now was produced using R markdown.

More on this later…

36.9 A First Dip into R

36.9.1 Expressions and Assignments

Elementary commands are either expressions or assignments.

  • An expression simply displays result of a calculation; not retained in the computer’s memory.

  • An assignment passes the result of a calculation to a variable name (or ‘object’) which is stored; the result is not displayed.

36.9.2 Examples (for you to try)

3 + 4
[1] 7
  • The symbol > is the command prompt. This is where you would type the 3+4. The answer will come back when you hit the <Enter> key.

  • Don’t worry too much at this stage about the [1].

x <- 3 + 4
x
[1] 7
  • The <- is called the “left assignment” operator which assigns from right (3+4) to left (x)
  • Yes, it takes two keys to get it; there cannot be space between the < and the - and it is best practice to put space on either side of <- in your work.
  • It is equivalent to x=3+4, but many R users prefer the <- because it always means something is created in your workspace. (not always true for =)
  • A right assignment operator also exists but is not commonly used.

36.10 R Objects

  • All assigned variables (or any other R objects) are stored until overwritten or explicitly removed (deleted) by the command rm().

  • To list stored objects type ls() or objects().

x <- 8
y <- 3.1415
ls()
[1] "x" "y"
rm(x)
objects()
[1] "y"

36.11 R Syntax

  • R commands, e.g. ls(), rm(), are followed by parentheses which may contain additional information for the function.

  • Writing a command name without parentheses returns the R source code for the function. Try one…

36.12 Vectors in R

  • The command c() (for concatenate) creates vectors.
x <- c(2.3, 1.2, 2.4)
x
[1] 2.3 1.2 2.4
c(x, 9, x)
[1] 2.3 1.2 2.4 9.0 2.3 1.2 2.4

36.12.1 Regular Sequences

  • The expression 1:n denotes the sequence 1, 2,… n.

  • The expression seq(i,j,by=k) is a sequence from i) to j in steps of k.

1:5
[1] 1 2 3 4 5
y <- seq(3, 10, by = 2)
y
[1] 3 5 7 9

36.13 Vector Arithmetic in R

  • R uses +, -, * and / for the basic arithmetic operations, and ^ for exponentiation (raising to a power).

    • Vector operations are done element by element, with recycling of short vectors if required.
x <- c(2, 3)
y <- c(1, 4, 5, 6)
2 * x
[1] 4 6
2 + x
[1] 4 5
y^2
[1]  1 16 25 36
x + y
[1] 3 7 7 9

36.14 Types of vector

  • All the vectors we have seen so far have been numeric; some were integer which is a special type of number.

  • R also understands vectors of:

    - characters: letters, numerals, spaces, and other text.
    - logical values `TRUE` and `FALSE`; abbreviations `T`, `F` are often used.
    - factors (i.e. categorical variables); these may "look" like characters.
MyWords <- c("This", "is", "a", "character")
MyWords
[1] "This"      "is"        "a"         "character"
c(F, T, F, F)
[1] FALSE  TRUE FALSE FALSE
factor(c("Low", "Low", "Medium", "High", "High"))
[1] Low    Low    Medium High   High  
Levels: High Low Medium

N.B. All of these data types can be used in linear models, although character values are usually converted to factors.

36.15 Logical Comparisons

  • Numerical vectors can be compared by inequalities.

  • == denotes equality; the = states what something is.

  • != denotes ‘not equal’.

  • >, >= etc. for inequalities.

(1:5) == (5:1)
[1] FALSE FALSE  TRUE FALSE FALSE
(1:5) > (5:1)
[1] FALSE FALSE FALSE  TRUE  TRUE

You might like to play with these comparisons. Testing the need for the brackets is worth testing.

36.16 Indexing Vectors

  • To index components of a vector x, use the form x[...].

  • The square brackets can contain:

    • numeric vector specifying elements;

    • logical vector: only TRUE elements required.

x <- c(1.1, 3.2, 4.3, 7.4)
x[c(2, 4)]
[1] 3.2 7.4
x[-2]
[1] 1.1 4.3 7.4
x[x > 3.5]
[1] 4.3 7.4
which(x > 3.5)
[1] 3 4

36.17 Data Frames

  • A data frame is a collection of column vectors each of the same length.

  • The vectors may be numeric, factor, or whatever.

  • Each particular column and row of a data frame is given a name which can be chosen by the user, or assigned a default by R.

employee <- c("Dilbert", "Wally", "Catbert", "TheBoss")
job <- factor(c("Engineer", "Engineer", "Manager", "Manager"))
x <- c(8, 1, NA, -2)
dilbert <- data.frame(employee, job, competence = x)
dilbert
  employee      job competence
1  Dilbert Engineer          8
2    Wally Engineer          1
3  Catbert  Manager         NA
4  TheBoss  Manager         -2

36.17.1 Attaching and Detaching

  • To access variables (columns) of a data frame:

    • First attach data frame; or

    • Use data.frame$variable syntax.

rm(employee, job, x)
dilbert$competence
[1]  8  1 NA -2
job
Error in eval(expr, envir, enclos): object 'job' not found
attach(dilbert)
job
[1] Engineer Engineer Manager  Manager 
Levels: Engineer Manager
detach(dilbert)

N.B. Using attach() without detach() can lead to trouble. All is fine when things are done correctly, but the consequences of not using these commands correctly is seldom seen at the time they are used. When the errors come up it will be difficult to diagnose the problem. It is quite unusual to need to use these commands if you use modern ways of working. It is important to know how the attach() and detach() commands work, but do look to avoid their use.

36.18 Importing Data

  • The scan() command reads in text from a file as a single variable. You should use this command very infrequently.

  • The read.table() command is more flexible, importing data in tabular form and storing the result as a data frame.

  • other commands exist for importing csv files, and a host of other file formats.

  • The comma separated values (csv) file format is extremely common and is used often in the course. You should not need to use scan() or read.table().

## IBM <- scan(file = "ibm.txt")
ibm
Error in eval(expr, envir, enclos): object 'ibm' not found
IBM
  [1]   64.37   62.50   63.50   63.37   63.12   67.37   65.37   67.50   67.00
 [10]   66.87   70.12   70.00   69.25   69.62   69.00   69.00   71.25   70.62
 [19]   70.12   71.00   70.37   70.37   68.62   69.37   69.12   69.62   68.37
 [28]   67.12   67.25   66.75   68.87   69.37   68.12   67.62   67.62   67.00
 [37]   67.25   66.25   66.00   65.12   65.00   62.00   63.37   63.50   62.75
 [46]   63.25   62.00   61.12   61.00   61.62   62.25   61.37   60.12   59.87
 [55]   59.12   59.62   58.87   58.25   56.75   54.12   54.75   53.00   57.50
 [64]   55.87   55.75   54.87   55.50   54.87   54.87   53.37   53.50   54.62
 [73]   54.25   53.50   53.00   52.00   51.12   51.25   51.25   51.25   53.87
 [82]   53.50   53.75   55.37   54.50   54.87   54.87   54.37   54.00   55.37
 [91]   54.87   55.12   53.87   52.12   52.25   53.25   53.00   52.50   53.00
[100]   53.37   52.75   52.87   53.75   54.75   54.75   55.25   56.37   54.37
[109]   55.37   56.00   56.37   58.12   57.00   57.12   56.75   57.50   58.00
[118]   58.00   58.87   60.37   58.87   59.62   57.87   57.87   58.87   59.37
[127]   59.75   59.25   59.75   58.75   59.50   60.62   61.12   61.12   62.00
[136]   61.75   61.50   61.62   62.75   65.00   63.75   64.25   65.62   65.12
[145]   66.00   65.12   64.87   64.75   64.37   65.00   65.12   65.50   65.25
[154]   65.12   64.75   64.25   65.62   65.75   65.25   66.75   66.75   66.75
[163]   68.87   68.75   66.37   66.00   66.87   67.50   67.37   67.12   66.75
[172]   65.87   65.00   65.50   65.50   66.12   67.50   66.62   66.50   64.25
[181]   66.12   65.75   66.37   65.87   66.12   65.62   67.25   66.50   67.00
[190]   68.12   66.87   67.50   66.12   64.62   63.75   64.12   65.62   65.37
[199]   66.25   68.00   67.75   70.00   70.12   69.25   70.50   69.75   70.37
[208]   68.62   68.00   68.50   68.00   67.75   65.87   66.62   66.12   66.62
[217]   65.50   65.12   66.62   67.50   67.50   68.50   66.50   67.12   67.25
[226]   67.12   70.87   71.25   71.62   71.62   72.00   71.37   72.00   70.37
[235]   70.37   69.50   68.75   68.75   68.12   66.50   67.87   68.00   68.37
[244]   67.50   66.37   66.12   63.75   64.50 6575.00   64.50

N.B. R is case sensitive so ibm is different to IBM. Windows is not case sensitive so the filename can be misspecified without any trouble; other operating systems are case sensitive though.

Life <- read.csv(file = "../../data/life.csv", header = TRUE)
head(Life)  # shows only the first six rows
  LifeExp People.per.TV People.per.Dr LifeExp.Male LifeExp.Female
1    70.5           4.0           370           74             67
2    53.5         315.0          6166           53             54
3    65.0           4.0           684           68             62
4    76.5           1.7           449           80             73
5    70.0           8.0           643           72             68
6    71.0           5.6          1551           74             68

The ../../data/ in this command is called a relative file path. The .. means to look up one level from the current working directory; the file requested is in a subfolder called data; the / is how we separate folders from subfolders or filenames. This may seem strange to Windows users who want to use a backslash instead. Get used to using the forward slash in R because it works for all operating systems.

The read.csv() command makes a number of assumptions about the way the file is formatted. A csv file is actually plain text with commas between values. Some people think it is a MS Excel file, but csv files existed long before Excel.

36.19 Editing Data

  • Can edit by reassigning elements of a vector or data frame.

  • For data frame or matrix, A[i,j] is the i,jth element (row i, column j) of the data frame A.

IBM[5]
[1] 63.12
IBM[5] <- 65.12
IBM[5]
[1] 65.12
Life[2, ]
  LifeExp People.per.TV People.per.Dr LifeExp.Male LifeExp.Female
2    53.5           315          6166           53             54
Life[2, 4]
[1] 53
Life[2, 4] <- 7166

36.20 R Packages

  • R objects (functions, data etc.) are organized in libraries/packages.

  • Some are loaded by default when R starts each time.

    • E.g. function ls() is part of the base package that is automatically loaded,
  • Some packages need loading, using either the library() or require() command.

mvrnorm(1, mu = 0, Sigma = 1)
[1] 0.006678388
library(MASS)
mvrnorm(1, mu = 0, Sigma = 1)
[1] -0.8160883

36.21 Some R Functions to Get You Started

  • help() accesses R’s help system; e.g. help(ls). A quicker way is to use ?ls

  • mean(), sd(), min(), max() and range() give mean, standard deviation, minimum, maximum and range respectively for a vector argument. N.B. range() tells you the minimum and maximum as a pair, not the difference between them.

  • var() returns variance of a vector argument, or the covariance (dispersion) matrix for a matrix argument.

  • summary() returns summary information dependent on argument type.

  • plot() produces a plot on the current graphics tool. The type of plot depends on the type of argument. The simplest use is plot(x,y) which produces a scatter-plot of vectors x and y.

Practice is the key when learning R.

36.22 Your First R Exercise

  1. Use R to calculate

    1. 3456-789
    2. \(23\times{}34\)
    3. 133
  2. Write (efficient) code to create the following sequences:

    1. 2, 4, 6, … 100; that is, the even numbers up to 100

    2. 1,2,3,4,5,4,3,2,1

  3. Use the command y <- rnorm(100) to store 100 simulated standard normal random variables in the vector y.

    1. Find the mean and standard deviation of y.

    2. Find the largest simulated value. Which number simulation is this e.g. 54th, 23rd?