Lecture 33 Appendix: Introduction to R and RStudio
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
In this lecture we will look at:
What is R?
Basic R syntax
Use of R via RStudio.
33.1 What is R?
R is a statistical software system.
R is a programming language that has many “inbuilt” statistical commands (e.g. to fit a linear regression).
R started life as a quasi-clone of commercial package S-Plus, but development of R and S-Plus now slowly diverging.
Advantages over other statistics packages include flexibility, power, and quality of graphical display.
R is open-source software, and part of GNU project. It can be downloaded from
http://cran.r-project.org/
and used for free.There are versions of R for all common operating systems — Windows, Linux, and MacOS.
33.2 A Little R History
Ross Ihaka (left) and Robert Gentleman.
R is a New Zealand invention!
R originally developed by Ross Ihaka and Robert Gentleman in the Department of Statistics, University of Auckland.
First (test) version of R released in public domain in 1995.
R Development Core Team took over supervision of R in 1997.
This Team includes about 20 statisticians worldwide.
33.3 Starting R in the Computing Labs
- Start R for the first time via the Start menu (or desktop icon if available).
- Select R (version 4) from the appropriate program group.
- RGui (R Graphical User Interface) should appear as displayed.
N.B. Check the version number. Ideally we want to be sure that we are running the latest version, but do not change versions partway through the semester unless it is truly necessary.
33.4 Starting RStudio in the Computing Labs
- Start RStudio for the first time via the Start menu (or desktop icon if available).
- You should see the window split into sub-windows. Please note that RStudio is undergoing constant development. Your version may not look quite like the image displayed below.
N.B. the contents of the RGUI are all available in RStudio. We will work in RStudio because it offers many extra features.
33.5 Working directory
When you quit R you will get a pop-up asking “Save workspace image?”.
If you click yes, then the R workspace that you have created will be saved in your working directory.
You can find out where your working directory is by typing the command
getwd()
at the R prompt.The working directory can be changed using the Change dir… command from the R File menu.
Note the command getwd()
is in typewriter font. This font
is used in lectures for R input and output (amongst other things). The parentheses are included to show you that this is a command; typing getwd
alone will not give you what you want!
33.6 Save your work using scripts!
Please get into the habit of writing all your R commands in a R script before you run it in the console.
An R script is a text file containing code which can be run directly by highlighting then hitting
CTRL-R
.- Comments following
#
symbols in script files are not executed.
- Comments following
To create a new script use the New script command from the R File menu or New File>R Script from the RStudio File menu.
To save your script, click on the scripts pane, and then go to File > Save As in the menu bar.
Scripts allow you to rerun your entire analysis without re-writing all the commands, it also helps editing and proofreading your code.
33.8 An even better way
Putting your R commands into the file that becomes your end-use document will make your workflow even more efficient. When we use RStudio, we can make use of the extensive features to create these documents.
Look out for the tutorial sheets introducing you to what we call R markdown documents. The lecture material you are viewing now was produced using R markdown.
33.9 A First Dip into R
33.9.1 Expressions and Assignments
Elementary commands are either expressions or assignments.
An expression simply displays result of a calculation; not retained in the computer’s memory.
An assignment passes the result of a calculation to a variable name (or ‘object’) which is stored; the result is not displayed.
33.9.2 Examples (for you to try)
[1] 7
The symbol
>
is the command prompt. This is where you would type the3+4
. The answer will come back when you hit the<Enter>
key.Don’t worry too much at this stage about the
[1]
.
[1] 7
- The
<-
is called the “left assignment” operator which assigns from right (3+4
) to left (x
) - Yes, it takes two keys to get it; there cannot be space between the
<
and the-
and it is best practice to put space on either side of<-
in your work. - It is equivalent to
x=3+4
, but many R users prefer the<-
because it always means something is created in your workspace. (not always true for=
) - A right assignment operator also exists but is not commonly used.
33.10 R Objects
All assigned variables (or any other R objects) are stored until overwritten or explicitly removed (deleted) by the command
rm()
.To list stored objects type
ls()
orobjects()
.
[1] "x" "y"
[1] "y"
33.11 R Syntax
R commands, e.g.
ls()
,rm()
, are followed by parentheses which may contain additional information for the function.Writing a command name without parentheses returns the R source code for the function. Try one…
33.12 Vectors in R
- The command
c()
(for concatenate) creates vectors.
[1] 2.3 1.2 2.4
[1] 2.3 1.2 2.4 9.0 2.3 1.2 2.4
33.13 Vector Arithmetic in R
R uses
+
,-
,*
and/
for the basic arithmetic operations, and^
for exponentiation (raising to a power).- Vector operations are done element by element, with recycling of short vectors if required.
[1] 4 6
[1] 4 5
[1] 1 16 25 36
[1] 3 7 7 9
33.14 Types of vector
All the vectors we have seen so far have been numeric; some were integer which is a special type of number.
R also understands vectors of:
- characters: letters, numerals, spaces, and other text. - logical values `TRUE` and `FALSE`; abbreviations `T`, `F` are often used. - factors (i.e. categorical variables); these may "look" like characters.
[1] "This" "is" "a" "character"
[1] FALSE TRUE FALSE FALSE
[1] Low Low Medium High High
Levels: High Low Medium
N.B. All of these data types can be used in linear models, although character values are usually converted to factors.
33.15 Logical Comparisons
Numerical vectors can be compared by inequalities.
==
denotes equality; the=
states what something is.!=
denotes ‘not equal’.>
,>=
etc. for inequalities.
[1] FALSE FALSE TRUE FALSE FALSE
[1] FALSE FALSE FALSE TRUE TRUE
You might like to play with these comparisons. Testing the need for the brackets is worth testing.
33.16 Indexing Vectors
To index components of a vector
x
, use the formx[...]
.The square brackets can contain:
numeric vector specifying elements;
logical vector: only
TRUE
elements required.
[1] 3.2 7.4
[1] 1.1 4.3 7.4
[1] 4.3 7.4
[1] 3 4
33.17 Data Frames
A data frame is a collection of column vectors each of the same length.
The vectors may be numeric, factor, or whatever.
Each particular column and row of a data frame is given a name which can be chosen by the user, or assigned a default by R.
employee <- c("Dilbert", "Wally", "Catbert", "TheBoss")
job <- factor(c("Engineer", "Engineer", "Manager", "Manager"))
x <- c(8, 1, NA, -2)
dilbert <- data.frame(employee, job, competence = x)
dilbert
employee job competence
1 Dilbert Engineer 8
2 Wally Engineer 1
3 Catbert Manager NA
4 TheBoss Manager -2
33.17.1 Attaching and Detaching
To access variables (columns) of a data frame:
First
attach
data frame; orUse
data.frame$variable
syntax.
[1] 8 1 NA -2
Error: object 'job' not found
[1] Engineer Engineer Manager Manager
Levels: Engineer Manager
N.B. Using attach()
without detach()
can lead to trouble. All is fine when things are done correctly, but the consequences of not using these commands correctly is seldom seen at the time they are used. When the errors come up it will be difficult to diagnose the problem. It is quite unusual to need to use these commands if you use modern ways of working. It is important to know how the attach()
and detach()
commands work, but do look to avoid their use.
33.18 Importing Data
The
scan()
command reads in text from a file as a single variable. You should use this command very infrequently.The
read.table()
command is more flexible, importing data in tabular form and storing the result as a data frame.other commands exist for importing
csv
files, and a host of other file formats.The comma separated values (csv) file format is extremely common and is used often in the course. You should not need to use
scan()
orread.table()
.
Error: object 'ibm' not found
[1] 64.37 62.50 63.50 63.37 63.12 67.37 65.37 67.50 67.00
[10] 66.87 70.12 70.00 69.25 69.62 69.00 69.00 71.25 70.62
[19] 70.12 71.00 70.37 70.37 68.62 69.37 69.12 69.62 68.37
[28] 67.12 67.25 66.75 68.87 69.37 68.12 67.62 67.62 67.00
[37] 67.25 66.25 66.00 65.12 65.00 62.00 63.37 63.50 62.75
[46] 63.25 62.00 61.12 61.00 61.62 62.25 61.37 60.12 59.87
[55] 59.12 59.62 58.87 58.25 56.75 54.12 54.75 53.00 57.50
[64] 55.87 55.75 54.87 55.50 54.87 54.87 53.37 53.50 54.62
[73] 54.25 53.50 53.00 52.00 51.12 51.25 51.25 51.25 53.87
[82] 53.50 53.75 55.37 54.50 54.87 54.87 54.37 54.00 55.37
[91] 54.87 55.12 53.87 52.12 52.25 53.25 53.00 52.50 53.00
[100] 53.37 52.75 52.87 53.75 54.75 54.75 55.25 56.37 54.37
[109] 55.37 56.00 56.37 58.12 57.00 57.12 56.75 57.50 58.00
[118] 58.00 58.87 60.37 58.87 59.62 57.87 57.87 58.87 59.37
[127] 59.75 59.25 59.75 58.75 59.50 60.62 61.12 61.12 62.00
[136] 61.75 61.50 61.62 62.75 65.00 63.75 64.25 65.62 65.12
[145] 66.00 65.12 64.87 64.75 64.37 65.00 65.12 65.50 65.25
[154] 65.12 64.75 64.25 65.62 65.75 65.25 66.75 66.75 66.75
[163] 68.87 68.75 66.37 66.00 66.87 67.50 67.37 67.12 66.75
[172] 65.87 65.00 65.50 65.50 66.12 67.50 66.62 66.50 64.25
[181] 66.12 65.75 66.37 65.87 66.12 65.62 67.25 66.50 67.00
[190] 68.12 66.87 67.50 66.12 64.62 63.75 64.12 65.62 65.37
[199] 66.25 68.00 67.75 70.00 70.12 69.25 70.50 69.75 70.37
[208] 68.62 68.00 68.50 68.00 67.75 65.87 66.62 66.12 66.62
[217] 65.50 65.12 66.62 67.50 67.50 68.50 66.50 67.12 67.25
[226] 67.12 70.87 71.25 71.62 71.62 72.00 71.37 72.00 70.37
[235] 70.37 69.50 68.75 68.75 68.12 66.50 67.87 68.00 68.37
[244] 67.50 66.37 66.12 63.75 64.50 6575.00 64.50
N.B. R is case sensitive so ibm
is different to IBM
. Windows is not case sensitive so the filename can be misspecified without any trouble; other operating systems are case sensitive though.
Life <- read.csv(file = "../../data/life.csv", header = TRUE)
head(Life) # shows only the first six rows
LifeExp People.per.TV People.per.Dr LifeExp.Male LifeExp.Female
1 70.5 4.0 370 74 67
2 53.5 315.0 6166 53 54
3 65.0 4.0 684 68 62
4 76.5 1.7 449 80 73
5 70.0 8.0 643 72 68
6 71.0 5.6 1551 74 68
The ../../data/
in this command is called a relative file path. The ..
means to look up one level from the current working directory; the file requested is in a subfolder called data
; the /
is how we separate folders from subfolders or filenames. This may seem strange to Windows users who want to use a backslash instead. Get used to using the forward slash in R because it works for all operating systems.
The read.csv()
command makes a number of assumptions about the way the file is formatted. A csv
file is actually plain text with commas between values. Some people think it is a MS Excel file, but csv files existed long before Excel.
33.19 Editing Data
Can edit by reassigning elements of a vector or data frame.
For data frame or matrix,
A[i,j]
is the i,jth element (row i, column j) of the data frameA
.
[1] 63.12
[1] 65.12
LifeExp People.per.TV People.per.Dr LifeExp.Male LifeExp.Female
2 53.5 315 6166 53 54
[1] 53
33.20 R Packages
R objects (functions, data etc.) are organized in libraries/packages.
Some are loaded by default when R starts each time.
- E.g. function
ls()
is part of thebase
package that is automatically loaded,
- E.g. function
Some packages need loading, using either the
library()
orrequire()
command.
[1] -0.8943594
[1] 0.3540624
33.21 Some R Functions to Get You Started
help()
accesses R’s help system; e.g.help(ls)
. A quicker way is to use?ls
mean()
,sd()
,min()
,max()
andrange()
give mean, standard deviation, minimum, maximum and range respectively for a vector argument. N.B.range()
tells you the minimum and maximum as a pair, not the difference between them.var()
returns variance of a vector argument, or the covariance (dispersion) matrix for a matrix argument.summary()
returns summary information dependent on argument type.plot()
produces a plot on the current graphics tool. The type of plot depends on the type of argument. The simplest use isplot(x,y)
which produces a scatter-plot of vectorsx
andy
.
Practice is the key when learning R.
33.22 Your First R Exercise
Use R to calculate
- 3456-789
- \(23\times{}34\)
- 133
Write (efficient) code to create the following sequences:
2, 4, 6, … 100; that is, the even numbers up to 100
1,2,3,4,5,4,3,2,1
Use the command
y <- rnorm(100)
to store 100 simulated standard normal random variables in the vectory
.Find the mean and standard deviation of
y
.Find the largest simulated value. Which number simulation is this e.g. 54th, 23rd?