View the latest recording of this lecture
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
In this lecture we will look at:
What is R?
Basic R syntax
Use of R via RStudio.
R is a statistical software system.
R is a programming language that has many “inbuilt” statistical commands (e.g. to fit a linear regression).
R started life as a quasi-clone of commercial package S-Plus, but development of R and S-Plus now slowly diverging.
Advantages over other statistics packages include flexibility, power, and quality of graphical display.
R is open-source software, and part of GNU project. It can be
downloaded from http://cran.r-project.org/
and used for
free.
There are versions of R for all common operating systems — Windows, Linux, and MacOS.
Ross Ihaka (left) and Robert Gentleman.
R is a New Zealand invention!
R originally developed by Ross Ihaka and Robert Gentleman in the Department of Statistics, University of Auckland.
First (test) version of R released in public domain in 1995.
R Development Core Team took over supervision of R in 1997.
This Team includes about 20 statisticians worldwide.
N.B. Check the version number. Ideally we want to be sure that we are running the latest version, but do not change versions partway through the semester unless it is truly necessary.
N.B. the contents of the RGUI are all available in RStudio. We will work in RStudio because it offers many extra features.
When you quit R you will get a pop-up asking “Save workspace image?”.
If you click yes, then the R workspace that you have created will be saved in your working directory.
You can find out where your working directory is by typing the
command getwd()
at the R prompt.
The working directory can be changed using the Change dir… command from the R File menu.
Note the command getwd()
is in typewriter font. This
font is used in lectures for R input and output (amongst other things).
The parentheses are included to show you that this is a command; typing
getwd
alone will not give you what you want!
Please get into the habit of writing all your R commands in a R script before you run it in the console.
An R script is a text file containing code which can be run
directly by highlighting then hitting CTRL-R
.
#
symbols in script files are not
executed.To create a new script use the New script command from the R File menu or New File>R Script from the RStudio File menu.
To save your script, click on the scripts pane, and then go to File > Save As in the menu bar.
Scripts allow you to rerun your entire analysis without re-writing all the commands, it also helps editing and proofreading your code.
Putting your R commands into the file that becomes your end-use document will make your workflow even more efficient. When we use RStudio, we can make use of the extensive features to create these documents.
Look out for the tutorial sheets introducing you to what we call R markdown documents. The lecture material you are viewing now was produced using R markdown.
Elementary commands are either expressions or assignments.
An expression simply displays result of a calculation; not retained in the computer’s memory.
An assignment passes the result of a calculation to a variable name (or ‘object’) which is stored; the result is not displayed.
[1] 7
The symbol >
is the command prompt. This is where
you would type the 3+4
. The answer will come back when you
hit the <Enter>
key.
Don’t worry too much at this stage about the
[1]
.
[1] 7
<-
is called the “left assignment” operator
which assigns from right (3+4
) to left
(x
)<
and the -
and it is best practice to put
space on either side of <-
in your work.x=3+4
, but many R users prefer the
<-
because it always means something is created in your
workspace. (not always true for =
)All assigned variables (or any other R objects) are
stored until overwritten or explicitly removed (deleted) by the command
rm()
.
To list stored objects type ls()
or
objects()
.
[1] "A" "A.mat" "B.mat" "Caffeine"
[5] "Caffeine.bt" "Caffeine.lm" "Caffeine.lm0" "Caffeine.lm2"
[9] "Caffeine.lv" "Caffeine.M0" "Caffeine.M1" "climate"
[13] "Climate" "climate.full" "Climate.lm0" "Climate.lm00"
[17] "climate.lm1" "Climate.lm1" "Climate.lm1.sum" "climate.lm2"
[21] "Climate.lm2" "climate.lm3" "Climate.lm3" "climate.lm4"
[25] "Climate.lm4" "climate.lm5" "Climate.lm5" "climate.lm6"
[29] "Climate.lm6" "climate.null" "climate.s1" "climate.s2"
[33] "climate.step" "climate.step2" "Coeffs" "Cows"
[37] "Cows.lm.1" "Cows.lm.2" "cps.wls" "cps.wls2"
[41] "CPS5" "cps5.lm" "CPS5grouped" "CPS5weighted"
[45] "D.mat" "dat" "e" "elec.lm1"
[49] "elec.lm2" "elec.lm3" "elec.lm4" "elec.lm5"
[53] "electric" "electric.lm0" "English" "English.lm1"
[57] "English.lm2" "f" "Fat" "Fat.lm.0"
[61] "Fat.lm.1" "Fat.lm.2" "Fev" "Fev.lm.4"
[65] "Fev.lm.5" "Fev.lm.6" "Fev.lm.poly" "Fev.lm4"
[69] "Fev.lm6" "Fev.pce1" "grouped.lm" "Hills"
[73] "Hills.lm" "Hospital" "Hospital.lm" "i"
[77] "Indy" "Indy.pce0" "Indy.pce1" "Indy.pce2"
[81] "k" "Knot1" "Knot2" "Lions"
[85] "Lions.lm1" "Lions.lm2" "lm.cor" "lm.uncor"
[89] "lm1" "lm2" "lm3" "lm4"
[93] "lm5" "lm6" "MyTable" "newdat"
[97] "NSrent" "Outfile" "PrettyPVal" "PulseData"
[101] "Rat" "Rat.lm.1" "Rat.lm.2" "RatDiet"
[105] "RatDiet.lm" "Samara" "Samara.lm" "Samara.lm.m0"
[109] "Samara.lm.m1" "Samara.lm.m2" "Samara.lm2" "SAP"
[113] "SAP.lm" "SAP.lm2" "SAP.lm3" "SAPA.lm"
[117] "scope" "sim" "sim_final" "sim_rand"
[121] "sup.lm" "sup.wls" "supervisors" "TG"
[125] "tmc" "Tooth.lm.0" "Tooth.lm.1" "ToothGrowth"
[129] "Tourism" "Tourism.gls" "Tourism.lm" "Tourism.lm.2"
[133] "x" "X1" "X2" "y"
[137] "y1" "y2" "y3" "y4"
[141] "Z1" "z2" "z3"
[1] "A" "A.mat" "B.mat" "Caffeine"
[5] "Caffeine.bt" "Caffeine.lm" "Caffeine.lm0" "Caffeine.lm2"
[9] "Caffeine.lv" "Caffeine.M0" "Caffeine.M1" "climate"
[13] "Climate" "climate.full" "Climate.lm0" "Climate.lm00"
[17] "climate.lm1" "Climate.lm1" "Climate.lm1.sum" "climate.lm2"
[21] "Climate.lm2" "climate.lm3" "Climate.lm3" "climate.lm4"
[25] "Climate.lm4" "climate.lm5" "Climate.lm5" "climate.lm6"
[29] "Climate.lm6" "climate.null" "climate.s1" "climate.s2"
[33] "climate.step" "climate.step2" "Coeffs" "Cows"
[37] "Cows.lm.1" "Cows.lm.2" "cps.wls" "cps.wls2"
[41] "CPS5" "cps5.lm" "CPS5grouped" "CPS5weighted"
[45] "D.mat" "dat" "e" "elec.lm1"
[49] "elec.lm2" "elec.lm3" "elec.lm4" "elec.lm5"
[53] "electric" "electric.lm0" "English" "English.lm1"
[57] "English.lm2" "f" "Fat" "Fat.lm.0"
[61] "Fat.lm.1" "Fat.lm.2" "Fev" "Fev.lm.4"
[65] "Fev.lm.5" "Fev.lm.6" "Fev.lm.poly" "Fev.lm4"
[69] "Fev.lm6" "Fev.pce1" "grouped.lm" "Hills"
[73] "Hills.lm" "Hospital" "Hospital.lm" "i"
[77] "Indy" "Indy.pce0" "Indy.pce1" "Indy.pce2"
[81] "k" "Knot1" "Knot2" "Lions"
[85] "Lions.lm1" "Lions.lm2" "lm.cor" "lm.uncor"
[89] "lm1" "lm2" "lm3" "lm4"
[93] "lm5" "lm6" "MyTable" "newdat"
[97] "NSrent" "Outfile" "PrettyPVal" "PulseData"
[101] "Rat" "Rat.lm.1" "Rat.lm.2" "RatDiet"
[105] "RatDiet.lm" "Samara" "Samara.lm" "Samara.lm.m0"
[109] "Samara.lm.m1" "Samara.lm.m2" "Samara.lm2" "SAP"
[113] "SAP.lm" "SAP.lm2" "SAP.lm3" "SAPA.lm"
[117] "scope" "sim" "sim_final" "sim_rand"
[121] "sup.lm" "sup.wls" "supervisors" "TG"
[125] "tmc" "Tooth.lm.0" "Tooth.lm.1" "ToothGrowth"
[129] "Tourism" "Tourism.gls" "Tourism.lm" "Tourism.lm.2"
[133] "X1" "X2" "y" "y1"
[137] "y2" "y3" "y4" "Z1"
[141] "z2" "z3"
R commands, e.g. ls()
, rm()
, are
followed by parentheses which may contain additional information for the
function.
Writing a command name without parentheses returns the R source code for the function. Try one…
c()
(for concatenate) creates
vectors.[1] 2.3 1.2 2.4
[1] 2.3 1.2 2.4 9.0 2.3 1.2 2.4
R uses +
, -
, *
and
/
for the basic arithmetic operations, and ^
for exponentiation (raising to a power).
[1] 4 6
[1] 4 5
[1] 1 16 25 36
[1] 3 7 7 9
All the vectors we have seen so far have been numeric; some were integer which is a special type of number.
R also understands vectors of:
- characters: letters, numerals, spaces, and other text.
- logical values `TRUE` and `FALSE`; abbreviations `T`, `F` are often used.
- factors (i.e. categorical variables); these may "look" like characters.
[1] "This" "is" "a" "character"
[1] FALSE TRUE FALSE FALSE
[1] Low Low Medium High High
Levels: High Low Medium
N.B. All of these data types can be used in linear models, although character values are usually converted to factors.
Numerical vectors can be compared by inequalities.
==
denotes equality; the =
states what
something is.
!=
denotes ‘not equal’.
>
, >=
etc. for
inequalities.
[1] FALSE FALSE TRUE FALSE FALSE
[1] FALSE FALSE FALSE TRUE TRUE
You might like to play with these comparisons. Testing the need for the brackets is worth testing.
To index components of a vector x
, use the form
x[...]
.
The square brackets can contain:
numeric vector specifying elements;
logical vector: only TRUE
elements
required.
[1] 3.2 7.4
[1] 1.1 4.3 7.4
[1] 4.3 7.4
[1] 3 4
A data frame is a collection of column vectors each of the same length.
The vectors may be numeric, factor, or whatever.
Each particular column and row of a data frame is given a name which can be chosen by the user, or assigned a default by R.
employee <- c("Dilbert", "Wally", "Catbert", "TheBoss")
job <- factor(c("Engineer", "Engineer", "Manager", "Manager"))
x <- c(8, 1, NA, -2)
dilbert <- data.frame(employee, job, competence = x)
dilbert
employee job competence
1 Dilbert Engineer 8
2 Wally Engineer 1
3 Catbert Manager NA
4 TheBoss Manager -2
To access variables (columns) of a data frame:
First attach
data frame; or
Use data.frame$variable
syntax.
[1] 8 1 NA -2
Error: object 'job' not found
[1] Engineer Engineer Manager Manager
Levels: Engineer Manager
N.B. Using attach()
without detach()
can
lead to trouble. All is fine when things are done correctly, but the
consequences of not using these commands correctly is seldom seen at the
time they are used. When the errors come up it will be difficult to
diagnose the problem. It is quite unusual to need to use these commands
if you use modern ways of working. It is important to know how the
attach()
and detach()
commands work, but do
look to avoid their use.
The scan()
command reads in text from a file as a
single variable. You should use this command very infrequently.
The read.table()
command is more flexible, importing
data in tabular form and storing the result as a data frame.
other commands exist for importing csv
files, and a
host of other file formats.
The comma separated values (csv) file format is extremely common
and is used often in the course. You should not need to use
scan()
or read.table()
.
Error: object 'ibm' not found
[1] 64.37 62.50 63.50 63.37 63.12 67.37 65.37 67.50 67.00
[10] 66.87 70.12 70.00 69.25 69.62 69.00 69.00 71.25 70.62
[19] 70.12 71.00 70.37 70.37 68.62 69.37 69.12 69.62 68.37
[28] 67.12 67.25 66.75 68.87 69.37 68.12 67.62 67.62 67.00
[37] 67.25 66.25 66.00 65.12 65.00 62.00 63.37 63.50 62.75
[46] 63.25 62.00 61.12 61.00 61.62 62.25 61.37 60.12 59.87
[55] 59.12 59.62 58.87 58.25 56.75 54.12 54.75 53.00 57.50
[64] 55.87 55.75 54.87 55.50 54.87 54.87 53.37 53.50 54.62
[73] 54.25 53.50 53.00 52.00 51.12 51.25 51.25 51.25 53.87
[82] 53.50 53.75 55.37 54.50 54.87 54.87 54.37 54.00 55.37
[91] 54.87 55.12 53.87 52.12 52.25 53.25 53.00 52.50 53.00
[100] 53.37 52.75 52.87 53.75 54.75 54.75 55.25 56.37 54.37
[109] 55.37 56.00 56.37 58.12 57.00 57.12 56.75 57.50 58.00
[118] 58.00 58.87 60.37 58.87 59.62 57.87 57.87 58.87 59.37
[127] 59.75 59.25 59.75 58.75 59.50 60.62 61.12 61.12 62.00
[136] 61.75 61.50 61.62 62.75 65.00 63.75 64.25 65.62 65.12
[145] 66.00 65.12 64.87 64.75 64.37 65.00 65.12 65.50 65.25
[154] 65.12 64.75 64.25 65.62 65.75 65.25 66.75 66.75 66.75
[163] 68.87 68.75 66.37 66.00 66.87 67.50 67.37 67.12 66.75
[172] 65.87 65.00 65.50 65.50 66.12 67.50 66.62 66.50 64.25
[181] 66.12 65.75 66.37 65.87 66.12 65.62 67.25 66.50 67.00
[190] 68.12 66.87 67.50 66.12 64.62 63.75 64.12 65.62 65.37
[199] 66.25 68.00 67.75 70.00 70.12 69.25 70.50 69.75 70.37
[208] 68.62 68.00 68.50 68.00 67.75 65.87 66.62 66.12 66.62
[217] 65.50 65.12 66.62 67.50 67.50 68.50 66.50 67.12 67.25
[226] 67.12 70.87 71.25 71.62 71.62 72.00 71.37 72.00 70.37
[235] 70.37 69.50 68.75 68.75 68.12 66.50 67.87 68.00 68.37
[244] 67.50 66.37 66.12 63.75 64.50 6575.00 64.50
N.B. R is case sensitive so ibm
is different to
IBM
. Windows is not case sensitive so the filename can be
misspecified without any trouble; other operating systems are case
sensitive though.
Life <- read.csv(file = "../../data/life.csv", header = TRUE)
head(Life) # shows only the first six rows
LifeExp People.per.TV People.per.Dr LifeExp.Male LifeExp.Female
1 70.5 4.0 370 74 67
2 53.5 315.0 6166 53 54
3 65.0 4.0 684 68 62
4 76.5 1.7 449 80 73
5 70.0 8.0 643 72 68
6 71.0 5.6 1551 74 68
The ../../data/
in this command is called a relative
file path. The ..
means to look up one level from the
current working directory; the file requested is in a subfolder called
data
; the /
is how we separate folders from
subfolders or filenames. This may seem strange to Windows users who want
to use a backslash instead. Get used to using the forward slash in R
because it works for all operating systems.
The read.csv()
command makes a number of assumptions
about the way the file is formatted. A csv
file is actually
plain text with commas between values. Some people think it is a MS
Excel file, but csv files existed long before Excel.
Can edit by reassigning elements of a vector or data frame.
For data frame or matrix, A[i,j]
is the
i,jth element (row i, column j) of
the data frame A
.
[1] 63.12
[1] 65.12
LifeExp People.per.TV People.per.Dr LifeExp.Male LifeExp.Female
2 53.5 315 6166 53 54
[1] 53
R objects (functions, data etc.) are organized in libraries/packages.
Some are loaded by default when R starts each time.
ls()
is part of the base
package that is automatically loaded,Some packages need loading, using either the
library()
or require()
command.
[1] 0.2727256
[1] -0.9755279
help()
accesses R’s help system;
e.g. help(ls)
. A quicker way is to use
?ls
mean()
, sd()
, min()
,
max()
and range()
give mean, standard
deviation, minimum, maximum and range respectively for a vector
argument. N.B. range()
tells you the minimum and maximum as
a pair, not the difference between them.
var()
returns variance of a vector argument, or the
covariance (dispersion) matrix for a matrix argument.
summary()
returns summary information dependent on
argument type.
plot()
produces a plot on the current graphics tool.
The type of plot depends on the type of argument. The simplest use is
plot(x,y)
which produces a scatter-plot of vectors
x
and y
.
Practice is the key when learning R.
Use R to calculate
Write (efficient) code to create the following sequences:
2, 4, 6, … 100; that is, the even numbers up to 100
1,2,3,4,5,4,3,2,1
Use the command y <- rnorm(100)
to store 100
simulated standard normal random variables in the vector
y
.
Find the mean and standard deviation of y
.
Find the largest simulated value. Which number simulation is this e.g. 54th, 23rd?