Chapter 11LURN… To Do Basic Inference

This chapter is all about finding confidence intervals and performing hypothesis tests. While many textbooks separate these activities there is little point doing so when it comes time to using R as both are done at the same time in the majority of cases.

11.1 Confidence intervals and hypothesis tests for the mean of one population

The datasets package contains a set of annual precipitation values (in inches) for 70 U.S. cities. If we could assume this set to be a random sample (and we will even if it is not true), then we can find a confidence interval for the mean annual precipitation for U.S. cities on the whole.

You could use a manual approach to find the 95% confidence interval for the population mean (μ) yourself using the mean(), sd(), length(), and qt() commands as follows:

> mean(precip)

[1] 34.89

> sd(precip)

[1] 13.71

> length(precip)

[1] 70

> qt(0.975, length(precip)-1)

[1] 1.995

Note the use of the qt() command here. It has two arguments; one for the level of confidence (95% confidence relates to an upper tail area of 0.025) and the degrees of freedom (based on the sample size). The 95% confidence interval is therefore found using

> mean(precip) - qt(0.975, length(precip)-1) * sd(precip) / sqrt(length(precip))

[1] 31.62

> mean(precip) + qt(0.975, length(precip)-1) * sd(precip) / sqrt(length(precip))

[1] 38.15

or, with a nicer (quicker) command:

> mean(precip) + c(-1,1) * qt(0.975, length(precip)-1) * sd(precip) / sqrt(length(precip))

[1] 31.62 38.15

These calculations include all the commands needed for a hypothesis test, except the need to find the specific p-value of a hypothesis test. Let’s test the notion that the annual precipitation is actually greater than 30 inches per annum. This is a one-sided hypothesis test where the null hypothesis is that μ = 30.

> TestStat= (mean(precip)-30)/(sd(precip) / sqrt(length(precip)))
> pt(TestStat, length(precip)-1, lower.tail=FALSE)

[1] 0.001976

Note that the pt() command required us to think about the degrees of freedom and that the upper tail area was important to us. We now see that the null hypothesis is rejected as the p-value is very small.

These calculations are fine, but there is a quicker way! The t.test() command does it all for us — both confidence interval and hypothesis test.

> t.test(precip)

One Sample t-test

data:  precip
t = 21, df = 69, p-value <2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
31.62 38.15
sample estimates:
mean of x
34.89

This gives a confidence interval with the default level of confidence (95%), while

> t.test(precip, mu=30, alternative="greater")

One Sample t-test

data:  precip
t = 3, df = 69, p-value = 0.002
alternative hypothesis: true mean is greater than 30
95 percent confidence interval:
32.15   Inf
sample estimates:
mean of x
34.89

gives us the results for a one-sided hypothesis test and the corresponding confidence interval.

The t.test() command is used for many t-tests, notably for comparing two population means, whether the data are paired or not.

11.2 Confidence intervals and hypothesis tests for the difference of two population means

The first question you must ask yourself when working with two sets of values you wish to compare is if they are two independent samples or two measurements taken on one sample. If the latter is the case, then we have paired samples and will need to add an extra argument to our command to allow for the link. In both cases, the difference of the two population means is denoted μ1 - μ2, although as we will see, the linked sample case is really a reduction of the paired values to a one-sample t-test.

First, we see how to do a basic two sample t-test using the ... data.

11.2.1 Two paired samples

The sleep data that is in the datasets package has two recordings for each of ten individuals. In an experimental design sense, we would think of the individuals as a blocking factor. In fact we will see how this is done in Section ??.

The data set lists the twenty observations in a single column. This is a little unusual for paired data that will be analyzed using the t.test() command. Caution must be expressed because the variable that shows how the observations are paired is not stated. If the observations are not ordered the same for each group, then the paring will be done incorrectly. If for any reason you cannot be sure of the ordering of the data then the use of the aov() command as discussed in Section ?? is strongly advised.

Using the attach() command to be able to refer to the variables directly speeds up this process.

> attach(sleep)

We employ the t.test() command with the paired argument set to TRUE. We can use either of two approaches. Either we separate the two groups apart using subscripting

> t.test(extra[group=="1"], extra[group=="2"], paired=TRUE)

Paired t-test

data:  extra[group == "1"] and extra[group == "2"]
t = -4.1, df = 9, p-value = 0.003
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.4599 -0.7001
sample estimates:
mean of the differences
-1.58

which is the way we would normally work when we have the two measurements stored using two distinct variables; or, use the formula approach:

> t.test(extra~group, paired=TRUE)

Paired t-test

data:  extra by group
t = -4.1, df = 9, p-value = 0.003
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.4599 -0.7001
sample estimates:
mean of the differences
-1.58

You could of course alter the analysis by adding arguments to these commands to suit your needs. Experiment a little before detaching the sleep data using the detach() command:

> detach(sleep)

11.5 Hypothesis tests and confidence intervals for Correlation coefficients

In Section 8.4 we examined the correlation of the variables in the airquality data set using the cor() command. If we require a formal hypothesis test for the significance of the correlation coefficient we will need to use the cor.test() command. The difference between the two commands is that we could offer the entire data set as the object to work on with the cor() command while we need to specify two vectors for the cor.test() command.

There are problems with cherry-picking one pair of variables from a set to perform a post hoc test like this. We should probably consider finding the significance of all correlations of interest and adjusting the way we think about the likelihood of getting the set of hypothesis tests we observe as a combination using a Bonn Ferroni adjustment in our analysis; that is beyond the scope of this exposition so for the moment just assume we are (and were before data was collected) interested in only the correlation that exists between a single pair of variables. We use the Ozone and Wind variables from the airquality data.

To test the notion that there is zero correlation between the two variables we would probably consider using Pearson’s correlation coefficient which measures the strength of the linear relationship between a pair of variables, but use of Spearman’s rank correlation coefficient will measure the strength of any monotonic relationship that might exist. I prefer to use both at the same time in many circumstances.

> attach(airquality)
> cor.test(Ozone, Wind)

Pearson's product-moment correlation

data:  Ozone and Wind
t = -8, df = 110, p-value = 9e-13
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.7064 -0.4709
sample estimates:
cor
-0.6015

> cor.test(Ozone, Wind, method="spearman")

Warning in cor.test.default(Ozone, Wind, method = "spearman"): Cannot compute exact p-value with ties

Spearman's rank correlation rho

data:  Ozone and Wind
S = 410000, p-value = 3e-12
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.5902

> detach(airquality)

So it seems the two variables are correlated and have a monotonic relationship that implies that as wind increases the amount of ozone decreases and that this relationship is fairly linear. We will look at this relationship again when we look at linear regression in Chapter 12.

11.6 Testing the independence of two categorical variables

using the chisq.test command. Refer graphical representation to the vcd package.

11.7 Testing the normality of a distribution

norm test package here.

We can use the ks.test() function for the Kolmogorov-Smirnov test. It is suitable for comparing two samples, or a single sample against a theoretical distribution, or the shapiro.test() function which is specifically for testing for normality.