Download the R markdown file for this lecture.
We saw previously that when linear models are fitted to time series data, the residuals can display autocorrelation.
When this occurs, it suggests that the model errors are not independent.
In this lecture we shall examine:
Usually we fit a linear model to data by the method of least squares.
Under standard assumptions (A1)-(A4), the least squares estimators are unbiased for the regression parameters. In other words, \(E[\hat \beta_i] = \beta_i\) for i=1,…,p.
This unbiasedness property holds even if the error terms are autocorrelated.
However, standard errors given by standard least squares theory (and hence by standard R output) will usually be smaller than the true standard deviations of the parameter estimates.
The result is that t-test statistics will be inflated, and confidence intervals will be too narrow.
The previous discussion indicates that we need to account for autocorrelation in the errors when it is present.
This will require two steps:
We will turn first to modelling for autocorrelated errors using autoregressive models.
In a time series, the errors \(\varepsilon_1, \varepsilon_2, \ldots, \varepsilon_n\) can be thought of as a realization of a random process (in time).
We can model serial correlation between these errors by making each one explicitly dependent on the previous one: \[\varepsilon_t = \phi \varepsilon_{t-1} + \eta_t\] where \(\eta_t~(t=1,2,\ldots,n)\) is white noise, that is, a sequence of independent and identically distributed (normal) random variables with zero mean.
To illustrate an ACF plot for a function demonstrating autocorrelation, we create two new variables from white noise
= rnorm(1000)
Z1 = Z1[-1] + 0.85 * Z1[-1000]
X1 = Z1[-1] - 0.85 * Z1[-1000]
X2 par(mfrow = c(2, 2))
plot(X1, type = "l")
acf(X1)
plot(X2, type = "l")
acf(X2)
The X1
series has correlation of 0.51 at lag1, while the X2
series has autocorrelation of -0.47 at lag1
If residuals in a linear model for time series data display autocorrelation then we change the model to allow for an AR1 correlation structure for the residuals in our model to account for this.
N.B. More complex models are also possible. For example:
## Tourism <- read.csv(file = "Motel.csv", header = TRUE)
$Yr = Tourism$Year - 1979 + Tourism$Month/12
Tourism$Month <- factor(Tourism$Month) Tourism
<- lm(RoomNights ~ Yr + Month, data = Tourism)
Tourism.lm acf(resid(Tourism.lm))
There is a suggestion of exponential decay in autocorrelations. - AR1 model (with positive \(\phi\)) for errors seems reasonable.
The model incorporating the AR1 error process, and written in regression format, is as follows:
\[\begin{aligned} Y_{t} &=& \beta_0 + \beta_1 \mbox{Yr} + \beta_2 z_{t,2} + \beta_3 z_{t,3} + \ldots + \beta_{12} z_{t,12} + \varepsilon_t \\ \varepsilon_t &=& \phi \varepsilon_{t-1} + \eta_{t}\end{aligned}\]
where:
RoomNights
;To fit a model with AR1 errors, we must:
The AR1 parameter, \(\phi\), can be estimated by the method of (restricted) maximum likelihood.
The details are beyond the scope of this paper, but you will meet maximum likelihood estimation if you take 161.200: Statistical Modelling.
Once \(\phi\) is estimated, we can account for the AR1 nature of the errors by fitting the linear model using generalized least squares.
The mathematical details of generalized least squares will not be covered in this paper.
In R, linear models can be fitted using this technique using the gls()
command found in the nlme
add-on package. Call this package using library(nlme)
before using the gls()
command.
gls()
command uses the same model formula as lm()
, but also accepts an additional argument to describe the correlation structure for the errors.## Tourism <- read.csv(file = "Motel.csv", header = TRUE)
$Yr = Tourism$Year - 1979 + Tourism$Month/12
Tourism$Month <- factor(Tourism$Month)
Tourism<- gls(RoomNights ~ Yr + Month, correlation = corAR1(), data = Tourism)
Tourism.gls anova(Tourism.gls)
Denom. DF: 167
numDF F-value p-value
(Intercept) 1 45239.15 <.0001
Yr 1 1092.85 <.0001
Month 11 81.94 <.0001
summary(Tourism.gls)
Generalized least squares fit by REML
Model: RoomNights ~ Yr + Month
Data: Tourism
AIC BIC logLik
3722.941 3769.711 -1846.471
Correlation Structure: AR(1)
Formula: ~1
Parameter estimate(s):
Phi
0.392721
Coefficients:
Value Std.Error t-value p-value
(Intercept) 261682.96 4946.534 52.90229 0.0000
Yr 12747.00 382.435 33.33115 0.0000
Month2 -31194.09 4205.044 -7.41825 0.0000
Month3 23482.73 4959.531 4.73487 0.0000
Month4 -2170.47 5225.494 -0.41536 0.6784
Month5 -19419.12 5325.710 -3.64630 0.0004
Month6 -65827.11 5362.987 -12.27434 0.0000
Month7 -44522.34 5373.207 -8.28599 0.0000
Month8 -29768.74 5365.559 -5.54812 0.0000
Month9 -12039.27 5331.948 -2.25795 0.0252
Month10 26393.95 5238.620 5.03834 0.0000
Month11 16471.67 4988.022 3.30224 0.0012
Month12 -58631.94 4272.401 -13.72342 0.0000
Correlation:
(Intr) Yr Month2 Month3 Month4 Month5 Month6 Month7 Month8 Month9
Yr -0.640
Month2 -0.424 0.003
Month3 -0.498 -0.001 0.589
Month4 -0.522 -0.005 0.462 0.659
Month5 -0.529 -0.011 0.416 0.535 0.682
Month6 -0.529 -0.016 0.398 0.487 0.560 0.690
Month7 -0.526 -0.022 0.391 0.468 0.512 0.568 0.692
Month8 -0.521 -0.028 0.387 0.460 0.492 0.520 0.571 0.692
Month9 -0.514 -0.034 0.383 0.454 0.481 0.498 0.520 0.569 0.690
Month10 -0.501 -0.039 0.376 0.445 0.470 0.482 0.493 0.513 0.561 0.683
Month11 -0.473 -0.044 0.358 0.423 0.446 0.456 0.463 0.471 0.491 0.538
Month12 -0.400 -0.048 0.308 0.364 0.384 0.392 0.396 0.400 0.407 0.425
Mnth10 Mnth11
Yr
Month2
Month3
Month4
Month5
Month6
Month7
Month8
Month9
Month10
Month11 0.662
Month12 0.471 0.597
Standardized residuals:
Min Q1 Med Q3 Max
-2.5656600 -0.6285951 -0.1307054 0.5424996 3.2823396
Residual standard error: 14799.38
Degrees of freedom: 180 total; 167 residual
Comments on the Analysis
The use of an AR1 error process is obtained by setting
correlation=corAR1()
in the
gls()
command.Error variance and correlation estimated by REML (restricted maximum likelihood).
Estimated value of AR1 correlation parameter is \(\hat \phi = 0.393\).
Both
Yr
andMonth
are associated with RoomNights (P-value less than \(0.0001\) in both cases).Comparison of the output from
summary(Tourism.gls)
with the corresponding output from the earlier analysis (assuming independent errors) shows that the standard errors are usually larger here, particularly for the intercept and the coefficient ofYr
.The only remaining question is, was the use of
gls()
successful?