Lecture 30: Introduction to Linear Modelling for Time Series

Download the R markdown file for this lecture.

A time series comprises observations taken at a sequence of time points.

In most situations these time points will be (approximately) evenly spaced. For example:

Monthly rainfall figures
Quarterly unemployment rates

Linear models can be applied to time series data, but typically it will prove necessary to generalize the structure of the error terms from the simple independent errors that we have assumed so far.

Tourism in Victoria

Data are number of room nights occupied in hotels, motels and guesthouses in Victoria. Observations are monthly from January 1980 to December 1994. Data source: Australian Bureau of Statistics.

A motel room.

Download Motel.csv

## Tourism <- read.csv(file = "Motel.csv", header = TRUE)
str(Tourism)

'data.frame':   180 obs. of  4 variables:
 $ Year      : int  1980 1980 1980 1980 1980 1980 1980 1980 1980 1980 ...
 $ Month     : int  1 2 3 4 5 6 7 8 9 10 ...
 $ RoomNights: int  276986 260633 291551 275383 275302 231693 238829 274215 277808 299060 ...
 $ AvePrice  : num  27.7 28.7 28.6 28.3 28.7 ...

Tourism$Yr = Tourism$Year - 1979 + Tourism$Month/12
head(Tourism)

  Year Month RoomNights AvePrice       Yr
1 1980     1     276986    27.70 1.083333
2 1980     2     260633    28.67 1.166667
3 1980     3     291551    28.60 1.250000
4 1980     4     275383    28.34 1.333333
5 1980     5     275302    28.66 1.416667
6 1980     6     231693    28.57 1.500000

Note that Month is coded 1, 2, ..., 12 and is currently an integer valued variable.

Yr is constructed to represent a “fractional year after 1979” (accounting for month).

We will use the average price variable in the practical exercise for this week.

Tourism$NYear = 1979 + Tourism$Yr  # to get nicely spaced points along x axis
plot(RoomNights ~ NYear, xlab = "Year", type = "l", data = Tourism)

There is clear evidence of upward trend and seasonal (monthly) variation.

Variation in a Time Series

Possible sources of variation in a time series are:

Secular trend (or just trend): tendency of the series to increase or decrease over a long period of time.
Seasonal variation: describes fluctuations that recur during specific parts of the year (e.g. quarterly or monthly).
Residual variation (or innovations): the part of the variation which is not explained by long term trend or seasonal effects.
An additional cyclical source of variation (corresponding to business cycles, for example) is sometimes identified.

Modelling A Time Series

Time series data can be modelled using linear models (although there are a number of alternative approaches).

Long term trend can be modelled using polynomial regression.
Seasonal effects can be represented by specifying the seasons (e.g. months, quarters) as a factor in the model.
Additional covariates can sometimes be incorporated in such models (e.g. standard economic indicators may be included to help explain variation in sales data)

Back to the Tourism Data

Model Fitting and ANOVA

Tourism$Month <- factor(Tourism$Month)
Tourism.lm1 <- lm(RoomNights ~ Yr + Month, data = Tourism)
anova(Tourism.lm1)

Analysis of Variance Table

Response: RoomNights
           Df     Sum Sq    Mean Sq  F value    Pr(>F)    
Yr          1 5.4099e+11 5.4099e+11 2494.398 < 2.2e-16 ***
Month      11 1.5589e+11 1.4172e+10   65.346 < 2.2e-16 ***
Residuals 167 3.6219e+10 2.1688e+08                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(Tourism.lm1)


Call:
lm(formula = RoomNights ~ Yr + Month, data = Tourism)

Residuals:
   Min     1Q Median     3Q    Max 
-37721  -8900  -1826   7931  48624 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 261275.9     4321.6  60.458  < 2e-16 ***
Yr           12705.7      254.1  50.010  < 2e-16 ***
Month2      -30740.9     5377.5  -5.717 4.90e-08 ***
Month3       24115.9     5377.7   4.484 1.35e-05 ***
Month4       -1464.5     5377.9  -0.272 0.785712    
Month5      -18682.5     5378.2  -3.474 0.000654 ***
Month6      -65076.5     5378.5 -12.099  < 2e-16 ***
Month7      -43764.3     5379.0  -8.136 8.89e-14 ***
Month8      -29006.2     5379.5  -5.392 2.35e-07 ***
Month9      -11274.0     5380.2  -2.095 0.037636 *  
Month10      27159.2     5380.9   5.047 1.16e-06 ***
Month11      17231.1     5381.7   3.202 0.001635 ** 
Month12     -57892.7     5382.5 -10.756  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14730 on 167 degrees of freedom
Multiple R-squared:  0.9506,    Adjusted R-squared:  0.947 
F-statistic: 267.8 on 12 and 167 DF,  p-value: < 2.2e-16

plot(RoomNights ~ NYear, xlab = "Year", data = Tourism)
lines(Tourism$NYear, fitted(Tourism.lm1), col = "blue")
title("Fitted Linear Model")

Comments

Yr (which is essentially time) is the appropriate covariate to track trend (not Year, which would ignore secular trend during a year).

It is important to remember to code Month as a factor so as to represent seasonal effects.

There is strong evidence of trend (\(P < 2.2 \times 10^{-16}\) for Yr) and of seasonality (\(P < 2.2 \times 10^{-16}\) for Month) in the data.

As might be expected, room bookings tend to be low in the winter months. The pattern over summer is less clear. Perhaps December is a bad month because of the Christmas effect?

We have assumed a linear secular trend. Would quadratic be better?

More Model Fitting in R

Tourism.lm2 <- lm(RoomNights ~ poly(Yr, 2) + Month, data = Tourism)
anova(Tourism.lm1, Tourism.lm2)

Analysis of Variance Table

Model 1: RoomNights ~ Yr + Month
Model 2: RoomNights ~ poly(Yr, 2) + Month
  Res.Df        RSS Df Sum of Sq     F Pr(>F)
1    167 3.6219e+10                          
2    166 3.5762e+10  1 456501120 2.119 0.1474

No evidence that quadratic trend improves on linear (P=0.1418).

Residuals Versus Time

plot(Tourism$Yr, resid(Tourism.lm1), type = "l", xlab = "Year-1979")

The time plot of residuals for the linear trend model suggests that there are extended periods when the residuals are almost all negative (1986-1988) and extended periods where the residuals are almost all positive (1989-1990).

Such behaviour should not be observed if the errors are independent.

However, for time series data it is common for some residual correlation between residuals to remain even when the trend and seasonal variation has been removed.

This type of correlation in the sequence of residuals is usually called autocorrelation.

A quick test for autocorrelation

Sometimes it won’t be so easy to see patterns in plots. The Durbin-Watson Test was developed back when generating plots was a slow process.

N.B. The Durbin-Watson Test is not the only way to look for autocorrelation.

We need to use the lmtest package to get the `dwtest() function. This function can be set to look for positive and/or negative autocorrelation, with the default action set to look for positive autocorrelation. You may need to install the lmtest package before running the following example:

library(lmtest)
dwtest(Tourism.lm1, alternative = "two.sided")


    Durbin-Watson test

data:  Tourism.lm1
DW = 1.2228, p-value = 3.893e-07
alternative hypothesis: true autocorrelation is not 0

So, we reject the null hypothesis that there is no autocorrelation in the residuals for the Tourism.lm1 model.

Stationary Processes and Autocorrelation

Consider a random process in time: Z_t where t=1,2,… and Z_t represents the value of the process at time t.

This process is said to be (weakly) stationary if:

E[Z_t] and Var(Z_t) do not change with time t.
The correlation Corr(Z_t, Z_t+k) depends only on the time lag k.

It is common to model the residuals from a time series as a stationary random process with zero mean.

For a stationary process, the autocorrelation function (or ACF) is defined by \[\rho(k) = \mbox{Corr}(Z_t, Z_{t+k})\]

The ACF (autocorrelation against time lag) can be plotted in R using the acf() command.

Tourism Data: ACF Plot for Residuals

acf(resid(Tourism.lm1))

The ACF plot indicates a correlation of about \(0.4\) at lag one. Hence consecutive residuals are positively dependent.
The dashed horizontal lines on the plot are a 95% confidence interval under the assumption that the true autocorrelation is zero.
Any correlation lying within this confidence interval may be just noise.
Any correlation lying outside this confidence interval is probably indicative of true serial dependence in the data.

For the tourism data it seems that there is serial dependence in the data, since the correlations at lags 1, 2, 3 and 4 all extend beyond the confidence interval bounds.

The existence of several significant correlations is common when autocorrelation exists. If the correlation between the i^th and j^th variables is high, and so is the correlation between the j^th and k^th variables, we ought to expect the correlation between the i^th and k^th variables to also be high. When we are thinking about time series, the correlation for data from lags 0 and 1 is the same as the data from lags 1 and 2 because they are both sets formed by pairs of successive observations. If observations are correlated with their preceding observations, and those preceeding observations are correlated with their preceding observations, then there is likely to be a correlation between observations and those that are back two time steps. This logic continues to three, four, and greater time lags and is especially likely when the correlation at the first lag is very, very high.