Download the R markdown file for this lecture.

Simple and multiple regression models are types of linear model.

Remember that the word linear refers to the fact that the response is linearly related to the regression parameters, not to the predictors.

In most regression models, the relationship between the response and each of the explanatory variables is linear.

We can represent more general relationships using polynomial functions of the explanatory variables.

The resulting models are often referred to as polynomial regression models.

Child Lung Function Data

FEV (forced expiratory volume) is a measure of lung function. The data include determinations of FEV on 318 female children who were seen in a childhood respiratory disease study in Massachusetts in the U.S. Child’s age (in years) also recorded. Of interest to model FEV (response) as function of age.

Download fev.csv

Data source: Tager, I. B., Weiss, S. T., Rosner, B., and Speizer, F. E. (1979). Effect of parental cigarette smoking on pulmonary function in children. American Journal of Epidemiology, 110, 15-26.

Example : Scatter Plot

## Fev <- read.csv(file = "https://r-resources.massey.ac.nz/161221/data/fev.csv", header = TRUE)
plot(FEV ~ Age, data = Fev)

Download FEV.csv

The scatter plot of data indicates that relationship between FEV and age is not linear. Will try to model FEV as a polynomial in age.

The Polynomial Regression Model

The polynomial linear regression model is \[Y_i = \beta_0 + \beta_1 x_{i} + \beta_2 x_i^2 + \ldots + \beta_p x_{i}^p + \varepsilon_i ~~~~~~~~~ (i=1,\ldots,n)\]

where Yi and xi are the response and explanatory variable observed on the ith individual.

\(\varepsilon_1, \ldots \varepsilon_n\) are random errors satisfying the usual assumptions (A1) - (A4).

\(\beta_0, \beta_1, \ldots, \beta_p\) are the regression parameters.

The polynomial regression equation and (A1) imply that \[E[Y_i] = \mu_i = \beta_0 + \beta_1 x_{i} + \beta_2 x_i^2 + \ldots + \beta_p x_{i}^p\]

Notes on The Polynomial Regression Model

We have defined a pth order polynomial regression model.

This last point is important. As an example, a cubic polynomial should include linear and quadratic terms: \[E[Y_i] = \beta_0 + \beta_1 x_{i} + \beta_2 x_i^2 + \beta_3 x_{i}^3\] is correct; we should not omit (e.g.) the quadratic term: \[E[Y_i] = \beta_0 + \beta_1 x_{i} + \beta_3 x_{i}^3\]

Example : Model Fitting in R

A sixth order polynomial model can be fitted with

lm(FEV Age+I(Age^2)+I(Age^3)+I(Age^4)+I(Age^5)+I(Age^6),data=Fev)

Notice that the powers of Age appear within the function I(). This function “protects” its contents, in the sense that R reads the contents as a normal mathematical expression.

For example, I(Age^4) is telling R to include the fourth power of Age in the model.

N.B. the expressions *, / and ^ have a different meaning in a model formula (explained in later lectures)

Modern practice is to avoid all that typing though!

Example : Sixth Order Model

Fev.lm.6 <- lm(FEV ~ poly(Age, 6, raw = TRUE), data = Fev)
summary(Fev.lm.6)

Call:
lm(formula = FEV ~ poly(Age, 6, raw = TRUE), data = Fev)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2047 -0.2791  0.0186  0.2509  0.9210 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)
(Intercept)               -1.520e+00  5.052e+00  -0.301    0.764
poly(Age, 6, raw = TRUE)1  2.079e+00  3.533e+00   0.588    0.557
poly(Age, 6, raw = TRUE)2 -6.468e-01  9.735e-01  -0.664    0.507
poly(Age, 6, raw = TRUE)3  1.033e-01  1.359e-01   0.761    0.447
poly(Age, 6, raw = TRUE)4 -8.145e-03  1.017e-02  -0.801    0.424
poly(Age, 6, raw = TRUE)5  3.056e-04  3.882e-04   0.787    0.432
poly(Age, 6, raw = TRUE)6 -4.360e-06  5.931e-06  -0.735    0.463

Residual standard error: 0.3902 on 311 degrees of freedom
Multiple R-squared:  0.6418,    Adjusted R-squared:  0.6349 
F-statistic: 92.87 on 6 and 311 DF,  p-value: < 2.2e-16

Example : Fifth Order Model

Fev.lm.5 <- lm(FEV ~ poly(Age, 5, raw = TRUE), data = Fev)
summary(Fev.lm.5)

Call:
lm(formula = FEV ~ poly(Age, 5, raw = TRUE), data = Fev)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.19431 -0.27623  0.02088  0.24471  0.93266 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)
(Intercept)                1.750e+00  2.393e+00   0.731    0.465
poly(Age, 5, raw = TRUE)1 -3.209e-01  1.349e+00  -0.238    0.812
poly(Age, 5, raw = TRUE)2  3.714e-02  2.858e-01   0.130    0.897
poly(Age, 5, raw = TRUE)3  5.707e-03  2.860e-02   0.200    0.842
poly(Age, 5, raw = TRUE)4 -7.392e-04  1.359e-03  -0.544    0.587
poly(Age, 5, raw = TRUE)5  2.083e-05  2.467e-05   0.844    0.399

Residual standard error: 0.3899 on 312 degrees of freedom
Multiple R-squared:  0.6412,    Adjusted R-squared:  0.6354 
F-statistic: 111.5 on 5 and 312 DF,  p-value: < 2.2e-16

Example : Fourth Order Model

Fev.lm.4 <- lm(FEV ~ poly(Age, 4, raw = TRUE), data = Fev)
summary(Fev.lm.4)

Call:
lm(formula = FEV ~ poly(Age, 4, raw = TRUE), data = Fev)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.20203 -0.27466  0.02218  0.24525  0.93709 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                3.5685906  1.0434415   3.420 0.000709 ***
poly(Age, 4, raw = TRUE)1 -1.3941151  0.4529027  -3.078 0.002267 ** 
poly(Age, 4, raw = TRUE)2  0.2712825  0.0692350   3.918 0.000110 ***
poly(Age, 4, raw = TRUE)3 -0.0181507  0.0044318  -4.096 5.37e-05 ***
poly(Age, 4, raw = TRUE)4  0.0004055  0.0001007   4.029 7.05e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3897 on 313 degrees of freedom
Multiple R-squared:  0.6404,    Adjusted R-squared:  0.6358 
F-statistic: 139.3 on 4 and 313 DF,  p-value: < 2.2e-16

Example : Notes on the Fitted Models

We started by trying a 6th order polynomial model, Fev.lm.6. The 6th order term can be removed (coefficient of poly(Age,6) not significant).

We then tried a 5th order polynomial model, Fev.lm.5. The 5th order term can be removed (coefficient of poly(Age,5) not significant).

We then tried a 4th order polynomial model, Fev.lm.4. The 4th order term was statistically significant, so this is our preferred model.

Our fitted model equation is: \[E[\mbox{FEV}] = 3.57 - 1.39~\mbox{Age} + 0.27~\mbox{Age}^2 - 0.018~\mbox{Age}^3 + 0.00041~\mbox{Age}^4\]

Looking at the plot (right), the fitted model looks adequate for ages in the range 5 to 15 years. At the extreme ends of the plot the fitted curve exhibits spurious upward curvature. This feature of the plot is an artefact of the form of the polynomial rather than a characteristic of the data.

Example : Collinearity in the Fitted Model

There is noticeable collinearity between the powers of Age, producing (for example) lots of non-significant t-statistics for the regression coefficients in Fev.lm.6.

Collinearity in polynomial regression can lead to numerical instability in statistical software packages.

Problems of collinearity can be addressed by using orthogonal polynomials, coming up in a lecture quite soon.