Lecture 28 Nonlinear least squares

In class version

This topic introduces the concept of nonlinear regression models, iterative fitting by direct search, Gauss-Newton algorithm.

28.1 When the Response is a Nonlinear Function of the Regression Parameters

We consider what happens when the relationship between \(Y\) (response) and \(X\) (predictor) is known to be a particular curve that cannot be made linear by transformation.

That is, suppose \[ y = f(x) + \varepsilon\]

Where \(f(x)\) denotes some function of \(x\) (e.g. a curve), and where the error variation \(\varepsilon\) around the curve is normally distributed with constant variance as before.

Suppose we have a general idea of the shape of the curve, but the precise details depend on some unknown parameters, say \(\alpha, \beta\) and \(\gamma\).

The task, as in ordinary linear regression, reduces to that of estimating the parameters. Now in principle we can estimate \(\alpha, \beta\) and \(\gamma\) by least squares as before. That is, we simply choose those estimates that make the squared error of our curve as small as possible: i.e. the least squares problem is to minimize \[ SS_{error} = \sum \left(y_i - f(x_i; \alpha,\beta,\gamma )\right) ^2 \] or equivalently minimise the residual standard error \(S = \sqrt{SS_{error}/{(n-p)}}\).

The only difficulty or difference to the previous situation is that we do not have simple formulas to use arising out of the matrix linear algebra, as the curve \(f(x)\) is not linear in \(\alpha, \beta\) and \(\gamma\).

A couple of examples of functions \(f(x)\) that are nonlinear in the parameters are:

  1. Suppose

\[ f(x_i; \alpha,\beta_0,\beta_1) = \alpha~~ \frac{ e^{\beta_0+ \beta_1 x}}{ 1+e^{\beta_0+ \beta_1 x} } + \varepsilon\]

This is the so-called logistic curve, which models a phenomenon (such as sales) that goes from zero to some upper limit \(\alpha\).

(Note \(\alpha\) is an asymptote for the curve but the individual data can go randomly above or below the curve).

This model cannot be made linear, but in principle we can use any mathematical method we wish to find those estimates of \(\alpha,\beta_0\),\(\beta_1\) that minimize
\(SS_{error} = \sum (y_i - \hat y_i )^2\) where \[\hat y_i = \hat\alpha~~\frac{e^{\hat\beta_0+ \hat\beta_1 x} }{ 1+e^{\hat\beta_0+ \hat\beta_1 x} } ~~.\]

  1. A somewhat simpler example, but also nonlinear, is

\[ y_i = \beta_0 + \beta_1 x_i^{\alpha} \]

where \(\alpha\) is unknown. Again this cannot be made linear in \(\alpha,\beta_0\) and \(\beta_1\) but in this example the model fitting could be done simply by guessing \(\alpha\) and creating a new explanatory variable \(x_{new} = x^\alpha\). Then \(\beta_0\) and \(\beta_1\) can be found by linear regression of \(y\) on \(x_{new}\).

28.3 The Gauss-Newton Algorithm

The alternative to using direct search is to use a mathematical technique such as the Gauss-Newton Algorithm .

The idea of the algorithm is to approximate the curve \(f(x_i; \alpha,\beta,\gamma)\) by a sum of straight lines involving \(\alpha,\beta\) and \(\gamma\). That is, a linearized approximation to the curve.

The mathematical ideas are in an word document on Stream, but are not examinable.

The main points are:

  1. The process of model-fitting is iterative. That is, it involves a series of repeated linear regressions with slightly different data each time, converging to a fixed answer as we get closer and closer to the optimal estimates.

  2. The Gauss-Newton method is guaranteed to converge quickly to the optimal estimates, provided that our initial starting point is close enough.

  3. The method also produces approximate hypothesis tests and confidence intervals.

The nls() nonlinear least squares procedure implements the Gauss-Newton algorithm.

We need to specify the formula, and (preferably) reasonable starting values.

manhours.nls = nls(Manhours ~ beta0 + beta1 * (Cases^alpha), start = list(alpha = 0.25,
    beta0 = -16000, beta1 = 4400), data = manhours)

summary(manhours.nls)

Formula: Manhours ~ beta0 + beta1 * (Cases^alpha)

Parameters:
        Estimate Std. Error t value Pr(>|t|)
alpha  2.689e-01  1.585e-01   1.697    0.115
beta0 -1.463e+04  1.256e+04  -1.164    0.267
beta1  3.622e+03  5.954e+03   0.608    0.554

Residual standard error: 909.7 on 12 degrees of freedom

Number of iterations to convergence: 6 
Achieved convergence tolerance: 2.462e-06
plot(Manhours ~ Cases)
lines(predict(manhours.nls) ~ Cases)

unlabelled

We can calculate an approximate confidence interval for \(\alpha\) using \[ \hat\alpha \pm t_{n-p}(0.025)*SE(\hat\alpha)\]

summary(manhours.nls)$coefficients
           Estimate   Std. Error    t value  Pr(>|t|)
alpha  2.689232e-01     0.158476  1.6969329 0.1154707
beta0 -1.462572e+04 12564.190279 -1.1640799 0.2670141
beta1  3.621670e+03  5954.344902  0.6082398 0.5543666
est = summary(manhours.nls)$coefficients[1]
se = summary(manhours.nls)$coefficients[4]
lower = est + qt(0.025, df = 15 - 3) * se
upper = est + qt(0.975, df = 15 - 3) * se
cbind(est, se, lower, upper)
           est       se       lower     upper
[1,] 0.2689232 0.158476 -0.07636641 0.6142128

We can conclude from this that we could not reject a hypothesis that \(\alpha\) equalled any value between -0.07 and 0.6.

In particular, we could have stuck with the square-root transformation (\(\alpha=0.5\)) or used the log transformation (recall this takes the place of \(\alpha=0\) in the ladder of powers).

28.4 Example: A Growth Curve Model with an Maximum

Recall the fev data relating forced expiration volume (FEV) to girls’ ages. This seemed to flatten off after about age 10.

We will model the growth using a logistic curve with unknown maximum.

\[ f(x_i; \alpha,\beta_0,\beta_1) = \alpha~~\frac{e^{\beta_0+ \beta_1 x}}{ 1+e^{\beta_0+ \beta_1 x} } + \varepsilon\]

Download FEV.csv

## fev <- read.csv(file = "fev.csv", header = TRUE)

FEV = fev$FEV
Age = fev$Age

plot(FEV ~ Age)

unlabelled

From the graph the maximum \(\alpha \approx 3\), so setting \[FEV \approx 3 ~~\frac{e^{\beta_0 + \beta_1 Age}}{ 1+ e^{\beta_0 + \beta_1 Age}}\] we can rearrange this as
\[1+ e^{\beta_0 + \beta_1 Age} \approx \frac{3}{FEV} ~ e^{\beta_0 + \beta_1 Age} \]

Gather up the \(e^{\beta_0 + \beta_1 Age}\) terms

\[ \frac{1}{3/FEV - 1} \approx e^{\beta_0 + \beta_1 Age}\] and then if we take logs we can use the left-hand side as the “response” in a regression

\[ Y_{new} = \log\left(\frac{1}{3/FEV - 1}\right)\approx \beta_0 + \beta_1 Age \] to give us starting values for \(\beta_0\) and \(\beta_1\):

start.calc = lm(-log(3/FEV - 1) ~ Age)
Warning in log(3/FEV - 1): NaNs produced
summary(start.calc)

Call:
lm(formula = -log(3/FEV - 1) ~ Age)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9603 -0.5223 -0.1590  0.2990  5.0842 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.30548    0.21129  -6.179 2.78e-09 ***
Age          0.28493    0.02226  12.799  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9443 on 238 degrees of freedom
  (78 observations deleted due to missingness)
Multiple R-squared:  0.4077,    Adjusted R-squared:  0.4052 
F-statistic: 163.8 on 1 and 238 DF,  p-value: < 2.2e-16

So we will use starting values \(\alpha=3\), \(\beta_0= -1.3\) and \(\beta_1=0.28\)

fev.nls = nls(FEV ~ alpha * exp(beta0 + beta1 * Age)/(1 + exp(beta0 + beta1 * Age)),
    start = list(alpha = 3, beta0 = -1.3, beta1 = 0.28))
summary(fev.nls)

Formula: FEV ~ alpha * exp(beta0 + beta1 * Age)/(1 + exp(beta0 + beta1 * 
    Age))

Parameters:
      Estimate Std. Error t value Pr(>|t|)    
alpha  3.21164    0.08682   36.99   <2e-16 ***
beta0 -2.16822    0.21482  -10.09   <2e-16 ***
beta1  0.36561    0.03620   10.10   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3956 on 315 degrees of freedom

Number of iterations to convergence: 5 
Achieved convergence tolerance: 6.266e-06
plot(FEV ~ Age)
lines(30:200/10, predict(fev.nls, newdata = data.frame(Age = 30:200/10)), lty = 1,
    col = 2, lwd = 3)

unlabelled

This has converged. Note the residual standard error is 0.3956, which is slightly bigger than for the piecewise model looked at earlier (S=0.39) but the curve seems more reasonable than having a sharp corner.

We can get a slightly better fit by adding a quadratic term in Age. In the following we use starting value \(\beta_2=0\).

fev.nls2 = nls(FEV ~ alpha * exp(beta0 + beta1 * Age + beta2 * Age^2)/(1 + exp(beta0 +
    beta1 * Age + beta2 * Age^2)), start = list(alpha = 3, beta0 = -1.3, beta1 = 0.28,
    beta2 = 0), data = fev)
summary(fev.nls2)

Formula: FEV ~ alpha * exp(beta0 + beta1 * Age + beta2 * Age^2)/(1 + exp(beta0 + 
    beta1 * Age + beta2 * Age^2))

Parameters:
      Estimate Std. Error t value Pr(>|t|)    
alpha  3.03273    0.05788  52.393  < 2e-16 ***
beta0 -0.06772    0.75358  -0.090  0.92846    
beta1 -0.23896    0.21936  -1.089  0.27684    
beta2  0.04437    0.01619   2.741  0.00647 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3906 on 314 degrees of freedom

Number of iterations to convergence: 7 
Achieved convergence tolerance: 2.889e-06
plot(FEV ~ Age)
lines(30:200/10, predict(fev.nls2, newdata = data.frame(Age = 30:200/10)), lty = 1,
    col = 2, lwd = 3)

unlabelled

28.5 Piecewise Regression with an Unknown Knot

Recall that sometimes we wish to fit a piecewise regression model, but the location of the knot is unknown.

This then is becomes a nonlinear regression model, with the location of the knot taking the role of the alpha parameter above (not meaning that it enters the equation the same way, but that if the knot is known, then every other parameter can be fitted by linear regression).

Strictly speaking the Gauss-Newton algorithm does not quite work, as the knot gives a ‘corner’ to the lines, or even a jump in the level of the lines, and this means that the calculus on which the method was originally based breaks down. However the nls() routine can be made to cope.

28.5.1 Nonlinear regression for a change-slope model: the Thai Baht.

The data here refer to the Thai Baht Exchange Rate \(Y\) vs Time (time measured in hundreds of days, hDays. ) The graph below shows \(Y\) vs hDays with a cubic model. The model looks like it would be unsafe for extrapolation on the right.

Download baht.csv

## baht = read.csv("baht.csv", header = TRUE)

Y = baht$Y
hDays = baht$hDays

plot(Y ~ hDays)
cubic = lm(Y ~ poly(hDays, 3))
summary(cubic)$sigma
[1] 0.008618117
lines(predict(cubic) ~ hDays, col = 2, lwd = 2)

unlabelled

We consider a piecewise model \[ Y = \beta_0 + \beta_1. hDays + \beta_2 (hDays-k)_+ \] where \(k\) is unknown but probably near 2. We will use this guess to give starting values for the other parameters.

start.guess = lm(Y ~ hDays + I((abs(hDays - 2) + hDays - 2)/2))
summary(start.guess)

Call:
lm(formula = Y ~ hDays + I((abs(hDays - 2) + hDays - 2)/2))

Residuals:
      Min        1Q    Median        3Q       Max 
-0.024944 -0.005216  0.001020  0.004923  0.037731 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                        0.3443525  0.0013566  253.84   <2e-16 ***
hDays                             -0.0201913  0.0008924  -22.63   <2e-16 ***
I((abs(hDays - 2) + hDays - 2)/2)  0.0219204  0.0012007   18.26   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.008564 on 386 degrees of freedom
Multiple R-squared:  0.6141,    Adjusted R-squared:  0.6121 
F-statistic: 307.2 on 2 and 386 DF,  p-value: < 2.2e-16

This gives starting estimates \(\beta_0\approx 0.34, ~~ \beta_1 \approx -0.02,~~ \beta_2 \approx 0.022, ~~k=2\).

The following code fits the model, but some iteration control parameters were needed, presumably because of the corner. (Other statistical software such as SPSS and Minitab have less problems fitting this model.)

baht.nls = nls(Y ~ beta0 + beta1 * hDays + beta2 * I((abs(hDays - k) + hDays - k)/2),
    start = list(beta0 = 0.34, beta1 = -0.02, beta2 = 0.022, k = 2), data = baht,
    nls.control(warnOnly = TRUE, minFactor = 1/2048))
Warning in nls(Y ~ beta0 + beta1 * hDays + beta2 * I((abs(hDays - k) + hDays -
: step factor 0.000244141 reduced below 'minFactor' of 0.000488281
summary(baht.nls)

Formula: Y ~ beta0 + beta1 * hDays + beta2 * I((abs(hDays - k) + hDays - 
    k)/2)

Parameters:
       Estimate Std. Error t value Pr(>|t|)    
beta0  0.349459   0.001597  218.88   <2e-16 ***
beta1 -0.027767   0.001772  -15.67   <2e-16 ***
beta2  0.028657   0.001821   15.73   <2e-16 ***
k      1.579995   0.064909   24.34   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.008074 on 385 degrees of freedom

Number of iterations till stop: 14 
Achieved convergence tolerance: 0.008517
Reason stopped: step factor 0.000244141 reduced below 'minFactor' of 0.000488281
plot(Y ~ hDays)
lines(predict(baht.nls) ~ hDays, col = 2, lwd = 3)

unlabelled

The graph seems much more reasonable. Note it changed the \(k\) to 1.57.

In terms of residual standard error, which is the better-fitting model, the cubic or the change-point model? (From above the cubic graph S =0.008618).