Lecture 37 Nonlinear least squares

In class version

This topic introduces the concept of nonlinear regression models, iterative fitting by direct search, Gauss-Newton algorithm.

37.1 When the Response is a Nonlinear Function of the Regression Parameters

We consider what happens when the relationship between \(Y\) (response) and \(X\) (predictor) is known to be a particular curve that cannot be made linear by transformation.

That is, suppose \[ y = f(x) + \varepsilon\]

Where \(f(x)\) denotes some function of \(x\) (e.g. a curve), and where the error variation \(\varepsilon\) around the curve is normally distributed with constant variance as before.

Suppose we have a general idea of the shape of the curve, but the precise details depend on some unknown parameters, say \(\alpha, \beta\) and \(\gamma\).

The task, as in ordinary linear regression, reduces to that of estimating the parameters. Now in principle we can estimate \(\alpha, \beta\) and \(\gamma\) by least squares as before. That is, we simply choose those estimates that make the squared error of our curve as small as possible: i.e. the least squares problem is to minimize \[ SS_{error} = \sum \left(y_i - f(x_i; \alpha,\beta,\gamma )\right) ^2 \] or equivalently minimise the residual standard error \(S = \sqrt{SS_{error}/{(n-p)}}\).

The only difficulty or difference to the previous situation is that we do not have simple formulas to use arising out of the matrix linear algebra, as the curve \(f(x)\) is not linear in \(\alpha, \beta\) and \(\gamma\).

A couple of examples of functions \(f(x)\) that are nonlinear in the parameters are:

Suppose

\[ f(x_i; \alpha,\beta_0,\beta_1) = \alpha~~ \frac{ e^{\beta_0+ \beta_1 x}}{ 1+e^{\beta_0+ \beta_1 x} } + \varepsilon\]

This is the so-called logistic curve, which models a phenomenon (such as sales) that goes from zero to some upper limit \(\alpha\).

(Note \(\alpha\) is an asymptote for the curve but the individual data can go randomly above or below the curve).

This model cannot be made linear, but in principle we can use any mathematical method we wish to find those estimates of \(\alpha,\beta_0\),\(\beta_1\) that minimize
\(SS_{error} = \sum (y_i - \hat y_i )^2\) where \[\hat y_i = \hat\alpha~~\frac{e^{\hat\beta_0+ \hat\beta_1 x} }{ 1+e^{\hat\beta_0+ \hat\beta_1 x} } ~~.\]

A somewhat simpler example, but also nonlinear, is

\[ y_i = \beta_0 + \beta_1 x_i^{\alpha} \]

where \(\alpha\) is unknown. Again this cannot be made linear in \(\alpha,\beta_0\) and \(\beta_1\) but in this example the model fitting could be done simply by guessing \(\alpha\) and creating a new explanatory variable \(x_{new} = x^\alpha\). Then \(\beta_0\) and \(\beta_1\) can be found by linear regression of \(y\) on \(x_{new}\).

37.2 Case Study using Direct Search

The following example concerns some data on \(Y\) = the number of manhours used in surgical cases in 15 US navy hospitals, versus \(X\)= the number of cases treated.

The purpose of the analysis was to develop a preliminary manpower equation for estimation of manhours per month expended on surgical services.

Download Manhours.csv

## manhours = read.csv("Manhours.csv", header = TRUE)

manhours

   Manhours Cases
1      1275   230
2      1350   235
3      1650   250
4      2000   277
5      3750   522
6      4222   545
7      5018   625
8      6125   713
9      6200   735
10     8150   820
11     9975   992
12    12200  1322
13    12750  1900
14    13014  2022
15    13275  2155

plot(Manhours ~ Cases, data = manhours)
mh1 = lm(Manhours ~ Cases, data = manhours)
abline(coef(mh1))

We could use trial and error. A straight-line model \(\alpha=1\) does not seem suitable. We could try a model \(\sqrt X\) (i.e. \(\alpha=0.5\)).

Cases = manhours$Cases
Manhours = manhours$Manhours

summary(mh1)


Call:
lm(formula = Manhours ~ Cases, data = manhours)

Residuals:
    Min      1Q  Median      3Q     Max 
-1697.0 -1003.5  -561.2   510.3  2653.2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 936.9261   631.4785   1.484    0.162    
Cases         6.5128     0.5764  11.299 4.29e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1428 on 13 degrees of freedom
Multiple R-squared:  0.9076,    Adjusted R-squared:  0.9005 
F-statistic: 127.7 on 1 and 13 DF,  p-value: 4.288e-08

Casenew = Cases^0.5
mh2 = lm(Manhours ~ Casenew)
summary(mh2)


Call:
lm(formula = Manhours ~ Casenew)

Residuals:
     Min       1Q   Median       3Q      Max 
-1069.77  -547.39  -177.25   -65.14  2005.86 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4803.30     708.57  -6.779 1.30e-05 ***
Casenew       412.48      23.76  17.362 2.25e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 954.8 on 13 degrees of freedom
Multiple R-squared:  0.9587,    Adjusted R-squared:  0.9555 
F-statistic: 301.4 on 1 and 13 DF,  p-value: 2.245e-10

plot(Manhours ~ Casenew, main = "Cases ^ 0.5")
abline(coef(mh2))

Note the residual standard error is smaller for this model. Let’s try the fourth root.

Casenew = Cases^0.25
mh3 = lm(Manhours ~ Casenew)
summary(mh3)


Call:
lm(formula = Manhours ~ Casenew)

Residuals:
    Min      1Q  Median      3Q     Max 
-1134.3  -581.6  -184.9   309.2  1793.5 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -16232.1     1228.1  -13.22 6.51e-09 ***
Casenew       4417.8      232.3   19.02 7.15e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 874.6 on 13 degrees of freedom
Multiple R-squared:  0.9653,    Adjusted R-squared:  0.9626 
F-statistic: 361.8 on 1 and 13 DF,  p-value: 7.148e-11

plot(Manhours ~ Casenew, main = "Cases ^ 0.25")
abline(coef(mh3))

Again the residual standard error is smaller. Suppose we try a grid of values for \(\alpha\)

alpha = rse = rep(0, 10)
for (i in 1:10) {
    alpha[i] = i/10
    casenew = Cases^alpha[i]
    rse[i] = summary(lm(Manhours ~ casenew))$sigma
}
cbind(alpha, rse)

      alpha       rse
 [1,]   0.1  920.0195
 [2,]   0.2  881.7700
 [3,]   0.3  875.5447
 [4,]   0.4  901.2579
 [5,]   0.5  954.8140
 [6,]   0.6 1029.8343
 [7,]   0.7 1119.6818
 [8,]   0.8 1218.6620
 [9,]   0.9 1322.3660
[10,]   1.0 1427.5653

It looks like the minimum is for \(\alpha\) between 0.2 and 0.4

alpha = rse = rep(0, 20)
for (i in 1:20) {
    alpha[i] = 0.2 + i/100
    casenew = Cases^alpha[i]
    rse[i] = summary(lm(Manhours ~ casenew))$sigma
}
cbind(alpha, rse)

      alpha      rse
 [1,]  0.21 879.6730
 [2,]  0.22 877.9018
 [3,]  0.23 876.4577
 [4,]  0.24 875.3420
 [5,]  0.25 874.5553
 [6,]  0.26 874.0976
 [7,]  0.27 873.9687
 [8,]  0.28 874.1678
 [9,]  0.29 874.6937
[10,]  0.30 875.5447
[11,]  0.31 876.7186
[12,]  0.32 878.2128
[13,]  0.33 880.0244
[14,]  0.34 882.1499
[15,]  0.35 884.5855
[16,]  0.36 887.3271
[17,]  0.37 890.3701
[18,]  0.38 893.7098
[19,]  0.39 897.3409
[20,]  0.40 901.2579

It looks like the minimum is for \(\alpha\) between 0.26 and 0.28.

alpha = rse = rep(0, 20)
for (i in 1:20) {
    alpha[i] = 0.26 + i/1000
    casenew = Cases^alpha[i]
    rse[i] = summary(lm(Manhours ~ casenew))$sigma
}
cbind(alpha, rse)

      alpha      rse
 [1,] 0.261 874.0699
 [2,] 0.262 874.0455
 [3,] 0.263 874.0244
 [4,] 0.264 874.0066
 [5,] 0.265 873.9921
 [6,] 0.266 873.9809
 [7,] 0.267 873.9729
 [8,] 0.268 873.9682
 [9,] 0.269 873.9668
[10,] 0.270 873.9687
[11,] 0.271 873.9739
[12,] 0.272 873.9823
[13,] 0.273 873.9941
[14,] 0.274 874.0091
[15,] 0.275 874.0273
[16,] 0.276 874.0489
[17,] 0.277 874.0737
[18,] 0.278 874.1018
[19,] 0.279 874.1332
[20,] 0.280 874.1678

It looks like the minimum \(S\) is obtained for \(\alpha \approx 0.269\). The procedure could be repeated to obtain the minimum RSE to more significant figures but, in view of the small changes we are achieving, little will be achieved by going further.

casenew = Cases^0.269
mh3 = lm(Manhours ~ casenew)
summary(mh3)


Call:
lm(formula = Manhours ~ casenew)

Residuals:
    Min      1Q  Median      3Q     Max 
-1111.6  -586.3  -207.4   278.2  1806.2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -14619.7     1144.1  -12.78 9.80e-09 ***
casenew       3618.8      190.1   19.03 7.09e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 874 on 13 degrees of freedom
Multiple R-squared:  0.9654,    Adjusted R-squared:  0.9627 
F-statistic: 362.3 on 1 and 13 DF,  p-value: 7.086e-11

plot(Manhours ~ casenew, main = "Cases ^ 0.269")
abline(coef(mh3))

37.2.1 Comments on direct search

Direct searching on a grid may get to a solution, but the method has disadvantages.

It is time-consuming, requiring programming and/or human intervention.
If we have several parameters to search for, we may have to use a grid in two, three or more dimensions, which is computationally intensive.
Even once we get to a point estimate, we do not have a standard error for the estimate, so we do not know really how accurate it is. We cannot compute confidence intervals or hypothesis tests regarding the parameters.

We now consider a more convenient method that overcomes all these handicaps.

37.3 The Gauss-Newton Algorithm

The alternative to using direct search is to use a mathematical technique such as the Gauss-Newton Algorithm .

The idea of the algorithm is to approximate the curve \(f(x_i; \alpha,\beta,\gamma)\) by a sum of straight lines involving \(\alpha,\beta\) and \(\gamma\). That is, a linearized approximation to the curve.

The mathematical ideas are in an word document on Stream, but are not examinable.

The main points are:

The process of model-fitting is iterative. That is, it involves a series of repeated linear regressions with slightly different data each time, converging to a fixed answer as we get closer and closer to the optimal estimates.
The Gauss-Newton method is guaranteed to converge quickly to the optimal estimates, provided that our initial starting point is close enough.
The method also produces approximate hypothesis tests and confidence intervals.

The nls() nonlinear least squares procedure implements the Gauss-Newton algorithm.

We need to specify the formula, and (preferably) reasonable starting values.

manhours.nls = nls(Manhours ~ beta0 + beta1 * (Cases^alpha), start = list(alpha = 0.25,
    beta0 = -16000, beta1 = 4400), data = manhours)

summary(manhours.nls)


Formula: Manhours ~ beta0 + beta1 * (Cases^alpha)

Parameters:
        Estimate Std. Error t value Pr(>|t|)
alpha  2.689e-01  1.585e-01   1.697    0.115
beta0 -1.463e+04  1.256e+04  -1.164    0.267
beta1  3.622e+03  5.954e+03   0.608    0.554

Residual standard error: 909.7 on 12 degrees of freedom

Number of iterations to convergence: 6 
Achieved convergence tolerance: 2.462e-06

plot(Manhours ~ Cases)
lines(predict(manhours.nls) ~ Cases)

unlabelled

We can calculate an approximate confidence interval for \(\alpha\) using \[ \hat\alpha \pm t_{n-p}(0.025)*SE(\hat\alpha)\]

summary(manhours.nls)$coefficients

           Estimate   Std. Error    t value  Pr(>|t|)
alpha  2.689232e-01     0.158476  1.6969329 0.1154707
beta0 -1.462572e+04 12564.190279 -1.1640799 0.2670141
beta1  3.621670e+03  5954.344902  0.6082398 0.5543666

est = summary(manhours.nls)$coefficients[1]
se = summary(manhours.nls)$coefficients[4]
lower = est + qt(0.025, df = 15 - 3) * se
upper = est + qt(0.975, df = 15 - 3) * se
cbind(est, se, lower, upper)

           est       se       lower     upper
[1,] 0.2689232 0.158476 -0.07636641 0.6142128

We can conclude from this that we could not reject a hypothesis that \(\alpha\) equalled any value between -0.07 and 0.6.

In particular, we could have stuck with the square-root transformation (\(\alpha=0.5\)) or used the log transformation (recall this takes the place of \(\alpha=0\) in the ladder of powers).

37.4 Example: A Growth Curve Model with an Maximum

Recall the fev data relating forced expiration volume (FEV) to girls’ ages. This seemed to flatten off after about age 10.

We will model the growth using a logistic curve with unknown maximum.

\[ f(x_i; \alpha,\beta_0,\beta_1) = \alpha~~\frac{e^{\beta_0+ \beta_1 x}}{ 1+e^{\beta_0+ \beta_1 x} } + \varepsilon\]

Download FEV.csv

## fev <- read.csv(file = "fev.csv", header = TRUE)

FEV = fev$FEV
Age = fev$Age

plot(FEV ~ Age)

unlabelled

From the graph the maximum \(\alpha \approx 3\), so setting \[FEV \approx 3 ~~\frac{e^{\beta_0 + \beta_1 Age}}{ 1+ e^{\beta_0 + \beta_1 Age}}\] we can rearrange this as
\[1+ e^{\beta_0 + \beta_1 Age} \approx \frac{3}{FEV} ~ e^{\beta_0 + \beta_1 Age} \]

Gather up the \(e^{\beta_0 + \beta_1 Age}\) terms

\[ \frac{1}{3/FEV - 1} \approx e^{\beta_0 + \beta_1 Age}\] and then if we take logs we can use the left-hand side as the “response” in a regression

\[ Y_{new} = \log\left(\frac{1}{3/FEV - 1}\right)\approx \beta_0 + \beta_1 Age \] to give us starting values for \(\beta_0\) and \(\beta_1\):

start.calc = lm(-log(3/FEV - 1) ~ Age)

Warning in log(3/FEV - 1): NaNs produced

summary(start.calc)


Call:
lm(formula = -log(3/FEV - 1) ~ Age)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9603 -0.5223 -0.1590  0.2990  5.0842 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.30548    0.21129  -6.179 2.78e-09 ***
Age          0.28493    0.02226  12.799  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9443 on 238 degrees of freedom
  (78 observations deleted due to missingness)
Multiple R-squared:  0.4077,    Adjusted R-squared:  0.4052 
F-statistic: 163.8 on 1 and 238 DF,  p-value: < 2.2e-16

So we will use starting values \(\alpha=3\), \(\beta_0= -1.3\) and \(\beta_1=0.28\)

fev.nls = nls(FEV ~ alpha * exp(beta0 + beta1 * Age)/(1 + exp(beta0 + beta1 * Age)),
    start = list(alpha = 3, beta0 = -1.3, beta1 = 0.28))
summary(fev.nls)


Formula: FEV ~ alpha * exp(beta0 + beta1 * Age)/(1 + exp(beta0 + beta1 * 
    Age))

Parameters:
      Estimate Std. Error t value Pr(>|t|)    
alpha  3.21164    0.08682   36.99   <2e-16 ***
beta0 -2.16822    0.21482  -10.09   <2e-16 ***
beta1  0.36561    0.03620   10.10   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3956 on 315 degrees of freedom

Number of iterations to convergence: 5 
Achieved convergence tolerance: 6.266e-06

plot(FEV ~ Age)
lines(30:200/10, predict(fev.nls, newdata = data.frame(Age = 30:200/10)), lty = 1,
    col = 2, lwd = 3)

unlabelled

This has converged. Note the residual standard error is 0.3956, which is slightly bigger than for the piecewise model looked at earlier (S=0.39) but the curve seems more reasonable than having a sharp corner.

We can get a slightly better fit by adding a quadratic term in Age. In the following we use starting value \(\beta_2=0\).

fev.nls2 = nls(FEV ~ alpha * exp(beta0 + beta1 * Age + beta2 * Age^2)/(1 + exp(beta0 +
    beta1 * Age + beta2 * Age^2)), start = list(alpha = 3, beta0 = -1.3, beta1 = 0.28,
    beta2 = 0), data = fev)
summary(fev.nls2)


Formula: FEV ~ alpha * exp(beta0 + beta1 * Age + beta2 * Age^2)/(1 + exp(beta0 + 
    beta1 * Age + beta2 * Age^2))

Parameters:
      Estimate Std. Error t value Pr(>|t|)    
alpha  3.03273    0.05788  52.393  < 2e-16 ***
beta0 -0.06772    0.75358  -0.090  0.92846    
beta1 -0.23896    0.21936  -1.089  0.27684    
beta2  0.04437    0.01619   2.741  0.00647 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3906 on 314 degrees of freedom

Number of iterations to convergence: 7 
Achieved convergence tolerance: 2.889e-06

plot(FEV ~ Age)
lines(30:200/10, predict(fev.nls2, newdata = data.frame(Age = 30:200/10)), lty = 1,
    col = 2, lwd = 3)

unlabelled

37.5 Piecewise Regression with an Unknown Knot

Recall that sometimes we wish to fit a piecewise regression model, but the location of the knot is unknown.

This then is becomes a nonlinear regression model, with the location of the knot taking the role of the alpha parameter above (not meaning that it enters the equation the same way, but that if the knot is known, then every other parameter can be fitted by linear regression).

Strictly speaking the Gauss-Newton algorithm does not quite work, as the knot gives a ‘corner’ to the lines, or even a jump in the level of the lines, and this means that the calculus on which the method was originally based breaks down. However the nls() routine can be made to cope.

37.5.1 Nonlinear regression for a change-slope model: the Thai Baht.

The data here refer to the Thai Baht Exchange Rate \(Y\) vs Time (time measured in hundreds of days, hDays. ) The graph below shows \(Y\) vs hDays with a cubic model. The model looks like it would be unsafe for extrapolation on the right.

Download baht.csv

## baht = read.csv("baht.csv", header = TRUE)

Y = baht$Y
hDays = baht$hDays

plot(Y ~ hDays)
cubic = lm(Y ~ poly(hDays, 3))
summary(cubic)$sigma

[1] 0.008618117

lines(predict(cubic) ~ hDays, col = 2, lwd = 2)

unlabelled

We consider a piecewise model \[ Y = \beta_0 + \beta_1. hDays + \beta_2 (hDays-k)_+ \] where \(k\) is unknown but probably near 2. We will use this guess to give starting values for the other parameters.

start.guess = lm(Y ~ hDays + I((abs(hDays - 2) + hDays - 2)/2))
summary(start.guess)


Call:
lm(formula = Y ~ hDays + I((abs(hDays - 2) + hDays - 2)/2))

Residuals:
      Min        1Q    Median        3Q       Max 
-0.024944 -0.005216  0.001020  0.004923  0.037731 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                        0.3443525  0.0013566  253.84   <2e-16 ***
hDays                             -0.0201913  0.0008924  -22.63   <2e-16 ***
I((abs(hDays - 2) + hDays - 2)/2)  0.0219204  0.0012007   18.26   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.008564 on 386 degrees of freedom
Multiple R-squared:  0.6141,    Adjusted R-squared:  0.6121 
F-statistic: 307.2 on 2 and 386 DF,  p-value: < 2.2e-16

This gives starting estimates \(\beta_0\approx 0.34, ~~ \beta_1 \approx -0.02,~~ \beta_2 \approx 0.022, ~~k=2\).

The following code fits the model, but some iteration control parameters were needed, presumably because of the corner. (Other statistical software such as SPSS and Minitab have less problems fitting this model.)

baht.nls = nls(Y ~ beta0 + beta1 * hDays + beta2 * I((abs(hDays - k) + hDays - k)/2),
    start = list(beta0 = 0.34, beta1 = -0.02, beta2 = 0.022, k = 2), data = baht,
    nls.control(warnOnly = TRUE, minFactor = 1/2048))

Warning in nls(Y ~ beta0 + beta1 * hDays + beta2 * I((abs(hDays - k) + hDays -
: step factor 0.000244141 reduced below 'minFactor' of 0.000488281

summary(baht.nls)


Formula: Y ~ beta0 + beta1 * hDays + beta2 * I((abs(hDays - k) + hDays - 
    k)/2)

Parameters:
       Estimate Std. Error t value Pr(>|t|)    
beta0  0.349459   0.001597  218.88   <2e-16 ***
beta1 -0.027767   0.001772  -15.67   <2e-16 ***
beta2  0.028657   0.001821   15.73   <2e-16 ***
k      1.579995   0.064909   24.34   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.008074 on 385 degrees of freedom

Number of iterations till stop: 14 
Achieved convergence tolerance: 0.008517
Reason stopped: step factor 0.000244141 reduced below 'minFactor' of 0.000488281

plot(Y ~ hDays)
lines(predict(baht.nls) ~ hDays, col = 2, lwd = 3)

unlabelled

The graph seems much more reasonable. Note it changed the \(k\) to 1.57.

In terms of residual standard error, which is the better-fitting model, the cubic or the change-point model? (From above the cubic graph S =0.008618).