We ask if studying mortality by sampling the dead is a good idea and whether it is reasonable to assume normally distributed remaining life after age 60. We answer *No* and *Maybe*.

Data come from Skellefteå and Umeå. The selection is all persons born between January 1, 1841 and December 31, 1850 and present in the data at age 60. They are followed from age 60 to death or right censoring (at most until December 31, 1950). The first six rows of the data frame:

```
id sex birthdate region civst enter exit event
28 18 male 1845.231 ske married 61.767 75.005 1
61 28 male 1843.632 ske widow 82.644 86.523 1
312 264 male 1848.941 ske widow 60.000 65.880 1
433 377 female 1845.310 ske widow 61.534 68.147 0
472 400 male 1844.062 ske widow 67.936 72.056 1
627 545 female 1847.188 ske married 60.000 85.510 1
```

We want to observe time since age 60 (exact!), so we subtract 60 from the ages `enter`

and `exit`

.

In the following, we are interested in the *sex difference* in expected remaining life after 60 (*e60*) (just as an illustration). The categorical variable `region`

is included as an example of a confounder.

OLS regression in survival analysis is a special case of the AFT regression model, see the next section.

*OLS* means assuming *normally distributed* remaining life after 60, and we can use standard linear regression, given that we are willing to assume an *accelerated failure time (AFT)* model for our data. In the case of right censoring, we need to fall back on maximum likelihood for the estimation. The function *survreg* in the **R** (R Core Team 2018) package **survival** facilitates this.

```
Call:
survreg(formula = Surv(exit, event) ~ sex + region, data = ols,
dist = "gaussian")
Value Std. Error z p
(Intercept) 16.474 0.2379 69.25 0.000000
sexfemale 0.913 0.2984 3.06 0.002208
regionume -1.322 0.3185 -4.15 0.000033
Log(scale) 2.192 0.0122 180.31 0.000000
Scale= 8.96
Gaussian distribution
Loglik(model)= -12402.8 Loglik(intercept only)= -12415.7
Chisq= 25.7 on 2 degrees of freedom, p= 2.6e-06
Number of Newton-Raphson Iterations: 3
n= 3883
```

So the *e60* for men from Skellefteå is 16.47 years and for women it is 0.91 more years. The difference is statistically significant.

We repeat the analyses with the data subset consisting of all those who were observerd to die in the parishes. This implies that 13 percent censored individuals are thrown away.

```
Call:
survreg(formula = Surv(exit, event) ~ sex + region, data = sols,
dist = "gaussian")
Value Std. Error z p
(Intercept) 15.24 0.2295 66.40 0.00e+00
sexfemale 1.31 0.2886 4.53 5.78e-06
regionume -1.45 0.3104 -4.67 2.94e-06
Log(scale) 2.12 0.0122 173.58 0.00e+00
Scale= 8.34
Gaussian distribution
Loglik(model)= -11857.7 Loglik(intercept only)= -11878.2
Chisq= 41.03 on 2 degrees of freedom, p= 1.2e-09
Number of Newton-Raphson Iterations: 2
n= 3350
```

So the *e60* for men from Skellefteå is now only 15.24 years and for women it is 1.31 more years. The difference is statistically significant,
and larger than in the case with full data.

Sampling of deaths leads to overestimation of mortality (underestimation of *e60*): This is a trivial (well-known) consequence. More seriously, it distorts the balance between the effects of categories (`sex`

, `region`

) on mortality in unpredictable ways.

The OLS model is an example of what is called an *accelerated failure time (AFT)* model in survival analysis, a competitor to the *proportional hazards (PH)* model.
A common assumption about old age mortality is that a *Gompertz* distribution fits well.

So we go back to the full data situation and fit a Gompertz AFT model. The *survreg* function does not include the Gompertz distribution, but the *aftreg* function in the package **eha** (Broström 2017) does.

```
Call:
aftreg(formula = Surv(exit, event) ~ sex + region, data = ols,
dist = "gompertz", param = "lifeExp")
Covariate W.mean Coef Life-Expn se(Coef) Wald p
sex
male 0.460 0 1 (reference)
female 0.540 -0.014 0.986 0.015 0.347
region
ske 0.697 0 1 (reference)
ume 0.303 -0.059 0.942 0.017 0.000
Baseline parameters:
log(scale) 2.617 0.026 0.000
log(shape) -1.224 0.057 0.000
Baseline life expectancy: 16.9
Events 3350
Total time at risk 58983
Max. log. likelihood -12246
LR test statistic 13.8
Degrees of freedom 2
Overall p-value 0.000986637
```

So the *e60* for Skellefteå men is 16.95 years, and for the women we get
16.71 years, a difference of -0.24 years, that is, *men* are better off. This difference is *not* statistically significant.

Yes, as long as condition on survival past the mode of the survival distribution, according to Kannisto (1996). So what is the mode in the life distribution in our data?

Well, it is closer to 80 than to 60, so the normal distribution assumption may be in doubt. We can compare the baseline (men in Skellefteå) distributions from the OLS fit and nonparametric estimation.

Looks quite ok, even though the OLS curve heavily overestimates the ages above 90. No simple parametric model will catch the heavy slow-down of the mortality increase in the high ages.

Maybe the PH model fits better than the AFT? Compare max log likelihoods.

```
Call:
phreg(formula = Surv(exit, event) ~ sex + region, data = ols,
dist = "gompertz")
Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
sex
male 0.460 0 1 (reference)
female 0.540 -0.036 0.964 0.035 0.296
region
ske 0.697 0 1 (reference)
ume 0.303 0.142 1.153 0.037 0.000
log(scale) 2.594 0.025 0.000
log(shape) -1.244 0.059 0.000
Events 3350
Total time at risk 58983
Max. log. likelihood -12246
LR test statistic 15.01
Degrees of freedom 2
Overall p-value 0.000551037
```

No, the difference is very small: 0.582 in favor of the PH model. Not much to argue about.

These are the mortality patterns for the four groups we are contemplating. Worth noting is that male mortality above age 90 is so low (compared to women).