1 Introduction

We ask if studying mortality by sampling the dead is a good idea and whether it is reasonable to assume normally distributed remaining life after age 60. We answer No and Maybe.

2 Data

Data come from Skellefteå and Umeå. The selection is all persons born between January 1, 1841 and December 31, 1850 and present in the data at age 60. They are followed from age 60 to death or right censoring (at most until December 31, 1950). The first six rows of the data frame:

     id    sex birthdate region   civst  enter   exit event
28   18   male  1845.231    ske married 61.767 75.005     1
61   28   male  1843.632    ske   widow 82.644 86.523     1
312 264   male  1848.941    ske   widow 60.000 65.880     1
433 377 female  1845.310    ske   widow 61.534 68.147     0
472 400   male  1844.062    ske   widow 67.936 72.056     1
627 545 female  1847.188    ske married 60.000 85.510     1

We want to observe time since age 60 (exact!), so we subtract 60 from the ages enter and exit.

In the following, we are interested in the sex difference in expected remaining life after 60 (e60) (just as an illustration). The categorical variable region is included as an example of a confounder.

3 OLS regression

OLS regression in survival analysis is a special case of the AFT regression model, see the next section.

3.1 Full data

OLS means assuming normally distributed remaining life after 60, and we can use standard linear regression, given that we are willing to assume an accelerated failure time (AFT) model for our data. In the case of right censoring, we need to fall back on maximum likelihood for the estimation. The function survreg in the R (R Core Team 2018) package survival facilitates this.


Call:
survreg(formula = Surv(exit, event) ~ sex + region, data = ols, 
    dist = "gaussian")
             Value Std. Error      z        p
(Intercept) 16.474     0.2379  69.25 0.000000
sexfemale    0.913     0.2984   3.06 0.002208
regionume   -1.322     0.3185  -4.15 0.000033
Log(scale)   2.192     0.0122 180.31 0.000000

Scale= 8.96 

Gaussian distribution
Loglik(model)= -12402.8   Loglik(intercept only)= -12415.7
    Chisq= 25.7 on 2 degrees of freedom, p= 2.6e-06 
Number of Newton-Raphson Iterations: 3 
n= 3883 

So the e60 for men from Skellefteå is 16.47 years and for women it is 0.91 more years. The difference is statistically significant.

3.2 Sampling the dead

We repeat the analyses with the data subset consisting of all those who were observerd to die in the parishes. This implies that 13 percent censored individuals are thrown away.


Call:
survreg(formula = Surv(exit, event) ~ sex + region, data = sols, 
    dist = "gaussian")
            Value Std. Error      z        p
(Intercept) 15.24     0.2295  66.40 0.00e+00
sexfemale    1.31     0.2886   4.53 5.78e-06
regionume   -1.45     0.3104  -4.67 2.94e-06
Log(scale)   2.12     0.0122 173.58 0.00e+00

Scale= 8.34 

Gaussian distribution
Loglik(model)= -11857.7   Loglik(intercept only)= -11878.2
    Chisq= 41.03 on 2 degrees of freedom, p= 1.2e-09 
Number of Newton-Raphson Iterations: 2 
n= 3350 

So the e60 for men from Skellefteå is now only 15.24 years and for women it is 1.31 more years. The difference is statistically significant, and larger than in the case with full data.

3.3 Conclusion

Sampling of deaths leads to overestimation of mortality (underestimation of e60): This is a trivial (well-known) consequence. More seriously, it distorts the balance between the effects of categories (sex, region) on mortality in unpredictable ways.

4 AFT regression

The OLS model is an example of what is called an accelerated failure time (AFT) model in survival analysis, a competitor to the proportional hazards (PH) model. A common assumption about old age mortality is that a Gompertz distribution fits well.

So we go back to the full data situation and fit a Gompertz AFT model. The survreg function does not include the Gompertz distribution, but the aftreg function in the package eha (Broström 2017) does.

Call:
aftreg(formula = Surv(exit, event) ~ sex + region, data = ols, 
    dist = "gompertz", param = "lifeExp")

Covariate          W.mean      Coef Life-Expn  se(Coef)    Wald p
sex 
            male    0.460     0         1           (reference)
          female    0.540    -0.014     0.986     0.015     0.347 
region 
             ske    0.697     0         1           (reference)
             ume    0.303    -0.059     0.942     0.017     0.000 

Baseline parameters:
log(scale)                    2.617               0.026     0.000 
log(shape)                   -1.224               0.057     0.000 
Baseline life expectancy:  16.9 

Events                    3350 
Total time at risk         58983 
Max. log. likelihood      -12246 
LR test statistic         13.8 
Degrees of freedom        2 
Overall p-value           0.000986637

So the e60 for Skellefteå men is 16.95 years, and for the women we get 16.71 years, a difference of -0.24 years, that is, men are better off. This difference is not statistically significant.

4.1 Is the assumption of normally distributed residual life reasonable?

Yes, as long as condition on survival past the mode of the survival distribution, according to Kannisto (1996). So what is the mode in the life distribution in our data?

Well, it is closer to 80 than to 60, so the normal distribution assumption may be in doubt. We can compare the baseline (men in Skellefteå) distributions from the OLS fit and nonparametric estimation.

Looks quite ok, even though the OLS curve heavily overestimates the ages above 90. No simple parametric model will catch the heavy slow-down of the mortality increase in the high ages.

5 PH regression

Maybe the PH model fits better than the AFT? Compare max log likelihoods.

Call:
phreg(formula = Surv(exit, event) ~ sex + region, data = ols, 
    dist = "gompertz")

Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
sex 
            male    0.460     0         1           (reference)
          female    0.540    -0.036     0.964     0.035     0.296 
region 
             ske    0.697     0         1           (reference)
             ume    0.303     0.142     1.153     0.037     0.000 

log(scale)                    2.594               0.025     0.000 
log(shape)                   -1.244               0.059     0.000 

Events                    3350 
Total time at risk         58983 
Max. log. likelihood      -12246 
LR test statistic         15.01 
Degrees of freedom        2 
Overall p-value           0.000551037

No, the difference is very small: 0.582 in favor of the PH model. Not much to argue about.

6 A look at data

6.1 Full data

These are the mortality patterns for the four groups we are contemplating. Worth noting is that male mortality above age 90 is so low (compared to women).