Introduction

If you suffer from large execution times with huge data sets, and/or integrity problems, this is for you. I show how the theory of sufficient statistics may solve your problem, given that you are willing to organize your data properly. Your integrity problem (if any) is solved on the fly.

I am illustrating everything by example using R and RStudio. However, the principle is universal, and you could, as an exercise, think of how you would implement it in Stata, if that is part of your inclinations. In any case, you may find pleasure in studying the R code included all along.

The problem

I am analyzing a large data set from Lisa, trying some selected statistical models for a project based on survival analysis. Since Lisa contains yearly data I’m contemplating a discrete-time proportional hazards model, which I plan to analyze with logistic regression (cloglog link). This is my data (first few rows):

head(suff)
##       id age hisclass    sex event
## 1 183091  66        1 female FALSE
## 2 183091  67        1 female FALSE
## 3 183091  67        1 female FALSE
## 4 183091  68        1 female FALSE
## 5 183091  68        1 female FALSE
## 6 183091  69        1 female FALSE

The variable hisclass is of primary interest. It is grouped into four ordered classes, where “1” is the “uppermost class” and “4” is the “lowermost.” All other “control variables” except sex are omitted, to make this example clear and easy. The variable event is the response, with TRUE signifying an observed death.

This data frame is quite large (but not huge), 1715655 records with 97776 individuals. There are in all 6115 deaths.

Let us run a logistic regression and at the same time measure execution time (on singasten running Ubuntu 12.04 (very old!), 64 GB RAM).

(tt <- system.time(fit <- glm(event ~ factor(age) + factor(hisclass) + sex, 
                              data = suff, family = binomial(link = cloglog))))
##    user  system elapsed 
## 271.396  11.292 439.282
res <- summary(fit)$coef[-(1:50), 1:2]
round(res, 4)
##                   Estimate Std. Error
## factor(hisclass)2   0.1760     0.0409
## factor(hisclass)3   0.3319     0.0505
## factor(hisclass)4   0.3718     0.0398
## sexfemale          -0.4726     0.0274

The problem with this is the execution time: It takes 439 seconds to perform this simple analysis!

The solution: Sufficient statistics!

We realize that the explanatory variables can only be combined in 50 x 4 x 2 = 400 distinct combinations, so the solution is to aggregate. For this to work we need a variable that counts the number of repetitions of each combination. Let’s call it count:

suff$count <- 1

It is equal to one on all our original records, and now we aggregate event and count over all combinations:

newsuff <- aggregate(suff[, c("event", "count")], 
                     by = suff[, c("age", "hisclass", "sex")],
                     FUN = sum)
head(newsuff)
##   age hisclass  sex event count
## 1  40        1 male     6  6884
## 2  41        1 male     1  6932
## 3  42        1 male     1  6986
## 4  43        1 male     2  6982
## 5  44        1 male     1  7003
## 6  45        1 male     1  7030

This is the first six rows of a total of 398 in the new dataframe. (Shouldn’t it be 400? Yes, but those combinations not occurring in the data are removed: Two in our case.)

Now we can use our new data frame to run the same logistic regression as in the first place, but we must formulate the response as a vector (successes, failures):

(tt2 <- system.time(fit2 <- glm(cbind(event, count - event) ~ 
                                    factor(age) + factor(hisclass) + sex, 
                                data = newsuff, family = binomial(link = cloglog))))
##    user  system elapsed 
##    0.02    0.00    0.02
res2 <- summary(fit2)$coef[-(1:50), 1:2]
round(res2, 4)
##                   Estimate Std. Error
## factor(hisclass)2   0.1760     0.0409
## factor(hisclass)3   0.3319     0.0505
## factor(hisclass)4   0.3718     0.0398
## sexfemale          -0.4726     0.0274

As you can see, the two results are identical, but the execution time in the aggregated case is only 0.02 seconds! Compared to 439 seconds in the original case! That is, 2.196410^{4} times faster for the aggregated case!!

You may wonder if I didn’ lose that time in the aggregation process? No, it took around five seconds (for the computer) and besides, it is a one-time job, if done cleverly.

Why does this work? Can we be sure that this wasn’t just a happy coincidence? Yes, the answer is given by the statistical theory about sufficient statistics: The table we created is a sufficient statistic for the model and sample at hand. This means that, given the sufficient statistic (our table) and the model, the conditional distribution of the original sample does not depend on the model, it is pure random noise. Thus, it has no information about the parameters in our model.

Conclusion

The principle of tabulating a huge data set works well if the sufficient statistic reduces the dimensionality significantly of your data set. A necessary start is to avoid continuous covariates (group them). The distribution of the response must also allow for categorization in many cases. For instance, in a survival analysis situation (continuous time) you may want to apply Cox regression. It will not admit any reduction by sufficiency, but if you are willing to assume a piecewise constant hazards model, you are OK. It will work essentially as above, but with Poisson regression instead of logistic, and aggregation of exposure times instead of pure counts.

In “real” problems, the generated table will be much larger than in this example. The one I work with now generates a table with around 40000 rows, but that is still a huge gain compared to the original dataframe with three million observations. So don’t hesitate, just do it.

This is btw a good example of the principle “think first”, sadly forgotten these days: Computers do not think (although some argue and believe differently).

Oh, I almost forgot: Where does integrity come into the picture? But it is obvious: In the generated table, no individual information is available. As always, though, this is a truth with modification: If some combination of attributes is very rare, it may be possible to identify an individual for an initiated observer. In general, it should even be possible to publish tables generated in this way.


Copyright © 2016 Göran Broström and CEDAR. All rigths reserved.