Time series models for mortality and disease incidence

Model time-varying incidence rates given a time series of case (or death) counts and population at risk.

stan_rw(
  data,
  group,
  time,
  cor = FALSE,
  family = poisson(),
  prior = list(),
  chains = 4,
  cores = 1,
  iter = 3000,
  refresh = 1500,
  control = list(adapt_delta = 0.98),
  ...
)

Source

Brandt P and Williams JT. Multiple time series models. Thousand Oaks, CA: SAGE Publications, 2007.

Clayton, DG. Generalized linear mixed models. In: Gilks WR, Richardson S, Spiegelhalter DJ, editors. Markov Chain Monte Carlo in Practice: Interdisciplinary Statistics. Boca Raton, FL: CRC Press, 1996. p. 275-302.

Donegan C, Hughes AE, and Lee SC (2022). Colorectal Cancer Incidence, Inequalities, and Prevention Priorities in Urban Texas: Surveillance Study With the "surveil" Software Package. JMIR Public Health & Surveillance 8(8):e34589. doi:10.2196/34589

Stan Development Team. Stan Modeling Language Users Guide and Reference Manual, 2.28. 2021. https://mc-stan.org

Arguments

data

A data.frame containing the following columns:

Count: Number of cases or deaths; this column must be named 'Count'.
Population: Size of population at risk; this column must be named 'Population'.
time: Time period indicator. (Provide the (unquoted) column name using the time argument.)
group: Optional grouping variable. (Provide the (unquoted) column name using the group argument.)

group

If data is aggregated by demographic group, provide the (unquoted) name of the column in data containing the grouping structure, such as age brackets or race-ethnicity. E.g., if data has column names Year, Race, Cases, and Population, then you would provide group = Race.

time

Specify the (unquoted) name of the time variable in data, as in time = Year. This variable must be numeric-alike (i.e., as.numeric(data$time) will not fail).

cor

For correlated random walks use cor = TRUE; default value is FALSE. Note this only applies when the group argument is used.

family

The default specification is a Poisson model with log link function (family = poisson()). For a Binomial model with logit link function, use family = binomial().

prior

Optionally provide a named list with prior parameters. If any of the following items are missing, default priors will be assigned and printed to the console.

eta_1

The first value of log-risk in each series must be assigned a Gaussian prior probability distribution. Provide the location and scale parameters for each demographic group in a list, where each parameter is a k-length vector.

For example, with k=2 demographic groups, the following code will assign priors of normal(-6.5, 5) to the starting values of both series: prior = list(eta_1 = normal(location = -6.5, scale = 5, k = 2). Note, eta is the log-rate, so centering the prior for eta_1 on -6.5 is similar to centering the prior rate on exp(-6.5)*100,000 = 150 cases per 100,000 person-years at risk. Note, however, that the translation from log-rate to rate is non-linear.

sigma

Each demographic group has a scale parameter assigned to its log-rate. This is the scale of the annual deviations from the previous year's log-rate. The scale parameters are assigned independent half-normal prior distributions (these half normal distributions are restricted to be positive-valued only).

omega

If cor = TRUE, an LKJ prior is assigned to the correlation matrix, Omega.

chains

Number of independent MCMC chains to initiate (passed to sampling).

cores

The number of cores to use when executing the Markov chains in parallel (passed to sampling).

iter

Total number of MCMC iterations. Warmup draws are automatically half of iter.

refresh

How often to print the MCMC sampling progress to the console.

control

A named list of parameters to control Stan's sampling behavior. The most common parameters to control are adapt_delta, which may be raised to address divergent transitions, and max_treedepth. For example, control = list(adapt_delta = 0.99, max_treedepth = 13), may be a reasonable specification to address a divergent transitions or maximum treedepth warning from Stan.

...

Other arguments passed to sampling.

Value

The function returns a list, also of class surveil, containing the following elements:

summary: A data.frame with posterior means and 95 percent credible intervals, as well as the raw data (Count, Population, time period, grouping variable if any, and crude rates).
samples: A stanfit object returned by sampling. This contains the MCMC samples from the posterior distribution of the fitted model.
cor: Logical value indicating if the model included a correlation structure.
time: A list containing the name of the time-period column in the user-provided data and a data.frame of observed time periods and their index.
group: If a grouping variable was used, this will be a list containing the name of the grouping variable and a data.frame with group labels and index values.
family: The user-provided family argument.

Details

By default, the models have Poisson likelihoods for the case counts, with log link function. Alternatively, a Binomial model with logit link function can be specified using the family argument (family = binomial()).

For time t = 1,...n, the models assign Poisson probability distribution to the case counts, given log-risk eta and population at tirks P; the log-risk is modeled using the first-difference (or random-walk) prior:

 y_t ~ Poisson(p_t * exp(eta_t))
 eta_t ~ Normal(eta_{t-1}, sigma)
 eta_1 ~ Normal(-6, 5) (-Inf, 0)
 sigma ~ Normal(0, 1) (0, Inf)

This style of model has been discussed in Bayesian (bio)statistics for quite some time. See Clayton (1996).

The above model can be used for multiple distinct groups; in that case, each group will have its own independent time series model.

It is also possible to add a correlation structure to that set of models. Let Y_t be a k-length vector of observations for each of k groups at time t (the capital letter now indicates a vector), then:

 Y_t ~ Poisson(P_t * exp(Eta_t))
 Eta_t ~ MVNormal(Eta_{t-1}, Sigma)
 Eta_1 ~ Normal(-6, 5)  (-Inf, 0)
 Sigma = diag(sigma) * Omega * diag(sigma)
 Omega ~ LKJ(2)
 sigma ~ Normal(0, 1) (0, Inf)

where Omega is a correlation matrix and diag(sigma) is a diagonal matrix with scale parameters on the diagonal. This was adopted from Brandt and Williams (2007); for the LKJ prior, see the Stan Users Guide and Reference Manual.

If the binomial model is used instead of the Poisson, then the first line of the model specifications will be:

 y_t ~ binomial(P_t, inverse_logit(eta_t))

All else is remains the same. The logit function is log(r/(1-r)), where r is a rate between zero and one; the inverse-logit function is exp(x)/(1 + exp(x)).

Author

Connor Donegan (Connor.Donegan@UTSouthwestern.edu)

Examples

data(msa)
dat <- aggregate(cbind(Count, Population) ~ Year, data = msa, FUN = sum)

fit <- stan_rw(dat, time = Year)

## print summary of results
print(fit)
print(fit$summary)

## plot time trends (rates per 10,000)
plot(fit, scale = 10e3)
plot(fit, style = 'lines', scale = 10e3)

## Summary with MCMC diagnostics (n_eff, Rhat; from Rstan)
print(fit$samples)

## cumulative percent change
fit_pc <- apc(fit)
print(fit_pc$cpc)
plot(fit_pc, cumulative = TRUE)

# \donttest{
## age-specific rates 
data(cancer)
cancer2 <- subset(cancer, grepl("55-59|60-64|65-69", Age))
fit <- stan_rw(cancer2, time = Year, group = Age,
               chains = 3, iter = 1e3) # for speed only

## plot trends 
plot(fit, scale = 10e3)

## age-standardized rates
data(standard)
fit_stands <- standardize(fit,
                          label = standard$age,
                          standard_pop = standard$standard_pop)
print(fit_stands)
plot(fit_stands)

## percent change for age-standardized rates
fit_stands_apc <- apc(fit_stands)
plot(fit_stands_apc)
print(fit_stands_apc)
# }