Model time-varying incidence rates given a time series of case (or death) counts and population at risk.
Brandt P and Williams JT. Multiple time series models. Thousand Oaks, CA: SAGE Publications, 2007.
Clayton, DG. Generalized linear mixed models. In: Gilks WR, Richardson S, Spiegelhalter DJ, editors. Markov Chain Monte Carlo in Practice: Interdisciplinary Statistics. Boca Raton, FL: CRC Press, 1996. p. 275-302.
Donegan C, Hughes AE, and Lee SC (2022). Colorectal Cancer Incidence, Inequalities, and Prevention Priorities in Urban Texas: Surveillance Study With the "surveil" Software Package. JMIR Public Health & Surveillance 8(8):e34589. doi:10.2196/34589
Stan Development Team. Stan Modeling Language Users Guide and Reference Manual, 2.28. 2021. https://mc-stan.org
A data.frame
containing the following columns:
Number of cases or deaths; this column must be named 'Count'.
Size of population at risk; this column must be named 'Population'.
Time period indicator. (Provide the (unquoted) column name using the time
argument.)
Optional grouping variable. (Provide the (unquoted) column name using the group
argument.)
If data
is aggregated by demographic group, provide the (unquoted) name of the column in data
containing the grouping structure, such as age brackets or race-ethnicity. E.g., if data
has column names Year
, Race
, Cases
, and Population
, then you would provide group = Race
.
Specify the (unquoted) name of the time variable in data
, as in time = Year
. This variable must be numeric-alike (i.e., as.numeric(data$time)
will not fail).
For correlated random walks use cor = TRUE
; default value is FALSE
. Note this only applies when the group
argument is used.
The default specification is a Poisson model with log link function (family = poisson()
). For a Binomial model with logit link function, use family = binomial()
.
Optionally provide a named list
with prior parameters. If any of the following items are missing, default priors will be assigned and printed to the console.
The first value of log-risk in each series must be assigned a Gaussian prior probability distribution. Provide the location and scale parameters for each demographic group in a list, where each parameter is a k
-length vector.
For example, with k=2
demographic groups, the following code will assign priors of normal(-6.5, 5)
to the starting values of both series: prior = list(eta_1 = normal(location = -6.5, scale = 5, k = 2)
. Note, eta
is the log-rate, so centering the prior for eta_1
on -6.5
is similar to centering the prior rate on exp(-6.5)*100,000 = 150
cases per 100,000 person-years at risk. Note, however, that the translation from log-rate to rate is non-linear.
Each demographic group has a scale parameter assigned to its log-rate. This is the scale of the annual deviations from the previous year's log-rate. The scale parameters are assigned independent half-normal prior distributions (these half
normal distributions are restricted to be positive-valued only).
If cor = TRUE
, an LKJ prior is assigned to the correlation matrix, Omega.
Number of independent MCMC chains to initiate (passed to sampling
).
The number of cores to use when executing the Markov chains in parallel (passed to sampling
).
Total number of MCMC iterations. Warmup draws are automatically half of iter
.
How often to print the MCMC sampling progress to the console.
A named list of parameters to control Stan's sampling behavior. The most common parameters to control are adapt_delta
, which may be raised to address divergent transitions, and max_treedepth
. For example, control = list(adapt_delta = 0.99, max_treedepth = 13)
, may be a reasonable specification to address a divergent transitions or maximum treedepth warning from Stan.
Other arguments passed to sampling
.
The function returns a list, also of class surveil
, containing the following elements:
A data.frame
with posterior means and 95 percent credible intervals, as well as the raw data (Count, Population, time period, grouping variable if any, and crude rates).
A stanfit
object returned by sampling
. This contains the MCMC samples from the posterior distribution of the fitted model.
Logical value indicating if the model included a correlation structure.
A list containing the name of the time-period column in the user-provided data and a data.frame
of observed time periods and their index.
If a grouping variable was used, this will be a list containing the name of the grouping variable and a data.frame
with group labels and index values.
The user-provided family
argument.
By default, the models have Poisson likelihoods for the case counts, with log link function. Alternatively, a Binomial model with logit link function can be specified using the family
argument (family = binomial()
).
For time t = 1,...n, the models assign Poisson probability distribution to the case counts, given log-risk eta
and population at tirks P; the log-risk is modeled using the first-difference (or random-walk) prior:
~ Poisson(p_t * exp(eta_t))
y_t ~ Normal(eta_{t-1}, sigma)
eta_t 1 ~ Normal(-6, 5) (-Inf, 0)
eta_~ Normal(0, 1) (0, Inf) sigma
This style of model has been discussed in Bayesian (bio)statistics for quite some time. See Clayton (1996).
The above model can be used for multiple distinct groups; in that case, each group will have its own independent time series model.
It is also possible to add a correlation structure to that set of models. Let Y_t
be a k-length vector of observations for each of k groups at time t (the capital letter now indicates a vector), then:
~ Poisson(P_t * exp(Eta_t))
Y_t ~ MVNormal(Eta_{t-1}, Sigma)
Eta_t 1 ~ Normal(-6, 5) (-Inf, 0)
Eta_ diag(sigma) * Omega * diag(sigma)
Sigma =~ LKJ(2)
Omega ~ Normal(0, 1) (0, Inf) sigma
where Omega
is a correlation matrix and diag(sigma)
is a diagonal matrix with scale parameters on the diagonal. This was adopted from Brandt and Williams (2007); for the LKJ prior, see the Stan Users Guide and Reference Manual.
If the binomial model is used instead of the Poisson, then the first line of the model specifications will be:
~ binomial(P_t, inverse_logit(eta_t)) y_t
All else is remains the same. The logit function is log(r/(1-r))
, where r
is a rate between zero and one; the inverse-logit function is exp(x)/(1 + exp(x))
.
vignette("demonstration", package = "surveil")
vignette("age-standardization", package = "surveil")
apc
standardize
data(msa)
dat <- aggregate(cbind(Count, Population) ~ Year, data = msa, FUN = sum)
fit <- stan_rw(dat, time = Year)
## print summary of results
print(fit)
print(fit$summary)
## plot time trends (rates per 10,000)
plot(fit, scale = 10e3)
plot(fit, style = 'lines', scale = 10e3)
## Summary with MCMC diagnostics (n_eff, Rhat; from Rstan)
print(fit$samples)
## cumulative percent change
fit_pc <- apc(fit)
print(fit_pc$cpc)
plot(fit_pc, cumulative = TRUE)
# \donttest{
## age-specific rates
data(cancer)
cancer2 <- subset(cancer, grepl("55-59|60-64|65-69", Age))
fit <- stan_rw(cancer2, time = Year, group = Age,
chains = 3, iter = 1e3) # for speed only
## plot trends
plot(fit, scale = 10e3)
## age-standardized rates
data(standard)
fit_stands <- standardize(fit,
label = standard$age,
standard_pop = standard$standard_pop)
print(fit_stands)
plot(fit_stands)
## percent change for age-standardized rates
fit_stands_apc <- apc(fit_stands)
plot(fit_stands_apc)
print(fit_stands_apc)
# }