Space-time modeling in Stan: charting the evolution of U.S. mortality rates
This post is a tutorial on modeling spatial-temporal data using the Stan modeling language, with a focus on areal data. When we take the right approach, Stan can provide a great platform for spatial statistics (presumably, so would other Hamiltonian Monte Carlo samplers). I’ll illustrate by modeling mortality rates for U.S. states and D.C., covering the years 1999 through 2020.
The computational methods presented here are a fairly straigtforward extension of my previous work on CAR models for geostan (mainly presented in this OSF preprint) and time trend models for public health monitoring. I first began using these space-time models in my dissertation proposal and now, after a few years, I am getting around to sharing and using them. Feedback is welcome.
Contents:
- Disease mapping, minus the jargon
- The statistical models
- The mortality data
- Multiple AR models in Stan
- Multiple CAR models in Stan
- The CAR-AR model in Stan
- Model comparison
- Visualizing mortality trends
- Time-varying parameters
- Addressing MCMC sampling failures
- Resources
- How to cite
- References
- Endnotes
Disease mapping, minus the jargon
We’ll be modeling state mortality rates for the years 1999–2020, for women ages 35—44 only. The goal for this kind of modeling project, which is often referred to as ‘disease mapping’, is to make inferences about systematic disease risk for population segments. We want sound answers to questions like: are mid-life mortality rates rising in Wisconsin? How fast are they rising? Is mortality higher in Ohio than Pennsylvania?
When the question pertains to massive groups of people, then raw data from a good vital statistics system can give us a suitable answer in the form of crude incidence rates. (Answering a statistical question is not the same, obviously, as developing an understanding of the matter.) Whenever our research questions concern small or ‘smallish’ groups of people, the patterns and trends of interest become obscured by chance causes (or causes of morbidity that are not persistent over time), or we may ‘see’ patterns where there are none. We usually proceed with the supposition that the data is without much error, but we distinguish conceptually between actual morbidity or mortality (crude rates) and the systematic level of risk (natural and otherwise).1
The purpose of space-time disease mapping is to improve inferences about disease risk for a collection of population segments that are indexed by both time and geography. (Questions about aggregates or averages, like whether mortality has been rising on average in the population, are secondary; answers to those questions will emerge naturally from our disease mapping model if we are successful at making inferences about our collection of rates at each unit of observation.) In this case, we have mortality data for 50 states plus D.C. ( \( S = 51 \) ) spanning 22 years (\( T = 22 \)). This means we have a total of \(N = S \times T = 1,122 \) 'rates' about which we might like to make inferences.
The crude mortality rate is defined as $$r_{i,t} = \frac{y_{i,t}}{p_{i,t}},$$ where \( y \) is the mortality count and \( p \) is the population count. The way we improve upon the crude rate is by taking information from context. We look at rates in previous years, nearby places, similar age-groups, or other demographic segments. Having divided the population into any number of discrete, often well-motivated but at least semi-artificial categories (age groups, places, "races", etc.), we want our models to acknowledge that the categories are not so discrete after all.
The state-level data that we will be using here was chosen in part to keep the case study simple, rather than to illustrate all the problems of small numbers. Many of our rates are based on quite large numbers.
The statistical models
Our modeling framework takes after the hierarchical Bayesian approach developed by C. Wikle, M. Berliner, and N. Cressie (1998; after Berliner 1996). The CAR-AR model specification to be discussed here was described in that article. However, they adopted an approach similar to one described by Cliff and Ord (1981, p. 233), a kind of first-order vector auto-regression that inludes time-lagged spatial neighbors. Santos-Fernandez et al. (2022) extend this to model stream network data (implemented in the SSNbayes R package, with Stan).
The CAR-AR model (defined below) was proposed again by A. Rushworth, D. Lee, and R. Mitchell (2014), who implement it in the ‘CARBayesST’ R package (our specification differs from theirs only in minor ways). They reference a similar model that was introduced by M.D. Ugarte and others (Urgarte et al. 2012). Ugarte et al. are working within the alternative modeling framework introduced by Knorr-Held (2000), which remains very popular. I will note various advantages of the present framework after I have introduced it.
We start with the distinction between chance variation and systematic risk. Because we are modeling rare events, we use the Poisson distribution: $$y_{i,t} \sim Poisson(\mu_{i,t}).$$ This states that we expect the number of cases or deaths \( y \) at any site \( i \) and time \( t \) to fluctuate around a mean value, equal to the population size multiplied by the systematic risk: $$ \begin{aligned} \mu_{i,t} &= p_{i,t} \cdot \eta_{i,t} \\ &= exp(log(p_{i,t}) + log(\eta_{i,t})). \end{aligned} $$ Then the expectation of the observed rate is $$E {\Large[} \frac{y_{i,t}}{p_{i,t}} {\Large]} = \eta_{i,t}.$$ This Poisson model is our likelihood for the observed rates. The next part of the model provides a prior probability distribution for the systematic risk ('rates' or 'risk' for short).
Evolution through time
The next stage of the model is focused on inferences about the systematic risk. The remainder of the modeling will be working with them on the log-scale:
\[\phi_{i,t} = log(\eta_{i,t}).\]Systematic levels of risk typically evolve over time (except in catastrophes, when they jump up). We can view the (unknown) rate as a movement from its own value in the previous time period, helping to narrow down the range of plausible values for it:
\[\phi_{i,t} = \phi_{i, t-1} + \epsilon_{i,t},\]or equivalently,
\[\phi_{i,t} \sim Normal(\phi_{i, t-1}, \tau^2).\]Estimating the scale parameter for that normal distribution gives us the characteristic amount of year-to-year variability in the rates. This is known as a random walk or first-difference model.
We will add an auto-regressive (AR) coefficient to this model, rather than keep it fixed at 1 (as implied above):
\[\phi_{i,t} \sim Normal(\beta \cdot \phi_{i, t-1}, \tau^2).\]If we were to stop at this point we would have a perfectly valid model for our state incidence rates (understanding that we still need to assign some prior distributions for the parameters). To model the evolution of mortality rates in 50 states, for example, we would simply apply this auto-regressive Poisson model to each state (one model per state). We could do them all at once so that they all share the same AR coefficient; or all the AR coefficients could be fixed at 1; or each state could have its own AR coefficient. But in any case, the results may be perfectly usable for public health monitoring.
Spatial auto-regression
Now let’s start over and look at the rates differently. Instead of the time series auto-regression, we could improve our inferences about any given mortality rate by looking at the rates that are nearby and/or part of the same geographic region. Geographic trends are typically weaker than trends in time, so we won’t consider fixing the auto-correlation parameter at 1, as we did above.
As a starting point, a spatial auto-regression for any generic, spatially-indexed vector \( x \) may be written as $$x = \mu + \rho \cdot C (x - \mu) + \epsilon,$$ where \( \rho \) is an auto-correlation parameter, \( C \) is a sparse connectivity matrix, and \( \mu \) is an intercept (we could add covariates here too). The error term \( \epsilon \) is white-noise. The expectation of \( x - \mu \) is zero, but the expected value for the \( i \)th site, given that we know the other \( n-1 \) values, is \( \mu + \rho \cdot C (x - \mu) \) (Cressie 2015, 564). The equation states that our expectation for \( x_i \) will be informed by the overall average rate \( \mu \), by the values near to the \( i \)th location, and by the overall degree of spatial auto-correlation \( \rho \).
(There is a special type of spatial regresion that has the spatial lag of the dependent variable in its expectation, rather than spatially-lagged residuals. This material does not apply to that ‘spatial lag’ model. That model really is quite different from this, and it is not normally called for.)
Click for a short intoduction to \( C \) and spatial lags
The matrix \(C\) allows us to calculate the average of neighboring values for any focal observation; we call this the (mean) spatial lag. The \( i \)th row of \( C \) contains the weights required to calculate the spatial lag \( \tilde{ x } \) for the \( i \)th value of \( x \): \( \tilde{x}_i = \sum_j^n c_{i,j} \cdot x_j\). When we pre-multiply a vector \(x\) by \( C \), we apply this spatial lag operator to each element of \( x \) to yield a vector of spatially lagged values \( \tilde{x} = C x \). In a spatial auto-regressive model, the spatial lag of residuals \( C ( x - \mu ) \) is made more or less important to our expectations when it is multiplied by \( \rho \), the spatial auto-correlation (SA) parameter.
Lets say that \( x \) is indexed to the 48 contiguous U.S. states and D.C., so each \(x_i\) corresponds to a 'state'. We can construct an \( n \)-by-\( n \) adjacency matrix \( A \) that identifies which states are neighbors of one another. If we order states alphabetically so the first (\(i = 1\)) observation is for Alabama, then the first row of \( A \), the row vector denoted by \(A_{i,}\), will contain 1s in those index positions that correspond to the neighbors of Alabama (Florida \(i=9\), Georgia \(i=10\), Mississippi \(i=23\), Tennessee \(i=41\)) and the rest will be zeroes (including \(i=1\) for Alabama itself). Then \( \tilde{x}_i = \sum_j^n a_{i,j} \cdot x_j\) will yield the sum of the neighboring or spatially-lagged values for the \( i \)th value. Dividing this by the number of neighbors (4 for Alabama) gives the mean spatial lag.
The matrix \( C \) can be constructed out of \(A\), often by row-normalizing: dividing the elements of each row by their corresponding row sum. For Alabama, \(C_{1,}\) will contain \(1/4\) just where \(A_{1,}\) contains 1.
You can find an introduction to spatial connectivity matrices here.A comparison with a more standard model may be helpful. Remember that this part of the model is our prior distribution for the rates \( \phi \) (for the moment, these could be an \( S \)-length vector, or an \( S \times T \) vector). A simple prior probability model could state that the rates have a finite, characteristic amount of variation \( \tau \) around their mean \( \mu \): $$\phi_{i,t} \sim Normal( \mu, \tau^2).$$ As a prior distribution, this will assign low probability to values of \( \phi_{i,t} \) that are uncharacteristically far from \( \mu \). Once we have made an inference about the values of \( \mu \) and \( \tau \), we are positioned to say just what kind of value is 'uncharacteristic'. Then any crude rates that are 'uncharacteristic' will be assigned a low prior probability, as if to say 'I am skeptical that the systematic risk is as high (or low) as the crude rate alone would indicate'.
What happens next depends on the likelihood. With respect to the crude rates that already contain a lot of information (i.e., they are based on large population sizes), our inferences will not be sensitive to our prior skepticism; the likelihood will overwhelm the prior, and the posterior probability for \( \eta_{i,t} \) will be centered right on the crude rate \( r_{i,t} \). The smaller the population is, the more our inferences will be sensitive to our prior skepticism. That skepticism is saying 'I think the systematic risk here is probably more near to the overall mean \( \mu \) than the crude rate alone indicates'. This is the 'shrinkage' to the mean that hierarchical models often impose on estimates.
The main difference here, with our spatial auto-regression, is that we are getting more precise by 'shrinking' our estimates towards a local mean \( \mu + \rho \cdot C (\phi - \mu) \), rather than the overall mean. This can make quite the difference. Mortality rates in Mississippi, for example, may be outliers relative to the national average. But if you understand the U.S. South as a region then it is really inappropriate to impose that kind of skepticism on the rates. Similarly for Appalachia, or for any number of areas at various spatial scales. To use the standard model would be biased (in the common-sense meaning of the word).
The CAR model
There are multiple ways to translate our concept of an auto-regression from the one-dimensional temporal domain into the two-dimensional spatial domain (Cliff and Ord 1981). We will adopt what is known as the conditionally-specified spatial auto-regressive (CAR) model. The CAR model is a multivariate normal distribution, defined by its covariance matrix:
\[\begin{aligned} \phi \sim Normal(\mu, \Sigma) \\ \Sigma = (I - \rho \cdot C)^{-1} M \end{aligned}\]Here we have:
\[\begin{aligned} &I: \text{n-by-n identity matrix} \\ &\rho: \text{spatial auto-correlation parameter} \\ &C: \text{n-by-n connectivity matrix} \\ &M: \text{n-by-n diagonal matrix containing scale parameters,} \tau_i. \end{aligned}\]The range of values that \( \rho \) can take on is limited by our requirement that \( \Sigma \) be a properly specified covariance matrix; depending on how one creates \( C \), the range may (or may not) be similar to a correlation coefficient: its maximum value may be 1 or less than 1, and its minimum permissible value often differs from \( -1 \). In the most common specification ('WCAR', used here), the maximum value is 1. Whatever its range, in certain respects the spatial dependence parameter of the CAR model does not 'behave' similarly to Pearson's correlation coefficient. A value of 0.9 is a very strong degree of correlation, but it is a more moderate value for the auto-correlation parameter of the CAR model. (The simultaneously-specified spatial auto-regressive or SAR model does have an auto-correlation parameter that seems to satisfy the intuition that we develop through the study of correlation.)
The matrix \( M \) contains variance terms for each location. For our particular implementation of the CAR model, the variances will be sensitive to number of neighbors a site has: more neighbors means more information, thus we will have less uncertainty remaining (encoded as lower variance). When examining a site that has fewer neighbors, we gain less information by considering the neighboring values, so the uncertainty is greater (encoded as higher variance).
Specifically, we have a scalar variance parameter \( \tau \) multiplied by the inverse of the weights \( w_i \), where the weights are equal to the number of neighbors. So the diagonal elements of \( M \) are $$m_{i,i} = \tau_i = \frac{1}{w_i} \tau.$$
CARs only
Now lets return to our space-time mortality data. We are specifying an alternative to the simple vector auto-regression approach that we specified earlier. We can proceed in an analogous way, though, by applying the CAR model successively at each time period. Including the likelihood, we have:
\[\begin{aligned} &y_{i,t} \sim Poisson(p_{i,t} \cdot exp(\phi_{i,t})) \\ &\phi_{,t} \sim Normal( \mu, \Sigma ). \end{aligned}\]\( \phi_{,t} \) is an \( S \)-length vector, a snapshot of mortality for all \( S \) locations at time \( t \). If desired, one can allow the parameters of the model to vary over time. The mean \( \mu \) can evolve over time, as may the auto-correlation \( \rho \) and and scale \( \tau \) parameters (yielding \(\mu_t, \rho_t, \tau_t \)).
The last part of this model is to add prior distributions for those parameters. Because we are working with rates on the log scale, something like this is not unreasonable:
\[\begin{aligned} &\mu \sim Normal( -4, 4), \mu < 0 \\ &\tau \sim Normal(0, 1), \tau > 0. \end{aligned}\]The constraints on the parameters imply that those are half-normal distributions. The spatial auto-correlation parameter \( \rho \) will be assigned a uniform prior distribution across its full support, which is determined by the eigenvectors of \( C \).
Once again, we have a simple way to proceed but it is nonetheless a valid space-time model which can be (and, strictly speaking, is often) applied to public health data.
The CAR-AR model
Now we have two options for modeling our rates: the time-series auto-regression and the spatial auto-regression. Our third approach is to combine them. We model a time trend for each location, and then we apply the CAR model to their cross-sectional errors.
Altogether, our third model is: $$ \begin{aligned} &y_{i,t} \sim Poisson(p_{i,t} \cdot exp( \phi_{i,t}) ) \\ &\phi_{,t} \sim Normal(\beta \cdot \phi_{, t-1}, \Sigma ), t > 1 \\ &\phi_{,1} \sim Normal( \mu, \Sigma) \\ &\Sigma = (I - \rho \cdot C)^{-1} M, \end{aligned} $$ where we reserve the option to make any of the scalar parameters vary over time \( t \) or geography \( i \) (the possibilities indicated roughly by: \( \mu_t, \beta_t, \beta_i, \rho_t, \tau_t \) ).
That amounts to a complex space-time auto-correlation structure, but with this framework there is no need to specify the implied (potentially massive) covariance matrix (nor do we require a series of complementary ‘random effect’ parameter vectors). As Wikle, Berliner, and Cressie write,
The Bayesian hierarchical strategy that we propose allows complicated structure to be modelled in terms of means at various stages, rather than a model for a massive joint covariance matrix. (118)
For example, our inference about one rate \( \eta_{i, t} \) will be informed directly by its own past value \( \eta_{i, t-1} \) and therefore indirectly by the past values of neighboring values.
Another advantage of this particular (CAR-AR) model is that it maintains the distinction between serial and spatial auto-correlation. The two are usually not of the same degree, with the former generally being stronger than the latter. Hence, containing both an AR parameter \( \beta \) and an SA parameter \( \rho \) is a positive feature of this model. Another virtue is that the CAR-AR model can be reduced to either of its two component parts and still yield a reasonable model. It can be expanded with space- or time-varying parameters, though I suspect that provides more flexibility than most projects warrant.
Because the CAR and AR models are proper probability models, we don't have to worry about imposing additional constraints on any parameter vectors to make them identifiable. In short, the CAR-AR model has a number of advantages that distinguish it from the BYM/Knorr-Held framework.
Next we are going to take a look at the mortality data, and then we’ll implement each of our three model specifications using Stan and R.
The mortality data
I accessed mortality data from the CDC Wonder database and restricted the request to females between the ages of 35 and 44. I then cleaned up the column names. I’ll be using R for this. You can download the data abstract from my github page:
Here is a look at the columns of interest:
Year | State | Deaths | Population |
---|---|---|---|
1999 | Alabama | 728 | 351301 |
2000 | Alabama | 650 | 350582 |
2001 | Alabama | 713 | 346986 |
2002 | Alabama | 701 | 340884 |
2003 | Alabama | 674 | 335121 |
We’ll plot the crude rates for each state as a time trend, reported as deaths per 100,000 at risk.
Multiple AR models in Stan
This is the first of three model specifications that we will build. In this part, we model time trends for each state using the hierarchical Poisson AR models discussed above. We will apply one AR time trend model for each state.
The notation used above to describe the AR models was working element-wise, calculating probability densities for each observation. We want to vectorize the calculations as much as possible (avoid ‘for’ loops).
We pass the observations to Stan as an \( S \times T \) array.
Then we can declare our scalar parameters plus an array of rates phi
(\phi).
Inside the model
block, we are going to create another array, phi_mu
, to store the mean of our model for the rates. For the first time period, our model for the rates ( \phi_{i,1} ) is
For all other times the model is
\[\phi_{i,t} \sim Normal( \beta \cdot \phi_{i, t-1}, \tau^2).\]So our model block begins like this, looping over each time period, creating the mean as appropriate:
Now we can vectorize the log-probability density calculation, using the normal distribution for the prior model and the Poisson distribution for the likelihood (adding these inside the same ‘for’ loop in the model block):
We then add prior distributions for the scalar parameters:
Lastly, we are going to collect samples of the log-likelihood of each observation. We will use these later for model comparison. We have to loop over every observation to complete these calculations.
The loop is not computationally expensive, but storing large arrays of parameters can slow things down with large (N).
Click for the complete AR Stan model
Our mortality data is stored in the data.frame dat
. Creating arrays with the proper order requires a bit of care. We will first order our data.frame by state and year, and then prepare the list of data for Stan:
Click for practice creating arrays in R, they are tricky
[,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12This one allows for a visual check of the order [someone made a mistake with arrays today]
Now we can compile the model and draw samples:
Inference for Stan model: anon_model. 4 chains, each with iter=1000; warmup=500; thin=1; post-warmup draws per chain=500, total post-warmup draws=2000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat alpha -6.55 0 0.01 -6.57 -6.56 -6.55 -6.54 -6.53 26881 beta_ar 1.00 0 0.00 1.00 1.00 1.00 1.00 1.00 25911 tau 0.07 0 0.00 0.07 0.07 0.07 0.07 0.08 8691 Samples were drawn using NUTS(diag_e) at Thu Jan 9 17:40:14 2025. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1).
Before moving on, we are going to grab some results for Alaska and Hawaii. We’ll come back for these later.
Multiple CAR models in Stan
Our CAR models require two data inputs than are not in AR models: (1) a spatial connectivity matrix, and (2) some quantities that will help us calculate (more quickly) the probability density for the CAR model.
Also, before getting started, note that we will be dropping Hawaii and Alaska from the modeling at this point. Why? Because they are far from all the other states (physically and otherwise), so they won’t benefit from this type of model. The AR Poisson model would suffice for those states.
We are going to prepare the spatial data first, then we’ll work on our Stan model.
Preparing the spatial data
I will use the ‘geostan’ (Donegan 2022) R package for help with the CAR models, ‘tigris’ (Walker 2023) to download some cartographic boundaries, and ‘sf’ (Pebesma 2018) for GIS operations and mapping.
The CAR models we are using here are implemented in ‘geostan’, which uses the prep_car_data
function to convert a connectivity matrix into the data required by the Stan model. So we can use that here as well:
Contiguity condition: rook Number of neighbors per unit, summary: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 3.000 4.000 4.367 6.000 8.000 Spatial weights, summary: Min. 1st Qu. Median Mean 3rd Qu. Max. 1 1 1 1 1 1 Range of permissible rho values: -1.392, 1
Click for distance-based connectivity
This code snippet shows how to create a distance-based CAR specification. Details can be found in the OSF preprint on CAR models in Stan (Donegan 2021). For our model, the adjacency-based connectivity matrix is a better option. You can try both and compare using DIC and residual diagnostics.Now we need to create our data list again (now \( S = 49 \)) and then append the the CAR parts to it:
Before starting, we can examine Moran’s I, the index of spatial auto-correlation, at teach time point (this ‘geostan’ vignette provides a short introduction to Moran’s I, which is also known as the Moran coefficient). This exercise can also serve as a quick check on our input data stan_dl
: if Moran’s I values are near zero then we may have just screwed up the indexing/order.
[1] 0.371 0.302 0.329 0.449 0.335 0.367 0.456 0.372 0.385 0.376 0.453 0.433 0.370 [14] 0.456 0.416 0.412 0.366 0.379 0.391 0.424 0.344 0.387
The results indicate moderately positive SA. A quick eyeball analysis suggests that the degree of SA was not changing much over time (it seems to be bouncing around 0.38, without any trend).
CAR models in Stan
It is possible to use Stan’s built-in multivariate normal distribution functions to fit CAR models. But we want to take advantage of the fact that the connectivity matrix is sparse, as this allows us to use sparse matrix multiplication. We can also use some well-known tricks for calculating the log-determinant of the covariance matrix. For details, see the OSF preprint on CAR models in Stan (Donegan 2021). According to that study, these models can sample about 10 times faster than CAR models in Nimble (a popular R package for MCMC with spatial models).
First, we will define a custom _lpdf
function in Stan. Second, we will update the data
block of our Stan model. Third, we will update the parameters
and model
blocks.
Below is our custom Stan function for calculating the log-probability of the CAR model, wcar_normal_lpdf
. For present purposes, just save this in a file (in your working directory) with the name car_lpdf.stan
.
Click for a more generally applicable CAR model
The wcar_lpdf
function is usually what you need. If you fit distance-based CAR models, you need a more general specification. The WCAR is generally faster than this, but this is still pretty good. This one works for any valid CAR specification (including WCAR). Again, see Donegan (2021) for details.
To make this function available to our Stan model, we put it in the functions
block at the top of the file:
Below are the data inputs we require (add these to the data
block). Each of these items is already found in our stan_dl
list, thanks to the geostan::prep_car_data
function.
Our parameters
block now needs: an intercept alpha
, a scale parameter tau
(allow positive values only in the declaration), and an SA parameter rho
. Permissible values for the SA parameter are determined by the eigenvalues lambda
(we saw those printed to the R console when we used prep_car_data
).
In our model
block, we will use phi_mu
again, but in this case phi_mu
is just a constant intercept. When we loop through the time periods, we calculate the CAR model (prior distribution) and then the Poisson (likelihood). The priors on our scalar parameters are the same as previously, except we drop the AR coefficient. The prior for rho
is uniform across its full support (we don’t need to declare that in Stan, its the default).
(Again, we could allow alpha
, tau
, and rho
to vary over time. In that case, we would declare them as vectors in the parameters
block, then adjust the loop as needed; e.g., assign alpha[tt]
to phi_mu
inside the ‘for’ loop. My intuition is that this is all or nothing: I would not allow one parameter to vary while fixing the others. They are not independent of one another.)
Click for the complete 'multiple CARs' Stan model
Running the CAR models
Save the Stan model code in a file named ‘CARs.stan’. Then we sample from the model and print some results. On my laptop, sampling completed in 5.5 seconds per chain; using parallel processing, the full runtime was 14 seconds.
There were no warnings from Stan, and the summary printout shows that the effective sample sizes are more than adequate for inference. For the purposes of disease mapping its important to check diagnostics for the log-rate estimates phi
; we can examine those using print(S2)
. Doing so (not shown below), we find that the effective samples sizes (ESS) for the log-rates are all quite high (over 2,000 and over 3,000), and the R-hats are all 1.00.
Inference for Stan model: anon_model. 4 chains, each with iter=1000; warmup=500; thin=1; post-warmup draws per chain=500, total post-warmup draws=2000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat alpha -6.52 0 0.02 -6.56 -6.54 -6.52 -6.51 -6.49 2645 1 rho 0.93 0 0.02 0.89 0.92 0.93 0.94 0.96 2319 1 tau 0.37 0 0.01 0.35 0.36 0.37 0.38 0.39 1650 1 Samples were drawn using NUTS(diag_e) at Fri Jan 10 08:07:12 2025. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1).
The CAR-AR model in Stan
Finally, we can fit our full CAR-AR model. We don’t need to make any changes to our stan_dl
data list (after the CAR models). We have all the peices ready for the Stan model, too, we just need to put them together.
To create this model, I started by copying and pasting the contents of 'ARs.stan'
into a new file named 'CAR_AR.stan'
. Relative to 'ARs.stan'
, only the following changes are needed:
-
Add a
functions
block with our#include car_lpdf.stan
line. -
In the
data
block, make sure you add the CAR parts. You can copy and paste fromCARs.stan'
. -
In the
parameters
block, include all of our parameters:alpha
,rho
,tau
,beta_ar
, andphi
. -
Finally, in the
model
block, we need to replace thenormal_lpdf
with ourwcar_normal_lpdf
. Again, you can copy and paste from'CARs.stan'
.
After making those changes, the 'CAR_AR.stan'
file will look like this:
We are ready to go now:
Inference for Stan model: anon_model. 4 chains, each with iter=1000; warmup=500; thin=1; post-warmup draws per chain=500, total post-warmup draws=2000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat alpha -6.56 0 0.04 -6.64 -6.58 -6.56 -6.53 -6.48 3019 1.00 beta_ar 1.00 0 0.00 1.00 1.00 1.00 1.00 1.00 2035 1.00 rho 0.98 0 0.01 0.96 0.97 0.98 0.98 0.99 1731 1.00 tau 0.09 0 0.00 0.08 0.08 0.09 0.09 0.09 426 1.02 Samples were drawn using NUTS(diag_e) at Fri Jan 10 08:56:57 2025. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1).
On my laptop, sampling finished in 23 seconds. Stan issued a warning about bulk effective sample size (ESS) being low. There is a low threshold for warnings, which is good. In this case, the results are fine for analysis: the parameter with lowest ESS is tau
(426), and ESS for the log-rates and log-likelihoods are all in the thousands.
I’m guessing that the warning was issued because the ESS for lp__
is only 366. The lp__
is the log-probability; I take this as an indicator for ESS that is relevant for joint probabilities, like differences between rates. So it is a fair ESS warning, although still fairly conservative.
If these results were being reported in a publication, I would run this again with more samples (iter = 2e3
or so should be fine).
There is one interesting feature of these results already. The estimate for rho
is 0.98, which is actually higher than it was in the ‘CARs only’ model. Why? The CAR model here is applied to deviations from the state time trends (the AR residuals). My reading of this is that the mortality rates for neighboring states are more similar in their year-by-year movements than they are in their levels.
Model comparison
For each of our three models, we collected samples for the log-likelihoods. Those are for computing information criteria. I will use the DIC for these models.
(You can use WAIC if you want to; if you check diagnostics using the loo R package, you may find that the diagnostics are crap for your spatial models. As I understand it, that is just a feature of WAIC with auto-correlated data, and it does not necessarily indicate a problem with your model.)
This code first defines an R function to calculate DIC from a Stan model (the model must contain a log_lik
parameter vector), and then applies it to our three models:
DIC penalty model 1 10536.99 804.73 AR 2 10517.93 1066.08 CAR 3 10001.27 624.69 CAR-AR
For interpretation: lower values of DIC indicate a better fit, and small differences between DICs are not to be given much weight.
The ‘CARs only’ model looks similar to the AR-only model, while the CAR-AR model has considerably lower DIC than both of the others (a difference of at least -500, including a considerably smaller penalty term). We’ll use the CAR-AR model for final section.
Visualizing mortality trends
Using the CAR-AR model results, we will first plot time trends for a selection of states. Next we’ll map mortality for 2020. Finally, we’ll chart percent change in mortality rates from 1999 to 2019 (stopping just before the pandemic).
Model-based time trends
Here we need to summarize MCMC samples by state and year. Once we extract a matrix of samples for phi
(\( \phi \)), we exponentiate them to get the rates \( \eta \). Then we need to identify which columns correspond to whatever state we are interested in. The regular expressions found in the code below are introduced for that task.
For each state, we do the following:
- Identify its corresponding column positions in
eta
. Each state has 22 columns ineta
, one per year. Each year has 2,000 MCMC samples. - Summarize the posterior probability distribution at each year: calculate the mean of the samples, and extreme quantiles (2.5%, 97.5%).
- Plot the mean (our estimate of the trend mortality risk) with credible intervals.
[1] 1 4 7 11 17 23 36 37 47
[1] ",1]" ",4]" ",7]" ",11]" ",17]" ",23]" ",36]" ",37]" ",47]"
Mapping mortality
Here we map mortality rate estimates for the year 2020. We'll start again by getting our matrix samples for \( \eta \), then we extract the columns corresponding to \( t = 22 \) (year 2020).
This function takes a vector of values \( x \), break points for assigning values to cateries, and a color palette. It returns a vector with appropriate colors from the palette together with labels for our map legend. The palette is from ColorBrewer.
I’m using the classInt
package to create the breaks (Bivand 2023).
Percent change analysis
This is our last analysis. We will use our matrix of MCMC samples for the rates \( \eta \) to produce samples from the posterior probability distributions for our quantities of interest, namely percent change over time. Then we will summarize those samples and put them into a figure that displays the change in mortality rates for each state plus D.C.
We can bring in Alaska and Hawaii from our AR-only model:
Now we add the summary of the results to a figure, displaying the states in order of their percent change while illustrating the range of plausible estimates with credible intervals.
Time-varying parameters
We can easily adjust our Stan models to specify time-varying parameters. Exploratory analysis and standard model comparison techniques (or workflows) can assist in deciding whether such an expansion of the model is worthwhile to consider. Our bit of exploratory analysis (time plots and Moran’s I) suggest that this level of flexibility is not warranted in this case.
For the AR only model, we can potentially allow the AR coefficient and the scale parameter to vary over time. (The intercept only applies to the first time period anyways.)
For the CARs only model, we can allow the intercept, spatial dependence, and scale parameters to vary with time.
For the CAR-AR model, the AR coefficient, spatial dependence, and scale parameters can vary with time. (The intercept only applies to the first time period anyways.)
We may want to be able to switch back and forth easily, between fixed and time-varying parameters. In the data
block, start by introducing a binary indicator fixed_ar
, which takes a value of either 0 or 1. Do the same with fixed_tau
.
We will use the fixed_ar
indicator as a ‘switch’ when declaring the parameter beta_ar
. If we declare beta_ar
to be a vector, we can use the switch to decide if it should be length 1 or length TT
:
Inside the square brackets is a conditional statement. You can read it as an ‘if-else’ statement: “If fixed_ar equal 1, make beta_ar a vector of length 1; otherwise, make it a vector of length TT-1.” (It needs to be TT-1 because the first time period does not have a past value to use, instead its centered on the intercept.)
Similarly, use the appropriate indicator when declaring tau
as a vector:
We can use the same technique in the model
block, using the contitional statement to determine the right option ‘on the fly’:
In terms of our workflow in R, the only modification required is the addition of the indicators:
The following drop-down style code blocks contain complete Stan models with options for time-varying parameters. Each also includes a flag for inclusion of the log-likelihood, keep_log_lik
. With large N, storing samples of log_lik
can reduce computational efficiency, due to memory overload. In that case, you can use this flag to drop those samples unless or until you need to use them.
Click for the 'AR only' Stan model with time-varying parameters
Click for the 'CARs only' Stan model with time-varying parameters
Click for the full CAR-AR model with options
Addressing MCMC sampling failures
All of our models sampled smoothly and quickly. If our data were more sparse, this may not have gone as smoothly. By ‘sparse’, I mean (based on limited experience) that many of our small areas have fewer than 10 observations. With sparse data, the CAR models I have provided above will tend to sample poorly (sampling may proceed in fits and starts, convergence statistics will be bad, and you may have warnings of divergent transitions). This is not due to a problem with, or limitation of, proper CAR models. Rather, this is a general phenomenon encountered in hierarchical modeling with MCMC.
If you use this, let me know how it goes (especially if it gives you trouble or you find an issue).
Re-parameterizing the CAR model
To fix it, we just have to change the way we encode our model. This is generally discussed as a change from the ‘centered’ to ‘non-centered’ parameterization (see, e.g., here).
To clarify, view our first CAR model like this: \begin{aligned} &y_i \sim Poisson(p_i \cdot exp(\phi_i)) \\ &\phi \sim Normal(\mu, \Sigma(\rho, \tau)) \end{aligned} where the CAR covariance matrix is specified in the usual way, with a spatial dependence parameter and a scale parameter (now made explicit in the notation). To fit a spatial auto-regressive model to sparse data, we can encode the same model in a slightly different way: \begin{aligned} &y_i \sim Poisson(p_i \cdot exp(\phi_i)) \\ &\phi = \mu + \tilde{\phi} \cdot \tau \\ &\tilde{\phi} \sim Normal(0, \Sigma(\rho, 1)), \end{aligned} where \( \tilde{\phi} \) has mean of zero and scale of 1. The latter method is known generally as the 'non-centered parameterization'.
This works well for sparse spatial data (it is implemented in geostan as the ‘zero-mean parameterization’ or zmp). For space-time modeling, this approach seems to require adjustment. It does not sample adequately.
In my experience, it samples much better in the CAR-AR model if we do not force the CAR model to have a mean of zero, but we do keep the scale parameter fixed at one. This is the parameterization that we want to use for sparse space-time data: \begin{aligned} &\phi = \tilde{\phi} \cdot \tau \\ &\tilde{\phi} \sim Normal(\mu, \Sigma(\rho, 1)) \end{aligned} where \( \mu \) may contains the AR model.
An important note: neither the standard (‘centered’) nor the adjusted, ‘sparse-data’ parameterization will work well in all situations. If you apply the latter parameterization to non-sparse data (like our state mortality rates) it will probably sample poorly. If one doesn’t seem to work well, try the other.
Stan code for the sparse-data fix
We can use the ‘sparse-data’ CAR model without making any other changes to our workflow.
For the CAR-AR model, our parameters
block needs to include \( \tilde{\phi} \) as phi_tilde
(replacing phi
):
Then the model
block needs to be adjusted, as:
To avoid having to change your workflow in R, adjust the generated quantities block too (so we get samples from the posterior for phi
):
You can run this model using the same data input as our standard CAR-AR model, stan_dl
.
Below, I provide full Stan code for this model but also apply the technique to each of our model types.
Click to view the complete Stan model with sparse-data parameterization
When using this Stan model, you have to specify which model type you want to use. In R:Resources
You can find the R code for this tutorial (as a single R script), here on the github page for this post, together with the data and Stan files.
Others who have worked on implementing spatial models in Stan include:
- Max Joseph’s work on proper CAR models. These methods are implemented in the ‘brms’ R package.
- Mitzi Morris’s work on intrinsic CAR models.
- James Hogg’s work on CAR models and disease modeling, including the Leroux CAR specification.
- Adam Howe’s contribution to adjusting ICAR models for disconnected graphs.
- Edgar Santos-Fernandez’s spatio-temporal stream network models and the SSNbayes R package.
- Marco Gramatica’s multivariate CAR models and misaligned areal data (with Peter Congdon and Slivia Liverani).
- My own work on proper CAR models and spatial econometric models for geostan, and extending Howe’s contributions with M. Morris.
I’m sure there are others; let me know if I’ve missed anything that you’ve found helpful.
How to cite
Attribution for the CAR models presented here should be to Donegan (2021) or Donegan (2022). Recommended citation for the blog post:
Donegan, C. (2025) “Space-time modeling in Stan: charting the evolution of U.S. mortality rates.” Retrieved [date] from https://connordonegan.github.io/statistics/public_health/2025/01/08/space-time-mortality.html
References
Berliner, L. Mark (1996). Hierarchical Bayesian time series models. In K. Hanson and R. Silver (eds), Maximum Entropy and Bayesian Methods, Kluwer Academic Publishers, Dordrecht, pp. 15—22.
Bivand R (2023). classInt: Choose Univariate Class Intervals. R package version 0.4-10, https://CRAN.R-project.org/package=classInt.
Cliff, AD and JK Ord (1981). Spatial Processes: Models and Applications. Pion Press.
Cressie, Noel (2015 (1993)). Statistics for Spatial Data. Wiley Classics, Revised Edition.
Donegan, Connor (2021). Building spatial conditional autoregressive (CAR) models in the Stan programming language. OSF Preprints. https://doi.org/10.31219/osf.io/3ey65 Nb: the present blog post contains an updated implementation of the Stan models that were introduced by this OSF preprint.
Donegan, Connor (2022) ‘geostan’: An R package for Bayesian spatial analysis. The Journal of Open Source Software. 7, no. 79: 4716. https://joss.theoj.org/papers/10.21105/joss.04716
Knorr-Held, Leonhard (2000). Bayesian modelling of inseparable space-time variation in disease risk. Statistics in Medicine 19(17-18), 2555—2567.
Lee, Duncan, A. Rushworth, and G. Napier (2018). Spatio-Temporal Areal Unit Modeling in R with Conditional Autoregressive Priors Using the CARBayesST Package. Journal of Statistical Software, 84(9), 1–39 https://doi.org/10.18637/jss.v084.i09.
Morris, Mitzi, Katherine Wheeler-Martin, Dan Simpson, Stephen J. Mooney, Andrew Gelman, and Charles DiMaggio. Bayesian hierarchical spatial models: Implementing the Besag York Mollié model in Stan. Spatial and spatio-temporal epidemiology 31 (2019): 100301.
Pebesma, Edzer (2018). Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal 10 (1), 439-446, https://doi.org/10.32614/RJ-2018-009
Rushworth, Alastair, Duncan Lee, and Richard Mitchell (2014). A spatio-temporal model for estimating the long-term effects of air pollution on respiratory hospital admissions in Greater London. Spatial and Spatio-temporal Epidemiology 10, 29—38.
Santos-Fernandez, E., Ver Hoef, J. M., Peterson, E. E., McGree, J., Isaak, D. J., & Mengersen, K. (2022). Bayesian spatio-temporal models for stream networks. Computational Statistics & Data Analysis, 170, 107446. Pre-print pdf
Urgart, Maria Dolores, Jaione Etxeberria, T. Goicoa, and E. Ardanaz. Gender-specific spatio-temporal patterns of colorectal cancer incidence in Navarre, Spain (1990–2005). Cancer Epidemiology 36, 254—262.
Walker, Kyle (2023). tigris: Load Census TIGER/Line Shapefiles. R package version 2.0.4, https://CRAN.R-project.org/package=tigris.
Wikle, Christopher K., L. Mark Berliner, and Noel Cressie (1998). Hierarchical Bayesian space-time models. Environmental and Ecological Statistics 5, 117—154.
Endnotes
-
At least, that is my best stab at communicating the meaning of ‘chance’ here and what exactly the target of our inference is. ↩