Obtain predicted values from a fitted model by providing new covariate values.
Goulard, Michael, Thibault Laurent, and Christine Thomas-Agnan (2017). About predictions in spatial autoregressive models: optimal and almost optimal strategies. Spatial Economic Analysis 12 (2-3): 304-325.
LeSage, James (2014). What Regional Scientists Need to Know about Spatial Econometrics. The Review of Regional Science 44: 13-32 (2014 Southern Regional Science Association Fellows Address).
LeSage, James, & Robert kelley Pace (2009). Introduction to Spatial Econometrics. Chapman and Hall/CRC.
A fitted model object of class geostan_fit
.
A data frame in which to look for variables with which to predict. Note that if the model formula includes an offset term, newdata
must contain the offset column (see examples below). If covariates in the model were centered using the centerx
argument, the predict.geostan_fit
method will automatically center the predictors in newdata
using the values stored in object$x_center
. If newdata
is missing, the fitted values of the model will be returned.
An N-by-1 matrix of MCMC samples for the intercept; this is provided by default. If used, note that the intercept needs to be provided on the scale of the linear predictor. This argument might be used if there is a need to incorporate the spatial trend term (as a spatially-varying intercept).
Optional vector of numeric values or a logical scalar to pass to scale
. Defaults to using object$x_center
. If the model was fit using centerx = TRUE
, then covariates were centered and their mean values are stored in object$x_center
and the predict
method will use them automatically to center newdata
; if the model was fit with centerx = FALSE
, then object$x_center = FALSE
and newdata
will not be centered.
If FALSE
, a matrix containing samples from the posterior distribution at each observation is returned. The default, TRUE
, will summarize results by providing an estimate (mean) and credible interval (formed by taking quantiles of the MCMC samples).
By default, results from predict
are on the scale of the linear predictor (type = "link")
). The alternative (type = "response"
) is on the scale of the response variable. For example, the default return values for a Poisson model on the log scale, and using type = "response"
will return the original scale of the outcome variable (by exponentiating the log values).
Logical. If add_slx = TRUE
, any spatially-lagged covariates that were specified through the 'slx' argument (of the model fitting function, e.g., stan_glm
) will be added to the linear predictor. The spatial lag terms will be calculated internally using object$C
, the spatial weights matrix used to fit the model. Hence, newdata
must have N = object$N
rows. Predictions from spatial lag models (SAR models of type 'SLM' and 'SDLM') always include the SLX terms (i.e., any value passed to add_slx
will be overwritten with TRUE
).
For SAR models of type 'SLM' or 'SDLM' only; use an approximation for matrix inversion? See details below.
Number of matrix powers to use with approx
.
Not used
If summary = FALSE
, a matrix of samples is returned. If summary = TRUE
(the default), a data frame is returned.
The primary purpose of the predict method is to explore marginal effects of covariates. The uncertainty present in these predictions refers to uncertainty in the expected value of the model. The expectation does not include the error term of the model (nb: one expects actual observations to form a cloud of points around the expected value). By contrast, posterior_predict returns the complete (posterior) predictive distribution of the model (the expectation plus noise).
The model formula will be taken from object$formula
, and then a model matrix will be created by passing newdata
to the model.frame function (as in: model.frame(object$formula, newdata)
. Parameters are taken from as.matrix(object, pars = c("intercept", "beta"))
.
The examples illustrate how to use the function in most cases.
Special considerations apply to models with spatially-lagged covariates and a spatially-lagged dependent variable (i.e., the 'SLM' and 'SDLM' models fit by stan_sar).
Spatially-lagged covariates which were included via the slx
argument will, by default, not be included in the predicted values. (The user can have greater control by manually adding the spatially-lagged covariate to the main model formula.) The slx
term will be be included in predictions if add_slx = TRUE
or if the fitted model is a SAR model of type 'SLM' or 'SDLM'. In either of those cases, newdata
must have the same number of rows as were used to fit the original data.
The typical 'marginal effect' interpretation of the regression coefficients does not hold for the SAR models of type 'SLM' or 'SDLM'. For details on these 'spillover effects', see LeSage and Pace (2009), LeSage (2014), and impacts.
Predictions for the spatial lag model (SAR models of type 'SLM') are equal to:
$$
(I - \rho W)^{-1} X \beta
$$
where \(X \beta\) contains the intercept and covariates. Predictions for the spatial Durbin lag model (SAR models of type 'SDLM') are equal to:
$$
(I - \rho W)^{-1} (X \beta + WX \gamma)
$$
where \(WX \gamma\) are spatially lagged covariates multiplied by their coefficients. For this reason, the predict.geostan_fit
method requires that newdata
have as many rows as the original data (so that nrow(newdata) == nrow(object$C)
); the spatial weights matrix will be taken from object$C
.
The inverse of the matrix \((I - \rho W)\) can be time consuming to compute (especially when iterating over MCMC samples). You can use approx = TRUE
to approximate the inverse using a series of matrix powers. The argument \(K\) controls how many powers to use for the approximation. As a rule, higher values of \(\rho\) require larger \(K\) to obtain accuracy. Notice that \(\rho^K\) should be close to zero for the approximation to hold. For example, for \(\rho = .5\) a value of \(K=8\) may suffice (\(0.5^8 = 0.004\)), but larger values of \(\rho\) require higher values of \(K\).
In generalized linear models (such as Poisson and Binomial models) marginal effects plots on the response scale may be sensitive to the level of other covariates in the model and to geographic location (given a spatially-varying mean value). If the model includes a spatial autocorrelation component (for example, you used a spatial CAR, SAR, or ESF model, or used the re
argument for random effects), by default these terms will be fixed at zero for the purposes of calculating marginal effects. If you want to change this, you can introduce a varying intercept manually using the alpha
argument.
data(georgia)
georgia$income <- georgia$income / 1e3
fit <- stan_glm(deaths.male ~ offset(log(pop.at.risk.male)) + log(income),
data = georgia,
re = ~ GEOID,
centerx = TRUE,
family = poisson(),
chains = 2, iter = 600) # for speed only
# note: pop.at.risk.male=1 leads to offset of log(pop.at.risk.male)=0
# so that the predicted values are rates
newdata <- data.frame(
income = seq(min(georgia$income),
max(georgia$income),
length.out = 200),
pop.at.risk.male = 1)
preds <- predict(fit, newdata, type = "response")
head(preds)
plot(preds$income,
preds$mean * 10e3,
type = "l",
ylab = "Deaths per 10,000",
xlab = "Income ($1,000s)")
# here the predictions are rates per 10,000
newdata$pop.at.risk.male <- 10e3
preds <- predict(fit, newdata, type = "response")
head(preds)
# plot range
y_lim <- c(min(preds$`2.5%`), max(preds$`97.5%`))
# plot line
plot(preds$income,
preds$mean,
type = "l",
ylab = "Deaths per 10,000",
xlab = "Income ($1,000s)",
ylim = y_lim,
axes = FALSE)
# add shaded cred. interval
x <- c(preds$income, rev(preds$income))
y <- c(preds$`2.5%`, rev(preds$`97.5%`))
polygon(x = x, y = y,
col = rgb(0.1, 0.2, 0.3, 0.3),
border = NA)
# add axes
yat = seq(0, 300, by = 20)
axis(2, at = yat)
xat = seq(0, 200, by = 10)
axis(1, at = xat)
# show county incomes
rug(georgia$income)