library(bbr)
library(bbr.bayes)
library(dplyr)
library(here)
library(tidyverse)
library(yspec)
library(loo)
<- here("model", "stan","expo")
MODEL_DIR <- here("deliv", "figure","expo") FIGURE_DIR
Introduction
There are often situations where you want to use your posterior samples as input to subsequent calculations. In some cases, this is easily done in R. In others, it can be convenient to perform the calculations in Stan.
To save the time associated with re-running the Stan MCMC sampler, cmdstan
has implemented a generate_quantities
method. bbr.bayes
is designed to take advantage of this method, and this chapter provides an example of using it.
For more information about the generate_quantities
method, see the sections describing stand-alone generated quantities models in the quick-start portion of the Stand-alone Generated Quantities section of the CmdStan User’s Guide and the Generated Quantities section of the Stan User’s Guide.
Tools used
MetrumRG packages
bbr Manage, track, and report modeling activities, through simple R objects, with a user-friendly interface between R and NONMEM®.
CRAN packages
dplyr A grammar of data manipulation.
Set up
Load required packages and set file paths to your model and figure directories.
Model creation and submission
Comparing models using ELPD
In Bayesian analysis, a commonly used metric for comparing models is the expected log predictive density (ELPD). Without a hold-out sample, ELPD is typically calculated using an approximate leave one out (LOO) cross-validation estimate
For non-linear mixed effect models, such as the one we’re using here, the log-likelihood is not available in a closed-form solution. So, we will derive it using a Monte Carlo approximation. Specifically, we want to calculate
\[ \log \ell_i(\boldsymbol{\theta},\boldsymbol{\Omega} ~|~ \mathbf{y}_i) = \log \left( \int f(\mathbf{y}_i ~|~ \boldsymbol{\theta}, \boldsymbol{\Omega}, \boldsymbol{\eta}) ~f(\boldsymbol{\eta} ~|~ \boldsymbol{\Omega}) ~ d\boldsymbol{\eta} \right) \] The Monte Carlo approximation to the integral is derived as:
\[\begin{align*} \ell_i(\boldsymbol{\theta},\boldsymbol{\Omega} ~|~ \mathbf{y}_i) &= \int \prod_{j=1}^{n_i} f(y_{ij} ~|~ \boldsymbol{\theta}, \boldsymbol{\Omega}, \boldsymbol{\eta}) ~f(\boldsymbol{\eta} ~|~ \boldsymbol{\Omega}) ~ d\boldsymbol{\eta} \\ & \approx \frac{1}{M} \sum_{m=1}^M \prod_{j=1}^{n_i} f(y_{ij} ~|~ \boldsymbol{\theta}, \boldsymbol{\Omega}, \boldsymbol{\eta}^{(m)}) ~f(\boldsymbol{\eta}^{(m)} ~|~ \boldsymbol{\Omega}) \\ & = \frac{1}{M} \sum_{m=1}^M \exp \left( \sum_{j=1}^{n_i} \log f(y_{ij} ~|~ \boldsymbol{\theta}, \boldsymbol{\Omega}, \boldsymbol{\eta}^{(m)}) + \log f(\boldsymbol{\eta}^{(m)} ~|~ \boldsymbol{\Omega}) \right) \end{align*}\]
Thus, the approximation to \(\log \ell_i(\boldsymbol{\theta},\boldsymbol{\Omega} ~|~ \mathbf{y}_i)\) is given by
\[\begin{align*} \log \ell_i(\boldsymbol{\theta},\boldsymbol{\Omega} ~|~ \mathbf{y}_i) & \approx \log(\sum_{m=1}^M \exp(A_i)) - \log M \\ & = \text{log\_sum\_exp}(A_i) - \log M \end{align*}\]
where \(M\) is the number of Monte Carlo samples, \(\eta^{(m)}\) is a draw from the distribution \(f(\boldsymbol{\eta} ~|~ \boldsymbol{\Omega})\), log_sum_exp
corresponding to the Stan function of the same name and
\[ A_i = \sum_{j=1}^{n_i} \log f(y_{ij} ~|~ \boldsymbol{\theta}, \boldsymbol{\Omega}, \boldsymbol{\eta}^{(m)}) + \log f(\boldsymbol{\eta}^{(m)} ~|~ \boldsymbol{\Omega}) \]
To calculate the LOO-ELPD for our models, we will take advantage of the ability in cmdstan
to run a stand-alone generate quantities
model – a model which takes as input the samples from a previous model and implements calculations through the generated quantities
block. This can be an efficient approach to performing calculations with previously fitted models.
We’ll demonstrate the process in detail with mod0
. Let’s start by re-reading the parent model.
<- read_model(file.path(MODEL_DIR,'mod0')) mod0
Now, we’ll use the copy_model_as_stan_gq()
function to make a copy of mod0
but for use as a generated_quantities
model.
<- copy_model_as_stan_gq(.parent_mod = mod0,
mod0_gq .new_model = 'mod0_gq',
.overwrite = TRUE)
Next, we’ll edit the Stan model file to make the following edits:
- Remove the
model
block - Replace the original
generated quantities
block (which simulated data from the model for the purpose of model evaluation) with code to approximate the log-likelihood. - Add a variable denoting the number of Monte Carlo samples (
n_sim
) to thedata
block and to the input data
(Note: removing the model
block is not necessary; we have done this for the sake of clarity. If it is left in the code, Stan will ignore it.)
open_stanmod_file(mod0_gq)
The resulting code will look something like this:
data {
int<lower=0> N ;
int<lower=0> n_id ;
vector[N] DV;
vector[N] Conc;
array[N] int ID;
array[n_id] int start;
array[n_id] int end;
int n_sim; // Number of simulations used to approximate log-likelihood
}
parameters {
real<lower=0, upper=100> emax;
real tv_log_ec50;
real tv_e0;
real log_gamma;
vector[n_id] eta_log_ec50;
vector[n_id] eta_e0;
real<lower=0> sigma;
real<lower=0> omega_e0;
real<lower=0> omega_log_ec50;
}
transformed parameters {
real tv_ec50 = exp(tv_log_ec50);
real gamma = exp(log_gamma);
vector[n_id] ec50;
vector[n_id] e0;
vector[N] mu;
ec50 = exp(tv_log_ec50 + omega_log_ec50 * eta_log_ec50);
e0 = tv_e0 + omega_e0 * eta_e0;
for (i in 1:N) {
mu[i] = e0[ID[i]] + (emax-e0[ID[i]])*pow(Conc[i]/100,gamma) / (pow(ec50[ID[i]]/100,gamma) + pow(Conc[i]/100,gamma));
}
}
generated quantities {
vector[n_id] log_like_approx;
vector[n_id] mean_penalty;
vector[n_id] mean_ll;
{
// temporary variables
matrix[n_id,n_sim] sim_e0;
matrix[n_id, n_sim] sim_ec50;
matrix[n_id,n_sim] sim_penalty;
vector[n_sim] sum_ll;
matrix[N,n_sim] mutemp;
matrix[N,n_sim] lltemp;
real temp_eta_e0;
real temp_eta_log_ec50;
for (i in 1:n_id) {
for (j in 1:n_sim) {
temp_eta_e0 = normal_rng(0,1);
temp_eta_log_ec50 = normal_rng(0,1);
sim_ec50[i,j] = exp(tv_log_ec50 + omega_log_ec50 * temp_eta_log_ec50);
sim_e0[i,j] = tv_e0 + omega_e0 * temp_eta_e0;
sim_penalty[i,j] = std_normal_lpdf(temp_eta_log_ec50) +
std_normal_lpdf(temp_eta_e0);
// sim_penalty[i,j] = multi_normal_lpdf([temp_eta_log_ec50,temp_eta_e0] | [0,0], diag_matrix([pow(omega_log_ec50,2),pow(omega_e0,2)]'));
}
mean_penalty[i] = mean(sim_penalty[i,]);
}
for (i in 1:N) {
for (j in 1:n_sim) {
mutemp[i,j] = sim_e0[ID[i],j] + (emax-sim_e0[ID[i],j])*pow(Conc[i]/100,gamma) / (pow(sim_ec50[ID[i],j]/100,gamma) + pow(Conc[i]/100,gamma));
lltemp[i,j] = normal_lpdf(DV[i] | mutemp[i,j], sigma) ;
}
}
for (i in 1:n_id) {
for (j in 1:n_sim) {
sum_ll[j] = sum(lltemp[start[i]:end[i],j]) + sim_penalty[i,j];
}
mean_ll[i] = mean(lltemp[start[i]:end[i],]);
log_like_approx[i] = log_sum_exp(sum_ll) - log(n_sim);
}
}
}
Next, we need to add n_sim
to the data file. For a balance of reproducibility and speed, we’ll set n_sim=1000
.
open_standata_file(mod0_gq)
The resulting data will look something like this:
# Create Stan data
#
# This function must return the list that will be passed to `data` argument
# of `cmdstanr::sample()`
#
# The `.dir` argument represents the absolute path to the directory containing
# this file. This is useful for building file paths to the input files you will
# load. Note: you _don't_ need to pass anything to this argument, you only use
# it within the function. `bbr` will pass in the correct path when it calls
# `make_standata()` under the hood.
make_standata <- function(.dir) {
# read in any input data
in_data <- readr::read_csv(
file.path(.dir, "..", "..", "..", "..", "data/derived/exdata3.csv"),
show_col_types = FALSE
)
list(
N = nrow(in_data),
n_id = length(unique(in_data$ID)),
DV = in_data$fxa.inh,
Conc = in_data$cobs,
ID = in_data$ID,
start = in_data %>% mutate(num=1:n()) %>% group_by(ID) %>% slice_head(n=1) %>% pull(num),
end = in_data %>% mutate(num=1:n()) %>% group_by(ID) %>% slice_tail(n=1) %>% pull(num),
n_sim = 1000
)
}
Let’s make sure to run the chains in parallel.
set_stanargs(mod0_gq, .stanargs = list(parallel_chains=4))
Finally, we’ll run the model. Note that this is pulling in the MCMC samples from mod0
and running them through the generated quantities
block of model mod0_gq
. No additional sampling is happening.
<- submit_model(mod0_gq) fit0_gq
We can pass the output from this fit to the loo
function to calculate the approximate expected log predictive density (ELPD).
<- loo(fit0_gq$draws(variable='log_like_approx'))
loo0
loo0
.
. Computed from 4000 by 72 log-likelihood matrix
.
. Estimate SE
. elpd_loo -4552.6 25.1
. p_loo 6.4 0.7
. looic 9105.1 50.3
. ------
. Monte Carlo SE of elpd_loo is 0.0.
.
. All Pareto k estimates are good (k < 0.5).
. See help('pareto-k-diagnostic') for details.
We’ll run through the same process for models 2 and 4.
ELPD for model 2
<- read_model(file.path(MODEL_DIR,'mod2')) mod2
<- copy_model_as_stan_gq(.parent_mod = mod2,
mod2_gq .new_model = 'mod2_gq',
.overwrite = TRUE)
Next, we’ll edit the Stan model file to make the following edits:
- Add
n_sim
to thedata
block - Remove the
model
block - Replace the original
generated quantities
block (which simulated from the model) with code to approximate the log-likelihood.
open_stanmod_file(mod2_gq)
The resulting code will look something like this:
data {
int<lower=0> N ;
int<lower=0> n_id ;
vector[N] DV;
vector[N] Conc;
array[N] int ID;
array[n_id] int<lower=1, upper=N> id_start;
array[n_id] int<lower=1, upper=N> id_end;
int<lower=0,upper=1> prior_only;
int n_sim;
}
parameters {
real<lower=0, upper=100> emax;
real tv_log_ec50;
real tv_e0;
real log_gamma;
matrix[2,n_id] etas; // etas[i,1] = log_ec50; etas[i,2] = e0;
real<lower=0> sigma;
vector<lower=0>[2] omega_sds; // SDs for IIV
cholesky_factor_corr[2] L_Omega_corr;
}
transformed parameters {
real tv_ec50 = exp(tv_log_ec50);
real gamma = exp(log_gamma);
vector[n_id] ec50;
vector[n_id] e0;
matrix[2,2] Omega_corr;
cov_matrix[2] Omega;
vector[N] mu;
{
vector[2] tmp_eta;
for (i in 1:n_id) {
tmp_eta = diag_pre_multiply(omega_sds, L_Omega_corr)*etas[,i];
ec50[i] = exp(tv_log_ec50 + tmp_eta[1]);
e0[i] = tv_e0 + tmp_eta[2];
}
}
Omega_corr = L_Omega_corr * L_Omega_corr';
Omega = quad_form_diag(Omega_corr, omega_sds);
for (i in 1:N) {
mu[i] = e0[ID[i]] + (emax-e0[ID[i]])*pow(Conc[i]/100,gamma) / (pow(ec50[ID[i]]/100,gamma) + pow(Conc[i]/100,gamma));
}
}
generated quantities {
vector[n_id] log_like_approx;
vector[n_id] mean_penalty;
vector[n_id] mean_ll;
{
// temporary variables
matrix[n_id,n_sim] sim_e0;
matrix[n_id, n_sim] sim_ec50;
matrix[n_id,n_sim] sim_penalty;
vector[n_sim] sum_ll;
matrix[N,n_sim] mutemp;
matrix[N,n_sim] lltemp;
// vector[n_id] sum_ll;
vector[2] tmp_eta;
for (i in 1:n_id) {
for (j in 1:n_sim) {
tmp_eta = multi_normal_rng([0,0], diag_matrix([1.0,1.0]'));
tmp_eta = diag_pre_multiply(omega_sds, L_Omega_corr)*tmp_eta;
sim_ec50[i,j] = exp(tv_log_ec50 + tmp_eta[1]);
sim_e0[i,j] = tv_e0 + tmp_eta[2];
sim_penalty[i,j] = multi_normal_lpdf(tmp_eta | [0,0],Omega);
}
mean_penalty[i] = mean(sim_penalty[i,]);
}
for (i in 1:N) {
for (j in 1:n_sim) {
mutemp[i,j] = sim_e0[ID[i],j] + (emax-sim_e0[ID[i],j])*pow(Conc[i]/100,gamma) / (pow(sim_ec50[ID[i],j]/100,gamma) + pow(Conc[i]/100,gamma));
lltemp[i,j] = normal_lpdf(DV[i] | mutemp[i,j], sigma) ;
}
}
for (i in 1:n_id) {
for (j in 1:n_sim) {
sum_ll[j] = sum(lltemp[id_start[i]:id_end[i],j]) + sim_penalty[i,j];
}
mean_ll[i] = mean(lltemp[id_start[i]:id_end[i],]);
log_like_approx[i] = log_sum_exp(sum_ll) - log(n_sim);
}
}
}
Next, we’ll make the same edits to the data file (add n_sim=1000
) and set the Stan arguments to run the chains in parallel.
set_stanargs(mod2_gq, .stanargs = list(parallel_chains=4))
Finally, we’ll run the model.
<- submit_model(mod2_gq) fit2_gq
<- loo(fit2_gq$draws(variable='log_like_approx'))
loo2
loo2
.
. Computed from 4000 by 72 log-likelihood matrix
.
. Estimate SE
. elpd_loo -4438.8 25.1
. p_loo 193.6 0.9
. looic 8877.7 50.3
. ------
. Monte Carlo SE of elpd_loo is 0.1.
.
. All Pareto k estimates are good (k < 0.5).
. See help('pareto-k-diagnostic') for details.
ELPD for model 4
<- read_model(file.path(MODEL_DIR,'mod4')) mod4
<- copy_model_as_stan_gq(.parent_mod = mod4,
mod4_gq .new_model = 'mod4_gq',
.overwrite=TRUE)
open_stanmod_file(mod4_gq)
The resulting code will look something like this:
data {
int<lower=0> N ; // Number of observations
int<lower=0> K; // Number of covariates
int<lower=0> n_id ; // Number of individuals
vector[N] DV;
vector[N] Conc;
matrix[n_id, K] X_e0; // Matrix of covariates (one row per subject) on E0
matrix[n_id, K] X_ec50; // Matrix of covariates (one row per subject) on EC50
array[N] int ID;
array[n_id] int<lower=1, upper=N> id_start;
array[n_id] int<lower=1, upper=N> id_end;
int<lower=0,upper=1> prior_only;
int n_sim;
}
parameters {
real<lower=0, upper=100> emax;
real tv_log_ec50;
real tv_e0;
real log_gamma;
vector[K] beta_e0;
vector[K] beta_log_ec50;
matrix[2,n_id] etas; // etas[i,1] = log_ec50; etas[i,2] = e0;
real<lower=0> sigma;
vector<lower=0>[2] omega_sds; // SDs for IIV
cholesky_factor_corr[2] L_Omega_corr;
}
transformed parameters {
real tv_ec50 = exp(tv_log_ec50);
real gamma = exp(log_gamma);
vector[n_id] ec50;
vector[n_id] e0;
matrix[2,2] Omega_corr;
cov_matrix[2] Omega;
vector[N] mu;
{
vector[2] tmp_eta;
for (i in 1:n_id) {
tmp_eta = diag_pre_multiply(omega_sds, L_Omega_corr)*etas[,i];
ec50[i] = exp(tv_log_ec50 + X_ec50[i,] * beta_log_ec50 + tmp_eta[1]);
e0[i] = tv_e0 + X_e0[i,] * beta_e0 + tmp_eta[2];
}
}
Omega_corr = L_Omega_corr * L_Omega_corr';
Omega = quad_form_diag(Omega_corr, omega_sds);
for (i in 1:N) {
mu[i] = e0[ID[i]] + (emax-e0[ID[i]])*pow(Conc[i]/100,gamma) / (pow(ec50[ID[i]]/100,gamma) + pow(Conc[i]/100,gamma));
}
}
generated quantities {
vector[n_id] log_like_approx;
{
// temporary variables
matrix[n_id,n_sim] sim_e0;
matrix[n_id, n_sim] sim_ec50;
matrix[n_id,n_sim] sim_penalty;
vector[n_sim] sum_ll;
matrix[N,n_sim] mutemp;
matrix[N,n_sim] lltemp;
// vector[n_id] sum_ll;
vector[2] tmp_eta;
for (i in 1:n_id) {
for (j in 1:n_sim) {
tmp_eta = multi_normal_rng([0,0], diag_matrix([1.0,1.0]'));
tmp_eta = diag_pre_multiply(omega_sds, L_Omega_corr)*tmp_eta;
sim_ec50[i,j] = exp(tv_log_ec50 + X_ec50[i,] * beta_log_ec50 + tmp_eta[1]);
sim_e0[i,j] = tv_e0 + X_e0[i,] * beta_e0 + tmp_eta[2];
sim_penalty[i,j] = multi_normal_lpdf(tmp_eta | [0,0],Omega);
}
}
for (i in 1:N) {
for (j in 1:n_sim) {
mutemp[i,j] = sim_e0[ID[i],j] + (emax-sim_e0[ID[i],j])*pow(Conc[i]/100,gamma) / (pow(sim_ec50[ID[i],j]/100,gamma) + pow(Conc[i]/100,gamma));
lltemp[i,j] = normal_lpdf(DV[i] | mutemp[i,j], sigma) ;
}
}
for (i in 1:n_id) {
for (j in 1:n_sim) {
sum_ll[j] = sum(lltemp[id_start[i]:id_end[i],j]) + sim_penalty[i,j];
}
log_like_approx[i] = log_sum_exp(sum_ll) - log(n_sim);
}
}
}
Next, we make the edits to the Stan data and arguments and run the model.
open_standata_file(mod4_gq)
set_stanargs(mod4_gq, .stanargs = list(parallel_chains=4))
<- submit_model(mod4_gq) fit4_gq
<- loo(fit4_gq$draws(variable='log_like_approx'))
loo4
loo4
.
. Computed from 4000 by 72 log-likelihood matrix
.
. Estimate SE
. elpd_loo -4446.8 25.2
. p_loo 162.9 1.2
. looic 8893.7 50.3
. ------
. Monte Carlo SE of elpd_loo is 0.1.
.
. All Pareto k estimates are good (k < 0.5).
. See help('pareto-k-diagnostic') for details.
Model comparison
Let’s compare the model fits based on ELPD:
loo_compare(loo0, loo2, loo4)
. elpd_diff se_diff
. model2 0.0 0.0
. model3 -8.0 0.5
. model1 -113.7 0.3
Based on this measure of model fit, it looks like models 2 (“model2” in the loo output) and 4 (model3 in the loo output) are comparable and both are preferred to model 0 (model1 in the loo output).
Other resources
The following script from the GitHub repository is discussed on this page. If you’re interested running this code, visit the About the GitHub Repo page first.
Generated Quantities Models script: generated-quantities.Rmd