Generated Quantities Models

Example generated quantities model workflow with bbr.bayes.

bbr
model management

Introduction

There are often situations where you want to use your posterior samples as input to subsequent calculations. In some cases, this is easily done in R. In others, it can be convenient to perform the calculations in Stan.

To save the time associated with re-running the Stan MCMC sampler, cmdstan has implemented a generate_quantities method. bbr.bayes is designed to take advantage of this method, and this chapter provides an example of using it.

For more information about the generate_quantities method, see the sections describing stand-alone generated quantities models in the quick-start portion of the Stand-alone Generated Quantities section of the CmdStan User’s Guide and the Generated Quantities section of the Stan User’s Guide.

Tools used


MetrumRG packages

bbr Manage, track, and report modeling activities, through simple R objects, with a user-friendly interface between R and NONMEM®.

CRAN packages

dplyr A grammar of data manipulation.

Set up


Load required packages and set file paths to your model and figure directories.

library(bbr)
library(bbr.bayes)
library(dplyr)
library(here)
library(tidyverse)
library(yspec)
library(loo)

MODEL_DIR <- here("model", "stan","expo")
FIGURE_DIR <- here("deliv", "figure","expo")

Model creation and submission

Comparing models using ELPD

In Bayesian analysis, a commonly used metric for comparing models is the expected log predictive density (ELPD). Without a hold-out sample, ELPD is typically calculated using an approximate leave one out (LOO) cross-validation estimate . To calculate LOO-ELPD, we need to derive the log-likelihood contribution for each independent unit. For our analysis, this corresponds to the likelihood at the subject level.

For non-linear mixed effect models, such as the one we’re using here, the log-likelihood is not available in a closed-form solution. So, we will derive it using a Monte Carlo approximation. Specifically, we want to calculate

\[ \log \ell_i(\boldsymbol{\theta},\boldsymbol{\Omega} ~|~ \mathbf{y}_i) = \log \left( \int f(\mathbf{y}_i ~|~ \boldsymbol{\theta}, \boldsymbol{\Omega}, \boldsymbol{\eta}) ~f(\boldsymbol{\eta} ~|~ \boldsymbol{\Omega}) ~ d\boldsymbol{\eta} \right) \] The Monte Carlo approximation to the integral is derived as:

\[\begin{align*} \ell_i(\boldsymbol{\theta},\boldsymbol{\Omega} ~|~ \mathbf{y}_i) &= \int \prod_{j=1}^{n_i} f(y_{ij} ~|~ \boldsymbol{\theta}, \boldsymbol{\Omega}, \boldsymbol{\eta}) ~f(\boldsymbol{\eta} ~|~ \boldsymbol{\Omega}) ~ d\boldsymbol{\eta} \\ & \approx \frac{1}{M} \sum_{m=1}^M \prod_{j=1}^{n_i} f(y_{ij} ~|~ \boldsymbol{\theta}, \boldsymbol{\Omega}, \boldsymbol{\eta}^{(m)}) ~f(\boldsymbol{\eta}^{(m)} ~|~ \boldsymbol{\Omega}) \\ & = \frac{1}{M} \sum_{m=1}^M \exp \left( \sum_{j=1}^{n_i} \log f(y_{ij} ~|~ \boldsymbol{\theta}, \boldsymbol{\Omega}, \boldsymbol{\eta}^{(m)}) + \log f(\boldsymbol{\eta}^{(m)} ~|~ \boldsymbol{\Omega}) \right) \end{align*}\]

Thus, the approximation to \(\log \ell_i(\boldsymbol{\theta},\boldsymbol{\Omega} ~|~ \mathbf{y}_i)\) is given by

\[\begin{align*} \log \ell_i(\boldsymbol{\theta},\boldsymbol{\Omega} ~|~ \mathbf{y}_i) & \approx \log(\sum_{m=1}^M \exp(A_i)) - \log M \\ & = \text{log\_sum\_exp}(A_i) - \log M \end{align*}\]

where \(M\) is the number of Monte Carlo samples, \(\eta^{(m)}\) is a draw from the distribution \(f(\boldsymbol{\eta} ~|~ \boldsymbol{\Omega})\), log_sum_exp corresponding to the Stan function of the same name and

\[ A_i = \sum_{j=1}^{n_i} \log f(y_{ij} ~|~ \boldsymbol{\theta}, \boldsymbol{\Omega}, \boldsymbol{\eta}^{(m)}) + \log f(\boldsymbol{\eta}^{(m)} ~|~ \boldsymbol{\Omega}) \]

To calculate the LOO-ELPD for our models, we will take advantage of the ability in cmdstan to run a stand-alone generate quantities model – a model which takes as input the samples from a previous model and implements calculations through the generated quantities block. This can be an efficient approach to performing calculations with previously fitted models.

We’ll demonstrate the process in detail with mod0. Let’s start by re-reading the parent model.

mod0 <- read_model(file.path(MODEL_DIR,'mod0'))

Now, we’ll use the copy_model_as_stan_gq() function to make a copy of mod0 but for use as a generated_quantities model.

Use caution when using the .overwrite= TRUE argument in new_model and copy_model_from. It will overwrite existing files and output directory, if they exist.

mod0_gq <- copy_model_as_stan_gq(.parent_mod = mod0, 
                                 .new_model = 'mod0_gq', 
                                 .overwrite = TRUE)

Next, we’ll edit the Stan model file to make the following edits:

  • Remove the model block
  • Replace the original generated quantities block (which simulated data from the model for the purpose of model evaluation) with code to approximate the log-likelihood.
  • Add a variable denoting the number of Monte Carlo samples (n_sim) to the data block and to the input data

(Note: removing the model block is not necessary; we have done this for the sake of clarity. If it is left in the code, Stan will ignore it.)

open_stanmod_file(mod0_gq)

The resulting code will look something like this:


data {
  int<lower=0> N ;
  int<lower=0> n_id ;
  vector[N] DV;
  vector[N] Conc;
  array[N] int ID;
  array[n_id] int start;
  array[n_id] int end;
  
  int n_sim; // Number of simulations used to approximate log-likelihood
}

parameters {
  real<lower=0, upper=100> emax;
  real tv_log_ec50;
  real tv_e0;
  real log_gamma;
  
  vector[n_id] eta_log_ec50;
  vector[n_id] eta_e0;
  
  real<lower=0> sigma;
  real<lower=0> omega_e0;
  real<lower=0> omega_log_ec50;
}

transformed parameters {
  real tv_ec50 = exp(tv_log_ec50);
  real gamma = exp(log_gamma);
  vector[n_id] ec50;
  vector[n_id] e0;
  vector[N] mu;
  
  ec50 = exp(tv_log_ec50 + omega_log_ec50 * eta_log_ec50);
  e0 = tv_e0 + omega_e0 * eta_e0;
  
  for (i in 1:N) {
    mu[i] = e0[ID[i]] + (emax-e0[ID[i]])*pow(Conc[i]/100,gamma) / (pow(ec50[ID[i]]/100,gamma) + pow(Conc[i]/100,gamma));
  }
  
}


generated quantities {

  vector[n_id] log_like_approx;
  vector[n_id] mean_penalty;
  vector[n_id] mean_ll;

      {
        // temporary variables
        matrix[n_id,n_sim] sim_e0;
        matrix[n_id, n_sim] sim_ec50;
        
        matrix[n_id,n_sim] sim_penalty;
        vector[n_sim] sum_ll;
        matrix[N,n_sim] mutemp;
        matrix[N,n_sim] lltemp;
        real temp_eta_e0;
        real temp_eta_log_ec50;
        
        for (i in 1:n_id) {
          for (j in 1:n_sim) {
            temp_eta_e0 = normal_rng(0,1);
            temp_eta_log_ec50 = normal_rng(0,1);
            sim_ec50[i,j] = exp(tv_log_ec50  + omega_log_ec50 * temp_eta_log_ec50);
            sim_e0[i,j] = tv_e0 + omega_e0 * temp_eta_e0;
            
           sim_penalty[i,j] = std_normal_lpdf(temp_eta_log_ec50) +
                              std_normal_lpdf(temp_eta_e0);
            // sim_penalty[i,j] = multi_normal_lpdf([temp_eta_log_ec50,temp_eta_e0] | [0,0], diag_matrix([pow(omega_log_ec50,2),pow(omega_e0,2)]'));
          }
          mean_penalty[i] = mean(sim_penalty[i,]);
        }
        
        
        for (i in 1:N) {
          for (j in 1:n_sim) {
            mutemp[i,j] = sim_e0[ID[i],j] + (emax-sim_e0[ID[i],j])*pow(Conc[i]/100,gamma) / (pow(sim_ec50[ID[i],j]/100,gamma) + pow(Conc[i]/100,gamma));
            lltemp[i,j] = normal_lpdf(DV[i] | mutemp[i,j], sigma) ;
          }
        }
        
        for (i in 1:n_id) {
          for (j in 1:n_sim) {
            sum_ll[j] = sum(lltemp[start[i]:end[i],j]) + sim_penalty[i,j];
          }
          mean_ll[i] = mean(lltemp[start[i]:end[i],]);
          log_like_approx[i] = log_sum_exp(sum_ll) - log(n_sim);
        }
      }
    
}

Next, we need to add n_sim to the data file. For a balance of reproducibility and speed, we’ll set n_sim=1000.

open_standata_file(mod0_gq)

The resulting data will look something like this:

# Create Stan data
#
# This function must return the list that will be passed to `data` argument
#   of `cmdstanr::sample()`
#
# The `.dir` argument represents the absolute path to the directory containing
#   this file. This is useful for building file paths to the input files you will
#   load. Note: you _don't_ need to pass anything to this argument, you only use
#   it within the function. `bbr` will pass in the correct path when it calls
#   `make_standata()` under the hood.
make_standata <- function(.dir) {
  # read in any input data
  in_data <- readr::read_csv(
    file.path(.dir, "..", "..", "..", "..", "data/derived/exdata3.csv"),
    show_col_types = FALSE
  )
  
  list(
    N = nrow(in_data),
    n_id = length(unique(in_data$ID)),
    DV = in_data$fxa.inh,
    Conc = in_data$cobs,
    ID = in_data$ID,
    start = in_data %>% mutate(num=1:n()) %>% group_by(ID) %>% slice_head(n=1) %>% pull(num),
    end = in_data %>% mutate(num=1:n()) %>% group_by(ID) %>% slice_tail(n=1) %>% pull(num),
    n_sim = 1000
  )
}

Let’s make sure to run the chains in parallel.

set_stanargs(mod0_gq,  .stanargs = list(parallel_chains=4))

Finally, we’ll run the model. Note that this is pulling in the MCMC samples from mod0 and running them through the generated quantities block of model mod0_gq. No additional sampling is happening.

fit0_gq <- submit_model(mod0_gq)

We can pass the output from this fit to the loo function to calculate the approximate expected log predictive density (ELPD).

loo0 <- loo(fit0_gq$draws(variable='log_like_approx'))

loo0
. 
. Computed from 4000 by 72 log-likelihood matrix
. 
.          Estimate   SE
. elpd_loo  -4552.6 25.1
. p_loo         6.4  0.7
. looic      9105.1 50.3
. ------
. Monte Carlo SE of elpd_loo is 0.0.
. 
. All Pareto k estimates are good (k < 0.5).
. See help('pareto-k-diagnostic') for details.

We’ll run through the same process for models 2 and 4.

ELPD for model 2

mod2 <- read_model(file.path(MODEL_DIR,'mod2'))

Use caution when using the .overwrite= TRUE argument in new_model and copy_model_from. It will overwrite existing files and output directory, if they exist.

mod2_gq <- copy_model_as_stan_gq(.parent_mod = mod2, 
                                 .new_model = 'mod2_gq',
                                 .overwrite = TRUE)

Next, we’ll edit the Stan model file to make the following edits:

  • Add n_sim to the data block
  • Remove the model block
  • Replace the original generated quantities block (which simulated from the model) with code to approximate the log-likelihood.
open_stanmod_file(mod2_gq)

The resulting code will look something like this:


data {
  int<lower=0> N ;
  int<lower=0> n_id ;
  vector[N] DV;
  vector[N] Conc;
  array[N] int ID;
  array[n_id] int<lower=1, upper=N> id_start;
  array[n_id] int<lower=1, upper=N> id_end;
  
  int<lower=0,upper=1> prior_only;
  int n_sim;
}


parameters {
  real<lower=0, upper=100> emax;
  real tv_log_ec50;
  real tv_e0;
  real log_gamma;
  
  matrix[2,n_id] etas; // etas[i,1] = log_ec50; etas[i,2] = e0;

  real<lower=0> sigma;
  vector<lower=0>[2] omega_sds; // SDs for IIV
  cholesky_factor_corr[2] L_Omega_corr;
}

transformed parameters {
  real tv_ec50 = exp(tv_log_ec50);
  real gamma = exp(log_gamma);
  vector[n_id] ec50;
  vector[n_id] e0;
  matrix[2,2] Omega_corr;
  cov_matrix[2] Omega;
  vector[N] mu;
  

  {
    vector[2] tmp_eta;
    
    for (i in 1:n_id) {
      tmp_eta = diag_pre_multiply(omega_sds, L_Omega_corr)*etas[,i];
      ec50[i] = exp(tv_log_ec50 + tmp_eta[1]);
      e0[i] = tv_e0 + tmp_eta[2];
    }
  }
  
  Omega_corr = L_Omega_corr * L_Omega_corr';
  Omega = quad_form_diag(Omega_corr, omega_sds);
  
  for (i in 1:N) {
    mu[i] = e0[ID[i]] + (emax-e0[ID[i]])*pow(Conc[i]/100,gamma) / (pow(ec50[ID[i]]/100,gamma) + pow(Conc[i]/100,gamma));
  }
}


generated quantities {
  vector[n_id] log_like_approx;
  vector[n_id] mean_penalty;
  vector[n_id] mean_ll;

      {
        // temporary variables
          matrix[n_id,n_sim] sim_e0;
          matrix[n_id, n_sim] sim_ec50;
          
          matrix[n_id,n_sim] sim_penalty;
          vector[n_sim] sum_ll;
          matrix[N,n_sim] mutemp;
          matrix[N,n_sim] lltemp;
          // vector[n_id] sum_ll;
          vector[2] tmp_eta;
          
          for (i in 1:n_id) {
            for (j in 1:n_sim) {
              tmp_eta = multi_normal_rng([0,0], diag_matrix([1.0,1.0]'));
              
              tmp_eta = diag_pre_multiply(omega_sds, L_Omega_corr)*tmp_eta;
              sim_ec50[i,j] = exp(tv_log_ec50 + tmp_eta[1]);
              sim_e0[i,j] = tv_e0 + tmp_eta[2];
              
              sim_penalty[i,j] = multi_normal_lpdf(tmp_eta | [0,0],Omega);
            }
            mean_penalty[i] = mean(sim_penalty[i,]);
          }
          
          
          for (i in 1:N) {
            for (j in 1:n_sim) {
              mutemp[i,j] = sim_e0[ID[i],j] + (emax-sim_e0[ID[i],j])*pow(Conc[i]/100,gamma) / (pow(sim_ec50[ID[i],j]/100,gamma) + pow(Conc[i]/100,gamma));
              lltemp[i,j] = normal_lpdf(DV[i] | mutemp[i,j], sigma) ;
            }
          }
          
          for (i in 1:n_id) {
            for (j in 1:n_sim) {
              sum_ll[j] = sum(lltemp[id_start[i]:id_end[i],j]) + sim_penalty[i,j];
            }
            mean_ll[i] = mean(lltemp[id_start[i]:id_end[i],]);
            log_like_approx[i] = log_sum_exp(sum_ll) - log(n_sim);
          }
      }
      
}

Next, we’ll make the same edits to the data file (add n_sim=1000) and set the Stan arguments to run the chains in parallel.

set_stanargs(mod2_gq,  .stanargs = list(parallel_chains=4))

Finally, we’ll run the model.

fit2_gq <- submit_model(mod2_gq)
loo2 <- loo(fit2_gq$draws(variable='log_like_approx'))

loo2
. 
. Computed from 4000 by 72 log-likelihood matrix
. 
.          Estimate   SE
. elpd_loo  -4438.8 25.1
. p_loo       193.6  0.9
. looic      8877.7 50.3
. ------
. Monte Carlo SE of elpd_loo is 0.1.
. 
. All Pareto k estimates are good (k < 0.5).
. See help('pareto-k-diagnostic') for details.

ELPD for model 4

mod4 <- read_model(file.path(MODEL_DIR,'mod4'))

Use caution when using the .overwrite= TRUE argument in new_model and copy_model_from. It will overwrite existing files and output directory, if they exist.

mod4_gq <- copy_model_as_stan_gq(.parent_mod = mod4, 
                                 .new_model = 'mod4_gq', 
                                 .overwrite=TRUE)
open_stanmod_file(mod4_gq)

The resulting code will look something like this:


data {
  int<lower=0> N ; // Number of observations
  int<lower=0> K;  // Number of covariates
  int<lower=0> n_id ; // Number of individuals
  vector[N] DV;
  vector[N] Conc;
  matrix[n_id, K] X_e0;  // Matrix of covariates (one row per subject) on E0
  matrix[n_id, K] X_ec50;  // Matrix of covariates (one row per subject) on EC50
  array[N] int ID;
  array[n_id] int<lower=1, upper=N> id_start;
  array[n_id] int<lower=1, upper=N> id_end;
  
  int<lower=0,upper=1> prior_only;
  int n_sim;
}


parameters {
  real<lower=0, upper=100> emax;
  real tv_log_ec50;
  real tv_e0;
  real log_gamma;
  vector[K] beta_e0;
  vector[K] beta_log_ec50;

  matrix[2,n_id] etas; // etas[i,1] = log_ec50; etas[i,2] = e0;

  real<lower=0> sigma;
  vector<lower=0>[2] omega_sds; // SDs for IIV
  cholesky_factor_corr[2] L_Omega_corr;
}

transformed parameters {
  real tv_ec50 = exp(tv_log_ec50);
  real gamma = exp(log_gamma);
  vector[n_id] ec50;
  vector[n_id] e0;
  matrix[2,2] Omega_corr;
  cov_matrix[2] Omega;
  vector[N] mu;
  

  {
    vector[2] tmp_eta;
    
    for (i in 1:n_id) {
      tmp_eta = diag_pre_multiply(omega_sds, L_Omega_corr)*etas[,i];
      ec50[i] = exp(tv_log_ec50 + X_ec50[i,] * beta_log_ec50 + tmp_eta[1]);
      e0[i] = tv_e0 + X_e0[i,] * beta_e0 + tmp_eta[2];
    }
  }
  
  Omega_corr = L_Omega_corr * L_Omega_corr';
  Omega = quad_form_diag(Omega_corr, omega_sds);
  
  for (i in 1:N) {
    mu[i] = e0[ID[i]] + (emax-e0[ID[i]])*pow(Conc[i]/100,gamma) / (pow(ec50[ID[i]]/100,gamma) + pow(Conc[i]/100,gamma));
  }
}


generated quantities {
  vector[n_id] log_like_approx;
  
      {
        // temporary variables
          matrix[n_id,n_sim] sim_e0;
          matrix[n_id, n_sim] sim_ec50;
          
          matrix[n_id,n_sim] sim_penalty;
          vector[n_sim] sum_ll;
          matrix[N,n_sim] mutemp;
          matrix[N,n_sim] lltemp;
          // vector[n_id] sum_ll;
          vector[2] tmp_eta;
          
          for (i in 1:n_id) {
            for (j in 1:n_sim) {
              tmp_eta = multi_normal_rng([0,0], diag_matrix([1.0,1.0]'));
              
              tmp_eta = diag_pre_multiply(omega_sds, L_Omega_corr)*tmp_eta;
              sim_ec50[i,j] = exp(tv_log_ec50 + X_ec50[i,] * beta_log_ec50 + tmp_eta[1]);
              sim_e0[i,j] = tv_e0 + X_e0[i,] * beta_e0 + tmp_eta[2];
              
              sim_penalty[i,j] = multi_normal_lpdf(tmp_eta | [0,0],Omega);
            }
          }
          
          
          for (i in 1:N) {
            for (j in 1:n_sim) {
              mutemp[i,j] = sim_e0[ID[i],j] + (emax-sim_e0[ID[i],j])*pow(Conc[i]/100,gamma) / (pow(sim_ec50[ID[i],j]/100,gamma) + pow(Conc[i]/100,gamma));
              lltemp[i,j] = normal_lpdf(DV[i] | mutemp[i,j], sigma) ;
            }
          }
          
          for (i in 1:n_id) {
            for (j in 1:n_sim) {
              sum_ll[j] = sum(lltemp[id_start[i]:id_end[i],j]) + sim_penalty[i,j];
            }
            log_like_approx[i] = log_sum_exp(sum_ll) - log(n_sim);
          }
      }
      
}

Next, we make the edits to the Stan data and arguments and run the model.

open_standata_file(mod4_gq)
set_stanargs(mod4_gq,  .stanargs = list(parallel_chains=4))
fit4_gq <- submit_model(mod4_gq)
loo4 <- loo(fit4_gq$draws(variable='log_like_approx'))

loo4
. 
. Computed from 4000 by 72 log-likelihood matrix
. 
.          Estimate   SE
. elpd_loo  -4446.8 25.2
. p_loo       162.9  1.2
. looic      8893.7 50.3
. ------
. Monte Carlo SE of elpd_loo is 0.1.
. 
. All Pareto k estimates are good (k < 0.5).
. See help('pareto-k-diagnostic') for details.

Model comparison

Let’s compare the model fits based on ELPD:

loo_compare(loo0, loo2, loo4)
.        elpd_diff se_diff
. model2    0.0       0.0 
. model3   -8.0       0.5 
. model1 -113.7       0.3

Based on this measure of model fit, it looks like models 2 (“model2” in the loo output) and 4 (model3 in the loo output) are comparable and both are preferred to model 0 (model1 in the loo output).

Other resources


The following script from the GitHub repository is discussed on this page. If you’re interested running this code, visit the About the GitHub Repo page first.

Generated Quantities Models script: generated-quantities.Rmd