library(bbr)
library(bbr.bayes)
library(dplyr)
library(here)
library(tidyverse)
library(yspec)
library(cmdstanr)
set_cmdstan_path(path = here("Torsten", "v0.91.0", "cmdstan"))
<- here("model", "stan")
MODEL_DIR <- here("deliv", "figure") FIGURE_DIR
Introduction
Managing the model development processes in a traceable and reproducible manner can be a significant challenge. At Metrum Research Group (MetrumRG), we use bbr
and bbr.bayes
to streamline this process for fitting Bayesian (Bayes) models. bbr
and bbr.bayes
are R packages developed by MetrumRG that serve three primary purposes:
- Create and submit Stan or NONMEM® models for execution in parallel and/or on a high-performance compute (HPC) cluster (e.g., Metworx)
- Parse Stan or NONMEM® outputs into R objects to facilitate model evaluation and diagnostics in R
- Annotate the model development process for easier and more reliable traceability and reproducibility
This page demonstrates the following bbr
and bbr.bayes
functionality:
- Create an initial Stan/Torsten population pharmacokinetics (popPK) model
- Submit that model
Tools used
MetrumRG packages
bbr Manage, track, and report modeling activities, through simple R objects, with a user-friendly interface between R and NONMEM®.
bbr.bayes Extension of the bbr package to support Bayesian modeling with Stan or NONMEM®.
CRAN packages
dplyr A grammar of data manipulation.
Outline
This page details a range of tasks frequently performed throughout the modeling process such as defining, submitting, and annotating models. Typically, our scientists run this code in a scratch pad style script since it isn’t necessary for the entire coding history for each model to persist. Here, the code used each step of the process is provided as a reference, and a runnable version of this page is available in our GitHub repository: initial-model-submission.Rmd
.
If you’re a new bbr.bayes
user, we recommend you read through the Getting Started vignette before trying to run any code. This vignette includes some setup and configuration requirement (e.g., making sure bbr.bayes
can find your Torsten installation).
Please note, bbr.bayes
doesn’t edit the model structure in your Stan code. This page walks through a sequence of models that might evolve during a simple modeling project. All modifications to the Stan code are made directly by the modeler. Below, we’ve included comments asking you to edit your Stan code manually.
Set up
Load required packages and set file paths to your model and figure directories.
Initial model creation and submission
Create a model object
We begin with a fairly simple popPK model called ppkexpo1:
- linear two compartment model with first order absorption,
- no individual-specific covariates,
- centered random effect parameterization, and
- lognormal residual variation.
To begin modeling, first create a model object using new_model()
.
<- new_model(here(MODEL_DIR, "ppkexpo1"),
ppkexpo1 .model_type = "stan",
.overwrite = TRUE)
This generates an S3 object which you pass to all of the other bbr
functions that submit, summarize, or manipulate models. The first argument (.path
) must be the path to a directory that will contain the necessary files. For Stan models, you have to add .model_type = "stan"
to the new_model()
call. The above new_model()
statement creates a new model folder named ppkexpo1 (unless it already exists) and populates it with templates for the following files:
ppkexpo1.stan
: Stan language modelppkexpo1-standata.R
: function to generate Stan-formatted data listppkexpo1-init.R
: function to generate initial estimatesppkexpo1-stanargs.R
: list ofcmdstan
arguments
A ppkexpo1.yaml
file is also created in your model directory; this file automatically persists any tags, notes, and other model metadata. While it’s useful to know that this .yaml
file exists, you shouldn’t need to interact with it directly (i.e., in the RStudio file tabs). Instead, bbr
has a variety of functions (e.g., add_tags()
and add_notes()
shown below) that work with the model object to edit the information in the .yaml
file from your R console.
The roles of the template files created by new_model()
are likely self-evident from their names, but a bit more information about their inputs and outputs is provided below:
<run>.stan
is a file containing the Stan model code.<run>-standata.R
constructs a function returning a Stan-ready data object which is then passed to theCmdStanModel$sample()
call. You can use thebuild_data()
function to view the data object created by the function.<run>-stanargs.R
constructs a named list with all of the arguments that will be passed to theCmdStanModel$sample()
call. Seeset_stanargs()
for details on modifying.<run>-init.R
is a file which contains all necessary R code to create the initial values passed to theCmdStanModel$sample()
call.
Edit or replace the .stan
, -standata.R
, and -init.R
files
Now, you may manually edit the .stan
, -standata.R
, and -init.R
files by using the bbr.bayes
open_stanmod_file
, open_standata_file
, and open_staninit_file
functions. Alternatively, if you’ve already created any of those files, you may copy then use the add_stan_file
, add_standata_file
, or add_stan_init
functions. This is demonstrated below using files from the “demo” folder.
<- ppkexpo1 %>%
ppkexpo1 add_stanmod_file(here(MODEL_DIR, "demo", "ppkexpo1.stan")) %>%
add_standata_file(here(MODEL_DIR, "demo", "ppkexpo1-standata.R")) %>%
add_staninit_file(here(MODEL_DIR, "demo", "ppkexpo1-init.R"))
Here is the Stan model implementing the two compartmental popPK model with a centered parameterization.
. /*
. Base PPK model
. - Linear 2 compartment
. - Centered parameterization
. - lognormal residual variation
.
. Based on NONMEM PopPK FOCE example project
. */
.
. data{
. int<lower = 1> nId; // number of individuals
. int<lower = 1> nt; // number of events (rows in data set)
. int<lower = 1> nObs; // number of PK observations
. array[nObs] int<lower = 1> iObs; // event indices for PK observations
. int<lower = 1> nBlq; // number of BLQ observations
. array[nBlq] int<lower = 1> iBlq; // event indices for BLQ observations
. array[nt] real<lower = 0> amt, time;
. array[nt] int<lower = 1> cmt;
. array[nt] int<lower = 0> evid;
. array[nId] int<lower = 1> start, end; // indices of first & last observations
. vector<lower = 0>[nObs] cObs;
. real<lower = 0> LOQ;
. }
.
. transformed data{
. int<lower = 1> nRandom = 5, nCmt = 3;
. array[nt] real<lower = 0> rate = rep_array(0.0, nt),
. ii = rep_array(0.0, nt);
. array[nt] int<lower = 0> addl = rep_array(0, nt),
. ss = rep_array(0, nt);
. array[nId, nCmt] real F = rep_array(1.0, nId, nCmt),
. tLag = rep_array(0.0, nId, nCmt);
. }
.
. parameters{
. real<lower = 0> CLHat, QHat, V2Hat, V3Hat;
. // To constrain kaHat > lambda1 uncomment the following
. real<lower = (CLHat / V2Hat + QHat / V2Hat + QHat / V3Hat +
. sqrt((CLHat / V2Hat + QHat / V2Hat + QHat / V3Hat)^2 -
. 4 * CLHat / V2Hat * QHat / V3Hat)) / 2> kaHat; // ka > lambda_1
.
. real<lower = 0> sigma;
. vector<lower = 0>[nRandom] omega;
. corr_matrix[nRandom] rho;
. array[nId] vector[nRandom] theta;
. }
.
. transformed parameters{
. vector[nRandom] thetaHat = log([CLHat, QHat, V2Hat, V3Hat, kaHat]');
. // Individual parameters
. array[nId] real<lower = 0> CL = exp(theta[,1]),
. Q = exp(theta[,2]),
. V2 = exp(theta[,3]),
. V3 = exp(theta[,4]),
. ka = exp(theta[,5]);
.
. // Covariance matrix
. cov_matrix[nRandom] Omega = quad_form_diag(rho, omega);
.
. row_vector[nt] cHat; // predicted concentration
. matrix[nCmt, nt] x; // mass in all compartments
.
. for(j in 1:nId){
. x[, start[j]:end[j]] = pmx_solve_twocpt(time[start[j]:end[j]],
. amt[start[j]:end[j]],
. rate[start[j]:end[j]],
. ii[start[j]:end[j]],
. evid[start[j]:end[j]],
. cmt[start[j]:end[j]],
. addl[start[j]:end[j]],
. ss[start[j]:end[j]],
. {CL[j], Q[j], V2[j], V3[j], ka[j]});
.
. cHat[start[j]:end[j]] = x[2, start[j]:end[j]] / V2[j];
. }
. }
.
. model{
. // priors
. CLHat ~ normal(0, 10);
. QHat ~ normal(0, 10);
. V2Hat ~ normal(0, 50);
. V3Hat ~ normal(0, 100);
. kaHat ~ normal(0, 3);
. sigma ~ normal(0, 0.5);
. omega ~ normal(0, 0.5);
. rho ~ lkj_corr(2);
.
. // interindividual variability
. theta ~ multi_normal(thetaHat, Omega);
.
. // likelihood
. cObs ~ lognormal(log(cHat[iObs]), sigma); // observed data likelihood
. target += lognormal_lcdf(LOQ | log(cHat[iBlq]), sigma); // BLQ data likelihood
.
. }
.
. generated quantities{
. // Individual parameters for hypothetical new individuals
. array[nId] vector[nRandom] thetaNew =
. multi_normal_rng(rep_array(thetaHat, nId), Omega);
. array[nId] real<lower = 0> CLNew = exp(thetaNew[,1]),
. QNew = exp(thetaNew[,2]),
. V2New = exp(thetaNew[,3]),
. V3New = exp(thetaNew[,4]),
. kaNew = exp(thetaNew[,5]);
.
. row_vector[nt] cHatNew; // predicted concentration in new individuals
. matrix[nCmt, nt] xNew; // mass in all compartments in new individuals
. array[nt] real cPred, cPredNew;
.
. vector[nObs + nBlq] log_lik;
.
. // Simulations for posterior predictive checks
. for(j in 1:nId){
. xNew[, start[j]:end[j]] = pmx_solve_twocpt(time[start[j]:end[j]],
. amt[start[j]:end[j]],
. rate[start[j]:end[j]],
. ii[start[j]:end[j]],
. evid[start[j]:end[j]],
. cmt[start[j]:end[j]],
. addl[start[j]:end[j]],
. ss[start[j]:end[j]],
. {CLNew[j], QNew[j], V2New[j],
. V3New[j], kaNew[j]});
.
. cHatNew[start[j]:end[j]] = xNew[2, start[j]:end[j]] / V2New[j];
.
. cPred[start[j]] = 0.0;
. cPredNew[start[j]] = 0.0;
. // new observations in existing individuals
. cPred[(start[j] + 1):end[j]] = lognormal_rng(log(cHat[(start[j] + 1):end[j]]),
. sigma);
. // new observations in new individuals
. cPredNew[(start[j] + 1):end[j]] = lognormal_rng(log(cHatNew[(start[j] + 1):end[j]]),
. sigma);
. }
.
. // Calculate log-likelihood values to use for LOO CV
.
. // Observation-level log-likelihood
. for(i in 1:nObs)
. log_lik[i] = lognormal_lpdf(cObs[i] | log(cHat[iObs[i]]), sigma);
. for(i in 1:nBlq)
. log_lik[i + nObs] = lognormal_lcdf(LOQ | log(cHat[iBlq[i]]), sigma);
.
. }
The ppkexpo1-standata.R
file must create a function named make_standata()
which takes the name of the directory the file is located as input and returns a list suitable for use with Stan.
. # Create Stan data
. #
. # This function must return the list that will be passed to `data` argument
. # of `cmdstanr::sample()`
. #
. # The `.dir` argument represents the absolute path to the directory containing
. # this file. This is useful for building file paths to the input files you will
. # load. Note: you _don't_ need to pass anything to this argument, you only use
. # it within the function. `bbr` will pass in the correct path when it calls
. # `make_standata()` under the hood.
. make_standata <- function(.dir) {
. # read in any input data
. data1 <- readr::read_csv(file.path(.dir, '..', '..', '..', "data", "derived",
. "pk.csv"), na = ".") %>%
. mutate(AMT = if_else(is.na(AMT), 0, AMT))
.
. ## Write data1 to file in model directory
. save(data1, file = file.path(.dir, "data1.RData"))
.
. nt <- nrow(data1)
. start <- (1:nt)[!duplicated(data1$ID)]
. end <- c(start[-1] - 1, nt)
. nId <- length(unique(data1$ID))
.
. ## Indices of records containing observed data
. iObs <- with(data1, (1:nrow(data1))[BLQ == 0 & EVID == 0])
. nObs <- length(iObs)
. ## Indices of records containing BQL concentrations
. iBlq <- with(data1, (1:nrow(data1))[BLQ == 1 & EVID == 0])
. nBlq <- length(iBlq)
.
. stan_data <- with(data1,
. list(nId = nId,
. nt = nt,
. nObs = nObs,
. iObs = iObs,
. nBlq = nBlq,
. iBlq = iBlq,
. amt = 1000 * AMT,
. cmt = CMT,
. evid = EVID,
. time = TIME,
. start = start,
. end = end,
. cObs = DV[iObs],
. LOQ = 10
. ))
. return(stan_data)
. }
The ppkexpo1-init.R
file defines how we want the initial values generated for each Markov Chain Monte Carlo (MCMC) chain. Here’s how we constructed it for the ppkexpo1
model.
. # Create Stan initial values
. #
. # This function must return something that can be passed to the `init` argument
. # of `cmdstanr::sample()`. There are several options; see `?cmdstanr::sample`
. # for details.
. #
. # `.data` represents the list returned from `make_standata()` for this model.
. # This is provided in case any of your initial values are dependent on some
. # aspect of the data (e.g. the number of rows).
. #
. # `.args` represents the list of attached arguments that will be passed through to
. # cmdstanr::sample(). This is provided in case any of your initial values are
. # dependent on any of these arguments (e.g. the number of chains).
. #
. # Note: you _don't_ need to pass anything to either of these arguments, you only
. # use it within the function. `bbr` will pass in the correct objects when it calls
. # `make_init()` under the hood.
. #
. make_init <- function(.data, .args) {
. ### create initial estimates
. gen1init <- function(.data){
. nRandom <- 5
. nId = .data$nId
. list(
. CLHat = rlnorm(1, log(5), 0.5),
. QHat = rlnorm(1, log(5), 0.5),
. V2Hat = rlnorm(1, log(50), 0.5),
. V3Hat = rlnorm(1, log(100), 0.5),
. kaHat = rlnorm(1, log(2), 0.5),
. sigma = rlnorm(1, log(0.25), 0.5),
. omega = rlnorm(nRandom, log(0.25), 0.5),
. rho = diag(nRandom),
. ## theta = matrix(rep(log(c(5, 5, 50, 100, 2)), nId), nrow = nRandom)
. theta = matrix(rep(log(c(5, 5, 50, 100, 2)), ea = nId), ncol = nRandom)
. )
. }
. inits <- NULL
. for (n in 1:.args$chains)
. {
. inits[[n]] <- gen1init(.data)
. }
. return(inits)
. }
Lastly, we set the arguments for running the sampler. The only required argument is a random seed (for reproducibility). Other arguments, such as the number of cores, chains, warm-up iterations, adapt_delta parameter, etc., are optional. See the help file for cmdstanr::sample
for a list of possible options. This populates the contents of the ppkexpo1-stanargs.R
file.
<- ppkexpo1 %>%
ppkexpo1 set_stanargs(list(iter_warmup = 500,
iter_sampling = 500,
thin = 1,
chains = 4,
parallel_chains = 4,
seed = 1234,
save_warmup = FALSE),
.clear = TRUE)
Here is the resulting ppkexpo1-stanargs.R
file.
. list(
. chains = 4, iter_sampling = 500, iter_warmup = 500, parallel_chains = 4,
. save_warmup = FALSE, seed = 1234, thin = 1
. )
Annotating your model
bbr
has some great features that allow you to easily annotate your models. This helps you document your modeling process as you go and can be easily retrieved later for creating run logs that describe the entire analysis (shown later).
A well-annotated model development process does several important things:
- Makes it easy to update collaborators on the state of the project.
- Supports reproducibility and traceability of your results.
- Keeps the project organized in case you need to return to it at a later date.
The bbr
model object contains three annotation fields. The “description” is a character scalar (i.e., a single string), while “notes” and “tags” can contain character vectors of any length. These fields can be used for any purpose; however, we recommend the patterns below which have worked for us. These three fields can also be added to the model object at any time (before or after submitting the model).
Using description
Descriptions are typically brief; for example, “Base model” or “First covariate model”. The “description” field defaults to NULL if no description is provided. One benefit of this approach is that you can easily filter your run logs to the notable models using dplyr::filter(!is.null(description))
.
We’ll add a description denoting ppkexpo1 as the base popPK model.
<- ppkexpo1 %>%
ppkexpo1 add_description("Base popPK model")
Using notes
Notes are often used for free text observations about a particular model. Some modelers leverage notes as official annotations which might get pulled into a run log for a final report of some kind, while other modelers prefer to keep them informal and use them primarily for their own reference.
Submit the model
Now that you have created your model object, you can submit it to be run with submit_model()
.
## Check that the necessary files are present
check_stan_model(ppkexpo1)
## Fit the model using Stan
<- ppkexpo1 %>% submit_model(.overwrite = TRUE) ppkexpo1_fit
The submit_model
function will compile the Stan model and run the MCMC sampler, return a CmdStanMCMC object, and write the MCMC samples to the <run>/<run>-output
directory as .csv
files (one per chain). Additionally, submit_model
saves an .RDS
file containing all of the CmdStanMCMC object elements except the samples.
If the model has already been fitted, you can read it using the following statements without having to refit the model.
<- read_model(here(MODEL_DIR, "ppkexpo1"))
ppkexpo1 <- read_fit_model(ppkexpo1) ppkexpo1_fit
We can look at some simple summaries working directly with the CmdStanMCMC object. For example, the cmdstan_diagnose
slot provides a quick assessment of model diagnostics,
$cmdstan_diagnose() ppkexpo1_fit
. Processing csv files: /data/expo4-torsten-poppk/model/stan/ppkexpo1/ppkexpo1-output/ppkexpo1-202409271413-1-0c80d9.csv, /data/expo4-torsten-poppk/model/stan/ppkexpo1/ppkexpo1-output/ppkexpo1-202409271413-2-0c80d9.csv, /data/expo4-torsten-poppk/model/stan/ppkexpo1/ppkexpo1-output/ppkexpo1-202409271413-3-0c80d9.csv, /data/expo4-torsten-poppk/model/stan/ppkexpo1/ppkexpo1-output/ppkexpo1-202409271413-4-0c80d9.csv
.
. Checking sampler transitions treedepth.
. Treedepth satisfactory for all transitions.
.
. Checking sampler transitions for divergences.
. No divergent transitions found.
.
. Checking E-BFMI - sampler transitions HMC potential energy.
. The E-BFMI, 0.09, is below the nominal threshold of 0.30 which suggests that HMC may have trouble exploring the target distribution.
. If possible, try to reparameterize the model.
.
. Effective sample size satisfactory.
.
. The following parameters had split R-hat greater than 1.05:
. omega[2], omega[4], theta[138,2], Q[138], Omega[2,2], Omega[4,2], Omega[2,4], Omega[4,4]
. Such high values indicate incomplete mixing and biased estimation.
. You should consider regularizating your model with additional prior information or a more effective parameterization.
.
. Processing complete.
and the summary
slot provides a quick summary of the parameter distributions, split Rhat and effective sample sizes:
$summary(variables=c("CLHat", "QHat", "V2Hat", "V3Hat",
ppkexpo1_fit"kaHat", "omega", "sigma")) %>%
mutate(across(-variable, pmtables::sig))
. # A tibble: 11 × 10
. variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
. <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
. 1 CLHat 3.11 3.11 0.113 0.109 2.93 3.30 1.00 2.46e+03 1.57e+03
. 2 QHat 3.80 3.80 0.178 0.187 3.52 4.09 1.02 118 192
. 3 V2Hat 60.9 60.8 2.23 2.24 57.3 64.6 1.00 278 494
. 4 V3Hat 67.7 67.7 1.87 1.90 64.7 70.8 1.00 334 990
. 5 kaHat 1.61 1.60 0.117 0.116 1.43 1.82 1.00 340 778
. 6 omega[1] 0.436 0.435 0.0250 0.0254 0.397 0.479 1.00 2.38e+03 1.62e+03
. 7 omega[2] 0.131 0.123 0.0609 0.0657 0.0406 0.234 1.17 18.8 42.1
. 8 omega[3] 0.377 0.376 0.0266 0.0283 0.335 0.421 1.01 412 1.20e+03
. 9 omega[4] 0.182 0.182 0.0291 0.0288 0.135 0.231 1.05 66.6 413
. 10 omega[5] 0.545 0.543 0.0709 0.0713 0.433 0.667 1.01 183 470
. 11 sigma 0.214 0.214 0.00304 0.00297 0.208 0.219 1.00 2.00e+03 1.60e+03
Finally, add a note documenting our initial assessment of MCMC convergence.
<- ppkexpo1 %>%
ppkexpo1 add_notes("Initial assessment of convergence based on R-hat looks good except for some elements of Omega.")
Other resources
The following script from the GitHub repository is discussed on this page. If you’re interested running this code, visit the About the GitHub Repo page first.
Initial Model Submission script: initial-model-submission.Rmd