Reproducibility and Traceability

A note about how bbr.bayes enables reproducible and traceable research.

bbr
model management

Reproducibility

We consider modeling results to be reproducible if (when running a model repeatedly with the same data, software, and hardware) we obtain the same results. Modeling Markov Chain Monte Carlo (MCMC) methods is inherently stochastic, and precise reproducibility is difficult. However, the Stan developers have built a system that enables reproducible research under certain conditions Stan reproducibility. bbr.bayes facilitates specifying the model in a reproducible way.

Aspects beyond the model specification (e.g., R package versions and hardware environment) are also important to control and track for reproducibility purposes, but these are considered outside the scope of bbr.bayes. There are many possible approaches to this. At Metrum Research Group (MetrumRG), we use our Metworx platform as our validated and stable, high-performance computing environment as well as MPN and pkgr for managing R packages.

bbr.bayes does the following to promote the generation of reproducible posterior samples:

  1. Requires the random seed be specified to run a model.
  2. Requires that the model definition, cmdstanr method arguments, data preparation code, and initial values specification be tracked in dedicated files on disk.
  3. Provides special handling, ensuring reproducibility when the user specifies a function value to generate the initial values; because, passing this as is to CmdStanModel$sample() can compromise reproducibility.
  4. Records the hashes of these inputs. (The check_up_to_date helper can be used to determine if any of these files have changed since the last run of the model.)

Traceability

We consider a model to be traceable if the provenance of a model can be tracked back to its source. bbr and bbr.bayes allow (and encourage) basing one model on another by using the copy_model_from() function. In addition to copying the model files, this function updates the models .yaml file to include a based_on field which tracks the model copied to create the current model.

The bbr and bbr.bayes packages enable traceable modeling by:

  1. Making it possible to define one model as the child of another (copy_model_from).
  2. Providing various helpers (e.g., get_based_on, get_model_ancestry, and run_log) to inspect the modeling history and development.
  3. Recording the inputs, which produced the outputs, by storing the input’s hashes as mentioned above.

Key principle: write the essential elements for reproducing and tracking models to disk

To facilitate reproducible and traceable modeling, bbr.bayes writes to disk all of the model elements required to reproduce a model run and track its provenance.

More information about the essential elements of a bbr.bayes model object can be found in the documentation on bbr.bayes Getting Started vignette.