library(tidyverse)
library(here)
library(mrgda)
library(yspec)
library(lastdose)
1 Introduction
This page will demonstrate our data assembly workflow as we derive a data set using a prepared data specification file and a source data set.
2 Tools used
2.1 MetrumRG Packages
yspec Data specification, wrangling, and documentation for pharmacometrics.
mrgda Data assembly helper functions.
lastdose Calculate last dose amount and time since previous doses.
2.2 CRAN Packages
dplyr A grammar of data manipulation.
3 Outline
A source data directory, coming as a collection of .sas7bdat
files is provided. These files are found in the data/source
folder.
A data specification yaml file can be found at data/derived/da-spec.yml
. A spec object will be created from this using the yspec package.
Below, we will assemble PK observations and dosing administration records, along with baseline covariates to create a NONMEM ready data set.
The mrgda
package will be used to:
- read the source data into R
- assign
ID
consistently to each subject - write a NONMEM compliant csv file along with data assembly meta data
We will use the spec object and yspec
package to:
- validate the derived dataset
- add labels to the derived dataset
- generate a data definition document in
pdf
format
4 Setup
First, we will load the required packages, source data, and data specification file.
4.1 Required packages
All packages will be installed from mpn via pkgr.
4.2 Load in source data
The source data is read in using read_src_dir()
from the mrgda
package. It is saved to the concise, descriptive variable name src_list
.
<- read_src_dir(here("data", "source")) src_list
┌ read_src_dir Summary ────────────────────────┐
│ │
│ Number of domains successfully loaded: 5 │
│ Number of domains that failed to load: 0 │
│ │
└──────────────────────────────────────────────┘
src_list
is a named list containing data.frames of each source domain. Below we can see the contents of one of the domains, demographics (dm).
head(src_list$dm)
# A tibble: 6 × 25
STUDYID DOMAIN USUBJID SUBJID RFSTDTC RFENDTC RFXSTDTC RFXENDTC RFICDTC
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 CDISCPILOT01 DM 01-701-1… 1015 2014-0… 2014-0… 2014-01… 2014-07… ""
2 CDISCPILOT01 DM 01-701-1… 1023 2012-0… 2012-0… 2012-08… 2012-09… ""
3 CDISCPILOT01 DM 01-701-1… 1028 2013-0… 2014-0… 2013-07… 2014-01… ""
4 CDISCPILOT01 DM 01-701-1… 1033 2014-0… 2014-0… 2014-03… 2014-03… ""
5 CDISCPILOT01 DM 01-701-1… 1034 2014-0… 2014-1… 2014-07… 2014-12… ""
6 CDISCPILOT01 DM 01-701-1… 1047 2013-0… 2013-0… 2013-02… 2013-03… ""
# ℹ 16 more variables: RFPENDTC <chr>, DTHDTC <chr>, DTHFL <chr>, SITEID <chr>,
# AGE <dbl>, AGEU <chr>, SEX <chr>, RACE <chr>, ETHNIC <chr>, ARMCD <chr>,
# ARM <chr>, ACTARMCD <chr>, ACTARM <chr>, COUNTRY <chr>, DMDTC <chr>,
# DMDY <dbl>
4.3 Load and examine the spec object
The spec file (examp-da-spec.yml
) identifies the desired data columns for the data set we are deriving. We can load the spec file into an object in the R session using the ys_load()
function from the yspec
package.
<- ys_load(here("data/derived/da-spec.yml")) spec
Each row in the object contains attributes about one column in the data set. The first handful of rows can be viewed with the head()
function.
head(spec)
name info unit short source
1 C cd- . Commented rows da-lookup
2 NUM --- . Row number da-lookup
3 ID --- . NONMEM ID number da-lookup
4 TIME --- hour Time after first dose da-lookup
5 DVID -d- . Dependent variable identifier da-lookup
6 EVID -d- . Event identifier da-lookup
7 AMT --- mg Dose amount da-lookup
8 DV --- . Dependent variable da-lookup
9 AGE --- years Age da-lookup
10 WT --- kg Weight da-lookup
4.4 Define output lists
During data assemblies, we work with two types of variables: subject level and time-varying.
Subject level variables are those where there is only one unique value per subject. Sex, race and most demographic/baseline values are examples of these.
Time-varying variables are those that change with time. Dosing records, PK observations and time-varying lab values are examples of these.
To organize the variables into their respective types, we create output lists below. All subject level variables are saved in the derived$sl
list, while time-varying variables are in the derived$tv
list.
<- list()
derived $sl <- list()
derived$tv <- list() derived
5 Demographics
This section will focus on deriving subject level variables from the demographics source domain.
5.1 Remove screen failures
First we filter the data to only the subjects we are interested in. Subjects who do not meet the inclusion criteria for the study are marked as a Screen Failure
. We want to remove these subjects.
<-
dm0 $dm %>%
src_listfilter(ACTARM != "Screen Failure")
5.2 Grab covariates of interest
According to the specification file, we need to derive AGE
and SEX
, which can all be found in the dm
domain.
SEX
is a categorical variable, which means we need to assign each category a numerical value according to the specification file.
$SEX spec
name value
col SEX
type numeric
short Sex
value 0 : Male
1 : Female
As shown above, males should be SEX = 0
and females SEX = 1
.
<-
dm1 %>%
dm0 transmute(
USUBJID,
AGE,SEX = if_else(SEX == "F", 1, 0),
ACTARM,STUDY = STUDYID,
SUBJ = SUBJID
)
5.3 Save to subject level output list
With our variables of interest derived, we save dm1
to the derived$sl
list.
$sl$dm <- dm1 derived
6 Labs
This section will focus on deriving subject level variables from the labs source domain (lb
).
From the specification file, we need to derive ALT
, AST
, BILI
, CREAT
from the this domain.
6.1 Filter to baseline values
The LBBLFL
variable in the lb
domain indicates which records are baseline measurements. We can use this variable and the subjects we derived from the dm
domain to filter down to records and subjects of interest.
<-
lb0 $lb %>%
src_listfilter(LBBLFL == "Y") %>%
filter(USUBJID %in% derived$sl$dm$USUBJID)
6.2 Manipulate the format of the data
Each row in the lb
domain has one lab test, meaning subjects have multiple rows in the data. We want to transform the data so that each subject only has one row for multiple labs.
LBTESTCD
identifies the type of lab test and LBSTRESN
contains the result. We want to update the data such that the LBTESTCD
values are the names of the columns, filled with the LBSTRESN
values.
<-
lb1 %>%
lb0 filter(LBTESTCD %in% c("ALT", "AST", "BILI", "CREAT")) %>%
select(USUBJID, LBTESTCD, LBSTRESN) %>%
pivot_wider(names_from = "LBTESTCD", values_from = "LBSTRESN")
<- lb1 %>% rename(SCR = CREAT) lb2
6.3 Save to subject level output list
Since lb2
contains baseline labs, we save it to the derived$sl
list.
$sl$lb <- lb2 derived
7 Dosing
This section will focus on deriving time-varying dose administration records from the ex
domain.
7.1 Filter to desired treatment type
The ex
domain contains dosing records for all treatments given to the subject. We can use the EXTRT
variable to filter the data to only treatments of interest.
<-
ex0 $ex %>%
src_listfilter(EXTRT %in% c("PLACEBO", "XANOMELINE"))
7.2 Define variables of interest
According to the specification file, EVID
and DVID
need to be set to the numeric decodes for dosing records. We check this below:
$EVID spec
name value
col EVID
type numeric
short Event identifier
value 0 : Observation event
1 : Dosing event
Dosing records also require an AMT
variable representing the dose. We also need to obtain the date/time of the dosing administration. This information can be found in the EXSTDTC
variable.
<-
ex1 %>%
ex0 transmute(
USUBJID,EVID = 1,
DVID = 1,
DOSE = EXDOSE,
AMT = DOSE,
DATETIME = lubridate::ymd(EXSTDTC)
)
7.3 Save to time-varying output list
With all doing records captured, ex1
can be saved to the derived$tv
list.
$tv$ex <- ex1 derived
8 PK observations
This section will focus on deriving time-varying pk observations from the pc
domain.
8.1 Filter to desired concentrations
The PCTEST
variable contains a description of the concentration measured. We want to filter to only the concentrations we are interested in.
<-
pc0 $pc %>%
src_listfilter(PCTEST == "XANOMELINE")
8.2 Define variables of interest
Similar to dosing records, EVID
and DVID
need to be set to the numeric decodes provided in the specification file.
We are also interested in the PCLLOQ
column. These values represent the lower limit of quantification (LLOQ) of each concentration sample. If a concentration is lower than the LLOQ, then the below limit of quantification (BLQ) flag is set to 1.
Additionally, the concentration values can be obtained from PCSTRESN
and the date/time of the record from PCDTC
.
<-
pc1 %>%
pc0 transmute(
USUBJID,EVID = 0,
DVID = 2,
BLQ = if_else(grepl("<", PCORRES, fixed = TRUE), 1, 0),
DV = if_else(BLQ == 0, PCSTRESN, NA_real_),
LLOQ = PCLLOQ,
DATETIME = lubridate::ymd_hms(PCDTC)
)
8.3 Save to time-varying output list
With all doing records captured, pc1
can be saved to the derived$tv
list.
$tv$pc <- pc1 derived
9 Combine domains
This section will focus on combining the time-varying and subject level data we have assembled above. Once combined, final modifications will be made to match the data to the specification file.
9.1 Bind and join derived data
The data we have saved to derived$tv
will be used to generate rows in our combined data set. Meanwhile, the data within derived$sl
will add columns.
First, we will combine the dataframes in derived$sl
. Notice that we have one row per subject.
<- reduce(derived$sl, full_join, by = "USUBJID")
baseline_variables head(baseline_variables)
# A tibble: 6 × 11
USUBJID AGE SEX ACTARM STUDY SUBJ ALT AST BILI SCR WT
<chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 01-701-1015 63 1 Placebo CDIS… 1015 27 40 10.3 79.6 54.4
2 01-701-1023 64 0 Placebo CDIS… 1023 23 21 12.0 124. 80.3
3 01-701-1028 71 0 Xanomeline … CDIS… 1028 26 24 18.8 124. 99.3
4 01-701-1033 74 0 Xanomeline … CDIS… 1033 16 20 13.7 133. 88.4
5 01-701-1034 77 1 Xanomeline … CDIS… 1034 15 23 10.3 88.4 62.6
6 01-701-1047 85 1 Placebo CDIS… 1047 22 25 6.84 88.4 67.1
Next, we will bind together the time-varying data and then join on baseline_variables
. We sort by subject and date/time.
<-
nm0 bind_rows(derived$tv) %>%
left_join(baseline_variables, by = "USUBJID") %>%
arrange(
USUBJID,
DATETIME )
There are variables that need to be filled for every row within a subject (locf
).
<-
nm1 %>%
nm0 group_by(USUBJID) %>%
::fill("DOSE", .direction = "downup") %>%
tidyrungroup()
9.2 Derive TIME and TAD
The lastdose package can be used to create TIME
and TAD
variables in the data set when the following columns are present:
- subject ID
- record time
- dose amount
- EVID
<-
nm2 %>%
nm1 lastdose(include_tafd = TRUE, time_units = "hours") %>%
mutate(
TIME = TAFD
)
9.3 Assign ID variable
The mrgda package has a function assign_id()
that can be used to derive the ID variable. It ensures each subject is assigned a consistent unique numerical value.
<-
nm3 %>%
nm2 assign_id(.subject_col = "USUBJID")
┌ ID Summary ───────────────────────────────────────────┐
│ │
│ Number of subjects detected and assigned IDs: 254 │
│ │
└───────────────────────────────────────────────────────┘
9.4 Final modifications
To conclude the data assembly, we need to ensure the variables in our data set match those in the specification file. We can use mrgmisc::pool()
to check this:
::pool(names(spec), names(nm3)) mrgmisc
$`names(spec)`
[1] "C" "NUM" "MDV"
$`names(nm3)`
[1] "DATETIME" "LLOQ" "BILI" "TAFD"
$both
[1] "ID" "TIME" "DVID" "EVID" "AMT" "DV" "AGE"
[8] "WT" "SEX" "SCR" "AST" "ALT" "TAD" "LDOS"
[15] "BLQ" "DOSE" "SUBJ" "USUBJID" "STUDY" "ACTARM"
There are three more variables we need to add to the data: C
, NUM
and MDV
. These are common NONMEM variables and can be derived as such. It’s important to wait to derive NUM
until all other modifications have been made to the data.
<-
nm4 %>%
nm3 mutate(
MDV = if_else(is.na(DV), 1, 0),
C = ".",
NUM = 1:n()
)
$nm <- nm4 %>% select(names(spec)) derived
The final derivation step is to save our final data set to derived$nm
.
9.5 Verify derived data with yspec
Before submitting our derived data set, we can leverage the spec object to prepare and verify our work.
yspec
can compare the candidate data set with what is in the specification. The ys_check()
function will compare the data set and spec object. It will output a message saying the result of the comparison.
ys_check(derived$nm, spec)
The data set passed all checks.
Everything looks to be in order!
9.6 Output data to csv
Since the data and spec object match, our final step is to write the data to data/derived/examp-da.csv
. We can use the write_derived()
function from mrgda
.
write_derived(
.data = derived$nm,
.spec = spec,
.file = "data/derived/examp-da.csv"
)
You can view our finished product in the data/derived
folder.
10 Documenting derived data
This section will provide an example for creating a define document and SAS xport file.
10.1 Write a data definitions document
We create a define document for examp-da.csv
in the data/derived
folder.
ys_document(
spec, type = "regulatory",
output_dir = here("data/derived"),
output_file = "examp-da.pdf",
build_dir = definetemplate(),
author = "Kyle Baron"
)
When rendering the define document, the regulatory
type is selected, which is set up to follow the format requested for pharmacometrics submissions. The definetemplate()
format puts a more polished style on the output, but it can be omitted or updated.
We can look at our define document here data/derived/examp-da.pdf
.
11 Other resources
11.1 Full script
The following script from the Github repository is discussed on this page. If you are interested running this code, visit the About the Github Repo page first.
Data preparation script: data-assembly.R
11.2 Additional content
Further uses of the package yspec can be seen in the exploratory data analysis (EDA) pages.
To advance to the EDA content, go here: