Data Preparation

Data assembly and documentation with mrgda.

yspec
mrgda
data assembly

1 Introduction


This page will demonstrate our data assembly workflow as we derive a data set using a prepared data specification file and a source data set.

2 Tools used


2.1 MetrumRG Packages

yspec Data specification, wrangling, and documentation for pharmacometrics.

mrgda Data assembly helper functions.

lastdose Calculate last dose amount and time since previous doses.

2.2 CRAN Packages

dplyr A grammar of data manipulation.

3 Outline


A source data directory, coming as a collection of .sas7bdat files is provided. These files are found in the data/source folder.

A data specification yaml file can be found at data/derived/da-spec.yml. A spec object will be created from this using the yspec package.

Below, we will assemble PK observations and dosing administration records, along with baseline covariates to create a NONMEM ready data set.

The mrgda package will be used to:

  • read the source data into R
  • assign ID consistently to each subject
  • write a NONMEM compliant csv file along with data assembly meta data

We will use the spec object and yspec package to:

  • validate the derived dataset
  • add labels to the derived dataset
  • generate a data definition document in pdf format

4 Setup


First, we will load the required packages, source data, and data specification file.

4.1 Required packages

All packages will be installed from mpn via pkgr.

library(tidyverse)
library(here)
library(mrgda)
library(yspec)
library(lastdose)

4.2 Load in source data

The source data is read in using read_src_dir() from the mrgda package. It is saved to the concise, descriptive variable name src_list.

src_list <- read_src_dir(here("data", "source"))
┌ read_src_dir Summary ────────────────────────┐
│                                              │
│   Number of domains successfully loaded: 5   │
│   Number of domains that failed to load: 0   │
│                                              │
└──────────────────────────────────────────────┘

src_list is a named list containing data.frames of each source domain. Below we can see the contents of one of the domains, demographics (dm).

head(src_list$dm)
# A tibble: 6 × 25
  STUDYID      DOMAIN USUBJID   SUBJID RFSTDTC RFENDTC RFXSTDTC RFXENDTC RFICDTC
  <chr>        <chr>  <chr>     <chr>  <chr>   <chr>   <chr>    <chr>    <chr>  
1 CDISCPILOT01 DM     01-701-1… 1015   2014-0… 2014-0… 2014-01… 2014-07… ""     
2 CDISCPILOT01 DM     01-701-1… 1023   2012-0… 2012-0… 2012-08… 2012-09… ""     
3 CDISCPILOT01 DM     01-701-1… 1028   2013-0… 2014-0… 2013-07… 2014-01… ""     
4 CDISCPILOT01 DM     01-701-1… 1033   2014-0… 2014-0… 2014-03… 2014-03… ""     
5 CDISCPILOT01 DM     01-701-1… 1034   2014-0… 2014-1… 2014-07… 2014-12… ""     
6 CDISCPILOT01 DM     01-701-1… 1047   2013-0… 2013-0… 2013-02… 2013-03… ""     
# ℹ 16 more variables: RFPENDTC <chr>, DTHDTC <chr>, DTHFL <chr>, SITEID <chr>,
#   AGE <dbl>, AGEU <chr>, SEX <chr>, RACE <chr>, ETHNIC <chr>, ARMCD <chr>,
#   ARM <chr>, ACTARMCD <chr>, ACTARM <chr>, COUNTRY <chr>, DMDTC <chr>,
#   DMDY <dbl>

4.3 Load and examine the spec object

The spec file (examp-da-spec.yml) identifies the desired data columns for the data set we are deriving. We can load the spec file into an object in the R session using the ys_load() function from the yspec package.

spec <- ys_load(here("data/derived/da-spec.yml"))

Each row in the object contains attributes about one column in the data set. The first handful of rows can be viewed with the head() function.

head(spec)
   name info  unit                         short    source
1     C  cd-     .                Commented rows da-lookup
2   NUM  ---     .                    Row number da-lookup
3    ID  ---     .              NONMEM ID number da-lookup
4  TIME  ---  hour         Time after first dose da-lookup
5  DVID  -d-     . Dependent variable identifier da-lookup
6  EVID  -d-     .              Event identifier da-lookup
7   AMT  ---    mg                   Dose amount da-lookup
8    DV  ---     .            Dependent variable da-lookup
9   AGE  --- years                           Age da-lookup
10   WT  ---    kg                        Weight da-lookup

Each row in the spec object can have:

  • a name which must be \(<=\) 8 characters long (by default)
    • continuous data items have units associated with them
    • discrete data items (like SEQ or EVID) can have valid levels listed in the spec as well as decodes for those levels
  • a “short” name that can be used in figure labels and tables
  • a tex specific namespace so that the units can potentially be returned with LaTeX formatting
  • a flags argument that can be used later to extract only covariate specific columns from the spec

4.4 Define output lists

During data assemblies, we work with two types of variables: subject level and time-varying.

Subject level variables are those where there is only one unique value per subject. Sex, race and most demographic/baseline values are examples of these.

Time-varying variables are those that change with time. Dosing records, PK observations and time-varying lab values are examples of these.

To organize the variables into their respective types, we create output lists below. All subject level variables are saved in the derived$sl list, while time-varying variables are in the derived$tv list.

derived <- list()
derived$sl <- list()
derived$tv <- list()

5 Demographics


This section will focus on deriving subject level variables from the demographics source domain.

5.1 Remove screen failures

First we filter the data to only the subjects we are interested in. Subjects who do not meet the inclusion criteria for the study are marked as a Screen Failure. We want to remove these subjects.

dm0 <-
  src_list$dm %>%
  filter(ACTARM != "Screen Failure")

5.2 Grab covariates of interest

According to the specification file, we need to derive AGE and SEX, which can all be found in the dm domain.

SEX is a categorical variable, which means we need to assign each category a numerical value according to the specification file.

spec$SEX
 name  value     
 col   SEX       
 type  numeric   
 short Sex       
 value 0 : Male  
       1 : Female

As shown above, males should be SEX = 0 and females SEX = 1.

dm1 <-
  dm0 %>%
  transmute(
    USUBJID,
    AGE,
    SEX = if_else(SEX == "F", 1, 0),
    ACTARM,
    STUDY = STUDYID,
    SUBJ = SUBJID
  )

5.3 Save to subject level output list

With our variables of interest derived, we save dm1 to the derived$sl list.

derived$sl$dm <- dm1

6 Labs


This section will focus on deriving subject level variables from the labs source domain (lb).

From the specification file, we need to derive ALT, AST, BILI, CREAT from the this domain.

6.1 Filter to baseline values

The LBBLFL variable in the lb domain indicates which records are baseline measurements. We can use this variable and the subjects we derived from the dm domain to filter down to records and subjects of interest.

lb0 <-
  src_list$lb %>%
  filter(LBBLFL == "Y") %>% 
  filter(USUBJID %in% derived$sl$dm$USUBJID) 

6.2 Manipulate the format of the data

Each row in the lb domain has one lab test, meaning subjects have multiple rows in the data. We want to transform the data so that each subject only has one row for multiple labs.

LBTESTCD identifies the type of lab test and LBSTRESN contains the result. We want to update the data such that the LBTESTCD values are the names of the columns, filled with the LBSTRESN values.

lb1 <-
  lb0 %>%
  filter(LBTESTCD %in% c("ALT", "AST", "BILI", "CREAT")) %>%
  select(USUBJID, LBTESTCD, LBSTRESN) %>% 
  pivot_wider(names_from = "LBTESTCD", values_from = "LBSTRESN") 

lb2 <- lb1 %>% rename(SCR = CREAT)

6.3 Save to subject level output list

Since lb2 contains baseline labs, we save it to the derived$sl list.

derived$sl$lb <- lb2

7 Dosing


This section will focus on deriving time-varying dose administration records from the ex domain.

7.1 Filter to desired treatment type

The ex domain contains dosing records for all treatments given to the subject. We can use the EXTRT variable to filter the data to only treatments of interest.

ex0 <-
  src_list$ex %>%
  filter(EXTRT %in% c("PLACEBO", "XANOMELINE"))

7.2 Define variables of interest

According to the specification file, EVID and DVID need to be set to the numeric decodes for dosing records. We check this below:

spec$EVID
 name  value                
 col   EVID                 
 type  numeric              
 short Event identifier     
 value 0 : Observation event
       1 : Dosing event     

Dosing records also require an AMT variable representing the dose. We also need to obtain the date/time of the dosing administration. This information can be found in the EXSTDTC variable.

ex1 <-
  ex0 %>% 
  transmute(
    USUBJID,
    EVID = 1,
    DVID = 1,
    DOSE = EXDOSE,
    AMT = DOSE, 
    DATETIME = lubridate::ymd(EXSTDTC)
  )

7.3 Save to time-varying output list

With all doing records captured, ex1 can be saved to the derived$tv list.

derived$tv$ex <- ex1

8 PK observations


This section will focus on deriving time-varying pk observations from the pc domain.

8.1 Filter to desired concentrations

The PCTEST variable contains a description of the concentration measured. We want to filter to only the concentrations we are interested in.

pc0 <-
  src_list$pc %>%
  filter(PCTEST == "XANOMELINE")

8.2 Define variables of interest

Similar to dosing records, EVID and DVID need to be set to the numeric decodes provided in the specification file.

We are also interested in the PCLLOQ column. These values represent the lower limit of quantification (LLOQ) of each concentration sample. If a concentration is lower than the LLOQ, then the below limit of quantification (BLQ) flag is set to 1.

Additionally, the concentration values can be obtained from PCSTRESN and the date/time of the record from PCDTC.

pc1 <-
  pc0 %>%
  transmute(
    USUBJID,
    EVID = 0,
    DVID = 2,
    BLQ = if_else(grepl("<", PCORRES, fixed = TRUE), 1, 0),
    DV = if_else(BLQ == 0, PCSTRESN, NA_real_),
    LLOQ = PCLLOQ,
    DATETIME = lubridate::ymd_hms(PCDTC)
  )

8.3 Save to time-varying output list

With all doing records captured, pc1 can be saved to the derived$tv list.

derived$tv$pc <- pc1

9 Combine domains


This section will focus on combining the time-varying and subject level data we have assembled above. Once combined, final modifications will be made to match the data to the specification file.

9.1 Bind and join derived data

The data we have saved to derived$tv will be used to generate rows in our combined data set. Meanwhile, the data within derived$sl will add columns.

First, we will combine the dataframes in derived$sl. Notice that we have one row per subject.

baseline_variables <- reduce(derived$sl, full_join, by = "USUBJID")
head(baseline_variables)
# A tibble: 6 × 11
  USUBJID       AGE   SEX ACTARM       STUDY SUBJ    ALT   AST  BILI   SCR    WT
  <chr>       <dbl> <dbl> <chr>        <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 01-701-1015    63     1 Placebo      CDIS… 1015     27    40 10.3   79.6  54.4
2 01-701-1023    64     0 Placebo      CDIS… 1023     23    21 12.0  124.   80.3
3 01-701-1028    71     0 Xanomeline … CDIS… 1028     26    24 18.8  124.   99.3
4 01-701-1033    74     0 Xanomeline … CDIS… 1033     16    20 13.7  133.   88.4
5 01-701-1034    77     1 Xanomeline … CDIS… 1034     15    23 10.3   88.4  62.6
6 01-701-1047    85     1 Placebo      CDIS… 1047     22    25  6.84  88.4  67.1

Next, we will bind together the time-varying data and then join on baseline_variables. We sort by subject and date/time.

nm0 <-
  bind_rows(derived$tv) %>%
  left_join(baseline_variables, by = "USUBJID") %>%
  arrange(
    USUBJID,
    DATETIME
  )

There are variables that need to be filled for every row within a subject (locf).

nm1 <-
  nm0 %>% 
  group_by(USUBJID) %>%
  tidyr::fill("DOSE", .direction = "downup") %>%
  ungroup()

9.2 Derive TIME and TAD

The lastdose package can be used to create TIME and TAD variables in the data set when the following columns are present:

  • subject ID
  • record time
  • dose amount
  • EVID
nm2 <-
  nm1 %>%
  lastdose(include_tafd = TRUE, time_units = "hours") %>%
  mutate(
    TIME = TAFD
  )

9.3 Assign ID variable

The mrgda package has a function assign_id() that can be used to derive the ID variable. It ensures each subject is assigned a consistent unique numerical value.

nm3 <-
  nm2 %>%
  assign_id(.subject_col = "USUBJID")
┌ ID Summary ───────────────────────────────────────────┐
│                                                       │
│   Number of subjects detected and assigned IDs: 254   │
│                                                       │
└───────────────────────────────────────────────────────┘

9.4 Final modifications

To conclude the data assembly, we need to ensure the variables in our data set match those in the specification file. We can use mrgmisc::pool() to check this:

mrgmisc::pool(names(spec), names(nm3))
$`names(spec)`
[1] "C"   "NUM" "MDV"

$`names(nm3)`
[1] "DATETIME" "LLOQ"     "BILI"     "TAFD"    

$both
 [1] "ID"      "TIME"    "DVID"    "EVID"    "AMT"     "DV"      "AGE"    
 [8] "WT"      "SEX"     "SCR"     "AST"     "ALT"     "TAD"     "LDOS"   
[15] "BLQ"     "DOSE"    "SUBJ"    "USUBJID" "STUDY"   "ACTARM" 

There are three more variables we need to add to the data: C, NUM and MDV. These are common NONMEM variables and can be derived as such. It’s important to wait to derive NUM until all other modifications have been made to the data.

nm4 <-
  nm3 %>% 
  mutate(
    MDV = if_else(is.na(DV), 1, 0),
    C = ".",
    NUM = 1:n()
  )

derived$nm <- nm4 %>% select(names(spec))

The final derivation step is to save our final data set to derived$nm.

9.5 Verify derived data with yspec

Before submitting our derived data set, we can leverage the spec object to prepare and verify our work.

yspec can compare the candidate data set with what is in the specification. The ys_check() function will compare the data set and spec object. It will output a message saying the result of the comparison.

ys_check(derived$nm, spec)
The data set passed all checks.

Everything looks to be in order!

The call to ys_check() does a limited check of the data against the spec, including

  1. all the columns are present and in the correct order
  2. levels of discrete data columns match up between the spec and the data
  3. the range for all continuous data items is within the range specified in the spec

Note also that certain data requirements are also enforced on load, for example

  1. all column names contain 8 or fewer characters
  2. the short field in the spec contains 40 or fewer characters for all columns
  3. the label field in the spec contains 40 or fewer characters for all columns

9.6 Output data to csv

Since the data and spec object match, our final step is to write the data to data/derived/examp-da.csv. We can use the write_derived() function from mrgda.

write_derived(
  .data = derived$nm,
  .spec = spec,
  .file = "data/derived/examp-da.csv"
)

You can view our finished product in the data/derived folder.

10 Documenting derived data


This section will provide an example for creating a define document and SAS xport file.

10.1 Write a data definitions document

We create a define document for examp-da.csv in the data/derived folder.

ys_document(
  spec, 
  type = "regulatory", 
  output_dir = here("data/derived"), 
  output_file = "examp-da.pdf", 
  build_dir = definetemplate(), 
  author = "Kyle Baron"
)

When rendering the define document, the regulatory type is selected, which is set up to follow the format requested for pharmacometrics submissions. The definetemplate() format puts a more polished style on the output, but it can be omitted or updated.

We can look at our define document here data/derived/examp-da.pdf.

11 Other resources


11.1 Full script

The following script from the Github repository is discussed on this page. If you are interested running this code, visit the About the Github Repo page first.

Data preparation script: data-assembly.R

11.2 Additional content

Further uses of the package yspec can be seen in the exploratory data analysis (EDA) pages.

To advance to the EDA content, go here: