Specifications: The Single Source of Truth

A herald_spec is the single source of truth for a clinical submission. It drives every downstream operation: metadata decoration, Define-XML generation, conformance validation, and submission packaging. Build it once; use it everywhere.

The spec slots

A herald_spec holds up to eleven data frames, each corresponding to a tab in a Pinnacle 21 specification workbook:

Slot	What it holds	Required?
`ds_spec`	Dataset-level info: label, class, structure, keys	Yes
`var_spec`	Variable-level info: label, type, length, format, order	Yes
`value_spec`	Value-level metadata (Where/Comment fields)	No
`codelist`	Controlled terminology: code/decode pairs	No
`study`	Study-level metadata: protocol, sponsor	No
`dictionaries`	Medical dictionaries (MedDRA, WHODrug)	No
`methods`	Derivation methods for ADaM	No
`comments`	Reviewer guide comments	No
`documents`	Supplemental documents	No
`arm_displays`	ADaM Results Metadata display definitions	No
`arm_results`	ADaM Results Metadata analysis results	No

Building a spec programmatically

For small studies or tests, build the spec inline. This is the most self-contained approach and works without any external files.

spec <- herald_spec(
  ds_spec = data.frame(
    dataset   = c("DM", "AE"),
    label     = c("Demographics", "Adverse Events"),
    keys      = c("STUDYID, USUBJID", "STUDYID, USUBJID, AESEQ"),
    structure  = c("One record per subject", "One record per subject per AE"),
    stringsAsFactors = FALSE
  ),
  var_spec = data.frame(
    dataset   = c("DM","DM","DM","DM", "AE","AE","AE","AE","AE"),
    variable  = c("STUDYID","USUBJID","AGE","SEX",
                  "STUDYID","USUBJID","AESEQ","AETERM","AESTDTC"),
    label     = c("Study Identifier","Unique Subject Identifier","Age","Sex",
                  "Study Identifier","Unique Subject Identifier",
                  "Sequence Number of AE","Reported Term for the Adverse Event",
                  "Start Date/Time of AE"),
    data_type = c("text","text","integer","text",
                  "text","text","integer","text","text"),
    length    = c(12L,11L,8L,1L, 12L,11L,8L,200L,19L),
    order     = c(1L,2L,3L,4L, 1L,2L,3L,4L,5L),
    stringsAsFactors = FALSE
  ),
  codelist = data.frame(
    codelist_id   = c("SEX","SEX","RACE","RACE","RACE"),
    term          = c("M","F","WHITE","BLACK","ASIAN"),
    decoded_value = c("Male","Female","White","Black or African American","Asian"),
    stringsAsFactors = FALSE
  )
)

spec
#> 
#> ── herald_spec ──
#> 
#> • Datasets: 2
#> • Variables: 9
#> • Codelists: 2
#> Datasets: "DM" and "AE"

The print() method gives a quick summary. Use summary() for slot-level detail:

summary(spec)
#> 
#> ── herald_spec summary ──
#> 
#> study: 0 rows x 0 cols
#> ds_spec: 2 rows x 4 cols
#> var_spec: 9 rows x 6 cols
#> value_spec: NULL
#> codelist: 5 rows x 3 cols
#> dictionaries: NULL
#> methods: NULL
#> comments: NULL
#> documents: NULL
#> arm_displays: NULL
#> arm_results: NULL

Inspecting a spec

List datasets

spec_datasets(spec)
#> [1] "DM" "AE"

Variable metadata for one dataset

spec_vars(spec, "AE")
#>   dataset variable                               label data_type length order
#> 1      AE  STUDYID                    Study Identifier      text     12     1
#> 2      AE  USUBJID           Unique Subject Identifier      text     11     2
#> 3      AE    AESEQ               Sequence Number of AE   integer      8     3
#> 4      AE   AETERM Reported Term for the Adverse Event      text    200     4
#> 5      AE  AESTDTC               Start Date/Time of AE      text     19     5

Codelist entries

spec_codelist(spec, "SEX")
#>   codelist_id term decoded_value
#> 1         SEX    M          Male
#> 2         SEX    F        Female

Study slot

# (no study slot in this example — returns NULL)
spec_study(spec, "protocol")
#> NULL

Slot access: `@` vs `$`

herald_spec is an S7 object. Both @ and $ access slots, but they behave differently in the IDE:

spec$ds_spec     # works, but no IDE autocomplete
#>   dataset          label                    keys                     structure
#> 1      DM   Demographics        STUDYID, USUBJID        One record per subject
#> 2      AE Adverse Events STUDYID, USUBJID, AESEQ One record per subject per AE
spec@ds_spec     # works AND triggers autocomplete — use this
#>   dataset          label                    keys                     structure
#> 1      DM   Demographics        STUDYID, USUBJID        One record per subject
#> 2      AE Adverse Events STUDYID, USUBJID, AESEQ One record per subject per AE
spec$codelist
#>   codelist_id  term             decoded_value
#> 1         SEX     M                      Male
#> 2         SEX     F                    Female
#> 3        RACE WHITE                     White
#> 4        RACE BLACK Black or African American
#> 5        RACE ASIAN                     Asian
spec@codelist
#>   codelist_id  term             decoded_value
#> 1         SEX     M                      Male
#> 2         SEX     F                    Female
#> 3        RACE WHITE                     White
#> 4        RACE BLACK Black or African American
#> 5        RACE ASIAN                     Asian

Tip: Use @ for slot access in scripts and the console. IDEs (RStudio, Positron) autocomplete @ slot names; $ bypasses the autocomplete mechanism for S7 objects.

Reading specs from files

Pinnacle 21 Excel (real-world workflow)

# Reads all tabs automatically — ds_spec, var_spec, value_spec, codelist, etc.
spec <- read_spec("path/to/specification.xlsx")

read_spec() detects the file type from the extension: .xlsx triggers the P21 Excel parser, .xml triggers the Define-XML parser, .json triggers the herald JSON parser.

Define-XML round-trip

Generate Define-XML from a spec, then read it back:

if (requireNamespace("xml2", quietly = TRUE)) {
  xml_path <- tempfile(fileext = ".xml")

  write_define_xml(spec, xml_path, validate = FALSE)
  spec2 <- read_spec_define(xml_path)

  # Variable metadata is preserved
  nrow(spec2$var_spec)
  spec2$ds_spec$label
}
#> [1] "Demographics"   "Adverse Events"

Herald JSON round-trip

JSON is ideal for version control — store your spec alongside your code.

if (requireNamespace("jsonlite", quietly = TRUE)) {
  json_path <- tempfile(fileext = ".json")

  write_spec(spec, json_path)
  spec3 <- read_spec(json_path)

  identical(spec3$var_spec$variable, spec$var_spec$variable)
  identical(spec3$codelist$term,     spec$codelist$term)
}
#> [1] TRUE

Validating the spec itself

validate_spec() checks the spec structure before you touch any data. It runs the DD-prefix rules (Define-XML conformance rules) against the spec.

# Introduce a deliberate error: variable with no label
bad_spec <- herald_spec(
  ds_spec  = data.frame(dataset = "DM", label = "Demographics",
                        stringsAsFactors = FALSE),
  var_spec = data.frame(
    dataset = "DM", variable = "AGE", label = NA_character_,
    data_type = "integer", length = 8L,
    stringsAsFactors = FALSE
  )
)

result <- validate_spec(bad_spec)
result
#> 
#> ── herald validation ──
#> 
#> Datasets checked: 3
#> ℹ Spec checks only -- no conformance rules evaluated
#> Findings: 0 reject, 6 high, 0 medium, 0 low
#> 
#> ── Reject / High Impact
#> ✖ [DD0006] [High] datasets.dataset: Dataset name is missing. Each row in the Datasets sheet must have a dataset name.
#> ✖ [DD0006] [High] variables.dataset: Dataset name is missing. Each row in the Datasets sheet must have a dataset name.
#> ✖ [DD0007] [High] datasets.label: Dataset label is missing. Description is required for all ItemGroupDef in regulatory submissions.
#> ✖ [DD0021] [High] variables.variable: Variable name is missing. Each row in the Variables sheet must have a variable name.
#> ✖ [DD0022] [High] datasets.label: Variable label is missing. Description is required for all ItemDef corresponding to Variable definitions in regulatory submissions.
#> ✖ [DD0028] [High] variables.data_type: Text variable length exceeds 200 characters. SAS v5 Transport files restrict variable lengths to 200 characters.
result$findings
#>   rule_id impact   dataset  variable row        value expected
#> 1  DD0006   High  datasets   dataset   1           DM     <NA>
#> 2  DD0006   High variables   dataset   1           DM     <NA>
#> 3  DD0007   High  datasets     label   1 Demographics     <NA>
#> 4  DD0021   High variables  variable   1          AGE     <NA>
#> 5  DD0022   High  datasets     label   1 Demographics     <NA>
#> 6  DD0028   High variables data_type   1      integer     <NA>
#>                                                                                                                               message
#> 1                                                   Dataset name is missing. Each row in the Datasets sheet must have a dataset name.
#> 2                                                   Dataset name is missing. Each row in the Datasets sheet must have a dataset name.
#> 3                                   Dataset label is missing. Description is required for all ItemGroupDef in regulatory submissions.
#> 4                                                Variable name is missing. Each row in the Variables sheet must have a variable name.
#> 5 Variable label is missing. Description is required for all ItemDef corresponding to Variable definitions in regulatory submissions.
#> 6                    Text variable length exceeds 200 characters. SAS v5 Transport files restrict variable lengths to 200 characters.

Before vs After

Task	metacore	herald
Create spec object	`metacore::metacore(ds_spec, var_spec, value_spec, ...)` — 6+ separate data frames with strict S4 class requirements	`herald_spec(ds_spec, var_spec, ...)` — plain data frames, no S4 ceremony
Read P21 Excel	`metacore::spec_to_metacore("spec.xlsx")`	`read_spec("spec.xlsx")`
Access variable labels	`metacore$var_spec %>% filter(dataset == "DM") %>% pull(label)`	`spec_vars(spec, "DM")$label`
Check codelist	`metacore$codelist %>% filter(code_id == "SEX")`	`spec_codelist(spec, "SEX")`
Write to JSON	(not available)	`write_spec(spec, "spec.json")`
Validate spec	(not available)	`validate_spec(spec)`