Skip to contents

herald’s validation engine checks clinical datasets at two levels: built-in spec conformance checks that run from a herald_spec alone, and authority-specific conformance rules (CDISC CORE, FDA, PMDA) that check the regulatory requirements for each standard.

Two layers of validation

Layer 1 — Spec checks (HRL-* rules)
  ├── HRL-VAR: Variable presence and naming
  ├── HRL-LBL: Label match between data and spec
  ├── HRL-TYP: Type conformance (char vs numeric)
  ├── HRL-LEN: Length compliance
  ├── HRL-DS:  Dataset label match
  ├── HRL-CL:  Codelist membership
  ├── HRL-KEY: Key uniqueness
  └── HRL-CON: Cross-dataset consistency

Layer 2 — Conformance rules (CORE/FDA/PMDA)
  ├── CORE-*:  CDISC CORE rules (spec-agnostic)
  ├── FDAB*:   FDA business rules
  ├── FDAV-SD*: FDA SDTM validator rules
  └── PMDA*:   PMDA-specific rules

Layer 1 always runs when a spec is provided. Layer 2 requires either bundled rules or a prior fetch_herald_rules() call.

Basic validation — spec checks only

# Build a spec and dataset with deliberate errors
spec <- herald_spec(
  ds_spec = data.frame(
    dataset = "DM",
    label   = "Demographics",
    keys    = "STUDYID, USUBJID",
    stringsAsFactors = FALSE
  ),
  var_spec = data.frame(
    dataset   = c("DM","DM","DM","DM","DM"),
    variable  = c("STUDYID","USUBJID","AGE","SEX","RACE"),
    label     = c("Study Identifier","Unique Subject Identifier",
                  "Age","Sex","Race"),
    data_type = c("text","text","integer","text","text"),
    length    = c(12L,11L,8L,1L,200L),
    stringsAsFactors = FALSE
  )
)

# DM with three deliberate errors:
# 1. AGE is character (spec says integer)
# 2. RACE is missing
# 3. USUBJID label is wrong
dm <- data.frame(
  STUDYID = rep("CDISCPILOT01", 3L),
  USUBJID = c("01-701-1015","01-701-1023","01-701-1028"),
  AGE     = c("63","64","71"),   # ERROR: character, spec says integer
  SEX     = c("F","M","M"),
  stringsAsFactors = FALSE
)
attr(dm$USUBJID, "label") <- "Wrong Label"  # ERROR: mismatch with spec
dir <- tempfile()
dir.create(dir)

write_xpt(dm, file.path(dir, "dm.xpt"))

result <- validate(dir, spec = spec, rules = NULL)
result
#> 
#> ── herald validation ──
#> 
#> Datasets checked: 1
#>  Spec checks only -- no conformance rules evaluated
#> Findings: 1 reject, 1 high, 5 medium, 0 low
#> 
#> ── Reject / High Impact
#>  [HRL-VAR-001] [High] DM.RACE: Variable specified but not found in data
#>  [HRL-TYP-001] [Reject] DM.AGE: Type mismatch: spec expects 'integer' but column is character

Understanding the findings

result$findings[, c("rule_id", "impact", "dataset", "variable", "message")]
#>       rule_id impact dataset variable
#> 1 HRL-VAR-001   High      DM     RACE
#> 2 HRL-LBL-001 Medium      DM  STUDYID
#> 3 HRL-LBL-001 Medium      DM  USUBJID
#> 4 HRL-LBL-001 Medium      DM      AGE
#> 5 HRL-LBL-001 Medium      DM      SEX
#> 6 HRL-TYP-001 Reject      DM      AGE
#> 7  HRL-DS-001 Medium      DM         
#>                                                                   message
#> 1                                Variable specified but not found in data
#> 2                     Label mismatch: expected 'Study Identifier', got ''
#> 3 Label mismatch: expected 'Unique Subject Identifier', got 'Wrong Label'
#> 4                                  Label mismatch: expected 'Age', got ''
#> 5                                  Label mismatch: expected 'Sex', got ''
#> 6           Type mismatch: spec expects 'integer' but column is character
#> 7                 Dataset label mismatch: expected 'Demographics', got ''
result$summary
#> $reject
#> [1] 1
#> 
#> $high
#> [1] 1
#> 
#> $medium
#> [1] 5
#> 
#> $low
#> [1] 0
#> 
#> $total
#> [1] 7

Impact levels:

Level Meaning Typical action
Reject Structural failure (wrong type, missing required variable) Must fix before submission
High Significant non-conformance Fix before submission
Medium Possible issue Investigate
Low Minor deviation Document or fix

Filtering findings

# Only Reject and High impact
critical <- result$findings[result$findings$impact %in% c("Reject","High"), ]
critical[, c("rule_id", "impact", "variable", "message")]
#>       rule_id impact variable
#> 1 HRL-VAR-001   High     RACE
#> 6 HRL-TYP-001 Reject      AGE
#>                                                         message
#> 1                      Variable specified but not found in data
#> 6 Type mismatch: spec expects 'integer' but column is character

# Only a specific dataset
dm_findings <- result$findings[result$findings$dataset == "DM", ]
nrow(dm_findings)
#> [1] 7

Running with conformance rules

Use rules or config to add FDA/PMDA/CDISC conformance rules.

# Bundled herald rules
result2 <- validate(dir, spec = spec, rules = "all")
#> ⠙ Evaluating rules [3772/3865]  98%
#> ⠙ Evaluating rules [3865/3865] 100%
#> 
result2$summary
#> $reject
#> [1] 1
#> 
#> $high
#> [1] 10
#> 
#> $medium
#> [1] 5
#> 
#> $low
#> [1] 0
#> 
#> $total
#> [1] 16

# Or use a specific pre-built config
result3 <- validate(dir, spec = spec, config = "fda-sdtm-ig-3.3")
result3$summary
#> $reject
#> [1] 1
#> 
#> $high
#> [1] 4
#> 
#> $medium
#> [1] 5
#> 
#> $low
#> [1] 0
#> 
#> $total
#> [1] 10

Rules shortcut table

rules = What runs
NULL Spec checks only (HRL-*)
"fda" FDA business + validator rules
"pmda" PMDA-specific rules
"core" CDISC CORE rules
"all" (default in submit()) All available rule sets

Config-based profiles

Configs are pre-built profiles that combine rule sets for a specific standard + version + authority. Auto-selection happens when standard and version match a bundled config:

# When spec contains standard info, auto-selection picks the right config
result4 <- validate(dir, spec = spec,
                    standard = "sdtmig", version = "3.3")
#>  Auto-selected config: "fda-sdtm-ig-3.3"
result4$summary
#> $reject
#> [1] 1
#> 
#> $high
#> [1] 4
#> 
#> $medium
#> [1] 5
#> 
#> $low
#> [1] 0
#> 
#> $total
#> [1] 10

Browsing available rules

catalog <- rule_catalog()
catalog
#> 
#> ── herald rule catalog ──
#> 
#> Built-in: 10554 rules (3 sets)
#> `fda_rules()()` 2824 FDA
#> `adam_rules()()` 3865 CDISC-ADaM
#> `pmda_rules()()` 3865 PMDA
#> 
#> P21 Community: not configured (set `options(herald.p21_rules_path = "...")`)
#> 
#> Herald rules: cached but no rules found

# Convert to data frame for filtering
cat_df <- as.data.frame(catalog)
nrow(cat_df)
#> [1] 13
names(cat_df)
#> [1] "source"    "version"   "authority" "standard"  "set"       "count"    
#> [7] "path"
# Load a specific P21 pre-built configuration (requires P21 rules path)
# options(herald.p21_rules_path = "/path/to/p21/rules")
cfg <- rule_config(version = "3.3", authority = "fda", standard = "sdtmig")
cfg

Fetching and updating rules

By default, herald uses rules bundled in the package installation. To get the latest rules from GitHub:

# Fetch latest herald + CDISC CORE rules (requires network)
fetch_herald_rules()

# Update only CDISC CORE rules
update_core_rules()

# Fetch CDISC CORE rules from the CDISC Library API (requires API key)
fetch_core_rules()

# See where rules are cached
herald_rules_cache_dir()

Custom operators

Register custom validation operators to extend the rule engine with organization-specific checks. The optional description is shown in rule_catalog() output and documents what the operator checks.

# Register a custom "starts_with" operator
register_operator(
  name        = "starts_with",
  fn          = function(col, prefix) !startsWith(as.character(col), prefix),
  description = "Violation when column values do not start with the given prefix"
)

# Now YAML rules can reference: operator: starts_with

Validation reports

validation_report() exports findings to HTML and Excel:

if (requireNamespace("openxlsx2", quietly = TRUE)) {
  xlsx_path <- tempfile(fileext = ".xlsx")

  validation_report(result, xlsx_path)
  file.exists(xlsx_path)
}
#>  Wrote validation report to /tmp/RtmpqNe3G5/file467a50b78db5.xlsx
#> [1] TRUE
if (requireNamespace("htmltools", quietly = TRUE)) {
  html_out <- tempfile(fileext = ".html")
  validation_report(result, html_out)
  file.exists(html_out)
}
#>  Wrote validation report to /tmp/RtmpqNe3G5/file467a1557f99e.html
#> [1] TRUE

Verifying an HTML report

verify_html_report() checks that a generated HTML report is structurally sound — useful for automated testing or CI pipelines.

if (requireNamespace("htmltools", quietly = TRUE)) {
  # verify_html_report() takes the herald_validation object and re-checks structure
  verify_html_report(result)  # returns TRUE invisibly on success
}

Interactive HTML preview

In interactive R sessions (RStudio, Positron), printing a herald_validation object automatically opens an HTML summary in the IDE Viewer pane — no extra call required:

result   # prints to console AND opens Viewer in interactive sessions

Use validation_report() to save the report to a file.

Validating from a single file

validate() also accepts a single XPT file path instead of a directory:

single_xpt <- file.path(tempdir(), "dm.xpt")

write_xpt(dm, single_xpt)
result5 <- validate(single_xpt, spec = spec, rules = NULL)
result5$summary
#> $reject
#> [1] 1
#> 
#> $high
#> [1] 1
#> 
#> $medium
#> [1] 5
#> 
#> $low
#> [1] 0
#> 
#> $total
#> [1] 7

Note: Single-file validation runs spec checks and single-dataset rules only. Cross-dataset rules (e.g., checking that ADAE subjects exist in ADSL) require both datasets to be present. Pass a directory containing all related datasets, or use the files parameter to load specific files from different locations (see below).

Validating specific files across directories

Use files to select exactly which datasets to validate — useful when related datasets (e.g., ADAE + ADSL) live in different locations or when you want to validate a subset without moving files:

# Unnamed vector — dataset names inferred from file basenames
result6 <- validate(
  files = c("/adam_outputs/adae.xpt", "/shared/adsl.xpt"),
  spec  = spec
)

# Named list — explicit dataset names
result7 <- validate(
  files = list(
    ADAE = "/adam_outputs/adae.xpt",
    ADSL = "/shared/adsl.xpt"
  ),
  spec = spec
)

With two or more datasets, anchor auto-detection runs and cross-dataset rules fire normally. This is equivalent to placing both files in a single directory and calling validate(path = dir, datasets = c("ADAE", "ADSL")).

Filtering a directory to specific datasets also works (case-insensitive):

# Validate only ADAE and ADSL from a larger directory
result8 <- validate(
  path     = "/adam_outputs",
  datasets = c("adae", "adsl"),   # lowercase OK — matched case-insensitively
  spec     = spec
)

herald_context

new_herald_context() creates a context object capturing the standard, version, and optionally a CT version or Define-XML path — used when running rules against a known IG.

Note: authority (FDA vs PMDA) is not part of the context. It lives in the config = or rules = parameter of validate(). The same SDTMIG 3.3 datasets can be checked against FDA rules or PMDA rules independently.

ctx <- new_herald_context(standard = "sdtmig", version = "3.3")
ctx
#> <herald_context>
#> Standard: sdtmig
#> Version: 3.3

Built-in rule ID prefixes

Prefix Source Example
HRL-VAR-* herald (variable presence) HRL-VAR-001
HRL-LBL-* herald (label checks) HRL-LBL-001
HRL-TYP-* herald (type checks) HRL-TYP-001
HRL-LEN-* herald (length checks) HRL-LEN-001
HRL-DS-* herald (dataset label) HRL-DS-001
HRL-CL-* herald (codelist) HRL-CL-001
HRL-KEY-* herald (key uniqueness) HRL-KEY-001
HRL-CON-* herald (cross-dataset) HRL-CON-001
HRL-SD-* herald SDTM gap-fill HRL-SD-001
HRL-AD-* herald ADaM gap-fill HRL-AD-001
CORE-* CDISC CORE CORE-000001
ADaM-* CDISC ADaM rules ADaM-005
CG* CDISC general (cross-standard) CG0001
SD* CDISC SDTM SD0001
FDAB* FDA business rules FDAB001
FDAV-SD* FDA SDTM validator FDAV-SD001
AD* PMDA ADaM AD0241B
OD* PMDA other datasets OD0071
DD* Define-XML DD0001

Custom controlled terminology

Organizations that maintain their own terminology (in NCI EVS Excel format) can layer it on top of the bundled CDISC CT. Company terms for the same codelist code take precedence; CDISC terms fill the rest.

register_ct() registers a CT file for the current R session. All subsequent validate() and submit() calls pick it up automatically:

# Register custom SDTM terminology (Excel, NCI EVS column layout)
register_ct("org-sdtm", "path/to/Custom_SDTM_Terminology.xlsx")

# Register custom ADaM terminology
register_ct("org-adam", "path/to/Custom_ADaM_Terminology.xlsx")

# Validate — registered CT is merged with bundled CDISC CT automatically
result <- validate(dir, spec = spec, rules = "all")

# One-off: ct_path applies CT only for this call (no session-wide registration)
result2 <- validate(dir, spec = spec, ct_path = "path/to/Custom_CT.xlsx")

# Inspect what is registered
list_ct()

# Remove all custom CT from the session
clear_ct()

The Excel file must follow the NCI EVS layout (same column structure as files downloaded from the NCI EVS browser or CDISC Library):

Column Description
Codelist Code NCI concept code, e.g. C66731
Codelist Name Human-readable codelist name
CDISC Submission Value Submission term (used in data)
CDISC Definition Term definition
Codelist Extensible (Yes/No) Whether the codelist allows custom terms

Before vs After

Task Old way herald
Validate datasets Pinnacle 21 Community (Java, GUI) validate(dir, spec = spec)
FDA SDTM rules P21 Enterprise license validate(dir, config = "fda-sdtm-ig-3.3")
PMDA rules P21 Enterprise + PMDA configuration validate(dir, rules = "pmda")
HTML report P21 GUI export validation_report(result, "report.html")
Excel report P21 GUI export validation_report(result, "report.xlsx")
Rule browsing P21 GUI only rule_catalog()
Custom rules Not available register_operator()