herald’s validation engine checks clinical datasets at two levels:
built-in spec conformance checks that run from a
herald_spec alone, and authority-specific conformance rules
(CDISC CORE, FDA, PMDA) that check the regulatory requirements for each
standard.
Two layers of validation
Layer 1 — Spec checks (HRL-* rules)
├── HRL-VAR: Variable presence and naming
├── HRL-LBL: Label match between data and spec
├── HRL-TYP: Type conformance (char vs numeric)
├── HRL-LEN: Length compliance
├── HRL-DS: Dataset label match
├── HRL-CL: Codelist membership
├── HRL-KEY: Key uniqueness
└── HRL-CON: Cross-dataset consistency
Layer 2 — Conformance rules (CORE/FDA/PMDA)
├── CORE-*: CDISC CORE rules (spec-agnostic)
├── FDAB*: FDA business rules
├── FDAV-SD*: FDA SDTM validator rules
└── PMDA*: PMDA-specific rules
Layer 1 always runs when a spec is provided. Layer 2 requires either
bundled rules or a prior fetch_herald_rules() call.
Basic validation — spec checks only
# Build a spec and dataset with deliberate errors
spec <- herald_spec(
ds_spec = data.frame(
dataset = "DM",
label = "Demographics",
keys = "STUDYID, USUBJID",
stringsAsFactors = FALSE
),
var_spec = data.frame(
dataset = c("DM","DM","DM","DM","DM"),
variable = c("STUDYID","USUBJID","AGE","SEX","RACE"),
label = c("Study Identifier","Unique Subject Identifier",
"Age","Sex","Race"),
data_type = c("text","text","integer","text","text"),
length = c(12L,11L,8L,1L,200L),
stringsAsFactors = FALSE
)
)
# DM with three deliberate errors:
# 1. AGE is character (spec says integer)
# 2. RACE is missing
# 3. USUBJID label is wrong
dm <- data.frame(
STUDYID = rep("CDISCPILOT01", 3L),
USUBJID = c("01-701-1015","01-701-1023","01-701-1028"),
AGE = c("63","64","71"), # ERROR: character, spec says integer
SEX = c("F","M","M"),
stringsAsFactors = FALSE
)
attr(dm$USUBJID, "label") <- "Wrong Label" # ERROR: mismatch with spec
dir <- tempfile()
dir.create(dir)
write_xpt(dm, file.path(dir, "dm.xpt"))
result <- validate(dir, spec = spec, rules = NULL)
result
#>
#> ── herald validation ──
#>
#> Datasets checked: 1
#> ℹ Spec checks only -- no conformance rules evaluated
#> Findings: 1 reject, 1 high, 5 medium, 0 low
#>
#> ── Reject / High Impact
#> ✖ [HRL-VAR-001] [High] DM.RACE: Variable specified but not found in data
#> ✖ [HRL-TYP-001] [Reject] DM.AGE: Type mismatch: spec expects 'integer' but column is characterUnderstanding the findings
result$findings[, c("rule_id", "impact", "dataset", "variable", "message")]
#> rule_id impact dataset variable
#> 1 HRL-VAR-001 High DM RACE
#> 2 HRL-LBL-001 Medium DM STUDYID
#> 3 HRL-LBL-001 Medium DM USUBJID
#> 4 HRL-LBL-001 Medium DM AGE
#> 5 HRL-LBL-001 Medium DM SEX
#> 6 HRL-TYP-001 Reject DM AGE
#> 7 HRL-DS-001 Medium DM
#> message
#> 1 Variable specified but not found in data
#> 2 Label mismatch: expected 'Study Identifier', got ''
#> 3 Label mismatch: expected 'Unique Subject Identifier', got 'Wrong Label'
#> 4 Label mismatch: expected 'Age', got ''
#> 5 Label mismatch: expected 'Sex', got ''
#> 6 Type mismatch: spec expects 'integer' but column is character
#> 7 Dataset label mismatch: expected 'Demographics', got ''
result$summary
#> $reject
#> [1] 1
#>
#> $high
#> [1] 1
#>
#> $medium
#> [1] 5
#>
#> $low
#> [1] 0
#>
#> $total
#> [1] 7Impact levels:
| Level | Meaning | Typical action |
|---|---|---|
Reject |
Structural failure (wrong type, missing required variable) | Must fix before submission |
High |
Significant non-conformance | Fix before submission |
Medium |
Possible issue | Investigate |
Low |
Minor deviation | Document or fix |
Filtering findings
# Only Reject and High impact
critical <- result$findings[result$findings$impact %in% c("Reject","High"), ]
critical[, c("rule_id", "impact", "variable", "message")]
#> rule_id impact variable
#> 1 HRL-VAR-001 High RACE
#> 6 HRL-TYP-001 Reject AGE
#> message
#> 1 Variable specified but not found in data
#> 6 Type mismatch: spec expects 'integer' but column is character
# Only a specific dataset
dm_findings <- result$findings[result$findings$dataset == "DM", ]
nrow(dm_findings)
#> [1] 7Running with conformance rules
Use rules or config to add FDA/PMDA/CDISC
conformance rules.
# Bundled herald rules
result2 <- validate(dir, spec = spec, rules = "all")
#> ⠙ Evaluating rules [3772/3865] 98%
#> ⠙ Evaluating rules [3865/3865] 100%
#>
result2$summary
#> $reject
#> [1] 1
#>
#> $high
#> [1] 10
#>
#> $medium
#> [1] 5
#>
#> $low
#> [1] 0
#>
#> $total
#> [1] 16
# Or use a specific pre-built config
result3 <- validate(dir, spec = spec, config = "fda-sdtm-ig-3.3")
result3$summary
#> $reject
#> [1] 1
#>
#> $high
#> [1] 4
#>
#> $medium
#> [1] 5
#>
#> $low
#> [1] 0
#>
#> $total
#> [1] 10Rules shortcut table
rules = |
What runs |
|---|---|
NULL |
Spec checks only (HRL-*) |
"fda" |
FDA business + validator rules |
"pmda" |
PMDA-specific rules |
"core" |
CDISC CORE rules |
"all" (default in submit()) |
All available rule sets |
Config-based profiles
Configs are pre-built profiles that combine rule sets for a specific
standard + version + authority. Auto-selection happens when
standard and version match a bundled
config:
# When spec contains standard info, auto-selection picks the right config
result4 <- validate(dir, spec = spec,
standard = "sdtmig", version = "3.3")
#> ℹ Auto-selected config: "fda-sdtm-ig-3.3"
result4$summary
#> $reject
#> [1] 1
#>
#> $high
#> [1] 4
#>
#> $medium
#> [1] 5
#>
#> $low
#> [1] 0
#>
#> $total
#> [1] 10Browsing available rules
catalog <- rule_catalog()
catalog
#>
#> ── herald rule catalog ──
#>
#> Built-in: 10554 rules (3 sets)
#> `fda_rules()()` 2824 FDA
#> `adam_rules()()` 3865 CDISC-ADaM
#> `pmda_rules()()` 3865 PMDA
#>
#> P21 Community: not configured (set `options(herald.p21_rules_path = "...")`)
#>
#> Herald rules: cached but no rules found
# Convert to data frame for filtering
cat_df <- as.data.frame(catalog)
nrow(cat_df)
#> [1] 13
names(cat_df)
#> [1] "source" "version" "authority" "standard" "set" "count"
#> [7] "path"
# Load a specific P21 pre-built configuration (requires P21 rules path)
# options(herald.p21_rules_path = "/path/to/p21/rules")
cfg <- rule_config(version = "3.3", authority = "fda", standard = "sdtmig")
cfgFetching and updating rules
By default, herald uses rules bundled in the package installation. To get the latest rules from GitHub:
# Fetch latest herald + CDISC CORE rules (requires network)
fetch_herald_rules()
# Update only CDISC CORE rules
update_core_rules()
# Fetch CDISC CORE rules from the CDISC Library API (requires API key)
fetch_core_rules()
# See where rules are cached
herald_rules_cache_dir()Custom operators
Register custom validation operators to extend the rule engine with
organization-specific checks. The optional description is
shown in rule_catalog() output and documents what the
operator checks.
# Register a custom "starts_with" operator
register_operator(
name = "starts_with",
fn = function(col, prefix) !startsWith(as.character(col), prefix),
description = "Violation when column values do not start with the given prefix"
)
# Now YAML rules can reference: operator: starts_withValidation reports
validation_report() exports findings to HTML and
Excel:
if (requireNamespace("openxlsx2", quietly = TRUE)) {
xlsx_path <- tempfile(fileext = ".xlsx")
validation_report(result, xlsx_path)
file.exists(xlsx_path)
}
#> ✔ Wrote validation report to /tmp/RtmpqNe3G5/file467a50b78db5.xlsx
#> [1] TRUE
if (requireNamespace("htmltools", quietly = TRUE)) {
html_out <- tempfile(fileext = ".html")
validation_report(result, html_out)
file.exists(html_out)
}
#> ✔ Wrote validation report to /tmp/RtmpqNe3G5/file467a1557f99e.html
#> [1] TRUEVerifying an HTML report
verify_html_report() checks that a generated HTML report
is structurally sound — useful for automated testing or CI
pipelines.
if (requireNamespace("htmltools", quietly = TRUE)) {
# verify_html_report() takes the herald_validation object and re-checks structure
verify_html_report(result) # returns TRUE invisibly on success
}Interactive HTML preview
In interactive R sessions (RStudio, Positron), printing a
herald_validation object automatically opens an HTML
summary in the IDE Viewer pane — no extra call required:
result # prints to console AND opens Viewer in interactive sessionsUse validation_report() to save the report to a
file.
Validating from a single file
validate() also accepts a single XPT file path instead
of a directory:
single_xpt <- file.path(tempdir(), "dm.xpt")
write_xpt(dm, single_xpt)
result5 <- validate(single_xpt, spec = spec, rules = NULL)
result5$summary
#> $reject
#> [1] 1
#>
#> $high
#> [1] 1
#>
#> $medium
#> [1] 5
#>
#> $low
#> [1] 0
#>
#> $total
#> [1] 7Note: Single-file validation runs spec checks and single-dataset rules only. Cross-dataset rules (e.g., checking that ADAE subjects exist in ADSL) require both datasets to be present. Pass a directory containing all related datasets, or use the
filesparameter to load specific files from different locations (see below).
Validating specific files across directories
Use files to select exactly which datasets to validate —
useful when related datasets (e.g., ADAE + ADSL) live in different
locations or when you want to validate a subset without moving
files:
# Unnamed vector — dataset names inferred from file basenames
result6 <- validate(
files = c("/adam_outputs/adae.xpt", "/shared/adsl.xpt"),
spec = spec
)
# Named list — explicit dataset names
result7 <- validate(
files = list(
ADAE = "/adam_outputs/adae.xpt",
ADSL = "/shared/adsl.xpt"
),
spec = spec
)With two or more datasets, anchor auto-detection runs and
cross-dataset rules fire normally. This is equivalent to placing both
files in a single directory and calling
validate(path = dir, datasets = c("ADAE", "ADSL")).
Filtering a directory to specific datasets also works (case-insensitive):
herald_context
new_herald_context() creates a context object capturing
the standard, version, and optionally a CT version or Define-XML path —
used when running rules against a known IG.
Note: authority (FDA vs PMDA) is not part of the
context. It lives in the config = or rules =
parameter of validate(). The same SDTMIG 3.3 datasets can
be checked against FDA rules or PMDA rules independently.
ctx <- new_herald_context(standard = "sdtmig", version = "3.3")
ctx
#> <herald_context>
#> Standard: sdtmig
#> Version: 3.3Built-in rule ID prefixes
| Prefix | Source | Example |
|---|---|---|
HRL-VAR-* |
herald (variable presence) | HRL-VAR-001 |
HRL-LBL-* |
herald (label checks) | HRL-LBL-001 |
HRL-TYP-* |
herald (type checks) | HRL-TYP-001 |
HRL-LEN-* |
herald (length checks) | HRL-LEN-001 |
HRL-DS-* |
herald (dataset label) | HRL-DS-001 |
HRL-CL-* |
herald (codelist) | HRL-CL-001 |
HRL-KEY-* |
herald (key uniqueness) | HRL-KEY-001 |
HRL-CON-* |
herald (cross-dataset) | HRL-CON-001 |
HRL-SD-* |
herald SDTM gap-fill | HRL-SD-001 |
HRL-AD-* |
herald ADaM gap-fill | HRL-AD-001 |
CORE-* |
CDISC CORE | CORE-000001 |
ADaM-* |
CDISC ADaM rules | ADaM-005 |
CG* |
CDISC general (cross-standard) | CG0001 |
SD* |
CDISC SDTM | SD0001 |
FDAB* |
FDA business rules | FDAB001 |
FDAV-SD* |
FDA SDTM validator | FDAV-SD001 |
AD* |
PMDA ADaM | AD0241B |
OD* |
PMDA other datasets | OD0071 |
DD* |
Define-XML | DD0001 |
Custom controlled terminology
Organizations that maintain their own terminology (in NCI EVS Excel format) can layer it on top of the bundled CDISC CT. Company terms for the same codelist code take precedence; CDISC terms fill the rest.
register_ct() registers a CT file for the current R
session. All subsequent validate() and
submit() calls pick it up automatically:
# Register custom SDTM terminology (Excel, NCI EVS column layout)
register_ct("org-sdtm", "path/to/Custom_SDTM_Terminology.xlsx")
# Register custom ADaM terminology
register_ct("org-adam", "path/to/Custom_ADaM_Terminology.xlsx")
# Validate — registered CT is merged with bundled CDISC CT automatically
result <- validate(dir, spec = spec, rules = "all")
# One-off: ct_path applies CT only for this call (no session-wide registration)
result2 <- validate(dir, spec = spec, ct_path = "path/to/Custom_CT.xlsx")
# Inspect what is registered
list_ct()
# Remove all custom CT from the session
clear_ct()The Excel file must follow the NCI EVS layout (same column structure as files downloaded from the NCI EVS browser or CDISC Library):
| Column | Description |
|---|---|
Codelist Code |
NCI concept code, e.g. C66731
|
Codelist Name |
Human-readable codelist name |
CDISC Submission Value |
Submission term (used in data) |
CDISC Definition |
Term definition |
Codelist Extensible (Yes/No) |
Whether the codelist allows custom terms |
Before vs After
| Task | Old way | herald |
|---|---|---|
| Validate datasets | Pinnacle 21 Community (Java, GUI) | validate(dir, spec = spec) |
| FDA SDTM rules | P21 Enterprise license | validate(dir, config = "fda-sdtm-ig-3.3") |
| PMDA rules | P21 Enterprise + PMDA configuration | validate(dir, rules = "pmda") |
| HTML report | P21 GUI export | validation_report(result, "report.html") |
| Excel report | P21 GUI export | validation_report(result, "report.xlsx") |
| Rule browsing | P21 GUI only | rule_catalog() |
| Custom rules | Not available | register_operator() |
What to read next
-
vignette("spec-management")— building specs for validation -
vignette("submission-workflow")—submit()runs validation automatically -
vignette("define-xml")— validating Define-XML withvalidate_spec_define()
