This vignette documents herald’s internal architecture for contributors and power users who need to extend the package or understand how it works at the implementation level.
Six-layer architecture
herald is organized into six layers, each with a clear responsibility:
Layer 5 submit() sub-submit.R, sub-manifest.R, sub-class.R
│
Layer 4 validate() val-engine.R, val-class.R, val-checks.R,
│ val-report.R, val-define.R, val-spec.R,
│ validate-context.R
│
Layer 3 write_define_xml() define-write.R, define-build.R,
│ define-build-arm.R
│
Layer 2 apply_spec() spec-apply.R, spec-metadata.R,
│ spec-accessors.R
│
Layer 1 herald_spec() spec-class.R, spec-read.R, spec-define.R,
│ spec-json.R, spec-write.R
│
Layer 0 read_xpt() xpt-read.R, xpt-write.R, xpt-header.R,
write_xpt() xpt-ieee.R, xpt-encoding.R
read_json() json-io.R
write_json()
You can enter at any layer. Most downstream consumers
(pharmacometrics, TFL pipelines) enter at Layer 2 with
apply_spec(). Tools that only need file conversion enter at
Layer 0.
Design decisions
S7 for spec and result classes, S3 for rules
herald_spec, herald_validation, and
herald_submission are S7 classes — typed
properties, method dispatch via S7::method(), structural
guarantees on construction. The S7 choice gives type safety without the
S4 ceremony.
herald_rule and herald_rule_catalog are
S3 — they are purely informational, no invariants to
enforce, minimal dispatch needed.
Pure R, no compiled code
XPT binary parsing uses readBin() +
writeBin() with pure R logic. IEEE 754 → IBM 370
floating-point conversion is implemented in xpt-ieee.R
using bit manipulation via rawShift() and
bitwAnd(). No Rcpp, no C wrappers.
This is a deliberate GxP design choice: every line of code that touches clinical data can be audited by a QA team with no C/C++ expertise.
Path-first API
validate("/sdtm/", spec = "spec.xlsx") # first arg is always a path
submit("/sdtm/", spec = "spec.xlsx")The first argument is always a directory or file path, never a data frame. This mirrors how regulators think about submissions (directory structures) and how CI pipelines are organized (file in → file out).
Spec-driven, format-blind validation
validate() loads data files and runs the same engine
regardless of whether the input is XPT or Dataset-JSON. The validation
engine never sees file format — it operates on data frames with metadata
attributes.
Attribute contract
apply_spec() sets a defined attribute contract on the
output data frame. Write functions (write_xpt(),
write_json()) read these attributes and serialize them to
the file format:
| Attribute | Level | Set by | Read by |
|---|---|---|---|
label |
column |
apply_spec(), set_label()
|
write_xpt(), write_json()
|
format.sas |
column |
apply_spec(), set_format()
|
write_xpt() |
sas.length |
column |
apply_spec(), set_length()
|
write_xpt() |
label |
data frame |
apply_spec(), set_dataset_label()
|
write_xpt(), write_json()
|
dataset_name |
data frame |
read_xpt(), read_json(),
apply_spec()
|
write_xpt(), write_json()
|
herald.dataset |
data frame | apply_spec() |
write_json() |
herald.sort_keys |
data frame |
apply_spec(), sort_keys()
|
write_xpt(), write_json()
|
XPT binary format internals
SAS V5 transport files follow a specific 80-byte record structure:
Library header (3 records × 80 bytes = 240 bytes)
Record 1: "HEADER RECORD*******LIBRARY HEADER RECORD!!!!..." magic bytes
Record 2: SAS system info (name, OS, creation date)
Record 3: Creation/modification timestamps
Member header (5 records × 80 bytes = 400 bytes per dataset)
Record 1: "HEADER RECORD*******MEMBER HEADER RECORD!!!!..." magic
Record 2: SAS member info
Record 3: Dataset name + label
Record 4: "HEADER RECORD*******DSCRPTR HEADER RECORD!!!!..."
Record 5: continuation
Namestr block (ceil(n_vars × 140 / 80) × 80 bytes)
One 140-byte namestr per variable:
- Bytes 1-2: variable type (1=numeric, 2=character)
- Bytes 3-4: hash (0)
- Bytes 5-6: variable length
- Bytes 7-8: variable number (1-based)
- Bytes 9-18: variable name (padded with spaces)
- Bytes 17-56: variable label (padded with spaces)
- Bytes 57-68: format name
- Bytes 69-76: additional format info
- Bytes 77-80: variable index
Observation records (ceil(row_width / 80) × 80 bytes per row)
Row data packed at row_width bytes per row, padded to 80-byte boundary
IEEE 754 → IBM 370 floating point
SAS stores numeric values in IBM System/370 hexadecimal
floating-point format (not IEEE 754). herald converts in
xpt-ieee.R:
# IEEE 754 double → IBM 370 hex float
# 1. Handle special values (0, NA, +/-Inf, NaN)
# 2. Extract sign, biased exponent, mantissa from 8 raw bytes
# 3. Re-bias: IEEE exponent (base 2) → IBM exponent (base 16)
# 4. Pack into 8 bytes in IBM format
ieee_to_ibm <- function(x) { ... } # xpt-ieee.RThe SAS missing value system uses specific IBM float bit patterns:
_ (underscore) is the standard missing;
A–Z are special missings. herald encodes
standard missing as NA.
Rule engine internals
Operator dispatch
Operators are R functions registered in a lazy-init environment via
init_operations_operators(). The evaluation pipeline:
-
load_rules()parses YAML → list ofherald_ruleobjects -
evaluate_rule(rule, data, context)looks uprule$check$operatorin the operator registry - The operator function is called with
(data[[rule$check$params$dataset]], rule$check$params) - Returns a logical vector:
TRUE= pass,FALSE= finding
Built-in rule sets
# FDA SDTM rules (requires bundled rules)
fda <- fda_rules()
length(fda)
#> [1] 2824
# ADaM rules
adam <- adam_rules()
length(adam)
#> [1] 3865
# PMDA rules
pmda <- pmda_rules()
length(pmda)
#> [1] 3865Loading a config
# Load a named config from bundled or cache
cfg <- load_herald_config("fda-sdtm-ig-3.3")
names(cfg)
#> NULLAnchor system
For cross-dataset rules (e.g., “AESTDTC must be after RFSTDTC”), herald builds an anchor index: a mapping from each dataset to the subject-level anchor (usually DM).
The anchor is detected via 4-tier heuristics in
build_anchor_indexes(): 1. ds_spec$structure
contains “one record per subject” 2. ds_spec$class is
“SPECIAL PURPOSE” or “SUBJECT LEVEL ANALYSIS DATASET” 3. Key analysis:
dataset whose keys are a subset of all others’ keys 4. Data frequency:
dataset with the most unique USUBJID-like columns
Spec class internals
herald_spec is an S7 class with 11 typed properties:
new_class(
"herald_spec",
properties = list(
ds_spec = class_data.frame | NULL,
var_spec = class_data.frame | NULL,
value_spec = class_data.frame | NULL,
codelist = class_data.frame | NULL,
study = class_data.frame | NULL,
dictionaries = class_data.frame | NULL,
methods = class_data.frame | NULL,
comments = class_data.frame | NULL,
documents = class_data.frame | NULL,
arm_displays = class_data.frame | NULL,
arm_results = class_data.frame | NULL
)
)The $ accessor is overloaded via
method(S7::$, herald_spec) to provide both
slot access and column-level filtering syntax.
Define-XML builder
write_define_xml() builds an ODM 1.3 document using the
xml2 package:
xml2::xml_new_document()
└── <ODM xmlns="http://www.cdisc.org/ns/odm/v1.3"
xmlns:def="http://www.cdisc.org/ns/def/v2.1"
xmlns:arm="http://www.cdisc.org/ns/arm/v1">
└── <Study OID="...">
└── <MetaDataVersion OID="..." Name="...">
├── <def:Standards>
├── <ItemGroupDef> × n_datasets
│ └── <ItemRef> × n_vars_in_dataset
├── <ItemDef> × n_vars_total
│ ├── <def:Origin>
│ └── <CodeListRef>
├── <CodeList> × n_codelists
│ └── <EnumeratedItem> × n_terms
├── <MethodDef> × n_methods
└── <arm:AnalysisResultDisplays> (if ARM data present)
Dependency philosophy
herald has only 4 hard dependencies (Imports):
| Package | Why it’s a hard dep |
|---|---|
cli |
All user-facing messages use cli_inform/warn/abort
|
rlang |
Error handling (caller_env(), %||%,
abort()) |
S7 |
The spec and result classes are S7 |
vctrs |
vec_size() and vec_cast() in hot
paths |
Everything else is in Suggests. If xml2 is not
installed, Define-XML generation is skipped with a warning. If
jsonlite is not installed, Dataset-JSON is unavailable. If
openxlsx2 is not installed, Excel reports are skipped. The
core pipeline (spec → XPT → validate) has zero optional
dependencies.
What to read next
-
vignette("herald")— the user-facing 5-minute workflow -
vignette("validation")— conformance rules from a user perspective -
vignette("migration-guide")— moving from metacore + xportr + P21
