Architecture and Internals

This vignette documents herald’s internal architecture for contributors and power users who need to extend the package or understand how it works at the implementation level.

Six-layer architecture

herald is organized into six layers, each with a clear responsibility:

Layer 5  submit()          sub-submit.R, sub-manifest.R, sub-class.R
          │
Layer 4  validate()        val-engine.R, val-class.R, val-checks.R,
          │                val-report.R, val-define.R, val-spec.R,
          │                validate-context.R
          │
Layer 3  write_define_xml() define-write.R, define-build.R,
          │                  define-build-arm.R
          │
Layer 2  apply_spec()      spec-apply.R, spec-metadata.R,
          │                spec-accessors.R
          │
Layer 1  herald_spec()     spec-class.R, spec-read.R, spec-define.R,
          │                spec-json.R, spec-write.R
          │
Layer 0  read_xpt()        xpt-read.R, xpt-write.R, xpt-header.R,
         write_xpt()       xpt-ieee.R, xpt-encoding.R
         read_json()       json-io.R
         write_json()

You can enter at any layer. Most downstream consumers (pharmacometrics, TFL pipelines) enter at Layer 2 with apply_spec(). Tools that only need file conversion enter at Layer 0.

Design decisions

S7 for spec and result classes, S3 for rules

herald_spec, herald_validation, and herald_submission are S7 classes — typed properties, method dispatch via S7::method(), structural guarantees on construction. The S7 choice gives type safety without the S4 ceremony.

herald_rule and herald_rule_catalog are S3 — they are purely informational, no invariants to enforce, minimal dispatch needed.

Pure R, no compiled code

XPT binary parsing uses readBin() + writeBin() with pure R logic. IEEE 754 → IBM 370 floating-point conversion is implemented in xpt-ieee.R using bit manipulation via rawShift() and bitwAnd(). No Rcpp, no C wrappers.

This is a deliberate GxP design choice: every line of code that touches clinical data can be audited by a QA team with no C/C++ expertise.

Path-first API

validate("/sdtm/", spec = "spec.xlsx")   # first arg is always a path
submit("/sdtm/",   spec = "spec.xlsx")

The first argument is always a directory or file path, never a data frame. This mirrors how regulators think about submissions (directory structures) and how CI pipelines are organized (file in → file out).

validate() loads data files and runs the same engine regardless of whether the input is XPT or Dataset-JSON. The validation engine never sees file format — it operates on data frames with metadata attributes.

Attribute contract

apply_spec() sets a defined attribute contract on the output data frame. Write functions (write_xpt(), write_json()) read these attributes and serialize them to the file format:

Attribute	Level	Set by	Read by
`label`	column	`apply_spec()`, `set_label()`	`write_xpt()`, `write_json()`
`format.sas`	column	`apply_spec()`, `set_format()`	`write_xpt()`
`sas.length`	column	`apply_spec()`, `set_length()`	`write_xpt()`
`label`	data frame	`apply_spec()`, `set_dataset_label()`	`write_xpt()`, `write_json()`
`dataset_name`	data frame	`read_xpt()`, `read_json()`, `apply_spec()`	`write_xpt()`, `write_json()`
`herald.dataset`	data frame	`apply_spec()`	`write_json()`
`herald.sort_keys`	data frame	`apply_spec()`, `sort_keys()`	`write_xpt()`, `write_json()`

XPT binary format internals

SAS V5 transport files follow a specific 80-byte record structure:

Library header (3 records × 80 bytes = 240 bytes)
  Record 1: "HEADER RECORD*******LIBRARY HEADER RECORD!!!!..." magic bytes
  Record 2: SAS system info (name, OS, creation date)
  Record 3: Creation/modification timestamps

Member header (5 records × 80 bytes = 400 bytes per dataset)
  Record 1: "HEADER RECORD*******MEMBER  HEADER RECORD!!!!..." magic
  Record 2: SAS member info
  Record 3: Dataset name + label
  Record 4: "HEADER RECORD*******DSCRPTR HEADER RECORD!!!!..."
  Record 5: continuation

Namestr block (ceil(n_vars × 140 / 80) × 80 bytes)
  One 140-byte namestr per variable:
  - Bytes 1-2:   variable type (1=numeric, 2=character)
  - Bytes 3-4:   hash (0)
  - Bytes 5-6:   variable length
  - Bytes 7-8:   variable number (1-based)
  - Bytes 9-18:  variable name (padded with spaces)
  - Bytes 17-56: variable label (padded with spaces)
  - Bytes 57-68: format name
  - Bytes 69-76: additional format info
  - Bytes 77-80: variable index

Observation records (ceil(row_width / 80) × 80 bytes per row)
  Row data packed at row_width bytes per row, padded to 80-byte boundary

IEEE 754 → IBM 370 floating point

SAS stores numeric values in IBM System/370 hexadecimal floating-point format (not IEEE 754). herald converts in xpt-ieee.R:

# IEEE 754 double → IBM 370 hex float
# 1. Handle special values (0, NA, +/-Inf, NaN)
# 2. Extract sign, biased exponent, mantissa from 8 raw bytes
# 3. Re-bias: IEEE exponent (base 2) → IBM exponent (base 16)
# 4. Pack into 8 bytes in IBM format
ieee_to_ibm <- function(x) { ... }   # xpt-ieee.R

The SAS missing value system uses specific IBM float bit patterns: _ (underscore) is the standard missing; A–Z are special missings. herald encodes standard missing as NA.

Rule engine internals

YAML rule format

Each rule in inst/rules/engines/ is a YAML file:

rule_id: HRL-VAR-001
description: "Variable present in dataset"
standard: sdtmig
category: variable
severity: reject
executability: Hardcoded

check:
  operator: variable_present
  params:
    dataset: "{dataset}"
    variable: "{variable}"

Operator dispatch

Operators are R functions registered in a lazy-init environment via init_operations_operators(). The evaluation pipeline:

load_rules() parses YAML → list of herald_rule objects
evaluate_rule(rule, data, context) looks up rule$check$operator in the operator registry
The operator function is called with (data[[rule$check$params$dataset]], rule$check$params)
Returns a logical vector: TRUE = pass, FALSE = finding

Built-in rule sets

# FDA SDTM rules (requires bundled rules)
fda <- fda_rules()
length(fda)
#> [1] 2824

# ADaM rules
adam <- adam_rules()
length(adam)
#> [1] 3865

# PMDA rules
pmda <- pmda_rules()
length(pmda)
#> [1] 3865

Loading a config

# Load a named config from bundled or cache
cfg <- load_herald_config("fda-sdtm-ig-3.3")
names(cfg)
#> NULL

Anchor system

For cross-dataset rules (e.g., “AESTDTC must be after RFSTDTC”), herald builds an anchor index: a mapping from each dataset to the subject-level anchor (usually DM).

The anchor is detected via 4-tier heuristics in build_anchor_indexes(): 1. ds_spec$structure contains “one record per subject” 2. ds_spec$class is “SPECIAL PURPOSE” or “SUBJECT LEVEL ANALYSIS DATASET” 3. Key analysis: dataset whose keys are a subset of all others’ keys 4. Data frequency: dataset with the most unique USUBJID-like columns

Spec class internals

herald_spec is an S7 class with 11 typed properties:

new_class(
  "herald_spec",
  properties = list(
    ds_spec      = class_data.frame | NULL,
    var_spec     = class_data.frame | NULL,
    value_spec   = class_data.frame | NULL,
    codelist     = class_data.frame | NULL,
    study        = class_data.frame | NULL,
    dictionaries = class_data.frame | NULL,
    methods      = class_data.frame | NULL,
    comments     = class_data.frame | NULL,
    documents    = class_data.frame | NULL,
    arm_displays = class_data.frame | NULL,
    arm_results  = class_data.frame | NULL
  )
)

The $ accessor is overloaded via method(S7::$, herald_spec) to provide both slot access and column-level filtering syntax.

Define-XML builder

write_define_xml() builds an ODM 1.3 document using the xml2 package:

xml2::xml_new_document()
  └── <ODM xmlns="http://www.cdisc.org/ns/odm/v1.3"
           xmlns:def="http://www.cdisc.org/ns/def/v2.1"
           xmlns:arm="http://www.cdisc.org/ns/arm/v1">
        └── <Study OID="...">
              └── <MetaDataVersion OID="..." Name="...">
                    ├── <def:Standards>
                    ├── <ItemGroupDef> × n_datasets
                    │     └── <ItemRef> × n_vars_in_dataset
                    ├── <ItemDef> × n_vars_total
                    │     ├── <def:Origin>
                    │     └── <CodeListRef>
                    ├── <CodeList> × n_codelists
                    │     └── <EnumeratedItem> × n_terms
                    ├── <MethodDef> × n_methods
                    └── <arm:AnalysisResultDisplays> (if ARM data present)

Dependency philosophy

herald has only 4 hard dependencies (Imports):

Package	Why it’s a hard dep
`cli`	All user-facing messages use `cli_inform/warn/abort`
`rlang`	Error handling (`caller_env()`, `%\|\|%`, `abort()`)
`S7`	The spec and result classes are S7
`vctrs`	`vec_size()` and `vec_cast()` in hot paths

Everything else is in Suggests. If xml2 is not installed, Define-XML generation is skipped with a warning. If jsonlite is not installed, Dataset-JSON is unavailable. If openxlsx2 is not installed, Excel reports are skipped. The core pipeline (spec → XPT → validate) has zero optional dependencies.