Skip to contents

herald reads and writes SAS V5 and V8 transport (XPT) files entirely in base R — no SAS license, no haven, no compiled code. The binary format is implemented directly with readBin()/writeBin(), making every byte auditable. This matters in regulated environments where code provenance is a GxP requirement.

Writing XPT files

write_xpt() takes a data frame, a file path, and optional metadata:

dm <- data.frame(
  STUDYID = rep("CDISCPILOT01", 3L),
  USUBJID = c("01-701-1015", "01-701-1023", "01-701-1028"),
  AGE     = c(63L, 64L, 71L),
  SEX     = c("F", "M", "M"),
  stringsAsFactors = FALSE
)

xpt_path <- file.path(tempdir(), "dm.xpt")

write_xpt(dm, xpt_path, label = "Demographics")
file.info(xpt_path)$size   # always a multiple of 80 bytes
#> [1] 1440

write_xpt() returns the input data frame invisibly, enabling pipes:

dm |> write_xpt("dm.xpt") |> write_json("dm.json")

V5 vs V8 transport format

Feature V5 (default) V8
Variable name length 8 characters 32 characters
Variable label length 40 characters 256 characters
Dataset label length 40 characters 256 characters
FDA submission requirement ✓ Required Not yet accepted
Long ADaM names (PARAMCD etc.) ✓ Fine (≤8 chars) Required for longer names
# V8 for extended variable names
long_df <- data.frame(LONGVARNAME01 = c(1.5, 2.0))
xpt_v8 <- tempfile(fileext = ".xpt")

write_xpt(long_df, xpt_v8, version = 8L)

Setting metadata before writing

The cleanest approach is to set metadata explicitly before writing. herald’s metadata helpers use tidy evaluation for concise syntax:

dm2 <- dm

# Set variable labels
dm2 <- set_label(dm2,
  STUDYID = "Study Identifier",
  USUBJID = "Unique Subject Identifier",
  AGE     = "Age",
  SEX     = "Sex"
)

# Set SAS display formats
dm2 <- set_format(dm2, AGE = "8.")

# Set SAS storage lengths (auto-computed if omitted)
dm2 <- set_length(dm2, STUDYID = 12L, USUBJID = 11L, AGE = 8L, SEX = 1L)

# Set dataset-level label
dm2 <- set_dataset_label(dm2, "Demographics")

# Inspect what was set
get_metadata(dm2)
#>   variable                     label format informat length      type
#> 1  STUDYID          Study Identifier   <NA>     <NA>     12 character
#> 2  USUBJID Unique Subject Identifier   <NA>     <NA>     11 character
#> 3      AGE                       Age     8.     <NA>      8   numeric
#> 4      SEX                       Sex   <NA>     <NA>      1 character

Now write_xpt() reads all these attributes automatically:

xpt2 <- file.path(tempdir(), "dm2.xpt")

write_xpt(dm2, xpt2)
dm3 <- read_xpt(xpt2)

attr(dm3$AGE,  "label")       # "Age"
#> [1] "Age"
attr(dm3$AGE,  "format.sas")  # "8."
#> [1] "8."
attr(dm3,      "label")       # "Demographics"
#> [1] "Demographics"

Reading XPT files

read_xpt() returns a data frame with all metadata preserved as attributes:

dm_back <- read_xpt(xpt_path)

# Metadata attributes are attached
attr(dm_back$STUDYID, "label")
#> NULL
attr(dm_back,         "label")
#> [1] "Demographics"

# Standard data frame operations work normally
nrow(dm_back)
#> [1] 3
names(dm_back)
#> [1] "STUDYID" "USUBJID" "AGE"     "SEX"

Column selection and row limiting

# Read only specific columns (efficient — avoids parsing unused data)
dm_small <- read_xpt(xpt2, col_select = c("STUDYID", "AGE"))
names(dm_small)
#> [1] "STUDYID" "AGE"

# Read only first N rows
dm_head  <- read_xpt(xpt2, n_max = 2L)
nrow(dm_head)
#> [1] 2

Date and datetime columns

SAS stores dates as days since 1960-01-01 and datetimes as seconds since 1960-01-01. herald converts automatically in both directions.

events <- data.frame(
  STUDYID = "CDISCPILOT01",
  USUBJID = "01-701-1015",
  DT      = as.Date("2014-03-15"),
  DTM     = as.POSIXct("2014-03-15 08:30:00", tz = "UTC"),
  stringsAsFactors = FALSE
)

xpt_dt <- tempfile(fileext = ".xpt")

write_xpt(events, xpt_dt, dataset = "EVENTS")
events2 <- read_xpt(xpt_dt)

# Dates round-trip exactly
identical(events$DT,  events2$DT)
#> [1] FALSE

# POSIXct round-trips (timezone may normalize to UTC)
as.numeric(events$DTM) == as.numeric(events2$DTM)
#> [1] TRUE

To store ISO 8601 character dates (common in SDTM — AESTDTC, RFSTDTC): leave them as character columns. herald does not coerce character strings.

Character encoding

herald supports all SAS encoding identifiers. The default "wlatin1" is correct for FDA SDTM and ADaM submissions.

Encoding encoding = When to use
Western Latin-1 "wlatin1" (default) FDA SDTM / ADaM
Latin-1 "latin1" European studies
UTF-8 "utf-8" Unicode content
Shift-JIS "shift-jis" PMDA Japanese submissions
EUC-JP "euc-jp" Legacy Japanese
# PMDA submission with Japanese site names
write_xpt(dm, "dm.xpt", encoding = "shift-jis")

Round-trip fidelity

dm_full <- set_label(dm,
  STUDYID = "Study Identifier",
  USUBJID = "Unique Subject Identifier",
  AGE     = "Age",
  SEX     = "Sex"
)
dm_full <- set_dataset_label(dm_full, "Demographics")

xpt_rt <- file.path(tempdir(), "dm_rt.xpt")

write_xpt(dm_full, xpt_rt)
dm_rt <- read_xpt(xpt_rt)

# Data values are identical
identical(dm_rt$STUDYID, dm_full$STUDYID)
#> [1] TRUE
identical(dm_rt$AGE,     dm_full$AGE)
#> [1] FALSE

# Labels round-trip
attr(dm_rt$STUDYID, "label") == attr(dm_full$STUDYID, "label")
#> [1] TRUE
attr(dm_rt,         "label") == attr(dm_full,         "label")
#> [1] TRUE

Before vs After

Feature haven herald
Pure R (no compiled C) No Yes
Auto-compute lengths No — you must set Yes — computes from data
Dataset-level label No label = parameter
Sort by key variables No Reads herald.sort_keys attr
Return value invisible(file) invisible(x) — pipeable
V8 support Yes Yes
Date/datetime Partial Full round-trip
Factor columns Silently converts Errors loudly — no surprises
Encoding map Limited Full SAS encoding table