Skip to contents

Reads V5 (FDA standard) or V8 (extended) XPT transport files into R data frames. Pure R implementation — no SAS or haven dependency.

Usage

read_xpt(file, col_select = NULL, n_max = Inf, encoding = "wlatin1")

Arguments

file

File path to an .xpt file.

col_select

Character vector of column names to read. NULL (default) reads all columns.

n_max

Maximum number of rows to read. Inf (default) reads all rows.

encoding

Character encoding of the XPT file. Defaults to "wlatin1" (SAS WLATIN1 = Windows-1252), which is the standard encoding for SAS on Windows and a superset of 7-bit ASCII. Accepts SAS encoding names ("wlatin1", "latin1", "utf-8", "shift-jis"), aliases ("wlt1", "sjis"), or standard names ("WINDOWS-1252", "ISO-8859-1"). Set to NULL to pass bytes through without conversion.

Value

A data frame for single-member files, or a named list of data frames for multi-member files.

Details

Date/datetime conversion

Numeric columns with a SAS date or datetime format are automatically converted to R Date or POSIXct classes using the SAS epoch (1960-01-01). The conversion is based on the format.sas attribute stored in the XPT file header (NAMESTR record).

Date formats (e.g. DATE9., MMDDYY10., YYMMDD10., E8601DA.) produce R Date values. Datetime formats (e.g. DATETIME20., E8601DT., DATEAMPM.) produce R POSIXct values in UTC.

The format.sas attribute is preserved on converted columns for round-trip fidelity with write_xpt().

SAS missing values

  • Numeric SAS missing values (., .A-.Z, ._) are read as NA_real_. For date/datetime columns these become NA dates.

  • Character blanks (all spaces) are read as NA_character_.

Attributes

  • Column labels are stored as the "label" attribute on each column.

  • SAS formats are stored as the "format.sas" attribute on each column.

  • The dataset label is stored as the "label" attribute on the data frame.

Character encoding

XPT files contain no encoding metadata. SAS on Windows defaults to WLATIN1 (Windows-1252), an extended ASCII encoding that is a superset of 7-bit ASCII. By default, read_xpt() converts WLATIN1 bytes to UTF-8. This is a no-op for pure ASCII files (all bytes < 0x80 are identical) and correctly handles extended characters commonly found in clinical data.

Supported SAS encoding names:

SAS nameAliasStandard name
wlatin1wlt1WINDOWS-1252
latin1lat1ISO-8859-1
utf-8utf8UTF-8
us-asciiansiUS-ASCII
wlatin2wlt2WINDOWS-1250
wcyrillicwcyrWINDOWS-1251
shift-jissjisCP932
euc-jpjeucEUC-JP

WLATIN1 extended ASCII characters commonly found in clinical data:

ByteUnicodeDescription
0x91U+2018Left single quote
0x92U+2019Right single quote
0x93U+201CLeft double quote
0x94U+201DRight double quote
0x96U+2013En dash
0x97U+2014Em dash
0x85U+2026Horizontal ellipsis
0x99U+2122Trademark
0xA9U+00A9Copyright
0xAEU+00AERegistered
0xB0U+00B0Degree sign
0xB1U+00B1Plus-minus
0xB5U+00B5Micro sign
0xD7U+00D7Multiplication sign
0xE9U+00E9Latin small e acute
0xF1U+00F1Latin small n tilde
0xFCU+00FCLatin small u umlaut

See the full WLATIN1 map at https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT.