Extracting Common ADSL Variables Across Domains
Clinical Data Science Team
adsl_common_vars_extractor.RmdIntroduction
In clinical data analysis, the ADSL (Subject-Level Analysis Dataset)
often contains critical variables that are reused across various
analysis domains (e.g., ADAE, ADLB, ADLBC). The
extract_common_adsl_variables() function in the
adrgOS package helps identify and summarize these
common variables, providing insights into variable provenance and
consistency across domains.
Installation
You can install adrgOS from GitHub:
# install.packages("remotes")
remotes::install_github("phuse-org/adrgOS")Usage
Extracting variable information from define.xml
First, extract variable metadata from your Define-XML file using
extract_variable_info_from_define(). If you have a
Define-XML file in your package (inst/define.xml), you can
locate it with system.file().
library(adrgOS)
define_path <- system.file("define.xml", package = "adrgOS")
if (file.exists(define_path)) {
variable_info <- extract_variable_info_from_define(define_path)
} else {
stop("Define-XML file not found in the package.")
}Extract common ADSL variables
Use extract_common_adsl_variables() to identify
variables present in ADSL and other domains.
common_vars <- extract_common_adsl_variables(
variable_info_df = variable_info,
adsl_dataset_name = "ADSL", # Name of the ADSL dataset
include_sas_info = TRUE, # Include SAS field and length info
min_domains_count = 1, # Minimum number of domains
sort_by = "domains_count" # Sort by number of domains
)
# View the first few rows
head(common_vars)
#> Variable SASFieldName DataType Length Description Origin
#> 1 AGE AGE integer 8 Age Derived
#> 2 AGEGR1 AGEGR1 text 5 Pooled Age Group 1 Derived
#> 3 AGEGR1N AGEGR1N integer 8 Pooled Age Group 1 (N) Derived
#> 4 RACE RACE text 32 Race Derived
#> 5 RACEN RACEN integer 8 Race (N) Derived
#> 6 SEX SEX text 1 Sex Derived
#> CodelistOID DomainsCount DomainsFound SASLength
#> 1 4 ADADAS, ADAE, ADLBC, ADTTE integer(8)
#> 2 4 ADADAS, ADAE, ADLBC, ADTTE text(5)
#> 3 4 ADADAS, ADAE, ADLBC, ADTTE integer(8)
#> 4 4 ADADAS, ADAE, ADLBC, ADTTE text(32)
#> 5 4 ADADAS, ADAE, ADLBC, ADTTE integer(8)
#> 6 4 ADADAS, ADAE, ADLBC, ADTTE text(1)Customizing the results
Filtering by domain frequency
To restrict to variables that appear in at least 3 other domains:
high_freq_vars <- extract_common_adsl_variables(
variable_info_df = variable_info,
min_domains_count = 3
)
head(high_freq_vars)
#> Variable SASFieldName DataType Length Description Origin
#> 1 AGE AGE integer 8 Age Derived
#> 2 AGEGR1 AGEGR1 text 5 Pooled Age Group 1 Derived
#> 3 AGEGR1N AGEGR1N integer 8 Pooled Age Group 1 (N) Derived
#> 4 RACE RACE text 32 Race Derived
#> 5 RACEN RACEN integer 8 Race (N) Derived
#> 6 SEX SEX text 1 Sex Derived
#> CodelistOID DomainsCount DomainsFound SASLength
#> 1 4 ADADAS, ADAE, ADLBC, ADTTE integer(8)
#> 2 4 ADADAS, ADAE, ADLBC, ADTTE text(5)
#> 3 4 ADADAS, ADAE, ADLBC, ADTTE integer(8)
#> 4 4 ADADAS, ADAE, ADLBC, ADTTE text(32)
#> 5 4 ADADAS, ADAE, ADLBC, ADTTE integer(8)
#> 6 4 ADADAS, ADAE, ADLBC, ADTTE text(1)Excluding SAS-specific information
If you prefer a simpler output without SAS field names and lengths,
set include_sas_info = FALSE:
vars_no_sas <- extract_common_adsl_variables(
variable_info_df = variable_info,
include_sas_info = FALSE,
sort_by = "variable_name"
)
head(vars_no_sas)
#> Variable DataType Length Description Origin
#> 1 AGE integer 8 Age Derived
#> 2 AGEGR1 text 5 Pooled Age Group 1 Derived
#> 3 AGEGR1N integer 8 Pooled Age Group 1 (N) Derived
#> 4 COMP24FL text 1 Completers of Week 24 Population Flag Derived
#> 5 DSRAEFL text 1 Discontinued due to AE? Derived
#> 6 EFFFL text 1 Efficacy Population Flag Derived
#> CodelistOID DomainsCount DomainsFound
#> 1 4 ADADAS, ADAE, ADLBC, ADTTE
#> 2 4 ADADAS, ADAE, ADLBC, ADTTE
#> 3 4 ADADAS, ADAE, ADLBC, ADTTE
#> 4 2 ADADAS, ADLBC
#> 5 1 ADLBC
#> 6 1 ADADASWorking with sample data
When testing or demonstrating functionality, you can create a simple sample data frame:
sample_data <- data.frame(
Name = c("STUDYID", "USUBJID", "AGE", "SEX",
"STUDYID", "USUBJID", "AVAL",
"STUDYID", "USUBJID", "PARAMCD"),
Dataset = c("ADSL", "ADSL", "ADSL", "ADSL",
"ADAE", "ADAE", "ADAE",
"ADLBC", "ADLBC", "ADLBC"),
DataType = c("text", "text", "integer", "text",
"text", "text", "float",
"text", "text", "text"),
Length = c("12", "20", "3", "1",
"12", "20", "8",
"12", "20", "8"),
Description = c("Study Identifier", "Unique Subject ID", "Age", "Sex",
"Study Identifier", "Unique Subject ID", "Analysis Value",
"Study Identifier", "Unique Subject ID", "Parameter Code"),
Origin = c("Assigned", "Assigned", "Collected", "Collected",
"Assigned", "Assigned", "Derived", "Assigned", "Assigned", "Derived"),
stringsAsFactors = FALSE
)
# Extract from sample data
result <- extract_common_adsl_variables(sample_data)
print(result)
#> Variable SASFieldName DataType Length Description Origin CodelistOID
#> 1 STUDYID STUDYID text 12 Study Identifier Assigned
#> 2 USUBJID USUBJID text 20 Unique Subject ID Assigned
#> DomainsCount DomainsFound SASLength
#> 1 2 ADAE, ADLBC text(12)
#> 2 2 ADAE, ADLBC text(20)Conclusion
The adrgOS vignette demonstrated how to use
extract_common_adsl_variables() to streamline the
identification and analysis of common ADSL variables across domains.
This tool aids in verifying data consistency and understanding variable
reuse in clinical datasets.