SEND Data Conversion | Linked Data Validation for Preclinical Studies

The IRI creation method and values on this page are outdated as of 2020-02-05. Text to be updated.

Source Data

Publicly available preclinical study data was copied from the PHUSE TestDataFactory GitHub repository to the SENDConform repository for use in this project. The SAS transport XPT data files are available within the folder structure at:

  SENDConform/data/studies/Study Name

Source SAS transport files are read in by the conversion scripts and converted to Terse Triple Language format, located in the /ttl folder:

    SENDConform/data/studies/Study Name/ttl

The /csv folder contains data in comma-delimited format for ease of viewing the data in Excel. These files are raw, source data and are not used for mapping into the triplestore.

    SENDConform/data/studies/Study Name/csv

The Demographics (DM) and Trial Summary (TS) domains from the study ID CJ16050 “RE Function in Rats” /data/studies/RE Function in Rats is used for initial development and testing. Original XPT is converted to Terse Triple Language (TTL) format using the driver script r/driver.R. Additional data conversion methods using SAS or Python may be developed, time and expertise permitting.

The easiest approach is to convert the row-by-column source data to RDF using column names to identity the types of entities, rows as individuals, and each cells as values for that individual. The superior approach taken in this project is to re-formed the data to match ontologies that describe the types of entities and their relationships in the study, based on knowledge of both the data and the clinical trial process.

The converted RDF data used for developing SHACL is available here: SHACL/CJ16050Constraints/DM-CJ16050-R.TTL

Instructions on how to create validation reports in Stardog is available on the SEND Data Validation page.

Data Augmentation for Test Cases

The data conversion process adds observations to violate the various SHACL shape rule components. Test observations are identified by subjid and usubjid values containing the pattern 99Tn, in contrast to the original study data values of 00M0n. Test cases are documented in the file TestCases.xlsx

Animal Subject IRIs

At first it may seem reasonable to use subjid or usubjid when forming IRIs for Animal Subjects. IRI creation is simple and the human-readable value facilitates traceability back to the original source. For example, the IRI for Subject 00M01 would be: cj16050:Animal_00M01

However, the use of subjid or usubjid is fraught with problems. Consider cases where:

A subjid is accidentally re-used and assigned to more than one animal. Two unique individuals would have the same ID number and the resulting RDF would have all observations assigned to a single IRI. It would be difficult to detect this duplication after data is converted to the graph.
The same animal is accidentally assigned two different subjid values. Values are incorrectly assigned to two separate individuals.
A row of data is accidentally duplicated, a condition that could go undetected when converting the data to RDF.

A solution is to create IRIs for critical components like Animal Subject and Reference Interval that are independent from values in the source data. For the purpose of this prototype, an SHA-1 hash of a randomly generated value (with a known seed value) is used to create select IRIs where missing, duplicate, or partial data would be problematic. The long hash value is truncated to eight characters for ease of reference and discussion in this prototype.

When this method of IRI generation is followed:

IRIs remain constant throughout multiple project development runs over time.
IRIs for subjects, intervals, and other critical components become independent of the source data.
Testing for duplicate, missing, and incorrect instance data becomes possible thanks to IRIs that are independent from instance data.

Example Animal Subject IRI: cj16050:Animal_a6d09184

Methods to generate UIDs for subjects in real-world settings is beyond the mandate of this project. See the Technical Details page) of the project Unique Identifiers for the Pharmaceutical Industry for more information on generating unique identifiers.

Reference Interval IRIs

The model for Reference Intervals for Animal Subjects is not intuitive and requires some explanation. Date values for reference start date (rfstdtc) and reference end date (rfendtc) are not directly attached to the Animal Subject IRI. Rather, the Animal Subject IRI cj16050:Animal_hashvalue is attached to a Reference Interval IRI cj16050:Interval_hashvalue which in turn has two date IRIs attached via the time:hasBeginning and time:hasEnd predicates (Figure 1).

Figure 1: Animal_99T1 (incomplete data)

Reference Interval IRIs are created even the start date or end date is missing (Figure 2), because the data for the corresponding non-missing date must still be represented in the graph. A Reference Interval is also be created when both start and end dates are missing, showing that the concept of the interval is still present but the data supporting it is not available.

Figure 2: Animal_99T5 Missing rfendtc

See SHACL Shapes for how validation shapes are constructed based on this model.

RDF Conventions

Labels

skos:prefLabel is the primary label used in the graph. For controlled terms, skos:prefLabel contains the industry standard (CDISC) label, which is often in plural form (DAYS, WEEKS, etc.) while rdfs:label contains the W3C standard in singular form (DAY, WEEK, etc.). rdfs:label is optional for all other triples.

Additional RDF Conventions will be added.

Conversion Details

R Programs

R scripts for data conversion are located in the /r folder, directly below the project repository root folder in GitHub. (See Project Repsository Structure for project folder structure details.)

Order	File	Description
1.	driver.R	Main driver program for data conversion. Graph metadata creation.
2.	DM-convert.R	DM instance data conversion to TTL, addition of observations to test constraints. (Under construction)
3.	TS-convert.R	TS instance data conversion to TTL, addition of observations to test constraints. (Not yet written)

Graph Metadata

Graph metadata, including data conversion date and graph version, is created within the driver.R script and exported to a TTL file for upload into the triplestore. A corresponding .csv file is created for SMS mapping purposes.

The .csv and .ttl files are located in the folder: \data\studies\Study Name\ttl

File	Role	Description
Graphmeta-StudyName.csv	Basic graph metadata	Description of graph content, status, version, and time stamp information.
Graphmeta-StudyName-map.TTL	SMS Map	Map CSV to Stardog graph.
Graphmeta-StudyName.TTL	RDF Triples	TTL file for loading directly into triplestore.

DM

The .csv and .ttl files are located in the folder: \data\studies\Study Name\ttl

File	Role	Description
DM-CJ16050.CSV	Demographics	May be a subset during development.
DM-CJ16050-R-map.TTL	SMS Map	Map CSV to Stardog graph.
DM-CJ16050-R.TTL	RDF Triples	TTL file for loading directly into triplestore.

Considerations for Study: CJ16050

Data Imputation

Creation of values not in the original study data, or located in domains that are not part of the pilot include:.

Variable	Value(s)	Description
SPECIESCD_IM	“Rat”	Species Code not specified in DM data file.
AGEUNIT_IM	“Week”	A representation of the age unit that is used to link to time namespace.
DURATION_IM	“P56D”	Duration code, derived from 8 weeks x 7 days/wk.

Data

/data/studies/RE Function in Rats/ttl

File	Description	Contact
cj16050.ttl	Instance data file. Outdated as of 2019-08-02
cjprot.ttl	Nonclinical study protocol file for study CJ16050	AO
cj160500send.shapes.ttl	Combines the instance file with the SEND ontology to support automated SEND dataset creation. It currently recreates the first record of the pilot DM domain. TS not yet included.	AO
DM-CJ16050-R.csv	Data file created by R for mapping DM domain data to triplestore using SMS	TW
DM-CJ16050-R.TTL	Data file created by R for direct load into triplestore	TW
DM-CJ16050-R-map.TTL	SMS map for DM-CJ16050-R.csv to Stardog
Graphmeta-CJ16050.csv	Graph metadata file for mapping to triplestore using SMS	TW
GraphMeta-CJ16050.TTL	Graph metadata file for direct load into triplestore	TW
Graphmeta-CJ16050-map.TTL	SMS map for Graphmeta-CJ16050.csv to Stardog	TW
SENDConform-CJ106050LoadDriver.bat	Driver .BAT file that calls SENDConform-CJ106050LoadSequence.bat. Needed for Windows.	TW
SENDConform-CJ106050LoadSequence.bat	Loads data into Stardog using a series of SMS calls.
study.ttl	Ontology file from the CTDasRDF project, updated to support nonclinical data	AO
send.ttl	“bare bones” SEND ontology to allow exporting protocol and instance data into SEND format.	AO

TS

Future development

Data Mapping with Stardog SMS

Stardog Mapping Syntax (SMS) (stardog.com) is provided as an alternative data mapping and upload process. The same data conversion scripts that produce TTL files for upload into a triplestore also create a .CSV file that can be mapped to the database. The .CSV files do not contain the full set of data for evaluating the test cases.

Why create this additional data file when it does not contain the full set of values needed to evaluate the test cases? The team benefits from having an R Shiny app that reads in the SMS file and produces a visualization of the data schema. This schema is used during development to help ensure the nodes and relations are being constructed correctly. The visualization also aids in SHACL Shape and SPARQL query development.

Conversion and Mapping Details

The source data and R scripts used to create the .CSV files used by the SMS maps are documented earlier on this page, including generation of values like SHA-1 hashes used in both TTL and SMS methods.

Each CSV file has a corresponding map file in TTL format with “-map” appended to the name.

Graph Metadata

File	Role	Description
Graphmeta-StudyName.CSV	Basic graph metadata	Description of graph content, status, version, and time stamp information.
Graphmeta-StudyName-map.TTL	SMS Map	Map CSV to Stardog graph.

DM

File	Role	Description
DM-CJ16050.CSV	Demographics	May be a subset during development.
DM-CJ16050-R-map.TTL	SMS Map	Map CSV to Stardog graph.

SMS Format

The SMS files follow formatting rules that go beyond the Stardog specification, primarily due to weak parsing expressions in the R Shiny visualization code (this can easily be improved!). These rules include:

subject is hard left on a line by itself.
predicate, object line:
- indented at least one space.
- end with a ; on same line, no trailing spaces
No short hand for predicates. Use ‘rdf:type’, not ‘a’ .
File must end with carriage return on a line by itself.

This excerpt from the DM domain mapping file shows the AnimalSubject triples. Values within { } are substituted from the named columns in the .CSV file as the file is processed line-by-line.

Animal Subject

cj16050:Animal_{DMROWSHORTHASH_IM}
  rdf:type                    study:AnimalSubject ;
  skos:prefLabel              "Animal {subjid}"^^xsd:string ;
  study:hasReferenceInterval  cj16050:Interval_{DMROWSHORTHASH_IM} ;
  study:hasSubjectID          cj16050:SubjectIdentifier_{subjid} ;
  study:hasUniqueSubjectID    cj16050:UniqueSubjectIdentifier_{usubjid} ;
  study:memberOf              cjprot:Set_{setcd} ;
  study:memberOf              code:Species_{SPECIESCD_IM} ;
  study:participatesIn        cj16050:AgeDataCollection_{DMROWSHORTHASH_IM} ;
  study:participatesIn        cj16050:SexDataCollection_{DMROWSHORTHASH_IM} ;
.

Data Upload using SMS

Mapping and upload is accomplished by issuing a series of import commands similar to the following, where a database named SENDConform is already present. The database name is specified after the import parameter and followed by the mapping file and CSV file:

stardog-admin virtual import SENDConform DM-CJ16050-R-map.TTL DM-CJ16050-R.CSV

A series of these commands is chained together in a batch file to upload all graphs at the same time, including additional files like the supporting ontology.

Visualization

An RShiny app for visualization the SMS files is available at /r/vis/SMSMapVis-appSEND. Paths within the file global.R must change to point to your local clone of the repository. Figure 1 shows a screen shot of the SMS files for the DM and Graph Metadata portions of the graph.

Figure 1: Screen shot from RShiny SMS visualization.