1 Introduction

Several discrepancies have been discovered in statistical analysis results between different programming languages, even in fully qualified statistical computing environments. Subtle differences exist between the fundamental approaches implemented by each language, yielding differences in results which are each correct in their own right. The fact that these differences exist causes unease on the behalf of sponsor companies when submitting to a regulatory agency, as it is uncertain if the agency will view these differences as problematic. In its Statistical Software Clarifying Statement, the US Food and Drug Administration (FDA) states that it “FDA does not require use of any specific software for statistical analyses” and that “the computer software used for data management and statistical analysis should be reliable.” Observing differences across languages can reduce the analyst’s confidence in reliability and, by understanding the source of any discrepancies, one can reinstate confidence in reliability.

This white paper aims to empower analysts to make informed choices on the implementation of statistical analyses when multiple languages yield different results. Our objective is not to prescribe what that choice should be, but rather provide guidance on the types of questions an analyst should ask in order to identify the fundamental sources of discrepant results. These discrepancies may exist for a variety of different reasons, of which this paper will explore and provide examples.

In this context, the risk of interpreting numerical differences in analysis results due solely to differences in programming language can be mitigated, instilling confidence in both the sponsor company and the agency during the review period.

WIP Note: I don’t like this sentence but I need to think about how I’d actually want to change it.

1.1 Motivation

1.2 Background

As clinical data analytics evolves within the pharmaceutical industry, a large and noteworthy contingent of people and organizations have explored the use of various computational technologies as an effort to reimagine how to tell the story about the data that is collected during the course of a clinical trial. These technologies, whether available commercially or as open source, offer new potential in the ability of a sponsor company to discover new medicines and demonstrate that they can be safely and effectively administered to patients for a given indication. We see applications of machine learning and artificial intelligence being built into exploratory analyses as well as automation of conventional reporting pipelines, both as expanded offerings of commercial products and through tools developed and available as open source. We are witnessing a desired transformation of how we deliver clinical insights from flat data files with rows/columns and compiled PDF reports into dynamic visualization platforms which facilitate a reviewer to explore the trial database in a three-dimensional way. And, most notably, because the tools that other industries most commonly used for these ‘new’ ways of data engineering, data analytics, and data reporting are often built on programming languages not historically used within the pharmaceutical industry, we are experiencing a dramatic shift away from dependence on a small set of commercially available solutions and toward embracing many languages to build and use the best-fit tools to extract the most knowledge from clinical data.

WIP Note:
Any value to discussing the changing demographic of analysts in the industry – R vs SAS as grad experience, for example
Want to trim this back and make it more concise. I like the storytelling piece, but I think we should also focus in on access to new/advanced statistical methods and the rapid advancements of open-source data analysis tools over the past 10 years or so.

This last piece has brought to light an element of our data analytics that was previously overlooked due to an over dependence on a single solution from one programming language. Within the clinical reporting pipeline (transforming patient level clinical trial data from collection to submission), the industry has historically relied on comparing results to an independently generated second set of results as the primary form of quality control (QC). In the early years, comparisons were made on paper and thoroughly verified by a human that the number in the table matched the number independently derived by a second programmer. As technology progressed, electronic comparisons of the output data presented in a table reduced the risk of human error that the validator missed a discrepancy. The theory is that if two people put the same inputs through two independently developed processes (the code) and achieve the same outcome, then the outcome must be right. It’s not a perfect system and it can produce false positives, but efficiencies were gained and quality improved.

However, up until recently, the QC process has nearly always been implemented with the same programming language being used both for the generation of results (‘on production’) and for independent QC. The shift in the industry to explore other languages has now raised questions such as “What if the numbers don’t match? Which is correct?”

For example, if one were to take a use case to compare rounding rules between SAS® and R, it is now becoming well understood that the default rounding rule (implemented in the respective language’s round() function) are different, but only when the number being rounded is equidistant between the two possible results. The round() function in SAS will round the number ‘away from zero’, meaning that 12.5 rounds to the integer 13. The round() function in Base R will round the number ‘to even’, meaning that 12.5 rounds to the integer 12. SAS also has the rounde() function which rounds to even and the janitor package in R contains a function that rounds away from zero. In this use case, SAS produces a correct result from its round() function, based on its documentation, as does R. Both are right based on what they say they do, but they produce different results.

WIP Note: Can we make a table to illustrate this?
I want to see how the rest of the paper pans out but I think we could move this into use cases for discussion. Referencing rounding I think it’s important to also note that the round to even is based on the IEC 60559 standard

Now, the analyst has a choice to make if both R and SAS are in their toolbox – how do I round this result? To answer this question, the analyst needs to understand the rationale behind round-to-even rule and the round-away-from-zero rule, and even other rounding rules that may exist. To our knowledge, this ‘how do I round’ question has never been asked with respect to clinical trial reporting until the difference between R and SAS default rounding was discovered. The ‘correct’ answer is up to the analyst to determine and justify. It likely depends on such things as understanding the impact on the resulting data story about the safety and efficacy of the investigational product.

Why should the analyst care? Why does it matter? One answer is because they want to tell the most accurate story of their data. However, and perhaps more importantly in the highly regulated pharmaceutical industry, because a third-party reviewer will be assessing the integrity of the data. If the reviewer attempts to reproduce the same results and chooses a different language, the analyst needs to be able to explain why results may differ, else the integrity of the entire package may be questioned. By fully understanding the implications of choosing a statistical modeling implementation in Language A vs Language B, the analyst can communicate the rationale of the choice, based on sound statistical reasoning, and instill confidence in the regulatory body reviewing the submitted data.

It should be noted that in what follows, it is assumed that statistical packages and routines perform in a manner consistent with their documentation. The question at hand is not whether the procedures are accurate or reliable, but rather, in what ways to similar implementations across languages differ. Hence, we are not concerned with another major area of discussion within the industry – the so-called validation of packages and software.

1.3 Other Readings

Perhaps cite the TransCelerate MoA project
Perhaps cite other working groups or published conference proceedings

Preface

2 Framework