10 Numerical Matching

10.1 What if open source languages and commercial software (such as SAS), do not match numerically with the analysis results they give?

What if the same inputs yield similar, but numerically different results?
What if the same inputs yield drastically different results?
What is the truth? Which is correct? Can any be considered ‘correct’?

10.2 Historical Background

The statistical analysis system SAS was developed by a collective of eight USA southern state universities in the late 1960’s. The SAS Institute Inc. was founded in 1976, and their first product release of Base SAS, consisted of approximately 300,000 lines of code¹. During the first 30 years of existence, SAS software became renowned in the highly regulated medical research industry, for its well documented methodology and its high quality, reproducible, robust and reliable analysis implementation. This made it the number one data analysis tool in the pharmaceutical industry and a gold standard for regulatory submissions around the world.

R has been available since 1993 with packages being added continuously, providing a continual development cycle and user-led package development². This open-source language has a package repository called CRAN (Comprehensive R Archive Network) with over 21000 packages in it. The momentum and adoption of R in Pharma is growing, likely due to:

it being a common tool used by students, academics and other industries,
it’s ability to produce interactive reporting and graphics
the large quantity of open source development of packages, which is often conducted through github repositories. Recently, the pharmaverse³ has been created which pulls together multiple packages designed specifically to resolve the needs of conducting analysis in medical research with a focus on the regulatory needs.

It is commonly found that if the user does not change the default options in SAS and R for a particular analysis, then the results that are output are different! But why is this …?!

SAS and R would have experienced entirely different challenges during their early development periods. For SAS, latency (speed to do computations) was very low compared to modern computers. The time for a computer to implement a complex statistical analysis (or even a simple one) would have initially been very long! This would have made numerical approximations (faster algorithms) very attractive in order to reduce computational time required to conduct analysis. As computational power improved, the speed to do analysis rapidly improved. The need to simplify an analysis to an approximation or more simple algorithm was less important. Hence SAS increased its functionality adding more methods. However, due to its rigorous reproducibility and backwards compatibility commitments, the ‘default’ method remained as the original, and new methods (which were often more complex) were added as options to the original SAS procedures.

With R being developed in the 1990’s, it didn’t have such a restriction on speed of computation. This means that the methods that R defaults to, are often the ones that were most commonly used when that package was developed, or that were documented in the literature with better performance compared to an older methodology.

In conclusion, if you write code in SAS and R without specifying fully your analysis using the optional parameters or knowing your default options, it is very likely that the analysis they conduct could be different. In many cases, adding detail to your code (Specifying all options clearly), specifies exactly the analysis you want SAS or R to do, which then ensures that SAS and R are applying the same methods and you get the same results. However, there are still cases where some analyses are only available in SAS and some only in R.

The PHUSE CAMIS Project⁴ compares default analysis methods in SAS, R and Python, documents which options need to be specified in order to obtain a reproduction of the same analysis and identifies cases where software can not replicate the same analysis.

10.3 Why do we need languages / software to align on results and does it need to be identical?

10.3.1 Statistical Interpretation Perspective

From a Statistical Interpretation Perspective, results obtained from different languages or software which are in line (similar) but not identical would be interpreted in the same way. Hence the conclusion of the trial would not be affected by lack of identical replication of statistical analysis results.

In fact, it is common for regulators to require sensitivity analysis to explore how robust the analysis is to handling of missing data and model assumptions. Therefore, we would hope that applying slightly different methods, would not change interpretation of the results, otherwise we may have concerns regarding the robustness of our analysis and the transferability of those results to the real world.

If results did differ by a clinically important amount, then it would be very important to further investigate and justify the differences.

10.3.2 In Practice !

Despite the above, in practice, the medical research industry is governed by strict processes and Standard Operative Procedures (SOPs) to ensure quality of statistical analysis. For this reason, many companies apply a double programming approach to ensure 100% independent replication of results. This requires a full identical match to be obtained when two programmers do the same work independently. In addition, if the initial work is done by a Contract Research Organization (CRO), working with a pharmaceutical company, the analysis may be programmed a third time to replicate results in the two different companies systems. Finally, when results are submitted to the regulators, they will also program the results and attempt replication.

Therefore, in all these cases if a full identical match cannot be obtained, this introduces uncertainty, apprehension and nervousness about the results being presented. Are they correct ? ! Why don’t we get the same results doing the same analysis in another language / software.

It is likely that the statisticians, data analysts or programmers in all organizations would have to spend substantial time looking into the discrepancy, to try to provide a full explanation and justification, or do updates so a match can be obtained, before the analysis would be passed as acceptable and trustworthy. This is often at a critical time in the drug development program and can adversely affect relationships between customers / clients and regulators, if matches cannot be easily achieved.

10.4 Possible Reasons for Analysis Results being Different

Common reasons for not obtaining a match when using different software consist of:

Input dataset differences
- Check your input datasets are identical (different input = different output !)
- NOTE: This may need to also include the sort order since some algorithms may work differently if data is differently sorted
Different model or methods used compared to your statistical analysis plan (SAP)
- Perhaps the SAP clearly specifies the analysis, however your results is not in line, due to differences in the default options, handling of missing data or tied data, modelling assumptions, method, algorithms, modelling values, convergence methods, optimization routines or parameterization it is using
- Check the software / package documentation to verify it is doing the analysis method specified in the SAP. NOTE: you may have to do your own research to fully understand how methods are being implemented in software. Replication by hand or in additional programming to see if you can test the results versus what you would expect using the equations. Check on website help pages, StackOverflow etc, to see if others have already found the issue. Contact the software company or author to request clarity on the method the software is implementing.
- Check PHUSE CAMIS Project⁴ for any additional detail already available. If you find something not already documented, please consider contributing
- NOTE: sometimes the results may match in most cases, with exception for a particular scenario in your data (small samples, tied data, missing data or thresholds being met), which results in different handling of these cases being made by the software / languages and hence different methods being used and different results obtained.
SAP does not specify enough detail of what analysis is required
- SAP is ambiguous - there are different ways to do the analysis method and the SAP does not clearly specify the method that should be applied.
- It is not sufficient to write a SAP with the SAS defaults being assumed. The method needs to be fully specified such that it can be replicated in any language or software.
- If the method is not clearly specified in the SAP, default methods may be assumed, and these may be different across software / languages.
- Update SAP to be more specific.
Same Methods are being applied but different rounding of the final result. SAS and R round differently by default. See here for more detail.
There is a bug is the software / package derivation and the results are incorrect
- This is the least commonly found issue when renown software and packages are being used. It is more common that the method being implemented is actually planned by the software / package authors to be different, rather than incorrectly programmed.

10.5 Best practice steps

The White paper⁵ describes key considerations when understanding differences in statistical methodology implementations across programming languages.

The Key Message is to ensure that the SAP fully documents the analysis method you plan to do and ensure that the software you use is implementating that analysis method.

It is no longer acceptable, to write statistical analysis plans which describe the analysis in insufficient detail, expecting readers to be using the SAS defaults. Instead SAPs need to be written with full explanation of the method being applied, including any detail on convergence methods, continuity corrections applied, and options selected. This can often be challenging since the methodology can be statistically complex and statisticians have got so used to SAS, that there is still a bias towards using the methods that SAS does as default rather than the method which is best for the analysis of your data.

10.6 Example 1 : Signed Rank Test

SAP states: For this 2-period, cross-over study comparing Treatments A and B, a wilcoxon signed-rank test will be applied.

The analysis is run in SAS and R and slightly different p-values are obtained.

After much research, it is found to be due to the following:

SAS describes its method here⁶. For samples <=20 it applies the exact method (scaled binomial distributions). For samples >20 it uses an approximation method using the student’s t distribution. It also has a method for handling of tied data.

CAMIS repo describe the R options here. Using stats package (version 3.6.2), allows you to specify if you want to use an exact method or normal approximation for any sample size. However there is no method for handling tied data, R excludes these results from the analysis.

Hence for samples without ties and with <20 observations, SAS and R p-values match, however if you do have tied results or >20 observations, it’s likely that you will get slightly different p-values in SAS and R.

SAP should be updated (for example) to say: For this 2-period, cross-over study comparing Treatments A and B, a wilcoxon signed-rank test using a normal approximation as described in the stats package (version 3.6.2) will be applied.

10.7 Example 2: Survival Analysis, Kaplan-Meier and Log Rank Test

SAP states: Kaplan-Meier plots (with 95% CIs) of Overall Survival will be presented by treatment group. Overall survival will be analyzed using a stratified log-rank test with hazard ratio and 95% CI from the cox-proportional hazards model.

The analysis is run in SAS and R and although the p-values from the Cox-proportional hazard model are in agreement, we get slightly different confidence intervals on the Kaplan-Meier plots and for the estimate of the Hazard ratio (+/- 95% CIs) estimated from the Cox-proportional hazard model.

After much research, it is found to be due to the following as documented on the CAMIS repository here.

In SAS, the default Kaplan-Meier confidence interval calculation method is using the log-log method whereas in R package survival::survfit function the default method is the log method.
In SAS, the default method for handling of ties in a Cox-propotional hazard model is to use the breslow method, however, in R package survival::coxph , the default method is efron.

Both of these issues can be resolved but adding options to the code to specify the method. However, the SAP did not specify the methods to use fully enough.

SAP should be updated (for example) to say: Kaplan-Meier plots (with 95% CIs calculated using the log-log method) of Overall Survival will be presented by treatment group. Overall survival will be analyzed using a stratified log-rank test with hazard ratio and 95% CI from the cox-proportional hazards model calculated using the efron method in the case of tied survival times.

10.8 Discussion

Contribute to the discussion here in GitHub Discussions:
Do we need to match SAS numerically when using a different language?

Contribute to the CAMIS repository here.

10.9 References

https://www.sas.com/en_us/company-information/history.html
https://royalsocietypublishing.org/doi/10.1098/rsos.221550
https://pharmaverse.org/e2eclinical/
https://psiaims.github.io/CAMIS/
https://phuse.s3.eu-central-1.amazonaws.com/Deliverables/Data+Visualisation+%26+Open+Source+Technology/WP077.pdf
https://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_univariate_sect029.htm