3 Releasing open source

Without open-source, many of the R packages we use today would never have developed or would be kept behind company firewalls. Open-source provides a mechanism for code sharing and collaboration, which in turn means talent can flow from company to company across our industry, we prevent duplication of the same post-competitive tools, and we move closer to decrease the burden on reviewers by bringing consistency in both our code and outputs in a submission.

Within the context of clinical reporting, we are often focussed on the benefit of inter-company collaboration on packages in the clinical domain. It is important to note though that there is already a strong track record of open source tools supported by Pharma having an impact on data science beyond our industry, examples include Pfizer and caret(Max Kuhn 2010), Eli Lilly and targets(Will Landau 2022) and Genentech/Roche and the R language (Ashlee Vance 2009; Michael Lawrence 2018).

3.1 Types of intellectual property

Intellectual Property (IP) is often bucketed into pre-competitive and competitive IP (Contreras 2016), with post-competitive being a less established term we will define in this guidance. In clinical reporting, we place significant resources into the collection and presentation of information that was collected on our competitive IP in confirmatory clinical trials. In order to help separate this simpler case from pre-competitive—here we define as post-competitive in clinical reporting as a unique scenario of code that takes data generated as part of confirmatory studies (e.g. a Phase III trial) and creates an output. Post-competitive IP is where the benefits of open sourcing and encouraging between company collaboration can be more clearly differentiated from potential competitive advantage in developing new medicines.

The following summarizes the three types of IP:

Pre-competitive: IP which is not a competitive advantage. This can be a complex definition, and will require guidance from company council. For instance, data standards may clearly be pre-competitive, but for anonymised data from historical trials, or an algorithm that generates risk scores for a certain outcome could provide a competitive advantage, or be defined as pre-competitive.
Competitive IP: Relevant examples in clinical reporting would be information on a new target, molecule or algorithm that provides an advantage in the creation of new medicines, or as a standalone data product that can be monetized.
Post-competitive IP: A less common term we have defined to be where code collaboration improves the efficency of insights, rather than the creation of insights that would otherwise not be possible. In the context of PHUSE collaborators, this includes packages that take CDISC data and apply templated data steps and visualizations to prepare a CSR, like those seen in the pharmaverse.

3.2 Preparing for release

3.2.1 Timing of releases relative to project lifecycle

As a general rule arising IP (Law Insider 2022), that is IP generated as part of the project, is simpler to handle than background IP that already exists. There is often a benefit to define what you want to do, decide if it would be open sourced, and if so, start it in an open-source setting. This also helps to encourage defining a clear scope from day one, and encourage others to engage early rather than initiate additional projects that later may not be compatible without significant re-factoring.

3.2.2 Impact of where you host a project

What are the differences between GitHub organizations that host packages like phuse-org, rinpharma, ropensci, openpharma, pharmaverse, pharmar, personal organisations, company owned organisations and organisations created to host a single project?

Ultimately, the licence chosen has an impact on how a package can be used, rather than the location the code is shared from. The location though can influence how a project is perceived. If it is hosted on a GitHub organisation with the name of a pharma company, relative to a pan-company organisation, it may imply that the project is ‘Company A’s’ project rather than something they wish to co-create. As a general rule, the recommendation would be to place it in a company’s organisation if you wish to remain control of the roadmap, but look to pan-company organisations if you wish to co-create and co-own the packages trajectory. Some examples are;

Personal Github orgs
- diffdf (gowerc/diffdf) and survival (therneau/survival) are examples of two repositories used in pharma hosted in Github orgs belonging to a specific individual.
Project/Initiative Github orgs
- openpharma: Open pharma is goverened by the non-profit Open Source in Pharma and will house packages that do not want to be associated with a specific company or organisation.
- pharmaverse: A sub-set of the pharmaverse clinical reporting repositories are also hosted on the pharmaverse Github org.
- pharmaR: Houses repositories from the R Validation Hub working group.
Company Github orgs
- Many companies maintain Github orgs either at the company or department in a company level, like GSK-Biostatistics, Roche, Genentech, Novartis, Merck
Organisation Github orgs
- phuse-org: PHUSE projects and working groups from PHUSE.
- ropensci, ropenscilabs, ropensci-docs, etc: rOpenSci maintains several GitHub orgs, with rOpenSci housing mature R packages contributed by their staff, or peer-reviewed.

3.2.3 Moving from inner sourced to open source

If a package started its development on an internal git server, or a private repository on github.com, there could be some risk of exposing data either in issues, or historical commits. These could range from screenshots of patient data, tables or other business confidential information in issues, to passwords or files in the git commit history that were deleted but not purged. The recommendation is to always flatten the commit history, and wipe issues by starting a new git repository when open sourcing unless you are certain no information can be leaked.

3.3 Avoiding the misuse of other intellectual property

When discussing the open sourcing of a codebase, it is important to flag to internal counsel existing external projects, and the overlap of scope with the project you intend to release.

It is possible that decisions made during development can become license liabilities. You may at first think it’s unlikely someone would copy and paste code into an already open sourced project, but as an example of a plausible scenario that has been seen by at least one guidelines author; a team need to implement a new function in an internal codebase. This function exists in another GPL-3 copy left licenced project. To add that project would introduce multiple dependencies that aren’t used by that particular function so a member of the team decides to copy the function into the package. Then the package is later open sourced under MIT, but the copy and pasted code is ‘forgotten about’. This would be a license breach as the authors have re-released copy-left code as permissive, and have incoporated the code without flagging it’s source, changes they made, and the original licence (GPL3 requirements).

The risk on already open sourced projects can also be lessened by a Contributor Licence Agreement (CLA; see the bot contributor-assistant for an example of CLA automation). A CLA helps ensure that anyone contributing to a project acknowledges specific terms expected of contributions, like the contributions are novel code and the author will abide by the projects licence terms. In the absence of a CLA it is important to ensure that all code within the package is original, and there is no culture of cannibalising external code and infringing on people’s copyright within the development team.

3.4 Mitigating reputational risks

What are the expectations when I release a package? Are there risks to my company’s brand having abandoned non-maintained packages?

In this guidance it is suggested to open-source early, yet doing so could expose projects that are not ready for use, might be cancelled before reaching v1.0 or are never successfully adopted. The ratio of failed to successful projects is an important consideration, but a skew in that ratio being a negative indicator can be mitigated if repositories are clear on what stage of the product life cycle they are at and make use of tools to inform users if a project has been deprecated, or are looking for new maintainers to take over the project.

While transperancy on lifecycles can help to ensure no negative reactions come from early software, robust software can have a positive effect on how others view your project. ROpenSci’s statistical software review guide includes many recomendations for reviewers that you can also take and apply to your software as you prepare for a version 1.0 release. The r-pks.org guide by Hadley Wickham also contains many of the best practices users may expect in a modern R package.

3.5 Licences for releasing code

Ultimately, the licence used for a project would require in-house counsel guidance on what licence is preferred.

All code open-sourced should have a licence. The licence has a standard location of being a text file called ‘LICENSE’ in the root of the project folder, a text file called ‘LICENSE.txt’, or a markdown file called ‘LICENSE.md’. Of particular note is that R packages often have the licence specified in the R specific location of the DESCRIPTION file, or may have it in both the standard and R specific locations (in rare cases these can also contradict so it is important to read both).

Generally, permissive licences are more common in clinical reporting, with the majority of pharmaverse R packages using an MIT or Apache 2.0 license. These licences allow distribution, commercial use and modification. One primary difference between MIT and Apache 2.0 is that the latter has patent protection language and rules around trademark usage, and may be preferred in larger projects due to its focus on more explicitly spelling out the terms.

As a general guidance, if the purpose of the project is to let future contributors freely use the code, MIT license is a concise permissive license to adopt. In the pharmaceutical industry, however, the patent of the code is often of concern in a post-competitive environment across companies, and thus an Apache 2.0 license could be more suitable. On the other hand, the copyleft license (e.g. GPLv2, GPLv3) demands any downstream derivatives to follow the same copyleft license of the source project and generally should be avoided. Sometimes, a company’s legal team might come up with their own license that is not listed as one of the approved open-source licenses. It is highly recommended to only use standard open-source licenses, as these are verified by the Open-Source Initiative, so others can easily understand the governance model of your project.

A licence is ideally one of the first commits made at project initiation, because a change in the license could impact many aspects of the project. With a permissive license, others have been granted permission to license modification from its inception. When under a permissive license, you could change to a license with more requirements, but this would not rescind the historical codebase that has a more permissive license.

3.6 Common governance models

Open-sourcing a project allows others to leverage the code, but the ultimate goal is often that the open-source community adopts and helps extend and evolve the project. How projects govern this shared development is diverse. A commonality across all projects is that the repository, and it’s main/production branch, will have some form of write access control, meaning a level of governance is present even if it’s not formalised.

There is no definitive definition of open-source governance models. The following models are based on mapping Redhat, opensource.com and Linux Foundation notes to the packages relevant to clinical reporting.

Single Entity: This category refers to a project where a single entity is the final decision maker, regardless of whether that single entity is an individual, a company or other legal entity. This governance model is sometimes referred to as the “privately open source”, “founder-leader”, or “benevolent dictator” model. The single entity controls which pull requests go to master and provides instruction on how new code should integrate in order to be accepted. Famous examples are Python until 2018 and Linux. Within pharmaverse.org, diffdf and many of the single company governed packages are an example of this governance model.
Steering Group: This category refers to a project where the ultimate decision-making capacity is shared between more than one entity. The structure of the group and manner in which the group makes decisions can vary. The name used to refer to the group can also vary, examples include “governing board”, “steering group”, and “council”. A famous example includes the relatively oligarchical Python Steering Council from 2018, however many projects prefer simple democracies, or merely that a specific number of approvals from among the contributing entities are required to approve acceptance to the production branch. Within pharmaverse.org, admiral is an example of this governance model.
Do-ocracy: This category refers to a project where access to the production branch is given out fairly freely, usually based on prior interactions with the primary contributors, or actual contributions via external pull requests. Trust is placed in the community to come to an agreement regarding acceptance to the production branch. This category is sometimes also referred to as a “self-governed” or “non-governed” governance model. Within pharmaverse.org, visR is an example of this governance model.
Foundation governed: A legal body (e.g., non-profit) assumes control - an example organisation is the Linux Foundation which governs many projects, while in pharma there are parallels to efforts like Transcelerate and OHDSI. There are no examples of this model within pharmaverse.org, but R/Pharma repositories do follow this model, where the registered non-profit Open Source in Pharma governs the github organisation.

3.7 The role of contracts in inter-company open-source ollaboration

Contributions to open-source code can come in many forms, and there is a great deal of diversity in projects relevant to clinical reporting. This is an emerging area for pharma companies, and so we will focus on promoting awareness, rather than giving firm guidelines.

3.7.1 Annotated examples of the role of contracts in open-source projects

When initiating a project like an R package, or when another company is considering investing in collaboration to an existing project, there could be a discussion on having a legal framework layered on top of the collaboration. To help contextualise this, we will use four example projects.

dplyr: The dplyr package is a ubiquitous in pharma, but is a generic data science package for data munging. The code owners are listed as individuals from a vendor, academia and a consultancy and it’s released under a permissive license. This package is extensively consumed, and a core dependency in data related packages like admiral. This package is heavily depended on pharma, but no legal agreement exists beyond the permissive licencing on the project.
gt: There is a large spread of table generation packages in pharma, but several pharma companies, including Roche and GSK, have publicly been exploring extensions that would allow the use of gt in TLG generation for CSRs. No legal agreement exists beyond the permissive licencing on the project.
pkglite: Submitting code to the FDA requires collapsing the contents into text files with restrictive formats. pkglite exists to collapse and reconstitute an R package before and after the eCTD submission portal. pkglite uses a copy-left license, and copyright is owned by Merck. No legal agreement exists beyond the copy-left licencing on the project.
admiral: admiral is an R package for creating ADaM datasets. The copyright is held between Roche and GSK, and it is permissively licensed. A contract exists between Roche and GSK on their collaboration model. Other companies have contributed and offered to extend admiral without legal contracts in place on the original codebase.

The examples above were intended to highlight that the majority of R packages used by pharma companies are done so without legal contracts in place, beyond the license of the project, even when some collaboration takes place.

It remains a discussion point though whether licenses are required, and the decision to create a license may become relevant if companies want to formally pool resources. It’s important to note that with permissively license projects, it is possible that if two entities want to take a package in different directions, they are able to by forking the project. So, contributions to another entities package are not lost to the contributing company.

3.8 Liability to users

One open question is often how does open-sourcing open a company up to liability, indemnity and warranties. We previously discussed CLA bots, as a mechanism to reinforce the need for contributions to be original, and never cannibalised from another project. For remaining risks from others using an open sourced codebase, licenses will include some language. As an example, 50% of the MIT license is devoted to this topic with the following working:

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.