FAIRSCAPE: a Framework for FAIR and Reproducible Biomedical Analytics

Levinson, Maxwell Adam; Niestroy, Justin; Al Manir, Sadnan; Fairchild, Karen; Lake, Douglas E.; Moorman, J. Randall; Clark, Timothy

doi:10.1007/s12021-021-09529-4

FAIRSCAPE: a Framework for FAIR and Reproducible Biomedical Analytics

Original Article
Open access
Published: 15 July 2021

Volume 20, pages 187–202, (2022)
Cite this article

Download PDF

You have full access to this open access article

Neuroinformatics Aims and scope Submit manuscript

FAIRSCAPE: a Framework for FAIR and Reproducible Biomedical Analytics

Download PDF

3109 Accesses
2 Citations
2 Altmetric
Explore all metrics

Abstract

Results of computational analyses require transparent disclosure of their supporting resources, while the analyses themselves often can be very large scale and involve multiple processing steps separated in time. Evidence for the correctness of any analysis should include not only a textual description, but also a formal record of the computations which produced the result, including accessible data and software with runtime parameters, environment, and personnel involved. This article describes FAIRSCAPE, a reusable computational framework, enabling simplified access to modern scalable cloud-based components. FAIRSCAPE fully implements the FAIR data principles and extends them to provide fully FAIR Evidence, including machine-interpretable provenance of datasets, software and computations, as metadata for all computed results. The FAIRSCAPE microservices framework creates a complete Evidence Graph for every computational result, including persistent identifiers with metadata, resolvable to the software, computations, and datasets used in the computation; and stores a URI to the root of the graph in the result’s metadata. An ontology for Evidence Graphs, EVI (https://w3id.org/EVI), supports inferential reasoning over the evidence. FAIRSCAPE can run nested or disjoint workflows and preserves provenance across them. It can run Apache Spark jobs, scripts, workflows, or user-supplied containers. All objects are assigned persistent IDs, including software. All results are annotated with FAIR metadata using the evidence graph model for access, validation, reproducibility, and re-use of archived data and software.

Literature reviews as independent studies: guidelines for academic practice

Article Open access 14 October 2022

Sascha Kraus, Matthias Breier, … João J. Ferreira

RNA-Seq Data Analysis in Galaxy

Artificial intelligence to automate the systematic review of scientific literature

Article Open access 11 May 2023

José de la Torre-López, Aurora Ramírez & José Raúl Romero

Introduction

Motivation

Computation is an integral part of the preparation and content of modern biomedical scientific publications, and the findings they report. Computations can range in scale from simple statistical routines run in Excel spreadsheets to massive orchestrations of very large primary datasets, computational workflows, software, cloud environments, and services. They typically produce data and generate images or tables as output. Scientific claims of the authors are supported by evidence that includes reference to the theoretical constructs embodied in existing domain literature, and to the experimental or observational data and its analysis represented in images or tables.

Today, increasingly strict requirements are demanded to leave a digital footprint of each preparation and analysis step in derivation of a finding to support reproducibility and reuse of both data and tools. The widely recommended and often required practice by publishers and funders today is to archive and cite one’s own experimental data (Cousijn et al., 2018; Data Citation Synthesis Group, 2014; Fenner et al., 2019; Groth et al., 2020); and to make it FAIR (Wilkinson et al., 2016). These approaches were developed over more than a decade by a significant community of researchers, archivists, funders, and publishers, prior to the current recommendations (Altman et al., 2001; Altman & King, 2007; Borgman, 2012; Bourne et al., 2012; Brase, 2009; CODATA/ITSCI Task Force on Data Citation, 2013; King, 2007; Starr et al., 2015; Uhlir, 2012). There is increasing support among publishers and the data science community to recommend, in addition, archiving and citing the specific software versions used in analysis (Katz et al., 2021a; Smith et al., 2016), with persistent identification and standardized core metadata, to establish FAIRness for research software (Katz et al., 2021b; Lamprecht et al., 2020); and to require identification via persistent identifiers, of critical research reagents (A. Bandrowski, 2014; A. E. Bandrowski & Martone, 2016; Prager et al., 2018).

How do we facilitate and unify these developments? Can we make the recorded digital footprints as broadly useful as possible in the research ecosystem, while their generation occurs as side-effects of processes inherently useful to the researcher – for example, in large scale data analytics and data commons environments?

The solution we developed is a reusable framework for building provenance-aware data commons environments, which we call FAIRSCAPE. It provides several features directly useful to the computational scientist, by simplifying and accelerating important data management and computational tasks; while providing, as metadata, an integrated evidence graph of the resources used in performing the work, allowing them to be retrieved, validated, reused, modified, and extended.

Evidence graphs are formal models inspired by a large body of work in abstract argumentation (Bench-Capon & Dunne, 2007; Brewka et al., 2014; Carrera & Iglesias, 2015; Cayrol & Lagasquie-Schiex, 2009; Dung, 1995; Dung & Thang, 2018; Gottifredi et al., 2018; Rahwan, 2009), and analysis of evidence chains in biomedical publications (Clark et al., 2014; Greenberg, 2009, 2011), which shows that the evidence for correctness of any finding, can be represented as a directed acyclic support graph, an Evidence Graph. When combined with a graph of challenges to statements, or their evidence, this becomes a bipolar argument graph - or argumentation system (Cayrol & Lagasquie-Schiex, 2009, 2010, 2013).

The nodes in these graphs can readily provide metadata about the objects related to the computation, including the computation parameters and history. Each set of metadata may be indexed by one or more persistent identifiers, as specified in the FAIR principles; and may include a URI by which the objects themselves may be retrieved, given the appropriate permissions. In this model, core metadata retrieved on resolution of a persistent identifier (PID) (Juty et al., 2020; Starr et al., 2015) will include an evidence graph for the object referenced by the PID. A link to the object’s evidence graph can be embedded in its metadata.

The central goals of FAIRSCAPE can be summarized as (1) to develop reusable cloud-based “data commons” frameworks adapted for very large-scale data analysis, providing significant value to researchers; and (2) to make the computations, data, and software in these environments fully transparent and FAIR (findable, accessible, interoperable, reusable). FAIRSCAPE supports a “data ecosystem” model (Grossman, 2019) in which computational results and their provenance are transparent, verifiable, citable, and FAIR across the research lifecycle. We combined elements of prior work by ourselves and others on provenance, abstract argumentation frameworks, data commons models, and citable research objects, to create the FAIRSCAPE framework. This work very significantly extends and refactors the identifier and Metadata Services we and our colleagues developed in the NIH Data Commons Pilot Project Consortium (Clark et al., 2018; Fenner et al., 2018; NIH Data Commons Pilot: Object Registration Service (ORS), 2018).

FAIRSCAPE has a unique position in comparison to other provenance-related, reproducibility-enabling, and “data commons” projects. We combine elements of all three approaches, while providing transparency, FAIRness, validation, and re-use of resources; and emphasize reusability of the FAIRSCAPE platform itself. Our goal is to enable researchers to implement effective and useful provenance-aware computational data commons in their own research environments, at any scale, while supporting full transparency of results across projects, via Evidence Graphs represented using a formal ontology.

Related Work

Works focusing on provenance per se such as (Alterovitz et al., 2018; Ellison et al., 2020) and the various workflow provenance systems such as (Khan et al., 2019; Papadimitriou et al., 2021; Yakutovich et al., 2021) are primarily concerned with very detailed documentation of each computation on one or more datasets. The W3C PROV model (Gil et al., 2013; Lebo et al., 2013; Moreau et al., 2013) was developed initially to support interoperability across the transformation logs of workflow systems. Our prior work on Micropublications (Clark et al., 2014) extending and repurposing several core classes and predicates from W3C PROV, were preliminary work forming a basis for the EVI ontology (Al Manir et al., 2021a, 2021b).

The EVI ontology used in FAIRSCAPE to represent evidence graphs, is concerned with creating reasonable transparency of evidence supporting scientific claims, including computational results; it reuses the three major PROV classes Entity, Activity, and Agent as a basis to develop a detailed ontology and rule system for reasoning across the evidence for (and against) results. When a computational result is reused in any new computation, that information is added to the graph, whether or not the operations were controlled by a workflow manager. Challenges to results, datasets, or methods, may also be added to the graph. While our current use of EVI is on computational evidence, it is designed to be extensible to objects across the full experimental and publication lifecycle.

Systems providing data commons environments, such as the various NCI and NHLBI cloud platforms (Birger et al., 2017; Brody et al., 2017; Lau et al., 2017; Malhotra et al., 2017; Wilson et al., 2017) while providing many highly useful specialized capabilities for their domain users, including re-use of data and software, have not focused extensively on providing re-use of their own frameworks, and are centralized. As noted later in this article, FAIRSCAPE can be – and is meant to be - installed on public, private, or hybrid cloud platforms, “bare metal” clusters, and even on high-end laptops, for use at varying scopes – personal, institution-wide, lab-wide, multi-center, etc.

Reproducibility platforms such as Whole Tale and CodeOcean, (Brinckman et al., 2019; Chard et al., 2019; Merkys et al., 2017) attempt to take on a one-stop-shop role for researchers wishing to demonstrate or at least assert, reproducibility of their computational research. Of these, CodeOcean (https://codeocean.com) is a special case – it is run by a company and appears to be principally described in press releases, and not in any peer reviewed articles.

FAIRSCAPE’s primary goals are to enable construction of multi-scale computational data lakes, or commons; and to make results transparent for reuse across the digital research ecosystem, via FAIRness of data, software, and computational records. FAIRSCAPE supports reproducibility via transparency.

In very many cases - such as the very large analytic workflows in our first use case - we believe that no reviewer will attempt to replicate such large-scale computations, which ran for months on substantial resources. The primary use case will be validation via inspection, and en passant validation via software reuse.

FAIRSCAPE is not meant to be a one-stop shop. It is a transferable, reusable framework. It is not only intended to enable localized participation in a global, fully FAIR data and software ecosystem – it is itself FAIR software. The FAIRSCAPE software, including installation and deployment instructions, is available in the CERN Zenodo archive (Levinson et al., 2021); and in the FAIRSCAPE Github repository (https://github.com/fairscape/fairscape).

Enabling Transparency through EVI’s Formal Model

To enable the necessary results transparency across separate computations, we abstracted core elements of our micropublications model (Clark et al., 2014) to create EVI (http://w3id.org/EVI), an ontology of evidence relationships that extends W3C PROV to support specific evidence types found in biomedical publications; and enable reasoning across deep evidence graphs, and propagation of evidence challenges deep in the graph, such as: retractions, reagent contamination, errors detected in algorithms, disputed validity of methods, challenges to validity of animal models, and others. EVI is based on the fundamental idea that scientific findings or claims are not facts, but assertions backed by some level of evidence, i.e., they are defeasible components of argumentation. Therefore, EVI focuses on the structure of evidence chains that support or challenge a result, and on providing access to the resources identified in those chains. Evidence in a scientific article is in essence, a record of the provenance of the finding, result, or claim asserted as likely to be true; along with the theoretical background material supporting the result’s interpretation.

If the data and software used in analysis are all registered and receive persistent identifiers (PIDs) with appropriate metadata, a provenance-aware computational data lake, i.e., a data lake with provenance-tracking computational services, can be built that attaches evidence graphs to the output of each process. At some point, a citable object - a dataset, image, figure, or table will be produced as part of the research. If this, too, is archived with its evidence graph as part of the metadata and the final supporting object is either directly cited in the text, or in a figure caption, then the complete evidence graph may be retrieved as a validation of the object’s derivation and as a set of URIs resolvable to reusable versions of the toolsets and data. Evidence graphs are themselves entities that can be consumed and extended at each transformation or computation.

The remainder of this article describes the approach, microservices architecture, and interaction model of the FAIRSCAPE framework in detail.

Materials and Methods

FAIRSCAPE Architectural Layers

FAIRSCAPE is built on a multi-layer set of components using a containerized microservice architecture (MSA) (Balalaie et al., 2016; Larrucea et al., 2018; Lewis & Fowler, 2014; Wan et al., 2018) running under Kubernetes (Burns et al., 2016). We run our local instance in an OpenStack (Adkins, 2016) private cloud environment, and maintain it using a DevOps deployment process (Balalaie et al., 2016; Leite et al., 2020). FAIRSCAPE may also be installed on laptops running minikube in Ubuntu Linux, MacOS, or Windows environments; and on Google Cloud managed Kubernetes. An architectural sketch of this model is shown in Fig. 1.

Ingress to microservices in the various layers is through a reverse proxy using an API gateway pattern. The top layer provides an interface to the end users with raw data and the associated metadata. The mid layer is a collection of tightly coupled services that allow end users with proper authorization to submit and view their data, metadata, and various types of computations performed on them. The bottom layer is built with special purpose storage and analytics platforms for storing and analyzing data, metadata and provenance information. All objects are assigned PIDs using local ARK (Kunze & Rodgers, 2008) assignment for speed, with global resolution for generality.

UI Layer

The User Interface layer in FAIRSCAPE offers end users various ways to utilize the functionalities in the framework. A Python client simplifies calls to the microservices. Data, metadata, software, scripts, workflows, containers, etc. are all submitted and registered by the end users from the UI Layer, which may be configured to include an interactive executable notebook environment such as Binder or Deepnote.

API Gateway

Access to the FAIRSCAPE environment is through an API gateway, mediated by a reverse proxy. Our gateway is mediated by Traefik (https://traefik.io) which dispatches calls to the various microservices endpoints.

Traefik is a reverse proxy that we configure as a Kubernetes Ingress Controller, to dynamically configure and expose multiple microservices using a single API.

The endpoints of the services are exposed through the OpenAPI specification (formerly Swagger Specification) (Darrel Miller et al., 2020) which defines the standard and the language-agnostic interface for publishing RESTful APIs and allows service discovery. Accessing the services requires user authentication, which we implement using the Globus Auth authentication broker (Tuecke et al., 2016). Users of GlobusAuth may be authenticated via a number of permitted authentication services, and are issued a token which serves as an identity credential. In our current installation we require use of the CommonShare authenticator, with site-specific two-factor authentication necessary to obtain an identity token. This token is then used by the microservices to determine a user’s permission to access various functionality.

Authentication and Authorization Layer

Authentication and authorization (authN/authZ) in FAIRSCAPE are handled by Keycloak (Christie et al., 2020), a widely-used open source identity and access management tool.

When Traefik receives a request, it handles an authentication check to Keycloak, which then determines whether or not the requestor has a valid token for an identity that can perform the requested action.

We distribute FAIRSCAPE with a preconfigured Keycloak for basic username / password authentication & authorization of service requests. This can be easily modified to support alternative identity providers, including LDAP, OpenID Connect, and OAuth2.0 for institutional single sign-on. Services continue to interact the same way, even if you change the configured identity provider.

Within our local Keycloak configuration, we chose to define Globus Auth as the identity provider. Globus Auth then serves as a dispatching broker amongst multiple other possible final identity providers. We selected the login service at the University of Virginia as our final provider, providing two-factor authentication and institutional single sign-on. Keycloak is very flexible in allowing selection of various authentication schemes, such as LDAP, SAML, OAuth2.0, etc. Selection of authentication schemes is an administrator decision.

Microservices Layer

The microservices layer is composed of seven services: (1) Transfer, (2) Metadata, (3) Object, (4) Evidence Graph, (5) Compute, (6) Search, and (7) Visualization. These are described in more detail in Section 2. Each microservice does its own request authorization, subsequent to Keycloak, enabling fine-grained access control.

Storage and Analytic Engine Layer

In FAIRSCAPE, an S3 compatible object store is required for storing objects, a document store for storing metadata, and a graph store for storing graph data. Persistence for these databases is configured through Kubernetes volumes, which map specific paths on containers to disk storage. The current release of FAIRSCAPE uses the S3 compatible MinIO as the object store, MongoDB as the document store, and Stardog as the graph store. Computations invoked by the Compute Service are managed by Kubernetes, Apache SPARK, and the Nipype neuroinformatics workflow engine.

FAIRSCAPE Microservice Components

Transfer Service

This service transfers and registers digital research objects - datasets, software, etc., − and their associated metadata, to the Commons. These objects are sent to the transfer service as binary data streams, which are then stored in MinIO object storage. These objects may include structured or unstructured data, application software, workflow, scripts. The associated metadata contains essential descriptive information such as context, type, name, textual description, author, location, checksum, etc. about these objects. Metadata are expressed as JSON-LD and sent to the Metadata Service for further processing.

Hashing is used to verify correct transmission of the object – users are required to specify a hash which is then recomputed by the Object Service after the object is stored. Hash computation is currently based on the SHA-256 secure cryptographic hash algorithm (Dang, 2015). Upon successful execution, the service returns a PID of the object in the form of an ARK, which resolves to the metadata. The metadata includes, as is normal in PID architecture (Starr et al., 2015), a link to the actual data location.