FAIR digital objects in environmental and life sciences should comprise workflow operation design data and method information for repeatability of study setups and reproducibility of results

Harjes, Janno; Link, Anton; Weibulat, Tanja; Triebel, Dagmar; Rambold, Gerhard

doi:10.1093/database/baaa059

Abstract

Repeatability of study setups and reproducibility of research results by underlying data are major requirements in science. Until now, abstract models for describing the structural logic of studies in environmental sciences are lacking and tools for data management are insufficient. Mandatory for repeatability and reproducibility is the use of sophisticated data management solutions going beyond data file sharing. Particularly, it implies maintenance of coherent data along workflows. Design data concern elements from elementary domains of operations being transformation, measurement and transaction. Operation design elements and method information are specified for each consecutive workflow segment from field to laboratory campaigns. The strict linkage of operation design element values, operation values and objects is essential. For enabling coherence of corresponding objects along consecutive workflow segments, the assignment of unique identifiers and the specification of their relations are mandatory. The abstract model presented here addresses these aspects, and the software DiversityDescriptions (DWB-DD) facilitates the management of thusly connected digital data objects and structures. DWB-DD allows for an individual specification of operation design elements and their linking to objects. Two workflow design use cases, one for DNA barcoding and another for cultivation of fungal isolates, are given. To publish those structured data, standard schema mapping and XML-provision of digital objects are essential. Schemas useful for this mapping include the Ecological Markup Language, the Schema for Meta-omics Data of Collection Objects and the Standard for Structured Descriptive Data. Data pipelines with DWB-DD include the mapping and conversion between schemas and functions for data publishing and archiving according to the Open Archival Information System standard. The setting allows for repeatability of study setups, reproducibility of study results and for supporting work groups to structure and maintain their data from the beginning of a study. The theory of ‘FAIR++’ digital objects is introduced.

Introduction

A ‘replication crisis’ and ‘reproducibility crisis’ in natural sciences have been under intensified discussion since recently (1, 2, 3) and address the paradigm that scientists should be enabled to better repeat study setups and reproduce study results in the future. Particularly in life sciences including ecology, there is, for several reasons, a lack of empirical studies, which tested earlier research findings by repetition (4). Reasons for the actual crisis are manifold (5). Regarding ecological and evolution research, they have been exemplarily analysed (6). The challenge is connected with the task to produce, document and report on all domains and all kind of data assets generated during the research process. Incomplete and wrong data might only rarely have been generated by intention, but unintentionally without having been recognized as such (7).

In environmental sciences, including ecology, the generation of flawed data may occur already in the field due to confusion of objects or object containers, at subsequent stages, due to mislabelling or to errors during laboratory operations (8). In collaborative biodiversity studies describing and analysing species community structure and molecular, cellular and organismic interactions such errors may be particularly frequent due to shortcomings during early phases of data management (9). Flat-structured data editing tools like spreadsheets have often been recognized as sufficient, probably due to the fact that data management during an early project phase is considered being less relevant and being under technicians’ stewardship. Certainly, for estimating quality and reliability of data products to be analysed, it is mandatory that all research process participants are involved to a considerable degree in the early data management. First practical guidelines to cope with this issue, particularly in long-term ecological monitoring projects, are available (10). Other researchers point to the lack of adequate basic data management procedures and the lack of infrastructure and of significant human resources (11).

Recently, research data management in context with data publication following FAIR data principles has become a major topic and has been addressed by international and national initiatives (e.g. FAIRsharing, GFBio project in Germany) (12, 13, 14, 15, 16). The demand of FAIRness has also strengthened evaluation and certification activities in the landscape of recognized scientific data repositories at various organization levels regarding transparency, interoperability and reusability of data for avoiding the creation of ‘data silos’ (17, 18).

Compared to requirements of data reusability (19), requirements of ‘repeatability’ of study setups and ‘reproducibility’ of research results go one step further, meeting study operation design, methods applied, data provenience and dataflow details (20, 21). Reusability may often be considered being a problem of the users, i.e. data consumers, how to handle, i.e. further process the accessible identified data products, i.e. data sets, for their own use (09) and may also be regarded as a problem of appropriate data preservation (sometimes together with software applied) and of the assignment of relevant properties and ontologies (22, 23). Regarding repeatability of study setups and reproducibility of resulting data, available data products often appear to be insufficient in completeness, quality and extent of documentation of relations between operation design and method information.

Digital objects are generated along all steps of the workflow in a scientific study. Coming from object-oriented programming, the term has been defined as ‘a unit of information that includes properties (attributes or characteristics of the object) and may also include methods (means of performing operations on the object)’, see https://www2.archivists.org/glossary/terms/d/digital-object. Used in a more general context, digital objects are meaningful entities in the digital domain having names (identities) and properties as well (24). The Digital Object Architecture addressing interoperability in heterogeneous networks, defines the term ‘digital object (DO)’ as ‘a sequence of bits, or a set of sequences of bits, incorporating a work or portion of a work or other information in which a party has rights or interests, or in which there is value, each of the sequences being structured in a way that is interpretable by one or more of the computational facilities, and having as an essential element an associated unique persistent identifier’ (25). This definition has recently been reflected and accepted for data and services in biodiversity science and geoscience (26) and is followed here.

The digital objects generated in a research study often insufficiently reflect the provenience and relations of objects, meaning vertical, i.e. synchronous, and horizontal, i.e. successional data coherence or concatenation, respectively, as well as applied information structures, formats, standard schemas and ontologies. Thus, the study workflow with its segments and results as a whole is not representable. Within the last years, the workflow for publication of scientific result data has been improved (27). Still insufficient attention, however, has been given to data management during early processes for generating data products as one form of digital objects and documentation of data handling, which is a prerequisite for reusable and reproducible scientific results. This frequently resulted in data products with structured or semi-structured non-standardized content in various technical formats, along with certain standardized bibliographic information only (28), deposited in non-domain-specific ‘file sharing’ data repositories (09).

The present contribution describes an abstract model. It is based on three elementary operation domains for all segments along research workflows to obtain highly structured data products. Such granular modelling approach is preconditional for generating interoperable bioscience and environmental data (29). The model describes scientific workflows as concatenated segments. Generated data products or, more general, generated digital objects comprise all information for the documentation of a study. This information should guarantee the repeatability of the conditions for observation and thus might allow—if all influential factors could be kept under control—for reproducibility of study results. This article supplements two preceding ones, which are also dedicated to the management of environmental research data (30, 31).

Challenges of scientific data management

Environmental research focuses on the complexity of interactions in nature. A variety of observational and experimental setups are required for testing evolving scientific questions or hypotheses. Thus, specific challenges of scientific data management exist. To cope with these, the setup of appropriate data management plans (DMPs) is regarded as mandatory. Such DMPs should provide study design information and concepts of how to achieve reproducibility and ‘FAIRness’ of resulting data, as well as repeatability of experiments (19, 32, 33). Furthermore, electronic laboratory notebooks (ELNs) or journals are used for documenting analysis data gained during operations in the laboratory. In addition, methodologies and scientific workflows applied during a study are documented in text documents and more recently, in Scientific Data Management Systems (SDMS), often being an integral part of laboratory information managment systems (LIMS). Finally, parts of the information on applied methods are described and referred to in the methodological chapters of resulting original publications. There is also increasing awareness that physical objects (environmental and other samples) require deposition in relevant material repositories (34, 35). Digital objects with data from measurements (along with design data and methodology information) from scientific workflows are supposed to be stored in relevant institutional or domain-specific, regional, national or international data repositories (e.g. in those recommended by journals, by national funding agencies or by data infrastructure consortia like the German Federation for Biological Data (GFBio) (36).

Community-agreed conceptual schemas for describing discipline-specific operations and measurement data provide a more or less comprehensive namespace for ‘variables’ and ‘parameters’, being utilized as elements or descriptors in data management systems, data exchange documents and online data portals. However, it is another challenge to implement such schemas in standard database applications or virtual research environments because they may either be too generic, patchy or too specific to be used in scientific studies with different experimental setups. This entails that on the one hand, study designs should be specific enough according to the scientific questions or hypotheses by use of discipline-specific ontologies, and on the other, descriptors should be suitable to be translated onto namespace elements and ontologies of community-agreed schemas.

The following abstract model addresses some of these challenges. Its applicability and suitability in practice has been tested by real-world use cases from ecological field and laboratory studies using an established generic SDMS.

Abstract model for analysing and describing FAIR digital objects in environmental and life sciences, and steps towards practice

The achieved characteristics of the new model include the abstraction of workflow segmentation, workflow segment design, design elements and method information, the generation of measurement values, object identity and object identifiers and the linking of workflow segments, operation designs and methods with design codes. Two use cases from environmental and life sciences are added. Details on related software implementation and considerations on mapping to standard conceptual schemas and ontologies are provided.

Workflow segmentation

During workflows in environmental research, a given number of physical objects and corresponding digital data objects are generated, the first by transformation of a preceding physical object, the latter by measurements on the physical object in focus or by transformation of a preceding digital data object. Therefore, a workflow from environmental sampling to data analysis in the laboratory is potentially divisible into a series of workflow segments according to the respective number of generated (intermediate or final) physical (or digital) objects. This means, in a most narrow concept, a workflow segment may be demarcated by only one object and corresponding measurement data (31). In a study, one to several workflow segments may constitute a study campaign. The combination of more than one segment in a campaign is due to practical reasons. The linkages between the segments within campaigns or along a whole workflow can be achieved by applying physical (or digital) object identifier relations, mainly parent identity relation. Object identifiers are identifiers used for defining the physical and digital object or unit identity (30). Further explanations are given in the chapter ‘Object Identity and Object Identifiers’ below and in Figure 1.

Figure 1

Open in new tab Download slide

Segment of a (multi-segment) workflow or a (multi-segment) campaign with object identity (ID), operation design elements and method information, as well as measurement values as assigned to physical (and digital) objects. Consecutive segments (indicated by the arrow at the lower right) are linked via parent identifier relations of the preceding physical objects or their digital representatives (parent identity relation). Operations are grouped according to the domains transformation, measurement and transaction (TF: transformation design, referring to domain 1, ME: method design, referring to domain 2, TA: transaction design, referring to domain 3), being assigned to (physical and digital) objects by declaration or selection of descriptor states (categorical, various). Measurement (or observation) values are primarily generated from physical objects (and secondarily from digital objects).

Workflow segment design and object description

Workflow segments are composed of physical or digital objects, which are characterized by operation data and method information (Figure 1) according to a given study design. Operations within the sequential workflow steps concern activities in the field and in the laboratory including object storage. In environmental research, particularly design data of field activities, like sampling and measurement of spatiotemporal coordinates, are required for interpreting data gained later in the workflow by measurements taken from physical objects during subsequent workflow segments. This means that measurement design-based data are required for correlation with transformation design-based data, to test scientific hypotheses as well as for quality and quantity control of a given object in a workflow segment. Complete sets of data describing study and workflow (segment) designs are essential for the repeatability of study setups and the potential reproducibility of results. This includes all types of research projects and studies with theory-driven and data-driven study design and research perspectives (37).

Design elements describing the object contextual properties as well as method information may be assigned to domains according to the three elementary operations. The ‘transformation design (element)’ (TF) refers to domain 1, ‘measurement design’ (ME) to domain 2 and ‘transaction design’ (TA) to domain 3. While domains 1 to 3 design elements specify e.g. spatial or other hierarchies in assays for transformations, storage, measurements and transfer, method information describes devices as well as the details and parameters for operations on (physical, digital) objects. Transformations based on the transformation design concern every kind of invasive treatment of an object; it may also include the storage of objects. Digital objects generated by measurement, based on the measurement design, concern all kind of data that are (non-invasively) generated from a physical and (secondarily) from a digital object, describing object traits and representing the description of objects, which have been generated by the transformation of a preceding object according to a given transformation design and method information. Transactions based on the transaction design concern object transfer (i.e. translocation or transport). The elements of all three design domains together represent the workflow segment designs, the overall study design and designs of individual digital objects (31).

Operation design elements and method information

An overall study design or the designs per workflow segment and assigned physical resp. digital objects usually include design elements of three domains (Figure 2) according to the elementary operations (31). Designs, which are composed by design elements (Table 1), represent the variables in a study and must be defined before starting a campaign. During data analysis after the campaigns focusing on physical objects, they represent the factors that are used for creating results to characterize the environmental conditions of an object. Transformation designs (TF) (domain 1) often follow nested or crossed designs (38). Data objects gained by measurements taken from a physical (or digital) object for quality control or for gathering trait information follow the measurement design (ME) (domain 2). The transaction design (TA) (domain 3) also mainly reflects the environmental conditions and spatial position of objects in the context of storage (room, freezer, shelf, microplate, etc.) and transport. The designs of material repositories and data repositories are usually hierarchical or nested like those for transformation. Furthermore, storage may also be regarded as a kind of transformation (e.g. due to physical or chemical changes over time) and therefore follow a transformation design accordingly.

Figure 2

Open in new tab Download slide

Digital object including information on physical object identity as well as operation design (OD) and method information (MI) structures and values of three elementary operation domains.

Table 1

Definitions of core terms

Workflow and dataflowWorkflow → A sequence of a given number of generated physical objects and corresponding digital data objects Workflow segment → Section of workflow, defined by a given object, the design of its generation, and its properties or traits; pre- (and post-)campaign activities are not considered as part of a workflow segment Study or workflow campaign → Section of workflow, defined by the practical work in a study, given number of workflow segments (1–n) according to the number of generated (physical and digital) objects being included. In the most granular representation of a workflow, campaigns and segments match 1:1; pre- (and post-)campaign activities are per se not considered as part of a workflow campaign

Domains of elementary operations and their supplementation by method informationElementary operations → Basic operations being transformation, measurement and transaction of an object Domain 1: Transformation → Generation of (target) objects → For creating objects with new or other properties being suitable for measurement Domain 2: Measurement → Generation of data (specifying an object) → For proofing successful transformation of objects; for enabling data analysis Domain 3: Transaction → Generation of (spatial) structures → For making (physical and digital) objects findable for transformation and measurement Method information (‘methodology’) → Selection of devices and specification of parameters for making processes functional in the context of domains 1–3

Operation design according to domains 1–3Operation design → The design of operations in a workflow segment according to the elementary (domain 1–3) operations on an object Factor → Element value used in (statistical) analysis, corresponding to the term ‘variable’ Design element → Generated factor for transformation, measurement and transaction of an object (mostly in an experiment) Parameter → Element value used for describing the setup of devices for operations, corresponding to the term ‘constant’

Generation and assignment of object identifiers, to operation design with method information, and measurement valuesIdentifiers (pre-campaign) to objects (containers) → Making objects identifiable Operation design elements with measurement information (pre-campaign) to objects → Making objects distinctive, i.e. characterize the objects Measurement values (on−/post-campaign) to objects → For recording object traits

Generation and assignment of operation design codes → (Pre-campaign) generation and assignment of operation design codes to objects

Workflow and dataflowWorkflow → A sequence of a given number of generated physical objects and corresponding digital data objects Workflow segment → Section of workflow, defined by a given object, the design of its generation, and its properties or traits; pre- (and post-)campaign activities are not considered as part of a workflow segment Study or workflow campaign → Section of workflow, defined by the practical work in a study, given number of workflow segments (1–n) according to the number of generated (physical and digital) objects being included. In the most granular representation of a workflow, campaigns and segments match 1:1; pre- (and post-)campaign activities are per se not considered as part of a workflow campaign

Domains of elementary operations and their supplementation by method informationElementary operations → Basic operations being transformation, measurement and transaction of an object Domain 1: Transformation → Generation of (target) objects → For creating objects with new or other properties being suitable for measurement Domain 2: Measurement → Generation of data (specifying an object) → For proofing successful transformation of objects; for enabling data analysis Domain 3: Transaction → Generation of (spatial) structures → For making (physical and digital) objects findable for transformation and measurement Method information (‘methodology’) → Selection of devices and specification of parameters for making processes functional in the context of domains 1–3

Operation design according to domains 1–3Operation design → The design of operations in a workflow segment according to the elementary (domain 1–3) operations on an object Factor → Element value used in (statistical) analysis, corresponding to the term ‘variable’ Design element → Generated factor for transformation, measurement and transaction of an object (mostly in an experiment) Parameter → Element value used for describing the setup of devices for operations, corresponding to the term ‘constant’

Generation and assignment of object identifiers, to operation design with method information, and measurement valuesIdentifiers (pre-campaign) to objects (containers) → Making objects identifiable Operation design elements with measurement information (pre-campaign) to objects → Making objects distinctive, i.e. characterize the objects Measurement values (on−/post-campaign) to objects → For recording object traits

Generation and assignment of operation design codes → (Pre-campaign) generation and assignment of operation design codes to objects

Open in new tab