Harmonization and standardization of data for a pan-European cohort on SARS- CoV-2 pandemic

Rinaldi, Eugenia; Stellmach, Caroline; Rajkumar, Naveen Moses Raj; Caroccia, Natascia; Dellacasa, Chiara; Giannella, Maddalena; Guedes, Mariana; Mirandola, Massimo; Scipione, Gabriella; Tacconelli, Evelina; Thun, Sylvia

doi:10.1038/s41746-022-00620-x

Download PDF

Article
Open access
Published: 14 June 2022

Harmonization and standardization of data for a pan-European cohort on SARS- CoV-2 pandemic

npj Digital Medicine volume 5, Article number: 75 (2022) Cite this article

3630 Accesses
10 Citations
7 Altmetric
Metrics details

Subjects

Abstract

The European project ORCHESTRA intends to create a new pan-European cohort to rapidly advance the knowledge of the effects and treatment of COVID-19. Establishing processes that facilitate the merging of heterogeneous clusters of retrospective data was an essential challenge. In addition, data from new ORCHESTRA prospective studies have to be compatible with earlier collected information to be efficiently combined. In this article, we describe how we utilized and contributed to existing standard terminologies to create consistent semantic representation of over 2500 COVID-19-related variables taken from three ORCHESTRA studies. The goal is to enable the semantic interoperability of data within the existing project studies and to create a common basis of standardized elements available for the design of new COVID-19 studies. We also identified 743 variables that were commonly used in two of the three prospective ORCHESTRA studies and can therefore be directly combined for analysis purposes. Additionally, we actively contributed to global interoperability by submitting new concept requests to the terminology Standards Development Organizations.

Methodological quality of COVID-19 clinical research

Article Open access 11 February 2021

OxCOVID19 Database, a multimodal data repository for better understanding the global impact of COVID-19

Article Open access 29 April 2021

ISARIC-COVID-19 dataset: A Prospective, Standardized, Global Dataset of Patients Hospitalized with COVID-19

Article Open access 30 July 2022

Introduction

The multinational initiative ORCHESTRA, funded by the European Commission, aims at establishing a new European-wide cohort. Based on existing and new large-scale clinical studies with different population cohorts, data from several centers and countries will be integrated to advance research on COVID-19. The work presented here shows how semantic interoperability was established within three studies belonging to ORCHESTRA. The purpose is to leverage the potential of knowledge contained in data which are presently scattered in different studies by merging them.

Interoperability can be broadly defined as “the ability of two or more systems or components to exchange information and to use the information that has been exchanged”¹. In order to make efficient use of data, it is recommended to follow the Findable, Accessible, Interoperable, Reusable (FAIR) principles. These principles facilitate knowledge discovery of scientific data and their associated algorithms and workflows by humans and machines². The use of interoperability standards that harmonize content and format of data, enhance the FAIRness of data, hence increasing their value. Semantic interoperability refers to the use of a common language to define concepts. This can be achieved by employing international terminologies and classifications that unambiguously define the meaning of concepts³.

The aim of our effort within ORCHESTRA was therefore to distinctly define each and every medical term, laboratory value, and other measurements and concepts used so that they can be uniquely identified and used by the partners to answer research questions⁴. Based on our experience with the standardization of the COVID-19–related German Consensus Dataset (GECCO)⁵ and similar to other FAIRification initiatives^6,7, we pursued the semantic representation of the concepts by mapping them to standard terminology codes provided by organizations such as SNOMED International^8,9, Logical Observation Identifiers Names and Codes (LOINC)^10,11, Anatomical Therapeutic Chemical (ATC)¹² and International Statistical Classification of Diseases and Related Health Problems (ICD)^13,14,15.

Our endeavor to map over 2500 COVID-19-related concepts to standard codes enabled us to:

Create a pool of standardized variables that can easily be merged with the same elements of new ORCHESTRA studies or with elements of external studies using the same terminologies and thus enhance their individual value.
Identify common elements (core data set)¹⁶ between two ORCHESTRA studies which also included most elements of the third study; these common elements can now easily be merged without the need for further transformation.

The process of harmonization and standardization of data is demanding, but very crucial¹⁷ to share data especially within a large scientific community. It enables an efficient processing of information coming from many different sources. If health data are structured according to international standards, data are much easier to merge and analyze. Also, the efforts needed for data cleaning and pre-processing are reduced. An extensive employment of these standard terminologies across different projects would generally expedite data analysis while also providing research initiatives with a larger base of data.

Results

Harmonized data

The harmonization and standardization efforts led to the creation of two data dictionaries, one each for the Long-Term Sequelae (LTS) and Fragile Population (FP) study. The specific data elements defined for the Genomics study were included in the LTS and FP studies.

A dedicated working group managed the identification and linkage of patients and their related samples throughout ORCHESTRA.

Figure 1 shows excerpts from the two data dictionaries of the above-mentioned studies.

**Fig. 1: Excerpts of both the Long-Term Sequelae and Fragile population data dictionaries and selected Genomics variables.**

Table 1 shows the data elements that were standardized for the LTS study broken down into categories in form of REDCap® instruments. The LTS study’s eCRF comprises of 1118 data elements representing questions, descriptive text, and calculations divided into 30 different instruments. The greatest number of elements are included in the ‘Treatment’, ‘Biochemistry’ and ‘Symptoms’ categories with 179, 126, and 97 elements respectively.

Table 1 Variables defined and used in the LTS study.

Full size table

Table 2 shows the data elements contained in the data dictionary of the FP study by category. A total of 1440 data elements consisting of questions, calculations and descriptive fields were defined by the study team and harmonized thereafter. The greatest number of questions was specified for the ‘Adverse Events Related to Anti-COVID-19 Vaccine’ instrument with a total of 210 elements, followed by 188 elements for ‘Baseline Information’ and 179 elements in the ‘COVID-19 Treatment’ instruments. Subject matter experts have created a ‘socioeconomic questionnaire’ for use in the LTS and FP studies. The adult version of the questionnaire comprises of 36 elements whereas the child version is more compact and contains only 15.

Table 2 Overview of the variables defined and used in the Fragile Population study.

Full size table

Core data set

The assignment of unique names using standard terminology codes to the variables in both the LTS and the FP studies (including genomics data) enabled us to identify 743 common data elements. Since the standard codes are used as variable names, all 743 common elements between the two studies have the same variable names.

Similar to the data dictionary, the core data set consists of a spreadsheet listing the core data elements (CDEs) and their metadata. Additionally, besides the informational category of each data item, it records the clinical study that it was defined for and whether or not the element belongs to a predefined questionnaire. This information is very useful for example when submitting entire questionnaires to SDOs. The core data set, just like the data dictionaries, are constantly updated with new codes once they are available.

The blue box in Fig. 2 highlights an instance of a variable for which no international terminology standard code existed when the mapping started, but for which a LOINC code was released later, following a term request to the SDO. The new LOINC code was subsequently added to the Field Annotation.

**Fig. 2: Excerpt of the Core Data Set.**

Table 3 shows how many common elements per informational category were identified between the LTS and FP studies that have been analyzed.

Table 3 Overview of common variables used as part of the metadata for the Long-Term Sequelae study’s and the Fragile Population study’s electronic case report form respectively.

Full size table

The CDEs that make up the ORCHESTRA core data set comprised of 707 variables that contain stand-alone questions and their respective value sets (answers) and 36 variables that belong to questionnaires and their value sets (Table 4). Some questions are closed-ended and contain a fixed list of permissible answers, others are open-ended and allow open answers in form of text strings or numbers. Within the stand-alone questions, 441 variables are open-ended, another 5 variables replace value sets with formulas used to calculate scores and 6 variables contain descriptive text that does not have a corresponding value set. Descriptive text CDEs are an artifact specific to REDCap® and are means to provide instructions for the person entering the patient data. In the subset of variables that are part of questionnaires, 8 variables are open-ended, and 27 variables have value sets comprising fixed parameters.

Table 4 Overview of all common variables identified as core data elements for the ORCHESTRA core data set.

Full size table

Figure 3 details which international codes were assigned to represent the CDEs. The total number of codes is lower than the total number of CDEs because for some of the eCRF questions no appropriate codes could be assigned, either because the concepts contained in the variable were too complex or because no standard codes were available as of yet.

Submissions

Submission to the most pertinent SDOs was evaluated for the data elements for which no corresponding international code could be found. Out of a total of 2558 variables and related answer lists, 125 concepts (57 stand-alone concepts and 68 items of Assessment scales/Questionnaires) were submitted to the SDOs SNOMED CT, LOINC, and NCI.

Table 5 shows the new concepts that were submitted to SNOMED CT and for which feedbacks are awaited.

Table 5 Healthcare concepts used in ORCHESTRA studies that were submitted to SNOMED CT for new code assignment.

Full size table

The process of submitting concepts to LOINC was started for all eleven questionnaires used within the LTS and FP studies and for 30 serology concepts defined in the Genomics study (Table 6). Out of these, submission has been completed and new codes were created for six questionnaires whereas for one we are still waiting for codes. There is an ongoing effort to obtain authors’ permissions to finish the submission of the four remaining questionnaires. In addition, LOINC codes were also received for the submitted serology variables.

Table 6 Questionnaires and serology concepts used in ORCHESTRA studies that were submitted to LOINC for new code creation.

Full size table

Terms provided by the NCI Thesaurus were used to code data elements that contained the highly specific genetic information defined by the Genomic study and included in FP and LTS studies as shown in Tables 1 and 2. Table 7 shows the concepts that were submitted and received coding by NCI.

Table 7 Details of the 15 concepts that were submitted to NCI to request creation of codes and the newly created codes.

Full size table

In total, 92 new international standard codes (comprising 42 single concepts and 50 questionnaires/assessment scales items) relevant to COVID-19 research and beyond have been created as a result of our efforts.

Figure 4 summarizes the results described in the paragraphs above.

**Fig. 4: Overview of harmonized data and submissions to standard developing organizations.**

Discussion

The effects of the COVID-19 pandemic have highlighted the need to gather new scientific insights into the pathology of the disease including its progression, effects on different population groups, vaccination monitoring, and long-term impact. Several SARS-CoV-2-related studies^28,29,30 collected data and added to the growing body of knowledge about the disease^31,32,33. ORCHESTRA has a multidisciplinary approach to estimate the clinical and social aspects of the burden of COVID-19 across different health care systems while improving comparability of data. The project includes 26 partners from 14 different countries in Europe in addition to 3 partners from non-European regions such as India, South America, and Africa. However, data collected often vary in structure and format, moreover are stored in different databases that are not interoperable, thus reducing the potential to answer important research questions.

Starting with three ORCHESTRA studies, we have addressed this problem by employing healthcare-specific interoperability standards that unambiguously identify variables and make health data more transferable to and interpretable by different IT systems and applications³⁴. As a result, we obtained all possible variables mapped to their respective domain-relevant standard codes which were also incorporated into variable IDs. This pool of standardized variables can now be used to facilitate merging of data of the examined studies with:

New ORCHESTRA studies that include any of those elements
Ongoing ORCHESTRA studies that, after being mapped to standard terminologies, are found to have matching variables
Any study that makes use of the same variables identified by the same standard terminologies

Additionally, introducing a standardized ID allowed us to identify 743 common elements between two of the three studies under consideration. Common data elements mainly fall into the ‘Treatment’, ‘Biochemistry’ and ‘Symptoms’ categories. In addition, most elements of the Genomics study were included in the pool of common data elements as well. In fact, the Genomics study is much smaller in terms of number of variables than the other two. But by focusing on sample-related genomics information, it enables a much-pursued deeper investigation of the SARS-CoV-2 virus and its variants.

The result is consistent with the notion that both studies would explore similar treatments that patients have received and would also request information about the same types of laboratory values and blood panels. Partners were very committed and motivated to increase possibilities of data sharing and analysis across studies and therefore made a real effort to adapt data definitions in each study to converge as much as possible on a common data set. Emphasis was also put on harmonizing the collection of data concerning symptoms experienced by patients.

Data collected by the LTS and FP studies for these 743 common data elements could be immediately merged and used for analysis without requiring any further transformation. The first analysis results for ORCHESTRA are expected in the first half of 2022. The harmonization work done so far will be utilized for other ongoing ORCHESTRA studies: each study’s data elements will be progressively mapped to standard terminologies, and new common elements will be identified and included in the COVID-19 core data set (CDS).

The use of CDEs to harmonize data originating from different studies has already been described^35,36 as a method to merge data in different medical disciplines. In their work, Meeuws et al.³⁷ show how the presence of CDEs on Traumatic Brain Injury has enabled a high level of harmonization in data from three different large multi-centric studies. The CDEs described usually relate to those provided by the American National Institute of Neurological Disorders and Stroke (NINDS) for a range of diseases (https://www.commondataelements.ninds.nih.gov/), but their mapping to standard terminologies is not always available. In our approach we are building the COVID-19 CDS along with the harmonization and mapping process. By doing so, we will on the one hand expand the pool of COVID-19-related standardized variables, and on the other hand we will continue to identify common data elements between studies and update our CDS.

In a similar approach, the development of a data model that includes mapping to SNOMED CT, LOINC, ICD-10, has been described for the Utah Newborn Screening (NBS) Program to improve the data exchange between healthcare providers and NBS programs. This leads to a reduced dependency on proprietary laboratory information systems³⁸.

A major challenge when harmonizing epidemiologic data is to fully understand the definition of the variables that need to be combined. The meaning of variable names and descriptions is not always understood in the same way from one person to another. The use of specific terminology standards sometimes implies a higher level of detail in the definition of the concept. For this reason, changes might be required in some data definitions to include more details. For example the variable “Hemoglobin” is represented in LOINC by different codes depending if the value is expressed in “mmol/L” or in “g/dL”; it is therefore necessary to specify the unit in the variable definition. This further specification of the information to be collected avoids different interpretations of the concept, implicitly supports the homogenization of the information, and ultimately improves the quality of the data. However, it is difficult to make changes to variables when data collection has already started. If data harmonization is not performed upfront, it needs to be done after data collection with a much bigger effort. As Rolland et al. observe “even with substantial documentation, it can be challenging to understand data that someone else has collected, without engaging in time-consuming conversations with the original data collectors”³⁹. Additionally, the quality of data, integrity, completeness as well as its ability to be traced can be compromised in this process.

Generally, to achieve a wider implementation of interoperability standards, political barriers in sharing data must be removed, goals should be aligned, and extensive collaboration between research and healthcare organizations is needed⁴⁰. The process to receive permission for reuse of data should also be well defined and transparent⁴¹. The pandemic could be an opportunity to broaden the use of CDEs beyond specific implementations by engaging the broader research and healthcare communities. Strong public advocacy is needed for the broad use of standards in healthcare research and in all data collections; this would speed up the extraction of knowledge from data by facilitating the exchange and merge of the information. In ORCHESTRA, an important result was achieved in identifying a first COVID-19 CDS that could be useful also for other external COVID-19 studies. However, the harmonization work within ORCHESTRA continues as other data collection protocols are being developed.

While it is important to make use of international healthcare standards, it is also desirable to actively contribute to the standardization efforts by submitting new concepts for coding when they are not already included in the terminologies. The acceptance by SDOs of the submitted concepts that were needed for use in the ORCHESTRA studies is a great achievement that resonates outside of this particular use case since it constitutes an active contribution to standardization. Since the COVID-19 pandemic, there have been new terms emerging every day. From specific laboratory test names to terms related to social distancing, there are many more new expressions, which were introduced during this period. In order to facilitate research on emerging infectious diseases, it is of particular importance to enrich the international standard terminologies with the new terms, by actively submitting them to the SDOs. Such new concepts and codes become thus available internationally and can be used globally to identify uniquely the same concepts also in other projects. Moreover, the two socio-economic questionnaires that were developed by experts within ORCHESTRA and have been submitted to LOINC could become reusable and available worldwide for other COVID-19-related studies and thus potentially provide comparable data on the socioeconomic effects of the infection. On the other hand, when new more specific codes become available, it is likely that some other codes already in use become obsolete or too generic. For this reason, it is important to reference also the new codes in the metadata.

A thorough evaluation of the use of LOINC in pathology laboratories showed some of its limitations and challenges⁴². Among other results, mapping inconsistencies especially in the properties of methods used were discovered between laboratories. The lack of explicit hierarchy in LOINC prevents the easy identification of related terms thus making the mapping process for granular differences in tests more challenging. We tried to mitigate the problem by working together with the experts and asking to confirm the methods used in dubious cases.

During the mapping process, we realized that molecular genetic diagnostics are not yet properly represented in the international terminologies considered, probably due to the new and fast-evolving methods and discoveries in this field. The knowledge of genes and genomes is indeed one of the rapidly growing areas of biomedical research. The high number of genetic tests with diverse attributes, involving over 20.000 genes, is posing new challenges to keeping the terminology systems up to date⁴³.

However, there is a continuous process allowing for improvement proposals to LOINC which are then implemented, such as the identification of LOINC’s implicit hierarchies⁴⁴. Additionally, an implementation guide for structured reporting of genetic tests was published by the HL7 Clinical Genomics Working Group containing guidance on how LOINC codes are to be used, and details on variable linkage to specific lists of permissible answers⁴⁵.

For information related to genetics, the National Cancer Institute Thesaurus turned out helpful in covering concepts that were not included in LOINC nor in SNOMED CT, probably due to the primary role of genomics in current cancer research.

The work described here shows that the combined use of standard terminologies is the best solution for embracing the different categories of information collected.

In summary, our work aimed at enhancing semantic interoperability within the international research community in the field of COVID-19 by making use of international standard terminologies and classifications. Data collected in different studies using the same CDEs can be merged directly without need for further transformation, thus accelerating research results.

Our pool of standardized variables can also be used beyond the project’s borders by other research initiatives.

Many aspects of the SARS-CoV-2 infection were still widely unknown at the beginning of our work and a language to describe them had not been fully built. Thus, we contributed to COVID-19 research by submitting new concepts (over 100 concepts in total) for coding to SDOs so that they could become available for research worldwide. Our approach expedites research collaborations and processing of results.

Methods

The work presented here does not involve human participants or data. It is based on the analysis of dataset definitions from ongoing or new studies coordinated by the ORCHESTRA partners, for which individual ethical approvals were obtained. In this study we are only investigating metadata, and no Human Subjects Research is involved.

Harmonization of studies

The process of harmonization started with three clinical studies in ORCHESTRA. The first study focuses on investigating long-term Sequelae after COVID-19 infection, hereafter referred to as the “Long-Term Sequelae” (LTS) study. The second prospective study focuses on patients considered fragile, hereafter referred to as the “Fragile Population” (FP) study. FP is an already ongoing study on post-vaccination monitoring and is also participating in the LTS study with fragile patients with history of prior COVID-19 diagnosis. For the FP study, we worked on harmonization of retrospective and prospective data concurrently. The third ORCHESTRA study, hereafter referred to as the “Genomics” study focuses on biobanking of patient samples, genomics, and viral–host interaction analysis.

Figure 5 depicts the workflow followed to standardize and harmonize the three prospective studies in ORCHESTRA. The clinical study teams for all three studies were the starting point for the activities. They defined the clinical concepts of their study protocols within a data dictionary that is used in the REDCap® electronic data capture (EDC) system to record patient data; subsequently, they submitted their data dictionaries to our standardization team.

**Fig. 5: Standardization and harmonization workflow.**

In the first instance all variables received from the LTS study were new and not standardized. Subsequently, we received updated variables in several iterations. This included changes to previously defined elements, removal of data elements, and addition of new data elements that had to be considered for the standardization and harmonization process. Our first activity was to review the received data dictionary and to assess whether the variables could be mapped to international standard terminology codes.

As part of that activity, the study data elements defined by clinical partners were compared with GECCO to identify information that had already been standardized i.e. associated with international standard terminologies and classifications. More information can be found in the “GECCO Data Set” section of the Supplementary Material.

If the variable was not already included in GECCO and therefore needed to be newly standardized, we chose the most appropriate standard terminology (Fig. 6) to be used for its representation enabling semantic interoperability.

For example, for general clinical concepts we used SNOMED CT because it is the world’s most comprehensive clinical healthcare terminology⁴⁶. Codes were searched using the SNOMED CT Browser’s International Edition⁴⁷ and assigned to data elements wherever appropriate. Laboratory values, vital signs, and questionnaires were mapped to LOINC codes that were selected using the LOINC search browser⁴⁸. LOINC played a very important role in defining the SARS-CoV-2 specific laboratory tests referenced in the ORCHESTRA studies’ electronic Case Report Forms (eCRFs);^49,50 it was chosen because it is a widely used terminology standard for health measurements, observations, and clinical documents⁵¹. For variables that aimed at collecting information on medication use, ATC codes were selected using the WHO’s ATC/DDD Index 2021⁵². Data that provided genetic, epigenetic, and sequencing information were generally assigned codes chosen through the use of the NCI Term Browser⁵³. The WHO’s search browser provided means of finding appropriate ICD-10 codes to assign to data elements detailing diseases and disorders. To code convalescent plasma treatment, the ISBT 128^54,55 standard’s lookup tools were used to assign codes to represent the respective data elements. Further information on international standard terminologies can be found in the Supplementary Material.

We firstly analyzed all 851 variables used in the LTS study and mapped them to the appropriate standard concepts whenever possible. The process was performed for both components of each variable: the question and its respective answers /value sets (if value sets were defined).

Whenever an appropriate code was available in a standard terminology to represent the clinical concept contained in a question or in the answer(s), we integrated it directly into the variable ID or answer ID respectively. The coding of variables along with any outstanding issues and questions were discussed in review meetings with the working groups. When a standard code for a clinical concept within the data dictionary did not exist, we started a submission process at the SDOs to request the creation of new specific codes for the concepts.

Due to the fact that submissions for new codes can take up to several months, we could not always integrate the codes in the variable names right away. In this case, we renamed the variable ID to a name that could represent the concept as closely as possible so that it could be promptly returned to the clinical teams and uploaded into REDCap® for immediate use. When a requested code was created and made available by the SDOs, it was integrated as field annotation in the REDCap® data dictionary.

A similar process was performed for the FP study for which we analyzed over 1250 variables and their respective answers and for the 267 sample-related variables of the Genomics study.

Coding within data capture tool

A dedicated instance of the EDC system REDCap® was provided by the Italian University Consortium CINECA (https://redcap-dev.orchestra.cineca.it) to support the definition and collection of the variables as well as the process of standardization and harmonization through the use of the data dictionary. The data dictionary is a specifically formatted spreadsheet with a.csv extension containing the metadata used to construct electronic data collection instruments through its upload in REDCap®. It is divided in several columns, each containing a different type of metadata. Two data dictionaries, for the three considered studies, were built to hold all the clinical study variables, namely questions and value sets to be collected through patient interviews at study visits. Clinical subject matter experts assigned different categories to the data by grouping them into ‘instruments’ according to the informational domains. These categories, defined as well in the data dictionaries, ranged from variables pertaining to patient admission, demographics, functional and physical exam results, clinical outcomes, symptoms, vaccination, imaging, samples, and socioeconomic situation. REDCap® accounts were provided both to the partners involved in the standardization process and to the scientific team defining the data elements to collect. This made it possible to interactively update the data dictionary while working on setting up the eCRFs for data collection. Supplementary Fig. 1 shows an example of the dataset Codebook, the human-readable version of the data dictionary. More information on REDCap® can be found in the dedicated section of the Supplementary Material.

Standardization was performed on the variables by identifying the corresponding standard codes and entering them in the data dictionary. In particular, we inserted them in the mandatory field for the variable name preceded by a prefix to identify the terminology used (e.g. sct for SNOMED CT or ln for LOINC) for every data element created in the eCRF for which a standard code was available. Analogous to the variable ID that identifies an element, every answer choice is identified with an ID code as well.

Where appropriate, concepts contained in the answers were also standardized and the respective international standard code was integrated into the answer ID.

Examples of this procedure are shown in Fig. 7. where the concepts “Graft Type” and “Lactate dehydrogenase” were respectively mapped to the SNOMED CT code 103403008|Type of graft (qualifier value)| and to the LOINC code 2532-0 Lactate dehydrogenase [Enzymatic activity/volume] in Serum or Plasma and then integrated into the Variable IDs. In Fig. 7a, it is possible to see how standard codes were also associated to the answer list.

**Fig. 7: Assignment of standard terminology codes to variable and answer IDs.**

REDCap® also provides the option to add an annotation to each element by use of the Field Annotation column that is part of the data dictionary. The Field Annotation was used to list other possible standard codes when available.

For many of the laboratory values, information related to a specific test was collected across several variables, covering details such as whether a test result was available, the test result itself, the date of the test, and unit of measurement. In order to ensure the link between the five details about one clinical test would remain obvious, we re-used the international code assigned to the variable ID of the actual test result and added suffixes to the variable IDs representing the additional informational domains. For example, the variable ID “ln_2157_6” contains the LOINC (abbreviated with ‘ln’) code 2157-6 for ‘Creatine kinase [Enzymatic activity/volume] in Serum or Plasma’. The additional four connected variables, ‘ln_2157_6_avail’ for the availability of the test information, ‘ln_2157_6_date’ for the date of the test, ‘ln_2157_6_unit’ to clarify the unit of measurement, all contain suffixes in addition to the standard code to help maintain the connection to the central variable, the test result in itself (Fig. 8a).

**Fig. 8: Incorporation of suffixes into the standardized variable names of data used in the Long-Term Sequelae and Fragile Population studies.**

When it came to variables that were mapped to ATC codes to describe a medication regimen, we decided to incorporate suffixes in the variable IDs to enable us to reflect clinical concepts like start and end date or dose and route of treatment. This was done by adding the suffixes “_start”, “_end”, “_dose” or “_route” to the variable ID (Fig. 8b).

Common data elements

Harmonization of data collection across clinical studies was based on the identification of CDEs^56,57.

CDEs can be defined as set “of a precisely defined question paired with a specified set of responses to the question that is common to multiple datasets or used across different studies”⁵⁸. We followed the American National Institutes of Health’s (NIH) approach in classifying data elements⁵⁹ and put an emphasis on only selecting data elements with high priority to define the core data set. Elements common to both the Long-Term Sequelae and Fragile Population studies were considered ‘High Priority’. Elements that were included in only one of the study data sets were not considered highly relevant and remained in the study-specific data sets. The core and the study-specific data sets were published on the public platform Art-Décor®⁶⁰ which is a free tool that facilitates the modeling of data sets with their bindings to terminologies.

An important part of the harmonization was to clarify the meaning of variables, ensure an unambiguous wording, and promote the use of the same variables across both studies. Meticulous efforts were undertaken by the partners to try to format them exactly the same way in terms of content, phrasing, and answer value sets. In this way, the variables could be assigned the same codes for the different partners, thus also facilitating the merging of data for research purposes.

To achieve that goal, we suggested adjustments to clinical partners when we identified variables that contained the same clinical concept in the question across both studies but where each study had slightly different answers in the value set of the question. The final decision to follow our suggestion rested with the working group who weighted the option against their clinical objectives.

One example of such a proposed change is the variable concerning the type of COVID-19 infection. Within the FP study, three answers were defined to the question: ‘primary infection‘, ‘re-infection ‘and ‘breakthrough-infection post-COVID-19 vaccination‘. In contrast, the LTS study only offered two answers as part of the variable value set: ‘primary infection‘ and ‘re-infection or breakthrough infection‘. As the three option value set was most precise, we suggested the LTS study adapted their variable to match the FP study’s. The proposal was accepted and the variable was added to the core data set for ORCHESTRA (Fig. 9a).

**Fig. 9: Harmonization of value sets for two common variables.**

In another case, we proposed to partners from the Fragile Population study to adapt their pregnancy status variable’s value set to match the one defined in the LTS study. This would change the value set from yes or no answers to answers that also incorporate information about the current trimester if a pregnancy was confirmed present (Fig. 9b).

Some concepts were identical between the two studies, but had different data collection time points. In this case, when naming the variables, a suffix was added to the standard unique code to identify the time point.

Genomics data variables were uniquely defined by the Genomics study team and were standardized following the same principles as the other two studies.

Software

The original data set definitions received from clinical partners were analyzed in LibreOffice Calc Version 7.1.1.2. The core data set was developed in Microsoft Excel 2016. All graphs shown in this publication were created in Microsoft PowerPoint 2016 while tables were built in Microsoft Excel 2016. The version of REDCap® used for the work described was 12.0.1. The standard terminologies used are: SNOMED CT International Edition July 2021, LOINC 2.71 ICD-10. ATC/DDD Index 2021, NCIt Version 21.11.e. To look up codes we used International SNOMED CT Browser Version 2021-7-31, SearchLOINC Version 2.20, ICD-10 online search application Version 2019, NCI Term Browser Version 2.19.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The metadata definitions for the three data sets are publicly available on the standard-enabling platform ART-DÉCOR® (https://art-decor.org/art-decor/decor-project--orch-). The data sets in ART DÉCOR® receive periodic updates to reflect possible changes in the CRFs, consequently, they might differ from the ones we refer to in the present work. The Data Dictionaries that were created are stored in a repository online (https://cloud.orchestra-cohort.eu/s/HeycD4xY7TACxLX. Access to the files can be granted upon reasonable request.

References

IEEE Standard Computer Dictionary: A Compilation of IEEE Standard Computer Glossaries. IEEE Std 610 1–217 (1991) https://doi.org/10.1109/IEEESTD.1991.106963.
Solle, D. Be FAIR to your data. Anal. Bioanal. Chem. 412, 3961–3965 (2020).
Article CAS PubMed PubMed Central Google Scholar
Dugas, M. et al. Portal of medical data models: information infrastructure for medical research and healthcare. Database J. Biol. Databases Curation 2016, bav121 (2016).
Google Scholar
Kim, H. H., Park, Y. R., Lee, S. & Kim, J. H. Composite CDE: modeling composite relationships between common data elements for representing complex clinical data. BMC Med. Inform. Decis. Mak. 20, 147 (2020).
Article PubMed PubMed Central Google Scholar
Sass, J. et al. The German Corona Consensus Dataset (GECCO): a standardized dataset for COVID-19 research in university medicine and beyond. BMC Med. Inf. Decis Mak 20, (2020).
Kersloot, M. G. et al. De-novo FAIRification via an Electronic Data Capture system by automated transformation of filled electronic Case Report Forms into machine-readable data. J. Biomed. Inform. 122, 103897 (2021).
Article PubMed Google Scholar
Hwang, J. E., Park, H.-A. & Shin, S.-Y. Mapping the Korean National health checkup questionnaire to standard terminologies. Healthc. Inform. Res. 27, 287–297 (2021).
Article PubMed PubMed Central Google Scholar
El-Sappagh, S., Franda, F., Ali, F. & Kwak, K.-S. SNOMED CT standard ontology based on the ontology for general medical science. BMC Med. Inform. Decis. Mak. 18, 76 (2018).
Article PubMed PubMed Central Google Scholar
Højen, A. R., Sundvall, E. & Gøeg, K. R. Methods and applications for visualization of SNOMED CT concept sets. Appl. Clin. Inform. 5, 127–152 (2014).
Article PubMed PubMed Central Google Scholar
McDonald, C. J. et al. LOINC, a universal standard for identifying laboratory observations: a 5-year update. Clin. Chem. 49, (2003).
Fiebeck, J. et al. Implementing LOINC - Current Status and Ongoing Work at a Medical University. Stud. Health Technol. Inform. 267, 59–65 (2019).
PubMed Google Scholar
Anatomical Therapeutic Chemical (ATC) Classification Index with Defined Daily Doses (DDDs): List A: Sorted According to ATC Code IncludingDefined Daily Doses (DDDs) for Plain Substances: List B: Alphabetically Sorted According to Nonproprietary Drug Name (only ATC 5th Levels are Included). (WHO Collaborating Centre for Drug Statistics Methodology, 1997).
Fung, K. W., Xu, J. & Bodenreider, O. The new International Classification of Diseases 11th edition: a comparative analysis with ICD-10 and ICD-10-CM. J. Am. Med. Inform. Assoc. JAMIA 27, 738–746 (2020).
Article PubMed Google Scholar
Park, H. et al. An information retrieval approach to ICD-10 classification. Stud. Health Technol. Inform. 264, 1564–1565 (2019).
PubMed Google Scholar
Mainor, A. J., Morden, N. E., Smith, J., Tomlin, S. & Skinner, J. ICD-10 coding will challenge researchers: caution and collaboration may reduce measurement error and improve comparability over time. Med. Care 57, e42–e46 (2019).
Article PubMed PubMed Central Google Scholar
Huser, V. & Amos, L. Analyzing real-world use of research common data elements. Amia. Annu. Symp. Proc. 2018, 602–608 (2018).
PubMed PubMed Central Google Scholar
Raisaro, J. L. et al. SCOR: A secure international informatics infrastructure to investigate COVID-19. J. Am. Med. Inform. Assoc. JAMIA 27, 1721–1726 (2020).
Article CAS PubMed Google Scholar
Klok, F. A. et al. The Post-COVID-19 Functional Status scale: a tool to measure functional status over time after COVID-19. Eur. Respir. J. 56, 2001494 (2020).
Article CAS PubMed PubMed Central Google Scholar
Mahler, D. A. & Wells, C. K. Evaluation of clinical methods for rating dyspnea. Chest 93, 580–586 (1988).
Article CAS PubMed Google Scholar
Casanova, C. et al. Differential effect of modified medical research council dyspnea, COPD assessment test, and clinical COPD questionnaire for symptoms evaluation within the new GOLD staging and mortality in COPD. CHEST 148, 159–168 (2015).
Article PubMed Google Scholar
Tsoris, A. & Marlar, C. A. Use Of The Child Pugh Score In Liver Disease. in StatPearls (StatPearls Publishing, 2022).
Marshall, J. C. et al. A minimal common outcome measure set for COVID-19 clinical research. Lancet Infect. Dis. 20, e192–e197 (2020).
Article CAS Google Scholar
Ware, J. J., Kosinski, M. & Keller, S. D. A 12-Item Short-Form Health Survey: construction of scales and preliminary tests of reliability and validity. Med. Care 34, 220–233 (1996).
Article PubMed Google Scholar
Beck, J. G. et al. The impact of event scale–revised: psychometric properties in a sample of motor vehicle accident survivors. J. Anxiety Disord. 22, 187–198 (2008).
Article PubMed Google Scholar
Rotstein, S., Hudaib, A.-R., Facey, A. & Kulkarni, J. Psychiatrist burnout: a meta-analysis of Maslach Burnout Inventory means. Australas. Psychiatry Bull. R. Aust. N. Z. Coll. Psychiatr. 27, 249–254 (2019).
Google Scholar
Petrowski, K., Albani, C., Zenger, M., Brähler, E. & Schmalbach, B. Revised short screening version of the profile of mood states (POMS) from the German general population. Front. Psychol. 12, 631668 (2021).
Article PubMed PubMed Central Google Scholar
Shallcross, A., Lu, N. Y. & Hays, R. D. Evaluation of the psychometric properties of the five facet of mindfulness questionnaire. J. Psychopathol. Behav. Assess. 42, 271–280 (2020).
Article PubMed PubMed Central Google Scholar
Docherty, A. B. et al. Features of 20 133 UK patients in hospital with covid-19 using the ISARIC WHO Clinical Characterisation Protocol: prospective observational cohort study. BMJ (2020) https://doi.org/10.1136/bmj.m1985.
Jakob, C. E. M. et al. First results of the ‘Lean European Open Survey on SARS-CoV-2-Infected Patients (LEOSS)’. Infection 49, 63–73 (2021).
Article CAS PubMed Google Scholar
Kurth, F. et al. Studying the pathophysiology of coronavirus disease 2019: a protocol for the Berlin prospective COVID-19 patient cohort (Pa-COVID-19). Infection 48, 619–626 (2020).
Article CAS PubMed PubMed Central Google Scholar
Riley, S. et al. Resurgence of SARS-CoV-2: Detection by community viral surveillance. Science 372, 990–995 (2021).
Article CAS PubMed PubMed Central Google Scholar
Liu, Y.-C., Kuo, R.-L. & Shih, S.-R. COVID-19: The first documented coronavirus pandemic in history. Biomed. J. 43, 328–333 (2020).
Article PubMed PubMed Central Google Scholar
Chams, N. et al. COVID-19: a multidisciplinary review. Front. Public Health 8, 383 (2020).
Article PubMed PubMed Central Google Scholar
LOINC. Mission, Vision, and Principles for Open Terminology Development. https://loinc.org/principles/ (2022).
Edlow, B. L. et al. Common data elements for COVID-19 neuroimaging: a GCS-NeuroCOVID proposal. Neurocrit. Care 34, 365–370 (2021).
Article CAS PubMed PubMed Central Google Scholar
Le Gal, G. et al. Development and implementation of common data elements for venous thromboembolism research: on behalf of SSC Subcommittee on official Communication from the SSC of the ISTH. J. Thromb. Haemost. 19, 297–303 (2021).
Article PubMed Google Scholar
Meeuws, S. et al. Common data elements: critical assessment of harmonization between current multi-center traumatic brain injury studies. J. Neurotrauma 37, 1283–1290 (2020).
Article PubMed PubMed Central Google Scholar
Jones, D. et al. Towards a newborn screening common data model: the Utah newborn screening data model. Int. J. Neonatal Screen 7, 70 (2021).
Article PubMed PubMed Central Google Scholar
Rolland, B. et al. Toward rigorous data harmonization in cancer epidemiology. Res.: One Approach Am. J. Epidemiol. 182, 1033–1038 (2015).
Google Scholar
Kush, R. D. et al. FAIR data sharing: The roles of common data elements and harmonization. J. Biomed. Inform. 107, 103421 (2020).
Article CAS PubMed Google Scholar
Haendel, M. A. et al. The National COVID Cohort Collaborative (N3C): rationale, design, infrastructure, and deployment. J. Am. Med. Inform. Assoc. JAMIA 28, 427–443 (2021).
Article PubMed Google Scholar
Stram, M. et al. A survey of LOINC Code Selection Practices Among Participants of the College of American Pathologists Coagulation (CGL) and cardiac markers (CRT) Proficiency Testing Programs. Arch. Pathol. Lab. Med. 144, 586–596 (2020).
Article CAS PubMed Google Scholar
Drenkhahn, C. & Ingenerf, J. The LOINC content model and its limitations of usage in the laboratory domain. Stud. Health Technol. Inform. 270, 437–442 (2020).
PubMed Google Scholar
Drenkhahn, C., Duhm-Harbeck, P. & Ingenerf, J. Aggregation and visualization of laboratory data by using ontological tools based on LOINC and SNOMED CT. Stud. Health Technol. Inform. 264, 108–112 (2019).
PubMed Google Scholar
Deckard, J., McDonald, C. J. & Vreeman, D. J. Supporting interoperability of genetic data with LOINC. J. Am. Med. Inform. Assoc. JAMIA 22, 621–627 (2015).
Article PubMed Google Scholar
Millar, J. The Need for a Global Language - SNOMED CT Introduction. Stud. Health Technol. Inform. 225, 683–685 (2016).
PubMed Google Scholar
Rogers, J. & Bodenreider, O. SNOMED CT: Browsing the browsers. in vol. 410 (2008).
LOINC. SearchLOINC Home. https://loinc.org/search/ (2022).
Harris, P. A. et al. Research electronic data capture (REDCap)-a metadata-driven methodology and workflow process for providing translational research informatics support. J. Biomed. Inform. 42, 377–381 (2009).
Article PubMed Google Scholar
Obeid, J. et al. Procurement of shared data instruments for research electronic data capture (REDCap). J. Biomed. Inform. 46, (2012).
Bodenreider, O., Cornet, R. & Vreeman, D. J. Recent Developments in Clinical Terminologies — SNOMED CT, LOINC, and RxNorm. Yearb. Med. Inform. 27, 129–139 (2018).
Article PubMed PubMed Central Google Scholar
Hollingworth, S. & Kairuz, T. Measuring medicine use: applying ATC/DDD methodology to Real-World Data. Pharmacy 9, 60 (2021).
Article PubMed PubMed Central Google Scholar
de Coronado, S. et al. The NCI Thesaurus quality assurance life cycle. J. Biomed. Inform. 42, 530–539 (2009).
Article PubMed Google Scholar
Distler, P. ISBT 128: a global information standard. Cell Tissue Bank 11, 365–373 (2010).
Article PubMed PubMed Central Google Scholar
Bégin, P. et al. Convalescent plasma for hospitalized patients with COVID-19: an open-label, randomized controlled trial. Nat. Med. 1–13 (2021) https://doi.org/10.1038/s41591-021-01488-2.
Mayer, C. S., Williams, N. & Huser, V. Analysis of data dictionary formats of HIV clinical trials. PLoS ONE 15, e0240047 (2020).
Article CAS PubMed PubMed Central Google Scholar
Mawji, A. et al. Common data elements for predictors of pediatric sepsis: a framework to standardize data collection. PLoS ONE 16, e0253051 (2021).
Article CAS PubMed PubMed Central Google Scholar
Sheehan, J. et al. Improving the value of clinical research through the use of Common Data Elements (CDEs). Clin. Trials Lond. Engl. 13, 671–676 (2016).
Article Google Scholar
Grinnon, S. T. et al. NINDS COMMON DATA ELEMENT PROJECT – APPROACH AND METHODS. Clin. Trials Lond. Engl. 9, 322–329 (2012).
Article Google Scholar
Stellmach, C. & Rinaldi, E. Orchestra - Datasets. https://art-decor.org/art-decor/decor-datasets--orch-?id=&effectiveDate=&conceptId=&conceptEffectiveDate= (2022).

Download references

Acknowledgements

We would like to thank the following people for their contributions: Nina Haffer (Berlin Institute of Health, Charité Universitätsmedizin Berlin, Germany), Francesca Fanì, Federica Arbizzani (University of Bologna, Italy), Mattia D´Antonio (Cineca Consorzio Interuniversitario, Bologna, Italy), Elisa Gentilotti, Gaia Maccarrone, Fulvia Mazzaferri, and Giorgia Tomassini (University of Verona, Italy). The ORCHESTRA project has received funding from the European Union’s Horizon 2020 research and innovation program under Grant agreement no. 101016167. The views expressed in this article are the sole responsibility of the authors, and the Commission is not responsible for any use that may be made of the information it contains. We would like to thank SNOMED CT international for waiving the License Fee for ORCHESTRA as a relevant research project.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Berlin Institute of Health (BIH), Charité – Universitätsmedizin Berlin, Berlin, Germany
Eugenia Rinaldi, Caroline Stellmach, Naveen Moses Raj Rajkumar & Sylvia Thun
University of Bologna, Bologna, Italy
Natascia Caroccia & Maddalena Giannella
Cineca Consorzio Interuniversitario, Bologna, Italy
Chiara Dellacasa & Gabriella Scipione
University of Verona, Verona, Italy
Mariana Guedes, Massimo Mirandola & Evelina Tacconelli

Authors

Eugenia Rinaldi
View author publications
You can also search for this author in PubMed Google Scholar
Caroline Stellmach
View author publications
You can also search for this author in PubMed Google Scholar
Naveen Moses Raj Rajkumar
View author publications
You can also search for this author in PubMed Google Scholar
Natascia Caroccia
View author publications
You can also search for this author in PubMed Google Scholar
Chiara Dellacasa
View author publications
You can also search for this author in PubMed Google Scholar
Maddalena Giannella
View author publications
You can also search for this author in PubMed Google Scholar
Mariana Guedes
View author publications
You can also search for this author in PubMed Google Scholar
Massimo Mirandola
View author publications
You can also search for this author in PubMed Google Scholar
Gabriella Scipione
View author publications
You can also search for this author in PubMed Google Scholar
Evelina Tacconelli
View author publications
You can also search for this author in PubMed Google Scholar
Sylvia Thun
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

E.R. and C.S. contributed equally and were primarily responsible for the methodology choices described in the work. They drafted the first version and following revisions of the manuscript under ST’s supervision. All other authors contributed additional content, edits, and references. All authors approved the final version.

Corresponding author

Correspondence to Eugenia Rinaldi.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Material

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Rinaldi, E., Stellmach, C., Rajkumar, N.M.R. et al. Harmonization and standardization of data for a pan-European cohort on SARS- CoV-2 pandemic. npj Digit. Med. 5, 75 (2022). https://doi.org/10.1038/s41746-022-00620-x

Download citation

Received: 01 November 2021
Accepted: 19 May 2022
Published: 14 June 2022
DOI: https://doi.org/10.1038/s41746-022-00620-x

This article is cited by

Towards interoperability in infection control: a standard data model for microbiology
- Eugenia Rinaldi
- Cora Drenkhahn
- Sylvia Thun
Scientific Data (2023)
ISO/TS 21564:2019- based Evaluation of a Semantic Map between Variables in the ISARIC Freestanding Follow Up Survey and ORCHESTRA Studies
- Eugenia Rinaldi
- Sylvia Thun
- Caroline Stellmach
Journal of Medical Systems (2023)
Harmonising electronic health records for reproducible research: challenges, solutions and recommendations from a UK-wide COVID-19 research collaboration
- Hoda Abbasizanjani
- Fatemeh Torabi
- Ashley Akbari
BMC Medical Informatics and Decision Making (2023)

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Harmonized data

Core data set

Submissions

Discussion

Methods

Harmonization of studies

Coding within data capture tool

Common data elements

Software

Reporting summary

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links