Use of public datasets in the examination of multimorbidity: Opportunities and challenges

https://doi.org/10.1016/j.mad.2020.111310Get rights and content

Highlights

  • Large public datasets provide great a opportunity for the study of multimorbidity.

  • Linkage between datasets providing complimentary information further expands scope and opportunities,.

  • Data depth and complexity can make assuring individual annonymity a challenge.

  • Responsible management requires awareness of prevailing legislation and information governance requirements.

  • Valid and meaningful interpretation of large datasets also necessitates efficient and accurate data curation.

Abstract

The interrogation of established, large-scale datasets presents great opportunities in health data science for the linkage and mining of potentially disparate resources to create new knowledge in a fast and cost-efficient manner. The number of datasets that can be queried in the field of multimorbidity is vast, ranging from national administrative and audit datasets, large clinical, technical and biological cohorts, through to more bespoke data collections made available by individual organisations and laboratories. However, with these opportunities also come technical and regulatory challenges that require an informed approach. In this review, we outline the potential benefits of using previously collected data as a vehicle for research activity. We illustrate the added value of combining potentially disparate datasets to find answers to novel questions in the field. We focus on the legal, governance and logistical considerations required to hold and analyse data acquired from disparate sources and outline some of the solutions to these challenges. We discuss the infrastructure resources required and the essential considerations in data curation and informatics management, and briefly discuss some of the analysis approaches currently used.

Introduction

Multimorbidity affects approximately two-thirds of people aged over 60 years (Fortin et al., 2012; Mokraoui et al., 2016), with complex interactions between education, behaviour, socio-economic deprivation, and frailty contributing to disease risk (Marengoni et al., 2011; Vetrano et al., 2019). Interactions between diseases may be tracked at the individual and population levels, with the metabolic syndrome and osteoarthritis as common exemplars (Anderson and Felson, 1988; Saklayen, 2018). The advent of data science (the methods of recording, storing, and analysing data to effectively extract useful information), commensurate with our capacity to manage data at scale that has increased exponentially since 2000 (Hilbert and Lopez, 2011), has led to an explosion in the discovery of such disease interactions in recent years using data mining, deep learning and big data technologies (Galetsi and Katsaliaki, 2020).

Here, the term “big data” is used to describe the study and applications of datasets that are too complex for traditional data-processing application software to adequately deal with (https://www.sas.com/en_gb/insights/big-data/what-is-big-data.html). The data may be described as “big” because of its volume, velocity (rate of accrual) or variety of formats (Fig. 1). The development of this field provides a great opportunity in the multimorbidity domain to gain insights for patient benefit through the linkage and efficient analysis of large and complementary datasets (Burstein et al., 2019; Pawar et al., 2020). In this review, we consider the types of data that are commonly recorded, with some examples of their application in the multimorbidity setting. We describe the governance frameworks that control access to, and terms of use for large datasets. Our reference point is legislation that is applicable in the European Union and United Kingdom, but similar regulatory arrangements apply in most other countries. We also outline the infrastructure requirements for managing these data and describe briefly some of the computational approaches used for their analysis.

Section snippets

Why study pre-existing datasets?

Reutilisation of previously collected data provides a highly cost-effective use of resources. Large datasets may provide national coverage for a particular disease area, and thus a broad geographic picture of disease incidence or prevalence and of the association between input variables and endpoints. Routinely collected data, whether for administrative or other purposes, provides real-world information at scale, and thus can provide insights into a given disease that are generalisable to other

What types of dataset are available?

The types of data sources that may be integrated into modern analytical frameworks is limited mainly by data readability (format) and computational capacity. In the healthcare domain, data may be recorded digitally for several purposes. At the local level, electronic health records provide individual patient data for the purposes of direct patient care. Summary data of care episodes may be collected for administrative purposes, such as for billing or for resource management. These data may also

What can public datasets tell us about multimorbidity?

There are many examples of the use of linked datasets to better understand the relationships between different risk factors and disease susceptibility, patient care and outcomes. For example, Wolff et al. (Wolff et al., 2002) used Medicare and death data to demonstrate that in 1999, 82 % of beneficiaries had 1 or more chronic condition and 65 % had multiple chronic conditions. They also demonstrated the exponential increase in hospitalisations and cost to society associated with increasing

What are the legal and governance requirements for data access?

Accessing routinely collected data for secondary purposes typically requires a series of permissions to be put into place through data request to the scientific leads of the dataset manager, the data controller (who may be different to the manager) and the controllers of any data sources that the dataset will be linked with. Where personal data are to be processed, the legal bases under common and statutory law need to be established. This may involve patient consent or application to a

Infrastructure requirements

The analysis of large and complex datasets requires a suitable supporting hard and soft infrastructure. The hard infrastructure includes the technologies necessary for large-scale data analysis e.g. data storage, processing power and capacity, software systems and tools for analysis. The soft infrastructure includes the technical services to support the hard infrastructure, and the skills (research domain knowledge and analytical) needed to undertake valid and meaningful analysis.

High

Data curation and analysis

When data is collected at scale, it is likely that there will be inconsistencies in the recording of data, missing fields and records and general untidiness. Datasets may also be revised and evolved over multiple iterations with subtle differences emerging in the resultant data. It is important that those wishing to analyse these datasets are sufficiently familiar with the underlying dataset design, or metadata, to be able to model for these inconsistencies. Equally important is inclusion in

Summary and conclusions

Here we have given an overview of the opportunities and challenges that are faced when using and combining large public datasets to answer questions in multimorbidity. We have focussed on the legislation and resources required as they apply to the United Kingdom and the European Union, although similar principles apply elsewhere. We have highlighted some of the available opportunities, focussing on public datasets, where the challenges of volume, velocity and variety are more pronounced than in

References (27)

  • P. Galetsi et al.

    Big data analytics in health: an overview and bibliometric study of research activity

    Health Info. Libr. J.

    (2020)
  • M. Hall et al.

    Patient and hospital determinants of primary percutaneous coronary intervention in England, 2003-2013

    Heart

    (2016)
  • K. Harron et al.

    Challenges in administrative data linkage for research

    Big Data Soc.

    (2017)
  • View full text