Use of public datasets in the examination of multimorbidity: Opportunities and challenges
Introduction
Multimorbidity affects approximately two-thirds of people aged over 60 years (Fortin et al., 2012; Mokraoui et al., 2016), with complex interactions between education, behaviour, socio-economic deprivation, and frailty contributing to disease risk (Marengoni et al., 2011; Vetrano et al., 2019). Interactions between diseases may be tracked at the individual and population levels, with the metabolic syndrome and osteoarthritis as common exemplars (Anderson and Felson, 1988; Saklayen, 2018). The advent of data science (the methods of recording, storing, and analysing data to effectively extract useful information), commensurate with our capacity to manage data at scale that has increased exponentially since 2000 (Hilbert and Lopez, 2011), has led to an explosion in the discovery of such disease interactions in recent years using data mining, deep learning and big data technologies (Galetsi and Katsaliaki, 2020).
Here, the term “big data” is used to describe the study and applications of datasets that are too complex for traditional data-processing application software to adequately deal with (https://www.sas.com/en_gb/insights/big-data/what-is-big-data.html). The data may be described as “big” because of its volume, velocity (rate of accrual) or variety of formats (Fig. 1). The development of this field provides a great opportunity in the multimorbidity domain to gain insights for patient benefit through the linkage and efficient analysis of large and complementary datasets (Burstein et al., 2019; Pawar et al., 2020). In this review, we consider the types of data that are commonly recorded, with some examples of their application in the multimorbidity setting. We describe the governance frameworks that control access to, and terms of use for large datasets. Our reference point is legislation that is applicable in the European Union and United Kingdom, but similar regulatory arrangements apply in most other countries. We also outline the infrastructure requirements for managing these data and describe briefly some of the computational approaches used for their analysis.
Section snippets
Why study pre-existing datasets?
Reutilisation of previously collected data provides a highly cost-effective use of resources. Large datasets may provide national coverage for a particular disease area, and thus a broad geographic picture of disease incidence or prevalence and of the association between input variables and endpoints. Routinely collected data, whether for administrative or other purposes, provides real-world information at scale, and thus can provide insights into a given disease that are generalisable to other
What types of dataset are available?
The types of data sources that may be integrated into modern analytical frameworks is limited mainly by data readability (format) and computational capacity. In the healthcare domain, data may be recorded digitally for several purposes. At the local level, electronic health records provide individual patient data for the purposes of direct patient care. Summary data of care episodes may be collected for administrative purposes, such as for billing or for resource management. These data may also
What can public datasets tell us about multimorbidity?
There are many examples of the use of linked datasets to better understand the relationships between different risk factors and disease susceptibility, patient care and outcomes. For example, Wolff et al. (Wolff et al., 2002) used Medicare and death data to demonstrate that in 1999, 82 % of beneficiaries had 1 or more chronic condition and 65 % had multiple chronic conditions. They also demonstrated the exponential increase in hospitalisations and cost to society associated with increasing
What are the legal and governance requirements for data access?
Accessing routinely collected data for secondary purposes typically requires a series of permissions to be put into place through data request to the scientific leads of the dataset manager, the data controller (who may be different to the manager) and the controllers of any data sources that the dataset will be linked with. Where personal data are to be processed, the legal bases under common and statutory law need to be established. This may involve patient consent or application to a
Infrastructure requirements
The analysis of large and complex datasets requires a suitable supporting hard and soft infrastructure. The hard infrastructure includes the technologies necessary for large-scale data analysis e.g. data storage, processing power and capacity, software systems and tools for analysis. The soft infrastructure includes the technical services to support the hard infrastructure, and the skills (research domain knowledge and analytical) needed to undertake valid and meaningful analysis.
High
Data curation and analysis
When data is collected at scale, it is likely that there will be inconsistencies in the recording of data, missing fields and records and general untidiness. Datasets may also be revised and evolved over multiple iterations with subtle differences emerging in the resultant data. It is important that those wishing to analyse these datasets are sufficiently familiar with the underlying dataset design, or metadata, to be able to model for these inconsistencies. Equally important is inclusion in
Summary and conclusions
Here we have given an overview of the opportunities and challenges that are faced when using and combining large public datasets to answer questions in multimorbidity. We have focussed on the legislation and resources required as they apply to the United Kingdom and the European Union, although similar principles apply elsewhere. We have highlighted some of the available opportunities, focussing on public datasets, where the challenges of volume, velocity and variety are more pronounced than in
References (27)
- et al.
Epidemiology of multimorbidity and implications for health care, research, and medical education: a cross-sectional study
Lancet
(2012) - et al.
The National Hip Fracture Database (NHFD) - Using a national clinical audit to raise standards of nursing care
Int. J. Orthop. Trauma Nurs.
(2017) - et al.
A case study of the Secure Anonymous Information Linkage (SAIL) Gateway: a privacy-protecting remote access system for health-related research and evaluation
J. Biomed. Inform.
(2014) - et al.
Aging with multimorbidity: a systematic review of the literature
Ageing Res. Rev.
(2011) - et al.
Rates of hip and knee joint replacement amongst different ethnic groups in England: an analysis of National Joint Registry data
Osteoarthr. Cartil.
(2017) - et al.
Older patients undergoing emergency laparotomy: observations from the National Emergency Laparotomy Audit (NELA) years 1-4
Age Ageing
(2020) - et al.
Factors associated with osteoarthritis of the knee in the first national Health and Nutrition Examination Survey (HANES I). Evidence for an association with overweight, race, and physical demands of work
Am. J. Epidemiol.
(1988) - et al.
The association of pre-operative anaemia with morbidity and mortality after emergency laparotomy
Anaesthesia
(2020) - et al.
Mapping 123 million neonatal, infant and child deaths between 2000 and 2017
Nature
(2019) - et al.
A systematic review of prevalence studies on multimorbidity: toward a more uniform methodology
Ann. Fam. Med.
(2012)
Big data analytics in health: an overview and bibliometric study of research activity
Health Info. Libr. J.
Patient and hospital determinants of primary percutaneous coronary intervention in England, 2003-2013
Heart
Challenges in administrative data linkage for research
Big Data Soc.
Cited by (4)
What researchers on ageing should know about multimorbidity, geroprotectors and artificial intelligence
2021, Mechanisms of Ageing and DevelopmentCardiovascular risk and aging: The need for a more comprehensive understanding
2021, Journal of Geriatric CardiologyAi and big data in healthcare: Towards a more comprehensive research framework for multimorbidity
2021, Journal of Clinical Medicine