The COVID-19 pandemic, which emerged from Wuhan, China, in December 2019, has resulted in at least 603 million cases and more than 6.4 million deaths worldwide (as of September 2022)1. There has been considerable additional disruption, morbidity and mortality resulting from the social, economic and health system consequences that ensued as several governments instituted a series of national and then more localized lockdowns2. The pandemic required a series of policy, public health and clinical decisions to be taken, with major consequences to societal functioning, economics and care provision. The taking of these decisions was always going to be complex, but for most places this was exacerbated by the lack of availability of relevant data3. By contrast, a handful of territories substantially developed their data capabilities over the course of the pandemic, generating important insights to guide their own national decisions and to inform international deliberations.

Key data sources should be available at various stages of a pandemic. Case studies of territories that have been positive outliers in their data capabilities allow potentially transferable lessons to be learned, in order to be better equipped to generate data-enabled responses to future epidemics and pandemics. As the COVID-19 pandemic is not yet over, the ideas contained in this paper should be seen as a work in progress.

Data requirements

All pandemics have distinctive dimensions that depend on the nature of the responsible infectious agent, the speed of national and international non-pharmacological responses, and the availability and deployment of vaccines and therapeutic agents. It is, however, possible to identify some core phases of pandemics and therefore consider the data sources that should ideally be available to support decision-making during these phases. The core phases of pandemics are summarized in the WHO (World Health Organization) Pandemic Phases Framework, which was originally developed for influenza3.

Although most governments have, to some extent, developed their pandemic data response capabilities, a few have disproportionately contributed to the discovery of policy-relevant insights during COVID-19. Examples of such places include Iceland, Israel, Qatar, Scotland and Taiwan (some of which are discussed in Table 1).

Table 1 Case studies of national data infrastructures used to support pandemic responses

Having relevant datasets available is fundamental, but insufficient, to ensure capacity for data-enabled policy responses to pandemics. Also needed are permissions to access data by different stakeholders, ideally coordinated and granted by a national scientific committee, and the ability to curate, link, analyze, visualize, interpret and communicate these data to government bodies, policy makers, health system leaders and other audiences, often across national boundaries. These are each time-consuming steps, but time is one luxury not available in the context of the exponential growth of infections seen in pandemics. It is therefore crucial that due attention is given to the data infrastructure and pipeline as part of national pandemic preparedness plans.

Data infrastructure

There is a need to access disparate data, including from electronic health records, travel and other health-related data, ideally on every person, in as close to real-time as possible. Key datasets can potentially be stored in a single central secure warehouse, as is the case for Qatar (Table 1). This requires adequate computational power, which can be substantial when dealing with millions of rows of data. Bringing together these disparate datasets can be done through deterministic or probabilistic approaches; where possible, this is most efficiently achieved using unique identifiers4,5. An alternative approach is to leave data in situ and deploy a service-orientated architecture (SOA) approach, which creates interfaces between disparate datasets through application programming interfaces (APIs). This requires upfront engineering costs, but offers the potential for periodic synchronized updates and accompanying substantial reductions in downstream resource demands.

Information governance

Access to health and other sensitive data needs to be carefully regulated6 and requires a variety of processes to be in place, to ensure that data are not inappropriately used. These checks are typically extensive and time consuming. However, the risk balance in providing access to these data needs to be shifted in the context of global emergencies such as pandemics. It is therefore important that policies and plans are in place, which may require special legislation. For example, Taiwan passed legislation to allow access to mobile-phone data (Table 1). Similarly, a Control of Patient Information (COPI) notice was issued by the UK Government’s then Secretary of State for Health and Social Care to allow sharing of confidential patient information among healthcare organizations and other relevant bodies in order to safeguard public health7.

Analytical capability

Another key rate-limiting step in the ability to generate data-enabled insights is the lack of data processing and analytical capability. There is a need for trained staff who are ideally familiar with the datasets in question who can, at pace, check, clean, link, analyze and help to visualize data for policy audiences and others. This requires staff with a range of skills to work together8. Taking the time to develop, for example, a data dictionary and the sharing of source code can greatly increase efficiency of analysis and transparency of methods.

Transparency

As ever, it is important that analyses are undertaken in transparent ways9 with provision for exploratory analyses. For example, it was unclear during the early stages of the pandemic which variables would be most useful to identify patients at greatest risk of poor COVID-19 outcomes, resulting in the need for several exploratory analyses. It is important that such exploratory analyses are transparently reported. Other recommendations for transparency include: reporting metadata; wherever possible, specifying statistical analysis plans (SAP) in advance and making these publicly available; making source code available through a repository such as GitHub; and, where possible, making actual or synthetic data available to facilitate replication and validation studies and training of new analysts. While the immediate need is to provide insights to policy makers, there is considerable merit in also publishing analyses in preprints and peer-reviewed journals to allow independent verification of methods and to share insights with the global community

International co-operation

There are numerous instances where it is important to be able to run analyses across countries, regions or globally10. However, this is difficult because it is seldom possible to move sovereign datasets across national boundaries and so requires federated analyses to be undertaken with some form of data synthesis. The most prominent example has been the Johns Hopkins Coronavirus Resource Center COVID-19 Testing Dashboard (Box 1).

Other examples include analyses of data across UK nations to investigate the effect of lockdown measures on health system functioning, investigation of rare vaccine safety signals, such as cerebral venous sinus thrombosis11,12, the impact of variants of concern (Gamma in Brazil and Delta in Scotland) on disease severity and waning of vaccine effectiveness13 and work undertaken across more than 40 countries through the International COVID-19 Data Alliance (ICODA; https://icoda-research.org) to investigate the effect of lockdown measures on perinatal outcomes14.

Conclusions

Ready access to high quality multi-dimensional data is fundamental to generating effective evidence and informed policy responses to pandemics, but most places have struggled with this. Many analyses need to extend across international boundaries, which is most likely to be achieved through federated analytical approaches, but will require coordination between governments. A few territories have excelled in health data science during the pandemic, which offers a framework that might be developed and deployed in future epidemics and pandemics (Box 2).