Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study,Journal of Medical Internet Research

当前位置： X-MOL 学术 › J. Med. Internet Res. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study
Journal of Medical Internet Research ( IF 5.8 ) Pub Date : 2021-05-06 , DOI: 10.2196/25714
Uddhav Vaghela ₁ , Simon Rabinowicz ₁ , Paris Bratsos ₁ , Guy Martin ₁ , Epameinondas Fritzilas ₂ , Sheraz Markar ₁ , Sanjay Purkayastha ₁ , Karl Stringer ₃ , Harshdeep Singh ₄ , Charlie Llewellyn ₂ , Debabrata Dutta ₄ , Jonathan M Clarke ₁ , Matthew Howard ₂ , _{1,

5} , Ovidiu Serban ₆ , James Kinross ₁

Affiliation

Background: The scale and quality of the global scientific response to the COVID-19 pandemic have unquestionably saved lives. However, the COVID-19 pandemic has also triggered an unprecedented “infodemic”; the velocity and volume of data production have overwhelmed many key stakeholders such as clinicians and policy makers, as they have been unable to process structured and unstructured data for evidence-based decision making. Solutions that aim to alleviate this data synthesis–related challenge are unable to capture heterogeneous web data in real time for the production of concomitant answers and are not based on the high-quality information in responses to a free-text query. Objective: The main objective of this project is to build a generic, real-time, continuously updating curation platform that can support the data synthesis and analysis of a scientific literature framework. Our secondary objective is to validate this platform and the curation methodology for COVID-19–related medical literature by expanding the COVID-19 Open Research Dataset via the addition of new, unstructured data. Methods: To create an infrastructure that addresses our objectives, the PanSurg Collaborative at Imperial College London has developed a unique data pipeline based on a web crawler extraction methodology. This data pipeline uses a novel curation methodology that adopts a human-in-the-loop approach for the characterization of quality, relevance, and key evidence across a range of scientific literature sources. Results: REDASA (Realtime Data Synthesis and Analysis) is now one of the world’s largest and most up-to-date sources of COVID-19–related evidence; it consists of 104,000 documents. By capturing curators’ critical appraisal methodologies through the discrete labeling and rating of information, REDASA rapidly developed a foundational, pooled, data science data set of over 1400 articles in under 2 weeks. These articles provide COVID-19–related information and represent around 10% of all papers about COVID-19. Conclusions: This data set can act as ground truth for the future implementation of a live, automated systematic review. The three benefits of REDASA’s design are as follows: (1) it adopts a user-friendly, human-in-the-loop methodology by embedding an efficient, user-friendly curation platform into a natural language processing search engine; (2) it provides a curated data set in the JavaScript Object Notation format for experienced academic reviewers’ critical appraisal choices and decision-making methodologies; and (3) due to the wide scope and depth of its web crawling method, REDASA has already captured one of the world’s largest COVID-19–related data corpora for searches and curation.

This is the abstract only. Read the full article on the JMIR site. JMIR is the leading open access journal for eHealth and healthcare in the Internet age.

中文翻译：

使用安全、持续更新的 Web 源处理管道支持科学文献的实时数据合成和分析：开发和验证研究

背景：全球针对 COVID-19 大流行的科学应对措施的规模和质量无疑挽救了生命。然而，COVID-19大流行也引发了前所未有的“信息流行病”；数据产生的速度和数量让临床医生和政策制定者等许多关键利益相关者不知所措，因为他们无法处理结构化和非结构化数据以进行循证决策。旨在缓解这种与数据合成相关的挑战的解决方案无法实时捕获异构网络数据以生成伴随答案，并且不基于响应自由文本查询的高质量信息。目标：该项目的主要目标是建立一个通用、实时、持续更新的管理平台，支持科学文献框架的数据合成和分析。我们的次要目标是通过添加新的非结构化数据来扩展 COVID-19 开放研究数据集，从而验证该平台和 COVID-19 相关医学文献的管理方法。方法：为了创建一个能够实现我们目标的基础设施，伦敦帝国理工学院的 PanSurg Collaborative 开发了一个基于网络爬虫提取方法的独特数据管道。该数据管道采用了一种新颖的管理方法，该方法采用人机交互的方法来描述一系列科学文献来源的质量、相关性和关键证据。结果：REDASA（实时数据合成和分析）现已成为世界上最大、最新的 COVID-19 相关证据来源之一；它由 104,000 个文档组成。通过对信息进行离散标记和评级来捕获策展人的批判性评估方法，REDASA 在不到 2 周的时间里迅速开发了一个包含 1400 多篇文章的基础、汇总数据科学数据集。这些文章提供了与 COVID-19 相关的信息，约占有关 COVID-19 的所有论文的 10%。结论：该数据集可以作为未来实施实时、自动化系统审查的基本事实。 REDASA设计的三个好处如下：（1）它采用用户友好、人机交互的方法，将高效、用户友好的管理平台嵌入到自然语言处理搜索引擎中； (2)它为经验丰富的学术评审员的批判性评估选择和决策方法提供了JavaScript对象表示法格式的精选数据集； (3) 由于其网络爬行方法的广泛性和深度，REDASA 已经捕获了世界上最大的 COVID-19 相关数据语料库之一，用于搜索和管理。

这只是摘要。请阅读 JMIR 网站上的完整文章。 JMIR 是互联网时代电子健康和医疗保健领域领先的开放获取期刊。

更新日期：2021-05-06

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11