Health record hiccups—5,526 real-world time series with change points labelled by crowdsourced visual inspection,GigaScience

当前位置： X-MOL 学术 › Gigascience › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Health record hiccups—5,526 real-world time series with change points labelled by crowdsourced visual inspection
GigaScience ( IF 9.2 ) Pub Date : 2023-07-28 , DOI: 10.1093/gigascience/giad060
T Phuong Quan ₁ , Ben Lacey ₂ , Tim E A Peto ₁ , A Sarah Walker ₁

Affiliation

Background Large routinely collected data such as electronic health records (EHRs) are increasingly used in research, but the statistical methods and processes used to check such data for temporal data quality issues have not moved beyond manual, ad hoc production and visual inspection of graphs. With the prospect of EHR data being used for disease surveillance via automated pipelines and public-facing dashboards, automation of data quality checks will become increasingly valuable. Findings We generated 5,526 time series from 8 different EHR datasets and engaged >2,000 citizen-science volunteers to label the locations of all suspicious-looking change points in the resulting graphs. Consensus labels were produced using density-based clustering with noise, with validation conducted using 956 images containing labels produced by an experienced data scientist. Parameter tuning was done against 670 images and performance calculated against 286 images, resulting in a final sensitivity of 80.4% (95% CI, 77.1%–83.3%), specificity of 99.8% (99.7%–99.8%), positive predictive value of 84.5% (81.4%–87.2%), and negative predictive value of 99.7% (99.6%–99.7%). In total, 12,745 change points were found within 3,687 of the time series. Conclusions This large collection of labelled EHR time series can be used to validate automated methods for change point detection in real-world settings, encouraging the development of methods that can successfully be applied in practice. It is particularly valuable since change point detection methods are typically validated using synthetic data, so their performance in real-world settings cannot be assumed to be comparable. While the dataset focusses on EHRs and data quality, it should also be applicable in other fields.

中文翻译：

健康记录问题 — 5,526 个真实世界时间序列，其中变化点由众包目视检查标记

背景电子健康记录 (EHR) 等大量常规收集的数据越来越多地用于研究，但用于检查此类数据是否存在时间数据质量问题的统计方法和流程并未超越手动、临时生成和视觉检查图表。随着 EHR 数据通过自动化管道和面向公众的仪表板用于疾病监测的前景，数据质量检查的自动化将变得越来越有价值。研究结果我们从 8 个不同的 EHR 数据集中生成了 5,526 个时间序列，并聘请了超过 2,000 名公民科学志愿者来标记结果图表中所有可疑变化点的位置。使用基于密度的噪声聚类生成共识标签，并使用 956 张包含由经验丰富的数据科学家生成的标签的图像进行验证。针对 670 个图像进行参数调整，并根据 286 个图像计算性能，最终灵敏度为 80.4%（95% CI，77.1%–83.3%），特异性为 99.8%（99.7%–99.8%），阳性预测值为84.5% (81.4%–87.2%)，阴性预测值为 99.7% (99.6%–99.7%)。在 3,687 个时间序列中总共发现了 12,745 个变化点。结论这一大量标记的 EHR 时间序列可用于验证现实环境中变化点检测的自动化方法，从而鼓励开发可成功应用于实践的方法。它特别有价值，因为变化点检测方法通常使用合成数据进行验证，因此不能假设它们在现实环境中的性能具有可比性。虽然该数据集侧重于电子病历和数据质量，但它也应该适用于其他领域。

更新日期：2023-07-28

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>