Cohort analytics: efficiency and applicability,The VLDB Journal

当前位置： X-MOL 学术 › VLDB J. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Cohort analytics: efficiency and applicability
The VLDB Journal ( IF 2.8 ) Pub Date : 2020-08-27 , DOI: 10.1007/s00778-020-00625-6
Behrooz Omidvar-Tehrani , Sihem Amer-Yahia , Laks V. S. Lakshmanan

The abundant availability of health-care data calls for effective analysis methods to help medical experts gain a better understanding of their patients and their health. The focus of existing work has been largely on prediction. In this paper, we introduce Core, a framework for cohort “representation” and “exploration.” Our contributions are twofold: First, we formalize cohort representation as the problem of aggregating the trajectories of its patients. This problem is challenging because cohorts often consist of hundreds of patients who underwent medical actions of various types at different points in time. We prove that producing a representative cohort trajectory is NP-complete with a reduction in the multiple sequence alignment problem. We propose a heuristic that extends the Needleman–Wunsch algorithm for sequence matching to handle temporal sequences. To further improve cohort representation efficiency, we introduce “trajectory families” and “stratified sampling.” Our second contribution is formalizing the problem of cohort exploration as finding a set of cohorts that are similar to a cohort of interest and that maximize entropy. This problem is challenging because the potential number of similar cohorts is huge. We prove NP-completeness with a reduction in the maximum edge subgraph problem. To address complexity, we develop a multi-staged approach based on limiting the search space to “contrast cohorts.” To speed up the computation of cohort similarity, we use “event sets” that are inspired from the double dictionary encoding proposed for keyword search. Moreover, we explore the usefulness and efficiency of Core using an extensive set of qualitative and quantitative experiments on two real health-care datasets. In a user study with medical experts, we show that Core reduces time-to-insight from hours to seconds and helps them find better insights than baseline approaches. Moreover, we show that the obtained cohort representations offer the right trade-off between quality and performance. We study the benefits of trajectory families and stratified sampling for cohort representation and show their applicability on large and heterogeneous cohorts. We also show the benefit of event sets for cohort exploration in providing interactive performance.

中文翻译：

同类群组分析：效率和适用性

卫生保健数据的丰富可用性要求采用有效的分析方法，以帮助医学专家更好地了解患者及其健康状况。现有工作的重点主要放在预测上。在本文中，我们介绍了 Core，它是同类群组“表示”和“探索”的框架。我们的贡献是双重的：首先，我们正式化队列代表作为汇总其患者轨迹的问题。这个问题具有挑战性，因为队列通常由数百名患者组成，这些患者在不同的时间点接受了各种类型的医疗操作。我们证明，产生具有代表性的队列轨迹是NP完整的，并且减少了多序列比对问题。我们提出了一种启发式算法，该算法扩展了Needleman-Wunsch算法的序列匹配能力以处理时间序列。为了进一步提高队列代表效率，我们引入了“轨迹族”和“分层抽样”。我们的第二个贡献是形式化队列研究问题找到一组与感兴趣的同类群组相似且最大化熵的同类群组。这个问题是具有挑战性的，因为潜在的同类人群非常多。我们用最大边缘子图问题的减少证明了NP完整性。为了解决复杂性，我们基于将搜索空间限制为“对比群组”，开发了一种多阶段方法。为了加快同类群组相似度的计算，我们使用“事件集”，该事件集是从为关键字搜索提出的双字典编码中获得启发的。此外，我们在两个真实的医疗数据集上使用了广泛的定性和定量实验，探索了Core的有用性和效率。在与医学专家的用户研究中，我们证明了Core将见识时间从几小时减少到几秒钟，并帮助他们找到比基准方法更好的见解。此外，我们表明，获得的同类群组表示法在质量和性能之间提供了适当的权衡。我们研究了轨迹族和分层采样对队列代表的好处，并显示了它们在大型和异构队列中的适用性。我们还展示了事件集对同类群组探索在提供交互式表演中的好处。

更新日期：2020-08-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文