当前位置: X-MOL 学术J. Biomed. Inform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multi-view representation learning for tabular data integration using inter-feature relationships
Journal of Biomedical informatics ( IF 4.5 ) Pub Date : 2024-02-10 , DOI: 10.1016/j.jbi.2024.104602
Sandhya Tripathi , Bradley A. Fritz , Mohamed Abdelhack , Michael S. Avidan , Yixin Chen , Christopher R. King

An applied problem facing all areas of data science is harmonizing data sources. Joining data from multiple origins with unmapped and only partially overlapping features is a prerequisite to developing and testing robust, generalizable algorithms, especially in healthcare. This integrating is usually resolved using meta-data such as feature names, which may be unavailable or ambiguous. Our goal is to design methods that create a mapping between structured tabular datasets derived from electronic health records independent of meta-data. We evaluate methods in the challenging case of numeric features without reliable and distinctive univariate summaries, such as nearly Gaussian and binary features. We assume that a small set of features are a priori mapped between two datasets, which share unknown identical features and possibly many unrelated features. Inter-feature relationships are the main source of identification which we expect. We compare the performance of contrastive learning methods for feature representations, novel partial auto-encoders, mutual-information graph optimizers, and simple statistical baselines on simulated data, public datasets, the MIMIC-III medical-record changeover, and perioperative records from before and after a medical-record system change. Performance was evaluated using both mapping of identical features and reconstruction accuracy of examples in the format of the other dataset. Contrastive learning-based methods overall performed the best, often substantially beating the literature baseline in matching and reconstruction, especially in the more challenging real data experiments. Partial auto-encoder methods showed on-par matching with contrastive methods in all synthetic and some real datasets, along with good reconstruction. However, the statistical method we created performed reasonably well in many cases, with much less dependence on hyperparameter tuning. When validating feature match output in the EHR dataset we found that some mistakes were actually a surrogate or related feature as reviewed by two subject matter experts. In simulation studies and real-world examples, we find that inter-feature relationships are effective at identifying matching or closely related features across tabular datasets when meta-data is not available. Decoder architectures are also reasonably effective at imputing features without an exact match.

中文翻译:

使用特征间关系进行表格数据集成的多视图表示学习

数据科学所有领域面临的一个应用问题是协调数据源。将多个来源的数据与未映射且仅部分重叠的特征连接起来是开发和测试稳健、可通用算法的先决条件,尤其是在医疗保健领域。这种集成通常使用元数据(例如功能名称)来解决,这些元数据可能不可用或不明确。我们的目标是设计方法,在独立于元数据的电子健康记录派生的结构化表格数据集之间创建映射。我们在没有可靠且独特的单变量摘要(例如近高斯和二元特征)的数字特征的挑战性情况下评估方法。我们假设一小组特征是两个数据集之间先验映射的,它们共享未知的相同特征和可能许多不相关的特征。特征间关系是我们期望的识别的主要来源。我们比较了特征表示、新颖的部分自动编码器、互信息图优化器以及模拟数据、公共数据集、MIMIC-III 医疗记录转换和围手术期记录的简单统计基线的对比学习方法的性能。病历系统变更后。使用相同特征的映射和其他数据集格式的示例的重建准确性来评估性能。基于对比学习的方法总体表现最好,在匹配和重建方面通常大大超过文献基线,特别是在更具挑战性的真实数据实验中。部分自动编码器方法在所有合成数据集和一些真实数据集中显示出与对比方法的同等匹配,以及良好的重建。然而,我们创建的统计方法在许多情况下都表现得相当好,对超参数调整的依赖性要小得多。在验证 EHR 数据集中的特征匹配输出时,我们发现一些错误实际上是由两位主题专家审查的替代或相关特征。在模拟研究和现实世界的例子中,我们发现当元数据不可用时,特征间关系可以有效地识别表格数据集中匹配或密切相关的特征。解码器架构在输入没有精确匹配的特征方面也相当有效。
更新日期:2024-02-10
down
wechat
bug