当前位置: X-MOL 学术J. Am. Med. Inform. Assoc. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Improving domain adaptation in de-identification of electronic health records through self-training
Journal of the American Medical Informatics Association ( IF 6.4 ) Pub Date : 2021-08-07 , DOI: 10.1093/jamia/ocab128
Shun Liao 1, 2 , Jamie Kiros 3 , Jiyang Chen 3 , Zhaolei Zhang 1, 2 , Ting Chen 3
Affiliation  

Abstract
Objective
De-identification is a fundamental task in electronic health records to remove protected health information entities. Deep learning models have proven to be promising tools to automate de-identification processes. However, when the target domain (where the model is applied) is different from the source domain (where the model is trained), the model often suffers a significant performance drop, commonly referred to as domain adaptation issue. In de-identification, domain adaptation issues can make the model vulnerable for deployment. In this work, we aim to close the domain gap by leveraging unlabeled data from the target domain.
Materials and Methods
We introduce a self-training framework to address the domain adaptation issue by leveraging unlabeled data from the target domain. We validate the effectiveness on 4 standard de-identification datasets. In each experiment, we use a pair of datasets: labeled data from the source domain and unlabeled data from the target domain. We compare the proposed self-training framework with supervised learning that directly deploys the model trained on the source domain.
Results
In summary, our proposed framework improves the F1-score by 5.38 (on average) when compared with direct deployment. For example, using i2b2-2014 as the training dataset and i2b2-2006 as the test, the proposed framework increases the F1-score from 76.61 to 85.41 (+8.8). The method also increases the F1-score by 10.86 for mimic-radiology and mimic-discharge.
Conclusion
Our work demonstrates an effective self-training framework to boost the domain adaptation performance for the de-identification task for electronic health records.


中文翻译:

通过自我训练提高电子健康记录去识别化的领域适应性

摘要
客观的
去标识化是电子健康记录中移除受保护健康信息实体的一项基本任务。深度学习模型已被证明是自动化去识别过程的有前途的工具。但是,当目标域(应用模型的地方)与源域(训练模型的地方)不同时,模型通常会遭受显着的性能下降,通常称为域适应问题。在去标识化中,域适应问题会使模型易于部署。在这项工作中,我们旨在通过利用来自目标域的未标记数据来缩小域差距。
材料和方法
我们引入了一个自我训练框架,通过利用来自目标域的未标记数据来解决域适应问题。我们验证了 4 个标准去识别数据集的有效性。在每个实验中,我们使用一对数据集:来自源域的标记数据和来自目标域的未标记数据。我们将提出的自训练框架与直接部署在源域上训练的模型的监督学习进行比较。
结果
总之,与直接部署相比,我们提出的框架将 F1 分数提高了 5.38(平均)。例如,使用i2b2-2014作为训练数据集和i2b2-2006作为测试,所提出的框架增加了F1-得分从76.61到85.41(8.8)。该方法还将模拟放射学模拟放电的 F1 分数提高了 10.86 。
结论
我们的工作展示了一种有效的自我训练框架,可以提高电子健康记录去标识化任务的域适应性能。
更新日期:2021-09-20
down
wechat
bug