Survey on RNN and CRF models for de-identification of medical free text,Journal of Big Data

当前位置： X-MOL 学术 › J. Big Data › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Survey on RNN and CRF models for de-identification of medical free text
Journal of Big Data ( IF 8.6 ) Pub Date : 2020-09-04 , DOI: 10.1186/s40537-020-00351-4
Joffrey L. Leevy , Taghi M. Khoshgoftaar , Flavio Villanustre

The increasing reliance on electronic health record (EHR) in areas such as medical research should be addressed by using ample safeguards for patient privacy. These records often tend to be big data, and given that a significant portion is stored as free (unstructured) text, we decided to examine relevant work on automated free text de-identification with recurrent neural network (RNN) and conditional random field (CRF) approaches. Both methods involve machine learning and are widely used for the removal of protected health information (PHI) from free text. The outcome of our survey work produced several informative findings. Firstly, RNN models, particularly long short-term memory (LSTM) algorithms, generally outperformed CRF models and also other systems, namely rule-based algorithms. Secondly, hybrid or ensemble systems containing joint LSTM-CRF models showed no advantage over individual LSTM and CRF models. Thirdly, overfitting may be an issue when customized de-identification datasets are used during model training. Finally, statistical validation of performance scores and diversity during experimentation were largely ignored. In our comprehensive survey, we also identify major research gaps that should be considered for future work.

中文翻译：

RNN和CRF模型用于医疗自由文本识别的调查

在医学研究等领域，对电子健康记录（EHR）的依赖性日益增加，应该通过使用足够的患者隐私保护措施来解决。这些记录通常倾向于大数据，并且考虑到很大一部分存储为自由（非结构化）文本，因此我们决定研究与递归神经网络（RNN）和条件随机字段（CRF ）自动进行自由文本去识别相关的工作）方法。两种方法都涉及机器学习，并且广泛用于从自由文本中删除受保护的健康信息（PHI）。我们调查工作的结果产生了一些有益的发现。首先，RNN模型，特别是长短期记忆（LSTM）算法，通常胜过CRF模型和其他系统，即基于规则的算法。其次，包含联合LSTM-CRF模型的混合或集成系统没有显示出优于单个LSTM和CRF模型的优势。第三，在模型训练期间使用定制的去识别数据集时，过度拟合可能会成为一个问题。最后，在实验过程中对性能得分和多样性的统计验证在很大程度上被忽略。在我们的综合调查中，我们还确定了未来研究应考虑的主要研究空白。

更新日期：2020-09-04

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文