Natural language generation for electronic health records.,npj Digital Medicine

当前位置： X-MOL 学术 › npj Digit. Med. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Natural language generation for electronic health records.
npj Digital Medicine ( IF 12.4 ) Pub Date : 2019-01-29
Scott H Lee ₁

Affiliation

One broad goal of biomedical informatics is to generate fully-synthetic, faithfully representative electronic health records (EHRs) to facilitate data sharing between healthcare providers and researchers and promote methodological research. A variety of methods existing for generating synthetic EHRs, but they are not capable of generating unstructured text, like emergency department (ED) chief complaints, history of present illness, or progress notes. Here, we use the encoder-decoder model, a deep learning algorithm that features in many contemporary machine translation systems, to generate synthetic chief complaints from discrete variables in EHRs, like age group, gender, and discharge diagnosis. After being trained end-to-end on authentic records, the model can generate realistic chief complaint text that appears to preserve the epidemiological information encoded in the original record-sentence pairs. As a side effect of the model's optimization goal, these synthetic chief complaints are also free of relatively uncommon abbreviation and misspellings, and they include none of the personally identifiable information (PII) that was in the training data, suggesting that this model may be used to support the de-identification of text in EHRs. When combined with algorithms like generative adversarial networks (GANs), our model could be used to generate fully-synthetic EHRs, allowing healthcare providers to share faithful representations of multimodal medical data without compromising patient privacy. This is an important advance that we hope will facilitate the development of machine-learning methods for clinical decision support, disease surveillance, and other data-hungry applications in biomedical informatics.

中文翻译：

电子健康记录的自然语言生成。

生物医学信息学的一个广泛目标是生成完全综合，忠实代表性的电子健康记录（EHR），以促进医疗保健提供者和研究人员之间的数据共享并促进方法学研究。现有多种生成合成EHR的方法，但是它们无法生成非结构化的文本，例如急诊科（ED）的主要投诉，当前病史或病历记录。在这里，我们使用编码器-解码器模型（一种在许多现代机器翻译系统中都具有的深度学习算法），从EHR中的离散变量（例如年龄组，性别和出院诊断）生成综合主要投诉。在接受真实记录的端到端培训之后，该模型可以生成现实的主要投诉文本，该文本似乎可以保留原始记录句子对中编码的流行病学信息。作为模型优化目标的副作用，这些综合主诉也没有相对少见的缩写和拼写错误，并且它们均不包含训练数据中的任何个人身份信息（PII），表明可以使用此模型支持在电子病历中取消文本的标识。当与诸如生成对抗网络（GAN）之类的算法结合使用时，我们的模型可以用于生成完全综合的EHR，从而使医疗保健提供者可以在不损害患者隐私的情况下共享多模式医疗数据的真实表示。

更新日期：2019-11-01

点击分享

点击收藏

阅读更多本刊最新论文