当前位置: X-MOL 学术Artif. Intell. Med. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Real-world data medical knowledge graph: construction and applications.
Artificial Intelligence in Medicine ( IF 7.5 ) Pub Date : 2020-02-06 , DOI: 10.1016/j.artmed.2020.101817
Linfeng Li 1 , Peng Wang 2 , Jun Yan 3 , Yao Wang 3 , Simin Li 3 , Jinpeng Jiang 3 , Zhe Sun 3 , Buzhou Tang 4 , Tsung-Hui Chang 5 , Shenghui Wang 6 , Yuting Liu 7
Affiliation  

Objective

Medical knowledge graph (KG) is attracting attention from both academic and healthcare industry due to its power in intelligent healthcare applications. In this paper, we introduce a systematic approach to build medical KG from electronic medical records (EMRs) with evaluation by both technical experiments and end to end application examples.

Materials and Methods

The original data set contains 16,217,270 de-identified clinical visit data of 3,767,198 patients. The KG construction procedure includes 8 steps, which are data preparation, entity recognition, entity normalization, relation extraction, property calculation, graph cleaning, related-entity ranking, and graph embedding respectively. We propose a novel quadruplet structure to represent medical knowledge instead of the classical triplet in KG. A novel related-entity ranking function considering probability, specificity and reliability (PSR) is proposed. Besides, probabilistic translation on hyperplanes (PrTransH) algorithm is used to learn graph embedding for the generated KG.

Results

A medical KG with 9 entity types including disease, symptom, etc. was established, which contains 22,508 entities and 579,094 quadruplets. Compared with term frequency - inverse document frequency (TF/IDF) method, the normalized discounted cumulative gain ([email protected]) increased from 0.799 to 0.906 with the proposed ranking function. The embedding representation for all entities and relations were learned, which are proven to be effective using disease clustering.

Conclusion

The established systematic procedure can efficiently construct a high-quality medical KG from large-scale EMRs. The proposed ranking function PSR achieves the best performance under all relations, and the disease clustering result validates the efficacy of the learned embedding vector as entity’s semantic representation. Moreover, the obtained KG finds many successful applications due to its statistics-based quadruplet.

where Ncomin is a minimum co-occurrence number and R is the basic reliability value. The reliability value can measure how reliable is the relationship between Si and Oij. The reason for the definition is the higher value of Nco(Si, Oij), the relationship is more reliable. However, the reliability values of the two relationships should not have a big difference if both of their co-occurrence numbers are very big. In our study, we finally set Ncomin = 10 and R = 1 after some experiments. For instance, if co-occurrence numbers of three relationships are 1, 100 and 10000, their reliability values are 1, 2.96 and 5 respectively.



中文翻译:

实际数据医学知识图:构造和应用。

目的

医学知识图(KG)由于其在智能医疗保健应用中的强大功能而吸引了学术界和医疗保健行业的关注。在本文中,我们介绍了一种通过电子病历(EMR)建立医学KG的系统方法,并通过技术实验和端到端应用示例进行了评估。

材料和方法

原始数据集包含3,767,198例患者的16,217,270例不明确的临床就诊数据。KG的构建过程包括8个步骤,分别是数据准备,实体识别,实体规范化,关系提取,属性计算,图清洗,相关实体排名和图嵌入。我们提出了一种新颖的四联体结构来代表医学知识,而不是KG中的经典三联体。提出了一种考虑概率,特异性和可靠性(PSR)的新型关联实体排序函数。此外,利用超平面概率平移(PrTransH)算法来学习生成的KG的图嵌入。

结果

建立了一个具有9个实体类型(包括疾病,症状等)的医疗KG,其中包含22,508个实体和579,094个四元组。与术语频率-逆文档频率(TF / IDF)方法相比,归一化的折现累积收益([电子邮件保护])从建议的排名函数从0.799增加到0.906。学会了所有实体和关系的嵌入表示,使用疾病聚类证明是有效的。

结论

建立的系统程序可以从大型EMR有效地构建高质量的医学KG。提出的排序函数PSR在所有关系下均达到最佳性能,并且疾病聚类结果验证了学习的嵌入向量作为实体的语义表示的有效性。此外,由于其基于统计的四元组,获得的KG找到了许多成功的应用程序。

哪里 ñCØ一世ñ是最小同时出现数,R是基本可靠性值。可靠性值可以衡量S iO ij之间的关系的可靠性。定义的原因是N coS i, O ij)的值越高,关系越可靠。但是,如果两个关系的共现数都很大,则两个关系的可靠性值不应有太大差异。在我们的研究中,我们终于确定ñCØ一世ñ经过一些实验后= 10,R = 1。例如,如果三个关系的同时出现数为1、100和10000,则它们的可靠性值分别为1、2.96和5。

更新日期:2020-02-06
down
wechat
bug