当前位置: X-MOL 学术BMC Med. Inform. Decis. Mak. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Constructing fine-grained entity recognition corpora based on clinical records of traditional Chinese medicine.
BMC Medical Informatics and Decision Making ( IF 3.5 ) Pub Date : 2020-04-06 , DOI: 10.1186/s12911-020-1079-2
Tingting Zhang 1 , Yaqiang Wang 2 , Xiaofeng Wang 2 , Yafei Yang 2 , Ying Ye 1
Affiliation  

BACKGROUND In this study, we focus on building a fine-grained entity annotation corpus with the corresponding annotation guideline of traditional Chinese medicine (TCM) clinical records. Our aim is to provide a basis for the fine-grained corpus construction of TCM clinical records in future. METHODS We developed a four-step approach that is suitable for the construction of TCM medical records in our corpus. First, we determined the entity types included in this study through sample annotation. Then, we drafted a fine-grained annotation guideline by summarizing the characteristics of the dataset and referring to some existing guidelines. We iteratively updated the guidelines until the inter-annotator agreement (IAA) exceeded a Cohen's kappa value of 0.9. Comprehensive annotations were performed while keeping the IAA value above 0.9. RESULTS We annotated the 10,197 clinical records in five rounds. Four entity categories involving 13 entity types were employed. The final fine-grained annotated entity corpus consists of 1104 entities and 67,799 tokens. The final IAAs are 0.936 on average (for three annotators), indicating that the fine-grained entity recognition corpus is of high quality. CONCLUSIONS These results will provide a foundation for future research on corpus construction and named entity recognition tasks in the TCM clinical domain.

中文翻译:

基于中医临床记录构建细粒度实体识别语料库。

背景技术在这项研究中,我们着重于建立具有相应中医临床记录注释准则的细粒度实体注释语料库。我们的目的是为将来中医临床记录细粒度语料库的构建提供依据。方法我们开发了一种四步法,适用于在我们的语料库中构建中医病历。首先,我们通过样本注释确定了本研究中包含的实体类型。然后,通过汇总数据集的特征并参考一些现有准则,我们起草了细粒度的注释准则。我们迭代更新了准则,直到注释者之间的协议(IAA)超过了科恩的kappa值为0.9。在将IAA值保持在0.9以上的同时进行了全面注释。结果我们在五个回合中对10,197个临床记录进行了注释。使用涉及13个实体类型的四个实体类别。最终的细粒度带注释的实体语料库由1104个实体和67,799个令牌组成。最终的IAA平均为0.936(对于三个注释者而言),表明细粒度的实体识别语料库是高质量的。结论这些结果将为将来在中医临床领域的语料库构建和命名实体识别任务研究奠定基础。表示细粒度的实体识别语料库是高质量的。结论这些结果将为中医临床领域的语料库构建和命名实体识别任务的未来研究提供基础。表示细粒度的实体识别语料库是高质量的。结论这些结果将为将来在中医临床领域的语料库构建和命名实体识别任务研究奠定基础。
更新日期:2020-04-22
down
wechat
bug