The impact of using different annotation schemes on named entity recognition,Egyptian Informatics Journal

当前位置： X-MOL 学术 › Egypt. Inform. J. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

The impact of using different annotation schemes on named entity recognition
Egyptian Informatics Journal ( IF 5.2 ) Pub Date : 2020-11-19 , DOI: 10.1016/j.eij.2020.10.004
Nasser Alshammari , Saad Alanazi

Named entity recognition (NER) is a subfield of information extraction, which aims to detect and classify predefined named entities (e.g., people, locations, organizations, etc.) in a body of text. In the literature, many researchers have studied the application of different machine learning models and features to NER. However, few research efforts have been devoted to studying annotation schemes used to label multi-token named entities. In this research, we studied seven annotation schemes (IO, IOB, IOE, IOBES, BI, IE, and BIES) and their impact on the task of NER using five different classifiers. Our experiment was conducted on an in–house dataset that consists of 27 medical Arabic articles with more than 62,000 tokens. The IO annotation scheme outperformed other schemes with an F-measure score of 84.44%. The closest competitor is the BIES scheme, which scored 72.78%. The rest of the schemes’ scores ranged from 60.38% to 69.18%. Although the IO scheme achieved the best results, comparing it to the other schemes is not reasonable because it cannot identify consecutive entities, which the other schemes can do. Therefore, we also investigated the ability of recognizing consecutive entities and provided an analysis of the running-time complexity.

中文翻译：

使用不同标注方案对命名实体识别的影响

命名实体识别（NER）是信息提取的一个子领域，旨在检测和分类文本正文中预定义的命名实体（例如，人、位置、组织等）。在文献中，许多研究人员研究了不同机器学习模型和特征在 NER 中的应用。然而，很少有研究工作致力于研究用于标记多令牌命名实体的注释方案。在这项研究中，我们使用五种不同的分类器研究了七种注释方案（IO、IOB、IOE、IOBES、BI、IE 和 BIES）及其对 NER 任务的影响。我们的实验是在内部数据集上进行的，该数据集由 27 篇医学阿拉伯语文章组成，具有超过 62,000 个标记。IO 注释方案以 84.44% 的 F-measure 得分优于其他方案。最接近的竞争对手是 BIES 计划，得分为 72.78%。其余计划的分数介于 60.38% 至 69.18% 之间。虽然IO方案取得了最好的结果，但与其他方案相比并不合理，因为它无法识别连续的实体，而其他方案可以做到。因此，我们还研究了识别连续实体的能力，并提供了运行时复杂度的分析。

更新日期：2020-11-19

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>