当前位置: X-MOL 学术Inf. Process. Manag. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Named entity recognition in Turkish: A comparative study with detailed error analysis
Information Processing & Management ( IF 8.6 ) Pub Date : 2022-09-05 , DOI: 10.1016/j.ipm.2022.103065
Oguzhan Ozcelik , Cagri Toraman

Named entity recognition aims to detect pre-determined entity types in unstructured text. There is a limited number of studies on this task for low-resource languages such as Turkish. We provide a comprehensive study for Turkish named entity recognition by comparing the performances of existing state-of-the-art models on the datasets with varying domains to understand their generalization capability and further analyze why such models fail or succeed in this task. Our experimental results, supported by statistical tests, show that the highest weighted F1 scores are obtained by Transformer-based language models, varying from 80.8% in tweets to 96.1% in news articles. We find that Transformer-based language models are more robust to entity types with a small sample size and longer named entities compared to traditional models, yet all models have poor performance for longer named entities in social media. Moreover, when we shuffle 80% of words in a sentence to imitate flexible word order in Turkish, we observe more performance deterioration, 12% in well-written texts, compared to 7% in noisy text.



中文翻译:

土耳其语命名实体识别:具有详细错误分析的比较研究

命名实体识别旨在检测非结构化文本中预先确定的实体类型。对于土耳其语等资源匮乏的语言,这项任务的研究数量有限。我们通过比较现有最先进模型在具有不同领域的数据集上的性能,为土耳其命名实体识别提供全面的研究,以了解它们的泛化能力,并进一步分析这些模型在这项任务中失败或成功的原因。我们的实验结果得到了统计测试的支持,表明基于 Transformer 的语言模型获得了最高的加权 F1 分数,从推文中的 80.8% 到新闻文章中的 96.1% 不等。我们发现,与传统模型相比,基于 Transformer 的语言模型对样本量较小且命名实体较长的实体类型更稳健,然而,对于社交媒体中较长的命名实体,所有模型的性能都很差。此外,当我们将句子中 80% 的单词打乱以模仿土耳其语中的灵活词序时,我们观察到更多的性能下降,在写得好的文本中下降了 12%,而在嘈杂的文本中下降了 7%。

更新日期:2022-09-06
down
wechat
bug