当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
CKMorph: A Comprehensive Morphological Analyzer for Central Kurdish
arXiv - CS - Computation and Language Pub Date : 2021-09-17 , DOI: arxiv-2109.08615
Morteza Naserzade, Aso Mahmudi, Hadi Veisi, Hawre Hosseini, Mohammad MohammadAmini

A morphological analyzer, which is a significant component of many natural language processing applications especially for morphologically rich languages, divides an input word into all its composing morphemes and identifies their morphological roles. In this paper, we introduce a comprehensive morphological analyzer for Central Kurdish (CK), a low-resourced language with a rich morphology. Building upon the limited existing literature, we first assembled and systematically categorized a comprehensive collection of the morphological and morphophonological rules of the language. Additionally, we collected and manually labeled a generative lexicon containing nearly 10,000 verb, noun and adjective stems, named entities, and other types of word stems. We used these rule sets and resources to implement CKMorph Analyzer based on finite-state transducers. In order to provide a benchmark for future research, we collected, manually labeled, and publicly shared test sets for evaluating accuracy and coverage of the analyzer. CKMorph was able to correctly analyze 95.9% of the accuracy test set, containing 1,000 CK words morphologically analyzed according to the context. Moreover, CKMorph gave at least one analysis for 95.5% of 4.22M CK tokens of the coverage test set. The demonstration of the application and resources including CK verb database and test sets are openly accessible at https://github.com/CKMorph.

中文翻译:

CKMorph:中央库尔德人的综合形态分析器

形态分析器是许多自然语言处理应用程序的重要组成部分,尤其是对于形态丰富的语言,它将输入词划分为其所有组成词素并识别它们的词素作用。在本文中,我们介绍了一种针对中央库尔德语 (CK) 的综合形态分析器,这是一种具有丰富形态的资源不足的语言。在有限的现有文献的基础上,我们首先收集并系统地分类了语言的形态和形态语音规则的综合集合。此外,我们收集并手动标记了一个包含近 10,000 个动词、名词和形容词词干、命名实体和其他类型词干的生成词典。我们使用这些规则集和资源来实现基于有限状态传感器的 CKMorph Analyzer。为了为未来的研究提供基准,我们收集、手动标记并公开共享测试集,以评估分析器的准确性和覆盖率。CKMorph 能够正确分析 95.9% 的准确率测试集,其中包含 1,000 个根据上下文进行形态学分析的 CK 词。此外,CKMorph 对覆盖测试集的 422 万个 CK 令牌中的 95.5% 进行了至少一项分析。包括 CK 动词数据库和测试集在内的应用程序和资源的演示可在 https://github.com/CKMorph 上公开访问。000 CK 词根据上下文进行形态分析。此外,CKMorph 对覆盖测试集的 422 万个 CK 令牌中的 95.5% 进行了至少一项分析。包括 CK 动词数据库和测试集在内的应用程序和资源的演示可在 https://github.com/CKMorph 上公开访问。000 CK 词根据上下文进行形态分析。此外,CKMorph 对覆盖测试集的 422 万个 CK 令牌中的 95.5% 进行了至少一项分析。包括 CK 动词数据库和测试集在内的应用程序和资源的演示可在 https://github.com/CKMorph 上公开访问。
更新日期:2021-09-20
down
wechat
bug