当前位置: X-MOL 学术Lang. Resour. Eval. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Restoring Arabic vowels through omission-tolerant dictionary lookup
Language Resources and Evaluation ( IF 1.7 ) Pub Date : 2019-04-25 , DOI: 10.1007/s10579-019-09464-6
Alexis Amid Neme , Sébastien Paumier

Vowels in Arabic are optional orthographic symbols written as diacritics above or below letters. In Arabic texts, typically more than 97 percent of written words do not explicitly show any of the vowels they contain; that is to say, depending on the author, genre and field, less than 3 percent of words include any explicit vowel. Although numerous studies have been published on the issue of restoring the omitted vowels in speech technologies, little attention has been given to this problem in papers dedicated to written Arabic technologies. In this research, we present Arabic-Unitex, an Arabic Language Resource, with emphasis on vowel representation and encoding. Specifically, we present two dozens of rules formalizing a detailed description of vowel omission in written text. They are typographical rules integrated into large-coverage resources for morphological annotation. For restoring vowels, our resources are capable of identifying words in which the vowels are not shown, as well as words in which the vowels are partially or fully included. By taking into account these rules, our resources are able to compute and restore for each word form a list of compatible fully vowelized candidates through omission-tolerant dictionary lookup. In our previous studies, we have proposed a straightforward encoding of taxonomy for verbs (Neme in Proceedings of the international workshop on lexical resources (WoLeR) at ESSLLI, 2011) and broken plurals (Neme and Laporte in Lang Sci, 2013, http://dx.doi.org/10.1016/j.langsci.2013.06.002). While traditional morphology is based on derivational rules, our description is based on inflectional ones. The breakthrough lies in the reversal of the traditional root-and-pattern Semitic model into pattern-and-root, giving precedence to patterns over roots. The lexicon is built and updated manually and contains 76,000 fully vowelized lemmas. It is then inflected by means of finite-state transducers (FSTs), generating 6 million forms. The coverage of these inflected forms is extended by formalized grammars, which accurately describe agglutinations around a core verb, noun, adjective or preposition. A laptop needs one minute to generate the 6 million inflected forms in a 340-MB flat file, which is compressed in 2 min into 11 MB for fast retrieval. Our program performs the analysis of 5000 words/second for running text (20 pages/second). Based on these comprehensive linguistic resources, we created a spell checker that detects any invalid/misplaced vowel in a fully or partially vowelized form. Finally, our resources provide a lexical coverage of more than 99 percent of the words used in popular newspapers, and restore vowels in words (out of context) simply and efficiently.

中文翻译:

通过容错字典查找来恢复阿拉伯语元音

阿拉伯语的元音是在字母上方或下方以变音符号形式书写的可选正字符号。在阿拉伯语文本中,通常有97%以上的书面单词没有明确显示它们包含的任何元音;也就是说,取决于作者,流派和领域,少于3%的单词包含任何显式的元音。尽管已经发表了大量关于恢复语音技术中省略的元音的问题的研究,但是在专门针对阿拉伯书面技术的论文中,很少有人关注该问题。在这项研究中,我们介绍了阿拉伯语语言资源Arabic-Unitex,重点是元音表示和编码。具体来说,我们提供了几十条规则,以书面形式详细描述了元音遗漏的详细信息。它们是印刷规则,已集成到大覆盖资源中以进行形态标注。为了恢复元音,我们的资源能够识别未显示元音的单词以及部分或全部包含元音的单词。通过考虑这些规则,我们的资源能够通过容错字典查找来计算和还原每个单词,形成兼容的完全元音候选列表。在我们以前的研究中,我们提出了动词分类学的直接编码(Neme在ESSLLI的国际词汇资源研讨会(WoLeR)的会议记录,2011年)和破碎的复数形式(Neme和Laporte in Lang Sci,2013年,http:/ /dx.doi.org/10.1016/j.langsci.2013.06.002)。传统的形态学是基于推导规则的,我们的描述是基于变形的。突破在于将传统的根和模式符号模型转换为模式和根,从而使模式优先于根。该词典是手动构建和更新的,包含76,000个完全元音的引理。然后通过有限状态换能器(FST)对其进行折弯,生成600万张表格。这些变形形式的覆盖范围由形式化语法扩展,形式化语法准确描述了核心动词,名词,形容词或介词周围的凝集。一台笔记本电脑需要一分钟的时间才能以340 MB的平面文件生成600万个变形表格,然后在2分钟内将其压缩为11 MB以进行快速检索。我们的程序执行5000字/秒的运行文本分析(20页/秒)。基于这些综合的语言资源,我们创建了一个拼写检查器,可以检测全部或部分元音形式的任何无效/错位元音。最后,我们的资源提供了流行报纸上超过99%的单词的词汇覆盖,并简单有效地恢复了单词中的元音(不包含上下文)。
更新日期:2019-04-25
down
wechat
bug