Computer Science Review ( IF 12.9 ) Pub Date : 2019-09-25 , DOI: 10.1016/j.cosrev.2019.100195 Abdul Jabbar , Saif ul Islam , Shafiq Hussain , Adnan Akhunzada , Manzoor Ilahi
With the advent of globalization epoch, the Internet-based resources for Urdu are increasing in depth and breadth at a higher pace than ever and thus require a mechanism for computational processing of Urdu text. Information retrieval (IR) systems have now become the major tool for seeking varied information on the web. It uses variant forms of the word transformed through stemmer. Broadly speaking, current Urdu stemmers can be categorized into two major categories: linguistic-based stemmers and statistical stemmers. In this paper, the authors explain the applications where stemming is used as a first step and highlight the challenges in Urdu text stemming. This is the first comparative study of the state-of-the-art Urdu stemmers, based on various distinct features such as used approach, main idea, limitations, the rules or affixes, data set, evaluation criteria and claimed accuracy. A comparative analysis, among state-of-the-art Urdu stemmers, is performed by using the standard data set. Finally, we outline the relevant research gaps in the literature and suggest recommendations for future research on Urdu text stemming.
中文翻译:
乌尔都语词干对比研究:方法和挑战
随着全球化时代的到来,用于Urdu的基于Internet的资源的深度和广度以前所未有的速度增长,因此需要一种用于处理Urdu文本的机制。信息检索(IR)系统现已成为在网络上查找各种信息的主要工具。它使用通过词干转换词的变体形式。广义上讲,当前的乌尔都语词干可以分为两大类:基于语言的词干和统计词干。在本文中,作者解释了使用词干作为第一步的应用程序,并重点介绍了乌尔都语文本词干的挑战。这是对最先进的乌尔都语词干分析器的首次比较研究,其基于各种不同的特征,例如使用的方法,主要思想,局限性,规则或词缀,数据集,评估标准和要求的准确性。使用标准数据集对最先进的乌尔都语茎干进行比较分析。最后,我们概述了文献中的相关研究空白,并对乌尔都语文本词干的未来研究提出了建议。