当前位置: X-MOL 学术Inf. Process. Manag. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
An unsupervised lexical normalization for Roman Hindi and Urdu sentiment analysis
Information Processing & Management ( IF 7.4 ) Pub Date : 2020-09-16 , DOI: 10.1016/j.ipm.2020.102368
Khawar Mehmood , Daryl Essam , Kamran Shafi , Muhammad Kamran Malik

Text normalization is the task of transforming lexically variant words to their canonical forms. The importance of text normalization becomes apparent while developing natural language processing applications. This paper proposes a novel technique called Transliteration based Encoding for Roman Hindi/Urdu text Normalization (TERUN). TERUN utilizes the linguistic aspects of Roman Hindi/Urdu to transform lexically variant words to their canonical forms. It consists of three interlinked modules: transliteration based encoder, filter module and hash code ranker. The encoder generates all possible hash-codes for a single Roman Hindi/Urdu word. The next component filters the irrelevant codes, while the third module ranks the filtered hash-codes based on their relevance. The aim of this study is not only to normalize the text but to also examine its impact on text classification. Hence, baseline classification accuracies were computed on a dataset of 11,000 non-standardized Roman Hindi/Urdu sentiment analysis reviews using different machine learning algorithms. The dataset was then standardized using TERUN and other established phonetic algorithms, and the classification accuracies were recomputed. The cross-scheme comparison showed that TERUN outperformed all the phonetic algorithms and significantly reduced the error rate from the baseline. TERUN was then enhanced from a corpus specific to a corpus independent text normalization technique. To this end, a parallel corpus of 50,000 Urdu and Roman Hindi/Urdu words was manually tagged using a set of comprehensive annotation guidelines. Also, different phonetic algorithms and TERUN were intrinsically evaluated using a dataset of 20,000 lexically variant words. The results clearly showed the superiority of TERUN over well-known phonetic algorithms.



中文翻译:

罗马印地语和乌尔都语情感分析的无监督词法归一化

文本规范化是将词汇变体词转换为其规范形式的任务。在开发自然语言处理应用程序时,文本规范化的重要性变得显而易见。本文提出了一种称为基于音译的罗马印地文/乌尔都语文本归一化(TERUN)技术。TERUN利用罗马印地语/乌尔都语的语言方面,将词汇变体词转换为规范形式。它由三个相互关联的模块组成:基于音译的编码器,过滤器模块和哈希码排名器。编码器会为单个罗马印地语/乌尔都语单词生成所有可能的哈希码。下一个组件过滤不相关的代码,而第三个模块根据过滤后的哈希代码的相关性对它们进行排序。这项研究的目的不仅是对文本进行规范化,而且还研究其对文本分类的影响。因此,使用不同的机器学习算法在11,000个非标准化的Roman Hindi / Urdu情感分析评论的数据集上计算了基线分类的准确性。然后使用TERUN和其他已建立的语音算法对数据集进行标准化,并重新计算分类精度。跨方案比较显示,TERUN优于所有语音算法,并显着降低了基线错误率。然后从特定于语料库的独立文本规范化技术的语料库增强了TERUN。为此,使用一套全面的注释指南手动标记了50,000个乌尔都语和罗马北印度语/乌尔都语单词的平行语料库。也,使用20,000个词汇变异词的数据集对不同的语音算法和TERUN进行了内在评估。结果清楚地表明,TERUN优于著名的语音算法。

更新日期:2020-09-16
down
wechat
bug