A clustering framework for lexical normalization of Roman Urdu,Natural Language Engineering

当前位置： X-MOL 学术 › Nat. Lang. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A clustering framework for lexical normalization of Roman Urdu
Natural Language Engineering ( IF 2.5 ) Pub Date : 2020-06-10 , DOI: 10.1017/s1351324920000285
Abdul Rafae Khan , Asim Karim , Hassan Sajjad , Faisal Kamiran , Jia Xu

Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content. It lacks standard spelling and hence poses several normalization challenges during automatic language processing. In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm UrduPhone, a string matching component, a feature-based similarity function, and a clustering algorithm Lex-Var. UrduPhone encodes Roman Urdu strings to their pronunciation-based representations. The string matching component handles character-level variations that occur when writing Urdu using Roman script. The similarity function incorporates various phonetic-based, string-based, and contextual features of words. The Lex-Var algorithm is a variant of the k-medoids clustering algorithm that groups lexical variations of words. It contains a similarity threshold to balance the number of clusters and their maximum similarity. The framework allows feature learning and optimization in addition to the use of predefined features and weights. We evaluate our framework extensively on four real-world datasets and show an F-measure gain of up to 15% from baseline methods. We also demonstrate the superiority of UrduPhone and Lex-Var in comparison to respective alternate algorithms in our clustering framework for the lexical normalization of Roman Urdu.

中文翻译：

罗马乌尔都语词汇规范化的聚类框架

罗马乌尔都语是用罗马文字书写的乌尔都语的一种非正式形式，在南亚广泛用于在线文本内容。它缺乏标准的拼写，因此在自动语言处理过程中提出了一些规范化挑战。在本文中，我们为罗马乌尔都语语料库的词汇规范化提出了一个基于特征的聚类框架，其中包括一个语音算法乌尔都语电话、字符串匹配组件、基于特征的相似度函数和聚类算法Lex-Var. UrduPhone 将罗马乌尔都语字符串编码为其基于发音的表示。字符串匹配组件处理使用罗马脚本编写乌尔都语时出现的字符级变化。相似度函数结合了单词的各种基于语音、基于字符串和上下文的特征。Lex-Var 算法是对单词的词汇变体进行分组的 k-medoids 聚类算法的变体。它包含一个相似度阈值来平衡集群的数量和它们的最大相似度。除了使用预定义的特征和权重之外，该框架还允许特征学习和优化。我们在四个真实世界的数据集上广泛评估了我们的框架，并显示出比基线方法高达 15% 的 F 度量增益。

更新日期：2020-06-10

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>