Compression versus traditional machine learning classifiers to detect code-switching in varieties and dialects: Arabic as a case study,Natural Language Engineering

当前位置： X-MOL 学术 › Nat. Lang. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Compression versus traditional machine learning classifiers to detect code-switching in varieties and dialects: Arabic as a case study
Natural Language Engineering ( IF 2.3 ) Pub Date : 2020-05-05 , DOI: 10.1017/s135132492000011x
Taghreed Tarmom , William Teahan , Eric Atwell , Mohammad Ammar Alsalka

The occurrence of code-switching in online communication, when a writer switches among multiple languages, presents a challenge for natural language processing tools, since they are designed for texts written in a single language. To answer the challenge, this paper presents detailed research on ways to detect code-switching in Arabic text automatically. We compare the prediction by partial matching (PPM) compression-based classifier, implemented in Tawa, and a traditional machine learning classifier sequential minimal optimization (SMO), implemented in Waikato Environment for Knowledge Analysis, working specifically on Arabic text taken from Facebook. Three experiments were conducted in order to: (1) detect code-switching among the Egyptian dialect and English; (2) detect code-switching among the Egyptian dialect, the Saudi dialect, and English; and (3) detect code-switching among the Egyptian dialect, the Saudi dialect, Modern Standard Arabic (MSA), and English. Our experiments showed that PPM achieved a higher accuracy rate than SMO with 99.8% versus 97.5% in the first experiment and 97.8% versus 80.7% in the second. In the third experiment, PPM achieved a lower accuracy rate than SMO with 53.2% versus 60.2%. Code-switching between Egyptian Arabic and English text is easiest to detect because Arabic and English are generally written in different character sets. It is more difficult to distinguish between Arabic dialects and MSA as these use the same character set, and most users of Arabic, especially Saudis and Egyptians, frequently mix MSA with their dialects. We also note that the MSA corpus used for training the MSA model may not represent MSA Facebook text well, being built from news websites. This paper also describes in detail the new Arabic corpora created for this research and our experiments.

中文翻译：

压缩与传统机器学习分类器检测变体和方言的语码转换：以阿拉伯语为例

当作者在多种语言之间切换时，在线交流中会发生代码转换，这对自然语言处理工具提出了挑战，因为它们是为以单一语言编写的文本而设计的。为了应对这一挑战，本文详细研究了自动检测阿拉伯语文本中的语码转换的方法。我们比较了在 Tawa 中实现的基于部分匹配 (PPM) 压缩的分类器的预测，以及在怀卡托知识分析环境中实现的传统机器学习分类器序列最小优化 (SMO) 的预测，该分类器专门处理取自 Facebook 的阿拉伯文本。进行了三个实验，以便：（1）检测埃及方言和英语之间的语码转换；(2) 检测埃及方言、沙特方言和英语之间的语码转换；(3) 检测埃及方言、沙特方言、现代标准阿拉伯语 (MSA) 和英语之间的语码转换。我们的实验表明，PPM 比 SMO 获得了更高的准确率，第一个实验为 99.8% 对 97.5%，第二个实验为 97.8% 对 80.7%。在第三个实验中，PPM 的准确率低于 SMO，分别为 53.2% 和 60.2%。埃及阿拉伯语和英语文本之间的代码转换最容易检测，因为阿拉伯语和英语通常以不同的字符集编写。区分阿拉伯语方言和 MSA 更加困难，因为它们使用相同的字符集，并且大多数阿拉伯语用户，尤其是沙特人和埃及人，经常将 MSA 与他们的方言混合使用。我们还注意到，用于训练 MSA 模型的 MSA 语料库可能无法很好地表示 MSA Facebook 文本，由新闻网站构建。本文还详细描述了为这项研究和我们的实验创建的新阿拉伯语语料库。

更新日期：2020-05-05

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11