Native Language Identification of Fluent and Advanced Non-Native Writers,ACM Transactions on Asian and Low-Resource Language Information Processing

当前位置： X-MOL 学术 › ACM Trans. Asian Low Resour. Lang. Inf. Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Native Language Identification of Fluent and Advanced Non-Native Writers
ACM Transactions on Asian and Low-Resource Language Information Processing ( IF 2 ) Pub Date : 2020-04-11 , DOI: 10.1145/3383202
Raheem Sarwar ₁ , Attapol T. Rutherford ₂ , Saeed-Ul Hassan ₃ , Thanawin Rakthanmanon ₄ , Sarana Nutanong ₁

Affiliation

Native Language Identification (NLI) aims at identifying the native languages of authors by analyzing their text samples written in a non-native language. Most existing studies investigate this task for educational applications such as second language acquisition and require the learner corpora. This article performs NLI in a challenging context of the user-generated-content (UGC) where authors are fluent and advanced non-native speakers of a second language. Existing NLI studies with UGC (i) rely on the content-specific/social-network features and may not be generalizable to other domains and datasets, (ii) are unable to capture the variations of the language-usage-patterns within a text sample, and (iii) are not associated with any outlier handling mechanism. Moreover, since there is a sizable number of people who have acquired non-English second languages due to the economic and immigration policies, there is a need to gauge the applicability of NLI with UGC to other languages. Unlike existing solutions, we define a topic-independent feature space, which makes our solution generalizable to other domains and datasets. Based on our feature space, we present a solution that mitigates the effect of outliers in the data and helps capture the variations of the language-usage-patterns within a text sample. Specifically, we represent each text sample as a point set and identify the top- k stylistically similar text samples (SSTs) from the corpus. We then apply the probabilistic k nearest neighbors’ classifier on the identified top- k SSTs to predict the native languages of the authors. To conduct experiments, we create three new corpora where each corpus is written in a different language, namely, English, French , and German . Our experimental studies show that our solution outperforms competitive methods and reports more than 80% accuracy across languages.

中文翻译：

流利和高级非母语作家的母语识别

母语识别(NLI) 旨在识别本国的作者的语言通过分析他们的文本样本写在非本地人语言。大多数现有的研究都针对教育应用调查了这项任务，例如第二语言习得并要求学习者语料库。本文在具有挑战性的环境中执行 NLI用户生成内容（UGC）作者是第二语言的流利和高级非母语人士。现有的使用 UGC 的 NLI 研究 (i) 依赖于内容特定/社交网络特征，可能无法推广到其他领域和数据集，(ii) 无法捕获文本样本中语言使用模式的变化, 和 (iii) 与任何异常值处理机制无关。此外，由于经济和移民政策的影响，有相当多的人获得了非英语的第二语言，因此有必要评估 NLI 与 UGC 对其他语言的适用性。与现有解决方案不同，我们定义了一个独立于主题的特征空间，这使得我们的解决方案可推广到其他领域和数据集。基于我们的特征空间，我们提出了一种解决方案，可以减轻数据中异常值的影响，并帮助捕获文本样本中语言使用模式的变化。具体来说，我们将每个文本样本表示为点集并确定顶部ķ来自语料库的风格相似的文本样本 (SST)。然后我们应用概率的ķ最近的邻居识别出的顶部分类器ķSSTs 来预测作者的母语。为了进行实验，我们创建了三个新的语料库，每个语料库都用不同的语言编写，即英语、法语，和德语. 我们的实验研究表明，我们的解决方案优于竞争方法，并且报告的跨语言准确率超过 80%。

更新日期：2020-04-11

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>