The unreasonable effectiveness of machine learning in Moldavian versus Romanian dialect identification,International Journal of Intelligent Systems

当前位置： X-MOL 学术 › Int. J. Intell. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

The unreasonable effectiveness of machine learning in Moldavian versus Romanian dialect identification
International Journal of Intelligent Systems ( IF 5.0 ) Pub Date : 2021-11-17 , DOI: 10.1002/int.22746
Mihaela Găman ₁ , Radu Tudor Ionescu _{1,

2}

Affiliation

Motivated by the seemingly high accuracy levels of machine learning (ML) models in Moldavian versus Romanian dialect identification and the increasing research interest on this topic, we provide a follow-up on the Moldavian versus Romanian Cross-Dialect Topic Identification (MRC) shared task of the VarDial 2019 evaluation campaign. The shared task included two subtask types: one that consisted in discriminating between the Moldavian and Romanian dialects and one that consisted in classifying documents by topic across the two dialects of Romanian. Participants achieved impressive scores, for example, the top model for Moldavian versus Romanian dialect identification obtained a macro-

F_{1}

$F_{1}$ score of 0.895. We conduct a subjective evaluation by human annotators, showing that humans attain much lower accuracy rates compared with ML models. Hence, it remains unclear why the methods proposed by participants attain such high accuracy rates. Our goal is to understand (i) why the proposed methods work so well (by visualizing the discriminative features) and (ii) to what extent these methods can keep their high accuracy levels, for example, when we shorten the text samples to single sentences or when we use tweets at inference time. A secondary goal of our work is to propose an improved ML model using ensemble learning. Our experiments show that ML models can accurately identify the dialects, even at the sentence level and across different domains (news articles vs. tweets). We also analyze the most discriminative features of the best performing models, providing some explanations behind the decisions taken by these models. Interestingly, we learn new dialectal patterns previously unknown to us or to our human annotators. Furthermore, we conduct experiments showing that the ML performance on the MRC shared task can be improved through an ensemble based on stacking.

中文翻译：

机器学习在摩尔达维亚方言与罗马尼亚方言识别中的不合理效果

受摩尔达维亚与罗马尼亚方言识别中机器学习 (ML) 模型的看似高准确度水平以及对该主题日益增长的研究兴趣的推动，我们提供了摩尔达维亚与罗马尼亚跨方言主题识别 (MRC) 共享任务的后续行动VarDial 2019 评估活动。共享任务包括两种子任务类型：一种是区分摩尔达维亚语和罗马尼亚语方言，另一种是按主题对罗马尼亚语两种方言中的文档进行分类。参与者取得了令人印象深刻的分数，例如，摩尔达维亚语与罗马尼亚语方言识别的顶级模型获得了宏观-

F_{1}

$F_{1}$ 0.895 分。我们由人工注释者进行主观评估，表明与 ML 模型相比，人类获得的准确率要低得多。因此，尚不清楚为什么参与者提出的方法能达到如此高的准确率。我们的目标是了解（i）为什么所提出的方法工作得这么好（通过可视化判别特征）和（ii）这些方法可以在多大程度上保持其高精度水平，例如，当我们将文本样本缩短为单个句子时或者当我们在推理时使用推文时。我们工作的第二个目标是使用集成学习提出改进的 ML 模型。我们的实验表明，机器学习模型可以准确地识别方言，即使在句子级别和跨不同领域（新闻文章与推文）。我们还分析了表现最好的模型的最具辨别力的特征，为这些模型做出的决定提供了一些解释。有趣的是，我们学习了我们或人类注释者以前不知道的新方言模式。此外，我们进行的实验表明，可以通过基于堆叠的集成来提高 MRC 共享任务的 ML 性能。

更新日期：2021-11-17

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11