当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving Out-of-Domain Robustness
arXiv - CS - Computation and Language Pub Date : 2020-09-21 , DOI: arxiv-2009.10195
Nathan Ng, Kyunghyun Cho, Marzyeh Ghassemi

Models that perform well on a training domain often fail to generalize to out-of-domain (OOD) examples. Data augmentation is a common method used to prevent overfitting and improve OOD generalization. However, in natural language, it is difficult to generate new examples that stay on the underlying data manifold. We introduce SSMBA, a data augmentation method for generating synthetic training examples by using a pair of corruption and reconstruction functions to move randomly on a data manifold. We investigate the use of SSMBA in the natural language domain, leveraging the manifold assumption to reconstruct corrupted text with masked language models. In experiments on robustness benchmarks across 3 tasks and 9 datasets, SSMBA consistently outperforms existing data augmentation methods and baseline models on both in-domain and OOD data, achieving gains of 0.8% accuracy on OOD Amazon reviews, 1.8% accuracy on OOD MNLI, and 1.4 BLEU on in-domain IWSLT14 German-English.

中文翻译:

SSMBA:基于自监督流形的数据增强,以提高域外鲁棒性

在训练域上表现良好的模型通常无法推广到域外 (OOD) 示例。数据增强是用于防止过拟合和提高 OOD 泛化的常用方法。然而,在自然语言中,很难生成留在底层数据流形上的新示例。我们介绍了 SSMBA,这是一种数据增强方法,通过使用一对损坏和重建函数在数据流形上随机移动来生成合成训练示例。我们研究了 SSMBA 在自然语言领域的使用,利用多种假设来重建带有掩码语言模型的损坏文本。在 3 个任务和 9 个数据集的稳健性基准实验中,SSMBA 在域内和 OOD 数据上始终优于现有的数据增强方法和基线模型,
更新日期:2020-10-06
down
wechat
bug