当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Fast Cross-domain Data Augmentation through Neural Sentence Editing
arXiv - CS - Computation and Language Pub Date : 2020-03-23 , DOI: arxiv-2003.10254
Guillaume Raille, Sandra Djambazovska, Claudiu Musat

Data augmentation promises to alleviate data scarcity. This is most important in cases where the initial data is in short supply. This is, for existing methods, also where augmenting is the most difficult, as learning the full data distribution is impossible. For natural language, sentence editing offers a solution - relying on small but meaningful changes to the original ones. Learning which changes are meaningful also requires large amounts of training data. We thus aim to learn this in a source domain where data is abundant and apply it in a different, target domain, where data is scarce - cross-domain augmentation. We create the Edit-transformer, a Transformer-based sentence editor that is significantly faster than the state of the art and also works cross-domain. We argue that, due to its structure, the Edit-transformer is better suited for cross-domain environments than its edit-based predecessors. We show this performance gap on the Yelp-Wikipedia domain pairs. Finally, we show that due to this cross-domain performance advantage, the Edit-transformer leads to meaningful performance gains in several downstream tasks.

中文翻译:

通过神经句子编辑进行快速跨域数据增强

数据增强有望缓解数据稀缺性。这在初始数据短缺的情况下最为重要。对于现有方法,这也是增强最困难的地方,因为不可能学习完整的数据分布。对于自然语言,句子编辑提供了一种解决方案——依赖对原始语言的微小但有意义的更改。学习哪些变化是有意义的也需要大量的训练数据。因此,我们的目标是在数据丰富的源域中学习这一点,并将其应用于数据稀缺的不同目标域——跨域增强。我们创建了 Edit-transformer,这是一个基于 Transformer 的句子编辑器,它比最先进的技术快得多,并且还可以跨域工作。我们认为,由于其结构,与基于编辑的前辈相比,编辑转换器更适合跨域环境。我们在 Yelp-Wikipedia 域对上展示了这种性能差距。最后,我们表明,由于这种跨域性能优势,Edit-transformer 在几个下游任务中带来了有意义的性能提升。
更新日期:2020-03-24
down
wechat
bug