当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Translation Transformers Rediscover Inherent Data Domains
arXiv - CS - Computation and Language Pub Date : 2021-09-16 , DOI: arxiv-2109.07864
Maksym Del, Elizaveta Korotkova, Mark Fishel

Many works proposed methods to improve the performance of Neural Machine Translation (NMT) models in a domain/multi-domain adaptation scenario. However, an understanding of how NMT baselines represent text domain information internally is still lacking. Here we analyze the sentence representations learned by NMT Transformers and show that these explicitly include the information on text domains, even after only seeing the input sentences without domains labels. Furthermore, we show that this internal information is enough to cluster sentences by their underlying domains without supervision. We show that NMT models produce clusters better aligned to the actual domains compared to pre-trained language models (LMs). Notably, when computed on document-level, NMT cluster-to-domain correspondence nears 100%. We use these findings together with an approach to NMT domain adaptation using automatically extracted domains. Whereas previous work relied on external LMs for text clustering, we propose re-using the NMT model as a source of unsupervised clusters. We perform an extensive experimental study comparing two approaches across two data scenarios, three language pairs, and both sentence-level and document-level clustering, showing equal or significantly superior performance compared to LMs.

中文翻译:

转换转换器重新发现固有数据域

许多工作提出了在域/多域适应场景中提高神经机器翻译 (NMT) 模型性能的方法。然而,仍然缺乏对 NMT 基线如何在内部表示文本域信息的理解。在这里,我们分析了 NMT Transformers 学习的句子表示,并表明这些表示明确包含了文本域的信息,即使在只看到没有域标签的输入句子之后也是如此。此外,我们表明,这些内部信息足以在没有监督的情况下按其底层域对句子进行聚类。我们表明,与预训练的语言模型 (LM) 相比,NMT 模型产生的集群更符合实际领域。值得注意的是,当在文档级别计算时,NMT 集群到域的对应接近 100%。我们将这些发现与使用自动提取域的 NMT 域适应方法结合使用。之前的工作依赖于外部 LM 进行文本聚类,我们建议重新使用 NMT 模型作为无监督聚类的来源。我们进行了广泛的实验研究,比较了两种数据场景、三种语言对以及句子级和文档级聚类的两种方法,与 LM 相比,表现出同等或显着优越的性能。
更新日期:2021-09-17
down
wechat
bug