TED: Triple Supervision Decouples End-to-end Speech-to-text Translation,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

TED: Triple Supervision Decouples End-to-end Speech-to-text Translation
arXiv - CS - Computation and Language Pub Date : 2020-09-21 , DOI: arxiv-2009.09704
Qianqian Dong, Mingxuan Wang, Hao Zhou, Shuang Xu, Bo Xu, Lei Li

An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language. Inspired by neuroscience, humans have perception systems and cognitive systems to process different information, we propose TED, \textbf{T}ransducer-\textbf{E}ncoder-\textbf{D}ecoder, a unified framework with triple supervision to decouple the end-to-end speech-to-text translation task. In addition to the target sentence translation loss, \method includes two auxiliary supervising signals to guide the acoustic transducer that extracts acoustic features from the input, and the semantic encoder to extract semantic features relevant to the source transcription text. Our method achieves state-of-the-art performance on both English-French and English-German speech translation benchmarks.

中文翻译：

TED：三重监督解耦端到端语音到文本翻译

端到端语音到文本翻译 (ST) 以源语言获取音频并以目标语言输出文本。受神经科学的启发，人类有感知系统和认知系统来处理不同的信息，我们提出了 TED，\textbf{T}ransducer-\textbf{E}ncoder-\textbf{D}ecoder，一个具有三重监督的统一框架来解耦端到端的语音到文本翻译任务。除了目标句子翻译损失之外，\method 还包括两个辅助监督信号，用于指导从输入中提取声学特征的声学换能器，以及提取与源转录文本相关的语义特征的语义编码器。我们的方法在英法和英德语音翻译基准上都达到了最先进的性能。

更新日期：2020-09-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>