Serialized Output Training for End-to-End Overlapped Speech Recognition,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Serialized Output Training for End-to-End Overlapped Speech Recognition
arXiv - CS - Sound Pub Date : 2020-03-28 , DOI: arxiv-2003.12687
Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka

This paper proposes serialized output training (SOT), a novel framework for multi-speaker overlapped speech recognition based on an attention-based encoder-decoder approach. Instead of having multiple output layers as with the permutation invariant training (PIT), SOT uses a model with only one output layer that generates the transcriptions of multiple speakers one after another. The attention and decoder modules take care of producing multiple transcriptions from overlapped speech. SOT has two advantages over PIT: (1) no limitation in the maximum number of speakers, and (2) an ability to model the dependencies among outputs for different speakers. We also propose a simple trick that allows SOT to be executed in $O(S)$, where $S$ is the number of the speakers in the training sample, by using the start times of the constituent source utterances. Experimental results on LibriSpeech corpus show that the SOT models can transcribe overlapped speech with variable numbers of speakers significantly better than PIT-based models. We also show that the SOT models can accurately count the number of speakers in the input audio.

中文翻译：

端到端重叠语音识别的序列化输出训练

本文提出了序列化输出训练 (SOT)，这是一种基于基于注意力的编码器 - 解码器方法的多说话者重叠语音识别的新框架。SOT 没有像置换不变训练 (PIT) 那样具有多个输出层，而是使用只有一个输出层的模型，该模型一个接一个地生成多个说话者的转录。注意力和解码器模块负责从重叠的语音中生成多个转录。SOT 与 PIT 相比有两个优点：(1) 扬声器的最大数量没有限制，(2) 能够对不同扬声器的输出之间的依赖关系建模。我们还提出了一个简单的技巧，通过使用组成源话语的开始时间，允许在 $O(S)$ 中执行 SOT，其中 $S$ 是训练样本中说话者的数量。LibriSpeech 语料库的实验结果表明，与基于 PIT 的模型相比，SOT 模型可以更好地转录具有可变数量说话人的重叠语音。我们还展示了 SOT 模型可以准确计算输入音频中的扬声器数量。

更新日期：2020-08-11

点击分享查看原文

点击收藏

阅读更多本刊最新论文