Hypothesis Stitcher for End-to-End Speaker-attributed ASR on Long-form Multi-talker Recordings,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Hypothesis Stitcher for End-to-End Speaker-attributed ASR on Long-form Multi-talker Recordings
arXiv - CS - Sound Pub Date : 2021-01-06 , DOI: arxiv-2101.01853
Xuankai Chang, Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka

An end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed recently to jointly perform speaker counting, speech recognition and speaker identification. The model achieved a low speaker-attributed word error rate (SA-WER) for monaural overlapped speech comprising an unknown number of speakers. However, the E2E modeling approach is susceptible to the mismatch between the training and testing conditions. It has yet to be investigated whether the E2E SA-ASR model works well for recordings that are much longer than samples seen during training. In this work, we first apply a known decoding technique that was developed to perform single-speaker ASR for long-form audio to our E2E SA-ASR task. Then, we propose a novel method using a sequence-to-sequence model, called hypothesis stitcher. The model takes multiple hypotheses obtained from short audio segments that are extracted from the original long-form input, and it then outputs a fused single hypothesis. We propose several architectural variations of the hypothesis stitcher model and compare them with the conventional decoding methods. Experiments using LibriSpeech and LibriCSS corpora show that the proposed method significantly improves SA-WER especially for long-form multi-talker recordings.

中文翻译：

长时间多说话者录音中端到端说话者属性ASR的假想缝合器

最近提出了一种端到端（E2E）说话者属性自动语音识别（SA-ASR）模型，以共同执行说话者计数，语音识别和说话者识别。该模型针对包含未知数量扬声器的单声道重叠语音实现了较低的扬声器属性单词错误率（SA-WER）。但是，E2E建模方法容易受到训练和测试条件之间的不匹配的影响。E2E SA-ASR模型是否适合于比在训练过程中看到的样本更长的记录，效果尚待研究。在这项工作中，我们首先将一种已知的解码技术应用于E2E SA-ASR任务，该解码技术是针对长格式音频执行单扬声器ASR的。然后，我们提出了一种使用序列到序列模型的新方法，称为假设拼接器。该模型采用从短音频段中获得的多个假设，这些假设是从原始的长格式输入中提取的，然后输出一个融合的单个假设。我们提出了假设缝合器模型的几种架构变化，并将它们与常规解码方法进行比较。使用LibriSpeech和LibriCSS语料库进行的实验表明，该方法显着改善了SA-WER，特别是对于长格式的多通话者录音。

更新日期：2021-01-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文