当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
An End-to-end Architecture of Online Multi-channel Speech Separation
arXiv - CS - Sound Pub Date : 2020-09-07 , DOI: arxiv-2009.03141
Jian Wu, Zhuo Chen, Jinyu Li, Takuya Yoshioka, Zhili Tan, Ed Lin, Yi Luo, Lei Xie

Multi-speaker speech recognition has been one of the keychallenges in conversation transcription as it breaks the singleactive speaker assumption employed by most state-of-the-artspeech recognition systems. Speech separation is consideredas a remedy to this problem. Previously, we introduced a sys-tem, calledunmixing,fixed-beamformerandextraction(UFE),that was shown to be effective in addressing the speech over-lap problem in conversation transcription. With UFE, an inputmixed signal is processed by fixed beamformers, followed by aneural network post filtering. Although promising results wereobtained, the system contains multiple individually developedmodules, leading potentially sub-optimum performance. In thiswork, we introduce an end-to-end modeling version of UFE. Toenable gradient propagation all the way, an attentional selectionmodule is proposed, where an attentional weight is learnt foreach beamformer and spatial feature sampled over space. Ex-perimental results show that the proposed system achieves com-parable performance in an offline evaluation with the originalseparate processing-based pipeline, while producing remark-able improvements in an online evaluation.

中文翻译:

在线多通道语音分离的端到端架构

多说话人语音识别一直是对话转录的关键挑战之一,因为它打破了大多数最先进的语音识别系统采用的单一活跃说话人假设。语音分离被认为是解决这个问题的一种方法。之前,我们介绍了一种称为分离、固定波束成形和提取 (UFE) 的系统,该系统已被证明可有效解决对话转录中的语音重叠问题。使用 UFE,输入混合信号由固定波束形成器处理,然后是神经网络后滤波。虽然获得了有希望的结果,但该系统包含多个单独开发的模块,导致潜在的次优性能。在这项工作中,我们介绍了 UFE 的端到端建模版本。一路启用梯度传播,提出了一个注意力选择模块,其中为每个波束形成器和在空间上采样的空间特征学习了一个注意力权重。实验结果表明,所提出的系统在离线评估中实现了与原始的基于单独处理的管道相当的性能,同时在在线评估中产生了显着的改进。
更新日期:2020-09-08
down
wechat
bug