Dual-Path Modeling With Memory Embedding Model for Continuous Speech Separation,IEEE/ACM Transactions on Audio, Speech, and Language Processing

当前位置： X-MOL 学术 › IEEE ACM Trans. Audio Speech Lang. Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Dual-Path Modeling With Memory Embedding Model for Continuous Speech Separation
IEEE/ACM Transactions on Audio, Speech, and Language Processing ( IF 4.1 ) Pub Date : 4-7-2022 , DOI: 10.1109/taslp.2022.3165712
Chenda Li ₁ , Zhuo Chen ₂ , Yanmin Qian ₁

Affiliation

Continuous speech separation (CSS) aims at separating overlap-free targets from a long, partially-overlapped recording. Though it has shown promising results, the origin CSS framework does not consider cross-window information and long-span dependency. To alleviate these limitations, this work introduces two novel methods to implicitly and explicitly capture the long-span knowledge, respectively. We firstly apply the dual-path (DP) modeling architecture for the CSS framework, where the within and across window information are jointly modeled by alternating stacked local-global processing modules. Secondly, to further capture the long-span dependency, we introduce a memory-based model for CSS. An additional memory pool is designed to extract embedding from each small window, and the inter-window commutation is established above the memory embedding pool through an attention mechanism. This memory-based model can precisely control what information needs to be transferred across the windows, thus leading to both improved modeling capacity and interpretability. The experimental results on the LibriCSS dataset show that both strategies can well capture the long-span information of the continuous speech and significantly improve system performance. Moreover, further improvements are observed with the integration of these two methods.

中文翻译：

用于连续语音分离的具有内存嵌入模型的双路径建模

连续语音分离 (CSS) 旨在将无重叠的目标从长的、部分重叠的录音中分离出来。尽管它已经显示出可喜的结果，但原始 CSS 框架没有考虑跨窗口信息和长跨度依赖。为了缓解这些限制，这项工作引入了两种新颖的方法来分别隐式和显式地捕获长跨度知识。我们首先将双路径（DP）建模架构应用于CSS框架，其中窗口内和跨窗口信息通过交替堆叠的局部全局处理模块联合建模。其次，为了进一步捕获长跨度依赖关系，我们引入了基于内存的 CSS 模型。设计了一个额外的内存池来从每个小窗口中提取嵌入，并通过注意力机制在内存嵌入池之上建立窗口间交换。这种基于内存的模型可以精确控制需要跨窗口传输的信息，从而提高建模能力和可解释性。 LibriCSS数据集上的实验结果表明，两种策略都能很好地捕获连续语音的长跨度信息，并显着提高系统性能。此外，通过整合这两种方法可以观察到进一步的改进。

更新日期：2024-08-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文