Speaker activity driven neural speech extraction,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Speaker activity driven neural speech extraction
arXiv - CS - Sound Pub Date : 2021-01-14 , DOI: arxiv-2101.05516
Marc Delcroix, Katerina Zmolikova, Tsubasa Ochiai, Keisuke Kinoshita, Tomohiro Nakatani

Target speech extraction, which extracts the speech of a target speaker in a mixture given auxiliary speaker clues, has recently received increased interest. Various clues have been investigated such as pre-recorded enrollment utterances, direction information, or video of the target speaker. In this paper, we explore the use of speaker activity information as an auxiliary clue for single-channel neural network-based speech extraction. We propose a speaker activity driven speech extraction neural network (ADEnet) and show that it can achieve performance levels competitive with enrollment-based approaches, without the need for pre-recordings. We further demonstrate the potential of the proposed approach for processing meeting-like recordings, where speaker activity obtained from a diarization system is used as a speaker clue for ADEnet. We show that this simple yet practical approach can successfully extract speakers after diarization, which leads to improved ASR performance when using a single microphone, especially in high overlapping conditions, with a relative word error rate reduction of up to 25 %.

中文翻译：

说话者活动驱动的神经语音提取

在给定辅助说话者线索的情况下，以混合形式提取目标说话者的语音的目标语音提取最近受到了越来越多的关注。已经研究了各种线索，例如预先记录的注册话语，方向信息或目标说话者的视频。在本文中，我们探索了说话人活动信息的使用，作为基于单通道神经网络的语音提取的辅助线索。我们提出了一种说话人活动驱动的语音提取神经网络（ADEnet），并表明它可以达到基于注册的方法所具有的性能水平，而无需预先录制。我们进一步证明了所提出的方法可用于处理类似会议的录音的潜力，其中从差动系统获得的演讲者活动被用作ADEnet的演讲者线索。

更新日期：2021-01-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文