当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
AvaTr: One-Shot Speaker Extraction with Transformers
arXiv - CS - Sound Pub Date : 2021-05-03 , DOI: arxiv-2105.00609
Shell Xu Hu, Md Rifat Arefin, Viet-Nhat Nguyen, Alish Dipani, Xaq Pitkow, Andreas Savas Tolias

To extract the voice of a target speaker when mixed with a variety of other sounds, such as white and ambient noises or the voices of interfering speakers, we extend the Transformer network to attend the most relevant information with respect to the target speaker given the characteristics of his or her voices as a form of contextual information. The idea has a natural interpretation in terms of the selective attention theory. Specifically, we propose two models to incorporate the voice characteristics in Transformer based on different insights of where the feature selection should take place. Both models yield excellent performance, on par or better than published state-of-the-art models on the speaker extraction task, including separating speech of novel speakers not seen during training.

中文翻译:

AvaTr:使用变压器的一键式扬声器提取

为了提取目标说话人的声音与其他各种声音(例如白色和环境噪声或干扰说话人的声音)混合时的声音,我们扩展了Transformer网络,以针对特定目标说话人提供最相关的信息他或她的声音作为上下文信息的一种形式。这个想法就选择性注意理论而言具有自然的解释。具体而言,我们基于对功能选择应在何处进行的不同见解,提出了两种将语音特性整合到Transformer中的模型。两种模型均具有出色的性能,其性能与公开的说话人提取任务最先进的模型相当或更好,包括分离训练过程中未见过的新型讲话者的语音。
更新日期:2021-05-04
down
wechat
bug