当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances
arXiv - CS - Sound Pub Date : 2020-04-07 , DOI: arxiv-2004.03194
Youngmoon Jung, Seong Min Kye, Yeunju Choi, Myunghun Jung, Hoirin Kim

Currently, the most widely used approach for speaker verification is the deep speaker embedding learning. In this approach, we obtain a speaker embedding vector by pooling single-scale features that are extracted from the last layer of a speaker feature extractor. Multi-scale aggregation (MSA), which utilizes multi-scale features from different layers of the feature extractor, has recently been introduced and shows superior performance for variable-duration utterances. To increase the robustness dealing with utterances of arbitrary duration, this paper improves the MSA by using a feature pyramid module. The module enhances speaker-discriminative information of features from multiple layers via a top-down pathway and lateral connections. We extract speaker embeddings using the enhanced features that contain rich speaker information with different time scales. Experiments on the VoxCeleb dataset show that the proposed module improves previous MSA methods with a smaller number of parameters. It also achieves better performance than state-of-the-art approaches for both short and long utterances.

中文翻译:

使用特征金字塔模块改进多尺度聚合,以实现可变持续时间话语的稳健说话人验证

目前,最广泛使用的说话人验证方法是深度说话人嵌入学习。在这种方法中,我们通过汇集从说话人特征提取器的最后一层提取的单尺度特征来获得说话人嵌入向量。多尺度聚合(MSA)最近被引入,它利用来自特征提取器不同层的多尺度特征,并显示出对可变持续时间话语的卓越性能。为了增加处理任意持续时间的话语的鲁棒性,本文通过使用特征金字塔模块改进了 MSA。该模块通过自上而下的路径和横向连接增强了来自多层的特征的说话人辨别信息。我们使用包含不同时间尺度的丰富说话人信息的增强特征来提取说话人嵌入。在 VoxCeleb 数据集上的实验表明,所提出的模块以较少的参数改进了以前的 MSA 方法。对于短句和长句,它也比最先进的方法实现了更好的性能。
更新日期:2020-11-09
down
wechat
bug