当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multichannel CRNN for Speaker Counting: an Analysis of Performance
arXiv - CS - Sound Pub Date : 2021-01-06 , DOI: arxiv-2101.01977
Pierre-Amaury Grumiaux, Srdan Kitic, Laurent Girin, Alexandre Guérin

Speaker counting is the task of estimating the number of people that are simultaneously speaking in an audio recording. For several audio processing tasks such as speaker diarization, separation, localization and tracking, knowing the number of speakers at each timestep is a prerequisite, or at least it can be a strong advantage, in addition to enabling a low latency processing. In a previous work, we addressed the speaker counting problem with a multichannel convolutional recurrent neural network which produces an estimation at a short-term frame resolution. In this work, we show that, for a given frame, there is an optimal position in the input sequence for best prediction accuracy. We empirically demonstrate the link between that optimal position, the length of the input sequence and the size of the convolutional filters.

中文翻译:

用于说话者计数的多通道CRNN:效果分析

说话者计数是估计音频录音中同时讲话的人数的任务。对于多个音频处理任务(例如扬声器的分离,分离,定位和跟踪),先决条件是知道每个时间步的扬声器数量,或者除了可以实现低延迟处理之外,还可以成为一个强大的优势。在先前的工作中,我们使用多通道卷积递归神经网络解决了说话人计数问题,该网络在短期帧分辨率下产生估计。在这项工作中,我们表明,对于给定的帧,在输入序列中存在最佳位置以获得最佳预测精度。我们凭经验证明了最佳位置,输入序列的长度和卷积滤波器的大小之间的联系。
更新日期:2021-01-07
down
wechat
bug