当前位置: X-MOL 学术Comput. Speech Lang › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Deep ad-hoc beamforming
Computer Speech & Language ( IF 3.1 ) Pub Date : 2021-02-14 , DOI: 10.1016/j.csl.2021.101201
Xiao-Lei Zhang

Far-field speech processing is an important and challenging problem. In this paper, we propose deep ad-hoc beamforming, a deep-learning-based multichannel speech enhancement framework based on ad-hoc microphone arrays, to address the problem. It contains three novel components. First, it combines ad-hoc microphone arrays with deep-learning-based multichannel speech enhancement, which reduces the probability of the occurrence of far-field acoustic environments significantly. Second, it groups the microphones around the speech source to a local microphone array by a supervised channel selection framework based on deep neural networks. Third, it develops a simple time synchronization framework to synchronize the channels that have different time delay. Besides the above novelties and advantages, the proposed model is also trained in single-channel fashion, so that it can easily employ new development of speech processing techniques. Its test stage is also flexible in incorporating any number of microphones without retraining or modifying the framework. We have developed many implementations of the proposed framework and conducted an extensive experiment in scenarios where the locations of the speech sources are far-field, random, and blind to the microphones. Results on speech enhancement tasks show that our method outperforms its counterpart that works with linear microphone arrays by a considerable margin in both diffuse noise reverberant environments and point source noise reverberant environments. We have also tested the framework with different handcrafted features. Results show that although designing good features lead to high performance, they do not affect the conclusion on the effectiveness of the proposed framework.



中文翻译:

深层临时波束成形

远场语音处理是一个重要且具有挑战性的问题。在本文中,我们提出了深度自组织波束成形,一种基于深度学习的基于自组织麦克风阵列的多通道语音增强框架,以解决该问题。它包含三个新颖的组件。首先,它结合了专用麦克风阵列基于深度学习的多通道语音增强功能,大大降低了发生远场声环境的可能性。其次,它通过基于深度神经网络的监督通道选择框架将语音源周围的麦克风分组到本地麦克风阵列。第三,它开发了一个简单的时间同步框架来同步具有不同时间延迟的信道。除了上述新颖性和优点之外,所提出的模型还以单通道方式进行了训练,因此可以轻松地采用语音处理技术的新发展。它的测试阶段还可以灵活地合并任何数量的麦克风,而无需重新培训或修改框架。我们已经开发了所提出框架的许多实现,并在语音源的位置是远场,随机且对麦克风不敏感的情况下进行了广泛的实验。语音增强任务的结果表明,在扩散噪声混响环境和点源噪声混响环境中,我们的方法都比线性麦克风阵列的方法性能好得多。我们还测试了具有不同手工功能的框架。结果表明,尽管设计好的功能可以带来高性能,但它们不会影响所提出框架的有效性结论。语音增强任务的结果表明,在弥散噪声混响环境和点源噪声混响环境中,我们的方法都比线性麦克风阵列的方法性能好得多。我们还测试了具有不同手工功能的框架。结果表明,尽管设计好的功能可以带来高性能,但它们不会影响所提出框架的有效性结论。语音增强任务的结果表明,在弥散噪声混响环境和点源噪声混响环境中,我们的方法都比线性麦克风阵列的方法性能好得多。我们还测试了具有不同手工功能的框架。结果表明,尽管设计好的功能可以带来高性能,但它们不会影响所提出框架的有效性结论。

更新日期:2021-02-17
down
wechat
bug