当前位置: X-MOL 学术IEEE Trans. Signal Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Mixture of Inference Networks for VAE-Based Audio-Visual Speech Enhancement
IEEE Transactions on Signal Processing ( IF 4.6 ) Pub Date : 2021-03-17 , DOI: 10.1109/tsp.2021.3066038
Mostafa Sadeghi 1 , Xavier Alameda-Pineda 2
Affiliation  

We address unsupervised audio-visual speech enhancement based on variational autoencoders (VAEs), where the prior distribution of clean speech spectrogram is simulated using an encoder-decoder architecture. At enhancement (test) time, the trained generative model (decoder) is combined with a noise model whose parameters need to be estimated. The initialization of the latent variables describing the generative process of the clean speech via the decoder, is crucial, as the overall inference problem is non-convex. This is usually done by using the output of the trained encoder given the noisy audio and clean visual data as input. Current audio-visual VAE models do not provide an effective initialization because the two modalities are tightly coupled (concatenated) in the associated architectures. To overcome this issue, we introduce the mixture of inference networks variational autoencoder (MIN-VAE). Two encoder networks input, respectively, audio and visual data, and the posterior of the latent variables is modeled as a mixture of two Gaussian distributions output from each encoder. The mixture variable is also latent, and therefore learning the optimal balance between the audio and visual encoders is unsupervised as well. By training a shared decoder, the overall network learns to adaptively fuse the two modalities. Moreover, at test time, the visual encoder, taking (clean) visual data, is used for initialization. A variational inference approach is derived to train the proposed model. Thanks to the novel inference procedure and the robust initialization, the MIN-VAE exhibits superior performance than the standard audio-only as well as audio-visual counterparts.

中文翻译:

基于VAE的视听语音增强的推理网络混合

我们解决了基于变分自动编码器(VAE)的无监督视听语音增强功能,其中使用编码器-解码器体系结构模拟了干净语音频谱图的先验分布。在增强(测试)时间,将训练后的生成模型(解码器)与需要估计其参数的噪声模型组合在一起。通过解码器初始化描述干净语音的生成过程的潜在变量的初始化至关重要,因为总体推断问题是非凸的。通常在给定嘈杂的音频和干净的可视数据作为输入的情况下,使用经过训练的编码器的输出来完成此操作。当前的视听VAE模型无法提供有效的初始化,因为这两种模式在关联的体系结构中紧密耦合(串联)。为了解决这个问题,我们介绍了推理网络混合自动编码器(MIN-VAE)的混合物。两个编码器网络分别输入音频和视觉数据,并且将潜在变量的后部建模为每个编码器输出的两个高斯分布的混合。混合变量也是潜在的,因此,在音频和视频编码器之间学习最佳平衡也是不受监督的。通过训练共享的解码器,整个网络将学会自适应地融合这两种模式。此外,在测试时,使用视觉编码器(获取(干净的)视觉数据)进行初始化。推导了一种变分推理方法来训练提出的模型。得益于新颖的推理程序和强大的初始化功能,
更新日期:2021-04-06
down
wechat
bug