Unsupervised Any-to-Many Audiovisual Synthesis via Exemplar Autoencoders,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Unsupervised Any-to-Many Audiovisual Synthesis via Exemplar Autoencoders
arXiv - CS - Sound Pub Date : 2020-01-13 , DOI: arxiv-2001.04463
Kangle Deng and Aayush Bansal and Deva Ramanan

We present an unsupervised approach that enables us to convert the speech input of any one individual to an output set of potentially-infinitely many speakers. One can stand in front of a mic and be able to make their favorite celebrity say the same words. Our approach builds on simple autoencoders that project out-of-sample data to the distribution of the training set (motivated by PCA/linear autoencoders). We use an exemplar autoencoder to learn the voice and specific style (emotions and ambiance) of a target speaker. In contrast to existing methods, the proposed approach can be easily extended to an arbitrarily large number of speakers in a very little time using only two-three minutes of audio data from a speaker. We also exhibit the usefulness of our approach for generating video from audio signals and vice-versa. We suggest the reader to check out our project webpage for various synthesized examples: https://dunbar12138.github.io/projectpage/Audiovisual/

中文翻译：

通过示例自动编码器进行无监督的多对多视听合成

我们提出了一种无监督的方法，使我们能够将任何一个人的语音输入转换为可能无限多的说话者的输出集。一个人可以站在麦克风前，让他们最喜欢的名人说出同样的话。我们的方法建立在简单的自动编码器之上，这些自动编码器将样本外数据投射到训练集的分布上（由 PCA/线性自动编码器驱动）。我们使用示例自动编码器来学习目标说话者的声音和特定风格（情绪和氛围）。与现有方法相比，所提出的方法可以在很短的时间内轻松扩展到任意数量的扬声器，只需使用来自扬声器的两到三分钟的音频数据。我们还展示了我们的方法从音频信号生成视频的有用性，反之亦然。

更新日期：2020-01-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文