当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Conditional Sound Generation Using Neural Discrete Time-Frequency Representation Learning
arXiv - CS - Sound Pub Date : 2021-07-21 , DOI: arxiv-2107.09998
Xubo Liu, Turab Iqbal, Jinzheng Zhao, Qiushi Huang, Mark D. Plumbley, Wenwu Wang

Deep generative models have recently achieved impressive performance in speech synthesis and music generation. However, compared to the generation of those domain-specific sounds, the generation of general sounds (such as car horn, dog barking, and gun shot) has received less attention, despite their wide potential applications. In our previous work, sounds are generated in the time domain using SampleRNN. However, it is difficult to capture long-range dependencies within sound recordings using this method. In this work, we propose to generate sounds conditioned on sound classes via neural discrete time-frequency representation learning. This offers an advantage in modelling long-range dependencies and retaining local fine-grained structure within a sound clip. We evaluate our proposed approach on the UrbanSound8K dataset, as compared to a SampleRNN baseline, with the performance metrics measuring the quality and diversity of the generated sound samples. Experimental results show that our proposed method offers significantly better performance in diversity and comparable performance in quality, as compared to the baseline method.

中文翻译:

使用神经离散时频表示学习的条件声音生成

深度生成模型最近在语音合成和音乐生成方面取得了令人瞩目的表现。然而,与这些特定领域声音的生成相比,一般声音(如汽车喇叭、狗吠和枪声)的生成尽管具有广泛的潜在应用,但受到的关注较少。在我们之前的工作中,声音是使用 SampleRNN 在时域中生成的。但是,使用这种方法很难捕获录音中的长距离相关性。在这项工作中,我们建议通过神经离散时频表示学习生成以声音类为条件的声音。这在建模长距离依赖性和在声音片段中保留局部细粒度结构方面具有优势。我们在 UrbanSound8K 数据集上评估了我们提出的方法,与 SampleRNN 基线相比,性能指标衡量生成的声音样本的质量和多样性。实验结果表明,与基线方法相比,我们提出的方法在多样性和质量方面提供了显着更好的性能。
更新日期:2021-07-22
down
wechat
bug