MP3net: coherent, minute-long music generation from raw audio with a simple convolutional GAN,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

MP3net: coherent, minute-long music generation from raw audio with a simple convolutional GAN
arXiv - CS - Sound Pub Date : 2021-01-12 , DOI: arxiv-2101.04785
Korneel van den Broek

We present a deep convolutional GAN which leverages techniques from MP3/Vorbis audio compression to produce long, high-quality audio samples with long-range coherence. The model uses a Modified Discrete Cosine Transform (MDCT) data representation, which includes all phase information. Phase generation is hence integral part of the model. We leverage the auditory masking and psychoacoustic perception limit of the human ear to widen the true distribution and stabilize the training process. The model architecture is a deep 2D convolutional network, where each subsequent generator model block increases the resolution along the time axis and adds a higher octave along the frequency axis. The deeper layers are connected with all parts of the output and have the context of the full track. This enables generation of samples which exhibit long-range coherence. We use MP3net to create 95s stereo tracks with a 22kHz sample rate after training for 250h on a single Cloud TPUv2. An additional benefit of the CNN-based model architecture is that generation of new songs is almost instantaneous.

中文翻译：

MP3net：通过简单的卷积GAN从原始音频生成连贯，长达一分钟的音乐

我们提出了一种深度卷积GAN，它利用了MP3 / Vorbis音频压缩技术来产生具有长距离一致性的长，高质量音频样本。该模型使用改进的离散余弦变换（MDCT）数据表示形式，其中包括所有相位信息。因此，相位生成是模型不可或缺的一部分。我们利用人耳的听觉掩蔽和心理声学感知极限来扩大真实分布并稳定训练过程。该模型体系结构是一个深度2D卷积网络，其中每个后续生成器模型块都沿时间轴增加分辨率，并沿频率轴增加更高的八度。较深的层与输出的所有部分相连，并具有完整轨道的上下文。这使得能够生成表现出长期相干性的样本。在单个Cloud TPUv2上训练250h后，我们使用MP3net创建具有22kHz采样率的95s立体声轨道。基于CNN的模型架构的另一个好处是，新歌曲的生成几乎是即时的。

更新日期：2021-01-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>