Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs
arXiv - CS - Sound Pub Date : 2021-01-07 , DOI: arxiv-2101.02402
Wen-Yi Hsiao, Jen-Yu Liu, Yin-Cheng Yeh, Yi-Hsuan Yang

To apply neural sequence models such as the Transformers to music generation tasks, one has to represent a piece of music by a sequence of tokens drawn from a finite set of pre-defined vocabulary. Such a vocabulary usually involves tokens of various types. For example, to describe a musical note, one needs separate tokens to indicate the note's pitch, duration, velocity (dynamics), and placement (onset time) along the time grid. While different types of tokens may possess different properties, existing models usually treat them equally, in the same way as modeling words in natural languages. In this paper, we present a conceptually different approach that explicitly takes into account the type of the tokens, such as note types and metric types. And, we propose a new Transformer decoder architecture that uses different feed-forward heads to model tokens of different types. With an expansion-compression trick, we convert a piece of music to a sequence of compound words by grouping neighboring tokens, greatly reducing the length of the token sequences. We show that the resulting model can be viewed as a learner over dynamic directed hypergraphs. And, we employ it to learn to compose expressive Pop piano music of full-song length (involving up to 10K individual tokens per song), both conditionally and unconditionally. Our experiment shows that, compared to state-of-the-art models, the proposed model converges 5--10 times faster at training (i.e., within a day on a single GPU with 11 GB memory), and with comparable quality in the generated music.

中文翻译：

复合词转换器：学习在动态有向超图上编写全歌曲音乐

为了将诸如变压器之类的神经序列模型应用于音乐生成任务，一个人必须通过从一组有限的预定义词汇表中提取的标记序列来代表一首音乐。这样的词汇通常涉及各种类型的标记。例如，为了描述一个音符，需要使用单独的标记来指示该音符沿时间网格的音高，持续时间，速度（动力学）和位置（开始时间）。尽管不同类型的令牌可能具有不同的属性，但是现有模型通常以与自然语言中的单词建模相同的方式同等对待它们。在本文中，我们提出了一种概念上不同的方法，该方法明确考虑了令牌的类型，例如票据类型和指标类型。和，我们提出了一种新的Transformer解码器体系结构，该体系结构使用不同的前馈头来对不同类型的令牌进行建模。通过扩展压缩技巧，我们通过将相邻标记分组来将一段音乐转换为复合词序列，从而大大减少了标记序列的长度。我们表明，可以将生成的模型视为动态有向超图的学习者。并且，我们利用它来学习有条件和无条件地创作具有表现力的，全曲长的流行钢琴音乐（每首歌曲最多包含10K个单独的音符）。我们的实验表明，与最新模型相比，该模型在训练时（例如，一天内在具有11 GB内存的单个GPU上）收敛速度提高了5--10倍，并且在产生的音乐。通过扩展压缩技巧，我们通过将相邻标记分组来将一段音乐转换为复合词序列，从而大大减少了标记序列的长度。我们表明，可以将生成的模型视为动态有向超图的学习者。并且，我们利用它来学习有条件和无条件地创作具有表现力的，全曲长的流行钢琴音乐（每首歌曲最多包含10K个单独的音符）。我们的实验表明，与最新模型相比，该模型在训练时（例如，一天内在具有11 GB内存的单个GPU上）收敛速度提高了5--10倍，并且在产生的音乐。通过扩展压缩技巧，我们通过将相邻标记分组来将一段音乐转换为复合词序列，从而大大减少了标记序列的长度。我们表明，可以将生成的模型视为动态有向超图的学习者。并且，我们利用它来学习有条件和无条件地创作具有表现力的，全曲长的流行钢琴音乐（每首歌曲最多包含10K个单独的音符）。我们的实验表明，与最新模型相比，该模型在训练时（例如，一天内在具有11 GB内存的单个GPU上）收敛速度提高了5--10倍，并且在产生的音乐。我们表明，可以将生成的模型视为动态有向超图的学习者。并且，我们利用它来学习有条件和无条件地创作具有表现力的，全曲长的流行钢琴音乐（每首歌曲最多包含10K个单独的音符）。我们的实验表明，与最新模型相比，该模型在训练时（例如，一天内在具有11 GB内存的单个GPU上）收敛速度提高了5--10倍，并且在产生的音乐。我们表明，可以将生成的模型视为动态有向超图的学习者。并且，我们利用它来学习有条件和无条件地创作具有表现力的，全曲长的流行钢琴音乐（每首歌曲最多包含10K个单独的音符）。我们的实验表明，与最新模型相比，该模型在训练时（例如，一天内在具有11 GB内存的单个GPU上）收敛速度提高了5--10倍，并且在产生的音乐。

更新日期：2021-01-08

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>