Musical rhythm transcription based on Bayesian piece-specific score models capturing repetitions

doi:10.1016/j.ins.2021.04.100

Information Sciences

Volume 572, September 2021, Pages 482-500

https://doi.org/10.1016/j.ins.2021.04.100 Get rights and content

Abstract

Most work on musical score models (a.k.a. musical language models) for music transcription has focused on describing the local sequential dependence of notes in musical scores and failed to capture their global repetitive structure, which can be a useful guide for transcribing music. Focusing on rhythm, we formulate several classes of Bayesian Markov models of musical scores that describe repetitions indirectly using the sparse transition probabilities of notes or note patterns. This enables us to construct piece-specific models for unseen scores with an unfixed repetitive structure and to derive tractable inference algorithms. Moreover, to describe approximate repetitions, we explicitly incorporate a process for modifying the repeated notes/note patterns. We apply these models as prior musical score models for rhythm transcription, where piece-specific score models are inferred from performed MIDI data by Bayesian learning, in contrast to the conventional supervised construction of score models. Evaluations using the vocal melodies of popular music showed that the Bayesian models improved the transcription accuracy for most of the tested model types, indicating the universal efficacy of the proposed approach. Moreover, we found an effective data representation for modelling rhythms that maximizes the transcription accuracy and computational efficiency.

Introduction

Music transcription is an actively studied but yet unsolved problem in music information processing [1], [2]. One of the goals of music transcription is to convert a musical performance signal into a human-readable symbolic musical score. While recent studies have achieved highly accurate pitch detection methods [3], [11], [33], it is also necessary to transcribe rhythms in order to obtain symbolic music representations [6], [7], [10], [15], [17], [18], [19], [20], [23], [31], [32]. Since there are many logically possible representations of rhythms (including inappropriate ones for transcription) for a given performance [6], using a (musical) score model (a.k.a. musical language model) that describes the prior knowledge of musical scores is a key to solving this problem. Here, we study the problem of the rhythm transcription of monophonic music, which is the task of recognizing score-notated musical rhythms from human-performed MIDI data that contain onset time deviations. A rhythm transcription method can also be used as a module of audio-to-score music transcription systems, as studied in [17], [20].

A common approach for music transcription is to integrate a score model and a performance/acoustic model to obtain a proper transcription that best fits an input performance signal, similar to the statistical speech recognition method [21]. This is a top-down approach for describing the generative process of musical performance signals using statistical models, in contrast to bottom-up approaches for extracting features from musical performance signals, such as using ratios of inter-onset intervals [13], [25] and using the wavelet transform [30] to represent rhythmic features. A major advantage of the former approach is the potential to utilize statistical machine learning techniques.

In constructing a score model for music transcription, capturing both local and global features of musical scores is considered to be effective. Among the local features, the sequential dependences of musical notes can be used to induce output scores that obey the musical grammar. Among the global features, repetitive structures are commonly found in various styles of music [12], [26] and can be a useful guide for transcription since they can complement the local information of musical scores. This is because by using a repetitive structure, one can in effect cancel out timing deviations and other “noises” in performances. Conventional score models for music transcription that are designed to capture the local sequential dependence of musical notes include Markov models [10], [19], [20], [22], [23], [24], [31], hidden Markov models (HMMs) [27], recurrent neural networks (RNNs) [28], and long-short term memory (LSTM) networks [34]. However, it is challenging to construct a score model incorporating a repetitive structure for transcription, particularly because the computational costs for imposing extensive constraints on output scores typically become prohibitively large.

Recently, Nakamura et al. [18] proposed a statistical score model incorporating a repetitive structure and derived a tractable algorithm for music transcription based on the model. A prominent feature of this method is that piece-specific score models for individual musical pieces are constructed and subsequently used for transcription, in contrast to the use of a generic score model for all target musical pieces in the conventional methods. The model is formulated as a Bayesian model whereby a piece-specific score model is first generated from a generic score model and a musical score is then generated from the piece-specific model (Fig. 1). This model is similar to the topic model of natural language based on the latent Dirichlet allocation (LDA) [5], where a word distribution of individual text is generated from a generic word distribution of a collection of text. Instead of a word distribution, a distribution of note patterns (i.e. subsequences of notes corresponding to bars) is considered as a score model for generating musical scores. The process for generating piece-specific models is described as a Dirichlet process [14] with a small concentration parameter, which induces sparse distributions of note patterns. As a consequence of the sparseness of the piece-specific score model, repetitions of note patterns are naturally induced in the generated score. It is important that the piece-specific score model is stochastically generated and thus the total model can describe unseen scores with an unfixed repetitive structure. Moreover, by combining a process of modifying the generated note patterns, the model can also describe approximate repetitions, which are commonly seen in music practice.

In [18], the efficacy of the score model was tested in the task of the rhythm transcription of monophonic music. It was found that both the Bayesian construction of score models and the note modification process improved the transcription accuracy, showing the potential of the approach. It was also found that the computational costs were too large for practical applications.

The purpose of this study is to find a score model capturing repetitions that can be practically used for rhythm transcription. We construct wider classes of Bayesian score models and conduct a systematic comparative evaluation of these models to find an optimal model in terms of transcription accuracy and computational costs. While repetitions were considered in units of note patterns in [18], it is theoretically possible to consider repetitions in units of notes. We construct Bayesian score models based on the note value Markov model (MM) [31] and the metrical MM [10], [23], which are advantageous for computational efficiency compared to the note pattern MM considered in [18]. We apply the constructed score models to rhythm transcription and conduct evaluations to examine the effect of capturing repetitions and to reveal the best model for the task.

After introducing basic score models for rhythms (note value MM, metrical MM, and note pattern MM) and presenting statistical analyses on the repetitive structure in Section 2, we formulate our models in Section 3. Bayesian extensions of the three types of Markov models and the integration of note modification processes (note divisions and onset shifts) are formulated there. A rhythm transcription method based on the score models is presented in Section 4. Evaluations are carried out in Section 5, where models are compared in terms of the transcription error rate and computation time. We also examine the influence of the parameterization of the relevant hyperparameters. The data and source code used in this study are available at https://bayesianscoremodel.github.io.

Section snippets

Repetitive structure and sparseness of piece-specific score models

We here define the form of music data we address, introduce basic score models, and present statistical analyses that indicate the necessity of considering piece-specific score models in constructing musical score models capturing the repetitive structure of music.

Overview of the studied models

Here, we explain the proposed score models. We construct score models based on three basic models: the note value MM, the metrical MM, and the note pattern MM. Each basic model is extended in a Bayesian manner to describe the repetitive structure and combined with note modification models to describe approximate repetitions. The incorporation of note modifications is described in Section 3.2 and the Bayesian extensions are formulated in Section 3.3. The list of constructed models based on the

Rhythm transcription method

The aim of a rhythm transcription method is to estimate the score onset times $τ_{0 : N}$ (or note values $r_{1 : N}$ ) from input MIDI data with note onset times $t_{0 : N}$ (or durations $d_{1 : N}$ , where $d_{n} = t_{n} - t_{n - 1}$ ) that are represented in units of seconds. We first construct a musical performance generative model that describes the probability $P (τ_{0 : N}, t_{0 : N})$ or $P (r_{1 : N}, d_{1 : N})$ by combining a score model (one of the aforementioned models) and a performance model explained in Section 4.1. We then derive a rhythm

Evaluation

To evaluate the performance of the studied models, we conducted two evaluation experiments. In the first experiment (Section 5.2), in order to compare the effects of different model architectures, the predictive ability of the non-Bayesian score models is evaluated. In the second experiment (Section 5.3), the transcription accuracies of the models are measured to examine the effects of the Bayesian extensions and modification models. We also examine the influence of the hyperparameters of the

Conclusion

We have studied the statistical description of the repetitive structure of musical notes using Bayesian score models and its application to rhythm transcription. The main results are summarized as follows.

•
The repetitive structure of musical rhythms is reflected in the sparseness of piece-specific score models, and the Dirichlet process describing the generative process of piece-specific score models from a generic score model can explain the distribution of the entropies of piece-specific

CRediT authorship contribution statement

Eita Nakamura: Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Software, Writing - original draft. Kazuyoshi Yoshii: Data curation, Funding acquisition, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (34)

N. Jacoby et al.
Integer ratio priors on musical rhythm revealed cross-culturally by iterated reproduction
Curr. Biol.
(2017)
C. Raphael
A hybrid graphical model for rhythmic parsing
Artif. Intell.
(2002)
E. Benetos et al.
Automatic music transcription: An overview
IEEE Signal Process. Mag.
(2019)
E. Benetos et al.
Automatic music transcription: Challenges and future directions
J. Intell. Inf. Syst.
(2013)
E. Benetos et al.
An efficient temporally-constrained probabilistic model for multiple-instrument music transcription
Proc. ISMIR
(2015)
C.M. Bishop
Pattern Recognition and Machine Learning
(2006)
D.M. Blei et al.
Latent Dirichlet allocation
J. Mach. Learn. Res.
(2003)
A.T. Cemgil et al.
Rhythm quantization for transcription
Comp. Mus. J.
(2000)
P. Desain et al.
The quantization of musical time: A connectionist approach
Comp. Mus. J.
(1989)
S. Fine et al.
The hierarchical hidden Markov model: Analysis and applications
Mach. Learn.
(1998)

M. Goto et al.

RWC music database: Popular, classical and jazz music databases

Proc. ISMIR

(2002)

M. Hamanaka et al.

A learning-based quantization: Unsupervised estimation of the model parameters

Proc. ICMC

(2003)

C. Hawthorne et al.

Onsets and frames: Dual-objective piano transcription

Proc. ISMIR

(2018)

D. Huron

Sweet Anticipation: Music and the Psychology of Expectation

(2006)

M.I. Jordan

Dirichlet processes, Chinese restaurant processes and all that

H. Longuet-Higgins

Mental Processes: Studies in Cognitive Science

(1987)

J.-F. Mari et al.

Automatic word recognition based on second-order hidden Markov models

IEEE Trans. Speech Audio Process.

(1997)

Cited by (4)

Audio-to-Score Singing Transcription Based on Joint Estimation of Pitches, Onsets, and Metrical Positions With Tatum-Level CTC Loss
2023, 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023
Study of Automatic Piano Transcription Algorithms based on the Polyphonic Properties of Piano Audio
2023, IEIE Transactions on Smart Processing and Computing
Dynamic cluster structure and predictive modelling of music creation style distributions
2022, Royal Society Open Science
Dynamic cluster structure and predictive modelling of music creation style distributions
2022, arXiv

^☆: This work is in part supported by JSPS KAKENHI Nos. 15K16054, 16H01744, 16J05486, and 19K20340; JST ACCEL No. JPMJAC1602; the Kyoto University Foundation; and the Kayamori Foundation. The work of EN was supported by the JSPS research fellowship (PD).

View full text

Musical rhythm transcription based on Bayesian piece-specific score models capturing repetitions☆

Abstract

Introduction

Section snippets

Repetitive structure and sparseness of piece-specific score models

Overview of the studied models

Rhythm transcription method

Evaluation

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Curr. Biol.

Artif. Intell.

Automatic music transcription: An overview

IEEE Signal Process. Mag.

Automatic music transcription: Challenges and future directions

J. Intell. Inf. Syst.

An efficient temporally-constrained probabilistic model for multiple-instrument music transcription

Proc. ISMIR

Pattern Recognition and Machine Learning

Latent Dirichlet allocation

J. Mach. Learn. Res.

Rhythm quantization for transcription

Comp. Mus. J.

The quantization of musical time: A connectionist approach

Comp. Mus. J.

The hierarchical hidden Markov model: Analysis and applications

Mach. Learn.

RWC music database: Popular, classical and jazz music databases

Proc. ISMIR

A learning-based quantization: Unsupervised estimation of the model parameters

Proc. ICMC

Onsets and frames: Dual-objective piano transcription

Proc. ISMIR

Sweet Anticipation: Music and the Psychology of Expectation

Dirichlet processes, Chinese restaurant processes and all that

Mental Processes: Studies in Cognitive Science

Automatic word recognition based on second-order hidden Markov models

IEEE Trans. Speech Audio Process.