Musical rhythm transcription based on Bayesian piece-specific score models capturing repetitions☆
Introduction
Music transcription is an actively studied but yet unsolved problem in music information processing [1], [2]. One of the goals of music transcription is to convert a musical performance signal into a human-readable symbolic musical score. While recent studies have achieved highly accurate pitch detection methods [3], [11], [33], it is also necessary to transcribe rhythms in order to obtain symbolic music representations [6], [7], [10], [15], [17], [18], [19], [20], [23], [31], [32]. Since there are many logically possible representations of rhythms (including inappropriate ones for transcription) for a given performance [6], using a (musical) score model (a.k.a. musical language model) that describes the prior knowledge of musical scores is a key to solving this problem. Here, we study the problem of the rhythm transcription of monophonic music, which is the task of recognizing score-notated musical rhythms from human-performed MIDI data that contain onset time deviations. A rhythm transcription method can also be used as a module of audio-to-score music transcription systems, as studied in [17], [20].
A common approach for music transcription is to integrate a score model and a performance/acoustic model to obtain a proper transcription that best fits an input performance signal, similar to the statistical speech recognition method [21]. This is a top-down approach for describing the generative process of musical performance signals using statistical models, in contrast to bottom-up approaches for extracting features from musical performance signals, such as using ratios of inter-onset intervals [13], [25] and using the wavelet transform [30] to represent rhythmic features. A major advantage of the former approach is the potential to utilize statistical machine learning techniques.
In constructing a score model for music transcription, capturing both local and global features of musical scores is considered to be effective. Among the local features, the sequential dependences of musical notes can be used to induce output scores that obey the musical grammar. Among the global features, repetitive structures are commonly found in various styles of music [12], [26] and can be a useful guide for transcription since they can complement the local information of musical scores. This is because by using a repetitive structure, one can in effect cancel out timing deviations and other “noises” in performances. Conventional score models for music transcription that are designed to capture the local sequential dependence of musical notes include Markov models [10], [19], [20], [22], [23], [24], [31], hidden Markov models (HMMs) [27], recurrent neural networks (RNNs) [28], and long-short term memory (LSTM) networks [34]. However, it is challenging to construct a score model incorporating a repetitive structure for transcription, particularly because the computational costs for imposing extensive constraints on output scores typically become prohibitively large.
Recently, Nakamura et al. [18] proposed a statistical score model incorporating a repetitive structure and derived a tractable algorithm for music transcription based on the model. A prominent feature of this method is that piece-specific score models for individual musical pieces are constructed and subsequently used for transcription, in contrast to the use of a generic score model for all target musical pieces in the conventional methods. The model is formulated as a Bayesian model whereby a piece-specific score model is first generated from a generic score model and a musical score is then generated from the piece-specific model (Fig. 1). This model is similar to the topic model of natural language based on the latent Dirichlet allocation (LDA) [5], where a word distribution of individual text is generated from a generic word distribution of a collection of text. Instead of a word distribution, a distribution of note patterns (i.e. subsequences of notes corresponding to bars) is considered as a score model for generating musical scores. The process for generating piece-specific models is described as a Dirichlet process [14] with a small concentration parameter, which induces sparse distributions of note patterns. As a consequence of the sparseness of the piece-specific score model, repetitions of note patterns are naturally induced in the generated score. It is important that the piece-specific score model is stochastically generated and thus the total model can describe unseen scores with an unfixed repetitive structure. Moreover, by combining a process of modifying the generated note patterns, the model can also describe approximate repetitions, which are commonly seen in music practice.
In [18], the efficacy of the score model was tested in the task of the rhythm transcription of monophonic music. It was found that both the Bayesian construction of score models and the note modification process improved the transcription accuracy, showing the potential of the approach. It was also found that the computational costs were too large for practical applications.
The purpose of this study is to find a score model capturing repetitions that can be practically used for rhythm transcription. We construct wider classes of Bayesian score models and conduct a systematic comparative evaluation of these models to find an optimal model in terms of transcription accuracy and computational costs. While repetitions were considered in units of note patterns in [18], it is theoretically possible to consider repetitions in units of notes. We construct Bayesian score models based on the note value Markov model (MM) [31] and the metrical MM [10], [23], which are advantageous for computational efficiency compared to the note pattern MM considered in [18]. We apply the constructed score models to rhythm transcription and conduct evaluations to examine the effect of capturing repetitions and to reveal the best model for the task.
After introducing basic score models for rhythms (note value MM, metrical MM, and note pattern MM) and presenting statistical analyses on the repetitive structure in Section 2, we formulate our models in Section 3. Bayesian extensions of the three types of Markov models and the integration of note modification processes (note divisions and onset shifts) are formulated there. A rhythm transcription method based on the score models is presented in Section 4. Evaluations are carried out in Section 5, where models are compared in terms of the transcription error rate and computation time. We also examine the influence of the parameterization of the relevant hyperparameters. The data and source code used in this study are available at https://bayesianscoremodel.github.io.
Section snippets
Repetitive structure and sparseness of piece-specific score models
We here define the form of music data we address, introduce basic score models, and present statistical analyses that indicate the necessity of considering piece-specific score models in constructing musical score models capturing the repetitive structure of music.
Overview of the studied models
Here, we explain the proposed score models. We construct score models based on three basic models: the note value MM, the metrical MM, and the note pattern MM. Each basic model is extended in a Bayesian manner to describe the repetitive structure and combined with note modification models to describe approximate repetitions. The incorporation of note modifications is described in Section 3.2 and the Bayesian extensions are formulated in Section 3.3. The list of constructed models based on the
Rhythm transcription method
The aim of a rhythm transcription method is to estimate the score onset times (or note values ) from input MIDI data with note onset times (or durations , where ) that are represented in units of seconds. We first construct a musical performance generative model that describes the probability or by combining a score model (one of the aforementioned models) and a performance model explained in Section 4.1. We then derive a rhythm
Evaluation
To evaluate the performance of the studied models, we conducted two evaluation experiments. In the first experiment (Section 5.2), in order to compare the effects of different model architectures, the predictive ability of the non-Bayesian score models is evaluated. In the second experiment (Section 5.3), the transcription accuracies of the models are measured to examine the effects of the Bayesian extensions and modification models. We also examine the influence of the hyperparameters of the
Conclusion
We have studied the statistical description of the repetitive structure of musical notes using Bayesian score models and its application to rhythm transcription. The main results are summarized as follows.
- •
The repetitive structure of musical rhythms is reflected in the sparseness of piece-specific score models, and the Dirichlet process describing the generative process of piece-specific score models from a generic score model can explain the distribution of the entropies of piece-specific
CRediT authorship contribution statement
Eita Nakamura: Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Software, Writing - original draft. Kazuyoshi Yoshii: Data curation, Funding acquisition, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (34)
- et al.
Integer ratio priors on musical rhythm revealed cross-culturally by iterated reproduction
Curr. Biol.
(2017) A hybrid graphical model for rhythmic parsing
Artif. Intell.
(2002)- et al.
Automatic music transcription: An overview
IEEE Signal Process. Mag.
(2019) - et al.
Automatic music transcription: Challenges and future directions
J. Intell. Inf. Syst.
(2013) - et al.
An efficient temporally-constrained probabilistic model for multiple-instrument music transcription
Proc. ISMIR
(2015) Pattern Recognition and Machine Learning
(2006)- et al.
Latent Dirichlet allocation
J. Mach. Learn. Res.
(2003) - et al.
Rhythm quantization for transcription
Comp. Mus. J.
(2000) - et al.
The quantization of musical time: A connectionist approach
Comp. Mus. J.
(1989) - et al.
The hierarchical hidden Markov model: Analysis and applications
Mach. Learn.
(1998)
RWC music database: Popular, classical and jazz music databases
Proc. ISMIR
A learning-based quantization: Unsupervised estimation of the model parameters
Proc. ICMC
Onsets and frames: Dual-objective piano transcription
Proc. ISMIR
Sweet Anticipation: Music and the Psychology of Expectation
Dirichlet processes, Chinese restaurant processes and all that
Mental Processes: Studies in Cognitive Science
Automatic word recognition based on second-order hidden Markov models
IEEE Trans. Speech Audio Process.
Cited by (4)
Audio-to-Score Singing Transcription Based on Joint Estimation of Pitches, Onsets, and Metrical Positions With Tatum-Level CTC Loss
2023, 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023Study of Automatic Piano Transcription Algorithms based on the Polyphonic Properties of Piano Audio
2023, IEIE Transactions on Smart Processing and ComputingDynamic cluster structure and predictive modelling of music creation style distributions
2022, Royal Society Open Science
- ☆
This work is in part supported by JSPS KAKENHI Nos. 15K16054, 16H01744, 16J05486, and 19K20340; JST ACCEL No. JPMJAC1602; the Kyoto University Foundation; and the Kayamori Foundation. The work of EN was supported by the JSPS research fellowship (PD).