Elsevier

Speech Communication

Volume 133, October 2021, Pages 9-22
Speech Communication

NHSS: A speech and singing parallel database

https://doi.org/10.1016/j.specom.2021.07.002Get rights and content

Highlights

  • A publicly available (NHSS) database of parallel recordings of speech and singing.

  • The NHSS database supports study of speech-singing and speech-to-singing conversion.

  • The database consists of recordings of spoken and sung vocals of English pop songs.

  • It contains 100 songs sung and spoken by 10 singers resulting in 7 hours audio data.

  • Here we develop benchmark systems for different applications using the NHSS database.

Abstract

We present a database of parallel recordings of speech and singing, collected and released by the Human Language Technology (HLT) laboratory at the National University of Singapore (NUS), that is called NUS-HLT Speak–Sing (NHSS) database. We release this database1 to the public to support research activities, that include, but not limited to comparative studies of acoustic attributes of speech and singing signals, cooperative synthesis of speech and singing voices, and speech-to-singing conversion. This database consists of recordings of sung vocals of English pop songs, the spoken counterpart of lyrics of the songs read by the singers in their natural reading manner, and manually prepared utterance-level and word-level annotations. The audio recordings in the NHSS database correspond to 100 songs sung and spoken by 10 singers, resulting in a total of 7 h of audio data. There are 5 male and 5 female singers, singing and reading the lyrics of 10 songs each. In this paper, we discuss the design methodology of the database, analyze the similarities and dissimilarities in characteristics of speech and singing voices, and provide some strategies to address relationships between these characteristics for converting one to another. We develop benchmark systems, which can be used as reference for speech-to-singing alignment, spectral mapping, and conversion using the NHSS database.

Introduction

With the advent of deep learning, speech and singing technologies are often developed using neural networks that learn from large datasets. Apart from developing technology innovation, a proper database is the basis for analyzing the characteristics of signals. Analysis of audio signals generally assists in studying the characteristics of the vocal-tract system and the excitation source signal responsible for producing them. A fairly designed audio database can help us to solve many real-world signal processing problems and develop related realistic applications.

The widespread scope of developing technologies related to speech and singing has lead to several databases in the literature, where each of them is targeted towards a specific application. We present a brief review of some of the widespread speech and singing databases, along with the existing speech and singing parallel corpora to motivate our database development effort.

We would like to start with a brief review of common speech databases and the design concepts to motivate our effort to develop a speech and singing parallel database.

One broadly used database for speech signal analysis is TIMIT corpus of read speech. It is designed to provide speech data for acoustic–phonetic studies, and for the development and evaluation of automatic speech recognition (ASR) systems (Garofolo et al., 1993). In the same context, the Switchboard database is a long standing database used for speech research (Godfrey et al., 1992). Although Switchboard was originally developed for speaker recognition, it is widely used for conversational and spontaneous speech analysis, speech alignment, and spoken language understanding systems.

Among the ASR related databases, the LibriSpeech corpus (Panayotov et al., 2015) is a sufficiently large publicly available database of 1000 h of read English speech. This corpus is extensively used in language modeling and acoustic modeling in ASR research and development (Panayotov et al., 2015). Another significant speech corpora widely used for advancing ASR research is Wall Street Journal (WSJ) CSR corpus (Paul and Baker, 1992). The AMI Corpus, Aurora and TIDIGITS are other such databases available for ASR research (Carletta et al., 2005, Hirsch and Pearce, 2000, Leonard, 1984).

Research on text-independent speaker identification is strongly supported by the international benchmarking events organized by the NIST and a large amount of data is made available in this context (Martin and Greenberg, 2009). VoxCeleb is a large scale audio-visual dataset (Nagrani et al., 2020) used in speaker identification. The RedDots dataset is another such publicly available database (Lee et al., 2015). In the context of text-dependent speaker identification systems, RSR2015 database is designed for evaluation under different duration and lexical constraints (Larcher et al., 2014).

The CMU ARCTIC corpus (Kominek and Black, 2004, Sharma and Prasanna, 2016a) is one of the most popular databases for speech synthesis. The VCTK corpus (Veaux et al., 2012) is another popular database for multi-speaker TTS. The LJspeech (Ito, 2017) speech includes 13,100 sentences, which is around 24 h of audio and text from audiobooks read by a single speaker. The Blizzard Challenge 2011 (BC2011) (King and Karaiskos, 2011) and Blizzard Challenge 2013 (BC2013) (King and Karaiskos, 2013) corpora also provide a relatively large amount of read speech from a single speaker.

Apart from the above mentioned databases, there are numerous other speech corpora available for the community. Broadly, the design objective of these speech databases is focused on aspects such as applicability, appropriateness, amount of speech data, time aligned transcription, speaker, and lexical variability. The speech databases can be considered as a common platform that supports decades of research in speech recognition, speaker recognition, and speech synthesis, starting from traditional hidden Markov models to deep neural networks (Garofolo et al., 1993, Martin and Greenberg, 2009).

The long standing effort in the literature to characterize singing voice and development of technologies in the field of music information retrieval (MIR) result in several singing voice databases. We review some of the largely used databases related to different singing applications.

One widely used database in MIR is the large Dataset of synchronized Audio, LyrIcs and notes (DALI) (Meseguer-Brocal et al., 2018), which stands as a reference database for various music related research directions including lyrics-to-audio alignment and automatic singing voice recognition. It contains 5358 songs with background accompaniment, time-aligned lyrics and time-aligned notes. With the increase in popularity of mobile phone karaoke applications, singing data collected from Smule’sSing! is made available for research, referred as digital archive of mobile performances (DAMP) dataset (Sing!, 2010). It contains more than 34K a capella (solo) singing recordings of 301 songs with textual lyrics. Gupta et al. (2018) made an effort to provide automatically time aligned lyrics for the DAMP dataset, which increases the usefulness of this large singing voice corpora.

Another important research in MIR is singing voice extraction from background accompaniment. The iKala dataset (Chan et al., 2015) is specifically designed for singing voice extraction, query by humming, and melodic extraction. The music accompaniment and the singing voice are provided along with the human-labeled pitch contours and timestamped lyrics. In a similar direction, the MUSDB18 dataset (Rafii et al., 2017) comprises 150 full length music tracks (10 h duration) of different genres along with their isolated drums, bass, vocals, and other.

The Million Song Dataset (Bertin-Mahieux et al., 2011) is a freely available collection of audio features and metadata for a million contemporary popular music tracks, which can be used for different applications related to the MIR research. Other singing voice databases are targeted towards specific applications like music genre classification (Bogdanov et al., 2019, Schedl et al., 2006, McKay et al., 2006), singer recognition (Ellis, 2007, Knees et al., 2004, Bertin-Mahieux et al., 2011, Sharma et al., 2019a) and music emotion recognition (Aljanaki et al., 2016, Soleymani et al., 2013, Aljanaki et al., 2017).

Apart from the above mentioned, there are several other databases with singing voice recordings. Most of these databases available for the MIR research can be used for the analysis of acoustic characteristics of singing voice, and to develop subsequent statistical models related to different technologies.

For applications like speech-to-singing conversion, speaker identity conversion, and singing voice conversion, a comparative study of speech and singing voice is required. This necessitates a multi-speaker speech and singing voice parallel database.

Despite many attempts, there are very few databases to support an effective comparative study of speech and singing voice. The NUS sung and spoken lyrics (NUS-48E) corpus (Duan et al., 2013) provides speech and singing parallel recordings. It is a phonetically annotated corpus with 48 English songs. The lyrics of each song are sung and spoken by 12 subjects representing a variety of voice types and accents. There are 20 unique songs, each of which is covered by at least one male and one female subject, with a total duration of 169 mins. Although the NUS-48E corpus is annotated in phone level, due to the very short length of phones and the presence of co-articulation effect specifically in singing, the manually identified phone boundaries may not be reliable. This corpus is also of limited size which restricts its applicability in developing statistical models or mapping schemes for related applications.

The MIR-1K database (Hsu and Jang, 2009) consists of 1000 song clips, which are extracted from 110 recordings of Chinese pop songs sung by amateurs singers. This database also provides human-labeled pitch values, unvoiced sounds and vocal/nonvocal segments, lyrics, and the speech recordings of the lyrics for each clip. The MIR-1K database contains total 133 min of sung audio along with music accompaniment in separate channels and is specifically designed for the research of singing voice separation. As the singing vocals in the MIR-1K (Hsu and Jang, 2009) database are not sung by professional singers, this database may not be suitable for the speech-to-singing conversion research.

Another speech and singing parallel database in Chinese language, referred to as RSS database has been published in Shi et al., 2019a, Shi et al., 2019b. The RSS database consists of parallel speech and singing data for 10 male and 10 female singers and is used in text-dependent speaker verification. However, this database does not provide any transcription or word/phoneme boundary information.

A primitive version of the database introduced in this paper is presented in Gao et al. (2018), and referred as NUS-HLT Spoken Lyrics and Singing (SLS) corpus. The SLS corpus is the collection of the recordings of speech and singing for 100 songs without any manual intervention and annotation. A comparison of different speech and singing parallel databases is shown in Table 1.

A relatively large, well designed, multi-speaker, uniform, and precisely time-aligned publicly available speech and singing parallel database has crucial importance in the research of MIR. It would allow for a comparative, quantitative, and systematic study between speech and singing, and application-oriented statistical modeling.

Speaking and singing voices are produced from the human vocal-tract system excited by the glottal excitation signal. Although both speaking and singing voices are generated by the same production system, they tend to produce two very different kinds of signals (Lindblom and Sundberg, 2014). The major differences between speaking and singing voices lie in the loudness, pitch, duration and formant variations (Lindblom and Sundberg, 2014, Fujisaki, 1983, Sundberg, 1970). Studying parallel utterances of spoken and sung voices can shed light on the different configurations of the vocal-tract system and glottal excitation that rendered these two kinds of signals, even though the linguistic content remains the same. A coherent understanding of the human voice production mechanism behind speaking and singing can benefit numerous applications such as singing voice synthesis, conversion of voice identity of one singer to another, conversion of read lyrics to singing. These applications are under rapid development and soaring demand from the entertainment industry, which necessitate a well designed speech and singing parallel database (Vijayan et al., 2018b).

In order to demonstrate the prevalent significance of a sufficiently large, manually annotated speech and singing parallel database on technology development, we broadly discuss the speech-to-singing conversion and its sub-components. Speech-to-singing conversion is the process of converting natural speech to singing with the same lyrical content, by retaining the speaker characteristics and correct prosody structure as that of a high quality song. This technology allows the user to synthesize personalized singing just by reading lyrics of a song, which is very appealing to the general public (Dong et al., 2014, Cen et al., 2012, Gupta et al., 2019).

The basic aspect of speech-to-singing conversion is to modify the prosodic features of user’s speech, with respect to reference singing, which is then used to generate the singing voice by retaining the user’s voice identity. To convert user’s speech to appropriate singing, the mapping of prosodic characteristics from speech to singing is very crucial. This mapping can be performed with alignment information between frames of speech and singing, which we refer to as speech-to-singing alignment. The quality of synthesized singing depends on the accuracy of the alignment to a significant extent. The alignment strategies require a comparative study between speech and singing. Despite several differences between speech and singing, such as duration, pitch, spectral energy content, the correct alignment can be achieved by comparing commonalities between the two. To study the speaker independent commonalities between speech and singing and evaluate the efficacy of such a temporal alignment method, a multi-speaker speech and singing parallel database is of crucial importance.

Due to the prominent difference in the formant characteristics of speech and singing spectrum, the spectral transformation/mapping techniques are proposed to achieve better naturalness in the converted singing voice (Gao et al., 2019), while retaining the speaker information. This spectral mapping also necessitates a speech and singing parallel database with a sufficient amount of audio data.

Furthermore, the unique features in singing vocals (Sundberg and Rossing, 1990, Dayme, 2009) and their perceptual effects (Saitou et al., 2004, Ohishi et al., 2006) are important research directions in the field of speech-to-singing conversion. The singing voice recorded from professional singers can also be used as the reference singing template for prosody generation, to apply in singing voice synthesis.

To precisely evaluate the effectiveness of an existing speech-to-singing conversion method, we can compare the converted singing obtained from the user’s speech to the user’s original singing. Therefore, evaluation of singing voice synthesis or singing voice conversion can be largely benefited from such a database.

Development of the above mentioned efficient statistical and adaptive systems requires databases of speech and singing parallel utterances. A database with professional singers’ singing and corresponding speech for multiple songs would facilitate the research in speech-to-singing conversion in terms of reference prosody generation, speech and singing alignment, spectral mapping, conversion methodology, and evaluation. As mentioned earlier, there are very few publicly available databases in this area to advance the research based on a common platform.

With this outlook of the importance of a speech and singing parallel database, we propose the NHSS database, which consists of spoken and sung lyrics of English popular songs along with their manually labeled utterance- and word-level annotations. In this work, we also conduct studies on speech-to-singing alignment, spectral mapping, and speech-to-singing conversion, where we aim to use existing methods to develop reference systems and report the findings. With the demonstration of these applications, we validate the usability of the NHSS database. This work may serve as a common platform for various applications that require parallel speech and singing recordings with the same lyrical content.

In Gao et al. (2018), the ongoing effort to collect and design the NHSS database is explained briefly from the perspective of its significance in different research areas. The work presented here is significantly different from Gao et al. (2018), in terms of the design of the complete database with manual intervention, time aligned word/utterance annotation, detailed statistical analysis, and evaluation of the related baseline systems, which can be seen from Table 1.

The rest of the paper is organized as follows. In Section 2, we explain the collection, processing, and organization of the NHSS database. The acoustic analysis and comparison of characteristics of speech and singing are described in Section 3. The methods for speech-to-singing alignment with respect to the NHSS database are presented and evaluated in Section 4. In Section 5, we present different approaches for spectral mapping followed by the development of speech-to-singing conversion systems. Finally, we draw conclusions of this work in Section 6.

Section snippets

NUS-HLT speak–sing (NHSS) database

The NHSS database consists of 100 songs from 10 professional singers, each of them singing and speaking 10 songs. The database provides 4.75 h of sung and 2.25 h of spoken data. In total, we have 7 h of audio data with manually annotated utterance and word boundaries.

In the design of this database, we emphasize several factors such as audio quality, singer selection, one-to-one correspondence of words/ utterances between spoken and sung audio, accurate word boundary, usability for various

Analysis of acoustic properties

In this section, we discuss and analyze different acoustic characteristics of speech and singing signals with reference to the NHSS database.

Singing is a form of art that is rich in the expression of emotions, providing exciting and soothing feelings to listeners. The expressiveness in singing is generally delivered by singers using a wide range of variations in energy, duration, pitch, and spectrum. A trained singer is able to maneuver the subglottal pressure efficiently, thereby controlling

Speech-to-singing alignment

In this section, we discuss methods for speech-to-singing alignment and provide benchmark results on the NHSS database.

In order to study the relative characteristics between speech and singing and to facilitate conversion from one to another, we often need to establish a frame-level mapping between the two. This mapping between the frames of speech and singing is referred to as speech-to-singing alignment, which can be useful in speech-to-singing conversion. It has been observed that as the

Speech-to-singing conversion

We carry out another experiment on NHSS for speech-to-singing conversion, that serves as a reference system for readers. In speech-to-singing conversion, we convert the read speech by a user (user’s speech) into his/her singing with the same lyrical content. The basic idea of speech-to-singing conversion is to transform the prosody and spectral features from the user’s speech to those of the reference singing while preserving the speaker identity of the user. In this work, we particularly focus

Conclusion

In this paper, we present the development of the NHSS database, which consists of 7 h of speech and singing parallel recordings from multiple singers, along with utterance- and word-level annotations for the entire database. We discuss data collection, processing, and organization of the database in detail. We analyze some similar and distinctive prosodic characteristics of speech and singing with respect to the presented database.

We demonstrate that the NHSS database can be potentially used in

CRediT authorship contribution statement

Bidisha Sharma: Writing – original draft, Methodology, Investigation, Visualization. Xiaoxue Gao: Investigation, Writing – Original draft. Karthika Vijayan: Visualization, Investigation. Xiaohai Tian: Writing – original draft, visualization. Haizhou Li: Conceptualization, Writing – reviewing & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This research work is supported by Academic Research Council, Ministry of Education (ARC, MOE). Grant: MOE2018-T2-2-127. Title: Learning Generative and Parameterized Interactive Sequence Models with RNNs.

References (84)

  • CarlettaJ. et al.

    The AMI meeting corpus: A pre-announcement

  • Cen, L., Dong, M., Chan, P., 2011. Segmentation of speech signals in template-based speech to singing conversion. In:...
  • CenL. et al.

    Template-based personalized singing voice synthesis

  • Chan, T.-S., Yeh, T.-C., Fan, Z.-C., Chen, H.-W., Su, L., Yang, Y.-H., Jang, R., 2015. Vocal activity informed singing...
  • DaymeM.A.

    Dynamics of the Singing Voice

    (2009)
  • DehakN. et al.

    Front-end factor analysis for speaker verification

    IEEE/ACM Trans. Audio Speech Lang. Process.

    (2010)
  • Dong, M., Lee, S.W., Li, H., Chan, P., Peng, X., Ehnes, J.W., Huang, D., 2014. I2R speech2singing perfects everyone’s...
  • DuanZ. et al.

    The NUS sung and spoken lyrics corpus: A quantitative comparison of singing and speech

  • Ellis, D.P., 2007. Classifying music audio with timbral and chroma features. In: International Society for Music...
  • FujisakiH.

    Dynamic characteristics of voice fundamental frequency in speech and singing

  • Gao, X., Sisman, B., Das, R.K., Vijayan, K., 2018. NUS-HLT spoken lyrics and singing, SLS corpus. In: International...
  • Gao, X., Tian, X., Das, R.K., Zhou, Y., Li, H., 2019. Speaker-independent spectral mapping for speech-to-singing...
  • GarofoloJ.S. et al.

    TIMIT Acoustic Phonetic Continuous Speech Corpus

    (1993)
  • GodfreyJ.J. et al.

    SWITCHBOARD: Telephone speech corpus for research and development

  • GreyJ.M. et al.

    Perceptual effects of spectral modifications on musical timbres

    J. Acoust. Soc. Am.

    (1978)
  • GuptaC. et al.

    Automatic leaderboard: Evaluation of singing quality without a standard reference

    IEEE/ACM Trans. Audio Speech Lang. Process.

    (2020)
  • Gupta, C., Tong, R., Li, H., Wang, Y., 2018. Semi-supervised lyrics and solo-singing alignment. In: International...
  • Gupta, C., Vijayan, K., Sharma, B., Gao, X., Li, H., 2019. NUS speak-to-sing: A web platform for personalized...
  • GuptaC. et al.

    Automatic lyrics alignment and transcription in polyphonic music: Does background music help?

  • HenrichN. et al.

    Vocal tract resonances in singing: Strategies used by sopranos, altos, tenors, and baritones

    J. Acoust. Soc. Am.

    (2011)
  • Hirsch, H.-G., Pearce, D., 2000. The Aurora experimental framework for the performance evaluation of speech recognition...
  • HsuC.L. et al.

    On the improvement of singing voice separation for monaural recordings using the MIR-1K dataset

    IEEE/ACM Trans. Audio Speech Lang. Process.

    (2009)
  • ItoK.

    The LJ speech dataset

    (2017)
  • KingS.J. et al.

    The Blizzard Challenge 2011

    (2011)
  • KingS.J. et al.

    The Blizzard Challenge 2013

    (2013)
  • Knees, P., Pampalk, E., Widmer, G., 2004. Artist classification with web-based data. In: International Society for...
  • Kominek, J., Black, A.W., 2004. The CMU Arctic speech databases. In: Fifth ISCA Workshop on Speech Synthesis. SSW5. pp....
  • Lee, K.A., Larcher, A., Wang, G., Kenny, P., Brümmer, N., Leeuwen, D.v., Aronowitz, H., Kockmann, M., Vaquero, C., Ma,...
  • LeonardR.

    A database for speaker-independent digit recognition

  • LindblomB. et al.

    The human voice in speech and singing

  • Martin, A.F., Greenberg, C.S., 2009. NIST 2008 speaker recognition evaluation: Performance across telephone and room...
  • McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M., 2017. Montreal forced aligner: Trainable...
  • Cited by (19)

    View all citing articles on Scopus
    View full text