Search on speech from spoken queries: the Multi-domain International ALBAYZIN 2018 Query-by-Example Spoken Term Detection Evaluation

Tejedor, Javier; Toledano, Doroteo T.; Lopez-Otero, Paula; Docio-Fernandez, Laura; Peñagarikano, Mikel; Rodriguez-Fuentes, Luis Javier; Moreno-Sandoval, Antonio

doi:10.1186/s13636-019-0156-x

Research
Open access
Published: 19 July 2019

Search on speech from spoken queries: the Multi-domain International ALBAYZIN 2018 Query-by-Example Spoken Term Detection Evaluation

Javier Tejedor ORCID: orcid.org/0000-0001-7699-5620¹,
Doroteo T. Toledano²,
Paula Lopez-Otero³,
Laura Docio-Fernandez⁴,
Mikel Peñagarikano⁵,
Luis Javier Rodriguez-Fuentes⁵ &
…
Antonio Moreno-Sandoval⁶

EURASIP Journal on Audio, Speech, and Music Processing volume 2019, Article number: 13 (2019) Cite this article

3756 Accesses
3 Citations
Metrics details

Abstract

The huge amount of information stored in audio and video repositories makes search on speech (SoS) a priority area nowadays. Within SoS, Query-by-Example Spoken Term Detection (QbE STD) aims to retrieve data from a speech repository given a spoken query. Research on this area is continuously fostered with the organization of QbE STD evaluations. This paper presents a multi-domain internationally open evaluation for QbE STD in Spanish. The evaluation aims at retrieving the speech files that contain the queries, providing their start and end times, and a score that reflects the confidence given to the detection. Three different Spanish speech databases that encompass different domains have been employed in the evaluation: MAVIR database, which comprises a set of talks from workshops; RTVE database, which includes broadcast television (TV) shows; and COREMAH database, which contains 2-people spontaneous speech conversations about different topics. The evaluation has been designed carefully so that several analyses of the main results can be carried out. We present the evaluation itself, the three databases, the evaluation metrics, the systems submitted to the evaluation, the results, and the detailed post-evaluation analyses based on some query properties (within-vocabulary/out-of-vocabulary queries, single-word/multi-word queries, and native/foreign queries). Fusion results of the primary systems submitted to the evaluation are also presented. Three different teams took part in the evaluation, and ten different systems were submitted. The results suggest that the QbE STD task is still in progress, and the performance of these systems is highly sensitive to changes in the data domain. Nevertheless, QbE STD strategies are able to outperform text-based STD in unseen data domains.

1 Introduction

The huge amount of information stored in audio and audiovisual repositories makes it necessary to develop efficient methods for search on speech (SoS). Significant research has been carried out in this area from spoken document retrieval (SDR) [1–6], keyword spotting (KWS) [7–12], spoken term detection (STD) [13–18], and Query-by-Example Spoken Term Detection (QbE STD) [19–26] tasks.

STD aims to find terms within audio archives. It is based on a text-based input, commonly the word/phone transcription of the search term, and hence, STD is also called text-based STD. Query-by-Example Spoken Term Detection also aims to search within audio archives but is based on an acoustic (spoken) input. This is a highly valuable alternative for visually impaired people or when using devices that do not have a text-based input (such as smart speakers), and consequently, the query must be given in another format such as speech.

STD systems typically comprise three different stages: (1) the audio is decoded into word/subword lattices using an automatic speech recognition (ASR) subsystem trained for the target language; (2) a term detection subsystem searches the terms within those word/subword lattices to hypothesize detections; and (3) confidence measures are computed to rank detections. The STD systems are normally language-dependent and require large amounts of resources to be built.

On the other hand, QbE STD has been traditionally addressed using three different approaches: methods based on the word/subword transcription of the query, methods based on template matching of features, and hybrid approaches. These approaches are described below.

1.1 Methods based on the word/subword transcription of the spoken query

In these methods, first, the spoken query is decoded using an ASR system and then a text-based STD approach is employed to hypothesize detections. The errors produced in the transcription of the query can lead to significant performance degradation. In [21] and [27], the authors employ a Viterbi-based search on hidden Markov models (HMMs). In other works [19, 28–30] dynamic time warping (DTW) or variants of DTW are applied (e.g., non-segmental dynamic time warping (NS-DTW)) to align phone sequences. More sophisticated approaches [20, 31–33] employ word and syllable speech recognizers. In [34], the authors employ a phone-based speech recognizer and weighted finite state transducer (WFST)-based search, whereas in [35], they apply multilingual phone-based speech recognition from supervised and unsupervised acoustic models and sequential dynamic time warping for search. The works [36–38] propose the discovery of unsupervised acoustic features (e.g., bottleneck features) and unsupervised acoustic units for query/utterance representation, and [39] and the work by (Lopez-Otero et al.: Probabilistic information retrieval models for query-by-example spoken document retrieval, submitted to Multimed. Tools Appl.) make use of information retrieval models for QbE STD employing ASR.

1.2 Methods based on template matching

In these methods, sequences of feature vectors are extracted from both the input spoken queries and the utterances, which are then used in the search stage to hypothesize detections. Regarding the features used for query/utterance representation, Gaussian posteriorgrams are employed in [22, 29, 40, 41]; an i-vector-based approach for feature extraction is proposed in [42]; phone log-likelihood ratio-based features are used in [43]; posteriorgrams derived from various unsupervised tokenizers, supervised tokenizers, and semi-supervised tokenizers are employed in [44]; and posteriorgrams derived from a Gaussian mixture model (GMM) tokenizer, phoneme recognition, and acoustic segment modeling are used in [45]. Phoneme posteriorgrams have been widely used [34, 41, 46–54] and bottleneck features as well [34, 55–60]. Posteriorgrams from non-parametric Bayesian models are used in [61], articulatory class-based posteriorgrams are employed in [62], intrinsic spectral analysis is proposed in [63], unsupervised segment-based bag of acoustic words is employed in [64], and [65] is based on the sparse subspace modeling of posteriorgrams. An exhaustive feature set is proposed in [66], which includes Mel-frequency cepstral coefficients (MFCCs), spectral entropy, fundamental frequency, among others.

All these studies employ the standard DTW algorithm for query search, except for [40], which employs the NS-DTW algorithm, [41, 50, 51, 53, 56, 59, 61, 66] which employ the subsequence DTW (S-DTW) algorithm, [22] which presents a variant of the S-DTW algorithm, and [52] which employs the segmental DTW algorithm. An interesting alternative is [54] which proposes the use of hashing of the phone posteriors to speed-up search and to enable searching on massively large datasets.

These template matching-based methods were found to outperform subword transcription-based techniques in QbE STD [67] and can be effectively employed to build language-independent STD systems, since prior knowledge of the language involved in the speech data is not necessary.

1.3 Hybrid methods

These methods take advantage of the text-based STD approach and the approaches based on template matching by combining them to hypothesize detections. A powerful way of enhancing the performance relies on building hybrid (fused) systems that combine the two individual methods. Logistic regression-based fusion of acoustic keyword spotting and DTW-based systems using language-dependent phoneme recognizers is presented in [68–70]. An information retrieval technique to hypothesize detection and DTW-based score detection are proposed in [39]. Logistic regression-based fusion on DTW and phone-based systems is employed in [71–74]. DTW-based search at the HMM state-level from syllables obtained from a word-based speech recognizer and a deep neural network (DNN) posteriorgram-based rescoring are employed in [75], and [76] adds a logistic regression-based approach for detection rescoring. Finally, [77] employs a syllable-based speech recognizer and dynamic programming at the triphone state level to output detections and DNN posteriorgram-based rescoring.

2 Methods

Research carried out in a certain area may be difficult to compare in the absence of a common evaluation framework. In QbE STD, research also suffers from this issue since the published systems typically employ different acoustic databases and different lists of queries that make system comparison impossible. In this context, international evaluations provide a unique framework to measure the progress of any technology, such as QbE STD in this case.

ALBAYZIN evaluation campaigns comprise an internationally open set of evaluations supported by the Spanish Thematic Network on Speech Technologies (RTTH^{Footnote 1}) and the ISCA Special Interest Group on Iberian Languages (SIG-IL^{Footnote 2}), which have been held biennially since 2006. These evaluation campaigns provide an objective mechanism to compare different systems and are a powerful way to promote research on different speech technologies [78–87].

Spanish is a major language in the world, and significant research has been conducted on it for ASR, KWS, and STD tasks [88–94]. The increasing interest in SoS around the world and the lack of SoS evaluations dealing with Spanish encouraged us to organize a series of QbE STD evaluations starting in 2012 and held biennially until 2018, aiming to evaluate the progress in this technology for Spanish. Each evaluation has been extended by incorporating new challenges. The main novelty of the fourth ALBAYZIN QbE STD evaluation is the addition of a new data domain, namely broadcast television (TV) shows, with the inclusion of shows from the Spanish public television Radio Televisión Española (RTVE). In addition, a novel conversational speech database has also been used to assess the validity of the submitted systems in an unseen data domain. Moreover, the queries used in one of the databases (MAVIR) in the ALBAYZIN 2016 QbE STD evaluation were kept to enable a straightforward comparison of the systems submitted to both evaluations.

The main objectives of this evaluation can be summarized as follows:

Organize the first Spanish QbE STD multi-domain evaluation whose systems are ranked according to different databases and different domains
Provide an evaluation and benchmark with increasing complexity in the search queries compared to the previous ALBAYZIN QbE STD evaluations

This evaluation is suitable for research groups/companies that work in speech recognition.

This paper is organized as follows: First, Section 3 presents the evaluation and a comparison with other QbE STD evaluations. Then, Section 4, the different systems submitted to the evaluation, along with a text-based STD system, are presented. Evaluation results and discussion are presented in Section 5, which includes the corresponding paired t tests [95] as statistical significance measure for system comparison. The Section 6 presents a post-evaluation analysis based on some properties of the queries and the fusion of the primary systems submitted to the evaluation. The last section outlines the main conclusions of the paper.

3 ALBAYZIN 2018 QbE STD evaluation

3.1 Evaluation overview

This evaluation involves searching queries given in spoken form within speech data, by indicating the appropriate audio files with the occurrences and timestamps that contain any of those queries.

The evaluation consists in searching different query lists within different sets of speech data. Speech data comprise different domains (workshop talks, broadcast TV shows, and 2-people conversations), for which individual datasets are given. The ranking of the evaluation results is based on the average system performance on the three datasets in the test experiments.

Two different types of queries are defined in this evaluation, in-vocabulary (INV) and out-of-vocabulary (OOV) queries. The OOV query set was defined to simulate the out-of-vocabulary words of a large vocabulary continuous speech recognition (LVCSR) system. In case participants employ LVCSR for processing the audio, these OOV words must be previously removed from the system dictionary, and hence, other methods have to be used for searching OOV queries. On the other hand, the INV queries could appear in the LVCSR system dictionary.

Participants could submit a primary system and up to four contrastive systems. No manual intervention was allowed for each developed system to generate the final output file, and hence, all the systems had to be fully automatic [96].

About 3 months were given to the participants for system development, and therefore, the QbE STD evaluation focuses on building QbE STD systems in a limited period of time. The training, development, and test data were released to the participants at different times. Training and development data were released by the end of June 2018. The test data were released by the beginning of September 2018. The final system submission was due by mid-October 2018. Final results were discussed at IberSPEECH 2018 conference by the end of November 2018.

3.2 Evaluation metrics

In QbE STD, a hypothesized occurrence is called a detection; if the detection corresponds to an actual occurrence, it is called a hit; otherwise it is called a false alarm (FA). If an actual occurrence is not detected, it is called a miss. The actual term-weighted value (ATWV) metric proposed by the National Institute of Standards and Technology (NIST) [96] has been used as the main metric for the evaluation. This metric integrates the hit rate and false alarm rate of each query into a single metric and then averages over all the queries:

$$ \text{ATWV}=\frac{1}{|\Delta|}\sum_{K \in \Delta}{\left(\frac{N^{K}_{\text{hit}}}{N^{K}_{\text{true}}} - \beta \frac{N^{K}_{\text{FA}}}{T-N^{K}_{\text{true}}}\right)}, $$

(1)

where Δ denotes the set of queries and |Δ| is the number of queries in this set. $N^{K}_{\text {hit}}$ and $N^{K}_{\text {FA}}$ represent the numbers of hits and false alarms of query K, respectively, and $N^{K}_{\text {true}}$ is the number of actual occurrences of K in the audio. T denotes the audio length in seconds, and β is a weight factor set to 999.9 as in [97]. This weight factor causes an emphasis placed on recall compared to precision with a ratio 10:1.

ATWV represents the term-weighted value (TWV) for a threshold given by the QbE STD system (usually tuned on development data). An additional metric, called maximum term-weighted value (MTWV) [96], can also be used to evaluate the performance of a QbE STD system. MTWV is the maximum TWV obtained by the QbE STD system for all possible thresholds, and hence does not depend on the tuned threshold. Therefore, MTWV represents an upper bound of the performance obtained by the QbE STD system. Results based on this metric are also presented to evaluate the system performance regardless of the decision threshold.

In addition to ATWV and MTWV, NIST also proposed a detection error tradeoff (DET) curve [98] to evaluate the performance of a QbE STD system working at various miss/FA ratios. Although DET curves were not used for the evaluation itself, they are also presented in this paper for system comparison.

In this work, the NIST STD evaluation tool [99] was employed to compute MTWV, ATWV, and DET curves.

3.3 Databases

Three different databases that comprise different acoustic conditions and domains have been employed for the evaluation: (1) MAVIR database, which was employed in all the previous ALBAYZIN QbE STD evaluations, is used for comparison purposes; (2) RTVE database, which consists of different programs recorded from the Spanish public television (Radio Televisión Española) and involves different broadcast TV shows; (3) COREMAH database, which contains conversational speech with two speakers per recording. For MAVIR and RTVE databases, three separate datasets (i.e., training, development, and test) were provided to the participants. For COREMAH database, only test data were provided. This allowed measuring the generalization capability of the systems in an unseen data domain. Tables 1, 2, and 3 include some database features such as the division into training, development, and test data of the speech files; the number of word occurrences; duration; the number of speakers; and average mean opinion score (MOS) [100] as a way to get an idea of the quality of each speech file in the different databases.

Table 1 Characteristics of the MAVIR database. Number of word occurrences (#occ.), duration (dur.) in minutes (min), number of speakers (#spk.), and average MOS (Ave. MOS)

Search on speech from spoken queries: the Multi-domain International ALBAYZIN 2018 Query-by-Example Spoken Term Detection Evaluation

Abstract

1 Introduction

1.1 Methods based on the word/subword transcription of the spoken query

1.2 Methods based on template matching

1.3 Hybrid methods

2 Methods

3 ALBAYZIN 2018 QbE STD evaluation

3.1 Evaluation overview

3.2 Evaluation metrics

3.3 Databases

3.3.1 MAVIR

3.3.2 RTVE

3.3.3 COREMAH

3.3.4 Query list selection

3.4 Comparison to other QbE STD international evaluations

4 Systems

4.1 A-Hybrid DTW+LVCSR system

4.1.1 Feature extraction in DTW-based systems

4.1.2 Query detection

4.1.3 LVCSR-based QbE STD

4.1.4 Calibration and fusion

4.2 B-Fusion DTW system

4.3 C-Phoneme-posteriorgram DTW system (C-PhonePost DTW)

4.4 D-LVCSR system

4.5 E-DTW system

4.5.1 Feature extraction

4.5.2 Query detection

4.6 F-Combined DTW system

4.6.1 Voice activity detection

4.6.2 Feature extraction

4.6.3 Dynamic time warping-based search

4.6.4 Calibration and fusion

4.7 G-Super bottleneck feature DTW system (G-Super-BNF DTW)

4.8 H-Multilingual bottleneck feature DTW system (H-Multilingual-BNF DTW)

4.9 I-Monophone bottleneck feature DTW system (I-Monoph.-BNF DTW)

4.10 J-Triphone bottleneck feature DTW system (J-Triph.-BNF DTW)

4.11 K-Text STD system

5 Evaluation results and discussion

5.1 Overall results

5.2 Development data

5.2.1 MAVIR

5.2.2 RTVE

5.3 Test data

5.3.1 MAVIR

5.3.2 RTVE

5.3.3 COREMAH

5.4 Analysis of development and test data DET curves

6 Post-evaluation analysis

6.1 Performance analysis of QbE STD systems for in-language and out-of-language queries

6.2 Performance analysis of QbE STD systems for single and multi-word queries

6.3 Performance analysis of QbE STD systems for INV and OOV queries

6.4 System fusion

6.4.1 Pre-processing

6.4.2 Calibration and fusion

6.4.3 Fusion results

6.5 Comparison to the ALBAYZIN 2016 QbE STD evaluation

6.6 Towards a language-independent STD system

7 Conclusions

Availability of data and materials

Notes

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Authors’ contributions

Authors’ information

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords