
显示样式: 排序: IF: - GO 导出
-
Dynamic out-of-vocabulary word registration to language model for speech recognition EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2021-01-25 Norihide Kitaoka; Bohan Chen; Yuya Obashi
We propose a method of dynamically registering out-of-vocabulary (OOV) words by assigning the pronunciations of these words to pre-inserted OOV tokens, editing the pronunciations of the tokens. To do this, we add OOV tokens to an additional, partial copy of our corpus, either randomly or to part-of-speech (POS) tags in the selected utterances, when training the language model (LM) for speech recognition
-
Time–frequency scattering accurately models auditory similarities between instrumental playing techniques EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2021-01-11 Vincent Lostanlen; Christian El-Hajj; Mathias Rossignol; Grégoire Lafay; Joakim Andén; Mathieu Lagrange
Instrumentalplaying techniques such as vibratos, glissandos, and trills often denote musical expressivity, both in classical and folk contexts. However, most existing approaches to music similarity retrieval fail to describe timbre beyond the so-called “ordinary” technique, use instrument identity as a proxy for timbre quality, and do not allow for customization to the perceptual idiosyncrasies of
-
Forward-backward recursive expectation-maximization for concurrent speaker tracking EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2021-01-09 Yuval Dorfan; Boaz Schwartz; Sharon Gannot
In this paper, a study addressing the task of tracking multiple concurrent speakers in reverberant conditions is presented. Since both past and future observations can contribute to the current location estimate, we propose a forward-backward approach, which improves tracking accuracy by introducing near-future data to the estimator, in the cost of an additional short latency. Unlike classical target
-
Progressive loss functions for speech enhancement with deep neural networks EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2021-01-07 Jorge Llombart; Dayana Ribas; Antonio Miguel; Luis Vicente; Alfonso Ortega; Eduardo Lleida
The progressive paradigm is a promising strategy to optimize network performance for speech enhancement purposes. Recent works have shown different strategies to improve the accuracy of speech enhancement solutions based on this mechanism. This paper studies the progressive speech enhancement using convolutional and residual neural network architectures and explores two criteria for loss function optimization:
-
Binaural speaker identification using the equalization-cancelation technique EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2020-12-03 Masoud Geravanchizadeh; Sina Ghalamiosgouei
In real applications, environmental effects such as additive noise and room reverberation lead to a mismatch between training and testing signals that substantially reduces the performance of far-field speaker identification. As a solution to this mismatch problem, in this paper, a new binaural speaker identification system is proposed which employs the well-known equalization-cancelation technique
-
Towards cross-modal pre-training and learning tempo-spatial characteristics for audio recognition with convolutional and recurrent neural networks EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2020-12-02 Shahin Amiriparian; Maurice Gerczuk; Sandra Ottl; Lukas Stappen; Alice Baird; Lukas Koebe; Björn Schuller
In this paper, we investigate the performance of two deep learning paradigms for the audio-based tasks of acoustic scene, environmental sound and domestic activity classification. In particular, a convolutional recurrent neural network (CRNN) and pre-trained convolutional neural networks (CNNs) are utilised. The CRNN is directly trained on Mel-spectrograms of the audio samples. For the pre-trained
-
A simulation study on optimal scores for speaker recognition EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2020-11-25 Dong Wang
In this article, we conduct a comprehensive simulation study for the optimal scores of speaker recognition systems that are based on speaker embedding. For that purpose, we first revisit the optimal scores for the speaker identification (SI) task and the speaker verification (SV) task in the sense of minimum Bayes risk (MBR) and show that the optimal scores for the two tasks can be formulated as a
-
Depression-level assessment from multi-lingual conversational speech data using acoustic and text features EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2020-11-17 Cenk Demiroglu; Aslı Beşirli; Yasin Ozkanca; Selime Çelik
Depression is a widespread mental health problem around the world with a significant burden on economies. Its early diagnosis and treatment are critical to reduce the costs and even save lives. One key aspect to achieve that goal is to use technology and monitor depression remotely and relatively inexpensively using automated agents. There has been numerous efforts to automatically assess depression
-
DOANet: a deep dilated convolutional neural network approach for search and rescue with drone-embedded sound source localization EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2020-11-05 Alif Bin Abdul Qayyum; K. M. Naimul Hassan; Adrita Anika; Md. Farhan Shadiq; Md Mushfiqur Rahman; Md. Tariqul Islam; Sheikh Asif Imran; Shahruk Hossain; Mohammad Ariful Haque
Drone-embedded sound source localization (SSL) has interesting application perspective in challenging search and rescue scenarios due to bad lighting conditions or occlusions. However, the problem gets complicated by severe drone ego-noise that may result in negative signal-to-noise ratios in the recorded microphone signals. In this paper, we present our work on drone-embedded SSL using recordings
-
Steerable differential beamformers with planar microphone arrays EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2020-11-04 Gongping Huang; Jingdong Chen; Jacob Benesty; Israel Cohen; Xudong Zhao
Humanoid robots require to use microphone arrays to acquire speech signals from the human communication partner while suppressing noise, reverberation, and interferences. Unlike many other applications, microphone arrays in humanoid robots have to face the restrictions in size and geometry. To address these challenges, this paper presents an approach to differential beamforming with arbitrary planar
-
Multichannel speaker interference reduction using frequency domain adaptive filtering EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2020-11-04 Patrick Meyer; Samy Elshamy; Tim Fingscheidt
Microphone leakage or crosstalk is a common problem in multichannel close-talk audio recordings (e.g., meetings or live music performances), which occurs when a target signal does not only couple into its dedicated microphone, but also in all other microphone channels. For further signal processing such as automatic transcription of a meeting, a multichannel speaker interference reduction is required
-
Noise power spectral density scaled SNR response estimation with restricted range search for sound source localisation using unmanned aerial vehicles EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2020-09-22 Benjamin Yen; Yusuke Hioka
A method to locate sound sources using an audio recording system mounted on an unmanned aerial vehicle (UAV) is proposed. The method introduces extension algorithms to apply on top of a baseline approach, which performs localisation by estimating the peak signal-to-noise ratio (SNR) response in the time-frequency and angular spectra with the time difference of arrival information. The proposed extensions
-
Estimation of acoustic echoes using expectation-maximization methods EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2020-08-08 Usama Saqib; Sharon Gannot; Jesper Rindom Jensen
Estimation problems like room geometry estimation and localization of acoustic reflectors are of great interest and importance in robot and drone audition. Several methods for tackling these problems exist, but most of them rely on information about times-of-arrival (TOAs) of the acoustic echoes. These need to be estimated in practice, which is a difficult problem in itself, especially in robot applications
-
Motor data-regularized nonnegative matrix factorization for ego-noise suppression EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2020-07-31 Alexander Schmidt; Andreas Brendel; Thomas Haubner; Walter Kellermann
Ego-noise, i.e., the noise a robot causes by its own motions, significantly corrupts the microphone signal and severely impairs the robot’s capability to interact seamlessly with its environment. Therefore, suitable ego-noise suppression techniques are required. For this, it is intuitive to use also motor data collected by proprioceptors mounted to the joints of the robot since it describes the physical
-
A depthwise separable convolutional neural network for keyword spotting on an embedded system EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2020-06-25 Peter Mølgaard Sørensen; Bastian Epp; Tobias May
A keyword spotting algorithm implemented on an embedded system using a depthwise separable convolutional neural network classifier is reported. The proposed system was derived from a high-complexity system with the goal to reduce complexity and to increase efficiency. In order to meet the requirements set by hardware resource constraints, a limited hyper-parameter grid search was performed, which showed
-
Joint speaker localization and array calibration using expectation-maximization EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2020-06-09 Yuval Dorfan; Ofer Schwartz; Sharon Gannot
Ad hoc acoustic networks comprising multiple nodes, each of which consists of several microphones, are addressed. From the ad hoc nature of the node constellation, microphone positions are unknown. Hence, typical tasks, such as localization, tracking, and beamforming, cannot be directly applied. To tackle this challenging joint multiple speaker localization and array calibration task, we propose a
-
Ensemble of convolutional neural networks to improve animal audio classification EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2020-05-26 Loris Nanni; Yandre M. G. Costa; Rafael L. Aguiar; Rafael B. Mangolin; Sheryl Brahnam; Carlos N. Silla
In this work, we present an ensemble for automated audio classification that fuses different types of features extracted from audio files. These features are evaluated, compared, and fused with the goal of producing better classification accuracy than other state-of-the-art approaches without ad hoc parameter optimization. We present an ensemble of classifiers that performs competitively on different
-
Quadratic approach for single-channel noise reduction EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2020-04-15 Gal Itzhak; Jacob Benesty; Israel Cohen
In this paper, we introduce a quadratic approach for single-channel noise reduction. The desired signal magnitude is estimated by applying a linear filter to a modified version of the observations’ vector. The modified version is constructed from a Kronecker product of the observations’ vector with its complex conjugate. The estimated signal magnitude is multiplied by a complex exponential whose phase
-
Discriminative features based on modified log magnitude spectrum for playback speech detection EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2020-04-07 Jichen Yang; Longting Xu; Bo Ren; Yunyun Ji
In order to improve the performance of hand-crafted features to detect playback speech, two discriminative features, constant-Q variance-based octave coefficients and constant-Q mean-based octave coefficients, are proposed for playback speech detection in this work. They rely on our findings that variance-based modified log magnitude spectrum and mean-based modified log magnitude spectrum can enhance
-
Multiclass audio segmentation based on recurrent neural networks for broadcast domain data EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2020-03-05 Pablo Gimeno; Ignacio Viñals; Alfonso Ortega; Antonio Miguel; Eduardo Lleida
This paper presents a new approach based on recurrent neural networks (RNN) to the multiclass audio segmentation task whose goal is to classify an audio signal as speech, music, noise or a combination of these. The proposed system is based on the use of bidirectional long short-term Memory (BLSTM) networks to model temporal dependencies in the signal. The RNN is complemented by a resegmentation module
-
Binaural sound localization based on deep neural network and affinity propagation clustering in mismatched HRTF condition EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2020-02-10 Jing Wang; Jin Wang; Kai Qian; Xiang Xie; Jingming Kuang
Binaural sound source localization is an important and widely used perceptually based method and it has been applied to machine learning studies by many researchers based on head-related transfer function (HRTF). Because the HRTF is closely related to human physiological structure, the HRTFs vary between individuals. Related machine learning studies to date tend to focus on binaural localization in
-
Segment boundary detection directed attention for online end-to-end speech recognition EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2020-01-30 Junfeng Hou; Wu Guo; Yan Song; Li-Rong Dai
Attention-based encoder-decoder models have recently shown competitive performance for automatic speech recognition (ASR) compared to conventional ASR systems. However, how to employ attention models for online speech recognition still needs to be explored. Different from conventional attention models wherein the soft alignment is obtained by a pass over the entire input sequence, attention models
-
The aerodynamics of voiced stop closures EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2020-01-28 Luis M. T. Jesus; Maria Conceição Costa
Experimental data combining complementary measures based on the oral airflow signal is presented in this paper, exploring the view that European Portuguese voiced stops are produced in a similar fashion to Germanic languages. Four Portuguese speakers were recorded producing a corpus of nine isolated words with /b, d, ɡ/ in initial, medial and final word position, and the same nine words embedded in
-
Improving dysarthric speech recognition using empirical mode decomposition and convolutional neural network EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2020-01-13 Mohammed Sidi Yakoub; Sid-ahmed Selouani; Brahim-Fares Zaidi; Asma Bouchair
In this paper, we use empirical mode decomposition and Hurst-based mode selection (EMDH) along with deep learning architecture using a convolutional neural network (CNN) to improve the recognition of dysarthric speech. The EMDH speech enhancement technique is used as a preprocessing step to improve the quality of dysarthric speech. Then, the Mel-frequency cepstral coefficients are extracted from the
-
Unsupervised adaptation of PLDA models for broadcast diarization EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2019-12-27 Ignacio Viñals; Alfonso Ortega; Jesús Villalba; Antonio Miguel; Eduardo Lleida
We present a novel model adaptation approach to deal with data variability for speaker diarization in a broadcast environment. Expensive human annotated data can be used to mitigate the domain mismatch by means of supervised model adaptation approaches. By contrast, we propose an unsupervised adaptation method which does not need for in-domain labeled data but only the recording that we are diarizing
-
Online/offline score informed music signal decomposition: application to minus one EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2019-12-23 Antonio Jesús Munoz-Montoro; Julio José Carabias-Orti; Pedro Vera-Candeas; Francisco Jesús Canadas-Quesada; Nicolás Ruiz-Reyes
In this paper, we propose a score-informed source separation framework based on non-negative matrix factorization (NMF) and dynamic time warping (DTW) that suits for both offline and online systems. The proposed framework is composed of three stages: training, alignment, and separation. In the training stage, the score is encoded as a sequence of individual occurrences and unique combinations of notes
-
A unit selection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2019-12-16 Marc Freixes; Francesc Alías; Joan Claudi Socoró
Text-to-speech (TTS) synthesis systems have been widely used in general-purpose applications based on the generation of speech. Nonetheless, there are some domains, such as storytelling or voice output aid devices, which may also require singing. To enable a corpus-based TTS system to sing, a supplementary singing database should be recorded. This solution, however, might be too costly for eventual
-
Signal enhancement for communication systems used by fire fighters EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2019-12-12 Michael Brodersen; Achim Volmer; Gerhard Schmidt
So-called full-face masks are essential for fire fighters to ensure respiratory protection in smoke diving incidents. While such masks are absolutely necessary for protection purposes on one hand, they impair the voice communication of fire fighters drastically on the other hand. For this reason communication systems should be used to amplify the speech and, therefore, to improve the communication
-
Speech enhancement methods based on binaural cue coding EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2019-12-11 Xianyun Wang; Changchun Bao
According to the encoding and decoding mechanism of binaural cue coding (BCC), in this paper, the speech and noise are considered as left channel signal and right channel signal of the BCC framework, respectively. Subsequently, the speech signal is estimated from noisy speech when the inter-channel level difference (ICLD) and inter-channel correlation (ICC) between speech and noise are given. In this
-
Introducing phonetic information to speaker embedding for speaker verification EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2019-12-05 Yi Liu; Liang He; Jia Liu; Michael T. Johnson
Phonetic information is one of the most essential components of a speech signal, playing an important role for many speech processing tasks. However, it is difficult to integrate phonetic information into speaker verification systems since it occurs primarily at the frame level while speaker characteristics typically reside at the segment level. In deep neural network-based speaker verification, existing
-
A new joint CTC-attention-based speech recognition model with multi-level multi-head attention EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2019-10-28 Chu-Xiong Qin; Wen-Lin Zhang; Dan Qu
A method called joint connectionist temporal classification (CTC)-attention-based speech recognition has recently received increasing focus and has achieved impressive performance. A hybrid end-to-end architecture that adds an extra CTC loss to the attention-based model could force extra restrictions on alignments. To explore better the end-to-end models, we propose improvements to the feature extraction
-
Non-parallel dictionary learning for voice conversion using non-negative Tucker decomposition EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2019-09-11 Yuki Takashima; Toru Nakashika; Tetsuya Takiguchi; Yasuo Ariki
Voice conversion (VC) is a technique of exclusively converting speaker-specific information in the source speech while preserving the associated phonemic information. Non-negative matrix factorization (NMF)-based VC has been widely researched because of the natural-sounding voice it achieves when compared with conventional Gaussian mixture model-based VC. In conventional NMF-VC, models are trained
-
ALBAYZIN 2018 spoken term detection evaluation: a multi-domain international evaluation in Spanish EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2019-09-02 Javier Tejedor; Doroteo T. Toledano; Paula Lopez-Otero; Laura Docio-Fernandez; Ana R. Montalvo; Jose M. Ramirez; Mikel Peñagarikano; Luis Javier Rodriguez-Fuentes
Search on speech (SoS) is a challenging area due to the huge amount of information stored in audio and video repositories. Spoken term detection (STD) is an SoS-related task aiming to retrieve data from a speech repository given a textual representation of a search term (which can include one or more words). This paper presents a multi-domain internationally open evaluation for STD in Spanish. The
-
Room-localized speech activity detection in multi-microphone smart homes EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2019-08-27 Panagiotis Giannoulis; Gerasimos Potamianos; Petros Maragos
Voice-enabled interaction systems in domestic environments have attracted significant interest recently, being the focus of smart home research projects and commercial voice assistant home devices. Within the multi-module pipelines of such systems, speech activity detection (SAD) constitutes a crucial component, providing input to their activation and speech recognition subsystems. In typical multi-room
-
Articulation constrained learning with application to speech emotion recognition EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2019-08-20 Mohit Shah; Ming Tu; Visar Berisha; Chaitali Chakrabarti; Andreas Spanias
Speech emotion recognition methods combining articulatory information with acoustic features have been previously shown to improve recognition performance. Collection of articulatory data on a large scale may not be feasible in many scenarios, thus restricting the scope and applicability of such methods. In this paper, a discriminative learning method for emotion recognition using both articulatory
-
Search on speech from spoken queries: the Multi-domain International ALBAYZIN 2018 Query-by-Example Spoken Term Detection Evaluation EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2019-07-19 Javier Tejedor; Doroteo T. Toledano; Paula Lopez-Otero; Laura Docio-Fernandez; Mikel Peñagarikano; Luis Javier Rodriguez-Fuentes; Antonio Moreno-Sandoval
The huge amount of information stored in audio and video repositories makes search on speech (SoS) a priority area nowadays. Within SoS, Query-by-Example Spoken Term Detection (QbE STD) aims to retrieve data from a speech repository given a spoken query. Research on this area is continuously fostered with the organization of QbE STD evaluations. This paper presents a multi-domain internationally open
-
Latent class model with application to speaker diarization EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2019-07-09 Liang He; Xianhong Chen; Can Xu; Yi Liu; Jia Liu; Michael T. Johnson
In this paper, we apply a latent class model (LCM) to the task of speaker diarization. LCM is similar to Patrick Kenny’s variational Bayes (VB) method in that it uses soft information and avoids premature hard decisions in its iterations. In contrast to the VB method, which is based on a generative model, LCM provides a framework allowing both generative and discriminative models. The discriminative
-
Music detection from broadcast contents using convolutional neural networks with a Mel-scale kernel EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2019-06-26 Byeong-Yong Jang; Woon-Haeng Heo; Jung-Hyun Kim; Oh-Wook Kwon
We propose a new method for music detection from broadcasting contents using the convolutional neural networks with a Mel-scale kernel. In this detection task, music segments should be annotated from the broadcast data, where music, speech, and noise are mixed. The convolutional neural network is composed of a convolutional layer with kernel that is trained to extract robust features. The Mel-scale
-
Robust singer identification of Indian playback singers EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2019-06-17 Deepali Y. Loni; Shaila Subbaraman
Singing voice analysis has been a topic of research to assist several applications in the domain of music information retrieval system. One such major area is singer identification (SID). There has been enormous increase in production of movies and songs in Bollywood industry over the last 50 decades. Surveying this extensive dataset of singers, the paper presents singer identification system for Indian
-
Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2019-06-17 Diego de Benito-Gorron; Alicia Lozano-Diez; Doroteo T. Toledano; Joaquin Gonzalez-Rodriguez
Audio signals represent a wide diversity of acoustic events, from background environmental noise to spoken communication. Machine learning models such as neural networks have already been proposed for audio signal modeling, where recurrent structures can take advantage of temporal dependencies. This work aims to study the implementation of several neural network-based systems for speech and music event
-
Replay attack detection with auditory filter-based relative phase features EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2019-06-10 Zeyan Oo; Longbiao Wang; Khomdet Phapatanaburi; Meng Liu; Seiichi Nakagawa; Masahiro Iwahashi; Jianwu Dang
There are many studies on detecting human speech from artificially generated speech and automatic speaker verification (ASV) that aim to detect and identify whether the given speech belongs to a given speaker. Recent studies demonstrate the success of the relative phase (RP) feature in speaker recognition/verification and the detection of synthesized speech and converted speech. However, there are
-
An adaptive a priori SNR estimator for perceptual speech enhancement EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2019-06-07 Lara Nahma; Pei Chee Yong; Hai Huyen Dam; Sven Nordholm
In this paper, an adaptive averaging a priori SNR estimation employing critical band processing is proposed. The proposed method modifies the current decision-directed a priori SNR estimation to achieve faster tracking when SNR changes. The decision-directed estimator (DD) employs a fixed weighting with the value close to one, which makes it slow in following the onsets of speech utterances. The proposed
-
Feature trajectory dynamic time warping for clustering of speech segments EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2019-04-04 Lerato Lerato; Thomas Niesler
Dynamic time warping (DTW) can be used to compute the similarity between two sequences of generally differing length. We propose a modification to DTW that performs individual and independent pairwise alignment of feature trajectories. The modified technique, termed feature trajectory dynamic time warping (FTDTW), is applied as a similarity measure in the agglomerative hierarchical clustering of speech
-
Loudness stability of binaural sound with spherical harmonic representation of sparse head-related transfer functions EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2019-03-15 Zamir Ben-Hur; David Lou Alon; Boaz Rafaely; Ravish Mehra
In response to renewed interest in virtual and augmented reality, the need for high-quality spatial audio systems has emerged. The reproduction of immersive and realistic virtual sound requires high resolution individualized head-related transfer function (HRTF) sets. In order to acquire an individualized HRTF, a large number of spatial measurements are needed. However, such a measurement process requires
-
Punctuation-generation-inspired linguistic features for Mandarin prosody generation EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2019-02-21 Chen-Yu Chiang; Yu-Ping Hung; Han-Yun Yeh; I-Bin Liao; Chen-Ming Pan
This paper proposes two novel linguistic features extracted from text input for prosody generation in a Mandarin text-to-speech system. The first feature is the punctuation confidence (PC), which measures the likelihood that a major punctuation mark (MPM) can be inserted at a word boundary. The second feature is the quotation confidence (QC), which measures the likelihood that a word string is quoted
-
Dual supervised learning for non-native speech recognition EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2019-01-14 Kacper Radzikowski; Robert Nowak; Le Wang; Osamu Yoshie
Current automatic speech recognition (ASR) systems achieve over 90–95% accuracy, depending on the methodology applied and datasets used. However, the level of accuracy decreases significantly when the same ASR system is used by a non-native speaker of the language to be recognized. At the same time, the volume of labeled datasets of non-native speech samples is extremely limited both in size and in
-
Decision tree SVM model with Fisher feature selection for speech emotion recognition EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2019-01-07 Linhui Sun; Sheng Fu; Fu Wang
The overall recognition rate will reduce due to the increase of emotional confusion in multiple speech emotion recognition. To solve the problem, we propose a speech emotion recognition method based on the decision tree support vector machine (SVM) model with Fisher feature selection. At the stage of feature selection, Fisher criterion is used to filter out the feature parameters of higher distinguish
-
Discriminative frequency filter banks learning with neural networks EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2019-01-03 Teng Zhang; Ji Wu
Filter banks on spectrums play an important role in many audio applications. Traditionally, the filters are linearly distributed on perceptual frequency scale such as Mel scale. To make the output smoother, these filters are often placed so that they overlap with each other. However, fixed-parameter filters are usually in the context of psychoacoustic experiments and selected experimentally. To make
-
Automatic bird species recognition based on birds vocalization EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2018-12-14 Jiri Stastny; Michal Munk; Lubos Juranek
This paper deals with a project of Automatic Bird Species Recognition Based on Bird Vocalization. Eighteen bird species of 6 different families were analyzed. At first, human factor cepstral coefficients representing the given signal were calculated from particular recordings. In the next phase, using the voice activity detection system, segments of bird vocalizations were detected from which a likelihood
-
Towards end-to-end speech recognition with transfer learning EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2018-11-21 Chu-Xiong Qin; Dan Qu; Lian-Hai Zhang
A transfer learning-based end-to-end speech recognition approach is presented in two levels in our framework. Firstly, a feature extraction approach combining multilingual deep neural network (DNN) training with matrix factorization algorithm is introduced to extract high-level features. Secondly, the advantage of connectionist temporal classification (CTC) is transferred to the target attention-based
-
Web-based environment for user generation of spoken dialog for virtual assistants EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2018-11-16 Ryota Nishimura; Daisuke Yamamoto; Takahiro Uchiya; Ichi Takumi
In this paper, a web-based spoken dialog generation environment which enables users to edit dialogs with a video virtual assistant is developed and to also select the 3D motions and tone of voice for the assistant. In our proposed system, “anyone” can “easily” post/edit contents of the dialog for the dialog system. The dialog type corresponding to the system is limited to the question-and-answer type
-
Robust image-in-audio watermarking technique based on DCT-SVD transform EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2018-10-01 Aniruddha Kanhe; Aghila Gnanasekaran
In this paper, a robust and highly imperceptible audio watermarking technique is presented based on discrete cosine transform (DCT) and singular value decomposition (SVD). The low-frequency components of the audio signal have been selectively embedded with watermark image data making the watermarked audio highly imperceptible and robust. The imperceptibility of proposed methods is evaluated by computing
-
Relevance-based quantization of scattering features for unsupervised mining of environmental audio EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2018-09-29 Vincent Lostanlen; Grégoire Lafay; Joakim Andén; Mathieu Lagrange
The emerging field of computational acoustic monitoring aims at retrieving high-level information from acoustic scenes recorded by some network of sensors. These networks gather large amounts of data requiring analysis. To decide which parts to inspect further, we need tools that automatically mine the data, identifying recurring patterns and isolated events. This requires a similarity measure for
-
The use of long-term features for GMM- and i-vector-based speaker diarization systems EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2018-09-26 Abraham Woubie Zewoudie; Jordi Luque; Javier Hernando
Several factors contribute to the performance of speaker diarization systems. For instance, the appropriate selection of speech features is one of the key aspects that affect speaker diarization systems. The other factors include the techniques employed to perform both segmentation and clustering. While the static mel frequency cepstral coefficients are the most widely used features in speech-related
-
From raw audio to a seamless mix: creating an automated DJ system for Drum and Bass EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2018-09-24 Len Vande Veire; Tijl De Bie
We present the open-source implementation of the first fully automatic and comprehensive DJ system, able to generate seamless music mixes using songs from a given library much like a human DJ does. The proposed system is built on top of several enhanced music information retrieval (MIR) techniques, such as for beat tracking, downbeat tracking, and structural segmentation, to obtain an understanding
-
AudioPairBank: towards a large-scale tag-pair-based audio content analysis EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2018-09-15 Sebastian Säger; Benjamin Elizalde; Damian Borth; Christian Schulze; Bhiksha Raj; Ian Lane
Recently, sound recognition has been used to identify sounds, such as the sound of a car, or a river. However, sounds have nuances that may be better described by adjective-noun pairs such as “slow car” and verb-noun pairs such as “flying insects,” which are underexplored. Therefore, this work investigates the relationship between audio content and both adjective-noun pairs and verb-noun pairs. Due
-
Piano multipitch estimation using sparse coding embedded deep learning EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2018-09-12 Xingda Li; Yujing Guan; Yingnian Wu; Zhongbo Zhang
As the foundation of many applications, multipitch estimation problem has always been the focus of acoustic music processing; however, existing algorithms perform deficiently due to its complexity. In this paper, we employ deep learning to address piano multipitch estimation problem by proposing MPENet based on a novel multimodal sparse incoherent non-negative matrix factorization (NMF) layer. This
-
Enhancement of speech dynamics for voice activity detection using DNN EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2018-09-12 Suci Dwijayanti; Kei Yamamori; Masato Miyoshi
Voice activity detection (VAD) is an important preprocessing step for various speech applications to identify speech and non-speech periods in input signals. In this paper, we propose a deep neural network (DNN)-based VAD method for detecting such periods in noisy signals using speech dynamics, which are time-varying speech signals that may be expressed as the first- and second-order derivatives of
-
Robust emotional speech recognition based on binaural model and emotional auditory mask in noisy environments EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2018-08-28 Meysam Bashirpour; Masoud Geravanchizadeh
The performance of automatic speech recognition systems degrades in the presence of emotional states and in adverse environments (e.g., noisy conditions). This greatly limits the deployment of speech recognition application in realistic environments. Previous studies in the emotion-affected speech recognition field focus on improving emotional speech recognition using clean speech data recorded in
-
An artificial patient for pure-tone audiometry EURASIP J. Audio Speech Music Proc. (IF 1.289) Pub Date : 2018-07-27 Alexander Kocian; Guido Cattani; Stefano Chessa; Wilko Grolman
The successful treatment of hearing loss depends on the individual practitioner’s experience and skill. So far, there is no standard available to evaluate the practitioner’s testing skills. To assess every practitioner equally, the paper proposes a first machine, dubbed artificial patient (AP), mimicking a real patient with hearing impairment operating in real time and real environment. Following this