-
An investigation of domain adaptation in speaker embedding space for speaker recognition Speech Commun. (IF 1.417) Pub Date : 2021-01-23 Fahimeh Bahmaninezhad; Chunlei Zhang; John H.L. Hansen
Speaker recognition continues to grow as a research challenge in the field with expanded application in commercial, forensic, educational and general speech technology interfaces. However, challenges remain, especially for naturalistic audio streams including recordings with mismatch between train and test data (i.e., when train or system development data and enrollment/test data or application data
-
Exploiting the directional coherence function for multichannel source extraction Speech Commun. (IF 1.417) Pub Date : 2021-01-20 Shan Liang; Guanjun Li; Shuai Nie; ZhanLei Yang; WenJu Liu; Jianhua Tao
The desired speech detector plays an important role for controlling the speech distortion in spatial filtering based speech enhancement algorithms. However, the conventional complex coherence(CC) based algorithms can only distinguish the coherent speech and diffuse noise. To improve the performance on the scenarios that both the coherent interference and diffuse noise are present, we propose a directional
-
Speech signal processing on graphs: The graph frequency analysis and an improved graph Wiener filtering method Speech Commun. (IF 1.417) Pub Date : 2021-01-13 Tingting Wang; Haiyan Guo; Xue Yan; Zhen Yang
In the paper, we investigate a graph representation of speech signals and graph speech enhancement technology. Specifically, we first propose a new graph k-shift operator Ck to map speech signals into the graph domain and construct a novel graph Fourier basis by using its singular eigenvectors for speech graph signals (SGSs). On this basis, we propose an improved graph Wiener filtering method based
-
Learning deep multimodal affective features for spontaneous speech emotion recognition Speech Commun. (IF 1.417) Pub Date : 2020-12-26 Shiqing Zhang; Xin Tao; Yuelong Chuang; Xiaoming Zhao
Recently, spontaneous speech emotion recognition has become an active and challenging research subject. This paper proposes a new method of spontaneous speech emotion recognition by using deep multimodal audio feature learning based on multiple deep convolutional neural networks (multi-CNNs). The proposed method initially generates three different audio inputs for multi-CNNs so as to learn deep multimodal
-
Development and structure of the VariaNTS corpus: A spoken Dutch corpus containing talker and linguistic variability Speech Commun. (IF 1.417) Pub Date : 2020-12-28 Floor Arts; Deniz Başkent; Terrin N. Tamati
Speech perception and spoken word recognition are not only affected by what is being said, but also by who is speaking. Currently, publicly available corpora of spoken Dutch do not offer a wide variety of linguistic materials produced by multiple talkers. The VariaNTS (Variatie in Nederlandse Taal en Sprekers) corpus is a Dutch spoken corpus that was developed to maximize both linguistic and talker
-
POLEMAD–A database for the multimodal analysis of Polish pronunciation Speech Commun. (IF 1.417) Pub Date : 2020-12-11 Robert Wielgat; Rafał Jędryka; Anita Lorenc; Łukasz Mik; Daniel Król
-
Phonetic accommodation to natural and synthetic voices: Behavior of groups and individuals in speech shadowing Speech Commun. (IF 1.417) Pub Date : 2020-12-29 Iona Gessinger; Eran Raveh; Ingmar Steiner; Bernd Möbius
The present study investigates whether native speakers of German phonetically accommodate to natural and synthetic voices in a shadowing experiment. We aim to determine whether this phenomenon, which is frequently found in HHI, also occurs in HCI involving synthetic speech. The examined features pertain to different phonetic domains: allophonic variation, schwa epenthesis, realization of pitch accents
-
Automatic quality control and enhancement for voice-based remote Parkinson’s disease detection Speech Commun. (IF 1.417) Pub Date : 2020-12-30 Amir Hossein Poorjam; Mathew Shaji Kavalekalam; Liming Shi; Jordan P. Raykov; Jesper Rindom Jensen; Max A. Little; Mads Græsbøll Christensen
The performance of voice-based Parkinson’s disease (PD) detection systems degrades when there is an acoustic mismatch between training and operating conditions caused mainly by degradation in test signals. In this paper, we address this mismatch by considering three types of degradation commonly encountered in remote voice analysis, namely background noise, reverberation and nonlinear distortion, and
-
A unified system for multilingual speech recognition and language identification Speech Commun. (IF 1.417) Pub Date : 2020-12-26 Danyang Liu; Ji Xu; Pengyuan Zhang; Yonghong Yan
In this paper, a multilingual automatic speech recognition (ASR) and language identification (LID) system is designed. In contrast to conventional multilingual ASR systems, this paper takes advantage of the complementarity of the ASR and LID modules. First, the LID module contributes to the language adaptive training of the multilingual acoustic model. Then, the ASR decoding information acts as the
-
Phonetic correlates of laryngeal and place contrasts of Burushaski Speech Commun. (IF 1.417) Pub Date : 2020-11-30 Qandeel Hussain
Burushaski is an endangered language isolate spoken in Hunza, Nager, and Yasin valleys of Gilgit, Northern Pakistan. The present study investigates the acoustic correlates of Hunza Burushaski’s three-way stop laryngeal contrast (voiceless unaspirated, voiceless aspirated, and voiced unaspirated) across five places of articulation (bilabial, dental, retroflex, palatal, and velar). A wide range of acoustic
-
The effect of intermittent noise on lexically-guided perceptual learning in native and non-native listening Speech Commun. (IF 1.417) Pub Date : 2020-12-17 Polina Drozdova; Roeland van Hout; Sven Mattys; Odette Scharenborg
There is ample evidence that both native and non-native listeners deal with speech variation by quickly tuning into a speaker and adjusting their phonetic categories according to the speaker’s ambiguous pronunciation. This process is called lexically-guided perceptual learning. Moreover, the presence of noise in the speech signal has previously been shown to change the word competition process by increasing
-
Acoustic differences in emotional speech of people with dysarthria Speech Commun. (IF 1.417) Pub Date : 2020-12-04 Lubna Alhinti; Heidi Christensen; Stuart Cunningham
Communicating emotion is essential in building and maintaining relationships. We communicate our emotional state not just with the words we use, but also how we say them. Changes in the rate of speech, short-term energy and intonation all help to convey emotional states like 'angry', 'sad' and 'happy'. People with dysarthria, the most common speech disorder, have reduced articulatory and phonatory
-
Model architectures to extrapolate emotional expressions in DNN-based text-to-speech Speech Commun. (IF 1.417) Pub Date : 2020-11-24 Katsuki Inoue; Sunao Hara; Masanobu Abe; Nobukatsu Hojo; Yusuke Ijima
This paper proposes architectures that facilitate the extrapolation of emotional expressions in deep neural network (DNN)-based text-to-speech (TTS). In this study, the meaning of “extrapolate emotional expressions” is to borrow emotional expressions from others, and the collection of emotional speech uttered by target speakers is unnecessary. Although a DNN has potential power to construct DNN-based
-
RPCA-based real-time speech and music separation method Speech Commun. (IF 1.417) Pub Date : 2020-12-09 Mohaddeseh Mirbeygi; Aminollah Mahabadi; Akbar Ranjbar
The improvement of the performance of online separating speech and music is an NP problem and the separation optimization increases the complexity of the method in a Robust Principal Component Analysis (RPCA) method which is time consuming in big size matrix computations. This paper presents a RPCA-based speech and music separation method to reduce the amount of computational complexity and be robust
-
Fusion of Deep Learning Features with Mixture of Brain Emotional Learning for Audio-Visual Emotion Recognition Speech Commun. (IF 1.417) Pub Date : 2020-12-03 Zeinab Farhoudi; Saeed Setayeshi
Multimodal emotion recognition is a challenging task due to different modalities emotions expressed during a specific time in video clips. Considering the existed spatial-temporal correlation in the video, we propose an audio-visual fusion model of deep learning features with a Mixture of Brain Emotional Learning (MoBEL) model inspired by the brain limbic system. The proposed model is composed of two
-
Two-stage dimensional emotion recognition by fusing predictions of acoustic and text networks using SVM Speech Commun. (IF 1.417) Pub Date : 2020-11-19 Bagus Tris Atmaja; Masato Akagi
Automatic speech emotion recognition (SER) by a computer is a critical component for more natural human-machine interaction. As in human-human interaction, the capability to perceive emotion correctly is essential to taking further steps in a particular situation. One issue in SER is whether it is necessary to combine acoustic features with other data such as facial expressions, text, and motion capture
-
Verbalization has regulatory influences on autonomic activity during recall of unpleasant experience Speech Commun. (IF 1.417) Pub Date : 2020-11-15 Antti Rantanen; Seppo Laukka; Antti Siipo; Suvi Tiinanen; Mika P. Tarvainen; Jukka Kortelainen; Matti Lehtihalmes; Tapio Seppänen
The purpose of the study was to examine effects of verbalization of affective experience on heart-rate variability and emotions (HRV). We measured HRV of 35 subjects while they were viewing and verbalizing 48 affective pictures (IAPS). Our results showed that sympathetic activity was lower and parasympathetic activity higher when subjects were verbalizing unpleasant pictures than it was when they were
-
Speech enhancement using a DNN-augmented colored-noise Kalman filter Speech Commun. (IF 1.417) Pub Date : 2020-11-04 Hongjiang Yu; Wei-Ping Zhu; Benoit Champagne
In this paper, we propose a new speech enhancement system using a deep neural network (DNN)-augmented colored-noise Kalman filter. In our system, both clean speech and noise are modelled as autoregressive (AR) processes, whose parameters comprise the linear prediction coefficients (LPCs) and the driving noise variances. The LPCs are obtained through training a multi-objective DNN that learns the mapping
-
Acoustic and temporal representations in convolutional neural network models of prosodic events Speech Commun. (IF 1.417) Pub Date : 2020-11-05 Sabrina Stehwien; Antje Schweitzer; Ngoc Thang Vu
Prosodic events such as pitch accents and phrase boundaries have various acoustic and temporal correlates that are used as features in machine learning models to automatically detect these events from speech. These features are often linguistically motivated, high-level features that are hand-crafted by experts to best represent the prosodic events to be detected or classified. An alternative approach
-
Masked multi-head self-attention for causal speech enhancement Speech Commun. (IF 1.417) Pub Date : 2020-10-29 Aaron Nicolson; Kuldip K. Paliwal
Accurately modelling the long-term dependencies of noisy speech is critical to the performance of a speech enhancement system. Current deep learning approaches to speech enhancement employ either a recurrent neural network (RNN) or a temporal convolutional network (TCN). However, RNNs and TCNs both demonstrate deficiencies when modelling long-term dependencies. Enter multi-head attention (MHA) — a
-
Voks: Digital instruments for chironomic control of voice samples Speech Commun. (IF 1.417) Pub Date : 2020-11-02 Grégoire Locqueville; Christophe d’Alessandro; Samuel Delalez; Boris Doval; Xiao Xiao
This paper presents Voks, a new family of digital instruments that allow for real-time control and modification of pre-recorded voice signal samples. An instrument based on Voks is made of Voks itself, the synthesis software and a given set of chironomic (hand-driven) interfaces. Rhythm can be accurately controlled thanks to a new methodology, based on syllabic control points. Timing can also be controlled
-
Amplitude and Frequency Modulation-based features for detection of replay Spoof Speech Speech Commun. (IF 1.417) Pub Date : 2020-10-28 Madhu R. Kamble; Hemlata Tak; Hemant A. Patil
Replay attack poses a great threat to the Automatic Speaker Verification (ASV) system. This paper introduces Amplitude Modulation and Frequency Modulation-based features for replay Spoof Speech Detection (SSD) task. In this context, we propose Instantaneous Amplitude (IA) and Instantaneous Frequency (IF) features using Energy Separation Algorithm (ESA). The speech signal is passed through bandpass
-
Phonetic Detail Encoding in Explaining Boundary-modulated Coarticulation Speech Commun. (IF 1.417) Pub Date : 2020-10-31 Shan Luo
This study examines the acoustic outputs in boundary coarticulation and attempts to explain the native and nonnative production differences in relation to speech planning. English consonant clusters across words produced by English (L1) and Mandarin (L2) speakers are analyzed. This study focuses on two segmental features (place of articulation and voicing) and computes three acoustic-based measurements
-
Computer-assisted assessment of phonetic fluency in a second language: a longitudinal study of Japanese learners of French Speech Commun. (IF 1.417) Pub Date : 2020-10-08 Sylvain Detey; Lionel Fontan; Maxime Le Coz; Saïd Jmel
Automatic second language (L2) speech fluency assessment has been one of the ultimate goals of several projects aiming at designing Computer-Assisted Pronunciation Training (CAPT) tools for L2 learners. Usually, three challenges must be tackled in order to solve the issues at stake: 1) Defining fluency from a threefold interdisciplinary perspective (acoustic and perceptual phonetics, computer science
-
Duration of the rhotic approximant /ɹ/ in spastic dysarthria of different severity levels Speech Commun. (IF 1.417) Pub Date : 2020-10-09 Krishna Gurugubelli; Anil Kumar Vuppala; N.P. Narendra; Paavo Alku
Dysarthria is a motor speech disorder leading to imprecise articulation of speech. Acoustic analysis capable of detecting and assessing articulation errors is useful in dysarthria diagnosis and therapy. Since speakers with dysarthria experience difficulty in producing rhotics due to complex articulatory gestures of these sounds, the hypothesis of the present study is that duration of the rhotic approximant
-
Musical noise suppression using a low-rank and sparse matrix decomposition approach Speech Commun. (IF 1.417) Pub Date : 2020-09-21 Jishnu Sadasivan; Jitendra K. Dhiman; Chandra Sekhar Seelamantula
We address the problem of suppressing musical noise from speech enhanced using a short-time processing algorithm. Enhancement algorithms rely on noise statistics and errors in estimating the statistics lead to residual noise in the enhanced signal. A frequently encountered residual noise type is the so-called musical noise, which is a consequence of spurious peaks occurring at random locations in the
-
Acoustic model-based subword tokenization and prosodic-context extraction without language knowledge for text-to-speech synthesis Speech Commun. (IF 1.417) Pub Date : 2020-09-24 Masashi Aso; Shinnosuke Takamichi; Norihiro Takamune; Hiroshi Saruwatari
This paper presents text tokenization and context extraction without using language knowledge for text-to-speech (TTS) synthesis. To generate prosody, statistical parametric TTS synthesis typically requires the professional knowledge of the target language. Therefore, languages suitable for TTS synthesis are limited to only rich-resource languages. To achieve TTS synthesis without using language knowledge
-
A cross-linguistic analysis of the temporal dynamics of turn-taking cues using machine learning as a descriptive tool Speech Commun. (IF 1.417) Pub Date : 2020-09-29 Pablo Brusco; Jazmín Vidal; Štefan Beňuš; Agustín Gravano
In dialogue, speakers produce and perceive acoustic/prosodic turn-taking cues, which are fundamental for negotiating turn exchanges with their interlocutors. However, little of the temporal dynamics and cross-linguistic validity of these cues is known. In this work, we explore a set of acoustic/prosodic cues preceding three turn-transition types (hold, switch and backchannel) in three different languages
-
Perceptual realization of Greek consonants by Russian monolingual speakers Speech Commun. (IF 1.417) Pub Date : 2020-10-01 Georgios P. Georgiou; Natalia V. Perfilieva; Vladimir N. Denisenko; Natalia V. Novospasskaya
Nonnative sound perception might be challenging for adult listeners since they attune from a very young age to the phonological aspects of their native language and, thus, every nonnative sound is filtered through their first language. The present study investigates the perception of Greek consonants in both Consonant-Vowel (CV) and Vowel-Consonant (VC) syllable context by Russian monolingual speakers
-
B&Anet: Combining bidirectional LSTM and self-attention for end-to-end learning of task-oriented dialogue system Speech Commun. (IF 1.417) Pub Date : 2020-09-28 He Qun; Liu Wenjing; Cai Zhangli
Building dialogue systems plays an important role in modern life. amongst them, task-oriented dialogue for resolving problems in actual life is most worth exploring. Motivated by the development of end-to-end approaches, a task-oriented dialogue model based on bidirectional LSTM and self-attention mechanism is proposed. It not only makes good use of context and effectively solves the long-term dependency
-
Modeling concurrent vowel identification for shorter durations Speech Commun. (IF 1.417) Pub Date : 2020-10-01 Harshavardhan Settibhaktini; Ananthakrishna Chintanpalli
Behavioral studies on concurrent vowels with shorter duration (50 ms) show that overall identification scores across fundamental frequency (F0) differences are reduced when compared to longer duration (200 ms). In this current study, we investigated the effect of shorter durations on concurrent vowel scores using the temporal responses of an auditory-nerve model (Zilany et al., 2014) with a modified
-
Exploring the relationship between voice similarity estimates by listeners and by an automatic speaker recognition system incorporating phonetic features Speech Commun. (IF 1.417) Pub Date : 2020-08-12 Linda Gerlach; Kirsty McDougall; Finnian Kelly; Anil Alexander; Francis Nolan
The present study investigates relationships between voice similarity ratings made by human listeners and comparison scores produced by an automatic speaker recognition system that includes phonetic, perceptually-relevant features in its modelling. The study analyses human voice similarity ratings of pairs of speech samples from unrelated speakers from an accent-controlled database (DyViS, Standard
-
A time–frequency smoothing neural network for speech enhancement Speech Commun. (IF 1.417) Pub Date : 2020-09-21 Wenhao Yuan
In the existing speech enhancement methods based on deep neural network (DNN), the network architectures are not designed for speech enhancement specially, which extract local features of noisy speech in a non-causal way. In this paper, inspired by the feature calculation method based on the time–frequency correlation in the improved minima controlled recursive averaging (IMCRA), by using the long
-
The Fharvard corpus: A phonemically-balanced French sentence resource for audiology and intelligibility research Speech Commun. (IF 1.417) Pub Date : 2020-09-05 Vincent Aubanel; C. Bayard; A. Strauß; J.-L. Schwartz
The current study describes the collection of a new phonemically-balanced sentence resource for French, known as the Fharvard corpus. The resource consists of 700 sentences inspired by the original English Harvard sentences, along with audio recordings from one female and one male native French talker. Each of the sentences contains five mono- or bisyllabic keywords and are grouped into 70 lists of
-
An empirical study of the effect of acoustic-prosodic entrainment on the perceived trustworthiness of conversational avatars Speech Commun. (IF 1.417) Pub Date : 2020-07-30 Ramiro H. Gálvez; Agustín Gravano; Štefan Beňuš; Rivka Levitan; Marian Trnka; Julia Hirschberg
Entrainment is the tendency of interlocutors to become more similar to each other in their way of speaking. This phenomenon has been repeatedly documented and is associated with multiple social aspects of human-human conversations. However, there is a dearth of research on the effects of spoken dialogue systems (SDSs) with implemented acoustic-prosodic (dis)entrainment policies. The goal of the present
-
Optimum step-size control for a variable step-size stereo acoustic echo canceller in the frequency domain Speech Commun. (IF 1.417) Pub Date : 2020-08-27 Zhenhai Yan, Feiran Yang, Jun Yang
The frequency-domain adaptive filter (FDAF) is a competitive candidate for stereo acoustic echo cancellation (SAEC) because of its good convergence and computational efficiency. However, there is a conflict between the convergence rate and the steady-state misalignment when using a constant step size. In this paper, a variable step-size approach is proposed for the stereo echo canceller in the frequency
-
A supervised non-negative matrix factorization model for speech emotion recognition Speech Commun. (IF 1.417) Pub Date : 2020-08-13 Mixiao Hou, Jinxing Li, Guangming Lu
Feature representation plays a critical role in speech emotion recognition (SER). As a method of data dimensionality reduction, Non-negative Matrix Factorization (NMF) can obtain the low-dimensional representation of data by matrix decomposition, and make the data more distinguishable. In order to improve the recognition ability of NMF for SER, we conduct a potential study on NMF and propose a supervised
-
Sinusoidal model-based hypernasality detection in cleft palate speech using CVCV sequence Speech Commun. (IF 1.417) Pub Date : 2020-08-08 Akhilesh Kumar Dubey, S.R. Mahadeva Prasanna, S. Dandapat
Hypernasality in the speech of children with cleft palate is a consequence of velopharyngeal insufficiency. The spectral analysis of hypernasal speech shows the presence of nasal formants and anti-formants in the spectrum which affects the harmonic-intensity. The nasal formants increase whereas the anti-formants decrease the magnitude of harmonics around its location of addition. Hence, the spectrum
-
Multimodal perception of prominence in spontaneous speech: A methodological proposal using mixed models and AIC Speech Commun. (IF 1.417) Pub Date : 2020-07-31 Miguel Jiménez-Bravo, Victoria Marrero-Aguiar
Research on prominence perception has made use of animated agents and controlled speech in experimental settings, but these methodologies have disregarded some aspects of the acoustic and visual correlates of prominence. To overcome these limitations we propose a new methodological approach using spontaneous speech data. For this, we created a small database with extracts from a television talent show
-
Parallel Representation Learning for the Classification of Pathological Speech: Studies on Parkinson’s Disease and Cleft Lip and Palate Speech Commun. (IF 1.417) Pub Date : 2020-07-29 J.C. Vasquez-Correa, T. Arias-Vergara, M. Schuster, J.R. Orozco-Arroyave, E. Nöth
Speech signals may contain different paralinguistic aspects such as the presence of pathologies that affect the proper communication capabilities of a speaker. Those speech disorders have different origin depending on the type of the disease. For instance, diseases with morphological origin such as cleft lip and palate that causes hypernasality, or with neurodegenerative origin such as Parkinson’s
-
Analysis of Glottal Inverse Filtering in the Presence of Source-Filter Interaction. Speech Commun. (IF 1.417) Pub Date : 2020-07-24 Anil Palaparthi,Ingo R Titze
The validity of glottal inverse filtering (GIF) to obtain a glottal flow waveform from radiated pressure signal in the presence and absence of source-filter interaction was studied systematically. A driven vocal fold surface model of vocal fold vibration was used to generate source signals. A one-dimensional wave reflection algorithm was used to solve for acoustic pressures in the vocal tract. Several
-
Accuracy, recording interference, and articulatory quality of headsets for ultrasound recordings Speech Commun. (IF 1.417) Pub Date : 2020-07-11 Michael Pucher, Nicola Klingler, Jan Luttenberger, Lorenzo Spreafico
In this paper we evaluate the accuracy, recording interference, and articulatory quality of two different ultrasound probe stabilization headsets: a metallic Ultrasound Stabilisation Headset (USH) and UltraFit, a recently developed headset that is 3D printed in Nylon. To evaluate accuracy, we recorded three native speakers of German with different head sizes using an optical marker tracking system
-
Enhancement of cleft palate speech using temporal and spectral processing Speech Commun. (IF 1.417) Pub Date : 2020-07-09 Protima Nomo Sudro, S. R. Mahadeva Prasanna
The speech of the individuals with cleft palate (CP) is generally characterized by the presence of abnormal nasal resonances during the production of voiced sounds, primarily in vowels, and is called hypernasality. Hypernasality is present in more than 50% of the individuals with CP, and it often results in degraded speech, both in quality and intelligibility. The current work describes the signal
-
Vowels and tones as acoustic cues in Chinese subregional dialect identification Speech Commun. (IF 1.417) Pub Date : 2020-07-03 Huangmei Liu, Jie Liang, Vincent J. van Heuven, Wilbert Heeringa
The aim of the present perceptual study is to weight tones and vowels as acoustic cues in Chinese subregional dialect identification, and to test the credibility of the subregional dialect classification that has been proposed in the literature. Our findings show that listeners are able to pinpoint speakers’ subregional dialect even when only given monosyllabic Chinese word stimuli, either natural
-
Automatic intelligibility assessment of dysarthric speech using glottal parameters Speech Commun. (IF 1.417) Pub Date : 2020-07-01 N P Narendra, Paavo Alku
Objective intelligibility assessment of dysarthric speech can assist clinicians in diagnosis of speech disorders as well as in medical treatment. This study investigates the use of glottal parameters (i.e. parameters that describe the acoustical excitation of voiced speech, the glottal flow) in the automatic intelligibility assessment of dysarthric speech. Instead of directly predicting the intelligibility
-
Aoustical and perceptual characteristics of mandarin consonants produced with an electrolarynx Speech Commun. (IF 1.417) Pub Date : 2020-06-30 Ke Xiao, Bo Zhang, Supin Wang, Mingxi Wan, Liang Wu
The electrolarynx (EL) is an electromechanical device that enables patients to produce voice following the surgical removal of their larynx. The purpose of this study is to understand the acoustic and perceptual characteristics of Mandarin consonants produced by EL speakers. First, the acoustic characteristics (including speech intensity, consonant duration, spectral peak, and F2 onset) of Mandarin
-
An Iterative Graph Spectral Subtraction Method for Speech Enhancement Speech Commun. (IF 1.417) Pub Date : 2020-06-30 Xue Yan, Zhen Yang, Tingting Wang, Haiyan Guo
In this paper, we investigate the application of graph signal processing (GSP) theory in speech enhancement. We first propose a set of shift operators to construct graph speech signals, and then analyze their spectrum in the graph Fourier domain. By leveraging the differences between the spectrum of graph speech and graph noise signals, we further propose the graph spectral subtraction (GSS) method
-
Significance of spectral cues in automatic speech segmentation for Indian language speech synthesizers Speech Commun. (IF 1.417) Pub Date : 2020-06-27 Arun Baby, Jeena J. Prakash, Aswin Shanmugam Subramanian, Hema A. Murthy
Building speech synthesis systems for Indian languages is challenging owing to the fact that digital resources for these languages are hardly available. Vocabulary independent speech synthesis requires that a given text is split at the level of the smallest sound unit, namely, phone. The waveforms or models of phones are concatenated to produce speech. The waveforms corresponding to that of the phones
-
GEDI: Gammachirp envelope distortion index for predicting intelligibility of enhanced speech Speech Commun. (IF 1.417) Pub Date : 2020-06-17 Katsuhiko Yamamoto, Toshio Irino, Shoko Araki, Keisuke Kinoshita, Tomohiro Nakatani
In this study, we propose a new concept, the gammachirp envelope distortion index (GEDI), based on the signal-to-distortion ratio in the auditory envelope, SDRenv, to predict the intelligibility of speech enhanced by nonlinear algorithms. The objective of GEDI is to calculate the distortion between enhanced and clean-speech representations in the domain of a temporal envelope extracted by the gammachirp
-
Automatic accent identification as an analytical tool for accent robust automatic speech recognition Speech Commun. (IF 1.417) Pub Date : 2020-06-04 Maryam Najafian, Martin Russell
We present a novel study of relationships between automatic accent identification (AID) and accent-robust automatic speech recognition (ASR), using i-vector based AID and deep neural network, hidden Markov Model (DNN-HMM) based ASR. A visualization of the AID i-vector space and a novel analysis of the accent content of the WSJCAM0 corpus are presented. Accents that occur at the periphery of AID space
-
DeepConversion: Voice conversion with limited parallel training data Speech Commun. (IF 1.417) Pub Date : 2020-06-04 Mingyang Zhang, Berrak Sisman, Li Zhao, Haizhou Li
A deep neural network approach to voice conversion usually depends on a large amount of parallel training data from source and target speakers. In this paper, we propose a novel conversion pipeline, DeepConversion, that leverages a large amount of non-parallel, multi-speaker data, but requires only a small amount of parallel training data. It is believed that we can represent the shared characteristics
-
The Hearing-Aid Speech Perception Index (HASPI) Speech Commun. (IF 1.417) Pub Date : 2020-05-24 James M. Kates; Kathryn H. Arehart
This paper presents a revised version of the Hearing-Aid Speech Perception Index (HASPI). The index is based on a model of the auditory periphery that incorporates changes due to hearing loss and is valid for both normal-hearing and hearing-impaired listeners. It is an intrusive metric that compares the time-frequency envelope and temporal fine structure (TFS) of a degraded signal to an unprocessed
-
Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features Speech Commun. (IF 1.417) Pub Date : 2020-05-22 Lamiaa Abdel-Hamid
Speech emotion recognition (SER) has recently been receiving increased interest due to the rapid advancements in affective computing and human computer interaction. English, German, Mandarin and Indian are among the most commonly considered languages for SER along with other European and Asian languages. However, few researches have implemented Arabic SER systems due to the scarcity of available Arabic
-
A review of multi-objective deep learning speech denoising methods Speech Commun. (IF 1.417) Pub Date : 2020-05-21 Arian Azarang, Nasser Kehtarnavaz
This paper presents a review of multi-objective deep learning methods that have been introduced in the literature for speech denoising. After stating an overview of conventional, single objective deep learning, and hybrid or combined conventional and deep learning methods, a review of the mathematical framework of the multi-objective deep learning methods for speech denoising is provided. A representative
-
The effect of female voice on verbal processing Speech Commun. (IF 1.417) Pub Date : 2020-05-20 Laura Smorenburg, Aoju Chen
Previous studies have suggested that female voices may impede verbal processing. For example, words were remembered less well and lexical decision was slower when spoken by a female speaker. The current study tried to replicate this gender effect in an auditory semantic/associative priming task that excluded any effects of speaker variability and extended previous research by examining the role of
-
The interplay of prosodic cues in the L2: How intonation, rhythm, and speech rate in speech by Spanish learners of Dutch contribute to L1 Dutch perceptions of accentedness and comprehensibility Speech Commun. (IF 1.417) Pub Date : 2020-05-18 Lieke van Maastricht; Tim Zee; Emiel Krahmer; Marc Swerts
This study investigates the relative contribution of L2 intonation, rhythm, and speech rate to L1 perceptions of accentedness and comprehensibility. The intonation, rhythm, and speech rate of an L1 speaker of Dutch was transferred onto the segmental string of four Spanish learners of Dutch, resulting in eight conditions that reflect all possible combinations of these prosodic cues. Our results show
-
Single-channel speech enhancement with correlated spectral components: Limits-potential Speech Commun. (IF 1.417) Pub Date : 2020-05-16 Pejman Mowlaee, Johannes K.W. Stahl
In this paper, we investigate single-channel speech enhancement algorithms that operate in the short-time Fourier transform and take into account dependencies w.r.t. frequency. As a result of allowing for inter-frequency dependencies, the minimum mean square error optimal estimates of the short-time Fourier transform expansion coefficients are functions of complex-valued covariance matrices in general
-
Integrating lexical and prosodic features for automatic paragraph segmentation Speech Commun. (IF 1.417) Pub Date : 2020-05-11 Catherine Lai, Mireia Farrús, Johanna D. Moore
Spoken documents, such as podcasts or lectures, are a growing presence in everyday life. Being able to automatically identify their discourse structure is an important step to understanding what a spoken document is about. Moreover, finer-grained units, such as paragraphs, are highly desirable for presenting and analyzing spoken content. However, little work has been done on discourse based speech
-
HiLAM-state discriminative multi-task deep neural network in dynamic time warping framework for text-dependent speaker verification Speech Commun. (IF 1.417) Pub Date : 2020-05-06 Mohammad Azharuddin Laskar, Rabul Hussain Laskar
This paper builds on a multi-task Deep Neural Network (DNN), which provides an utterance-level feature representation called j-vector, to implement a Text-dependent Speaker Verification (TDSV) system. This technique exploits the speaker idiosyncrasies associated with individual pass-phrases. However, speaker information is known to be characteristic of more specific speech units and, thus, it is likely
-
Analytic phase features for dysarthric speech detection and intelligibility assessment Speech Commun. (IF 1.417) Pub Date : 2020-05-01 Krishna Gurugubelli, Anil Kumar Vuppala
The objectives of the dysarthria assessment are to discriminate dysarthric speech from normal speech, to estimate the severity of dysarthria in terms of the dysarthric speech intelligibility, and to find the motor speech subsystem which causes defects in speech production. In this work, analytic phase features are investigated for the objective assessment of dysarthria. In this connection, the importance
Contents have been reproduced by permission of the publishers.