• arXiv.cs.SD Pub Date : 2020-01-19
Marcello Federico; Robert Enyedi; Roberto Barra-Chicote; Ritwik Giri; Umut Isik; Arvindh Krishnaswamy

We present enhancements to a speech-to-speech translation pipeline in order to perform automatic dubbing. Our architecture features neural machine translation generating output of preferred length, prosodic alignment of the translation with the original speech segments, neural text-to-speech with fine tuning of the duration of each utterance, and, finally, audio rendering to enriches text-to-speech output with background noise and reverberation extracted from the original audio. We report on a subjective evaluation of automatic dubbing of excerpts of TED Talks from English into Italian, which measures the perceived naturalness of automatic dubbing and the relative importance of each proposed enhancement.

更新日期：2020-01-22
• arXiv.cs.SD Pub Date : 2020-01-20
Shreyas Ramoji; Prashant Krishnan V; Prachi Singh; Sriram Ganapathy

The state-of-art approach to speaker verification involves the extraction of discriminative embeddings like x-vectors followed by a generative model back-end using a probabilistic linear discriminant analysis (PLDA). In this paper, we propose a Pairwise neural discriminative model for the task of speaker verification which operates on a pair of speaker embeddings such as x-vectors/i-vectors and outputs a score that can be considered as a scaled log-likelihood ratio. We construct a differentiable cost function which approximates speaker verification loss, namely the minimum detection cost. The pre-processing steps of linear discriminant analysis (LDA), unit length normalization and within class covariance normalization are all modeled as layers of a neural model and the speaker verification cost functions can be back-propagated through these layers during training. We also explore regularization techniques to prevent overfitting, which is a major concern in using discriminative back-end models for verification tasks. The experiments are performed on the NIST SRE 2018 development and evaluation datasets. We observe average relative improvements of 8% in CMN2 condition and 30% in VAST condition over the PLDA baseline system.

更新日期：2020-01-22
• arXiv.cs.SD Pub Date : 2020-01-20
Hiroki Tamaru; Shinnosuke Takamichi; Naoko Tanji; Hiroshi Saruwatari

Thanks to developments in machine learning techniques, it has become possible to synthesize high-quality singing voices of a single singer. An open multispeaker singing-voice corpus would further accelerate the research in singing-voice synthesis. However, conventional singing-voice corpora only consist of the singing voices of a single singer. We designed a Japanese multispeaker singing-voice corpus called "JVS-MuSiC" with the aim to analyze and synthesize a variety of voices. The corpus consists of 100 singers' recordings of the same song, Katatsumuri, which is a Japanese children's song. It also includes another song that is different for each singer. In this paper, we describe the design of the corpus and experimental analyses using JVS-MuSiC. We investigated the relationship between 1) the similarity of singing voices and perceptual oneness of unison singing voices and between 2) the similarity of singing voices and that of speech. The results suggest that 1) there is a positive and moderate correlation between singing-voice similarity and the oneness of unison and that 2) the correlation between singing-voice similarity and speech similarity is weak. This corpus is freely available online.

更新日期：2020-01-22
• arXiv.cs.SD Pub Date : 2019-06-07
Eduardo Fonseca; Manoj Plakal; Frederic Font; Daniel P. W. Ellis; Xavier Serra

This paper introduces Task 2 of the DCASE2019 Challenge, titled "Audio tagging with noisy labels and minimal supervision". This task was hosted on the Kaggle platform as "Freesound Audio Tagging 2019". The task evaluates systems for multi-label audio tagging using a large set of noisy-labeled data, and a much smaller set of manually-labeled data, under a large vocabulary setting of 80 everyday sound classes. In addition, the proposed dataset poses an acoustic mismatch problem between the noisy train set and the test set due to the fact that they come from different web audio sources. This can correspond to a realistic scenario given by the difficulty in gathering large amounts of manually labeled data. We present the task setup, the FSDKaggle2019 dataset prepared for this scientific evaluation, and a baseline system consisting of a convolutional neural network. All these resources are freely available.

更新日期：2020-01-22
• arXiv.cs.SD Pub Date : 2019-12-11
Marius Cotescu; Thomas Drugman; Goeric Huybrechts; Jaime Lorenzo-Trueba; Alexis Moinet

We present an approach to synthesize whisper by applying a handcrafted signal processing recipe and Voice Conversion (VC) techniques to convert normally phonated speech to whispered speech. We investigate using Gaussian Mixture Models (GMM) and Deep Neural Networks (DNN) to model the mapping between acoustic features of normal speech and those of whispered speech. We evaluate naturalness and speaker similarity of the converted whisper on an internal corpus and on the publicly available wTIMIT corpus. We show that applying VC techniques is significantly better than using rule-based signal processing methods and it achieves results that are indistinguishable from copy-synthesis of natural whisper recordings. We investigate the ability of the DNN model to generalize on unseen speakers, when trained with data from multiple speakers. We show that excluding the target speaker from the training set has little or no impact on the perceived naturalness and speaker similarity of the converted whisper. The proposed DNN method is used in the newly released Whisper Mode of Amazon Alexa.

更新日期：2020-01-22
• arXiv.cs.SD Pub Date : 2020-01-15
Andrea Valenti; Antonio Carta; Davide Bacciu

We address the challenging open problem of learning an effective latent space for symbolic music data in generative music modeling. We focus on leveraging adversarial regularization as a flexible and natural mean to imbue variational autoencoders with context information concerning music genre and style. Through the paper, we show how Gaussian mixtures taking into account music metadata information can be used as an effective prior for the autoencoder latent space, introducing the first Music Adversarial Autoencoder (MusAE). The empirical analysis on a large scale benchmark shows that our model has a higher reconstruction accuracy than state-of-the-art models based on standard variational autoencoders. It is also able to create realistic interpolations between two musical sequences, smoothly changing the dynamics of the different tracks. Experiments show that the model can organise its latent space accordingly to low-level properties of the musical pieces, as well as to embed into the latent variables the high-level genre information injected from the prior distribution to increase its overall performance. This allows us to perform changes to the generated pieces in a principled way.

更新日期：2020-01-17
• arXiv.cs.SD Pub Date : 2020-01-15
Huy Phan; Ian V. McLoughlin; Lam Pham; Oliver Y. Chén; Philipp Koch; Maarten De Vos; Alfred Mertins

Generative adversarial networks (GAN) have recently been shown to be efficient for speech enhancement. Most, if not all, existing speech enhancement GANs (SEGANs) make use of a single generator to perform one-stage enhancement mapping. In this work, we propose two novel SEGAN frameworks, iterated SEGAN (ISEGAN) and deep SEGAN (DSEGAN). In the two proposed frameworks, the GAN architectures are composed of multiple generators that are chained to accomplish multiple-stage enhancement mapping which gradually refines the noisy input signals in stage-wise fashion. On the one hand, ISEGAN's generators share their parameters to learn an iterative enhancement mapping. On the other hand, DSEGAN's generators share a common architecture but their parameters are independent; as a result, different enhancement mappings are learned at different stages of the network. We empirically demonstrate favorable results obtained by the proposed ISEGAN and DSEGAN frameworks over the vanilla SEGAN. The source code is available at http://github.com/pquochuy/idsegan.

更新日期：2020-01-17
• arXiv.cs.SD Pub Date : 2020-01-16
Bohan Zhai; Tianren Gao; Flora Xue; Daniel Rothchild; Bichen Wu; Joseph E. Gonzalez; Kurt Keutzer

Automatic speech synthesis is a challenging task that is becoming increasingly important as edge devices begin to interact with users through speech. Typical text-to-speech pipelines include a vocoder, which translates intermediate audio representations into an audio waveform. Most existing vocoders are difficult to parallelize since each generated sample is conditioned on previous samples. WaveGlow is a flow-based feed-forward alternative to these auto-regressive models (Prenger et al., 2019). However, while WaveGlow can be easily parallelized, the model is too expensive for real-time speech synthesis on the edge. This paper presents SqueezeWave, a family of lightweight vocoders based on WaveGlow that can generate audio of similar quality to WaveGlow with 61x - 214x fewer MACs. Code, trained models, and generated audio are publicly available at https://github.com/tianrengao/SqueezeWave.

更新日期：2020-01-17
• arXiv.cs.SD Pub Date : 2020-01-16
Chunyi Wang

A speech emotion recognition algorithm based on multi-feature and Multi-lingual fusion is proposed in order to resolve low recognition accuracy caused by lack of large speech dataset and low robustness of acoustic features in the recognition of speech emotion. First, handcrafted and deep automatic features are extracted from existing data in Chinese and English speech emotions. Then, the various features are fused respectively. Finally, the fused features of different languages are fused again and trained in a classification model. Distinguishing the fused features with the unfused ones, the results manifest that the fused features significantly enhance the accuracy of speech emotion recognition algorithm. The proposed solution is evaluated on the two Chinese corpus and two English corpus, and is shown to provide more accurate predictions compared to original solution. As a result of this study, the multi-feature and Multi-lingual fusion algorithm can significantly improve the speech emotion recognition accuracy when the dataset is small.

更新日期：2020-01-17
• arXiv.cs.SD Pub Date : 2020-01-14
Yanpei Shi; Qiang Huang; Thomas Hain

In this paper, a novel architecture for speaker recognition is proposed by cascading speech enhancement and speaker processing. Its aim is to improve speaker recognition performance when speech signals are corrupted by noise. Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural networks. Furthermore, to increase robustness against noise, a multi-stage attention mechanism is employed to highlight the speaker related features learned from context information in time and frequency domain. To evaluate speaker identification and verification performance of the proposed approach, we test it on the dataset of VoxCeleb1, one of mostly used benchmark datasets. Moreover, the robustness of our proposed approach is also tested on VoxCeleb1 data when being corrupted by three types of interferences, general noise, music, and babble, at different signal-to-noise ratio (SNR) levels. The obtained results show that the proposed approach using speech enhancement and multi-stage attention models outperforms two strong baselines not using them in most acoustic conditions in our experiments.

更新日期：2020-01-16
• arXiv.cs.SD Pub Date : 2020-01-15
Alexander Schindler; Thomas Lidy; Sebastian Böck

Deep Learning has become state of the art in visual computing and continuously emerges into the Music Information Retrieval (MIR) and audio retrieval domain. In order to bring attention to this topic we propose an introductory tutorial on deep learning for MIR. Besides a general introduction to neural networks, the proposed tutorial covers a wide range of MIR relevant deep learning approaches. \textbf{Convolutional Neural Networks} are currently a de-facto standard for deep learning based audio retrieval. \textbf{Recurrent Neural Networks} have proven to be effective in onset detection tasks such as beat or audio-event detection. \textbf{Siamese Networks} have been shown effective in learning audio representations and distance functions specific for music similarity retrieval. We will incorporate both academic and industrial points of view into the tutorial. Accompanying the tutorial, we will create a Github repository for the content presented at the tutorial as well as references to state of the art work and literature for further reading. This repository will remain public after the conference.

更新日期：2020-01-16
• arXiv.cs.SD Pub Date : 2020-01-11
Mingda Li; Weitong Ruan; Xinyue Liu; Luca Soldaini; Wael Hamza; Chengwei Su

In a modern spoken language understanding (SLU) system, the natural language understanding (NLU) module takes interpretations of a speech from the automatic speech recognition (ASR) module as the input. The NLU module usually uses the first best interpretation of a given speech in downstream tasks such as domain and intent classification. However, the ASR module might misrecognize some speeches and the first best interpretation could be erroneous and noisy. Solely relying on the first best interpretation could make the performance of downstream tasks non-optimal. To address this issue, we introduce a series of simple yet efficient models for improving the understanding of semantics of the input speeches by collectively exploiting the n-best speech interpretations from the ASR module.

更新日期：2020-01-16
• arXiv.cs.SD Pub Date : 2020-01-14
Bin Gu; Wu Guo

This paper presents an improved deep embedding learning method based on convolutional neural network (CNN) for text-independent speaker verification. Two improvements are proposed for x-vector embedding learning: (1) Multi-scale convolution (MSCNN) is adopted in frame-level layers to capture complementary speaker information in different receptive fields. (2) A Baum-Welch statistics attention (BWSA) mechanism is applied in pooling-layer, which can integrate more useful long-term speaker characteristics in the temporal pooling layer. Experiments are carried out on the NIST SRE16 evaluation set. The results demonstrate the effectiveness of MSCNN and show the proposed BWSA can further improve the performance of the DNN embedding system

更新日期：2020-01-15
• arXiv.cs.SD Pub Date : 2020-01-14
Bin Gu; Wu Guo

The x-vector maps segments of arbitrary duration to vectors of fixed dimension using deep neural network. Combined with the probabilistic linear discriminant analysis (PLDA) backend, the x-vector/PLDA has become the dominant framework in text-independent speaker verification. Nevertheless, how to extract the x-vector appropriate for the PLDA backend is a key problem. In this paper, we propose a Gaussian noise constrained network (GNCN) to extract xvector, which adopts a multi-task learning strategy with the primary task classifying the speakers and the auxiliary task just fitting the Gaussian noises. Experiments are carried out using the SITW database. The results demonstrate the effectiveness of our proposed method

更新日期：2020-01-15
• arXiv.cs.SD Pub Date : 2020-01-14
Charles Jankowski; Vishwas Mruthyunjaya; Ruixi Lin

Social robots deployed in public spaces present a challenging task for ASR because of a variety of factors, including noise SNR of 20 to 5 dB. Existing ASR models perform well for higher SNRs in this range, but degrade considerably with more noise. This work explores methods for providing improved ASR performance in such conditions. We use the AiShell-1 Chinese speech corpus and the Kaldi ASR toolkit for evaluations. We were able to exceed state-of-the-art ASR performance with SNR lower than 20 dB, demonstrating the feasibility of achieving relatively high performing ASR with open-source toolkits and hundreds of hours of training data, which is commonly available.

更新日期：2020-01-15
• arXiv.cs.SD Pub Date : 2020-01-14
Jesse Engel; Lamtharn Hantrakul; Chenjie Gu; Adam Roberts

Most generative models of audio directly generate samples in one of two domains: time or frequency. While sufficient to express any signal, these representations are inefficient, as they do not utilize existing knowledge of how sound is generated and perceived. A third approach (vocoders/synthesizers) successfully incorporates strong domain knowledge of signal processing and perception, but has been less actively researched due to limited expressivity and difficulty integrating with modern auto-differentiation-based machine learning methods. In this paper, we introduce the Differentiable Digital Signal Processing (DDSP) library, which enables direct integration of classic signal processing elements with deep learning methods. Focusing on audio synthesis, we achieve high-fidelity generation without the need for large autoregressive models or adversarial losses, demonstrating that DDSP enables utilizing strong inductive biases without losing the expressive power of neural networks. Further, we show that combining interpretable modules permits manipulation of each separate model component, with applications such as independent control of pitch and loudness, realistic extrapolation to pitches not seen during training, blind dereverberation of room acoustics, transfer of extracted room acoustics to new environments, and transformation of timbre between disparate sources. In short, DDSP enables an interpretable and modular approach to generative modeling, without sacrificing the benefits of deep learning. The library is publicly available at https://github.com/magenta/ddsp and we welcome further contributions from the community and domain experts.

更新日期：2020-01-15
• arXiv.cs.SD Pub Date : 2020-01-14
Ivan Kiskin; Adam D. Cobb; Lawrence Wang; Stephen Roberts

Mosquitoes are the only known vector of malaria, which leads to hundreds of thousands of deaths each year. Understanding the number and location of potential mosquito vectors is of paramount importance to aid the reduction of malaria transmission cases. In recent years, deep learning has become widely used for bioacoustic classification tasks. In order to enable further research applications in this field, we release a new dataset of mosquito audio recordings. With over a thousand contributors, we obtained 195,434 labels of two second duration, of which approximately 10 percent signify mosquito events. We present an example use of the dataset, in which we train a convolutional neural network on log-Mel features, showcasing the information content of the labels. We hope this will become a vital resource for those researching all aspects of malaria, and add to the existing audio datasets for bioacoustic detection and signal processing.

更新日期：2020-01-15
• arXiv.cs.SD Pub Date : 2020-01-13
Anant Khandelwal; E. B. Goud; Y. Chand; L. Kumar; S. Prasad; N. Agarwala; R. Singh

In this paper, two microphone based systems for audio zooming is proposed for the first time. The audio zooming application allows sound capture and enhancement from the front direction while attenuating interfering sources from all other directions. The complete audio zooming system utilizes beamforming based target extraction. In particular, Minimum Power Distortionless Response (MPDR) beamformer and Griffith Jim Beamformer (GJBF) are explored. This is followed by block thresholding for residual noise and interference suppression, and zooming effect creation. A number of simulation and real life experiments using Samsung smartphone (Samsung Galaxy A5) were conducted. Objective and subjective measures confirm the rich user experience.

更新日期：2020-01-15
• arXiv.cs.SD Pub Date : 2020-01-12
Harishchandra Dubey; Dimitra Emmanouilidou; Ivan J. Tashev

Audio event classification is an important task for several applications such as surveillance, audio, video and multimedia retrieval etc. There are approximately 3M people with hearing loss who can't perceive events happening around them. This paper establishes the CURE dataset which contains curated set of specific audio events most relevant for people with hearing loss. We propose a ladder network based audio event classifier that utilizes 5s sound recordings derived from the Freesound project. We adopted the state-of-the-art convolutional neural network (CNN) embeddings as audio features for this task. We also investigate extreme learning machine (ELM) for event classification. In this study, proposed classifiers are compared with support vector machine (SVM) baseline. We propose signal and feature normalization that aims to reduce the mismatch between different recordings scenarios. Firstly, CNN is trained on weakly labeled Audioset data. Next, the pre-trained model is adopted as feature extractor for proposed CURE corpus. We incorporate ESC-50 dataset as second evaluation set. Results and discussions validate the superiority of Ladder network over ELM and SVM classifier in terms of robustness and increased classification accuracy. While Ladder network is robust to data mismatches, simpler SVM and ELM classifiers are sensitive to such mismatches, where the proposed normalization techniques can play an important role. Experimental studies with ESC-50 and CURE corpora elucidate the differences in dataset complexity and robustness offered by proposed approaches.

更新日期：2020-01-14
• arXiv.cs.SD Pub Date : 2020-01-10
Seung Hee Yang; Minhwa Chung

Dysarthria is a motor speech impairment affecting millions of people. Dysarthric speech can be far less intelligible than those of non-dysarthric speakers, causing significant communication difficulties. The goal of our work is to develop a model for dysarthric to healthy speech conversion using Cycle-consistent GAN. Using 18,700 dysarthric and 8,610 healthy control Korean utterances that were recorded for the purpose of automatic recognition of voice keyboard in a previous study, the generator is trained to transform dysarthric to healthy speech in the spectral domain, which is then converted back to speech. Objective evaluation using automatic speech recognition of the generated utterance on a held-out test set shows that the recognition performance is improved compared with the original dysarthic speech after performing adversarial training, as the absolute WER has been lowered by 33.4%. It demonstrates that the proposed GAN-based conversion method is useful for improving dysarthric speech intelligibility.

更新日期：2020-01-14
• arXiv.cs.SD Pub Date : 2020-01-13
Pranay Manocha; Adam Finkelstein; Zeyu Jin; Nicholas J. Bryan; Richard Zhang; Gautham J. Mysore

Assessment of many audio processing tasks relies on subjective evaluation which is time-consuming and expensive. Efforts have been made to create objective metrics but existing ones correlate poorly with human judgment. In this work, we construct a differentiable metric by fitting a deep neural network on a newly collected dataset of just-noticeable differences (JND), in which humans annotate whether a pair of audio clips are identical or not. By varying the type of differences, including noise, reverb, and compression artifacts, we are able to learn a metric that is well-calibrated with human judgments. Furthermore, we evaluate this metric by training a neural network, using the metric as a loss function. We find that simply replacing an existing loss with our metric yields significant improvement in denoising as measured by subjective pairwise comparison.

更新日期：2020-01-14
• arXiv.cs.SD Pub Date : 2020-01-13
Kangle Deng; Aayush Bansal; Deva Ramanan

We present an unsupervised approach that enables us to convert the speech input of any one individual to an output set of potentially-infinitely many speakers. One can stand in front of a mic and be able to make their favorite celebrity say the same words. Our approach builds on simple autoencoders that project out-of-sample data to the distribution of the training set (motivated by PCA/linear autoencoders). We use an exemplar autoencoder to learn the voice and specific style (emotions and ambiance) of a target speaker. In contrast to existing methods, the proposed approach can be easily extended to an arbitrarily large number of speakers in a very little time using only two-three minutes of audio data from a speaker. We also exhibit the usefulness of our approach for generating video from audio signals and vice-versa. We suggest the reader to check out our project webpage for various synthesized examples: https://dunbar12138.github.io/projectpage/Audiovisual/

更新日期：2020-01-14
• arXiv.cs.SD Pub Date : 2018-10-16
Jing-Xuan Zhang; Zhen-Hua Ling; Li-Juan Liu; Yuan Jiang; Li-Rong Dai

In this paper, a neural network named Sequence-to-sequence ConvErsion NeTwork (SCENT) is presented for acoustic modeling in voice conversion. At training stage, a SCENT model is estimated by aligning the feature sequences of source and target speakers implicitly using attention mechanism. At conversion stage, acoustic features and durations of source utterances are converted simultaneously using the unified acoustic model. Mel-scale spectrograms are adopted as acoustic features which contain both excitation and vocal tract descriptions of speech signals. The bottleneck features extracted from source speech using an automatic speech recognition (ASR) model are appended as auxiliary input. A WaveNet vocoder conditioned on Mel-spectrograms is built to reconstruct waveforms from the outputs of the SCENT model. It is worth noting that our proposed method can achieve appropriate duration conversion which is difficult in conventional methods. Experimental results show that our proposed method obtained better objective and subjective performance than the baseline methods using Gaussian mixture models (GMM) and deep neural networks (DNN) as acoustic models. This proposed method also outperformed our previous work which achieved the top rank in Voice Conversion Challenge 2018. Ablation tests further confirmed the effectiveness of several components in our proposed method.

更新日期：2020-01-14
• arXiv.cs.SD Pub Date : 2019-10-21
Mengfan Zhang; Zhongshu Ge; Tiejun Liu; Xihong Wu; Tianshu Qu

Head-related transfer function (HRTF) plays an important role in the construction of 3D auditory display. This paper presents an individual HRTF modeling method using deep neural networks based on spatial principal component analysis. The HRTFs are represented by a small set of spatial principal components combined with frequency and individual-dependent weights. By estimating the spatial principal components using deep neural networks and mapping the corresponding weights to a quantity of anthropometric parameters, we predict individual HRTFs in arbitrary spatial directions. The objective and subjective experiments evaluate the HRTFs generated by the proposed method, the principal component analysis (PCA) method, and the generic method. The results show that the HRTFs generated by the proposed method and PCA method perform better than the generic method. For most frequencies the spectral distortion of the proposed method is significantly smaller than the PCA method in the high frequencies but significantly larger in the low frequencies. The evaluation of the localization model shows the PCA method is better than the proposed method. The subjective localization experiments show that the PCA and the proposed methods have similar performances in most conditions. Both the objective and subjective experiments show that the proposed method can predict HRTFs in arbitrary spatial directions.

更新日期：2020-01-14
• arXiv.cs.SD Pub Date : 2019-12-03
Wei Ping; Kainan Peng; Kexin Zhao; Zhao Song

In this work, we propose WaveFlow, a small-footprint generative flow for raw audio, which is directly trained with maximum likelihood. WaveFlow handles the long-range structure of waveform with a dilated 2-D convolutional architecture, while modeling the local variations using compact autoregressive functions. It provides a unified view of likelihood-based models for raw audio, including WaveNet and WaveGlow as special cases. WaveFlow can generate high-fidelity speech as WaveNet, while synthesizing several orders of magnitude faster as it only requires a few sequential steps to generate waveforms with hundreds of thousands of time-steps. Furthermore, it can close the significant likelihood gap that has existed between autoregressive models and flow-based models for efficient synthesis. Finally, our small-footprint WaveFlow has 15$\times$ fewer parameters than WaveGlow and can generate 22.05 kHz high-fidelity audio 42.6$\times$ faster than real-time on a V100 GPU without engineered inference kernels.

更新日期：2020-01-14
• arXiv.cs.SD Pub Date : 2019-12-14
Helin Wang; Yuexian Zou; Dading Chong; Wenwu Wang

Convolutional neural networks (CNN) are one of the best-performing neural network architectures for environmental sound classification (ESC). Recently, attention mechanisms have been used in CNN to capture the useful information from the audio signal for sound classification, especially for weakly labelled data where the timing information about the acoustic events is not available in the training data, apart from the availability of sound class labels. In these methods, however, the inherent time-frequency characteristics and variations are not explicitly exploited when obtaining the deep features. In this paper, we propose a new method, called time-frequency enhancement block (TFBlock), which temporal attention and frequency attention are employed to enhance the features from relevant frames and frequency bands. Compared with other attention mechanisms, in our method, parallel branches are constructed which allow the temporal and frequency features to be attended respectively in order to mitigate interference from the sections where no sound events happened in the acoustic environments. The experiments on three benchmark ESC datasets show that our method improves the classification performance and also exhibits robustness to noise.

更新日期：2020-01-14
• arXiv.cs.SD Pub Date : 2019-12-13
Hossein Zeinali; Kong Aik Lee; Jahangir Alam; Lukas Burget

This document describes the Short-duration Speaker Verification (SdSV) Challenge 2020. The main goal of the challenge is to evaluate new technologies for text-dependent (TD) and text-independent (TI) speaker verification (SV) in a short duration scenario. The proposed challenge evaluates SdSV with varying degree of phonetic overlap between the enrollment and test utterances (cross-lingual). It is the first challenge with a broad focus on systematic benchmark and analysis on varying degrees of phonetic variability on short-duration speaker recognition. We expect that modern methods (deep neural networks in particular) will play a key role.

更新日期：2020-01-13
• arXiv.cs.SD Pub Date : 2020-01-09
Maurantonio Caprolu; Savio Sciancalepore; Roberto Di Pietro

Short-range audio channels have a few distinguishing characteristics: ease of use, low deployment costs, and easy to tune frequencies, to cite a few. Moreover, thanks to their seamless adaptability to the security context, many techniques and tools based on audio signals have been recently proposed. However, while the most promising solutions are turning into valuable commercial products, acoustic channels are increasingly used also to launch attacks against systems and devices, leading to security concerns that could thwart their adoption. To provide a rigorous, scientific, security-oriented review of the field, in this paper we survey and classify methods, applications, and use-cases rooted on short-range audio channels for the provisioning of security services---including Two-Factor Authentication techniques, pairing solutions, device authorization strategies, defense methodologies, and attack schemes. Moreover, we also point out the strengths and weaknesses deriving from the use of short-range audio channels. Finally, we provide open research issues in the context of short-range audio channels security, calling for contributions from both academia and industry.

更新日期：2020-01-10
• arXiv.cs.SD Pub Date : 2019-07-18
Mohammad Eslami; Christiane Neuschaefer-Rube; Antoine Serrurier

The various speech sounds of a language are obtained by varying the shape and position of the articulators surrounding the vocal tract. Analyzing their variations is crucial for understanding speech production, diagnosing speech disorders and planning therapy. Identifying key anatomical landmarks of these structures on medical images is a pre-requisite for any quantitative analysis and the rising amount of data generated in the field calls for an automatic solution. The challenge lies in the high inter- and intra-speaker variability, the mutual interaction between the articulators and the moderate quality of the images. This study addresses this issue for the first time and tackles it by means by means of Deep Learning. It proposes a dedicated network architecture named Flat-net and its performance are evaluated and compared with eleven state-of-the-art methods from the literature. The dataset contains midsagittal anatomical Magnetic Resonance Images for 9 speakers sustaining 62 articulations with 21 annotated anatomical landmarks per image. Results show that the Flat-net approach outperforms the former methods, leading to an overall Root Mean Square Error of 3.6 pixels/0.36 cm obtained in a leave-one-out procedure over the speakers. The implementation codes are also shared publicly on GitHub.

更新日期：2020-01-10
• arXiv.cs.SD Pub Date : 2020-01-08
Anup Anand Deshmukh; Catherine Soladie; Renaud Seguier

Emotion plays a key role in many applications like healthcare, to gather patients emotional behavior. There are certain emotions which are given more importance due to their effectiveness in understanding human feelings. In this paper, we propose an approach that models human stress from audio signals. The research challenge in speech emotion detection is defining the very meaning of stress and being able to categorize it in a precise manner. Supervised Machine Learning models, including state of the art Deep Learning classification methods, rely on the availability of clean and labelled data. One of the problems in affective computation and emotion detection is the limited amount of annotated data of stress. The existing labelled stress emotion datasets are highly subjective to the perception of the annotator. We address the first issue of feature selection by exploiting the use of traditional MFCC features in Convolutional Neural Network. Our experiments show that Emo-CNN consistently and significantly outperforms the popular existing methods over multiple datasets. It achieves 90.2% categorical accuracy on the Emo-DB dataset. To tackle the second and the more significant problem of subjectivity in stress labels, we use Lovheim's cube, which is a 3-dimensional projection of emotions. The cube aims at explaining the relationship between these neurotransmitters and the positions of emotions in 3D space. The learnt emotion representations from the Emo-CNN are mapped to the cube using three component PCA (Principal Component Analysis) which is then used to model human stress. This proposed approach not only circumvents the need for labelled stress data but also complies with the psychological theory of emotions given by Lovheim's cube. We believe that this work is the first step towards creating a connection between Artificial Intelligence and the chemistry of human emotions.

更新日期：2020-01-09
• arXiv.cs.SD Pub Date : 2020-01-08
Yin-Cheng Yeh; Wen-Yi Hsiao; Satoru Fukayama; Tetsuro Kitahara; Benjamin Genchel; Hao-Min Liu; Hao-Wen Dong; Yian Chen; Terence Leong; Yi-Hsuan Yang

Several prior works have proposed various methods for the task of automatic melody harmonization, in which a model aims to generate a sequence of chords to serve as the harmonic accompaniment of a given multiple-bar melody sequence. In this paper, we present a comparative study evaluating and comparing the performance of a set of canonical approaches to this task, including a template matching based model, a hidden Markov based model, a genetic algorithm based model, and two deep learning based models. The evaluation is conducted on a dataset of 9,226 melody/chord pairs we newly collect for this study, considering up to 48 triad chords, using a standardized training/test split. We report the result of an objective evaluation using six different metrics and a subjective study with 202 participants.

更新日期：2020-01-09
• arXiv.cs.SD Pub Date : 2020-01-08
Ondřej Mokrý; Pavel Rajmic

We deal with the problem of sparsity-based audio inpainting. A consequence of optimization approaches is actually the insufficient energy of the signal within the filled gap. We propose improvements to the audio inpainting framework based on sparsity and convex optimization, aiming at compensating for this energy loss. The new ideas are based on different types of weighting, both in the coefficient and the time domains. We show that our propositions improve the inpainting performance both in terms of the SNR and ODG. However, the autoregressive Janssen algorithm remains a strong competitor.

更新日期：2020-01-09
• arXiv.cs.SD Pub Date : 2020-01-08
Niko Moritz; Takaaki Hori; Jonathan Le Roux

Encoder-decoder based sequence-to-sequence models have demonstrated state-of-the-art results in end-to-end automatic speech recognition (ASR). Recently, the transformer architecture, which uses self-attention to model temporal context information, has been shown to achieve significantly lower word error rates (WERs) compared to recurrent neural network (RNN) based system architectures. Despite its success, the practical usage is limited to offline ASR tasks, since encoder-decoder architectures typically require an entire speech utterance as input. In this work, we propose a transformer based end-to-end ASR system for streaming ASR, where an output must be generated shortly after each spoken word. To achieve this, we apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism. Our proposed streaming transformer architecture achieves 2.7% and 7.0% WER for the clean'' and other'' test data of LibriSpeech, which to the best of our knowledge is the best published streaming end-to-end ASR result for this task.

更新日期：2020-01-09
• arXiv.cs.SD Pub Date : 2020-01-06
Stefan Lattner

In recent years, artificial neural networks (ANNs) have become a universal tool for tackling real-world problems. ANNs have also shown great success in music-related tasks including music summarization and classification, similarity estimation, computer-aided or autonomous composition, and automatic music analysis. As structure is a fundamental characteristic of Western music, it plays a role in all these tasks. Some structural aspects are particularly challenging to learn with current ANN architectures. This is especially true for mid- and high-level self-similarity, tonal and rhythmic relationships. In this thesis, I explore the application of ANNs to different aspects of musical structure modeling, identify some challenges involved and propose strategies to address them. First, using probability estimations of a Restricted Boltzmann Machine (RBM), a probabilistic bottom-up approach to melody segmentation is studied. Then, a top-down method for imposing a high-level structural template in music generation is presented, which combines Gibbs sampling using a convolutional RBM with gradient-descent optimization on the intermediate solutions. Furthermore, I motivate the relevance of musical transformations in structure modeling and show how a connectionist model, the Gated Autoencoder (GAE), can be employed to learn transformations between musical fragments. For learning transformations in sequences, I propose a special predictive training of the GAE, which yields a representation of polyphonic music as a sequence of intervals. Furthermore, the applicability of these interval representations to a top-down discovery of repeated musical sections is shown. Finally, a recurrent variant of the GAE is proposed, and its efficacy in music prediction and modeling of low-level repetition structure is demonstrated.

更新日期：2020-01-08
• arXiv.cs.SD Pub Date : 2020-01-06
Vikramjit Mitra; Horacio Franco

Unseen or out-of-domain data can seriously degrade the performance of a neural network model, indicating the model's failure to generalize to unseen data. Neural net pruning can not only help to reduce a model's size but can improve the model's generalization capacity as well. Pruning approaches look for low-salient neurons that are less contributive to a model's decision and hence can be removed from the model. This work investigates if pruning approaches are successful in detecting neurons that are either high-salient (mostly active or hyper) or low-salient (barely active or hypo), and whether removal of such neurons can help to improve the model's generalization capacity. Traditional blind adaptation techniques update either the whole or a subset of layers, but have never explored selectively updating individual neurons across one or more layers. Focusing on the fully connected layers of a convolutional neural network (CNN), this work shows that it may be possible to selectively adapt certain neurons (consisting of the hyper and the hypo neurons) first, followed by a full-network fine tuning. Using the task of automatic speech recognition, this work demonstrates how the removal of hyper and hypo neurons from a model can improve the model's performance on out-of-domain speech data and how selective neuron adaptation can ensure improved performance when compared to traditional blind model adaptation.

更新日期：2020-01-08
• arXiv.cs.SD Pub Date : 2020-01-06
Zhong Meng; Yashesh Gaur; Jinyu Li; Yifan Gong

Predicting words and subword units (WSUs) as the output has shown to be effective for the attention-based encoder-decoder (AED) model in end-to-end speech recognition. However, as one input to the decoder recurrent neural network (RNN), each WSU embedding is learned independently through context and acoustic information in a purely data-driven fashion. Little effort has been made to explicitly model the morphological relationships among WSUs. In this work, we propose a novel character-aware (CA) AED model in which each WSU embedding is computed by summarizing the embeddings of its constituent characters using a CA-RNN. This WSU-independent CA-RNN is jointly trained with the encoder, the decoder and the attention network of a conventional AED to predict WSUs. With CA-AED, the embeddings of morphologically similar WSUs are naturally and directly correlated through the CA-RNN in addition to the semantic and acoustic relations modeled by a traditional AED. Moreover, CA-AED significantly reduces the model parameters in a traditional AED by replacing the large pool of WSU embeddings with a much smaller set of character embeddings. On a 3400 hours Microsoft Cortana dataset, CA-AED achieves up to 11.9% relative WER improvement over a strong AED baseline with 27.1% fewer model parameters.

更新日期：2020-01-08
• arXiv.cs.SD Pub Date : 2020-01-06
Zhong Meng; Jinyu Li; Yashesh Gaur; Yifan Gong

Teacher-student (T/S) has shown to be effective for domain adaptation of deep neural network acoustic models in hybrid speech recognition systems. In this work, we extend the T/S learning to large-scale unsupervised domain adaptation of an attention-based end-to-end (E2E) model through two levels of knowledge transfer: teacher's token posteriors as soft labels and one-best predictions as decoder guidance. To further improve T/S learning with the help of ground-truth labels, we propose adaptive T/S (AT/S) learning. Instead of conditionally choosing from either the teacher's soft token posteriors or the one-hot ground-truth label, in AT/S, the student always learns from both the teacher and the ground truth with a pair of adaptive weights assigned to the soft and one-hot labels quantifying the confidence on each of the knowledge sources. The confidence scores are dynamically estimated at each decoder step as a function of the soft and one-hot labels. With 3400 hours parallel close-talk and far-field Microsoft Cortana data for domain adaptation, T/S and AT/S achieve 6.3% and 10.3% relative word error rate improvement over a strong E2E model trained with the same amount of far-field data.

更新日期：2020-01-08
• arXiv.cs.SD Pub Date : 2019-08-28
Andong Li; Minmin Yuan; Chengshi Zheng; Xiaodong Li

Recently, progressive learning has shown its capacity to improve speech quality and speech intelligibility when it is combined with deep neural network (DNN) and long short-term memory (LSTM) based monaural speech enhancement algorithms, especially in low signal-to-noise ratio (SNR) conditions. Nevertheless, due to a large number of parameters and high computational complexity, it is hard to implement in current resource-limited micro-controllers and thus, it is essential to significantly reduce both the number of parameters and the computational load for practical applications. For this purpose, we propose a novel progressive learning framework with causal convolutional recurrent neural networks called PL-CRNN, which takes advantage of both convolutional neural networks and recurrent neural networks to drastically reduce the number of parameters and simultaneously improve speech quality and speech intelligibility. Numerous experiments verify the effectiveness of the proposed PL-CRNN model and indicate that it yields consistent better performance than the PL-DNN and PL-LSTM algorithms and also it gets results close even better than the CRNN in terms of objective measurements. Compared with PL-DNN, PL-LSTM, and CRNN, the proposed PL-CRNN algorithm can reduce the number of parameters up to 93%, 97%, and 92%, respectively.

更新日期：2020-01-08
• arXiv.cs.SD Pub Date : 2019-12-29
Thomas Drugman; Thierry Dutoit

The modeling of speech production often relies on a source-filter approach. Although methods parameterizing the filter have nowadays reached a certain maturity, there is still a lot to be gained for several speech processing applications in finding an appropriate excitation model. This manuscript presents a Deterministic plus Stochastic Model (DSM) of the residual signal. The DSM consists of two contributions acting in two distinct spectral bands delimited by a maximum voiced frequency. Both components are extracted from an analysis performed on a speaker-dependent dataset of pitch-synchronous residual frames. The deterministic part models the low-frequency contents and arises from an orthonormal decomposition of these frames. As for the stochastic component, it is a high-frequency noise modulated both in time and frequency. Some interesting phonetic and computational properties of the DSM are also highlighted. The applicability of the DSM in two fields of speech processing is then studied. First, it is shown that incorporating the DSM vocoder in HMM-based speech synthesis enhances the delivered quality. The proposed approach turns out to significantly outperform the traditional pulse excitation and provides a quality equivalent to STRAIGHT. In a second application, the potential of glottal signatures derived from the proposed DSM is investigated for speaker identification purpose. Interestingly, these signatures are shown to lead to better recognition rates than other glottal-based methods.

更新日期：2020-01-07
• arXiv.cs.SD Pub Date : 2020-01-06
Yeongtae Hwang; Hyemin Cho; Hongsun Yang; Insoo Oh; Seong-Whan Lee

When training the sequence-to-sequence voice conversion model, we need to handle an issue of insufficient data about the number of speech tuples which consist of the same utterance. This study experimentally investigated the effects of Mel-spectrogram augmentation on the sequence-to-sequence voice conversion model. For Mel-spectrogram augmentation, we adopted the policies proposed in SpecAugment. In addition, we propose new policies for more data variations. To find the optimal hyperparameters of augmentation policies for voice conversion, we experimented based on the new metric, namely deformation per deteriorating ratio. We observed the effect of these through experiments based on various sizes of training set and combinations of augmentation policy. In the experimental results, the time axis warping based policies showed better performance than other policies.

更新日期：2020-01-07
• arXiv.cs.SD Pub Date : 2020-01-06
Cheng Yu; Ryandhimas E. Zezario; Jonathan Sherman; Yi-Yen Hsieh; Xugang Lu; Hsin-Min Wang; Yu Tsao

Deep learning-based models have greatly advanced the performance of speech enhancement (SE) systems. However, two problems remain unsolved, which are closely related to model generalizability to noisy conditions: (1) mismatched noisy condition during testing, i.e., the performance is generally sub-optimal when models are tested with unseen noise types that are not involved in the training data; (2) local focus on specific noisy conditions, i.e., models trained using multiple types of noises cannot optimally remove a specific noise type even though the noise type has been involved in the training data. These problems are common in real applications. In this paper, we propose a novel denoising autoencoder with a multi-branched encoder (termed DAEME) model to deal with these two problems. In the DAEME model, two stages are involved: offline and online. In the offline stage, we build multiple component models to form a multi-branched encoder based on a dynamically-sized decision tree(DSDT). The DSDT is built based on a prior knowledge of speech and noisy conditions (the speaker, environment, and signal factors are considered in this paper), where each component of the multi-branched encoder performs a particular mapping from noisy to clean speech along the branch in the DSDT. Finally, a decoder is trained on top of the multi-branched encoder. In the online stage, noisy speech is first processed by the tree and fed to each component model. The multiple outputs from these models are then integrated into the decoder to determine the final enhanced speech. Experimental results show that DAEME is superior to several baseline models in terms of objective evaluation metrics and the quality of subjective human listening tests.

更新日期：2020-01-07
• arXiv.cs.SD Pub Date : 2020-01-02
Zhiyun Fan; Jie Li; Shiyu Zhou; Bo Xu

Recently, end-to-end (E2E) models become a competitive alternative to the conventional hybrid automatic speech recognition (ASR) systems. However, they still suffer from speaker mismatch in training and testing condition. In this paper, we use Speech-Transformer (ST) as the study platform to investigate speaker aware training of E2E models. We propose a model called Speaker-Aware Speech-Transformer (SAST), which is a standard ST equipped with a speaker attention module (SAM). The SAM has a static speaker knowledge block (SKB) that is made of i-vectors. At each time step, the encoder output attends to the i-vectors in the block, and generates a weighted combined speaker embedding vector, which helps the model to normalize the speaker variations. The SAST model trained in this way becomes independent of specific training speakers and thus generalizes better to unseen testing speakers. We investigate different factors of SAM. Experimental results on the AISHELL-1 task show that SAST achieves a relative 6.5% CER reduction (CERR) over the speaker-independent (SI) baseline. Moreover, we demonstrate that SAST still works quite well even if the i-vectors in SKB all come from a different data source other than the acoustic training set.

更新日期：2020-01-07
• arXiv.cs.SD Pub Date : 2020-01-06
Jianwei Yu; Shi-Xiong Zhang; Jian Wu; Shahram Ghorbani; Bo Wu; Shiyin Kang; Shansong Liu; Xunying Liu; Helen Meng; Dong Yu

Automatic recognition of overlapped speech remains a highly challenging task to date. Motivated by the bimodal nature of human speech perception, this paper investigates the use of audio-visual technologies for overlapped speech recognition. Three issues associated with the construction of audio-visual speech recognition (AVSR) systems are addressed. First, the basic architecture designs i.e. end-to-end and hybrid of AVSR systems are investigated. Second, purposefully designed modality fusion gates are used to robustly integrate the audio and visual features. Third, in contrast to a traditional pipelined architecture containing explicit speech separation and recognition components, a streamlined and integrated AVSR system optimized consistently using the lattice-free MMI (LF-MMI) discriminative criterion is also proposed. The proposed LF-MMI time-delay neural network (TDNN) system establishes the state-of-the-art for the LRS2 dataset. Experiments on overlapped speech simulated from the LRS2 dataset suggest the proposed AVSR system outperformed the audio only baseline LF-MMI DNN system by up to 29.98\% absolute in word error rate (WER) reduction, and produced recognition performance comparable to a more complex pipelined system. Consistent performance improvements of 4.89\% absolute in WER reduction over the baseline AVSR system using feature fusion are also obtained.

更新日期：2020-01-07
• arXiv.cs.SD Pub Date : 2019-12-21
Qiuqiang Kong; Yin Cao; Turab Iqbal; Yuxuan Wang; Wenwu Wang; Mark D. Plumbley

Audio pattern recognition is an important research topic in the machine learning area, and includes several tasks such as audio tagging, acoustic scene classification and sound event detection. Recently neural networks have been applied to solve audio pattern recognition problems. However, previous systems focus on small datasets, which limits the performance of audio pattern recognition systems. Recently in computer vision and natural language processing, systems pretrained on large datasets have generalized well to several tasks. However, there is limited research on pretraining neural networks on large datasets for audio pattern recognition. In this paper, we propose large-scale pretrained audio neural networks (PANNs) trained on AudioSet. We propose to use Wavegram, a feature learned from waveform, and the mel spectrogram as input. We investigate the performance and complexity of a variety of convolutional neural networks. Our proposed AudioSet tagging system achieves a state-of-the-art mean average precision (mAP) of 0.439, outperforming the best previous system of 0.392. We transferred a PANN to six audio pattern recognition tasks and achieve state-of-the-art performance in many tasks. Source code and pretrained models have been released.

更新日期：2020-01-07
• arXiv.cs.SD Pub Date : 2019-12-31
Sucheta Ghosh

Conversational discourse coherence depends on both linguistic and paralinguistic phenomena. In this work we combine both paralinguistic and linguistic knowledge into a hybrid framework through a multi-level hierarchy. Thus it outputs the discourse-level topic structures. The laughter occurrences are used as paralinguistic information from the multiparty meeting transcripts of ICSI database. A clustering-based algorithm is proposed that chose the best topic-segment cluster from two independent, optimized clusters, namely, hierarchical agglomerative clustering and $K$-medoids. Then it is iteratively hybridized with an existing lexical cohesion based Bayesian topic segmentation framework. The hybrid approach improves the performance of both of the stand-alone approaches. This leads to the brief study of interactions between topic structures with discourse relational structure. This training-free topic structuring approach can be applicable to online understanding of spoken dialogs.

更新日期：2020-01-06
• arXiv.cs.SD Pub Date : 2020-01-02
Weixin Liang; Zixuan Liu; Can Liu

Training a Generative Adversarial Networks (GAN) for a new domain from scratch requires an enormous amount of training data and days of training time. To this end, we propose DAWSON, a Domain Adaptive FewShot Generation FrameworkFor GANs based on meta-learning. A major challenge of applying meta-learning GANs is to obtain gradients for the generator from evaluating it on development sets due to the likelihood-free nature of GANs. To address this challenge, we propose an alternative GAN training procedure that naturally combines the two-step training procedure of GANs and the two-step training procedure of meta-learning algorithms. DAWSON is a plug-and-play framework that supports a broad family of meta-learning algorithms and various GANs with architectural-variants. Based on DAWSON, We also propose MUSIC MATINEE, which is the first few-shot music generation model. Our experiments show that MUSIC MATINEE could quickly adapt to new domains with only tens of songs from the target domains. We also show that DAWSON can learn to generate new digits with only four samples in the MNIST dataset. We release source codes implementation of DAWSON in both PyTorch and Tensorflow, generated music samples on two genres and the lightning video.

更新日期：2020-01-06
• arXiv.cs.SD Pub Date : 2020-01-02
Kwangyoun Kim; Kyungmin Lee; Dhananjaya Gowda; Junmo Park; Sungsoo Kim; Sichen Jin; Young-Yoon Lee; Jinsu Yeo; Daehyun Kim; Seokyeong Jung; Jungin Lee; Myoungji Han; Chanwoo Kim

In this paper, we present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus. We attained around 90% of a word recognition rate for general domain mainly by using joint training of connectionist temporal classifier (CTC) and cross entropy (CE) losses, minimum word error rate (MWER) training, layer-wise pre-training and data augmentation methods. In addition, we compressed our models by more than 3.4 times smaller using an iterative hyper low-rank approximation (LRA) method while minimizing the degradation in recognition accuracy. The memory footprint was further reduced with 8-bit quantization to bring down the final model size to lower than 39 MB. For on-demand adaptation, we fused the MoChA models with statistical n-gram models, and we could achieve a relatively 36% improvement on average in word error rate (WER) for target domains including the general domain.

更新日期：2020-01-06
• arXiv.cs.SD Pub Date : 2020-01-02
Thomas Drugman; Thierry Dutoit

This paper addresses the problem of pitch modification, as an important module for an efficient voice transformation system. The Deterministic plus Stochastic Model of the residual signal we proposed in a previous work is compared to TDPSOLA, HNM and STRAIGHT. The four methods are compared through an important subjective test. The influence of the speaker gender and of the pitch modification ratio is analyzed. Despite its higher compression level, the DSM technique is shown to give similar or better results than other methods, especially for male speakers and important ratios of modification. The DSM turns out to be only outperformed by STRAIGHT for female voices.

更新日期：2020-01-06
• arXiv.cs.SD Pub Date : 2020-01-02
Thomas Drugman; Jerome Urbain; Thierry Dutoit

This paper addresses the issue of cough detection using only audio recordings, with the ultimate goal of quantifying and qualifying the degree of pathology for patients suffering from respiratory diseases, notably mucoviscidosis. A large set of audio features describing various aspects of the audio signal is proposed. These features are assessed in two steps. First, their intrisic potential and redundancy are evaluated using mutual information-based measures. Secondly, their efficiency is confirmed relying on three classifiers: Artificial Neural Network, Gaussian Mixture Model and Support Vector Machine. The influence of both the feature dimension and the classifier complexity are also investigated.

更新日期：2020-01-06
• arXiv.cs.SD Pub Date : 2020-01-02
Thomas Drugman; Geoffrey Wilfart; Thierry Dutoit

Statistical parametric speech synthesizers have recently shown their ability to produce natural-sounding and flexible voices. Unfortunately the delivered quality suffers from a typical buzziness due to the fact that speech is vocoded. This paper proposes a new excitation model in order to reduce this undesirable effect. This model is based on the decomposition of pitch-synchronous residual frames on an orthonormal basis obtained by Principal Component Analysis. This basis contains a limited number of eigenresiduals and is computed on a relatively small speech database. A stream of PCA-based coefficients is added to our HMM-based synthesizer and allows to generate the voiced excitation during the synthesis. An improvement compared to the traditional excitation is reported while the synthesis engine footprint remains under about 1Mb.

更新日期：2020-01-06
• arXiv.cs.SD Pub Date : 2020-01-02
Thomas Drugman; Thierry Dutoit; Baris Bozkurt

This paper investigates the differences occuring in the excitation for different voice qualities. Its goal is two-fold. First a large corpus containing three voice qualities (modal, soft and loud) uttered by the same speaker is analyzed and significant differences in characteristics extracted from the excitation are observed. Secondly rules of modification derived from the analysis are used to build a voice quality transformation system applied as a post-process to HMM-based speech synthesis. The system is shown to effectively achieve the transformations while maintaining the delivered quality.

更新日期：2020-01-06
• arXiv.cs.SD Pub Date : 2020-01-02
Thomas Drugman; Thomas Dubuisson; Thierry Dutoit

This paper addresses the problem of automatic detection of voice pathologies directly from the speech signal. For this, we investigate the use of the glottal source estimation as a means to detect voice disorders. Three sets of features are proposed, depending on whether they are related to the speech or the glottal signal, or to prosody. The relevancy of these features is assessed through mutual information-based measures. This allows an intuitive interpretation in terms of discrimation power and redundancy between the features, independently of any subsequent classifier. It is discussed which characteristics are interestingly informative or complementary for detecting voice pathologies.

更新日期：2020-01-06
• arXiv.cs.SD Pub Date : 2020-01-03
Liu Li; Feng Gang

Cued Speech (CS) is a communication system developed for deaf people, which exploits hand cues to complement speechreading at the phonetic level. Currently, it is estimated that CS has been adapted to over 60 languages; however, no official CS system is available for Mandarin Chinese. This article proposes a novel and efficient Mandarin Chinese CS system, satisfying the main criterion that the hand coding constitutes a complement to the lips movements. We propose to code vowels [i, u, y] as semiconsonants when they are followed by other Mandarin finals, which reduces the number of Mandarin finals to be coded from 36 to 16. We establish a coherent similarity between Mandarin Chinese and French vowels for the remaining 16 vowels, which allows us to take advantage of the French CS system. Furthermore, by investigating the lips viseme distribution based on a new corpus, an optimal allocation of the 16 Mandarin vowels to different hand positions is obtained. A Gaussian classifier was used to evaluate the average separability of different allocated vowel groups, which gives 92.08\%, 92.33\%, and 92.73\% for the three speakers, respectively. The consonants are mainly designed according to their similarities with the French CS system, as well as some considerations on the special Mandarin consonants. In our system, the tones of Mandarin are coded with head movements.

更新日期：2020-01-06
• arXiv.cs.SD Pub Date : 2019-12-28
Thomas Drugman; Baris Bozkurt; Thierry Dutoit

Source-tract decomposition (or glottal flow estimation) is one of the basic problems of speech processing. For this, several techniques have been proposed in the literature. However studies comparing different approaches are almost nonexistent. Besides, experiments have been systematically performed either on synthetic speech or on sustained vowels. In this study we compare three of the main representative state-of-the-art methods of glottal flow estimation: closed-phase inverse filtering, iterative and adaptive inverse filtering, and mixed-phase decomposition. These techniques are first submitted to an objective assessment test on synthetic speech signals. Their sensitivity to various factors affecting the estimation quality, as well as their robustness to noise are studied. In a second experiment, their ability to label voice quality (tensed, modal, soft) is studied on a large corpus of real connected speech. It is shown that changes of voice quality are reflected by significant modifications in glottal feature distributions. Techniques based on the mixed-phase decomposition and on a closed-phase inverse filtering process turn out to give the best results on both clean synthetic and real speech signals. On the other hand, iterative and adaptive inverse filtering is recommended in noisy environments for its high robustness.

更新日期：2020-01-06
• arXiv.cs.SD Pub Date : 2019-12-28
Thomas Drugman; Thierry Dutoit

This paper proposes a new procedure to detect Glottal Closure and Opening Instants (GCIs and GOIs) directly from speech waveforms. The procedure is divided into two successive steps. First a mean-based signal is computed, and intervals where speech events are expected to occur are extracted from it. Secondly, at each interval a precise position of the speech event is assigned by locating a discontinuity in the Linear Prediction residual. The proposed method is compared to the DYPSA algorithm on the CMU ARCTIC database. A significant improvement as well as a better noise robustness are reported. Besides, results of GOI identification accuracy are promising for the glottal source characterization.

更新日期：2020-01-06
• arXiv.cs.SD Pub Date : 2019-12-29
Thomas Drugman; Geoffrey Wilfart; Thierry Dutoit

Speech generated by parametric synthesizers generally suffers from a typical buzziness, similar to what was encountered in old LPC-like vocoders. In order to alleviate this problem, a more suited modeling of the excitation should be adopted. For this, we hereby propose an adaptation of the Deterministic plus Stochastic Model (DSM) for the residual. In this model, the excitation is divided into two distinct spectral bands delimited by the maximum voiced frequency. The deterministic part concerns the low-frequency contents and consists of a decomposition of pitch-synchronous residual frames on an orthonormal basis obtained by Principal Component Analysis. The stochastic component is a high-pass filtered noise whose time structure is modulated by an energy-envelope, similarly to what is done in the Harmonic plus Noise Model (HNM). The proposed residual model is integrated within a HMM-based speech synthesizer and is compared to the traditional excitation through a subjective test. Results show a significative improvement for both male and female voices. In addition the proposed model requires few computational load and memory, which is essential for its integration in commercial applications.

更新日期：2020-01-06
• arXiv.cs.SD Pub Date : 2020-01-03
Li Liu; Gang Feng; Denis Beautemps; Xiao-Ping Zhang

Cued Speech (CS) is an augmented lip reading complemented by hand coding, and it is very helpful to the deaf people. Automatic CS recognition can help communications between the deaf people and others. Due to the asynchronous nature of lips and hand movements, fusion of them in automatic CS recognition is a challenging problem. In this work, we propose a novel re-synchronization procedure for multi-modal fusion, which aligns the hand features with lips feature. It is realized by delaying hand position and hand shape with their optimal hand preceding time which is derived by investigating the temporal organizations of hand position and hand shape movements in CS. This re-synchronization procedure is incorporated into a practical continuous CS recognition system that combines convolutional neural network (CNN) with multi-stream hidden markov model (MSHMM). A significant improvement of about 4.6\% has been achieved retaining 76.6\% CS phoneme recognition correctness compared with the state-of-the-art architecture (72.04\%), which did not take into account the asynchrony of multi-modal fusion in CS. To our knowledge, this is the first work to tackle the asynchronous multi-modal fusion in the automatic continuous CS recognition.

更新日期：2020-01-06
• arXiv.cs.SD Pub Date : 2019-09-26
Chang-Le Liu; Szu-Wei Fu; You-Jin Lee; Yu Tsao; Jen-Wei Huang; Hsin-Min Wang

In recent years, waveform-mapping-based speech enhancement (SE) methods have garnered significant attention. These methods generally use a deep learning model to directly process and reconstruct speech waveforms. Because both the input and output are in waveform format, the waveform-mapping-based SE methods can overcome the distortion caused by imperfect phase estimation, which may be encountered in spectral-mapping-based SE systems. So far, most waveform-mapping-based SE methods have focused on single-channel tasks. In this paper, we propose a novel fully convolutional network (FCN) with Sinc and dilated convolutional layers (termed SDFCN) for multichannel SE that operates in the time domain. We also propose an extended version of SDFCN, called the residual SDFCN (termed rSDFCN). The proposed methods are evaluated on two multichannel SE tasks, namely the dual-channel inner-ear microphones SE task and the distributed microphones SE task. The experimental results confirm the outstanding denoising capability of the proposed SE systems on both tasks and the benefits of using the residual architecture on the overall SE performance.

更新日期：2020-01-06
• arXiv.cs.SD Pub Date : 2020-01-01
Fenglin Ding; Wu Guo; Lirong Dai; Jun Du

Batch normalization (BN) is an effective method to accelerate model training and improve the generalization performance of neural networks. In this paper, we propose an improved batch normalization technique called attentive batch normalization (ABN) in Long Short Term Memory (LSTM) based acoustic modeling for automatic speech recognition (ASR). In the proposed method, an auxiliary network is used to dynamically generate the scaling and shifting parameters in batch normalization, and attention mechanisms are introduced to improve their regularized performance. Furthermore, two schemes, frame-level and utterance-level ABN, are investigated. We evaluate our proposed methods on Mandarin and Uyghur ASR tasks, respectively. The experimental results show that the proposed ABN greatly improves the performance of batch normalization in terms of transcription accuracy for both languages.

更新日期：2020-01-04
• arXiv.cs.SD Pub Date : 2020-01-02
Thomas Drugman; Thomas Dubuisson; Thierry Dutoit

In most current approaches of speech processing, information is extracted from the magnitude spectrum. However recent perceptual studies have underlined the importance of the phase component. The goal of this paper is to investigate the potential of using phase-based features for automatically detecting voice disorders. It is shown that group delay functions are appropriate for characterizing irregularities in the phonation. Besides the respect of the mixed-phase model of speech is discussed. The proposed phase-based features are evaluated and compared to other parameters derived from the magnitude spectrum. Both streams are shown to be interestingly complementary. Furthermore phase-based features turn out to convey a great amount of relevant information, leading to high discrimination performance.

更新日期：2020-01-04
Contents have been reproduced by permission of the publishers.

down
wechat
bug