• arXiv.cs.SD Pub Date : 2020-04-07
Youngmoon Jung; Seongmin Kye; Yeunju Choi; Myunghun Jung; Hoirin Kim

Currently, the most widely used approach for speaker verification is the deep speaker embedding learning. In this approach, convolutional neural networks are mainly used as a frame-level feature extractor, and speaker embeddings are extracted from the last layer of the feature extractor. Multi-scale aggregation (MSA), which utilizes multi-scale features from different layers of the feature extractor

更新日期：2020-04-08
• arXiv.cs.SD Pub Date : 2020-04-07
Jiguo Li; Xinfeng Zhang; Chuanmin Jia; Jizheng Xu; Li Zhang; Yue Wang; Siwei Ma; Wen Gao

Direct speech-to-image translation without text is an interesting and useful topic due to the potential applications in human-computer interaction, art creation, computer-aided design. etc. Not to mention that many languages have no writing form. However, as far as we know, it has not been well-studied how to translate the speech signals into images directly and how well they can be translated. In

更新日期：2020-04-08
• arXiv.cs.SD Pub Date : 2020-04-07
Jiguo Li; Xinfeng Zhang; Chuanmin Jia; Jizheng Xu; Li Zhang; Yue Wang; Siwei Ma; Wen Gao

Attacking deep learning based biometric systems has drawn more and more attention with the wide deployment of fingerprint/face/speaker recognition systems, given the fact that the neural networks are vulnerable to the adversarial examples, which have been intentionally perturbed to remain almost imperceptible for human. In this paper, we demonstrated the existence of the universal adversarial perturbations~(UAPs)

更新日期：2020-04-08
• arXiv.cs.SD Pub Date : 2020-04-07
Jiguo Li; Xinfeng Zhang; Jizheng Xu; Li Zhang; Yue Wang; Siwei Ma; Wen Gao

Due to the widespread deployment of fingerprint/face/speaker recognition systems, attacking deep learning based biometric systems has drawn more and more attention. Previous research mainly studied the attack to the vision-based system, such as fingerprint and face recognition. While the attack for speaker recognition has not been investigated yet, although it has been widely used in our daily life

更新日期：2020-04-08
• arXiv.cs.SD Pub Date : 2020-04-07
Yi Zheng; Xianjie Yang; Xuyong Dang

A new label smoothing method that makes use of prior knowledge of a language at human level, homophone, is proposed in this paper for automatic speech recognition (ASR). Compared with its forerunners, the proposed method uses pronunciation knowledge of homophones in a more complex way. End-to-end ASR models that learn acoustic model and language model jointly and modelling units of characters are necessary

更新日期：2020-04-08
• arXiv.cs.SD Pub Date : 2020-04-07
Robert Rehr; Timo Gerkmann

This paper analyzes the generalization of speech enhancement algorithms based on deep neural networks (DNNs) with respect to (1) the chosen features, (2) the size and diversity of the training data, and (3) different network architectures. To address (1), we compare three input features, namely logarithmized noisy periodograms, noise aware training (NAT) and signal-to-noise ratio (SNR) based noise

更新日期：2020-04-08
• arXiv.cs.SD Pub Date : 2020-02-02
Rui Liu; Berrak Sisman; Feilong Bao; Guanglai Gao; Haizhou Li

Tacotron-based text-to-speech (TTS) systems directly synthesize speech from text input. Such frameworks typically consist of a feature prediction network that maps character sequences to frequency-domain acoustic features, followed by a waveform reconstruction algorithm or a neural vocoder that generates the time-domain waveform from acoustic features. As the loss function is usually calculated only

更新日期：2020-04-08
• arXiv.cs.SD Pub Date : 2020-02-01
Kun Zhou; Berrak Sisman; Haizhou Li

Emotional voice conversion aims to convert the spectrum and prosody to change the emotional patterns of speech, while preserving the speaker identity and linguistic content. Many studies require parallel speech data between different emotional patterns, which is not practical in real life. Moreover, they often model the conversion of fundamental frequency (F0) with a simple linear transform. As F0

更新日期：2020-04-08
• arXiv.cs.SD Pub Date : 2020-04-02
Bharat Padi; Anand Mohan; Sriram Ganapathy

The task of automatic language identification (LID) involving multiple dialects of the same language family in the presence of noise is a challenging problem. In these scenarios, the identity of the language/dialect may be reliably present only in parts of the temporal sequence of the speech signal. The conventional approaches to LID (and for speaker recognition) ignore the sequence information by

更新日期：2020-04-06
• arXiv.cs.SD Pub Date : 2020-04-02
Ali Imran; Iryna Posokhova; Haneya N. Qureshi; Usama Masood; Sajid Riaz; Kamran Ali; Charles N. John; Muhammad Nabeel

Inability to test at scale has become Achille's heel in humanity's ongoing war against COVID-19 pandemic. An agile, scalable and cost-effective testing, deployable at a global scale, can act as a game changer in this war. To address this challenge, building on the promising results of our prior work on cough-based diagnosis of a motley of respiratory diseases, we develop an Artificial Intelligence

更新日期：2020-04-06
• arXiv.cs.SD Pub Date : 2020-04-01
Charles Bales; Charles John; Hasan Farooq; Usama Masood; Muhammad Nabeel; Ali Imran

5G is bringing new use cases to the forefront, one of the most prominent being machine learning empowered health care. Since respiratory infections are one of the notable modern medical concerns and coughs being a common symptom of this, a system for recognizing and diagnosing infections based on raw cough data would have a multitude of beneficial research and medical applications. In the literature

更新日期：2020-04-06
• arXiv.cs.SD Pub Date : 2020-04-01
Nao Tokui

There has been significant progress in the music generation technique utilizing deep learning. However, it is still hard for musicians and artists to use these techniques in their daily music-making practice. This paper proposes a Variational Autoencoder\cite{Kingma2014}(VAE)-based rhythm generation system, in which musicians can train a deep learning model only by selecting target MIDI files, then

更新日期：2020-04-06
• arXiv.cs.SD Pub Date : 2020-04-02
Tharindu Fernando; Sridha Sridharan; Mitchell McLaren; Darshana Priyasad; Simon Denman; Clinton Fookes

This paper presents a novel framework for Speech Activity Detection (SAD). Inspired by the recent success of multi-task learning approaches in the speech processing domain, we propose a novel joint learning framework for SAD. We utilise generative adversarial networks to automatically learn a loss function for joint prediction of the frame-wise speech/ non-speech classifications together with the next

更新日期：2020-04-06
• arXiv.cs.SD Pub Date : 2019-08-05
Lea Schönherr; Thorsten Eisenhofer; Steffen Zeiler; Thorsten Holz; Dorothea Kolossa

Automatic speech recognition (ASR) systems can be fooled via targeted adversarial examples, which induce the ASR to produce arbitrary transcriptions in response to altered audio signals. However, state-of-the-art adversarial examples typically have to be fed into the ASR system directly, and are not successful when played in a room. The few published over-the-air adversarial examples fall into one

更新日期：2020-04-06
• arXiv.cs.SD Pub Date : 2019-09-06
Haiyang Xu; Hui Zhang; Kun Han; Yun Wang; Yiping Peng; Xiangang Li

Speech emotion recognition is a challenging problem because human convey emotions in subtle and complex ways. For emotion recognition on human speech, one can either extract emotion related features from audio signals or employ speech recognition techniques to generate text from speech and then apply natural language processing to analyze the sentiment. Further, emotion recognition will be beneficial

更新日期：2020-04-06
• arXiv.cs.SD Pub Date : 2019-09-14
Chunxi Liu; Qiaochu Zhang; Xiaohui Zhang; Kritika Singh; Yatharth Saraf; Geoffrey Zweig

Towards developing high-performing ASR for low-resource languages, approaches to address the lack of resources are to make use of data from multiple languages, and to augment the training data by creating acoustic variations. In this work we present a single grapheme-based ASR model learned on 7 geographically proximal languages, using standard hybrid BLSTM-HMM acoustic models with lattice-free MMI

更新日期：2020-04-06
• arXiv.cs.SD Pub Date : 2019-11-27
Jianwei Tai; Xiaoqi Jia; Qingjia Huang; Weijuan Zhang; Shengzhi Zhang

With the pervasiveness of voice control on smart devices, speaker verification is widely used as the preferred identity authentication mechanism due to its convenience. However, the task of "in-the-wild" speaker verification is challenging, considering the speech samples may contain lots of identity-unrelated information, e.g., background noise, reverberation, emotion, etc. Previous works focus on

更新日期：2020-04-06
• arXiv.cs.SD Pub Date : 2019-11-25
António Ramires; Pritish Chandna; Xavier Favory; Emilia Gómez; Xavier Serra

We present a deep neural network-based methodology for synthesising percussive sounds with control over high-level timbral characteristics of the sounds. This approach allows for intuitive control of a synthesizer, enabling the user to shape sounds without extensive knowledge of signal processing. We use a feedforward convolutional neural network-based architecture, which is able to map input parameters

更新日期：2020-04-06
• arXiv.cs.SD Pub Date : 2020-04-02
Ali Aroudi; Tobias de Taillez; Simon Doclo

Identifying the target speaker in hearing aid applications is crucial to improve speech understanding. Recent advances in electroencephalography (EEG) have shown that it is possible to identify the target speaker from single-trial EEG recordings using auditory attention decoding (AAD) methods. AAD methods reconstruct the attended speech envelope from EEG recordings, based on a linear least-squares

更新日期：2020-04-03
• arXiv.cs.SD Pub Date : 2020-04-02
Haoyu Li; Szu-Wei Fu; Yu Tsao; Junichi Yamagishi

The intelligibility of natural speech is seriously degraded when exposed to adverse noisy environments. In this work, we propose a deep learning-based speech modification method to compensate for the intelligibility loss, with the constraint that the root mean square (RMS) level and duration of the speech signal are maintained before and after modifications. Specifically, we utilize an iMetricGAN approach

更新日期：2020-04-03
• arXiv.cs.SD Pub Date : 2020-04-02
Wei Zhou; Wilfried Michel; Kazuki Irie; Markus Kitza; Ralf Schlüter; Hermann Ney

We present a complete training pipeline to build a state-of-the-art hybrid HMM-based ASR system on the 2nd release of the TED-LIUM corpus. Data augmentation using SpecAugment is successfully applied to improve performance on top of our best SAT model using i-vectors. By investigating the effect of different maskings, we achieve improvements from SpecAugment on hybrid HMM models without increasing model

更新日期：2020-04-03
• arXiv.cs.SD Pub Date : 2020-04-02
Wei Zhou; Ralf Schlüter; Hermann Ney

In hybrid HMM based speech recognition, LSTM language models have been widely applied and achieved large improvements. The theoretical capability of modeling any unlimited context suggests that no recombination should be applied in decoding. This motivates to reconsider full summation over the HMM-state sequences instead of Viterbi approximation in decoding. We explore the potential gain from more

更新日期：2020-04-03
• arXiv.cs.SD Pub Date : 2020-04-02
Alexander Schindler; Andrew Lindley; Anahid Jalali; Martin Boyer; Sergiu Gordea; Ross King

The forensic investigation of a terrorist attack poses a significant challenge to the investigative authorities, as often several thousand hours of video footage must be viewed. Large scale Video Analytic Platforms (VAP) assist law enforcement agencies (LEA) in identifying suspects and securing evidence. Current platforms focus primarily on the integration of different computer vision methods and thus

更新日期：2020-04-03
• arXiv.cs.SD Pub Date : 2020-03-31
Chao-Han Huck Yang; Jun Qi; Pin-Yu Chen; Xiaoli Ma; Chin-Hui Lee

Recent studies have highlighted adversarial examples as ubiquitous threats to the deep neural network (DNN) based speech recognition systems. In this work, we present a U-Net based attention model, U-Net$_{At}$, to enhance adversarial speech signals. Specifically, we evaluate the model performance by interpretable speech recognition metrics and discuss the model performance by the augmented adversarial

更新日期：2020-04-01
• arXiv.cs.SD Pub Date : 2019-11-28
Triantafyllos Afouras; Joon Son Chung; Andrew Zisserman

The goal of this work is to train strong models for visual speech recognition without requiring human annotated ground truth data. We achieve this by distilling from an Automatic Speech Recognition (ASR) model that has been trained on a large-scale audio-only corpus. We use a cross-modal distillation method that combines Connectionist Temporal Classification (CTC) with a frame-wise cross-entropy loss

更新日期：2020-04-01
• arXiv.cs.SD Pub Date : 2020-03-28
Naoyuki Kanda; Yashesh Gaur; Xiaofei Wang; Zhong Meng; Takuya Yoshioka

This paper proposes serialized output training (SOT), a novel framework for multi-speaker overlapped speech recognition based on an attention-based encoder-decoder approach. Instead of having multiple output layers as with the permutation invariant training (PIT), SOT uses a model with only one output layer that generates the transcriptions of multiple speakers one after another. The attention and

更新日期：2020-03-31
• arXiv.cs.SD Pub Date : 2020-03-28
Tara N. Sainath; Yanzhang He; Bo Li; Arun Narayanan; Ruoming Pang; Antoine Bruguier; Shuo-yiin Chang; Wei Li; Raziel Alvarez; Zhifeng Chen; Chung-Cheng Chiu; David Garcia; Alex Gruenstein; Ke Hu; Minho Jin; Anjuli Kannan; Qiao Liang; Ian McGraw; Cal Peyser; Rohit Prabhavalkar; Golan Pundak; David Rybach; Yuan Shangguan; Yash Sheth; Trevor Strohman; Mirko Visontai; Yonghui Wu; Yu Zhang; Ding Zhao

Thus far, end-to-end (E2E) models have not been shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses

更新日期：2020-03-31
• arXiv.cs.SD Pub Date : 2020-03-30
Shivam Agarwal; Siddarth Venkatraman

Steganography is the art of hiding a secret message inside a publicly visible carrier message. Ideally, it is done without modifying the carrier, and with minimal loss of information in the secret message. Recently, various deep learning based approaches to steganography have been applied to different message types. We propose a deep learning based technique to hide a source RGB image message inside

更新日期：2020-03-31
• arXiv.cs.SD Pub Date : 2020-03-29
Andong Li; Chengshi Zheng; Cunhang Fan; Renhua Peng; Xiaodong Li

A person tends to generate dynamic attention towards speech under complicated environments. Based on this phenomenon, we propose a framework combining dynamic attention and recursive learning together for monaural speech enhancement. Apart from a major noise reduction network, we design a separated sub-network, which adaptively generates the attention distribution to control the information flow throughout

更新日期：2020-03-31
• arXiv.cs.SD Pub Date : 2019-12-10
Ruohan Gao; Tae-Hyun Oh; Kristen Grauman; Lorenzo Torresani

In the face of the video data deluge, today's expensive clip-level classifiers are increasingly impractical. We propose a framework for efficient action recognition in untrimmed video that uses audio as a preview mechanism to eliminate both short-term and long-term visual redundancies. First, we devise an ImgAud2Vid framework that hallucinates clip-level features by distilling from lighter modalities---a

更新日期：2020-03-31
• arXiv.cs.SD Pub Date : 2020-03-26
Arian Azarang; Nasser Kehtarnavaz

This paper presents a review of multi-objective deep learning methods that have been introduced in the literature for speech denoising. After stating an overview of conventional, single objective deep learning, and hybrid or combined conventional and deep learning methods, a review of the mathematical framework of the multi-objective deep learning methods for speech denoising is provided. A representative

更新日期：2020-03-30
• arXiv.cs.SD Pub Date : 2020-03-26
Eunjeong Koh; Fatemeh Saki; Yinyi Guo; Cheng-Yu Hung; Erik Visser

This paper presents a new learning strategy for the Sound Event Detection (SED) system to tackle the issues of i) knowledge migration from a pre-trained model to a new target model and ii) learning new sound events without forgetting the previously learned ones without re-training from scratch. In order to migrate the previously learned knowledge from the source model to the target one, a neural adapter

更新日期：2020-03-30
• arXiv.cs.SD Pub Date : 2020-03-27
Heinrich Dinkel; Yefei Chen; Mengyue Wu; Kai Yu

Traditional voice activity detection (VAD) methods work well in clean and controlled scenarios, with performance severely degrading in real-world applications. One possible bottleneck for such supervised VAD training is its requirement for clean training data and frame-level labels. In contrast, we propose the GPVAD framework, which can be easily trained from noisy data in a weakly supervised fashion

更新日期：2020-03-30
• arXiv.cs.SD Pub Date : 2020-03-27
Yi Luo; Nima Mesgarani

Many recent source separation systems are designed to separate a fixed number of sources out of a mixture. In the cases where the source activation patterns are unknown, such systems have to either adjust the number of outputs or to identify invalid outputs from the valid ones. Iterative separation methods have gain much attention in the community as they can flexibly decide the number of outputs,

更新日期：2020-03-30
• arXiv.cs.SD Pub Date : 2020-03-27
Michael A Lepori; Chaz Firestone

The rise of sophisticated machine-recognition systems has brought with it a rise in comparisons between human and machine perception. But such comparisons face an asymmetry: Whereas machine perception of some stimulus can often be probed through direct and explicit measures, much of human perceptual knowledge is latent, incomplete, or embedded in unconscious mental processes that may not be available

更新日期：2020-03-30
• arXiv.cs.SD Pub Date : 2020-03-22
Sebastian Baunsgaard; Sebastian B. Wrede; Pınar Tozun

Automatic Speech Recognition (ASR) has increased in popularity in recent years. The evolution of processor and storage technologies has enabled more advanced ASR mechanisms, fueling the development of virtual assistants such as Amazon Alexa, Apple Siri, Microsoft Cortana, and Google Home. The interest in such assistants, in turn, has amplified the novel developments in ASR research. However, despite

更新日期：2020-03-30
• arXiv.cs.SD Pub Date : 2020-03-27
Akhil Mathur; Anton Isopoussu; Fahim Kawsar; Nadia Berthouze; Nicholas D. Lane

Mobile and embedded devices are increasingly using microphones and audio-based computational models to infer user context. A major challenge in building systems that combine audio models with commodity microphones is to guarantee their accuracy and robustness in the real-world. Besides many environmental dynamics, a primary factor that impacts the robustness of audio models is microphone variability

更新日期：2020-03-30
• arXiv.cs.SD Pub Date : 2019-10-14
Yi Luo; Zhuo Chen; Takuya Yoshioka

Recent studies in deep learning-based speech separation have proven the superiority of time-domain approaches to conventional time-frequency-based methods. Unlike the time-frequency domain approaches, the time-domain separation systems often receive input sequences consisting of a huge number of time steps, which introduces challenges for modeling extremely long sequences. Conventional recurrent neural

更新日期：2020-03-30
• arXiv.cs.SD Pub Date : 2019-10-30
Yi Luo; Zhuo Chen; Nima Mesgarani; Takuya Yoshioka

An important problem in ad-hoc microphone speech separation is how to guarantee the robustness of a system with respect to the locations and numbers of microphones. The former requires the system to be invariant to different indexing of the microphones with the same locations, while the latter requires the system to be able to process inputs with varying dimensions. Conventional optimization-based

更新日期：2020-03-30
• arXiv.cs.SD Pub Date : 2019-11-01
Emna Rejaibi; Daoud Kadoch; Kamil Bentounes; Romain Alfred; Mohamed Daoudi; Abdenour Hadid; Alice Othmani

Automatic emotions recognition and Major Depressive Disorder (MDD) diagnosis are inherently challenging problems in health informatics applications. According to the World Health Organization, 300M people are affected by depression in 2017 with only the third of them correctly identified. MDD is a permanent mood disorder where the patient constantly feels negative emotions (low Valence) and lacks excitement

更新日期：2020-03-30
• arXiv.cs.SD Pub Date : 2020-03-26
Yi-Chiao Wu; Patrick Lumban Tobing; Kazuhiro Kobayashi; Tomoki Hayashi; Tomoki Toda

In this paper, we integrate a simple non-parallel voice conversion (VC) system with a WaveNet (WN) vocoder and a proposed collapsed speech suppression technique. The effectiveness of WN as a vocoder for generating high-fidelity speech waveforms on the basis of acoustic features has been confirmed in recent works. However, when combining the WN vocoder with a VC system, the distorted acoustic features

更新日期：2020-03-28
• arXiv.cs.SD Pub Date : 2020-03-14
Abhilash Jain

Transformers have recently taken the center stage in language modeling after LSTM's were considered the dominant model architecture for a long time. In this project, we investigate the performance of the Transformer architectures-BERT and Transformer-XL for the language modeling task. We use a sub-word model setting with the Finnish language and compare it to the previous State of the art (SOTA) LSTM

更新日期：2020-03-28
• arXiv.cs.SD Pub Date : 2020-03-26
Wissam A. Jassim; Jan Skoglund; Michael Chinen; Andrew Hines

This study compares the performances of different algorithms for coding speech at low bit rates. In addition to widely deployed traditional vocoders, a selection of recently developed generative-model-based coders at different bit rates are contrasted. Performance analysis of the coded speech is evaluated for different quality aspects: accuracy of pitch periods estimation, the word error rates for

更新日期：2020-03-28
• arXiv.cs.SD Pub Date : 2020-03-26
Joon Son Chung; Jaesung Huh; Seongkyu Mun; Minjae Lee; Hee Soo Heo; Soyeon Choe; Chiheon Ham; Sunghwan Jung; Bong-Jin Lee; Icksang Han

The objective of this paper is 'open-set' speaker recognition of unseen speakers, where ideal embeddings should be able to condense information into a compact utterance-level representation that has small intra-class (same speaker) and large inter-class (different speakers) distance. A popular belief in speaker recognition is that networks trained with classification objectives outperform metric learning

更新日期：2020-03-28
• arXiv.cs.SD Pub Date : 2020-03-24
Björn W. Schuller; Dagmar M. Schuller; Kun Qian; Juan Liu; Huaiyuan Zheng; Xiao Li

At the time of writing, the world population is suffering from more than 10,000 registered COVID-19 disease epidemic induced deaths since the outbreak of the Corona virus more than three months ago now officially known as SARS-CoV-2. Since, tremendous efforts have been made worldwide to counter-steer and control the epidemic by now labelled as pandemic. In this contribution, we provide an overview

更新日期：2020-03-26
• arXiv.cs.SD Pub Date : 2020-03-22
Andong Li; Chengshi Zheng; Linjuan Cheng; Renhua Peng; Xiaodong Li

In this paper, we propose a type of neural network with recursive learning in the time domain called RTNet for monaural speech enhancement, where the proposed network consists of three principal components. The first part is called stage recurrent neural network, which is proposed to effectively aggregate the deep feature dependencies across different stages with a memory mechanism and also remove

更新日期：2020-03-24
• arXiv.cs.SD Pub Date : 2020-03-21
Davide Rocchesso; Maria Mannone

Concepts and formalism from acoustics are often used to exemplify quantum mechanics. Conversely, quantum mechanics could be used to achieve a new perspective on acoustics, as shown by Gabor studies. Here, we focus in particular on the study of human voice, considered as a probe to investigate the world of sounds. We present a theoretical framework that is based on observables of vocal production, and

更新日期：2020-03-24
• arXiv.cs.SD Pub Date : 2020-03-23
Venkatesh S. Kadandale; Juan F. Montesinos; Gloria Haro; Emilia Gómez

A fairly straightforward approach for music source separation is to train independent models, wherein each model is dedicated for estimating only a specific source. Training a single model to estimate multiple sources generally does not perform as well as the independent dedicated models. However, Conditioned U-Net (C-U-Net) uses a control mechanism to train a single model for multi-source separation

更新日期：2020-03-24
• arXiv.cs.SD Pub Date : 2018-12-17

In this paper, a novel approach is proposed for the recognition of Persian phonemes in the Persian Consonant-Vowel Combination (PCVC) speech dataset. Nowadays, deep neural networks play a crucial role in classification tasks. However, the best results in speech recognition are not yet as perfect as human recognition rate. Deep learning techniques show outstanding performance over many other classification

更新日期：2020-03-24
• arXiv.cs.SD Pub Date : 2019-07-01
Yi-Chiao Wu; Tomoki Hayashi; Patrick Lumban Tobing; Kazuhiro Kobayashi; Tomoki Toda

In this paper, we propose a quasi-periodic neural network (QPNet) vocoder with a novel network architecture named pitch-dependent dilated convolution (PDCNN) to improve the pitch controllability of WaveNet (WN) vocoder. The effectiveness of the WN vocoder to generate high-fidelity speech samples from given acoustic features has been proved recently. However, because of the fixed dilated convolution

更新日期：2020-03-24
• arXiv.cs.SD Pub Date : 2019-07-13
Siddique Latif; Rajib Rana; Sara Khalifa; Raja Jurdak; Julien Epps; Björn W. Schuller

Inspite the emerging importance of Speech Emotion Recognition (SER), the state-of-the-art accuracy is quite low and needs improvement to make commercial applications of SER viable. A key underlying reason for the low accuracy is the scarcity of emotion datasets, which is a challenge for developing any robust machine learning model in general. In this paper, we propose a solution to this problem: a

更新日期：2020-03-24
• arXiv.cs.SD Pub Date : 2019-07-21
Yi-Chiao Wu; Patrick Lumban Tobing; Tomoki Hayashi; Kazuhiro Kobayashi; Tomoki Toda

In this paper, we investigate the effectiveness of a quasi-periodic WaveNet (QPNet) vocoder combined with a statistical spectral conversion technique for a voice conversion task. The WaveNet (WN) vocoder has been applied as the waveform generation module in many different voice conversion frameworks and achieves significant improvement over conventional vocoders. However, because of the fixed dilated

更新日期：2020-03-24
• arXiv.cs.SD Pub Date : 2019-06-24
Charles C. Onu; Jonathan Lebensold; William L. Hamilton; Doina Precup

Despite continuing medical advances, the rate of newborn morbidity and mortality globally remains high, with over 6 million casualties every year. The prediction of pathologies affecting newborns based on their cry is thus of significant clinical interest, as it would facilitate the development of accessible, low-cost diagnostic tools\cut{ based on wearables and smartphones}. However, the inadequacy

更新日期：2020-03-20
• arXiv.cs.SD Pub Date : 2019-07-05
Jeong Choi; Jongpil Lee; Jiyoung Park; Juhan Nam

Audio-based music classification and tagging is typically based on categorical supervised learning with a fixed set of labels. This intrinsically cannot handle unseen labels such as newly added music genres or semantic words that users arbitrarily choose for music retrieval. Zero-shot learning can address this problem by leveraging an additional semantic space of labels where side information about

更新日期：2020-03-20
• arXiv.cs.SD Pub Date : 2020-03-17
Ke Hu; Tara N. Sainath; Ruoming Pang; Rohit Prabhavalkar

End-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) and perform competitively relative to conventional models. To further improve the quality, a two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model while maintaining a reasonable latency. The model attends to acoustics to rescore hypotheses, as

更新日期：2020-03-19
• arXiv.cs.SD Pub Date : 2020-03-18
A. Fahim; P. N. Samarasinghe; T. D. Abhayapala

We propose a novel multi-source direction of arrival (DOA) estimation technique using a convolutional neural network algorithm which learns the modal coherence patterns of an incident soundfield through measured spherical harmonic coefficients. We train our model for individual time-frequency bins in the short-time Fourier transform spectrum by analyzing the unique snapshot of modal coherence for each

更新日期：2020-03-19
• arXiv.cs.SD Pub Date : 2020-03-18
Yuan Gong; Jian Yang; Christian Poellabauer

With the rapidly growing number of security-sensitive systems that use voice as the primary input, it becomes increasingly important to address these systems' potential vulnerability to replay attacks. Previous efforts to address this concern have focused primarily on single-channel audio. In this paper, we introduce a novel neural network-based replay attack detection model that further leverages

更新日期：2020-03-19
• arXiv.cs.SD Pub Date : 2020-03-16
Pablo Alonso-Jiménez; Dmitry Bogdanov; Jordi Pons; Xavier Serra

Essentia is a reference open-source C++/Python library for audio and music analysis. In this work, we present a set of algorithms that employ TensorFlow in Essentia, allow predictions with pre-trained deep learning models, and are designed to offer flexibility of use, easy extensibility, and real-time inference. To show the potential of this new interface with TensorFlow, we provide a number of pre-trained

更新日期：2020-03-18
• arXiv.cs.SD Pub Date : 2020-03-17
Jinyu Li; Rui Zhao; Eric Sun; Jeremy H. M. Wong; Amit Das; Zhong Meng; Yifan Gong

While the community keeps promoting end-to-end models over conventional hybrid models, which usually are long short-term memory (LSTM) models trained with a cross entropy criterion followed by a sequence discriminative training criterion, we argue that such conventional hybrid models can still be significantly improved. In this paper, we detail our recent efforts to improve conventional hybrid LSTM

更新日期：2020-03-18
• arXiv.cs.SD Pub Date : 2020-03-17
Cunhang Fan; Jianhua Tao; Bin Liu; Jiangyan Yi; Zhengqi Wen; Xuefei Liu

In this paper, we propose an end-to-end post-filter method with deep attention fusion features for monaural speaker-independent speech separation. At first, a time-frequency domain speech separation method is applied as the pre-separation stage. The aim of pre-separation stage is to separate the mixture preliminarily. Although this stage can separate the mixture, it still contains the residual interference

更新日期：2020-03-18
Contents have been reproduced by permission of the publishers.

down
wechat
bug