当前期刊: arXiv - CS - Sound Go to current issue    加入关注   
显示样式:        排序: IF: - GO 导出
  • Multi-Scale Aggregation Using Feature Pyramid Module for Text-Independent Speaker Verification
    arXiv.cs.SD Pub Date : 2020-04-07
    Youngmoon Jung; Seongmin Kye; Yeunju Choi; Myunghun Jung; Hoirin Kim

    Currently, the most widely used approach for speaker verification is the deep speaker embedding learning. In this approach, convolutional neural networks are mainly used as a frame-level feature extractor, and speaker embeddings are extracted from the last layer of the feature extractor. Multi-scale aggregation (MSA), which utilizes multi-scale features from different layers of the feature extractor

  • Direct Speech-to-image Translation
    arXiv.cs.SD Pub Date : 2020-04-07
    Jiguo Li; Xinfeng Zhang; Chuanmin Jia; Jizheng Xu; Li Zhang; Yue Wang; Siwei Ma; Wen Gao

    Direct speech-to-image translation without text is an interesting and useful topic due to the potential applications in human-computer interaction, art creation, computer-aided design. etc. Not to mention that many languages have no writing form. However, as far as we know, it has not been well-studied how to translate the speech signals into images directly and how well they can be translated. In

  • Universal Adversarial Perturbations Generative Network for Speaker Recognition
    arXiv.cs.SD Pub Date : 2020-04-07
    Jiguo Li; Xinfeng Zhang; Chuanmin Jia; Jizheng Xu; Li Zhang; Yue Wang; Siwei Ma; Wen Gao

    Attacking deep learning based biometric systems has drawn more and more attention with the wide deployment of fingerprint/face/speaker recognition systems, given the fact that the neural networks are vulnerable to the adversarial examples, which have been intentionally perturbed to remain almost imperceptible for human. In this paper, we demonstrated the existence of the universal adversarial perturbations~(UAPs)

  • Learning to fool the speaker recognition
    arXiv.cs.SD Pub Date : 2020-04-07
    Jiguo Li; Xinfeng Zhang; Jizheng Xu; Li Zhang; Yue Wang; Siwei Ma; Wen Gao

    Due to the widespread deployment of fingerprint/face/speaker recognition systems, attacking deep learning based biometric systems has drawn more and more attention. Previous research mainly studied the attack to the vision-based system, such as fingerprint and face recognition. While the attack for speaker recognition has not been investigated yet, although it has been widely used in our daily life

  • Homophone-based Label Smoothing in End-to-End Automatic Speech Recognition
    arXiv.cs.SD Pub Date : 2020-04-07
    Yi Zheng; Xianjie Yang; Xuyong Dang

    A new label smoothing method that makes use of prior knowledge of a language at human level, homophone, is proposed in this paper for automatic speech recognition (ASR). Compared with its forerunners, the proposed method uses pronunciation knowledge of homophones in a more complex way. End-to-end ASR models that learn acoustic model and language model jointly and modelling units of characters are necessary

  • SNR-Based Features and Diverse Training Data for Robust DNN-Based Speech Enhancement
    arXiv.cs.SD Pub Date : 2020-04-07
    Robert Rehr; Timo Gerkmann

    This paper analyzes the generalization of speech enhancement algorithms based on deep neural networks (DNNs) with respect to (1) the chosen features, (2) the size and diversity of the training data, and (3) different network architectures. To address (1), we compare three input features, namely logarithmized noisy periodograms, noise aware training (NAT) and signal-to-noise ratio (SNR) based noise

  • WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss
    arXiv.cs.SD Pub Date : 2020-02-02
    Rui Liu; Berrak Sisman; Feilong Bao; Guanglai Gao; Haizhou Li

    Tacotron-based text-to-speech (TTS) systems directly synthesize speech from text input. Such frameworks typically consist of a feature prediction network that maps character sequences to frequency-domain acoustic features, followed by a waveform reconstruction algorithm or a neural vocoder that generates the time-domain waveform from acoustic features. As the loss function is usually calculated only

  • Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data
    arXiv.cs.SD Pub Date : 2020-02-01
    Kun Zhou; Berrak Sisman; Haizhou Li

    Emotional voice conversion aims to convert the spectrum and prosody to change the emotional patterns of speech, while preserving the speaker identity and linguistic content. Many studies require parallel speech data between different emotional patterns, which is not practical in real life. Moreover, they often model the conversion of fundamental frequency (F0) with a simple linear transform. As F0

  • Towards Relevance and Sequence Modeling in Language Recognition
    arXiv.cs.SD Pub Date : 2020-04-02
    Bharat Padi; Anand Mohan; Sriram Ganapathy

    The task of automatic language identification (LID) involving multiple dialects of the same language family in the presence of noise is a challenging problem. In these scenarios, the identity of the language/dialect may be reliably present only in parts of the temporal sequence of the speech signal. The conventional approaches to LID (and for speaker recognition) ignore the sequence information by

  • AI4COVID-19: AI Enabled Preliminary Diagnosis for COVID-19 from Cough Samples via an App
    arXiv.cs.SD Pub Date : 2020-04-02
    Ali Imran; Iryna Posokhova; Haneya N. Qureshi; Usama Masood; Sajid Riaz; Kamran Ali; Charles N. John; Muhammad Nabeel

    Inability to test at scale has become Achille's heel in humanity's ongoing war against COVID-19 pandemic. An agile, scalable and cost-effective testing, deployable at a global scale, can act as a game changer in this war. To address this challenge, building on the promising results of our prior work on cough-based diagnosis of a motley of respiratory diseases, we develop an Artificial Intelligence

  • Can Machine Learning Be Used to Recognize and Diagnose Coughs?
    arXiv.cs.SD Pub Date : 2020-04-01
    Charles Bales; Charles John; Hasan Farooq; Usama Masood; Muhammad Nabeel; Ali Imran

    5G is bringing new use cases to the forefront, one of the most prominent being machine learning empowered health care. Since respiratory infections are one of the notable modern medical concerns and coughs being a common symptom of this, a system for recognizing and diagnosing infections based on raw cough data would have a multitude of beneficial research and medical applications. In the literature

  • Towards democratizing music production with AI-Design of Variational Autoencoder-based Rhythm Generator as a DAW plugin
    arXiv.cs.SD Pub Date : 2020-04-01
    Nao Tokui

    There has been significant progress in the music generation technique utilizing deep learning. However, it is still hard for musicians and artists to use these techniques in their daily music-making practice. This paper proposes a Variational Autoencoder\cite{Kingma2014}(VAE)-based rhythm generation system, in which musicians can train a deep learning model only by selecting target MIDI files, then

  • Temporarily-Aware Context Modelling using Generative Adversarial Networks for Speech Activity Detection
    arXiv.cs.SD Pub Date : 2020-04-02
    Tharindu Fernando; Sridha Sridharan; Mitchell McLaren; Darshana Priyasad; Simon Denman; Clinton Fookes

    This paper presents a novel framework for Speech Activity Detection (SAD). Inspired by the recent success of multi-task learning approaches in the speech processing domain, we propose a novel joint learning framework for SAD. We utilise generative adversarial networks to automatically learn a loss function for joint prediction of the frame-wise speech/ non-speech classifications together with the next

  • Imperio: Robust Over-the-Air Adversarial Examples for Automatic Speech Recognition Systems
    arXiv.cs.SD Pub Date : 2019-08-05
    Lea Schönherr; Thorsten Eisenhofer; Steffen Zeiler; Thorsten Holz; Dorothea Kolossa

    Automatic speech recognition (ASR) systems can be fooled via targeted adversarial examples, which induce the ASR to produce arbitrary transcriptions in response to altered audio signals. However, state-of-the-art adversarial examples typically have to be fed into the ASR system directly, and are not successful when played in a room. The few published over-the-air adversarial examples fall into one

  • Learning Alignment for Multimodal Emotion Recognition from Speech
    arXiv.cs.SD Pub Date : 2019-09-06
    Haiyang Xu; Hui Zhang; Kun Han; Yun Wang; Yiping Peng; Xiangang Li

    Speech emotion recognition is a challenging problem because human convey emotions in subtle and complex ways. For emotion recognition on human speech, one can either extract emotion related features from audio signals or employ speech recognition techniques to generate text from speech and then apply natural language processing to analyze the sentiment. Further, emotion recognition will be beneficial

  • Multilingual Graphemic Hybrid ASR with Massive Data Augmentation
    arXiv.cs.SD Pub Date : 2019-09-14
    Chunxi Liu; Qiaochu Zhang; Xiaohui Zhang; Kritika Singh; Yatharth Saraf; Geoffrey Zweig

    Towards developing high-performing ASR for low-resource languages, approaches to address the lack of resources are to make use of data from multiple languages, and to augment the training data by creating acoustic variations. In this work we present a single grapheme-based ASR model learned on 7 geographically proximal languages, using standard hybrid BLSTM-HMM acoustic models with lattice-free MMI

  • SEF-ALDR: A Speaker Embedding Framework via Adversarial Learning based Disentangled Representation
    arXiv.cs.SD Pub Date : 2019-11-27
    Jianwei Tai; Xiaoqi Jia; Qingjia Huang; Weijuan Zhang; Shengzhi Zhang

    With the pervasiveness of voice control on smart devices, speaker verification is widely used as the preferred identity authentication mechanism due to its convenience. However, the task of "in-the-wild" speaker verification is challenging, considering the speech samples may contain lots of identity-unrelated information, e.g., background noise, reverberation, emotion, etc. Previous works focus on

  • Neural Percussive Synthesis Parameterised by High-Level Timbral Features
    arXiv.cs.SD Pub Date : 2019-11-25
    António Ramires; Pritish Chandna; Xavier Favory; Emilia Gómez; Xavier Serra

    We present a deep neural network-based methodology for synthesising percussive sounds with control over high-level timbral characteristics of the sounds. This approach allows for intuitive control of a synthesizer, enabling the user to shape sounds without extensive knowledge of signal processing. We use a feedforward convolutional neural network-based architecture, which is able to map input parameters

  • Improving auditory attention decoding performance of linear and non-linear methods using state-space model
    arXiv.cs.SD Pub Date : 2020-04-02
    Ali Aroudi; Tobias de Taillez; Simon Doclo

    Identifying the target speaker in hearing aid applications is crucial to improve speech understanding. Recent advances in electroencephalography (EEG) have shown that it is possible to identify the target speaker from single-trial EEG recordings using auditory attention decoding (AAD) methods. AAD methods reconstruct the attended speech envelope from EEG recordings, based on a linear least-squares

  • iMetricGAN: Intelligibility Enhancement for Speech-in-Noise using Generative Adversarial Network-based Metric Learning
    arXiv.cs.SD Pub Date : 2020-04-02
    Haoyu Li; Szu-Wei Fu; Yu Tsao; Junichi Yamagishi

    The intelligibility of natural speech is seriously degraded when exposed to adverse noisy environments. In this work, we propose a deep learning-based speech modification method to compensate for the intelligibility loss, with the constraint that the root mean square (RMS) level and duration of the speech signal are maintained before and after modifications. Specifically, we utilize an iMetricGAN approach

  • The RWTH ASR System for TED-LIUM Release 2: Improving Hybrid HMM with SpecAugment
    arXiv.cs.SD Pub Date : 2020-04-02
    Wei Zhou; Wilfried Michel; Kazuki Irie; Markus Kitza; Ralf Schlüter; Hermann Ney

    We present a complete training pipeline to build a state-of-the-art hybrid HMM-based ASR system on the 2nd release of the TED-LIUM corpus. Data augmentation using SpecAugment is successfully applied to improve performance on top of our best SAT model using i-vectors. By investigating the effect of different maskings, we achieve improvements from SpecAugment on hybrid HMM models without increasing model

  • Full-Sum Decoding for Hybrid HMM based Speech Recognition using LSTM Language Model
    arXiv.cs.SD Pub Date : 2020-04-02
    Wei Zhou; Ralf Schlüter; Hermann Ney

    In hybrid HMM based speech recognition, LSTM language models have been widely applied and achieved large improvements. The theoretical capability of modeling any unlimited context suggests that no recombination should be applied in decoding. This motivates to reconsider full summation over the HMM-state sequences instead of Viterbi approximation in decoding. We explore the potential gain from more

  • Multi-Modal Video Forensic Platform for Investigating Post-Terrorist Attack Scenarios
    arXiv.cs.SD Pub Date : 2020-04-02
    Alexander Schindler; Andrew Lindley; Anahid Jalali; Martin Boyer; Sergiu Gordea; Ross King

    The forensic investigation of a terrorist attack poses a significant challenge to the investigative authorities, as often several thousand hours of video footage must be viewed. Large scale Video Analytic Platforms (VAP) assist law enforcement agencies (LEA) in identifying suspects and securing evidence. Current platforms focus primarily on the integration of different computer vision methods and thus

  • Characterizing Speech Adversarial Examples Using Self-Attention U-Net Enhancement
    arXiv.cs.SD Pub Date : 2020-03-31
    Chao-Han Huck Yang; Jun Qi; Pin-Yu Chen; Xiaoli Ma; Chin-Hui Lee

    Recent studies have highlighted adversarial examples as ubiquitous threats to the deep neural network (DNN) based speech recognition systems. In this work, we present a U-Net based attention model, U-Net$_{At}$, to enhance adversarial speech signals. Specifically, we evaluate the model performance by interpretable speech recognition metrics and discuss the model performance by the augmented adversarial

  • ASR is all you need: cross-modal distillation for lip reading
    arXiv.cs.SD Pub Date : 2019-11-28
    Triantafyllos Afouras; Joon Son Chung; Andrew Zisserman

    The goal of this work is to train strong models for visual speech recognition without requiring human annotated ground truth data. We achieve this by distilling from an Automatic Speech Recognition (ASR) model that has been trained on a large-scale audio-only corpus. We use a cross-modal distillation method that combines Connectionist Temporal Classification (CTC) with a frame-wise cross-entropy loss

  • Serialized Output Training for End-to-End Overlapped Speech Recognition
    arXiv.cs.SD Pub Date : 2020-03-28
    Naoyuki Kanda; Yashesh Gaur; Xiaofei Wang; Zhong Meng; Takuya Yoshioka

    This paper proposes serialized output training (SOT), a novel framework for multi-speaker overlapped speech recognition based on an attention-based encoder-decoder approach. Instead of having multiple output layers as with the permutation invariant training (PIT), SOT uses a model with only one output layer that generates the transcriptions of multiple speakers one after another. The attention and

  • A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency
    arXiv.cs.SD Pub Date : 2020-03-28
    Tara N. Sainath; Yanzhang He; Bo Li; Arun Narayanan; Ruoming Pang; Antoine Bruguier; Shuo-yiin Chang; Wei Li; Raziel Alvarez; Zhifeng Chen; Chung-Cheng Chiu; David Garcia; Alex Gruenstein; Ke Hu; Minho Jin; Anjuli Kannan; Qiao Liang; Ian McGraw; Cal Peyser; Rohit Prabhavalkar; Golan Pundak; David Rybach; Yuan Shangguan; Yash Sheth; Trevor Strohman; Mirko Visontai; Yonghui Wu; Yu Zhang; Ding Zhao

    Thus far, end-to-end (E2E) models have not been shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses

  • Deep Residual Neural Networks for Image in Speech Steganography
    arXiv.cs.SD Pub Date : 2020-03-30
    Shivam Agarwal; Siddarth Venkatraman

    Steganography is the art of hiding a secret message inside a publicly visible carrier message. Ideally, it is done without modifying the carrier, and with minimal loss of information in the secret message. Recently, various deep learning based approaches to steganography have been applied to different message types. We propose a deep learning based technique to hide a source RGB image message inside

  • A Recursive Network with Dynamic Attention for Monaural Speech Enhancement
    arXiv.cs.SD Pub Date : 2020-03-29
    Andong Li; Chengshi Zheng; Cunhang Fan; Renhua Peng; Xiaodong Li

    A person tends to generate dynamic attention towards speech under complicated environments. Based on this phenomenon, we propose a framework combining dynamic attention and recursive learning together for monaural speech enhancement. Apart from a major noise reduction network, we design a separated sub-network, which adaptively generates the attention distribution to control the information flow throughout

  • Listen to Look: Action Recognition by Previewing Audio
    arXiv.cs.SD Pub Date : 2019-12-10
    Ruohan Gao; Tae-Hyun Oh; Kristen Grauman; Lorenzo Torresani

    In the face of the video data deluge, today's expensive clip-level classifiers are increasingly impractical. We propose a framework for efficient action recognition in untrimmed video that uses audio as a preview mechanism to eliminate both short-term and long-term visual redundancies. First, we devise an ImgAud2Vid framework that hallucinates clip-level features by distilling from lighter modalities---a

  • A Review of Multi-Objective Deep Learning Speech Denoising Methods
    arXiv.cs.SD Pub Date : 2020-03-26
    Arian Azarang; Nasser Kehtarnavaz

    This paper presents a review of multi-objective deep learning methods that have been introduced in the literature for speech denoising. After stating an overview of conventional, single objective deep learning, and hybrid or combined conventional and deep learning methods, a review of the mathematical framework of the multi-objective deep learning methods for speech denoising is provided. A representative

  • Incremental Learning Algorithm for Sound Event Detection
    arXiv.cs.SD Pub Date : 2020-03-26
    Eunjeong Koh; Fatemeh Saki; Yinyi Guo; Cheng-Yu Hung; Erik Visser

    This paper presents a new learning strategy for the Sound Event Detection (SED) system to tackle the issues of i) knowledge migration from a pre-trained model to a new target model and ii) learning new sound events without forgetting the previously learned ones without re-training from scratch. In order to migrate the previously learned knowledge from the source model to the target one, a neural adapter

  • GPVAD: Towards noise robust voice activity detection via weakly supervised sound event detection
    arXiv.cs.SD Pub Date : 2020-03-27
    Heinrich Dinkel; Yefei Chen; Mengyue Wu; Kai Yu

    Traditional voice activity detection (VAD) methods work well in clean and controlled scenarios, with performance severely degrading in real-world applications. One possible bottleneck for such supervised VAD training is its requirement for clean training data and frame-level labels. In contrast, we propose the GPVAD framework, which can be easily trained from noisy data in a weakly supervised fashion

  • Separating Varying Numbers of Sources with Auxiliary Autoencoding Loss
    arXiv.cs.SD Pub Date : 2020-03-27
    Yi Luo; Nima Mesgarani

    Many recent source separation systems are designed to separate a fixed number of sources out of a mixture. In the cases where the source activation patterns are unknown, such systems have to either adjust the number of outputs or to identify invalid outputs from the valid ones. Iterative separation methods have gain much attention in the community as they can flexibly decide the number of outputs,

  • Can you hear me $\textit{now}$? Sensitive comparisons of human and machine perception
    arXiv.cs.SD Pub Date : 2020-03-27
    Michael A Lepori; Chaz Firestone

    The rise of sophisticated machine-recognition systems has brought with it a rise in comparisons between human and machine perception. But such comparisons face an asymmetry: Whereas machine perception of some stimulus can often be probed through direct and explicit measures, much of human perceptual knowledge is latent, incomplete, or embedded in unconscious mental processes that may not be available

  • Training for Speech Recognition on Coprocessors
    arXiv.cs.SD Pub Date : 2020-03-22
    Sebastian Baunsgaard; Sebastian B. Wrede; Pınar Tozun

    Automatic Speech Recognition (ASR) has increased in popularity in recent years. The evolution of processor and storage technologies has enabled more advanced ASR mechanisms, fueling the development of virtual assistants such as Amazon Alexa, Apple Siri, Microsoft Cortana, and Google Home. The interest in such assistants, in turn, has amplified the novel developments in ASR research. However, despite

  • Mic2Mic: Using Cycle-Consistent Generative Adversarial Networks to Overcome Microphone Variability in Speech Systems
    arXiv.cs.SD Pub Date : 2020-03-27
    Akhil Mathur; Anton Isopoussu; Fahim Kawsar; Nadia Berthouze; Nicholas D. Lane

    Mobile and embedded devices are increasingly using microphones and audio-based computational models to infer user context. A major challenge in building systems that combine audio models with commodity microphones is to guarantee their accuracy and robustness in the real-world. Besides many environmental dynamics, a primary factor that impacts the robustness of audio models is microphone variability

  • Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation
    arXiv.cs.SD Pub Date : 2019-10-14
    Yi Luo; Zhuo Chen; Takuya Yoshioka

    Recent studies in deep learning-based speech separation have proven the superiority of time-domain approaches to conventional time-frequency-based methods. Unlike the time-frequency domain approaches, the time-domain separation systems often receive input sequences consisting of a huge number of time steps, which introduces challenges for modeling extremely long sequences. Conventional recurrent neural

  • End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation
    arXiv.cs.SD Pub Date : 2019-10-30
    Yi Luo; Zhuo Chen; Nima Mesgarani; Takuya Yoshioka

    An important problem in ad-hoc microphone speech separation is how to guarantee the robustness of a system with respect to the locations and numbers of microphones. The former requires the system to be invariant to different indexing of the microphones with the same locations, while the latter requires the system to be able to process inputs with varying dimensions. Conventional optimization-based

  • Clinical Depression and Affect Recognition with EmoAudioNet
    arXiv.cs.SD Pub Date : 2019-11-01
    Emna Rejaibi; Daoud Kadoch; Kamil Bentounes; Romain Alfred; Mohamed Daoudi; Abdenour Hadid; Alice Othmani

    Automatic emotions recognition and Major Depressive Disorder (MDD) diagnosis are inherently challenging problems in health informatics applications. According to the World Health Organization, 300M people are affected by depression in 2017 with only the third of them correctly identified. MDD is a permanent mood disorder where the patient constantly feels negative emotions (low Valence) and lacks excitement

  • Non-parallel Voice Conversion System with WaveNet Vocoder and Collapsed Speech Suppression
    arXiv.cs.SD Pub Date : 2020-03-26
    Yi-Chiao Wu; Patrick Lumban Tobing; Kazuhiro Kobayashi; Tomoki Hayashi; Tomoki Toda

    In this paper, we integrate a simple non-parallel voice conversion (VC) system with a WaveNet (WN) vocoder and a proposed collapsed speech suppression technique. The effectiveness of WN as a vocoder for generating high-fidelity speech waveforms on the basis of acoustic features has been confirmed in recent works. However, when combining the WN vocoder with a VC system, the distorted acoustic features

  • Finnish Language Modeling with Deep Transformer Models
    arXiv.cs.SD Pub Date : 2020-03-14
    Abhilash Jain

    Transformers have recently taken the center stage in language modeling after LSTM's were considered the dominant model architecture for a long time. In this project, we investigate the performance of the Transformer architectures-BERT and Transformer-XL for the language modeling task. We use a sub-word model setting with the Finnish language and compare it to the previous State of the art (SOTA) LSTM

  • Speech Quality Factors for Traditional and Neural-Based Low Bit Rate Vocoders
    arXiv.cs.SD Pub Date : 2020-03-26
    Wissam A. Jassim; Jan Skoglund; Michael Chinen; Andrew Hines

    This study compares the performances of different algorithms for coding speech at low bit rates. In addition to widely deployed traditional vocoders, a selection of recently developed generative-model-based coders at different bit rates are contrasted. Performance analysis of the coded speech is evaluated for different quality aspects: accuracy of pitch periods estimation, the word error rates for

  • In defence of metric learning for speaker recognition
    arXiv.cs.SD Pub Date : 2020-03-26
    Joon Son Chung; Jaesung Huh; Seongkyu Mun; Minjae Lee; Hee Soo Heo; Soyeon Choe; Chiheon Ham; Sunghwan Jung; Bong-Jin Lee; Icksang Han

    The objective of this paper is 'open-set' speaker recognition of unseen speakers, where ideal embeddings should be able to condense information into a compact utterance-level representation that has small intra-class (same speaker) and large inter-class (different speakers) distance. A popular belief in speaker recognition is that networks trained with classification objectives outperform metric learning

  • COVID-19 and Computer Audition: An Overview on What Speech & Sound Analysis Could Contribute in the SARS-CoV-2 Corona Crisis
    arXiv.cs.SD Pub Date : 2020-03-24
    Björn W. Schuller; Dagmar M. Schuller; Kun Qian; Juan Liu; Huaiyuan Zheng; Xiao Li

    At the time of writing, the world population is suffering from more than 10,000 registered COVID-19 disease epidemic induced deaths since the outbreak of the Corona virus more than three months ago now officially known as SARS-CoV-2. Since, tremendous efforts have been made worldwide to counter-steer and control the epidemic by now labelled as pandemic. In this contribution, we provide an overview

  • Monaural Speech Enhancement with Recursive Learning in the Time Domain
    arXiv.cs.SD Pub Date : 2020-03-22
    Andong Li; Chengshi Zheng; Linjuan Cheng; Renhua Peng; Xiaodong Li

    In this paper, we propose a type of neural network with recursive learning in the time domain called RTNet for monaural speech enhancement, where the proposed network consists of three principal components. The first part is called stage recurrent neural network, which is proposed to effectively aggregate the deep feature dependencies across different stages with a memory mechanism and also remove

  • A Quantum Vocal Theory of Sound
    arXiv.cs.SD Pub Date : 2020-03-21
    Davide Rocchesso; Maria Mannone

    Concepts and formalism from acoustics are often used to exemplify quantum mechanics. Conversely, quantum mechanics could be used to achieve a new perspective on acoustics, as shown by Gabor studies. Here, we focus in particular on the study of human voice, considered as a probe to investigate the world of sounds. We present a theoretical framework that is based on observables of vocal production, and

  • Multi-task U-Net for Music Source Separation
    arXiv.cs.SD Pub Date : 2020-03-23
    Venkatesh S. Kadandale; Juan F. Montesinos; Gloria Haro; Emilia Gómez

    A fairly straightforward approach for music source separation is to train independent models, wherein each model is dedicated for estimating only a specific source. Training a single model to estimate multiple sources generally does not perform as well as the independent dedicated models. However, Conditioned U-Net (C-U-Net) uses a control mechanism to train a single model for multi-source separation

  • The Recognition Of Persian Phonemes Using PPNet
    arXiv.cs.SD Pub Date : 2018-12-17
    Saber Malekzadeh; Mohammad Hossein Gholizadeh; Hossein Ghayoumi zadeh; Seyed Naser Razavi

    In this paper, a novel approach is proposed for the recognition of Persian phonemes in the Persian Consonant-Vowel Combination (PCVC) speech dataset. Nowadays, deep neural networks play a crucial role in classification tasks. However, the best results in speech recognition are not yet as perfect as human recognition rate. Deep learning techniques show outstanding performance over many other classification

  • Quasi-Periodic WaveNet Vocoder: A Pitch Dependent Dilated Convolution Model for Parametric Speech Generation
    arXiv.cs.SD Pub Date : 2019-07-01
    Yi-Chiao Wu; Tomoki Hayashi; Patrick Lumban Tobing; Kazuhiro Kobayashi; Tomoki Toda

    In this paper, we propose a quasi-periodic neural network (QPNet) vocoder with a novel network architecture named pitch-dependent dilated convolution (PDCNN) to improve the pitch controllability of WaveNet (WN) vocoder. The effectiveness of the WN vocoder to generate high-fidelity speech samples from given acoustic features has been proved recently. However, because of the fixed dilated convolution

  • Multi-Task Semi-Supervised Adversarial Autoencoding for Speech Emotion Recognition
    arXiv.cs.SD Pub Date : 2019-07-13
    Siddique Latif; Rajib Rana; Sara Khalifa; Raja Jurdak; Julien Epps; Björn W. Schuller

    Inspite the emerging importance of Speech Emotion Recognition (SER), the state-of-the-art accuracy is quite low and needs improvement to make commercial applications of SER viable. A key underlying reason for the low accuracy is the scarcity of emotion datasets, which is a challenge for developing any robust machine learning model in general. In this paper, we propose a solution to this problem: a

  • Statistical Voice Conversion with Quasi-Periodic WaveNet Vocoder
    arXiv.cs.SD Pub Date : 2019-07-21
    Yi-Chiao Wu; Patrick Lumban Tobing; Tomoki Hayashi; Kazuhiro Kobayashi; Tomoki Toda

    In this paper, we investigate the effectiveness of a quasi-periodic WaveNet (QPNet) vocoder combined with a statistical spectral conversion technique for a voice conversion task. The WaveNet (WN) vocoder has been applied as the waveform generation module in many different voice conversion frameworks and achieves significant improvement over conventional vocoders. However, because of the fixed dilated

  • Neural Transfer Learning for Cry-based Diagnosis of Perinatal Asphyxia
    arXiv.cs.SD Pub Date : 2019-06-24
    Charles C. Onu; Jonathan Lebensold; William L. Hamilton; Doina Precup

    Despite continuing medical advances, the rate of newborn morbidity and mortality globally remains high, with over 6 million casualties every year. The prediction of pathologies affecting newborns based on their cry is thus of significant clinical interest, as it would facilitate the development of accessible, low-cost diagnostic tools\cut{ based on wearables and smartphones}. However, the inadequacy

  • Zero-shot Learning for Audio-based Music Classification and Tagging
    arXiv.cs.SD Pub Date : 2019-07-05
    Jeong Choi; Jongpil Lee; Jiyoung Park; Juhan Nam

    Audio-based music classification and tagging is typically based on categorical supervised learning with a fixed set of labels. This intrinsically cannot handle unseen labels such as newly added music genres or semantic words that users arbitrarily choose for music retrieval. Zero-shot learning can address this problem by leveraging an additional semantic space of labels where side information about

  • Deliberation Model Based Two-Pass End-to-End Speech Recognition
    arXiv.cs.SD Pub Date : 2020-03-17
    Ke Hu; Tara N. Sainath; Ruoming Pang; Rohit Prabhavalkar

    End-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) and perform competitively relative to conventional models. To further improve the quality, a two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model while maintaining a reasonable latency. The model attends to acoustics to rescore hypotheses, as

  • Multi-Source DOA Estimation through Pattern Recognition of the Modal Coherence of a Reverberant Soundfield
    arXiv.cs.SD Pub Date : 2020-03-18
    A. Fahim; P. N. Samarasinghe; T. D. Abhayapala

    We propose a novel multi-source direction of arrival (DOA) estimation technique using a convolutional neural network algorithm which learns the modal coherence patterns of an incident soundfield through measured spherical harmonic coefficients. We train our model for individual time-frequency bins in the short-time Fourier transform spectrum by analyzing the unique snapshot of modal coherence for each

  • Detecting Replay Attacks Using Multi-Channel Audio: A Neural Network-Based Method
    arXiv.cs.SD Pub Date : 2020-03-18
    Yuan Gong; Jian Yang; Christian Poellabauer

    With the rapidly growing number of security-sensitive systems that use voice as the primary input, it becomes increasingly important to address these systems' potential vulnerability to replay attacks. Previous efforts to address this concern have focused primarily on single-channel audio. In this paper, we introduce a novel neural network-based replay attack detection model that further leverages

  • TensorFlow Audio Models in Essentia
    arXiv.cs.SD Pub Date : 2020-03-16
    Pablo Alonso-Jiménez; Dmitry Bogdanov; Jordi Pons; Xavier Serra

    Essentia is a reference open-source C++/Python library for audio and music analysis. In this work, we present a set of algorithms that employ TensorFlow in Essentia, allow predictions with pre-trained deep learning models, and are designed to offer flexibility of use, easy extensibility, and real-time inference. To show the potential of this new interface with TensorFlow, we provide a number of pre-trained

  • High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model
    arXiv.cs.SD Pub Date : 2020-03-17
    Jinyu Li; Rui Zhao; Eric Sun; Jeremy H. M. Wong; Amit Das; Zhong Meng; Yifan Gong

    While the community keeps promoting end-to-end models over conventional hybrid models, which usually are long short-term memory (LSTM) models trained with a cross entropy criterion followed by a sequence discriminative training criterion, we argue that such conventional hybrid models can still be significantly improved. In this paper, we detail our recent efforts to improve conventional hybrid LSTM

  • Deep Attention Fusion Feature for Speech Separation with End-to-End Post-filter Method
    arXiv.cs.SD Pub Date : 2020-03-17
    Cunhang Fan; Jianhua Tao; Bin Liu; Jiangyan Yi; Zhengqi Wen; Xuefei Liu

    In this paper, we propose an end-to-end post-filter method with deep attention fusion features for monaural speaker-independent speech separation. At first, a time-frequency domain speech separation method is applied as the pre-separation stage. The aim of pre-separation stage is to separate the mixture preliminarily. Although this stage can separate the mixture, it still contains the residual interference

Contents have been reproduced by permission of the publishers.
全球疫情及响应:BMC Medicine专题征稿