当前期刊: arXiv - CS - Sound Go to current issue    加入关注   
显示样式:        排序: IF: - GO 导出
我的关注
我的收藏
您暂时未登录!
登录
  • Timbre Space Representation of a Subtractive Synthesizer
    arXiv.cs.SD Pub Date : 2020-09-24
    Cyrus Vahidi; George Fazekas; Charalampos Saitis; Alessandro Palladini

    In this study, we produce a geometrically scaled perceptual timbre space from dissimilarity ratings of subtractive synthesized sounds and correlate the resulting dimensions with a set of acoustic descriptors. We curate a set of 15 sounds, produced by a synthesis model that uses varying source waveforms, frequency modulation (FM) and a lowpass filter with an enveloped cutoff frequency. Pairwise dissimilarity

    更新日期:2020-09-25
  • The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms
    arXiv.cs.SD Pub Date : 2020-09-24
    Lara Orlandic; Tomas Teijeiro; David Atienza

    Cough audio signal classification has been successfully used to diagnose a variety of respiratory conditions, and there has been significant interest in leveraging Machine Learning (ML) to provide widespread COVID-19 screening. However, there is currently no validated database of cough sounds with which to train such ML models. The COUGHVID dataset provides over 20,000 crowdsourced cough recordings

    更新日期:2020-09-25
  • Effects of Word-frequency based Pre- and Post- Processings for Audio Captioning
    arXiv.cs.SD Pub Date : 2020-09-24
    Daiki Takeuchi; Yuma Koizumi; Yasunori Ohishi; Noboru Harada; Kunio Kashino

    The system we used for Task 6 (Automated Audio Captioning)of the Detection and Classification of Acoustic Scenes and Events(DCASE) 2020 Challenge combines three elements, namely, dataaugmentation, multi-task learning, and post-processing, for audiocaptioning. The system received the highest evaluation scores, butwhich of the individual elements most fully contributed to its perfor-mance has not yet

    更新日期:2020-09-25
  • FluentNet: End-to-End Detection of Speech Disfluency with Deep Learning
    arXiv.cs.SD Pub Date : 2020-09-23
    Tedd Kourkounakis; Amirhossein Hajavi; Ali Etemad

    Strong presentation skills are valuable and sought-after in workplace and classroom environments alike. Of the possible improvements to vocal presentations, disfluencies and stutters in particular remain one of the most common and prominent factors of someone's demonstration. Millions of people are affected by stuttering and other speech disfluencies, with the majority of the world having experienced

    更新日期:2020-09-25
  • A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline
    arXiv.cs.SD Pub Date : 2020-09-22
    Yerbolat Khassanov; Saida Mussakhojayeva; Almas Mirzakhmetov; Alen Adiyev; Mukhamet Nurpeiissov; Huseyin Atakan Varol

    We present an open-source speech corpus for the Kazakh language. The Kazakh speech corpus (KSC) contains around 335 hours of transcribed audio comprising over 154,000 utterances spoken by participants from different regions, age groups, and gender. It was carefully inspected by native Kazakh speakers to ensure high quality. The KSC is the largest publicly available database developed to advance various

    更新日期:2020-09-23
  • End-to-End Speech Recognition and Disfluency Removal
    arXiv.cs.SD Pub Date : 2020-09-22
    Paria Jamshid Lou; Mark Johnson

    Disfluency detection is usually an intermediate step between an automatic speech recognition (ASR) system and a downstream task. By contrast, this paper aims to investigate the task of end-to-end speech recognition and disfluency removal. We specifically explore whether it is possible to train an ASR model to directly map disfluent speech into fluent transcripts, without relying on a separate disfluency

    更新日期:2020-09-23
  • End-to-End Learning of Speech 2D Feature-Trajectory for Prosthetic Hands
    arXiv.cs.SD Pub Date : 2020-09-22
    Mohsen Jafarzadeh; Yonas Tadesse

    Speech is one of the most common forms of communication in humans. Speech commands are essential parts of multimodal controlling of prosthetic hands. In the past decades, researchers used automatic speech recognition systems for controlling prosthetic hands by using speech commands. Automatic speech recognition systems learn how to map human speech to text. Then, they used natural language processing

    更新日期:2020-09-23
  • Using Inaudible Audio and Voice Assistants to Transmit Sensitive Data over Telephony
    arXiv.cs.SD Pub Date : 2020-09-21
    Zhengxian He; Mohit Narayan Rajput; Mustaque Ahamad

    New security and privacy concerns arise due to the growing popularity of voice assistant (VA) deployments in home and enterprise networks. A number of past research results have demonstrated how malicious actors can use hidden commands to get VAs to perform certain operations even when a person may be in their vicinity. However, such work has not explored how compromised computers that are close to

    更新日期:2020-09-23
  • Correlating Subword Articulation with Lip Shapes for Embedding Aware Audio-Visual Speech Enhancement
    arXiv.cs.SD Pub Date : 2020-09-21
    Hang Chen; Jun Du; Yu Hu; Li-Rong Dai; Bao-Cai Yin; Chin-Hui Lee

    In this paper, we propose a visual embedding approach to improving embedding aware speech enhancement (EASE) by synchronizing visual lip frames at the phone and place of articulation levels. We first extract visual embedding from lip frames using a pre-trained phone or articulation place recognizer for visual-only EASE (VEASE). Next, we extract audio-visual embedding from noisy speech and lip videos

    更新日期:2020-09-22
  • End-to-End Speaker-Dependent Voice Activity Detection
    arXiv.cs.SD Pub Date : 2020-09-21
    Yefei Chen; Shuai Wang; Yanmin Qian; Kai Yu

    Voice activity detection (VAD) is an essential pre-processing step for tasks such as automatic speech recognition (ASR) and speaker recognition. A basic goal is to remove silent segments within an audio, while a more general VAD system could remove all the irrelevant segments such as noise and even unwanted speech from non-target speakers. We define the task, which only detects the speech from the

    更新日期:2020-09-22
  • DiffWave: A Versatile Diffusion Model for Audio Synthesis
    arXiv.cs.SD Pub Date : 2020-09-21
    Zhifeng Kong; Wei Ping; Jiaji Huang; Kexin Zhao; Bryan Catanzaro

    In this work, we propose DiffWave, a versatile Diffusion probabilistic model for conditional and unconditional Waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces

    更新日期:2020-09-22
  • Detecting Acoustic Events Using Convolutional Macaron Net
    arXiv.cs.SD Pub Date : 2020-09-21
    Teck Kai Chan; Cheng Siong Chin

    In this paper, we propose to address the issue of the lack of strongly labeled data by using pseudo strongly labeled data that is approximated using Convolutive Nonnegative Matrix Factorization (CNMF). Using this pseudo strongly labeled data, we then train a new architecture combining Convolutional Neural Network (CNN) with Macaron Net (MN), which we term it as Convolutional Macaron Net (CMN). As opposed

    更新日期:2020-09-22
  • End-to-End Bengali Speech Recognition
    arXiv.cs.SD Pub Date : 2020-09-21
    Sayan Mandal; Sarthak Yadav; Atul Rai

    Bengali is a prominent language of the Indian subcontinent. However, while many state-of-the-art acoustic models exist for prominent languages spoken in the region, research and resources for Bengali are few and far between. In this work, we apply CTC based CNN-RNN networks, a prominent deep learning based end-to-end automatic speech recognition technique, to the Bengali ASR task. We also propose and

    更新日期:2020-09-22
  • Open-set Short Utterance Forensic Speaker Verification using Teacher-Student Network with Explicit Inductive Bias
    arXiv.cs.SD Pub Date : 2020-09-21
    Mufan Sang; Wei Xia; John H. L. Hansen

    In forensic applications, it is very common that only small naturalistic datasets consisting of short utterances in complex or unknown acoustic environments are available. In this study, we propose a pipeline solution to improve speaker verification on a small actual forensic field dataset. By leveraging large-scale out-of-domain datasets, a knowledge distillation based objective function is proposed

    更新日期:2020-09-22
  • Learning a Lie Algebra from Unlabeled Data Pairs
    arXiv.cs.SD Pub Date : 2020-09-19
    Chris Ick; Vincent Lostanlen

    Deep convolutional networks (convnets) show a remarkable ability to learn disentangled representations. In recent years, the generalization of deep learning to Lie groups beyond rigid motion in $\mathbb{R}^n$ has allowed to build convnets over datasets with non-trivial symmetries, such as patterns over the surface of a sphere. However, one limitation of this approach is the need to explicitly define

    更新日期:2020-09-22
  • A Machine Learning Approach to Detect Suicidal Ideation in US Veterans Based on Acoustic and Linguistic Features of Speech
    arXiv.cs.SD Pub Date : 2020-09-14
    Vaibhav Sourirajan; Anas Belouali; Mary Ann Dutton; Matthew Reinhard; Jyotishman Pathak

    Preventing Veteran suicide is a national priority. The US Department of Veterans Affairs (VA) collects, analyzes, and publishes data to inform suicide prevention strategies. Current approaches for detecting suicidal ideation mostly rely on patient self report which are inadequate and time consuming. In this research study, our goal was to automate suicidal ideation detection from acoustic and linguistic

    更新日期:2020-09-22
  • Optimizing Speech Emotion Recognition using Manta-Ray Based Feature Selection
    arXiv.cs.SD Pub Date : 2020-09-18
    Soham Chattopadhyay; Arijit Dey; Hritam Basak

    Emotion recognition from audio signals has been regarded as a challenging task in signal processing as it can be considered as a collection of static and dynamic classification tasks. Recognition of emotions from speech data has been heavily relied upon end-to-end feature extraction and classification using machine learning models, though the absence of feature selection and optimization have restrained

    更新日期:2020-09-21
  • Cough Against COVID: Evidence of COVID-19 Signature in Cough Sounds
    arXiv.cs.SD Pub Date : 2020-09-17
    Piyush Bagad; Aman Dalmia; Jigar Doshi; Arsha Nagrani; Parag Bhamare; Amrita Mahale; Saurabh Rane; Neeraj Agarwal; Rahul Panicker

    Testing capacity for COVID-19 remains a challenge globally due to the lack of adequate supplies, trained personnel, and sample-processing equipment. These problems are even more acute in rural and underdeveloped regions. We demonstrate that solicited-cough sounds collected over a phone, when analysed by our AI model, have statistically significant signal indicative of COVID-19 status (AUC 0.72, t-test

    更新日期:2020-09-21
  • GuessTheMusic: Song Identification from Electroencephalography response
    arXiv.cs.SD Pub Date : 2020-09-17
    Dhananjay Sonawane; Krishna Prasad Miyapuram; Bharatesh RS; Derek J. Lomas

    The music signal comprises of different features like rhythm, timbre, melody, harmony. Its impact on the human brain has been an exciting research topic for the past several decades. Electroencephalography (EEG) signal enables non-invasive measurement of brain activity. Leveraging the recent advancements in deep learning, we proposed a novel approach for song identification using a Convolution Neural

    更新日期:2020-09-21
  • Hierarchical Multi-Grained Generative Model for Expressive Speech Synthesis
    arXiv.cs.SD Pub Date : 2020-09-17
    Yukiya Hono; Kazuna Tsuboi; Kei Sawada; Kei Hashimoto; Keiichiro Oura; Yoshihiko Nankaku; Keiichi Tokuda

    This paper proposes a hierarchical generative model with a multi-grained latent variable to synthesize expressive speech. In recent years, fine-grained latent variables are introduced into the text-to-speech synthesis that enable the fine control of the prosody and speaking styles of synthesized speech. However, the naturalness of speech degrades when these latent variables are obtained by sampling

    更新日期:2020-09-21
  • Online Speaker Diarization with Relation Network
    arXiv.cs.SD Pub Date : 2020-09-17
    Xiang Li; Yucheng Zhao; Chong Luo; Wenjun Zeng

    In this paper, we propose an online speaker diarization system based on Relation Network, named RenoSD. Unlike conventional diariztion systems which consist of several independently-optimized modules, RenoSD implements voice-activity-detection (VAD), embedding extraction, and speaker identity association using a single deep neural network. The most striking feature of RenoSD is that it adopts a meta-learning

    更新日期:2020-09-20
  • Utterance-level Intent Recognition from Keywords
    arXiv.cs.SD Pub Date : 2020-09-17
    Wenda Chen; Jonathan Huang; Mark Hasegawa-Johnson

    This paper focuses on wake on intent (WOI) techniques for platforms with limited compute and memory. Our approach of utterance-level intent classification is based on a sequence of keywords in the utterance instead of a single fixed key phrase. The keyword sequence is transformed into four types of input features, namely acoustics, phones, word2vec and speech2vec for individual intent learning and

    更新日期:2020-09-20
  • Temporally Guided Music-to-Body-Movement Generation
    arXiv.cs.SD Pub Date : 2020-09-17
    Hsuan-Kai Kao; Li Su

    This paper presents a neural network model to generate virtual violinist's 3-D skeleton movements from music audio. Improved from the conventional recurrent neural network models for generating 2-D skeleton data in previous works, the proposed model incorporates an encoder-decoder architecture, as well as the self-attention mechanism to model the complicated dynamics in body movement sequences. To

    更新日期:2020-09-20
  • Similarity-based data mining for online domain adaptation of a sonar ATR system
    arXiv.cs.SD Pub Date : 2020-09-16
    Jean de Bodinat; Thomas Guerneve; Jose Vazquez; Marija Jegorova

    Due to the expensive nature of field data gathering, the lack of training data often limits the performance of Automatic Target Recognition (ATR) systems. This problem is often addressed with domain adaptation techniques, however the currently existing methods fail to satisfy the constraints of resource and time-limited underwater systems. We propose to address this issue via an online fine-tuning

    更新日期:2020-09-18
  • Pardon the Interruption: An Analysis of Gender and Turn-Taking in U.S. Supreme Court Oral Arguments
    arXiv.cs.SD Pub Date : 2020-09-15
    Haley Lepp; Gina-Anne Levow

    This study presents a corpus of turn changes between speakers in U.S. Supreme Court oral arguments. Each turn change is labeled on a spectrum of "cooperative" to "competitive" by a human annotator with legal experience in the United States. We analyze the relationship between speech features, the nature of exchanges, and the gender and legal role of the speakers. Finally, we demonstrate that the models

    更新日期:2020-09-18
  • When Automatic Voice Disguise Meets Automatic Speaker Verification
    arXiv.cs.SD Pub Date : 2020-09-15
    Linlin Zheng; Jiakang Li; Meng Sun; Xiongwei Zhang; Thomas Fang Zheng

    The technique of transforming voices in order to hide the real identity of a speaker is called voice disguise, among which automatic voice disguise (AVD) by modifying the spectral and temporal characteristics of voices with miscellaneous algorithms are easily conducted with softwares accessible to the public. AVD has posed great threat to both human listening and automatic speaker verification (ASV)

    更新日期:2020-09-16
  • Controllable neural text-to-speech synthesis using intuitive prosodic features
    arXiv.cs.SD Pub Date : 2020-09-14
    Tuomo Raitio; Ramya Rasipuram; Dan Castellani

    Modern neural text-to-speech (TTS) synthesis can generate speech that is indistinguishable from natural speech. However, the prosody of generated utterances often represents the average prosodic style of the database instead of having wide prosodic variation. Moreover, the generated prosody is solely defined by the input text, which does not allow for different styles for the same sentence. In this

    更新日期:2020-09-16
  • A study of vowel nasalization using instantaneous spectra
    arXiv.cs.SD Pub Date : 2020-09-14
    RaviShankar Prasad; B. Yegnanarayana

    Nasalization of vowels is a phenomenon where oral and nasal tracts participate simultaneously for the production of speech. Acoustic coupling of oral and nasal tracts results in a complex production system, which is subjected to a continuous changes owing to glottal activity. Identification of the duration of nasalization in vowels, and the extent of coupling of oral and nasal tracts, is a challenging

    更新日期:2020-09-15
  • SONYC-UST-V2: An Urban Sound Tagging Dataset with Spatiotemporal Context
    arXiv.cs.SD Pub Date : 2020-09-11
    Mark Cartwright; Jason Cramer; Ana Elisa Mendez Mendez; Yu Wang; Ho-Hsiang Wu; Vincent Lostanlen; Magdalena Fuentes; Graham Dove; Charlie Mydlarz; Justin Salamon; Oded Nov; Juan Pablo Bello

    We present SONYC-UST-V2, a dataset for urban sound tagging with spatiotemporal information. This dataset is aimed for the development and evaluation of machine listening systems for real-world urban noise monitoring. While datasets of urban recordings are available, this dataset provides the opportunity to investigate how spatiotemporal metadata can aid in the prediction of urban sound tags. SONYC-UST-V2

    更新日期:2020-09-14
  • RECOApy: Data recording, pre-processing and phonetic transcription for end-to-end speech-based applications
    arXiv.cs.SD Pub Date : 2020-09-11
    Adriana Stan

    Deep learning enables the development of efficient end-to-end speech processing applications while bypassing the need for expert linguistic and signal processing features. Yet, recent studies show that good quality speech resources and phonetic transcription of the training data can enhance the results of these applications. In this paper, the RECOApy tool is introduced. RECOApy streamlines the steps

    更新日期:2020-09-14
  • Generalized Minimal Distortion Principle for Blind Source Separation
    arXiv.cs.SD Pub Date : 2020-09-11
    Robin Scheibler

    We revisit the source image estimation problem from blind source separation (BSS). We generalize the traditional minimum distortion principle to maximum likelihood estimation with a model for the residual spectrograms. Because residual spectrograms typically contain other sources, we propose to use a mixed-norm model that lets us finely tune sparsity in time and frequency. We propose to carry out the

    更新日期:2020-09-14
  • Speaker Diarization Using Stereo Audio Channels: Preliminary Study on Utterance Clustering
    arXiv.cs.SD Pub Date : 2020-09-10
    Yingjun Dong; Neil G. MacLaren; Yiding Cao; Francis J. Yammarino; Shelley D. Dionne; Michael D. Mumford; Shane Connelly; Hiroki Sayama; Gregory A. Ruark

    Speaker diarization is one of the actively researched topics in audio signal processing and machine learning. Utterance clustering is a critical part of a speaker diarization task. In this study, we aim to improve the performance of utterance clustering by processing multichannel (stereo) audio signals. We generated processed audio signals by combining left- and right-channel audio signals in a few

    更新日期:2020-09-14
  • Emotion-Based End-to-End Matching Between Image and Music in Valence-Arousal Space
    arXiv.cs.SD Pub Date : 2020-08-22
    Sicheng Zhao; Yaxian Li; Xingxu Yao; Weizhi Nie; Pengfei Xu; Jufeng Yang; Kurt Keutzer

    Both images and music can convey rich semantics and are widely used to induce specific emotions. Matching images and music with similar emotions might help to make emotion perceptions more vivid and stronger. Existing emotion-based image and music matching methods either employ limited categorical emotion states which cannot well reflect the complexity and subtlety of emotions, or train the matching

    更新日期:2020-09-14
  • A dataset and classification model for Malay, Hindi, Tamil and Chinese music
    arXiv.cs.SD Pub Date : 2020-09-09
    Fajilatun Nahar; Kat Agres; Balamurali BT; Dorien Herremans

    In this paper we present a new dataset, with musical excepts from the three main ethnic groups in Singapore: Chinese, Malay and Indian (both Hindi and Tamil). We use this new dataset to train different classification models to distinguish the origin of the music in terms of these ethnic groups. The classification models were optimized by exploring the use of different musical features as the input

    更新日期:2020-09-11
  • Exploration of End-to-end Synthesisers forZero Resource Speech Challenge 2020
    arXiv.cs.SD Pub Date : 2020-09-10
    Karthik Pandia D S; Anusha Prakash; Mano Ranjith Kumar; Hema A Murthy

    A Spoken dialogue system for an unseen language is referred to as Zero resource speech. It is especially beneficial for developing applications for languages that have low digital resources. Zero resource speech synthesis is the task of building text-to-speech (TTS) models in the absence of transcriptions. In this work, speech is modelled as a sequence of transient and steady-state acoustic units,

    更新日期:2020-09-11
  • ICASSP 2021 Acoustic Echo Cancellation Challenge: Datasets and Testing Framework
    arXiv.cs.SD Pub Date : 2020-09-10
    Kusha Sridhar; Ross Cutler; Ando Saabas; Tanel Parnamaa; Hannes Gamper; Sebastian Braun; Robert Aichner; Sriram Srinivasan

    The ICASSP 2021 Acoustic Echo Cancellation Challenge is intended to stimulate research in the area of acoustic echo cancellation (AEC), which is an important part of speech enhancement and still a top issue in audio communication and conferencing systems. Many recent AEC studies report reasonable performance on synthetic datasets where the train and test samples come from the same underlying distribution

    更新日期:2020-09-11
  • Hardware Aware Training for Efficient Keyword Spotting on General Purpose and Specialized Hardware
    arXiv.cs.SD Pub Date : 2020-09-09
    Peter Blouw; Gurshaant Malik; Benjamin Morcos; Aaron R. Voelker; Chris Eliasmith

    Keyword spotting (KWS) provides a critical user interface for many mobile and edge applications, including phones, wearables, and cars. As KWS systems are typically 'always on', maximizing both accuracy and power efficiency are central to their utility. In this work we use hardware aware training (HAT) to build new KWS neural networks based on the Legendre Memory Unit (LMU) that achieve state-of-the-art

    更新日期:2020-09-11
  • Exploiting Multi-Modal Features From Pre-trained Networks for Alzheimer's Dementia Recognition
    arXiv.cs.SD Pub Date : 2020-09-09
    Junghyun Koo; Jie Hwan Lee; Jaewoo Pyo; Yujin Jo; Kyogu Lee

    Collecting and accessing a large amount of medical data is very time-consuming and laborious, not only because it is difficult to find specific patients but also because it is required to resolve the confidentiality of a patient's medical records. On the other hand, there are deep learning models, trained on easily collectible, large scale datasets such as Youtube or Wikipedia, offering useful representations

    更新日期:2020-09-10
  • VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition
    arXiv.cs.SD Pub Date : 2020-09-09
    Quan Wang; Ignacio Lopez Moreno; Mert Saglam; Kevin Wilson; Alan Chiao; Renjie Liu; Yanzhang He; Wei Li; Jason Pelecanos; Marily Nika; Alexander Gruenstein

    We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user, as part of a streaming speech recognition system. Delivering such a model presents numerous challenges: It should improve the performance when the input signal consists of overlapped speech, and must not hurt the speech recognition performance under

    更新日期:2020-09-10
  • Multiple F0 Estimation in Vocal Ensembles using Convolutional Neural Networks
    arXiv.cs.SD Pub Date : 2020-09-09
    Helena Cuesta; Brian McFee; Emilia Gómez

    This paper addresses the extraction of multiple F0 values from polyphonic and a cappella vocal performances using convolutional neural networks (CNNs). We address the major challenges of ensemble singing, i.e., all melodic sources are vocals and singers sing in harmony. We build upon an existing architecture to produce a pitch salience function of the input signal, where the harmonic constant-Q transform

    更新日期:2020-09-10
  • Multi-modal Attention for Speech Emotion Recognition
    arXiv.cs.SD Pub Date : 2020-09-09
    Zexu Pan; Zhaojie Luo; Jichen Yang; Haizhou Li

    Emotion represents an essential aspect of human speech that is manifested in speech prosody. Speech, visual, and textual cues are complementary in human communication. In this paper, we study a hybrid fusion method, referred to as multi-modal attention network (MMAN) to make use of visual and textual cues in speech emotion recognition. We propose a novel multi-modal attention mechanism, cLSTM-MMA,

    更新日期:2020-09-10
  • 1-Dimensional polynomial neural networks for audio signal related problems
    arXiv.cs.SD Pub Date : 2020-09-09
    Habib Ben Abdallah; Christopher J. Henry; Sheela Ramanna

    In addition to being extremely non-linear, modern problems require millions if not billions of parameters to solve or at least to get a good approximation of the solution, and neural networks are known to assimilate that complexity by deepening and widening their topology in order to increase the level of non-linearity needed for a better approximation. However, compact topologies are always preferred

    更新日期:2020-09-10
  • Toward the pre-cocktail party problem with TasTas$+$
    arXiv.cs.SD Pub Date : 2020-09-07
    Anyan Shi; Jiqing Han; Ziqiang Shi

    Deep neural network with dual-path bi-directional long short-term memory (BiLSTM) block has been proved to be very effective in sequence modeling, especially in speech separation, e.g. DPRNN-TasNet \cite{luo2019dual}, TasTas \cite{shi2020speech}. In this paper, we propose two improvements of TasTas \cite{shi2020speech} for end-to-end approach to monaural speech separation in pre-cocktail party problems

    更新日期:2020-09-10
  • AutoKWS: Keyword Spotting with Differentiable Architecture Search
    arXiv.cs.SD Pub Date : 2020-09-08
    Bo Zhang; WenFeng Li; Qingyuan Li; Weiji Zhuang; Xiangxiang Chu; Yujun Wang

    Smart audio devices are gated by an always-on lightweight keyword spotting program to reduce power consumption. It is however challenging to design models that have both high accuracy and low latency for accurate and fast responsiveness. Many efforts have been made to develop end-to-end neural networks, in which depthwise separable convolutions, temporal convolutions, and LSTMs are adopted as building

    更新日期:2020-09-10
  • Predictions of Subjective Ratings and Spoofing Assessments of Voice Conversion Challenge 2020 Submissions
    arXiv.cs.SD Pub Date : 2020-09-08
    Rohan Kumar Das; Tomi Kinnunen; Wen-Chin Huang; Zhenhua Ling; Junichi Yamagishi; Yi Zhao; Xiaohai Tian; Tomoki Toda

    The Voice Conversion Challenge 2020 is the third edition under its flagship that promotes intra-lingual semiparallel and cross-lingual voice conversion (VC). While the primary evaluation of the challenge submissions was done through crowd-sourced listening tests, we also performed an objective assessment of the submitted systems. The aim of the objective assessment is to provide complementary performance

    更新日期:2020-09-10
  • Robust Digital Envelope Estimation Via Geometric Properties of an Arbitrary Real Signal
    arXiv.cs.SD Pub Date : 2020-09-07
    Carlos Tarjano; Valdecy Pereira

    The temporal amplitude envelope of a signal is essential for its complete characterization, being the primary information-carrying medium in spoken voice and telecommunications, for example. Envelope detection techniques have applications in areas like health, sound classification and synthesis, seismology and speech recognition. Nevertheless, a general method to digital envelope detection of signals

    更新日期:2020-09-08
  • An End-to-end Architecture of Online Multi-channel Speech Separation
    arXiv.cs.SD Pub Date : 2020-09-07
    Jian Wu; Zhuo Chen; Jinyu Li; Takuya Yoshioka; Zhili Tan; Ed Lin; Yi Luo; Lei Xie

    Multi-speaker speech recognition has been one of the keychallenges in conversation transcription as it breaks the singleactive speaker assumption employed by most state-of-the-artspeech recognition systems. Speech separation is consideredas a remedy to this problem. Previously, we introduced a sys-tem, calledunmixing,fixed-beamformerandextraction(UFE),that was shown to be effective in addressing the

    更新日期:2020-09-08
  • A Comparison of Virtual Analog Modelling Techniques for Desktop and Embedded Implementations
    arXiv.cs.SD Pub Date : 2020-09-06
    Jatin Chowdhury

    We develop a virtual analog model of the Klon Centaur guitar pedal circuit, comparing various circuit modelling techniques. The techniques analyzed include traditional modelling techniques such as nodal analysis and Wave Digital Filters, as well as a machine learning technique using recurrent neural networks. We examine these techniques in the contexts of two use cases: an audio plug-in designed to

    更新日期:2020-09-08
  • Non causal deep learning based dereverberation
    arXiv.cs.SD Pub Date : 2020-09-06
    Jorge Wuth; Richard M. Stern; Nestor Becerra Yoma

    In this paper we demonstrate the effectiveness of non-causal context for mitigating the effects of reverberation in deep-learning-based automatic speech recognition (ASR) systems. First, the value of non-causal context using a non-causal FIR filter is shown by comparing the contributions of previous vs. future information. Second, MLP- and LSTM-based dereverberation networks were trained to confirm

    更新日期:2020-09-08
  • Libri-Adapt: A New Speech Dataset for Unsupervised Domain Adaptation
    arXiv.cs.SD Pub Date : 2020-09-06
    Akhil Mathur; Fahim Kawsar; Nadia Berthouze; Nicholas D. Lane

    This paper introduces a new dataset, Libri-Adapt, to support unsupervised domain adaptation research on speech recognition models. Built on top of the LibriSpeech corpus, Libri-Adapt contains English speech recorded on mobile and embedded-scale microphones, and spans 72 different domains that are representative of the challenging practical scenarios encountered by ASR models. More specifically, Libri-Adapt

    更新日期:2020-09-08
  • Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019
    arXiv.cs.SD Pub Date : 2020-09-06
    Archontis Politis; Annamaria Mesaros; Sharath Adavanne; Toni Heittola; Tuomas Virtanen

    Sound event localization and detection is a novel area of research that emerged from the combined interest of analyzing the acoustic scene in terms of the spatial and temporal activity of sounds of interest. This paper presents an overview of the first international evaluation on sound event localization and detection, organized as a task of DCASE 2019 Challenge. A large-scale realistic dataset of

    更新日期:2020-09-08
  • Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling
    arXiv.cs.SD Pub Date : 2020-09-06
    Songxiang Liu; Yuewen Cao; Disong Wang; Xixin Wu; Xunying Liu; Helen Meng

    This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq) based, non-parallel voice conversion approach. In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq based synthesis module. During the training stage, an encoder-decoder based hybrid connectionist-temporal-classification-attention (CTC-attention) phoneme recognizer is trained, whose encoder

    更新日期:2020-09-08
  • Semi-supervised Multi-modal Emotion Recognition with Cross-Modal Distribution Matching
    arXiv.cs.SD Pub Date : 2020-09-05
    Jingjun Liang; Ruichen Li; Qin Jin

    Automatic emotion recognition is an active research topic with wide range of applications. Due to the high manual annotation cost and inevitable label ambiguity, the development of emotion recognition dataset is limited in both scale and quality. Therefore, one of the key challenges is how to build effective models with limited data resource. Previous works have explored different approaches to tackle

    更新日期:2020-09-08
  • A multi-view approach for Mandarin non-native mispronunciation verification
    arXiv.cs.SD Pub Date : 2020-09-05
    Zhenyu Wang; John H. L. Hansen; Yanlu Xie

    Traditionally, the performance of non-native mispronunciation verification systems relied on effective phone-level labelling of non-native corpora. In this study, a multi-view approach is proposed to incorporate discriminative feature representations which requires less annotation for non-native mispronunciation verification of Mandarin. Here, models are jointly learned to embed acoustic sequence and

    更新日期:2020-09-08
  • Cross-domain Adaptation with Discrepancy Minimization for Text-independent Forensic Speaker Verification
    arXiv.cs.SD Pub Date : 2020-09-05
    Zhenyu Wang; Wei Xia; John H. L. Hansen

    Forensic audio analysis for speaker verification offers unique challenges due to location/scenario uncertainty and diversity mismatch between reference and naturalistic field recordings. The lack of real naturalistic forensic audio corpora with ground-truth speaker identity represents a major challenge in this field. It is also difficult to directly employ small-scale domain-specific data to train

    更新日期:2020-09-08
  • Towards Musically Meaningful Explanations Using Source Separation
    arXiv.cs.SD Pub Date : 2020-09-04
    Verena Haunschmid; Ethan Manilow; Gerhard Widmer

    Deep neural networks (DNNs) are successfully applied in a wide variety of music information retrieval (MIR) tasks. Such models are usually considered "black boxes", meaning that their predictions are not interpretable. Prior work on explainable models in MIR has generally used image processing tools to produce explanations for DNN predictions, but these are not necessarily musically meaningful, or

    更新日期:2020-09-08
  • Degradation effects of water immersion on earbud audio quality
    arXiv.cs.SD Pub Date : 2020-09-02
    Scott Beveridge; Steffen A. Herff; Estefanía Cano

    Earbuds are subjected to constant use and scenarios that may degrade sound quality. Indeed, a common fate of earbuds is being forgotten in pockets and faced with a laundry cycle (LC). Manufacturers' accounts of the extent to which LCs affect earbud sound quality are vague at best, leaving users to their own devices in assessing the damage caused. This paper offers a systematic, empirical approach to

    更新日期:2020-09-08
  • SEANet: A Multi-modal Speech Enhancement Network
    arXiv.cs.SD Pub Date : 2020-09-04
    Marco Tagliasacchi; Yunpeng Li; Karolis Misiunas; Dominik Roblek

    We explore the possibility of leveraging accelerometer data to perform speech enhancement in very noisy conditions. Although it is possible to only partially reconstruct user's speech from the accelerometer, the latter provides a strong conditioning signal that is not influenced from noise sources in the environment. Based on this observation, we feed a multi-modal input to SEANet (Sound EnhAncement

    更新日期:2020-09-08
  • Dense CNN with Self-Attention for Time-Domain Speech Enhancement
    arXiv.cs.SD Pub Date : 2020-09-03
    Ashutosh Pandey; DeLiang Wang

    Speech enhancement in the time domain is becoming increasingly popular in recent years, due to its capability to jointly enhance both the magnitude and the phase of speech. In this work, we propose a dense convolutional network (DCN) with self-attention for speech enhancement in the time domain. DCN is an encoder and decoder based architecture with skip connections. Each layer in the encoder and the

    更新日期:2020-09-08
  • Knowing What to Listen to: Early Attention for Deep Speech Representation Learning
    arXiv.cs.SD Pub Date : 2020-09-03
    Amirhossein Hajavi; Ali Etemad

    Deep learning techniques have considerably improved speech processing in recent years. Speech representations extracted by deep learning models are being used in a wide range of tasks such as speech recognition, speaker recognition, and speech emotion recognition. Attention models play an important role in improving deep learning models. However current attention mechanisms are unable to attend to

    更新日期:2020-09-05
Contents have been reproduced by permission of the publishers.
导出
全部期刊列表>>
物理学研究前沿热点精选期刊推荐
chemistry
自然职位线上招聘会
欢迎报名注册2020量子在线大会
化学领域亟待解决的问题
材料学研究精选新
GIANT
ACS ES&T Engineering
ACS ES&T Water
ACS Publications填问卷
屿渡论文,编辑服务
阿拉丁试剂right
南昌大学
王辉
南方科技大学
彭小水
隐藏1h前已浏览文章
课题组网站
新版X-MOL期刊搜索和高级搜索功能介绍
ACS材料视界
天合科研
x-mol收录
赵延川
李霄羽
廖矿标
朱守非
试剂库存
down
wechat
bug