• arXiv.cs.SD Pub Date : 2020-07-11
Xianchao Wu; Chengyuan Wang; Qinying Lei

Current state-of-the-art AI based classical music creation algorithms such as Music Transformer are trained by employing single sequence of notes with time-shifts. The major drawback of absolute time interval expression is the difficulty of similarity computing of notes that share the same note value yet different tempos, in one or among MIDI files. In addition, the usage of single sequence restricts

更新日期：2020-07-15
• arXiv.cs.SD Pub Date : 2020-07-12
Zhichao Zhang; Shugong Xu; Shunqing Zhang; Tianhao Qiao; Shan Cao

Environmental sound classification (ESC) is a challenging problem due to the complexity of sounds. The classification performance is heavily dependent on the effectiveness of representative features extracted from the environmental sounds. However, ESC often suffers from the semantically irrelevant frames and silent frames. In order to deal with this, we employ a frame-level attention model to focus

更新日期：2020-07-15
• arXiv.cs.SD Pub Date : 2020-07-14
Prateek Verma; Alessandro Ilic Mezza; Chris Chafe; Cristina Rottondi

Networked Music Performance (NMP) is envisioned as a potential game changer among Internet applications: it aims at revolutionizing the traditional concept of musical interaction by enabling remote musicians to interact and perform together through a telecommunication network. Ensuring realistic conditions for music performance, however, constitutes a significant engineering challenge due to extremely

更新日期：2020-07-15
• arXiv.cs.SD Pub Date : 2020-07-14
Efthymios Tzinis; Zhepei Wang; Paris Smaragdis

In this paper, we present an efficient neural network for end-to-end general purpose audio source separation. Specifically, the backbone structure of this convolutional network is the SUccessive DOwnsampling and Resampling of Multi-Resolution Features (SuDoRMRF) as well as their aggregation which is performed through simple one-dimensional convolutions. In this way, we are able to obtain high quality

更新日期：2020-07-15
• arXiv.cs.SD Pub Date : 2020-07-13
Hadi Abdullah; Kevin Warren; Vincent Bindschaedler; Nicolas Papernot; Patrick Traynor

Speech and speaker recognition systems are employed in a variety of applications, from personal assistants to telephony surveillance and biometric authentication. The wide deployment of these systems has been made possible by the improved accuracy in neural networks. Like other systems based on neural networks, recent research has demonstrated that speech and speaker recognition systems are vulnerable

更新日期：2020-07-15
• arXiv.cs.SD Pub Date : 2020-07-12
Omkar Ranadive; Grant Gasser; David Terpay; Prem Seetharaman

We present OtoWorld, an interactive environment in which agents must learn to listen in order to solve navigational tasks. The purpose of OtoWorld is to facilitate reinforcement learning research in computer audition, where agents must learn to listen to the world around them to navigate. OtoWorld is built on three open source libraries: OpenAI Gym for environment and agent interaction, PyRoomAcoustics

更新日期：2020-07-14
• arXiv.cs.SD Pub Date : 2020-07-13
Alexey Tikhonov; Ivan P. Yamshchikov

This paper presents a large dataset of drum patterns and compares two different architectures of artificial neural networks that produce latent explorable spaces with some recognizable genre areas. Adversarially constrained autoencoder interpolations (ACAI) show better results in comparison with a standard variational autoencoder. To our knowledge, this is the first application of ACAI to drum-pattern

更新日期：2020-07-14
• arXiv.cs.SD Pub Date : 2020-07-12
Mudit Verma; Arun Balaji Buduru

Due to a drastic improvement in the quality of internet services worldwide, there is an explosion of multilingual content generation and consumption. This is especially prevalent in countries with large multilingual audience, who are increasingly consuming media outside their linguistic familiarity/preference. Hence, there is an increasing need for real-time and fine-grained content analysis services

更新日期：2020-07-14
• arXiv.cs.SD Pub Date : 2020-07-12
Tomi Kinnunen; Héctor Delgado; Nicholas Evans; Kong Aik Lee; Ville Vestman; Andreas Nautsch; Massimiliano Todisco; Xin Wang; Md Sahidullah; Junichi Yamagishi; Douglas A. Reynolds

Recent years have seen growing efforts to develop spoofing countermeasures (CMs) to protect automatic speaker verification (ASV) systems from being deceived by manipulated or artificial inputs. The reliability of spoofing CMs is typically gauged using the equal error rate (EER) metric. The primitive EER fails to reflect application requirements and the impact of spoofing and CMs upon ASV and its use

更新日期：2020-07-14
• arXiv.cs.SD Pub Date : 2020-07-12
Xian Shi; Qiangze Feng; Lei Xie

Code-switching (CS) is a common phenomenon and recognizing CS speech is challenging. But CS speech data is scarce and there' s no common testbed in relevant research. This paper describes the design and main outcomes of the ASRU 2019 Mandarin-English code-switching speech recognition challenge, which aims to improve the ASR performance in Mandarin-English code-switching situation. 500 hours Mandarin

更新日期：2020-07-14
• arXiv.cs.SD Pub Date : 2020-07-11
Takashi Oya; Shohei Iwase; Ryota Natsume; Takahiro Itazuri; Shugo Yamaguchi; Shigeo Morishima

During the performance of sound source localization which uses both visual and aural information, it presently remains unclear how much either image or sound modalities contribute to the result, i.e. do we need both image and sound for sound source localization? To address this question, we develop an unsupervised learning system that solves sound source localization by decomposing this task into two

更新日期：2020-07-14
• arXiv.cs.SD Pub Date : 2020-07-11
Yi-Chiao Wu; Tomoki Hayashi; Patrick Lumban Tobing; Kazuhiro Kobayashi; Tomoki Toda

In this paper, a pitch-adaptive waveform generative model named Quasi-Periodic WaveNet (QPNet) is proposed to improve the pitch controllability of vanilla WaveNet (WN) using pitch-dependent dilated convolution neural networks (PDCNNs). Specifically, as a probabilistic autoregressive generation model with stacked dilated convolution layers, WN achieves high-fidelity audio waveform generation. However

更新日期：2020-07-14
• arXiv.cs.SD Pub Date : 2020-07-10
Jae-Bin Kim; Seongkyu Mun; Myungwoo Oh; Soyeon Choe; Yong-Hyeok Lee; Hyung-Min Park

This paper addresses the noisy label issue in audio event detection (AED) by refining strong labels as sequential labels with inaccurate timestamps removed. In AED, strong labels contain the occurrence of a specific event and its timestamps corresponding to the start and end of the event in an audio clip. The timestamps depend on subjectivity of each annotator, and their label noise is inevitable.

更新日期：2020-07-13
• arXiv.cs.SD Pub Date : 2020-07-10
Konstantinos Drossos; Stylianos I. Mimilakis; Tuomas Virtanen

Sound event detection (SED) is the task of identifying sound events along with their onset and offset times. A recent, convolutional neural networks based SED method, proposed the usage of depthwise separable (DWS) and time-dilated convolutions. DWS and time-dilated convolutions yielded state-of-the-art results for SED, with considerable small amount of parameters. In this work we propose the expansion

更新日期：2020-07-13
• arXiv.cs.SD Pub Date : 2020-07-10
David Hülsmeier; Marc René Schädler; Birger Kollmeier

Developing and selecting hearing aids is a time consuming process which could be simplified by using objective models. The framework for auditory discrimination experiments (FADE) accurately simulated the benefit of hearing aid algorithms. One simulation with FADE requires several hours of (un)processed signals, which is obstructive when the signals have to be recorded. We propose and evaluate a real-time

更新日期：2020-07-13
• arXiv.cs.SD Pub Date : 2020-07-10
Hyeonseung Lee; Woo Hyun Kang; Sung Jun Cheon; Hyeongju Kim; Nam Soo Kim

Recently, attention-based encoder-decoder (AED) models have shown state-of-the-art performance in automatic speech recognition (ASR). As the original AED models with global attentions are not capable of online inference, various online attention schemes have been developed to reduce ASR latency for better user experience. However, a common limitation of the conventional softmax-based online attention

更新日期：2020-07-13
• arXiv.cs.SD Pub Date : 2020-07-09
Yuki Okamoto; Keisuke Imoto; Shinnosuke Takamichi; Ryosuke Yamanishi; Takahiro Fukumori; Yoichi Yamashita

Environmental sound synthesis is a technique for generating a natural environmental sound. Conventional work on environmental sound synthesis using sound event labels cannot finely control synthesized sounds, for example, the pitch and timbre. We consider that onomatopoeic words can be used for environmental sound synthesis. Onomatopoeic words are effective for explaining the feature of sounds. We

更新日期：2020-07-10
• arXiv.cs.SD Pub Date : 2020-07-09
Emre Çakır; Konstantinos Drossos; Tuomas Virtanen

Audio captioning is a multi-modal task, focusing on using natural language for describing the contents of general audio. Most audio captioning methods are based on deep neural networks, employing an encoder-decoder scheme and a dataset with audio clips and corresponding natural language descriptions (i.e. captions). A significant challenge for audio captioning is the distribution of words in the captions:

更新日期：2020-07-10
• arXiv.cs.SD Pub Date : 2020-07-08
Dawid Przyczyna; Maria Szaciłowska; Marek Przybylski; Marcin Strzelecki; Konrad Szaciłowski

Reservoir computing is an emerging, but very successful approach towards processing and classification of various signals. It can be described as a model of a transient computation, where influence of input changes internal dynamics of chosen computational reservoir. Trajectory of these changes represents computation performed by the system. The selection of a suitable computational substrate capable

更新日期：2020-07-10
• arXiv.cs.SD Pub Date : 2020-07-09
Hye-jin Shim; Jee-weon Jung; Ju-ho Kim; Ha-jin Yu

Frequently misclassified pairs of classes that share many common acoustic properties exist in acoustic scene classification (ASC). To distinguish such pairs of classes, trivial details scattered throughout the data could be vital clues. However, these details are less noticeable and are easily removed using conventional non-linear activations (e.g. ReLU). Furthermore, making design choices to emphasize

更新日期：2020-07-10
• arXiv.cs.SD Pub Date : 2020-07-09
Yi Ren; Xu Tan; Tao Qin; Jian Luan; Zhou Zhao; Tie-Yan Liu

In this paper, we develop DeepSinger, a multi-lingual multi-singer singing voice synthesis (SVS) system, which is built from scratch using singing training data mined from music websites. The pipeline of DeepSinger consists of several steps, including data crawling, singing and accompaniment separation, lyrics-to-singing alignment, data filtration, and singing modeling. Specifically, we design a lyrics-to-singing

更新日期：2020-07-10
• arXiv.cs.SD Pub Date : 2020-07-08
Nicolas TurpaultMULTISPEECH; Scott WisdomMULTISPEECH; Hakan ErdoganMULTISPEECH; John HersheyMULTISPEECH; Romain SerizelMULTISPEECH; Eduardo FonsecaMTG; Prem Seetharaman; Justin Salamon

Performing sound event detection on real-world recordings often implies dealing with overlapping target sound events and non-target sounds, also referred to as interference or noise. Until now these problems were mainly tackled at the classifier level. We propose to use sound separation as a pre-processing for sound event detection. In this paper we start from a sound separation model trained on the

更新日期：2020-07-09
• arXiv.cs.SD Pub Date : 2020-07-08
Nicolas TurpaultMULTISPEECH; Romain SerizelMULTISPEECH

Training a sound event detection algorithm on a heterogeneous dataset including both recorded and synthetic soundscapes that can have various labeling granularity is a non-trivial task that can lead to systems requiring several technical choices. These technical choices are often passed from one system to another without being questioned. We propose to perform a detailed analysis of DCASE 2020 task

更新日期：2020-07-09
• arXiv.cs.SD Pub Date : 2020-07-06
Helin Wang; Yuexian Zou; Dading Chong

Recently, convolutional neural networks (CNN) have achieved the state-of-the-art performance in acoustic scene classification (ASC) task. The audio data is often transformed into two-dimensional spectrogram representations, which are then fed to the neural networks. In this paper, we study the problem of efficiently taking advantage of different spectrogram representations through discriminative processing

更新日期：2020-07-09
• arXiv.cs.SD Pub Date : 2020-07-08
Abhinav Shukla; Stavros Petridis; Maja Pantic

The intuitive interaction between the audio and visual modalities is valuable for cross-modal self-supervised learning. This concept has been demonstrated for generic audiovisual tasks like video action recognition and acoustic scene classification. However, self-supervision remains under-explored for audiovisual speech. We propose a method to learn self-supervised speech representations from the raw

更新日期：2020-07-09
• arXiv.cs.SD Pub Date : 2020-07-08
Surabhi Punjabi; Harish Arsikere; Zeynab Raeesy; Chander Chandak; Nikhil Bhave; Ankish Bansal; Markus Müller; Sergio Murillo; Ariya Rastrow; Sri Garimella; Roland Maas; Mat Hans; Athanasios Mouchtaris; Siegfried Kunzmann

Multilingual ASR technology simplifies model training and deployment, but its accuracy is known to depend on the availability of language information at runtime. Since language identity is seldom known beforehand in real-world scenarios, it must be inferred on-the-fly with minimum latency. Furthermore, in voice-activated smart assistant systems, language identity is also required for downstream processing

更新日期：2020-07-09
• arXiv.cs.SD Pub Date : 2020-07-08
Valentin Leplat; Nicolas Gillis; Cédric Févotte

Blind spectral unmixing is the problem of decomposing the spectrum of a mixed signal or image into a collection of source spectra and their corresponding activations indicating the proportion of each source present in the mixed spectrum. To perform this task, nonnegative matrix factorization (NMF) based on the $\beta$-divergence, referred to as $\beta$-NMF, is a standard and state-of-the art technique

更新日期：2020-07-09
• arXiv.cs.SD Pub Date : 2020-07-01
Joshua E. Siegel; Umberto Coda

We survey the state-of-the-art in offboard diagnostics for vehicles, their occupants, and environments, with particular focus on vibroacoustic approaches. We identify promising application areas including data-driven management for shared mobility and automated fleets, usage-based insurance, and vehicle,occupant, and environmental state and condition monitoring. We close by exploring the particular

更新日期：2020-07-09
• arXiv.cs.SD Pub Date : 2020-07-07
Laetitia Jeancolas; Dijana Petrovska-Delacrétaz; Graziella Mangone; Badr-Eddine Benkelfat; Jean-Christophe Corvol; Marie Vidailhet; Stéphane Lehéricy; Habib Benali

Many articles have used voice analysis to detect Parkinson's disease (PD), but few have focused on the early stages of the disease and the gender effect. In this article, we have adapted the latest speaker recognition system, called x-vectors, in order to detect an early stage of PD from voice analysis. X-vectors are embeddings extracted from a deep neural network, which provide robust speaker representations

更新日期：2020-07-08
• arXiv.cs.SD Pub Date : 2020-07-07
Zihan Pan; Malu Zhang; Jibin Wu; Haizhou Li

Inspired by the mammal's auditory localization pathway, in this paper we propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment, and implement this algorithm in a real-time robotic system with a microphone array. The key of this model relies on the MTPC scheme, which encodes the interaural time difference (ITD) cues into

更新日期：2020-07-08
• arXiv.cs.SD Pub Date : 2020-07-07

This study approached the Hit Song Science problem with the aim of predicting which songs in the Afrobeats genre will become popular among Spotify listeners. A dataset of 2063 songs was generated through the Spotify Web API, with the provided audio features. Random Forest and Gradient Boosting algorithms proved to be successful with approximately F1 scores of 86%.

更新日期：2020-07-08
• arXiv.cs.SD Pub Date : 2020-07-06
Vineel Pratap; Anuroop Sriram; Paden Tomasello; Awni Hannun; Vitaliy Liptchinsky; Gabriel Synnaeve; Ronan Collobert

We study training a single acoustic model for multiple languages with the aim of improving automatic speech recognition (ASR) performance on low-resource languages, and over-all simplifying deployment of ASR systems that support diverse languages. We perform an extensive benchmark on 51 languages, with varying amount of training data by language(from 100 hours to 1100 hours). We compare three variants

更新日期：2020-07-08
• arXiv.cs.SD Pub Date : 2020-07-06
Stylianos Ioannis Mimilakis; Konstantinos Drossos; Gerald Schuller

In this work we present a method for unsupervised learning of audio representations, focused on the task of singing voice separation. We build upon a previously proposed method for learning representations of time-domain music signals with a re-parameterized denoising autoencoder, extending it by using the family of Sinkhorn distances with entropic regularization. We evaluate our method on the freely

更新日期：2020-07-07
• arXiv.cs.SD Pub Date : 2020-07-06
Pyry Pyykkönen; Styliannos I. Mimilakis; Konstantinos Drossos; Tuomas Virtanen

Recent approaches for music source separation are almost exclusively based on deep neural networks, mostly employing recurrent neural networks (RNNs). Although RNNs are in many cases superior than other types of deep neural networks for sequence processing, they are known to have specific difficulties in training and parallelization, especially for the typically long sequences encountered in music

更新日期：2020-07-07
• arXiv.cs.SD Pub Date : 2020-07-06
Khoa Nguyen; Konstantinos Drossos; Tuomas Virtanen

Audio captioning is the task of automatically creating a textual description for the contents of a general audio signal. Typical audio captioning methods rely on deep neural networks (DNNs), where the target of the DNN is to map the input audio sequence to an output sequence of words, i.e. the caption. Though, the length of the textual description is considerably less than the length of the audio signal

更新日期：2020-07-07
• arXiv.cs.SD Pub Date : 2020-07-06
Tianyan Zhou; Yong Zhao; Jian Wu

ResNet-based architecture has been widely adopted as the speaker embedding extractor in speaker verification system. Its standard topology and modularized design ease the human efforts on hyper parameter tuning. Therefore, width and depth are left as two major dimensions to further improve ResNet's representation power. However, simply increasing width or depth is not efficient. In this paper, we investigate

更新日期：2020-07-07
• arXiv.cs.SD Pub Date : 2020-07-04
Hengguan Huang; Fuzhao Xue; Hao Wang; Ye Wang

Lying at the core of human intelligence, relational thinking is characterized by initially relying on innumerable unconscious percepts pertaining to relations between new sensory signals and prior knowledge, consequently becoming a recognizable concept or object through coupling and transformation of these percepts. Such mental processes are difficult to model in real-world problems such as in conversational

更新日期：2020-07-07
• arXiv.cs.SD Pub Date : 2020-07-04
Monica Sunkara; Srikanth Ronanki; Kalpit Dixit; Sravan Bodapati; Katrin Kirchhoff

Automatic speech recognition (ASR) systems in the medical domain that focus on transcribing clinical dictations and doctor-patient conversations often pose many challenges due to the complexity of the domain. ASR output typically undergoes automatic punctuation to enable users to speak naturally, without having to vocalise awkward and explicit punctuation commands, such as "period", "add comma" or

更新日期：2020-07-07
• arXiv.cs.SD Pub Date : 2020-07-03
Pavel Denisov; Ngoc Thang Vu

Spoken language understanding is typically based on pipeline architectures including speech recognition and natural language understanding steps. Therefore, these components are optimized independently from each other and the overall system suffers from error propagation. In this paper, we propose a novel training method that enables pretrained contextual embeddings such as BERT to process acoustic

更新日期：2020-07-06
• arXiv.cs.SD Pub Date : 2020-07-02
Jinhua Liang; Tao Zhang; Guoqing Feng

Model compression and acceleration are attracting increasing attentions due to the demand for embedded devices and mobile applications. Research on efficient convolutional neural networks (CNNs) aims at removing feature redundancy by decomposing or optimizing the convolutional calculation. In this work, feature redundancy is assumed to exist among channels in CNN architectures, which provides some

更新日期：2020-07-06
• arXiv.cs.SD Pub Date : 2020-07-03
Thomas Haubner; Andreas Brendel; Mohamed Elminshawi; Walter Kellermann

We present a noise-robust adaptation control strategy for block-online supervised acoustic system identification by exploiting a noise dictionary. The proposed algorithm takes advantage of the pronounced spectral structure which characterizes many types of interfering noise signals. We model the noisy observations by a linear Gaussian Discrete Fourier Transform-domain state space model whose parameters

更新日期：2020-07-06
• arXiv.cs.SD Pub Date : 2020-07-03
Thomas Haubner; Andreas Brendel; Walter Kellermann

In this paper we present a novel algorithm for improved block-online supervised acoustic system identification in adverse noise scenarios by exploiting prior knowledge about the space of Room Impulse Responses (RIRs). The method is based on the assumption that the variability of the unknown RIRs is controlled by only few physical parameters, describing, e.g., source position movements, and thus is

更新日期：2020-07-06
• arXiv.cs.SD Pub Date : 2020-07-02
Joon Son Chung; Jaesung Huh; Arsha Nagrani; Triantafyllos Afouras; Andrew Zisserman

The goal of this paper is speaker diarisation of videos collected 'in the wild'. We make three key contributions. First, we propose an automatic audio-visual diarisation method for YouTube videos. Our method consists of active speaker detection using audio-visual methods and speaker verification using self-enrolled speaker models. Second, we integrate our method into a semi-automatic dataset creation

更新日期：2020-07-03
• arXiv.cs.SD Pub Date : 2020-07-01
Carmine Emanuele Cella; Daniele Ghisi; Vincent Lostanlen; Fabien Lévy; Joshua Fineberg; Yan Maresz

This paper introduces OrchideaSOL, a free dataset of samples of extended instrumental playing techniques, designed to be used as default dataset for the Orchidea framework for target-based computer-aided orchestration. OrchideaSOL is a reduced and modified subset of Studio On Line, or SOL for short, a dataset developed at Ircam between 1996 and 1998. We motivate the reasons behind OrchideaSOL and describe

更新日期：2020-07-03
• arXiv.cs.SD Pub Date : 2020-07-02
Eugene Kharitonov; Morgane Rivière; Gabriel Synnaeve; Lior Wolf; Pierre-Emmanuel Mazaré; Matthijs Douze; Emmanuel Dupoux

Contrastive Predictive Coding (CPC), based on predicting future segments of speech based on past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs other methods on unsupervised evaluation benchmarks. Here, we introduce WavAugment, a time-domain data augmentation library and find that applying augmentation in the past is generally

更新日期：2020-07-03
• arXiv.cs.SD Pub Date : 2020-07-02
Nam Kyun Kim; Hong Kook Kim

This report proposes a polyphonic sound event detection (SED) method for the DCASE 2020 Challenge Task 4. The proposed SED method is based on semi-supervised learning to deal with the different combination of training datasets such as weakly labeled dataset, unlabeled dataset, and strongly labeled synthetic dataset. Especially, the target label of each audio clip from weakly labeled or unlabeled dataset

更新日期：2020-07-03
• arXiv.cs.SD Pub Date : 2020-07-02
Chan Teck Kai; Chin Cheng Siong; Li Ye

For the DCASE 2020 Challenge Task 4, this paper pro-posed a combinative approach using Nonnegative Matrix Factorization (NMF) and Convolutional Neural Network (CNN). The main idea begins with utilizing NMF to ap-proximate strong labels for the weakly labeled data. Sub-sequently, based on the approximated strongly labeled data, two different CNNs are trained using a semi-supervised framework where one

更新日期：2020-07-03
• arXiv.cs.SD Pub Date : 2020-07-01
Zhuohao Chen; James Gibson; Ming-Chang Chiu; Qiaohong Hu; Tara K Knight; Daniella Meeker; James A Tulsky; Kathryn I Pollak; Shrikanth Narayanan

Empathy involves understanding other people's situation, perspective, and feelings. In clinical interactions, it helps clinicians establish rapport with a patient and support patient-centered care and decision making. Understanding physician communication through observation of audio-recorded encounters is largely carried out with manual annotation and analysis. However, manual annotation has a prohibitively

更新日期：2020-07-03
• arXiv.cs.SD Pub Date : 2020-07-01
Jordan J. Bird; Diego R. Faria; Anikó Ekárt; Cristiano Premebida; Pedro P. S. Ayrosa

In speech recognition problems, data scarcity often poses an issue due to the willingness of humans to provide large amounts of data for learning and classification. In this work, we take a set of 5 spoken Harvard sentences from 7 subjects and consider their MFCC attributes. Using character level LSTMs (supervised learning) and OpenAI's attention-based GPT-2 models, synthetic MFCCs are generated by

更新日期：2020-07-03
• arXiv.cs.SD Pub Date : 2020-06-30
Keigo Kamo; Yuki Kubo; Norihiro Takamune; Daichi Kitamura; Hiroshi Saruwatari; Yu Takahashi; Kazunobu Kondo

In this paper, we address a statistical model extension of multichannel nonnegative matrix factorization (MNMF) for blind source separation, and we propose a new parameter update algorithm used in the sub-Gaussian model. MNMF employs full-rank spatial covariance matrices and can simulate situations in which the reverberation is strong and the sources are not point sources. In conventional MNMF, spectrograms

更新日期：2020-07-02
• arXiv.cs.SD Pub Date : 2020-07-01
Daichi Kitamura; Kohei Yatabe

Independent low-rank matrix analysis (ILRMA) is the state-of-the-art algorithm for blind source separation (BSS) in the determined situation (the number of microphones is greater than or equal to that of source signals). ILRMA achieves a great separation performance by modeling the power spectrograms of the source signals via the nonnegative matrix factorization (NMF). Such highly developed source

更新日期：2020-07-02
• arXiv.cs.SD Pub Date : 2020-06-30
Anurag Kumar; Vamsi Krishna Ithapu

An important problem in machine auditory perception is to recognize and detect sound events. In this paper, we propose a sequential self-teaching approach to learning sounds. Our main proposition is that it is harder to learn sounds in adverse situations such as from weakly labeled and/or noisy labeled data, and in these situations a single stage of learning is not sufficient. Our proposal is a sequential

更新日期：2020-07-02
• arXiv.cs.SD Pub Date : 2020-07-01
Thomas Dietzen; Marc Moonen; Toon van Waterschoot

Power spectral density (PSD) estimates of various microphone signal components are essential to many speech enhancement procedures. As speech is highly non-nonstationary, performance improvements may be gained by maintaining time-variations in PSD estimates. In this paper, we propose an instantaneous PSD estimation approach based on generalized principal components. Similarly to other eigenspace-based

更新日期：2020-07-02
• arXiv.cs.SD Pub Date : 2020-07-01
Hangting Chen; Pengyuan Zhang

With the success of deep learning in speech signal processing, speaker-independent speech separation under the reverberant environment remains challenging. The deep attractor network (DAN) performs speech separation with speaker attractor, but it is conducted in the time-frequency domain, which is not optimal. The recently proposed convolutional time-domain audio separation network (Conv-TasNet) surpasses

更新日期：2020-07-02
• arXiv.cs.SD Pub Date : 2020-07-01
Kyle Bittner; Martine De Cock; Rafael Dowsley

Deep learning in audio signal processing, such as human voice audio signal classification, is a rich application area of machine learning. Legitimate use cases include voice authentication, gunfire detection, and emotion recognition. While there are clear advantages to automated human speech classification, application developers can gain knowledge beyond the professed scope from unprotected audio

更新日期：2020-07-02
• arXiv.cs.SD Pub Date : 2020-07-01
Yuma Koizumi; Daiki Takeuchi; Yasunori Ohishi; Noboru Harada; Kunio Kashino

This technical report describes the system participating to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge, Task 6: automated audio captioning. Our submission focuses on solving two indeterminacy problems in automated audio captioning: word selection indeterminacy and sentence length indeterminacy. We simultaneously solve the main caption generation and sub indeterminacy

更新日期：2020-07-02
• arXiv.cs.SD Pub Date : 2020-07-01
Yuma Koizumi; Ryo Masumura; Kyosuke Nishida; Masahiro Yasuda; Shoichiro Saito

One of the problems with automated audio captioning (AAC) is the indeterminacy in word selection corresponding to the audio event/scene. Since one acoustic event/scene can be described with several words, it results in a combinatorial explosion of possible captions and difficulty in training. To solve this problem, we propose a Transformer-based audio-captioning model with keyword estimation called

更新日期：2020-07-02
• arXiv.cs.SD Pub Date : 2020-07-01
Nasim Alamdari; Edward Lobarinas; Nasser Kehtarnavaz

Existing prescriptive compression strategies used in hearing aid fitting are designed based on gain averages from a group of users which are not necessarily optimal for a specific user. Nearly half of hearing aid users prefer settings that differ from the commonly prescribed settings. This paper presents a human-in-loop deep reinforcement learning approach that personalizes hearing aid compression

更新日期：2020-07-02
• arXiv.cs.SD Pub Date : 2020-07-01
Bowen Shi; Shane Settle; Karen Livescu

Segmental models are sequence prediction models in which scores of hypotheses are based on entire variable-length segments of frames. We consider segmental models for whole-word ("acoustic-to-word") speech recognition, with the segment feature vectors defined using acoustic word embeddings. Such models are computationally challenging as the number of paths is proportional to the vocabulary size, which

更新日期：2020-07-02
• arXiv.cs.SD Pub Date : 2020-06-30
Maarten Van Segbroeck; Harish Mallidih; Brian King; I-Fan Chen; Gurpreet Chadha; Roland Maas

Acoustic models in real-time speech recognition systems typically stack multiple unidirectional LSTM layers to process the acoustic frames over time. Performance improvements over vanilla LSTM architectures have been reported by prepending a stack of frequency-LSTM (FLSTM) layers to the time LSTM. These FLSTM layers can learn a more robust input feature to the time LSTM layers by modeling time-frequency

更新日期：2020-07-02
Contents have been reproduced by permission of the publishers.

down
wechat
bug