当前期刊: arXiv - CS - Sound Go to current issue    加入关注   
显示样式:        排序: IF: - GO 导出
我的关注
我的收藏
您暂时未登录!
登录
  • Transformer-XL Based Music Generation with Multiple Sequences of Time-valued Notes
    arXiv.cs.SD Pub Date : 2020-07-11
    Xianchao Wu; Chengyuan Wang; Qinying Lei

    Current state-of-the-art AI based classical music creation algorithms such as Music Transformer are trained by employing single sequence of notes with time-shifts. The major drawback of absolute time interval expression is the difficulty of similarity computing of notes that share the same note value yet different tempos, in one or among MIDI files. In addition, the usage of single sequence restricts

    更新日期:2020-07-15
  • Learning Frame Level Attention for Environmental Sound Classification
    arXiv.cs.SD Pub Date : 2020-07-12
    Zhichao Zhang; Shugong Xu; Shunqing Zhang; Tianhao Qiao; Shan Cao

    Environmental sound classification (ESC) is a challenging problem due to the complexity of sounds. The classification performance is heavily dependent on the effectiveness of representative features extracted from the environmental sounds. However, ESC often suffers from the semantically irrelevant frames and silent frames. In order to deal with this, we employ a frame-level attention model to focus

    更新日期:2020-07-15
  • A Deep Learning Approach for Low-Latency Packet Loss Concealment of Audio Signals in Networked Music Performance Applications
    arXiv.cs.SD Pub Date : 2020-07-14
    Prateek Verma; Alessandro Ilic Mezza; Chris Chafe; Cristina Rottondi

    Networked Music Performance (NMP) is envisioned as a potential game changer among Internet applications: it aims at revolutionizing the traditional concept of musical interaction by enabling remote musicians to interact and perform together through a telecommunication network. Ensuring realistic conditions for music performance, however, constitutes a significant engineering challenge due to extremely

    更新日期:2020-07-15
  • Sudo rm -rf: Efficient Networks for Universal Audio Source Separation
    arXiv.cs.SD Pub Date : 2020-07-14
    Efthymios Tzinis; Zhepei Wang; Paris Smaragdis

    In this paper, we present an efficient neural network for end-to-end general purpose audio source separation. Specifically, the backbone structure of this convolutional network is the SUccessive DOwnsampling and Resampling of Multi-Resolution Features (SuDoRMRF) as well as their aggregation which is performed through simple one-dimensional convolutions. In this way, we are able to obtain high quality

    更新日期:2020-07-15
  • The Faults in our ASRs: An Overview of Attacks against Automatic Speech Recognition and Speaker Identification Systems
    arXiv.cs.SD Pub Date : 2020-07-13
    Hadi Abdullah; Kevin Warren; Vincent Bindschaedler; Nicolas Papernot; Patrick Traynor

    Speech and speaker recognition systems are employed in a variety of applications, from personal assistants to telephony surveillance and biometric authentication. The wide deployment of these systems has been made possible by the improved accuracy in neural networks. Like other systems based on neural networks, recent research has demonstrated that speech and speaker recognition systems are vulnerable

    更新日期:2020-07-15
  • OtoWorld: Towards Learning to Separate by Learning to Move
    arXiv.cs.SD Pub Date : 2020-07-12
    Omkar Ranadive; Grant Gasser; David Terpay; Prem Seetharaman

    We present OtoWorld, an interactive environment in which agents must learn to listen in order to solve navigational tasks. The purpose of OtoWorld is to facilitate reinforcement learning research in computer audition, where agents must learn to listen to the world around them to navigate. OtoWorld is built on three open source libraries: OpenAI Gym for environment and agent interaction, PyRoomAcoustics

    更新日期:2020-07-14
  • Drum Beats and Where To Find Them: Sampling Drum Patterns from a Latent Space
    arXiv.cs.SD Pub Date : 2020-07-13
    Alexey Tikhonov; Ivan P. Yamshchikov

    This paper presents a large dataset of drum patterns and compares two different architectures of artificial neural networks that produce latent explorable spaces with some recognizable genre areas. Adversarially constrained autoencoder interpolations (ACAI) show better results in comparison with a standard variational autoencoder. To our knowledge, this is the first application of ACAI to drum-pattern

    更新日期:2020-07-14
  • Fine-grained Language Identification with Multilingual CapsNet Model
    arXiv.cs.SD Pub Date : 2020-07-12
    Mudit Verma; Arun Balaji Buduru

    Due to a drastic improvement in the quality of internet services worldwide, there is an explosion of multilingual content generation and consumption. This is especially prevalent in countries with large multilingual audience, who are increasingly consuming media outside their linguistic familiarity/preference. Hence, there is an increasing need for real-time and fine-grained content analysis services

    更新日期:2020-07-14
  • Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification: Fundamentals
    arXiv.cs.SD Pub Date : 2020-07-12
    Tomi Kinnunen; Héctor Delgado; Nicholas Evans; Kong Aik Lee; Ville Vestman; Andreas Nautsch; Massimiliano Todisco; Xin Wang; Md Sahidullah; Junichi Yamagishi; Douglas A. Reynolds

    Recent years have seen growing efforts to develop spoofing countermeasures (CMs) to protect automatic speaker verification (ASV) systems from being deceived by manipulated or artificial inputs. The reliability of spoofing CMs is typically gauged using the equal error rate (EER) metric. The primitive EER fails to reflect application requirements and the impact of spoofing and CMs upon ASV and its use

    更新日期:2020-07-14
  • The ASRU 2019 Mandarin-English Code-Switching Speech Recognition Challenge: Open Datasets, Tracks, Methods and Results
    arXiv.cs.SD Pub Date : 2020-07-12
    Xian Shi; Qiangze Feng; Lei Xie

    Code-switching (CS) is a common phenomenon and recognizing CS speech is challenging. But CS speech data is scarce and there' s no common testbed in relevant research. This paper describes the design and main outcomes of the ASRU 2019 Mandarin-English code-switching speech recognition challenge, which aims to improve the ASR performance in Mandarin-English code-switching situation. 500 hours Mandarin

    更新日期:2020-07-14
  • Do We Need Sound for Sound Source Localization?
    arXiv.cs.SD Pub Date : 2020-07-11
    Takashi Oya; Shohei Iwase; Ryota Natsume; Takahiro Itazuri; Shugo Yamaguchi; Shigeo Morishima

    During the performance of sound source localization which uses both visual and aural information, it presently remains unclear how much either image or sound modalities contribute to the result, i.e. do we need both image and sound for sound source localization? To address this question, we develop an unsupervised learning system that solves sound source localization by decomposing this task into two

    更新日期:2020-07-14
  • Quasi-Periodic WaveNet: An Autoregressive Raw Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network
    arXiv.cs.SD Pub Date : 2020-07-11
    Yi-Chiao Wu; Tomoki Hayashi; Patrick Lumban Tobing; Kazuhiro Kobayashi; Tomoki Toda

    In this paper, a pitch-adaptive waveform generative model named Quasi-Periodic WaveNet (QPNet) is proposed to improve the pitch controllability of vanilla WaveNet (WN) using pitch-dependent dilated convolution neural networks (PDCNNs). Specifically, as a probabilistic autoregressive generation model with stacked dilated convolution layers, WN achieves high-fidelity audio waveform generation. However

    更新日期:2020-07-14
  • Overcoming label noise in audio event detection using sequential labeling
    arXiv.cs.SD Pub Date : 2020-07-10
    Jae-Bin Kim; Seongkyu Mun; Myungwoo Oh; Soyeon Choe; Yong-Hyeok Lee; Hyung-Min Park

    This paper addresses the noisy label issue in audio event detection (AED) by refining strong labels as sequential labels with inaccurate timestamps removed. In AED, strong labels contain the occurrence of a specific event and its timestamps corresponding to the start and end of the event in an audio clip. The timestamps depend on subjectivity of each annotator, and their label noise is inevitable.

    更新日期:2020-07-13
  • Conditioned Time-Dilated Convolutions for Sound Event Detection
    arXiv.cs.SD Pub Date : 2020-07-10
    Konstantinos Drossos; Stylianos I. Mimilakis; Tuomas Virtanen

    Sound event detection (SED) is the task of identifying sound events along with their onset and offset times. A recent, convolutional neural networks based SED method, proposed the usage of depthwise separable (DWS) and time-dilated convolutions. DWS and time-dilated convolutions yielded state-of-the-art results for SED, with considerable small amount of parameters. In this work we propose the expansion

    更新日期:2020-07-13
  • Towards accurate simulations of individual speech recognition benefits with real hearing aids with FADE
    arXiv.cs.SD Pub Date : 2020-07-10
    David Hülsmeier; Marc René Schädler; Birger Kollmeier

    Developing and selecting hearing aids is a time consuming process which could be simplified by using objective models. The framework for auditory discrimination experiments (FADE) accurately simulated the benefit of hearing aid algorithms. One simulation with FADE requires several hours of (un)processed signals, which is obstructive when the signals have to be recorded. We propose and evaluate a real-time

    更新日期:2020-07-13
  • Gated Recurrent Context: Softmax-free attention for Online Encoder-Decoder Speech Recognition
    arXiv.cs.SD Pub Date : 2020-07-10
    Hyeonseung Lee; Woo Hyun Kang; Sung Jun Cheon; Hyeongju Kim; Nam Soo Kim

    Recently, attention-based encoder-decoder (AED) models have shown state-of-the-art performance in automatic speech recognition (ASR). As the original AED models with global attentions are not capable of online inference, various online attention schemes have been developed to reduce ASR latency for better user experience. However, a common limitation of the conventional softmax-based online attention

    更新日期:2020-07-13
  • RWCP-SSD-Onomatopoeia: Onomatopoeic Word Dataset for Environmental Sound Synthesis
    arXiv.cs.SD Pub Date : 2020-07-09
    Yuki Okamoto; Keisuke Imoto; Shinnosuke Takamichi; Ryosuke Yamanishi; Takahiro Fukumori; Yoichi Yamashita

    Environmental sound synthesis is a technique for generating a natural environmental sound. Conventional work on environmental sound synthesis using sound event labels cannot finely control synthesized sounds, for example, the pitch and timbre. We consider that onomatopoeic words can be used for environmental sound synthesis. Onomatopoeic words are effective for explaining the feature of sounds. We

    更新日期:2020-07-10
  • Multi-task Regularization Based on Infrequent Classes for Audio Captioning
    arXiv.cs.SD Pub Date : 2020-07-09
    Emre Çakır; Konstantinos Drossos; Tuomas Virtanen

    Audio captioning is a multi-modal task, focusing on using natural language for describing the contents of general audio. Most audio captioning methods are based on deep neural networks, employing an encoder-decoder scheme and a dataset with audio clips and corresponding natural language descriptions (i.e. captions). A significant challenge for audio captioning is the distribution of words in the captions:

    更新日期:2020-07-10
  • Information, communication and music: Recognition of musical dissonance and consonance in a simple reservoir computing system
    arXiv.cs.SD Pub Date : 2020-07-08
    Dawid Przyczyna; Maria Szaciłowska; Marek Przybylski; Marcin Strzelecki; Konrad Szaciłowski

    Reservoir computing is an emerging, but very successful approach towards processing and classification of various signals. It can be described as a model of a transient computation, where influence of input changes internal dynamics of chosen computational reservoir. Trajectory of these changes represents computation performed by the system. The selection of a suitable computational substrate capable

    更新日期:2020-07-10
  • Capturing scattered discriminative information using a deep architecture in acoustic scene classification
    arXiv.cs.SD Pub Date : 2020-07-09
    Hye-jin Shim; Jee-weon Jung; Ju-ho Kim; Ha-jin Yu

    Frequently misclassified pairs of classes that share many common acoustic properties exist in acoustic scene classification (ASC). To distinguish such pairs of classes, trivial details scattered throughout the data could be vital clues. However, these details are less noticeable and are easily removed using conventional non-linear activations (e.g. ReLU). Furthermore, making design choices to emphasize

    更新日期:2020-07-10
  • DeepSinger: Singing Voice Synthesis with Data Mined From the Web
    arXiv.cs.SD Pub Date : 2020-07-09
    Yi Ren; Xu Tan; Tao Qin; Jian Luan; Zhou Zhao; Tie-Yan Liu

    In this paper, we develop DeepSinger, a multi-lingual multi-singer singing voice synthesis (SVS) system, which is built from scratch using singing training data mined from music websites. The pipeline of DeepSinger consists of several steps, including data crawling, singing and accompaniment separation, lyrics-to-singing alignment, data filtration, and singing modeling. Specifically, we design a lyrics-to-singing

    更新日期:2020-07-10
  • Improving Sound Event Detection In Domestic Environments Using Sound Separation
    arXiv.cs.SD Pub Date : 2020-07-08
    Nicolas TurpaultMULTISPEECH; Scott WisdomMULTISPEECH; Hakan ErdoganMULTISPEECH; John HersheyMULTISPEECH; Romain SerizelMULTISPEECH; Eduardo FonsecaMTG; Prem Seetharaman; Justin Salamon

    Performing sound event detection on real-world recordings often implies dealing with overlapping target sound events and non-target sounds, also referred to as interference or noise. Until now these problems were mainly tackled at the classifier level. We propose to use sound separation as a pre-processing for sound event detection. In this paper we start from a sound separation model trained on the

    更新日期:2020-07-09
  • Training Sound Event Detection On A Heterogeneous Dataset
    arXiv.cs.SD Pub Date : 2020-07-08
    Nicolas TurpaultMULTISPEECH; Romain SerizelMULTISPEECH

    Training a sound event detection algorithm on a heterogeneous dataset including both recorded and synthetic soundscapes that can have various labeling granularity is a non-trivial task that can lead to systems requiring several technical choices. These technical choices are often passed from one system to another without being questioned. We propose to perform a detailed analysis of DCASE 2020 task

    更新日期:2020-07-09
  • Acoustic Scene Classification with Spectrogram Processing Strategies
    arXiv.cs.SD Pub Date : 2020-07-06
    Helin Wang; Yuexian Zou; Dading Chong

    Recently, convolutional neural networks (CNN) have achieved the state-of-the-art performance in acoustic scene classification (ASC) task. The audio data is often transformed into two-dimensional spectrogram representations, which are then fed to the neural networks. In this paper, we study the problem of efficiently taking advantage of different spectrogram representations through discriminative processing

    更新日期:2020-07-09
  • Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision
    arXiv.cs.SD Pub Date : 2020-07-08
    Abhinav Shukla; Stavros Petridis; Maja Pantic

    The intuitive interaction between the audio and visual modalities is valuable for cross-modal self-supervised learning. This concept has been demonstrated for generic audiovisual tasks like video action recognition and acoustic scene classification. However, self-supervision remains under-explored for audiovisual speech. We propose a method to learn self-supervised speech representations from the raw

    更新日期:2020-07-09
  • Streaming End-to-End Bilingual ASR Systems with Joint Language Identification
    arXiv.cs.SD Pub Date : 2020-07-08
    Surabhi Punjabi; Harish Arsikere; Zeynab Raeesy; Chander Chandak; Nikhil Bhave; Ankish Bansal; Markus Müller; Sergio Murillo; Ariya Rastrow; Sri Garimella; Roland Maas; Mat Hans; Athanasios Mouchtaris; Siegfried Kunzmann

    Multilingual ASR technology simplifies model training and deployment, but its accuracy is known to depend on the availability of language information at runtime. Since language identity is seldom known beforehand in real-world scenarios, it must be inferred on-the-fly with minimum latency. Furthermore, in voice-activated smart assistant systems, language identity is also required for downstream processing

    更新日期:2020-07-09
  • Multi-Resolution Beta-Divergence NMF for Blind Spectral Unmixing
    arXiv.cs.SD Pub Date : 2020-07-08
    Valentin Leplat; Nicolas Gillis; Cédric Févotte

    Blind spectral unmixing is the problem of decomposing the spectrum of a mixed signal or image into a collection of source spectra and their corresponding activations indicating the proportion of each source present in the mixed spectrum. To perform this task, nonnegative matrix factorization (NMF) based on the $\beta$-divergence, referred to as $\beta$-NMF, is a standard and state-of-the art technique

    更新日期:2020-07-09
  • Surveying Off-Board and Extra-Vehicular Monitoring and Progress Towards Pervasive Diagnostics
    arXiv.cs.SD Pub Date : 2020-07-01
    Joshua E. Siegel; Umberto Coda

    We survey the state-of-the-art in offboard diagnostics for vehicles, their occupants, and environments, with particular focus on vibroacoustic approaches. We identify promising application areas including data-driven management for shared mobility and automated fleets, usage-based insurance, and vehicle,occupant, and environmental state and condition monitoring. We close by exploring the particular

    更新日期:2020-07-09
  • X-vectors: New Quantitative Biomarkers for Early Parkinson's Disease Detection from Speech
    arXiv.cs.SD Pub Date : 2020-07-07
    Laetitia Jeancolas; Dijana Petrovska-Delacrétaz; Graziella Mangone; Badr-Eddine Benkelfat; Jean-Christophe Corvol; Marie Vidailhet; Stéphane Lehéricy; Habib Benali

    Many articles have used voice analysis to detect Parkinson's disease (PD), but few have focused on the early stages of the disease and the gender effect. In this article, we have adapted the latest speaker recognition system, called x-vectors, in order to detect an early stage of PD from voice analysis. X-vectors are embeddings extracted from a deep neural network, which provide robust speaker representations

    更新日期:2020-07-08
  • Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by Spiking Neural Network
    arXiv.cs.SD Pub Date : 2020-07-07
    Zihan Pan; Malu Zhang; Jibin Wu; Haizhou Li

    Inspired by the mammal's auditory localization pathway, in this paper we propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment, and implement this algorithm in a real-time robotic system with a microphone array. The key of this model relies on the MTPC scheme, which encodes the interaural time difference (ITD) cues into

    更新日期:2020-07-08
  • Predicting Afrobeats Hit Songs Using Spotify Data
    arXiv.cs.SD Pub Date : 2020-07-07
    Adewale Adeagbo

    This study approached the Hit Song Science problem with the aim of predicting which songs in the Afrobeats genre will become popular among Spotify listeners. A dataset of 2063 songs was generated through the Spotify Web API, with the provided audio features. Random Forest and Gradient Boosting algorithms proved to be successful with approximately F1 scores of 86%.

    更新日期:2020-07-08
  • Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters
    arXiv.cs.SD Pub Date : 2020-07-06
    Vineel Pratap; Anuroop Sriram; Paden Tomasello; Awni Hannun; Vitaliy Liptchinsky; Gabriel Synnaeve; Ronan Collobert

    We study training a single acoustic model for multiple languages with the aim of improving automatic speech recognition (ASR) performance on low-resource languages, and over-all simplifying deployment of ASR systems that support diverse languages. We perform an extensive benchmark on 51 languages, with varying amount of training data by language(from 100 hours to 1100 hours). We compare three variants

    更新日期:2020-07-08
  • Revisiting Representation Learning for Singing Voice Separation with Sinkhorn Distances
    arXiv.cs.SD Pub Date : 2020-07-06
    Stylianos Ioannis Mimilakis; Konstantinos Drossos; Gerald Schuller

    In this work we present a method for unsupervised learning of audio representations, focused on the task of singing voice separation. We build upon a previously proposed method for learning representations of time-domain music signals with a re-parameterized denoising autoencoder, extending it by using the family of Sinkhorn distances with entropic regularization. We evaluate our method on the freely

    更新日期:2020-07-07
  • Depthwise Separable Convolutions Versus Recurrent Neural Networks for Monaural Singing Voice Separation
    arXiv.cs.SD Pub Date : 2020-07-06
    Pyry Pyykkönen; Styliannos I. Mimilakis; Konstantinos Drossos; Tuomas Virtanen

    Recent approaches for music source separation are almost exclusively based on deep neural networks, mostly employing recurrent neural networks (RNNs). Although RNNs are in many cases superior than other types of deep neural networks for sequence processing, they are known to have specific difficulties in training and parallelization, especially for the typically long sequences encountered in music

    更新日期:2020-07-07
  • Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning
    arXiv.cs.SD Pub Date : 2020-07-06
    Khoa Nguyen; Konstantinos Drossos; Tuomas Virtanen

    Audio captioning is the task of automatically creating a textual description for the contents of a general audio signal. Typical audio captioning methods rely on deep neural networks (DNNs), where the target of the DNN is to map the input audio sequence to an output sequence of words, i.e. the caption. Though, the length of the textual description is considerably less than the length of the audio signal

    更新日期:2020-07-07
  • ResNeXt and Res2Net Structure for Speaker Verification
    arXiv.cs.SD Pub Date : 2020-07-06
    Tianyan Zhou; Yong Zhao; Jian Wu

    ResNet-based architecture has been widely adopted as the speaker embedding extractor in speaker verification system. Its standard topology and modularized design ease the human efforts on hyper parameter tuning. Therefore, width and depth are left as two major dimensions to further improve ResNet's representation power. However, simply increasing width or depth is not efficient. In this paper, we investigate

    更新日期:2020-07-07
  • Deep Graph Random Process for Relational-Thinking-Based Speech Recognition
    arXiv.cs.SD Pub Date : 2020-07-04
    Hengguan Huang; Fuzhao Xue; Hao Wang; Ye Wang

    Lying at the core of human intelligence, relational thinking is characterized by initially relying on innumerable unconscious percepts pertaining to relations between new sensory signals and prior knowledge, consequently becoming a recognizable concept or object through coupling and transformation of these percepts. Such mental processes are difficult to model in real-world problems such as in conversational

    更新日期:2020-07-07
  • Robust Prediction of Punctuation and Truecasingfor Medical ASR
    arXiv.cs.SD Pub Date : 2020-07-04
    Monica Sunkara; Srikanth Ronanki; Kalpit Dixit; Sravan Bodapati; Katrin Kirchhoff

    Automatic speech recognition (ASR) systems in the medical domain that focus on transcribing clinical dictations and doctor-patient conversations often pose many challenges due to the complexity of the domain. ASR output typically undergoes automatic punctuation to enable users to speak naturally, without having to vocalise awkward and explicit punctuation commands, such as "period", "add comma" or

    更新日期:2020-07-07
  • Pretrained Semantic Speech Embeddings for End-to-End Spoken Language Understanding via Cross-Modal Teacher-Student Learning
    arXiv.cs.SD Pub Date : 2020-07-03
    Pavel Denisov; Ngoc Thang Vu

    Spoken language understanding is typically based on pipeline architectures including speech recognition and natural language understanding steps. Therefore, these components are optimized independently from each other and the overall system suffers from error propagation. In this paper, we propose a novel training method that enables pretrained contextual embeddings such as BERT to process acoustic

    更新日期:2020-07-06
  • Channel Compression: Rethinking Information Redundancy among Channels in CNN Architecture
    arXiv.cs.SD Pub Date : 2020-07-02
    Jinhua Liang; Tao Zhang; Guoqing Feng

    Model compression and acceleration are attracting increasing attentions due to the demand for embedded devices and mobile applications. Research on efficient convolutional neural networks (CNNs) aims at removing feature redundancy by decomposing or optimizing the convolutional calculation. In this work, feature redundancy is assumed to exist among channels in CNN architectures, which provides some

    更新日期:2020-07-06
  • Noise-Robust Adaptation Control for Supervised System Identification Exploiting A Noise Dictionary
    arXiv.cs.SD Pub Date : 2020-07-03
    Thomas Haubner; Andreas Brendel; Mohamed Elminshawi; Walter Kellermann

    We present a noise-robust adaptation control strategy for block-online supervised acoustic system identification by exploiting a noise dictionary. The proposed algorithm takes advantage of the pronounced spectral structure which characterizes many types of interfering noise signals. We model the noisy observations by a linear Gaussian Discrete Fourier Transform-domain state space model whose parameters

    更新日期:2020-07-06
  • Online Supervised Acoustic System Identification exploiting Prelearned Local Affine Subspace Models
    arXiv.cs.SD Pub Date : 2020-07-03
    Thomas Haubner; Andreas Brendel; Walter Kellermann

    In this paper we present a novel algorithm for improved block-online supervised acoustic system identification in adverse noise scenarios by exploiting prior knowledge about the space of Room Impulse Responses (RIRs). The method is based on the assumption that the variability of the unknown RIRs is controlled by only few physical parameters, describing, e.g., source position movements, and thus is

    更新日期:2020-07-06
  • Spot the conversation: speaker diarisation in the wild
    arXiv.cs.SD Pub Date : 2020-07-02
    Joon Son Chung; Jaesung Huh; Arsha Nagrani; Triantafyllos Afouras; Andrew Zisserman

    The goal of this paper is speaker diarisation of videos collected 'in the wild'. We make three key contributions. First, we propose an automatic audio-visual diarisation method for YouTube videos. Our method consists of active speaker detection using audio-visual methods and speaker verification using self-enrolled speaker models. Second, we integrate our method into a semi-automatic dataset creation

    更新日期:2020-07-03
  • OrchideaSOL: a dataset of extended instrumental techniques for computer-aided orchestration
    arXiv.cs.SD Pub Date : 2020-07-01
    Carmine Emanuele Cella; Daniele Ghisi; Vincent Lostanlen; Fabien Lévy; Joshua Fineberg; Yan Maresz

    This paper introduces OrchideaSOL, a free dataset of samples of extended instrumental playing techniques, designed to be used as default dataset for the Orchidea framework for target-based computer-aided orchestration. OrchideaSOL is a reduced and modified subset of Studio On Line, or SOL for short, a dataset developed at Ircam between 1996 and 1998. We motivate the reasons behind OrchideaSOL and describe

    更新日期:2020-07-03
  • Data Augmenting Contrastive Learning of Speech Representations in the Time Domain
    arXiv.cs.SD Pub Date : 2020-07-02
    Eugene Kharitonov; Morgane Rivière; Gabriel Synnaeve; Lior Wolf; Pierre-Emmanuel Mazaré; Matthijs Douze; Emmanuel Dupoux

    Contrastive Predictive Coding (CPC), based on predicting future segments of speech based on past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs other methods on unsupervised evaluation benchmarks. Here, we introduce WavAugment, a time-domain data augmentation library and find that applying augmentation in the past is generally

    更新日期:2020-07-03
  • Polyphonic sound event detection based on convolutional recurrent neural networks with semi-supervised loss function for DCASE challenge 2020 task 4
    arXiv.cs.SD Pub Date : 2020-07-02
    Nam Kyun Kim; Hong Kook Kim

    This report proposes a polyphonic sound event detection (SED) method for the DCASE 2020 Challenge Task 4. The proposed SED method is based on semi-supervised learning to deal with the different combination of training datasets such as weakly labeled dataset, unlabeled dataset, and strongly labeled synthetic dataset. Especially, the target label of each audio clip from weakly labeled or unlabeled dataset

    更新日期:2020-07-03
  • Semi-Supervised NMF-CNN For Sound Event Detection
    arXiv.cs.SD Pub Date : 2020-07-02
    Chan Teck Kai; Chin Cheng Siong; Li Ye

    For the DCASE 2020 Challenge Task 4, this paper pro-posed a combinative approach using Nonnegative Matrix Factorization (NMF) and Convolutional Neural Network (CNN). The main idea begins with utilizing NMF to ap-proximate strong labels for the weakly labeled data. Sub-sequently, based on the approximated strongly labeled data, two different CNNs are trained using a semi-supervised framework where one

    更新日期:2020-07-03
  • Automated Empathy Detection for Oncology Encounters
    arXiv.cs.SD Pub Date : 2020-07-01
    Zhuohao Chen; James Gibson; Ming-Chang Chiu; Qiaohong Hu; Tara K Knight; Daniella Meeker; James A Tulsky; Kathryn I Pollak; Shrikanth Narayanan

    Empathy involves understanding other people's situation, perspective, and feelings. In clinical interactions, it helps clinicians establish rapport with a patient and support patient-centered care and decision making. Understanding physician communication through observation of audio-recorded encounters is largely carried out with manual annotation and analysis. However, manual annotation has a prohibitively

    更新日期:2020-07-03
  • LSTM and GPT-2 Synthetic Speech Transfer Learning for Speaker Recognition to Overcome Data Scarcity
    arXiv.cs.SD Pub Date : 2020-07-01
    Jordan J. Bird; Diego R. Faria; Anikó Ekárt; Cristiano Premebida; Pedro P. S. Ayrosa

    In speech recognition problems, data scarcity often poses an issue due to the willingness of humans to provide large amounts of data for learning and classification. In this work, we take a set of 5 spoken Harvard sentences from 7 subjects and consider their MFCC attributes. Using character level LSTMs (supervised learning) and OpenAI's attention-based GPT-2 models, synthetic MFCCs are generated by

    更新日期:2020-07-03
  • Joint-Diagonalizability-Constrained Multichannel Nonnegative Matrix Factorization Based on Multivariate Complex Sub-Gaussian Distribution
    arXiv.cs.SD Pub Date : 2020-06-30
    Keigo Kamo; Yuki Kubo; Norihiro Takamune; Daichi Kitamura; Hiroshi Saruwatari; Yu Takahashi; Kazunobu Kondo

    In this paper, we address a statistical model extension of multichannel nonnegative matrix factorization (MNMF) for blind source separation, and we propose a new parameter update algorithm used in the sub-Gaussian model. MNMF employs full-rank spatial covariance matrices and can simulate situations in which the reverberation is strong and the sources are not point sources. In conventional MNMF, spectrograms

    更新日期:2020-07-02
  • Consistent Independent Low-Rank Matrix Analysis for Determined Blind Source Separation
    arXiv.cs.SD Pub Date : 2020-07-01
    Daichi Kitamura; Kohei Yatabe

    Independent low-rank matrix analysis (ILRMA) is the state-of-the-art algorithm for blind source separation (BSS) in the determined situation (the number of microphones is greater than or equal to that of source signals). ILRMA achieves a great separation performance by modeling the power spectrograms of the source signals via the nonnegative matrix factorization (NMF). Such highly developed source

    更新日期:2020-07-02
  • A Sequential Self Teaching Approach for Improving Generalization in Sound Event Recognition
    arXiv.cs.SD Pub Date : 2020-06-30
    Anurag Kumar; Vamsi Krishna Ithapu

    An important problem in machine auditory perception is to recognize and detect sound events. In this paper, we propose a sequential self-teaching approach to learning sounds. Our main proposition is that it is harder to learn sounds in adverse situations such as from weakly labeled and/or noisy labeled data, and in these situations a single stage of learning is not sufficient. Our proposal is a sequential

    更新日期:2020-07-02
  • Instantaneous PSD Estimation for Speech Enhancement based on Generalized Principal Components
    arXiv.cs.SD Pub Date : 2020-07-01
    Thomas Dietzen; Marc Moonen; Toon van Waterschoot

    Power spectral density (PSD) estimates of various microphone signal components are essential to many speech enhancement procedures. As speech is highly non-nonstationary, performance improvements may be gained by maintaining time-variations in PSD estimates. In this paper, we propose an instantaneous PSD estimation approach based on generalized principal components. Similarly to other eigenspace-based

    更新日期:2020-07-02
  • Exploring the time-domain deep attractor network with two-stream architectures in a reverberant environment
    arXiv.cs.SD Pub Date : 2020-07-01
    Hangting Chen; Pengyuan Zhang

    With the success of deep learning in speech signal processing, speaker-independent speech separation under the reverberant environment remains challenging. The deep attractor network (DAN) performs speech separation with speaker attractor, but it is conducted in the time-frequency domain, which is not optimal. The recently proposed convolutional time-domain audio separation network (Conv-TasNet) surpasses

    更新日期:2020-07-02
  • Private Speech Characterization with Secure Multiparty Computation
    arXiv.cs.SD Pub Date : 2020-07-01
    Kyle Bittner; Martine De Cock; Rafael Dowsley

    Deep learning in audio signal processing, such as human voice audio signal classification, is a rich application area of machine learning. Legitimate use cases include voice authentication, gunfire detection, and emotion recognition. While there are clear advantages to automated human speech classification, application developers can gain knowledge beyond the professed scope from unprotected audio

    更新日期:2020-07-02
  • The NTT DCASE2020 Challenge Task 6 system: Automated Audio Captioning with Keywords and Sentence Length Estimation
    arXiv.cs.SD Pub Date : 2020-07-01
    Yuma Koizumi; Daiki Takeuchi; Yasunori Ohishi; Noboru Harada; Kunio Kashino

    This technical report describes the system participating to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge, Task 6: automated audio captioning. Our submission focuses on solving two indeterminacy problems in automated audio captioning: word selection indeterminacy and sentence length indeterminacy. We simultaneously solve the main caption generation and sub indeterminacy

    更新日期:2020-07-02
  • A Transformer-based Audio Captioning Model with Keyword Estimation
    arXiv.cs.SD Pub Date : 2020-07-01
    Yuma Koizumi; Ryo Masumura; Kyosuke Nishida; Masahiro Yasuda; Shoichiro Saito

    One of the problems with automated audio captioning (AAC) is the indeterminacy in word selection corresponding to the audio event/scene. Since one acoustic event/scene can be described with several words, it results in a combinatorial explosion of possible captions and difficulty in training. To solve this problem, we propose a Transformer-based audio-captioning model with keyword estimation called

    更新日期:2020-07-02
  • Personalization of Hearing Aid Compression by Human-In-Loop Deep Reinforcement Learning
    arXiv.cs.SD Pub Date : 2020-07-01
    Nasim Alamdari; Edward Lobarinas; Nasser Kehtarnavaz

    Existing prescriptive compression strategies used in hearing aid fitting are designed based on gain averages from a group of users which are not necessarily optimal for a specific user. Nearly half of hearing aid users prefer settings that differ from the commonly prescribed settings. This paper presents a human-in-loop deep reinforcement learning approach that personalizes hearing aid compression

    更新日期:2020-07-02
  • Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings
    arXiv.cs.SD Pub Date : 2020-07-01
    Bowen Shi; Shane Settle; Karen Livescu

    Segmental models are sequence prediction models in which scores of hypotheses are based on entire variable-length segments of frames. We consider segmental models for whole-word ("acoustic-to-word") speech recognition, with the segment feature vectors defined using acoustic word embeddings. Such models are computationally challenging as the number of paths is proportional to the vocabulary size, which

    更新日期:2020-07-02
  • Multi-view Frequency LSTM: An Efficient Frontend for Automatic Speech Recognition
    arXiv.cs.SD Pub Date : 2020-06-30
    Maarten Van Segbroeck; Harish Mallidih; Brian King; I-Fan Chen; Gurpreet Chadha; Roland Maas

    Acoustic models in real-time speech recognition systems typically stack multiple unidirectional LSTM layers to process the acoustic frames over time. Performance improvements over vanilla LSTM architectures have been reported by prepending a stack of frequency-LSTM (FLSTM) layers to the time LSTM. These FLSTM layers can learn a more robust input feature to the time LSTM layers by modeling time-frequency

    更新日期:2020-07-02
Contents have been reproduced by permission of the publishers.
导出
全部期刊列表>>
产业、创新与基础设施
AI核心技术
10years
自然科研线上培训服务
材料学研究精选
Springer Nature Live 产业与创新线上学术论坛
胸腔和胸部成像专题
自然科研论文编辑服务
ACS ES&T Engineering
ACS ES&T Water
屿渡论文,编辑服务
杨超勇
周一歌
华东师范大学
段炼
清华大学
廖矿标
李远
跟Nature、Science文章学绘图
隐藏1h前已浏览文章
中洪博元
课题组网站
新版X-MOL期刊搜索和高级搜索功能介绍
ACS材料视界
x-mol收录
福州大学
南京大学
王杰
丘龙斌
电子显微学
何凤
洛杉矶分校
吴杰
赵延川
试剂库存
天合科研
down
wechat
bug