-
MFFCN: Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement arXiv.cs.SD Pub Date : 2021-01-15 Xinmeng Xu; Dongxiang Xu; Jie Jia; Yang Wang; Binbin Chen
The purpose of speech enhancement is to extract target speech signal from a mixture of sounds generated from several sources. Speech enhancement can potentially benefit from the visual information from the target speaker, such as lip move-ment and facial expressions, because the visual aspect of speech isessentially unaffected by acoustic environment. In order to fuse audio and visual information,
-
Estimation of the Frequency of Occurrence of Italian Phonemes in Text arXiv.cs.SD Pub Date : 2021-01-14 Javi Arango; Alex DeCaprio; Sunwoo Baik; Luca De Nardis; Stefanie Shattuck-Hufnagel; Maria Gabriella Di Benedetto
The purpose of this project was to derive a reliable estimate of the frequency of occurrence of the 30 phonemes - plus consonant geminated counterparts - of the Italian language, based on four selected written texts. Since no comparable dataset was found in previous literature, the present analysis may serve as a reference in future studies. Four textual sources were considered: Come si fa una tesi
-
Unsupervised heart abnormality detection based on phonocardiogram analysis with Beta Variational Auto-Encoders arXiv.cs.SD Pub Date : 2021-01-14 Shengchen Li; Ke Tian; Rui Wang
Heart Sound (also known as phonocardiogram (PCG)) analysis is a popular way that detects cardiovascular diseases (CVDs). Most PCG analysis uses supervised way, which demands both normal and abnormal samples. This paper proposes a method of unsupervised PCG analysis that uses beta variational auto-encoder ($\beta-\text{VAE}$) to model the normal PCG signals. The best performed model reaches an AUC (Area
-
EmoCat: Language-agnostic Emotional Voice Conversion arXiv.cs.SD Pub Date : 2021-01-14 Bastian Schnell; Goeric Huybrechts; Bartek Perz; Thomas Drugman; Jaime Lorenzo-Trueba
Emotional voice conversion models adapt the emotion in speech without changing the speaker identity or linguistic content. They are less data hungry than text-to-speech models and allow to generate large amounts of emotional data for downstream tasks. In this work we propose EmoCat, a language-agnostic emotional voice conversion model. It achieves high-quality emotion conversion in German with less
-
Generating coherent spontaneous speech and gesture from text arXiv.cs.SD Pub Date : 2021-01-14 Simon Alexanderson; Éva Székely; Gustav Eje Henter; Taras Kucherenko; Jonas Beskow
Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscripted
-
An evaluation of word-level confidence estimation for end-to-end automatic speech recognition arXiv.cs.SD Pub Date : 2021-01-14 Dan Oneata; Alexandru Caranica; Adriana Stan; Horia Cucu
Quantifying the confidence (or conversely the uncertainty) of a prediction is a highly desirable trait of an automatic system, as it improves the robustness and usefulness in downstream tasks. In this paper we investigate confidence estimation for end-to-end automatic speech recognition (ASR). Previous work has addressed confidence measures for lattice-based ASR, while current machine learning research
-
Speaker activity driven neural speech extraction arXiv.cs.SD Pub Date : 2021-01-14 Marc Delcroix; Katerina Zmolikova; Tsubasa Ochiai; Keisuke Kinoshita; Tomohiro Nakatani
Target speech extraction, which extracts the speech of a target speaker in a mixture given auxiliary speaker clues, has recently received increased interest. Various clues have been investigated such as pre-recorded enrollment utterances, direction information, or video of the target speaker. In this paper, we explore the use of speaker activity information as an auxiliary clue for single-channel neural
-
WER-BERT: Automatic WER Estimation with BERT in a Balanced Ordinal Classification Paradigm arXiv.cs.SD Pub Date : 2021-01-14 Akshay Krishna Sheshadri; Anvesh Rao Vijjini; Sukhdeep Kharbanda
Audio Speech Recognition (ASR) systems are evaluated using Word Error Rate (WER) which is calculated by comparing the number of errors between the ground truth and the ASR system's transcription. This calculation, however, requires manual transcription of the speech signal to obtain the ground truth. Since transcribing audio signals is a costly process, Automatic WER Evaluation (e-WER) methods have
-
Whispered and Lombard Neural Speech Synthesis arXiv.cs.SD Pub Date : 2021-01-13 Qiong Hu; Tobias Bleisch; Petko Petkov; Tuomo Raitio; Erik Marchi; Varun Lakshminarasimhan
It is desirable for a text-to-speech system to take into account the environment where synthetic speech is presented, and provide appropriate context-dependent output to the user. In this paper, we present and compare various approaches for generating different speaking styles, namely, normal, Lombard, and whisper speech, using only limited data. The following systems are proposed and assessed: 1)
-
End-to-End Speaker Height and age estimation using Attention Mechanism with LSTM-RNN arXiv.cs.SD Pub Date : 2021-01-13 Manav Kaushik; Van Tung Pham; Eng Siong Chng
Automatic height and age estimation of speakers using acoustic features is widely used for the purpose of human-computer interaction, forensics, etc. In this work, we propose a novel approach of using attention mechanism to build an end-to-end architecture for height and age estimation. The attention mechanism is combined with Long Short-Term Memory(LSTM) encoder which is able to capture long-term
-
Deep Attention-based Representation Learning for Heart Sound Classification arXiv.cs.SD Pub Date : 2021-01-13 Zhao Ren; Kun Qian; Fengquan Dong; Zhenyu Dai; Yoshiharu Yamamoto; Björn W. Schuller
Cardiovascular diseases are the leading cause of deaths and severely threaten human health in daily life. On the one hand, there have been dramatically increasing demands from both the clinical practice and the smart home application for monitoring the heart status of subjects suffering from chronic cardiovascular diseases. On the other hand, experienced physicians who can perform an efficient auscultation
-
MP3net: coherent, minute-long music generation from raw audio with a simple convolutional GAN arXiv.cs.SD Pub Date : 2021-01-12 Korneel van den Broek
We present a deep convolutional GAN which leverages techniques from MP3/Vorbis audio compression to produce long, high-quality audio samples with long-range coherence. The model uses a Modified Discrete Cosine Transform (MDCT) data representation, which includes all phase information. Phase generation is hence integral part of the model. We leverage the auditory masking and psychoacoustic perception
-
Effective Low-Cost Time-Domain Audio Separation Using Globally Attentive Locally Recurrent Networks arXiv.cs.SD Pub Date : 2021-01-13 Max W. Y. Lam; Jun Wang; Dan Su; Dong Yu
Recent research on the time-domain audio separation networks (TasNets) has brought great success to speech separation. Nevertheless, conventional TasNets struggle to satisfy the memory and latency constraints in industrial applications. In this regard, we design a low-cost high-performance architecture, namely, globally attentive locally recurrent (GALR) network. Alike the dual-path RNN (DPRNN), we
-
Practical Speech Re-use Prevention in Voice-driven Services arXiv.cs.SD Pub Date : 2021-01-12 Yangyong Zhang; Maliheh Shirvanian; Sunpreet S. Arora; Jianwei Huang; Guofei Gu
Voice-driven services (VDS) are being used in a variety of applications ranging from smart home control to payments using digital assistants. The input to such services is often captured via an open voice channel, e.g., using a microphone, in an unsupervised setting. One of the key operational security requirements in such setting is the freshness of the input speech. We present AEOLUS, a security
-
Piano Skills Assessment arXiv.cs.SD Pub Date : 2021-01-13 Paritosh Parmar; Jaiden Reddy; Brendan Morris
Can a computer determine a piano player's skill level? Is it preferable to base this assessment on visual analysis of the player's performance or should we trust our ears over our eyes? Since current CNNs have difficulty processing long video videos, how can shorter clips be sampled to best reflect the players skill level? In this work, we collect and release a first-of-its-kind dataset for multimodal
-
Neural Network-based Virtual Microphone Estimator arXiv.cs.SD Pub Date : 2021-01-12 Tsubasa Ochiai; Marc Delcroix; Tomohiro Nakatani; Rintaro Ikeshita; Keisuke Kinoshita; Shoko Araki
Developing microphone array technologies for a small number of microphones is important due to the constraints of many devices. One direction to address this situation consists of virtually augmenting the number of microphone signals, e.g., based on several physical model assumptions. However, such assumptions are not necessarily met in realistic conditions. In this paper, as an alternative approach
-
Smartajweed Automatic Recognition of Arabic Quranic Recitation Rules arXiv.cs.SD Pub Date : 2020-12-26 Ali M. Alagrami; Maged M. Eljazzar
Tajweed is a set of rules to read the Quran in a correct Pronunciation of the letters with all its Qualities, while Reciting the Quran. which means you have to give every letter in the Quran its due of characteristics and apply it to this particular letter in this specific situation while reading, which may differ in other times. These characteristics include melodic rules, like where to stop and for
-
Integrating a joint Bayesian generative model in a discriminative learning framework for speaker verification arXiv.cs.SD Pub Date : 2021-01-09 Xugang Lu; Peng Shen; Yu Tsao; Hisashi Kawai
The task for speaker verification (SV) is to decide an utterance is spoken by a target or imposter speaker. In most SV studies, a log-likelihood ratio (L_LLR) score is estimated based on a generative probability model on speaker features, and compared with a threshold for decision making. However, the generative model usually focuses on feature distributions and does not have the discriminative feature
-
Domain-aware Neural Language Models for Speech Recognition arXiv.cs.SD Pub Date : 2021-01-05 Linda Liu; Yile Gu; Aditya Gourav; Ankur Gandhe; Shashank Kalmane; Denis Filimonov; Ariya Rastrow; Ivan Bulyko
As voice assistants become more ubiquitous, they are increasingly expected to support and perform well on a wide variety of use-cases across different domains. We present a domain-aware rescoring framework suitable for achieving domain-adaptation during second-pass rescoring in production settings. In our framework, we fine-tune a domain-general neural language model on several domains, and use an
-
On Interfacing the Brain with Quantum Computers: An Approach to Listen to the Logic of the Mind arXiv.cs.SD Pub Date : 2020-12-22 Eduardo Reck Miranda
This chapter presents a quantum computing-based approach to study and harness neuronal correlates of mental activity for the development of Brain-Computer Interface (BCI) systems. It introduces the notion of a logic of the mind, where neurophysiological data are encoded as logical expressions representing mental activity. Effective logical expressions are likely to be extensive, involving dozens of
-
Low-resource expressive text-to-speech using data augmentation arXiv.cs.SD Pub Date : 2020-11-11 Goeric Huybrechts; Thomas Merritt; Giulia Comini; Bartek Perz; Raahil Shah; Jaime Lorenzo-Trueba
While recent neural text-to-speech (TTS) systems perform remarkably well, they typically require a substantial amount of recordings from the target speaker reading in the desired speaking style. In this work, we present a novel 3-step methodology to circumvent the costly operation of recording large amounts of target data in order to build expressive style voices with as little as 15 minutes of such
-
A Four-Stage Data Augmentation Approach to ResNet-Conformer Based Acoustic Modeling for Sound Event Localization and Detection arXiv.cs.SD Pub Date : 2021-01-08 Qing Wang; Jun Du; Hua-Xin Wu; Jia Pan; Feng Ma; Chin-Hui Lee
In this paper, we propose a novel four-stage data augmentation approach to ResNet-Conformer based acoustic modeling for sound event localization and detection (SELD). First, we explore two spatial augmentation techniques, namely audio channel swapping (ACS) and multi-channel simulation (MCS), to deal with data sparsity in SELD. ACS and MDS focus on augmenting the limited training data with expanding
-
VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency arXiv.cs.SD Pub Date : 2021-01-08 Ruohan Gao; Kristen Grauman
We introduce a new approach for audio-visual speech separation. Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers. Whereas existing methods focus on learning the alignment between the speaker's lip movements and the sounds they generate, we propose to leverage the speaker's face appearance as an additional
-
Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs arXiv.cs.SD Pub Date : 2021-01-07 Wen-Yi Hsiao; Jen-Yu Liu; Yin-Cheng Yeh; Yi-Hsuan Yang
To apply neural sequence models such as the Transformers to music generation tasks, one has to represent a piece of music by a sequence of tokens drawn from a finite set of pre-defined vocabulary. Such a vocabulary usually involves tokens of various types. For example, to describe a musical note, one needs separate tokens to indicate the note's pitch, duration, velocity (dynamics), and placement (onset
-
Attention-based multi-task learning for speech-enhancement and speaker-identification in multi-speaker dialogue scenario arXiv.cs.SD Pub Date : 2021-01-07 Chiang-Jen Peng; Yun-Ju Chan; Cheng Yu; Syu-Siang Wang; Yu Tsao; Tai-Shih Chi
Multi-task learning (MTL) and the attention technique have been proven to effectively extract robust acoustic features for various speech-related applications in noisy environments. In this study, we integrated MTL and the attention-weighting mechanism and propose an attention-based MTL (ATM0 approach to realize a multi-model learning structure and to promote the speech enhancement (SE) and speaker
-
Investigating the efficacy of music version retrieval systems for setlist identification arXiv.cs.SD Pub Date : 2021-01-06 Furkan Yesiler; Emilio Molina; Joan Serrà; Emilia Gómez
The setlist identification (SLI) task addresses a music recognition use case where the goal is to retrieve the metadata and timestamps for all the tracks played in live music events. Due to various musical and non-musical changes in live performances, developing automatic SLI systems is still a challenging task that, despite its industrial relevance, has been under-explored in the academic literature
-
Multichannel CRNN for Speaker Counting: an Analysis of Performance arXiv.cs.SD Pub Date : 2021-01-06 Pierre-Amaury Grumiaux; Srdan Kitic; Laurent Girin; Alexandre Guérin
Speaker counting is the task of estimating the number of people that are simultaneously speaking in an audio recording. For several audio processing tasks such as speaker diarization, separation, localization and tracking, knowing the number of speakers at each timestep is a prerequisite, or at least it can be a strong advantage, in addition to enabling a low latency processing. In a previous work
-
Interspeech 2021 Deep Noise Suppression Challenge arXiv.cs.SD Pub Date : 2021-01-06 Chandan K A Reddy; Harishchandra Dubey; Kazuhito Koishida; Arun Nair; Vishak Gopal; Ross Cutler; Sebastian Braun; Hannes Gamper; Robert Aichner; Sriram Srinivasan
The Deep Noise Suppression (DNS) challenge is designed to foster innovation in the area of noise suppression to achieve superior perceptual speech quality. We recently organized a DNS challenge special session at INTERSPEECH and ICASSP 2020. We open-sourced training and test datasets for the wideband scenario. We also open-sourced a subjective evaluation framework based on ITU-T standard P.808, which
-
Environment Transfer for Distributed Systems arXiv.cs.SD Pub Date : 2021-01-06 Chunheng Jiang; Jae-wook Ahn; Nirmit Desai
Collecting sufficient amount of data that can represent various acoustic environmental attributes is a critical problem for distributed acoustic machine learning. Several audio data augmentation techniques have been introduced to address this problem but they tend to remain in simple manipulation of existing data and are insufficient to cover the variability of the environments. We propose a method
-
Hypothesis Stitcher for End-to-End Speaker-attributed ASR on Long-form Multi-talker Recordings arXiv.cs.SD Pub Date : 2021-01-06 Xuankai Chang; Naoyuki Kanda; Yashesh Gaur; Xiaofei Wang; Zhong Meng; Takuya Yoshioka
An end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed recently to jointly perform speaker counting, speech recognition and speaker identification. The model achieved a low speaker-attributed word error rate (SA-WER) for monaural overlapped speech comprising an unknown number of speakers. However, the E2E modeling approach is susceptible to the mismatch between
-
Fixed-MAML for Few Shot Classification in Multilingual Speech Emotion Recognition arXiv.cs.SD Pub Date : 2021-01-05 Anugunj Naman; Liliana Mancini
In this paper, we analyze the feasibility of applying few-shot learning to speech emotion recognition task (SER). The current speech emotion recognition models work exceptionally well but fail when then input is multilingual. Moreover, when training such models, the models' performance is suitable only when the training corpus is vast. This availability of a big training corpus is a significant problem
-
Development of a Respiratory Sound Labeling Software for Training a Deep Learning-Based Respiratory Sound Analysis Model arXiv.cs.SD Pub Date : 2021-01-05 Fu-Shun Hsu; Chao-Jung Huang; Chen-Yi Kuo; Shang-Ran Huang; Yuan-Ren Cheng; Jia-Horng Wang; Yi-Lin Wu; Tzu-Ling Tzeng; Feipei Lai
Respiratory auscultation can help healthcare professionals detect abnormal respiratory conditions if adventitious lung sounds are heard. The state-of-the-art artificial intelligence technologies based on deep learning show great potential in the development of automated respiratory sound analysis. To train a deep learning-based model, a huge number of accurate labels of normal breath sounds and adventitious
-
Generalized RNN beamformer for target speech separation arXiv.cs.SD Pub Date : 2021-01-04 Yong Xu; Zhuohuang Zhang; Meng Yu; Shi-Xiong Zhang; Lianwu Chen; Dong Yu
Recently we proposed an all-deep-learning minimum variance distortionless response (ADL-MVDR) method where the unstable matrix inverse and principal component analysis (PCA) operations in the MVDR were replaced by recurrent neural networks (RNNs). However, it is not clear whether the success of the ADL-MVDR is owed to the calculated covariance matrices or following the MVDR formula. In this work, we
-
A novel policy for pre-trained Deep Reinforcement Learning for Speech Emotion Recognition arXiv.cs.SD Pub Date : 2021-01-04 Thejan Rajapakshe; Rajib Rana; Sara Khalifa
Reinforcement Learning (RL) is a semi-supervised learning paradigm which an agent learns by interacting with an environment. Deep learning in combination with RL provides an efficient method to learn how to interact with the environment is called Deep Reinforcement Learning (deep RL). Deep RL has gained tremendous success in gaming - such as AlphaGo, but its potential have rarely being explored for
-
Adversarial Unsupervised Domain Adaptation for Harmonic-Percussive Source Separation arXiv.cs.SD Pub Date : 2021-01-03 Carlos Lordelo; Emmanouil Benetos; Simon Dixon; Sven Ahlbäck; Patrik Ohlsson
This paper addresses the problem of domain adaptation for the task of music source separation. Using datasets from two different domains, we compare the performance of a deep learning-based harmonic-percussive source separation model under different training scenarios, including supervised joint training using data from both domains and pre-training in one domain with fine-tuning in another. We propose
-
A Survey on Deep Reinforcement Learning for Audio-Based Applications arXiv.cs.SD Pub Date : 2021-01-01 Siddique Latif; Heriberto Cuayáhuitl; Farrukh Pervez; Fahad Shamshad; Hafiz Shehbaz Ali; Erik Cambria
Deep reinforcement learning (DRL) is poised to revolutionise the field of artificial intelligence (AI) by endowing autonomous systems with high levels of understanding of the real world. Currently, deep learning (DL) is enabling DRL to effectively solve various intractable problems in various fields. Most importantly, DRL algorithms are also being employed in audio signal processing to learn directly
-
Generative Deep Learning for Virtuosic Classical Music: Generative Adversarial Networks as Renowned Composers arXiv.cs.SD Pub Date : 2021-01-01 Daniel Szelogowski
Current AI-generated music lacks fundamental principles of good compositional techniques. By narrowing down implementation issues both programmatically and musically, we can create a better understanding of what parameters are necessary for a generated composition nearly indistinguishable from that of a master composer.
-
What all do audio transformer models hear? Probing Acoustic Representations for Language Delivery and its Structure arXiv.cs.SD Pub Date : 2021-01-02 Jui Shah; Yaman Kumar Singla; Changyou Chen; Rajiv Ratn Shah
In recent times, BERT based transformer models have become an inseparable part of the 'tech stack' of text processing models. Similar progress is being observed in the speech domain with a multitude of models observing state-of-the-art results by using audio transformer models to encode speech. This begs the question of what are these audio transformer models learning. Moreover, although the standard
-
Audio Content Analysis arXiv.cs.SD Pub Date : 2021-01-01 Alexander Lerch
Preprint for a book chapter introducing Audio Content Analysis. With a focus on Music Information Retrieval systems, this chapter defines musical audio content, introduces the general process of audio content analysis, and surveys basic approaches to audio content analysis. The various tasks in Audio Content Analysis are categorized into three classes: music transcription, music performance analysis
-
Psychoacoustic Calibration of Loss Functions for Efficient End-to-End Neural Audio Coding arXiv.cs.SD Pub Date : 2020-12-31 Kai Zhen; Mi Suk Lee; Jongmo Sung; Seungkwon Beack; Minje Kim
Conventional audio coding technologies commonly leverage human perception of sound, or psychoacoustics, to reduce the bitrate while preserving the perceptual quality of the decoded audio signals. For neural audio codecs, however, the objective nature of the loss function usually leads to suboptimal sound quality as well as high run-time complexity due to the large model size. In this work, we present
-
EfficientNet-Absolute Zero for Continuous Speech Keyword Spotting arXiv.cs.SD Pub Date : 2020-12-31 Amir Mohammad Rostami; Ali Karimi; Mohammad Ali Akhaee
Keyword spotting is a process of finding some specific words or phrases in recorded speeches by computers. Deep neural network algorithms, as a powerful engine, can handle this problem if they are trained over an appropriate dataset. To this end, the football keyword dataset (FKD), as a new keyword spotting dataset in Persian, is collected with crowdsourcing. This dataset contains nearly 31000 samples
-
Unified Mandarin TTS Front-end Based on Distilled BERT Model arXiv.cs.SD Pub Date : 2020-12-31 Yang Zhang; Liqun Deng; Yasheng Wang
The front-end module in a typical Mandarin text-to-speech system (TTS) is composed of a long pipeline of text processing components, which requires extensive efforts to build and is prone to large accumulative model size and cascade errors. In this paper, a pre-trained language model (PLM) based model is proposed to simultaneously tackle the two most important tasks in TTS front-end, i.e., prosodic
-
Multi-view Temporal Alignment for Non-parallel Articulatory-to-Acoustic Speech Synthesis arXiv.cs.SD Pub Date : 2020-12-30 Jose A. Gonzalez-Lopez; Miriam Gonzalez-Atienza; Alejandro Gomez-Alanis; Jose L. Perez-Cordoba; Phil D. Green
Articulatory-to-acoustic (A2A) synthesis refers to the generation of audible speech from captured movement of the speech articulators. This technique has numerous applications, such as restoring oral communication to people who cannot longer speak due to illness or injury. Most successful techniques so far adopt a supervised learning framework, in which time-synchronous articulatory-and-speech recordings
-
Data-driven audio recognition: a supervised dictionary approach arXiv.cs.SD Pub Date : 2020-12-29 Imad Rida
Machine hearing is an emerging area. Motivated by the need of a principled framework across domain applications for machine listening, we propose a generic and data-driven representation learning approach. For this sake, a novel and efficient supervised dictionary learning method is presented. Experiments are performed on both computational auditory scene (East Anglia and Rouen) and synthetic music
-
Detecting COVID-19 from Breathing and Coughing Sounds using Deep Neural Networks arXiv.cs.SD Pub Date : 2020-12-29 Björn W. Schuller; Harry Coppock; Alexander Gaskell
The COVID-19 pandemic has affected the world unevenly; while industrial economies have been able to produce the tests necessary to track the spread of the virus and mostly avoided complete lockdowns, developing countries have faced issues with testing capacity. In this paper, we explore the usage of deep learning models as a ubiquitous, low-cost, pre-testing method for detecting COVID-19 from audio
-
Generalized Operating Procedure for Deep Learning: an Unconstrained Optimal Design Perspective arXiv.cs.SD Pub Date : 2020-12-31 Shen Chen; Mingwei Zhang; Jiamin Cui; Wei Yao
Deep learning (DL) has brought about remarkable breakthrough in processing images, video and speech due to its efficacy in extracting highly abstract representation and learning very complex functions. However, there is seldom operating procedure reported on how to make it for real use cases. In this paper, we intend to address this problem by presenting a generalized operating procedure for DL from
-
Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks arXiv.cs.SD Pub Date : 2020-12-29 Federico Landini; Ján Profant; Mireia Diez; Lukáš Burget
The recently proposed VBx diarization method uses a Bayesian hidden Markov model to find speaker clusters in a sequence of x-vectors. In this work we perform an extensive comparison of performance of the VBx diarization with other approaches in the literature and we show that VBx achieves superior performance on three of the most popular datasets for evaluating diarization: CALLHOME, AMI and DIHARDII
-
An analytic physically motivated model of the mammalian cochlea arXiv.cs.SD Pub Date : 2020-12-25 Samiya A Alkhairy; Christopher A Shera
We develop an analytic model of the mammalian cochlea. We use a mixed physical-phenomenological approach by utilizing existing work on the physics of classical box-representations of the cochlea, and behavior of recent data-derived wavenumber estimates. Spatial variation is incorporated through a single independent variable that combines space and frequency. We arrive at closed-form expressions for
-
Detection of Lexical Stress Errors in Non-native (L2) English with Data Augmentation and Attention arXiv.cs.SD Pub Date : 2020-12-29 Daniel Korzekwa; Roberto Barra-Chicote; Szymon Zaporowski; Grzegorz Beringer; Jaime Lorenzo-Trueba; Alicja Serafinowicz; Jasha Droppo; Thomas Drugman; Bozena Kostek
This paper describes two novel complementary techniques that improve the detection of lexical stress errors in non-native (L2) English speech: attention-based feature extraction and data augmentation based on Neural Text-To-Speech (TTS). In a classical approach, audio features are usually extracted from fixed regions of speech such as syllable nucleus. We propose an attention-based deep learning model
-
Inception-Based Network and Multi-Spectrogram Ensemble Applied For Predicting Respiratory Anomalies and Lung Diseases arXiv.cs.SD Pub Date : 2020-12-26 Lam Pham; Huy Phan; Ross King; Alfred Mertins; Ian McLoughlin
This paper presents an inception-based deep neural network for detecting lung diseases using respiratory sound input. Recordings of respiratory sound collected from patients are firstly transformed into spectrograms where both spectral and temporal information are well presented, referred to as front-end feature extraction. These spectrograms are then fed into the proposed network, referred to as back-end
-
Lattice-Free MMI Adaptation Of Self-Supervised Pretrained Acoustic Models arXiv.cs.SD Pub Date : 2020-12-28 Apoorv Vyas; Srikanth Madikeri; Hervé Bourlard
In this work, we propose lattice-free MMI (LFMMI) for supervised adaptation of self-supervised pretrained acoustic model. We pretrain a Transformer model on thousand hours of untranscribed Librispeech data followed by supervised adaptation with LFMMI on three different datasets. Our results show that fine-tuning with LFMMI, we consistently obtain relative WER improvements of 10% and 35.3% on the clean
-
Building Multi lingual TTS using Cross Lingual Voice Conversion arXiv.cs.SD Pub Date : 2020-12-28 Qinghua Sun; Kenji Nagamatsu
In this paper we propose a new cross-lingual Voice Conversion (VC) approach which can generate all speech parameters (MCEP, LF0, BAP) from one DNN model using PPGs (Phonetic PosteriorGrams) extracted from inputted speech using several ASR acoustic models. Using the proposed VC method, we tried three different approaches to build a multilingual TTS system without recording a multilingual speech corpus
-
Deep Learning Framework Applied for Predicting Anomaly of Respiratory Sounds arXiv.cs.SD Pub Date : 2020-12-26 Dat Ngo; Lam Pham; Anh Nguyen; Ben Phan; Khoa Tran; Truong Nguyen
This paper proposes a robust deep learning framework used for classifying anomaly of respiratory cycles. Initially, our framework starts with front-end feature extraction step. This step aims to transform the respiratory input sound into a two-dimensional spectrogram where both spectral and temporal features are well presented. Next, an ensemble of C- DNN and Autoencoder networks is then applied to
-
Multi-channel Multi-frame ADL-MVDR for Target Speech Separation arXiv.cs.SD Pub Date : 2020-12-24 Zhuohuang Zhang; Yong Xu; Meng Yu; Shi-Xiong Zhang; Lianwu Chen; Donald S. Williamson; Dong Yu
Many purely neural network based speech separation approaches have been proposed that greatly improve objective assessment scores, but they often introduce nonlinear distortions that are harmful to automatic speech recognition (ASR). Minimum variance distortionless response (MVDR) filters strive to remove nonlinear distortions, however, these approaches either are not optimal for removing residual
-
Unsupervised neural adaptation model based on optimal transport for spoken language identification arXiv.cs.SD Pub Date : 2020-12-24 Xugang Lu; Peng Shen; Yu Tsao; Hisashi Kawai
Due to the mismatch of statistical distributions of acoustic speech between training and testing sets, the performance of spoken language identification (SLID) could be drastically degraded. In this paper, we propose an unsupervised neural adaptation model to deal with the distribution mismatch problem for SLID. In our model, we explicitly formulate the adaptation as to reduce the distribution discrepancy
-
AudioViewer: Learning to Visualize Sound arXiv.cs.SD Pub Date : 2020-12-22 Yuchi Zhang; Willis Peng; Bastian Wandt; Helge Rhodin
Sensory substitution can help persons with perceptual deficits. In this work, we attempt to visualize audio with video. Our long-term goal is to create sound perception for hearing impaired people, for instance, to facilitate feedback for training deaf speech. Different from existing models that translate between speech and text or text and images, we target an immediate and low-level translation that
-
Wheel-Rail Interface Condition Estimation (W-RICE) arXiv.cs.SD Pub Date : 2020-12-24 Sundar Shrestha; Anand Koirala; Maksym Spiryagin; Qing Wu
The surface roughness between the wheel and rail has a huge influence on rolling noise level. The presence of the third body such as frost or grease at wheel-rail interface contributes towards change in adhesion coefficient resulting in the generation of noise at various levels. Therefore, it is possible to estimate adhesion conditions between the wheel and rail from the analysis of noise patterns
-
The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans arXiv.cs.SD Pub Date : 2020-12-23 Shinji Watanabe; Florian Boyer; Xuankai Chang; Pengcheng Guo; Tomoki Hayashi; Yosuke Higuchi; Takaaki Hori; Wen-Chin Huang; Hirofumi Inaguma; Naoyuki Kamo; Shigeki Karita; Chenda Li; Jing Shi; Aswin Shanmugam Subramanian; Wangyou Zhang
This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. This project was initiated in December 2017 to mainly deal with end-to-end speech recognition experiments based on sequence-to-sequence modeling. The project has grown rapidly and now covers a wide range of speech processing applications. Now ESPnet also includes text to
-
Speech Synthesis as Augmentation for Low-Resource ASR arXiv.cs.SD Pub Date : 2020-12-23 Deblin Bagchi; Shannon Wotherspoon; Zhuolin Jiang; Prasanna Muthukumar
Speech synthesis might hold the key to low-resource speech recognition. Data augmentation techniques have become an essential part of modern speech recognition training. Yet, they are simple, naive, and rarely reflect real-world conditions. Meanwhile, speech synthesis techniques have been rapidly getting closer to the goal of achieving human-like speech. In this paper, we investigate the possibility
-
Incremental Text-to-Speech Synthesis Using Pseudo Lookahead with Large Pretrained Language Model arXiv.cs.SD Pub Date : 2020-12-23 Takaaki Saeki; Shinnosuke Takamichi; Hiroshi Saruwatari
Text-to-speech (TTS) synthesis, a technique for artificially generating human-like utterances from texts, has dramatically evolved with the advances of end-to-end deep neural network-based methods in recent years. The majority of these methods are sentence-level TTS, which can take into account time-series information in the whole sentence. However, it is necessary to establish incremental TTS, which
Contents have been reproduced by permission of the publishers.