• arXiv.cs.SD Pub Date : 2021-01-15
Xinmeng Xu; Dongxiang Xu; Jie Jia; Yang Wang; Binbin Chen

The purpose of speech enhancement is to extract target speech signal from a mixture of sounds generated from several sources. Speech enhancement can potentially benefit from the visual information from the target speaker, such as lip move-ment and facial expressions, because the visual aspect of speech isessentially unaffected by acoustic environment. In order to fuse audio and visual information,

更新日期：2021-01-18
• arXiv.cs.SD Pub Date : 2021-01-14
Javi Arango; Alex DeCaprio; Sunwoo Baik; Luca De Nardis; Stefanie Shattuck-Hufnagel; Maria Gabriella Di Benedetto

The purpose of this project was to derive a reliable estimate of the frequency of occurrence of the 30 phonemes - plus consonant geminated counterparts - of the Italian language, based on four selected written texts. Since no comparable dataset was found in previous literature, the present analysis may serve as a reference in future studies. Four textual sources were considered: Come si fa una tesi

更新日期：2021-01-18
• arXiv.cs.SD Pub Date : 2021-01-14
Shengchen Li; Ke Tian; Rui Wang

Heart Sound (also known as phonocardiogram (PCG)) analysis is a popular way that detects cardiovascular diseases (CVDs). Most PCG analysis uses supervised way, which demands both normal and abnormal samples. This paper proposes a method of unsupervised PCG analysis that uses beta variational auto-encoder ($\beta-\text{VAE}$) to model the normal PCG signals. The best performed model reaches an AUC (Area

更新日期：2021-01-15
• arXiv.cs.SD Pub Date : 2021-01-14
Bastian Schnell; Goeric Huybrechts; Bartek Perz; Thomas Drugman; Jaime Lorenzo-Trueba

Emotional voice conversion models adapt the emotion in speech without changing the speaker identity or linguistic content. They are less data hungry than text-to-speech models and allow to generate large amounts of emotional data for downstream tasks. In this work we propose EmoCat, a language-agnostic emotional voice conversion model. It achieves high-quality emotion conversion in German with less

更新日期：2021-01-15
• arXiv.cs.SD Pub Date : 2021-01-14
Simon Alexanderson; Éva Székely; Gustav Eje Henter; Taras Kucherenko; Jonas Beskow

Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscripted

更新日期：2021-01-15
• arXiv.cs.SD Pub Date : 2021-01-14
Dan Oneata; Alexandru Caranica; Adriana Stan; Horia Cucu

Quantifying the confidence (or conversely the uncertainty) of a prediction is a highly desirable trait of an automatic system, as it improves the robustness and usefulness in downstream tasks. In this paper we investigate confidence estimation for end-to-end automatic speech recognition (ASR). Previous work has addressed confidence measures for lattice-based ASR, while current machine learning research

更新日期：2021-01-15
• arXiv.cs.SD Pub Date : 2021-01-14
Marc Delcroix; Katerina Zmolikova; Tsubasa Ochiai; Keisuke Kinoshita; Tomohiro Nakatani

Target speech extraction, which extracts the speech of a target speaker in a mixture given auxiliary speaker clues, has recently received increased interest. Various clues have been investigated such as pre-recorded enrollment utterances, direction information, or video of the target speaker. In this paper, we explore the use of speaker activity information as an auxiliary clue for single-channel neural

更新日期：2021-01-15
• arXiv.cs.SD Pub Date : 2021-01-14
Akshay Krishna Sheshadri; Anvesh Rao Vijjini; Sukhdeep Kharbanda

Audio Speech Recognition (ASR) systems are evaluated using Word Error Rate (WER) which is calculated by comparing the number of errors between the ground truth and the ASR system's transcription. This calculation, however, requires manual transcription of the speech signal to obtain the ground truth. Since transcribing audio signals is a costly process, Automatic WER Evaluation (e-WER) methods have

更新日期：2021-01-15
• arXiv.cs.SD Pub Date : 2021-01-13
Qiong Hu; Tobias Bleisch; Petko Petkov; Tuomo Raitio; Erik Marchi; Varun Lakshminarasimhan

It is desirable for a text-to-speech system to take into account the environment where synthetic speech is presented, and provide appropriate context-dependent output to the user. In this paper, we present and compare various approaches for generating different speaking styles, namely, normal, Lombard, and whisper speech, using only limited data. The following systems are proposed and assessed: 1)

更新日期：2021-01-15
• arXiv.cs.SD Pub Date : 2021-01-13
Manav Kaushik; Van Tung Pham; Eng Siong Chng

Automatic height and age estimation of speakers using acoustic features is widely used for the purpose of human-computer interaction, forensics, etc. In this work, we propose a novel approach of using attention mechanism to build an end-to-end architecture for height and age estimation. The attention mechanism is combined with Long Short-Term Memory(LSTM) encoder which is able to capture long-term

更新日期：2021-01-14
• arXiv.cs.SD Pub Date : 2021-01-13
Zhao Ren; Kun Qian; Fengquan Dong; Zhenyu Dai; Yoshiharu Yamamoto; Björn W. Schuller

Cardiovascular diseases are the leading cause of deaths and severely threaten human health in daily life. On the one hand, there have been dramatically increasing demands from both the clinical practice and the smart home application for monitoring the heart status of subjects suffering from chronic cardiovascular diseases. On the other hand, experienced physicians who can perform an efficient auscultation

更新日期：2021-01-14
• arXiv.cs.SD Pub Date : 2021-01-12
Korneel van den Broek

We present a deep convolutional GAN which leverages techniques from MP3/Vorbis audio compression to produce long, high-quality audio samples with long-range coherence. The model uses a Modified Discrete Cosine Transform (MDCT) data representation, which includes all phase information. Phase generation is hence integral part of the model. We leverage the auditory masking and psychoacoustic perception

更新日期：2021-01-14
• arXiv.cs.SD Pub Date : 2021-01-13
Max W. Y. Lam; Jun Wang; Dan Su; Dong Yu

Recent research on the time-domain audio separation networks (TasNets) has brought great success to speech separation. Nevertheless, conventional TasNets struggle to satisfy the memory and latency constraints in industrial applications. In this regard, we design a low-cost high-performance architecture, namely, globally attentive locally recurrent (GALR) network. Alike the dual-path RNN (DPRNN), we

更新日期：2021-01-14
• arXiv.cs.SD Pub Date : 2021-01-12
Yangyong Zhang; Maliheh Shirvanian; Sunpreet S. Arora; Jianwei Huang; Guofei Gu

Voice-driven services (VDS) are being used in a variety of applications ranging from smart home control to payments using digital assistants. The input to such services is often captured via an open voice channel, e.g., using a microphone, in an unsupervised setting. One of the key operational security requirements in such setting is the freshness of the input speech. We present AEOLUS, a security

更新日期：2021-01-14
• arXiv.cs.SD Pub Date : 2021-01-13
Paritosh Parmar; Jaiden Reddy; Brendan Morris

Can a computer determine a piano player's skill level? Is it preferable to base this assessment on visual analysis of the player's performance or should we trust our ears over our eyes? Since current CNNs have difficulty processing long video videos, how can shorter clips be sampled to best reflect the players skill level? In this work, we collect and release a first-of-its-kind dataset for multimodal

更新日期：2021-01-14
• arXiv.cs.SD Pub Date : 2021-01-12
Tsubasa Ochiai; Marc Delcroix; Tomohiro Nakatani; Rintaro Ikeshita; Keisuke Kinoshita; Shoko Araki

Developing microphone array technologies for a small number of microphones is important due to the constraints of many devices. One direction to address this situation consists of virtually augmenting the number of microphone signals, e.g., based on several physical model assumptions. However, such assumptions are not necessarily met in realistic conditions. In this paper, as an alternative approach

更新日期：2021-01-13
• arXiv.cs.SD Pub Date : 2020-12-26
Ali M. Alagrami; Maged M. Eljazzar

Tajweed is a set of rules to read the Quran in a correct Pronunciation of the letters with all its Qualities, while Reciting the Quran. which means you have to give every letter in the Quran its due of characteristics and apply it to this particular letter in this specific situation while reading, which may differ in other times. These characteristics include melodic rules, like where to stop and for

更新日期：2021-01-13
• arXiv.cs.SD Pub Date : 2021-01-09
Xugang Lu; Peng Shen; Yu Tsao; Hisashi Kawai

The task for speaker verification (SV) is to decide an utterance is spoken by a target or imposter speaker. In most SV studies, a log-likelihood ratio (L_LLR) score is estimated based on a generative probability model on speaker features, and compared with a threshold for decision making. However, the generative model usually focuses on feature distributions and does not have the discriminative feature

更新日期：2021-01-12
• arXiv.cs.SD Pub Date : 2021-01-05
Linda Liu; Yile Gu; Aditya Gourav; Ankur Gandhe; Shashank Kalmane; Denis Filimonov; Ariya Rastrow; Ivan Bulyko

As voice assistants become more ubiquitous, they are increasingly expected to support and perform well on a wide variety of use-cases across different domains. We present a domain-aware rescoring framework suitable for achieving domain-adaptation during second-pass rescoring in production settings. In our framework, we fine-tune a domain-general neural language model on several domains, and use an

更新日期：2021-01-12
• arXiv.cs.SD Pub Date : 2020-12-22
Eduardo Reck Miranda

This chapter presents a quantum computing-based approach to study and harness neuronal correlates of mental activity for the development of Brain-Computer Interface (BCI) systems. It introduces the notion of a logic of the mind, where neurophysiological data are encoded as logical expressions representing mental activity. Effective logical expressions are likely to be extensive, involving dozens of

更新日期：2021-01-12
• arXiv.cs.SD Pub Date : 2020-11-11
Goeric Huybrechts; Thomas Merritt; Giulia Comini; Bartek Perz; Raahil Shah; Jaime Lorenzo-Trueba

While recent neural text-to-speech (TTS) systems perform remarkably well, they typically require a substantial amount of recordings from the target speaker reading in the desired speaking style. In this work, we present a novel 3-step methodology to circumvent the costly operation of recording large amounts of target data in order to build expressive style voices with as little as 15 minutes of such

更新日期：2021-01-12
• arXiv.cs.SD Pub Date : 2021-01-08
Qing Wang; Jun Du; Hua-Xin Wu; Jia Pan; Feng Ma; Chin-Hui Lee

In this paper, we propose a novel four-stage data augmentation approach to ResNet-Conformer based acoustic modeling for sound event localization and detection (SELD). First, we explore two spatial augmentation techniques, namely audio channel swapping (ACS) and multi-channel simulation (MCS), to deal with data sparsity in SELD. ACS and MDS focus on augmenting the limited training data with expanding

更新日期：2021-01-11
• arXiv.cs.SD Pub Date : 2021-01-08
Ruohan Gao; Kristen Grauman

We introduce a new approach for audio-visual speech separation. Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers. Whereas existing methods focus on learning the alignment between the speaker's lip movements and the sounds they generate, we propose to leverage the speaker's face appearance as an additional

更新日期：2021-01-11
• arXiv.cs.SD Pub Date : 2021-01-07
Wen-Yi Hsiao; Jen-Yu Liu; Yin-Cheng Yeh; Yi-Hsuan Yang

To apply neural sequence models such as the Transformers to music generation tasks, one has to represent a piece of music by a sequence of tokens drawn from a finite set of pre-defined vocabulary. Such a vocabulary usually involves tokens of various types. For example, to describe a musical note, one needs separate tokens to indicate the note's pitch, duration, velocity (dynamics), and placement (onset

更新日期：2021-01-08
• arXiv.cs.SD Pub Date : 2021-01-07
Chiang-Jen Peng; Yun-Ju Chan; Cheng Yu; Syu-Siang Wang; Yu Tsao; Tai-Shih Chi

Multi-task learning (MTL) and the attention technique have been proven to effectively extract robust acoustic features for various speech-related applications in noisy environments. In this study, we integrated MTL and the attention-weighting mechanism and propose an attention-based MTL (ATM0 approach to realize a multi-model learning structure and to promote the speech enhancement (SE) and speaker

更新日期：2021-01-08
• arXiv.cs.SD Pub Date : 2021-01-06
Furkan Yesiler; Emilio Molina; Joan Serrà; Emilia Gómez

The setlist identification (SLI) task addresses a music recognition use case where the goal is to retrieve the metadata and timestamps for all the tracks played in live music events. Due to various musical and non-musical changes in live performances, developing automatic SLI systems is still a challenging task that, despite its industrial relevance, has been under-explored in the academic literature

更新日期：2021-01-07
• arXiv.cs.SD Pub Date : 2021-01-06
Pierre-Amaury Grumiaux; Srdan Kitic; Laurent Girin; Alexandre Guérin

Speaker counting is the task of estimating the number of people that are simultaneously speaking in an audio recording. For several audio processing tasks such as speaker diarization, separation, localization and tracking, knowing the number of speakers at each timestep is a prerequisite, or at least it can be a strong advantage, in addition to enabling a low latency processing. In a previous work

更新日期：2021-01-07
• arXiv.cs.SD Pub Date : 2021-01-06
Chandan K A Reddy; Harishchandra Dubey; Kazuhito Koishida; Arun Nair; Vishak Gopal; Ross Cutler; Sebastian Braun; Hannes Gamper; Robert Aichner; Sriram Srinivasan

The Deep Noise Suppression (DNS) challenge is designed to foster innovation in the area of noise suppression to achieve superior perceptual speech quality. We recently organized a DNS challenge special session at INTERSPEECH and ICASSP 2020. We open-sourced training and test datasets for the wideband scenario. We also open-sourced a subjective evaluation framework based on ITU-T standard P.808, which

更新日期：2021-01-07
• arXiv.cs.SD Pub Date : 2021-01-06
Chunheng Jiang; Jae-wook Ahn; Nirmit Desai

Collecting sufficient amount of data that can represent various acoustic environmental attributes is a critical problem for distributed acoustic machine learning. Several audio data augmentation techniques have been introduced to address this problem but they tend to remain in simple manipulation of existing data and are insufficient to cover the variability of the environments. We propose a method

更新日期：2021-01-07
• arXiv.cs.SD Pub Date : 2021-01-06
Xuankai Chang; Naoyuki Kanda; Yashesh Gaur; Xiaofei Wang; Zhong Meng; Takuya Yoshioka

An end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed recently to jointly perform speaker counting, speech recognition and speaker identification. The model achieved a low speaker-attributed word error rate (SA-WER) for monaural overlapped speech comprising an unknown number of speakers. However, the E2E modeling approach is susceptible to the mismatch between

更新日期：2021-01-07
• arXiv.cs.SD Pub Date : 2021-01-05
Anugunj Naman; Liliana Mancini

In this paper, we analyze the feasibility of applying few-shot learning to speech emotion recognition task (SER). The current speech emotion recognition models work exceptionally well but fail when then input is multilingual. Moreover, when training such models, the models' performance is suitable only when the training corpus is vast. This availability of a big training corpus is a significant problem

更新日期：2021-01-06
• arXiv.cs.SD Pub Date : 2021-01-05
Fu-Shun Hsu; Chao-Jung Huang; Chen-Yi Kuo; Shang-Ran Huang; Yuan-Ren Cheng; Jia-Horng Wang; Yi-Lin Wu; Tzu-Ling Tzeng; Feipei Lai

Respiratory auscultation can help healthcare professionals detect abnormal respiratory conditions if adventitious lung sounds are heard. The state-of-the-art artificial intelligence technologies based on deep learning show great potential in the development of automated respiratory sound analysis. To train a deep learning-based model, a huge number of accurate labels of normal breath sounds and adventitious

更新日期：2021-01-06
• arXiv.cs.SD Pub Date : 2021-01-04
Yong Xu; Zhuohuang Zhang; Meng Yu; Shi-Xiong Zhang; Lianwu Chen; Dong Yu

Recently we proposed an all-deep-learning minimum variance distortionless response (ADL-MVDR) method where the unstable matrix inverse and principal component analysis (PCA) operations in the MVDR were replaced by recurrent neural networks (RNNs). However, it is not clear whether the success of the ADL-MVDR is owed to the calculated covariance matrices or following the MVDR formula. In this work, we

更新日期：2021-01-06
• arXiv.cs.SD Pub Date : 2021-01-04
Thejan Rajapakshe; Rajib Rana; Sara Khalifa

Reinforcement Learning (RL) is a semi-supervised learning paradigm which an agent learns by interacting with an environment. Deep learning in combination with RL provides an efficient method to learn how to interact with the environment is called Deep Reinforcement Learning (deep RL). Deep RL has gained tremendous success in gaming - such as AlphaGo, but its potential have rarely being explored for

更新日期：2021-01-05
• arXiv.cs.SD Pub Date : 2021-01-03
Carlos Lordelo; Emmanouil Benetos; Simon Dixon; Sven Ahlbäck; Patrik Ohlsson

This paper addresses the problem of domain adaptation for the task of music source separation. Using datasets from two different domains, we compare the performance of a deep learning-based harmonic-percussive source separation model under different training scenarios, including supervised joint training using data from both domains and pre-training in one domain with fine-tuning in another. We propose

更新日期：2021-01-05
• arXiv.cs.SD Pub Date : 2021-01-01
Siddique Latif; Heriberto Cuayáhuitl; Farrukh Pervez; Fahad Shamshad; Hafiz Shehbaz Ali; Erik Cambria

Deep reinforcement learning (DRL) is poised to revolutionise the field of artificial intelligence (AI) by endowing autonomous systems with high levels of understanding of the real world. Currently, deep learning (DL) is enabling DRL to effectively solve various intractable problems in various fields. Most importantly, DRL algorithms are also being employed in audio signal processing to learn directly

更新日期：2021-01-05
• arXiv.cs.SD Pub Date : 2021-01-01
Daniel Szelogowski

Current AI-generated music lacks fundamental principles of good compositional techniques. By narrowing down implementation issues both programmatically and musically, we can create a better understanding of what parameters are necessary for a generated composition nearly indistinguishable from that of a master composer.

更新日期：2021-01-05
• arXiv.cs.SD Pub Date : 2021-01-02
Jui Shah; Yaman Kumar Singla; Changyou Chen; Rajiv Ratn Shah

In recent times, BERT based transformer models have become an inseparable part of the 'tech stack' of text processing models. Similar progress is being observed in the speech domain with a multitude of models observing state-of-the-art results by using audio transformer models to encode speech. This begs the question of what are these audio transformer models learning. Moreover, although the standard

更新日期：2021-01-05
• arXiv.cs.SD Pub Date : 2021-01-01
Alexander Lerch

Preprint for a book chapter introducing Audio Content Analysis. With a focus on Music Information Retrieval systems, this chapter defines musical audio content, introduces the general process of audio content analysis, and surveys basic approaches to audio content analysis. The various tasks in Audio Content Analysis are categorized into three classes: music transcription, music performance analysis

更新日期：2021-01-05
• arXiv.cs.SD Pub Date : 2020-12-31
Kai Zhen; Mi Suk Lee; Jongmo Sung; Seungkwon Beack; Minje Kim

Conventional audio coding technologies commonly leverage human perception of sound, or psychoacoustics, to reduce the bitrate while preserving the perceptual quality of the decoded audio signals. For neural audio codecs, however, the objective nature of the loss function usually leads to suboptimal sound quality as well as high run-time complexity due to the large model size. In this work, we present

更新日期：2021-01-05
• arXiv.cs.SD Pub Date : 2020-12-31

Keyword spotting is a process of finding some specific words or phrases in recorded speeches by computers. Deep neural network algorithms, as a powerful engine, can handle this problem if they are trained over an appropriate dataset. To this end, the football keyword dataset (FKD), as a new keyword spotting dataset in Persian, is collected with crowdsourcing. This dataset contains nearly 31000 samples

更新日期：2021-01-01
• arXiv.cs.SD Pub Date : 2020-12-31
Yang Zhang; Liqun Deng; Yasheng Wang

The front-end module in a typical Mandarin text-to-speech system (TTS) is composed of a long pipeline of text processing components, which requires extensive efforts to build and is prone to large accumulative model size and cascade errors. In this paper, a pre-trained language model (PLM) based model is proposed to simultaneously tackle the two most important tasks in TTS front-end, i.e., prosodic

更新日期：2021-01-01
• arXiv.cs.SD Pub Date : 2020-12-30
Jose A. Gonzalez-Lopez; Miriam Gonzalez-Atienza; Alejandro Gomez-Alanis; Jose L. Perez-Cordoba; Phil D. Green

Articulatory-to-acoustic (A2A) synthesis refers to the generation of audible speech from captured movement of the speech articulators. This technique has numerous applications, such as restoring oral communication to people who cannot longer speak due to illness or injury. Most successful techniques so far adopt a supervised learning framework, in which time-synchronous articulatory-and-speech recordings

更新日期：2021-01-01
• arXiv.cs.SD Pub Date : 2020-12-29

Machine hearing is an emerging area. Motivated by the need of a principled framework across domain applications for machine listening, we propose a generic and data-driven representation learning approach. For this sake, a novel and efficient supervised dictionary learning method is presented. Experiments are performed on both computational auditory scene (East Anglia and Rouen) and synthetic music

更新日期：2021-01-01
• arXiv.cs.SD Pub Date : 2020-12-29
Björn W. Schuller; Harry Coppock; Alexander Gaskell

The COVID-19 pandemic has affected the world unevenly; while industrial economies have been able to produce the tests necessary to track the spread of the virus and mostly avoided complete lockdowns, developing countries have faced issues with testing capacity. In this paper, we explore the usage of deep learning models as a ubiquitous, low-cost, pre-testing method for detecting COVID-19 from audio

更新日期：2021-01-01
• arXiv.cs.SD Pub Date : 2020-12-31
Shen Chen; Mingwei Zhang; Jiamin Cui; Wei Yao

Deep learning (DL) has brought about remarkable breakthrough in processing images, video and speech due to its efficacy in extracting highly abstract representation and learning very complex functions. However, there is seldom operating procedure reported on how to make it for real use cases. In this paper, we intend to address this problem by presenting a generalized operating procedure for DL from

更新日期：2021-01-01
• arXiv.cs.SD Pub Date : 2020-12-29
Federico Landini; Ján Profant; Mireia Diez; Lukáš Burget

The recently proposed VBx diarization method uses a Bayesian hidden Markov model to find speaker clusters in a sequence of x-vectors. In this work we perform an extensive comparison of performance of the VBx diarization with other approaches in the literature and we show that VBx achieves superior performance on three of the most popular datasets for evaluating diarization: CALLHOME, AMI and DIHARDII

更新日期：2021-01-01
• arXiv.cs.SD Pub Date : 2020-12-25
Samiya A Alkhairy; Christopher A Shera

We develop an analytic model of the mammalian cochlea. We use a mixed physical-phenomenological approach by utilizing existing work on the physics of classical box-representations of the cochlea, and behavior of recent data-derived wavenumber estimates. Spatial variation is incorporated through a single independent variable that combines space and frequency. We arrive at closed-form expressions for

更新日期：2021-01-01
• arXiv.cs.SD Pub Date : 2020-12-29
Daniel Korzekwa; Roberto Barra-Chicote; Szymon Zaporowski; Grzegorz Beringer; Jaime Lorenzo-Trueba; Alicja Serafinowicz; Jasha Droppo; Thomas Drugman; Bozena Kostek

This paper describes two novel complementary techniques that improve the detection of lexical stress errors in non-native (L2) English speech: attention-based feature extraction and data augmentation based on Neural Text-To-Speech (TTS). In a classical approach, audio features are usually extracted from fixed regions of speech such as syllable nucleus. We propose an attention-based deep learning model

更新日期：2021-01-01
• arXiv.cs.SD Pub Date : 2020-12-26
Lam Pham; Huy Phan; Ross King; Alfred Mertins; Ian McLoughlin

This paper presents an inception-based deep neural network for detecting lung diseases using respiratory sound input. Recordings of respiratory sound collected from patients are firstly transformed into spectrograms where both spectral and temporal information are well presented, referred to as front-end feature extraction. These spectrograms are then fed into the proposed network, referred to as back-end

更新日期：2020-12-29
• arXiv.cs.SD Pub Date : 2020-12-28
Apoorv Vyas; Srikanth Madikeri; Hervé Bourlard

In this work, we propose lattice-free MMI (LFMMI) for supervised adaptation of self-supervised pretrained acoustic model. We pretrain a Transformer model on thousand hours of untranscribed Librispeech data followed by supervised adaptation with LFMMI on three different datasets. Our results show that fine-tuning with LFMMI, we consistently obtain relative WER improvements of 10% and 35.3% on the clean

更新日期：2020-12-29
• arXiv.cs.SD Pub Date : 2020-12-28
Qinghua Sun; Kenji Nagamatsu

In this paper we propose a new cross-lingual Voice Conversion (VC) approach which can generate all speech parameters (MCEP, LF0, BAP) from one DNN model using PPGs (Phonetic PosteriorGrams) extracted from inputted speech using several ASR acoustic models. Using the proposed VC method, we tried three different approaches to build a multilingual TTS system without recording a multilingual speech corpus

更新日期：2020-12-29
• arXiv.cs.SD Pub Date : 2020-12-26
Dat Ngo; Lam Pham; Anh Nguyen; Ben Phan; Khoa Tran; Truong Nguyen

This paper proposes a robust deep learning framework used for classifying anomaly of respiratory cycles. Initially, our framework starts with front-end feature extraction step. This step aims to transform the respiratory input sound into a two-dimensional spectrogram where both spectral and temporal features are well presented. Next, an ensemble of C- DNN and Autoencoder networks is then applied to

更新日期：2020-12-29
• arXiv.cs.SD Pub Date : 2020-12-24
Zhuohuang Zhang; Yong Xu; Meng Yu; Shi-Xiong Zhang; Lianwu Chen; Donald S. Williamson; Dong Yu

Many purely neural network based speech separation approaches have been proposed that greatly improve objective assessment scores, but they often introduce nonlinear distortions that are harmful to automatic speech recognition (ASR). Minimum variance distortionless response (MVDR) filters strive to remove nonlinear distortions, however, these approaches either are not optimal for removing residual

更新日期：2020-12-29
• arXiv.cs.SD Pub Date : 2020-12-24
Xugang Lu; Peng Shen; Yu Tsao; Hisashi Kawai

Due to the mismatch of statistical distributions of acoustic speech between training and testing sets, the performance of spoken language identification (SLID) could be drastically degraded. In this paper, we propose an unsupervised neural adaptation model to deal with the distribution mismatch problem for SLID. In our model, we explicitly formulate the adaptation as to reduce the distribution discrepancy

更新日期：2020-12-25
• arXiv.cs.SD Pub Date : 2020-12-22
Yuchi Zhang; Willis Peng; Bastian Wandt; Helge Rhodin

Sensory substitution can help persons with perceptual deficits. In this work, we attempt to visualize audio with video. Our long-term goal is to create sound perception for hearing impaired people, for instance, to facilitate feedback for training deaf speech. Different from existing models that translate between speech and text or text and images, we target an immediate and low-level translation that

更新日期：2020-12-25
• arXiv.cs.SD Pub Date : 2020-12-24
Sundar Shrestha; Anand Koirala; Maksym Spiryagin; Qing Wu

The surface roughness between the wheel and rail has a huge influence on rolling noise level. The presence of the third body such as frost or grease at wheel-rail interface contributes towards change in adhesion coefficient resulting in the generation of noise at various levels. Therefore, it is possible to estimate adhesion conditions between the wheel and rail from the analysis of noise patterns

更新日期：2020-12-25
• arXiv.cs.SD Pub Date : 2020-12-23
Shinji Watanabe; Florian Boyer; Xuankai Chang; Pengcheng Guo; Tomoki Hayashi; Yosuke Higuchi; Takaaki Hori; Wen-Chin Huang; Hirofumi Inaguma; Naoyuki Kamo; Shigeki Karita; Chenda Li; Jing Shi; Aswin Shanmugam Subramanian; Wangyou Zhang

This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. This project was initiated in December 2017 to mainly deal with end-to-end speech recognition experiments based on sequence-to-sequence modeling. The project has grown rapidly and now covers a wide range of speech processing applications. Now ESPnet also includes text to

更新日期：2020-12-25
• arXiv.cs.SD Pub Date : 2020-12-23
Deblin Bagchi; Shannon Wotherspoon; Zhuolin Jiang; Prasanna Muthukumar

Speech synthesis might hold the key to low-resource speech recognition. Data augmentation techniques have become an essential part of modern speech recognition training. Yet, they are simple, naive, and rarely reflect real-world conditions. Meanwhile, speech synthesis techniques have been rapidly getting closer to the goal of achieving human-like speech. In this paper, we investigate the possibility

更新日期：2020-12-25
• arXiv.cs.SD Pub Date : 2020-12-23
Takaaki Saeki; Shinnosuke Takamichi; Hiroshi Saruwatari

Text-to-speech (TTS) synthesis, a technique for artificially generating human-like utterances from texts, has dramatically evolved with the advances of end-to-end deep neural network-based methods in recent years. The majority of these methods are sentence-level TTS, which can take into account time-series information in the whole sentence. However, it is necessary to establish incremental TTS, which

更新日期：2020-12-24
Contents have been reproduced by permission of the publishers.

down
wechat
bug