当前期刊: arXiv - CS - Sound Go to current issue    加入关注   
显示样式:        排序: IF: - GO 导出
我的关注
我的收藏
您暂时未登录!
登录
  • MFFCN: Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement
    arXiv.cs.SD Pub Date : 2021-01-15
    Xinmeng Xu; Dongxiang Xu; Jie Jia; Yang Wang; Binbin Chen

    The purpose of speech enhancement is to extract target speech signal from a mixture of sounds generated from several sources. Speech enhancement can potentially benefit from the visual information from the target speaker, such as lip move-ment and facial expressions, because the visual aspect of speech isessentially unaffected by acoustic environment. In order to fuse audio and visual information,

    更新日期:2021-01-18
  • Estimation of the Frequency of Occurrence of Italian Phonemes in Text
    arXiv.cs.SD Pub Date : 2021-01-14
    Javi Arango; Alex DeCaprio; Sunwoo Baik; Luca De Nardis; Stefanie Shattuck-Hufnagel; Maria Gabriella Di Benedetto

    The purpose of this project was to derive a reliable estimate of the frequency of occurrence of the 30 phonemes - plus consonant geminated counterparts - of the Italian language, based on four selected written texts. Since no comparable dataset was found in previous literature, the present analysis may serve as a reference in future studies. Four textual sources were considered: Come si fa una tesi

    更新日期:2021-01-18
  • Unsupervised heart abnormality detection based on phonocardiogram analysis with Beta Variational Auto-Encoders
    arXiv.cs.SD Pub Date : 2021-01-14
    Shengchen Li; Ke Tian; Rui Wang

    Heart Sound (also known as phonocardiogram (PCG)) analysis is a popular way that detects cardiovascular diseases (CVDs). Most PCG analysis uses supervised way, which demands both normal and abnormal samples. This paper proposes a method of unsupervised PCG analysis that uses beta variational auto-encoder ($\beta-\text{VAE}$) to model the normal PCG signals. The best performed model reaches an AUC (Area

    更新日期:2021-01-15
  • EmoCat: Language-agnostic Emotional Voice Conversion
    arXiv.cs.SD Pub Date : 2021-01-14
    Bastian Schnell; Goeric Huybrechts; Bartek Perz; Thomas Drugman; Jaime Lorenzo-Trueba

    Emotional voice conversion models adapt the emotion in speech without changing the speaker identity or linguistic content. They are less data hungry than text-to-speech models and allow to generate large amounts of emotional data for downstream tasks. In this work we propose EmoCat, a language-agnostic emotional voice conversion model. It achieves high-quality emotion conversion in German with less

    更新日期:2021-01-15
  • Generating coherent spontaneous speech and gesture from text
    arXiv.cs.SD Pub Date : 2021-01-14
    Simon Alexanderson; Éva Székely; Gustav Eje Henter; Taras Kucherenko; Jonas Beskow

    Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscripted

    更新日期:2021-01-15
  • An evaluation of word-level confidence estimation for end-to-end automatic speech recognition
    arXiv.cs.SD Pub Date : 2021-01-14
    Dan Oneata; Alexandru Caranica; Adriana Stan; Horia Cucu

    Quantifying the confidence (or conversely the uncertainty) of a prediction is a highly desirable trait of an automatic system, as it improves the robustness and usefulness in downstream tasks. In this paper we investigate confidence estimation for end-to-end automatic speech recognition (ASR). Previous work has addressed confidence measures for lattice-based ASR, while current machine learning research

    更新日期:2021-01-15
  • Speaker activity driven neural speech extraction
    arXiv.cs.SD Pub Date : 2021-01-14
    Marc Delcroix; Katerina Zmolikova; Tsubasa Ochiai; Keisuke Kinoshita; Tomohiro Nakatani

    Target speech extraction, which extracts the speech of a target speaker in a mixture given auxiliary speaker clues, has recently received increased interest. Various clues have been investigated such as pre-recorded enrollment utterances, direction information, or video of the target speaker. In this paper, we explore the use of speaker activity information as an auxiliary clue for single-channel neural

    更新日期:2021-01-15
  • WER-BERT: Automatic WER Estimation with BERT in a Balanced Ordinal Classification Paradigm
    arXiv.cs.SD Pub Date : 2021-01-14
    Akshay Krishna Sheshadri; Anvesh Rao Vijjini; Sukhdeep Kharbanda

    Audio Speech Recognition (ASR) systems are evaluated using Word Error Rate (WER) which is calculated by comparing the number of errors between the ground truth and the ASR system's transcription. This calculation, however, requires manual transcription of the speech signal to obtain the ground truth. Since transcribing audio signals is a costly process, Automatic WER Evaluation (e-WER) methods have

    更新日期:2021-01-15
  • Whispered and Lombard Neural Speech Synthesis
    arXiv.cs.SD Pub Date : 2021-01-13
    Qiong Hu; Tobias Bleisch; Petko Petkov; Tuomo Raitio; Erik Marchi; Varun Lakshminarasimhan

    It is desirable for a text-to-speech system to take into account the environment where synthetic speech is presented, and provide appropriate context-dependent output to the user. In this paper, we present and compare various approaches for generating different speaking styles, namely, normal, Lombard, and whisper speech, using only limited data. The following systems are proposed and assessed: 1)

    更新日期:2021-01-15
  • End-to-End Speaker Height and age estimation using Attention Mechanism with LSTM-RNN
    arXiv.cs.SD Pub Date : 2021-01-13
    Manav Kaushik; Van Tung Pham; Eng Siong Chng

    Automatic height and age estimation of speakers using acoustic features is widely used for the purpose of human-computer interaction, forensics, etc. In this work, we propose a novel approach of using attention mechanism to build an end-to-end architecture for height and age estimation. The attention mechanism is combined with Long Short-Term Memory(LSTM) encoder which is able to capture long-term

    更新日期:2021-01-14
  • Deep Attention-based Representation Learning for Heart Sound Classification
    arXiv.cs.SD Pub Date : 2021-01-13
    Zhao Ren; Kun Qian; Fengquan Dong; Zhenyu Dai; Yoshiharu Yamamoto; Björn W. Schuller

    Cardiovascular diseases are the leading cause of deaths and severely threaten human health in daily life. On the one hand, there have been dramatically increasing demands from both the clinical practice and the smart home application for monitoring the heart status of subjects suffering from chronic cardiovascular diseases. On the other hand, experienced physicians who can perform an efficient auscultation

    更新日期:2021-01-14
  • MP3net: coherent, minute-long music generation from raw audio with a simple convolutional GAN
    arXiv.cs.SD Pub Date : 2021-01-12
    Korneel van den Broek

    We present a deep convolutional GAN which leverages techniques from MP3/Vorbis audio compression to produce long, high-quality audio samples with long-range coherence. The model uses a Modified Discrete Cosine Transform (MDCT) data representation, which includes all phase information. Phase generation is hence integral part of the model. We leverage the auditory masking and psychoacoustic perception

    更新日期:2021-01-14
  • Effective Low-Cost Time-Domain Audio Separation Using Globally Attentive Locally Recurrent Networks
    arXiv.cs.SD Pub Date : 2021-01-13
    Max W. Y. Lam; Jun Wang; Dan Su; Dong Yu

    Recent research on the time-domain audio separation networks (TasNets) has brought great success to speech separation. Nevertheless, conventional TasNets struggle to satisfy the memory and latency constraints in industrial applications. In this regard, we design a low-cost high-performance architecture, namely, globally attentive locally recurrent (GALR) network. Alike the dual-path RNN (DPRNN), we

    更新日期:2021-01-14
  • Practical Speech Re-use Prevention in Voice-driven Services
    arXiv.cs.SD Pub Date : 2021-01-12
    Yangyong Zhang; Maliheh Shirvanian; Sunpreet S. Arora; Jianwei Huang; Guofei Gu

    Voice-driven services (VDS) are being used in a variety of applications ranging from smart home control to payments using digital assistants. The input to such services is often captured via an open voice channel, e.g., using a microphone, in an unsupervised setting. One of the key operational security requirements in such setting is the freshness of the input speech. We present AEOLUS, a security

    更新日期:2021-01-14
  • Piano Skills Assessment
    arXiv.cs.SD Pub Date : 2021-01-13
    Paritosh Parmar; Jaiden Reddy; Brendan Morris

    Can a computer determine a piano player's skill level? Is it preferable to base this assessment on visual analysis of the player's performance or should we trust our ears over our eyes? Since current CNNs have difficulty processing long video videos, how can shorter clips be sampled to best reflect the players skill level? In this work, we collect and release a first-of-its-kind dataset for multimodal

    更新日期:2021-01-14
  • Neural Network-based Virtual Microphone Estimator
    arXiv.cs.SD Pub Date : 2021-01-12
    Tsubasa Ochiai; Marc Delcroix; Tomohiro Nakatani; Rintaro Ikeshita; Keisuke Kinoshita; Shoko Araki

    Developing microphone array technologies for a small number of microphones is important due to the constraints of many devices. One direction to address this situation consists of virtually augmenting the number of microphone signals, e.g., based on several physical model assumptions. However, such assumptions are not necessarily met in realistic conditions. In this paper, as an alternative approach

    更新日期:2021-01-13
  • Smartajweed Automatic Recognition of Arabic Quranic Recitation Rules
    arXiv.cs.SD Pub Date : 2020-12-26
    Ali M. Alagrami; Maged M. Eljazzar

    Tajweed is a set of rules to read the Quran in a correct Pronunciation of the letters with all its Qualities, while Reciting the Quran. which means you have to give every letter in the Quran its due of characteristics and apply it to this particular letter in this specific situation while reading, which may differ in other times. These characteristics include melodic rules, like where to stop and for

    更新日期:2021-01-13
  • Integrating a joint Bayesian generative model in a discriminative learning framework for speaker verification
    arXiv.cs.SD Pub Date : 2021-01-09
    Xugang Lu; Peng Shen; Yu Tsao; Hisashi Kawai

    The task for speaker verification (SV) is to decide an utterance is spoken by a target or imposter speaker. In most SV studies, a log-likelihood ratio (L_LLR) score is estimated based on a generative probability model on speaker features, and compared with a threshold for decision making. However, the generative model usually focuses on feature distributions and does not have the discriminative feature

    更新日期:2021-01-12
  • Domain-aware Neural Language Models for Speech Recognition
    arXiv.cs.SD Pub Date : 2021-01-05
    Linda Liu; Yile Gu; Aditya Gourav; Ankur Gandhe; Shashank Kalmane; Denis Filimonov; Ariya Rastrow; Ivan Bulyko

    As voice assistants become more ubiquitous, they are increasingly expected to support and perform well on a wide variety of use-cases across different domains. We present a domain-aware rescoring framework suitable for achieving domain-adaptation during second-pass rescoring in production settings. In our framework, we fine-tune a domain-general neural language model on several domains, and use an

    更新日期:2021-01-12
  • On Interfacing the Brain with Quantum Computers: An Approach to Listen to the Logic of the Mind
    arXiv.cs.SD Pub Date : 2020-12-22
    Eduardo Reck Miranda

    This chapter presents a quantum computing-based approach to study and harness neuronal correlates of mental activity for the development of Brain-Computer Interface (BCI) systems. It introduces the notion of a logic of the mind, where neurophysiological data are encoded as logical expressions representing mental activity. Effective logical expressions are likely to be extensive, involving dozens of

    更新日期:2021-01-12
  • Low-resource expressive text-to-speech using data augmentation
    arXiv.cs.SD Pub Date : 2020-11-11
    Goeric Huybrechts; Thomas Merritt; Giulia Comini; Bartek Perz; Raahil Shah; Jaime Lorenzo-Trueba

    While recent neural text-to-speech (TTS) systems perform remarkably well, they typically require a substantial amount of recordings from the target speaker reading in the desired speaking style. In this work, we present a novel 3-step methodology to circumvent the costly operation of recording large amounts of target data in order to build expressive style voices with as little as 15 minutes of such

    更新日期:2021-01-12
  • A Four-Stage Data Augmentation Approach to ResNet-Conformer Based Acoustic Modeling for Sound Event Localization and Detection
    arXiv.cs.SD Pub Date : 2021-01-08
    Qing Wang; Jun Du; Hua-Xin Wu; Jia Pan; Feng Ma; Chin-Hui Lee

    In this paper, we propose a novel four-stage data augmentation approach to ResNet-Conformer based acoustic modeling for sound event localization and detection (SELD). First, we explore two spatial augmentation techniques, namely audio channel swapping (ACS) and multi-channel simulation (MCS), to deal with data sparsity in SELD. ACS and MDS focus on augmenting the limited training data with expanding

    更新日期:2021-01-11
  • VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency
    arXiv.cs.SD Pub Date : 2021-01-08
    Ruohan Gao; Kristen Grauman

    We introduce a new approach for audio-visual speech separation. Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers. Whereas existing methods focus on learning the alignment between the speaker's lip movements and the sounds they generate, we propose to leverage the speaker's face appearance as an additional

    更新日期:2021-01-11
  • Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs
    arXiv.cs.SD Pub Date : 2021-01-07
    Wen-Yi Hsiao; Jen-Yu Liu; Yin-Cheng Yeh; Yi-Hsuan Yang

    To apply neural sequence models such as the Transformers to music generation tasks, one has to represent a piece of music by a sequence of tokens drawn from a finite set of pre-defined vocabulary. Such a vocabulary usually involves tokens of various types. For example, to describe a musical note, one needs separate tokens to indicate the note's pitch, duration, velocity (dynamics), and placement (onset

    更新日期:2021-01-08
  • Attention-based multi-task learning for speech-enhancement and speaker-identification in multi-speaker dialogue scenario
    arXiv.cs.SD Pub Date : 2021-01-07
    Chiang-Jen Peng; Yun-Ju Chan; Cheng Yu; Syu-Siang Wang; Yu Tsao; Tai-Shih Chi

    Multi-task learning (MTL) and the attention technique have been proven to effectively extract robust acoustic features for various speech-related applications in noisy environments. In this study, we integrated MTL and the attention-weighting mechanism and propose an attention-based MTL (ATM0 approach to realize a multi-model learning structure and to promote the speech enhancement (SE) and speaker

    更新日期:2021-01-08
  • Investigating the efficacy of music version retrieval systems for setlist identification
    arXiv.cs.SD Pub Date : 2021-01-06
    Furkan Yesiler; Emilio Molina; Joan Serrà; Emilia Gómez

    The setlist identification (SLI) task addresses a music recognition use case where the goal is to retrieve the metadata and timestamps for all the tracks played in live music events. Due to various musical and non-musical changes in live performances, developing automatic SLI systems is still a challenging task that, despite its industrial relevance, has been under-explored in the academic literature

    更新日期:2021-01-07
  • Multichannel CRNN for Speaker Counting: an Analysis of Performance
    arXiv.cs.SD Pub Date : 2021-01-06
    Pierre-Amaury Grumiaux; Srdan Kitic; Laurent Girin; Alexandre Guérin

    Speaker counting is the task of estimating the number of people that are simultaneously speaking in an audio recording. For several audio processing tasks such as speaker diarization, separation, localization and tracking, knowing the number of speakers at each timestep is a prerequisite, or at least it can be a strong advantage, in addition to enabling a low latency processing. In a previous work

    更新日期:2021-01-07
  • Interspeech 2021 Deep Noise Suppression Challenge
    arXiv.cs.SD Pub Date : 2021-01-06
    Chandan K A Reddy; Harishchandra Dubey; Kazuhito Koishida; Arun Nair; Vishak Gopal; Ross Cutler; Sebastian Braun; Hannes Gamper; Robert Aichner; Sriram Srinivasan

    The Deep Noise Suppression (DNS) challenge is designed to foster innovation in the area of noise suppression to achieve superior perceptual speech quality. We recently organized a DNS challenge special session at INTERSPEECH and ICASSP 2020. We open-sourced training and test datasets for the wideband scenario. We also open-sourced a subjective evaluation framework based on ITU-T standard P.808, which

    更新日期:2021-01-07
  • Environment Transfer for Distributed Systems
    arXiv.cs.SD Pub Date : 2021-01-06
    Chunheng Jiang; Jae-wook Ahn; Nirmit Desai

    Collecting sufficient amount of data that can represent various acoustic environmental attributes is a critical problem for distributed acoustic machine learning. Several audio data augmentation techniques have been introduced to address this problem but they tend to remain in simple manipulation of existing data and are insufficient to cover the variability of the environments. We propose a method

    更新日期:2021-01-07
  • Hypothesis Stitcher for End-to-End Speaker-attributed ASR on Long-form Multi-talker Recordings
    arXiv.cs.SD Pub Date : 2021-01-06
    Xuankai Chang; Naoyuki Kanda; Yashesh Gaur; Xiaofei Wang; Zhong Meng; Takuya Yoshioka

    An end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed recently to jointly perform speaker counting, speech recognition and speaker identification. The model achieved a low speaker-attributed word error rate (SA-WER) for monaural overlapped speech comprising an unknown number of speakers. However, the E2E modeling approach is susceptible to the mismatch between

    更新日期:2021-01-07
  • Fixed-MAML for Few Shot Classification in Multilingual Speech Emotion Recognition
    arXiv.cs.SD Pub Date : 2021-01-05
    Anugunj Naman; Liliana Mancini

    In this paper, we analyze the feasibility of applying few-shot learning to speech emotion recognition task (SER). The current speech emotion recognition models work exceptionally well but fail when then input is multilingual. Moreover, when training such models, the models' performance is suitable only when the training corpus is vast. This availability of a big training corpus is a significant problem

    更新日期:2021-01-06
  • Development of a Respiratory Sound Labeling Software for Training a Deep Learning-Based Respiratory Sound Analysis Model
    arXiv.cs.SD Pub Date : 2021-01-05
    Fu-Shun Hsu; Chao-Jung Huang; Chen-Yi Kuo; Shang-Ran Huang; Yuan-Ren Cheng; Jia-Horng Wang; Yi-Lin Wu; Tzu-Ling Tzeng; Feipei Lai

    Respiratory auscultation can help healthcare professionals detect abnormal respiratory conditions if adventitious lung sounds are heard. The state-of-the-art artificial intelligence technologies based on deep learning show great potential in the development of automated respiratory sound analysis. To train a deep learning-based model, a huge number of accurate labels of normal breath sounds and adventitious

    更新日期:2021-01-06
  • Generalized RNN beamformer for target speech separation
    arXiv.cs.SD Pub Date : 2021-01-04
    Yong Xu; Zhuohuang Zhang; Meng Yu; Shi-Xiong Zhang; Lianwu Chen; Dong Yu

    Recently we proposed an all-deep-learning minimum variance distortionless response (ADL-MVDR) method where the unstable matrix inverse and principal component analysis (PCA) operations in the MVDR were replaced by recurrent neural networks (RNNs). However, it is not clear whether the success of the ADL-MVDR is owed to the calculated covariance matrices or following the MVDR formula. In this work, we

    更新日期:2021-01-06
  • A novel policy for pre-trained Deep Reinforcement Learning for Speech Emotion Recognition
    arXiv.cs.SD Pub Date : 2021-01-04
    Thejan Rajapakshe; Rajib Rana; Sara Khalifa

    Reinforcement Learning (RL) is a semi-supervised learning paradigm which an agent learns by interacting with an environment. Deep learning in combination with RL provides an efficient method to learn how to interact with the environment is called Deep Reinforcement Learning (deep RL). Deep RL has gained tremendous success in gaming - such as AlphaGo, but its potential have rarely being explored for

    更新日期:2021-01-05
  • Adversarial Unsupervised Domain Adaptation for Harmonic-Percussive Source Separation
    arXiv.cs.SD Pub Date : 2021-01-03
    Carlos Lordelo; Emmanouil Benetos; Simon Dixon; Sven Ahlbäck; Patrik Ohlsson

    This paper addresses the problem of domain adaptation for the task of music source separation. Using datasets from two different domains, we compare the performance of a deep learning-based harmonic-percussive source separation model under different training scenarios, including supervised joint training using data from both domains and pre-training in one domain with fine-tuning in another. We propose

    更新日期:2021-01-05
  • A Survey on Deep Reinforcement Learning for Audio-Based Applications
    arXiv.cs.SD Pub Date : 2021-01-01
    Siddique Latif; Heriberto Cuayáhuitl; Farrukh Pervez; Fahad Shamshad; Hafiz Shehbaz Ali; Erik Cambria

    Deep reinforcement learning (DRL) is poised to revolutionise the field of artificial intelligence (AI) by endowing autonomous systems with high levels of understanding of the real world. Currently, deep learning (DL) is enabling DRL to effectively solve various intractable problems in various fields. Most importantly, DRL algorithms are also being employed in audio signal processing to learn directly

    更新日期:2021-01-05
  • Generative Deep Learning for Virtuosic Classical Music: Generative Adversarial Networks as Renowned Composers
    arXiv.cs.SD Pub Date : 2021-01-01
    Daniel Szelogowski

    Current AI-generated music lacks fundamental principles of good compositional techniques. By narrowing down implementation issues both programmatically and musically, we can create a better understanding of what parameters are necessary for a generated composition nearly indistinguishable from that of a master composer.

    更新日期:2021-01-05
  • What all do audio transformer models hear? Probing Acoustic Representations for Language Delivery and its Structure
    arXiv.cs.SD Pub Date : 2021-01-02
    Jui Shah; Yaman Kumar Singla; Changyou Chen; Rajiv Ratn Shah

    In recent times, BERT based transformer models have become an inseparable part of the 'tech stack' of text processing models. Similar progress is being observed in the speech domain with a multitude of models observing state-of-the-art results by using audio transformer models to encode speech. This begs the question of what are these audio transformer models learning. Moreover, although the standard

    更新日期:2021-01-05
  • Audio Content Analysis
    arXiv.cs.SD Pub Date : 2021-01-01
    Alexander Lerch

    Preprint for a book chapter introducing Audio Content Analysis. With a focus on Music Information Retrieval systems, this chapter defines musical audio content, introduces the general process of audio content analysis, and surveys basic approaches to audio content analysis. The various tasks in Audio Content Analysis are categorized into three classes: music transcription, music performance analysis

    更新日期:2021-01-05
  • Psychoacoustic Calibration of Loss Functions for Efficient End-to-End Neural Audio Coding
    arXiv.cs.SD Pub Date : 2020-12-31
    Kai Zhen; Mi Suk Lee; Jongmo Sung; Seungkwon Beack; Minje Kim

    Conventional audio coding technologies commonly leverage human perception of sound, or psychoacoustics, to reduce the bitrate while preserving the perceptual quality of the decoded audio signals. For neural audio codecs, however, the objective nature of the loss function usually leads to suboptimal sound quality as well as high run-time complexity due to the large model size. In this work, we present

    更新日期:2021-01-05
  • EfficientNet-Absolute Zero for Continuous Speech Keyword Spotting
    arXiv.cs.SD Pub Date : 2020-12-31
    Amir Mohammad Rostami; Ali Karimi; Mohammad Ali Akhaee

    Keyword spotting is a process of finding some specific words or phrases in recorded speeches by computers. Deep neural network algorithms, as a powerful engine, can handle this problem if they are trained over an appropriate dataset. To this end, the football keyword dataset (FKD), as a new keyword spotting dataset in Persian, is collected with crowdsourcing. This dataset contains nearly 31000 samples

    更新日期:2021-01-01
  • Unified Mandarin TTS Front-end Based on Distilled BERT Model
    arXiv.cs.SD Pub Date : 2020-12-31
    Yang Zhang; Liqun Deng; Yasheng Wang

    The front-end module in a typical Mandarin text-to-speech system (TTS) is composed of a long pipeline of text processing components, which requires extensive efforts to build and is prone to large accumulative model size and cascade errors. In this paper, a pre-trained language model (PLM) based model is proposed to simultaneously tackle the two most important tasks in TTS front-end, i.e., prosodic

    更新日期:2021-01-01
  • Multi-view Temporal Alignment for Non-parallel Articulatory-to-Acoustic Speech Synthesis
    arXiv.cs.SD Pub Date : 2020-12-30
    Jose A. Gonzalez-Lopez; Miriam Gonzalez-Atienza; Alejandro Gomez-Alanis; Jose L. Perez-Cordoba; Phil D. Green

    Articulatory-to-acoustic (A2A) synthesis refers to the generation of audible speech from captured movement of the speech articulators. This technique has numerous applications, such as restoring oral communication to people who cannot longer speak due to illness or injury. Most successful techniques so far adopt a supervised learning framework, in which time-synchronous articulatory-and-speech recordings

    更新日期:2021-01-01
  • Data-driven audio recognition: a supervised dictionary approach
    arXiv.cs.SD Pub Date : 2020-12-29
    Imad Rida

    Machine hearing is an emerging area. Motivated by the need of a principled framework across domain applications for machine listening, we propose a generic and data-driven representation learning approach. For this sake, a novel and efficient supervised dictionary learning method is presented. Experiments are performed on both computational auditory scene (East Anglia and Rouen) and synthetic music

    更新日期:2021-01-01
  • Detecting COVID-19 from Breathing and Coughing Sounds using Deep Neural Networks
    arXiv.cs.SD Pub Date : 2020-12-29
    Björn W. Schuller; Harry Coppock; Alexander Gaskell

    The COVID-19 pandemic has affected the world unevenly; while industrial economies have been able to produce the tests necessary to track the spread of the virus and mostly avoided complete lockdowns, developing countries have faced issues with testing capacity. In this paper, we explore the usage of deep learning models as a ubiquitous, low-cost, pre-testing method for detecting COVID-19 from audio

    更新日期:2021-01-01
  • Generalized Operating Procedure for Deep Learning: an Unconstrained Optimal Design Perspective
    arXiv.cs.SD Pub Date : 2020-12-31
    Shen Chen; Mingwei Zhang; Jiamin Cui; Wei Yao

    Deep learning (DL) has brought about remarkable breakthrough in processing images, video and speech due to its efficacy in extracting highly abstract representation and learning very complex functions. However, there is seldom operating procedure reported on how to make it for real use cases. In this paper, we intend to address this problem by presenting a generalized operating procedure for DL from

    更新日期:2021-01-01
  • Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks
    arXiv.cs.SD Pub Date : 2020-12-29
    Federico Landini; Ján Profant; Mireia Diez; Lukáš Burget

    The recently proposed VBx diarization method uses a Bayesian hidden Markov model to find speaker clusters in a sequence of x-vectors. In this work we perform an extensive comparison of performance of the VBx diarization with other approaches in the literature and we show that VBx achieves superior performance on three of the most popular datasets for evaluating diarization: CALLHOME, AMI and DIHARDII

    更新日期:2021-01-01
  • An analytic physically motivated model of the mammalian cochlea
    arXiv.cs.SD Pub Date : 2020-12-25
    Samiya A Alkhairy; Christopher A Shera

    We develop an analytic model of the mammalian cochlea. We use a mixed physical-phenomenological approach by utilizing existing work on the physics of classical box-representations of the cochlea, and behavior of recent data-derived wavenumber estimates. Spatial variation is incorporated through a single independent variable that combines space and frequency. We arrive at closed-form expressions for

    更新日期:2021-01-01
  • Detection of Lexical Stress Errors in Non-native (L2) English with Data Augmentation and Attention
    arXiv.cs.SD Pub Date : 2020-12-29
    Daniel Korzekwa; Roberto Barra-Chicote; Szymon Zaporowski; Grzegorz Beringer; Jaime Lorenzo-Trueba; Alicja Serafinowicz; Jasha Droppo; Thomas Drugman; Bozena Kostek

    This paper describes two novel complementary techniques that improve the detection of lexical stress errors in non-native (L2) English speech: attention-based feature extraction and data augmentation based on Neural Text-To-Speech (TTS). In a classical approach, audio features are usually extracted from fixed regions of speech such as syllable nucleus. We propose an attention-based deep learning model

    更新日期:2021-01-01
  • Inception-Based Network and Multi-Spectrogram Ensemble Applied For Predicting Respiratory Anomalies and Lung Diseases
    arXiv.cs.SD Pub Date : 2020-12-26
    Lam Pham; Huy Phan; Ross King; Alfred Mertins; Ian McLoughlin

    This paper presents an inception-based deep neural network for detecting lung diseases using respiratory sound input. Recordings of respiratory sound collected from patients are firstly transformed into spectrograms where both spectral and temporal information are well presented, referred to as front-end feature extraction. These spectrograms are then fed into the proposed network, referred to as back-end

    更新日期:2020-12-29
  • Lattice-Free MMI Adaptation Of Self-Supervised Pretrained Acoustic Models
    arXiv.cs.SD Pub Date : 2020-12-28
    Apoorv Vyas; Srikanth Madikeri; Hervé Bourlard

    In this work, we propose lattice-free MMI (LFMMI) for supervised adaptation of self-supervised pretrained acoustic model. We pretrain a Transformer model on thousand hours of untranscribed Librispeech data followed by supervised adaptation with LFMMI on three different datasets. Our results show that fine-tuning with LFMMI, we consistently obtain relative WER improvements of 10% and 35.3% on the clean

    更新日期:2020-12-29
  • Building Multi lingual TTS using Cross Lingual Voice Conversion
    arXiv.cs.SD Pub Date : 2020-12-28
    Qinghua Sun; Kenji Nagamatsu

    In this paper we propose a new cross-lingual Voice Conversion (VC) approach which can generate all speech parameters (MCEP, LF0, BAP) from one DNN model using PPGs (Phonetic PosteriorGrams) extracted from inputted speech using several ASR acoustic models. Using the proposed VC method, we tried three different approaches to build a multilingual TTS system without recording a multilingual speech corpus

    更新日期:2020-12-29
  • Deep Learning Framework Applied for Predicting Anomaly of Respiratory Sounds
    arXiv.cs.SD Pub Date : 2020-12-26
    Dat Ngo; Lam Pham; Anh Nguyen; Ben Phan; Khoa Tran; Truong Nguyen

    This paper proposes a robust deep learning framework used for classifying anomaly of respiratory cycles. Initially, our framework starts with front-end feature extraction step. This step aims to transform the respiratory input sound into a two-dimensional spectrogram where both spectral and temporal features are well presented. Next, an ensemble of C- DNN and Autoencoder networks is then applied to

    更新日期:2020-12-29
  • Multi-channel Multi-frame ADL-MVDR for Target Speech Separation
    arXiv.cs.SD Pub Date : 2020-12-24
    Zhuohuang Zhang; Yong Xu; Meng Yu; Shi-Xiong Zhang; Lianwu Chen; Donald S. Williamson; Dong Yu

    Many purely neural network based speech separation approaches have been proposed that greatly improve objective assessment scores, but they often introduce nonlinear distortions that are harmful to automatic speech recognition (ASR). Minimum variance distortionless response (MVDR) filters strive to remove nonlinear distortions, however, these approaches either are not optimal for removing residual

    更新日期:2020-12-29
  • Unsupervised neural adaptation model based on optimal transport for spoken language identification
    arXiv.cs.SD Pub Date : 2020-12-24
    Xugang Lu; Peng Shen; Yu Tsao; Hisashi Kawai

    Due to the mismatch of statistical distributions of acoustic speech between training and testing sets, the performance of spoken language identification (SLID) could be drastically degraded. In this paper, we propose an unsupervised neural adaptation model to deal with the distribution mismatch problem for SLID. In our model, we explicitly formulate the adaptation as to reduce the distribution discrepancy

    更新日期:2020-12-25
  • AudioViewer: Learning to Visualize Sound
    arXiv.cs.SD Pub Date : 2020-12-22
    Yuchi Zhang; Willis Peng; Bastian Wandt; Helge Rhodin

    Sensory substitution can help persons with perceptual deficits. In this work, we attempt to visualize audio with video. Our long-term goal is to create sound perception for hearing impaired people, for instance, to facilitate feedback for training deaf speech. Different from existing models that translate between speech and text or text and images, we target an immediate and low-level translation that

    更新日期:2020-12-25
  • Wheel-Rail Interface Condition Estimation (W-RICE)
    arXiv.cs.SD Pub Date : 2020-12-24
    Sundar Shrestha; Anand Koirala; Maksym Spiryagin; Qing Wu

    The surface roughness between the wheel and rail has a huge influence on rolling noise level. The presence of the third body such as frost or grease at wheel-rail interface contributes towards change in adhesion coefficient resulting in the generation of noise at various levels. Therefore, it is possible to estimate adhesion conditions between the wheel and rail from the analysis of noise patterns

    更新日期:2020-12-25
  • The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans
    arXiv.cs.SD Pub Date : 2020-12-23
    Shinji Watanabe; Florian Boyer; Xuankai Chang; Pengcheng Guo; Tomoki Hayashi; Yosuke Higuchi; Takaaki Hori; Wen-Chin Huang; Hirofumi Inaguma; Naoyuki Kamo; Shigeki Karita; Chenda Li; Jing Shi; Aswin Shanmugam Subramanian; Wangyou Zhang

    This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. This project was initiated in December 2017 to mainly deal with end-to-end speech recognition experiments based on sequence-to-sequence modeling. The project has grown rapidly and now covers a wide range of speech processing applications. Now ESPnet also includes text to

    更新日期:2020-12-25
  • Speech Synthesis as Augmentation for Low-Resource ASR
    arXiv.cs.SD Pub Date : 2020-12-23
    Deblin Bagchi; Shannon Wotherspoon; Zhuolin Jiang; Prasanna Muthukumar

    Speech synthesis might hold the key to low-resource speech recognition. Data augmentation techniques have become an essential part of modern speech recognition training. Yet, they are simple, naive, and rarely reflect real-world conditions. Meanwhile, speech synthesis techniques have been rapidly getting closer to the goal of achieving human-like speech. In this paper, we investigate the possibility

    更新日期:2020-12-25
  • Incremental Text-to-Speech Synthesis Using Pseudo Lookahead with Large Pretrained Language Model
    arXiv.cs.SD Pub Date : 2020-12-23
    Takaaki Saeki; Shinnosuke Takamichi; Hiroshi Saruwatari

    Text-to-speech (TTS) synthesis, a technique for artificially generating human-like utterances from texts, has dramatically evolved with the advances of end-to-end deep neural network-based methods in recent years. The majority of these methods are sentence-level TTS, which can take into account time-series information in the whole sentence. However, it is necessary to establish incremental TTS, which

    更新日期:2020-12-24
Contents have been reproduced by permission of the publishers.
导出
全部期刊列表>>
微生物研究
亚洲大洋洲地球科学
NPJ欢迎投稿
自然科研论文编辑
ERIS期刊投稿
欢迎阅读创刊号
自然职场,为您触达千万科研人才
spring&清华大学出版社
城市可持续发展前沿研究专辑
Springer 纳米技术权威期刊征稿
全球视野覆盖
施普林格·自然新
chemistry
物理学研究前沿热点精选期刊推荐
自然职位线上招聘会
欢迎报名注册2020量子在线大会
化学领域亟待解决的问题
材料学研究精选新
GIANT
ACS ES&T Engineering
ACS ES&T Water
屿渡论文,编辑服务
阿拉丁试剂right
上海中医药大学
清华大学
复旦大学
南科大
北京理工大学
上海交通大学
隐藏1h前已浏览文章
课题组网站
新版X-MOL期刊搜索和高级搜索功能介绍
ACS材料视界
清华大学-1
武汉大学
浙江大学
天合科研
x-mol收录
试剂库存
down
wechat
bug