-
BirdSet: A Multi-Task Benchmark for Classification in Avian Bioacoustics arXiv.cs.SD Pub Date : 2024-03-15 Lukas Rauch, Raphael Schwinger, Moritz Wirth, René Heinrich, Jonas Lange, Stefan Kahl, Bernhard Sick, Sven Tomforde, Christoph Scholz
Deep learning (DL) models have emerged as a powerful tool in avian bioacoustics to diagnose environmental health and biodiversity. However, inconsistencies in research pose notable challenges hindering progress in this domain. Reliable DL models need to analyze bird calls flexibly across various species and environments to fully harness the potential of bioacoustics in a cost-effective passive acoustic
-
Multiscale Matching Driven by Cross-Modal Similarity Consistency for Audio-Text Retrieval arXiv.cs.SD Pub Date : 2024-03-15 Qian Wang, Jia-Chen Gu, Zhen-Hua Ling
Audio-text retrieval (ATR), which retrieves a relevant caption given an audio clip (A2T) and vice versa (T2A), has recently attracted much research attention. Existing methods typically aggregate information from each modality into a single vector for matching, but this sacrifices local details and can hardly capture intricate relationships within and between modalities. Furthermore, current ATR datasets
-
MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument Leakage arXiv.cs.SD Pub Date : 2024-03-15 Hao Hao Tan, Kin Wai Cheuk, Taemin Cho, Wei-Hsiang Liao, Yuki Mitsufuji
This paper presents enhancements to the MT3 model, a state-of-the-art (SOTA) token-based multi-instrument automatic music transcription (AMT) model. Despite SOTA performance, MT3 has the issue of instrument leakage, where transcriptions are fragmented across different instruments. To mitigate this, we propose MR-MT3, with enhancements including a memory retention mechanism, prior token sampling, and
-
SpokeN-100: A Cross-Lingual Benchmarking Dataset for The Classification of Spoken Numbers in Different Languages arXiv.cs.SD Pub Date : 2024-03-14 René Groh, Nina Goes, Andreas M. Kist
Benchmarking plays a pivotal role in assessing and enhancing the performance of compact deep learning models designed for execution on resource-constrained devices, such as microcontrollers. Our study introduces a novel, entirely artificially generated benchmarking dataset tailored for speech recognition, representing a core challenge in the field of tiny deep learning. SpokeN-100 consists of spoken
-
Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives arXiv.cs.SD Pub Date : 2024-03-15 Ronghui Li, YuXiang Zhang, Yachao Zhang, Hongwen Zhang, Jie Guo, Yan Zhang, Yebin Liu, Xiu Li
We propose Lodge, a network capable of generating extremely long dance sequences conditioned on given music. We design Lodge as a two-stage coarse to fine diffusion architecture, and propose the characteristic dance primitives that possess significant expressiveness as intermediate representations between two diffusion models. The first stage is global diffusion, which focuses on comprehending the
-
Joint Multimodal Transformer for Dimensional Emotional Recognition in the Wild arXiv.cs.SD Pub Date : 2024-03-15 Paul Waligora, Osama Zeeshan, Haseeb Aslam, Soufiane Belharbi, Alessandro Lameiras Koerich, Marco Pedersoli, Simon Bacon, Eric Granger
Audiovisual emotion recognition (ER) in videos has immense potential over unimodal performance. It effectively leverages the inter- and intra-modal dependencies between visual and auditory modalities. This work proposes a novel audio-visual emotion recognition system utilizing a joint multimodal transformer architecture with key-based cross-attention. This framework aims to exploit the complementary
-
Audiosockets: A Python socket package for Real-Time Audio Processing arXiv.cs.SD Pub Date : 2024-03-14 Nicolas Shu, David V. Anderson
There are many packages in Python which allow one to perform real-time processing on audio data. Unfortunately, due to the synchronous nature of the language, there lacks a framework which allows for distributed parallel processing of the data without requiring a large programming overhead and in which the data acquisition is not blocked by subsequent processing operations. This work improves on packages
-
ConSep: a Noise- and Reverberation-Robust Speech Separation Framework by Magnitude Conditioning arXiv.cs.SD Pub Date : 2024-03-04 Kuan-Hsun Ho, Jeih-weih Hung, Berlin Chen
Speech separation has recently made significant progress thanks to the fine-grained vision used in time-domain methods. However, several studies have shown that adopting Short-Time Fourier Transform (STFT) for feature extraction could be beneficial when encountering harsher conditions, such as noise or reverberation. Therefore, we propose a magnitude-conditioned time-domain framework, ConSep, to inherit
-
Beyond Voice Assistants: Exploring Advantages and Risks of an In-Car Social Robot in Real Driving Scenarios arXiv.cs.SD Pub Date : 2024-02-19 Yuanchao Li, Lachlan Urquhart, Nihan Karatas, Shun Shao, Hiroshi Ishiguro, Xun Shen
In-car Voice Assistants (VAs) play an increasingly critical role in automotive user interface design. However, existing VAs primarily perform simple 'query-answer' tasks, limiting their ability to sustain drivers' long-term attention. In this study, we investigate the effectiveness of an in-car Robot Assistant (RA) that offers functionalities beyond voice interaction. We aim to answer the question:
-
APCodec: A Neural Audio Codec with Parallel Amplitude and Phase Spectrum Encoding and Decoding arXiv.cs.SD Pub Date : 2024-02-16 Yang Ai, Xiao-Hang Jiang, Ye-Xin Lu, Hui-Peng Du, Zhen-Hua Ling
This paper introduces a novel neural audio codec targeting high waveform sampling rates and low bitrates named APCodec, which seamlessly integrates the strengths of parametric codecs and waveform codecs. The APCodec revolutionizes the process of audio encoding and decoding by concurrently handling the amplitude and phase spectra as audio parametric characteristics like parametric codecs. It is composed
-
Evaluating and Improving Continual Learning in Spoken Language Understanding arXiv.cs.SD Pub Date : 2024-02-16 Muqiao Yang, Xiang Li, Umberto Cappellazzo, Shinji Watanabe, Bhiksha Raj
Continual learning has emerged as an increasingly important challenge across various tasks, including Spoken Language Understanding (SLU). In SLU, its objective is to effectively handle the emergence of new concepts and evolving environments. The evaluation of continual learning algorithms typically involves assessing the model's stability, plasticity, and generalizability as fundamental aspects of
-
Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation and Editing via Content-based Controls arXiv.cs.SD Pub Date : 2024-02-14 Liwei Lin, Gus Xia, Yixiao Zhang, Junyan Jiang
Controllable music generation plays a vital role in human-AI music co-creation. While Large Language Models (LLMs) have shown promise in generating high-quality music, their focus on autoregressive generation limits their utility in music editing tasks. To bridge this gap, we introduce a novel Parameter-Efficient Fine-Tuning (PEFT) method. This approach enables autoregressive language models to seamlessly
-
ISPA: Inter-Species Phonetic Alphabet for Transcribing Animal Sounds arXiv.cs.SD Pub Date : 2024-02-05 Masato Hagiwara, Marius Miron, Jen-Yu Liu
Traditionally, bioacoustics has relied on spectrograms and continuous, per-frame audio representations for the analysis of animal sounds, also serving as input to machine learning models. Meanwhile, the International Phonetic Alphabet (IPA) system has provided an interpretable, language-independent method for transcribing human speech sounds. In this paper, we introduce ISPA (Inter-Species Phonetic
-
Exploring Federated Self-Supervised Learning for General Purpose Audio Understanding arXiv.cs.SD Pub Date : 2024-02-05 Yasar Abbas Ur Rehman, Kin Wai Lau, Yuyang Xie, Lan Ma, Jiajun Shen
The integration of Federated Learning (FL) and Self-supervised Learning (SSL) offers a unique and synergetic combination to exploit the audio data for general-purpose audio understanding, without compromising user data privacy. However, rare efforts have been made to investigate the SSL models in the FL regime for general-purpose audio understanding, especially when the training data is generated by
-
Focal Modulation Networks for Interpretable Sound Classification arXiv.cs.SD Pub Date : 2024-02-05 Luca Della Libera, Cem Subakan, Mirco Ravanelli
The increasing success of deep neural networks has raised concerns about their inherent black-box nature, posing challenges related to interpretability and trust. While there has been extensive exploration of interpretation techniques in vision and language, interpretability in the audio domain has received limited attention, primarily focusing on post-hoc explanations. This paper addresses the problem
-
How phonemes contribute to deep speaker models? arXiv.cs.SD Pub Date : 2024-02-05 Pengqi Li, Tianhao Wang, Lantian Li, Askar Hamdulla, Dong Wang
Which phonemes convey more speaker traits is a long-standing question, and various perception experiments were conducted with human subjects. For speaker recognition, studies were conducted with the conventional statistical models and the drawn conclusions are more or less consistent with the perception results. However, which phonemes are more important with modern deep neural models is still unexplored
-
Adversarial Data Augmentation for Robust Speaker Verification arXiv.cs.SD Pub Date : 2024-02-05 Zhenyu Zhou, Junhui Chen, Namin Wang, Lantian Li, Dong Wang
Data augmentation (DA) has gained widespread popularity in deep speaker models due to its ease of implementation and significant effectiveness. It enriches training data by simulating real-life acoustic variations, enabling deep neural networks to learn speaker-related representations while disregarding irrelevant acoustic variations, thereby improving robustness and generalization. However, a potential
-
Natural language guidance of high-fidelity text-to-speech with synthetic annotations arXiv.cs.SD Pub Date : 2024-02-02 Dan Lyth, Simon King
Text-to-speech models trained on large-scale datasets have demonstrated impressive in-context learning capabilities and naturalness. However, control of speaker identity and style in these models typically requires conditioning on reference speech recordings, limiting creative applications. Alternatively, natural language prompting of speaker identity and style has demonstrated promising results and
-
Identification of Cognitive Decline from Spoken Language through Feature Selection and the Bag of Acoustic Words Model arXiv.cs.SD Pub Date : 2024-02-02 Marko Niemelä, Mikaela von Bonsdorff, Sami Äyrämö, Tommi Kärkkäinen
Memory disorders are a central factor in the decline of functioning and daily activities in elderly individuals. The confirmation of the illness, initiation of medication to slow its progression, and the commencement of occupational therapy aimed at maintaining and rehabilitating cognitive abilities require a medical diagnosis. The early identification of symptoms of memory disorders, especially the
-
Digits micro-model for accurate and secure transactions arXiv.cs.SD Pub Date : 2024-02-02 Chirag Chhablani, Nikhita Sharma, Jordan Hosier, Vijay K. Gurbani
Automatic Speech Recognition (ASR) systems are used in the financial domain to enhance the caller experience by enabling natural language understanding and facilitating efficient and intuitive interactions. Increasing use of ASR systems requires that such systems exhibit very low error rates. The predominant ASR models to collect numeric data are large, general-purpose commercial models -- Google Speech-to-text
-
Harnessing Smartwatch Microphone Sensors for Cough Detection and Classification arXiv.cs.SD Pub Date : 2024-01-31 Pranay Jaiswal, Haroon R. Lone
This study investigates the potential of using smartwatches with built-in microphone sensors for monitoring coughs and detecting various cough types. We conducted a study involving 32 participants and collected 9 hours of audio data in a controlled manner. Afterward, we processed this data using a structured approach, resulting in 223 positive cough samples. We further improved the dataset through
-
Proactive Detection of Voice Cloning with Localized Watermarking arXiv.cs.SD Pub Date : 2024-01-30 Robin San Roman, Pierre Fernandez, Alexandre Défossez, Teddy Furon, Tuan Tran, Hady Elsahar
In the rapidly evolving field of speech generative models, there is a pressing need to ensure audio authenticity against the risks of voice cloning. We present AudioSeal, the first audio watermarking technique designed specifically for localized detection of AI-generated speech. AudioSeal employs a generator/detector architecture trained jointly with a localization loss to enable localized watermark
-
MunTTS: A Text-to-Speech System for Mundari arXiv.cs.SD Pub Date : 2024-01-28 Varun Gumma, Rishav Hada, Aditya Yadavalli, Pamir Gogoi, Ishani Mondal, Vivek Seshadri, Kalika Bali
We present MunTTS, an end-to-end text-to-speech (TTS) system specifically for Mundari, a low-resource Indian language of the Austo-Asiatic family. Our work addresses the gap in linguistic technology for underrepresented languages by collecting and processing data to build a speech synthesis system. We begin our study by gathering a substantial dataset of Mundari text and speech and train end-to-end
-
Improving Design of Input Condition Invariant Speech Enhancement arXiv.cs.SD Pub Date : 2024-01-25 Wangyou Zhang, Jee-weon Jung, Shinji Watanabe, Yanmin Qian
Building a single universal speech enhancement (SE) system that can handle arbitrary input is a demanded but underexplored research topic. Towards this ultimate goal, one direction is to build a single model that handles diverse audio duration, sampling frequencies, and microphone variations in noisy and reverberant scenarios, which we define here as "input condition invariant SE". Such a model was
-
Music Genre Classification: A Comparative Analysis of CNN and XGBoost Approaches with Mel-frequency cepstral coefficients and Mel Spectrograms arXiv.cs.SD Pub Date : 2024-01-09 Yigang Meng
In recent years, various well-designed algorithms have empowered music platforms to provide content based on one's preferences. Music genres are defined through various aspects, including acoustic features and cultural considerations. Music genre classification works well with content-based filtering, which recommends content based on music similarity to users. Given a considerable dataset, one premise
-
SonicVisionLM: Playing Sound with Vision Language Models arXiv.cs.SD Pub Date : 2024-01-09 Zhifeng Xie, Shengye Yu, Mengtian Li, Qile He, Chaofeng Chen, Yu-Gang Jiang
There has been a growing interest in the task of generating sound for silent videos, primarily because of its practicality in streamlining video post-production. However, existing methods for video-sound generation attempt to directly create sound from visual representations, which can be challenging due to the difficulty of aligning visual representations with audio representations. In this paper
-
CTC Blank Triggered Dynamic Layer-Skipping for Efficient CTC-based Speech Recognition arXiv.cs.SD Pub Date : 2024-01-04 Junfeng Hou, Peiyao Wang, Jincheng Zhang, Meng Yang, Minwei Feng, Jingcheng Yin
Deploying end-to-end speech recognition models with limited computing resources remains challenging, despite their impressive performance. Given the gradual increase in model size and the wide range of model applications, selectively executing model components for different inputs to improve the inference efficiency is of great interest. In this paper, we propose a dynamic layer-skipping method that
-
Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling arXiv.cs.SD Pub Date : 2023-12-19 Rui Liu, Yifan Hu, Yi Ren, Xiang Yin, Haizhou Li
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting. While recognising the significance of CSS task, the prior studies have not thoroughly investigated the emotional expressiveness problems due to the scarcity of emotional conversational datasets and the difficulty of stateful emotion modeling
-
Extending Whisper with prompt tuning to target-speaker ASR arXiv.cs.SD Pub Date : 2023-12-13 Hao Ma, Zhiyuan Peng, Mingjie Shao, Jing Li, Ju Liu
Target-speaker automatic speech recognition (ASR) aims to transcribe the desired speech of a target speaker from multi-talker overlapped utterances. Most of the existing target-speaker ASR (TS-ASR) methods involve either training from scratch or fully fine-tuning a pre-trained model, leading to significant training costs and becoming inapplicable to large foundation models. This work leverages prompt
-
Transformer Attractors for Robust and Efficient End-to-End Neural Diarization arXiv.cs.SD Pub Date : 2023-12-11 Lahiru Samarakoon, Samuel J. Broughton, Marc Härkönen, Ivan Fung
End-to-end neural diarization with encoder-decoder based attractors (EEND-EDA) is a method to perform diarization in a single neural network. EDA handles the diarization of a flexible number of speakers by using an LSTM-based encoder-decoder that generates a set of speaker-wise attractors in an autoregressive manner. In this paper, we propose to replace EDA with a transformer-based attractor calculation
-
Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism arXiv.cs.SD Pub Date : 2023-12-11 Georgios Milis, Panagiotis P. Filntisis, Anastasios Roussos, Petros Maragos
Recent advances in deep learning for sequential data have given rise to fast and powerful models that produce realistic videos of talking humans. The state of the art in talking face generation focuses mainly on lip-syncing, being conditioned on audio clips. However, having the ability to synthesize talking humans from text transcriptions rather than audio is particularly beneficial for many applications
-
A Practical Survey on Emerging Threats from AI-driven Voice Attacks: How Vulnerable are Commercial Voice Control Systems? arXiv.cs.SD Pub Date : 2023-12-10 Yuanda Wang, Qiben Yan, Nikolay Ivanov, Xun Chen
The emergence of Artificial Intelligence (AI)-driven audio attacks has revealed new security vulnerabilities in voice control systems. While researchers have introduced a multitude of attack strategies targeting voice control systems (VCS), the continual advancements of VCS have diminished the impact of many such attacks. Recognizing this dynamic landscape, our study endeavors to comprehensively assess
-
ROSE: A Recognition-Oriented Speech Enhancement Framework in Air Traffic Control Using Multi-Objective Learning arXiv.cs.SD Pub Date : 2023-12-11 Xincheng Yu, Dongyue Guo, Jianwei Zhang, Yi Lin
Radio speech echo is a specific phenomenon in the air traffic control (ATC) domain, which degrades speech quality and further impacts automatic speech recognition (ASR) accuracy. In this work, a recognition-oriented speech enhancement (ROSE) framework is proposed to improve speech intelligibility and also advance ASR accuracy, which serves as a plug-and-play tool in ATC scenarios and does not require
-
DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors arXiv.cs.SD Pub Date : 2023-12-07 Federico Landini, Mireia Diez, Themos Stafylakis, Lukáš Burget
Until recently, the field of speaker diarization was dominated by cascaded systems. Due to their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to-end models have gained great popularity lately. One of the most successful models is end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA). In this work, we replace the EDA module with a Perceiver-based
-
Towards small and accurate convolutional neural networks for acoustic biodiversity monitoring arXiv.cs.SD Pub Date : 2023-12-06 Serge Zaugg, Mike van der Schaar, Florence Erbs, Antonio Sanchez, Joan V. Castell, Emiliano Ramallo, Michel André
Automated classification of animal sounds is a prerequisite for large-scale monitoring of biodiversity. Convolutional Neural Networks (CNNs) are among the most promising algorithms but they are slow, often achieve poor classification in the field and typically require large training data sets. Our objective was to design CNNs that are fast at inference time and achieve good classification performance
-
Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models arXiv.cs.SD Pub Date : 2023-12-06 Dominik Wagner, Alexander Churchill, Siddharth Sigtia, Panayiotis Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi
Interactions with virtual assistants typically start with a trigger phrase followed by a command. In this work, we explore the possibility of making these interactions more natural by eliminating the need for a trigger phrase. Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone. We address this
-
Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis arXiv.cs.SD Pub Date : 2023-12-06 Zehua Chen, Guande He, Kaiwen Zheng, Xu Tan, Jun Zhu
In text-to-speech (TTS) synthesis, diffusion models have achieved promising generation quality. However, because of the pre-defined data-to-noise diffusion process, their prior distribution is restricted to a noisy representation, which provides little information of the generation target. In this work, we present a novel TTS system, Bridge-TTS, making the first attempt to substitute the noisy Gaussian
-
Data is Overrated: Perceptual Metrics Can Lead Learning in the Absence of Training Data arXiv.cs.SD Pub Date : 2023-12-06 Tashi Namgyal, Alexander Hepburn, Raul Santos-Rodriguez, Valero Laparra, Jesus Malo
Perceptual metrics are traditionally used to evaluate the quality of natural signals, such as images and audio. They are designed to mimic the perceptual behaviour of human observers and usually reflect structures found in natural signals. This motivates their use as loss functions for training generative models such that models will learn to capture the structure held in the metric. We take this idea
-
Detecting Voice Cloning Attacks via Timbre Watermarking arXiv.cs.SD Pub Date : 2023-12-06 Chang Liu, Jie Zhang, Tianwei Zhang, Xi Yang, Weiming Zhang, Nenghai Yu
Nowadays, it is common to release audio content to the public. However, with the rise of voice cloning technology, attackers have the potential to easily impersonate a specific person by utilizing his publicly released audio without any permission. Therefore, it becomes significant to detect any potential misuse of the released audio content and protect its timbre from being impersonated. To this end
-
Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification arXiv.cs.SD Pub Date : 2023-12-06 Tianchi Liu, Kong Aik Lee, Qiongqiong Wang, Haizhou Li
Previous studies demonstrate the impressive performance of residual neural networks (ResNet) in speaker verification. The ResNet models treat the time and frequency dimensions equally. They follow the default stride configuration designed for image recognition, where the horizontal and vertical axes exhibit similarities. This approach ignores the fact that time and frequency are asymmetric in speech
-
Subnetwork-to-go: Elastic Neural Network with Dynamic Training and Customizable Inference arXiv.cs.SD Pub Date : 2023-12-06 Kai Li, Yi Luo
Deploying neural networks to different devices or platforms is in general challenging, especially when the model size is large or model complexity is high. Although there exist ways for model pruning or distillation, it is typically required to perform a full round of model training or finetuning procedure in order to obtain a smaller model that satisfies the model size or complexity constraints. Motivated
-
Optimizing Two-Pass Cross-Lingual Transfer Learning: Phoneme Recognition and Phoneme to Grapheme Translation arXiv.cs.SD Pub Date : 2023-12-06 Wonjun Lee, Gary Geunbae Lee, Yunsu Kim
This research optimizes two-pass cross-lingual transfer learning in low-resource languages by enhancing phoneme recognition and phoneme-to-grapheme translation models. Our approach optimizes these two stages to improve speech recognition across languages. We optimize phoneme vocabulary coverage by merging phonemes based on shared articulatory characteristics, thus improving recognition accuracy. Additionally
-
Leveraging Laryngograph Data for Robust Voicing Detection in Speech arXiv.cs.SD Pub Date : 2023-12-05 Yixuan Zhang, Heming Wang, DeLiang Wang
Accurately detecting voiced intervals in speech signals is a critical step in pitch tracking and has numerous applications. While conventional signal processing methods and deep learning algorithms have been proposed for this task, their need to fine-tune threshold parameters for different datasets and limited generalization restrict their utility in real-world applications. To address these challenges
-
Distributed Speech Dereverberation Using Weighted Prediction Error arXiv.cs.SD Pub Date : 2023-12-05 Ziye Yang, Mengfei Zhang, Jie Chen
Speech dereverberation aims to alleviate the negative impact of late reverberant reflections. The weighted prediction error (WPE) method is a well-established technique known for its superior performance in dereverberation. However, in scenarios where microphone nodes are dispersed, the centralized approach of the WPE method requires aggregating all observations for inverse filtering, resulting in
-
Synthetic Data Generation Techniques for Developing AI-based Speech Assessments for Parkinson's Disease (A Comparative Study) arXiv.cs.SD Pub Date : 2023-12-04 Mahboobeh Parsapoor
Changes in speech and language are among the first signs of Parkinson's disease (PD). Thus, clinicians have tried to identify individuals with PD from their voices for years. Doctors can leverage AI-based speech assessments to spot PD thanks to advancements in artificial intelligence (AI). Such AI systems can be developed using machine learning classifiers that have been trained using individuals'
-
Integrating Plug-and-Play Data Priors with Weighted Prediction Error for Speech Dereverberation arXiv.cs.SD Pub Date : 2023-12-05 Ziye Yang, Wenxing Yang, Kai Xie, Jie Chen
Speech dereverberation aims to alleviate the detrimental effects of late-reverberant components. While the weighted prediction error (WPE) method has shown superior performance in dereverberation, there is still room for further improvement in terms of performance and robustness in complex and noisy environments. Recent research has highlighted the effectiveness of integrating physics-based and data-driven
-
Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions Using a Heun-Based Sampler arXiv.cs.SD Pub Date : 2023-12-05 Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen, Tommy Sonne Alstrøm, Tobias May
Diffusion models are a new class of generative models that have recently been applied to speech enhancement successfully. Previous works have demonstrated their superior performance in mismatched conditions compared to state-of-the art discriminative models. However, this was investigated with a single database for training and another one for testing, which makes the results highly dependent on the
-
Auralization based on multi-perspective ambisonic room impulse responses arXiv.cs.SD Pub Date : 2023-12-05 Kaspar Müller, Franz Zotter
Most often, virtual acoustic rendering employs real-time updated room acoustic simulations to accomplish auralization for a variable listener perspective. As an alternative, we propose and test a technique to interpolate room impulse responses, specifically Ambisonic room impulse responses (ARIRs) available at a grid of spatially distributed receiver perspectives, measured or simulated in a desired
-
Acoustic Signal Analysis with Deep Neural Network for Detecting Fault Diagnosis in Industrial Machines arXiv.cs.SD Pub Date : 2023-12-02 Mustafa Yurdakul, Sakir Tasdemir
Detecting machine malfunctions at an early stage is crucial for reducing interruptions in operational processes within industrial settings. Recently, the deep learning approach has started to be preferred for the detection of failures in machines. Deep learning provides an effective solution in fault detection processes thanks to automatic feature extraction. In this study, a deep learning-based system
-
Bigger is not Always Better: The Effect of Context Size on Speech Pre-Training arXiv.cs.SD Pub Date : 2023-12-03 Sean Robertson, Ewan Dunbar
It has been generally assumed in the automatic speech recognition (ASR) literature that it is better for models to have access to wider context windows. Yet, many of the potential reasons this might be true in the supervised setting do not necessarily transfer over to the case of unsupervised learning. We investigate how much context is necessary to achieve high-quality pre-trained acoustic models
-
Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling arXiv.cs.SD Pub Date : 2023-12-02 Shentong Mo, Pedro Morgado
Humans possess a remarkable ability to integrate auditory and visual information, enabling a deeper understanding of the surrounding environment. This early fusion of audio and visual cues, demonstrated through cognitive psychology and neuroscience research, offers promising potential for developing multimodal perception models. However, training early fusion architectures poses significant challenges
-
Exploring the Viability of Synthetic Audio Data for Audio-Based Dialogue State Tracking arXiv.cs.SD Pub Date : 2023-12-04 Jihyun Lee, Yejin Jeon, Wonjun Lee, Yunsu Kim, Gary Geunbae Lee
Dialogue state tracking plays a crucial role in extracting information in task-oriented dialogue systems. However, preceding research are limited to textual modalities, primarily due to the shortage of authentic human audio datasets. We address this by investigating synthetic audio data for audio-based DST. To this end, we develop cascading and end-to-end models, train them with our synthetic audio
-
A text-dependent speaker verification application framework based on Chinese numerical string corpus arXiv.cs.SD Pub Date : 2023-12-04 Litong Zheng, Feng Hong, Weijie Xu
Researches indicate that text-dependent speaker verification (TD-SV) often outperforms text-independent verification (TI-SV) in short speech scenarios. However, collecting large-scale fixed text speech data is challenging, and as speech length increases, factors like sentence rhythm and pauses affect TDSV's sensitivity to text sequence. Based on these factors, We propose the hypothesis that strategies
-
Building Ears for Robots: Machine Hearing in the Age of Autonomy arXiv.cs.SD Pub Date : 2023-12-04 Xuan Zhong
Robot hearing system is becoming an important topic due to the increasing number of field robots in uncertain environments. This study discusses what a hearing system means to a robot and why it is important. In particular, the hardware design principles are introduced with the example of robotaxi, on which exterior microphone arrays are used for detection of siren and other abnormal sound events.
-
OpenVoice: Versatile Instant Voice Cloning arXiv.cs.SD Pub Date : 2023-12-03 Zengyi Qin, Wenliang Zhao, Xumin Yu, Xin Sun
We introduce OpenVoice, a versatile voice cloning approach that requires only a short audio clip from the reference speaker to replicate their voice and generate speech in multiple languages. OpenVoice represents a significant advancement in addressing the following open challenges in the field: 1) Flexible Voice Style Control. OpenVoice enables granular control over voice styles, including emotion
-
A Semi-Supervised Deep Learning Approach to Dataset Collection for Query-By-Humming Task arXiv.cs.SD Pub Date : 2023-12-02 Amantur Amatov, Dmitry Lamanov, Maksim Titov, Ivan Vovk, Ilya Makarov, Mikhail Kudinov
Query-by-Humming (QbH) is a task that involves finding the most relevant song based on a hummed or sung fragment. Despite recent successful commercial solutions, implementing QbH systems remains challenging due to the lack of high-quality datasets for training machine learning models. In this paper, we propose a deep learning data collection technique and introduce Covers and Hummings Aligned Dataset
-
Head Orientation Estimation with Distributed Microphones Using Speech Radiation Patterns arXiv.cs.SD Pub Date : 2023-12-04 Kaspar Müller, Bilgesu Çakmak, Paul Didier, Simon Doclo, Jan Østergaard, Tobias Wolff
Determining the head orientation of a talker is not only beneficial for various speech signal processing applications, such as source localization or speech enhancement, but also facilitates intuitive voice control and interaction with smart environments or modern car assistants. Most approaches for head orientation estimation are based on visual cues. However, this requires camera systems which often
-
Multimodal Speech Emotion Recognition Using Modality-specific Self-Supervised Frameworks arXiv.cs.SD Pub Date : 2023-12-04 Rutherford Agbeshi Patamia, Paulo E. Santos, Kingsley Nketia Acheampong, Favour Ekong, Kwabena Sarpong, She Kun
Emotion recognition is a topic of significant interest in assistive robotics due to the need to equip robots with the ability to comprehend human behavior, facilitating their effective interaction in our society. Consequently, efficient and dependable emotion recognition systems supporting optimal human-machine communication are required. Multi-modality (including speech, audio, text, images, and videos)
-
End-to-End Speech-to-Text Translation: A Survey arXiv.cs.SD Pub Date : 2023-12-02 Nivedita Sethiya, Chandresh Kumar Maurya
Speech-to-text translation pertains to the task of converting speech signals in a language to text in another language. It finds its application in various domains, such as hands-free communication, dictation, video lecture transcription, and translation, to name a few. Automatic Speech Recognition (ASR), as well as Machine Translation(MT) models, play crucial roles in traditional ST translation, enabling
-
On the Out-Of-Distribution Robustness of Self-Supervised Representation Learning for Phonocardiogram Signals arXiv.cs.SD Pub Date : 2023-12-01 Aristotelis Ballas, Vasileios Papapanagiotou, Christos Diou
Objective: Despite the recent increase in research activity, deep-learning models have not yet been widely accepted in medicine. The shortage of high-quality annotated data often hinders the development of robust and generalizable models, which do not suffer from degraded effectiveness when presented with newly-collected, out-of-distribution (OOD) datasets. Methods: Contrastive Self-Supervised Learning