AMSS-Net: Audio Manipulation on User-Specified Sources with Textual Queries,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

AMSS-Net: Audio Manipulation on User-Specified Sources with Textual Queries
arXiv - CS - Sound Pub Date : 2021-04-28 , DOI: arxiv-2104.13553
Woosung Choi, Minseok Kim, Marco A. Martínez Ramírez, Jaehwa Chung, Soonyoung Jung

This paper proposes a neural network that performs audio transformations to user-specified sources (e.g., vocals) of a given audio track according to a given description while preserving other sources not mentioned in the description. Audio Manipulation on a Specific Source (AMSS) is challenging because a sound object (i.e., a waveform sample or frequency bin) is `transparent'; it usually carries information from multiple sources, in contrast to a pixel in an image. To address this challenging problem, we propose AMSS-Net, which extracts latent sources and selectively manipulates them while preserving irrelevant sources. We also propose an evaluation benchmark for several AMSS tasks, and we show that AMSS-Net outperforms baselines on several AMSS tasks via objective metrics and empirical verification.

中文翻译：

AMSS-Net：对具有文本查询的用户指定源进行音频处理

本文提出了一种神经网络，该神经网络根据给定的描述对给定音轨的用户指定的源（例如人声）执行音频转换，同时保留描述中未提及的其他源。在特定源（AMSS）上进行音频处理具有挑战性，因为声音对象（即波形样本或频率仓）是“透明的”；与图像中的像素相比，它通常携带来自多个来源的信息。为了解决这个具有挑战性的问题，我们提出了AMSS-Net，它可以提取潜在源并在保留不相关源的同时有选择地对其进行操作。我们还提出了针对多个AMSS任务的评估基准，并且我们通过客观指标和经验验证表明AMSS-Net优于多个AMSS任务的基准。

更新日期：2021-04-29

点击分享查看原文

点击收藏

阅读更多本刊最新论文