当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
AMSS-Net: Audio Manipulation on User-Specified Sources with Textual Queries
arXiv - CS - Sound Pub Date : 2021-04-28 , DOI: arxiv-2104.13553
Woosung Choi, Minseok Kim, Marco A. Martínez Ramírez, Jaehwa Chung, Soonyoung Jung

This paper proposes a neural network that performs audio transformations to user-specified sources (e.g., vocals) of a given audio track according to a given description while preserving other sources not mentioned in the description. Audio Manipulation on a Specific Source (AMSS) is challenging because a sound object (i.e., a waveform sample or frequency bin) is `transparent'; it usually carries information from multiple sources, in contrast to a pixel in an image. To address this challenging problem, we propose AMSS-Net, which extracts latent sources and selectively manipulates them while preserving irrelevant sources. We also propose an evaluation benchmark for several AMSS tasks, and we show that AMSS-Net outperforms baselines on several AMSS tasks via objective metrics and empirical verification.

中文翻译:

AMSS-Net:对具有文本查询的用户指定源进行音频处理

本文提出了一种神经网络,该神经网络根据给定的描述对给定音轨的用户指定的源(例如人声)执行音频转换,同时保留描述中未提及的其他源。在特定源(AMSS)上进行音频处理具有挑战性,因为声音对象(即波形样本或频率仓)是“透明的”;与图像中的像素相比,它通常携带来自多个来源的信息。为了解决这个具有挑战性的问题,我们提出了AMSS-Net,它可以提取潜在源并在保留不相关源的同时有选择地对其进行操作。我们还提出了针对多个AMSS任务的评估基准,并且我们通过客观指标和经验验证表明AMSS-Net优于多个AMSS任务的基准。
更新日期:2021-04-29
down
wechat
bug