当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Learning De-identified Representations of Prosody from Raw Audio
arXiv - CS - Sound Pub Date : 2021-07-17 , DOI: arxiv-2107.08248
Jack Weston, Raphael Lenain, Udeepa Meepegama, Emil Fristed

We propose a method for learning de-identified prosody representations from raw audio using a contrastive self-supervised signal. Whereas prior work has relied on conditioning models on bottlenecks, we introduce a set of inductive biases that exploit the natural structure of prosody to minimize timbral information and decouple prosody from speaker representations. Despite aggressive downsampling of the input and having no access to linguistic information, our model performs comparably to state-of-the-art speech representations on DAMMP, a new benchmark we introduce for spoken language understanding. We use minimum description length probing to show that our representations have selectively learned the subcomponents of non-timbral prosody, and that the product quantizer naturally disentangles them without using bottlenecks. We derive an information-theoretic definition of speech de-identifiability and use it to demonstrate that our prosody representations are less identifiable than other speech representations.

中文翻译:

从原始音频中学习去识别化的韵律表示

我们提出了一种使用对比自监督信号从原始音频中学习去识别化韵律表示的方法。之前的工作依赖于瓶颈的条件模型,我们引入了一组归纳偏差,利用韵律的自然结构来最小化音色信息并将韵律与说话者表示分离。尽管对输入进行了积极的下采样并且无法访问语言信息,但我们的模型在 DAMMP 上的表现与最先进的语音表示相当,这是我们为口语理解引入的新基准。我们使用最小描述长度探测来表明我们的表示已经选择性地学习了非音色韵律的子成分,并且乘积量化器自然地将它们解开而不使用瓶颈。
更新日期:2021-07-20
down
wechat
bug