Unsupervised Cross-Modal Audio Representation Learning from Unstructured Multilingual Text,arXiv - CS - Multimedia

当前位置： X-MOL 学术 › arXiv.cs.MM › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Unsupervised Cross-Modal Audio Representation Learning from Unstructured Multilingual Text
arXiv - CS - Multimedia Pub Date : 2020-03-27 , DOI: arxiv-2003.12265
Alexander Schindler, Sergiu Gordea, Peter Knees

We present an approach to unsupervised audio representation learning. Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness. By applying Latent Semantic Indexing (LSI) we embed corresponding textual information into a latent vector space from which we derive track relatedness for online triplet selection. This LSI topic modelling facilitates fine-grained selection of similar and dissimilar audio-track pairs to learn the audio representation using a Convolution Recurrent Neural Network (CRNN). By this we directly project the semantic context of the unstructured text modality onto the learned representation space of the audio modality without deriving structured ground-truth annotations from it. We evaluate our approach on the Europeana Sounds collection and show how to improve search in digital audio libraries by harnessing the multilingual meta-data provided by numerous European digital libraries. We show that our approach is invariant to the variety of annotation styles as well as to the different languages of this collection. The learned representations perform comparable to the baseline of handcrafted features, respectively exceeding this baseline in similarity retrieval precision at higher cut-offs with only 15\% of the baseline's feature vector length.

中文翻译：

从非结构化多语言文本中学习无监督跨模态音频表示

我们提出了一种无监督音频表示学习的方法。基于三元组神经网络架构，我们利用语义相关的跨模态信息来估计音轨相关性。通过应用潜在语义索引 (LSI)，我们将相应的文本信息嵌入到潜在向量空间中，从中我们可以导出用于在线三元组选择的轨道相关性。这种 LSI 主题建模有助于细粒度选择相似和不同的音轨对，以使用卷积循环神经网络 (CRNN) 学习音频表示。通过这种方式，我们将非结构化文本模态的语义上下文直接投影到音频模态的学习表示空间上，而无需从中导出结构化的真实标注。我们评估了我们对 Europeana Sounds 集合的方法，并展示了如何通过利用众多欧洲数字图书馆提供的多语言元数据来改进数字音频图书馆的搜索。我们表明，我们的方法对于各种注释样式以及该集合的不同语言是不变的。学习到的表征表现与手工特征的基线相当，分别在相似性检索精度上超过了这个基线，在更高的截止点上只有基线特征向量长度的 15%。

更新日期：2020-03-30

点击分享查看原文

点击收藏

阅读更多本刊最新论文