当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Do We Need Sound for Sound Source Localization?
arXiv - CS - Sound Pub Date : 2020-07-11 , DOI: arxiv-2007.05722
Takashi Oya, Shohei Iwase, Ryota Natsume, Takahiro Itazuri, Shugo Yamaguchi, Shigeo Morishima

During the performance of sound source localization which uses both visual and aural information, it presently remains unclear how much either image or sound modalities contribute to the result, i.e. do we need both image and sound for sound source localization? To address this question, we develop an unsupervised learning system that solves sound source localization by decomposing this task into two steps: (i) "potential sound source localization", a step that localizes possible sound sources using only visual information (ii) "object selection", a step that identifies which objects are actually sounding using aural information. Our overall system achieves state-of-the-art performance in sound source localization, and more importantly, we find that despite the constraint on available information, the results of (i) achieve similar performance. From this observation and further experiments, we show that visual information is dominant in "sound" source localization when evaluated with the currently adopted benchmark dataset. Moreover, we show that the majority of sound-producing objects within the samples in this dataset can be inherently identified using only visual information, and thus that the dataset is inadequate to evaluate a system's capability to leverage aural information. As an alternative, we present an evaluation protocol that enforces both visual and aural information to be leveraged, and verify this property through several experiments.

中文翻译:

声源定位需要声音吗?

在使用视觉和听觉信息进行声源定位的过程中,目前尚不清楚图像或声音模态对结果的贡献有多大,即我们是否需要图像和声音来进行声源定位?为了解决这个问题,我们开发了一个无监督学习系统,通过将这个任务分解为两个步骤来解决声源定位:(i)“潜在声源定位”,一个仅使用视觉信息定位可能声源的步骤(ii)“对象选择”,这是使用听觉信息识别哪些对象实际上在发声的步骤。我们的整体系统在声源定位方面实现了最先进的性能,更重要的是,我们发现尽管可用信息有限,(i) 的结果实现了类似的性能。通过这一观察和进一步的实验,我们表明,当使用当前采用的基准数据集进行评估时,视觉信息在“声音”源定位中占主导地位。此外,我们表明该数据集中样本中的大多数发声对象可以仅使用视觉信息来固有地识别,因此该数据集不足以评估系统利用听觉信息的能力。作为替代方案,我们提出了一个评估协议,强制利用视觉和听觉信息,并通过多次实验验证此属性。使用当前采用的基准数据集进行评估时的源本地化。此外,我们表明该数据集中样本中的大多数发声对象可以仅使用视觉信息来固有地识别,因此该数据集不足以评估系统利用听觉信息的能力。作为替代方案,我们提出了一个评估协议,强制利用视觉和听觉信息,并通过多次实验验证此属性。使用当前采用的基准数据集进行评估时的源本地化。此外,我们表明该数据集中样本中的大多数发声对象可以仅使用视觉信息来固有地识别,因此该数据集不足以评估系统利用听觉信息的能力。作为替代方案,我们提出了一个评估协议,强制利用视觉和听觉信息,并通过多次实验验证此属性。
更新日期:2020-07-14
down
wechat
bug