Gestalt Principles Emerge When Learning Universal Sound Source Separation,IEEE/ACM Transactions on Audio, Speech, and Language Processing

当前位置： X-MOL 学术 › IEEE ACM Trans. Audio Speech Lang. Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Gestalt Principles Emerge When Learning Universal Sound Source Separation
IEEE/ACM Transactions on Audio, Speech, and Language Processing ( IF 4.1 ) Pub Date : 5-27-2022 , DOI: 10.1109/taslp.2022.3178233
Han Li ₁ , Kean Chen ₂ , Bernhard U. Seeber ₃

Affiliation

Sound source separation is an essential aspect in auditory scene analysis, which is still an urgent challenge for machine hearing. In this paper, a fully convolutional time-domain audio separation network (ConvTasNet) is trained for universal two-source separation, consisting of speech, environmental sounds, and music. Besides the separation performance of the network, the underlying separation mechanisms are our main concern. Through a series of classic auditory segregation experiments, we systematically explore the principles learned by the network for simultaneous and sequential organization. The results show that without prior knowledge of auditory scene analysis imparted on the network, it spontaneously learns the separation mechanisms from raw waveforms that are similar to those which have developed over many years in humans. The Gestalt principles for separation in the human auditory system are shown to be effective in our network: harmonicity, onset synchrony and common fate (coherent modulation in amplitude and frequency), proximity, continuity, similarity. The universal sound source separation network following Gestalt principles is not limited to specific sources and can be applied to various acoustic situations like human hearing, providing new directions for solving the problem of auditory scene analysis.

中文翻译：

学习通用声源分离时出现格式塔原理

声源分离是听觉场景分析的一个重要方面，这仍然是机器听力面临的紧迫挑战。在本文中，训练了一个全卷积时域音频分离网络（ConvTasNet），用于通用的双源分离，包括语音、环境声音和音乐。除了网络的分离性能之外，底层的分离机制也是我们主要关心的。通过一系列经典的听觉分离实验，我们系统地探索了网络学习的同时和顺序组织的原理。结果表明，在没有预先了解网络上的听觉场景分析知识的情况下，它会自发地从原始波形中学习分离机制，这些机制类似于人类多年来发展起来的分离机制。人类听觉系统中的格式塔分离原理在我们的网络中被证明是有效的：和谐性、起始同步性和共同命运（幅度和频率的相干调制）、邻近性、连续性、相似性。遵循格式塔原理的通用声源分离网络不限于特定声源，可以应用于人类听觉等各种声学场景，为解决听觉场景分析问题提供了新的方向。

更新日期：2024-08-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文