当前位置: X-MOL 学术Neural Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
X-DC: Explainable Deep Clustering Based on Learnable Spectrogram Templates.
Neural Computation ( IF 2.9 ) Pub Date : 2021-06-11 , DOI: 10.1162/neco_a_01392
Chihiro Watanabe 1 , Hirokazu Kameoka 1
Affiliation  

Deep neural networks (DNNs) have achieved substantial predictive performance in various speech processing tasks. Particularly, it has been shown that a monaural speech separation task can be successfully solved with a DNN-based method called deep clustering (DC), which uses a DNN to describe the process of assigning a continuous vector to each time-frequency (TF) bin and measure how likely each pair of TF bins is to be dominated by the same speaker. In DC, the DNN is trained so that the embedding vectors for the TF bins dominated by the same speaker are forced to get close to each other. One concern regarding DC is that the embedding process described by a DNN has a black-box structure, which is usually very hard to interpret. The potential weakness owing to the noninterpretable black box structure is that it lacks the flexibility of addressing the mismatch between training and test conditions (caused by reverberation, for instance). To overcome this limitation, in this letter, we propose the concept of explainable deep clustering (X-DC), whose network architecture can be interpreted as a process of fitting learnable spectrogram templates to an input spectrogram followed by Wiener filtering. During training, the elements of the spectrogram templates and their activations are constrained to be nonnegative, which facilitates the sparsity of their values and thus improves interpretability. The main advantage of this framework is that it naturally allows us to incorporate a model adaptation mechanism into the network thanks to its physically interpretable structure. We experimentally show that the proposed X-DC enables us to visualize and understand the clues for the model to determine the embedding vectors while achieving speech separation performance comparable to that of the original DC models.

中文翻译:

X-DC:基于可学习频谱图模板的可解释深度聚类。

深度神经网络 (DNN) 在各种语音处理任务中取得了显着的预测性能。特别是,已经表明单声道语音分离任务可以通过称为深度聚类 (DC) 的基于 DNN 的方法成功解决,该方法使用 DNN 来描述为每个时频 (TF) 分配连续向量的过程bin 并测量每对 TF bins 被同一说话者支配的可能性。在 DC 中,对 DNN 进行训练,以便由同一说话者主导的 TF bin 的嵌入向量被迫彼此靠近。关于 DC 的一个问题是 DNN 描述的嵌入过程具有黑盒结构,通常很难解释。由于不可解释的黑盒结构的潜在弱点是它缺乏解决训练和测试条件之间不匹配(例如由混响引起)的灵活性。为了克服这个限制,在这封信中,我们提出了可解释深度聚类 (X-DC) 的概念,其网络架构可以解释为将可学习的频谱图模板拟合到输入频谱图,然后进行维纳滤波的过程。在训练期间,谱图模板的元素及其激活被限制为非负,这有助于它们的值的稀疏性,从而提高可解释性。该框架的主要优点是,由于其物理上可解释的结构,它自然地允许我们将模型适应机制纳入网络。
更新日期:2021-06-11
down
wechat
bug