Spoken language identification in unseen channel conditions using modified within-sample similarity loss,Pattern Recognition Letters

当前位置： X-MOL 学术 › Pattern Recogn. Lett. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Spoken language identification in unseen channel conditions using modified within-sample similarity loss
Pattern Recognition Letters ( IF 5.1 ) Pub Date : 2022-04-16 , DOI: 10.1016/j.patrec.2022.04.018
H. Muralikrishna ₁ , Dileep Aroor Dinesh ₁

Affiliation

State-of-the-art spoken language identification (LID) systems use sophisticated training strategies to improve the robustness to unseen channel conditions in the real-world test samples. However, all these approaches require training samples from multiple channels with corresponding channel-labels, which is not available in many cases. Recent research in this regard has shown the possibility of learning a channel-invariant representation of the speech using an auxiliary loss function called within-sample similarity loss (WSSL), which does not require samples from multiple channels. Specifically, the WSSL encourages the LID network to ignore channel-specific contents in the speech by minimizing the similarities between two utterance-level embeddings of same sample. However, as WSSL approach operates at sample-level, it ignores the channel variations that may be present across different training samples within same dataset. In this work, we propose a modification to the WSSL approach to address this limitation. Specifically, along with the WSSL, the proposed modified WSSL (mWSSL) approach additionally considers the similarities with two global-level embeddings which represent the average channel-specific contents in a given mini-batch of training samples. The proposed modification allows the network to have a better view of the channel-specific contents in the training dataset, leading to improved performance in unseen channel conditions.

中文翻译：

使用修改后的样本内相似性损失在看不见的通道条件下进行口语识别

最先进的口语识别 (LID) 系统使用复杂的训练策略来提高对真实世界测试样本中未见信道条件的鲁棒性。然而，所有这些方法都需要来自具有相应通道标签的多个通道的训练样本，这在许多情况下是不可用的。最近在这方面的研究表明，使用称为样本内相似性损失 (WSSL) 的辅助损失函数来学习语音的通道不变表示是可能的，它不需要来自多个通道的样本。具体来说，WSSL 鼓励 LID 网络通过最小化同一样本的两个话语级嵌入之间的相似性来忽略语音中特定于通道的内容。然而，由于 WSSL 方法在样本级别运行，它忽略了同一数据集中不同训练样本中可能存在的通道变化。在这项工作中，我们建议对 WSSL 方法进行修改以解决此限制。具体来说，与 WSSL 一起，所提出的修改后的 WSSL (mWSSL) 方法还考虑了与两个全局级嵌入的相似性，它们表示给定小批量训练样本中的平均通道特定内容。提议的修改允许网络更好地查看训练数据集中特定于通道的内容，从而提高在看不见的通道条件下的性能。所提出的修改后的 WSSL (mWSSL) 方法还考虑了与两个全局级嵌入的相似性，这两个嵌入表示给定的小批量训练样本中的平均通道特定内容。提议的修改允许网络更好地查看训练数据集中特定于通道的内容，从而提高在看不见的通道条件下的性能。所提出的修改后的 WSSL (mWSSL) 方法还考虑了与两个全局级嵌入的相似性，这两个嵌入表示给定的小批量训练样本中的平均通道特定内容。提议的修改允许网络更好地查看训练数据集中特定于通道的内容，从而提高在看不见的通道条件下的性能。

更新日期：2022-04-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>