当前位置: X-MOL 学术Circuits Syst. Signal Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Unsupervised Speech Signal-to-Symbol Transformation for Language Identification
Circuits, Systems, and Signal Processing ( IF 2.3 ) Pub Date : 2020-04-28 , DOI: 10.1007/s00034-020-01408-8
Saurabhchand Bhati , Shekhar Nayak , Sri Rama Murty Kodukula

This paper presents a new approach for unsupervised segmentation and labeling of acoustically homogeneous segments from the speech signals. The virtual labels, thus obtained, are used to build unsupervised acoustic models in the absence of manual transcriptions. We refer to this approach as unsupervised speech signal-to-symbol transformation. This approach mainly involves three steps: (i) segmenting the speech signal into acoustically homogeneous regions, (ii) assigning consistent labels to the acoustic segments with similar characteristics and (iii) iterative modeling of the acoustic segments sharing the same label. This work focuses on improving initial segmentation and acoustic segment labeling. A new kernel-Gram matrix-based approach is proposed for segmentation. The number of segments is automatically determined using this approach, and performance comparable to the state-of-the-art algorithms is achieved. The segment labeling is formulated in a graph clustering framework. Graph clustering methods require extensive computational resources for large datasets. A new graph growing-based strategy is proposed to make the algorithm scalable. A two-stage iterative modeling is used to refine the segment boundaries and segment labels alternately. The proposed method achieves highest normalized mutual information and purity on TIMIT dataset. Quality assessment of the virtual labels is performed by building a language identification (LID) system for Indian languages. A bigram language model is built using these virtual phones. The LID system built using these virtual labels and corresponding language model performs very close to the system trained using manual labels and an i-vector-based LID system. The fusion of unsupervised LID system scores from our approach and the i-vector approach outperforms the LID system built under the supervision of manual labels by a relative margin of 31.19% demonstrating the effectiveness of unsupervised LID systems that can be at par with supervised systems by using virtual labels.

中文翻译:

用于语言识别的无监督语音信号到符号转换

本文提出了一种从语音信号中无监督地分割和标记声学同质片段的新方法。如此获得的虚拟标签用于在没有人工转录的情况下构建无监督的声学模型。我们将这种方法称为无监督语音信号到符号转换。这种方法主要包括三个步骤:(i) 将语音信号分割成声学上同质的区域,(ii) 为具有相似特征的声学片段分配一致的标签,以及 (iii) 对共享相同标签的声学片段进行迭代建模。这项工作的重点是改进初始分段和声学分段标记。提出了一种新的基于内核-Gram 矩阵的方法进行分割。使用这种方法自动确定段数,并且实现了可与最先进算法相媲美的性能。段标记是在图聚类框架中制定的。图聚类方法需要大量的计算资源来处理大型数据集。提出了一种新的基于图生长的策略,使算法具有可扩展性。两阶段迭代建模用于交替细化段边界和段标签。所提出的方法在 TIMIT 数据集上实现了最高的归一化互信息和纯度。虚拟标签的质量评估是通过为印度语言构建语言识别 (LID) 系统来执行的。使用这些虚拟电话构建二元语言模型。使用这些虚拟标签和相应的语言模型构建的 LID 系统与使用手动标签和基于 i-vector 的 LID 系统训练的系统非常接近。来自我们的方法和 i-vector 方法的无监督 LID 系统分数的融合比在手动标签监督下构建的 LID 系统的性能高出 31.19%,这证明了无监督 LID 系统的有效性可以与有监督系统相提并论使用虚拟标签。
更新日期:2020-04-28
down
wechat
bug