Audio-visual Recognition of Overlapped speech for the LRS2 dataset,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Audio-visual Recognition of Overlapped speech for the LRS2 dataset
arXiv - CS - Sound Pub Date : 2020-01-06 , DOI: arxiv-2001.01656
Jianwei Yu, Shi-Xiong Zhang, Jian Wu, Shahram Ghorbani, Bo Wu, Shiyin Kang, Shansong Liu, Xunying Liu, Helen Meng, Dong Yu

Automatic recognition of overlapped speech remains a highly challenging task to date. Motivated by the bimodal nature of human speech perception, this paper investigates the use of audio-visual technologies for overlapped speech recognition. Three issues associated with the construction of audio-visual speech recognition (AVSR) systems are addressed. First, the basic architecture designs i.e. end-to-end and hybrid of AVSR systems are investigated. Second, purposefully designed modality fusion gates are used to robustly integrate the audio and visual features. Third, in contrast to a traditional pipelined architecture containing explicit speech separation and recognition components, a streamlined and integrated AVSR system optimized consistently using the lattice-free MMI (LF-MMI) discriminative criterion is also proposed. The proposed LF-MMI time-delay neural network (TDNN) system establishes the state-of-the-art for the LRS2 dataset. Experiments on overlapped speech simulated from the LRS2 dataset suggest the proposed AVSR system outperformed the audio only baseline LF-MMI DNN system by up to 29.98\% absolute in word error rate (WER) reduction, and produced recognition performance comparable to a more complex pipelined system. Consistent performance improvements of 4.89\% absolute in WER reduction over the baseline AVSR system using feature fusion are also obtained.

中文翻译：

LRS2 数据集重叠语音的视听识别

迄今为止，重叠语音的自动识别仍然是一项极具挑战性的任务。受人类语音感知的双峰性质的启发，本文研究了视听技术在重叠语音识别中的应用。解决了与构建视听语音识别 (AVSR) 系统相关的三个问题。首先，研究了基本架构设计，即 AVSR 系统的端到端和混合。其次，有目的地设计的模态融合门用于稳健地集成音频和视觉特征。第三，与包含显式语音分离和识别组件的传统流水线架构相比，还提出了使用无格 MMI (LF-MMI) 判别标准一致优化的流线型和集成 AVSR 系统。所提出的 LF-MMI 时间延迟神经网络 (TDNN) 系统为 LRS2 数据集建立了最新技术。从 LRS2 数据集模拟的重叠语音实验表明，所提出的 AVSR 系统优于纯音频基线 LF-MMI DNN 系统高达 29.98% 的绝对字错误率 (WER) 降低，并产生了与更复杂的流水线相媲美的识别性能系统。还获得了与使用特征融合的基线 AVSR 系统相比，WER 减少 4.89% 绝对值的一致性能改进。98% 的绝对字错误率 (WER) 降低，并产生与更复杂的流水线系统相媲美的识别性能。还获得了与使用特征融合的基线 AVSR 系统相比，WER 减少 4.89% 绝对值的一致性能改进。98% 的绝对字错误率 (WER) 降低，并产生与更复杂的流水线系统相媲美的识别性能。还获得了与使用特征融合的基线 AVSR 系统相比，WER 减少 4.89% 绝对值的一致性能改进。

更新日期：2020-01-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文