当前位置:
X-MOL 学术
›
arXiv.cs.SD
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
MixSpeech: Data Augmentation for Low-resource Automatic Speech Recognition
arXiv - CS - Sound Pub Date : 2021-02-25 , DOI: arxiv-2102.12664 Linghui Meng, Jin Xu, Xu Tan, Jindong Wang, Tao Qin, Bo Xu
arXiv - CS - Sound Pub Date : 2021-02-25 , DOI: arxiv-2102.12664 Linghui Meng, Jin Xu, Xu Tan, Jindong Wang, Tao Qin, Bo Xu
In this paper, we propose MixSpeech, a simple yet effective data augmentation
method based on mixup for automatic speech recognition (ASR). MixSpeech trains
an ASR model by taking a weighted combination of two different speech features
(e.g., mel-spectrograms or MFCC) as the input, and recognizing both text
sequences, where the two recognition losses use the same combination weight. We
apply MixSpeech on two popular end-to-end speech recognition models including
LAS (Listen, Attend and Spell) and Transformer, and conduct experiments on
several low-resource datasets including TIMIT, WSJ, and HKUST. Experimental
results show that MixSpeech achieves better accuracy than the baseline models
without data augmentation, and outperforms a strong data augmentation method
SpecAugment on these recognition tasks. Specifically, MixSpeech outperforms
SpecAugment with a relative PER improvement of 10.6$\%$ on TIMIT dataset, and
achieves a strong WER of 4.7$\%$ on WSJ dataset.
中文翻译:
MixSpeech:用于低资源自动语音识别的数据增强
在本文中,我们提出了MixSpeech,这是一种基于混合的自动语音识别(ASR)的简单而有效的数据增强方法。MixSpeech通过将两个不同语音特征(例如,梅尔声谱图或MFCC)的加权组合作为输入,并识别两个文本序列(其中两个识别损失使用相同的组合权重)来训练ASR模型。我们在两种流行的端到端语音识别模型(包括LAS(听,出席和拼写)和Transformer)上应用MixSpeech,并在TIMIT,WSJ和HKUST等几种低资源数据集上进行了实验。实验结果表明,MixSpeech比没有数据增强的基线模型具有更高的准确性,并且在这些识别任务上优于强大的数据增强方法SpecAugment。具体来说,
更新日期:2021-02-26
中文翻译:
MixSpeech:用于低资源自动语音识别的数据增强
在本文中,我们提出了MixSpeech,这是一种基于混合的自动语音识别(ASR)的简单而有效的数据增强方法。MixSpeech通过将两个不同语音特征(例如,梅尔声谱图或MFCC)的加权组合作为输入,并识别两个文本序列(其中两个识别损失使用相同的组合权重)来训练ASR模型。我们在两种流行的端到端语音识别模型(包括LAS(听,出席和拼写)和Transformer)上应用MixSpeech,并在TIMIT,WSJ和HKUST等几种低资源数据集上进行了实验。实验结果表明,MixSpeech比没有数据增强的基线模型具有更高的准确性,并且在这些识别任务上优于强大的数据增强方法SpecAugment。具体来说,