LSTM and GPT-2 Synthetic Speech Transfer Learning for Speaker Recognition to Overcome Data Scarcity,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

LSTM and GPT-2 Synthetic Speech Transfer Learning for Speaker Recognition to Overcome Data Scarcity
arXiv - CS - Sound Pub Date : 2020-07-01 , DOI: arxiv-2007.00659
Jordan J. Bird, Diego R. Faria, Anik\'o Ek\'art, Cristiano Premebida, Pedro P. S. Ayrosa

In speech recognition problems, data scarcity often poses an issue due to the willingness of humans to provide large amounts of data for learning and classification. In this work, we take a set of 5 spoken Harvard sentences from 7 subjects and consider their MFCC attributes. Using character level LSTMs (supervised learning) and OpenAI's attention-based GPT-2 models, synthetic MFCCs are generated by learning from the data provided on a per-subject basis. A neural network is trained to classify the data against a large dataset of Flickr8k speakers and is then compared to a transfer learning network performing the same task but with an initial weight distribution dictated by learning from the synthetic data generated by the two models. The best result for all of the 7 subjects were networks that had been exposed to synthetic data, the model pre-trained with LSTM-produced data achieved the best result 3 times and the GPT-2 equivalent 5 times (since one subject had their best result from both models at a draw). Through these results, we argue that speaker classification can be improved by utilising a small amount of user data but with exposure to synthetically-generated MFCCs which then allow the networks to achieve near maximum classification scores.

中文翻译：

LSTM 和 GPT-2 合成语音迁移学习用于说话人识别以克服数据稀缺性

在语音识别问题中，由于人类愿意提供大量数据用于学习和分类，数据稀缺性通常会带来问题。在这项工作中，我们从 7 个科目中选取了一组 5 个哈佛口语句子，并考虑了它们的 MFCC 属性。使用字符级 LSTM（监督学习）和 OpenAI 基于注意力的 GPT-2 模型，通过从每个主题提供的数据中学习生成合成 MFCC。训练神经网络以针对 Flickr8k 说话者的大型数据集对数据进行分类，然后与执行相同任务的迁移学习网络进行比较，但初始权重分布由从两个模型生成的合成数据中学习决定。所有 7 个主题的最佳结果是暴露于合成数据的网络，使用 LSTM 生成的数据预训练的模型获得了 3 次最佳结果，GPT-2 获得了 5 次等效结果（因为一名受试者在平局中获得了两个模型的最佳结果）。通过这些结果，我们认为可以通过使用少量用户数据来改进说话人分类，但暴露于合成生成的 MFCC，然后允许网络实现接近最大的分类分数。

更新日期：2020-07-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>