STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation Learning,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation Learning
arXiv - CS - Computation and Language Pub Date : 2020-11-23 , DOI: arxiv-2011.11387
Prakamya Mishra

In this paper, we present a novel multi-modal deep neural network architecture that uses speech and text entanglement for learning phonetically sound spoken-word representations. STEPs-RL is trained in a supervised manner to predict the phonetic sequence of a target spoken-word using its contextual spoken word's speech and text, such that the model encodes its meaningful latent representations. Unlike existing work, we have used text along with speech for auditory representation learning to capture semantical and syntactical information along with the acoustic and temporal information. The latent representations produced by our model were not only able to predict the target phonetic sequences with an accuracy of 89.47% but were also able to achieve competitive results to textual word representation models, Word2Vec & FastText (trained on textual transcripts), when evaluated on four widely used word similarity benchmark datasets. In addition, investigation of the generated vector space also demonstrated the capability of the proposed model to capture the phonetic structure of the spoken-words. To the best of our knowledge, none of the existing works use speech and text entanglement for learning spoken-word representation, which makes this work first of its kind.

中文翻译：

STEPs-RL：语音语音表示学习的语音-文本纠缠

在本文中，我们提出了一种新颖的多模式深度神经网络体系结构，该体系结构使用语音和文本纠缠来学习语音上的语音单词表示形式。STEPs-RL以受监督的方式进行训练，以使用目标语境话语的语音和文本来预测目标言语的语音序列，以使模型对有意义的潜在表示进行编码。与现有工作不同，我们将文本和语音一起用于听觉表示学习，以捕获语义和句法信息以及声音和时间信息。我们的模型产生的潜在表示不仅能够以89.47％的精度预测目标语音序列，而且还能够与文本单词表示模型Word2Vec和当在四个广泛使用的单词相似性基准数据集上进行评估时，FastText（接受文字记录）。此外，对生成的向量空间的研究还证明了所提出模型捕获语音单词的语音结构的能力。据我们所知，现有的作品都没有使用语音和文字纠缠来学习口语表达，这使得该作品成为同类作品中的佼佼者。

更新日期：2020-11-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文