End-to-End Speaker Height and age estimation using Attention Mechanism with LSTM-RNN,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

End-to-End Speaker Height and age estimation using Attention Mechanism with LSTM-RNN
arXiv - CS - Sound Pub Date : 2021-01-13 , DOI: arxiv-2101.05056
Manav Kaushik, Van Tung Pham, Eng Siong Chng

Automatic height and age estimation of speakers using acoustic features is widely used for the purpose of human-computer interaction, forensics, etc. In this work, we propose a novel approach of using attention mechanism to build an end-to-end architecture for height and age estimation. The attention mechanism is combined with Long Short-Term Memory(LSTM) encoder which is able to capture long-term dependencies in the input acoustic features. We modify the conventionally used Attention -- which calculates context vectors the sum of attention only across timeframes -- by introducing a modified context vector which takes into account total attention across encoder units as well, giving us a new cross-attention mechanism. Apart from this, we also investigate a multi-task learning approach for jointly estimating speaker height and age. We train and test our model on the TIMIT corpus. Our model outperforms several approaches in the literature. We achieve a root mean square error (RMSE) of 6.92cm and6.34cm for male and female heights respectively and RMSE of 7.85years and 8.75years for male and females ages respectively. By tracking the attention weights allocated to different phones, we find that Vowel phones are most important whistlestop phones are least important for the estimation task.

中文翻译：

使用注意力机制和LSTM-RNN进行端到端发言人身高和年龄估算

利用声音特征自动估计说话者的身高和年龄被广泛用于人机交互，取证等目的。在这项工作中，我们提出了一种使用注意力机制来建立高度的端到端架构的新颖方法和年龄估算。注意机制与长短期记忆（LSTM）编码器相结合，该编码器能够捕获输入声学特征中的长期依存关系。我们通过引入修改后的上下文向量来修改常规使用的Attention（仅在时间范围内计算上下文向量的注意总和），该上下文向量也考虑了编码器单元之间的总注意力，从而为我们提供了一种新的交叉注意机制。除此之外，我们还研究了一种多任务学习方法，以共同估算说话者的身高和年龄。我们在TIMIT语料库上训练和测试我们的模型。我们的模型优于文献中的几种方法。男性和女性身高的均方根误差（RMSE）分别为6.92cm和6.34cm，男女年龄均方根误差（RMSE）分别为7.85年和8.75年。通过跟踪分配给不同手机的注意力权重，我们发现元音手机对于评估任务而言是最重要的，而哨音手机则最不重要。

更新日期：2021-01-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文