当前位置: X-MOL 学术Speech Commun. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Acoustic and temporal representations in convolutional neural network models of prosodic events
Speech Communication ( IF 2.4 ) Pub Date : 2020-11-05 , DOI: 10.1016/j.specom.2020.10.005
Sabrina Stehwien , Antje Schweitzer , Ngoc Thang Vu

Prosodic events such as pitch accents and phrase boundaries have various acoustic and temporal correlates that are used as features in machine learning models to automatically detect these events from speech. These features are often linguistically motivated, high-level features that are hand-crafted by experts to best represent the prosodic events to be detected or classified. An alternative approach is to use a neural network that is trained and optimized to learn suitable feature representations on its own. An open question, however, is what exactly the learned feature representation consists of, since the high-level output of a neural network is not readily interpreted. In this paper, we use a convolutional neural network (CNN) that learns such features from frame-based acoustic input descriptors. We are concerned with the question of what the CNN has learned after being trained on different datasets to perform pitch accent and phrase boundary detection. Specifically, we suggest a methodology for analyzing what temporal, acoustic and context information is latent in the learned feature representation. We use the output representations learned by the CNN to predict various manually computed (aggregated) features using linear regression. The results show that the CNN learns word duration implicitly, and indicate that certain acoustic features may help to locate relevant voiced regions in speech that are useful for detecting pitch accents and phrase boundaries. Finally, our analysis of the latent contextual information learned by the CNN involves a comparison with a sequential model (LSTM) to investigate similarities and differences in what both network types have learned.



中文翻译:

韵律事件的卷积神经网络模型中的声音和时间表示

韵律事件(例如音调重音和短语边界)具有各种声学和时间相关性,这些相关性在机器学习模型中用作特征以从语音自动检测这些事件。这些功能通常是出于语言动机的高级功能,由专家手工制作,可以最好地表示要检测或分类的韵律事件。一种替代方法是使用经过训练和优化的神经网络,以自行学习合适的特征表示。然而,一个开放的问题是,学习到的特征表示的确切含义是什么,因为神经网络的高级输出不易解释。在本文中,我们使用卷积神经网络(CNN)从基于帧的声学输入描述符中学习此类特征。我们关注CNN在不同的数据集上进行了音高重音和短语边界检测训练后学到了什么。具体来说,我们建议一种用于分析在所学特征表示中潜在的时间,声学和上下文信息的方法。我们使用CNN获悉的输出表示法,使用线性回归预测各种手动计算(汇总)的特征。结果表明,CNN隐式地学习单词持续时间,并表明某些声学特征可能有助于在语音中定位相关的有声区域,这对于检测音高重音和短语边界很有用。最后,

更新日期:2020-11-12
down
wechat
bug