Unsupervised speech representation learning for behavior modeling using triplet enhanced contextualized networks,Computer Speech & Language

当前位置： X-MOL 学术 › Comput. Speech Lang › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Unsupervised speech representation learning for behavior modeling using triplet enhanced contextualized networks
Computer Speech & Language ( IF 3.1 ) Pub Date : 2021-04-22 , DOI: 10.1016/j.csl.2021.101226
Haoqi Li , Brian Baucom , Shrikanth Narayanan , Panayiotis Georgiou

Speech encodes a wealth of information related to human behavior and has been used in a variety of automated behavior recognition tasks. However, extracting behavioral information from speech remains challenging including due to inadequate training data resources stemming from the often low occurrence frequencies of specific behavioral patterns. Moreover, supervised behavioral modeling typically relies on domain-specific construct definitions and corresponding manually-annotated data, rendering generalizing across domains challenging. In this paper, we exploit the stationary properties of human behavior within an interaction and present a representation learning method to capture behavioral information from speech in an unsupervised way. We hypothesize that nearby segments of speech share the same behavioral context and hence map onto similar underlying behavioral representations. We present an encoder-decoder based Deep Contextualized Network (DCN) as well as a Triplet-Enhanced DCN (TE-DCN) framework to capture the behavioral context and derive a manifold representation, where speech frames with similar behaviors are closer while frames of different behaviors maintain larger distances. The models are trained on movie audio data and validated on diverse domains including on a couples therapy corpus and other publicly collected data (e.g., stand-up comedy). With encouraging results, our proposed framework shows the feasibility of unsupervised learning within cross-domain behavioral modeling.

中文翻译：

使用三重增强上下文环境网络进行行为建模的无监督语音表示学习

语音对与人类行为有关的大量信息进行编码，并已用于各种自动行为识别任务中。但是，从语音中提取行为信息仍然具有挑战性，包括由于特定行为模式的发生频率通常较低而导致训练数据资源不足。此外，受监督的行为建模通常依赖于特定于域的构造定义和相应的手动注释数据，这使得跨域泛化具有挑战性。在本文中，我们利用交互中人类行为的平稳特性，提出了一种表示学习方法，以一种无监督的方式从语音中捕获行为信息。我们假设附近的语音片段共享相同的行为上下文，因此映射到相似的基础行为表示上。我们提出了一种基于编码器-解码器的深度上下文网络（DCN）以及三重增强型DCN（TE-DCN）框架，以捕获行为上下文并导出流形表示，其中行为相似的语音帧更近，而行为不同的帧行为保持更大的距离。这些模型在电影音频数据上进行了训练，并在多个领域进行了验证，包括情侣疗法语料库和其他公共收集的数据（例如，单口喜剧）。令人鼓舞的结果是，我们提出的框架显示了跨域行为建模中无监督学习的可行性。我们提出了一种基于编码器-解码器的深度上下文网络（DCN）以及三重增强型DCN（TE-DCN）框架，以捕获行为上下文并导出流形表示，其中行为相似的语音帧更近，而行为不同的帧行为保持更大的距离。这些模型在电影音频数据上进行了训练，并在多个领域进行了验证，包括情侣疗法语料库和其他公共收集的数据（例如，单口喜剧）。令人鼓舞的结果是，我们提出的框架显示了跨域行为建模中无监督学习的可行性。我们提出了一种基于编码器-解码器的深度上下文网络（DCN）以及三重增强型DCN（TE-DCN）框架，以捕获行为上下文并导出流形表示，其中行为相似的语音帧更近，而行为不同的帧行为保持更大的距离。这些模型在电影音频数据上进行了训练，并在多个领域进行了验证，包括情侣疗法语料库和其他公共收集的数据（例如，单口喜剧）。令人鼓舞的结果是，我们提出的框架显示了跨域行为建模中无监督学习的可行性。行为相似的语音帧比较近，而行为不同的语音帧则保持较大距离。这些模型在电影音频数据上进行了训练，并在多个领域进行了验证，包括情侣疗法语料库和其他公共收集的数据（例如，单口喜剧）。令人鼓舞的结果是，我们提出的框架显示了跨域行为建模中无监督学习的可行性。行为相似的语音帧比较近，而行为不同的语音帧则保持较大距离。这些模型在电影音频数据上进行了训练，并在多个领域进行了验证，包括情侣疗法语料库和其他公共收集的数据（例如，单口喜剧）。令人鼓舞的结果是，我们提出的框架显示了跨域行为建模中无监督学习的可行性。

更新日期：2021-05-09

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文