当前位置: X-MOL 学术arXiv.cs.HC › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Temporal Convolution Networks with Positional Encoding for Evoked Expression Estimation
arXiv - CS - Human-Computer Interaction Pub Date : 2021-06-16 , DOI: arxiv-2106.08596
VanThong Huynh, Guee-Sang Lee, Hyung-Jeong Yang, Soo-Huyng Kim

This paper presents an approach for Evoked Expressions from Videos (EEV) challenge, which aims to predict evoked facial expressions from video. We take advantage of pre-trained models on large-scale datasets in computer vision and audio signals to extract the deep representation of timestamps in the video. A temporal convolution network, rather than an RNN like architecture, is used to explore temporal relationships due to its advantage in memory consumption and parallelism. Furthermore, to address the missing annotations of some timestamps, positional encoding is employed to ensure continuity of input data when discarding these timestamps during training. We achieved state-of-the-art results on the EEV challenge with a Pearson correlation coefficient of 0.05477, the first ranked performance in the EEV 2021 challenge.

中文翻译:

用于诱发表达估计的具有位置编码的时间卷积网络

本文提出了一种视频诱发表情(EEV)挑战的方法,旨在预测视频诱发的面部表情。我们利用计算机视觉和音频信号中大规模数据集的预训练模型来提取视频中时间戳的深层表示。由于其在内存消耗和并行性方面的优势,使用时间卷积网络而不是类似 RNN 的架构来探索时间关系。此外,为了解决一些时间戳的缺失注释,在训练过程中丢弃这些时间戳时,采用位置编码来确保输入数据的连续性。我们在 EEV 挑战中取得了最先进的结果,Pearson 相关系数为 0.05477,在 EEV 2021 挑战中排名第一。
更新日期:2021-06-17
down
wechat
bug