Knowledge-driven Egocentric Multimodal Activity Recognition,ACM Transactions on Multimedia Computing, Communications, and Applications

当前位置： X-MOL 学术 › ACM Trans. Multimed. Comput. Commun. Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Knowledge-driven Egocentric Multimodal Activity Recognition
ACM Transactions on Multimedia Computing, Communications, and Applications ( IF 5.2 ) Pub Date : 2020-12-17 , DOI: 10.1145/3409332
Yi Huang ₁ , Xiaoshan Yang ₁ , Junyu Gao ₁ , Jitao Sang ₂ , Changsheng Xu ₁

Affiliation

Recognizing activities from egocentric multimodal data collected by wearable cameras and sensors, is gaining interest, as multimodal methods always benefit from the complementarity of different modalities. However, since high-dimensional videos contain rich high-level semantic information while low-dimensional sensor signals describe simple motion patterns of the wearer, the large modality gap between the videos and the sensor signals raises a challenge for fusing the raw data. Moreover, the lack of large-scale egocentric multimodal datasets due to the cost of data collection and annotation processes makes another challenge for employing complex deep learning models. To jointly deal with the above two challenges, we propose a knowledge-driven multimodal activity recognition framework that exploits external knowledge to fuse multimodal data and reduce the dependence on large-scale training samples. Specifically, we design a dual-GCLSTM (Graph Convolutional LSTM) and a multi-layer GCN (Graph Convolutional Network) to collectively model the relations among activities and intermediate objects. The dual-GCLSTM is designed to fuse temporal multimodal features with top-down relation-aware guidance. In addition, we apply a co-attention mechanism to adaptively attend to the features of different modalities at different timesteps. The multi-layer GCN aims to learn relation-aware classifiers of activity categories. Experimental results on three publicly available egocentric multimodal datasets show the effectiveness of the proposed model.

中文翻译：

知识驱动的以自我为中心的多模式活动识别

从可穿戴相机和传感器收集的以自我为中心的多模式数据中识别活动引起了人们的兴趣，因为多模式方法总是受益于不同模式的互补性。然而，由于高维视频包含丰富的高级语义信息，而低维传感器信号描述了佩戴者的简单运动模式，视频与传感器信号之间较大的模态差距对融合原始数据提出了挑战。此外，由于数据收集和注释过程的成本，缺乏大规模的以自我为中心的多模态数据集，这给采用复杂的深度学习模型带来了另一个挑战。共同应对上述两大挑战，我们提出了一个知识驱动的多模态活动识别框架，利用外部知识融合多模态数据，减少对大规模训练样本的依赖。具体来说，我们设计了一个双 GCLSTM（图卷积 LSTM）和一个多层 GCN（图卷积网络）来共同建模活动和中间对象之间的关系。dual-GCLSTM 旨在将时间多模态特征与自上而下的关系感知指导融合在一起。此外，我们应用共同注意机制来自适应地关注不同时间步长的不同模态的特征。多层 GCN 旨在学习活动类别的关系感知分类器。三个公开可用的以自我为中心的多模式数据集的实验结果表明了所提出模型的有效性。

更新日期：2020-12-17

点击分享查看原文

点击收藏

阅读更多本刊最新论文