当前位置: X-MOL 学术IEEE Trans. Circ. Syst. Video Technol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Jointly Learning Visual Poses and Pose Lexicon for Semantic Action Recognition
IEEE Transactions on Circuits and Systems for Video Technology ( IF 8.3 ) Pub Date : 2020-02-01 , DOI: 10.1109/tcsvt.2019.2890829
Lijuan Zhou , Wanqing Li , Philip Ogunbona , Zhengyou Zhang

A novel method for semantic action recognition through learning a pose lexicon is presented in this paper. A pose lexicon comprises a set of semantic poses, a set of visual poses, and a probabilistic mapping between the visual and semantic poses. This paper assumes that both the visual poses and mapping are hidden and proposes a method to simultaneously learn a visual pose model that estimates the likelihood of an observed video frame being generated from hidden visual poses, and a pose lexicon model establishes the probabilistic mapping between the hidden visual poses and the semantic poses parsed from textual instructions. Specifically, the proposed method consists of two-level hidden Markov models. One level represents the alignment between the visual poses and semantic poses. The other level represents a visual pose sequence, and each visual pose is modeled as a Gaussian mixture. An expectation-maximization algorithm is developed to train a pose lexicon. With the learned lexicon, action classification is formulated as a problem of finding the maximum posterior probability of a given sequence of video frames that follows a given sequence of semantic poses, constrained by the most likely visual pose and the alignment sequences. The proposed method was evaluated on MSRC-12, WorkoutSU-10, WorkoutUOW-18, Combined-15, Combined-17, and Combined-50 action datasets using cross-subject, cross-dataset, zero-shot, and seen/unseen protocols.

中文翻译:

联合学习视觉姿势和姿势词典用于语义动作识别

本文提出了一种通过学习姿势词典进行语义动作识别的新方法。姿势词典包括一组语义姿势、一组视觉姿势以及视觉和语义姿势之间的概率映射。本文假设视觉姿势和映射都是隐藏的,并提出了一种同时学习视觉姿势模型的方法,该模型估计观察到的视频帧从隐藏的视觉姿势中生成的可能性,而姿势词典模型建立了两者之间的概率映射。隐藏的视觉姿势和从文本指令解析的语义姿势。具体来说,所提出的方法由两级隐马尔可夫模型组成。一层表示视觉姿势和语义姿势之间的对齐。另一层代表视觉姿势序列,并且每个视觉姿势都被建模为高斯混合。开发了一种期望最大化算法来训练姿势词典。使用学习的词典,动作分类被表述为一个问题,即寻找给定的视频帧序列的最大后验概率,该序列遵循给定的语义姿势序列,受最可能的视觉姿势和对齐序列的约束。使用跨主题、跨数据集、零样本和可见/不可见协议在 MSRC-12、WorkoutSU-10、WorkoutUOW-18、Combined-15、Combined-17 和 Combined-50 动作数据集上评估了所提出的方法. 动作分类被表述为寻找给定视频帧序列的最大后验概率的问题,该序列遵循给定的语义姿势序列,受最可能的视觉姿势和对齐序列的约束。使用跨主题、跨数据集、零样本和可见/不可见协议在 MSRC-12、WorkoutSU-10、WorkoutUOW-18、Combined-15、Combined-17 和 Combined-50 动作数据集上评估了所提出的方法. 动作分类被表述为一个问题,即寻找给定的视频帧序列的最大后验概率,该序列遵循给定的语义姿势序列,受最可能的视觉姿势和对齐序列的约束。所提出的方法在 MSRC-12、WorkoutSU-10、WorkoutUOW-18、Combined-15、Combined-17 和 Combined-50 动作数据集上使用跨主题、跨数据集、零样本和可见/不可见协议进行了评估.
更新日期:2020-02-01
down
wechat
bug