当前位置: X-MOL 学术Sci. China Inf. Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Self-adjustable hyper-graphs for video pose estimation based on spatial-temporal subspace construction
Science China Information Sciences ( IF 8.8 ) Pub Date : 2021-05-20 , DOI: 10.1007/s11432-019-2869-x
Jizhou Ma , Shuai Li , Hong Qin , Aimin Hao , Qinping Zhao

In recent years many supervised video pose estimation methods have achieved growing successes based on well-labeled training datasets. Nonetheless, when facing roughly-labeled training data, it still remains challenging to intrinsically encode the video contents’ spatial-temporal coherency for robust video pose estimation. Some researches aimed to directly improve and refine the existing confidence maps by combining the spatial-temporal structure models [1, 2]. Li et al. [2] suggested that fixing some reliable estimations and formulating propagation processing as a 3D trajectory completion problem. Differently, Moon et al. [1] assumed that state-of-the-art 2D human pose estimation methods have similar error distributions. Zhou and Torre [3] first learned a codebook from the motion capture dataset, and then they employed a bi-linear model to estimate poses by matching the movement mode and the dense trajectory tracing result. It enables related patterns to be expressed using sub-patterns instead of a uniform probability distribution. In [3], this flexible codebook based framework still requires a lot of extra annotated data to ensure the models’ dataset-specific applicability. To overcome this drawback, we advocate a new hierarchical hyper-graph approach based on intrinsic spatialtemporal subspace exploration and propagation. Those “mis-matched” hyper-graph subspaces, which result from imperfect data, could be adaptively improved by taking advantage of visual contents’ intrinsic continuities. At the theoretic level, the key idea for subspace exploration is to design a maximum matching subspace (MMS) operator, which help propagate highly correlated action information from local video frames to all video sequences in spatial-temporal subspaces. The hyper-graph is solely built based on our MMS metric, and it could synchronously encode cross-video action similarity, inner-video temporal coherency, and synergetic relationship of different body joints. In contrast to normal “explicit hyper-graph”, we construct an “implicit hyper-graph” by hierarchically representing different-level relationships. We conceptually split “explicit hyper-graph” into a series of sub-graphs (structure), which are formulated as optimized maximum matching subspaces. Then, these subspaces (“sub-graphs”) will re-contact with each other via a global MMS operator based affinity matrix (metric). Given a set of videos belonging to the same action category, each video is divided into a group of overlapping short video segments. The initial pose extractor ResNet50 [4], and such segments are represented as NP pose sequences P = {P1,P2,P3, . . . ,PNP }. For a pose sequence Pi ∈ R nk×nf , it covers nk body joints and nf consecutive frames. To align two pose sequences, similar to [3], we apply Procrustes analysis to get a spatial transition matrix Q, which conducts an affine transformation, including translating, rotating and

中文翻译:

基于时空子空间构建的视频姿态估计自可调超图

近年来,许多有监督的视频姿态估计方法基于标记良好的训练数据集取得了越来越大的成功。尽管如此,当面对粗略标记的训练数据时,对视频内容的时空相干性进行内在编码以实现稳健的视频姿态估计仍然具有挑战性。一些研究旨在通过结合时空结构模型来直接改进和细化现有的置信度图[1, 2]。李等人。[2] 建议修复一些可靠的估计并将传播处理公式化为 3D 轨迹完成问题。不同的是,Moon 等人。[1] 假设最先进的 2D 人体姿态估计方法具有相似的误差分布。Zhou 和 Torre [3] 首先从动作捕捉数据集中学习了一个码本,然后他们采用双线性模型通过匹配运动模式和密集轨迹跟踪结果来估计姿势。它使相关模式能够使用子模式而不是统一的概率分布来表示。在 [3] 中,这种灵活的基于码本的框架仍然需要大量额外的注释数据来确保模型的数据集特定的适用性。为了克服这个缺点,我们提倡一种基于内在时空子空间探索和传播的新分层超图方法。那些由不完美数据导致的“不匹配”的超图子空间可以通过利用视觉内容的内在连续性来自适应地改进。在理论层面,子空间探索的关键思想是设计一个最大匹配子空间(MMS)算子,这有助于将高度相关的动作信息从本地视频帧传播到时空子空间中的所有视频序列。超图是完全基于我们的 MMS 度量构建的,它可以同步编码跨视频动作相似性、内部视频时间相干性和不同身体关节的协同关系。与正常的“显式超图”相比,我们通过分层表示不同级别的关系来构建“隐式超图”。我们在概念上将“显式超图”拆分为一系列子图(结构),这些子图(结构)被表述为优化的最大匹配子空间。然后,这些子空间(“子图”)将通过基于全局 MMS 算子的亲和矩阵(度量)相互重新联系。给定一组属于同一动作类别的视频,每个视频被分成一组重叠的短视频片段。初始姿态提取器 ResNet50 [4],这些片段表示为 NP 姿态序列 P = {P1,P2,P3, . . . ,PNP }。对于一个姿势序列 Pi ∈ R nk×nf ,它覆盖了 nk 个身体关节和 nf 个连续帧。为了对齐两个姿势序列,类似于 [3],我们应用 Procrustes 分析来获得空间转换矩阵 Q,该矩阵进行仿射变换,包括平移、旋转和
更新日期:2021-05-20
down
wechat
bug