A Graph Attention Spatio-temporal Convolutional Network for 3D Human Pose Estimation in Video,arXiv - CS - Robotics

当前位置： X-MOL 学术 › arXiv.cs.RO › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Graph Attention Spatio-temporal Convolutional Network for 3D Human Pose Estimation in Video
arXiv - CS - Robotics Pub Date : 2020-03-11 , DOI: arxiv-2003.14179
Junfa Liu, Juan Rojas, Zhijun Liang, Yihui Li, and Yisheng Guan

Spatio-temporal information is key to resolve occlusion and depth ambiguity in 3D pose estimation. Previous methods have focused on either temporal contexts or local-to-global architectures that embed fixed-length spatio-temporal information. To date, there have not been effective proposals to simultaneously and flexibly capture varying spatio-temporal sequences and effectively achieves real-time 3D pose estimation. In this work, we improve the learning of kinematic constraints in the human skeleton: posture, local kinematic connections, and symmetry by modeling local and global spatial information via attention mechanisms. To adapt to single- and multi-frame estimation, the dilated temporal model is employed to process varying skeleton sequences. Also, importantly, we carefully design the interleaving of spatial semantics with temporal dependencies to achieve a synergistic effect. To this end, we propose a simple yet effective graph attention spatio-temporal convolutional network (GAST-Net) that comprises of interleaved temporal convolutional and graph attention blocks. Experiments on two challenging benchmark datasets (Human3.6M and HumanEva-I) and YouTube videos demonstrate that our approach effectively mitigates depth ambiguity and self-occlusion, generalizes to half upper body estimation, and achieves competitive performance on 2D-to-3D video pose estimation. Code, video, and supplementary information is available at: \href{http://www.juanrojas.net/gast/}{http://www.juanrojas.net/gast/}

中文翻译：

用于视频中 3D 人体姿势估计的图形注意时空卷积网络

时空信息是解决 3D 姿态估计中的遮挡和深度模糊的关键。以前的方法侧重于时间上下文或嵌入固定长度时空信息的本地到全局架构。迄今为止，还没有有效的建议来同时灵活地捕捉变化的时空序列并有效地实现实时 3D 姿态估计。在这项工作中，我们通过注意力机制对局部和全局空间信息进行建模，从而改进了人体骨骼运动学约束的学习：姿势、局部运动学连接和对称性。为了适应单帧和多帧估计，采用扩张时间模型来处理不同的骨架序列。此外，重要的是，我们精心设计了空间语义与时间依赖性的交织，以实现协同效应。为此，我们提出了一个简单而有效的图注意力时空卷积网络（GAST-Net），它由交错的时间卷积和图注意力块组成。在两个具有挑战性的基准数据集（Human3.6M 和 HumanEva-I）和 YouTube 视频上的实验表明，我们的方法有效地减轻了深度模糊和自遮挡，推广到半上半身估计，并在 2D 到 3D 视频姿势上实现了有竞争力的性能估计。代码、视频和补充信息位于：\href{http://www.juanrojas.net/gast/}{http://www.juanrojas.net/gast/} 我们提出了一个简单而有效的图注意力时空卷积网络（GAST-Net），它由交错的时间卷积和图注意力块组成。在两个具有挑战性的基准数据集（Human3.6M 和 HumanEva-I）和 YouTube 视频上的实验表明，我们的方法有效地减轻了深度模糊和自遮挡，推广到半上半身估计，并在 2D 到 3D 视频姿势上实现了有竞争力的性能估计。代码、视频和补充信息位于：\href{http://www.juanrojas.net/gast/}{http://www.juanrojas.net/gast/} 我们提出了一个简单而有效的图注意力时空卷积网络（GAST-Net），它由交错的时间卷积和图注意力块组成。在两个具有挑战性的基准数据集（Human3.6M 和 HumanEva-I）和 YouTube 视频上的实验表明，我们的方法有效地减轻了深度模糊和自遮挡，推广到半上半身估计，并在 2D 到 3D 视频姿势上实现了有竞争力的性能估计。代码、视频和补充信息位于：\href{http://www.juanrojas.net/gast/}{http://www.juanrojas.net/gast/} 6M 和 HumanEva-I) 和 YouTube 视频表明，我们的方法有效地减轻了深度模糊和自遮挡，推广到半上半身估计，并在 2D 到 3D 视频姿势估计方面取得了有竞争力的性能。代码、视频和补充信息位于：\href{http://www.juanrojas.net/gast/}{http://www.juanrojas.net/gast/} 6M 和 HumanEva-I) 和 YouTube 视频表明，我们的方法有效地减轻了深度歧义和自遮挡，推广到半上半身估计，并在 2D 到 3D 视频姿势估计方面取得了有竞争力的性能。代码、视频和补充信息位于：\href{http://www.juanrojas.net/gast/}{http://www.juanrojas.net/gast/}

更新日期：2020-10-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文