MS-TCN++: Multi-Stage Temporal Convolutional Network for Action Segmentation.,IEEE Transactions on Pattern Analysis and Machine Intelligence

当前位置： X-MOL 学术 › IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

MS-TCN++: Multi-Stage Temporal Convolutional Network for Action Segmentation.
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 20.8 ) Pub Date : 2023-05-05 , DOI: 10.1109/tpami.2020.3021756
Shijie Li ₁ , Yazan Abu Farha ₁ , Yun Liu ₂ , Ming-Ming Cheng ₂ , Juergen Gall ₁

Affiliation

With the success of deep learning in classifying short trimmed videos, more attention has been focused on temporally segmenting and classifying activities in long untrimmed videos. State-of-the-art approaches for action segmentation utilize several layers of temporal convolution and temporal pooling. Despite the capabilities of these approaches in capturing temporal dependencies, their predictions suffer from over-segmentation errors. In this paper, we propose a multi-stage architecture for the temporal action segmentation task that overcomes the limitations of the previous approaches. The first stage generates an initial prediction that is refined by the next ones. In each stage we stack several layers of dilated temporal convolutions covering a large receptive field with few parameters. While this architecture already performs well, lower layers still suffer from a small receptive field. To address this limitation, we propose a dual dilated layer that combines both large and small receptive fields. We further decouple the design of the first stage from the refining stages to address the different requirements of these stages. Extensive evaluation shows the effectiveness of the proposed model in capturing long-range dependencies and recognizing action segments. Our models achieve state-of-the-art results on three datasets: 50Salads, Georgia Tech Egocentric Activities (GTEA), and the Breakfast dataset.

中文翻译：

MS-TCN++：用于动作分割的多阶段时间卷积网络。

随着深度学习在对修剪过的短视频进行分类方面的成功，更多的注意力集中在未修剪的长视频中的时间分割和分类活动上。最先进的动作分割方法利用多层时间卷积和时间池。尽管这些方法具有捕获时间依赖性的能力，但它们的预测存在过度分割错误。在本文中，我们提出了一种用于时间动作分割任务的多阶段架构，克服了先前方法的局限性。第一阶段生成一个初始预测，由下一个阶段进行细化。在每个阶段，我们堆叠几层扩张的时间卷积，覆盖一个大的感受野，参数很少。虽然这种架构已经表现良好，较低的层仍然受到较小的感受野的影响。为了解决这个限制，我们提出了一个结合了大感受野和小感受野的双重扩张层。我们进一步将第一阶段的设计与精炼阶段分离，以解决这些阶段的不同要求。广泛的评估表明所提出的模型在捕获远程依赖性和识别动作片段方面的有效性。我们的模型在三个数据集上取得了最先进的结果：50Salads、Georgia Tech Egocentric Activities (GTEA) 和 Breakfast 数据集。我们进一步将第一阶段的设计与精炼阶段分离，以解决这些阶段的不同要求。广泛的评估表明所提出的模型在捕获远程依赖性和识别动作片段方面的有效性。我们的模型在三个数据集上取得了最先进的结果：50Salads、Georgia Tech Egocentric Activities (GTEA) 和 Breakfast 数据集。我们进一步将第一阶段的设计与精炼阶段分离，以解决这些阶段的不同要求。广泛的评估表明所提出的模型在捕获远程依赖性和识别动作片段方面的有效性。我们的模型在三个数据集上取得了最先进的结果：50Salads、Georgia Tech Egocentric Activities (GTEA) 和 Breakfast 数据集。

更新日期：2020-09-04

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11