当前位置: X-MOL 学术IEEE Trans. Cybern. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Finite-Time Event-Triggered Stabilization for Discrete-Time Fuzzy Markov Jump Singularly Perturbed Systems
IEEE Transactions on Cybernetics ( IF 9.4 ) Pub Date : 9-30-2022 , DOI: 10.1109/tcyb.2022.3207430
Wenhai Qi , Can Zhang , Guangdeng Zong , Shun-Feng Su , Mohammed Chadli

Video captioning is a challenging task of automatically generating natural and meaningful textual descriptions given some context videos. The state-of-the-art methods aggregate the spatial-wise information in the video encoder at the early stage, which has two drawbacks: 1) Early aggregation in the encoder can cause considerable spatial details missing, which may consequently lead to incorrect word choices in the following text encoder. 2) The spatial attention learned in the video encoder may not be compelling enough without text guidance. To solve these problems, we propose a Stay-in-Grid video CAPtioning method SGCAP, which makes full use of the grid-level spatial features and consists of a Bilinear Sequential Attention Encoder (BSAE) and a Cross-modal Sequential Attention Decoder (CSAD). The former explores and retains fully grid-level discriminative representations in the video encoder, while the latter performs the late spatial aggregation in the decoder to attend to the most relevant regions with the supervision of the input words. Experimental results demonstrate the effectiveness of our method on three public datasets, showing its superior performance over multiple state-of-the-art video captioning models. Source codes and the pre-trained models will be made available to the public.

中文翻译:


离散时间模糊马尔可夫跳跃奇异摄动系统的有限时间事件触发稳定



视频字幕是一项具有挑战性的任务,需要根据一些上下文视频自动生成自然且有意义的文本描述。最先进的方法在早期阶段聚合视频编码器中的空间信息,这有两个缺点:1)编码器中的早期聚合可能会导致大量空间细节丢失,从而可能导致错误的单词以下文本编码器中的选择。 2)如果没有文本指导,视频编码器中学到的空间注意力可能不够引人注目。为了解决这些问题,我们提出了一种保持网格视频CAPtioning方法SGCAP,该方法充分利用网格级空间特征,由双线性顺序注意编码器(BSAE)和跨模态顺序注意解码器(CSAD)组成)。前者在视频编码器中探索并保留完全网格级判别表示,而后者在解码器中执行后期空间聚合,以在输入单词的监督下关注最相关的区域。实验结果证明了我们的方法在三个公共数据集上的有效性,表明其比多个最先进的视频字幕模型具有优越的性能。源代码和预训练模型将向公众开放。
更新日期:2024-08-28
down
wechat
bug