Self-attention binary neural tree for video summarization

doi:10.1016/j.patrec.2020.12.016

Pattern Recognition Letters

Volume 143, March 2021, Pages 19-26

https://doi.org/10.1016/j.patrec.2020.12.016 Get rights and content

Highlights

•
A self-attention binary neural tree (SABTNet) is proposed for video summarization.
•
SABTNet is the first attempt to combine self-attention modules and decision trees for video summarization.
•
SABTNet follows a divide-and-conquer way to evaluate a video shot.
•
SABTNet takes hierarchical coarse-to-fine video shot feature learning by self-attention.
•
Experimental results show that the proposed SABTNet achieves state-of-the-art performance.

Abstract

In this paper, we address the problem of shot-level video summarization, which aims at selecting a subset of video shots as a summary to represent the original video contents compactly and completely. Most existing methods rely on various network architectures to learn a single score predictor for shot ranking and selection. Different from these methods, we plug network feature learning into a binary neural tree to consider multi-path predictions for each shot, thus enabling the shot evaluation from different aspects. Due to the hierarchical structure of the tree, video shots can be coarse-to-fine encoded by imposing self-attention on them along branches, leading to favorable predictions. Extensive experiments were conducted on two real-world datasets, and the results reveal that the proposed method achieves superior performance in comparison with previous state-of-the-art methods.

Introduction

In recent years, video summarization has become a hot research topic, especially when the explosive increase of videos uploaded to video-sharing platforms by personal users. Because of the lack of professional shooting and editing skills, most videos are usually redundant and time-consuming for browsing. In such a case, one may need video summarization to generate a compact and non-redundant summary for a given video while preserving its main information. This technology can be applied to a lot of real-world domains, including movie recapping, surveillance video browsing, and sports video highlighting. The methods in video summarization can be at object-level [1], [2], frame-level [3], or shot-level [4], according to the visual primitives (objects, frames, or shots) we are processing.

Frame-level and shot-level methods are most commonly investigated in video summarization. Modern frame-level methods [3], [5], [6] usually rely on Long Short-Term Memory (LSTM) to capture long-term and short-term dependencies within a video, and on carefully designed frame scoring networks to predict the probability of each frame being selected into the video summary. For example, an LSTM followed by a fully connected network is adopted to predict the frame scores in Zhang et al. [5]. Recently, LSTM has also been combined with generative adversarial learning [7] and reinforcement learning [6] for frame selection training. Although frame-level methods obtain some promising results, there are two big challenges for these methods: (i) Adjacent frames are very similar in vision, which makes it difficult for the frame scoring networks to have an accurate selection; (ii) the widely used LSTM networks in frame-level methods are not capable enough to explore long-term dependencies of more than 100 frames [4]. To address these two challenges, a frequently-used solution is hierarchical modeling inspired by the hierarchical structure of video-shots-frames, giving rise to the research of shot-level video summarization. By considering video summarization as a hierarchical modeling problem, one can use current shot-level methods [4], [8], [9] to encode frames within each continuous shot into shot-level features, and predict the selection scores on segmented shots rather than on separate frames, which facilitates exploiting temporal similarities and dependencies within a video. In this paper, we also focus on shot-level video summarization. A shot we consider here is a clip of the original video, consisting of a sequence of consecutive frames that are highly cohesive. Different from previous methods that directly learn a single score predictor for frame/shot evaluation and ranking [4], [5] depending on individual feature learning architectures, we enable evaluating a shot from different aspects by a tree structure. As shown in Fig. 1, people may ease to follow a divide-and-conquer way to evaluate a shot from diversity, representativeness, and other factors. Motivated by this intuition, we view shot-level video summarization as a regression decision tree scoring problem, where different root-to-leaf paths indicate different evaluation factors.

Inspired by Tanno et al. [10] and Ji et al. [11], we design a self-attention binary neural tree (SABTNet) for video summarization. Specifically, we start with dividing a video into several shots and adopt a bi-directional Long Short-Term Memory (Bi-LSTM) to extract shot-level visual features. We also incorporate self-attention along edges of the tree structure to focus on different parts of a video. For the non-leaf nodes, we adopt neural networks as the routing functions to determine the root-to-leaf computational paths. Each leaf node is designed to accompany a score predictor, which represents an assessment function of a factor for shot evaluation. This design combines neural networks and decision trees to gain the benefits of both approaches. It not only possesses the sequence modeling ability of self-attention and nonlinear fitting ability of neural networks, but also inherits the coarse-to-ne hierarchical feature learning ability of decision tree. By this means, different root-to-leaf paths in the SABTNet suggest different evaluation factors. The final prediction score of each video shot can be obtained by a weighted summation of the predictions from all the leaf nodes in the tree.

We follow ACNet [11] to use a fixed height full binary tree as the decision tree structure and adopt the soft decision scheme [12]. We conduct extensive evaluations on two widely used datasets, i.e., SumMe and TVSum. The results demonstrate the superiority of our proposed method compared with previous state-of-the-art methods. We conduct several ablation studies to comprehensively understand the influence of some important parameters. The ablation studies include the heights of the tree, the number of heads in self-attention, the number of stacked transformer blocks, and the summary rate.

The main contributions of our proposed method are summarized as follows:

•
We propose a novel yet simple self-attention binary neural tree for shot-level video summarization. This model follows a divide-and-conquer way to evaluate a shot from different aspects by taking a coarse-to-fine manner of shot feature learning.
•
To the best of our knowledge, this work is the first attempt to combine self-attention and decision trees for video summarization.
•
We conduct various experiments on two benchmark datasets, SumMe [3] and TVSum [13]. The results show that the presented method achieves state-of-the-art performance.

Section snippets

Video summarization

Video summarization aims at shortening an input video into a compact summary, which keeps the main information of the video. Existing summarization methods roughly fall into the category of unsupervised or supervised learning. Our method belongs to the latter.

Unsupervised methods design some hand-crafted models for a representative and diverse selection of video content [14]. Among various models, clustering could be the most intuitive one [15]. The frames/shots with high visual similarities

Approach

Our goal is to predict a score ranging from 0 to 1 for each shot. The higher the score a shot obtains, the more likely the shot will be selected into the final summary. To this end, we propose a self-attention binary neural tree (SABTNet) model, including the backbone network, shot encoding, branch routing, self-attention, and score prediction modules.

As shown in Fig. 2, given a video $v$ with $M_{v}$ frames, we first need to divide it into non-overlapping shots by shot segmentation algorithms [8],

Experiments

We begin this section by introducing the experimental setting. Then, we show and analyze the experimental results in detail. Finally, we display the qualitative results of our method.

Conclusion

In this paper, we propose a novel self-attention binary neural tree, SABTNet, for shot-level video summarization. It endows a binary tree with the capability of hierarchical feature refinement by drawing self-attention into each edge of the tree, which favors a divide-and-conquer shot evaluation from different root-to-leaf paths. Extensive comparison on two real-world datasets shows that the proposed SABTNet performs favorably against previous state-of-the-art methods.

This is the first step

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China under Grant 61976029.

References (31)

Y. Zhang et al.
Unsupervised object-level video summarization with online motion auto-encoder
PRL
(2020)
S. Kannappan et al.
DFP-ALC: automatic video summarization using distinct frame patch index and appearance based linear clustering
RPL
(2019)
K. Muhammad et al.
Efficient CNN based summarization of surveillance videos for resource-constrained devices
PRL
(2020)
J. Meng et al.
From keyframes to key objects: video summarization by representative object proposal selection
CVPR
(2016)
M. Gygli et al.
Creating summaries from user videos
ECCV
(2014)
B. Zhao et al.
Hierarchical recurrent neural network for video summarization
MM
(2017)
K. Zhang et al.
Video summarization with long short-term memory
ECCV
(2016)
K. Zhou et al.
Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward
AAAI
(2018)
B. Mahasseni et al.
Unsupervised video summarization with adversarial LSTM networks
CVPR
(2017)
B. Zhao et al.
HSA-RNN: hierarchical structure-adaptive RNN for video summarization
CVPR
(2018)

K. Zhang et al.

Retrospective encoders for video summarization

ECCV

(2018)

R. Tanno et al.

Adaptive neural trees

ICML

(2019)

R. Ji et al.

Attention convolutional binary neural tree for fine-grained visual categorization

CVPR

(2020)

Frosst N. et al.

Distilling a neural network into a soft decision tree

CoRR

(2017)

Y. Song et al.

TVSum: summarizing web videos using titles

CVPR

(2015)

Cited by (27)

Attention-guided multi-granularity fusion model for video summarization
2024, Expert Systems with Applications
Video summarization has attracted extensive attention benefiting from its valuable capability to facilitate video browsing. While achieving notable improvement, existing methods still fail to sufficiently and effectively model contextual information within videos, hindering the summarization performance owing to a deficiency in powerful contextual representations. To address this limitation, we present a novel Attention-Guided Multi-Granularity Fusion Model (AMFM), which allows for optimizing the modeling process from the context capturing and fusion perspective. AMFM comprises three dominant components including a content-aware enhancement (CAE) module, a multi-granularity encoder (MGE), and a scale-adaptive fusion (SAF) module. More specifically, CAE dynamically enhances pre-trained visual features by learning the potential visual relationship across frame-level and video-level embeddings. Subsequently, coarse-grained and fine-grained contextual information is simultaneously modeled in the same representation space by MGE with the combination of self-attention and temporal convolution scheme. Furthermore, the multi-granularity representations with a significant difference in the semantic scale are adaptively fused by SAF. Our method can precisely pinpoint key segments by effectively modeling and processing rich temporal representations. Extensive comparisons with state-of-the-art methods on standard datasets demonstrate the effectiveness of the proposed method, and the ablation studies further verify the positive impact of each module in our model.
Deep multi-scale pyramidal features network for supervised video summarization
2024, Expert Systems with Applications
Video data are witnessing exponential growth, and extracting summarized information is challenging. It is always necessary to reduce the load of video traffic for the efficient video storage, transmission, and retrieval requirements. The aim of video summarization (VS) is to extract the most important contents from video repositories effectively. Recent attempts have used fewer representative features, which have been fed to recurrent networks to achieve VS. However, generating the desired summaries can become challenging due to the limited representativeness of extracted features and a lack of consideration for feature refinement. In this article, we introduce a vision transformer (ViT)-assisted deep pyramidal refinement network that can extract and refine multi-scale features and can predict an importance score for each frame. The proposed network comprises four main modules; initially, a dense prediction transformer with a ViT backbone is applied for the first time in this domain to acquire the optimal representations from the input frames. Then, feature maps are obtained from various layers separately and processed individually to support multi-scale progressive feature fusion and refinement before the data are passed to the ultimate prediction module. Next, a customized pyramidal refinement block is employed to refine the multi-level feature set before predicting the importance scores. Finally, video summaries are produced by selecting keyframes based on the predictions. To explore the performance of the proposed network, extensive experiments are conducted on the TVSum and SumMe datasets, and our network is found to achieve F1-scores of 62.4% and 51.9%, respectively, outperforming state-of-the-art alternatives by 0.9% and 0.5%.
A multi-flexible video summarization scheme using property-constraint decision tree
2022, Neurocomputing
Video summarization (VS) technology presents a multi-tendency development. However, there are still certain challenges in shortening a long video into several diverse and concise versions, such as generating various versions and variable lengths of summaries depending on user requirements. In this context, it is critical to extract the spatial–temporal features of video frames. To address these issues, this paper proposes a multi-flexible video summarization scheme using a property-constraint decision.
Due to the difficulty of directly transforming user requirements into summary algorithm factors, this paper employs property-constraints as a bridge between user requirements and the video summary algorithm. The specific implementation method is to aggregate and analyze existing research results on summary-oriented properties. It begins with the establishment of a property-constraints library, followed by the translating and mapping of user requirements into VS property-constraints, and ultimately constructing a property-constraints tree for flexible decision versions of VS. In addition, a hybrid cascade bidirectional network (CB-ConvLSTM) based on the Convolutional LSTM Network (ConvLSTM) is designed to extract the spatial–temporal features of the video. On this basis, the VS generator is configured. The goal of this scheme, which combines the VS property constraints and CB-ConvLSTM, is to “analyze once, satisfy multiple factors, generate multiple levels”. Verification experiments and a comparative analysis are conducted on benchmark datasets to evaluate the algorithm in this paper. The results indicate that the proposed algorithm is highly rational, effective, and usable.
Multiple cross-attention for video-subtitle moment retrieval
2022, Pattern Recognition Letters
Citation Excerpt :
To address the above issues, we design a novel cross-attention block and propose a multiple cross-attention network (MCA) for VSMR. Motivated by the success of transformer [4] in machine translation and language understanding, the Self-Attention (SA) unit as the basis component of transformer has shown its effectiveness to model the dense intra-modal interactions in computer vision [5–7] and natural language processing [8]. We are inspired by this to draw SA into our cross-attention block to enable self-learning in each single modality.
Given a natural language query, video-subtitle moment retrieval (VSMR) aims at localizing a short video moment from a video with subtitles. Different from the extensively studied video moment retrieval task locating vision moments to match the text query, VSMR is a more challenging task because the retrieval results have to contain both vision and subtitle contents, which needs a deep understanding of one more subtitle modality in addition to the query text and the video itself. Towards this end, we design a mutually guided cross-attention block by uniting multiple self-attention units and guided-attention units with successively mutual connections, and therefrom propose a novel Multiple Cross-Attention (MCA) network for multi-modal interaction and matching. Through such an attention interaction among multiple modalities, the proposed MCA can favorably model both the query-video relations and query-subtitle relations in word-by-clip level for VSMR. We quantitatively and qualitatively evaluate our proposed MCA on TVR, which is the most challenging VSMR dataset available. Empirical evidences demonstrate that our method outperforms the state-of-the-art ones.
Learning multiscale hierarchical attention for video summarization
2022, Pattern Recognition
Citation Excerpt :
Li et al. [36] proposed a global diverse attention scheme for pairwise temporal relations. Fu and Wang [37] exploited a binary neural tree for multi-path predictions. We predict importance scores of video frames with frame-level, block-level, and video-level representations.
In this paper, we propose a multiscale hierarchical attention approach for supervised video summarization. Different from most existing supervised methods which employ bidirectional long short-term memory networks, our method exploits the underlying hierarchical structure of video sequences and learns both the short-range and long-range temporal representations via a intra-block and a inter-block attention. Specifically, we first separate each video sequence into blocks of equal length and employ the intra-block and inter-block attention to learn local and global information, respectively. Then, we integrate the frame-level, block-level, and video-level representations for the frame-level importance score prediction. Next, we conduct shot segmentation and compute shot-level importance scores. Finally, we perform key shot selection to produce video summaries. Moreover, we extend our method into a two-stream framework, where appearance and motion information is leveraged. Experimental results on the SumMe and TVSum datasets validate the effectiveness of our method against state-of-the-art methods.
Ring-Regularized Cosine Similarity Learning for Fine-Grained Face Verification
2021, Pattern Recognition Letters
Citation Excerpt :
Face recognition is one of the popular research topics in computer vision and pattern recognition and it has been comprehensively studied for decades [1,9,12,13,15,18,20,21,23,27,30,34].
Face verification aims to determine whether a pair of face images belong to the same person. Different from the traditional face verification, the negative sample pairs in fine-grained face verification are composed of similar face images, e.g., facial images of twins, which makes it still very challenging. In this paper, we investigate the fine-grained face verification problem via metric learning techniques, and propose a ring-regularized cosine similarity learning (RRCSL) method to distinguish the negative face pairs. The proposed RRCSL method seeks a linear transformation to enlarge the cosine similarity of intra-class and reduce the cosine similarity of inter-class as much as possible, and adaptively learns the norm of samples to the scaled circle by exploiting the ring regularization term simultaneously. Experimental results on three face datasets demonstrate the effectiveness of RRCSL for fine-grained face verification.

View all citing articles on Scopus

View full text

Self-attention binary neural tree for video summarization

Highlights

Abstract

Introduction

Section snippets

Video summarization

Approach

Experiments

Conclusion

Declaration of Competing Interest

Acknowledgments

PRL

RPL

PRL

From keyframes to key objects: video summarization by representative object proposal selection

CVPR

Creating summaries from user videos

ECCV

Hierarchical recurrent neural network for video summarization

MM

Video summarization with long short-term memory

ECCV

Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward

AAAI

Unsupervised video summarization with adversarial LSTM networks

CVPR

HSA-RNN: hierarchical structure-adaptive RNN for video summarization

CVPR

Retrospective encoders for video summarization

ECCV

Adaptive neural trees

ICML

Attention convolutional binary neural tree for fine-grained visual categorization

CVPR

Distilling a neural network into a soft decision tree

CoRR

TVSum: summarizing web videos using titles

CVPR