Self-attention binary neural tree for video summarization
Introduction
In recent years, video summarization has become a hot research topic, especially when the explosive increase of videos uploaded to video-sharing platforms by personal users. Because of the lack of professional shooting and editing skills, most videos are usually redundant and time-consuming for browsing. In such a case, one may need video summarization to generate a compact and non-redundant summary for a given video while preserving its main information. This technology can be applied to a lot of real-world domains, including movie recapping, surveillance video browsing, and sports video highlighting. The methods in video summarization can be at object-level [1], [2], frame-level [3], or shot-level [4], according to the visual primitives (objects, frames, or shots) we are processing.
Frame-level and shot-level methods are most commonly investigated in video summarization. Modern frame-level methods [3], [5], [6] usually rely on Long Short-Term Memory (LSTM) to capture long-term and short-term dependencies within a video, and on carefully designed frame scoring networks to predict the probability of each frame being selected into the video summary. For example, an LSTM followed by a fully connected network is adopted to predict the frame scores in Zhang et al. [5]. Recently, LSTM has also been combined with generative adversarial learning [7] and reinforcement learning [6] for frame selection training. Although frame-level methods obtain some promising results, there are two big challenges for these methods: (i) Adjacent frames are very similar in vision, which makes it difficult for the frame scoring networks to have an accurate selection; (ii) the widely used LSTM networks in frame-level methods are not capable enough to explore long-term dependencies of more than 100 frames [4]. To address these two challenges, a frequently-used solution is hierarchical modeling inspired by the hierarchical structure of video-shots-frames, giving rise to the research of shot-level video summarization. By considering video summarization as a hierarchical modeling problem, one can use current shot-level methods [4], [8], [9] to encode frames within each continuous shot into shot-level features, and predict the selection scores on segmented shots rather than on separate frames, which facilitates exploiting temporal similarities and dependencies within a video. In this paper, we also focus on shot-level video summarization. A shot we consider here is a clip of the original video, consisting of a sequence of consecutive frames that are highly cohesive. Different from previous methods that directly learn a single score predictor for frame/shot evaluation and ranking [4], [5] depending on individual feature learning architectures, we enable evaluating a shot from different aspects by a tree structure. As shown in Fig. 1, people may ease to follow a divide-and-conquer way to evaluate a shot from diversity, representativeness, and other factors. Motivated by this intuition, we view shot-level video summarization as a regression decision tree scoring problem, where different root-to-leaf paths indicate different evaluation factors.
Inspired by Tanno et al. [10] and Ji et al. [11], we design a self-attention binary neural tree (SABTNet) for video summarization. Specifically, we start with dividing a video into several shots and adopt a bi-directional Long Short-Term Memory (Bi-LSTM) to extract shot-level visual features. We also incorporate self-attention along edges of the tree structure to focus on different parts of a video. For the non-leaf nodes, we adopt neural networks as the routing functions to determine the root-to-leaf computational paths. Each leaf node is designed to accompany a score predictor, which represents an assessment function of a factor for shot evaluation. This design combines neural networks and decision trees to gain the benefits of both approaches. It not only possesses the sequence modeling ability of self-attention and nonlinear fitting ability of neural networks, but also inherits the coarse-to-ne hierarchical feature learning ability of decision tree. By this means, different root-to-leaf paths in the SABTNet suggest different evaluation factors. The final prediction score of each video shot can be obtained by a weighted summation of the predictions from all the leaf nodes in the tree.
We follow ACNet [11] to use a fixed height full binary tree as the decision tree structure and adopt the soft decision scheme [12]. We conduct extensive evaluations on two widely used datasets, i.e., SumMe and TVSum. The results demonstrate the superiority of our proposed method compared with previous state-of-the-art methods. We conduct several ablation studies to comprehensively understand the influence of some important parameters. The ablation studies include the heights of the tree, the number of heads in self-attention, the number of stacked transformer blocks, and the summary rate.
The main contributions of our proposed method are summarized as follows:
- •
We propose a novel yet simple self-attention binary neural tree for shot-level video summarization. This model follows a divide-and-conquer way to evaluate a shot from different aspects by taking a coarse-to-fine manner of shot feature learning.
- •
To the best of our knowledge, this work is the first attempt to combine self-attention and decision trees for video summarization.
- •
We conduct various experiments on two benchmark datasets, SumMe [3] and TVSum [13]. The results show that the presented method achieves state-of-the-art performance.
Section snippets
Video summarization
Video summarization aims at shortening an input video into a compact summary, which keeps the main information of the video. Existing summarization methods roughly fall into the category of unsupervised or supervised learning. Our method belongs to the latter.
Unsupervised methods design some hand-crafted models for a representative and diverse selection of video content [14]. Among various models, clustering could be the most intuitive one [15]. The frames/shots with high visual similarities
Approach
Our goal is to predict a score ranging from 0 to 1 for each shot. The higher the score a shot obtains, the more likely the shot will be selected into the final summary. To this end, we propose a self-attention binary neural tree (SABTNet) model, including the backbone network, shot encoding, branch routing, self-attention, and score prediction modules.
As shown in Fig. 2, given a video with frames, we first need to divide it into non-overlapping shots by shot segmentation algorithms [8],
Experiments
We begin this section by introducing the experimental setting. Then, we show and analyze the experimental results in detail. Finally, we display the qualitative results of our method.
Conclusion
In this paper, we propose a novel self-attention binary neural tree, SABTNet, for shot-level video summarization. It endows a binary tree with the capability of hierarchical feature refinement by drawing self-attention into each edge of the tree, which favors a divide-and-conquer shot evaluation from different root-to-leaf paths. Extensive comparison on two real-world datasets shows that the proposed SABTNet performs favorably against previous state-of-the-art methods.
This is the first step
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is supported in part by the National Natural Science Foundation of China under Grant 61976029.
References (31)
- et al.
Unsupervised object-level video summarization with online motion auto-encoder
PRL
(2020) - et al.
DFP-ALC: automatic video summarization using distinct frame patch index and appearance based linear clustering
RPL
(2019) - et al.
Efficient CNN based summarization of surveillance videos for resource-constrained devices
PRL
(2020) - et al.
From keyframes to key objects: video summarization by representative object proposal selection
CVPR
(2016) - et al.
Creating summaries from user videos
ECCV
(2014) - et al.
Hierarchical recurrent neural network for video summarization
MM
(2017) - et al.
Video summarization with long short-term memory
ECCV
(2016) - et al.
Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward
AAAI
(2018) - et al.
Unsupervised video summarization with adversarial LSTM networks
CVPR
(2017) - et al.
HSA-RNN: hierarchical structure-adaptive RNN for video summarization
CVPR
(2018)
Retrospective encoders for video summarization
ECCV
Adaptive neural trees
ICML
Attention convolutional binary neural tree for fine-grained visual categorization
CVPR
Distilling a neural network into a soft decision tree
CoRR
TVSum: summarizing web videos using titles
CVPR
Cited by (27)
Attention-guided multi-granularity fusion model for video summarization
2024, Expert Systems with ApplicationsDeep multi-scale pyramidal features network for supervised video summarization
2024, Expert Systems with ApplicationsMultiple cross-attention for video-subtitle moment retrieval
2022, Pattern Recognition LettersCitation Excerpt :To address the above issues, we design a novel cross-attention block and propose a multiple cross-attention network (MCA) for VSMR. Motivated by the success of transformer [4] in machine translation and language understanding, the Self-Attention (SA) unit as the basis component of transformer has shown its effectiveness to model the dense intra-modal interactions in computer vision [5–7] and natural language processing [8]. We are inspired by this to draw SA into our cross-attention block to enable self-learning in each single modality.
Learning multiscale hierarchical attention for video summarization
2022, Pattern RecognitionCitation Excerpt :Li et al. [36] proposed a global diverse attention scheme for pairwise temporal relations. Fu and Wang [37] exploited a binary neural tree for multi-path predictions. We predict importance scores of video frames with frame-level, block-level, and video-level representations.
Ring-Regularized Cosine Similarity Learning for Fine-Grained Face Verification
2021, Pattern Recognition LettersCitation Excerpt :Face recognition is one of the popular research topics in computer vision and pattern recognition and it has been comprehensively studied for decades [1,9,12,13,15,18,20,21,23,27,30,34].