Elsevier

Pattern Recognition Letters

Volume 143, March 2021, Pages 19-26
Pattern Recognition Letters

Self-attention binary neural tree for video summarization

https://doi.org/10.1016/j.patrec.2020.12.016Get rights and content

Highlights

  • A self-attention binary neural tree (SABTNet) is proposed for video summarization.

  • SABTNet is the first attempt to combine self-attention modules and decision trees for video summarization.

  • SABTNet follows a divide-and-conquer way to evaluate a video shot.

  • SABTNet takes hierarchical coarse-to-fine video shot feature learning by self-attention.

  • Experimental results show that the proposed SABTNet achieves state-of-the-art performance.

Abstract

In this paper, we address the problem of shot-level video summarization, which aims at selecting a subset of video shots as a summary to represent the original video contents compactly and completely. Most existing methods rely on various network architectures to learn a single score predictor for shot ranking and selection. Different from these methods, we plug network feature learning into a binary neural tree to consider multi-path predictions for each shot, thus enabling the shot evaluation from different aspects. Due to the hierarchical structure of the tree, video shots can be coarse-to-fine encoded by imposing self-attention on them along branches, leading to favorable predictions. Extensive experiments were conducted on two real-world datasets, and the results reveal that the proposed method achieves superior performance in comparison with previous state-of-the-art methods.

Introduction

In recent years, video summarization has become a hot research topic, especially when the explosive increase of videos uploaded to video-sharing platforms by personal users. Because of the lack of professional shooting and editing skills, most videos are usually redundant and time-consuming for browsing. In such a case, one may need video summarization to generate a compact and non-redundant summary for a given video while preserving its main information. This technology can be applied to a lot of real-world domains, including movie recapping, surveillance video browsing, and sports video highlighting. The methods in video summarization can be at object-level [1], [2], frame-level [3], or shot-level [4], according to the visual primitives (objects, frames, or shots) we are processing.

Frame-level and shot-level methods are most commonly investigated in video summarization. Modern frame-level methods [3], [5], [6] usually rely on Long Short-Term Memory (LSTM) to capture long-term and short-term dependencies within a video, and on carefully designed frame scoring networks to predict the probability of each frame being selected into the video summary. For example, an LSTM followed by a fully connected network is adopted to predict the frame scores in Zhang et al. [5]. Recently, LSTM has also been combined with generative adversarial learning [7] and reinforcement learning [6] for frame selection training. Although frame-level methods obtain some promising results, there are two big challenges for these methods: (i) Adjacent frames are very similar in vision, which makes it difficult for the frame scoring networks to have an accurate selection; (ii) the widely used LSTM networks in frame-level methods are not capable enough to explore long-term dependencies of more than 100 frames [4]. To address these two challenges, a frequently-used solution is hierarchical modeling inspired by the hierarchical structure of video-shots-frames, giving rise to the research of shot-level video summarization. By considering video summarization as a hierarchical modeling problem, one can use current shot-level methods [4], [8], [9] to encode frames within each continuous shot into shot-level features, and predict the selection scores on segmented shots rather than on separate frames, which facilitates exploiting temporal similarities and dependencies within a video. In this paper, we also focus on shot-level video summarization. A shot we consider here is a clip of the original video, consisting of a sequence of consecutive frames that are highly cohesive. Different from previous methods that directly learn a single score predictor for frame/shot evaluation and ranking [4], [5] depending on individual feature learning architectures, we enable evaluating a shot from different aspects by a tree structure. As shown in Fig. 1, people may ease to follow a divide-and-conquer way to evaluate a shot from diversity, representativeness, and other factors. Motivated by this intuition, we view shot-level video summarization as a regression decision tree scoring problem, where different root-to-leaf paths indicate different evaluation factors.

Inspired by Tanno et al. [10] and Ji et al. [11], we design a self-attention binary neural tree (SABTNet) for video summarization. Specifically, we start with dividing a video into several shots and adopt a bi-directional Long Short-Term Memory (Bi-LSTM) to extract shot-level visual features. We also incorporate self-attention along edges of the tree structure to focus on different parts of a video. For the non-leaf nodes, we adopt neural networks as the routing functions to determine the root-to-leaf computational paths. Each leaf node is designed to accompany a score predictor, which represents an assessment function of a factor for shot evaluation. This design combines neural networks and decision trees to gain the benefits of both approaches. It not only possesses the sequence modeling ability of self-attention and nonlinear fitting ability of neural networks, but also inherits the coarse-to-ne hierarchical feature learning ability of decision tree. By this means, different root-to-leaf paths in the SABTNet suggest different evaluation factors. The final prediction score of each video shot can be obtained by a weighted summation of the predictions from all the leaf nodes in the tree.

We follow ACNet [11] to use a fixed height full binary tree as the decision tree structure and adopt the soft decision scheme [12]. We conduct extensive evaluations on two widely used datasets, i.e., SumMe and TVSum. The results demonstrate the superiority of our proposed method compared with previous state-of-the-art methods. We conduct several ablation studies to comprehensively understand the influence of some important parameters. The ablation studies include the heights of the tree, the number of heads in self-attention, the number of stacked transformer blocks, and the summary rate.

The main contributions of our proposed method are summarized as follows:

  • We propose a novel yet simple self-attention binary neural tree for shot-level video summarization. This model follows a divide-and-conquer way to evaluate a shot from different aspects by taking a coarse-to-fine manner of shot feature learning.

  • To the best of our knowledge, this work is the first attempt to combine self-attention and decision trees for video summarization.

  • We conduct various experiments on two benchmark datasets, SumMe [3] and TVSum [13]. The results show that the presented method achieves state-of-the-art performance.

Section snippets

Video summarization

Video summarization aims at shortening an input video into a compact summary, which keeps the main information of the video. Existing summarization methods roughly fall into the category of unsupervised or supervised learning. Our method belongs to the latter.

Unsupervised methods design some hand-crafted models for a representative and diverse selection of video content [14]. Among various models, clustering could be the most intuitive one [15]. The frames/shots with high visual similarities

Approach

Our goal is to predict a score ranging from 0 to 1 for each shot. The higher the score a shot obtains, the more likely the shot will be selected into the final summary. To this end, we propose a self-attention binary neural tree (SABTNet) model, including the backbone network, shot encoding, branch routing, self-attention, and score prediction modules.

As shown in Fig. 2, given a video v with Mv frames, we first need to divide it into non-overlapping shots by shot segmentation algorithms [8],

Experiments

We begin this section by introducing the experimental setting. Then, we show and analyze the experimental results in detail. Finally, we display the qualitative results of our method.

Conclusion

In this paper, we propose a novel self-attention binary neural tree, SABTNet, for shot-level video summarization. It endows a binary tree with the capability of hierarchical feature refinement by drawing self-attention into each edge of the tree, which favors a divide-and-conquer shot evaluation from different root-to-leaf paths. Extensive comparison on two real-world datasets shows that the proposed SABTNet performs favorably against previous state-of-the-art methods.

This is the first step

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China under Grant 61976029.

References (31)

  • K. Zhang et al.

    Retrospective encoders for video summarization

    ECCV

    (2018)
  • R. Tanno et al.

    Adaptive neural trees

    ICML

    (2019)
  • R. Ji et al.

    Attention convolutional binary neural tree for fine-grained visual categorization

    CVPR

    (2020)
  • Frosst N. et al.

    Distilling a neural network into a soft decision tree

    CoRR

    (2017)
  • Y. Song et al.

    TVSum: summarizing web videos using titles

    CVPR

    (2015)
  • Cited by (27)

    • Multiple cross-attention for video-subtitle moment retrieval

      2022, Pattern Recognition Letters
      Citation Excerpt :

      To address the above issues, we design a novel cross-attention block and propose a multiple cross-attention network (MCA) for VSMR. Motivated by the success of transformer [4] in machine translation and language understanding, the Self-Attention (SA) unit as the basis component of transformer has shown its effectiveness to model the dense intra-modal interactions in computer vision [5–7] and natural language processing [8]. We are inspired by this to draw SA into our cross-attention block to enable self-learning in each single modality.

    • Learning multiscale hierarchical attention for video summarization

      2022, Pattern Recognition
      Citation Excerpt :

      Li et al. [36] proposed a global diverse attention scheme for pairwise temporal relations. Fu and Wang [37] exploited a binary neural tree for multi-path predictions. We predict importance scores of video frames with frame-level, block-level, and video-level representations.

    • Ring-Regularized Cosine Similarity Learning for Fine-Grained Face Verification

      2021, Pattern Recognition Letters
      Citation Excerpt :

      Face recognition is one of the popular research topics in computer vision and pattern recognition and it has been comprehensively studied for decades [1,9,12,13,15,18,20,21,23,27,30,34].

    View all citing articles on Scopus
    View full text