Static video summarization using multi-CNN with sparse autoencoder and random forest classifier,Signal, Image and Video Processing

当前位置： X-MOL 学术 › Signal Image Video Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Static video summarization using multi-CNN with sparse autoencoder and random forest classifier
Signal, Image and Video Processing ( IF 2.0 ) Pub Date : 2020-10-06 , DOI: 10.1007/s11760-020-01791-4
Madhu S. Nair , Jesna Mohan

A summarization system detects the parts of the input video that contain an essential message. Such a system aims to generate a very compact and meaningful representation of the original video. A novel method to detect key-frames for static summarization is presented in this paper. The method detects key-frames based on feature vectors extracted from multiple pre-trained Convolutional Neural Network models (Multi-CNN). The features are extracted using four pre-trained models of CNN. These vectors are fed to Sparse Autoencoder, which outputs a combined representation of the input feature vectors. The key-frames of input video are extracted based on combined feature vectors using Random Forest Classifier. The evaluation of the method is done using two datasets: VSUMM and OVP, based on user summaries present in the ground-truth. The method was able to achieve an average F-score of 0.83 on VSUMM dataset and 0.82 on OVP dataset, respectively. The method attained promising results compared to other state-of-the-art methods in the literature. Multi-CNN model was also able to generate high-quality summaries consistently from videos of all categories. Further experiments prove that Multi-CNN model in combination with Random Forest classifier performs better than other classifiers considered in the study.

中文翻译：

使用带有稀疏自动编码器和随机森林分类器的多CNN进行静态视频汇总

摘要系统检测输入视频中包含基本消息的部分。这样的系统旨在生成原始视频的非常紧凑且有意义的表示。本文提出了一种用于检测静态摘要关键帧的新方法。该方法基于从多个预训练的卷积神经网络模型（Multi-CNN）中提取的特征向量来检测关键帧。使用四个CNN的预训练模型提取特征。这些向量被馈送到稀疏自动编码器，后者输出输入特征向量的组合表示。使用随机森林分类器基于组合特征向量提取输入视频的关键帧。该方法的评估使用两个数据集：VSUMM和OVP，基于地面真实情况下的用户摘要。F分数在VSUMM数据集上为0.83，在OVP数据集上为0.82。与文献中的其他最新方法相比，该方法获得了可喜的结果。多CNN模型还能够始终如一地从所有类别的视频中生成高质量的摘要。进一步的实验证明，与随机森林分类器相结合的Multi-CNN模型的性能优于研究中考虑的其他分类器。

更新日期：2020-10-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文