1 Introduction

Video has become more popular in many applications in recent years due to increased storage capacity, more advanced network architectures, as well as easy access to digital cameras, especially in mobile phones. According to recent statistics, more than 500 h of video is uploaded onto the Internet every minute and sharp rise in the number of videos is expected to continue in the coming decades due to the increase in demand for video content [1]. Therefore, this increase is a remarkable issue and brings serious challenges for video indexing, archiving, and retrieval systems. The main subject of videos on social networking Web sites is human actions. Automatic classification of their semantic content is essential for appropriate use and management of these videos. However, the classification of video content remains a challenging task owing to the complexity of video data.

Action recognition problems have been addressed using deep learning approaches in both image and video domains. Convolutional neural networks (CNNs) have achieved state-of-the-art results in the recent decade. CNN applications to video-based tasks are not so successful as those in image domains, e.g., object detection [2], segmentation [3], pattern recognition [4], and classification [5, 6]. Therefore, the power of recurrent neural networks (RNNs) in sequence learning has been employed to gather temporal information for improving video classification performance. Although combining CNNs and RNNs has achieved good results [7, 8], the representation of temporal information is still a demanding problem due to complex variations in actions and dynamic background in videos.

The performance of action recognition has been improved remarkably by transfer learning and use of extra training data. Extensive video datasets such as HMDB51 [9], UCF-101 [10], Sports-1M [11], and Youtube-8M [12] have been published, and the state-of-the-art results have recently been reported on these benchmarking datasets [13,14,15,16,17].

The majority of the current video classification methods classify videos by assigning a label to each frame. Nonetheless, considering all frames equally weakens the classification performance as some frames have more distinctive information than others. We argue that it is essential to select keyframes for better classification performance. Thus, this paper proposes a novel keyframe extraction method by identifying an action template to preserve the succinct content, in which the entire video is represented in a set of keyframes. The main novelty of this work is the proposed keyframe extraction algorithm that employs an action template for each video to extract and select the most distinctive frames with both static and dynamic backgrounds, without the need of using complex procedures.

The main contributions of this work are the identification of the best architecture for combining CNNs and RNNs for video classification and the proposal of the action template-based keyframe extraction, which aims to extract more informative frames by calculating the similarity only between action regions, rather than whole frames. The former was partly presented in [18], which has been substantially extended and serves as a baseline to test the newly proposed method. In this paper, extensive experiments have shown that the action template-based keyframe extraction method significantly outperformed the frame selection methods used in our experiments for comparison purposes.

The rest of the paper is organized as follows: The related work is reviewed in Sect. 2 and the proposed keyframe extraction method is described in Sect. 3. Experiments conducted are detailed in Sect. 4, while the results are analyzed and summarized in Sect. 5. Conclusions are given in Sect. 6.

2 Related work

Keyframe extraction approaches can be generally categorized into six groups: uniform sampling-based, shot boundary-based, shot activity-based, visual content-based, motion analysis-based, and clustering-based. Although uniform sampling-based methods are easy and computationally efficient, these methods may fail to represent the video in two possible scenarios: no enough keyframes for a short semantically important video and too many keyframes with similar content for a long static segment [19].

Early works on keyframe extraction focused on shot boundary-based techniques [20, 21]. Basically, this technique employs the first or middle frame of each shot as the keyframe after shot boundary detection [22]. Video shot boundary detection methods were reviewed by Dey et al. [22]. Although shot boundary-based methods are easy to use and generalize, the extracted keyframe cannot represent the visual content entirely and it is not stable.

Shot activity-based approach is used to select keyframes considering the frame with least difference from other frames in terms of a given similarity measure. Based on this concept, Lagendijk et al. proposed a keyframe selection method with the assumption that ‘every keyframe represents a contiguous interval in a shot’ [23]. In this work, the limits of intervals and the location of keyframe within each interval are optimized. Similarly, the Lloyd–Max algorithm is used in the design of a scalar quantizer in [24].

Visual content-based approach has been explored for visual content-based information retrieval and keyframe-based video summarization. In this approach, visual features of video clips are extracted to analyze keyframes in movie segments. Zhong and Smoliar proposed an integrated system solution using video content information obtained from a parsing process [25]. Human attention mechanism has been simulated to produce semantic video summary based on keyframe extraction. Visual attention of each frame is quantified using a descriptor named attention quantifier, which indicates color conspicuousness and the motion with more attention involved [26]. There have been many attempts to analyze visual content of video for keyframe extraction for video partitioning and summarization [27, 28].

As for motion analysis-based approach, a novel algorithm was proposed for selecting keyframes within shots from video by employing optical flow computations to detect local minima of action in a single shot [29]. This work measures the motion in a shot by utilizing optical flow analysis, where key frames are selected at the local minima of the action. Mizher et al. have also proposed an action keyframe extraction method based on L1-norm and accumulated optical flows [30]. Similar approach has been observed for salient region-based keyframe extraction by using optical flow and calculating mutual information entropy [31].

Clustering-based methods have been used to extract key frames. The idea is that frames are grouped based on their low-level features by using a clustering method like K-means and the most similar frames with the groups’ centers are selected [23]. Dynamic Delaunay graph clustering through iterative edge pruning technique has also been used to extract keyframes [32]. Tan et al. demonstrated KGAF-means method by adopting K-means and the artificial fish swarm algorithms to extract keyframe sequences [33].

The proposed method in this paper aims to tackle some important limitations of the aforementioned approaches. Although extracting keyframes using shot-based approaches is easy to use, early approaches are unable to capture the temporal information. As for clustering-based approaches, they are sensitive to the type of adopted kernel and the number of clusters, and high in time complexity [34]. Furthermore, video is a special kind of media content that includes temporal information and complex background. Another limitation of the mentioned methods is handling entire frame differences rather than a specific region of interest. This paper proposes a novel approach based on the similarity between regions of interest in consecutive frames to address these limitations. Different from the previous works, we employ an action template to find the region of interest for each video.

It is noteworthy that some related work on deep neural networks for video classification has been presented in our previous work [18].

3 The proposed method

Keyframe extraction is a principal pre-processing step in video analysis. The purpose of extracting keyframes is to get more discriminate information from the video in an effective manner. Each video has its own unique characteristics such as saturation, brightness, contrast, camera angle, vibration, blur, location of the action, number of actors, type of action, length, and background. Considering a large number of variables in each video and treating all videos equally bring about a major weakness in keyframe extraction. Thus, it is necessary to recognize the region of action in continuous action video. Considering the variations in the complex video data, it is a challenging task to find the location of action, based on which this paper proposes a new method for keyframe extraction.

3.1 The proposed keyframe extraction approach

In general, the location of action in a movie is related to the point on the screen and camera, to which reviewer’s attention is paid. It is observed that attention is paid to the central area mostly while recording and watching. Therefore, the outside regions of video frames are usually cropped off before identifying the region of interest. Then, the area of action is formulated as a region in the center of video frames, which produces either the biggest difference or lowest similarity between consecutive frames, leading to a template for the video to track the action area. Calculating only the difference in regions of action between frames throughout the video helps to extract keyframes more accurately and effectively by reducing the influence of possible dynamically changing backgrounds..

The proposed keyframe extraction method consists of four steps: (1) identify an action template; (2) specify the location of an action; (3) calculate action similarities to find distinctive frames; and (4) select a preset number of keyframes in chronological order. The four steps of the proposed keyframe extraction method can be summarized as follows:

  1. (1)

    Action template identification:

    • Frame decomposition.

    • Frame cropping.

    • Define three possible regions for action template.

    • Calculate mean squared error (MSE) for each possible region using the first two frames.

    • Choose the region that produces the largest MSE as an action template.

  2. (2)

    Action location specification:

    • Find the region of interest on each frame by matching the action template against overlapped regions in each frame using the correlation coefficient defined in Eq. (3).

  3. (3)

    Keyframe extraction:

    • Calculate the structural similarity measure \((S_{i})\) between regions of interest on consecutive frames \((f_{i}, f_{i-1})\).

    • Compare the similarity score with thresholds \(T_{1} = [0.65, 0.90]\) and \(T_{2} = [0.65, 0.95]\) (these threshold values were chosen by analysis of significance in action changes in our experiments):

      $$\begin{aligned} & 0.65< S_{i}< 0.90\rightarrow \text {add } f_i\ \text {to primary list} (p_f) \\ & 0.65< S_{i} < 0.95\rightarrow \text {add } f_i\ \text {to alternative list} (a_f) \end{aligned}$$
    • Repeat the above till end of the video, with \(N_{p_f}\) frames extracted into \(p_f\) and \(N_{a_f}\) frames extracted in to \(a_f\).

  4. (4)

    Keyframe selection:

    • Set the number of keyframes \((N_{k_f})\).

    • Find keyframe ratio \((k)\):

      $$\begin{aligned} k = {\left\{ \begin{array}{ll} \left\lfloor \frac{N_{p_f}}{N_{k_f}} \right\rfloor ,&{}\quad {\text {if}}\; N_{p_f}\ge N_{k_f}\\ \left\lfloor \frac{N_{a_f}}{N_{k_f}} \right\rfloor , &{}\quad {\text {otherwise}} \end{array}\right. } \end{aligned}$$
    • Return the indexes of keyframes by choosing a frame from every k frames from keyframe list \(p_f\) if \(N_{p_f}\ge N_{k_f}\), or from keyframe list \(a_f\) otherwise.

In the first step, each frame is cropped by taking an appropriate number of pixels out (depending on frame resolution) from each side of a frame to create a general action area. As depicted in Fig. 1, the inner area is then divided into three different candidate templates. It has been observed that background changes between consecutive frames result in small structural similarity (SSIM) and action differences lead to large MSE. The candidate area having the largest MSE between consecutive frames is assigned as an action template. The mean squared error between two regions of frames \(X\) and \(Y\) is computed as follows:

$$\begin{aligned} {\mathrm{MSE}} (X,Y) = \frac{1}{mn}\sum _{i=1}^{m}\sum _{j=1}^{n}{[Y(i,j) - X (i,j)]^2} \end{aligned}$$
(1)

where \(m\) and \(n\) are the number of rows and columns in the region of interest, respectively. The structural similarity formulated by Wang et al. [35] is adopted in this paper and defined as follows:

$$\begin{aligned} {\mathrm{SSIM}} (X,Y) = \frac{\big ( 2\mu _X \mu _Y + C_1\big )\big (2\sigma _{XY} + C_2\big )}{\big ( \mu _X^2 +\mu _Y^2 +C_1 \big )\big ( \sigma _X^2 +\sigma _Y^2 +C_2 \big )} \end{aligned}$$
(2)

where \(\mu _X\) and \(\mu _Y\) denote the average of pixel values in \(X\) and \(Y\), respectively, \(\sigma _X\) and \(\sigma _Y\) are the variance of pixel values in \(X\) and \(Y\), respectively, and \(\sigma _{XY}\) is the co-variance of pixel values in \(X\) and \(Y\). \(C_1=(k_1L)^2\) and \(C_2=(k_2L)^2\) are small constants where \(L\) denotes the dynamic range of pixel values. MSE and SSIM are calculated for each frame region and used to select frames well representing the action (Fig. 2).

Fig. 1
figure 1

Defining the possible locations of an action template. Red, green, and blue boxes represent the borders of three possible templates (Color figure online)

Fig. 2
figure 2

An example of action template identification

In the second step, after having determined the action template for each video, template-based correlation coefficient matching method is used to find at what position the template most closely matches the data in a region of each frame. This operation slides throughout each frame and compares the overlapped patterns of size \(w \times h\) to the template, where \(w\) and \(h\) are width and height of the template, respectively. Then, the best matches are found as global maximums. Regarding color channels, template summation in the function is done over all channels and different mean values are used for each channel. The formula for the template-based matching method is:

$$\begin{aligned} \begin{aligned} R(x,y)= \sum _{x',y'} (T'(x',y') \cdot I'(x+x',y+y')) \end{aligned} \end{aligned}$$
(3)

where \(R(x,y)\) is the correlation coefficient score for a single overlapped position of \((x,y)\) representing the coordinates of each pixel in the frame. \(T'(x',y')\) is the average of pixel values of the template \(T\), where \((x',y')\) represents the coordinates of each pixel in the template, given as:

$$\begin{aligned} \begin{aligned} T'(x',y')=T(x',y') - \frac{1}{(w \cdot h)} \cdot \sum _{x'',y''} T(x'',y''). \\ \end{aligned} \end{aligned}$$
(4)

On the other hand, \(I'(x+x',y+y')\) is the average of pixel values of a given frame \(I\) in the region overlapped with the template \(T\), given as:

$$\begin{aligned} \begin{aligned}&I'(x+x',y+y')=I(x+x',y+y') \\&\quad - \frac{1}{(w \cdot h)}\cdot \sum _{x'',y''} I(x+x'',y+y'')\\ \end{aligned} \end{aligned}$$
(5)

where \(x''=0,\ldots ,w-1\) and \(y''=0,\ldots ,h-1\) which represent the new coordinates of \((x,y)\) in the template after moving the center of the template over the frame. \(T(x',y')\) is the pixel values for a pixel \((x,y)\) in the template, while \(I(x+x',y+y')\) is the pixel value for the corresponding pixel position in the frame. After performing the template matching procedure, the region of interest on each frame is localized where the highest matching probability takes place.

In step 3, it is very challenging to distinguish between action changes and background changes in consecutive frames. Through an extensive investigation, it has been discovered that the structural similarity between two regions of interest in two consecutive frames is more sensitive to action changes than background changes. After analyzing both background and action changes in these areas, two rules are proposed in this paper to specify upper and lower bounds of similarity range. Action changes are found mostly important in the interval [0.65, 0.90], and the lower similarity is mainly due to the dramatic change in dynamic background. Rarely, the difference between regions in consecutive frames is not in this range. However, if there are not enough keyframes extracted using the above interval, more frames are extracted by extending the upper bound of the interval up to 0.95.

Finally, keyframes are selected by using a keyframe ratio in chronological order. The pseudocode of the proposed algorithm is demonstrated in Algorithm 1.

figure a

3.2 Deep neural network architectures based on VGG-16 for video classification

In 2014, Simonyan and Zisserman [6] introduced a VGG-16 network architecture trained on 1000 image categories using image data for the ImageNet Large Scale Visual Recognition Competition (ILSVRC). VGG-16 consists of 16 convolutional layers with relatively small convolution filters (3x3). We used the pre-trained neural network VGG-16 to generalize the pre-learnt feature representations using transfer learning. In our previous work [18], ConvLSTM and LSTM with local features extracted using VGG-16 outperformed those using global features. Thus, this paper uses the ConvLSTM(1) and LSTM(1) architectures used in [18]. Apart from the newly proposed keyframe extraction method, we also conducted the experiments using not only 20 but also 101 categories of the UCF-101 dataset and evaluating the proposed methods on the KTH action recognition dataset as well. Moreover, we optimized the parameters of the networks using hold-out validation method on the validation split of the training dataset. The two video classification architectures to classify 101 categories are shown in Fig. 3.

Fig. 3
figure 3

The architectures of the networks used in VGG-16-LSTM and VGG-16-ConvLSTM experiments

LSTM is one of the most common approaches for sequence modeling. Previous studies [36,37,38] have demonstrated that LSTM is a robust method to represent long-range dependencies. Its main advantage is that its memory cell \(c_t\) accumulates the state information. The cell is modified by controlling the input gate \(i_t\) and forget gate \(f_t\) at timestamp \(t\). Once the cell is fed with a new input \(x_t\), it accumulates input information, provided that the input gate is on. If the forget gate is activated, the previous cell information \(c_{t-1}\) could be forgotten. The output gate \(o_t\) checks the current cell output \(c_t\) to decide whether it is propagated to the final state \(h_t\) or not. In this study, we follow the hidden layer function of LSTM described in [39]:

$$\begin{aligned} \begin{aligned} i_t&= \sigma (W_x{}_i x_t + W_h{}_i h_t{}_-{}_1 + W_c{}_i c_t{}_-{}_1 +b_i) \\ f_t&= \sigma (W_x{}_f x_t + W_h{}_f h_t{}_-{}_1 + W_c{}_f c_t{}_-{}_1 +b_f)\\ c_t&= f_t \cdot c_t{}_-{}_1 + i_t \cdot \tanh (W_x{}_c x_t + W_h{}_c h_t{}_-{}_1 +b_c) \\ o_t&= \sigma (W_x{}_o x_t + W_h{}_o h_t{}_-{}_1 + W_c{}_o \cdot c_t +b_o) \\ h_t&= o_t \cdot \tanh (c_t) \end{aligned} \end{aligned}$$
(6)

where ‘\(\cdot\)’ denotes the Hadamard product, \(\sigma\) represents the sigmoid function, and \(\tanh\) denotes the hyperbolic tangent function. In Eq. (6), \(W_p{}_q\) and \(b_q\) are weight matrix and bias for the respective gates, where the subscript \(p\) can be either the input \(x\), the cell output \(c\) , or the hidden state \(h\) and the subscript \(q\) can be either the input gate \(i\), the forget gate \(f\), the memory cell \(c\) , or the output gate \(o\).

In VGG-16-LSTM, the local features extracted by VGG-16 from video frames are fed into LSTM to access spatiotemporal information. The number of units in the output space was set to 1024, and ReLU was used as the activation function.

An end-to-end trainable ConvLSTM was proposed by extending the fully connected LSTM to have convolutional structures in both the input-to-state and state-to-state transitions for precipitation nowcasting [40]. The purpose of precipitation nowcasting is to predict future precipitation intensity over a relatively short period of time in a local area, and it can be seen as a video prediction problem with a fixed camera with the weather radar [41]. It was shown that ConvLSTM extracts better spatiotemporal correlations than the fully connected LSTM for precipitation nowcasting [40]. We follow the formulation of ConvLSTM defined by [40], where ‘\(\circledast\)’ and ‘\(\cdot\)’ denote the convolution operator and Hadamard product, respectively:

$$\begin{aligned} \begin{aligned} i_t&= \sigma (W_x{}_i \circledast x_t + W_h{}_i\circledast h_t{}_-{}_1 + W_c{}_i \cdot c_t{}_-{}_1 +b_i) \\ f_t&= \sigma (W_x{}_f \circledast x_t + W_h{}_f\circledast h_t{}_-{}_1 + W_c{}_f \cdot c_t{}_-{}_1 +b_f)\\ c_t&= f_t \cdot c_t{}_-{}_1 + i_t \cdot \tanh (W_x{}_c \circledast x_t + W_h{}_c\circledast h_t{}_-{}_1 +b_c) \\ o_t&= \sigma (W_x{}_o \circledast x_t + W_h{}_o\circledast h_t{}_-{}_1 + W_c{}_o \cdot c_t +b_o) \\ h_t&= o_t \cdot \tanh (c_t). \end{aligned} \end{aligned}$$
(7)

Inspired by the mentioned study, ConvLSTM was used to build a new architecture for video classification, which takes the advantage of its capacity in capturing spatiotemporal information throughout time series. We add one ConvLSTM layer on top of the spatial feature maps extracted by VGG-16 and use the hidden states for video classification. This layer contains 64 hidden states, \(7 \times 7\) kernels, and the stride of convolution is set to 1 to perform the experiments described in Sect. 4.

4 Experiments

In this section, the datasets used, the experimental setup, and the evaluation method are described.

4.1 Datasets

In this study, the UCF-101 and the KTH datasets are used to evaluate the neural network architectures with the proposed keyframe extraction method and two more keyframe extraction methods for classifying human actions from video clips. The UCF-101 dataset includes 13,320 clips from 101 non-overlapping classes, with a resolution of \(240 \times 360\) pixels. All clips in the UCF-101 have a fixed frame rate of 25 frames per second (FPS). The minimum and maximum lengths of the clips are 1.06 s and 71.04 s, respectively. The KTH action recognition dataset consists of six types of human actions with over 2300 video sequences. Clips in this dataset have a fixed frame rate of 25 FPS and resolution of \(160 \times 120\) pixels.

The UCF-101 dataset has defined three training–testing splits, aiming to facilitate benchmarking algorithms. In our previous experiment [18], only the first 20 categories of the dataset were used due to limited time and computing facility, and the first training–testing split was adopted to generate training and testing data. However, we conducted the experiments in this paper using all categories with the three training–testing splits of the UCF-101 and the KTH datasets.

In the experiments, hold-out method was used to split training data into two subsets: 70% for training and the remaining 30% for validation. Testing dataset was never used during training and validation, but only used for producing the testing accuracy of each tested method.

4.2 Experiment design

During the training process, parameter tuning is carried out with the hold-out validation technique. The best parameters are identified based on the validation scores. After that, the model with the best parameters is evaluated on testing data by predicting unseen test videos’ classes.

The proposed network architectures are implemented by using TensorFlow-gpu v1.12 on an GPU NVIDIA TITAN X using the CUDA v9.0 toolkit. The batch size is set to 128, and the cost is minimized by using the stochastic gradient descent (SGD) optimizer. The number of epochs is determined using early stopping by observing the change in validation loss. Dropout is used as a regularization method, disabling some neurons within ConvLSTM and fully connected layers with a probability of 0.5.

4.3 Evaluation method

In the experiments, confusion matrices are produced for performance analysis and accuracy is used for the comparison of the performances achieved by different architectures. 180 training, validation, and testing accuracy scores have been collected with the three training–testing splits released by the UCF-101 organization (10 times per split and 30 times per classifier). Similarly, 90 accuracy scores have been collected with the official training, validation, and test splits of the KTH dataset (15 times per classifier).

The Kolmogorov–Smirnov test is a normality test which compares the observed cumulative distribution with the cumulative distribution that would occur if the data were normally distributed [42], and it has been used for calculating numerical means for assessing normality. As for the test of homogeneity of variances, Levene statistic has been applied to the dependent variable and shows variances of groups are homogenous based on mean and median. Levene’s test is simply a one-way analysis of variance on the absolute values of the differences between each observation and the mean of its group and is appropriate for testing the null hypothesis [43]. Furthermore, ANOVA test has been conducted to compare the variance differences to figure out whether the results are significant. Afterward, Tukey’s honest significant difference (HSD) test has been run to determine whether the specific groups’ means are different. The results are presented in Sect. 5.

5 Results and discussion

The VGG-16-ConvLSTM and VGG-16-LSTM architectures presented in our previous work [18] for video classification are used as baseline methods to evaluate the proposed method in this paper. One of the findings of the previous work [18] is that using global features can help achieve better classification performance over local features. It can be highlighted that the fundamental difference between local and global features is the way of representing input frames in terms of the whole frame or frame patches, which provide different information on the input to the video classifier. Seven different classification networks using either local or global features extracted by the pre-trained VGG-16 were compared in the previous work [18]. The extracted features were fed into a newly added fully connected layer in baseline VGG-16-VOTE (a), and the fully connected layer of VGG-16 was included in VGG-16-VOTE (b). Similar to the baseline methods, LSTM was employed to access spatiotemporal information over the features in VGG-16-LSTM (a) and VGG-16-LSTM (b). To test the effect of directional connections in LSTM structure for action recognition, VGG-16-BLSTM (a) and VGG-16-BLSTM (b) were implemented by using Bidirectional LSTM. The VGG-16-ConvLSTM architecture was proposed with convolutional structures in state transitions. Table 1 shows the results obtained from the earlier study [18] in which VGG-16-ConvLSTM (82.04%) significantly outperformed the other networks followed by VGG-16-LSTM (81.27%) with local features at 0.05 significance level \((p=.046)\). In our previous study, one frame per second was extracted to reduce the amount of frames in classification.

Table 1 Average accuracy scores achieved by different architectures on the UCF-101 dataset with 20 categories [18]

In this paper, experiments with the proposed keyframe extraction method were conducted using the first 20 categories of the UCF-101 dataset in the first place to investigate how the proposed keyframe extraction method can improve the video classification performance of the LSTM- and ConvLSTM-based network architectures in comparison with our previous work [18].

Table 2 Average accuracy scores achieved by LSTM and ConvLSTM network architectures based on the UCF-101 (20 categories) using two keyframe extraction methods where (1) and (2) indicate one frame per second and the proposed method, respectively

Table 2 shows the results obtained on the first 20 categories of UCF-101 in which ConvLSTM(1) and LSTM(1) refer to the previous method selecting one frame per second, whereas ConvLSTM(2) and LSTM(2) indicate the proposed method. It can be seen that the architectures using keyframes extracted by the proposed method, ConvLSTM(2) and LSTM(2), outperformed the previous method, achieving accuracy scores of 88.15% and 83.10%, respectively.

Table 3 Average accuracy scores achieved by LSTM and ConvLSTM network architectures on datasets KTH and UCF-101 (101 categories) using three keyframe extraction methods where (1), (2), and (3) indicate one frame per second method, the proposed action template-based method, and optical flow-based keyframe extraction method, respectively

In order to draw more convincing conclusions, further experiments were conducted with the proposed method (2) using all the 101 categories of the UCF-101 dataset and the KTH dataset in comparison with two commonly used keyframe extraction methods: Method (1) is a baseline method that extracts one frame for each second until the end of the video and method (3) is a motion-based keyframe extraction method that selects keyframes considering the local minima of action between optical flows [29]. The results are summarized in Table 3, in which ConvLSTM(2) achieved the best classification accuracy (71.13%) followed by LSTM(2), ConvLSTM(1), ConvLSTM(3), LSTM(1), and LSTM(3) on the KTH dataset, respectively. Similarly, ConvLSTM(2) outperformed the other methods on the UCF-101 dataset with 67.39% accuracy score. As shown in Table 4, there is a statistically significant difference between classifier groups in terms of one-way ANOVA \((F(11.258)=304.725, p=.000)\).

Table 4 One-way ANOVA of performance achieved by different network architectures where df, SS, MS, and \(F\) refer to degrees of freedom, sum of squares, mean sum of squares, and F score, respectively
Table 5 Post hoc comparisons using Tukey’s HSD on KTH dataset
Table 6 Post hoc comparisons using Tukey’s HSD on UCF-101 dataset

The Tukey’s HSD post hoc test results on the KTH dataset, as shown in Table 5, show that the classification performance achieved by ConvLSTM(2) is statistically significantly higher than LSTM(1), ConvLSTM(1), LSTM(2), LSTM(3), and ConvLSTM(3) (\(p<.05\)) on the KTH dataset. There is a statistically significant difference between all methods except for LSTM(1) and LSTM(2).

The Tukey’s HSD post hoc test results on the UCF-101 dataset are presented in Table 6, which demonstrate that ConvLSTM(2) significantly outperformed LSTM(1), ConvLSTM(1), LSTM(3), and ConvLSTM(3) \((p<.05)\).

The keyframe extraction method proposed in this paper uses action templates to identify most important regions related to actions in video frames. The experimental results have demonstrated that this action template-based approach to keyframe extraction can extract frames with distinctive actions to significantly improve the performance of deep convolutional neural networks for action recognition from videos. Keyframe extraction methods have been investigated due to their adaptability to video summarization systems and performance improvement in video classification approaches. Keyframe extraction methods enable using more informative input representation while reducing the number of frames. With the advantage of using fewer but more informative frames, the input dimension is reduced and training time is shortened. Moreover, using selected keyframes can effectively improve the accuracy in video classification.

6 Conclusion

In this paper, a template-based keyframe extraction method is proposed which employs action template-based similarity to extract keyframes for video classification tasks. Combining pre-trained CNN with ConvLSTM has achieved the highest classification accuracy among the other architectures. It can be seen that calculating structural similarity between two relevant regions of consecutive frames effectively prevents dynamic background noise from being treated as actions in keyframe selection. The experimental results and the conducted analysis show that the proposed keyframe extraction method can select informative frames reliably and thus significantly improve the performance of deep neural network architectures for video classification. Finally, when finding the relevant area using the extracted action template, the proposed method successfully extracts proper keyframes from human action videos for video classification using deep neural networks. Although the proposed method has outperformed the two commonly used keyframe extraction methods, this study has a few limitations. One of the limitations is that CNN architecture used in the evaluation of the proposed method was not the state-of-the-art architecture. This means that it did not produce the best results. The second limitation of the study is that the technical infrastructure of the experiment was weak with one GPU machine only and could not conduct more comprehensive experiments with larger batches. However, the proposed keyframe extraction method has significantly outperformed the commonly used keyframe extraction methods on two different datasets. Therefore, future work could be focused the on application of the proposed algorithm using more powerful architectures for real-world video classification and video summarizing problems.