Introduction

The agriculture sector is the most significant part of the economy, and the majority population of the entire world depends on it. Meanwhile, the recent growth in the world’s population rate necessitates an increase in crop quantity to meet the food requirements of people all over the world [1]. However, several challenging factors like the climate conditions and crop pests hinder the farmers in taking care of crops and improving the yield. Conventional crop pest identification is performed by manually arranging pest traps to analyze and assess the category of pests in crops. However, such methods are unreliable and often suffer from high error rates.

Furthermore, we often experience delays in the pest detection process due to the limited availability of agronomists. Moreover, because of the lack of technical information about the various pest types and high visual similarity between different insect species, it is difficult to accurately specify the related insecticide types, resulting in an extensive and blind application of pesticides [2]. The delay in pest recognition can prevent the farmers from taking timely measurements, which causes massive damage to both the quality and quantity of crops. However, identifying the type and amount of crop pests is a tedious and time-consuming activity. At the same time, early recognition and usage of timely spraying pesticides on the plants can improve the yield and enhance the economy. Recent progress in machine learning (ML) and computer vision (CV) has urged the researcher present computer-aided approaches to simplify this time-consuming task and build effective automated insect identification systems.

Initially, conventional ML-based approaches used local descriptors like local binary pattern (LBP) [3], regional ternary pattern (LTP) [4], SIFT [5], SURF [6], etc., with classifiers such as SVM [7], K-means [8], etc. Such approaches are heavily explored for pest detection and classification [9,10,11,12,13,14]. Although hand-crafted keypoint computation techniques are easier to use and require less training data, they are slow and require the skill of experienced human specialists. Moreover, the recent progression in image acquisition has introduced such challenging datasets. These conventional approaches employ ML as ready-to-use solutions are not promising for real-world pest detection, and they face a severe reduction in classification performance. This might be because of several reasons, such as ineffective hand-crafted keypoints computation.

Moreover, the exact pest can appear with different motions and positions in various images, which causes to generate the varying keypoints vectors for the same insects. Primarily, the research community is working to improve the detection performance for specific categories of insects by proposing new keypoints solutions. Such work lacks focus on introducing novel frameworks for multi-category pest recognition missions that need to acquire insect localization and classification information to assist in pest monitoring [15, 16].

Recently, we observed the robustness of Deep learning (DL)-based frameworks, i.e., convolution neural networks (CNNs) [17], recurrent neural networks (RNNs) [18], deep belief networks [19] in a variety of areas covering the agriculture sector as well. DL is a robust approach for image analysis and object recognition with superior effectiveness in classifying various categories of pests [20]. Transfer learning is an essential task in DL, in which pre-trained frameworks are modified to perform a new job. Deep transfer learning (DTL) proposes a new network for processing digital samples and performing predictive analytics, with better generalization power and the ability for pest recognition. DL-based approaches employ CNNs that can automatically detect the discriminative keypoints from input data without the assistance of human specialists. Because of the considerable evolution of hardware, DL frameworks are extensively being used to handle complicated problems in a reasonable amount of time. In agriculture, DL-based algorithms have shown to be entirely accurate and have been effectively adapted to perform various tasks [21].

As progress in DL methods [22, 23] has exhibited promising results in the area of object identification, so, the extensive research work concerns proposing more complicated object localization frameworks for better detection accuracy, i.e., Super-FAN [22] and unsupervised multi-stage keypoints learning [23], etc. Moreover, several CNN-based approaches namely GoogLeNet [24], AlexNet (AN) [25], VGG [26], ResNet (RN) [27], R-CNN [28], Fast R-CNN [29], Faster R-CNN [30], and YOLO [31] are also evaluated for pest detection and classification. Even though the aforementioned DL-based object identification frameworks have demonstrated robust performance in general object identification systems, their applications for pest detection are still limited. Pest recognition has its characteristics and differs from existing object identification and classification tasks [21]. The insect pests are small target and they are usually surrounded by complex environment in the real-field images; thus, the identification network can be easily deceived by the background while computing keypoints. In addition, because of the varying angles and distance at which they are captured in the field, there is a considerable difference in pest size and postures which makes accurate recognition more challenging. Moreover, distinct insect pest species often have a high degree of resemblance in appearance, and the same species may exist in many states such as egg, larva, pupa, and adult, indicating a large intra and inter-class variances. Furthermore, the existence of poor lighting and harsh environments complicates the automated identification process. Therefore, a low-complexity automated framework for precise pest recognition in the field and performance improvement both in terms of classification robustness and computational complexity is still required. In this work, we presented a cost-effective DL-based model for pest recognition and categorization using drone. The presented framework is based on custom CornerNet model with the DenseNet-100 serving as a backbone for deep feature extraction from the input samples. Our results show that the proposed technique is capable of effectively localizing and classifying the multiple pest species in the presence of high variation in shape, size, color, position, and variability across and within classes. The main contributions of our work are as follows:

  • We proposed a low-complexity AI-based framework for drone systems using a custom CornerNet model by employing DenseNet-100 for automated pest recognition in the field, improving the accuracy of classifying the various pests.

  • Introduced computationally efficient approach for precise insect pest detection as CornerNet is a one-stage object detection framework.

  • Improved classification accuracy of insect pests due to the ability of the DenseNet to compute deep keypoints and CornerNet model to deal with the over-fitted training data.

  • A rigorous quantitative and qualitative comparison of the presented technique was performed using a publically available challenging benchmark dataset, namely IP102, to exhibit the efficacy of our method.

The rest of the paper is structured as follows: “Related work” reviews related work for insect pest recognition, while “Proposed method” provides a detailed description of the proposed framework. In “Experimental details and results”, we provide the details of experiments performed and a discussion on the results. Lastly, “Conclusion” concludes our research.

Related work

Recently, pest localization and classification have attained the consideration of the research community due to immense development in the area of computer vision. Numerous standard datasets are available for this purpose. However, these datasets consist of the minimum number of samples compared to the normal data compulsory for the latest DL-based frameworks. This section conducted a thorough examination of previous work for the automated identification and categorization of pests from crops.

Nanni et al. [32] presented a method to merge CNNs with saliency approaches to recognize and classify the pest from crops automatically. Initially, the saliency method was applied for data augmentation, and then five different CNN models, namely AN, DenseNet201, ShuffleNet (SN), GoogLeNet (GL), and MobileNetv2 (MN), were trained to classify the insects. This approach [32] improves the pest classification accuracy. However, the performance degrades for identifying pest species with significant intra-class differences. In [33], the authors presented a novel CNN framework and compared it with existing DL models, i.e., AN, RN, GL, and VGG. Transfer learning and data augmentation were employed to prevent the network from overfitting and improve classification accuracy.

Similarly, Li et al. [34] proposed an approach for automated recognition and categorization of crop insects. Initially, the adaptive threshold (AT) algorithm was applied over the input sample to convert it to a binary image, on which morphological operations together with the watershed algorithm were used to acquire the region of interest (ROI). Then, the GrabCut technique was utilized to remove the background, and several DL models, namely VGG, GL, and RN, were applied to classify the pest from the input samples. However, these methods [33, 34] exhibit better insect classification accuracy at the expense of a longer computing time. Wang et al. [35] introduced a DL framework for mobile devices, namely DeepPest, to automatically detect and categorize insects. The method [35] employed contextual information as prior information for its training and worked well for the localization of small size insects. However, the approach [35] is inadequate for different mobile devices due to processing limitations. Jiao et al. [36] introduced an anchor-free region CNN (AF-RCNN) for automated localization and categorization of various classes of crop insects. Initially, a keypoints fusion unit was proposed to compute the representative set of features, particularly for small-sized pests. Next, an anchor-free region proposal network (AFRPN) was introduced to calculate practical object proposals based on pest positions by employing fusion feature maps. Lastly, AF-RCNN was trained to identify 24 classes of insects by integrating the AFRPN with Fast R-CNN into a single framework. This method [36] works well for the localization of small insects. However, the performance relies on extensive hyper-parameter choices selected during the training process.

Rodríguez et al. [37] proposed a framework for pest identification. Initially, the RGB sample was transformed into a quaternion matrix, on which Quaternion Gaussian Low-Pass Filter was applied to remove the noise. The processed sample was then subjected to Sangwine’s method to obtain the two colored keypoints maps and horizontal and vertical directions. Both maps were converted to the HSV domain to differentiate the monotone horizontal and vertical edge maps. In the following steps, the obtained maps were combined, on which binarization together with the morphological approach was applied to achieve the ROIs. This method [37] is robust to pest detection under chrominance and size variations. However, the generalization performance of this work can be further improved. Nam et al. [38] suggested a DL-based approach to locate and categorize crop insects. The Single-shot detector (SSD) framework was used to calculate the in-depth features from input samples and classify the pests into respective classes. The approach [38] resulted in higher accuracy than previously developed methods. However, it was unable to detect small insects. CNN-based techniques need diverse training samples to show better accuracy. However, datasets lack this aspect. Li et al. [39] proposed a data augmentation-based approach to deal with such challenges. Data augmentation was applied during the training step by rotating input samples to different angles together with a cropping operation. This step assisted in obtaining diverse multi-scale samples that could be utilized to train a multi-scale insect identification framework. Various CNN models were trained to demonstrate the effectiveness of the proposed strategy. This technique [39] detects insects despite significant variations in position. However, it is computationally costly. A two-stage CNN framework was proposed in [40] to locate and categorize crop pests. Initially, the Global activated Feature Pyramid Network (GaFPN) was applied to compute the representative set of features from the input images. The calculated feature vector was passed to the Local activated Region Proposal Network (LaRPN) to identify and classify the pests in the next phase. The method [40] shows better pest classification performance. However, it is prone to overfitting issues, thus resulting in poor performance on unseen data. Another framework for pest detection was proposed in [41]. Initially, the input image was converted to greyscale. In the next step, the processed sample was compared to a reference image to identify the changes saved as the feature vector. In the next step, the Density-Based Spatial Clustering (DBSCAN) was applied over the calculated key points to cluster the pests from samples. This approach [41] can effectively identify the insects from the noisy samples; however, this approach is computationally complex.

Nieuwenhuizen et al. [42] presented an approach to locate and classify the insects from the input sample. In the first step, annotations were developed from the input images, which were employed for transfer learning. In the next step, the annotated images were passed to a DL framework, namely Faster R-CNN, to localize and classify the insects. In the last step, insects were counted manually. The approach [42] improves the insect classification accuracy. However, few results were reported. In [43], the authors conducted a comparative analysis of various CNN-based frameworks, namely VGG, ResNet-50, ResNet-101, AN, and InceptionNet, together with SVM, KNN, and ELM classifiers. The mentioned CNN models were employed to compute the in-depth features, which were later used to train the classifiers to classify the pest from crops. It is concluded in [43] that in-depth features with SVM and ELM classifiers exhibit better classification accuracy. Liu et al. [44] proposed a DL-based model, namely PestNet, for classifying crop pests. In the first step, a module, namely Channel-Spatial Attention (CSA), was introduced to be replaced with CNN for keypoints computation. Then, Region Proposal Network (RPN) was employed for calculating region proposals to locate the positions of insects based on extracted keypoints maps from the samples. Finally, a Position-Sensitive Score Map (PSSM) was applied to show the located pest together with their computed class. This approach [44] works well for multiclass classification of crop pests, however, at the expense of increased computational complexity. Another automated pest detection framework was introduced in [45]. After performing image preprocessing, EM and KMM were applied to obtain the ROIs. In the following steps, the GLCM matrix was employed to compute the image features from computed ROIs. The obtained feature vector was used to train the SVM to classify the insects. The method [45] is robust to pest detection. However, it requires a substantial amount of time for data preparation and training. Rustia et al. [46] proposed an approach to localize and classify crop insects. After performing preprocessing, YOLO-V3 has been employed to calculate the deep features and classify the pests from the input samples. This technique [46] shows better insect detection accuracy. However, it is unable to locate pests under intense chrominance and light variations. Another CNN model, namely AN, was utilized in [47] to identify and categorize the insect from images. The technique [47] exhibits better recognition accuracy. However, performance decreases when multiple pest species are present. Xia et al. [48] presented a technique to identify and classify insects. Initially, a data augmentation step was applied to improve the diversity of data. Then the samples were used to train the VGG-19 framework to localize and categorize the pests. This approach [48] works well for insect classification, however, performance is evaluated on limited insect species data. In a very recent work [73], the Custom CenterNet framework with DenseNet-77 method was presented to automate the plant diseases detection and categorization efficiently. The model outperformed the latest plant disease approaches and were able to efficiently locate and classify 38 types of crop diseases from the PlantVillage dataset.

Table 1 presents the analysis of existing techniques employed for pest detection and classification along with their limitations. From Table 1, it can be seen that although the research community has presented extensive work in the field of automated pest categorization, however, there is still a need for performance improvement.

Table 1 Comparative analysis of existing pest detection techniques

Proposed method

In this section, we have discussed the framework presented for the automated identification and categorization of several crop pests in the field. The aim of this work is to propose a technique that is computationally efficient and capable of automatically extracting reliable image features without the need for any manual inspection. The proposed approach follows two phases: in the first step, a set of samples comprising images from a standard dataset, namely IPI02, is used to prepare the annotations, which are later used for the model training. In the test phase, suspected samples are passed to the trained framework to evaluate the model’s performance. More specifically, we have proposed an improved CornerNet model [49] by employing the DenseNet-100 as its base network. Initially, the deep features from the input images are calculated by using the DesneNet-100 framework, which is later used by CornerNet to locate the presence of pests on plant leaves and determine the corresponding category. In the last step, performance is evaluated by employing several standard metrics used in the field of object detection. Figure 1 shows the structural description of the introduced pest detection and classification methodology.

Fig. 1
figure 1

Visual representation of the introduced approach for pest recognition

Annotations

For practical DL-based model training, it is required to specify the RoI precisely. To accomplish this, we have utilized the LabelImg tool [50] to develop the annotations. A few annotated images are shown in Fig. 2. The annotations specify the information regarding the position and class of each pest, which is saved in an XML file. After this, the final training file is generated from this XML file to train the model.

Fig. 2
figure 2

Sample annotated images of the IP102 dataset

CornerNet model

The CornerNet [49] is a one-step detector that locates the RoIs using keypoint estimation. It predicts corners, i.e., the Top-Left (TL) and Bottom-Right (BR), to compute bounding boxes (bbox) that are more accurate and efficient as compared to other anchor-based techniques [29, 51]. The overall architecture of the CornerNet model comprises two major components; the backbone network and the prediction head (Fig. 1). Initially, the model uses a backbone feature extraction network to compute a set of related keypoint maps that are used to predict heatmaps (HMs), embeddings, offset, and class (C). The HMs provide the probability to determine if a particular position is a TL/BR corner belonging to a specific class. At the same time, the main objective of the embeddings is to distinguish the corner pairs and offsets for adjusting the position. The highest scored TL, and BR points are used to determine the exact location of the bbox, and class is determined by utilizing the embedding distances on more relevant feature pairs. The CornerNet model tends to outperform the existing object detection frameworks [28,29,30,31]. However, the recognition of insect pests has its unique properties, i.e., small size and a similar visual appearance to their surroundings that differentiates it from the existing object recognition and classification methods. In this work, we have customized the CornerNet model for the detection and classification of multiple pest species. We improved the backbone network of the CornerNet model to increase model effectiveness and achieve more accurate results for pest identification. The improved backbone computes high-level discriminative information that improves the pest localization accuracy and overall classification performance. Moreover, the improved architecture is lightweight and computationally efficient as compared to the original CornerNet model.

The motivation for employing the CornerNet model for the identification of pests is its ability to effectively identify objects by using keypoint estimation as compared to previous models [29, 51,52,53,54]. The model uses precise features and locates the object employing features; thus, it removes the requirement of utilizing extensive anchor boxes for different target dimensions as compared to other one-stage object detection approaches like SSD [52] and YOLO (v2, v3) [53]. While in comparison to two-stage approaches like R-CNN [54], Fast R-CNN [29], and Faster R-CNN [51]), the proposed approach is computationally efficient as these methods use two steps to perform the object detection and classification task. Thus, the proposed DenseNet-100-based CornerNet approach better tackles the problems of existing techniques by proposing a more robust framework that computes more reliable image features and minimizes the estimation cost as well due to its one-stage detection.

Custom CornerNet model

A backbone network extracts visual features which provide a semantic and robust representation of an image. The pest is a smaller target. Therefore more precise and discriminative features are required to distinguish them from the complicated surroundings, such as varying acquisition angles, brightness, luminosity conditions, and blurring. The traditional CornerNet model was presented with the Hourglass104 feature extractor [49]. The limitation of the Hourglass network is that it is computationally expensive, i.e., involves extensive network parameters and space requirements which unavoidably slows the detection process and reduces the overall efficiency of the model. Moreover, the accuracy of the feature extractor impacts the detection accuracy [55]. We have customized the backbone network for localization and classification of pests to improve framework robustness and achieve better performance. We adopt DenseNet-100 [56] as the backbone network for improved feature extraction and to reduce computational complexity.

DenseNet-100 feature extractor

The DenseNet-100 contains four densely connected blocks with 100 layers and is shallower than Hourglass104. The basic architecture of employed DenseNet-100 is presented in Fig. 1. The DenseNet-100 framework contains fewer parameters (7.08 M) than the Hourglass104 network (187 M), giving it a computational benefit. In DenseNets, all layers are directly connected to one another, and the keypoint maps from earlier layers are passed to the subsequent layers. The DenseNet architecture promotes feature reuse and enhances the information flow throughout the network, which makes it appropriate to tackle complex transformations efficiently for pest localization [56]. The structural details of DenseNet-100 are elaborated in Table 2.

Table 2 Structure of DenseNet-100

The DenseNet comprises several Convolutional Layer (ConL), Dense Block (DnB), and Transition Layer (TrL). Figure 3 presents the DnB structure, which is a main component of the DenseNet framework. In Fig. 3, z0 represents the input layer with f0 feature maps. Moreover, Hn(.) is a compound method comprising three successive operations: a 3 × 3 ConL filter, Batch Normalization (BtN), and ReLU. Every Hn(.) operation generates f keypoint maps that are passed to zn subsequent layers. As every layer takes all previous layer keypoints maps data as input, this produces f × (n−1) + f0 feature maps at the nth layer of DnB, which causes the dimension of the keypoint map to increase dramatically. Therefore, the TnL layers are introduced among the DnB to minimize the keypoints map dimension. The TnL contains a BtN and 1 × 1 ConL along with the average pooling layer, as demonstrated in Fig. 3.

Fig. 3
figure 3

The architecture of a dense block and b transition block

Prediction module

The feature extraction network is followed by two distinct output branches, which represent the TL corner and the BR corner prediction branch. Each branch module consists of a corner pooling layer placed on the top of the backbone to pool features and generate three outputs: HMs, embeddings, and offsets. The prediction module is a modified residual block comprised of two 3 × 3 ConL and one 1 × 1 residual network that is followed by a corner pooling layer. The corner pooling layer helps the network to localize the corners better. The pooled features are passed to a 3 × 3 ConL-BN layer, and reverse projection is added. This modified residual block is then followed by a 3 × 3 ConL, which generates HMs, embeddings, and offsets. The HMs are used to estimate the location of corner points. The offsets are employed to correct the corner location because a quantization error occurs when mapping from keypoints in the input image to the feature map is performed. There may exist multiple pests in an image. The embeddings are used to determine whether the corner point is a group, i.e., the TL and BR corner belong to the same or different pest.

Detection

To obtain the final bbox from the corner predictions, non-maximal suppression (NMS) on the corner HM via 3 × 3 max-pooling layer is applied. The top 100 TL corners and BR corners over all classes are extracted from the HMs. The predicted offsets are used to adjust the corner locations. The TL corner and BR corners per class are paired with the most similar embedding, and the pairs with an L1 distance greater than 0.5 are eliminated. For the obtained candidate bbox, we applied soft-NMS to remove strongly overlapping bbox. The average scores of the TL and BR corners are used as the detection scores.

Loss function

CornerNet is an end-to-end learning methodology that uses multi-task loss to improve its performance and precisely locate pests. The training loss function L is the summation of four different losses, defined as:

$$ L = L_{\det } + \alpha L_{{{\text{pull}}}} + \beta L_{{{\text{push}}}} + \gamma L_{{{\text{off}}}} , $$
(1)

where Ldet is the detection loss responsible for corner detection and is a variant of a focal loss, Lpull is the grouping loss which is responsible for grouping corners of the same bbox, Lpush is corner separation loss responsible for separating corners of the different bbox, Loff is the smooth, and L1 loss is responsible for the offset correction. The parameters α, β, and γ represent the weights for the pull, push and offset loss and are set as α = β = 0.1 and γ = 1. The Ldet is defined as:

$$ L_{\det } = \frac{ - 1}{M}\sum\limits_{i = 1}^{C} {\sum\limits_{x = 1}^{H} {\sum\limits_{y = 1}^{W} {\left\{ {\begin{array}{*{20}c} {(1 - T)^{\varphi } \log (T)} & {{\text{if}}(G) = 1} \\ {(1 - G)^{\omega } (T)^{\varphi } \log (1 - T)} & {{\text{otherwise}}} \\ \end{array} } \right.} } } . $$
(2)

Here, M is the number of pests in an image. C, H, and W denote the number of channels, width, and height, respectively, from the input. T and G represent Tixy and Gixy, where Tixy is the predicted score at the position (x, y) for pest of class i in the input image, and Gixy is the respective ground-truth value. The \(\varphi\) and \(\omega\) are the hyperparameters that control the contribution of each point and are set as 2 and 4, respectively.

During downsampling, the output size is decreased compared to the original input image. Therefore, the location (a, b) of a pest in the input image is mapped to the location \(\left( {\frac{a}{n},\frac{b}{n}} \right)\) in the HMs, where n is the factor to which downsampling is performed. During remapping of locations from the HM to the original size input image, it results in a precision loss that affects the quality of IoU for smaller bbox. To resolve this problem, the position offsets are calculated to adjust the corner locations and are given by:

$$ O_{k} = \left( {\frac{{a^{k} }}{n} - \left\lfloor {\frac{{a^{k} }}{n}} \right\rfloor ,\frac{{b^{k} }}{n} - \left\lfloor {\frac{{b^{k} }}{n}} \right\rfloor } \right), $$
(3)

where \(O_{k}\) denotes computed offset, \(a_{k}\) and \(b_{k}\) are the coordinators of \(a\) and \(b\) for corner \(k\). For training purposes, to compute Loff, the smooth L1 function is used to adjust the corner locations slightly and is defined as:

$$ L_{{{\text{off}}}} = \frac{1}{M}\sum\limits_{k = 1}^{M} {{\text{Smooth}}\;L1\;{\text{Loss}}(O_{k} ,O^{\prime}_{k} )} . $$
(4)

An input image may contain multiple pests; thus, multiple BR and TL corners are computed in a single image. For each detected corner, the network predicts an embedding vector used to decide whether a pair of TL and BR corner belongs to the same pest. We apply the “pull and “push” losses to train the network and are defined as:

$$ L_{{{\text{pull}}}} = \frac{1}{M}\sum\limits_{i = 1}^{M} [ (e_{li} - e_{i} ){}^{2} + (e_{r} - e_{i} )^{2} ], $$
(5)
$$ L_{{{\text{push}}}} = \frac{1}{M(M - 1)}\sum\limits_{i = 1}^{M} {\sum\limits_{\begin{subarray}{l} j = 1 \\ j \ne i \end{subarray} }^{M} {\max [0,\Delta - } } |e_{i} - e_{j} |], $$
(6)

where \(e_{{l_{i} }}\) represents the TL corner, \(e_{{r_{i} }}\) is the BR corner for pest i, and \(e_{i}\) is the average value of \(e_{{l_{i} }}\) and \( e_{{r_{i} }}\). The maximum distance for two corners belonging to different pests is set as 1; that is, \(\Delta\) = 1 used in all our experiments.

Experimental details and results

This section describes the implementation details and the experiments carried out to evaluate the performance of the suggested model. To comprehensively demonstrate the efficacy of the custom CornerNet model, we have evaluated pest recognition and classification and compared it with other models.

Dataset

In this work, we have utilized the IP102 insect pest recognition dataset [57] to evaluate the performance of the proposed model. This dataset contains 75,222 images covering 102 common insect pest classes. The IP102 dataset is organized hierarchically, with two super-classes: field crops (FC) and economic crops (EC), further divided into sub-classes depending on particular crop types damaged by pest insects. The FC contains five sub-classes, i.e., Rice, Corn, Wheat, Beet, and Alfalfa, whereas the EC contains three sub-classes, i.e., Citrus, Vitis, and Mango. All these sub-classes are further categorized and contain 102 pest classes that define pest insects associated with the specific crop. The further details of the classes and the number of samples in each class are given in [57]. It is worth noting that the images in the IP102 dataset are diverse, i.e., insects of very different ages, colors, sizes, and shapes. In addition, the variations in luminosity, zoom level, and angle further make the dataset very challenging to deal with the complexities of real-life scenes. Figure 4 presents some sample images of pests from various species from the IP102 dataset. It can be observed from the Fig. 4 that the samples in the dataset are challenging, having intricacies of various environmental factors such as varying lighting conditions or insects hidden in the background.

Fig. 4
figure 4

Sample images from IP102 Dataset

Implementation details

The overall implementation of the proposed framework is achieved in TensorFlow using the Keras library. Table 3 presents the detail of the final training parameters for the Custom CornerNet model. In our study, we have tuned the model's hyperparameters by varying epochs, batch size, and learning rate to obtain the final optimized model. The model learning rates of 0.01, 0.001, and 0.0001 with Stochastic Gradient Descent (SGD) training optimizer were utilized during the experiment. The mini-batch size and epoch were set at 15, 25, 35, 45, and 16, 32, 64, respectively. To prevent the model overfitting, we set the dropout value to 0.3. The size of the input images was fixed at 224 × 224, and the data were randomly divided into training, validation, and test sets. We used 60% of the data for training, 10% for validation, and the remaining 30% for testing.

Table 3 Training parameters for the proposed model

Evaluation parameters

For evaluating the performance of the proposed technique, we have used different quantitative metrics such as precision (P), recall (R), accuracy (Acc), Intersection over Union (IoU), and mean average precision (mAP). These metrics are computed as follows:

$$ P = \frac{{{\text{TP}}}}{{({\text{TP}} + {\text{FP}})}}, $$
(7)
$$ R = \frac{{{\text{TP}}}}{{({\text{TP}} + {\text{FN}})}}, $$
(8)
$$ {\text{Acc}} = \frac{{({\text{TP}} + {\text{TN}})}}{{({\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}})}}, $$
(9)
$$ {\text{IoU}} = \frac{{{\text{TP}}}}{{({\text{FN}} + {\text{FP}} + {\text{TP}})}} \times 2, $$
(10)
$$ {\text{mAP}} = \sum\limits_{i = 1}^{T} {\frac{{{\text{AP}}(t_{i} )}}{T}} , $$
(11)
$$ F1\_{\text{score}} = \frac{2 \times P \times R}{{(P + R)}}. $$
(12)

TP, TN, FP, and FN denote the true-positive, true-negative, false-positive, and false-negative cases. If the insect in the picture is correctly classified, it is considered TP; otherwise, it is considered FN. The insect not present in the image is classified as TN if the classification is wrong; otherwise, it is classified as FP. The mAP computation is shown in Eq. (11), where AP is the average precision of each class, t and T represent the test image and the total number of test images, respectively.

Insect pest localization results

The precise localization of pests is important for designing an effective automated pest recognition method. Therefore, we designed an experiment to assess the localization effectiveness of the proposed framework. For analysis, we utilized all the test images from the IP102 database and presented a few visual results, as shown in Fig. 5. From the reported results, we can see that the proposed approach is capable of locating pests of varying sizes, shapes, and colors. Additionally, our technique can effectively detect pests even in complex background, illumination, orientation changes, and varying acquisition angle. The localization ability of the proposed framework by employing keypoint estimation allows it to identify and discriminate pests of various categories effectively and precisely. We have computed the mAP and IOU to quantitatively measure the localization performance. These metrics show how well the proposed model performs in localization and recognition for several pest categories. For localization, the IOU threshold is set to 0.5, which means that the overlap score between the predicted region and ground truth is less than this value is considered background, otherwise considered a pest. The proposed framework achieved the mAP and mean IOU values of 0.578 and 0.621, respectively. We can infer from these results that the presented technique can effectively detect and precisely localize the pests even in a diverse background.

Fig. 5
figure 5

Sample detection results of insect pests using the proposed model

Insect pest classification results

The accurate categorization of various pests is important to demonstrate the robustness of a model. A crop cultivation area is suspected of having multiple types of insects depending on the crop category in the real world. Therefore, we have performed an experiment to measure the efficacy of the proposed technique for classifying insect pests based on eight hierarchical crop categories. The trained CornerNet model is applied to all the test images from the IP102 dataset to accomplish this task. Table 4 shows the crop-based pest categorization performance of the proposed method in terms of recall, precision, and F1 score. It can be observed from the stated results that the presented framework has acquired the precision, recall, and F1 score of 61.72%, 57.46%, and 59.39%, respectively, for all the crops-specific insect classes. The reason for robust pest classification performance is the correctness of the employed keypoints computation technique that represents each pest class in a discriminative and reliable manner. As a result, our custom CornerNet performs well in crop-wise pest identification, demonstrating the effectiveness of the introduced method.

Table 4 Class-wise crop-based insect classification performance of the proposed method

We have also reported the accuracies of eight crop-wise pest classes in a boxplot in Fig. 6. The boxplot indicates the distribution of classification accuracy over different classes. According to Fig. 6, our method attained the average accuracy values of 0.484, 0.707, 0.593, 0.695, 0.497, 0.899, 0.773, and 0.851 for eight crop classes, i.e., rice, corn, beet, wheat, alfalfa, mango, citrus, and vitis, respectively. More specifically, we achieved an average classification accuracy of 0.6874 with a low error rate on all classes that exhibit the efficacy of the proposed method. It can be observed from Fig. 6 that our method has achieved somewhat promising results for crops like mango, vitis, and citrus. However, the proposed framework achieves low classification accuracy on some classes such as rice and alfalfa due to visual similarities with the background and high intra-class variances. In Fig. 7, we have provided some example images from the IP102 dataset having a similar appearance. The sample in the same column represents pests from different species, but their visual characteristics are similar.

Fig. 6
figure 6

Accuracy of the proposed method over crop-wise pest classes

Fig. 7
figure 7

Sample images of pests species having similar visual features (the label shows the pest-wise class and corresponding crop subclass)

In addition, Fig. 8 shows the normalized confusion matrix plot of the presented technique that describes the summarized crop level pest classification results in terms of predicted and actual class. To further demonstrate the recognition performance of the proposed model for each of the 102 pest species, we have presented the obtained accuracy values in Fig. 9. These results validate the robust performance of the proposed approach over crop-wise pest categorizes and 102 pest species.

Fig. 8
figure 8

Confusion matrix of the proposed method over crop-wise pest classes

Fig. 9
figure 9

Accuracy of the proposed method over 102 pest classes

Evaluation of DenseNet-100 model

For the image recognition task, deep features are effective. We analyzed to evaluate the feature learning ability of the employed DenseNet-100 model compared to other deep feature extraction models for pest identification and classification task. For this reason, the detection performance of the proposed Custom CornerNet is compared with different base models, i.e., Alexnet [58], GoogleNet [59], VGGNet [60], ResNet-50 [61], ResNet-101 [61], Inception V4 [62], HourGlass104 [63], EfficientNet [64], and DenseNet-121 [56]. We adopted transfer learning to achieve more accurate generalizing power on the unseen data. All these base networks were pre-trained on ImageNet [65] and then fine-tuned the last layer of the network on the IP102 database. For this experiment, the networks were trained for 30 epochs, and the mini-batch sizes were set to 16 and 64, respectively. In addition, the learning rate was set at 0.001 with the SGD algorithm and a momentum value of 0.9. We have analyzed the acquired classification results of these models over the IP102 database and their computational complexity in terms of network parameters.

The comparative analysis of our approach with other feature extraction models is given in Table 5. The classification accuracies and standard deviation (STD) are presented. The STD shows the consistency of the model’s classification output. The higher value of STD shows the inconsistent model’s behavior for pest recognition and classification results. According to the results, it can be seen that the custom CornerNet with DenseNet-100 as the backbone network performs better as compared to other models. This is due to effective deep feature computation using the DenseNet model, which provides a more accurate and diverse feature representation of different insect pest species. Table 5 shows the base frameworks, i.e., AlexNet, VGGNet, ResNet, Inception v4, and HourGlass, yield low performance for pest recognition. This could be due to their inability to learn fine-level characteristics to distinguish multiple pest species in a complex background, thus resulting in a high misclassification rate. AlexNet attains the lowest accuracy of 41.8% for predicting pests for all 102 categories. The primary reason for the poor performance of the model is that the network is too simple to learn the complexities, i.e., the shape and texture of input pest data.

Table 5 Performance comparison of the proposed approach with other feature extraction models

In comparison, the deeper networks, i.e., ResNet-101, HourGlass, and DenseNet-121, are capable of learning more descriptive and fine-grained differences between many similar insect species. However, their performance is still low for identifying multiple pest classes. This might indicate that due to many network parameters, these models are more prone to overfit on pest classes in the IP102 dataset with fewer training samples. In comparison, the custom CornerNet with DenseNet-100 reached the best performance (68.74% of accuracy) in classifying the various pest species. The EfficientNet model attains the second highest accuracy (60.2%). However, it is computationally more complex. The DenseNet-100, on the other hand, has just 7.08 million parameters, which is fewer than any of the other employed DL models.

The better pest classification performance of our approach is its improved network architecture, which allows the optimal reuse of model parameters. We have used their original implementations for the base models, which are quite complex in structure and unable to extract reliable features. Our approach overcomes the shortcomings of comparative models by incorporating an efficient framework for discriminative keypoints calculation by reusing features from the original layers in each subsequent layer. As a result, it accurately handles complex transformations, resulting in improved performance. From this analysis, we can say that the proposed custom CornerNet with DenseNet-100 backbone performs better than other feature extraction models in terms of accuracy and efficiency.

Performance comparison with ML-based classifiers

To evaluate the performance of the proposed method against the ML-based classifiers with deep features, we performed an experiment to exhibit the classification performance analysis with other ML-based classifiers to demonstrate the efficacy of the proposed technique further. We used the IP102 dataset for this experiment and divided it into 60%, 10%, and 30% for training, validation, and testing sets, respectively. The detailed experimental settings are described in Sect. 4.2. We utilized the deep features extracted from the three highest performing feature extraction models in Table 5. The deep features from ResNet-50 [61], EfficientNet [64], and DenseNet-100 [56] are used to train the ML classifiers, i.e., SVM and KNN, and the classification results with standard deviation are shown in Table 6. From Table 6, it can be observed that using the DenseNet-100-based deep features with SVM and KNN classifiers achieved better results among other combinations. However, our Custom Corner Net model still obtained the best results. More specifically, DenseNet-100 with SVM and KNN as back-end classifiers achieved 52.5% and 50.4%, respectively. Whereas, the proposed Custom CornerNet model achieved an accuracy of 68.74%. This illustrates that the proposed model provides more accurate feature representation of the pests and better deals with over-fitted training data than ML-based classifiers.

Table 6 Performance comparison of the proposed approach with ML-based classifiers

Performance comparison with other object detection techniques

We have compared the performance of the proposed model with other state-of-the-art object detection methods. An accurate pest localization is important because a noisy background can mislead the classifier when the target pest is not apparent. The existence of several pests can further complicate the detection process. The correct localization can further improve the classification accuracy by ignoring irrelevant background information. To evaluate this, we have considered different two-stage detectors, i.e., Fast R-CNN [29], Faster R-CNN [51] and one-stage object detection models, i.e., SSD [52], YOLOv3 [53], RefineDet [66] and CornerNet [49] which have demonstrated robust performance on the COCO dataset [67]. We have assessed the performance of these models over the IP102 dataset to analyze their pest localization ability under different challenging conditions such as the presence of complex background, noise, luminosity, and variation in color, size, and shape. We have computed the mAP measure to conduct the performance analysis, a standard metric used in object recognition tasks. Furthermore, we have computed test times of all models to assess their computational complexity. Table 7 shows the performance comparison of mAP and inference time of different object detection approaches with varying backbones for pest detection.

Table 7 Performance comparison of the proposed approach with other object detection methods

Results reported in Table 7 show the superiority in performance of the proposed model for pest identification compared to the other. It can be seen from Table 7 that different object detection models show better performance with the powerful backbone, i.e., DenseNet, for the recognition of pests. The two-stage object detectors, Fast R-CNN and Faster R-CNN, show degraded performance. They are computationally expensive, as these approaches use anchor boxes to identify the potential region of interest and then perform classification and regression to find the corresponding bbox. In comparison, the one-stage networks RefineDet, SSD, and YOLOv3 directly determine the position and category of the object and show better performance. However, as the original implementations of these approaches are evaluated in this work, they cannot perform well in recognizing and locating the pests under intense light variations. Figure 10 presents the visual results of one-stage detection models on the test sample.

Fig. 10
figure 10

Sample visual result of SSD, RefineDet, YOLOv3, and the proposed CornerNet model

Moreover, regarding the computation speed, the one-stage detectors are shown to be fast as compared to two-stage detection models. Our model efficiently overcomes the limitation of these methods using a custom CornerNet model with DenseNet-100 as the backbone network. The reason for improved performance is that the DenseNet backbone enables the CornerNet to learn more representative features, which assist in better pest localization and classification into different categories. Furthermore, the CornerNet model provides a computational benefit over other models due to its one-stage detection nature and takes only 0.23 s to process a sample.

Performance comparison with existing approaches

In this section, we present the comparison of the classification performance of our approach with results obtained by previous works [32, 68,69,70,71,72] over the same dataset, i.e., IP102 [57]. Table 8 compares pest insect classification results with existing approaches in terms of average accuracy.

Table 8 Performance comparison of the proposed method with existing techniques

In [68], the authors employed transfer learning to train the deep-learning models (i.e., VGG-19, inceptionNetV3, and ResNet-50) for the classification of pest species and achieved the highest overall average accuracy 57.08% using inceptionNetV3. However, manually cropping and data augmentation techniques were applied before training the model. Ayan et al. [69] employed CCNs (Inception-V3, Xception, and MobileNet) with ensemble methodology, namely GAEnsemble, to improve the classification performance. Similarly, in [32], the authors combined CNN and the saliency method to create an ensemble of classifiers and used the fusion-sum method at the output layer. However, these methods [32, 69] achieved an accuracy of 61.93% and 67.13%, respectively, at the expense of slow computing speed because of ensemble weights calculation. Zhou et al. [70] used the EquisiteNet model comprising double fusion with squeeze-and-excitation and max-feature expansion blocks. The model achieved an accuracy of 52.32%. However, the obtained accuracy is much lower for practical use in the real world. These methods [71, 72] used the modified Resnet block by incorporating feature reuse and feature fusion mechanism for efficient feature computation and obtained an accuracy of 55.24% and 55.43%. However, the ResNet-based architecture is computationally more expensive as compared to DenseNet. These results clearly show that the proposed CornerNet model with DenseNet-100 outperforms the other studies by achieving an average accuracy of 68.74%. In particular, the reason for improved performance is that the DenseNet effectively computes the feature maps by connecting the output from preceding layers as input to all the subsequent layers. The computed features are used by the CornerNet architecture for localization and classification of the pests. Thus, strongly enhances the performance of the proposed model for the task of pest recognition and classification over the challenging dataset IP102. Moreover, our approach is computationally efficient and robust enough to identify insects more precisely in comparison to existing approaches. As a result, we can conclude that our technique has a lot of potential for classifying target pests in the field using drones.

Conclusion

In our work, we have presented a low-cost DL-based framework for the automated recognition and categorization of crop pests in the field using drones. The presented method is based on a custom CornerNet model that employs DenseNet architecture as a backbone network for feature extraction. More precisely, we employed the DenseNet-100 network to extract a discriminative set of keypoints from the input samples. The custom CornerNet model is then trained to recognize various types of pests. We evaluated our approach on the IP102 dataset, a large-scale challenging pest recognition benchmark database comprising in-field captured images. Through extensive experimentation, we have shown the efficacy of our approach for real-world pest monitoring applications. The reported results showed that our method could accurately localize and classify pests of various categories in the presence of complex background and variations in pest shape, color, size, orientation, and luminosity. In the future, we intend to develop a more effective feature fusion approach to improve the performance of our method for fine-grained pest categorization.